vsphere high availability: recommended practices for
TRANSCRIPT
#vmworld
HCI1870BU
vSphere High Availability: Recommended Practicesfor VMware vSAN
Duncan Epping, VMware, Inc.
#HCI1870BU
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc.
Disclaimer
This presentation may contain product features or functionality that are currently under development.
This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery.
Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.
2
The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein. VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 3
Anyone Still Recall What an HA Event Looked like Before VMware?
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 4
> 90%HA has changed the life of a lot of customers forever!
VMworld 2019 Content: Not for publication or distribution
5©2019 VMware, Inc.
vSAN AvailabilityThe Basics
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 6
vSAN is an Object Based Distributed Storage Platform
Consists of 1 flash device and 1-7 capacity devices
Maximum 5 disk groups per host
Benefits to >1 disk group per host (recommended):
• Better performance
• Host still serves storage upon disk group failure
• Adds more caching
Or simply said VMware’s Hyperconverged Infrastructure platform
vSphere vSAN
Cache
Capacity
Disk GroupDisk Group
All-Flash vSAN
Disk Group Disk Group Disk Group
Disk groups contribute to single vSAN datastore across cluster
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 7
Modern Object Based Storage for vSphere
RAID tree consist of a leaf that makes up given object
Components dispersedacross hosts in a cluster
vSAN determines placementof object components
Adheres to assigned policy of object
“Witness” components used to determine quorum
vSAN objects and components
Components
Max size: 255GB
May be split due to policy settings or environmental conditions
WReplica
C1C2C3Replica
C1 C2 C3
Copy of Object
Copy of Object
Components
Object
WitnessComponent
RAID-1
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 8
Each component has one vote by default
Odd number of votes required to break tie – preserves data integrity
Greater than 50% of components (votes) must be online
Greater than 50% Must be Online to Achieve Quorum
Component A
1 VoteComponent B
1 VoteComponent C
1 VoteComponent D
1 VoteComponent E
3 VotesComponent F
2 Votes
Partition 1 Partition 2
Ghosted VM Active VM
VMworld 2019 Content: Not for publication or distribution
9©2019 VMware, Inc.
vSphere HALet’s dive right in to it
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 10
vSphere HA Basics
vSphere HA minimizes unplanned downtime
Provides automatic VM recovery in minutes
Protects against 3 types of failures
• Compute
• Storage Connectivity
• OS and Application
Failure types
vSphere
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 11
VMworld 2019 Content: Not for publication or distribution
12©2019 VMware, Inc.
vSphere HA APD/PDL response should be disabled for vSANonly clusters.
Our first thing we learned!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 13
vSphere HA Basics
Cluster of ESXi hosts
• One of the hosts is elected as master
Heartbeats via network and storage to communicate availability
HA Network used by HA agents
• Management network
• vSAN network
Disable HA before you enable vSAN
Communications
Master Slave Slave
VMworld 2019 Content: Not for publication or distribution
14©2019 VMware, Inc.
vSphere HA uses the vSAN network when vSAN is enabled!
Second thing we learned!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 15
If we use the vSAN Network we need to take some things in to consideration:
1. The Isolation Address we select, needs to be an IP on the vSAN Network!
2. Disable the use of the Default Isolation Address
3. The Isolation Response we use Power Off and Restart VMs
Isolation Address Considerations!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 16
Configure Host Isolation to ”Power Off and Restart VMs”
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 17
What If You Have a Stretched Cluster?
vSphere vSAN
Site BSite A
5ms RTT, 10GbE
3rd site for witness
RAID-1
RAID-6 RAID-6
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 18
What Happens in a Stretched Cluster During a Partition Then?
This is a partition, Isolation Response does not apply
APD/PDL response isn’t supported with vSAN
Yet, all VMs in the location without any connectivity will be killed!
vSAN has a mechanism for this (VSAN.AutoTerminateGhostVm)
vSphere vSAN
Site BSite A
5ms RTT, 10GbE
3rd site for witness
RAID-1
RAID-6 RAID-6
VMworld 2019 Content: Not for publication or distribution
19©2019 VMware, Inc.
Carefully select the right isolation address and change the Host Isolation Response!
Third thing we learned!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 20
Isolation Address
Datastore Heartbeats
Observed behavior
IP on vSANNetwork
Not configured Isolated host cannot ping the isolation address, isolation declared, VMs killedand VMs restarted
Management Network
Not configured Can ping the isolation address, isolation not declared, yet rest of the cluster restarts the VMs even though they are still running on the isolated hosts
IP on vSAN Network
Configured Isolated host cannot ping the isolation address, isolation declared, VMs killedand VMs restarted
Management Network
Configured VMs are not powered-off and not restarted as the “isolated host” can still ping the management network and the datastore heartbeat mechanism is used to inform the master about the state. So the master knows HA network is not working, but the VMs are not powered off.
Behavior when vSAN network is isolated!
What About Those Heartbeat Datastores?
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 21
How Do I Disable Heartbeat Datastores?
VMworld 2019 Content: Not for publication or distribution
22©2019 VMware, Inc.
vSphere HA Features Not Specific to vSANBut shouldn’t be overlooked!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 23
Use Admission Control!
There are three algorithms to reserve resources
• Percentage based
• Slots based
• Failover host
Slot Policy: based on the highest reservation + overhead
• If you have a 32GB mem reservation, your slot size is 32GB for memory + memory overhead
• For CPU the minimum slot size is 32MHz
• Number of slots=Total Amount Resources / Slot Size
Percentage Policy: based on the reservation per VM
• Set aside a percentage for fail-over capacity
– defined by number of hosts as of 6.5
• And the reservations (and overhead) will be removed from the total percentage of available resources
Especially when you have a stretched cluster, only way to guarantee a restart!
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 24
Admission ControlA bit more in-depth info about reserving resources
Admission Control is all about static values… but what about actual VM performance?
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 25
Not Supported
Proactive HA
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 26
One Last Thing to Remember: /var/log/fdm.log
2018-03-12T12:08:30.398Z verbose fdm[2111414] Waited 5 seconds for isolation icmp ping reply. Isolated
2018-03-12T12:08:30.398Z info fdm[2111414] Host isolated is true
2018-03-12T12:09:00.399Z verbose fdm[2111421] [LocalIsolationPolicy::GetIsolationResponseInfo] Isolation response for VM
/vmfs/volumes/5981ca3d-a9cb2e29-540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx is powerOff
2018-03-12T12:09:35.403Z verbose fdm[2111421] [LocalIsolationPolicy::InitiateCreatePowerOffFiles] Creating power-off file for
/vmfs/volumes/5981ca3d-a9cb2e29-540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx.
2018-03-12T12:09:35.424Z verbose fdm[2111427] [LocalIsolationPolicy::DoVmPowerOff] Powering off /vmfs/volumes/5981ca3d-a9cb2e29-
540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx
2018-03-12T12:07:51.708Z verbose fdm[2129221] Heartbeat still pending for slave @ host-10
2018-03-12T12:07:52.709Z verbose fdm[2129221] Forcing heartbeat check on datastore /vmfs/volumes/5a04393e-ff7b4ae1-dbb0-
246e962f4910 for slave host-10
2018-03-12T12:07:59.718Z verbose fdm[2129221] Beginning ICMP pings every 1000000 microseconds to host-10
2018-03-12T12:08:02.725Z verbose fdm[2129221] Checking heartbeat datastore /vmfs/volumes/5a6f1dfb-65376d7c-a4af-246e962f4910 on
localhost
2018-03-12T12:08:02.748Z verbose fdm[2129210] (VMFS) host-10 @ 24:6e:96:2f:49:10 is ALIVE
2018-03-12T12:09:33.864Z verbose fdm[2129221] ProcessSlavePowerOnListChanges host host-10 listVersion=2830482267 isolated=true
poweredOnVms=0
2018-03-12T12:09:45.869Z verbose fdm[2129794] Issue failover start event for 2 Vms
2018-03-12T12:10:08.708Z verbose fdm[2129209] Failover operation in progress on 1 Vms: 0 VMs being restarted, 0 VMs waiting for a
retry, 1 VMs waiting for resources, 0 inaccessible vSAN VMs.
Sla
ve
Ma
ste
r
VMworld 2019 Content: Not for publication or distribution
27©2019 VMware, Inc.
Summarizing…
VMworld 2019 Content: Not for publication or distribution
©2019 VMware, Inc. 28
vSphere HA uses the vSAN Network when vSAN is enabled
When vSAN is the only storage platform, there’s no need for Heartbeat Datastores
The Isolation Address needs to be an address on the vSAN Network
Disable the use of the Default Isolation Address
Add isolation addresses using das.isolationaddress[0-9]
The Isolation Response needs to be set to “power off and restart VMs”
APD/PDL response should be configured to “Disabled”
vSAN has a “kill” mechanism for partition scenarios in a stretched environment!
Proactive HA is not supported!
What Did We Learn in the Past 30 Minutes?
VMworld 2019 Content: Not for publication or distribution
VMworld 2019 Content: Not for publication or distribution
VMworld 2019 Content: Not for publication or distribution