vsphere high availability: recommended practices for

30
#vmworld HCI1870BU vSphere High Availability: Recommended Practices for VMware vSAN Duncan Epping, VMware, Inc. #HCI1870BU VMworld 2019 Content: Not for publication or distribution

Upload: others

Post on 28-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: vSphere High Availability: Recommended Practices for

#vmworld

HCI1870BU

vSphere High Availability: Recommended Practicesfor VMware vSAN

Duncan Epping, VMware, Inc.

#HCI1870BU

VMworld 2019 Content: Not for publication or distribution

Page 2: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc.

Disclaimer

This presentation may contain product features or functionality that are currently under development.

This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined.

2

The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein. VMworld 2019 Content: Not for publication or distribution

Page 3: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 3

Anyone Still Recall What an HA Event Looked like Before VMware?

VMworld 2019 Content: Not for publication or distribution

Page 4: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 4

> 90%HA has changed the life of a lot of customers forever!

VMworld 2019 Content: Not for publication or distribution

Page 5: vSphere High Availability: Recommended Practices for

5©2019 VMware, Inc.

vSAN AvailabilityThe Basics

VMworld 2019 Content: Not for publication or distribution

Page 6: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 6

vSAN is an Object Based Distributed Storage Platform

Consists of 1 flash device and 1-7 capacity devices

Maximum 5 disk groups per host

Benefits to >1 disk group per host (recommended):

• Better performance

• Host still serves storage upon disk group failure

• Adds more caching

Or simply said VMware’s Hyperconverged Infrastructure platform

vSphere vSAN

Cache

Capacity

Disk GroupDisk Group

All-Flash vSAN

Disk Group Disk Group Disk Group

Disk groups contribute to single vSAN datastore across cluster

VMworld 2019 Content: Not for publication or distribution

Page 7: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 7

Modern Object Based Storage for vSphere

RAID tree consist of a leaf that makes up given object

Components dispersedacross hosts in a cluster

vSAN determines placementof object components

Adheres to assigned policy of object

“Witness” components used to determine quorum

vSAN objects and components

Components

Max size: 255GB

May be split due to policy settings or environmental conditions

WReplica

C1C2C3Replica

C1 C2 C3

Copy of Object

Copy of Object

Components

Object

WitnessComponent

RAID-1

VMworld 2019 Content: Not for publication or distribution

Page 8: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 8

Each component has one vote by default

Odd number of votes required to break tie – preserves data integrity

Greater than 50% of components (votes) must be online

Greater than 50% Must be Online to Achieve Quorum

Component A

1 VoteComponent B

1 VoteComponent C

1 VoteComponent D

1 VoteComponent E

3 VotesComponent F

2 Votes

Partition 1 Partition 2

Ghosted VM Active VM

VMworld 2019 Content: Not for publication or distribution

Page 9: vSphere High Availability: Recommended Practices for

9©2019 VMware, Inc.

vSphere HALet’s dive right in to it

VMworld 2019 Content: Not for publication or distribution

Page 10: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 10

vSphere HA Basics

vSphere HA minimizes unplanned downtime

Provides automatic VM recovery in minutes

Protects against 3 types of failures

• Compute

• Storage Connectivity

• OS and Application

Failure types

vSphere

VMworld 2019 Content: Not for publication or distribution

Page 11: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 11

VMworld 2019 Content: Not for publication or distribution

Page 12: vSphere High Availability: Recommended Practices for

12©2019 VMware, Inc.

vSphere HA APD/PDL response should be disabled for vSANonly clusters.

Our first thing we learned!

VMworld 2019 Content: Not for publication or distribution

Page 13: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 13

vSphere HA Basics

Cluster of ESXi hosts

• One of the hosts is elected as master

Heartbeats via network and storage to communicate availability

HA Network used by HA agents

• Management network

• vSAN network

Disable HA before you enable vSAN

Communications

Master Slave Slave

VMworld 2019 Content: Not for publication or distribution

Page 14: vSphere High Availability: Recommended Practices for

14©2019 VMware, Inc.

vSphere HA uses the vSAN network when vSAN is enabled!

Second thing we learned!

VMworld 2019 Content: Not for publication or distribution

Page 15: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 15

If we use the vSAN Network we need to take some things in to consideration:

1. The Isolation Address we select, needs to be an IP on the vSAN Network!

2. Disable the use of the Default Isolation Address

3. The Isolation Response we use Power Off and Restart VMs

Isolation Address Considerations!

VMworld 2019 Content: Not for publication or distribution

Page 16: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 16

Configure Host Isolation to ”Power Off and Restart VMs”

VMworld 2019 Content: Not for publication or distribution

Page 17: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 17

What If You Have a Stretched Cluster?

vSphere vSAN

Site BSite A

5ms RTT, 10GbE

3rd site for witness

RAID-1

RAID-6 RAID-6

VMworld 2019 Content: Not for publication or distribution

Page 18: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 18

What Happens in a Stretched Cluster During a Partition Then?

This is a partition, Isolation Response does not apply

APD/PDL response isn’t supported with vSAN

Yet, all VMs in the location without any connectivity will be killed!

vSAN has a mechanism for this (VSAN.AutoTerminateGhostVm)

vSphere vSAN

Site BSite A

5ms RTT, 10GbE

3rd site for witness

RAID-1

RAID-6 RAID-6

VMworld 2019 Content: Not for publication or distribution

Page 19: vSphere High Availability: Recommended Practices for

19©2019 VMware, Inc.

Carefully select the right isolation address and change the Host Isolation Response!

Third thing we learned!

VMworld 2019 Content: Not for publication or distribution

Page 20: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 20

Isolation Address

Datastore Heartbeats

Observed behavior

IP on vSANNetwork

Not configured Isolated host cannot ping the isolation address, isolation declared, VMs killedand VMs restarted

Management Network

Not configured Can ping the isolation address, isolation not declared, yet rest of the cluster restarts the VMs even though they are still running on the isolated hosts

IP on vSAN Network

Configured Isolated host cannot ping the isolation address, isolation declared, VMs killedand VMs restarted

Management Network

Configured VMs are not powered-off and not restarted as the “isolated host” can still ping the management network and the datastore heartbeat mechanism is used to inform the master about the state. So the master knows HA network is not working, but the VMs are not powered off.

Behavior when vSAN network is isolated!

What About Those Heartbeat Datastores?

VMworld 2019 Content: Not for publication or distribution

Page 21: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 21

How Do I Disable Heartbeat Datastores?

VMworld 2019 Content: Not for publication or distribution

Page 22: vSphere High Availability: Recommended Practices for

22©2019 VMware, Inc.

vSphere HA Features Not Specific to vSANBut shouldn’t be overlooked!

VMworld 2019 Content: Not for publication or distribution

Page 23: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 23

Use Admission Control!

There are three algorithms to reserve resources

• Percentage based

• Slots based

• Failover host

Slot Policy: based on the highest reservation + overhead

• If you have a 32GB mem reservation, your slot size is 32GB for memory + memory overhead

• For CPU the minimum slot size is 32MHz

• Number of slots=Total Amount Resources / Slot Size

Percentage Policy: based on the reservation per VM

• Set aside a percentage for fail-over capacity

– defined by number of hosts as of 6.5

• And the reservations (and overhead) will be removed from the total percentage of available resources

Especially when you have a stretched cluster, only way to guarantee a restart!

VMworld 2019 Content: Not for publication or distribution

Page 24: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 24

Admission ControlA bit more in-depth info about reserving resources

Admission Control is all about static values… but what about actual VM performance?

VMworld 2019 Content: Not for publication or distribution

Page 25: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 25

Not Supported

Proactive HA

VMworld 2019 Content: Not for publication or distribution

Page 26: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 26

One Last Thing to Remember: /var/log/fdm.log

2018-03-12T12:08:30.398Z verbose fdm[2111414] Waited 5 seconds for isolation icmp ping reply. Isolated

2018-03-12T12:08:30.398Z info fdm[2111414] Host isolated is true

2018-03-12T12:09:00.399Z verbose fdm[2111421] [LocalIsolationPolicy::GetIsolationResponseInfo] Isolation response for VM

/vmfs/volumes/5981ca3d-a9cb2e29-540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx is powerOff

2018-03-12T12:09:35.403Z verbose fdm[2111421] [LocalIsolationPolicy::InitiateCreatePowerOffFiles] Creating power-off file for

/vmfs/volumes/5981ca3d-a9cb2e29-540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx.

2018-03-12T12:09:35.424Z verbose fdm[2111427] [LocalIsolationPolicy::DoVmPowerOff] Powering off /vmfs/volumes/5981ca3d-a9cb2e29-

540a-246e962f4910/Clustering-Deep-Dive-01/Clustering-Deep-Dive-01.vmx

2018-03-12T12:07:51.708Z verbose fdm[2129221] Heartbeat still pending for slave @ host-10

2018-03-12T12:07:52.709Z verbose fdm[2129221] Forcing heartbeat check on datastore /vmfs/volumes/5a04393e-ff7b4ae1-dbb0-

246e962f4910 for slave host-10

2018-03-12T12:07:59.718Z verbose fdm[2129221] Beginning ICMP pings every 1000000 microseconds to host-10

2018-03-12T12:08:02.725Z verbose fdm[2129221] Checking heartbeat datastore /vmfs/volumes/5a6f1dfb-65376d7c-a4af-246e962f4910 on

localhost

2018-03-12T12:08:02.748Z verbose fdm[2129210] (VMFS) host-10 @ 24:6e:96:2f:49:10 is ALIVE

2018-03-12T12:09:33.864Z verbose fdm[2129221] ProcessSlavePowerOnListChanges host host-10 listVersion=2830482267 isolated=true

poweredOnVms=0

2018-03-12T12:09:45.869Z verbose fdm[2129794] Issue failover start event for 2 Vms

2018-03-12T12:10:08.708Z verbose fdm[2129209] Failover operation in progress on 1 Vms: 0 VMs being restarted, 0 VMs waiting for a

retry, 1 VMs waiting for resources, 0 inaccessible vSAN VMs.

Sla

ve

Ma

ste

r

VMworld 2019 Content: Not for publication or distribution

Page 27: vSphere High Availability: Recommended Practices for

27©2019 VMware, Inc.

Summarizing…

VMworld 2019 Content: Not for publication or distribution

Page 28: vSphere High Availability: Recommended Practices for

©2019 VMware, Inc. 28

vSphere HA uses the vSAN Network when vSAN is enabled

When vSAN is the only storage platform, there’s no need for Heartbeat Datastores

The Isolation Address needs to be an address on the vSAN Network

Disable the use of the Default Isolation Address

Add isolation addresses using das.isolationaddress[0-9]

The Isolation Response needs to be set to “power off and restart VMs”

APD/PDL response should be configured to “Disabled”

vSAN has a “kill” mechanism for partition scenarios in a stretched environment!

Proactive HA is not supported!

What Did We Learn in the Past 30 Minutes?

VMworld 2019 Content: Not for publication or distribution

Page 29: vSphere High Availability: Recommended Practices for

VMworld 2019 Content: Not for publication or distribution

Page 30: vSphere High Availability: Recommended Practices for

VMworld 2019 Content: Not for publication or distribution