technical report implementation guide: a continuous ...doc.agrarix.net/netapp/tr/tr-3854.pdf ·...

Technical Report

Implementation Guide: A Continuous Availability Solution for VMware vSphere and NetApp Storage Using VMware High Availability, Fault Tolerance, vCenter Server Heartbeat, and NetApp Stretch MetroCluster VEABU, NetApp May 2010 | TR-3854

EXECUTIVE SUMMARY This document provides a step-by-step guide of implementing a continuous availability solution for virtual data centers consisting of VMware® vSphere™ and NetApp® storage using VMware High Availability, Fault Tolerance, and NetApp stretch MetroCluster.

Implementation Guide: Continuous Availability solution with VMware vSphere and NetApp 2

TABLE OF CONTENTS

1 INTRODUCTION ......................................................................................................................... 3

1.1 INTENDED AUDIENCE ............................................................................................................................... 3

1.2 SCOPE ........................................................................................................................................................ 3

1.3 ASSUMPTION AND PREREQUISITES ....................................................................................................... 3

2 BACKGROUND .......................................................................................................................... 3

3 HIGH-LEVEL SOLUTION OVERVIEW AND COMPONENTS USED ....................................... 5

3.1 SOFTWARE AND HARDWARE COMPONENTS ....................................................................................... 6

3.2 BILL OF MATERIALS USED IN THE LAB SETUP ..................................................................................... 6

4 INSTALLATION AND CONFIGURATION OF NETAPP STRETCH METROCLUSTER .......... 7

4.1 STRETCH METROCLUSTER EQUIPMENT USED IN THE LAB SETUP ................................................... 7

4.2 CABLING THE STRETCH METROCLUSTER SETUP ................................................................................ 8

4.3 STRETCH METROCLUSTER CONFIGURATION IN SITE 1....................................................................... 9

4.4 STRETCH METROCLUSTER CONNECTION IN SITE 2........................................................................... 15

5 VMWARE VSPHERE4 INSTALLATION AND CONFIGURATION ......................................... 15

6 SETUP AND CONFIGURATION OF THE HIGH-AVAILABILITY SOLUTION ........................ 16

6.1 NETAPP STORAGE PROVISIONING ....................................................................................................... 16

6.2 VMWARE ESX SERVER CONFIGURATION ............................................................................................ 16

6.3 VMWARE HA CLUSTER CONFIGURATION ............................................................................................ 19

6.4 SETUP AND CONFIGURATION OPTIONS FOR VCENTER SERVER ..................................................... 21

6.5 SETUP AND CONFIGURATION OF VMWARE FAULT TOLERANCE (FT) ............................................. 23

6.6 TESTING THE CONTINUOUS AVAILABILITY SOLUTION IN DIFFERENT FAILURE SCENARIOS ....... 23

7 CONCLUSION .......................................................................................................................... 23

APPENDIX: REFERENCES ........................................................................................................... 24


1 INTRODUCTION This document provides step-by-step guidance on how to implement a production-class continuously available VMware vSphere and NetApp virtual infrastructure solution across data centers located inside a campus (within 500 meters).

It is a companion document to the following technical report, which provides the overview of this solution:

TR-3788: “A Continuous Availability Solution for Virtual Infrastructure.”

1.1 INTENDED AUDIENCE This document is for:

• Customers looking for implementation details regarding a continuous availability solution for their virtual infrastructure consisting of VMware vSphere and NetApp storage

• Administrators or professional services engineers who are responsible for architecting and deploying successful NetApp MetroCluster high-availability and disaster recovery configurations in a virtualization environment

1.2 SCOPE What this document describes:

• Detailed implementation planning, installation, configuration, and operation of NetApp stretch MetroCluster • Implementation of VMware High Availability (HA), Fault Tolerance (FT), and vCenter Server Heartbeat

solution for virtual infrastructure

Not in scope:

• This report does not replace any official manuals and documents from NetApp and VMware on the products used in the solution.

• This report does not discuss any performance impact and analysis from an end user perspective during a disaster.

• This report does not replace NetApp and VMware professional services documents or services. • This report does not discuss a regional (long-distance) disaster recovery solution.

1.3 ASSUMPTION AND PREREQUISITES

Note

This implementation guide is the companion document to TR-3788. It is therefore assumed that the reader has referred to that technical report and is familiar with the continuous availability solutions described in that document.

If you are not familiar with the continuous availability solutions described in TR-3788, it is strongly recommended that you go through that document first before continuing with this TR.

This document assumes familiarity with the following:

• Basic knowledge of VMware virtualization technologies and products: VMware vCenter Server 4.0 and VMware vSphere4

• Basic knowledge of NetApp FAS systems and Data ONTAP®

For a complete list of documents that you can refer to to get an understanding of the technologies used in this solution, see Appendix: References.

2 BACKGROUND The continuous availability solution for virtual infrastructure consisting of VMware vSphere and NetApp storage can have different tiers of protection, as illustrated in Figure 1.

http://media.netapp.com/documents/TR-3788.pdf�

http://media.netapp.com/documents/TR-3788.pdf�

http://www.netapp.com/us/library/technical-reports/tr-3788.html�


Figure 1) Different tiers of virtual infrastructure protection with NetApp storage.

Table 1 summarizes various scenarios with the corresponding VMware and NetApp components.

Table 1) Continuous availability solution scenarios with the corresponding components.

Number Tier of Protection

VMware Component (vSphere 4.0)

NetApp Component Scope of Protection

1 Data-center-level Protection

VMware HA VMware FT vCenter Server

Heartbeat

NetApp active-active cluster (with or without SyncMirror®)

Complete protection against common server and storage failures, including but not limited to failure of: Physical ESX Server Power supplies Disk drives Disk shelves Cables Storage controllers and so on

2 Cross-campus-level protection

VMware HA VMware FT vCenter Server

Heartbeat

NetApp stretch MetroCluster

VMware HA cluster nodes and the NetApp FAS controllers located at different buildings within the same site (up to 500m) Can handle building-level disasters in addition to protections provided in tier 1

3 Metro (site-level) distance protection

VMware HA vCenter Server

Heartbeat

NetApp fabric MetroCluster

VMware HA cluster nodes and the NetApp FAS controllers located at different regional sites (up to 100km) Can handle site-level disasters in addition


to protections provided in tier 1

4 Regional protection

VMware SRM

NetApp SnapMirror®

Note The focus of this implementation guide is the”cross-campus-level protection” scenario of the continuous availability solution, which can handle building-level site disasters in addition to the virtual infrastructure component failures. The products used in this solution are:

VMware vSphere 4 or later, VMware HA, VMware FT, and VMware vCenter Server Heartbeat NetApp storage system: Data ONTAP 7.3.3, NetApp stretch MetroCluster

3 HIGH-LEVEL SOLUTION OVERVIEW AND COMPONENTS USED Figure 2 illustrates the continuous availability solution that will be implemented across two data centers or sites inside a campus (within 500-meter distance).

Figure 2) VMware HA, FT, and NetApp stretch MetroCluster solution in VMware vSphere 4.


3.1 SOFTWARE AND HARDWARE COMPONENTS Table 2) Software and hardware components.

Component Comments

Required equipments Stretch MetroCluster Storage system type Two of the same type of NetApp storage systems

Storage system configuration See the System Configuration Guide at http://now.netapp.com/NOW/knowledge/docs/hardware/NetApp/syscfg

Cluster interconnect adapter

IB cluster adapter (required only for systems that do not use an NVRAM5 or NVRAM6 adapter, which functions as the cluster interconnect adapter) FC-VI adapter (required only for the FAS3140 and FAS3170 dual-controller systems) Note: When the FC-VI adapter is installed in a FAS3140 or FAS3170 system, the internal InfiniBand interconnects are automatically deactivated.

FC-AL or FC HBA (FC HBA for disk) adapters

Two or four Fibre Channel HBAs. These HBAs are required for 4Gbps MetroCluster operation. Onboard ports can be used for 2Gbps operation. Note: The ports on the Fibre Channel HBAs are labeled 1 and 2. However, the software refers to them as A and B. You see these labeling conventions in the user interface and system messages displayed on the console.

Cables

Four SC/LC (standard connector to low-profile connector) controller-to-disk shelf cables • Two SC/LC IB cluster adapter cables

• Four SC/LC or LC/LC cables Note: For information about required cables, see Best Practices for MetroCluster Design and Implementation MetroCluster Upgrade Planning Guide

License

• A SyncMirror_Local license

• A Cluster_remote license

• A Cluster license

• NFS (for VMware datastore)

• iSCSI (for VMware datastore)

3.2 BILL OF MATERIALS USED IN THE LAB SETUP Table 3) Materials used in the setup.

Infrastructure Component

Vendor Quantity Details

Server IBM 4 IBM x3550 server

Intel® Xeon processor (Intel-VT), E5420, CPU: 50 GHz total in the cluster Memory: 70GB total in the cluster

Storage NetApp FAS3170: RAID-DP®

Switch (front-end SAN)

Brocade 4 Brocade switch model 3800

Network adapter Broadcom 4 per server Broadcom NetXtreme II BCM 5708 1000Base-T

HBA QLogic 2 per server Qlogic QLA 2432

Software NetApp Data ONTAP 7.3.3

http://now.netapp.com/NOW/knowledge/docs/hardware/NetApp/syscfg/�

http://media.netapp.com/documents/tr-3548.pdf�



NetApp Cluster

NetApp cluster_remote

NetApp SyncMirror_local

VMware VMware vSphere 4 Enterprise Plus

VMware VMware vCenter Server 4.0

VMware VMware vCenter Server Heartbeat 5.5 Update 1

4 INSTALLATION AND CONFIGURATION OF NETAPP STRETCH METROCLUSTER

4.1 STRETCH METROCLUSTER EQUIPMENT USED IN THE LAB SETUP Table 4 lists the equipment used for the stretch MetroCluster deployment for this solution.

Table 4) Equipment list for configuring stretch MetroCluster.

Required Equipment Stretch MetroCluster

Storage system Two FAS3170 storage systems installed with Data ONTAP 7.3.3 or later

Storage system configuration

See the System Configuration Guide at http://now.netapp.com/NOW/knowledge/docs/hardware/NetApp/syscfg

Cluster interconnect adapter 2 (FC-VI adapter)

Disk shelves 4 (ESH 4 disk shelves)

Cables

Fiber Type Data Rate Max Distance (M)

OM-2 (50/125UM)

1Gb/s 500

2Gb/s 300

4Gb/s 150

OM-3 (50/125 UM) 2Gb/s 500

4Gb/s 270

8 (OM-2, 50/125 UM/2Gb/s)

License

• A SyncMirror_Local license

• A Cluster_remote license

• A Cluster license

• NFS (for VMware datastore)

• iSCSI (for VMware datastore)



4.2 CABLING THE STRETCH METROCLUSTER SETUP Cable the stretch MetroCluster setup so that each controller can access its own storage and its partner's storage, with local storage mirrored at the partner site. The following figure illustrates the cabling of the stretch MetroCluster setup used. Basically the cabling configuration will involve:

STEP 1: CONNECT THE METROCLUSTER INTERCONNECT BETWEEN THE TWO CONTROLLERS Site 1 (NetApp FAS3170) to site 2 (NetApp FAS3170)

STEP 2: CONNECT EACH CONTROLLER TO ITS OWN STORAGE Site 1 (NetApp FAS3170) to pool 0/site 1/channel A Site 1 (NetApp FAS3170) to pool 1/site 1/channel A Site 2 (NetApp FAS3170) to pool 0/site 2/channel A Site 2 (NetApp FAS3170) to pool 1/site 2/channel A

STEP 3: CONNECT EACH CONTROLLER TO ITS PARTNER’S STORAGE Site 1 (NetApp FAS3170) to pool 0/site 2/channel B Site 2 (NetApp FAS3170) to pool 0/site 1/channel B

STEP 4: CONNECT EACH CONTROLLER TO THE MIRRORS OF ITS STORAGE Site 1 (NetApp FAS3170) to pool 1/site 2/channel B Site 2 (NetApp FAS3170) to pool 1/site 1/channel B

Figure 3) Stretch MetroCluster connection diagram.


4.3 STRETCH METROCLUSTER CONFIGURATION IN SITE 1

STEP 1: VERIFY CONTROLLER LICENSES Perform the following steps on the controller in site 1. 1. Using telnet or ssh, log in to the controller. 2. Log in as root using the appropriate credentials. 3. Check for the following licenses:

Cluster Cluster_remote Syncmirror_local

4. If any of the preceding are not licensed, add the required licenses in the following order: Cluster Syncmirror_local Cluster_remote

STEP 2: VERIFY CONNECTION Confirm that the disks are visible and have dual paths by entering the following command on the console: storage show disk -p

The output shows the disk connected to the port: Controller1> storage show disk -p

PRIMARY PORT SECONDARY PORT SHELF BAY

0a.32 B 0b.32 A 2 0

0a.33 B 0b.33 A 2 1

If redundant paths are not shown for each disk, recheck the cabling.

STEP 3: DISK OWNERSHIP In a stretch MetroCluster configuration in which disk shelves on each side are mirrored to the other side and thus accessible by either controller, disk ownership comes into play. There are two methods of establishing disk ownership: Hardware Software Hardware-based ownership is the default for the FAS900 series and the FAS3020/3050. All other platforms (FAS6000 series, FAS3040/3070, FAS31xx) use software disk ownership. Hardware disk ownership establishes which controller owns which disks by how the shelves are connected. For more information, see the System Configuration Guide: http://now.netapp.com/NOW/knowledge/docs/hardware/NetApp/syscfg.

Table 5) Hardware disk ownership.

Model

Pool Ownership

FAS9XX Slots 2, 3, 4, 5, and 7 are pool 0

Slots 8, 9, and 10 are pool 1

Optional software-based ownership and pool selection (as of Data ONTAP 7.1.1; stretch MetroCluster only)

FAS3020/3050 0a, 0b, and slots 1 and 2 are pool 0 (slot 1 is usually NVRAM)

0c, 0d, and slots 3 and 4 are pool 1

Optional software-based ownership and pool selection (as of Data ONTAP 7.1.1; Stretch MetroCluster only)



FAS3040/3070 Software-based ownership and pool selection

FAS31xx Software-based ownership and pool selection

FAS60xx Software-based ownership and pool selection

Software-based disk ownership in a MetroCluster requires a different configuration than systems using hardware-based disk ownership. Software commands in Data ONTAP are used to assign disks, or they are auto-assigned by the software. This is because disk ownership is determined by the software, rather than by the physical cabling of the shelves.

To use the Data ONTAP command line to assign disk ownership: Use the disk show –n command to view all disks that do not have assigned owners. Use the following command to assign disks that are labeled “Not Owned” to one of the system controllers: disk assign {disk_name|all|-n count|auto} [-p pool] [-o ownername] [-s sysid] [-c block|zoned] [-f]

disk_name specifies the disks that you want to assign to the system.

all specifies that all the unowned disks are assigned to the system.

-n count specifies the number of unassigned disks to be assigned to the system, as specified by count.

auto causes disk auto assignment to be performed.

-p pool specifies which SyncMirror pool the disks are assigned to. The value of the pool is either 0 or 1. Note: Unassigned disks are not associated with a pool. To assign them to a different pool, use the -f option. However, moving individual disks between pools could result in the loss of redundancy and cause disk autoassignment to be disabled for that loop. For this reason, you should move all disks on that loop to the other pool if possible. -o ownername specifies the system that the disks are assigned to. -s sysid specifies the system that the disks are assigned to. -c specifies the checksum type (either blocks or zoned) for a LUN in V-Series systems. -f must be specified if a system already owns the disk or if you want to assign a disk to a different pool. Enter man disk or see the Data ONTAP Installation and Administration Guide for details.


Configuring software-based disk ownership:

Figure 4) Configuring software-based disk ownership.

1. The new disk shelves (site 1 shelf 1 and site 2 shelf 1) are connected and then reboot the controller. After rebooting the disks on the disk shelves will get ownership with their respective controller.

The ownership details are:

Controller Sysid

Site 1 118061356

Site 2 118061310


2. Connect the other two disk shelves named site 2 shelf 1 mirror and site 1 shelf 1 mirror as shown in Figure 4 and check the disk shelves’ status.

3. Assign ownership to all the disks on disk shelf “site 1 shelf 1 mirror” to site 1 controller.

Note:

a. The disk shelf “site 1 shelf 1 mirror” is connected to site 1 controller through FC port “0d.” That’s why “0d” is mentioned before the disk number in the following command.

b. The following command assigns ownership to site 1 controller and creates pool 1. c. Use Maintenance mode to perform the following activity.

-

Example- *> disk assign 0d.32 –s 118061356 –p 1

4. Follow the same procedure (in step 3) for other disks connected to fc port 0d in site 1 controller.

5. Now all the disks in the disk shelf (site 1 shelf 1 mirror), which have been connected to fc port od, changed their ownership and are owned by site 1 controller.

Now the mirror relationship has been established between “site 1 shelf 1 pool 0” to “site 1 shelf 1 mirror pool 1” as per the preceding diagram.

* > disk show –v


6. Now follow a similar disk assignment procedure for the disk shelf “site 2 shelf 1 mirror” to assign it to site 2

controllers.

a. The disk shelf “site 2 shelf 1 mirror” is connected to site 2 controller through FC port “0b.” That’s why “0b” is mentioned before the disk number in the following command.

b. The following command assigns ownership to site 2 controller and creates pool 1. c. Use Maintenance mode to perform the following activity.

Example- *> disk assign ob.22 –s 118061310 –p 1

7. Follow step 6 for other disks of the disk shelf “site 2 shelf 1 mirror.”

8. Now all the disks in the disk shelf “site 2 shelf 1 mirror” have changed their ownership to site 2 controller and are placed in pool 1.

9. Now the mirror relationship has been established between “site 2 shelf 1 pool 0” and “site 2 shelf 1 mirror pool 1” as per Figure 7.

*> disk show -v


Note: The preceding procedures are used for creating pool 1 and assigning ownership to the controller in SMC environment.

STEP 4: SET UP MIRRORS To make sure of highly available access to data, all data on each node must be mirrored to the other node using SyncMirror. Although it is possible to have data on one node that is not mirrored on the other, NetApp doesn’t recommend it. Keep in mind that mirrors can exist only between like drive types (FC to FC or ATA to ATA in the case of stretch MetroCluster).

With the SyncMirror license installed in step 1, disks are divided into pools. When a mirror is created, Data ONTAP pulls disks from pool 0 for the primary data and from pool 1 for the mirror. The selection of which disks to use for the mirror can be left up to Data ONTAP or chosen specifically.

It is important to verify the correct number of disks in each pool before creating the mirrored aggregate or traditional volume.

Note: For creating new aggregates and volumes, see section 6.

Any of these commands can be used to verify the number of drives in each pool:

sysconfig –r (gives the broadest information)

aggr status –r

vol status –r

Once the pools are verified, you can create the mirrors with one of the following:


{aggr | vol} create {aggrname | volname} –m ndisks[@disk-size]

For example, the command aggr create aggrA -m 6 creates a mirrored aggregate called aggrA with six drives (three for plex 0, three for plex 1).

An already existing aggregate can be mirrored using the following command:

{aggr | vol} mirror { aggr name | vol name }

For example, the command aggr mirror aggr0 will mirror the already existing aggregate aggr0.

In both of the preceding cases, Data ONTAP is allowed to choose the specific drives to be used.

4.4 STRETCH METROCLUSTER CONNECTION IN SITE 2

VERIFY CONTROLLER LICENSES Follow the same procedure as performed at site 1.

VERIFY CONNECTIONS Follow the same procedure as performed at site 1.

DISK OWNERSHIP Follow the same procedure as performed at site 1.

SET UP MIRRORS Follow the same procedure as performed at site 1.

5 VMWARE VSPHERE4 INSTALLATION AND CONFIGURATION Installation and configuration of the VMware vSphere infrastructure for this solution setup was done in accordance with the VMware ESX Installation Guide: ESX and vCenter Server Installation Guide. Similarly the VMware vCenter Server was installed following the procedure mentioned in Installing vCenter Server.

For this solution setup, the vCenter Server was installed inside a Microsoft® Windows® virtual machine, which runs on an ESX host of the VMware HA cluster.

Deploying the vCenter Server system in the virtual machine has the following advantages:

• Rather than dedicating a separate server to the vCenter Server system, you can place it in a virtual machine running on the same ESX host where other virtual machines run.

• You can provide high availability for the vCenter Server system by using VMware HA. • You can migrate the virtual machine containing the vCenter Server from one host to another, enabling

maintenance and other activities. • You can create Snapshot™ copies of the vCenter Server virtual machine and use them for backups,

archiving, and so on.

http://pubs.vmware.com/vsp40/wwhelp/wwhimpl/js/html/wwhelp.htm#href=install/c_installing_esx_modes.html�

http://pubs.vmware.com/vsp40/wwhelp/wwhimpl/js/html/wwhelp.htm#href=install/c_install_vc_container.html�


6 SETUP AND CONFIGURATION OF THE HIGH-AVAILABILITY SOLUTION

6.1 NETAPP STORAGE PROVISIONING

Note

See the “NetApp and VMware vSphere Storage Best Practices” document for the detailed description of the steps and the best practices to be followed before implementing your environment.

6.2 VMWARE ESX SERVER CONFIGURATION The controller at site 1 has one volume (vsphereftiscsi) to house the iSCSI LUN-based active datastores. Another volume (vphereftnfs) contains the NFS export for the VMware NFS datastore.

The controller at site 2 has one volume (vsphereftiscsi_1) to house the iSCSI LUN-based active datastores. Anthor volume (vspherenfs_1) contains the NFS export for the VMware NFS datastore.

iSCSI LUNs created are shown in Figure 9. Sizes are arbitrarily chosen for these tests. The iSCSI LUNs created were assigned an igroup called ftigroup_21 on the controller at site 1 and ftigroup_35 at site 2 containing the iSCSI IQN numbers for all servers in the VMware HA cluster.

Figure 5) Site 1 FAS controller LUN configuration.

Figure 6) Site B FAS controller LUN configuration.

http://www.netapp.com/us/library/technical-reports/tr-3749.html�


ISCSI AND LUN SETUP For the ESX server to see the LUNs created, one or more steps are necessary. If the iSCSI storage adapter is already running, then all that may be necessary is to tell the iSCSI storage adapter to rescan for devices. After a few moments, the new LUNs are displayed.

Figure 7) iSCSI adapter and LUNs.

DATASTORE Four datastores were created for the following purpose, as shown in Table 6.

Table 6) Datastore details for site 1 and site 2.

Data Store Use PROD1_ISCSI Site 1 storage for VMs (iSCSI LUNs)

PROD1_NFS Site 1 storage for VMs (NFS)

PROD2_ISCSI Site 2 storage for VMs (iSCSI LUNs)

PROD2_NFS Site 2 storage for VMs (NFS)

Once the datastores were created, visibility was verified in all the ESX servers in the HA cluster, as shown in Figure 8.


Figure 8) Site 1 and site 2 datastores’ visibility to each ESX server in the cluster.

VIRTUAL MACHINES Ten virtual machines were then created. All VMs were running Windows 2003 and stored on respective datastores from site 1 and site 2.

• Four of the VMs used an iSCSI LUN (Prod1_iSCSI and Prod2_iSCSI) from site 1 and site 2, respectively. • Four of the VMs used NFS volumes (Prod1_NFS and Prod2_NFS) from site 1 and site 2, respectively. • Two of the VMs were configured as vCenter Server Heartbeat and used the NFS datastores from site 1

and site 2, respectively. • Some of the VMs were configured for VMware FT, as show in Figure 9.

Figure 9) VMs and FT VMs placed across different ESX servers in the HA cluster.

Figure 10 shows the completed production site setup (VMs and datastores in site 1 and site 2). This site includes the datastores, ESX servers, VMs, FT VMs, and vCenter Server Heartbeat server (primary and secondary).


Figure 10) Map showing ESX servers, datastores, and VMs in site 1 and site 2.

6.3 VMWARE HA CLUSTER CONFIGURATION VMware HA leverages multiple ESX/ESXi hosts configured as a cluster to provide rapid recovery from outages and cost-effective high availability for applications running in virtual machines. VMware HA protects application availability in two ways: • It protects against a server failure by automatically restarting the virtual machines on other hosts within the

cluster. • It protects against application failure by continuously monitoring a virtual machine and resetting it in the

event that a failure is detected. Unlike other clustering solutions, VMware HA provides the infrastructure to protect all workloads within the infrastructure: • No special software needs to be installed within the application or virtual machine. All workloads are

protected by VMware HA. After VMware HA is configured, no actions are required to protect new virtual machines. They are automatically protected.

• VMware HA can be combined with VMware Distributed Resource Scheduler (DRS) not only to protect against failures but also to provide load balancing across the hosts within a cluster.

VMware HA has a number of advantages over traditional failover solutions: Minimal setup After a VMware HA cluster is set up, all virtual machines

in the cluster get failover support without additional configuration.

Reduced hardware cost and setup The virtual machine acts as a portable container for the

applications, and it can be moved among hosts. Administrators avoid duplicate configurations on multiple


machines. When you use VMware HA, you must have sufficient resources to fail over the number of hosts you want to protect with VMware HA. However, the vCenter Server system automatically manages resources and configures clusters.

Increased application availability Any application running inside a virtual machine has

access to increased availability. Because the virtual machine can recover from hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the application is not itself a clustered application. By monitoring and responding to VMware tool heartbeats and resetting nonresponsive virtual machines, it also protects against guest operating system crashes.

DRS and VMotion® integration If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, VMware HA can help recover from that failure.

CONFIGURING VMWARE HA CLUSTER VMware HA cluster enable a collection of ESX/ESXi hosts to work together so that, as a group, they provide higher levels of avaliability for virtual machines than each ESx/ESXi host could provide individually.

Best practice: When adding ESX hosts to a VMware HA cluster, the first five hosts added are considered primary hosts. The remaining hosts added are considered secondary hosts. Primary HA nodes hold node state information, which is synchronized between primary nodes and from the secondary nodes. To make sure that each site contains more than one primary HA node, the first five nodes added to the HA cluster should be added one at a time, alternating between sites. The sixth node and all remaining nodes can then be added in one operation.

VMware ESX hosts and NetApp FAS controller network ports are connected to the same subnet that is shared between site 1 and site 2.

The VMware ESX host’s FC HBA should be connected to the same fabric that is shared between site 1 and site 2.

To know more about the installation and configuration of VMware HA, see the vSphere Availability Guide.

Figure 11 shows the VMware HA cluster with all the virtual machines associated with each ESX server.

http://www.vmware.com/pdf/vsphere4/r40/vsp_40_availability.pdf�


Figure 11) Ms and FT VMs associated with respective ESX server.

6.4 SETUP AND CONFIGURATION OPTIONS FOR VCENTER SERVER

OPTION 1 The VMware vCenter server runs inside a virtual machine (non-FT) in the HA cluster.

Another way of designing the vCenter Server is to place it on a physical MSCS cluster with a MSCS cluster node in each site. If the storage housing the vCenter MSCS instance is at the failed site, it is necessary to perform the NetApp CFOD recovery. First recover the MSCS and start vCenter; then continue with the recovery process. For details on the deployment of vCenter Server with MSCS cluster, see www.vmware.com/pdf/VC_MSCS.pdf.

OPTION 2: BEST PRACTICE: VMWARE VCENTER HEARTBEAT VMware vCenter Server is used to manage multiple “tier 1” applications that it makes the vCenter itself a “tier 1” application. So the need arises for making the VMware vCenter highly available. And that is where the VMware vCenter Server Heartbeat steps in.

In this setup, the VMware vCenter Server Heartbeat (primary and secondary) runs inside a virtual machine (non-FT) in the HA cluster. Table 7 shows the various options to keep the vCenter Server and vCenter Server Heartbeat in a VMware HA and NetApp MetroCluster environment.

Table 7) vCenter server as a VM inside the HA cluster with vCenter Server Heartbeat.

Location of vCenter Server Data Store

Any ESX server in site 2 From NetApp FAS controller in site 2

Any ESX server in site 2 From NetApp FAS controller in site 1 Any ESX server in site 1 From NetApp FAS controller in site 1 Any ESX server in site 1 From NetApp FAS controller in site 2

Any ESX server in site 1 with vCenter Server Heartbeat implemented

Primary vCenter server uses controller in site 1 and secondary vCenter server uses controller in site 2

Any ESX server in site 2 with vCenter Server Heartbeat implemented

Primary vCenter server uses controller in site 2 and secondary vCenter server uses controller in site 1

http://www.vmware.com/pdf/VC_MSCS.pdf�


VCENTER SERVER HEARTBEAT INSTALLATIONS AND CONFIGURATION

VMware vCenter Server Heartbeat provides availability and resiliency for VMware vCenter Server.

• Protects VMware vCenter Server availability by monitoring all components of VMware vCenter Server, including VMware License Server and other plug-ins

• Minimizes downtime of critical functions such as VMware VMotion and VMware DRS • Protects VMware vCenter Server performance, alerts and events information, keeping it up-to-date even if

VMware vCenter Server experiences an outage • Provides automatic failover and failback of VMware vCenter Server • Enables administrators to schedule maintenance windows and maintain availability by initiating a manual

switchover to the standby server • Protects and recovers the VMware vCenter Server database • Protects critical configuration, inventory and other information stored in the VMware vCenter Server

database, even if the database is installed on a separate server

Figure 12) vCenter server heartbeat deployment.

Architecturally, vCenter Server Heartbeat is implemented on active/passive vCenter server clones, running on physical or virtual machines. In addition to server and network hardware, vCenter Server Heartbeat monitors the actual vCenter server instance, its back-end database, and the underlying operating system. In the case of failure, the passive node takes over and the vCenter Server Heartbeat software restarts the vCenter service. Failover can occur over both LANs and WANs. To learn more about the installation and configuration of VMware vCenter Server Heartbeat, see VMware vCenter Server Heartbeat.

http://www.vmware.com/support/pubs/heartbeat_pubs.html�




6.5 SETUP AND CONFIGURATION OF VMWARE FAULT TOLERANCE (FT) VMware HA provides a base level of protection for your virtual machines by restarting virtual machines in the event of a host failure. VMware Fault Tolerance provides a higher level of availability, allowing users to protect any virtual machine from a host failure with no loss of data, transactions, or connections. VMware Fault Tolerance provides continuous availability for virtual machines by creating and maintaining a secondary VM that is identical to, and continuously available to replace, the primary VM in the event of a failover situation. Fault Tolerance uses the VMware vLockstep technology on the ESX/ESXi host platform to provide continuous availability. This is done by making sure that the states of the primary and secondary VMs are identical at any point in the instruction execution of the virtual machine. vLockstep accomplishes this by having the primary and secondary VMs execute identical sequences of x86 instructions. The primary VM captures all inputs and events, from the processor to virtual I/O devices, and replays them on the secondary VM. The secondary VM executes the same series of instructions as the primary VM, while only a single virtual machine image (the primary VM) is seen executing the workload. Refer to VMware Fault Tolerance recommendations and Considerations on VMware vSphere 4 for details on VMware Fault Tolerance recommendations and considerations for VMware vSphere 4. To enable VMware Fault Tolerance, the VMware HA cluster must meet a few prerequisites. The tasks that need to be completed before attempting to enable Fault Tolerance are: • Enable host certificate checking (if upgrading from a previous version of Virtual Infrastructure). • Configure networking (FT logging) for each host. • Create the VMware HA cluster, add hosts, and check compliance.

VMware Fault Tolerance can be turned on from vCenter Server using an account with cluster administrator permissions. Procedure: 1. Select the Hosts & Clusters view. 2. Right-click a virtual machine and select Fault Tolerance > ‘Turn On’ Fault Tolerance.

The specified virtual machine is designated as a primary VM, and a secondary VM is established on another host. The primary VM is now fault tolerant. To learn more about the installation and configuration of vSphere HA and FT), refer to the vSphere Availability Guide.

6.6 TESTING THE CONTINUOUS AVAILABILITY SOLUTION IN DIFFERENT FAILURE SCENARIOS

For details on testing the continuous availability solution in different failure scenarios, refer to A Continuous Availability Solution for Virtual Infrastructure.

7 CONCLUSION VMware HA, FT, vCenter Server Heartbeat, and NetApp MetroCluster technologies can work in synergy to provide a simple and robust continuous availability solution for planned and unplanned downtime in virtual data center environments. Planned site and component failovers, at both server and storage levels, can be triggered without disrupting the environment, thus allowing scheduled maintenance without any downtime. Similarly this solution delivers complete protection against unplanned server and storage failures, including failure of the physical ESX Server, NetApp storage controller, power supplies, disk drives, disk shelves, cables, and so on. Each failure scenario described earlier showcases the value of deploying VMware HA, FT, vCenter Server Heartbeat, and NetApp MetroCluster to recover from these failures.

http://www.vmware.com/files/pdf/fault_tolerance_recommendations_considerations_on_vmw_vsphere4.pdf�






APPENDIX: REFERENCES Data ONTAP 7.3 Active/Active Configuration Guide: http://now.netapp.com/NOW/knowledge/docs/ontap/rel7311/pdfs/ontap/aaconfig.pdf

MetroCluster Design and Implementation Guide: http://media.netapp.com/documents/tr-3548.pdf

Active-Active Configuration Best Practices: http://media.netapp.com/documents/tr-3450.pdf

NetApp and VMware vSphere Storage Best Practices: http://media.netapp.com/documents/tr-3749.pdf

VMware vSphere Availability Guide: www.vmware.com/pdf/vsphere4/r40/vsp_40_availability.pdf

VMware Fault Tolerance Recommendation and Consideration on VMware vSphere 4: www.vmware.com/files/pdf/fault_tolerance_recommendations_considerations_on_vmw_vsphere4.pdf

NetApp provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer’s responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.

© Copyright 2010 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. NetApp, the NetApp logo, Go further, faster, Data ONTAP, RAID-DP, SnapMirror, Snapshot, and SyncMirror are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. Microsoft and Windows are registered trademarks of Microsoft Corporation. Linux is a registered trademark of Linus Torvalds. Intel is a registered trademark of Intel Corporation. VMware and VMotion are registered trademarks and vSphere is a trademark of VMware, Inc. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. TR-3854

http://now.netapp.com/NOW/knowledge/docs/ontap/rel7311/pdfs/ontap/aaconfig.pdf�





http://www.vmware.com/files/pdf/fault_tolerance_recommendations_considerations_on_vmw_vsphere4.pdf�

technical report implementation guide: a continuous ...doc.agrarix.net/netapp/tr/tr-3854.pdf ·...

Documents