jonathan frappier – challenge 2 design solution

Virtual Design Master After the Outbreak

After the Outbreak

Project: HA, Backup, and Disaster Recovery Solutions Focus Area: Provide HA, backup and disaster recovery solutions VMware vSphere, Active Directory, SQL, Network, Storage, Remote Access and other managed applications Prepared By: Jonathan Frappier @jfrappier www.virtxpert.com Project Quality Plan Version Control

Version Date Author Change Description

.5 8/19/13 Jonathan Frappier Draft

.75 8/19/13 Phoummala Schmitt Technical Consultant for Exchange HA Solution (verified solution is valid)

1.0 8/22/13 Jonathan Frappier Final


TABLE OF CONTENTS

1 EXECUTIVE SUMMARY ............................................................................................ 3

1.1 DEFINITIONS ....................................................................................................... 3

1.2 SCOPE ............................................................................................................... 3

2 SITE INFRASTRUCTURE UPDATES ........................................................................ 4

2.1 EQUIPMENT SPECIFICATIONS .............................................................................. 4

2.2 LAN & WAN ...................................................................................................... 4

2.3 PHYSICAL SERVERS ........................................................................................... 4

2.4 STORAGE ........................................................................................................... 4

2.5 SLA DEFINITIONS ............................................................................................... 4

2.6 UPDATED SITE DIAGRAM .................................................................................... 4

3 REQUIREMENTS, ASSUMPTIONS, CONSTRAINTS & RISKS ................................ 5

3.1 REQUIREMENTS .................................................................................................. 5

3.2 ASSUMPTIONS .................................................................................................... 5

3.3 CONSTRAINTS .................................................................................................... 6

3.4 RISKS ................................................................................................................ 6

4 APPLICATION BACKUP, HA, RPO AND RTO ......................................................... 6

4.1 RPO AND RTO CLASSIFICATIONS ....................................................................... 6

4.2 SOFTWARE SOLUTIONS FOR BACKUP AND REPLICATION ...................................... 6

4.3 IDENTIFIED APPLICATIONS & FAILOVER PROCESS BRIEFS .................................... 7

5 IP ALLOCATION ........................................................................................................ 12

6 APPENDICES ............................................................................................................ 12

6.1 HARDWARE MANIFEST ........................................................................................ 12

6.2 SOFTWARE MANIFEST ........................................................................................ 13

6.3 REFERENCE ....................................................................................................... 13

6.4 VMWARE CONFIGURATION MAXIMUMS (DOUBLE CLICK TO OPEN) .......................... 14

6.5 ORIGINAL SITE DESIGN (DOUBLE CLICK TO OPEN) ................................................ 14

6.6 EXCHANGE 2010 SIZING CALCULATOR (DOUBLE CLICK TO OPEN) ......................... 14


1 EXECUTIVE SUMMARY The world is in disarray after a virus outbreak that turned many into zombies. You’ve been recruited to build an infrastructure for a wealthy philanthropist to use in order to put the world back together. During phase 2 of the build out, appropriate High Availability, backup and disaster recovery solutions must be designed, documented and implemented. The disaster recovery solution will failover all virtual server workloads from the primary site to the secondary site, and all virtual desktop workloads to the tertiary site. The solution will use several methods for failover and recovery based on the RPO and RTO of the application. This will provide the most cost effective solution for failing over all applications.

1.1 Definitions

Backup: The process of making a backup copy of important information Business Continuity Plan (BCP): A plan for how the business will continue to operate in the event of a disaster. Information in a BCP may include escalation procedures, designation as to who may declare an emergency, contact and emergency contact information for key personnel, directions on how to access systems after a DRP has been implemented. Disaster Recovery Plan (DRP): A disaster recovery plan is the process for restoring access to infrastructure and applications in the event of a disaster. Disasters may range from small/isolated incidents such as a server failure that affects access to key systems or a major/catastrophic incident that causes major damage or prevents access to an entire site. High Availability (HA): High Availability is the process of enabling application resilency such that applications remain available, or are quickly restore in the event of an outage. An example of HA may be a virtual server being restarted if it has become unresponsive, quickly and automatically restoring access to the application with little to no human intervention. HA By Design (HABD): HA By Design is the concept of building HA into your normal application design such that during a disaster, connectivity remains unaffected, or requires very little intervention to be accessed (for example using an alternate URL for access). Recovery Point Objective (RPO): Recovery Point Objective defines an acceptable amount of data loss (or lag) in the event of a disaster. For example a system with an RPO of one (1) hour will need to have data recoverable within one (1) hour of an outage or disaster. Recovery Time Objective (RTO): Recovery Time Objective defines how long a system may remain offline after a disaster. The RTO and RPO do not need to be closely related. For example a system may have an RPO of only five (5) or ten (10) minutes but during a disaster may not need to be brought back online for several hours, or even days; an RTO of five (five) days may be acceptable even with a low RTO. Conversely a systems with a low RTO may contain static data, such as a web server, and thus have a very high RPO because the data does not change (often).

1.2 Scope

Mr. Billionaire needs you to build him an infrastructure so he can continue his efforts across the globe. There are 3 locations that must be use because there is not enough power in each of the locations to host all of the equipment. The primary site supports up to 5000 virtual servers, 1000 virtual servers in the secondary site and 500 in the tertiary site. The primary site also hosts 3000 virtual desktops available for full desktop access and mobile application delivery for at least 1500 devices. A plan for appropriate High Availability solutions, backup and Disaster Recovery must be implemented to support a major failure in the primary site. Business Continuity, communication and escalation procedures are outside the scope of this document.


2 Site Infrastructure Updates

2.1 Equipment Specifications

All systems used to support disaster recovery efforts will be the same make, model and configuration as the original installation. Additionally, this document will only list the required changes to each site; detailed system details can be found in the original site design (Appendix 5.5).

2.2 LAN & WAN

• LAN An additional Cisco 6513 will be added to the secondary site to ensure support for up to 50 additional hosts if required.

• WAN

A 100Mbps internet link has been added to the secondary and tertiary site. To support this, a Cisco 7600 router and Cisco ASA 5540 will be added to each site.

2.3 Physical Servers

The secondary site will need to add an additional 50 physical hosts to support full recovery of the virtual servers in the primary site. The tertiary site will require an additional 10 physical hosts to support the full recovery of the virtual desktops from the tertiary site.

2.4 Storage

During the original design, an assumption was made at the tertiary site to install an EMC Celerra NS-480 with 32TB of storage to support failover from the secondary site. The requirements for the DRP calls for failover of the primary site only, the spare NS-480 in the tertiary site will be upgrade to 64TB to match the NS-480 in the primary site supporting the VDI workloads.

2.5 SLA Definitions

No service levels were defined; will use 99.9% as a standard.

2.6 Additional Equipment Required

Device Type Manufacturer Model Quantity

Server HP DL580 G5 60 (50 to secondary, 10 to tertiary)

Add-On NIC Broadcom 5709 Based 120 (2 cards per server)

Add-On HBA EMC Qlogic QLE2462-E-SP 240 (4 cards per server)

Add-On HD OCZ 32GB SSD 1560 (16 drives per sever)

Storage Array EMC 146GB FC 15K 232 (fill remaining DAE’s in tertiary site 2

nd Celerra)

Load Balancer F5 BigIP 6800 6 (4 primary site, 2 secondary site, 2 tertiary site)

2.7 Updated Site Diagram


3 Requirements, Assumptions, Constraints & Risks

3.1 Requirements

The purpose of this project is to define, document and create a highly available infrastructure capable of surviving a disaster to the primary site.

3.2 Assumptions

• Secondary and tertiary sites have been upgrade to support the failover of select capacity from the primary listed in Section 2.5.

• 50 physical hosts will be required to meet the assumed server consolidation ratio in the primary datacenter.

• 10 physical hosts will be required to meet the assumed server consolidation ratio in the secondary datacenter.

• 5 physical hosts will be required to meet the assumed server consolidation ratio in the tertiary datacenter.

• Three hundred (300) desktop VMs to a single physical host will be an assumed average consolidation ratio (300-to-1).

• 10 physical hosts will be required to meet the assumed desktop consolidation ratio. • The 100Mbps link will provide sufficient bandwidth for normal internal traffic (AD replication,

vCenter management of hosts and system monitoring) as well as replication.


3.3 Constraints

• Hardware is limited to stock on hand at a discovered warehouse; components are believed to be from 2008.

• Power and cooling at each location are limited. • Each site has a 100Mbps link for connectivity to the other sites. This may impact RPO for select

systems.

3.4 Risks

• Each site has a 100Mbps link for connectivity to the other sites. This may impact RPO for select systems.

• The requirements call for a DRP only for the primary site, failures of the secondary or tertiary sites are accounted for.

• Hardware available is believed to be reliable and in working order. • The number of host required to meet the assumed consolidation ratio is above the maximum

supported for a single cluster; multiple clusters will have to be used.

4 Application Backup, HA, RPO and RTO

4.1 RPO and RTO Classifications

There will be N categories classifying RPO and RTO for the systems deployed in the primary data center. RPO Classifications Class RPO Example Applications

Platinum < 5 minutes Active Directory, Exchange Gold 5 minutes Web app DB tier Silver 10 minutes Custom department applications

(HR, Financial Planning) Bronze 30-60 minutes File servers, Web app front end

tier Static 1 day Basic utility systems, monitoring

systems, VDI RTO Classifications Class RTO Example Applications

Critical < 5 minutes Active Directory, Exchange Priority 30 minutes Minimum primary web app & DB

tier to restore access Important 5 hours File servers, Basic utility

systems, monitoring systems, VDI

Redundant 12 – 24 hours Redundant systems to ensure HA

Standby 10 days Custom department applications (HR, Financial Planning)

4.2 Software Solutions for Backup and Replication

We will be using different approaches for backup, replication and recovery for various systems based on RPO/RTO to be the most cost effective.


Product Use Cases Expected Cost

(MSRP/Published)

VMware HA Auto restarted of VMs on failed host within the same site.

Included with vSphere Enterprise Plus

VMware FT Hot spare kept in lock step with primary VM; limited use case due to single vCPU requirement but useful for some web services.


VMware PowerCLI / vMA Scripting configuration and setup of hosts and VMs, to be used with lower tier RTO/RPO classifications.


VMware vCenter Heartbeat Used to deliver HA for vCenter server required components.

$9995

Zerto Virtual Replication v3 Used for select upper tier RTO/RPO classifications to automate the replication and reconfiguration of systems at the secondary site.

$745 / protected VM.

Unitrends Backup and Recovery Will be used for backup of mid-tier RTO/RPO classifications such as specialized applications that can easily be scripted/installed.

$nnn / protected ??

EMC Celerra Replication The Celerra will replicate select LUNs/Storage Groups to the appropriate DR site.

Included with NS-480

4.3 Identified Applications & Failover Process Briefs

Below are a list of applications and their required RPO, RTO and a brief overview of how those will be achieved.

• Active Directory: All Domain Controllers will have a system state backup performed in Unitrends

14. Because Domain Controllers will be created in all data centers, there will be no

downtime and next to no data loss (if any) in the event of a site failure. Each site will contain at least one (1) Global Catalog Server and the FSMO roles will be separate based on Microsoft best practice

13. In this configuration, I believe that the published best practice for FSMO role

placement will provide the necessary resources for our domain.

RTO: Critical RPO: Platnium Achieved in application design

• Windows 2008 R2 Exchange 2010 Client Access Server (CAS)1: Hot stand-by CAS servers

will be configured in the secondary site to ensure immediate access is available in the event of a disaster. The number of CAS servers ready will be ¼ (25%) of the production CAS servers in use (with a minimum of one running at all times). The CAS server will be configured to respond to an alternate DNS A record while the primary DNS records are changed and replicated. TTL for the CAS server DNS records will be set to 5 minutes (300s). CAS servers in the primary site will be protected by Zerto.

RTO: Redundant RPO: Bronze Achieved by placing hot stand-by servers in the secondary site capable of handling workload during a disaster.


• Windows 2008 R2 Exchange 2010 Transport Server1: Hot standby Transport servers will be

configured in the secondary site. Transport servers will have log files backed up by Unitrends, but will not need to be restored in the event of a disaster.

RTO: Redundant RPO: Static Achieved by placing hot stand-by servers in the secondary site; not data is required from the primary site servers.

• Windows 2008 R2 Exchange 2010 Mailbox Server (BE)

1: Hot stand-by mailbox servers will be

configured in the secondary site as part of a Database Availability Group (DAG) so that mail can be replicated in near real time from the primary site to the secondary site due to the organizations reliance on messaging. Based on the Exchange Server 2010 Role Requirement Calculator (Appendix 5.6) that the 100Mbps site links between each data center; an estimated 43Mbps is required for this replication to be maintained at a near 0 hour RPO.

RTO: Critical RPO: Platinum Achieved by application design, placing hot spare mailbox servers in the secondary data center and configuring in a DAG. Email flow with primary site online. When failed over to the secondary site, an alternate URL will be available for immediate use while production DNS records are changed and replicated.


Mailbox Server / DAG Design

• Windows 2008 R2 Standard Application Servers: There will be three (3) levels of application

server classifications, the first will include the minimum components required to bring the application back online and capable of supporting all necessary traffic. For example a single web server, application server and database server may be able to handle the load for a given application, but due to its uptime requirements may have several other redundant servers to support it. In this scenario only 1 web, application and database server would be brought online initially and the redundant systems would be brought back online once all primary systems were restored. Some application servers may only be providing utility level services.

RTO: Priority RPO: Gold Achieved by continuous replication using Zerto Virtual Replication v3 which will allow the desired RPO level and assist with RTO by automating the changing of IP addresses in the secondary data center. RTO: Redundant RPO: Bronze Achieved by backing up systems with Unitrends and replicating Unitrends to the secondary site. Data will be refreshed from the primary systems within an application cluster. In some scenarios where tiers are stateless (i.e. no data is stored on the system), templates and scripts to restore the redundant systems to a working state will be used rather than backing up with Unitrends.


RTO: Standby RPO: Static Achieved by automating the installation of specific applications such as system monitoring (OpsView). For systems that require data retention, systems will have the specific data sets backed up by Unitrends (i.e. not using a full VM backup, rather just individual directories or databases).

• VMware vCenter Server and SQL 2008 R2 Database Server (for vCenter)2: VMware vCenter

Server, and the required services which are also going to be installed on the vCenter server (SSO, Inventory Service) and the vCenter Database server will be protected using VMware vCenter Heartbeat. Give that vCenter is crucial to the operation of the environment, I feel it is critical to provide a solid solution which is dedicated to this task. At $9995, the cost of the product give the size and scope of the environment should be justified

15.

RTO: Critical RPO: Platnium Achieved by leveraging vCenter Heartbeat which monitors vCenter and provides the ability to fail over, and failback. vCenter Heartbeat is capable of operating over a WAN environment, and with 100Mbps dedicated links between the data center I feel this is an ideal solution.


Graphic from www.vmware.com vCenter Heartbeat product page

• VMware View Infrastructure: Because many of our users rely the VMware View infrastructure

for remote access to systems and applications, we have decided to stand-up a warm VMware View infrastructure in the tertiary data center. This will allow remote staff, some of who may be required to access systems during a disaster, and alternate method for access. In order to support users in this fashion, VMware View HTML Access will be configured. There is no data saved in the view environment, rather it is saved in applications and file servers so there is no RPO required. VM templates will be replicated from the primary data center.

RTO: Priority RPO: N/A Achieved by building a warm stand-by View infrastructure in the tertiary data center.


5 IP Allocation The following IP allocation will be used across all data centers. Each site will have a class B range, sub-netted for traffic segmentation purposes. Routing and ACLs where appropriate will be handled by the switch.

6 APPENDICES

6.1 Hardware Manifest

Device Type Manufacturer Model

Router Cisco 7600

Firewall Cisco ASA 5540

Network Switch Cisco Catalyst 6500

Storage Switch Cisco MDS 9513

Server HP DL580 G5

Add-On NIC Broadcom 5709 Based

Add-On HBA EMC Qlogic QLE2462-E-SP

Add-On HD OCZ 32GB SSD

Storage Array EMC Celerra NS480

Storage Array EMC 146GB FC 15K


Load Balancer F5 BigIP 6800

6.2 Software Manifest

Vendor Software

Microsoft Windows 2008 R2

Microsoft Windows 7 64-bit

Microsoft Office 2010

Microsoft SQL 2010

VMware vSphere Enterprise Plus

VMware Horizon View

VMware Replication

VMware vSphere Data Protection

VMware Log Insight Manager

VMware vMA

VMware vSphere Support Assistant

VMware vShield Endpoint

VMware vCenter Server Heartbeat

Trend Micro Deep Security

Opsview Opsview Enterprise

Unitrends Enterprise Backup

Indeni Dynamic Knowledge Base

Zerto Zerto Virtual Replication v3

6.3 Reference

1 - http://goo.gl/7ohe4 - Microsoft Exchange 2010 on VMware Best Practices 2 - http://goo.gl/F2B4w - Installing vCenter Server 5.1 Best Practices 3 - http://goo.gl/ToZfWc - How old is my server 4 - http://goo.gl/PtlKT4 - Ark.intel.com 5 - http://goo.gl/LFBys - wmarow.com IOPS calculator 6 - http://goo.gl/xcF0h - RAIDcalc 7 - http://goo.gl/zRdhqT - Cisco 5500 Series Release Notes 8 - http://communities.vmware.com/docs/DOC-22981- vSphere 5.1 Hardening Guide 9 - http://goo.gl/WIC7Hb - vSphere 5.1 Documentation / Authentication 10 - http://goo.gl/KGJ7tK - vSphere 5.1 Host Conditions and Trigger States 11 - http://blogs.vmware.com/kb/2012/07/leveraging-multiple-nic-vmotion.html 12 - http://goo.gl/SEQfwH - MDS900 3.0(1) Release notes 13 - http://technet.microsoft.com/en-us/library/cc816945(v=ws.10).aspx – Managing Operations Master Roles 14 - http://support.unitrends.com/ikm/questions.php?questionid=891 – AD Restore 15 - http://www.vmware.com/products/vcenter-server-heartbeat/features.html - vCenter Heartbeat


6.4 VMware Configuration Maximums (double click to open)

vsphere-51-configuration-maximums.pdf

6.5 Original Site Design (double click to open)

jfrappier - challenge1.pdf

6.6 Exchange 2010 Sizing Calculator (double click to open)

vDM-ExchangeWorkBook.xlsm

jonathan frappier – challenge 2 design solution

Technology

brought back

placing hot

primary data

tertiary data

secondary

vmware vcenter

failover process

vmware vcenter