dag_2013

Exchange Server 2013 High Availability and Site Resilience

StorageHigh AvailabilitySite ResilienceAgendaStorageTechEd 2013 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/2015 9:19 AM2Storage ChallengesDisksCapacity is increasing, but IOPS are notDatabasesDatabase sizes must be manageableDatabase CopiesReseeds must be fast and reliablePassive database copy IOPS are inefficientLagged copies have asymmetric storage requirements require manual careStorage InnovationsMultiple Databases Per VolumeAutoreseedSelf-Recovery from Storage FailuresLagged Copy InnovationsMultiple database per volumeMultiple databases per volume

DB1

DB4DB3DB2DB4DB3DB2DB1DB4DB3DB2DB1DB4DB3DB2DB1PassiveActiveLagged4-member DAG4 databases4 copies of each database4 databases per volume

Symmetrical designMultiple databases per volume

DB1

DB1DB1DB1PassiveActiveLaggedSingle database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs20 MB/sMultiple databases per volumeDB1

DB4DB3DB2DB4DB3DB2DB1DB4DB3DB2DB1DB4DB3DB2DB1PassiveActiveLaggedSingle database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs

4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs12 MB/s12 MB/s20 MB/s20 MB/s

Multiple databases per volumeRequirementsSingle logical disk/partition per physical diskRecommendationsDatabases per volume should equal the number of copies per databaseSame neighbors on all serversAutoreseedSeeding ChallengesDisk failure on active copy = database failoverFailed disk and database corruption issues need to be addressed quicklyFast recovery to restore redundancy is neededSeeding InnovationsAutomatically restore redundancy after disk failure using provisioned sparesIn-Use StorageSparesX

Disk re-seed operationAutoreseed WorkflowAutoreseed WorkflowDetect a copy in an F&S state for 15 min in a rowTry to resume copy 3 times (with 5 min sleeps in between)Try assigning a spare volume 5 times (with 1 hour sleeps in between)Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1 hour sleeps in between)Once all retries are exhausted, workflow stopsIf 3 days have elapsed and copy is still F&S, workflow state is reset and starts from Step 1TechReady 16 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/201514Autoreseed WorkflowPrerequisitesCopy is not ReseedBlocked or ResumeBlockedLogs and database file(s) are on same volumeDatabase and log folder structure matches required naming conventionNo active copies on failed volumeAll copies are F&S on the failed volumeNo more than 8 F&S copies on the server (if so, might be a controller failure)For InPlaceSeedUp to 10 concurrent seeds are allowedIf a database files exists, wait for 2 days before in-place reseedingWaiting period based on LastWriteTime of database fileAutoreseed\ExchDbsExchVolsVol1Vol3MDB1MDB2MDB1Vol2MDB2MDB1.DBMDB1.logMDB1.DBMDB1.logAutoDagDatabasesRootFolderPath AutoDagVolumesRootFolderPathAutoDagDatabaseCopiesPerVolume = 1 Tech Ready 15 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/201516AutoreseedRequirementsSingle logical disk/partition per physical diskSpecific database and log folder structure must be usedRecommendationsSame neighbors on all serversDatabases per volume should equal the number of copies per databaseAutoreseedNumerous fixes in CU1Autoreseed not detecting spare disks correctlyAutoreseed not using spare disksIncreased Autoreseed copy limits (previously 4, now 8)Better tracking around mount path and ExchangeVolume pathGet-MailboxDatabaseCopyStatus displays ExchangeVolumeMountPointShows the mount point of the database volume under C:\ExchangeVolumesTechEd 2013 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/2015 8:47 AM18Update-MailboxDatabaseCopy includes new parameters designed to aid with automationParameterDescriptionBeginSeedUseful for scripting reseeds. Task asynchronously starts the seeding operation and then exits the cmdlet.MaximumSeedsInParallelUsed with Server parameter to specify maximum number of parallel seeding operations across specified server during full server reseed operation. Default is 10.SafeDeleteExistingFilesUsed to perform a seeding operation with a single copy redundancy pre-check prior to the seed. Because this parameter includes the redundancy safety check, it requires a lower level of permissions than DeleteExistingFiles, enabling a limited permission administrator to perform the seeding operationServerUsed as part of a full server reseed operation to reseed all database copies in a F&S state. Can be used with MaximumSeedsInParallel to start reseeds of database copies in parallel across specified server in batches of up to value of MaximumSeedsInParallel parameter copies at a timeSelf-recovery from storage failuresRecovery ChallengesStorage controllers are basically mini-PCsAs such, they can crash, hang, etc., requiring administrative interventionOther operator-recoverable conditions can occurLoss of vital system elementsHung or highly latent IOLagged copy innovationsLagged Copy ChallengesActivation is difficultLagged copies require manual careLagged copies cannot be page patchedLagged Copy InnovationsAutomatic log file replayLow disk space (enable in registry)Page patching (enabled by default)Less than 3 other healthy copies (enable in Active Directory; configure in registry)Integration with Safety NetNo need for log surgery or hunting for the point of corruptionHigh AvailabilityHigh Availability ChallengesHigh availability focuses on database healthBest copy selection insufficient for new architectureDAG network configuration still manualHigh Availability InnovationsManaged AvailabilityBest Copy and Server SelectionDAG Network AutoconfigManaged AvailabilityManaged AvailabilityKey tenets for Exchange 2013Access to a mailbox is provided by protocol stack on the Mailbox server that hosts the active copy of the mailboxIf a protocol is down on a Mailbox server, all access to active databases on that server via that protocol is lostManaged Availability was introduced to detect and automatically recover from these kinds of failuresFor most protocols, quick recovery is achieved via a restart actionIf the restart action fails, a failover can be triggeredManaged AvailabilityAn internal framework used by component teamsSequencing mechanism to control when recovery actions are taken versus alerting and escalationEnhances the Best Copy Selection algorithm by taking into account overall server health of source and targetManaged AvailabilityMA failovers are recovery action from failureDetected via a synthetic operation or live dataThrottled in time and across the DAGMA failovers can happen at database or server levelDatabase: Store-detected database failure can trigger database failoverServer: Protocol failure can trigger server failoverSingle Copy Alert integrated into MAServerOneCopyInternalMonitorProbe (part of DataProtection Health Set)Alert is per-server to reduce flowStill triggered across all machines with copiesLogs 4138 (red) and 4139 (green) eventsTechReady 16 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/201531Best Copy and Server SelectionBest Copy Selection ChallengesExchange 2010 used several criteriaCopy queue lengthReplay queue lengthDatabase copy status including activation blockedContent index statusUsing just this criteria is not good enough for Exchange 2013, because protocol health is not consideredBest Copy and Server SelectionStill an Active Manager algorithm performed at *over time based on extracted health of the systemReplication health still determined by same criteria and phasesCriteria now includes health of entire protocol stackConsiders a prioritized protocol health set in the selection using four priorities critical, high, medium, lowFailover responders trigger added checks to select a protocol not worse targetManaged Availability imposes 4 new constraints on theBest Copy Selection algorithmBest Copy and Server Selection1234BCSS Changes in CU1PAM tracks number of active databases per serverHonors MaximumActiveDatabases, if configuredAllows Active Manager to exclude servers that are already hosting the maximum amount of active databases when determining potential candidates for activationKeeps an in-memory state that tracks the number of active databases per serverWhen the PAM role moves or when the Exchange Replication service is restarted on the PAM, this information is rebuilt from the cluster databaseTechEd 2013 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/2015 8:47 AM36DAG Network InnovationsDAG Network ChallengesDAG networks must be manually collapsed in a multi-subnet deploymentSmall remaining administrative burden for deployment and initial configurationDAG Network InnovationsAutomatically collapsed in multi-subnet environmentAutomatic or manual configurationDefault is AutomaticRequires specific settings on MAPI and Replication network interfacesManual edits and EAC controls blocked by defaultSet DAG to manual network setup to edit or change DAG networks

Site ResilienceSite Resilience ChallengesOperationally complexMailbox and Client Access recovery connectedNamespace is a SPOFSite Resilience InnovationsKey CharacteristicsDNS resolves to multiple IP addressesAlmost all protocol access in Exchange 2013 is HTTPHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failuresAdmins can switchover by removing VIP from DNSNamespace no longer a SPOFNo dealing with DNS latencySite Resilience InnovationsOperationally simplifiedMailbox and Client Access recovery independentNamespace provides redundancySite ResilienceOperationally SimplifiedPreviously loss of CAS, CAS array, VIP, LB, etc., required admin to perform a datacenter switchoverIn Exchange Server 2013, recovery happens automaticallyThe admin focuses on fixing the issue, instead of restoring serviceSite ResilienceMailbox and CAS recovery independentPreviously, CAS and Mailbox server recovery were tied together in site recoveriesIn Exchange Server 2013, recovery is independent, and may come automatically in the form of failoverThis is dependent on business requirements and configurationSite ResilienceNamespace provides redundancyPreviously, the namespace was a single point of failureIn Exchange 2013, the namespace provides redundancy by leveraging multiple A records and clients OS/HTTP stack ability to failoverSite ResilienceSupport for new deployment scenariosWith the namespace simplification, consolidation of server roles, separation of CAS array and DAG recovery, de-coupling of CAS and Mailbox by AD site, and load balancing changes, if available, three locations can simplify mailbox recovery in response to datacenter-level eventsYou must have at least three locationsTwo locations with Exchange; one with witness serverExchange sites must be well-connectedWitness server site must be isolated from network failures affecting Exchange sitesSite Resilience Failover Examplesalternate datacenter: Portlandprimary datacenter: RedmondSite Resilience Failover Examplescas3cas4cas1cas2

VIP: 192.168.1.50X

VIP: 10.0.1.50mail.contoso.com: 192.168.1.50, 10.0.1.50Removing failing IP from DNS puts you in control of in service time of VIPWith multiple VIP endpoints sharing the same namespace, if one VIP fails, clients automatically failover to alternate VIP(s)mail.contoso.com: 10.0.1.50

third datacenter: Parisalternate datacenter: Portlandprimary datacenter: RedmondSite Resilience Failover Examples

dag1mbx1mbx2mbx3mbx4Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file,automatic failover of active databases should occurwitness

X

alternate datacenter: Portlandprimary datacenter: RedmondSite Resilience Failover Examples

dag1witnessmbx1mbx2mbx3mbx4

XXX

Tech Ready 15 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.3/16/201552alternate datacenter: Portlandprimary datacenter: Redmond

dag1Site Resilience Failover Exampleswitnessmbx1mbx2mbx3mbx4

alternate witnessMark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 ActiveDirectorySite:Redmond

Stop the Cluster Service on Remaining DAG members: Stop-Clussvc

Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 ActiveDirectorySite:Portland

X

dag_2013

Documents

tb database

database copiesdisk

database failoverfailed

tb disk

database corruption

volumemultiple databases

database4 databases

active copy