exchange server 2013 high availability - site resilience
Post on 15-Jan-2015
2.881 Views
Preview:
DESCRIPTION
TRANSCRIPT
Exchange Server 2013High Availability | Site ResilienceScott SchnollPrincipal Technical WriterMicrosoft Corporation
Agenda
Storage
High Availability
Site Resilience
Storage
Storage Challenges
Capacity is increasing, but IOPS are notDatabase sizes must be manageableReseeds must be fast and reliablePassive copy IOPS are inefficientLagged copies have asymmetric storage requirementsLow agility from low disk space recovery
Storage Innovations
Multiple Databases Per VolumeAutomatic ReseedAutomatic Recovery from Storage FailuresLagged Copy Enhancements
Multiple databases per volume
Multiple Databases Per Volume
DB1 DB4DB3DB2
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
Passive
ActiveLagge
d
4-member DAG4 databases4 copies of each database4 databases per volume
Symmetrical design with balanced activation preference
Number of copies per database = number of databases per volume
Multiple Databases Per Volume
DB1 DB1DB1DB1
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
DB1DB1
Passive
Active
20 MB/s
Multiple Databases Per Volume
DB1 DB4DB3DB2
Single database copy/disk:Reseed 2TB Database = ~23 hrsReseed 8TB Database = ~93 hrs
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
DB4
DB3
DB2
DB1
Passive
ActiveLagge
d
4 database copies/disk:Reseed 2TB Disk = ~9.7 hrsReseed 8TB Disk = ~39 hrs
DB1
12 MB/s
20 MB/s
20 MB/s
12 MB/s
DB1
DB4
DB3
DB2
Multiple Databases Per Volume
Requirements
Single logical disk/partition per physical disk
Best Practices
Same neighbors on all servers
Balance activation preferences
Database copies per volume = copies per database
Autoreseed
Seeding Challenges
Disk failure on active copy = database failoverFailed disk and database corruption issues need to be addressed quicklyFast recovery to restore redundancy is needed
Seeding Innovations
Automatic Reseed (Autoreseed) - use spares to automatically restore database redundancy after a disk failure
Autoreseed
In-Use Storage
Spares
X
Autoreseed Workflow
Periodically scan for
failed and suspended
copies
Check prerequisite
s: single copy, spare availability
Allocate and
remap a spare
Start the seed
Verify that the
new copy is
healthy
Admin replaces
failed disk
Autoreseed Workflow
1. Detect a copy in an F&S state for 15 min in a row2. Try to resume copy 3 times (with 5 min sleeps in between)3. Try assigning a spare volume 5 times (with 1 hour sleeps in
between)4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1
hour sleeps in between)5. Once all retries are exhausted, workflow stops6. If 3 days have elapsed and copy is still F&S, workflow state is
reset and starts from Step 1
Autoreseed Workflow
PrerequisitesCopy is not ReseedBlocked or ResumeBlockedLogs and EDB files are on same volumeDatabase and Log folder structure matches required naming conventionNo active copies on failed volumeAll copies are F&S on the failed volumeNo more than 8 F&S copies on the server (if so, we might be in a controller failure situation)
For InPlaceReseedIf EDB files exists, wait for 2 days before in-place reseeding (based on LastWriteTime of edb file)Only up to 10 concurrent seeds are allowed
Autoreseed
Configure storage subsystem with spare
disks
Create DAG, add servers with configured
storage
Create directory and mount points
Configure 3 AutoDAG properties
Create mailbox databases and
database copies
\
ExchDbs
ExchVols
Vol1 Vol3MDB1 MDB2
MDB1
Vol2
MDB2
MDB1.DB MDB1.log
MDB1.DB MDB1.log
AutoDagDatabasesRootFolderPath
AutoDagVolumesRootFolderPath
AutoDagDatabaseCopiesPerVolume = 1
Autoreseed
RequirementsSingle logical disk/partition per physical diskSpecific database and log folder structure must be used
RecommendationsSame neighbors on all serversDatabases per volume should equal the number of copies per databaseBalance activation preferences
Configuration instructions at http://aka.ms/autoreseed
Autoreseed
Numerous fixes in CU1Autoreseed not detecting spare disks correctly or using detected disks
GetCopyStatus has a new field 'ExchangeVolumeMountPoint' which shows the mount point of the database volume under C:\ExchangeVolumes
Better tracking around mount path and ExchangeVolume path
Increased autoreseed copy limits (previously 4, now 8)
Automatic Recovery From Storage Failures
Storage Challenges
Storage controllers are essentially mini-PCs and they can crash/hang
Other operator-recoverable conditions can occurLoss of vital system elementsHung or highly latent IO
Storage Innovations
Exchange Server 2013 includes functionality to automatically recovery from a variety of new storage-related failures
Innovations added in Exchange 2010 also carried forward
Even more behaviors added in CU1
Automatic Recovery from Storage Failures
Exchange Server 2010
ESE Database Hung IO (240s)
Failure Item Channel Heartbeat (30s)
SystemDisk Heartbeat (120s)
Exchange Server 2013
System Bad State (302s)
Long I/O times (41s)
MSExchangeRepl.exe memory threshold (4GB)
System Bus Reset (Event 129)
Replication service endpoints not responding
Lagged Copy Challenges
Lagged Copy Challenges
Activation is difficultRequire manual careCannot be page patched
Lagged Copy Innovations
Automatic play down of log files in critical situationsIntegration with Safety Net
Lagged Copy Innovations
Automatic log play downLow disk space (enable in registry)Page patching (enabled by default)Less than 3 other healthy copies (enable in AD; configure in registry)
Simpler activation with Safety NetNo need for log surgery or hunting for the point of corruption
High Availability
High Availability Challenges
High availability focuses on database healthBest copy selection insufficient for new architectureManagement challenges around maintenance and DAG network configuration
High Availability Innovations
Managed AvailabilityBest Copy and Server SelectionDAG Network Autoconfig
Managed Availability
Managed Availability
Key tenet for Exchange 2013All access to a mailbox is provided by the protocol stack on the Mailbox server that hosts the active copy of the user’s mailboxIf a protocol is down on a Mailbox server, all active databases lose access via that protocol
Managed Availability was introduced to detect these kinds of failures and automatically correct themFor most protocols, quick recovery is achieved via a restart actionIf the restart action fails, a failover can be triggered
Managed Availability
An internal framework used by component teams
Sequencing mechanism to control when recovery actions are taken versus alerting and escalation
Includes a mechanism for taking servers in/out of service (maintenance mode)
Enhancement to best copy selection algorithm
Managed Availability
MA failovers come in two formsServer: Protocol failure can trigger server failoverDatabase: Store-detected database failure can trigger database failover
MA includes Single Copy AlertAlert is per-server to reduce flowStill triggered across all machines with copiesMonitoring triggered through a notificationLogs 4138 (red) and 4139 (green) events based on 4113 (red) and 4114 (green) events
Best Copy and Server Selection
Best Copy Selection Challenges
Process for finding the “best” copy of a specific database to activate
Exchange 2010 uses several criteriaCopy queue lengthReplay queue lengthDatabase copy status – including activation blockedContent index status
Not good enough for Exchange Server 2013, because protocol health is not considered
Best Copy and Server Selection
Still an Active Manager algorithm performed at *over time based on extracted health of the systemReplication health still determined by same criteria and phases
Criteria now includes health of the entire protocol stackConsiders a prioritized protocol health set in the selectionFour priorities – critical, high, medium, low (all health sets have a priority)Failover responders trigger added checks to select a “protocol not worse” target
Best Copy and Server Selection
All HealthyChecks for a server hosting a copy that has all health sets in a healthy state
Up to Normal HealthyChecks for a server hosting a copy that has all health sets Medium and above in a healthy state
All Better than SourceChecks for a server hosting a copy that has health sets in a state that is better than the current server hosting the affected copy
Same as SourceChecks for a server hosting a copy of the affected database that has health sets in a state that is the same as the current server hosting the affected copy
DAG Network Autoconfig
DAG Network Autoconfig
Automatic or manual DAG network configDefault is AutomaticRequires specific configuration settings on MAPI and Replication network interfaces
Manual edits and EAC controls blocked when automatic networking is enabledSet DAG to manual network setup to edit or change DAG networks
DAG networks automatically collapsed in multi-subnet environment
Site Resilience
Site Resilience Challenges
Operationally complexMailbox and Client Access recovery connectedNamespace is a SPOF
Site Resilience Innovations
Operationally simplifiedMailbox and Client Access recovery independentNamespace provides redundancy
Site Resilience
Key CharacteristicsDNS resolves to multiple IP addressesAlmost all protocol access in Exchange 2013 is HTTPHTTP clients have built-in IP failover capabilitiesClients skip past IPs that produce hard TCP failuresAdmins can switchover by removing VIP from DNSNamespace no longer a SPOFNo dealing with DNS latency
Site Resilience
Previously loss of CAS, CAS array, VIP, LB, some portion of the DAG required admin to perform a datacenter switchover
In Exchange Server 2013, recovery happens automaticallyThe admin focuses on fixing the issue, instead of restoring service
Site Resilience
Previously, CAS and Mailbox server recovery were tied together in site recoveries
In Exchange Server 2013, recovery is independent, and may come automatically in the form of failoverThis is dependent on the customer’s business requirements and configuration
Site Resilience
With the namespace simplification, consolidation of server roles, separation of CAS array and DAG recovery, de-coupling of CAS and Mailbox by AD site, and load balancing changes…
if available, three locations can simplify mailbox recovery in response to datacenter-level events
Site Resilience
You must have at least three locationsTwo locations with Exchange; one with witness server
Exchange sites must be well-connected
Witness server site must be isolated from network failures affecting Exchange sites
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
cas3 cas4cas1 cas2
VIP: 192.168.1.50 VIP: 10.0.1.50
mail.contoso.com: 192.168.1.50, 10.0.1.50
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
cas3 cas4cas1 cas2
VIP: 192.168.1.50X VIP: 10.0.1.50
mail.contoso.com: 192.168.1.50, 10.0.1.50
Removing failing IP from DNS puts you in control of in service time of VIPWith multiple VIP endpoints sharing the same namespace, if one VIP fails, clients automatically failover to alternate VIP(s)
mail.contoso.com: 10.0.1.50
third datacenter: Paris
alternate datacenter: Portland
primary datacenter: Redmond
Site Resilience
dag1mbx1 mbx2 mbx3 mbx4
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur
witness
X
alternate datacenter: Portlandprimary datacenter: Redmond
Site Resilience
dag1
witness
mbx1 mbx2 mbx3 mbx4XXX
alternate datacenter: Portlandprimary datacenter: Redmond
dag1
Site Resilience
witness
mbx1 mbx2 mbx3 mbx4
alternate witness
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
X
Questions?
Scott SchnollPrincipal Technical Writerscott.schnoll@microsoft.comhttp://aka.ms/schnoll
schnoll
top related