high availability & disaster...
TRANSCRIPT
High Availability &Disaster RecoveryWitt Mathot
Managing the Twin Risks to your Operations
Data Loss
Down Time
RTO
ResiliencyHigh Availability
Round Robin
Business ContinuityTerminology
Business Continuity
Downtime (decreasing)
Co
st
Business
Interruption
Workday
Interruption
Momentary
Interruption
Days Hours Minutes Seconds
A Spectrum, Not a Switch
Tier Criticality Recovery Time (RTO)
Recovery Point (RPO)
Others…(test frequency, etc.)
Tier 1 Mission Critical < 4 hours < 1 hour …
Tier 2 Business Critical < 24 hours < 1 hour …
Tier 3 Significant < 72 hours < 48 hours …
Tier 4 No Impact < 1 week < 1 week …
NOTE: Traditional GIS deployments are typically seen in Tiers 3-4, but are becoming more prevalent Tiers 1-2
SLAs
Business ContinuityVaries by System
The Three Approaches
Backups
High Availability
Disaster RecoveryGeographic Redundancy
SnapshotAbility to go back in time
No single point of failureMachine redundancy
No single point of failureEnvironment redundancy
Choosing Between Them
Complementary
Build On Each Other
Cost and Capability
Backup & Restore
Backups are….
Simple
Highly Effective
Not Disruptive
Under appreciated
ArcGIS Enterprise Backups – WebGIS DR Tool
What the tool backs up
Settings(Portal, Server, Data Store)
Services
Portal Content
ArcGIS Data Store Data(relational, scene tiles)
ArcGIS Enterprise Backups – WebGIS DR Tool
What the tool doesn’t backup
EGDB or file based data
Traditional cache tiles
How to Backup Web GIS
Web GIS DR Tool
Property File• Location• Portal URL• Credentials• Scene Cache?
Automate
New at 10.5 and 10.5.1
• Reduced requirements for running the tool
- Different machine names
- Different internal URLs
• Incremental backups
• Cloud specific
- Different regions for primary and standby data centers
- Azure BLOB storage
- Ability to save a WebGIS DR backup to an S3 bucket
WebGIS DR Properties – Backup Restore Mode
WebGIS DR Properties – Amazon S3
Information for the backup portal content S3 bucket
WebGIS DR Properties – Amazon S3
Storing the WebGIS DR backup in an S3 bucket
WebGIS DR Properties – Azure
Credentials for the backup portal content container
WebGIS DR Tool – Usage
• Backup
- Runs concurrently
- No downtime while exporting
- Sample syntax
• Restore
- Runs sequentially
- Data Store Server Portal
- Downtime while restoring
- Sample syntax
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Windows
Scheduling ArcGIS Enterprise backups - Linux
Examples:Run the WebGIS DR Tool at 12:00:00 AM every day:
• Creating a cronjob:
• Cronjob syntax:
Run the tool every 12 hours every day starting at 12:00:00 AM:
High Availability
Overview
• What is High Availability
• ArcGIS Enterprise High Availability
• What’s New at 10.5 and 10.5.1 – Native Cloud implementations
• Other factors for High Availability
High Availability (HA)
• Definition:
- A system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to "100% operational" or "never failing.“ (SLAs)
• Shorter down time costs more
• Elimination of single points of failure.
• Availability of a system depends on the availability of all components
ArcGIS Server
ArcGIS Data Store
Portal for ArcGIS
Hosted Feature and Tile Data
GIS Services
Portal
ArcGIS Enterprise
“Highly Available Portal”
Portal Content
(shared)
Load Balancer
Portal Machines
Portal for ArcGIS : High Available Deployment
Highly Available Portal
• Two Portal machines
• Both Portal machines take requests
• Internally, there is a difference between the two machines’ role:
- Primary
- Standby
• Behaves a little bit differently :
- Standby machine is down (or Portal service stops)
No interruption
- Primary is down (or Portal service stops)
A minute or two Portal behaves like the internet
is slow.
• Not provided by Esri
• Typically already fault tolerant
• Provided by Esri
• Web-Tier Authentication
• Availability dependent on web servers
3rd Party Load BalancerArcGIS Web Adaptor
Portal for ArcGIS: Load Balancing Options
HA Portal w/Load Balancer & Web Adaptors
Load Balancer
Web Adaptors
Portal Machines
HA Portal w/ Load Balancer
Load Balancer
Portal Machines
Portal Content
(shared)
Portal Content
(shared)
• Simpler
• Need certain settings on LB
• Doesn’t support Web Tier Authentication
• More complex
• Web Tier Authentication
Portal for ArcGIS: HA Deployment Patterns
Portal for ArcGIS: Health Check
• Provided by Portal for ArcGIS
- https://<webadaptor machine>.domain.com/<context>/portaladmin/healthCheck
- https://<machine>.domain.com:7443/arcgis/portaladmin/healthCheck
• Check if Portal is ready to take request. Not individual component, e.g. service, item, etc.
• Or your own customized health check
Portal for ArcGIS: Key Considerations for HA
• Two Portal machines
- Primary
- Standby
- Behaves a little bit different when one machine is down
• Highly Available Load Balancer
- Web Tier Authentication
- No single Web Adaptor
• Health Check provided for Portal for ArcGIS
• Highly Available shared content store
ArcGIS Server
Portal for ArcGIS
GIS Services
Portal
ArcGIS Enterprise
Configuration Store
Server Directories
(shared)
Site
Load Balancer
ArcGIS Server: Multiple-Machine Architecture
• Multiple machines
• Identical Roles
• No interruption
when any machine is down
• The config-store and server directories need to be accessible to all machines.
Server Site w/Load Balancer & Web Adaptors
Load Balancer
Web Adaptors
Server Machines
Server Site w/ Load Balancer
Load Balancer
Server Machines
Config-store
Server Directories
(shared)
Config-store
Server Directories
(shared)
ArcGIS Server: HA Deployment Patterns
ArcGIS Server: Health Check
• Provided by ArcGIS Server
- https://<…..domain.com>/<context>/rest/info/healthcheck
- https://<machine>.domain.com:6443/arcgis/rest/info/healthcheck
• Server level health check. Not checking service.
• Or your own customized health check
Server
Portalurl:443
Portal
Services URL:443
privatePortalurl:7443 Administrative URL:6443
443443
Portal for ArcGIS and ArcGIS Server: Federation
Administrative Communication
7443 6443
Communication
Load Balancer
Server Machines
Load Balancer
Portal Machines
Load Balancer Load Balancer7443
443
6443
443
Portal for ArcGIS and ArcGIS Server: Federation
Portalurl:443 Services URL:443
privatePortalurl:7443 Administrative URL:6443
ArcGIS Server : Key Considerations for HA
• Multiple machines for scalability
• All machines have identical roles
- All Active roles
- No interruption when any machine is down or Server stops
• Highly Available Load Balancer
- Web Tier Authentication
- No single Web Adaptor
ArcGIS Server : Key Considerations for HA
• Highly Available shared config-store and server directories
• Health Check provided for ArcGIS Server
• Highly Available URLs when communicating with Portal
- Portal URL
- Private Portal URL
- Services URL
- Server Administrative URL
ArcGIS Server
ArcGIS Data Store
Portal for ArcGIS
Hosted Feature and Tile Data
GIS Services
Portal
ArcGIS Enterprise
Server Site : ArcGIS Data Store’s Load Balancer
“Highly Available ArcGIS Data Store”
Primary Standby
ArcGIS Data Store: High Availability Architecture
Backups
(shared)
• Primary ArcGIS Data Store stops working: Define Failure
- Computer crashes
- Gets unplugged
- Lose network connectivity
- etc
• Not “gracefully” shutdown
- Data Store service stops
• http://server.arcgis.com/en/documentation/ Search “Fail over scenarios”
ArcGIS Data Store: Failover Scenarios
Configuration Store
Server Directories
(shared)
Site
Load Balancer
“Highly Available Portal”
Portal Content
(shared)
Load Balancer
ArcGIS Enterprise High Availability Deployment
“Highly Available ArcGIS Data Store”
Primary Standby
Backups
(shared)
What’s New at 10.5 and 10.5.1 – Native Cloud Implementations
• Portal Content Store
- Azure Blob
- AWS S3
• Create Portal through portaladmin
• Use Esri deployment tools
- Azure Cloud Builder
- Esri Amazon Cloudformation templates
What’s New at 10.5 and 10.5.1 – Native Cloud Implementations
• Server config-store
- Azure Table and Azure Blob
- AWS DynamoDB and S3
• Create Site through serveradmin
• Use Esri deployment tools
- Azure Cloud Builder
- Esri Amazon Cloudformation templates
What’s New at 10.5 and 10.5.1 – Native Cloud Implementations
• Cloud Store
- Amazon S3
- Azure Blob
• Caching Directory
- Consume Cache
- Cache management is coming in future release
• Data Input Directory
• Backup/Restore to Cloud Storage
• Your Data
- Enterprise GeoDatabase- File based Data
• Software
- Web Server- Software Load Balancer
• Hardware
- File Server- Network
• People
- HA?- IT strong?
ArcGIS Enterprise HA: Part of Your HA Architecture
GEOGRAPHIC REDUNDANCY
Disaster Recovery
Overview
• What is geographic redundancy
• Using the Web GIS DR tool
• Roadmap to being geographically redundant
• Recovering from failover
• Geographically separate data centers
• Components within data centers are typically highly available
• Duplicated configurations and data between the two data centers
• WebGIS DR Tool is used to move snapshots of data from primary to standby
• Complex disaster recovery option
Overview
Traffic Manager
East coast data center (primary) West coast data center (standby)
Geographic Redundancy
Public Portal URL - https://mysite.esri.com/portal
Services URL – https://mysite.esri.com/server
Public portal URL and services URL need to be the sameReferenced data paths need to be the same
198.0.0.1 198.0.0.2
Geographic Redundancy
WA1 WA2
P1 P2S1 S2
DS1 DS2
WA3 WA4
P3 P4S3 S4
DS3 DS4
Traffic Manager
East coast data center (primary) West coast data center (standby)
Traffic Manager
East coast data center (primary) West coast data center (standby)
Geographic Redundancy
Traffic Manager
East coast data center (primary) West coast data center (standby)
Geographic Redundancy
Traffic Manager
East coast data center (primary) West coast data center (standby)
Geographic Redundancy
1. Duplicate the deployment between primary and standby data centers
2. Create snapshots of the primary data center
3. Apply snapshots to the standby data center
4. Monitor your standby data center
Roadmap for geographic redundancy
• Number of machines should be the same
• Identical URLs between data centers
- Public Portal URL
- Services URL
• Identical paths to data and connections to databases or enterprise geodatabases
Primary Data Center Standby Data Center
P1
S1
DS1
P2
S2
DS2
P3
S3
DS3
P4
S4
DS4
Duplication
Creating snapshots
• Full snapshot
- Create an initial snapshot of all of the data within the ArcGIS Enterprise
- Internally defines a base time that will be used for an incremental snapshot
• Incremental snapshot
- Creates a snapshot of all of the data that has been created or modified since the last full backup
- Decreases the time it takes to synchronize content, services, and data between primary and standby
Creates a snapshot of all data added or modified since the last full snapshot
Monday Tuesday Wednesday Thursday Friday SaturdaySunday Sunday
Creating incremental snapshots
• Creates a snapshot of all data added or modified since the last full snapshot
- Portal
- Server
- Data Store
Creating incremental snapshots
• QA process on standby ArcGIS Enterprise
- Checking the index within Portal
- Validating federated Servers
- Validating data stores using Server Admin
- Checking important services or applications
• Detecting when components fail within a data center
- Monitoring the healthCheck URLs of Portal and Server
• Failing over data centers should be a manual, deliberate decision
Monitoring and Failover
Traffic Manager
East coast data center (primary) West coast data center (standby)
Recovering from a failure
Traffic Manager
East coast data center (primary) West coast data center (standby)
Recovering from a failure – Bringing the primary back online
Traffic Manager
East coast data center (primary) West coast data center (standby)
Recovering from a failure – Move data back to primary
Traffic Manager
East coast data center (primary) West coast data center (standby)
Recovering from a failure – Point traffic manager back to primary
Traffic Manager
East coast data center (primary) West coast data center (standby)
Recovering from a failure – Resume applying snapshots to standby
Traffic Manager
East Coast Region West Coast Region
Geographic Redundancy – Cloud deployments
Traffic Manager
East Coast Region West Coast Region
Geographic Redundancy – Cloud deployments
Central Region*
*Support to store WebGIS DR backups in an Azure container coming at 10.6
Drills
Monitoring &
Notifications
Documented
Practices
People
Governance
• IT Managed
• Strong Technical Team
• Knowledge of GIS & IT
• Business Alignment
• Established SLAs
• Knowledge Management
• Training
Business Continuity Requires More Than Technology
• Important to understands the requirements of geographic redundancy as a disaster recovery option
• Take advantage of the Web GIS DR tool to move snapshots of the deployment from primary to standby
• Geographic redundancy is a complex disaster recovery option
Takeaway points
Print Your Certificate of AttendancePrint stations located in the 140 Concourse
Tuesday Wednesday12:30 pm – 6:30 pm GIS Solutions Expo Hall B
5:00 pm – 6:30 pm GIS Solutions Expo SocialHall B
10:30 am – 5:15 pm GIS Solutions Expo Hall B
6:30 pm – 9:00 pm Networking ReceptionSmithsonian National Portrait Gallery
Download the Esri Events
app and find your eventSelect the session
you attended
Scroll down to find the
feedback section
Complete answers
and select “Submit”
Please Take Our Survey in the Esri Events App