best practices for zero downtime: when outages are not an option
TRANSCRIPT
Best Practices for Zero Downtime: When Outages Are Not an Option
Michael Smith Consulting Member of Technical Staff Hector Pujol Consulting Member of Technical Staff October 2 , 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Defining Platinum Level Service
Application Continuity
Application Performance During Outages
1
2
3
4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Defining Platinum Level Service
5
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Maximum Availability Architecture (MAA)
• Oracle-optimized architecture • Extensive HA best practices
– Oracle Database – Oracle Fusion Middleware – Oracle Applications – Cloud Control – Oracle Engineered Systems – Partner solutions
http://www.oracle.com/goto/maa
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Availability Service Levels for Unplanned and Planned Outages Oracle MAA Availability Tiers
BRONZE
SILVER
GOLD • Comprehensive HA and Disaster Protection • Recovery in seconds with zero or near-zero data loss
• High Availability (HA) for Recoverable Local Outages • Backups plus redo for Oracle data protection
• Basic Service Restart • Backups plus redo for Oracle data protection
PLATINUM • Zero Outage for Platinum Ready Applications • Zero data loss
7
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Reference Architectures Oracle MAA Availability Tiers
BRONZE
SILVER
GOLD
PLATINUM
Single Instance
Replication
Backups
Platinum-Ready Apps
Clusters
Backups
Clusters
Clusters and Replication
8
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Zero Application Outage for Platinum Ready Applications Platinum Tier
Site A Site B
Backups
Backups
Outages masked from applications, in-flight transactions preserved
– Application Continuity Zero data loss failover, LAN or WAN
– Active Data Guard / Far Sync Bi-directional replication and zero
downtime maintenance – Oracle GoldenGate
Online patching for applications – Edition-based Redefinition
Automated workload management – Oracle Global Data Services
Tape
Data Guard sync
Tape
Application Continuity
GoldenGate
RAC RAC
9
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Platinum-Level High Availability and Disaster Recovery
Zero Outage for Platinum Ready Applications
Events Downtime Data Loss Potential
Database instance failure Zero application outage Zero
Recoverable server failure Zero application outage Zero
Data corruption, database unable to restart, site failure Zero application outage Zero
Online file move, reorganization/redefinition, patching Zero application outage Zero Hardware or operating system maintenance and database patches that cannot be done online but are qualified for RAC rolling install
Zero application outage Zero
Database upgrades: patch sets, full database releases Zero application outage Zero
Platform migrations Zero application outage Zero
Application upgrades that modify database objects Zero application outage Zero
Plan
ned
Mai
nten
ance
U
npla
nned
O
utag
es
10
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
11
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• FAN enabled fast outage notification and recovery • Application continuity now masks outages from applications • Outage scenarios are tricky for applications to handle • Available in 12c with Standalone JDBC and packaged with:
– Oracle Universal Connection Pool (UCP) – Oracle WebLogic Server 12c – Oracle JDBC-based third party application servers
Masking Outages from Your Application
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Application Continuity masks these outages: – Process or service failure – Instance failure – Node failure – Database failure
• The application will not see errors – It may see a response time delay depending on the type of failure
Masking Outages from Your Application
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Key Components (Challenges) of Application Continuity – Services, FAN, load balancing, and FCF for the connection pool – Recoverable Errors – Commit Outcome using Transaction Guard – Request boundaries – Verification of expected database call results – Session consistency and integrity checks – Safe restoration of mutable values at replay
How it works
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
How it works
1. Work Request
2. DB Calls
5. Response
Oracle Database 12c
4. Replay
3. Errors
App Server running either Weblogic, 12.1.0.2 JDBC, UCP, or 3rd party app server with UCP or JDBC
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Assess your application for: – Request boundaries
• Automatic with WebLogic, UCP, 3rd Party app servers using UCP, or Oracle JDBC 12.1.0.2 with PooledConnection interface
– Eliminate deprecated JDBC concrete classes – Decide to disable calls that should NOT be replayed
• UTL_HTTP, use of external wall clock time, or other external dependencies
– Decide to allow restore of mutable functions • SYSDATE, SYSTIMESTAMP, SYS_GUID, sequence.NEXTVAL
– Session state changes requiring callbacks
Assess your Application
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Configure UCP to use the replay driver and set pool settings • Configure Java Net Connections • Configure Service Attributes • Grant Mutables • Check resources for client and server
Configuration Steps
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Replay Driver – UCP Example of using the replay driver: try { pds = PoolDataSourceFactory.getPoolDataSource(); pds.setConnectionFactoryClassName("oracle.jdbc.replay.OracleDataSourceImpl"); …
• Pool Settings – Ensure the pool doesn’t close down an inactive connection too soon
pds.setONSConfiguration("nodes=maa05:6200,maa06:6200,maa07:6200,maa08:6200") pds.setFastConnectionFailoverEnabled(true); pds.setInactiveConnectionTimeout(86400); pds.setConnectionWaitTimeout(86400); pds.setTimeoutCheckInterval(3600); pds.setAbandonedConnectionTimeout(86400);
Configure JDBC Replay Driver and Pool Settings
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity Configure Client Connections
url = "jdbc:oracle:thin:@(DESCRIPTION =
(CONNECT_TIMEOUT=90) (RETRY_COUNT=30)(RETRY_DELAY=10)
(ADDRESS_LIST = (LOAD_BALANCE=on)
(ADDRESS = (PROTOCOL = TCP)(HOST=austin-scan)(PORT=1521))
(ADDRESS = (PROTOCOL = TCP)(HOST=houston-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME = oltpworkload)))"
Retry while service is unavailable
Safe for logon storms
New
Balance IP’s in Scan
19
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Service attributes – FAILOVERTYPE : Set this to TRANSACTION to enable Application Continuity – COMMIT_OUTCOME : Set this to TRUE to enable Transaction Guard (mandatory) – REPLAY_INIT_TIME : Duration after which replay is not started – FAILOVERRETRY : Number of connection retries for each replay attempt. – FAILOVERDELAY : Delay in seconds between connection retries
Configure Service Attributes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Configuring for Outages - Services
• Create role based services on primary and standby clusters – Ensures services start on the correct database – Efficient routing of work of different workloads
• Services should be created with Application Continuity and standby attributes previously mentioned
srvctl add service -db <db_unique_name> -service <service_name> [-role [PRIMARY][,PHYSICAL_STANDBY][,LOGICAL_STANDBY][,SNAPSHOT_STANDBY]] - failovertype TRANSACTION –commit_outcome TRUE
srvctl add service -db EMEA -service PLATINUM –role PRIMARY -failovertype TRANSACTION -commit_outcome TRUE -replay_init_time 600 -failoverretry 30 -failoverdelay 10
21
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Decide to allow mutables for delayed execution – You must understand how mutables are used before allowed to replay them – The ALTER SEQUENCE and GRANTs tell Application Continuity what to do
• Mutable Syntax: ALTER SEQUENCE.. [sequence object] [KEEP|NOKEEP];
GRANT KEEP SEQUENCE <SEQUENCE NAME> on sequence object [to USER] ;
- Needed for users that don’t own the sequence
GRANT [KEEP DATE TIME | KEEP SYSGUID].. [to USER]
Configure Approved Mutables
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• Middle Tier – Replay driver may use more memory than base driver – Amount of additional usage is negligibly more if number of calls/request are small – Allocate sufficient memory to the JVM (e.g. Xms4096m for 4 GB) to minimize
garbage collection impacts – Slight additional CPU usage for building proxy objects, etc
• Server Tier – Some additional CPU usage for managing the validation
Check Resources for Client and Server
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Continuity
• UCP Coding Best Practices – Get connections from the pool and CLOSE them after the request
• Request boundaries automatically set between getConnection() and close() • Keep request boundaries sensible for efficiency and timely planned maintenance operations
– Close connections as part of a finally{} block • Test connections to ensure they aren’t already closed nor null before closing them • Test connections with isValid() before rolling them back
Additional Tips
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
DBA Command Replays
srvctl stop service -db orcl -instance orcl2 -force YES
srvctl stop service -db orcl -node rws3 -force YES
srvctl stop service -db orcl -instance orcl2 –noreplay -force
srvctl stop service -db orcl -node rws3 –noreplay -force
alter system kill session … immediate YES
alter system kill session … noreplay
dbms_service.disconnect_session([service], dbms_service. noreplay)
Application Continuity Additional Tips on Killing Sessions
25
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Performance During Outages
26
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Defining Outages
• Planned – Software updates – Hardware changes
• Unplanned – Application outages – Partial site / Full site
27
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Planned Maintenance
• RAC rolling patch apply: – Drain resources from instance to be patched – Bring down instance – Apply patch – Restart instance and repeat on remaining instances
• Which patches are eligible for RAC rolling? – 96% of one off patches – All PSU/CPU – Nearly all bundle patches
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Planned Maintenance
• Data Guard Rolling upgrades for patchsets and major versions – Upgrade standby site – Drain connections from primary and switchover to standby – Reconnect connections to new primary at the higher version – Upgrade new standby via redo stream
• Same procedures as used for RAC rolling
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Planned Maintenance – Moving Connections
• Operational Steps 1. Check current status of the services
2. Stop services normally (not using FORCE option) on the node to be maintained
3. Disable the service to prevent unexpected restart
4. Disconnect long-running sessions after the current transaction completes
Run the DBMS_SERVICE.DISCONNECT_SESSION package
5. Repeat Steps 2 - 4 for all services affected by the maintenance
6. Shutdown the database instance IMMEDIATE
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Operational Steps (Cont’d) 7. Perform desired maintenance
8. Start up instance(s) on the node
9. Enable services that were previously disabled
10. Start services that were stopped.
11. Observe that connections are appearing on the service again
12. Repeat this procedure for each node or instances that need to be shutdown for
maintenance
Planned Maintenance – Moving Connections
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
OLTP Workload During RAC Planned Maintenance
0
1000
2000
3000
4000
5000
6000
21:4
0:56
21
:45:
05
21:4
9:14
21
:53:
23
21:5
7:32
22
:01:
42
22:0
5:51
22
:10:
00
22:1
4:09
22
:18:
18
22:2
2:27
22
:26:
36
22:3
0:45
22
:34:
54
22:3
9:03
22
:43:
13
22:4
7:22
22
:51:
31
22:5
5:40
22
:59:
49
23:0
3:58
23
:08:
07
23:1
2:16
23
:16:
25
23:2
0:34
23
:24:
44
23:2
8:53
23
:33:
02
23:3
7:11
23
:41:
20
23:4
5:29
23
:49:
38
23:5
3:47
23
:57:
56
0:02
:05
0:06
:15
0:10
:24
0:14
:33
0:18
:42
0:22
:51
0:27
:00
0:31
:09
0:35
:18
0:39
:27
0:43
:36
0:47
:45
0:51
:55
0:56
:04
1:00
:13
1:04
:22
1:08
:31
1:12
:40
1:16
:49
1:20
:58
1:25
:07
1:29
:16
1:33
:25
1:37
:35
Transaction Per Second
No impact to application during service stop/starts
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Unplanned Outages
• Application Continuity provides zero application outage for: – RAC node / instance failure – Complete clusterware failure – Database failure – Site failure
• Full site failover vs partial site failover – Full site failover typically has redundant mid tiers at remote site
• Must start mid tiers at the remote site
– Partial site fails over application connections to the new primary • Mid tiers do not have to be restarted
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
MAA Configuration Options
• Second system deployed for remote DR – Asynchronous redo transport, Data Guard Maximum Performance – Active Data Guard: offload read-only reporting
Remote Disaster Recovery with Maximum Performance
Primary Remote Standby
Asynchronous Transport
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
MAA Configuration Options
• Second system deployed for local DR (within 200 miles) – Synchronous redo transport, Data Guard Maximum Availability – Active Data Guard: offload read-only reporting
Local Disaster Recovery with Zero Data Loss
Primary Local Standby
SYNC
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
MAA Configuration Options
• Dual standby configuration – Local standby is primary failover target with zero data loss – Remote standby is failover of last resort – Either is used to offload read-only workload, backups, rolling upgrades, test
Multi-Standby: Local HA Failover plus Geographic Protection
Primary Remote Standby
Asynchronous
Local Standby SYNC
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Integrated Application Failover
Standby Database
Database Tier
Application Tier
Database Services
Primary HA Standby
Data Guard SYNC Transport
Primary Database
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Guard failover, manual or automatic
1
Database services start automatically
2
Integrated Application Failover
Primary Database
Database Tier
Application Tier
Database Services
Primary HA Standby
Primary Database
FAN breaks clients out of TCP timeout, applications quickly reconnect to new primary
3
AC replays txns 4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Demo
39
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Eliminating Logon Storms
• Before: 5000 simultaneous connections would experience some failures and take 46 seconds
• Now: 5000 simultaneous connections, zero client-side errors and all clients get connected in 6-8 seconds. – Increase listen backlog at OS level: net.core.somaxconn=6000
– Increase listener queuesize: TCP.QUEUESIZE=6000
• Opposite concern of throttling logon storms still leverage connection_rate from the listener
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Application Failover Timings UCP with AC/TG
FAN Event Failover Ends 24 Seconds Failover Begins
4-5 Seconds to build 1000 connections
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Wrap up
42
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Platinum Tier
Site A Site B
Backups
Backups
Bi-directional Replication
Application Continuity Outages masked from applications, in-
flight transactions preserved – Application Continuity
Zero data loss failover, LAN or WAN – Active Data Guard / Far Sync
Bi-directional replication and zero downtime maintenance
– Oracle GoldenGate Online patching for applications
– Edition-based Redefinition Automated workload management
– Oracle Global Data Services
Tape
Database Replication
sync
Tape
Zero Application Outage for Platinum Ready Applications
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
44
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 45