best practices for zero downtime: when outages are not an option

Best Practices for Zero Downtime: When Outages Are Not an Option

Michael Smith Consulting Member of Technical Staff Hector Pujol Consulting Member of Technical Staff October 2 , 2014

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |


Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3


Program Agenda

Defining Platinum Level Service

Application Continuity

Application Performance During Outages

1

2

3

4


Defining Platinum Level Service

5


Maximum Availability Architecture (MAA)

• Oracle-optimized architecture • Extensive HA best practices

– Oracle Database – Oracle Fusion Middleware – Oracle Applications – Cloud Control – Oracle Engineered Systems – Partner solutions

http://www.oracle.com/goto/maa


Availability Service Levels for Unplanned and Planned Outages Oracle MAA Availability Tiers

BRONZE

SILVER

GOLD • Comprehensive HA and Disaster Protection • Recovery in seconds with zero or near-zero data loss

• High Availability (HA) for Recoverable Local Outages • Backups plus redo for Oracle data protection

• Basic Service Restart • Backups plus redo for Oracle data protection

PLATINUM • Zero Outage for Platinum Ready Applications • Zero data loss

7


Reference Architectures Oracle MAA Availability Tiers

BRONZE

SILVER

GOLD

PLATINUM

Single Instance

Replication

Backups

Platinum-Ready Apps

Clusters

Backups

Clusters

Clusters and Replication

8


Zero Application Outage for Platinum Ready Applications Platinum Tier

Site A Site B

Backups

Backups

Outages masked from applications, in-flight transactions preserved

– Application Continuity Zero data loss failover, LAN or WAN

– Active Data Guard / Far Sync Bi-directional replication and zero

downtime maintenance – Oracle GoldenGate

Online patching for applications – Edition-based Redefinition

Automated workload management – Oracle Global Data Services

Tape

Data Guard sync

Tape


GoldenGate

RAC RAC

9


Platinum-Level High Availability and Disaster Recovery

Zero Outage for Platinum Ready Applications

Events Downtime Data Loss Potential

Database instance failure Zero application outage Zero

Recoverable server failure Zero application outage Zero

Data corruption, database unable to restart, site failure Zero application outage Zero

Online file move, reorganization/redefinition, patching Zero application outage Zero Hardware or operating system maintenance and database patches that cannot be done online but are qualified for RAC rolling install

Zero application outage Zero

Database upgrades: patch sets, full database releases Zero application outage Zero

Platform migrations Zero application outage Zero

Application upgrades that modify database objects Zero application outage Zero

Plan

ned

Mai

nten

ance

U

npla

nned

O

utag

es

10



11



• FAN enabled fast outage notification and recovery • Application continuity now masks outages from applications • Outage scenarios are tricky for applications to handle • Available in 12c with Standalone JDBC and packaged with:

– Oracle Universal Connection Pool (UCP) – Oracle WebLogic Server 12c – Oracle JDBC-based third party application servers

Masking Outages from Your Application



• Application Continuity masks these outages: – Process or service failure – Instance failure – Node failure – Database failure

• The application will not see errors – It may see a response time delay depending on the type of failure

Masking Outages from Your Application



• Key Components (Challenges) of Application Continuity – Services, FAN, load balancing, and FCF for the connection pool – Recoverable Errors – Commit Outcome using Transaction Guard – Request boundaries – Verification of expected database call results – Session consistency and integrity checks – Safe restoration of mutable values at replay

How it works



How it works

1. Work Request

2. DB Calls

5. Response

Oracle Database 12c

4. Replay

3. Errors

App Server running either Weblogic, 12.1.0.2 JDBC, UCP, or 3rd party app server with UCP or JDBC



• Assess your application for: – Request boundaries

• Automatic with WebLogic, UCP, 3rd Party app servers using UCP, or Oracle JDBC 12.1.0.2 with PooledConnection interface

– Eliminate deprecated JDBC concrete classes – Decide to disable calls that should NOT be replayed

• UTL_HTTP, use of external wall clock time, or other external dependencies

– Decide to allow restore of mutable functions • SYSDATE, SYSTIMESTAMP, SYS_GUID, sequence.NEXTVAL

– Session state changes requiring callbacks

Assess your Application



• Configure UCP to use the replay driver and set pool settings • Configure Java Net Connections • Configure Service Attributes • Grant Mutables • Check resources for client and server

Configuration Steps



• Replay Driver – UCP Example of using the replay driver: try { pds = PoolDataSourceFactory.getPoolDataSource(); pds.setConnectionFactoryClassName("oracle.jdbc.replay.OracleDataSourceImpl"); …

• Pool Settings – Ensure the pool doesn’t close down an inactive connection too soon

pds.setONSConfiguration("nodes=maa05:6200,maa06:6200,maa07:6200,maa08:6200") pds.setFastConnectionFailoverEnabled(true); pds.setInactiveConnectionTimeout(86400); pds.setConnectionWaitTimeout(86400); pds.setTimeoutCheckInterval(3600); pds.setAbandonedConnectionTimeout(86400);

Configure JDBC Replay Driver and Pool Settings


Application Continuity Configure Client Connections

url = "jdbc:oracle:thin:@(DESCRIPTION =

(CONNECT_TIMEOUT=90) (RETRY_COUNT=30)(RETRY_DELAY=10)

(ADDRESS_LIST = (LOAD_BALANCE=on)

(ADDRESS = (PROTOCOL = TCP)(HOST=austin-scan)(PORT=1521))

(ADDRESS = (PROTOCOL = TCP)(HOST=houston-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME = oltpworkload)))"

Retry while service is unavailable

Safe for logon storms

New

Balance IP’s in Scan

19



• Service attributes – FAILOVERTYPE : Set this to TRANSACTION to enable Application Continuity – COMMIT_OUTCOME : Set this to TRUE to enable Transaction Guard (mandatory) – REPLAY_INIT_TIME : Duration after which replay is not started – FAILOVERRETRY : Number of connection retries for each replay attempt. – FAILOVERDELAY : Delay in seconds between connection retries

Configure Service Attributes


Configuring for Outages - Services

• Create role based services on primary and standby clusters – Ensures services start on the correct database – Efficient routing of work of different workloads

• Services should be created with Application Continuity and standby attributes previously mentioned

srvctl add service -db <db_unique_name> -service <service_name> [-role [PRIMARY][,PHYSICAL_STANDBY][,LOGICAL_STANDBY][,SNAPSHOT_STANDBY]] - failovertype TRANSACTION –commit_outcome TRUE

srvctl add service -db EMEA -service PLATINUM –role PRIMARY -failovertype TRANSACTION -commit_outcome TRUE -replay_init_time 600 -failoverretry 30 -failoverdelay 10

21



• Decide to allow mutables for delayed execution – You must understand how mutables are used before allowed to replay them – The ALTER SEQUENCE and GRANTs tell Application Continuity what to do

• Mutable Syntax: ALTER SEQUENCE.. [sequence object] [KEEP|NOKEEP];

GRANT KEEP SEQUENCE <SEQUENCE NAME> on sequence object [to USER] ;

- Needed for users that don’t own the sequence

GRANT [KEEP DATE TIME | KEEP SYSGUID].. [to USER]

Configure Approved Mutables



• Middle Tier – Replay driver may use more memory than base driver – Amount of additional usage is negligibly more if number of calls/request are small – Allocate sufficient memory to the JVM (e.g. Xms4096m for 4 GB) to minimize

garbage collection impacts – Slight additional CPU usage for building proxy objects, etc

• Server Tier – Some additional CPU usage for managing the validation

Check Resources for Client and Server



• UCP Coding Best Practices – Get connections from the pool and CLOSE them after the request

• Request boundaries automatically set between getConnection() and close() • Keep request boundaries sensible for efficiency and timely planned maintenance operations

– Close connections as part of a finally{} block • Test connections to ensure they aren’t already closed nor null before closing them • Test connections with isValid() before rolling them back

Additional Tips


DBA Command Replays

srvctl stop service -db orcl -instance orcl2 -force YES

srvctl stop service -db orcl -node rws3 -force YES

srvctl stop service -db orcl -instance orcl2 –noreplay -force

srvctl stop service -db orcl -node rws3 –noreplay -force

alter system kill session … immediate YES

alter system kill session … noreplay

dbms_service.disconnect_session([service], dbms_service. noreplay)

Application Continuity Additional Tips on Killing Sessions

25


Application Performance During Outages

26


Defining Outages

• Planned – Software updates – Hardware changes

• Unplanned – Application outages – Partial site / Full site

27


Planned Maintenance

• RAC rolling patch apply: – Drain resources from instance to be patched – Bring down instance – Apply patch – Restart instance and repeat on remaining instances

• Which patches are eligible for RAC rolling? – 96% of one off patches – All PSU/CPU – Nearly all bundle patches


Planned Maintenance

• Data Guard Rolling upgrades for patchsets and major versions – Upgrade standby site – Drain connections from primary and switchover to standby – Reconnect connections to new primary at the higher version – Upgrade new standby via redo stream

• Same procedures as used for RAC rolling


Planned Maintenance – Moving Connections

• Operational Steps 1. Check current status of the services

2. Stop services normally (not using FORCE option) on the node to be maintained

3. Disable the service to prevent unexpected restart

4. Disconnect long-running sessions after the current transaction completes

Run the DBMS_SERVICE.DISCONNECT_SESSION package

5. Repeat Steps 2 - 4 for all services affected by the maintenance

6. Shutdown the database instance IMMEDIATE


• Operational Steps (Cont’d) 7. Perform desired maintenance

8. Start up instance(s) on the node

9. Enable services that were previously disabled

10. Start services that were stopped.

11. Observe that connections are appearing on the service again

12. Repeat this procedure for each node or instances that need to be shutdown for

maintenance

Planned Maintenance – Moving Connections


OLTP Workload During RAC Planned Maintenance

0

1000

2000

3000

4000

5000

6000

21:4

0:56

21

:45:

05

21:4

9:14

21

:53:

23

21:5

7:32

22

:01:

42

22:0

5:51

22

:10:

00

22:1

4:09

22

:18:

18

22:2

2:27

22

:26:

36

22:3

0:45

22

:34:

54

22:3

9:03

22

:43:

13

22:4

7:22

22

:51:

31

22:5

5:40

22

:59:

49

23:0

3:58

23

:08:

07

23:1

2:16

23

:16:

25

23:2

0:34

23

:24:

44

23:2

8:53

23

:33:

02

23:3

7:11

23

:41:

20

23:4

5:29

23

:49:

38

23:5

3:47

23

:57:

56

0:02

:05

0:06

:15

0:10

:24

0:14

:33

0:18

:42

0:22

:51

0:27

:00

0:31

:09

0:35

:18

0:39

:27

0:43

:36

0:47

:45

0:51

:55

0:56

:04

1:00

:13

1:04

:22

1:08

:31

1:12

:40

1:16

:49

1:20

:58

1:25

:07

1:29

:16

1:33

:25

1:37

:35

Transaction Per Second

No impact to application during service stop/starts


Unplanned Outages

• Application Continuity provides zero application outage for: – RAC node / instance failure – Complete clusterware failure – Database failure – Site failure

• Full site failover vs partial site failover – Full site failover typically has redundant mid tiers at remote site

• Must start mid tiers at the remote site

– Partial site fails over application connections to the new primary • Mid tiers do not have to be restarted


MAA Configuration Options

• Second system deployed for remote DR – Asynchronous redo transport, Data Guard Maximum Performance – Active Data Guard: offload read-only reporting

Remote Disaster Recovery with Maximum Performance

Primary Remote Standby

Asynchronous Transport



• Second system deployed for local DR (within 200 miles) – Synchronous redo transport, Data Guard Maximum Availability – Active Data Guard: offload read-only reporting

Local Disaster Recovery with Zero Data Loss

Primary Local Standby

SYNC



• Dual standby configuration – Local standby is primary failover target with zero data loss – Remote standby is failover of last resort – Either is used to offload read-only workload, backups, rolling upgrades, test

Multi-Standby: Local HA Failover plus Geographic Protection

Primary Remote Standby

Asynchronous

Local Standby SYNC


Integrated Application Failover

Standby Database

Database Tier

Application Tier

Database Services

Primary HA Standby

Data Guard SYNC Transport

Primary Database


Data Guard failover, manual or automatic

1

Database services start automatically

2

Integrated Application Failover

Primary Database

Database Tier

Application Tier

Database Services

Primary HA Standby

Primary Database

FAN breaks clients out of TCP timeout, applications quickly reconnect to new primary

3

AC replays txns 4


Demo

39


Eliminating Logon Storms

• Before: 5000 simultaneous connections would experience some failures and take 46 seconds

• Now: 5000 simultaneous connections, zero client-side errors and all clients get connected in 6-8 seconds. – Increase listen backlog at OS level: net.core.somaxconn=6000

– Increase listener queuesize: TCP.QUEUESIZE=6000

• Opposite concern of throttling logon storms still leverage connection_rate from the listener


Application Failover Timings UCP with AC/TG

FAN Event Failover Ends 24 Seconds Failover Begins

4-5 Seconds to build 1000 connections


Wrap up

42


Platinum Tier

Site A Site B

Backups

Backups

Bi-directional Replication

Application Continuity Outages masked from applications, in-

flight transactions preserved – Application Continuity

Zero data loss failover, LAN or WAN – Active Data Guard / Far Sync

Bi-directional replication and zero downtime maintenance

– Oracle GoldenGate Online patching for applications

– Edition-based Redefinition Automated workload management

– Oracle Global Data Services

Tape

Database Replication

sync

Tape

Zero Application Outage for Platinum Ready Applications


Safe Harbor Statement The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

44

best practices for zero downtime: when outages are not an option

Documents