best practices for zero downtime: when outages are not an option

46

Upload: lamkhanh

Post on 30-Dec-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best Practices for Zero Downtime: When Outages Are Not an Option
Page 2: Best Practices for Zero Downtime: When Outages Are Not an Option

Best Practices for Zero Downtime: When Outages Are Not an Option

Michael Smith Consulting Member of Technical Staff Hector Pujol Consulting Member of Technical Staff October 2 , 2014

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Page 3: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3

Page 4: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Defining Platinum Level Service

Application Continuity

Application Performance During Outages

1

2

3

4

Page 5: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Defining Platinum Level Service

5

Page 6: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Maximum Availability Architecture (MAA)

• Oracle-optimized architecture • Extensive HA best practices

– Oracle Database – Oracle Fusion Middleware – Oracle Applications – Cloud Control – Oracle Engineered Systems – Partner solutions

http://www.oracle.com/goto/maa

Page 7: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Availability Service Levels for Unplanned and Planned Outages Oracle MAA Availability Tiers

BRONZE

SILVER

GOLD • Comprehensive HA and Disaster Protection • Recovery in seconds with zero or near-zero data loss

• High Availability (HA) for Recoverable Local Outages • Backups plus redo for Oracle data protection

• Basic Service Restart • Backups plus redo for Oracle data protection

PLATINUM • Zero Outage for Platinum Ready Applications • Zero data loss

7

Page 8: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Reference Architectures Oracle MAA Availability Tiers

BRONZE

SILVER

GOLD

PLATINUM

Single Instance

Replication

Backups

Platinum-Ready Apps

Clusters

Backups

Clusters

Clusters and Replication

8

Page 9: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Zero Application Outage for Platinum Ready Applications Platinum Tier

Site A Site B

Backups

Backups

Outages masked from applications, in-flight transactions preserved

– Application Continuity Zero data loss failover, LAN or WAN

– Active Data Guard / Far Sync Bi-directional replication and zero

downtime maintenance – Oracle GoldenGate

Online patching for applications – Edition-based Redefinition

Automated workload management – Oracle Global Data Services

Tape

Data Guard sync

Tape

Application Continuity

GoldenGate

RAC RAC

9

Page 10: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Platinum-Level High Availability and Disaster Recovery

Zero Outage for Platinum Ready Applications

Events Downtime Data Loss Potential

Database instance failure Zero application outage Zero

Recoverable server failure Zero application outage Zero

Data corruption, database unable to restart, site failure Zero application outage Zero

Online file move, reorganization/redefinition, patching Zero application outage Zero Hardware or operating system maintenance and database patches that cannot be done online but are qualified for RAC rolling install

Zero application outage Zero

Database upgrades: patch sets, full database releases Zero application outage Zero

Platform migrations Zero application outage Zero

Application upgrades that modify database objects Zero application outage Zero

Plan

ned

Mai

nten

ance

U

npla

nned

O

utag

es

10

Page 11: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

11

Page 12: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• FAN enabled fast outage notification and recovery • Application continuity now masks outages from applications • Outage scenarios are tricky for applications to handle • Available in 12c with Standalone JDBC and packaged with:

– Oracle Universal Connection Pool (UCP) – Oracle WebLogic Server 12c – Oracle JDBC-based third party application servers

Masking Outages from Your Application

Page 13: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Application Continuity masks these outages: – Process or service failure – Instance failure – Node failure – Database failure

• The application will not see errors – It may see a response time delay depending on the type of failure

Masking Outages from Your Application

Page 14: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Key Components (Challenges) of Application Continuity – Services, FAN, load balancing, and FCF for the connection pool – Recoverable Errors – Commit Outcome using Transaction Guard – Request boundaries – Verification of expected database call results – Session consistency and integrity checks – Safe restoration of mutable values at replay

How it works

Page 15: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

How it works

1. Work Request

2. DB Calls

5. Response

Oracle Database 12c

4. Replay

3. Errors

App Server running either Weblogic, 12.1.0.2 JDBC, UCP, or 3rd party app server with UCP or JDBC

Page 16: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Assess your application for: – Request boundaries

• Automatic with WebLogic, UCP, 3rd Party app servers using UCP, or Oracle JDBC 12.1.0.2 with PooledConnection interface

– Eliminate deprecated JDBC concrete classes – Decide to disable calls that should NOT be replayed

• UTL_HTTP, use of external wall clock time, or other external dependencies

– Decide to allow restore of mutable functions • SYSDATE, SYSTIMESTAMP, SYS_GUID, sequence.NEXTVAL

– Session state changes requiring callbacks

Assess your Application

Page 17: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Configure UCP to use the replay driver and set pool settings • Configure Java Net Connections • Configure Service Attributes • Grant Mutables • Check resources for client and server

Configuration Steps

Page 18: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Replay Driver – UCP Example of using the replay driver: try { pds = PoolDataSourceFactory.getPoolDataSource(); pds.setConnectionFactoryClassName("oracle.jdbc.replay.OracleDataSourceImpl"); …

• Pool Settings – Ensure the pool doesn’t close down an inactive connection too soon

pds.setONSConfiguration("nodes=maa05:6200,maa06:6200,maa07:6200,maa08:6200") pds.setFastConnectionFailoverEnabled(true); pds.setInactiveConnectionTimeout(86400); pds.setConnectionWaitTimeout(86400); pds.setTimeoutCheckInterval(3600); pds.setAbandonedConnectionTimeout(86400);

Configure JDBC Replay Driver and Pool Settings

Page 19: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity Configure Client Connections

url = "jdbc:oracle:thin:@(DESCRIPTION =

(CONNECT_TIMEOUT=90) (RETRY_COUNT=30)(RETRY_DELAY=10)

(ADDRESS_LIST = (LOAD_BALANCE=on)

(ADDRESS = (PROTOCOL = TCP)(HOST=austin-scan)(PORT=1521))

(ADDRESS = (PROTOCOL = TCP)(HOST=houston-scan)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME = oltpworkload)))"

Retry while service is unavailable

Safe for logon storms

New

Balance IP’s in Scan

19

Page 20: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Service attributes – FAILOVERTYPE : Set this to TRANSACTION to enable Application Continuity – COMMIT_OUTCOME : Set this to TRUE to enable Transaction Guard (mandatory) – REPLAY_INIT_TIME : Duration after which replay is not started – FAILOVERRETRY : Number of connection retries for each replay attempt. – FAILOVERDELAY : Delay in seconds between connection retries

Configure Service Attributes

Page 21: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Configuring for Outages - Services

• Create role based services on primary and standby clusters – Ensures services start on the correct database – Efficient routing of work of different workloads

• Services should be created with Application Continuity and standby attributes previously mentioned

srvctl add service -db <db_unique_name> -service <service_name> [-role [PRIMARY][,PHYSICAL_STANDBY][,LOGICAL_STANDBY][,SNAPSHOT_STANDBY]] - failovertype TRANSACTION –commit_outcome TRUE

srvctl add service -db EMEA -service PLATINUM –role PRIMARY -failovertype TRANSACTION -commit_outcome TRUE -replay_init_time 600 -failoverretry 30 -failoverdelay 10

21

Page 22: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Decide to allow mutables for delayed execution – You must understand how mutables are used before allowed to replay them – The ALTER SEQUENCE and GRANTs tell Application Continuity what to do

• Mutable Syntax: ALTER SEQUENCE.. [sequence object] [KEEP|NOKEEP];

GRANT KEEP SEQUENCE <SEQUENCE NAME> on sequence object [to USER] ;

- Needed for users that don’t own the sequence

GRANT [KEEP DATE TIME | KEEP SYSGUID].. [to USER]

Configure Approved Mutables

Page 23: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• Middle Tier – Replay driver may use more memory than base driver – Amount of additional usage is negligibly more if number of calls/request are small – Allocate sufficient memory to the JVM (e.g. Xms4096m for 4 GB) to minimize

garbage collection impacts – Slight additional CPU usage for building proxy objects, etc

• Server Tier – Some additional CPU usage for managing the validation

Check Resources for Client and Server

Page 24: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Continuity

• UCP Coding Best Practices – Get connections from the pool and CLOSE them after the request

• Request boundaries automatically set between getConnection() and close() • Keep request boundaries sensible for efficiency and timely planned maintenance operations

– Close connections as part of a finally{} block • Test connections to ensure they aren’t already closed nor null before closing them • Test connections with isValid() before rolling them back

Additional Tips

Page 25: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

DBA Command Replays

srvctl stop service -db orcl -instance orcl2 -force YES

srvctl stop service -db orcl -node rws3 -force YES

srvctl stop service -db orcl -instance orcl2 –noreplay -force

srvctl stop service -db orcl -node rws3 –noreplay -force

alter system kill session … immediate YES

alter system kill session … noreplay

dbms_service.disconnect_session([service], dbms_service. noreplay)

Application Continuity Additional Tips on Killing Sessions

25

Page 26: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Performance During Outages

26

Page 27: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Defining Outages

• Planned – Software updates – Hardware changes

• Unplanned – Application outages – Partial site / Full site

27

Page 28: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Planned Maintenance

• RAC rolling patch apply: – Drain resources from instance to be patched – Bring down instance – Apply patch – Restart instance and repeat on remaining instances

• Which patches are eligible for RAC rolling? – 96% of one off patches – All PSU/CPU – Nearly all bundle patches

Page 29: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Planned Maintenance

• Data Guard Rolling upgrades for patchsets and major versions – Upgrade standby site – Drain connections from primary and switchover to standby – Reconnect connections to new primary at the higher version – Upgrade new standby via redo stream

• Same procedures as used for RAC rolling

Page 30: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Planned Maintenance – Moving Connections

• Operational Steps 1. Check current status of the services

2. Stop services normally (not using FORCE option) on the node to be maintained

3. Disable the service to prevent unexpected restart

4. Disconnect long-running sessions after the current transaction completes

Run the DBMS_SERVICE.DISCONNECT_SESSION package

5. Repeat Steps 2 - 4 for all services affected by the maintenance

6. Shutdown the database instance IMMEDIATE

Page 31: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

• Operational Steps (Cont’d) 7. Perform desired maintenance

8. Start up instance(s) on the node

9. Enable services that were previously disabled

10. Start services that were stopped.

11. Observe that connections are appearing on the service again

12. Repeat this procedure for each node or instances that need to be shutdown for

maintenance

Planned Maintenance – Moving Connections

Page 32: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

OLTP Workload During RAC Planned Maintenance

0

1000

2000

3000

4000

5000

6000

21:4

0:56

21

:45:

05

21:4

9:14

21

:53:

23

21:5

7:32

22

:01:

42

22:0

5:51

22

:10:

00

22:1

4:09

22

:18:

18

22:2

2:27

22

:26:

36

22:3

0:45

22

:34:

54

22:3

9:03

22

:43:

13

22:4

7:22

22

:51:

31

22:5

5:40

22

:59:

49

23:0

3:58

23

:08:

07

23:1

2:16

23

:16:

25

23:2

0:34

23

:24:

44

23:2

8:53

23

:33:

02

23:3

7:11

23

:41:

20

23:4

5:29

23

:49:

38

23:5

3:47

23

:57:

56

0:02

:05

0:06

:15

0:10

:24

0:14

:33

0:18

:42

0:22

:51

0:27

:00

0:31

:09

0:35

:18

0:39

:27

0:43

:36

0:47

:45

0:51

:55

0:56

:04

1:00

:13

1:04

:22

1:08

:31

1:12

:40

1:16

:49

1:20

:58

1:25

:07

1:29

:16

1:33

:25

1:37

:35

Transaction Per Second

No impact to application during service stop/starts

Page 33: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Unplanned Outages

• Application Continuity provides zero application outage for: – RAC node / instance failure – Complete clusterware failure – Database failure – Site failure

• Full site failover vs partial site failover – Full site failover typically has redundant mid tiers at remote site

• Must start mid tiers at the remote site

– Partial site fails over application connections to the new primary • Mid tiers do not have to be restarted

Page 34: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

MAA Configuration Options

• Second system deployed for remote DR – Asynchronous redo transport, Data Guard Maximum Performance – Active Data Guard: offload read-only reporting

Remote Disaster Recovery with Maximum Performance

Primary Remote Standby

Asynchronous Transport

Page 35: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

MAA Configuration Options

• Second system deployed for local DR (within 200 miles) – Synchronous redo transport, Data Guard Maximum Availability – Active Data Guard: offload read-only reporting

Local Disaster Recovery with Zero Data Loss

Primary Local Standby

SYNC

Page 36: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

MAA Configuration Options

• Dual standby configuration – Local standby is primary failover target with zero data loss – Remote standby is failover of last resort – Either is used to offload read-only workload, backups, rolling upgrades, test

Multi-Standby: Local HA Failover plus Geographic Protection

Primary Remote Standby

Asynchronous

Local Standby SYNC

Page 37: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Integrated Application Failover

Standby Database

Database Tier

Application Tier

Database Services

Primary HA Standby

Data Guard SYNC Transport

Primary Database

Page 38: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data Guard failover, manual or automatic

1

Database services start automatically

2

Integrated Application Failover

Primary Database

Database Tier

Application Tier

Database Services

Primary HA Standby

Primary Database

FAN breaks clients out of TCP timeout, applications quickly reconnect to new primary

3

AC replays txns 4

Page 39: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Demo

39

Page 40: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Eliminating Logon Storms

• Before: 5000 simultaneous connections would experience some failures and take 46 seconds

• Now: 5000 simultaneous connections, zero client-side errors and all clients get connected in 6-8 seconds. – Increase listen backlog at OS level: net.core.somaxconn=6000

– Increase listener queuesize: TCP.QUEUESIZE=6000

• Opposite concern of throttling logon storms still leverage connection_rate from the listener

Page 41: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Application Failover Timings UCP with AC/TG

FAN Event Failover Ends 24 Seconds Failover Begins

4-5 Seconds to build 1000 connections

Page 42: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Wrap up

42

Page 43: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Platinum Tier

Site A Site B

Backups

Backups

Bi-directional Replication

Application Continuity Outages masked from applications, in-

flight transactions preserved – Application Continuity

Zero data loss failover, LAN or WAN – Active Data Guard / Far Sync

Bi-directional replication and zero downtime maintenance

– Oracle GoldenGate Online patching for applications

– Edition-based Redefinition Automated workload management

– Oracle Global Data Services

Tape

Database Replication

sync

Tape

Zero Application Outage for Platinum Ready Applications

Page 44: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

44

Page 45: Best Practices for Zero Downtime: When Outages Are Not an Option

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 45

Page 46: Best Practices for Zero Downtime: When Outages Are Not an Option