openedge high availabilty adam backman grand poobah – white star software

39
OpenEdge High Availabilty Adam Backman Grand Poobah – White Star Software

Upload: marcia-bond

Post on 23-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

OpenEdge High Availabilty

Adam BackmanGrand Poobah – White Star Software

About the speaker

Head Winemaker – White Star SoftwareOne of the oldest and most respected consulting and training companies in the Progress OpenEdge sector

Lackey – DBAppraise Managed database services backed up by experienced Progress OpenEdge professionals not rookies off the bench

Read a book or two Snappy Dresser Knows a bit about systems and OpenEdge

Agenda

Are you really 24X7? Redundancy Replication Maintenance Failing over Conclusion

What is High Availability?

A real business need that requires full access to current data at any time of the day or night

Many sites are kind of 24X7 but only a small percentage of companies have real business requirements that necessitate access to the data 24 hours a day.

Some applications have high availability needs but only during given hours which simplifies maintenance

The need is growing every day

Are You Really 24X7?

Business runs 24 hours a day 3-shift manufacturing, Utility, Casino, Website,…

Business needs access 24 hoursWork during the day, report and plan at night

Weekend requirements

What is High Availability?

The ability to keep running your business Continuous Access which allows for failures with zero

impact to the users Minimally Invasive failure management like using HACMP

clustering with OpenEdge as a cluster service Major Failover where physical location of the application

must be changed Minimal recovery time in case of disaster It is not disaster recovery – DR is only used when HA fails

Before you begin

Understand your business Understand the cost of downtime Do not build a solution that costs more that what you are

protecting

People

Who “owns” the data Be inclusive with invites most will drop out This is not solely an IT decision

− You are the keeper, not owner of the data− You know what is technically possible− You know the cost of the tech needed to build the solution

The goal is to eliminate surprises if/when a problem occurs

Planning

Budget – it is not free Hardware – fault tolerant, redundancy, … Software – OpenEdge plus ALL the other stuff you have to

run the operation Knowledge – Buy or Rent Time – schedule and outage time Personnel constraints – Who is on call and who is their

backup

Causes of Downtime

Hardware− Disks are most vulnerable as they are the only moving

part unless you have SSD− Power - All the hardware requires power

Software− OS bug− OpenEdge (core or application) bug

Natural disaster− Fire− Flood

Sabotage Human Error

Basic Rules

Good Hardware− Trusted vendor− Good support (local support if possible)

No Windows (OK, maybe 2008) You need a good recovery plan You will run with after imaging enabled

Redundancy

Hardware Software Personnel

Redundancy: Hardware

Power (UPS or UPS + Generator) Mirrored disks Network - in machine and general network Non-interleaved memory (some use FT memory) Multiple CPUs Support hardware (PCs, terminals, phone,…) Complete failover environment

Hardware

Why have a UPS and a generator?− UPS has limited capacity− Generators can run for a long time− Have a reliable source of extra fuel

Hardware

Do not let standby systems sit idle Use them for development or test Keep copies of all support files

− .pf− .ini− .d

Redundancy: Software

Host-based are least fault tolerant Web-based can provide a good environment provided the

AppServer calls are stateless In client/server model remember that file servers need to

be redundant as well

Redundancy: Software

NameServer on the broadcast and clustered Don’t use the NameServer Cluster your AppServers so if a single AppServer fails

there is another to pick up the load

Redundancy: Staffing

Is the failover machine close? Can it reliably be accessed remotely (failure point) Possible to call in additional resources?

− More hands− Different skills− Relief of tired staff

Is it necessary to support all functions or only core?

Replication of Data

Database data− OpenEdge replication (synchronous)− Log-based replication (asynchronous)− Hardware-based replication (?)

Application and User files− OS utililty (fsync, rsync, …)− Hardware (remote mirroring)− Third-party (polyserve)

Replication: OpenEdge

Pros:− Supported product− Synchronous − Fast (Really Fast)

Cons− Cost− Yet another thing to support− Additional resource usage

Replication: Log-based

Pros:− Cheap (Not free, but close)− Easy to setup and maintain

Cons:− No formal support− Additional resource utilization

Hardware Replication

Pros:− Easy setup− Easy Maintenance

Cons:− Expensive− Possibility of data corruption unless ALL writes are guaranteed

Maintenance

Script everything to eliminate human error Scheduled Maintenance

− Application changes− Backups− Index maintenance− Adding space

Unscheduled maintenance− Eliminate unscheduled maintenance buy monitoring and trending

Maintenance: Application

Schema− Use fast schema add then add default value− Still requires an outage for some changes due to table locks

Code changes− If you are n-tier you can stop the AppServer to reduce the

interruption− Switch to a different propath and move clients over through

natural attrition

Maintenance: Backups

Progress backup− Reliable− Online option

Split mirror backup Replication backup

− Eliminate overhead on production db− Must be a no recover backup for log-based replication

Maintenance: Index

Index rebuild cannot be run against a replicated database Use index compact online

proutil <dbname> -C idxcompact <table.index> Notes:

− Watch for open transactions as idx compact will do a significant amount of logging

− Schedule outside of busy times to allow replication to keep up

Maintenance: Add Space (Online and offline approaches)

prostrct addonline to add space while you are running Process

− Make sure your umask is correct − Validate your add.st file− prostrct addonline db add.st

prostrct is supported for both source and target databases with the exception of prostrct unlock

Process− Shutdown source and target− Make changes to source− Make changes to target − Start both databases

Maintenance

All maintenance should be scripted and tested in a test environment before proceeding with the Production run− Eliminate the human element (no typos)− Know how long it will take− Make sure maintenance does not cause a problem− Apply and test schema changes thoroughly

Building a failover plan

Who− Business and technical personnel − Gets informed – email, conference call, call tree,…− Makes Decisions− Does the work

What − What resources are affected?

Where− Location of physical resources− Location of personnel− Location of replacement/replication target

Building a failover plan - continued

When− Times of backups− Times of data archiving− Times of backup archiving− Times of log archiving

Why− What are we protecting ourselves from− Why did we choose not to deal with some event

Risk Assessment

Things to consider− Risk – Natural Disaster, Human caused, hardware, …− Likelihood − Impact to application environment− Time to recover

It is OK to say we considered that and it was not high enough in likelihood in our eyes to create a solution

Determine the dependency of each level − Hardware requires power− OpenEdge application requires PostalSoft

Solutions

Document redundancy where it existsDocument places where redundancy is missing or unknown (on purpose or omission)Ensure reasonable software update procedures are in place and documentedVerify security, division of responsibilities and software release policies per layerNeed to develop Risk Assessment form

Aspects of a failover plan

When− When do we decide to move to the standby environment?− Who makes the decision?− Who does the work along with a backup for who does the work− Defined process− Service level agreements with customers− Milestones in the process

Why− This is a tougher decision than you think − Fix or flee – lost time vs. lost data

Documenting your plan

Your plan should be able to be executed by anyone You cannot have enough detail Automate as much of the process as possible to eliminate

the human element Document and automate both the failover and the failback

Test your plan

Switch over to your standby environment and run for a day or more

You don’t want to cause an extended outage testing your plan

You will only find issues if you run at full load Do this at least once a year Follow your document and correct mistakes as you go

Keep documents and support files up-to-date

Keep your failover and failback documents up-to-date Keep contact lists up-to-date Keep all individual process documents up-to-date Keep copies of your support files

− Scripts− Application (.pf, .ini, .properties, …)

Good password management Keep everything accessible (online and hard copies)

Points to Remember

Build redundancy into all aspects of your operation Look at the likelihood of a failure and its impact to the

customer Protect your entire application environment both hardware

and software Build a total solution but think about the cost/benefit of

each component Automate tasks to eliminate human error Test your failover plan at least once a year

Questions?

Adam Backman

[email protected]

Thank you for your time!