automating disaster recovery postgresql

Database RecoveryCreating an Automation Plan for Restoration

Preparation

Note database size, Postgres configurationEnable archiving of database transactionsContinuous archive of WAL segmentsOptional: Create restore points for PITRBackup control function: pg_create_restore_point(name)Can be done on each deploy

Initial Preparation

Default logging depends on used packagesLikely to be syslog or stderrHave to use log_line_prefix to specify whats includedCan specify CSV formatImport to a table if neededDont need to specify whats reported all information outputted

Logging

In postgresql.conf:logging_collector = on (requires restart)log_destination = 'csvlog'log_directory = '/var/log/postgresql'log_filename = 'postgresql-%a.log'

Logging

Records of every change made to the database's data filesPostgres maintains a write ahead log in the pg_xlog/ subdirectory of clusters data directoryCan "replay" the log entries

Write Ahead Log (WAL) Files

https://github.com/wal-e/wal-eContinuous WAL archiving Python toolsudo python3 -m pip install wal-e[aws,azure,google,swift]Works on most operating systemsCan push to S3, Azure Blob Store, Google Storage, Swift

Archiving WAL segments

If using cloud-based solution, ensure proper roles and permissions for storing and retrievingS3: IAM user roles and bucket policiesAzure: Custom Role-Based Access ControlGoogle Cloud Store: Access Control ListsEnsure master can access and write to bucket, backup can access and readDont use your root keys!

Storing WAL Files

Key commands:backup-fetchbackup-pushwal-fetchwal-pushdelete

wal-e continuous archiving tool setup

/etc/wal-e.d/env environment variables (for S3):AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGIONWALE_S3_PREFIX

wal-e key commands

Pushes a base backup to storagePoint to Postgres directoryenvdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e --terse backup-push /var/lib/pg/9.6/mainRecommend adding to a daily cron job

backup-push

List base backupsShould be able to run as the Postgres userUseful to test out wal-e configuration

backup-list

Restores a base backup from storageAllows keyword LATEST for latest base backupCan specify a backup from backup-listenvdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e backup-fetch /var/lib/postgresql/9.6/main LATEST

backup-fetch

Delete data from storageNeeds --confirm flagAlso accepts --dry-runAccepts 'before', 'retain', 'everything'wal-e delete [--confirm] retain 5Delete all backups and segment files older than the 5 most recent

delete

Use in backup dbs recovery.conf file to fetch WAL filesAccepts --prefetch parameterDownload more WAL files as time is spent recovering8 WAL files by default, can increase

wal-fetch

Set as archive_command in master database server configurationIncrease throughput by pooling WAL segments together to send in groups--pool-size parameter available (defaults to 8 as of version 0.7)

wal-push

archive_mode = onDefaulted to off. Need to restart database to be put into effect.archive_command = 'envdir /etc/wal-e.d/env/ /usr/local/wal-e/bin/wal-e --terse wal-push %p'%p = relative path and the filename of the WAL segment to be archived

Archiving WAL segments using wal-e

Avoid storing secret information in postgresql.confPostgreSQL users can check pg_settings table and see archive_commandenvdir as alternativeAllows command to use files as environment variables with the name as the keyPart of daemontoolsAvailable in Debian, can write a wrapper script if not easily installable

envdir

S3 Archive

Restoring the Database

Spin up a serverConfigure Postgresql settingsCreate a recovery.conf fileBegin backup fetchStart PostgresPerform sample queriesNotify on success

Automated Restoration Script

Script starts up EC2 instance in AWSLoads custom AMI with scripts for setting up Postgres and starting the restoration, environment variables

Spinning up a server

Configure Postgresql settingsCreate a recovery.conf fileStart backup fetchStart PostgresPerform sample queriesNotify on successAutomated Restoration Script

I, [2016-08-17T20:54:16.516658 #9196] INFO -- : Setting up configuration filesI, [2016-08-17T20:55:30.782533 #9300] INFO -- : Setup complete. Beginning backup fetch.I, [2016-08-18T21:12:05.646145 #29825] INFO -- : Backup fetch complete.I, [2016-08-18T22:20:06.445003 #29825] INFO -- : Starting postgres.I, [2016-08-18T22:12:07.082780 #29825] INFO -- : Postgres started. Restore under wayI, [2016-08-18T24:12:07.082855 #29825] INFO -- : Restore complete. Reporting to Datadog

Install Postgres, tune postgresql.confCreate recovery.confDone with script or configuration management/orchestration toolMay be quicker to start up with script

Configure Postgres Settings

cat /var/lib/postgresql/9.6/main/recovery.conf

restore_command = 'envdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e --terse wal-fetch "%f" "%p"' recovery_target_timeline = 'LATEST'

If point in time: recovery_target_time = '2017-01-13 13:00:00' recovery_target_name = 'deploy tag'

recovery.conf setup

wal_e.main INFO MSG: starting WAL-E DETAIL: The subcommand is "backup-fetch". STRUCTURED: time=2017-02-16T16:22:33.088767-00 pid=5444wal_e.worker.s3.s3_worker INFO MSG: beginning partition download DETAIL: The partition being downloaded is part_00000000.tar.lzo. HINT: The absolute S3 key is production-database/basebackups_005/base_000000010000230C00000039_00010808/tar_partitions/part_00000000.tar.lzo.

fetch log output

"archive recovery complete" text in csv logrecovery.conf file -> recovery.done

Checking for Completiondef restore_complete?day = Date.today.strftime('%a')! `less /var/log/postgresql/postgresql-#{day}.csv | grep "archive recovery complete" | tail -n 1`.empty?end

2017-03-02 21:52:44.282 UTC,,,5292,,58b89426.14ac,12,,2017-03-02 21:52:38 UTC,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,""2017-03-02 21:52:44.386 UTC,,,5292,,58b89426.14ac,13,,2017-03-02 21:52:38 UTC,1/0,0,LOG,00000,"MultiXact member wraparound protections are now enabled",,,,,,,,,""2017-03-02 21:52:44.389 UTC,,,5290,,58b89426.14aa,3,,2017-03-02 21:52:38 UTC,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""2017-03-02 21:52:44.389 UTC,,,5592,,58b8942c.15d8,1,,2017-03-02 21:52:44 UTC,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""

Checking for Completion

Run queries against databaseTimestamps of frequently updated tables

Checking for Completion

Checking for Completiondef latest_session_page_timestamp

end

PG.connect(dbname: 'procore', user: 'postgres').exec("SELECT created_at FROM session_pages ORDER BY created_atDESC LIMIT 1;")[0]["created_at"]

Checking for CompletionDETAIL: The partition being downloaded is part_00000000.tar.lzo.

`cat /var/log/syslog | grep "The partition being downloaded is" | tail -n 1`

Reporting Completiondef report_back_results

end

Datadog::Statsd.new('localhost', 8125).event("Recovery successful", "Latest session page: #{latest_session_page_timestamp} latest transaction log: #{latest_transaction_log}", tags: ['domain:disaster-recovery.backup'])

Reporting Completion

Things to look out for

Incompatible configurations for Postgres recovery server vs master db serverInstance not large enough to hold recovered dbIncorrect keys for wal-e configuration

Check Postgres logs for troubleshooting!

Things to look out for

\

Run through script, ssh to server periodically to check in on logsDouble-check final recorded transaction log, frequently updated table timestamp

Dont wait for something to go wrong to test this!Untested backups are not backups!

Testing Notes

Questions?

(Also, hi, yes, Procore is hiring!)Tweet at me @enkei9Email at:[email protected]@procore.com

automating disaster recovery postgresql

Technology