automating disaster recovery postgresql
TRANSCRIPT
Database RecoveryCreating an Automation Plan for Restoration
Preparation
Note database size, Postgres configurationEnable archiving of database transactionsContinuous archive of WAL segmentsOptional: Create restore points for PITRBackup control function: pg_create_restore_point(name)Can be done on each deploy
Initial Preparation
Default logging depends on used packagesLikely to be syslog or stderrHave to use log_line_prefix to specify whats includedCan specify CSV formatImport to a table if neededDont need to specify whats reported all information outputted
Logging
In postgresql.conf:logging_collector = on (requires restart)log_destination = 'csvlog'log_directory = '/var/log/postgresql'log_filename = 'postgresql-%a.log'
Logging
Records of every change made to the database's data filesPostgres maintains a write ahead log in the pg_xlog/ subdirectory of clusters data directoryCan "replay" the log entries
Write Ahead Log (WAL) Files
https://github.com/wal-e/wal-eContinuous WAL archiving Python toolsudo python3 -m pip install wal-e[aws,azure,google,swift]Works on most operating systemsCan push to S3, Azure Blob Store, Google Storage, Swift
Archiving WAL segments
If using cloud-based solution, ensure proper roles and permissions for storing and retrievingS3: IAM user roles and bucket policiesAzure: Custom Role-Based Access ControlGoogle Cloud Store: Access Control ListsEnsure master can access and write to bucket, backup can access and readDont use your root keys!
Storing WAL Files
Key commands:backup-fetchbackup-pushwal-fetchwal-pushdelete
wal-e continuous archiving tool setup
/etc/wal-e.d/env environment variables (for S3):AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGIONWALE_S3_PREFIX
wal-e key commands
Pushes a base backup to storagePoint to Postgres directoryenvdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e --terse backup-push /var/lib/pg/9.6/mainRecommend adding to a daily cron job
backup-push
List base backupsShould be able to run as the Postgres userUseful to test out wal-e configuration
backup-list
Restores a base backup from storageAllows keyword LATEST for latest base backupCan specify a backup from backup-listenvdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e backup-fetch /var/lib/postgresql/9.6/main LATEST
backup-fetch
Delete data from storageNeeds --confirm flagAlso accepts --dry-runAccepts 'before', 'retain', 'everything'wal-e delete [--confirm] retain 5Delete all backups and segment files older than the 5 most recent
delete
Use in backup dbs recovery.conf file to fetch WAL filesAccepts --prefetch parameterDownload more WAL files as time is spent recovering8 WAL files by default, can increase
wal-fetch
Set as archive_command in master database server configurationIncrease throughput by pooling WAL segments together to send in groups--pool-size parameter available (defaults to 8 as of version 0.7)
wal-push
archive_mode = onDefaulted to off. Need to restart database to be put into effect.archive_command = 'envdir /etc/wal-e.d/env/ /usr/local/wal-e/bin/wal-e --terse wal-push %p'%p = relative path and the filename of the WAL segment to be archived
Archiving WAL segments using wal-e
Avoid storing secret information in postgresql.confPostgreSQL users can check pg_settings table and see archive_commandenvdir as alternativeAllows command to use files as environment variables with the name as the keyPart of daemontoolsAvailable in Debian, can write a wrapper script if not easily installable
envdir
S3 Archive
Restoring the Database
Spin up a serverConfigure Postgresql settingsCreate a recovery.conf fileBegin backup fetchStart PostgresPerform sample queriesNotify on success
Automated Restoration Script
Script starts up EC2 instance in AWSLoads custom AMI with scripts for setting up Postgres and starting the restoration, environment variables
Spinning up a server
Configure Postgresql settingsCreate a recovery.conf fileStart backup fetchStart PostgresPerform sample queriesNotify on successAutomated Restoration Script
I, [2016-08-17T20:54:16.516658 #9196] INFO -- : Setting up configuration filesI, [2016-08-17T20:55:30.782533 #9300] INFO -- : Setup complete. Beginning backup fetch.I, [2016-08-18T21:12:05.646145 #29825] INFO -- : Backup fetch complete.I, [2016-08-18T22:20:06.445003 #29825] INFO -- : Starting postgres.I, [2016-08-18T22:12:07.082780 #29825] INFO -- : Postgres started. Restore under wayI, [2016-08-18T24:12:07.082855 #29825] INFO -- : Restore complete. Reporting to Datadog
Install Postgres, tune postgresql.confCreate recovery.confDone with script or configuration management/orchestration toolMay be quicker to start up with script
Configure Postgres Settings
cat /var/lib/postgresql/9.6/main/recovery.conf
restore_command = 'envdir /etc/wal-e.d/env /usr/local/wal-e/bin/wal-e --terse wal-fetch "%f" "%p"' recovery_target_timeline = 'LATEST'
If point in time: recovery_target_time = '2017-01-13 13:00:00' recovery_target_name = 'deploy tag'
recovery.conf setup
wal_e.main INFO MSG: starting WAL-E DETAIL: The subcommand is "backup-fetch". STRUCTURED: time=2017-02-16T16:22:33.088767-00 pid=5444wal_e.worker.s3.s3_worker INFO MSG: beginning partition download DETAIL: The partition being downloaded is part_00000000.tar.lzo. HINT: The absolute S3 key is production-database/basebackups_005/base_000000010000230C00000039_00010808/tar_partitions/part_00000000.tar.lzo.
fetch log output
"archive recovery complete" text in csv logrecovery.conf file -> recovery.done
Checking for Completiondef restore_complete?day = Date.today.strftime('%a')! `less /var/log/postgresql/postgresql-#{day}.csv | grep "archive recovery complete" | tail -n 1`.empty?end
2017-03-02 21:52:44.282 UTC,,,5292,,58b89426.14ac,12,,2017-03-02 21:52:38 UTC,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,""2017-03-02 21:52:44.386 UTC,,,5292,,58b89426.14ac,13,,2017-03-02 21:52:38 UTC,1/0,0,LOG,00000,"MultiXact member wraparound protections are now enabled",,,,,,,,,""2017-03-02 21:52:44.389 UTC,,,5290,,58b89426.14aa,3,,2017-03-02 21:52:38 UTC,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""2017-03-02 21:52:44.389 UTC,,,5592,,58b8942c.15d8,1,,2017-03-02 21:52:44 UTC,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""
Checking for Completion
Run queries against databaseTimestamps of frequently updated tables
Checking for Completion
Checking for Completiondef latest_session_page_timestamp
end
PG.connect(dbname: 'procore', user: 'postgres').exec("SELECT created_at FROM session_pages ORDER BY created_atDESC LIMIT 1;")[0]["created_at"]
Checking for CompletionDETAIL: The partition being downloaded is part_00000000.tar.lzo.
`cat /var/log/syslog | grep "The partition being downloaded is" | tail -n 1`
Reporting Completiondef report_back_results
end
Datadog::Statsd.new('localhost', 8125).event("Recovery successful", "Latest session page: #{latest_session_page_timestamp} latest transaction log: #{latest_transaction_log}", tags: ['domain:disaster-recovery.backup'])
Reporting Completion
Things to look out for
Incompatible configurations for Postgres recovery server vs master db serverInstance not large enough to hold recovered dbIncorrect keys for wal-e configuration
Check Postgres logs for troubleshooting!
Things to look out for
\
Run through script, ssh to server periodically to check in on logsDouble-check final recorded transaction log, frequently updated table timestamp
Dont wait for something to go wrong to test this!Untested backups are not backups!
Testing Notes
Questions?
(Also, hi, yes, Procore is hiring!)Tweet at me @enkei9Email at:[email protected]@procore.com