who wants a service with zero downtime? · postgresql’s replication part of core (fully open...
TRANSCRIPT
WHO WANTS A SERVICE WITH ZERO DOWNTIME?
… EVERYBODY
IS IT THAT GOOD?
NOT JUST TECHNOLOGY. RISKS, PROCEDURES, PEOPLE
FROM 0 TO ~100: BUSINESS CONTINUITY WITH POSTGRESQL
Giulio Calacoci Senior Developer @ 2ndQuadrant
DataOps 2019 Barcellona
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ABOUT MYSELF
▸ Open Source passionate since early 2k
▸ Member of the Italian and European PostgreSQL community
▸ Lean and DevOps practitioner
▸ Open Source Developer
▸ Member of the Barman team
▸ Continuous Delivery Architect @2ndQuadrant
▸ 24/7 support engineer @2ndQuadrant
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
BUSINESS CONTINUITY
▸ Disaster Recovery
▸ High Availability
▸ Types of disaster/failures
▸ Availability = Uptime / (Uptime + Downtime)
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
OBJECTIVES
▸ Recovery Point Objective (RPO)
▸ How much data can I afford to lose?
▸ Recovery Time Objective (RTO)
▸ How long will it take me to recover?
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SERVICE RELIABILITY
▸ Cost of downtime
▸ How many €/$/£/AUD/AED/…?
▸ Risk management
▸ SLI, SLO and SLA
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SOME NOTES FOR THIS PRESENTATION
▸ PostgreSQL on Linux
▸ Servers can be either physical or virtual
▸ Storage must be redundant
▸ RAID is required
▸ VOLUME: redundant disk mounted on a system
LET’S START
0. ONE POSTGRES SERVER
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Server name: HOPE
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ Why is RPO = ∞?
▸ Why is RTO = n/a?
▸ “Hope is not a strategy” (cit. Google)
▸ More common than you’d expect
10. ONE POSTGRES SERVER + LOGICAL BACKUPS
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Add systematic backups with pg_dump
LOGICAL BACKUP LOGICAL
BACKUPLOGICAL BACKUP …
Day 04AM
Day -1 4AM
Day -2 4AM
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ How do you feel now?
▸ Still: RPO = ∞ and RTO = n/a. Why?
▸ A backup is valid only if you have tested it
▸ Unfortunately, this is very common
20. ONE POSTGRES SERVER + LOGICAL BACKUPS + LOGICAL RESTORES
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Test your backups with pg_restore
LOGICAL BACKUP
Day 04AM
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
DEFINING SOME OBJECTIVES
▸ Measure time for pg_restore
▸ RPO = backup frequency
▸ RTO = maximum time of recovery
▸ Provision another server
▸ Configure another server (automated, right?)
▸ Time to restore the last backup (measure it)
HAVE WE REALLY THOUGHT ABOUT EVERYTHING?
TIME OF REACTION
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ Can this architecture work for you?
▸ We need reliable monitoring
▸ From now on, we assume we have it in place!
▸ We need to reduce both RPO and RTO
HOW?POINT-IN-TIME-RECOVERY
Using a time machine
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
POSTGRESQL’S PITR
▸ Part of core (fully open source)
▸ Rebuild a cluster at a point in time
▸ From crash recovery to sync streamrep (physical/logical)
▸ RPO = 0 (zero data loss)
▸ Hot base backup, continuous WAL archiving, Recovery
▸ API
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
BASIC CONCEPTS
▸ Continuous copy of WAL data (continuous archiving)
▸ Physical base backups
▸ Recovery:
▸ copy base backup to another location
▸ recovery mode (replay of WALs until target)
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
BARMAN
▸ Latest version: Barman 2.8
▸ Open Source (GNU GPL 3)
▸ Written in Python
▸ Developed and maintained by 2ndQuadrant
▸ Available at www.pgbarman.org
40. ONE POSTGRES SERVER + ONE BARMAN SERVER
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Continuous backup
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
BASIC CONCEPTS
▸ Remote backup and recovery
▸ Multiple server management
▸ Backup catalogue and WAL archive
▸ Retention policies
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
COPY METHOD
▸ PostgreSQL streaming
▸ Practical/Windows/Docker
▸ Rsync/SSH
▸ Incremental backup and recovery (via hard links)
▸ Parallel backup and recovery
▸ Network compression and bandwidth limitation
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
WAL SHIPPING METHOD
▸ “archiving”, through “archive_command”:
▸ RPO ~ 16MB of WAL data, or
▸ “archive_timeout”
▸ “streaming”, through streaming replication:
▸ “pg_receivewal” or “pg_receivexlog”
▸ continuous stream, RPO ~ 0
▸ PostgreSQL 9.2+ required
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
EXAMPLE FROM POSTGRESQL.CONF
archive_mode = on
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
archive_command = 'rsync -a %p
barman@HOST:/var/lib/barman/ID/incoming'
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
EXAMPLE FROM BARMAN.CONF[stark] description = “Tony Stark database" ssh_command = ssh postgres@stark conninfo = user=barman-avengers dbname=postgres host=stark retention_policy = RECOVERY WINDOW OF 6 MONTHS copy_method = rsync reuse_backup = link parallel_jobs = 4 archiver = true streaming_archiver = true slot_name = barman_streaming
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ How do you feel now?
▸ Still: RPO = ∞ and RTO = n/a. Why?
▸ A backup is valid only if you have tested it
▸ Barman reduces backup risks, does not exclude them
▸ Systematic tests (especially custom scripts)
▸ Business risk is very high
60. ONE POSTGRES SERVER + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Test your backups with barman
recover
WHAT A WASTE!
TESTING OR BI?HAVE YOU EVER THOUGHT OF USING IT FOR
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
HOOK SCRIPTS
▸ Barman has hook scripts:
▸ pre and post backup
▸ pre and post archiving
▸ with retry option (until the script returns SUCCESS)
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
EXAMPLE OF RECOVERY SCRIPT
▸ Write a bash script that:
▸ connects to a remote server via SSH
▸ stops the PostgreSQL server
▸ issues a “barman recover” with target “immediate”
▸ starts the PostgreSQL
▸ Set it as post-backup script
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SOME FOOD FOR THOUGHT
▸ Outcomes:
▸ Systematically test your backup
▸ Measure your recovery time
▸ Identical server? This is a backup server ready to start
▸ You can use a different data centre
▸ Be creative, PostgreSQL gives you infinite freedom!
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ RPO ~ 0 (your backups work, every time)
▸ RTO = Time of reaction + Recovery time
▸ Example: RPO ~0 and RTO < 1 day
▸ Acceptable or not acceptable?
▸ Entry level architecture for business continuity
▸ Priority now: improve RTO
HOW?REPLICATION
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
POSTGRESQL’S REPLICATION
▸ Part of core (fully open source)
▸ One master, multiple standby servers
▸ Evolution of PITR
▸ Standby server is in continuous recovery mode
▸ Hot standby (read-only)
▸ Both streaming (9.0+) and file based pulling of WAL
▸ Cascading from a standby
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SYNCHRONOUS REPLICATION
▸ Fine control (from global down to transaction level)
▸ 2-safe replication
▸ COMMIT of a write transactions waits until written on both the master and a standby (or more from 9.6)
▸ Read consistency of a cluster
▸ RPO = 0 (zero data loss)
80. TWO POSTGRES SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
barman_restore_wal
barman recover
Symmetric Cluster
master standby
STARK ROGERS
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
EXCERPT FROM ROGERS POSTGRESQL’S CONFIGURATIONpostgresql.conf:
hot_standby = on
recovery.conf:
standby_mode = ‘on' # Streaming primary_conninfo = ‘host=stark user=replica application_name=ha sslmode=require’ # Fallback via Barman restore_command = 'barman-wal-restore -U barman avengers stark %f %p’
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SWITCHOVER (PLANNED)
▸ Applications are paused (start of downtime)
▸ Shut down the master
▸ Allow the standby to catch up with the master
▸ Promote the standby
▸ Switch virtual IPs
▸ Resume applications (end of downtime)
▸ Reconfigure the former master as standby
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
FAILOVER (UNPLANNED)
▸ The master is down (start of downtime)
▸ Promote the standby
▸ Change the virtual IP
▸ DEGRADED SYSTEM
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
MANUAL SWITCHOVER AND FAILOVER
▸ Manual switchover != manual switchover procedure
▸ Manual switchover = manually triggered
▸ Automate the procedure!!!
▸ bash (good)
▸ Ansible (better)
▸ Enhance gradually
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
RECAP
▸ RPO ~ 0 (your backups work, every time)
▸ RTO = Time of reaction + Time of promotion
▸ Criticality: manual intervention
▸ Reliable monitoring
▸ Trained people (practice & docs!)
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
MANUAL FAILOVER VS AUTOMATED FAILOVER
▸ Risk management
▸ Split brain nightmare
▸ Automated is built on manual (test!)
▸ Your choice
▸ Very good solution for business continuity
▸ Uptime > 99.99% in a year
90. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
barman_restore_wal
barman recover
Synchronous
ZERO DATA LOSS
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
SYNCHRONOUS REPLICATION
▸ Primary: Barman
▸ Zero data loss backup
▸ Primary: Standby
▸ Zero data loss cluster (reduce RTO)
▸ Just one configuration line in PostgreSQL
▸ synchronous_standby_names = '1 (ha, barman_receive_wal)'
~100. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER + REPMGR (AUTO-FAILOVER)
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ARCHITECTURE
Potential synchronous
Synchronous
repmgr repmgr
repmgr witness
WHAT’S MORE?
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
PUSH THE BOUNDARIES
▸ Repeatable architectures in multiple data centres
▸ PgBouncer
▸ Virtual IPs
▸ S3 relay via Barman hook scripts
▸ Multiple standby servers and cascading replication
▸ Docker containers
▸ Logical replication backups
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
CONCLUSIONS
▸ Babysteps and KISS
▸ New? Explore and learn
▸ Practice is the only way to mastery (drills)
▸ Plan regular healthy downtimes
▸ Use switchovers to perform PostgreSQL updates
▸ Smart downtimes increase long-term uptime
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
ANY QUESTIONS?
▸ PostgreSQL: www.postgresql.org
▸ Barman: www.pgbarman.org #pgbarman
▸ PgBouncer: pgbouncer.github.io
▸ Repmgr: www.repmgr.org
▸ Our blog: blog.2ndquadrant.com
2ndquadrant.com
@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity
LICENCE
Attribution 4.0 International (CC BY 4.0)
You are free to:
▸ Share — copy and redistribute the material in any medium or format
▸ Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.