life on a_rollercoaster
TRANSCRIPT
Life on a rollercoasterScaling the PostgreSQL backup and recovery
Federico Campoli
Transferwise
2 November 2016
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 1 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 2 / 52
Warning!
The story you are about to hear is true.Only the names have been changed to protect the innocent.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 3 / 52
Dramatis personae
A brilliant startup ACME
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 4 / 52
Dramatis personae
The clueless engineers CE
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 5 / 52
Dramatis personae
An elephant on steroids PG
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 6 / 52
Dramatis personae
The big cheese HW
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 7 / 52
Dramatis personae
The real hero DBA
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 8 / 52
In the beginning
Our story starts in the year 2012. The world was young and our DBA started anew brilliant career in ACME.
After the usual time required by the onboarding, to our DBA were handed theproduction’s servers.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 9 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 10 / 52
2012 - Who am I? What I’m doing?
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 11 / 52
Size does matter
PG, our powerful and friendly elephant was used for storing the data in a multishard configuration.Not really big actually but very troubled indeed!
A small logger database - 50 GB
A larger configuration and auth datababase - 200 GB
Two archive db - 4 TB each
One db for the business intelligence - 2 TB
Each db had an hot standby counterpart hosted on less powerful HW.
Our story tells the life of the BI database.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 12 / 52
The carnival of monsters
In the early 2013 our brave DBA addressed the several problems found on thecurrent backup and recovery configuration.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 13 / 52
Lagging standby
Suboptimal schema.Churn on large tables and high wal generation rateThe slave lagged just because there was autovacuum running
rsync used in archive command.The wal were archived over the network using rsync+sshThe *.ready files in the pg xlog increased the risk of the cluster’s crash.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 14 / 52
Base backup
Rudimentary init standby script.Just a pg start backup call followed by a rsync between the master and theslave
The several tablespaces were synced using a single rsync process with the–delete option
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 15 / 52
Slow dump
Remote pg dump.Each cluster was dumped remotely on a separate server using the customformat.
The backup server had limited memory and cpu
Dump time between 3 hours and 2 days depending on the database size
The BI database was dumped on a daily basis, taking 14/18 hours.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 16 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 17 / 52
Parallel rsync
Our DBA took the baby step approach. He started fixing one issue at a timewithout affecting ACME’s activity.
The first enhancement was the init backup script.
Two bash arrays listed the origin and destination’s tablespaces
An empty bash array stored the rsync pids
The script started the pg start backup
For each tablespace a rsync process were spawned and the pid was stored inthe third array
A loop checked that the pids were present in the process list
When all the rsync finished pg stop backup were executed
An email to DBA was sent to tell him to start the slave
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 19 / 52
Local archive
The high rate of wal generation required a different archive strategy.
The archive command changed to a local copy
A simple rsync script copied every minute the archives to the slave
The script queried remotely the slave for the last restartpoint
The restartpoint was used by pg archivecleanup on the master
Implementing this solution solved the *.ready files problem but the autovacuumstill caused high lag.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 20 / 52
Autovacuum tune down
DBA investigated the autovacuum issue and finally addressed the cause.
The high lag on the slave was caused when autovacuum (or vacuum) hit a tableconcurrently updated. This behaviour is normal and is caused by the standbycode’s design.
With large denormalised tables which are updated constantly the only workaroundpossible was to increase autovacuum cost delay with a large value (1 second ormore).
When the autovacuum process reached an arbitrary cost during the executionthere was a 1 minute sleep before the activity resumed..The lag on the standbys disappeared at the cost of longer autovacuum runs.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 21 / 52
In the meanwhile...
The CE decided to shard the business intelligence database using the hotstandby copy
The three new databases initially had the same amount of data which wasslowly cleaned up later
But even with one third of data on each shard, the daily dump was reallyslow at the point of overlapping over the 24 hours
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 22 / 52
A slowish dump
pg dump connects to the running cluster like any other backend; it pulls outdata using in the copy format
With the custom format the compression happens on the server wherepg dump runs
The backup server were hammered on the network and the cpu
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 23 / 52
You got speed!
Our DBA wrote a bash script doing the following steps
Dump the database in custom format locally
Generate the file’s md5 checksum
Ship the file on the backup via rsync
Check the remote file’s md5
Send a message to nagios for success or failure
The backup time per each cluster dropped dramatically to just 5 hours includingthe copy and the checksum verification.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 24 / 52
Growing pains
Despite the business growth the CE ignored the problems with the poor schemadesign.
Speed was achieved by brute force using expensive SSD storage
The amount of data store in the BI db increased
The only accepted solution was to create new shards over and over again
By the end of 2013 the BI databases total size was 15 TB
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 25 / 52
In the meanwhile...
Our DBA upgraded all the PG to the version 9.2 with pg upgrade
THANKS BRUCE!!!!!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 26 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 27 / 52
2014 - The battle of five armies
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 28 / 52
Bloating data
Q1 2014 opened with another backup performance issue
The dump size increased over the time
The database CPU usage increased constantly with no apparent reason
Most of the shards had the tablespace usage at 90%
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 29 / 52
Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
The table had an harmless hstore field
Where everybody added new keys just changing the app code
And nobody did housekeeping of their data
The row length jumped from 200 bytes to 1200 bytes in few months
Each BI shard contained up to 2 billion rows...
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
The table had an harmless hstore field
Where everybody added new keys just changing the app code
And nobody did housekeeping of their data
The row length jumped from 200 bytes to 1200 bytes in few months
Each BI shard contained up to 2 billion rows...
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
The table had an harmless hstore field
Where everybody added new keys just changing the app code
And nobody did housekeeping of their data
The row length jumped from 200 bytes to 1200 bytes in few months
Each BI shard contained up to 2 billion rows...
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
Just a little trick
Despite the impending doom and the CE resistance DBA succeeded in convertingthe hstore field to a conventional columnar store (SORRY OLEG!).
The storage usage dropped by 30%
The CPU usage dropped by 60%
The speed of ACME’s product boosted
ACME saved $BIG BUNCH OF MONEY in new HW otherwise required toshard again the dying databases
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 31 / 52
In the meanwhile...
DBA knew the fix was just a workaround
He asked the CE to help him in the schema redesign
He told them things would be problematic again in just one year
Nobody listened
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 32 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 33 / 52
2015 - THIS IS SPARTA!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 34 / 52
I hate to say that, but I told you so
As predicted by our DBA the time required for backing up the BI databasesincreased again, approaching dangerously the 24 hours.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 35 / 52
Parallel is da way!
Back in the 2013 PG 9.3 added the parallel export. But, to DBA greatdisappointment, the version 9.3 was initially cursed by some bugs causing datacorruption. DBA could not use the parallel dump.However...
The parallel backup takes advantage of the snapshot export introduced in thePG 9.2
The debian packaging allows different PG’s major versions on the samemachine
DBA installed the client 9.3 and used its pg dump to dump the 9.2 in parallel
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 36 / 52
It worked very well...
The wrapper script required some adjustments
Accept the -j parameter
Check if the 9.3+ client is installed
Override the format to directory if the parallel backup is possible
Adapt the checksum procedure to check the files in the dump directory
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 37 / 52
...with just a little infinitesimal catch
All fine right?Not exactly
The restore test complained about the unknown parameter lock timeout
The backup hit the speed record since 2013
The schema was still the same of 2013
The databases performance were massively affected with 6 parallel jobs
DBA found that with just 4 parallel jobs the databases worked with minimaldisruption
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 38 / 52
In the meanwhile...
Our DBA upgraded PG to the latest version 9.4.
THANK YOU AGAIN BRUCE!!!!!
No more errors for the restore test.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 39 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 40 / 52
A new hope
The upgrade to PG 9.4 improved the performance issues and DBA had some timeto breath.
The script to ship the archived wal was improved to support multiple slavesin cascading replica
Each slave had a dedicated rsync process configurable with compression andprotocol (rsync or rsync +ssh)
The script determined automatically the farthest slave querying the remotecontrolfiles and cleaned the local archive accordingly
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 42 / 52
A new hope
The init standby script switched to the rsync protocol
The automated restore script used the ALTER SYSTEM added to the PG 9.4to switch between the restore and production configuration
Therefore the restore time improved to at most 9 hours for the largest BIdatabase (4.5 TB)
Working with BOFH JR, DBA wrapped the backup script in the$BACKUP MANAGER pre and post execution hooks
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 43 / 52
The rise of the machines
In the 2016 Q2, finally, DBA completed the configuration for $DEVOP TOOL anddeployed the several scripts to the 17 BI databases with minimal effort.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 44 / 52
Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 45 / 52
BI database at a glance
Year N. Databases Average size Total size Version2012 1 2 TB 2 TB 9.12013 5 3 TB 15 TB 9.22014 9 2.2 TB 19 TB 9.22015 13 2.7 TB 32 TB 9.42016 16 2.5 TB 40 TB 9.4
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 46 / 52
Few words of wisdom
Reading the products source code is always a good practice.
Bad design can lead to disasters, in particular if the business is successful.It’s never too early to book the CE to a SQL training course.
“One bad programmer can easily create two new jobs a year.” – David Parnas
If in doubt ask your DBA for advice.If you don’t have a DBA, get one hired ASAP!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 47 / 52
Did you say hire?
WE ARE HIRING!https://transferwise.com/jobs/
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 48 / 52
That’s all folks!
QUESTIONS?
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 49 / 52
Boring legal stuff
LAPD badge - source wikicommons
Montparnasse derailment - source wikipedia
Base jumper - copyright Chris McNaught
Disaster girl - source memegenerator
Blue elephant - source memecenter
Commodore 64 - source memecenter
Deadpool- source memegenerator
Thin ice - source Boating on Lake Winnebago
Boromir - source memegenerator
Sparta birds - source memestorage
Dart Vader - source memegenerator
Angry old man - source memegenerator
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 50 / 52
Contacts and license
Twitter: 4thdoctor scarf
Blog:http://www.pgdba.co.uk
Brighton PostgreSQL Meetup:http://www.meetup.com/Brighton-PostgreSQL-Meetup/
This document is distributed under the terms of the Creative Commons
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 51 / 52