gitlab postgresmortem: lessons learned

GitLab PostgresMortem:Lessons Learned

Alexey Lesovsky

alexey.lesovsky@dataegret.com

dataegret.com

31 January Events

Failure's key points

Preventative measures

https://goo.gl/GO5rYJ

31 January events

17:20 - an LVM snapshot of the production db was taken.

19:00 - database load increased due to spam.

23:00 - secondary's replication process started to lag behind.

23:30 - PostgreSQL database directory was wiped.

31 January events01

dataegret.com

Failure's key points

1.LVM snapshots and staging provisioning.

2.When a replica start to lag.

3.Do pg_basebackup properly – part 1.

4.max_wal_senders was exceeded, but how?

5.max_connections = 8000.

6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.

7.strace: good thing in wrong place.

8.rm or not rm?

9.A bit about backup.

10.Different PG versions on the production.

11.Broken mail.

31 January events02

dataegret.com

Snapshot impact on underlying storage.

Provisioning from backup.

Staging based on LVM snapshots02

dataegret.com

Re-initialize the standby.

Monitoring with pg_stat_replication.

Use wal_keep_segments while troubleshooting.

Use WAL archive.

When a replica started to lag02

dataegret.com

Do pg_basebackup into clean directory.

Remove «unnecessary» directory.

Use mv instead of rm.

Do pg_basebackup properly. Part 102

dataegret.com

There was only one standby (which was failed).

Increase max_wal_senders.

Check who has stolen connections.

The limit was exceeded by concurrent pg_basebackups.

max_wal_senders was exceeded.02

dataegret.com

More than 500 is bad idea.

Use pgbouncer to reduce the number of server connections.

max_connections = 800002

dataegret.com

Don't run more than one pg_basebackups.

It didn't stuck, it waited for the checkpoint.

Use «-c» option to make fast checkpoint.

Do pg_basebackup properly. Part 202

dataegret.com

Strace isn't a good tool in that case.

Use strace for system errors tracing.

Check stack trace from /proc/<pid>/stack or GDB.

Good things in wrong place.02

dataegret.com

Data directory was cleaned with rm.

Use mv instead of rm.

rm or not rm02

dataegret.com

Daily pg_dump.

Daily LVM snapshot.

Daily Azure snapshot.

PostgreSQL streaming replication.

Basebackup with WAL archive.

A bit about backup02

dataegret.com

Clean out old packages after major upgrade.

Different versions on a production02

dataegret.com

Setup cron, but forgot notifications.

Use reliable notification systems.

dataegret.com

Preventative measures

1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.

2. Prometheus monitoring for backups.

3. Set PostgreSQL's max_connections to a sane value.

4. Investigate Point in time recovery & continuous archiving for PostgreSQL.

5. Hourly LVM snapshots of the production databases.

6. Azure disk snapshots of production databases.

7. Move staging to the ARM environment.

8. Recover production replica(s).

9. Automated testing of recovering PostgreSQL database backups.

10.Improve PostgreSQL replication documentation/runbooks.

11.Investigate pgbarman for creating PostgreSQL backups.

12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.

13.Build Streaming Database Restore.

14.Assign an owner for data durability.

dataegret.com

1. Update PS1 across all hosts.

Looks OK.

2. Prometheus monitoring for backups.

Size, number, age and recovery status.

3. Set PostgreSQL's max_connections to a sane value.

Better use pgbouncer.

4. Investigate PITR & continuous archiving for PostgreSQL.

Yes, as the part of the backup.

Preventative measures03

dataegret.com

5. Hourly LVM snapshots of the production databases.

Looks unnecessary.

6. Azure disk snapshots of production databases.

Looks unnecessary.

7. Move staging to the ARM environment.

Very and very suspicious.

8. Recover production replica(s).

Do that asap.

dataegret.com

9. Automated testing of recovering database backups.

10. Improve documentation/runbooks.

You need a bureaucrat.

11. Investigate pgbarman.

Looks OK, Barman is stable and reliable.

12. Investigate using WAL-E.

Looks OK, WAL-E is the «setup and forget».

dataegret.com

13. Build Streaming Database Restore.

Corresponds with p.9.

14. Assign an owner for data durability.

Hire a DBA.

dataegret.com

Check and monitor backups.

Create an emergency instructions.

Learn to use tools properly.

Lessons learned03

dataegret.com

Postmortem of database outage of January 31https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

PostgreSQL Statistics Collector: pg_stat_replication viewhttps://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

pg_basebackup utilityhttps://www.postgresql.org/docs/current/static/app-pgbasebackup.html

PostgreSQL Replicationhttps://www.postgresql.org/docs/9.6/static/runtime-config-replication.html

PgBouncerhttps://pgbouncer.github.io/https://wiki.postgresql.org/wiki/PgBouncer

Barmanhttp://www.pgbarman.org/

Links03

dataegret.com

Thanks for watching!

dataegret.com alexey.lesovsky@dataegret.com

gitlab postgresmortem: lessons learned

Engineering

lessons learned and good practices...lessons learned lessons...

successes, challenges, & lessons learned...successes,...

lessons learned events at slac · 2012-09-14 · lessons...

lessons learned globally - campaign for tobacco-free...

lessons learned:

lessons learned from failures dec 2014 - mts houston · 212...

lessons&learned&in&deploying&akka&...

lean lessons learned - lean startup methodology and lessons...

lessons learned and lessons to be learned: investment law &...

lessons learned & not learned at msu

lessons learned

lessons learned handbook · 2011-10-12 · • lessons...

lessons learned remote pm+ cohort integrated k+s · lessons...

detecting financial reporting fraud – lessons learned ·...

lessons learned from “lessons learned” · 2 lessons...

lessons learned from accident investigation seminar/ihs -...

assessment & lessons learned from african diaspora...

lessons learned

learned lessons: