gitlab postgresmortem: lessons learned

26
GitLab PostgresMortem: Lessons Learned Alexey Lesovsky [email protected]

Upload: alexey-lesovsky

Post on 13-Apr-2017

71 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: GitLab PostgresMortem: Lessons Learned

GitLab PostgresMortem:Lessons Learned

Alexey Lesovsky

[email protected]

Page 2: GitLab PostgresMortem: Lessons Learned

dataegret.com

31 January Events

Failure's key points

Preventative measures

https://goo.gl/GO5rYJ

02

03

01

Page 3: GitLab PostgresMortem: Lessons Learned

31 January events

01

Page 4: GitLab PostgresMortem: Lessons Learned

17:20 - an LVM snapshot of the production db was taken.

19:00 - database load increased due to spam.

23:00 - secondary's replication process started to lag behind.

23:30 - PostgreSQL database directory was wiped.

31 January events01

dataegret.com

Page 5: GitLab PostgresMortem: Lessons Learned

Failure's key points

02

Page 6: GitLab PostgresMortem: Lessons Learned

1.LVM snapshots and staging provisioning.

2.When a replica start to lag.

3.Do pg_basebackup properly – part 1.

4.max_wal_senders was exceeded, but how?

5.max_connections = 8000.

6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.

7.strace: good thing in wrong place.

8.rm or not rm?

9.A bit about backup.

10.Different PG versions on the production.

11.Broken mail.

31 January events02

dataegret.com

Page 7: GitLab PostgresMortem: Lessons Learned

Snapshot impact on underlying storage.

Provisioning from backup.

Staging based on LVM snapshots02

dataegret.com

Page 8: GitLab PostgresMortem: Lessons Learned

Re-initialize the standby.

Monitoring with pg_stat_replication.

Use wal_keep_segments while troubleshooting.

Use WAL archive.

When a replica started to lag02

dataegret.com

Page 9: GitLab PostgresMortem: Lessons Learned

Do pg_basebackup into clean directory.

Remove «unnecessary» directory.

Use mv instead of rm.

Do pg_basebackup properly. Part 102

dataegret.com

Page 10: GitLab PostgresMortem: Lessons Learned

There was only one standby (which was failed).

Increase max_wal_senders.

Check who has stolen connections.

The limit was exceeded by concurrent pg_basebackups.

max_wal_senders was exceeded.02

dataegret.com

Page 11: GitLab PostgresMortem: Lessons Learned

More than 500 is bad idea.

Use pgbouncer to reduce the number of server connections.

max_connections = 800002

dataegret.com

Page 12: GitLab PostgresMortem: Lessons Learned

Don't run more than one pg_basebackups.

It didn't stuck, it waited for the checkpoint.

Use «-c» option to make fast checkpoint.

Do pg_basebackup properly. Part 202

dataegret.com

Page 13: GitLab PostgresMortem: Lessons Learned

Strace isn't a good tool in that case.

Use strace for system errors tracing.

Check stack trace from /proc/<pid>/stack or GDB.

Good things in wrong place.02

dataegret.com

Page 14: GitLab PostgresMortem: Lessons Learned

Data directory was cleaned with rm.

Use mv instead of rm.

rm or not rm02

dataegret.com

Page 15: GitLab PostgresMortem: Lessons Learned

Daily pg_dump.

Daily LVM snapshot.

Daily Azure snapshot.

PostgreSQL streaming replication.

Basebackup with WAL archive.

A bit about backup02

dataegret.com

Page 16: GitLab PostgresMortem: Lessons Learned

Clean out old packages after major upgrade.

Different versions on a production02

dataegret.com

Page 17: GitLab PostgresMortem: Lessons Learned

Setup cron, but forgot notifications.

Use reliable notification systems.

Different versions on a production02

dataegret.com

Page 18: GitLab PostgresMortem: Lessons Learned

Preventative measures

03

Page 19: GitLab PostgresMortem: Lessons Learned

1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.

2. Prometheus monitoring for backups.

3. Set PostgreSQL's max_connections to a sane value.

4. Investigate Point in time recovery & continuous archiving for PostgreSQL.

5. Hourly LVM snapshots of the production databases.

6. Azure disk snapshots of production databases.

7. Move staging to the ARM environment.

8. Recover production replica(s).

9. Automated testing of recovering PostgreSQL database backups.

10.Improve PostgreSQL replication documentation/runbooks.

11.Investigate pgbarman for creating PostgreSQL backups.

12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.

13.Build Streaming Database Restore.

14.Assign an owner for data durability.

Different versions on a production03

dataegret.com

Page 20: GitLab PostgresMortem: Lessons Learned

1. Update PS1 across all hosts.

Looks OK.

2. Prometheus monitoring for backups.

Size, number, age and recovery status.

3. Set PostgreSQL's max_connections to a sane value.

Better use pgbouncer.

4. Investigate PITR & continuous archiving for PostgreSQL.

Yes, as the part of the backup.

Preventative measures03

dataegret.com

Page 21: GitLab PostgresMortem: Lessons Learned

5. Hourly LVM snapshots of the production databases.

Looks unnecessary.

6. Azure disk snapshots of production databases.

Looks unnecessary.

7. Move staging to the ARM environment.

Very and very suspicious.

8. Recover production replica(s).

Do that asap.

Preventative measures03

dataegret.com

Page 22: GitLab PostgresMortem: Lessons Learned

9. Automated testing of recovering database backups.

YES!

10. Improve documentation/runbooks.

You need a bureaucrat.

11. Investigate pgbarman.

Looks OK, Barman is stable and reliable.

12. Investigate using WAL-E.

Looks OK, WAL-E is the «setup and forget».

Preventative measures03

dataegret.com

Page 23: GitLab PostgresMortem: Lessons Learned

13. Build Streaming Database Restore.

Corresponds with p.9.

14. Assign an owner for data durability.

Hire a DBA.

Preventative measures03

dataegret.com

Page 24: GitLab PostgresMortem: Lessons Learned

Check and monitor backups.

Create an emergency instructions.

Learn to use tools properly.

Lessons learned03

dataegret.com

Page 25: GitLab PostgresMortem: Lessons Learned

Postmortem of database outage of January 31https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

PostgreSQL Statistics Collector: pg_stat_replication viewhttps://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW

pg_basebackup utilityhttps://www.postgresql.org/docs/current/static/app-pgbasebackup.html

PostgreSQL Replicationhttps://www.postgresql.org/docs/9.6/static/runtime-config-replication.html

PgBouncerhttps://pgbouncer.github.io/https://wiki.postgresql.org/wiki/PgBouncer

Barmanhttp://www.pgbarman.org/

Links03

dataegret.com

Page 26: GitLab PostgresMortem: Lessons Learned

Thanks for watching!

dataegret.com [email protected]