preventing and resolving mysql downtime
TRANSCRIPT
![Page 1: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/1.jpg)
Jervin Real, Michael CoburnPercona
Preventing and Resolving MySQL Downtime
![Page 2: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/2.jpg)
About Us
•Jervin Real, Technical Services Manager• Engineer Engineering Engineers
• APAC
•Michael Coburn, Principal Technical Account Manager
• Responsible for managing technical relationship with Percona's
highest revenue customers
2
![Page 3: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/3.jpg)
What is Downtime?
•When your Application is completely unavailable
•When your Application is in a degraded state
•Whenever your boss says so :)
3
![Page 4: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/4.jpg)
Why Prevent Downtime?
•Your business loses money when the Application is down
•You and your team's reputation suffers
4
![Page 5: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/5.jpg)
•Real world adventures• Problems
• Solutions
• Prevention
•Putting them all together
Agenda
5
![Page 6: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/6.jpg)
I Had a Crash On You
6
![Page 7: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/7.jpg)
I Had a Crash On You (1): Page Corruption
7
![Page 8: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/8.jpg)
•Disk bad sectors problem, not monitored or checked
•Page corruption on disk level
•Server crashes when reading page from disk
•Keeps crashing :(
I Had a Crash On You (1): Page Corruption > About
8
![Page 9: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/9.jpg)
•Percona Server, we tried:• innodb_table_corrupt_action = salvage
•Worked!
•Dropped table, recreated - application back online
•Worst case:• innodb_force_recovery > 0
• Data Recovery
I Had a Crash On You (1): Page Corruption > Solutions
9
![Page 10: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/10.jpg)
•Running 5.6.11, early adopter, InnoDB FULLTEXT
•Upgrade to 5.6.18, MySQL crashed
•Data was unusable - bug#72079
I Had a Crash On You (2): Assertion > About
10
![Page 11: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/11.jpg)
•Downgrade and restore from backup
•Re-execute upgrade to avoid the bug
I Had a Crash On You (2): Assertion > Solutions
11
![Page 12: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/12.jpg)
•innodb_corrupt_table_action=salvage / warn
•pt-table-checksum• Regularly recurse your data and check for errors in error log
•RAID card health checks• Can vary by vendor
•SMART checks• Be vigilant for disk level errors
I Had a Crash On You (1): Page Corruption > Preventions
12
![Page 13: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/13.jpg)
Nobody’s Watching
13
![Page 14: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/14.jpg)
•Percona XtraDB Cluster, 3 nodes
•Few months ago node 3 went down due to conflict, but
nobody noticed
•Few hours ago, node 2 was killed by OOM, cluster lost
quorum
•EVERYBODY NOTICED!
Nobody’s Watching (1): Nobody Cared > About
14
![Page 15: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/15.jpg)
•Bootstrap remaining node• SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;
•SST second and 3rd node
•Define wsrep_notify_cmd temporarily
•Implement better alerting
Nobody’s Watching (1): Nobody Cared > Solutions
15
![Page 16: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/16.jpg)
•New sysadmin received disk space alert
•du -hx --max-depth=1 /
•/var has lots of data
•find /var/ -size +5G -exec rm -rf {} \;
•Bam, ibdata1 gone!
•Restart maintenance occurred later in the day ...
Nobody’s Watching (2): Dropped the Bomb > About
16
![Page 17: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/17.jpg)
•Restore from backup
•Really, they were lucky!
Nobody’s Watching (2): Dropped the Bomb > Solutions
17
![Page 18: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/18.jpg)
•Percona Monitoring Plugins• pmp-check-deleted-files
• pmp-check-mysql-status
• pmp-check-mysql-innodb
•Define a script executable by mysql user• Triggered on node state changes
•Take backups, and alert on failure
•Don't restart the server - file handles are still open!
Nobody’s Watching: Prevention
18
![Page 19: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/19.jpg)
Self Induced Pain
19
![Page 20: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/20.jpg)
•“Waiting for query cache lock”
root# ~> pt-sift /var/lib/pt-stalk/
...
--processlist--
State
226
90 Waiting for query cache lock
4 Sending data
4 Master has sent all binlog to slave; waiting for binlog to be updated
2 init
Self Induced Pain (1): Query Cache
20
![Page 21: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/21.jpg)
● Global mutex
● Point of contention
● Especially on hot dataset/table
● More so, with large QC
Self Induced Pain (1): Query Cache > About
21
![Page 22: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/22.jpg)
Self Induced Pain (1): Query Cache > Solutions
22
● Set it to small size - to reduce performance overhead
● Disable completely to to avoid contention
● Hint offending queries to skip the query cache i.e. SELECT
SQL_NO_CACHE
![Page 23: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/23.jpg)
Self Induced Pain (2): Buffer Pool Dump/Restore
23
● Dumps buffer pool page list to disk
● Reloads buffer pool based on this list at startup
● Meant to help speed up buffer pool warmup
![Page 24: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/24.jpg)
● Maintenance restart, buffer dump and restore enabled
● Yey! Expecting everything to go well.
● 30mins in performance still really bad, IO trashing
● Large buffer pool, busy read/write
Self Induced Pain (2): Buffer Pool Dump/Restore > About
24
![Page 25: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/25.jpg)
● Extend your maintenance period to let the server warmup
if possible, otherwise they will contend on IO
● RAID1 of 2 SATA disks is not a license to use buffer pool
warmup on 240GB of buffer pool
Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions
25
![Page 26: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/26.jpg)
Self-Induced Pain Prevention
•Percona Toolkit• pt-stalk
• pt-sift
• pt-kill
•Disable OOM killer
•Configure appropriate disk scheduler
•Check the error log for "Buffer pool load complete"
26
![Page 27: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/27.jpg)
MySQL, MySQL! What Have Suffereth Ye Thee?
27
![Page 28: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/28.jpg)
•Slow queries
•Connections build up
•Slow response times
•Long running transactions
•Stop the World scenario
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
28
![Page 29: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/29.jpg)
--innodb--
txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)
0 queries inside InnoDB, 0 queries in queue
Main thread: sleeping, pending reads 0, writes 28, flush 1
Log: lsn = 2147483647, chkp = 2147483647, chkp age =
210625191
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
29
![Page 30: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/30.jpg)
---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows
mysql tables in use 1, locked 1
80337 lock struct(s), heap size 8271400, 10979242 row lock(s)
MySQL thread id 671621, OS thread handle 0x7fe03528a700,
query id 37505085 localhost magento Sending data
SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item`
LIMIT 376 OFFSET 491056
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
30
![Page 31: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/31.jpg)
•KILL long running trx
•pt-kill for persistent long running trx
•Deploy immediate code changes to disable erroring code
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions
31
![Page 32: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/32.jpg)
•MySQL is still responding
•All sorts of mutexes• trx_sys->mutex
• block->lock
• lock_sys->mutex
• lock_sys->wait_mutex
•… and is killing latency
•Service impact means lost income
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About
32
![Page 33: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/33.jpg)
•innodb_thread_concurrency > 0
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions
33
![Page 34: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/34.jpg)
● “Opening tables”, “Closing tables”
--processlist--
State
578 Opening tables
32 closing tables
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About
34
![Page 35: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/35.jpg)
● Contention on LOCK_open mutex
● Risk of negative scalability
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About
35
![Page 36: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/36.jpg)
● Tune table_open_cache/table_definition_cache
● table_open_cache_instances (5.6+)
● Shard either logically/horizontally, run multiple mysql
instances to reduce object size by instance
MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions
36
![Page 37: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/37.jpg)
•pt-kill --log
•MySQL Server Configurationa. Remember to tune innodb_thread_ concurrency (default is 0)
b. innodb_table_cache + innodb_table_cache_instances
•Application Stack Configuration (Schema Design)a. Single tenant per schema
b. Multiple tenants per schema (each table has client_id column)
c. All tenants in one schema
MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention
37
![Page 38: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/38.jpg)
•Disk performance cascading to MySQL to application
Wizard of OS (1): Disk Performance
38
![Page 39: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/39.jpg)
•Slow writes, binlogs, redo logs, syncs
•Transactions stalling on COMMIT, updating, inserting …•Replication getting delayed if node is a slave
•Translates to latency
Wizard of OS (1): Disk Performance > About
39
![Page 40: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/40.jpg)
● RAID Controller in Write-Through
● Could also be a bad disk!
Wizard of OS (1): Disk Performance > Solutions
40
![Page 41: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/41.jpg)
● Swapping heavily, with significant amount of RAM free
Wizard of OS (2): Swapping
41
![Page 42: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/42.jpg)
● Swapping induces significant amount of IO
● Swapping in and out of disk is mighty expensive
● Affects MySQL in magnificent ways
● Swap Insanity!
Wizard of OS (2): Swapping > About
42
![Page 43: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/43.jpg)
● NUMA Interleave
● Percona Server is NUMA configurable○ numa_interleave
○ Flush_caches
● Check numastat - perl check_numa.pl
Wizard of OS (2): Swapping > Solutions
43
![Page 44: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/44.jpg)
● Tune:○ Vm.swappiness
○ NUMA policy
○ disk scheduler
○ mount options appropriately (ext4, xfs)■ (nobarrier, noatime)
● pt-heartbeat - monitor replication delay
Wizard of OS : Prevention
44
![Page 45: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/45.jpg)
Percona Server Features
•Enable InnoDB Buffer Pool warming
•Enable userstat for table & index statistics
•Enable verbose slow log
•Enable Query Response Time plugin
45
![Page 46: Preventing and Resolving MySQL Downtime](https://reader031.vdocuments.net/reader031/viewer/2022022412/58f2a51b1a28ab11208b4573/html5/thumbnails/46.jpg)
Thank You!
•Jervin Real [email protected]• Technical Services Manager, APAC
•Michael Coburn [email protected]• Principal Technical Account Manager, USA
46