failure scenarios and their recovery percona xtradb cluster · failure scenarios and their recovery...
TRANSCRIPT
![Page 1: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/1.jpg)
Percona XtraDB Cluster:
Failure Scenarios and their Recovery
Krunal Bauskar (PXC Lead, Percona)Alkin Tezuysal (Sr. Technical Manager, Percona)
![Page 2: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/2.jpg)
2
Who we are?
Krunal Bauskar● Database enthusiast.● Practicing databases (MySQL) for over a
decade now.● Wide interest in data handling and
management.● Worked on some real big data that powered
application @ Yahoo, Oracle, Teradata.
Alkin Tezuysal (@ask_dba)● Open Source Database Evangelist● Global Database Operations Expert● Cloud Infrastructure Architect AWS● Inspiring Technical and Strategic Leader● Creative Team Builder● Speaker, Mentor, and Coach● Outdoor Enthusiast
![Page 3: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/3.jpg)
3
Agenda
● Quick sniff at PXC● Failure Scenarios and their recovery● PXC Genie - You wish. We implement.● Q & A
![Page 4: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/4.jpg)
Quick Sniff at PXC
![Page 5: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/5.jpg)
5
What is PXC ?
Auto-node provisioning
Multi-master
Performance tuned
Enhanced Security
Flexible topology
Network protection
(Geo-distributed)
![Page 6: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/6.jpg)
Failure Scenarios and their recovery
![Page 7: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/7.jpg)
7
Scenario: New node fail to connect to cluster
![Page 8: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/8.jpg)
8
Scenario: New node fail to connect to cluster
Joiner log
![Page 9: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/9.jpg)
9
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any traces of JOINER trying to JOIN.
Administrator reviews configuration settings like IP address are sane and valid.
![Page 10: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/10.jpg)
10
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any traces of JOINER trying to JOIN.
Administrator reviews configuration settings like IP address are sane and valid.
Still JOINER failsto connect
![Page 11: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/11.jpg)
11
Scenario: New node fail to connect to cluster
Joiner log
DONOR log doesn’t have any traces of JOINER trying to JOIN.
Administrator reviews configuration settings like IP address are sane and valid.
SELinux/AppArmor
![Page 12: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/12.jpg)
12
Scenario: New node fail to connect to cluster
Joiner log
Don’t confuse this error with SST since node is not yet offered
membership of cluster. SST comes post membership.
![Page 13: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/13.jpg)
13
Scenario: New node fail to connect to cluster
● Solution-1:○ Setting mode to PERMISSIVE or DISABLED
![Page 14: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/14.jpg)
14
Scenario: New node fail to connect to cluster
● Solution-1:○ Setting mode to PERMISSIVE or DISABLED
● Solution-2:○ Configuring policy to allow access in ENFORCING mode.○ Related blogs
■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs what all permission are needed and add rules accordingly.
■ “Lock Down: Enforcing AppArmor with Percona XtraDB Cluster”■ Using this we can continue to use SELinux in enable mode. (You can also
refer to selinux configuration on Codership site too).
![Page 15: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/15.jpg)
15
Scenario: New node fail to connect to cluster
PXC can operate with SELinux/AppArmor.
![Page 16: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/16.jpg)
16
Scenario: Catching up cluster (SST, IST)
![Page 17: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/17.jpg)
17
Scenario: Catching up cluster (SST, IST)
● SST: complete copy-over of data-directory○ SST has has multiple external components SST script, XB, network aspect,
etc. Some of these are outside control of PXC process.
● IST: missing write-sets (as node is already member of cluster).○ Intrinsic to PXC process space.
![Page 18: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/18.jpg)
18
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
![Page 19: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/19.jpg)
19
Scenario: Catching up cluster (SST, IST)
#1
Joiner logSST failed on DONOR
![Page 20: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/20.jpg)
20
Scenario: Catching up cluster (SST, IST)
#1
Joiner logSST failed on DONOR
wsrep_sst_authnot set on DONOR
![Page 21: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/21.jpg)
21
Scenario: Catching up cluster (SST, IST)
#1
Joiner log
wsrep_sst_auth should be set on DONOR (often user set it on JOINER and things still fails). Post SST, JOINER will copy-over the said user from DONOR.
![Page 22: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/22.jpg)
22
Scenario: Catching up cluster (SST, IST)
#2
Donor log
![Page 23: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/23.jpg)
23
Scenario: Catching up cluster (SST, IST)
#2
Donor log
Possible cause:● Specified wsrep_sst_auth user doesn’t exit.● Credentials are wrong.● Insufficient privileges.
![Page 24: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/24.jpg)
24
Scenario: Catching up cluster (SST, IST)
#3
Joiner log
![Page 25: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/25.jpg)
25
Scenario: Catching up cluster (SST, IST)
#3
Joiner log
Trying to get old version JOINER to join fromnew version DONOR. (Not supported).
Opposite is naturally allowed.
![Page 26: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/26.jpg)
26
Scenario: Catching up cluster (SST, IST)
#4
Joiner log
Donor log
![Page 27: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/27.jpg)
27
Scenario: Catching up cluster (SST, IST)
#4
Joiner log
Donor log
WSREP_SST: [WARNING] wsrep_node_address or wsrep_sst_receive_address not set. Consider setting them if SST fails.
![Page 28: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/28.jpg)
28
Scenario: Catching up cluster (SST, IST)
#5
![Page 29: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/29.jpg)
29
Scenario: Catching up cluster (SST, IST)
#5Faulty SSL configuration
![Page 30: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/30.jpg)
30
Scenario: Catching up cluster (SST, IST)
PXC recommends: Same configuration on all nodes of the cluster.
Old DONOR - New JOINER (OK)
XB is external tool and has its own set of controllable configuration (passed through
PXC my.cnf)
SST user should be present on DONOR
Look at DONOR and JOINER log.
wsrep_sst_recieve_address/wsrep_node_address is needed.
Advance encryption option like keyring on DONOR and no keyring on JOINER is not
allowed.
Ensure stable n/w link between DONOR and JOINER.
Network rules (firewall, etc..). SST uses port 4444. IST uses 4568.
Often-error are local to XB. Check the XB log file that can give hint of error.
![Page 31: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/31.jpg)
31
Scenario: Cluster doesn’t come up on restart
![Page 32: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/32.jpg)
32
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)● DC hits power failure and all nodes are restarted.● On restart, recovery flow is executed to recover wsrep coordinates.
![Page 33: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/33.jpg)
33
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)● DC hits power failure and all nodes are restarted.● On restart, recovery flow is executed to recover wsrep coordinates.
![Page 34: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/34.jpg)
34
Scenario: Cluster doesn’t come up on restart
● All your nodes are located in same Data-Center (DC)● DC hits power failure and all nodes are restarted.● On restart, recovery flow is executed to recover wsrep coordinates.
Cluster still fails to come up
![Page 35: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/35.jpg)
35
Scenario: Cluster doesn’t come up on restart
● Close look at the log shows original bootstrapping node has safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in absence of original cluster forming node.
![Page 36: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/36.jpg)
36
Scenario: Cluster doesn’t come up on restart
● Close look at the log shows original bootstrapping node has safe_to_bootstrap set to 0 so it refuse to come up.
● Other nodes of cluster are left dangling (in non-primary state) in absence of original cluster forming node.
Galera/PXC expect user to identify node that has latest data and then use that too bootstrap. So as safety check safe_to_bootstrap was added.
![Page 37: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/37.jpg)
37
Scenario: Cluster doesn’t come up on restart
Identify the node that has latest data (look at wsrep-recovery co-ords)
Bootstrap the node
Restart other non-primary node (if they fail to auto-join).
set safe_to_bootstrap to 1 in grastate.dat from data-directory
![Page 38: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/38.jpg)
38
Scenario: Cluster doesn’t come up on restart
I have exact same setup but I never face this issue. My cluster get auto-restore on power failure.
Am I losing data or doing something wrong ?
![Page 39: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/39.jpg)
39
Scenario: Cluster doesn’t come up on restart
Because you have bootstrapped your node using
wsrep_cluster_address=<node-ip> &pc.recovery=true (default)
![Page 40: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/40.jpg)
40
Scenario: Cluster doesn’t come up on restart
Because you have bootstrapped your node using
wsrep_cluster_address=<node-ip> &pc.recovery=true (default)
Error is observed if you have bootstrapped:
wsrep_cluster_address=”gcomm://”
OR wsrep_cluster_address=”<node-ips>”
but pc.recovery=false
![Page 41: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/41.jpg)
41
Scenario: Cluster doesn’t come up on restart
PXC can auto-restart onDC failure depending on
configuration option used.
![Page 42: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/42.jpg)
42
Scenario: Data inconsistency
![Page 43: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/43.jpg)
43
Scenario: Data inconsistency
![Page 44: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/44.jpg)
44
Scenario: Data inconsistency
● 2 kinds of inconsistencies○ Physical inconsistency: Hardware Issues○ Logical inconsistency: Data Issues
![Page 45: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/45.jpg)
45
Scenario: Data inconsistency
● 2 kinds of inconsistencies○ Physical inconsistency: Hardware Issues○ Logical inconsistency: Data Issues
Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc…
![Page 46: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/46.jpg)
46
Scenario: Data inconsistency
● 2 kinds of inconsistencies○ Physical inconsistency: Hardware Issues○ Logical inconsistency: Data Issues
Logical inconsistency caused to cluster specific operation like locks, RSU, wsrep_on=off, etc…
PXC has zero tolerance for inconsistency and so it immediately isolate the nodes on detecting inconsistency.
![Page 47: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/47.jpg)
47
Scenario: Data inconsistency
Inconsistency detected
![Page 48: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/48.jpg)
48
Scenario: Data inconsistency
Cluster in healthy and
running
ISOLATED NODE (SHUTDOWN)
![Page 49: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/49.jpg)
49
Scenario: Data inconsistency
Inconsistency detected
Inconsistency detected
![Page 50: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/50.jpg)
50
Scenario: Data inconsistency
shutdown
shutdownnon-prim
State marked as UNSAFE
![Page 51: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/51.jpg)
51
Scenario: Data inconsistency
majority groupminority group
![Page 52: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/52.jpg)
52
Scenario: Data inconsistency
majority groupminority group
Minority group has GOOD DATA
![Page 53: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/53.jpg)
53
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a node that has latest data.
![Page 54: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/54.jpg)
54
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a node that has latest data.
Set pc.bootstrap=1 on the selected node.
Single node cluster formed
![Page 55: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/55.jpg)
55
Scenario: Data inconsistency
If there are multiple nodes in minority group, identify a node that has latest data.
Set pc.bootstrap=1 on the selected node.
Boot other majority node. (they will join through SST).
![Page 56: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/56.jpg)
56
Scenario: Data inconsistency
CLUSTER RESTORED
![Page 57: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/57.jpg)
57
Scenario: Data inconsistency
shutdown
shutdownnon-prim
State marked as UNSAFE
![Page 58: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/58.jpg)
58
Scenario: Data inconsistency
majority groupminority group
Majority group has GOOD DATA
![Page 59: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/59.jpg)
59
Scenario: Data inconsistency
Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group.
![Page 60: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/60.jpg)
60
Scenario: Data inconsistency
Valid uuid can be copied over from a minority group node.
Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group.
Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE).
![Page 61: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/61.jpg)
61
Scenario: Data inconsistency
Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group.
Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE).
Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join.
![Page 62: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/62.jpg)
62
Scenario: Data inconsistency
Nodes in majority group are already SHUTDOWN. Initiate SHUTDOWN of nodes from minority group.
Fix grastate.dat for the nodes from majority group. (Consistency shutdown sequence has marked STATE=UNSAFE).
Bootstrap the cluster using one of the node from majority group and eventually get other majority nodes to join.
Remove grastate.dat of minority group nodes and restart them to join newly formed cluster.
![Page 63: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/63.jpg)
63
Scenario: Data inconsistency
CLUSTER RESTORED
![Page 64: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/64.jpg)
64
Scenario: Another aspect of data inconsistency
![Page 65: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/65.jpg)
65
Scenario: Another aspect of data inconsistency
One of the node from minority group
![Page 66: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/66.jpg)
66
Scenario: Another aspect of data inconsistency
Transaction upto X
Transaction upto X - 1
![Page 67: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/67.jpg)
67
Scenario: Another aspect of data inconsistency
Transaction upto X
Transaction upto X - 1
Transaction X caused inconsistency so it never made it to
these nodes.
![Page 68: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/68.jpg)
68
Scenario: Another aspect of data inconsistency
Transaction upto X
Transaction upto X - 1
![Page 69: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/69.jpg)
69
Scenario: Another aspect of data inconsistency
Transaction upto X
Transaction upto X - 1
Membership rejected as new coming node
has one extra transaction than
cluster state.
![Page 70: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/70.jpg)
70
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X +
3
![Page 71: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/71.jpg)
71
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X +
3
![Page 72: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/72.jpg)
72
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X +
3
Node got membership and
node joined through IST too?
![Page 73: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/73.jpg)
73
Scenario: Another aspect of data inconsistency
2 node cluster is up and it started processing transaction. Moving the state of cluster from X -> X +
3
Node has transaction upto X and cluster says it
has transaction upto X+3.
Node joining doesn’t evaluate
data. It is all dependent on
seqno.
![Page 74: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/74.jpg)
74
Scenario: Another aspect of data inconsistency
User failed to remove grastate.dat that caused all this confusion.
![Page 75: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/75.jpg)
75
Scenario: Another aspect of data inconsistency
trx-seqno=x
trx-seqno=xTransaction with
same
seqno but diffe
rent
update
trx-seqno=x
![Page 76: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/76.jpg)
76
Scenario: Another aspect of data inconsistency
trx-seqno=x
Cluster restored just to enter more inconsistency (that may detect in future).
Transaction with
same
seqno but diffe
rent
updatetrx-seqno=x
trx-seqno=x
![Page 77: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/77.jpg)
77
Scenario: Cluster doesn’t come up on restart
Avoid running node local operation.
If cluster enter inconsistent state carefullyfollow the step-by-step guide to recover
(don’t fear SST, it is for your good).
![Page 78: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/78.jpg)
78
Scenario: Delayed purging
![Page 79: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/79.jpg)
79
Scenario: Delayed purgingGcache
(staging area to hold replicated
transaction)
![Page 80: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/80.jpg)
80
Scenario: Delayed purging
Transaction replicated and staged
![Page 81: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/81.jpg)
81
Scenario: Delayed purging
All nodes finished applying transaction
![Page 82: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/82.jpg)
82
Scenario: Delayed purging
Transactions can be removed from gcache
![Page 83: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/83.jpg)
83
Scenario: Delayed purging
● Each node at configured interval notifies other nodes/cluster about its transaction committed status
● This configuration is controlled by 2 conditions:○ gcache.keep_page_size and gcache.keep_page_count○ static limit on number of keys (1K), transactions (128),
bytes (128M).
● Accordingly each node evaluates the cluster level lowest water mark and initiate gcache purge.
![Page 84: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/84.jpg)
84
Scenario: Delayed purging
Each node update local graph and evaluate
cluster purge watermark
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
![Page 85: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/85.jpg)
85
Scenario: Delayed purging
And accordingly all nodes will purge local
gcache upto X.
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
N1_purged_upto: x+1N2_purged_upto: x+1N3_purged_upto: x
cluster-purge-water-mark=X
cluster-purge-water-mark=X
cluster-purge-water-mark=X
![Page 86: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/86.jpg)
86
Scenario: Delayed purging
gcache page created and purged.
![Page 87: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/87.jpg)
87
Scenario: Delayed purging
New COMMIT CUT 2360 after 2360 from 1purging index up to 2360releasing seqno from gcache 2360Got commit cut from GCS: 2360
![Page 88: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/88.jpg)
88
Scenario: Delayed purging
New COMMIT CUT 2360 after 2360 from 1purging index up to 2360releasing seqno from gcache 2360Got commit cut from GCS: 2360
Regularly each node communicates, committed upto water mark and then as per protocol explained, purging initiates.
![Page 89: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/89.jpg)
89
Scenario: Delayed purging
![Page 90: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/90.jpg)
90
Scenario: Delayed purging
GcacheSTOP processing transactionTransaction start to
pile up in gcache
![Page 91: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/91.jpg)
91
Scenario: Delayed purging
GcacheSTOP processing transactionTransaction start to
pile up in gcache ● FTWRL, RSU … action that causes node to pause and desync.
![Page 92: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/92.jpg)
92
Scenario: Delayed purging
● Given that one of the node is not making progress it would not emit its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest transaction continue to lock-down.
● This means, though other nodes are making progress, they will continue to pile up galera cache.
![Page 93: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/93.jpg)
93
Scenario: Delayed purging
● Given that one of the node is not making progress it would not emit its transaction committed status.
● This would freeze the cluster-purge-water-mark as lowest transaction continue to lock-down.
● This means, though other nodes are making progress, they will continue to pile up galera cache.
Galera has protection against it.If number of transactions continue to grow
beyond some hard limits it would force purge.
![Page 94: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/94.jpg)
94
Scenario: Delayed purging
trx map size: 16511 - check if status.last_committed is incrementingpurging index up to 11264releasing seqno from gcache 11264
In-build mechanism to force purge.
![Page 95: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/95.jpg)
95
Scenario: Delayed purging
trx map size: 16511 - check if status.last_committed is incrementingpurging index up to 11264releasing seqno from gcache 11264
Purge can get delayed but not halt.
![Page 96: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/96.jpg)
96
Scenario: Delayed purging
GcacheSTOP processing transactionForce purge done
![Page 97: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/97.jpg)
97
Scenario: Delayed purging
GcacheSTOP processing transaction
Purging means these entries are removed from galera maintained purge array.
(Physical removal of files gcache.page.0000xx is controlled by gcache.keep_pages_size and gcache.keep_pages_count)
![Page 98: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/98.jpg)
98
Scenario: Delayed purging
All nodes should have same configuration.
Keep a close watch if you plan to run a backup operation or other operation that can cause node to halt.
Monitor node is making progress by keeping watch on wsrep_last_applied/wsrep_last_committed.
![Page 99: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/99.jpg)
99
Scenario: Network latency and related failures
![Page 100: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/100.jpg)
100
Scenario: Network latency and related failures
![Page 101: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/101.jpg)
101
Scenario: Network latency and related failures
![Page 102: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/102.jpg)
102
Scenario: Network latency and related failures
Why ?What caused this weird behavior ?
![Page 103: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/103.jpg)
103
Scenario: Network latency and related failures
![Page 104: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/104.jpg)
104
Scenario: Network latency and related failures
Cluster is neither complete down nor complete up. What’s going on ? What
is causing this weird behavior ?
![Page 105: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/105.jpg)
105
Scenario: Network latency and related failures
All my writes are going to single node still I am getting this conflict ?
![Page 106: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/106.jpg)
106
Scenario: Network latency and related failures
All nodes are able to reach each other
![Page 107: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/107.jpg)
107
Scenario: Network latency and related failures
If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both
of the nodes.
![Page 108: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/108.jpg)
108
Scenario: Network latency and related failures
If link between 2 of nodes is broken then packets can be relayed through 3rd node that is reachable from both
of the nodes.
![Page 109: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/109.jpg)
109
Scenario: Network latency and related failures
Said node has flaky network connection or say has higher latency.
![Page 110: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/110.jpg)
110
Scenario: Network latency and related failures
Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds).
If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message.
If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD.
If node detects delay in response from given node it would try to add it to delayed list.
While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD
Node waits for delayed_margin before adding node to delayed_list (1S)
Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list.
![Page 111: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/111.jpg)
111
Scenario: Network latency and related failures
If node detects delay in response from given node it would try to add it to delayed list.
Node waits for delayed_margin before adding node to delayed_list (1S)
Even if node becomes active again it would take delayed_keep_period (30S) to remove it from the list.
Each node will monitor other nodes of the cluster @ inactive_check_period (0.5 seconds).
If node is not reachable from given node post peer_timeout (3S), cluster will enable relaying of message.
If all nodes votes for said node inactivity (suspect_timeout (5S)) it is pronounced DEAD.
While suspect_timeout needs consensus. inactive_timeout(15S) doesn’t need it. If node doesn’t respond it is marked DEAD
Runtime configurable
![Page 112: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/112.jpg)
112
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Latency
![Page 113: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/113.jpg)
113
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
![Page 114: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/114.jpg)
114
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Start sysbench workload
![Page 115: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/115.jpg)
115
Scenario: Network latency and related failures
< 1 ms 7 sec
7 sec
Start sysbench workload
Given RTT between n1 and n3 is 7 sec each trx needs 7 sec to complete even though it gets ACK from n2 in < 1ms
![Page 116: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/116.jpg)
116
Scenario: Network latency and related failures
#1
![Page 117: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/117.jpg)
117
Scenario: Network latency and related failures
● TPS hits 0 for 5 secs and then resume back.
#1
![Page 118: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/118.jpg)
118
Scenario: Network latency and related failures
● TPS hits 0 for 5 secs and then resume back.
● This is because trx is waiting for ACK from n3 that would take 7 sec but in meantime suspect_timeout timer goes off and marks n3 as DEAD so workload resumes after 5 secs.
#1
![Page 119: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/119.jpg)
119
Scenario: Network latency and related failures
● This temporarily make the complete cluster unavailable.
● Unfortunately, protocol design demands ACK from the farthest node to ensure consistency.
● Of-course latency of 7 sec is not realistic.
#1
![Page 120: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/120.jpg)
120
Scenario: Network latency and related failures
#2
![Page 121: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/121.jpg)
121
Scenario: Network latency and related failures
< 1 ms 2 sec
2 sec
![Page 122: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/122.jpg)
122
Scenario: Network latency and related failures
● This time I reduced the latency from 7 to 2 sec. Because of this every 2 sec (less 5 sec) there was some communication between node and this prevent n3 from being marked as DEAD.
● Post 10 secs we reverted back latency to original value so snag is seen for 10 secs.
#2
![Page 123: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/123.jpg)
123
Scenario: Network latency and related failures
All my writes are going to single node still I am getting this conflict ?
#3
![Page 124: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/124.jpg)
124
Scenario: Network latency and related failures
Because when the view changes initial position is re-assigned there-by purging history from cert index. Follow up transaction in cert that has dependency with old trx (that got purged) faces this conflict.
#3
![Page 125: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/125.jpg)
125
Scenario: Network latency and related failures
Farthest node dictates how cluster would operate and so latency is important.
Geo-Distributed cluster has milli-sec latency so timeout should be configured to avoid marking node as UNSTABLE due to added latency.
For geo-distributed cluster segment, window settings are other param to configure.
Flaky node are not good for overall transaction processing. (Can cause certification failures).
![Page 126: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/126.jpg)
126
Scenario: Blocking Transaction and related failures
![Page 127: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/127.jpg)
127
Scenario: Blocking Transaction and related failures
● Fail to load a table with N rows.
![Page 128: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/128.jpg)
128
Scenario: Blocking Transaction and related failures
● Fail to load a table with N rows.● Why ?
○ Because PXC has limit on how much data it can wrap in write-set and replicate across the cluster.
○ Current limit allows data transaction of size 2 G. (controlled through wsrep_max_ws_size)
But ever imagined why is that a limitation ?
![Page 129: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/129.jpg)
129
Scenario: Blocking Transaction and related failures
execute prepare replicate commit
![Page 130: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/130.jpg)
130
Scenario: Blocking Transaction and related failures
execute prepare replicate commit
Transaction first execute on local node. During this execution transaction doesn’t block other non-dependent transaction
Transaction replicate after it has been executed on local node but not yet committed.
Replication involves transporting write-set (binlog) to other nodes.
![Page 131: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/131.jpg)
131
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
apply commit N2
![Page 132: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/132.jpg)
132
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
To maintain data consistency across the cluster, protocol needs transaction to commit in same order on all the nodes.
![Page 133: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/133.jpg)
133
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.
![Page 134: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/134.jpg)
134
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
This means even though transaction following largest transaction are non-dependent and have completed APPLY ACTION before the largest transaction they can’t commit.
![Page 135: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/135.jpg)
135
Scenario: Blocking Transaction and related failures
execute prepare replicate commit N1
N2apply commit
Bigger the transaction, bigger backlog of small transactions this would eventually cause FLOW_CONTROL
![Page 136: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/136.jpg)
136
Scenario: Blocking Transaction and related failures
![Page 137: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/137.jpg)
137
Scenario: Blocking Transaction and related failures
![Page 138: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/138.jpg)
138
Scenario: Blocking Transaction and related failures
First snag appears when originating node block all resources to replicate a long running transaction.
Second snag appears when replicating node emit flow-control.
![Page 139: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/139.jpg)
139
Scenario: Network latency and related failures
PXC doesn’t like long running transaction.
For load data use LOAD DATA INFILE that would cause intermediate commit every 10K rows. Note: Random failure can cause partial data to get committed.
DDL can block/stall complete cluster workload as they need to execute in total-isolation. (Alternative is to use RSU but be careful at it is local operation to the node).
![Page 140: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/140.jpg)
140
One last important note● Majority of the error are due to mis-configuration or
difference in configuration of nodes.● PXC recommend same configuration on all nodes of
the cluster.
![Page 141: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/141.jpg)
PXC Genie: You Wish. We implement
![Page 142: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/142.jpg)
142
PXC Genie: You Wish. We implement
● Like to hear from you what you want next in PXC ?● Any specific module that you expect improvement ?● How can Percona help you with PXC or HA ?● Log issue (mark them as new improvement)
https://jira.percona.com/projects/PXC/issue● PXC forum is other way to reach us.
![Page 143: Failure Scenarios and their Recovery Percona XtraDB Cluster · Failure Scenarios and their recovery PXC Genie - You wish. We implement. Q & A. Quick ... So as safety check safe_to_bootstrap](https://reader033.vdocuments.net/reader033/viewer/2022052720/5f0947e17e708231d426126e/html5/thumbnails/143.jpg)
Questions and Answer