replication election and consensus algorithm refinements for mongodb 3.2

24
Distributed Consensus in MongoDB Spencer T Brody Senior Software Engineer at MongoDB @stbrody

Upload: mongodb

Post on 11-Aug-2015

498 views

Category:

Technology


10 download

TRANSCRIPT

Page 1: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Distributed Consensus in MongoDB

Spencer T Brody

Senior Software Engineer at MongoDB

@stbrody

Page 2: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Page 3: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Why use Replication?

• Data redundancy• High availability

Page 4: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

What is Consensus?

• Getting multiple processes/servers to agree on something• Must handle a wide range of failure modes

• Disk failure• Network partitions• Machine freezes• Clock skews

Page 5: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Basic consensus

State Machine

X 3

Y 2

Z 7

State Machine

X 3

Y 2

Z 7

State Machine

X 3

Y 2

Z 7

Page 6: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Leader Based Consensus

State Machine

X 3

Y 2

Z 7

Replicated Log

X 1⬅️

Y 2⬅️

X 3⬅️

State Machine

X 3

Y 2

Z 7

Replicated Log

X 1⬅️

Y 2⬅️

X 3⬅️

State Machine

X 3

Y 2

Z 7

Replicated Log

X 1⬅️

Y 2⬅️

X 3⬅️

Page 7: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Page 8: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Elections

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 9: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Elections

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 10: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Elections

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 11: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Elections

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 12: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Elections

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 13: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Data Replication

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 14: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Page 15: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

• Goals and inspiration from Raft Consensus Algorithm• Preventing double voting• Monitoring node status• Calling for elections

Page 16: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Goals for MongoDB 3.2

• Decrease failover time• Speed up detection and resolution of false primary situations

Page 17: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Finding Inspiration in Raft

• “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf

• Designed to address the shortcomings of Paxos• Easier to understand• Easier to implement in real applications

• Provably correct• Remarkably similar to what we’re doing already

Page 18: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Raft Concepts

• Term (election) IDs• Monitoring node status using existing data replication channel• Asymmetric election timeouts

Page 19: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Preventing Double Voting

• Can’t vote for 2 nodes in the same election• Pre-3.2: 30 second vote timeout• Post-3.2: Term IDs• Term:

• Monotonically increasing ID• Incremented on every election *attempt*• Lets voters distinguish elections so they can vote twice quickly in

different elections.

Page 20: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Monitoring Node Status

• Pre-3.2: Heartbeats• Sent every two seconds from every node to every other node• Volume increases quadratically as nodes are added to the replica set

• Post-3.2: Extra metadata sent via existing data replication channel• Utilizes chained replication• Faster heartbeats = faster elections and detection of false primaries

Page 21: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Data Replication

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

X 3

Y 2

Z 7

X ⬅️1

Y ⬅️2

X ⬅️3

Page 22: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Determining When To Call For An Election

• Tradeoff between failover time and spurious failovers• Node calls for an election when it hasn’t heard from the primary within

the election timeout• Starting in 3.2:

• Election timeout is configurable• Election timeout is varied randomly for each node• Varying timeouts help reduce tied votes• Fewer tied votes = faster failover

Page 23: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Conclusion

• MongoDB 3.2 will have• Faster failovers• Faster error detection• More control to prevent spurious failovers

• This means your systems are• More stable• More resilient to failure• Easier to maintain

Page 24: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Questions?