how raft consensus algorithm will make replication even better in mongodb 3.2 / henrik ingo...

26
How Raft consensus algorithm will make replication even better in MongoDB 3.2 Speaker: Henrik Ingo, @h_ingo Author: Spencer T Brody, @stbrody

Upload: ontico

Post on 18-Feb-2017

693 views

Category:

Engineering


1 download

TRANSCRIPT

How Raft consensus algorithm will make replication even better in MongoDB 3.2

Speaker: Henrik Ingo, @h_ingoAuthor: Spencer T Brody, @stbrody

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Why use Replication?

• Data redundancy• High availability

• Geographical distribution• Latency

What is Consensus?

• Getting multiple processes/servers to agree on state machine state

• Must handle a wide range of failure modes• Disk failure• Network partitions• Machine freezes• Clock skews• Bugs?

Basic consensus

State Machine

X 3Y 2

Z 7

State Machine

X 3

Y 2

Z 7

State Machine

X 3Y 2

Z 7

Leader Based Consensus

State Machine

X 3

Y 2

Z 7

Replicated Log

X 1⬅️

Y 2⬅️

X 3⬅️

State Machine

X 3Y 2

Z 7

Replicated Log

X 1⬅️Y 2⬅️

X 3⬅️

State Machine

X 3Y 2

Z 7

Replicated Log

X 1⬅️Y 2⬅️

X 3⬅️

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Elections

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

Elections

X 3

Y 2

Z 7

X 3

Y 2

Z 7X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

Elections

X 3

Y 2

Z 7

X 3

Y 2

Z 7X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

Elections

X 3

Y 2

Z 7

X 3

Y 2

Z 7X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

Elections

X 3

Y 2

Z 7

X 3

Y 2

Z 7X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

X 1⬅️

Y 2⬅️

X 3⬅️

Key requirements for elections

• There must be exactly 1 leader at any time• Majority election: there can only be 1 majority

• Note: 50% is not a majority!• Old leader must know to step down, even if

disconnected

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

Agenda

• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2

• Goals and inspiration from Raft Consensus Algorithm

• Preventing double voting• Monitoring node status• Calling for elections

Goals for MongoDB 3.2

• Decrease failover time• Speed up detection and resolution of false

primary situations

Finding Inspiration in Raft

• “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf

• Designed to address the shortcomings of Paxos• Easier to understand• Easier to implement in real applications

• Provably correct• Remarkably similar to what we’re doing already

Raft Concepts

• Term (election) IDs• Heartbeats using existing data replication

channel• Asymmetric election timeouts

Preventing Double Voting

• Can’t vote for 2 nodes in the same election• Pre-3.2: 30 second vote timeout• Post-3.2: Term IDs• Term:

• Monotonically increasing ID• Incremented on every election *attempt*• Lets voters distinguish elections so they can vote

twice quickly in different elections• Logical clock

MongoDB 3.0 OpLog entry

> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty(){

"ts" : Timestamp(1444466011, 1),"h" : NumberLong("-6240522391332325619"),"v" : 2,"op" : "i","ns" : "test.nulltest","o" : {

"_id" : 3,"a" : 3

}}

millisecond + counter

Random hash = gtid

MongoDB 3.2 OpLog entry> db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty(){

"ts" : Timestamp(1445632081, 861),"t" : NumberLong(42),"h" : NumberLong("5466055178864103715"),"v" : 2,"op" : "u","ns" : "ycsb.usertable","o2" : {

"_id" : "user645414720104877157"},"o" : {

"$set" : {"field4" :

BinData(0,"KT5sN1wrPl...Ny9wIg==")}

}}

millisecond + Raft counter

Random hash = gtid

Raft Term

Heartbeats

• MongoDB 3.0: Heartbeats• Sent every two seconds from every node to every

other node• Volume increases quadratically as nodes are added

to the replica set• Odd behaviors with asymmetric network partitions,

corner cases• MongoDB 3.2: Extra metadata sent

via existing data replication channel• Utilizes chained replication• Faster heartbeats = faster elections

and detection of false primaries

Daisy chaining

X 3Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

X 3Y 2

Z 7

X 1⬅️Y 2⬅️

X 3⬅️

X 3

Y 2

Z 7

X 1⬅️

Y 2⬅️

X 3⬅️

Election timeouts

• Tradeoff between failover time and spurious failovers

• Node calls for an election when it hasn’t heard

from the primary within the election timeout• Starting in 3.2:

• Single election timeout• Election timeout is configurable• Election timeout is varied randomly

for each node• Varying timeouts help reduce tied votes• Fewer tied votes = faster failover

rs.config().settings.

electionTimeoutMillis

Conclusion

• MongoDB 3.2 will have• Faster failovers• Faster error detection• More control to prevent spurious failovers• Robust also in corner cases

• This means your systems are• More resilient to failure• More uptime• Easier to maintain

Questions?