webinar: replication and replica sets

1

Summer 2013

Replication and Replica Sets

Member of Technical Staff, 10genWilliam Zola

2

Why Replication?

To keep your data safe

3

Why Replication?

To keep your data available

4

Why Replication?

Because bad things happen to good data centers

5

What is replication and why do we need it?

Replication

ImportantData

Copy of Important

Data

Copy of Important

Data

6

• Using replica sets for high availability– PRIMARY, SECONDARY, and ARBITER nodes– PRIMARY elections

• Using replica sets for disaster recovery• Configure a replica set so there’s no single point of

failure• No-Downtime Maintenance• Durability in a networked environment

Agenda

7

• Not new to DBA or System Administration• New to MongoDB or MongoDB replication

Audience

8

Use Cases

9

Stakeholders

10

• High Availability (automatic failover)

Use Cases

11


• Disaster Recovery

Use Cases

12



• No downtime for maintenance– Backups– Maintenance (index rebuilds, compaction)

Use Cases

13




• Replica Set is "transparent" to the application

Use Cases

14





• Read Scaling (extra copies to read from)

Use Cases

15

MongoDB Replication Basics

16

Replica Set Features

• A cluster of N servers• Any (one) node can be

primary• All writes to primary• Reads go to primary (default)

optionally to a secondary

• Consensus election of primary• Automatic failover• Automatic recovery

Node 3

Node 1

Node 2

Primary WRITESREADS

READS

Pick me!

READS

17

• Replica set is two or more nodes

Node 1

Node 2P

Node 3

How MongoDB Replication works

18

• Election establishes the PRIMARY• Data replicates from PRIMARY to SECONDARIES

Node 1

Node 2 Primary

Node 3

How MongoDB Replication works

data data

20

Planned– Hardware upgrade– O/S or file-system tuning– Relocation of data to new file-system / storage– Software upgrade

Unplanned– Hardware failure– Data center failure– Region outage– Human error– Application corruption

Types of outage

AUTOMATIC FAILOVER

MAINTENANCEw/o DOWNTIME

21

Mechanics of Automatic Failover

22

• Data replicates from PRIMARY to SECONDARIES

Node 1

Node 2 Primary

Node 3


data data

23

• Election establishes the PRIMARY• Data replicates from PRIMARY to SECONDARIES• Primary might FAIL

Node 1

Node 2 Primary

Node 3


data data data

data

dat

a da

ta

data data data

data

dat

a da

ta

24

Node 1 Node 3

• Automatic election of new PRIMARY if majority exists

Node 2 DOWN

negotiate new primary

✗


25

Node 1 Node 3

Node 2 DOWN

negotiate new master


New PRIMARY elected

Primary

26

Node 1 Node 3

Node 2RECOVERING

negotiate new master

Primary


Automatic Recovery of Failed Node

Can performfull resync from secondaryif necessary

27

• Once caught-up resumes syncing from primary• Original replica set configuration is re-established

Node 1

Node 2

Node 3


Primary

28

Cluster Size and Rules of Failover

29

Primary Election

Primary

Secondary

Secondary

As long as a partition can see a majority (>50%) of the cluster, then it will elect a primary.

Must have a STRICT majority to be elected primary!!!

30

Simple Failure

Primary

Failed Node

Secondary

66% of cluster visible. Primary is elected

Secondary

31

Failed Node

33% of cluster visible. Read only mode.

Failed Node

Secondary

Simple Failure

Secondary

Secondary

Primary

32

Network Partition

Primary

Secondary

Secondary

33

Network Partition

Primary

Secondary

Secondary

Primary

Failed Node

Secondary

66% of cluster visible.

Primary is elected

34

Secondary

Network Partition

33% visible. Read only mode.

Primary

Secondary

Failed Node

Failed Node

Secondary

35

Secondary

No “Split Brain” Problem

Primary

Secondary

A node must be elected by a strict majority of the set in order to be a primary• Only the primary node

can accept writes• A replica set never has

two primary nodes

36

Even Cluster Size

Primary

Secondary

Secondary

Secondary

37

Primary

Secondary

Secondary

Secondary

Failed Node

Secondary

Failed Node


Secondary

Even Cluster Size

38

Primary

Secondary

Failed Node

Secondary

Failed Node


Secondary

Secondary

Secondary

Even Cluster Size✗ODD = good

39

Types of Nodes

Regular • Regular node holds a copy of your data

• Arbiter node has no data• but it can vote! use to break ties

Secondary

Secondary

Arbiter

• Secondary / All data Nodes• different priorities• other configuration options

Primary• Primary

• A data node that won the election

40

Add an Arbiter!

Primary

Secondary

Secondary

Secondary

Arbiter

Add an arbiter node to break ties

• Odd number of votes in set• Arbiter is lightweight – does

not store data

42

High Availability

43

High Availability

44

No Downtime Maintenance

1. Take secondary out of set

2. Perform maintenance

3. Replace secondary in set

4. Wait for it to catch up

Secondary

Secondary

Secondary

Primary1. Take secondary out of set



4. Wait for it to catch up✓

✓

45

No Downtime Maintenance




4. Wait for it to catch up

Secondary

Secondary

5. Step down the primary

(wait for new primary to be elected)

6. Repeat steps 1-4

Secondary

Primary

Primary✓

✓

✓

46

Primary

Arbiter

Secondary

Is this a good configuration?

2 Replicas + Arbiter??

47

Primary

Arbiter

Secondary

2 Replicas + Arbiter??



3. Primary node crashes– Uh-oh!– Replica set is down– Data from the primary hasn’t

been replicated

48

Use Three Data Nodes!

Primary

Secondary

Secondary

Use a minimum of three data nodes to assure high availability

49

Avoid Single Points of Failure

50

Avoid Single Points of Failure

51

Avoid Single points of failure

Primary

Secondary

Secondary

Top of rack switch

Rack falls over

52

Better

Primary

Secondary

Secondary

Loss of internet

DC burns down

53

Even Better

Secondary

Secondary

Primary

San Francisco

Dallas

54

Priorities

Secondary

Secondary

Primary

San Francisco

Dallas

Priority 1

Priority 1

Priority 0

Disaster recover data center. Will never become primary automatically.

55

Even Better

Primary

Secondary

Secondary

San Francisco

Dallas

New York

Secondary

Secondary

56

Node Priority

Primary

Secondary

Secondary

Secondary

Secondary

Priority 10

Priority 10

Priority 5

Priority 5

Priority 0 Dallas

New York

SanFrancisco

57

Node Sizing

Primary

Secondary

Secondary

Secondary

Secondary

Priority 10

Priority 10

Priority 5

Priority 5

Priority 0 Dallas

New York

SanFrancisco

Nodes that can become primary should be sized equally

• RAM • Disk• IOPS

58

Recap

59

Replica Set Review

Primary

Secondary

Secondary

Replica set contains N nodes• At most one node is the

PRIMARY• All writes go to the PRIMARY• SECONDARY nodes contain

up-to-date copies of the data• SECONDARY nodes

continually copy data from the PRIMARY

WRITES

60

Failover Review

Primary

Secondary

Secondary

If the PRIMARY fails, the Replica Set can elect a new PRIMARY

• A strict (>50%) majority is required for election

• The former PRIMARY will rejoin the set as a SECONDARY when it recovers

WRITES

61

Partition Review

A Network Partition prevents the nodes from communicating

• The Replica Set treats a partition as a “down node”

• A node must get a strict majority of the votes to be elected PRIMARY

• Even numbers of votes reduce availability

• Use Arbiters to break ties• Spread your nodes across multiple

data centers

Secondary

Primary

Secondary

62

Using Applications with Replica Sets

63

Application View

ApplicationCode Here

MongoDBDriver

64

Replica Set

Under the Covers


MongoDBDriver

Secondary

Secondary

Primary

Replica Set Connection:

my-set/host1:27017,host2:27017,host3:27017

65

Replica Set

Secondary Reads


MongoDBDriver

Secondary

Secondary

Primary

Potentially Stale!

66

Replica Set

Failover

MongoDBDriver

Secondary

Secondary

Primary✗Connection Exception


67

Replica Set

New Election


MongoDBDriver

Secondary

Secondary

Primary

Secondary✗

68

Durability and Replica Sets

69

• Wikipedia:– In database systems, durability is the ACID property which

guarantees that transactions that have committed will survive permanently.

Durability

70

The Lifetime of a Write Operation (single-node)


MongoDBDriver

Journal Data in RAM

Network Write

Validate Data

Update RAM Update Journal

71

Get Last Error


MongoDBDriver

Journal Data in RAM

Network Write

getLastError command

getLastError ResultValidate Data

72

Write Concern

MongoDBDriver

Network Write


getLastError Result

Network Acknowledgement {w:0}

Check for Error {w:1}

Journal Sync {j:1}

76

Replica Sets and Durability

Primary

Secondary

Secondary

Secondary

Secondary

A write that has replicated to a majority of the nodes is durable

• The most up-to-date node will be elected primary

• The write will be present on that node

No guarantee of which nodes will have the write

• Use “tag sets” for finer-grained control

✓

✓

✓

Durable!

77

Network Write Concern

MongoDBDriver

Network Write


getLastError Result

Specific Number of Nodes

{w:2}

Majority of Data Nodes {w: ’majority’}

Tag Set {w: “my tag set”}

Wait for timeout {w:2, wtimeout:2000}

Replica Set

Primary

Secondary

Secondary

78

Wrapping it Up

79

Why Replication?

To keep your data safe and available

80

• High Availability (auto-failover)


• No downtime for maintenance


• Writes are durable with appropriate Write

Concern

Features

81

• Easy to setup – Try on a single machine– Multiple nodes with different ports on a single

host

• Check on-line documentation for RS tutorials– http://docs.mongodb.org/manual/replication/

#tutorials

Just Use It!

82

Questions?

83

Thank You!

webinar: replication and replica sets

Technology

primary data replicates

high availability primary

secondaries primary

primary reads

primary original replica

data safe

data available

compaction replica set