secondsite: disaster tolerance as a service

SECONDSITE: DISASTER TOLERANCE AS A SERVICE

Shriram RajagopalanBrendan CullyRyan O’ConnorAndrew Warfield

2

FAILURES IN A DATACENTER

3

TOLERATING FAILURES IN A DATACENTER

Initial idea behind Remus was to tolerate Datacenter level failures.

REMUS

4

CAN A WHOLE DATACENTER FAIL ?

Yes!It’s a “Disaster”!

5

DISASTERS

Illustrative Image courtesy of TangoPango, Flickr.

“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”

- Om Malik, GigaOM

“Truck driver in Texas kills all the websites you really use”

…Southlake FD found that he had low blood sugar

- valleywag.com

6

DISASTERS..

Water-main break cripples Dallas County computers, operations

The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.

- Dallas Morning News, Jun 2010

7

DISASTERS..

8

MORE FODDER BACK HOME“An explosion … near our

server bank … electrical box containing 580 fiber cables.

electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....

Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.

In other words, the perfect storm. Oh well. S*it happens. ’’

-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

9

DISASTER RECOVERY – THE OLD FASHIONED WAY

Storage replication between a primary and backup site.

Manually restore physical servers from backup images.

Data Loss and Long Outage periods.

Expensive Hardware – Storage Arrays, Replicators, etc.

10

Protected Site

Recovery Site

VirtualCenter Site Recovery ManagerVirtualCenter Site Recovery

Manager

Datastore Groups

Array Replication

Datastore GroupsX

STATE OF THE ART DISASTER RECOVERY

VMs offline

VMs powered on

VMs become unavailable

VMs online in Protected Site

Source: VMWare Site Recovery Manager – Technical Overview

http://communities.vmware.com/servlet/JiveServlet/download/1035618-13033/Site_Recovery_Manager_Technical_Overview.ppt


11

PROBLEMS WITH EXISTING SOLUTIONSData Loss & Service Disruption

(RPO ~15min, RTO ~few hours)

Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)

Application Level Recovery

Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.

12

DISASTER TOLERANCE AS A SERVICE ?

Our Vision

13

OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

14

PRIMARY & BACKUP SITES

5ms RTT

15

FAILOVER & FAILBACK WITHOUT OUTAGE

Primary Site: VancouverBackup Site : Kamloops

Primary Site: VancouverPrimary Site: Kamloops

Primary Site: KamloopsBackup Site : Vancouver

Complete State Recovery (CPU, disk, memory, network)

No Application Level Recovery

16

MAIN CONTRIBUTIONS Remus (NSDI ’08)

Checkpoint based State Replication Fully Transparent HA Recovery Consistency

No Application level recovery

RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using

Page Delta Compression Disk Read Tracking

SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area

17

CONTRIBUTIONS..

18

FAILURE DETECTION IN REMUSExternal Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

• A pair of independent dedicated NICs carry replication traffic.

• Backup declares Primary failure only if

• It cannot reach Primary via NIC 1 and NIC2

• It can reach External N/W via NIC1

• Failure of Replication link alone results in Backup shutdown.

• Split Brain occurs only when both NICs/links fail.

LAN

19

FAILURE DETECTION IN WIDE AREA DEPLOYMENTS

Cannot distinguish between link and node failure.

Higher chances of Split Brain as the network is not reliable anymore

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

LANWAN

PrimaryDatacent

er

BackupDatacent

er

ReplicationChannel

INTERNET

20

FAILOVER ARBITRATION Local Quorum of Simple Reachability Detectors.

Stewards can be placed on third party clouds.

Google App Server implementation with ~100 LoC.

Provider/User could have other sophisticated implementations.

21

Stewards1 2 3 4 5

FAILOVER ARBITRATION..

Replication Stream

POLL

1

PrimaryQuorum

Logic

BackupQuorum

Logic

Apriori Steward Set Agreement

I need majority to stay alive

I need exclusive majority to

failover

XX

XX

X

POLL

2PO

LL 3

POLL 4

POLL 5POLL 1

POLL 2POLL 3

POLL 4

POLL 5

22

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION

Remus – LAN - Gratuitous ARP from Backup Host

SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter

Need support from upstream ISP(s) at both Datacenters

IP Migration achieved through BGP Multi-homing

23

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..

Internet

BCNet (AS-271)

VMs

Vancouver(134.87.2.173

)

Kamloops(207.23.255.23

7)

134.87.2.174

AS-64678 (stub)(134.87.3.0/24)

207.23.255.238

VMs VMs

Primary Site Backup Site

AS-64678 (stub)(134.87.3.0/24)

BGP Multi-homing

Replication

Routing traffic to Primary Site

Re-routing traffic to Backup Site on Failover

as-path prepend64678 64678

as-path prepend64678 64678 64678 64678

as-path prepend64678

24

OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

25

I want periodic failovers with no downtime!

Did you run regression tests ?

Failover Works!!

More than one failure ?

I will have to restart HA!

EVALUATION

26

RESTARTING HA Need to Resynchronize Storage.

Avoiding Service Downtime requires Online Resynchronization

Leverage DRBD –only resynchronizes blocks that have changed

Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.

27

REGRESSION TESTS Synthetic Workloads to stress test the

Replication Pipeline

Failovers every 90 minutes

Discovered some interesting corner cases

Page-table corruptions in memory checkpoints

Write-after-write I/O ordering in disk replication

28

SECONDSITE – THE COMPLETE PICTURE

• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable

4 VMs x 100 Clients/VM

29

REPLICATION BANDWIDTH CONSUMPTION4 VMs x 100 Clients/VM

30

DEMO

Expect a real disaster (conference demos are not a good idea!)

31

APPLICATION THROUGHPUT VS. REPLICATION LATENCY

SPECWeb w/ 100 Clients

Kamloops

32

RESOURCE UTILIZATION VS. APPLICATION LOAD

Domain-0 CPU Utilization Bandwidth usage on Replication Channel

Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

33

RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD

OLTP Workload

34

The user creates a recovery plan which is associated to a single or multiple protection groups

SETUP WORKFLOW – RECOVERY SITE




35

RECOVERY PLANVM Shutdown

High PriorityVM Recovery

Prepare Storage

High PriorityVM Shutdown

Normal PriorityVM Recovery


Low PriorityVM Recovery



secondsite: disaster tolerance as a service

Documents

disaster tolerance

disaster occursc

recovery siteclick

protected vms

backup site

shadow vms

backup dc

view of srm