secondsite: disaster tolerance as a service

35
SECONDSITE: DISASTER TOLERANCE AS A SERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

Upload: niles

Post on 23-Feb-2016

92 views

Category:

Documents


0 download

DESCRIPTION

SecondSite: Disaster Tolerance as a Service. Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew W arfield. Failures in a Datacenter. Tolerating Failures in a Datacenter. REMUS. Initial idea behind Remus was to tolerate Datacenter level failures. Can A Whole Datacenter Fail ? . Yes! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SecondSite: Disaster Tolerance as a Service

SECONDSITE: DISASTER TOLERANCE AS A SERVICE

Shriram RajagopalanBrendan CullyRyan O’ConnorAndrew Warfield

Page 2: SecondSite: Disaster Tolerance as a Service

2

FAILURES IN A DATACENTER

Page 3: SecondSite: Disaster Tolerance as a Service

3

TOLERATING FAILURES IN A DATACENTER

Initial idea behind Remus was to tolerate Datacenter level failures.

REMUS

Page 4: SecondSite: Disaster Tolerance as a Service

4

CAN A WHOLE DATACENTER FAIL ?

Yes!It’s a “Disaster”!

Page 5: SecondSite: Disaster Tolerance as a Service

5

DISASTERS

Illustrative Image courtesy of TangoPango, Flickr.

“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”

- Om Malik, GigaOM

“Truck driver in Texas kills all the websites you really use”

…Southlake FD found that he had low blood sugar

- valleywag.com

Page 6: SecondSite: Disaster Tolerance as a Service

6

DISASTERS..

Water-main break cripples Dallas County computers, operations

The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.

- Dallas Morning News, Jun 2010

Page 7: SecondSite: Disaster Tolerance as a Service

7

DISASTERS..

Page 8: SecondSite: Disaster Tolerance as a Service

8

MORE FODDER BACK HOME“An explosion … near our

server bank … electrical box containing 580 fiber cables.

electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....

Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.

In other words, the perfect storm. Oh well. S*it happens. ’’

-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

Page 9: SecondSite: Disaster Tolerance as a Service

9

DISASTER RECOVERY – THE OLD FASHIONED WAY

Storage replication between a primary and backup site.

Manually restore physical servers from backup images.

Data Loss and Long Outage periods.

Expensive Hardware – Storage Arrays, Replicators, etc.

Page 10: SecondSite: Disaster Tolerance as a Service

10

Protected Site

Recovery Site

VirtualCenter Site Recovery ManagerVirtualCenter Site Recovery

Manager

Datastore Groups

Array Replication

Datastore GroupsX

STATE OF THE ART DISASTER RECOVERY

VMs offline

VMs powered on

VMs become unavailable

VMs online in Protected Site

Source: VMWare Site Recovery Manager – Technical Overview

Page 11: SecondSite: Disaster Tolerance as a Service

11

PROBLEMS WITH EXISTING SOLUTIONSData Loss & Service Disruption

(RPO ~15min, RTO ~few hours)

Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)

Application Level Recovery

Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.

Page 12: SecondSite: Disaster Tolerance as a Service

12

DISASTER TOLERANCE AS A SERVICE ?

Our Vision

Page 13: SecondSite: Disaster Tolerance as a Service

13

OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 14: SecondSite: Disaster Tolerance as a Service

14

PRIMARY & BACKUP SITES

5ms RTT

Page 15: SecondSite: Disaster Tolerance as a Service

15

FAILOVER & FAILBACK WITHOUT OUTAGE

Primary Site: VancouverBackup Site : Kamloops

Primary Site: VancouverPrimary Site: Kamloops

Primary Site: KamloopsBackup Site : Vancouver

Complete State Recovery (CPU, disk, memory, network)

No Application Level Recovery

Page 16: SecondSite: Disaster Tolerance as a Service

16

MAIN CONTRIBUTIONS Remus (NSDI ’08)

Checkpoint based State Replication Fully Transparent HA Recovery Consistency

No Application level recovery

RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using

Page Delta Compression Disk Read Tracking

SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area

Page 17: SecondSite: Disaster Tolerance as a Service

17

CONTRIBUTIONS..

Page 18: SecondSite: Disaster Tolerance as a Service

18

FAILURE DETECTION IN REMUSExternal Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

• A pair of independent dedicated NICs carry replication traffic.

• Backup declares Primary failure only if

• It cannot reach Primary via NIC 1 and NIC2

• It can reach External N/W via NIC1

• Failure of Replication link alone results in Backup shutdown.

• Split Brain occurs only when both NICs/links fail.

LAN

Page 19: SecondSite: Disaster Tolerance as a Service

19

FAILURE DETECTION IN WIDE AREA DEPLOYMENTS

Cannot distinguish between link and node failure.

Higher chances of Split Brain as the network is not reliable anymore

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

LANWAN

PrimaryDatacent

er

BackupDatacent

er

ReplicationChannel

INTERNET

Page 20: SecondSite: Disaster Tolerance as a Service

20

FAILOVER ARBITRATION Local Quorum of Simple Reachability Detectors.

Stewards can be placed on third party clouds.

Google App Server implementation with ~100 LoC.

Provider/User could have other sophisticated implementations.

Page 21: SecondSite: Disaster Tolerance as a Service

21

Stewards1 2 3 4 5

FAILOVER ARBITRATION..

Replication Stream

POLL

1

PrimaryQuorum

Logic

BackupQuorum

Logic

Apriori Steward Set Agreement

I need majority to stay alive

I need exclusive majority to

failover

XX

XX

X

POLL

2PO

LL 3

POLL 4

POLL 5POLL 1

POLL 2POLL 3

POLL 4

POLL 5

Page 22: SecondSite: Disaster Tolerance as a Service

22

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION

Remus – LAN - Gratuitous ARP from Backup Host

SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter

Need support from upstream ISP(s) at both Datacenters

IP Migration achieved through BGP Multi-homing

Page 23: SecondSite: Disaster Tolerance as a Service

23

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..

Internet

BCNet (AS-271)

VMs

Vancouver(134.87.2.173

)

Kamloops(207.23.255.23

7)

134.87.2.174

AS-64678 (stub)(134.87.3.0/24)

207.23.255.238

VMs VMs

Primary Site Backup Site

AS-64678 (stub)(134.87.3.0/24)

BGP Multi-homing

Replication

Routing traffic to Primary Site

Re-routing traffic to Backup Site on Failover

as-path prepend64678 64678

as-path prepend64678 64678 64678 64678

as-path prepend64678

Page 24: SecondSite: Disaster Tolerance as a Service

24

OVERVIEW A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 25: SecondSite: Disaster Tolerance as a Service

25

I want periodic failovers with no downtime!

Did you run regression tests ?

Failover Works!!

More than one failure ?

I will have to restart HA!

EVALUATION

Page 26: SecondSite: Disaster Tolerance as a Service

26

RESTARTING HA Need to Resynchronize Storage.

Avoiding Service Downtime requires Online Resynchronization

Leverage DRBD –only resynchronizes blocks that have changed

Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.

Page 27: SecondSite: Disaster Tolerance as a Service

27

REGRESSION TESTS Synthetic Workloads to stress test the

Replication Pipeline

Failovers every 90 minutes

Discovered some interesting corner cases

Page-table corruptions in memory checkpoints

Write-after-write I/O ordering in disk replication

Page 28: SecondSite: Disaster Tolerance as a Service

28

SECONDSITE – THE COMPLETE PICTURE

• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable

4 VMs x 100 Clients/VM

Page 29: SecondSite: Disaster Tolerance as a Service

29

REPLICATION BANDWIDTH CONSUMPTION4 VMs x 100 Clients/VM

Page 30: SecondSite: Disaster Tolerance as a Service

30

DEMO

Expect a real disaster (conference demos are not a good idea!)

Page 31: SecondSite: Disaster Tolerance as a Service

31

APPLICATION THROUGHPUT VS. REPLICATION LATENCY

SPECWeb w/ 100 Clients

Kamloops

Page 32: SecondSite: Disaster Tolerance as a Service

32

RESOURCE UTILIZATION VS. APPLICATION LOAD

Domain-0 CPU Utilization Bandwidth usage on Replication Channel

Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

Page 33: SecondSite: Disaster Tolerance as a Service

33

RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD

OLTP Workload

Page 34: SecondSite: Disaster Tolerance as a Service

34

The user creates a recovery plan which is associated to a single or multiple protection groups

SETUP WORKFLOW – RECOVERY SITE

Source: VMWare Site Recovery Manager – Technical Overview

Page 35: SecondSite: Disaster Tolerance as a Service

35

RECOVERY PLANVM Shutdown

High PriorityVM Recovery

Prepare Storage

High PriorityVM Shutdown

Normal PriorityVM Recovery

Source: VMWare Site Recovery Manager – Technical Overview

Low PriorityVM Recovery