zero-downtime datacenter failovers · 2020. 4. 21. · zero-downtime datacenter failovers...

Post on 01-Jan-2021

12 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ZERO-DOWNTIME DATACENTER FAILOVERS(SWITCHING HOSTING PROVIDERS FOR DUMMIES)

1 — Luka Kladaric @ AWS Adria 2017.

WHO?Luka Kladaric

formerly a web developer for >10 years

now: freelancing, consulting, architecting, securing

2 — Luka Kladaric @ AWS Adria 2017.

migrating an entire company's infrastructure

from Rackspace to Amazon AWS

3 — Luka Kladaric @ AWS Adria 2017.

60 virtual machines

3 baremetal boxes (db)

assorted networking equipment

4 — Luka Kladaric @ AWS Adria 2017.

the migration took 2 months to execute

but a year and a half to prepare

5 — Luka Kladaric @ AWS Adria 2017.

FOUND STATE6 — Luka Kladaric @ AWS Adria 2017.

hand-crafted build server, unreproducible

7 — Luka Kladaric @ AWS Adria 2017.

half the servers are not deployable from scratch

or their deployability is unknown

8 — Luka Kladaric @ AWS Adria 2017.

same mysql account used by everyone everywhere

9 — Luka Kladaric @ AWS Adria 2017.

that mysql account is "root"

10 — Luka Kladaric @ AWS Adria 2017.

that mysql db is 1.5 TB big

11 — Luka Kladaric @ AWS Adria 2017.

no access to LB config

has a bunch of magic in it

changes often result in issues and outages

12 — Luka Kladaric @ AWS Adria 2017.

no server metrics / perfdata

no idea if overprovisioned and by how much

13 — Luka Kladaric @ AWS Adria 2017.

no access to disaster recovery instancein case the primary DC went down

(access goes through primary DC)

14 — Luka Kladaric @ AWS Adria 2017.

RACKSPACE WAS REALLY TERRIBLEa constant pain to deal with

unexpected outages of never explained causes

unresponsive support team

zero flexibility

15 — Luka Kladaric @ AWS Adria 2017.

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months

conservatively: 6-9 months

realistically: a year

16 — Luka Kladaric @ AWS Adria 2017.

NO LEADERSHIP BUY-IN2 failed attempts to get approval

Infrastructure team makes a pact"Do Things The Right Way From Now On"

mask cleanup work with ongoing maintenance

17 — Luka Kladaric @ AWS Adria 2017.

A YEAR AND A HALF LATER...

majority of the issues were fixed

or at least significantly improved

18 — Luka Kladaric @ AWS Adria 2017.

PLOT TWISTRACKSPACE STARTS FALLING APART

19 — Luka Kladaric @ AWS Adria 2017.

New estimate: 19 man-days

(after final push for preparation)

20 — Luka Kladaric @ AWS Adria 2017.

SAVINGS ESTIMATE

$18k -> $6k

that's -66%

21 — Luka Kladaric @ AWS Adria 2017.

GOT APPROVAL!22 — Luka Kladaric @ AWS Adria 2017.

Actually executed in 25-30 man-days

over 2 months

23 — Luka Kladaric @ AWS Adria 2017.

HOW?24 — Luka Kladaric @ AWS Adria 2017.

"upgrading the fleet to Ubuntu 16.04"

all servers rebuilt and redeployed with Ansible

25 — Luka Kladaric @ AWS Adria 2017.

build server rebuilt from scratch

deployed from Ansible

all build jobs defined in code

no more tweaking jobs through UI

26 — Luka Kladaric @ AWS Adria 2017.

CloudFlare implemented for faster DNS failover

27 — Luka Kladaric @ AWS Adria 2017.

all LB logic slowly moved to our own haproxies

haproxy configuration auto-generated from Ansible

makes it easy to shuffle things around

28 — Luka Kladaric @ AWS Adria 2017.

all apps slowly migrated to be served through haproxies

avoiding Rackspace LB magic

29 — Luka Kladaric @ AWS Adria 2017.

VPN bridge between DCs~20 MB/s, ~20ms ping

good enough to treat as a "local" connectionfor shorter periods of time

30 — Luka Kladaric @ AWS Adria 2017.

mysql master-master replication between DCs

31 — Luka Kladaric @ AWS Adria 2017.

app servers in both DCs

32 — Luka Kladaric @ AWS Adria 2017.

haproxies in both DCs

aware of app servers in both DCsbut preferring local ones

"no request left behind"

33 — Luka Kladaric @ AWS Adria 2017.

failover with DNS at CloudFlare near-instantly

but even stray requests get handled

34 — Luka Kladaric @ AWS Adria 2017.

metrics, metrics, metrics

(Datadog ftw)

35 — Luka Kladaric @ AWS Adria 2017.

RESULTS36 — Luka Kladaric @ AWS Adria 2017.

core production migrated in days

internal tools migrated within a week or two

developer tools migrated within a month(git hosting, build server, etc)

obscure legacy services migrated within 2 months

37 — Luka Kladaric @ AWS Adria 2017.

all hardware at Rackspacedecomissioned within 3 months

38 — Luka Kladaric @ AWS Adria 2017.

sideffect: actual HA instead of fake HA

old "two or more of everything" approachtranslated well into Availability Zones

39 — Luka Kladaric @ AWS Adria 2017.

AND IT WAS GOOD40 — Luka Kladaric @ AWS Adria 2017.

41 — Luka Kladaric @ AWS Adria 2017.

QUESTIONS?42 — Luka Kladaric @ AWS Adria 2017.

THANK YOU!Luka Kladarichello@luka.io

@kll43 — Luka Kladaric @ AWS Adria 2017.

top related