zero-downtime datacenter failovers · 2020. 4. 21. · zero-downtime datacenter failovers...

43
ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017.

Upload: others

Post on 01-Jan-2021

12 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

ZERO-DOWNTIME DATACENTER FAILOVERS(SWITCHING HOSTING PROVIDERS FOR DUMMIES)

1 — Luka Kladaric @ AWS Adria 2017.

Page 2: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

WHO?Luka Kladaric

formerly a web developer for >10 years

now: freelancing, consulting, architecting, securing

2 — Luka Kladaric @ AWS Adria 2017.

Page 3: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

migrating an entire company's infrastructure

from Rackspace to Amazon AWS

3 — Luka Kladaric @ AWS Adria 2017.

Page 4: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

60 virtual machines

3 baremetal boxes (db)

assorted networking equipment

4 — Luka Kladaric @ AWS Adria 2017.

Page 5: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

the migration took 2 months to execute

but a year and a half to prepare

5 — Luka Kladaric @ AWS Adria 2017.

Page 6: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

FOUND STATE6 — Luka Kladaric @ AWS Adria 2017.

Page 7: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

hand-crafted build server, unreproducible

7 — Luka Kladaric @ AWS Adria 2017.

Page 8: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

half the servers are not deployable from scratch

or their deployability is unknown

8 — Luka Kladaric @ AWS Adria 2017.

Page 9: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

same mysql account used by everyone everywhere

9 — Luka Kladaric @ AWS Adria 2017.

Page 10: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

that mysql account is "root"

10 — Luka Kladaric @ AWS Adria 2017.

Page 11: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

that mysql db is 1.5 TB big

11 — Luka Kladaric @ AWS Adria 2017.

Page 12: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

no access to LB config

has a bunch of magic in it

changes often result in issues and outages

12 — Luka Kladaric @ AWS Adria 2017.

Page 13: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

no server metrics / perfdata

no idea if overprovisioned and by how much

13 — Luka Kladaric @ AWS Adria 2017.

Page 14: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

no access to disaster recovery instancein case the primary DC went down

(access goes through primary DC)

14 — Luka Kladaric @ AWS Adria 2017.

Page 15: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

RACKSPACE WAS REALLY TERRIBLEa constant pain to deal with

unexpected outages of never explained causes

unresponsive support team

zero flexibility

15 — Luka Kladaric @ AWS Adria 2017.

Page 16: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months

conservatively: 6-9 months

realistically: a year

16 — Luka Kladaric @ AWS Adria 2017.

Page 17: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

NO LEADERSHIP BUY-IN2 failed attempts to get approval

Infrastructure team makes a pact"Do Things The Right Way From Now On"

mask cleanup work with ongoing maintenance

17 — Luka Kladaric @ AWS Adria 2017.

Page 18: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

A YEAR AND A HALF LATER...

majority of the issues were fixed

or at least significantly improved

18 — Luka Kladaric @ AWS Adria 2017.

Page 19: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

PLOT TWISTRACKSPACE STARTS FALLING APART

19 — Luka Kladaric @ AWS Adria 2017.

Page 20: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

New estimate: 19 man-days

(after final push for preparation)

20 — Luka Kladaric @ AWS Adria 2017.

Page 21: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

SAVINGS ESTIMATE

$18k -> $6k

that's -66%

21 — Luka Kladaric @ AWS Adria 2017.

Page 22: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

GOT APPROVAL!22 — Luka Kladaric @ AWS Adria 2017.

Page 23: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

Actually executed in 25-30 man-days

over 2 months

23 — Luka Kladaric @ AWS Adria 2017.

Page 24: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

HOW?24 — Luka Kladaric @ AWS Adria 2017.

Page 25: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

"upgrading the fleet to Ubuntu 16.04"

all servers rebuilt and redeployed with Ansible

25 — Luka Kladaric @ AWS Adria 2017.

Page 26: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

build server rebuilt from scratch

deployed from Ansible

all build jobs defined in code

no more tweaking jobs through UI

26 — Luka Kladaric @ AWS Adria 2017.

Page 27: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

CloudFlare implemented for faster DNS failover

27 — Luka Kladaric @ AWS Adria 2017.

Page 28: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

all LB logic slowly moved to our own haproxies

haproxy configuration auto-generated from Ansible

makes it easy to shuffle things around

28 — Luka Kladaric @ AWS Adria 2017.

Page 29: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

all apps slowly migrated to be served through haproxies

avoiding Rackspace LB magic

29 — Luka Kladaric @ AWS Adria 2017.

Page 30: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

VPN bridge between DCs~20 MB/s, ~20ms ping

good enough to treat as a "local" connectionfor shorter periods of time

30 — Luka Kladaric @ AWS Adria 2017.

Page 31: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

mysql master-master replication between DCs

31 — Luka Kladaric @ AWS Adria 2017.

Page 32: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

app servers in both DCs

32 — Luka Kladaric @ AWS Adria 2017.

Page 33: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

haproxies in both DCs

aware of app servers in both DCsbut preferring local ones

"no request left behind"

33 — Luka Kladaric @ AWS Adria 2017.

Page 34: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

failover with DNS at CloudFlare near-instantly

but even stray requests get handled

34 — Luka Kladaric @ AWS Adria 2017.

Page 35: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

metrics, metrics, metrics

(Datadog ftw)

35 — Luka Kladaric @ AWS Adria 2017.

Page 36: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

RESULTS36 — Luka Kladaric @ AWS Adria 2017.

Page 37: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

core production migrated in days

internal tools migrated within a week or two

developer tools migrated within a month(git hosting, build server, etc)

obscure legacy services migrated within 2 months

37 — Luka Kladaric @ AWS Adria 2017.

Page 38: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

all hardware at Rackspacedecomissioned within 3 months

38 — Luka Kladaric @ AWS Adria 2017.

Page 39: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

sideffect: actual HA instead of fake HA

old "two or more of everything" approachtranslated well into Availability Zones

39 — Luka Kladaric @ AWS Adria 2017.

Page 40: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

AND IT WAS GOOD40 — Luka Kladaric @ AWS Adria 2017.

Page 41: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

41 — Luka Kladaric @ AWS Adria 2017.

Page 42: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

QUESTIONS?42 — Luka Kladaric @ AWS Adria 2017.

Page 43: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration

THANK YOU!Luka [email protected]

@kll43 — Luka Kladaric @ AWS Adria 2017.