aws re:invent 2016: moving mission critical apps from one region to multi-region active/active...

99
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alexander Filipchik Principal Engineer, Sony Interactive Entertainment Dustin Pham Principal Engineer, Sony Interactive Entertainment David Green Enterprise Solutions Architect, Amazon Web Services Moving Mission-Critical Apps from One Region to Multi-Region active/active November 30, 2016 ARC309

Upload: amazon-web-services

Post on 08-Jan-2017

103 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Alexander Filipchik – Principal Engineer, Sony Interactive Entertainment

Dustin Pham – Principal Engineer, Sony Interactive Entertainment

David Green – Enterprise Solutions Architect, Amazon Web Services

Moving Mission-Critical Apps from One

Region to Multi-Region active/active

November 30, 2016

ARC309

Page 2: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Thank you

Page 3: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What to expect from the session

• Architecture Background

• AWS global infrastructure

• Single vs Multi-Region?

• Multi-Region AWS Services

• Case Study: Sony’s Multi-Region Active/Active Journey

• Design approach

• Lessons learned

• Migrating without downtime

Page 4: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

AWS Global Infrastructure

Page 5: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

AWS worldwide locations

Region (14)

Coming Soon (4)

Page 6: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

AWS worldwide locations

Page 7: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Region topology

Page 8: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Transit

Transit

AZ

AZ

AZ AZAZ

Region topology

Page 9: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Transit

Transit

AZ

AZ

AZ AZAZ

Availability Zone

Page 10: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Availability Zone

Transit

Transit

AZ

AZ

AZ AZAZ

Page 11: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Single region high-availability approach

• Leverage multiple Availability Zones (AZs)

Availability Zone A Availability Zone B Availability Zone C

us-east-1

Page 12: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Reminder: Region-wide AWS services

• Amazon Simple Storage Service (Amazon S3)

• Amazon Elastic File System (Amazon EFS)

• Amazon Relational Database Services (RDS)

• Amazon DynamoDB

• And many more…

Page 13: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

OK … should I use Multi-Region?

Page 14: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Good Reasons for Multi-Region

• Lower latency to a subset of customers

• Legal and regulatory compliance (i.e. data

sovereignty)

• Satisfy disaster recovery requirements

Page 15: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

AWS Multi-Region services

Page 16: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Multi-Region services

• Amazon Route 53 (Managed DNS)

• S3 with cross-region replication

• RDS multi-region database replication

• And many more…

• EBS snapshots

• AMI

Page 17: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Amazon Route 53

• Health checks

• Send traffic to healthy infrastructure

• Latency-based routing

• Geo DNS

• Weighted Round Robin

• Global footprint via 60+ POPs

• Supports AWS and non-AWS resources

Page 18: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

prod-1 prod-2

95% 5%

example.net

health

health

+

weight

Example: Weighted with failover

prod.examp.net

examp-fail.s3-website

Page 19: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

S3 – cross-region replicationAutomated, fast, and reliable asynchronous replication of data across AWS regions

• Only replicates new PUTs. Once

S3 is configured, all new uploads

into a source bucket will be

replicated

• Entire bucket or prefix based

• 1:1 replication between any 2

regions / storage classes

• Transition S3 ownership from

primary account to sub-account

Use cases:

• Compliance—store data hundreds of miles apart

• Lower latency—distribute data to regional customers

• Security—create remote replicas managed by separate AWS accounts

Source

(Virginia)

Destination

(Oregon)

Page 20: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

RDS cross-region replication

• Move data closer to customers

• Satisfy disaster recovery requirements

• Relieve pressure on database master

• Promote read-replica to master

• AWS managed service

Page 21: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

RDS cross-region replication

Page 22: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Leverage existing resources

Page 23: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Many resources exist

AWS Reference Architecture Implementation Guides

Page 24: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What to expect from the session

• Architecture Background

• AWS global infrastructure

• Single vs Multi-Region?

• Enabling AWS services

• Case Study: Sony Multi-Region Active/Active

• Design approach

• Lessons learned

• Migrating without downtime

Page 25: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)
Page 26: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Who is talking?

Alexander Filipchik (PSN: LaserToy)Principal Software Engineer

at Sony Interactive Entertainment

Dustin Pham Principal Software Engineer

at Sony Interactive Entertainment

Page 27: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Our active/active story

Page 28: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Small team, large responsibility

• Service team ran like a startup

• Less than 10 core people working on new PS3 store

services

• PSN’s user base was already in the several hundred

millions of users

• Relied on quick iterations of architecture on AWS

Page 29: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Social

Page 30: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Video

Page 31: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Commerce

Page 32: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

MULTIPLE NEW VIRTUAL REALITY

PLATFORM LAUNCHES OF VARYING

EXPERIENCE LEVEL

THE YEAR OF VRCardboard

Page 33: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Transforming the store

Page 34: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Delivered new store

• Great job, now onto the PS4

• PS4 launch – 1 million users at once on Day 1, Hour 1

• Designing for many different use cases at scale

Page 35: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Architecture phases

Proof of Concept

Scale OptimizeMake Highly

Available

SF Bay

Page 36: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Next step: make highly available

• Highly available for us: multiregion active/active

• Raising key questions:

• How does one move a large set of critical apps with

hundreds of terabytes of live data?

• How did we architect every aspect to allow for multiregional,

active-active?

• How do we turn on active-active without user impact?

• User impact includes Hardware (ps3/ps4/etc.) and Game

partners!

• Where do we even begin?

Page 37: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Starting with applications

Page 38: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Applications

• First question to answer: What does it mean to be

multiregional?

• Different people had different answers:

• Active/stand-by vs. active/active

• Full data replication vs. partial

• Automatic failover vs. manual

• Etc.

Page 39: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

After some healthy discussions

Page 40: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Agreement

• “You should be able to lose 1 of anything” approach.

• Which means, we should be able to survive without any

visible impact losing of:

• 1 server

• 1 Availability Zone

• 1 region

Page 41: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Starting with uncertainty

• Multiple macro and micro services

• Stateless and stateful services

• They depend on multiple technologies

• Some are multiregional and some are not

• Documentation was as always: out of date

Page 42: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Inventory of dependencies

0102030405060708090

100

Tech

% o

fapplic

ations

Page 43: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What is multiregional by design?

With some customizations

Page 44: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Stages of grief

• Denial – can’t be true, let’s check again

• Anger – we told everyone to be active/active ready!!!

• Bargaining – active/stand-by?

• Depression – we can’t do it

• Acceptance – let’s work to fix it, we have 6 months…

Page 45: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What it tells us

• We can’t just put things in two regions and expect them

to work

• We will need to do some work to:

• Migrate services to technology which is multiregional by

design

• Somehow make underlying technology multiregional

Page 46: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Scheduling/optimization problem

• There is work that should be done on both apps and

infrastructure side

• We need to schedule it so we can get results faster

and minimize waits

• And we wanted machine to help us

Page 47: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

The world’s leading graph database

That can store a graph of 30B nodes

Here to help us to deal with our problem

Page 48: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Why Neo4J

• Graph engine and we are dealing with a graph

• Query language that is very powerful

• Can be populated programmatically

• Can show us something we didn’t expect

Page 49: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

How to use it?

• Model

• Identify nodes and relations

• Tracing

• Code analyzer

• Talking to people

• Generate the graph

• Run queries

Page 50: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Model example

• Nodes

• Users

• Technology: (Cassandra, Redis)

• multiregional: true/false

• Service (applications)

• stateless: true/false

• Edges

• Usage patterns (read, write)

Page 51: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Graph definition example

Page 52: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Graph example

Can be enriched with:

• Load balancers

• Security groups

• VPCs

• NATs

• Etc.

Page 53: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Ours looked more like

Page 54: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

And running some Neo4j magic

This one is important

Page 55: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Shows you what is ready to go

Page 56: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What to do next

• Validate multiregional technologies do actually work

• Figure out what to do with non-multiregional technologies

• Move services in the following order:

Page 57: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Validating our main DB (Cassandra)

A lot of unknowns:

• Will it work?

• Will performance degrade?

• How eventual is multiregional eventual consistency?

• Will we hit any roadblocks?

• Well, how many roadblocks will we hit?

Page 58: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

What did we know?

Netflix is doing it on AWS and they actually tested it

They wrote 1M records in one region of a multiregion

cluster

500 ms later read in other clusters was initiated

All records were successfully read

Page 59: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Well…

Some questions to answer:

Should we just trust the

Netflix’s results and just

replicate data and see what

happens?

Is their experiment applicable

to our situation?

Can we do better?

Break Something

Free Coffee

Say, "there's

gotta be a better way to do this"

HOW TO GET AN ENGINEER'S ATTENTION

Page 60: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Cassandra validation strategy

• Use production load/data

• Simulate disruptions

• Track replication latencies

• Track lost mutations

• Cassandra modifications were required

Page 61: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Preparation

Exporter

Region 1

Region 2

Ingester

Ingester

Page 62: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Test

Read/Write

Loader

Region 1

Read/Write

Loader

Region 2

Page 63: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Analysis

Page 64: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Sample results (usw1-usw2)

1

10

100

1000

10000

100000

1000000

10000000

617

14

617

16

617

18

617

20

617

22

617

24

617

26

617

28

618

02

618

04

618

06

618

08

618

10

618

12

618

14

618

16

618

18

618

20

618

22

618

24

618

26

618

28

618

30

618

32

618

34

618

36

618

38

618

40

618

42

618

44

618

46

618

48

618

50

618

52

618

54

618

56

618

58

619

00

619

02

619

04

619

06

619

08

619

10

619

12

619

14

Two DC connection cut-off and recovery ( latency in logarithmic scale)

Pct95 Pct99

Pct999 MaxLag

Page 65: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Things that are not multiregional by design

We gave teams 2 options:

• Redesign if is critical to user’s experience

• If not in the critical path (batch jobs)

• active/passive

• master/slave

• Use Kafka as a replication backbone (recommended)

Page 66: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Solr example (pre active/active)

Indexer

Master

App1

App2

Replicator

Replicator

Read Replicas

Read Replicas

Page 67: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Solr example (easy active/active)

Indexer

Master

Replicators

Read ReplicasApps

Replicators

Read ReplicasApps

Region 1 Region 2

Page 68: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Solr example (Kafka active/active)

Indexer

Read ReplicasApps

Region 1

Solr Indexer

Indexer

Read ReplicasApps

Region 2

Solr Indexer

Page 69: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Are we missing anything?

Yes, infrastructure

Page 70: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Decompose and recompose

Page 71: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Breaking up the system into moveable parts

App + caching tier

Data tier

Inbound tierOutbound tier

Clients

Page 72: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 1: Infrastructure

Private Subnet

Public Subnet

ELBs Inbound tierOutbound Tier

Infrastructure to build/move:

• VPCs

• Subnets

• ACLs

• ELBs

• IGW

• NAT

• Egress

Page 73: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 1: Infrastructure key points

• Building infrastructure in new region must be fully

automated (Infrastructure as Code)

• Regional communication decisions

• VPNs?

• Over Internet?

• Do infrastructures have to match exactly?

• 1st region evolved organically

• 2nd region should be blueprint for all new region DCs

Page 74: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 2: Data

Public subnet

ELBs

Data tier

Inbound tierOutbound tier

Page 75: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 2: Data option 1 replication over VPN

Public Subnet

ELBs

Data tier

Inbound tierOutbound tier

Region 2

VPN

Page 76: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 2: Data option 1 replication over VPN

• Pros

• Setting up VPN with current network architecture would be

easier on data tier

• Secure

• Managing data nodes intercommunication is straight forward

and has lower operational overhead

• Cons

• Limit on throughput

• Data set is large and can quickly saturate VPN

• Scaling more applications in future will be complicated!

Page 77: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 2: Data option 2 replication over ENIs with public IPs

Private subnet

Public subnet

ELBs

Data tier

Inbound tierOutbound tier

Region 2

SSL

SSL

Page 78: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 2: Data option 2 replication over ENIs with public IPs

• Pros

• Not network constrained

• Able to add more applications + data without need of building

new infrastructure to support

• Cons

• Operationally, more orchestration (Cassandra, for example,

needs to know other node Elastic IPs)

• Internode data transfer security is a must

Page 79: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 3: App tier + cache strategy

Outbound Tier

Region 2

Page 80: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 3: App tier + cache strategy

• Applications communicate within a region only

• Applications do not call another region’s databases,

caches, or applications

• Isolation creates for predictable failure cases and clearly

defines failure domains

• Monitoring and alerting are greatly simplified in this

model

Page 81: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 4: Client routing

Region 1 Region 2

DNS

Page 82: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Phase 4: Client routing

• Predictable “sticky” routing to avoid user bounce via

Georouting

• Data replication manages cross region state

• Allows for routing to stateless services

• Ability to do % based routing to manage different failure

scenarios

Page 83: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Putting it all together

Page 84: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Software design for multiregion deployments

• Typical software architecture

APIs

Business Logic

Data Access

Cross

Cutting

Config

Page 85: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Software design for multiregion deployments

Region 1 Region 2

Remember when we mentioned to have application tier call patterns to be

isolated in a region? How do we achieve this simply?

Page 86: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Software configuration approaches

• An application config to connect to a database could

look like:cassandra.seeds=10.0.1.16,10.0.1.17

• A naïve approach would be to have an application have

multiple configs per deployable depending on its regioncassandra.seeds.region1=10.0.1.16,10.0.1.17

cassandra.seeds.region2=10.0.2.16,10.0.2.16

• This, of course, results in an app config management

nightmare, especially now with 2 regions

Page 87: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Software configuration approaches

• What if we

implemented a

basic “central"

way of

configuration

Region x

Region x

Local DB

Where are my C*

Seeds? IPs are x.x.x.xcassandra.seeds=cass-

seed1, cass-seed2

cass-seed1 resolves to

x.x.x.x

Page 88: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Simplified software configuration (context)

• Context is made available to application which contains:

• Data Center/region

• Endpoint short-name resolution

• Environment (Dev, QA, Prod, A/B)

• Database connection details

• Context is the responsibility of the infrastructure itself

and is provided through build automation, AWS tagging,

etc.

• App is responsible for behaving correctly off of context

Page 89: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Infrastructure as code

• New regions must be built through automation

• Specification of services to Terraform

• Internal tool and DSL was built to manage domain

specific needs

• Example:

• Specify an app requires Cassandra and SNS

• Generates Terraform to create security groups for ports 9160,

7199-7999, build SNS, build ELB for app, etc.

Page 90: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Database automation

• Ansible run to assist

in build Cassandra in

public subnet and

associate EIPs to

every new node

• Manages network

rules (whitelisting)

• Manages certificates

and SSLPrivate Subnet

Public Subnet

ELBsOutbound Tier

Region 2

SSL

SSL

Page 91: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Monitoring multiregional deployments

Page 92: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Monitoring through proper tagging

• Part of the “Context” applications are aware of is the

region

• Adds “region” to any app logs

• Region tags then added in metrics and can be surfaced

in grafana or any monitoring of your choice

• Cross-regional monitoring key metrics and alerting

• Data replication (hints in Cassandra, seconds behind master

in MySQL, etc.)

• Data in/out

Page 93: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Putting it all together

Region 1 Region 2

Create

infrastructure

Replicate

DNS

Page 94: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Lessons learned

Page 95: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Lessons learned

• Data synchronization is super critical, so dependency

map based off of the data technologies first.

• Always run your own benchmarking.

• Do not allow legacy to control other region’s design. Find

a healthy transition and balance between old and new.

• Applications must be context-driven.

• Depending on your data load, Cross-regional VPNs may

not make sense.

Page 96: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

PlayStation is hiring in SF:

Find us at hackitects.com

Page 97: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Thank you!

Page 98: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Remember to complete

your evaluations!

Page 99: AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

Related Sessions