aws re:invent 2016: moving mission critical apps from one region to multi-region active/active...

Alexander Filipchik – Principal Engineer, Sony Interactive Entertainment

Dustin Pham – Principal Engineer, Sony Interactive Entertainment

David Green – Enterprise Solutions Architect, Amazon Web Services

Moving Mission-Critical Apps from One

Region to Multi-Region active/active

November 30, 2016

ARC309

Thank you

What to expect from the session

• Architecture Background

• AWS global infrastructure

• Single vs Multi-Region?

• Multi-Region AWS Services

• Case Study: Sony’s Multi-Region Active/Active Journey

• Design approach

• Lessons learned

• Migrating without downtime

AWS Global Infrastructure

AWS worldwide locations

Region (14)

Coming Soon (4)

AWS worldwide locations

Region topology

Transit

AZ AZAZ

Region topology

Transit

AZ AZAZ

Availability Zone

Transit

AZ AZAZ

Single region high-availability approach

• Leverage multiple Availability Zones (AZs)

Availability Zone A Availability Zone B Availability Zone C

us-east-1

Reminder: Region-wide AWS services

• Amazon Simple Storage Service (Amazon S3)

• Amazon Elastic File System (Amazon EFS)

• Amazon Relational Database Services (RDS)

• Amazon DynamoDB

• And many more…

OK … should I use Multi-Region?

Good Reasons for Multi-Region

• Lower latency to a subset of customers

• Legal and regulatory compliance (i.e. data

sovereignty)

• Satisfy disaster recovery requirements

AWS Multi-Region services

Multi-Region services

• Amazon Route 53 (Managed DNS)

• S3 with cross-region replication

• RDS multi-region database replication

• And many more…

• EBS snapshots

• AMI

Amazon Route 53

• Health checks

• Send traffic to healthy infrastructure

• Latency-based routing

• Geo DNS

• Weighted Round Robin

• Global footprint via 60+ POPs

• Supports AWS and non-AWS resources

prod-1 prod-2

95% 5%

example.net

health

weight

Example: Weighted with failover

prod.examp.net

examp-fail.s3-website

S3 – cross-region replicationAutomated, fast, and reliable asynchronous replication of data across AWS regions

• Only replicates new PUTs. Once

S3 is configured, all new uploads

into a source bucket will be

replicated

• Entire bucket or prefix based

• 1:1 replication between any 2

regions / storage classes

• Transition S3 ownership from

primary account to sub-account

Use cases:

• Compliance—store data hundreds of miles apart

• Lower latency—distribute data to regional customers

• Security—create remote replicas managed by separate AWS accounts

Source

(Virginia)

Destination

(Oregon)

RDS cross-region replication

• Move data closer to customers

• Satisfy disaster recovery requirements

• Relieve pressure on database master

• Promote read-replica to master

• AWS managed service

RDS cross-region replication

Leverage existing resources

Many resources exist

AWS Reference Architecture Implementation Guides

What to expect from the session

• Architecture Background

• AWS global infrastructure

• Single vs Multi-Region?

• Enabling AWS services

• Case Study: Sony Multi-Region Active/Active

• Design approach

• Lessons learned

• Migrating without downtime

Who is talking?

Alexander Filipchik (PSN: LaserToy)Principal Software Engineer

at Sony Interactive Entertainment

Dustin Pham Principal Software Engineer

at Sony Interactive Entertainment

Our active/active story

Small team, large responsibility

• Service team ran like a startup

• Less than 10 core people working on new PS3 store

services

• PSN’s user base was already in the several hundred

millions of users

• Relied on quick iterations of architecture on AWS

Social

Commerce

MULTIPLE NEW VIRTUAL REALITY

PLATFORM LAUNCHES OF VARYING

EXPERIENCE LEVEL

THE YEAR OF VRCardboard

Transforming the store

Delivered new store

• Great job, now onto the PS4

• PS4 launch – 1 million users at once on Day 1, Hour 1

• Designing for many different use cases at scale

Architecture phases

Proof of Concept

Scale OptimizeMake Highly

Available

SF Bay

Next step: make highly available

• Highly available for us: multiregion active/active

• Raising key questions:

• How does one move a large set of critical apps with

hundreds of terabytes of live data?

• How did we architect every aspect to allow for multiregional,

active-active?

• How do we turn on active-active without user impact?

• User impact includes Hardware (ps3/ps4/etc.) and Game

partners!

• Where do we even begin?

Starting with applications

Applications

• First question to answer: What does it mean to be

multiregional?

• Different people had different answers:

• Active/stand-by vs. active/active

• Full data replication vs. partial

• Automatic failover vs. manual

• Etc.

After some healthy discussions

Agreement

• “You should be able to lose 1 of anything” approach.

• Which means, we should be able to survive without any

visible impact losing of:

• 1 server

• 1 Availability Zone

• 1 region

Starting with uncertainty

• Multiple macro and micro services

• Stateless and stateful services

• They depend on multiple technologies

• Some are multiregional and some are not

• Documentation was as always: out of date

Inventory of dependencies

0102030405060708090

fapplic

ations

What is multiregional by design?

With some customizations

Stages of grief

• Denial – can’t be true, let’s check again

• Anger – we told everyone to be active/active ready!!!

• Bargaining – active/stand-by?

• Depression – we can’t do it

• Acceptance – let’s work to fix it, we have 6 months…

What it tells us

• We can’t just put things in two regions and expect them

to work

• We will need to do some work to:

• Migrate services to technology which is multiregional by

design

• Somehow make underlying technology multiregional

Scheduling/optimization problem

• There is work that should be done on both apps and

infrastructure side

• We need to schedule it so we can get results faster

and minimize waits

• And we wanted machine to help us

The world’s leading graph database

That can store a graph of 30B nodes

Here to help us to deal with our problem

Why Neo4J

• Graph engine and we are dealing with a graph

• Query language that is very powerful

• Can be populated programmatically

• Can show us something we didn’t expect

How to use it?

• Model

• Identify nodes and relations

• Tracing

• Code analyzer

• Talking to people

• Generate the graph

• Run queries

Model example

• Nodes

• Users

• Technology: (Cassandra, Redis)

• multiregional: true/false

• Service (applications)

• stateless: true/false

• Edges

• Usage patterns (read, write)

Graph definition example

Graph example

Can be enriched with:

• Load balancers

• Security groups

• VPCs

• NATs

• Etc.

Ours looked more like

And running some Neo4j magic

This one is important

Shows you what is ready to go

What to do next

• Validate multiregional technologies do actually work

• Figure out what to do with non-multiregional technologies

• Move services in the following order:

Validating our main DB (Cassandra)

A lot of unknowns:

• Will it work?

• Will performance degrade?

• How eventual is multiregional eventual consistency?

• Will we hit any roadblocks?

• Well, how many roadblocks will we hit?

What did we know?

Netflix is doing it on AWS and they actually tested it

They wrote 1M records in one region of a multiregion

cluster

500 ms later read in other clusters was initiated

All records were successfully read

Well…

Some questions to answer:

Should we just trust the

Netflix’s results and just

replicate data and see what

happens?

Is their experiment applicable

to our situation?

Can we do better?

Break Something

Free Coffee

Say, "there's

gotta be a better way to do this"

HOW TO GET AN ENGINEER'S ATTENTION

Cassandra validation strategy

• Use production load/data

• Simulate disruptions

• Track replication latencies

• Track lost mutations

• Cassandra modifications were required

Preparation

Exporter

Region 1

Region 2

Ingester

Read/Write

Loader

Region 1

Read/Write

Loader

Region 2

Analysis

Sample results (usw1-usw2)

100000

1000000

10000000

Two DC connection cut-off and recovery ( latency in logarithmic scale)

Pct95 Pct99

Pct999 MaxLag

Things that are not multiregional by design

We gave teams 2 options:

• Redesign if is critical to user’s experience

• If not in the critical path (batch jobs)

• active/passive

• master/slave

• Use Kafka as a replication backbone (recommended)

Solr example (pre active/active)

Indexer

Master

Replicator

Read Replicas

Solr example (easy active/active)

Indexer

Master

Replicators

Read ReplicasApps

Replicators

Read ReplicasApps

Region 1 Region 2

Solr example (Kafka active/active)

Indexer

Read ReplicasApps

Region 1

Solr Indexer

Indexer

Read ReplicasApps

Region 2

Solr Indexer

Are we missing anything?

Yes, infrastructure

Decompose and recompose

Breaking up the system into moveable parts

App + caching tier

Data tier

Inbound tierOutbound tier

Clients

Phase 1: Infrastructure

Private Subnet

Public Subnet

ELBs Inbound tierOutbound Tier

Infrastructure to build/move:

• VPCs

• Subnets

• ACLs

• ELBs

• IGW

• NAT

• Egress

Phase 1: Infrastructure key points

• Building infrastructure in new region must be fully

automated (Infrastructure as Code)

• Regional communication decisions

• VPNs?

• Over Internet?

• Do infrastructures have to match exactly?

• 1st region evolved organically

• 2nd region should be blueprint for all new region DCs

Phase 2: Data

Public subnet

Data tier

Phase 2: Data option 1 replication over VPN

Public Subnet

Data tier

Region 2

Phase 2: Data option 1 replication over VPN

• Pros

• Setting up VPN with current network architecture would be

easier on data tier

• Secure

• Managing data nodes intercommunication is straight forward

and has lower operational overhead

• Cons

• Limit on throughput

• Data set is large and can quickly saturate VPN

• Scaling more applications in future will be complicated!

Phase 2: Data option 2 replication over ENIs with public IPs

Private subnet

Public subnet

Data tier

Region 2

Phase 2: Data option 2 replication over ENIs with public IPs

• Pros

• Not network constrained

• Able to add more applications + data without need of building

new infrastructure to support

• Cons

• Operationally, more orchestration (Cassandra, for example,

needs to know other node Elastic IPs)

• Internode data transfer security is a must

Phase 3: App tier + cache strategy

Outbound Tier

Region 2

Phase 3: App tier + cache strategy

• Applications communicate within a region only

• Applications do not call another region’s databases,

caches, or applications

• Isolation creates for predictable failure cases and clearly

defines failure domains

• Monitoring and alerting are greatly simplified in this

Phase 4: Client routing

Region 1 Region 2

Phase 4: Client routing

• Predictable “sticky” routing to avoid user bounce via

Georouting

• Data replication manages cross region state

• Allows for routing to stateless services

• Ability to do % based routing to manage different failure

scenarios

Putting it all together

Software design for multiregion deployments

• Typical software architecture

Business Logic

Data Access

Cutting

Config

Software design for multiregion deployments

Region 1 Region 2

Remember when we mentioned to have application tier call patterns to be

isolated in a region? How do we achieve this simply?

Software configuration approaches

• An application config to connect to a database could

look like:cassandra.seeds=10.0.1.16,10.0.1.17

• A naïve approach would be to have an application have

multiple configs per deployable depending on its regioncassandra.seeds.region1=10.0.1.16,10.0.1.17

cassandra.seeds.region2=10.0.2.16,10.0.2.16

• This, of course, results in an app config management

nightmare, especially now with 2 regions

Software configuration approaches

• What if we

implemented a

basic “central"

way of

configuration

Region x

Local DB

Where are my C*

Seeds? IPs are x.x.x.xcassandra.seeds=cass-

seed1, cass-seed2

cass-seed1 resolves to

x.x.x.x

Simplified software configuration (context)

• Context is made available to application which contains:

• Data Center/region

• Endpoint short-name resolution

• Environment (Dev, QA, Prod, A/B)

• Database connection details

• Context is the responsibility of the infrastructure itself

and is provided through build automation, AWS tagging,

• App is responsible for behaving correctly off of context

Infrastructure as code

• New regions must be built through automation

• Specification of services to Terraform

• Internal tool and DSL was built to manage domain

specific needs

• Example:

• Specify an app requires Cassandra and SNS

• Generates Terraform to create security groups for ports 9160,

7199-7999, build SNS, build ELB for app, etc.

Database automation

• Ansible run to assist

in build Cassandra in

public subnet and

associate EIPs to

every new node

• Manages network

rules (whitelisting)

• Manages certificates

and SSLPrivate Subnet

Public Subnet

ELBsOutbound Tier

Region 2

Monitoring multiregional deployments

Monitoring through proper tagging

• Part of the “Context” applications are aware of is the

region

• Adds “region” to any app logs

• Region tags then added in metrics and can be surfaced

in grafana or any monitoring of your choice

• Cross-regional monitoring key metrics and alerting

• Data replication (hints in Cassandra, seconds behind master

in MySQL, etc.)

• Data in/out

Putting it all together

Region 1 Region 2

Create

infrastructure

Replicate

Lessons learned

• Data synchronization is super critical, so dependency

map based off of the data technologies first.

• Always run your own benchmarking.

• Do not allow legacy to control other region’s design. Find

a healthy transition and balance between old and new.

• Applications must be context-driven.

• Depending on your data load, Cross-regional VPNs may

not make sense.

PlayStation is hiring in SF:

Find us at hackitects.com

Thank you!

Remember to complete

your evaluations!

Related Sessions

aws re:invent 2016: moving mission critical apps from one region to multi-region active/active...

Technology

article statistical region based active contour using a...

cds active region results

using potential field extrapolations of active region...

region 1 active transportation needs...

region-based active contour jseg fusion technique for skin...

list of active nttc holders for region iv-a june...

active where?: multiactive where?: multi-region formative

static and dynamic modeling of a solar active region -...

active region loop studies in oslo

portage la prairie & region active guide - stride...

active region outflows as a source of the slow solar …

tracking motion detector region tracker output active shape

(arc309) building and scaling amazon cloud drive to millions...

solar active region magnetoconvection & sunspots

redal: region-based and diversity-aware active learning

the active region monitor (arm) beauty.nascom.nasa/arm

modeling active region magnetic fields on the sun

active...

region 1 active transportation needs inventory literature...

time-dependent stochastic modeling of solar active region...