high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011

59
twilio CLOUD COMMUNICATIONS SCALING HIGH-AVAILABILITY INFRASTRUCTURE IN THE CLOUD OCT 11, 2011, WEB 2.0 EVAN COOKE CO-FOUNDER & CTO

Upload: twilio

Post on 06-Dec-2014

59.706 views

Category:

Technology


0 download

DESCRIPTION

Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS APIs have strict consistency, latency, and availability requirements that cannot be solved using traditional sharding or scaling approaches. In this talk we first look to understand the challenges of running high-availability services in the cloud and then describe how we’ve architected “in-flight” and “post-flight” data into separate datastores that can be implemented using a range of technologies.

TRANSCRIPT

Page 1: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

twilioCLOUD COMMUNICATIONS

SCALING HIGH-AVAILABILITYINFRASTRUCTURE

IN THE CLOUD

OCT 11, 2011, WEB 2.0EVAN COOKECO-FOUNDER & CTO

Page 2: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

High-AvailabilitySounds good, we need that!

Yummmm Technical Meat!

Page 3: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

High-AvailabilitySounds good, we need that!

Availability = Uptime

Uptime + Downtime

Page 4: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

High-AvailabilitySounds good, we need that!

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Page 5: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

High-AvailabilitySounds good, we need that!

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Can’t rely on human to respond in a 5 min window! Must use automation.

Page 6: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

Page 7: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy

E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.

Page 8: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

DataPersistence

ChangeControl

change control monitoring of the relevant components

requirements procurement

operations

Operations

avoidance of internal app

failuresavoidance of

external services that fail

storage architecture redundancy

technical solution of

backup

process solution of

backup

Datacenter

avoidance of network failures

physical environment

network redundancy

physical location

infrastructure redundancy

Cloud Non-Cloud

Page 9: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

Database Database Database

ChangeControl

Page 10: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

DataPersistence

ChangeControl

change control monitoring of the relevant components

requirements procurement

operations

Operations

avoidance of internal app

failuresavoidance of

external services that fail

storage architecture redundancy

technical solution of

backup

process solution of

backup

Datacenter

avoidance of network failures

physical environment

network redundancy

physical location

infrastructure redundancy

TodayData PersistenceChange Control

lessons learned@twilio

Page 11: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011
Page 12: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Developer

End User

Carriers Inbound CallsOutbound Calls

Mobile/Browser VoIPVoice

SMS

PhoneNumbers

Send To/From Phone Numbers

Short Codes

Dynamically Buy Phone Numbers

Twilio provides web service APIs to automate Voice and SMS communications

Page 13: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

6

2009

20

2010

3

70+

2011

Page 14: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

X

1 Year

100X

100x Growth in Tx/Day over 1 Year

10X

Page 15: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

10Servers

2009100’s ofServers

2011

10’s ofServers

2010

Page 16: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

2011

Page 17: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Frameworks

- PHP for frontend components

- Python Twisted & gevent for async network services

- Java for backend services

• Storage technology

- MySQL for core DB services

- Redis for queuing and messaging

2011

Page 18: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Data persistence is hard(especially in the cloud)

Page 19: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Data persistence is hardData persistence is the hardest

technical problem most scalable SaaS businesses face

Page 20: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

What is data persistence?

Stuff that looks like this

Page 21: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

What is data persistence?

DatabasesQueues

Files

Page 22: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Files K/VC C D D

Tier 3

SQL

Tier 2 B B B B

A

Q

A

QTier 1

LB

Incoming Requests

DataPersistence!

Page 23: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Page 24: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Difficult to Change Structure

Id Name Value

1 Bob 12

2 Jane 78

3 Steve 56

Id Name

1 Bob

2 Jane

3 Steve

...500 million rows

ALTER TABLE names DROP COLUMN Value

HOURS later...

‣ You live with data decisions for a long time

Page 25: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Painful to Recover from Failures

Primary

W R R

DB DB

Secondary

Data on secondary?How much data?R/W consistency?

‣ Because of complexity, failover is human process

Page 26: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Woeful Performance/Scalability

‣ Poor I/O on cloud today, 100x slower than real HW

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %utilsda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00

~10 MB/s write

m1.xlargeraid0 4x ephemeral

ec2

Page 27: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Woeful Performance/Scalability

DB DB DB DB DB DB

‣ Difficult to horizontally scale in the cloud

Page 28: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

BUFFER POOL AND MEMORY----------------------Total memory allocated 11655168000; in additional pool allocated 0Internal hash tables (constant factor + variable factor) Adaptive hash index 223758224 (179959576 + 43798648) Page hash 11248264 Dictionary cache 45048690 (44991344 + 57346) File system 84400 (82672 + 1728) Lock system 28180376 (28119464 + 60912) Recovery system 0 (0 + 0) Threads 428608 (406936 + 21672)Dictionary memory allocated 57346Buffer pool size 693759Buffer pool size, bytes 11366547456Free buffers 1Database pages 691085Old database pages 255087Modified db pages 326490Pending reads 0Pending writes: LRU 0, flush list 0, single page 0Pages made young 497782847, not young 024.78 youngs/s, 0.00 non-youngs/sPages read 447257683, created 16982810, written 40515343324.82 reads/s, 1.14 creates/s, 33.36 writes/sBuffer pool hit rate 993 / 1000, young-making rate 7 / 1000 not 0 / 1000Pages read ahead 0.00/s, evicted without access 0.39/sLRU len: 691085, unzip_LRU len: 0I/O sum[2753]:cur[2], unzip sum[0]:cur[0]

• Incredibly complex configuration

- Billion knobs and buttons

- Whole companies exist just to tune DB’s

• Lots of consistency/transactional models

• Multi-region data is unsolved - Facebook and Google struggle

@!#$%^&* Complex

Page 29: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Deep breath, step backThink about each problem(use @twilio examples)

•Software that runs in the cloud•Open source

Page 30: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Don’t have structure

- key/value databases (SimpleDB, Cassandra)

- document-orient databases (CouchDB, MongoDB)

• Don’t store a lot of data...

Difficult to Change Structure1

Page 31: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Outsource data as much as possible

• But NOT to your customers

Don’t Store Stuff1

Page 32: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Aggressively archive and move data offline

Don’t Store Stuff

~500MRows

S3/SimpleDB

(keep indices in memory)

Build UX that supports longer/restricted access times to older data

1

Page 33: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

Web

Web

Web

SessionDB

Cookie:SessionID

1

Page 34: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

Web

Web

Web

SessionDB

Cookie:enc($session)

Store state in client browser

1

Page 35: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Painful to Recover from Failures

• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great docs on web

• Minimize number of stateful node, separate stateful & stateless components...

2

Page 36: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Separate Stateful and Stateless Components

App AReq App B App C

App BOn failure, even if we boot replacement, we lose data

2

Page 37: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Separate Stateful and Stateless Components

App AReq App B App C

Que

ue

Que

ue

Que

ue

On failure, even if we boot replacement, we lose data

App B

Que

ue

2

Page 38: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Separate Stateful and Stateless Components

App AReq App B App C

Keep connection open for whole app path!(hint: use evented framework)

App BApp AApp A App B App CApp C

On failure, we don’t lose a single request

Twilio’s SMS stackuses this approach

2

Page 39: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Painful to Recover from Failures

• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great blog docs on web

• Minimize number of stateful nodes, separate stateful & stateless components

• Build a data change control process to avoid mistakes and errors...

2

Page 40: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

Components deployed at different frequencies: Partially Continuous Deployment

Page 41: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

1000x

WebsiteContent

CMS

100x

WebsiteCode

PHP/Rubyetc.

10x

RESTAPI

Python/Javaetc.

1x

Big DBSchema

SQL

Log

Scal

e

DeploymentFrequency(Risk)

4 buckets

Page 42: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

WebsiteContent

One Click

WebsiteCode

One ClickCI Tests

RESTAPI

One Click

CI TestsHuman Sign-off

Big DBSchema

Human Assisted Click

CI TestsHuman Sign-off

DeploymentProcesses

Page 43: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Woeful Performance/Scalability

• If disk I/O is poor, avoid disk

- Tune tune tune. Keep your indices in memory

- Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave

• When disk I/O saturates, shard

- LOTs of sharding info on web

- Method of last resort, single point of failure becomes multiple single points of failure

3

Page 44: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

@#$%^&* Complex

• Bring the simplest tool to the job

- Use a strictly consistent store only if you need it

- If you don’t need HA, don’t add the complexity

• There is no magic database. Decompose requirements, mix-and-match datastores as needed...

4

Magic Database does it all. Consistency, Availability, Partition-tolerance, it's got allthree.

Magic Database

Page 45: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Twilio Data Lifecycle

CREATE

name:foo

status:

ret:INIT

0

UPDATE

name:foo

status:

ret:QUEUED

0

UPDATE

name:foo

status:

ret:GOING

0

name:foo

status:

ret:DONE

42

Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $

4

Page 46: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

CREATE

name:foo

status:

ret:INIT

0

UPDATE

name:foo

status:

ret:QUEUED

0

UPDATE

name:foo

status:

ret:GOING

0

name:foo

status:

ret:DONE

42

In-Flight Post-Flight

Twilio Data Lifecycle4

Page 47: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

In-Flight Post-Flight

• Atomically update part of a workflow

• Billing

• Log Access

• Analytics

• Reporting

ApplicationsTwilio Data Lifecycle

4

Page 48: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

In-Flight Post-Flight

• Strict Consistency

• Key/Value

• ~20ms

• Eventual Consistency

• Range Queries w/ Filters

• ~200ms

High-AvailabilityProperties

Twilio Data Lifecycle4

Page 49: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Data Store BData Store A

In-Flight Post-Flight

Systems with very different access semantics

Twilio Data Lifecycle4

Page 50: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Strict Consistency

Key/Value

~20ms

10k-1M

In-Flight Post-Flight

Logs(REST API)

Eventual consistencyRange queriesFiltered queries~200msBillions

Q

ReportingEventual consistencyArbitrary queriesHigh LatencyBillions

Q

BillingIdempotentAggregationKey/ValueBillions

Q

4

Page 51: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

MySQL

PostgreSQL

Redis

NDB

Billing

In-Flight Post-Flight

Logs(REST API)

Reporting

Q

Q

Q

SQL ShardedCassandra/AcunuMongoDbRiakCouchDb

SQL ShardedRedis

SQL ShardedRedis

Hadoop

4

Page 52: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Data

Page 53: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Don’t store stuff!Go schema-less

Separate stateful/statelessChange control processes

Memory FTWShard

Decompose data lifecycleMinimize complexity

Page 54: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Files K/VC C D D

Tier 3

SQL

Tier 2 B B B B

A

Q

A

QTier 1

LB

Incoming Requests

Page 55: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

A

LB

Incoming Requests

ATier 1

B B B BTier 2

C C D D

Tier 3

SQLSQLQ

SimpleDBS3

Q

Aggregate into HA queuesMaster-MasterMySQL

Move file store to S3

Move K/V toSimpleDB w/local cache

Idempotentrequest path

Page 56: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

DataPersistence

ChangeControl

change control monitoring of the relevant components

requirements procurement

operations

Operations

avoidance of internal app

failuresavoidance of

external services that fail

storage architecture redundancy

technical solution of

backup

process solution of

backup

Datacenter

avoidance of network failures

physical environment

network redundancy

physical location

infrastructure redundancy

HAis

Hard

Page 57: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD

Focus on dataHow you store it

When you can delete itControl changes to it

Where you store it

Page 58: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Billing

In-Flight Post-Flight

Logs(REST API)

Reporting

Q

Q

Q

Hadoop

HAqueue

Simplemulti-AZ

multi-regionconsistent

K/V

Open Problems...Massively scalable

range queriesfilterable~200ms

Simple HA

Hadoop

Massively scalable

aggregator

Page 59: High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

twiliohttp://www.twilio.com

@emcooke