architecting for failure in aws - puppetconf 2013

54
ARCHITECTING IN AWS for resilience & cost at scale Jos Boumans - @jiboumans http://rafaykhan619.wix.com/downhouse Thursday 22 August 13

Upload: puppet-labs

Post on 12-Nov-2014

3.846 views

Category:

Technology


3 download

DESCRIPTION

"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital. Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate. Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.

TRANSCRIPT

Page 1: Architecting for Failure in AWS - PuppetConf 2013

ARCHITECTING IN AWSfor resilience & cost at scale

Jos Boumans - @jiboumanshttp://rafaykhan619.wix.com/downhouse

Thursday 22 August 13

Page 3: Architecting for Failure in AWS - PuppetConf 2013

CANONICAL

http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775

Engineering manager for Ubuntu Server 10.04 & 10.10

http://www.ubuntu.com/business/server/overview

Thursday 22 August 13

Page 5: Architecting for Failure in AWS - PuppetConf 2013

SOME OF OUR CUSTOMERS

Thursday 22 August 13

Page 6: Architecting for Failure in AWS - PuppetConf 2013

LOTS OF TRAFFIC

http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

Thursday 22 August 13

Page 7: Architecting for Failure in AWS - PuppetConf 2013

AVERAGE REQUESTS* / SEC

http://mashable.com/2013/03/21/happy-7th-birthday-twitter/http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm

*Twitter : New tweets Wikipedia: Articles readKrux: New data points

0 3,750 7,500 11,250 15,000

Thursday 22 August 13

Page 8: Architecting for Failure in AWS - PuppetConf 2013

MONTHLY UNIQUE USERS

0 200,000,000 400,000,000 600,000,000 800,000,000

http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/

Thursday 22 August 13

Page 10: Architecting for Failure in AWS - PuppetConf 2013

THERE ARE DOWNSIDES

http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlinesThursday 22 August 13

Page 11: Architecting for Failure in AWS - PuppetConf 2013

RESILIENCE & COST AT SCALE

Thursday 22 August 13

Page 14: Architecting for Failure in AWS - PuppetConf 2013

So#ware,)8)

Automa/on,)4)

Process,)14)

#"of"Issues"

Amazon"Cloud"Major"Outage"7"Issues"Categories"

ROOT CAUSE CATEGORIES

http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

Software bugs & human error

Thursday 22 August 13

Page 17: Architecting for Failure in AWS - PuppetConf 2013

RESILIENCE @ SCALEEmbrace Failure: Hardware will fail. Humans will make errors.

Nature will produce thunderstorms.http://blabitcanada.com/category/twitter-2/

Thursday 22 August 13

Page 18: Architecting for Failure in AWS - PuppetConf 2013

DEFINE 'AVAILABLE'Things will break, so choose your degraded state.

http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris

Thursday 22 August 13

Page 19: Architecting for Failure in AWS - PuppetConf 2013

BASIC API CALL3 potential points of failure

Thursday 22 August 13

Page 20: Architecting for Failure in AWS - PuppetConf 2013

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 21: Architecting for Failure in AWS - PuppetConf 2013

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 22: Architecting for Failure in AWS - PuppetConf 2013

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 23: Architecting for Failure in AWS - PuppetConf 2013

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 24: Architecting for Failure in AWS - PuppetConf 2013

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 25: Architecting for Failure in AWS - PuppetConf 2013

USER EXPERIENCEMy tweet got posted

Thursday 22 August 13

Page 26: Architecting for Failure in AWS - PuppetConf 2013

RESILIENCE TOOLSStorage, Network & ACL

http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html

Thursday 22 August 13

Page 27: Architecting for Failure in AWS - PuppetConf 2013

MANY SMALL NODES VERSUS A FEW LARGER NODES

The benefits of the many outweigh the benefits of the fewhttp://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/

Thursday 22 August 13

Page 28: Architecting for Failure in AWS - PuppetConf 2013

DATABASESCAP Theorem applies.

Your choice: sacrifice availability or consistency. Orange is a lie.

RDBMSBigTable Based

Master / Slave based

CouchDBDynamo Based

http://ferd.ca/beating-the-cap-theorem-checklist.html

Thursday 22 August 13

Page 29: Architecting for Failure in AWS - PuppetConf 2013

SIMPLE STORAGE SERVICES3: Arguably AWS' best feature

http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/http://aws.amazon.com/s3/

https://forums.aws.amazon.com/message.jspa?messageID=182919#182919Thursday 22 August 13

Page 30: Architecting for Failure in AWS - PuppetConf 2013

CACHE WHAT YOU CANHTTP Responses, DB Queries, User content

Browsers have caches too!http://cruncht.com/95/drupal-caching/

http://redis.io/http://memcached.org/

http://varnish-cache.org/Thursday 22 August 13

Page 31: Architecting for Failure in AWS - PuppetConf 2013

CLIENT SIDE STORAGEKeep a copy of your users data locally

http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/

Thursday 22 August 13

Page 32: Architecting for Failure in AWS - PuppetConf 2013

USE ELASTIC LOAD BALANCERSThey will save you more than once

http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/

Thursday 22 August 13

Page 33: Architecting for Failure in AWS - PuppetConf 2013

USE GLOBAL LOAD BALANCINGFail over to the closest data center on region failure

Thursday 22 August 13

Page 34: Architecting for Failure in AWS - PuppetConf 2013

SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, Fastly, etc

http://dyn.com

Thursday 22 August 13

Page 35: Architecting for Failure in AWS - PuppetConf 2013

USE IAM ROLES FOR ACCESSHumans make mistakes, including your humans

Thursday 22 August 13

Page 36: Architecting for Failure in AWS - PuppetConf 2013

COST @ SCALEScaling without breaking the bank

http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg

Thursday 22 August 13

Page 37: Architecting for Failure in AWS - PuppetConf 2013

EMR + SPOT INSTANCESOn demand rate: $0.165 / hour

http://aws.amazon.com/ec2/spot-instances/

Thursday 22 August 13

Page 38: Architecting for Failure in AWS - PuppetConf 2013

AMAZON REDSHIFTEconomical Business Intelligence

Scales with data sizehttp://www.flitemedia.com/music.php

http://aws.amazon.com/redshifthttp://www.tableausoftware.com/

Thursday 22 August 13

Page 39: Architecting for Failure in AWS - PuppetConf 2013

AMAZON GLACIER"Tapes for the Cloud Era"

Writes vastly cheaper than readshttp://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html

Thursday 22 August 13

Page 40: Architecting for Failure in AWS - PuppetConf 2013

AWS SIMPLE EMAIL SERVICEDealing with email is boring and time consuming

http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/

Thursday 22 August 13

Page 41: Architecting for Failure in AWS - PuppetConf 2013

AWS SIMPLE QUEUE SERVICEExcellent for latency insensitive, small volume queues

http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.htmlhttp://aws.amazon.com/sqs/

http://colby.id.au/benchmarking-sqsThursday 22 August 13

Page 42: Architecting for Failure in AWS - PuppetConf 2013

INSTANCE MARKETPLACEBuy & sell reserved instances

http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/

Thursday 22 August 13

Page 43: Architecting for Failure in AWS - PuppetConf 2013

AWS DYNAMO DBExcellent for small keys & high read rates

at known & consistent IOPShttp://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/

Thursday 22 August 13

Page 44: Architecting for Failure in AWS - PuppetConf 2013

MAXIMIZE IOPSRAID 0 Ephemeral drives

use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPShttp://calculator.s3.amazonaws.com/calc5.html

http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance

Thursday 22 August 13

Page 45: Architecting for Failure in AWS - PuppetConf 2013

RED FLAGSAnti-patterns to watch out for

http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/Thursday 22 August 13

Page 46: Architecting for Failure in AWS - PuppetConf 2013

PROVISIONED IOPS EBSEphemeral storage on c1/m1.xlarge or SSD is betterIf you must: m*large or c1.xlarge for dedicated NIC

http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nnhttp://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html

Thursday 22 August 13

Page 47: Architecting for Failure in AWS - PuppetConf 2013

AWS DYNAMO DBFor high write rates or

large/variable keyshttp://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx

Thursday 22 August 13

Page 48: Architecting for Failure in AWS - PuppetConf 2013

HIGH IO/DISK/RAM NODESUse them deliberately

http://elledecoration.co.za/2010/07/gigantic/

Thursday 22 August 13

Page 49: Architecting for Failure in AWS - PuppetConf 2013

AWS CLOUDWATCHMetric collection, Amazon style

Cost prohibitive & resolution too lowhttp://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/

Thursday 22 August 13

Page 50: Architecting for Failure in AWS - PuppetConf 2013

LOWER COST PER METRICUse graphite & statsd

http://graphite.wikidot.com/https://github.com/etsy/statsd

Thursday 22 August 13

Page 51: Architecting for Failure in AWS - PuppetConf 2013

HOSTED ALTERNATIVESCirconus: All the insights you ever wanted

StackDriver : Optimized for AWShttp://circonus.com

http://stackdriver.com

Thursday 22 August 13

Page 52: Architecting for Failure in AWS - PuppetConf 2013

AWS CLOUDFORMATIONTemplatize your entire stack

Harder to use as complexity increaseshttp://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg

Thursday 22 August 13

Page 53: Architecting for Failure in AWS - PuppetConf 2013

RDS FOR ANALYTICS/REPORTSPaying OLTP prices for BI usageSharding will be a matter of time

http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/

Thursday 22 August 13

Page 54: Architecting for Failure in AWS - PuppetConf 2013

Q & A

http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html

@jiboumanshttp://slideshare.net/jiboumans

Thursday 22 August 13