architecting for failure in aws - puppetconf 2013
DESCRIPTION
"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital. Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate. Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.TRANSCRIPT
ARCHITECTING IN AWSfor resilience & cost at scale
Jos Boumans - @jiboumanshttp://rafaykhan619.wix.com/downhouse
Thursday 22 August 13
RIPE NCCEngineering manager for RIPE Database
http://www.ripe.net/db
Thursday 22 August 13
CANONICAL
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775
Engineering manager for Ubuntu Server 10.04 & 10.10
http://www.ubuntu.com/business/server/overview
Thursday 22 August 13
KRUXVP of Operations & Infrastructure
http://www.krux.com/
Thursday 22 August 13
SOME OF OUR CUSTOMERS
Thursday 22 August 13
LOTS OF TRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
Thursday 22 August 13
AVERAGE REQUESTS* / SEC
http://mashable.com/2013/03/21/happy-7th-birthday-twitter/http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
*Twitter : New tweets Wikipedia: Articles readKrux: New data points
0 3,750 7,500 11,250 15,000
Thursday 22 August 13
MONTHLY UNIQUE USERS
0 200,000,000 400,000,000 600,000,000 800,000,000
http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/
Thursday 22 August 13
WE CHOSE 'THE CLOUD'
http://previewnetworks.com/blog/
Thursday 22 August 13
THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlinesThursday 22 August 13
RESILIENCE & COST AT SCALE
Thursday 22 August 13
FOCUS ON AWS
http://aws.amazon.com/
Thursday 22 August 13
APRIL 21, 2011
http://aws.amazon.com/message/680587/http://aws.amazon.com/message/680342/
http://aws.amazon.com/message/67457/http://aws.amazon.com/message/65648/
Also: June 29, 2012 - October 22, 2012 - December 24, 2012
http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/
Thursday 22 August 13
So#ware,)8)
Automa/on,)4)
Process,)14)
#"of"Issues"
Amazon"Cloud"Major"Outage"7"Issues"Categories"
ROOT CAUSE CATEGORIES
http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis
Software bugs & human error
Thursday 22 August 13
JUNE 29, 2012
http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/
Thursday 22 August 13
AWS OUTAGE = YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
Thursday 22 August 13
RESILIENCE @ SCALEEmbrace Failure: Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.http://blabitcanada.com/category/twitter-2/
Thursday 22 August 13
DEFINE 'AVAILABLE'Things will break, so choose your degraded state.
http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris
Thursday 22 August 13
BASIC API CALL3 potential points of failure
Thursday 22 August 13
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
USER EXPERIENCEMy tweet got posted
Thursday 22 August 13
RESILIENCE TOOLSStorage, Network & ACL
http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html
Thursday 22 August 13
MANY SMALL NODES VERSUS A FEW LARGER NODES
The benefits of the many outweigh the benefits of the fewhttp://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/
Thursday 22 August 13
DATABASESCAP Theorem applies.
Your choice: sacrifice availability or consistency. Orange is a lie.
RDBMSBigTable Based
Master / Slave based
CouchDBDynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html
Thursday 22 August 13
SIMPLE STORAGE SERVICES3: Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/http://aws.amazon.com/s3/
https://forums.aws.amazon.com/message.jspa?messageID=182919#182919Thursday 22 August 13
CACHE WHAT YOU CANHTTP Responses, DB Queries, User content
Browsers have caches too!http://cruncht.com/95/drupal-caching/
http://redis.io/http://memcached.org/
http://varnish-cache.org/Thursday 22 August 13
CLIENT SIDE STORAGEKeep a copy of your users data locally
http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/
Thursday 22 August 13
USE ELASTIC LOAD BALANCERSThey will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
Thursday 22 August 13
USE GLOBAL LOAD BALANCINGFail over to the closest data center on region failure
Thursday 22 August 13
SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, Fastly, etc
http://dyn.com
Thursday 22 August 13
USE IAM ROLES FOR ACCESSHumans make mistakes, including your humans
Thursday 22 August 13
COST @ SCALEScaling without breaking the bank
http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg
Thursday 22 August 13
EMR + SPOT INSTANCESOn demand rate: $0.165 / hour
http://aws.amazon.com/ec2/spot-instances/
Thursday 22 August 13
AMAZON REDSHIFTEconomical Business Intelligence
Scales with data sizehttp://www.flitemedia.com/music.php
http://aws.amazon.com/redshifthttp://www.tableausoftware.com/
Thursday 22 August 13
AMAZON GLACIER"Tapes for the Cloud Era"
Writes vastly cheaper than readshttp://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html
Thursday 22 August 13
AWS SIMPLE EMAIL SERVICEDealing with email is boring and time consuming
http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/
Thursday 22 August 13
AWS SIMPLE QUEUE SERVICEExcellent for latency insensitive, small volume queues
http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.htmlhttp://aws.amazon.com/sqs/
http://colby.id.au/benchmarking-sqsThursday 22 August 13
INSTANCE MARKETPLACEBuy & sell reserved instances
http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/
Thursday 22 August 13
AWS DYNAMO DBExcellent for small keys & high read rates
at known & consistent IOPShttp://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/
Thursday 22 August 13
MAXIMIZE IOPSRAID 0 Ephemeral drives
use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPShttp://calculator.s3.amazonaws.com/calc5.html
http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance
Thursday 22 August 13
RED FLAGSAnti-patterns to watch out for
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/Thursday 22 August 13
PROVISIONED IOPS EBSEphemeral storage on c1/m1.xlarge or SSD is betterIf you must: m*large or c1.xlarge for dedicated NIC
http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nnhttp://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html
Thursday 22 August 13
AWS DYNAMO DBFor high write rates or
large/variable keyshttp://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx
Thursday 22 August 13
HIGH IO/DISK/RAM NODESUse them deliberately
http://elledecoration.co.za/2010/07/gigantic/
Thursday 22 August 13
AWS CLOUDWATCHMetric collection, Amazon style
Cost prohibitive & resolution too lowhttp://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/
Thursday 22 August 13
LOWER COST PER METRICUse graphite & statsd
http://graphite.wikidot.com/https://github.com/etsy/statsd
Thursday 22 August 13
HOSTED ALTERNATIVESCirconus: All the insights you ever wanted
StackDriver : Optimized for AWShttp://circonus.com
http://stackdriver.com
Thursday 22 August 13
AWS CLOUDFORMATIONTemplatize your entire stack
Harder to use as complexity increaseshttp://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg
Thursday 22 August 13
RDS FOR ANALYTICS/REPORTSPaying OLTP prices for BI usageSharding will be a matter of time
http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/
Thursday 22 August 13
Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumanshttp://slideshare.net/jiboumans
Thursday 22 August 13