qconsp 2013• code deployment ... netflix built a global paas ... django optional apache...
TRANSCRIPT
Tweet @jedberg with feedback!
QConSP 2013
Tweet @jedberg with feedback!
Do you have...
• A release Engineer?
• A QA department?
• Chef or Puppet to manage your systems?
Tweet @jedberg with feedback!
Do you have...
• Upwards of 100 releases a day?
Tweet @jedberg with feedback!
Jeremy Edberg
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per
month, including original series. For one low monthly price, Netflix members can watch as much as they want, anytime, anywhere, on nearly any Internet-
connected screen.Source: http://ir.netflix.com
What is Netflix?
Tweet @jedberg with feedback!
The Netflix way
• Everything is “built for three”
• Fully automated build tools to test and make packages
• Fully automated machine image bakery
• Fully automated image deployment
• Independent teams responsible for both Dev and Ops
Tweet @jedberg with feedback!
Philosophy
Tweet @jedberg with feedback!
Freedom and Responsibility
• We hire responsible adults and keep rules and policies to a minimum
• Developers can change any code in production at any time
• And things don’t break (usually)
• Not eXtreme Go Horse
Tweet @jedberg with feedback!
Automate all the things!
Tweet @jedberg with feedback!
Automate all the things!
• Application startup
• Configuration
• Code deployment
• System deployment
Tweet @jedberg with feedback!
Automation
• Standard base image
• Tools to manage all the systems
• Automated code deployment
Tweet @jedberg with feedback!
Shared state should be stored in a shared service
Data on an instance should be replicated to other
instances
Tweet @jedberg with feedback!
“Build for three”We hold a boot camp for new engineers to teach them how
to build for a highly distributed environment.
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
7%$(0/,4.H,IJ0/#B/C./% F(%$#8/G0 >0?.%#
>%),+,),>0?.%#D,J/C(
<.=.4,$#>0?.%(
678
D%?.%E( 6@"#A%()#B/C./%
!"#$%&'%()(#*%$#+,-#
./)0#)1%#2%34.5#678
9!"#0'):0'/+#$%&'%()(#*%$#+,-#)0#678#
+%*%/+%/;.%(
Tweet @jedberg with feedback!
!"#$%&'()*'+,-')./!0)/120)3456)
7'8)1,$')%()*,#-%+'(9):/;)
<#'()*=$=)
/'(#%>=?,@=A%>)
1$('=&,>B):/;)
*CD)
E%1)F%BB,>B)
GH'>!%>>'-$)!*I)J%K'#)
!*I)D=>=B'&'>$)=>L)
1$''(,>B)
!%>$'>$)M>-%L,>B)
!%>#"&'()M?'-$(%>,-#)
:71)!?%"L)1'(+,-'#)
!*I)MLB')F%-=A%>#)
J(%N#')
/?=9)
7=$-O)
Tweet @jedberg with feedback!
Highly aligned, loosely coupled
• Services are built by different teams who work together to figure out what each service will provide.
• The service owner publishes an API that anyone can use.
Tweet @jedberg with feedback!
Advantages to a Service Oriented Architecture• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more easily
• Narrow in the effects of a change
• More efficient local caching
Tweet @jedberg with feedback!
Freedom and Responsibility
• Developers deploy when they want
• They also manage their own capacity and autoscaling
• And fix anything that breaks at 4am!
Tweet @jedberg with feedback!
Decision making
Risk to my serviceRisk to Netflix
Time of Day/Week
Tweet @jedberg with feedback!
All systems choices assume some part will fail at some
point.
Tweet @jedberg with feedback!
Reliability and $$
Tweet @jedberg with feedback!
The Monkey Theory
• Simulate things that go wrong
• Find things that are different
Tweet @jedberg with feedback!
Execution
Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
Netflix built a global PaaS
• Service Oriented Architecture
• HTTP/Rest interfaces between services
Tweet @jedberg with feedback!
Netflix PaaS features• Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of metrics
Tweet @jedberg with feedback!
What AWS Provides
• Instances
• Machine Images
• Elastic IPs
• Load Balancers
• Security groups / Autoscaling groups
• Availability zones and regions
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
Tomcat
Optional Apache
Monitoring
Log Rotation to S3
Appdynamics Machine Agent
Appdynamics App Agent
monitoring
Application war file, base servlet, platform, interface
jars for dependent services
GC and thread dump logging
Healthcheck, status servelets, JMX interface,
Servo autoscale
Tweet @jedberg with feedback!
The Netflix PlatformDiscovery (Eureka)Entrypoints (Edda)
Configuration (Archaius)Zookeeper (Exhibitor)logging (Blitz4j & Honu)
NIWS (Ribbon)GeoBase
Circuit Breakers (Hystrix)Cassandra (Priam &
Astyanax & CassJMeter) Cryptex AKMS
EvCacheZuuli18nL10n
Open Source
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Finding things
• Discovery (Eureka)
• Application to instance mapping
• Heartbeat to keep track of health
• Entrypoints (Edda)
• Local database of AWS resources
• NIWS (Ribbon)
• On instance software load balancer
• Handles retry logic
• Geo (Geolocation library)
• Provides IP to Lat/Lon mapping for any service that needs it.
Tweet @jedberg with feedback!
Entrypoints (Edda)
• REST API
• GET /REST/v1/instance/$id
• Keeps track of all resources
• Autoscaling groups, EIPs, Instances, Applications, Clusters, History
Tweet @jedberg with feedback!
Entrypoints Exploration
Find all active instances all()
Find all instances in a group
%(cloudmonkey)
How many instances are not in an autoscale
group?count(all(),-info(eval(INSTANCES;asg())))
Which ELB contains a particular instance?
filter(TYPE;asg;*(i-4a12d3b9))
Tweet @jedberg with feedback!
Keeping it all straight
• Configuration (Archaius)• Global variables (Fast properties)
• Base• Base system. Prod vs. Test, etc
• Zookeeper (Curator)• Locks, other similar coordination
• logging (Blitz4j and Honu)• Keep track of what happened and store it for
post analysis.
Tweet @jedberg with feedback!
Keeping it secure
• Cryptex
• Service for key management
• High, medium and low value keys
• AKMS (Amazon Key Management System)
• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance
Tweet @jedberg with feedback!
Key Management
• Cryptex service provides keys
• Low value: Cookie encryption keys
• Med value: Device activation keys
• High value: Credit card encryption
Tweet @jedberg with feedback!
Cryptex
• Pass in encrypted string, get decrypted string out
• Decryption is in a different place depending on value of key
• Always try to design for lowest value key
Tweet @jedberg with feedback!
Translating it
• i18n (Internationalization)
• Make it easy to translate things from one language to another
• L10n (Localization)
• The library that actually does the translations
Tweet @jedberg with feedback!
Storing it• Cassandra (Priam, astyanax)
• Configure and access Cassandra
• Provide OO abstractions handle connection pooling, discovery of hosts
• EVCache (Eccentric Volatile Cache)
• Wrapper for memcached to handle zone awareness and replication
• Proxies
• Get data out of the datacenter and into the cloud.
Tweet @jedberg with feedback!
DataWhat do we do with it all?
Tweet @jedberg with feedback!
We store it!
• Cache (memcached)
• Cassandra
• RDS (MySql)
Tweet @jedberg with feedback!
Cassandra
Tweet @jedberg with feedback!
Why Cassandra?
• Availability over consistency
• Writes over reads
• We know Java
• Open source + support
Tweet @jedberg with feedback!
Cassandra Benefits
• Fast writes
• Fast negative lookups
• Easy incremental scalability
• Distributed -- No SPoF
Tweet @jedberg with feedback!
Things we store in Cassandra
• Video Quality
• Network issues
• Usage History
• Playback Errors
• A/B Tests
Tweet @jedberg with feedback!
A/B Testing
Tweet @jedberg with feedback!
A/B Testing
Online Data Offline Data
Test Cell allocationTest MetadataStart/End dateUI Directives
Test trackingRetention
Fraction ViewedPages Viewed
Tweet @jedberg with feedback!
Using Cassandra at Netflix
• Priam
• Zero touch auto-config
• State management
• Token assignment
• Node replacement
• Backup/restore to/from S3
• Astyanax
• OO abstraction to Cassandra
• Multi-region support
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Cassandra Architecture
Tweet @jedberg with feedback!
Cassandra Architecture
For more info, see DAT202: Optimizing your Cassandra Database on AWS
Tweet @jedberg with feedback!
Tools
• Asgard
• AWS usage
• Atlas
• Chronos
• Build system
• Explorers (Cassandra and SimpleDB)
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Deploying Code; Step 1
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Auto ScalingGroup
LaunchConfiguration
SecurityGroup
Amazon MachineImage
Instances
Configuration
Elastic LoadBalancer
Tweet @jedberg with feedback!
api-usprod-v007
api-frontend
api-usprod-v008
Tweet @jedberg with feedback!
api-usprod-v007
api-frontend
api-usprod-v008
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix has moved the granularity from the
instance to the cluster
Tweet @jedberg with feedback!
Why Bake?
Generic AMI
Instance
Traditional:•launch OS•install packages•install app
Netflix:•launch OS+app
App AMI Instance
Tweet @jedberg with feedback!
Getting Baked
Perforce / Git
libraries
source
Ant targets
Ivy
Groovy all over
snapshot / release libraries / apps
app bundlesapp bundles
Jenkins
sync
resolve
buildcompile report
publishtest
Perforce / Git
sourcesourcesource
sync
Perforce / Git Ant targets
sourcesource
sync compile
Perforce / Git
sourcesource
sync
libraries
resolve
Artifactory
Ivylibraries snapshot / release
libraries / apps
Groovy all over
build
Tweet @jedberg with feedback!
Base ImageBaking
Yum / Apt
Linux: CentOS, Fedora, Ubuntu
AWSRPMs: Apache, Java...
ec2 slave instances
Linux: CentOS, Fedora, Ubuntu
ec2 slave instances
S3 / EBS
foundation AMI
base AMI
Bakery
mount
installinstall
ec2 slave instances
Bakeryinstall
foundation AMI
base
Ready forappbake
snapshot
Tweet @jedberg with feedback!
App ImageBaking
Jenkins / Yum / Artifactory
Linux, Apache, Java, Tomcat
AWSapp bundle
ec2 slave instances
Linux, Apache, Java, Tomcat
ec2 slave instances
S3 / EBS
base AMI
app AMI
Bakery
mount
installinstall
ec2 slave instances
Bakeryinstall
base AMI
Ready to launch!
snapshot
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
Tomcat
Optional Apache
Monitoring
Log Rotation to S3
Appdynamics Machine Agent
Appdynamics App Agent
monitoring
Application war file, base servlet, platform, interface
jars for dependent services
GC and thread dump logging
Healthcheck, status servelets, JMX interface,
Servo autoscale
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
JBoss
Optional Apache
Monitoring
Log Rotation to S3
Appdynamics Machine Agent
Appdynamics App Agent
monitoring
Application war file, base servlet, platform, interface
jars for dependent services
GC and thread dump logging
Healthcheck, status servelets, JMX interface,
Servo autoscale
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Python
Django
Optional Apache
Monitoring
Log Rotation to S3
Appdynamics Machine Agent
monitoring
Application file, base server, platform, interface
libs for dependent serviceslogging
Tweet @jedberg with feedback!
The Monkey Theory
• Simulate things that go wrong
• Find things that are different
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
The simian army• Chaos -- Kills random instances
• Chaos Gorilla -- Kills zones
• Chaos Kong -- Kills regions
• Latency -- Degrades network and injects faults
• Conformity -- Looks for outliers
• Circus -- Kills and launches instances to maintain zone balance
• Doctor -- Fixes unhealthy resources
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things like Amazon limit violations
• Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback!
What’s going on?!
Tweet @jedberg with feedback!
Atlas
Tweet @jedberg with feedback!
!""#$%&'()*'#+",""""#)-.$/011*)10(2*#3""""#)-.$/011*)10(2*45)6#""73""#0%)*('#+",""""88"92&"$0:"&')";060'$*.-("'(9%)"$2<<):('".:"(=)"$2:>.1""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5#3""""""#0--%9C2#+"#$%&'()*#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""#<0E#+"FGF""""""H3""""""#')6)*.(9#+"#<0;2*#3""""""#5)'$*.-(.2:#+"#-%&1.:".'"5*2--.:1"<)(*.$'#""""H3""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5/I:'(0:$)#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#26)**.5)'#+"!""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#*)Q&.*)/.:'(0:$)/'(0(&'/:2(/.:+",#BJR?#3"#JSC/JT/D@UVIW@#73""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H3
!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/Y)(*.$W2&:(#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#5)'$*.-(.2:#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#055.(.2:0%B)(0.%'#+"!""""""""#'(0(&'S*%#+"#=((-+88Z!-&[%.$B:'?0<)H+\FFM8D(0(&'#3""""""""#:0$W%&'()*S*%#+"#:0$Z!):6H8Z!*)1.2:H8$%&'()*8'=2]8Z!$%&'()*H#""""""H""""""#26)**.5)'#+"!""""""""#'&[;)$(#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""""#.:$.5):(/L)9#+"#Z!<)(*.$?0<)H+Z!.:'(0:$)I5H#3""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H""7H
Example Alert Config
Tweet @jedberg with feedback!
Alert Tuning
Tweet @jedberg with feedback!
Alert Systems
alerting
api
api
COREEvent
Gateway
Paging Service
AmazonSES
CORE Agent
Other Team’s Agent
CORE Agent
Atlas
Appdynamics
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Chronos
Tweet @jedberg with feedback!
Best Practices
Tweet @jedberg with feedback!
Incident Reviews
• What went wrong?
• How could we have detected it sooner?
• How could we have prevented it?
• How can we prevent this class of problem in the future?
• How can we improve our behavior for next time?
Ask the key questions:
Tweet @jedberg with feedback!
Best Practices for Data
• Have multiple copies of all data
• Keep those copies in multiple AZs
• Avoid keeping state on a single instance
• Take frequent snapshots of EBS disks
• No secret keys on the instance
Tweet @jedberg with feedback!
Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send
Tweet @jedberg with feedback!
Netflix autoscaling
Traffic Peak
Text1
2Deployment
Tweet @jedberg with feedback!
AWS UsageDollar amounts have been carefully removed
Tweet @jedberg with feedback!
Going multi-zone
Tweet @jedberg with feedback!
Benefits of Amazon’s Zones
• Loosely connected
• Low latency between zones
• 99.95% uptime guarantee per region
Tweet @jedberg with feedback!
Going Multi-region
Tweet @jedberg with feedback!
Leveraging Mutli-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Tweet @jedberg with feedback!
Multi-Region Challenges
• Data replication
• Cache invalidation
• Misdirected users
• Sudden load increase during failover
• When do you fail over?
Tweet @jedberg with feedback!
Data Replication
Tweet @jedberg with feedback!
Cache Replication
• Three strategies available to users:
• No replication
• Invalidation only
• Full copy
Tweet @jedberg with feedback!
Traffic Routing and Failover
• Need to scale up and not get overwhelmed
• Don’t want to suddenly give a bad experience to people
• Make sure that misrouted users are sent “home”
• Can’t failover at first sign of trouble, need to strike a balance
Tweet @jedberg with feedback!
Coming soon...
• We’re in the testing phases now
• Expect to see more info and a tech blog post in the future
Tweet @jedberg with feedback!
Just a quick reminder...
(Some of) Netflix is open source:
https://github.com/netflix
Tweet @jedberg with feedback!
Netflix is hiring
http://jobs.netflix.com/jobs.html
Tweet @jedberg with feedback!
Please don’t forget to vote!
Voting is how we know what to present to you next time. :)
Tweet @jedberg with feedback!
Questions?
Tweet @jedberg with feedback!
Getting in touch
Email: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg