cloud native at netflix: what changed? - gartner catalyst 2013
DESCRIPTION
If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at http://netflix.github.comTRANSCRIPT
Cloud Native at NetflixWhat Changed?
July 2013Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft
Cloud Native
Netflix Architecture
NetflixOSS
Cloud Native
What is it?Why?
Engineers
Solve hard problemsBuild amazing and complex things
Fix things when they break
Strive for perfection
Perfect codePerfect hardware
Perfectly operated
But perfection takes too long…
Compromises…Time to market vs. Quality
Utopia remains out of reach
Where time to market wins big
Making a land-grabDisrupting competitors (OODA)
Anything delivered as web services
Observe
Orient
Decide
Act
Land grab opportunity Competitive
move
Customer Pain Point
Analysis
Get buy-in
Plan response
Commit resources
Implement
Deliver
Engage customers
Research alternatives
BIG DATA
INNOVATION
CULTURE
CLOUD
Measure customers
Colonel Boyd, USAF
“Get inside your adversaries'
OODA loop to disorient them”
How Soon?
Code features in days instead of monthsGet hardware in minutes instead of weeks
Incident response in seconds instead of hours
A new engineering challenge
Construct a highly agile and highly available service from ephemeral and
assumed broken components
Inspiration
How to get to Cloud Native
Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
Re-Org!
Four Transitions
• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps
• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot
• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates
• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers
Netflix BusDevOps OrganizationChief Product
Officer
VP Product Management
Directors Product
VP UI Engineering
Directors Development
Developers + DevOps
UI Data Sources
AWS
VP Discovery Engineering
Directors Development
Developers + DevOps
Discovery Data Sources
AWS
VP Platform
Directors Platform
Developers + DevOps
Platform Data Sources
AWS
Denormalized, independently updated and scaled data
Cloud, self service updated & scaled infrastructure
Code, independently updated continuous delivery
Decentralized Deployment
Asgard Developer Portalhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Ephemeral Instances
• Largest services are autoscaled• Average lifetime of an instance is 36 hours
Push
Autoscale UpAutoscale Down
Netflix Streaming
A Cloud Native Application based on an open source platform
Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
How Netflix Streaming Works
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and
Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Nov2012StreamingBandwidth
March2013
MeanBandwidth+39% 6mo
Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers (for US, Canada and Latam)
Each icon is three to a few hundred instances across three AWS zones
Three Balanced Availability ZonesTest with Chaos Gorilla
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
Chaos Gorilla
Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Cross Region Use Cases
• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011
• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now
Benchmarking Global CassandraWrite intensive test of cross region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total192 TB of SSD in six locations up and running Cassandra in 20 min
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test Load
Test Load
Validation Load
Inter-Zone Traffic
1 Million writesCL.ONE (wait for one replica to ack)
1 Million readsAfter 500msCL.ONE with noData loss
Inter-Region TrafficUp to 9Gbits/s, 83ms 18TB
backups from S3
Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNSDynECT
DNS
AWS Route53
Denominator – manage traffic via multiple DNS providers with Java code2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May
Denominator
Incidents – Impact and Mitigation
PRX Incidents
CSXX Incidents
Metrics impact – Feature disableXXX Incidents
No Impact – fast retry or automated failoverXXXX Incidents
Public Relations Media Impact
High Customer Service Calls
Affects AB Test Results
Y incidents mitigated by Active Active, game day practicing
YY incidents mitigated by
better tools and practices
YYY incidents mitigated by better
data tagging
Cloud Security
Automated attack surface monitoringCrypto key store management (CloudHSM)
Scale to resist DDOS attackshttp://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
What Changed?
“Impossible” deployments are easyJointly building code with vendors in public
Highly available and secure despite scale and speed
The DIY Question
Why doesn’t Netflix build and run its own cloud?
Fitting Into Public Scale
Public Grey Area Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
How big is Public?
AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default (some VPC don’t)
AWS Maximum Possible Instance Count 4.2 Million – May 2013Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange
A Cloud Native Open Source PlatformSee netflix.github.com
Establish our solutions as Best
Practices / Standards
Hire, Retain and Engage Top Engineers
Build up Netflix Technology Brand
Benefit from a shared ecosystem
Goals
Example Application – RSS Reader
ZUUL
Zuul TrafficProcessing and Routing
Ice – Detailed AWS “Chargeback”http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
Boosting the @NetflixOSS EcosystemSee netflix.github.com
More Use Cases
More Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contributions from vendors
What’s Coming Next?
Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”Functionally completeDemonstrated MarchReleased June in V3.3
Offering $10K prize for integration workVendor and end user interestOpenstack “Heat” getting therePaypal C3 Console based on Asgard
Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud native ecosystem
Rapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
Slideshare.net/Netflix Details• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games
– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html
• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-roadmap
• Lightning Talks Feb S1E1– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Security Architecture– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/
• Cost Aware Cloud Architectures – with Jinesh Varia of AWS– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-varia-aw
s-and-adrian-cockroft-netflix
What Changed?
Speed wins, Cloud Native helps you get there
NetflixOSS makes it easier for everyone to become Cloud Native
http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud @NetflixOSS