scaling the netflix api - oscon
DESCRIPTION
The term "scale" for engineering often is used to discuss systems and their ability to grow with the needs of its users. This is clearly an important aspect of scaling, but there are many other areas in which an engineering organization needs to scale to be successful in the long term. This presentation discusses some of those other areas and details how Netflix (and specifically the API team) addresses them.TRANSCRIPT
Scaling the Netflix API
Daniel Jacobson@daniel_jacobson
http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson
Please read the notes associated with each slide for
the full context of the presentation
What do I mean by “scale”?
But There Are Many Ways to Scale!
OrganizationSystems
Devices
Development
Testing
But first, some background…
Global Streaming Videofor TV Shows and Movies
More than 36 Million Subscribers
More than 40 Countries
Netflix Accounts for 33% of Peak Internet Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
Netflix REST API:One-Size-Fits-All (OSFA)
Solution
Image courtesy of Jay Mac 3 on Flickr
Netflix API Requests by AudienceAt Launch In 2008
External Developers
Image courtesy of Jay Mac 3 on Flickr
Netflix API Requests by AudienceFrom 2011
External Developers
Scaling…
OrganizationSystems
Devices
Development
Testing
Distributed Architecture
1000+ Device Types
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies Reviews A/B Test
Engine
Dozens of Dependencies
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
http://www.slideshare.net/reed2001/culture-1798664
Scaling…
OrganizationSystems
Devices
Development
Testing
System Resiliency
Distributed Architecture
Dependency Relationships
2,000,000,000Requests Per Day to the
Netflix API
30Distinct, Direct Dependent Services for the Netflix API
14,000,000,000Netflix API Calls Per Day to those Dependent Services
0Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime Per Month
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Circuit Breaker Dashboard
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Fallback
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Fallback
System Infrastructure
AWS Cloud
Autoscaling
Autoscaling
More than 36 Million Subscribers
More than 40 Countries
ZuulGatekeeper for the Netflix Streaming Application
Zuul
• Multi-Region Resiliency
• Insights• Stress Testing• Canary Testing• Dynamic Routing
• Load Shedding• Security• Static Response
Handling• Authentication
Isthmus
Forced Failure
Scaling…
OrganizationSystems
Devices
Development
Testing
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-AllAPI
Request
RequestRequest
Request
Request
Request
RequestRequest
Request
Request
RequestRequest
Request
Request
Request
Request
Scaling…
OrganizationSystems
Devices
Development
Testing
Courtesy of South Florida Classical Review
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title• /users/<id>/queues• /users/<id>/queues/instant• /users/<id>/recommendations• /catalog/titles/movie• /catalog/titles/series• /catalog/people
REST API
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
Network Border Network Border
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,FORMATTING,AND DELIVERY
USER INTERFACERENDERING
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
Groovy Layer
RECOMMENDATIONSA
ZXSXX C CCC
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border
RECOMMENDATIONSA
ZXSXX C CCC
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTINGAND DELIVERY
USER INTERFACERENDERING
Network Border Network Border
Scaling…
OrganizationSystems
Devices
Development
Testing
Dependency Relationships
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
• Unit tests
• Functional tests
• Regression scripts
• Continuous integration
• Capacity planning
• Load / Performance tests
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from the Internet
Canary Analysis Automation
Single Canary InstanceTo Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from the Internet
Error!
Current Code
In Production
API Requests from the Internet
Current Code
In Production
API Requests from the Internet
Perfect!
Current Code
In Production
API Requests from the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
API Requests from the Internet
New Code
Getting Prepared for Production
https://www.github.com/Netflix
Scaling the Netflix API
Daniel Jacobson@daniel_jacobson
http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson
HelpWanted!