netflix cloud architecture and open source

Download Netflix Cloud Architecture and Open Source

Post on 07-Jan-2017




0 download

Embed Size (px)


Netflix Architecture and Open SourceAndrew SpykerSenior Software Engineer, Netflix

About Netflix69M members2000+ employees (1400 tech)80+ countries> 100M hours watch per day> NA internet download traffic500+ MicroservicesMany 10s of thousands VMs3 regions across the world

About the SpeakerCloud platform technologiesDistributed configuration, service discovery, RPC, application frameworks, non-Java sidecarContainer cloudResource management and scheduling, making Docker containers operational in Amazon EC2/ECSOpen SourceOrganize @NetflixOSS meetups & internal groupPerformanceAssist across Netflix, but focused mainly on cloud platform perf

With Netflix for ~ 1 year. Previously at IBM here in Raleigh/Durham (RTP)


AgendaNetflixOSSNetflix Cloud ArchitectureGetting started

Why does Netflix open source?Allows engineers to gather feedbackOpenly talk, through code, on our approachCollaboration on key projects with the worldHappily use proven outside open sourceAnd improve it for Netflix scale and availabilityNetflix culture of freedom and responsibilityWant to open source?Go for it, be responsible!Recruiting and RetentionCandidates know exactly what they can work onNetflixOSS engineers choose to stay at Netflix

NetflixOSS is widely usedThe architecture has shaped public cloud usageImmutability, Red/Black Deploys, Chaos,Regional and worldwide high availability

OfferingsPivotal Spring Cloud

Large usageIBM Watson as a Service (on IBM Cloud)Nike Digital is hiring NetflixOSS experts

Interesting usageTo help locate new troves of data claiming to be the files stolen from AshleyMadison, the companys forensics team has been using a tool that Netflix released last year called Scumblr

NetflixOSS Website Relaunch

Key aspects of NetflixOSS websiteShow how the pieces fit togetherProjects now discussed with each other in context

OSS categories mirror internal teamsNo artificial categories, focal points for each area

Focus on projects that are core to NetflixProjects mentioned are core and strategic

AgendaNetflixOSSNetflix Cloud ArchitectureGetting Started

Elastic, Web and Hyper Scale Doing this

Not doing that

Elastic, Web and Hyper ScaleFront endAPIAnotherMicroserviceTemporalcachingDurableStorageLoadBalancersStrategyBenefitMake deployments automatedWithout automation impossibleExpose well designed API to usersOffloads presentation complexity to clientsRemove state for mid tier servicesAllows easy elastic scale outPush temporal state to client and caching tierLeverage clients, avoids data tier overloadUse partitioned data storageData design and storage scales with HA


Icons from


HA and Automatic RecoveryFeeling ThisNot Feeling That


Micro serviceImplementationCall microservice #2Highly Available Service Runtime RecipeRibbon REST clientwith EurekaMicroservice #1(REST services)App ServiceMicroservice #2ExecutecallHystrixEurekaServer(s)EurekaServer(s)EurekaServer(s)Karyon


Implementation DetailBenefitsDecompose into micro servicesKey user path always availableFailure does not propagate across service boundariesKaryon /w automatic Eureka registrationNew instances are quickly foundFailing individual instances disappearRibbon client with Eureka awarenessLoad balances & retries across instances with smartsHandles temporal instance failureHystrix as dependency circuit breakerAllows for fast failureProvides graceful cross service degradation/recovery

Make bigger#

IaaS High Availability Region (us-east-1)us-east-1eus-east-1cEurekaWeb AppService1Service2Cluster Auto Recovery and Scaling Services (Auto Scaling Groups)

ELBsRuleWhy?Always > 2 of everything1 is SPOF, 2 doesnt web scale and slow DR recoveryIncluding IaaS and cloud servicesYoure only as strong as your weakest dependencyUse auto scaler/recovery monitoringClusters guarantee availability and service latencyUse application level health checksInstance on the network != healthyWorldwide availabilityData replication, global front-end routing, cross region traffic


A truly global serviceReplicate data across regionsBe able to redirect traffic from region to regionBe able to migrate regional traffic to other regionsHave automated control across regions

Flux Demo

Testing is only way to prove HAChaos MonkeyKill instances in production - runs regularlyChaos GorillaKills availability zones (single datacenter)Also testing for split brain importantChaos KongKill entire region and shift traffic globallyRun frequently but with prior scheduling

Continuous Delivery

Reading ThisNot This



Continuous DeliveryCluster v1Canary v2Cluster V2StepTechnologyDevelopers test locallyUnit test frameworksContinuous buildContinuous build server based on gradle buildsBuild bakes full instance imageAminator and deployment pipeline bake images from build artifactsDeveloper work across dev and testArchaius allows for environment based contextDevelopers do canary tests, red/black deployments in prodAsgard console provides app cluster common devops approach, security patterns, and visibility

ContinuousBuild ServerBaked to images (AMIs)

From Asgard to SpinnakerSpinnaker is our CI/CD solutionCI/CD solution including baking and Jenkins integrationWorkflow engine for the continuous deliveryPipeline based deployment including bakingGlobal visibility across all of our AWS regionsProvides an API first designA microservices runtime HA architectureMore flexible cloud model so the community can contribute back improvements not related to AWS

Asgard continues to work side-by-sideSpinnaker is this new end to end CI/CD tool

Spinnaker Examples

Works atNetflix scaleViews of global pipelinesFrom simple Asgard like deployment to advanced CI/CD pipelines

Operational VisibilityIf you cant see it, you cant improve it


Operational Visibility

Microservice #1Microservice #2Visibility PointTechnologyBasic IaaS instance monitoringNot enough (not scalable, not app specific)User like external monitoringSaaS offerings or OSS like UptimeTargeted performance, samplingVector performance and app level metricsService to service interconnectsHystrix streams Turbine aggregation Hystrix dashboardApplication centric metricsServo/Spectator gauges, counters, timers sent to metrics store like AtlasRemote loggingLogstash/Kibana or similar log aggregation and analysis frameworksThreshold monitoring and alertsServices like Atlas and PagerDuty for incident management

Servo/SpectatorHystrix/TurbineExternal UptimeMonitoringMetric/EventRepositoriesLogStash/ElasticSearch/Kibana





SecurityDynamic Security

Done in new waysNOT

Dynamic, Web Scale & Simpler SecuritySecurity MonkeyMonitors security policies, tracks changes, alerts on situationsScumblrSearches internet for security nuggets (credentials, hacking discussions)SketchyA safe way to collect text and screenshots from websitesFIDOAutomated event detection, analysis, enrichment & and enforcementSleepy PuppyDelayed cross site scripting propagation testing frameworkLemurx.509 certificate orchestration framework

What did we not cover?Over 50 github projectsNetflixOSS is Technical indigestion as a service

Big Data, Data Persistence and UI EngineeringBig Data tools used well beyond NetflixEphemeral, semi and fully persistent data systemsRecent addition of UI OSS and Falcor

AgendaNetflixOSSNetflix Cloud ArchitectureGetting Started

How do I get started?All of the previous slides shows NetflixOSS componentsCode: http://netflix.github.ioAnnouncements:

Want to get running a bit faster?

ZeroToCloudWorkshop for getting started with build/bake/deploy in Amazon EC2

ZeroToDockerDocker images that containing running Netflix technologies (not production ready, but easy to understand)

ZeroToDocker DemoMac OS XVirtual BoxUbuntu 14.04single kernelContainer #1Filesystem + processEureka ContainerZuul ContainerAnother Container...

Docker running instancesSingle kernelContained processesZookeeper and ExhibitorA Microservices app and surrounding NetflixOSS services (Zuul to Karyonwith Eureka)