netflix and containers - titus

31
Netflix and Containers Titus Overview, January 2016 Andrew Spyker Cloud Platform Engineer

Upload: aspyker

Post on 07-Jan-2017

64.627 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Netflix and Containers - Titus

Netflix and ContainersTitus Overview, January 2016

Andrew SpykerCloud Platform Engineer

Page 2: Netflix and Containers - Titus

About Netflix

● 75M+ members● #NetflixEverywhere (Worldwide)● 42.5B hours watched 2015● > ⅓ NA internet download traffic● 1000’s Microservices● Many 10’s of thousands VM’s● 3 regions across the world● 2000+ employees

2

Page 3: Netflix and Containers - Titus

About me● Cloud platform technologies

○ Distributed configuration, service discovery, RPC, application frameworks, non-Java sidecar

● Container cloud○ Resource management and scheduling, making Docker containers

operational in Amazon EC2/ECS

● Open Source○ Organize @NetflixOSS meetups & internal group

● Performance○ Assist across Netflix, but focused mainly on cloud platform perf

With Netflix for ~ 1 year. Previously at IBM.

@aspyker

ispyker.blogspot.

com

3

Page 4: Netflix and Containers - Titus

Team members

@aspyker @amit_joshee AndrewLeung

@podila AndreiUshakov

@williamthurston

@timbozarth @dzapata

4

Page 5: Netflix and Containers - Titus

Agenda

● Why Containers for Netflix?● Container runtime platform● Container development experience

5

Page 6: Netflix and Containers - Titus

Why containers operationally?

Case 1:I have a job I want run reliably and efficiently, but I don’t want to manage clusters myself

Case 2:I have lots of services and I want to reduce the number of the VM’s I need to manage with isolation between process instances

Page 7: Netflix and Containers - Titus

History - Project Titan

● Container management system○ Predominantly batch processing system

● Higher level frameworks drive tasks○ General workflow engine○ DAG base data processing○ Misc reports, big data processing stages, interactive notebooks

● Tech○ Rudimentary scheduling with Dynamo storage○ Proven Docker execution environment○ Using Mesos and Fenzo

7

Page 8: Netflix and Containers - Titus

History - Project Mantis

● Real time operational intelligence for streaming experience○ Ad hoc and perpetual stream processing

● Tech○ Proven scheduling with C* storage○ Mantis fatjars deployed in cgroups○ Using Mesos and Fenzo

8

Page 9: Netflix and Containers - Titus

Fenzo overview

● A generic, plug-ins based scheduling library for Apache Mesos frameworks

● Features○ Heterogenous resources match with varied tasks○ Autoscaling of underlying cluster○ Plugins for constraints and fitness○ Support for fast (ms) scheduling rate○ Visibility of scheduling actions

github.com/Netflix/Fenzo 9

Page 10: Netflix and Containers - Titus

Fenzo: fitness, constraints plugins

Fitness value (0.0 - 1.0)● Degree of fitness - first fit, best fit, worst fit

○ Real world tradeoff between perfection and speed● Composable evaluators● e.g., bin packing

Constraints● Hard constraints filter appropriate resources● Soft constraints specify preferences● e.g., zone balancing, instance type preferences

10

Page 11: Netflix and Containers - Titus

Project Titus

● Mantis (Scheduling, Job Mgmt)+ Titan (Docker execution)------------------------------------------ Titus (Andromedon)

● Titan API -> Mantis job mgmt/scheduler -> Titan executor

● Rolled out Q4 2015, took over all jobs in Jan 2016

11

Page 12: Netflix and Containers - Titus

Why Titus?

● Many other container management & scheduling systems, why build another?

● Key unique values○ Deeply support Amazon (not trying to abstract IaaS)○ Narrow focus (just container management)○ Deep integration with existing Netflix systems○ Complex job scheduling reqs and scale/reliability

12

Page 13: Netflix and Containers - Titus

Current Titus Numbers

● Autoscaling 100’s of r3.8xl’s (32 vCPU, 244G)

● Peak○ thousands of cores, tens

of TB’s memory

● thousands containers/day● < 100 different images

13

Page 14: Netflix and Containers - Titus

Also in containers

● Already○ Long running data pipeline service style routing tier

■ 850 c3.4xl instances with ~10K long running containers○ Mantis cgroups

■ 1000’s cores running varied stream processing jobs

● Soon○ Media encoding (10 of thousands of cores)○ Service style (potentially VERY large)

14

Page 15: Netflix and Containers - Titus

Titus UITitus UI

Docker RegistryDocker Registry

Titus high level architecture

Rhea

containercontainer

container

docker

Mesos Agent

metrics agent

containercontainer

container

docker executor

logging agent

zfsmesos agent

docker

RheaTitus API

Cassandra

Titus Master

Job Management & Scheduler

S3

ZookeeperDocker Registry

15

EC2 Autocaling API

Mesos Master

Titus UI

(CI/CD)

Fenzo

Page 16: Netflix and Containers - Titus

Titus User Console

16

Page 17: Netflix and Containers - Titus

Titus Spinnaker Integration

● Spinnaker is our CI/CD system

● Titus integration coming soon

17

Page 18: Netflix and Containers - Titus

POST http://titusapi/v2/jobs

GET http://titusapi/v2/jobs/JOBIDGET http://titusapi/v2/tasks/TASKID

Titus API (today)

JOB Titus-12345

TaskIndex = 0Num = 2

TaskIndex = 1Num = 3

TaskIndex = 2Num = 4

TaskIndex = 1Num = 5

titus-12345-worker-1-5

18

Page 19: Netflix and Containers - Titus

● Disparate use cases in a single API○ Going beyond batch to service, stream and cron

● SLA based on job attributes○ For batch, completion time○ For service, user focused SLA (autoscaling, etc.)

● Ownership and cost accounting/metering○ Group costs to owner and teams

● Aligned with existing continuous deployment system○ Apps, clusters, asgs in Spinnaker

Titus API (coming)

19

Page 20: Netflix and Containers - Titus

Titus Operational ViewsAlso API’s for● cluster state● cluster rolling updates● leadership

● Titus app managed through Spinnaker

20

Page 21: Netflix and Containers - Titus

Dependency Versions (as of 1/16)

Docker● Registry - 2.0.1● Engine - 1.9.1

○ Plus Netflix logging driver

Mesos● 0.24.1

Using Netflix C*, Zookeeper shared services21

Page 22: Netflix and Containers - Titus

Container Agent Features (existing)

● Volumes with quota○ Using ZFS with snapshots and S3 archival

● Logging○ Streaming live stdout/err logs○ Rotation & shipping stdout/err & app logs to S3

● Networking○ IP per container integration with VPC

● Metrics○ cgroup metrics tagged by job/task id and image

22

Page 23: Netflix and Containers - Titus

Container Agent Features (planned)● Networking/Security

○ Extend driver to support security groups & IAM Roles● Volume Drivers

○ Persistent volumes as required by EBS/EFS● Isolation

○ Beyond CPU, Memory, Disk - Networking I/O Bandwidth● Security

○ Host and container security hardening (AppArmor/SELinux)● Insight

○ Performance (Vector) and adhoc debugging (ssh)

23

Page 24: Netflix and Containers - Titus

Unique Titus Scheduler Technology

● Job managers are separate from resource allocation○ Less monolithic, more extensible

● Fenzo benefits○ Bin packing, autoscaling, fitness/constraint configurability○ Visibility into current state of the cluster

● Mesos reconciliation and task heartbeats● Rate limiting of failing jobs and agents● Thresholds and alerts for key aspects

○ Queue depth, idle hosts, etc

24

Page 25: Netflix and Containers - Titus

Integration with Netflix Infrastructure

● Goal: Make containers work with existing cloud systems (designed for virtual machines) vs. replace

● Areas○ Service registration and discovery (Eureka)○ IPC (Ribbon)○ Continuous Delivery (Spinnaker)○ Telemetry (Atlas)○ Reliability (Chaos, Performance Insight)

25

Page 26: Netflix and Containers - Titus

Path to ECS

● Why we are considering ECS○ Resource/cluster mgmt undifferentiated heavy lifting○ Expect ECS to have strong integration /w EC2/AWS

● Have prototyped a Titus/Fenzo ECS port○ Using our job mgmt/scheduling on top of ECS

● Working with the ECS team to add in○ Simpler start task API (w/o define task first)○ Event stream to power real time scheduling info○ Extensibility in ECS events, resource types

26

Page 27: Netflix and Containers - Titus

Why containers for developers?

Case 1:I want a consistent local development and cloud deployment experience (in both directions)

Case 2:I want to specify what it means to run my process, not integrate into a one size fits most VM image

27

Page 28: Netflix and Containers - Titus

Developer Experience (coming)

Titus

28

Page 29: Netflix and Containers - Titus

Developer experience

NEWT● One stop shop for creation, development, deployment of containers

Netflix Docker base layers● Already integrated with runtime expectations● Continuously rebuilt with small and controlled common support

Netflix Docker build tools● Extend our bakery to produce Docker images and run locally● More advanced image creation tools

○ Multi-inheritance, guaranteed metadata, metrics29

Page 30: Netflix and Containers - Titus

We’re hiring

Come advance containers at Netflix!

Senior Software Engineer Container Platform - https://jobs.netflix.com/jobs/860487

30

Page 31: Netflix and Containers - Titus

Questions?

31