aws re:invent 2016: netflix: container scheduling, execution, and integration with aws (con313)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Andrew Spyker, Sr. Software Engineer, Netflix
December 2016
CON313
NetflixContainer Scheduling, Execution, and Integration with AWS
What to Expect from the Session
• Why containers?
• Including current use cases and scale
• How did we get there?
• Overview of our container cloud platform
• Collaboration with ECS
About Netflix
• 86.7M members
• 1000+ developers
• 190+ countries
• > ⅓ NA internet download traffic
• 500+ microservices
• Over 100,000 VMs
• 3 regions across the world
Why containers?
Given that our VM architecture is comprised of …
amazingly resilient,
microservice driven,
cloud native,
CI/CD devops enabled,
elastically scalable
do we really need containers?
Our Container System Provides Innovation Velocity
• Iterative local development, deploy when ready
• Manage app and dependencies easily and completely
• Simpler way to express resources, let system manage
Innovation Velocity - Use Cases
• Media encoding - encoding research development time
• Using VMs - 1 month, using containers - 1 week
• Niagara
• Build all Netflix codebases in hours
• Saves development 100s of hours of debugging
• Edge Rearchitecture with Node.js
• Focus returns to app development
• Simplifies, speeds test and deployment
Why not use existing container mgmt solution?
• Most solutions are focused on the datacenter
• Most solutions are
• Working to abstract datacenter and cross-cloud
• Delivering more than cluster manager
• Not yet at our level of scale
• Wanted to leverage our existing cloud platform
• Not appropriate for Netflix
Batch
What do batch users want?
• Simple shared resources, run till done, job files
• NOT
• EC2 instance sizes, automatic scaling, AMI OS
• WHY
• Offloads resource management ops, simpler
Historic use of containers
• General workflow (Meson), stream
processing (Mantis)
• Proven using cgroups and Mesos
• With simple isolation
• Using specific packaging formats
Linux
cgroups
Enter Titus
Job Management
Batch
Resource Management & Optimization
Container ExecutionIntegration
Sample batch use cases
• Algorithm
Model
Training
GPU usage
• Personalization and recommendation
• Deep learning with neural nets/mini batch
• Titus
• Added g2 support using nvidia-docker-plugin
• Mounts nvidia drivers and devices into Docker containers
• Distribution of training jobs and infrastructure made self service
• Recently moved to p2.8xl instances
• 2X performance improvement with same CUDA-based code
Sample batch use cases
• Media encoding experimentation
• Digital watermarking
Sample batch use cases
Ad hoc
reporting
Open connect
CDN reporting
Lessons learned from batch
• Docker helped generalize use cases
• Cluster automatic scaling adds efficiency
• Advanced scheduling required
• Initially ignored failures (with retries)
• Time-sensitive batch came later
Titus Batch Usage (Week of 11/7)
• Started ~ 300,000 containers during the week
• Peak of 1000 containers per minute
• Peak of 3,000 instances (mix of r3.8xls and m4.4xls)
Services
Adding Services to Titus
Job Management
Batch
Resource Management & Optimization
Container ExecutionIntegration
Service
Services are just
long- running
batches, right?
Services more complex
Services resize constantly and run forever
• Automatic scaling
• Hard to upgrade underlying hosts
Have more state
• Ready for traffic vs. just started/stopped
• Even harder to upgrade
Existing, well-defined dev, deploy, runtime, & ops tools
Real Networking is Hard
Multi-Tenant Networking is Hard
• IP per container
• Security group support
• IAM role support
• Network bandwidth isolation
Solutions
• VPC Networking driver
• Supports ENI’s - full IP functionality
• With scheduling - security groups
• Support traffic control (isolation)
• EC2 Metadata proxy
• Adds container “node” identity
• Delivers IAM roles
VPC Networking Integration with Docker
Titus
Executor
Titus Networking Driver
- Create and attach ENI with
- security group
- IP address
create net namespace
VPC Networking Integration with Docker
Titus
Executor
Titus Networking Driver
- Launch ”pod root” container with
- IP address
- Using “pause” container
- Using net=none
Pod Root
ContainerDocker
create net namespace
VPC Networking Integration with Docker
Titus
Executor
Titus Networking Driver
- Create virtual ethernet
- Configure routing rules
- Configure metadata proxy iptables NAT
- Configure traffic control for bandwidthpod_root_id
Pod Root
Container
VPC Networking Integration with Docker
Titus
Executor
Pod Root
Container(pod_root_id)
Docker
App
Container
create container with
--net=container:pod_root_id
Metadata Proxy
container
Amazon
Metadata
Service
(169.254.169.254)
Titus Metadata Proxy
What is my IP, instanceid, hostname?
- Return Titus assigned
What is my AMI, instance type, etc.
- Unknown
Give me my role credentials
- Assume role to container role, return
credentials
Give me anything else
- Proxy
veth<id>
169.254.169.254:80
host_ip:9999
iptables/NAT
Putting it all together
Virtual Machine Host
ENI1sg=A
ENI2sg=X
ENI3sg=Y,Z
Non-routable IP IP1
IP2
IP3
sg=X sg=X sg=Y,ZNonroutable IP, sg=A Metadata proxy
App
container
pod root
veth<id>
App
container
pod root
veth<id>
App
container
pod root
veth<id>
App
container
pod root
veth<id>
Container 1 Container 2 Container 3 Container 4
Linux Policy Based Routing
+ Traffic Control
169.254.169.254
NAT
Additional AWS Integrations
• Live and rotated to S3 log file access
• Multi-tenant resource isolation (disk)
• Environmental context
• Automatic instance type selection
• Elastic scaling of underlying resource pool
Netflix Infrastructure Integration
• Spinnaker CI/CD
• Atlas telemetry
• Discovery/IPC
• Edda (and dependent systems)
• Healthcheck, system metrics pollers
• Chaos testing
VM’sVM’s
Why? Single consistent cloud platform
VPC
EC2
Virtual Machines
AW
S A
uto
sca
ler
Service
Applications
Cloud Platform Libraries
(metrics, IPC, health)
Titu
s J
ob
Co
ntro
l
VM’sVM’s
Container
Service
Applications
Cloud Platform Libraries
(metrics, IPC, health)
VM’sVM’s
Container
Batch
Applications
Cloud Platform Libraries
(metrics, IPC)
Edda EurekaAtlas
Titus Spinnaker Integration
Deploy based on
new Docker
registry tags
Deployment
strategies same
as Auto Scaling
group
IAM roles and
security groups
per container
Basic resource
requirements
Easily see health
check & service
discovery status
Fenzo – The heart of Titus scheduling
Extensible library for scheduling frameworks
• Plugins based scheduling objectives
• Bin packing, etc.
• Heterogeneous resources & tasks
• Cluster automatic scaling
• Multiple instance types
• Plugin-based constraints evaluator
• Resource affinity, task locality, etc.
• Single offer mode added in support of ECS
Fenzo scheduling strategy
For each task
On each host
Validate hard constraints
Eval fitness and soft constraints
Until fitness “good enough”, and
A minimum #hosts evaluated
Plugins
Scheduling – Capacity Guarantees
Desired
Max
Titus maintains …
Critical tier
• guaranteed
capacity & start
latencies
Flex tier
• more dynamic
capacity & variable
start latency
Titus Master
SchedulerFenzo
Scheduling – Bin Packing, Elastic Scaling
Max
User adds work tasks
• Titus does bin
packing to ensure
that we can
downscale entire
hosts efficientlyCan
terminate
Desired
Min
✖ ✖ ✖ ✖
Titus Master
SchedulerFenzo
Availability Zone B
Availability Zone A
Scheduling – Constraints including zone
balancing
User specifies constraints
• Availability Zone
balancing
• Resource and Task
affinity
• Hard and softDesired
Min
Titus Master
SchedulerFenzo
Auto Scaling group version 001
Scheduling – Rolling new Titus code
Operator updates Titus agent
codebase
• New scheduling on new cluster
• Batch jobs drain
• Service tasks are migrated via
Spinnaker pipelines
• Old cluster scales down
Desired
Min
Auto Scaling group version 002
Min
Desired
✖ ✖
Titus Master
SchedulerFenzo
Current Service Usage
• Approach
• Started with internal applications
• Moved on to line-of-fire Node.js (shadow first, prod 1Q17)
• Moved on to stream processing (prod 4Q)
• Current - ~ 2000 long running containers
1Q
Batch 2Q
Service
pre-prod 3Q
Service
shadow
Service
Prod
4Q
Collaboration with ECS
Why ECS?
• Decrease operational overhead of underlying cluster
state management
• Allow open source collaboration on ECS agent
• Work with Amazon and others on EC2 enablement
• GPUS, VPC, security groups, IAM roles, etc.
• Over time, this enablement should result in less maintenance
Titus Today
Container host
mesos-
agent
Titus
executor
containercontainer
containerMesos
master
Titus
Scheduler
EC2
integrationOutbound
- Launch/terminate container
- Reconciliation
Inbound
- Container host events (and offers)
- Container events
First Titus ECS Implementation
Container host
ECS agent
Titus
executor
containercontainer
containerECS
Titus
Scheduler
EC2
integrationOutbound
- Launch/terminate container
- Polling for
- Container host events
- Container events
✖
✖
Collaboration with ECS team starts
• Collaboration on ECS “event stream” that could provide
• “Real time” task & container instance state changes
• Event based architecture more scalable than polling
• Great engineering collaboration
• Face to face focus
• Monthly interlocks
• Engineer to engineer focused
Current Titus ECS Implementation
Container host
ECS agent
Titus
executor
containercontainer
container
ECS
Titus
Scheduler
EC2
integration
Outbound
- Launch/terminate container
- Reconciliation
Inbound
- Container host events
- Container events
✖
✖
CloudWatch
EventsSQS
Analysis - Periodic Reconciliation
For tasks in listTasks
describeTasks (batches of 100)
Number of API calls: 1 + num tasks / 100 per reconcile
1280 containers
across 40 nodes
Analysis - Scheduling
• Number of API calls: 2X number of tasks
• registerTaskDefinition and startTask
• Largest Titus historical job
• 1000 tasks per minute
• Possible with increased rate limits
Continued areas of scheduling collaboration
• Combining/batching registerTaskDefinition and startTask
• More resource types in the control plane
• Disk, network bandwidth, ENIs
• To fit with existing scheduler approach
• Extensible message fields in task state transitions
• Named tasks (beyond ARNs) for terminate
• Starting vs. started state
Possible phases of ECS support in Titus
• Work in progress
• ECS completing scheduling collaboration items
• Complete transition to ECS for overall cluster manager
• Allows us to contribute to ECS agent open source
Netflix cloud platform and EC2 integration points
• Future
• Provide Fenzo as the ECS task placement service
• Extend Titus Job Management features to ECS
Titus Future Focus
Future Strategy of Titus
• Service automatic scaling and global traffic
integration
• Service/batch SLA management
• Capacity guarantees, fair shares, and pre-emption
• Trough / Internal Spot market management
• Exposing pods to users
• More use cases and scale
Questions?
Andrew Spyker (@aspyker)
Thank you!
Remember to complete
your evaluations!