container management at scale 13x google scale€¦ · “let me contain that for you” probably...
TRANSCRIPT
![Page 1: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/1.jpg)
Google confidential │ Do not distribute
Google confidential │ Do not distribute
SCALE 13xContainer Management at Google ScaleTim Hockin <[email protected]>Senior Staff Software Engineer@thockin
![Page 2: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/2.jpg)
Google confidential │ Do not distribute
Google confidential │ Do not distribute
SCALE 13xContainer Management at Google Scale
Container
Tim Hockin <[email protected]>Senior Staff Software Engineer@thockin
![Page 3: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/3.jpg)
Google confidential │ Do not distribute
Old Way: Shared machines
kernel
libs
app
app app
No isolation
No namespacing
Common libs
Highly coupled apps and OS
app
![Page 4: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/4.jpg)
Google confidential │ Do not distribute
Old Way: Virtual machines
Some isolation
Expensive and inefficient
Still highly coupled to the guest OS
Hard to manageapp
libskernel
libs
app app
kernel
app
libs
libskernel
kernel
![Page 5: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/5.jpg)
Google confidential │ Do not distribute
New Way: Containers
libs
app
kernel
libs
app
libs
app
libs
app
![Page 6: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/6.jpg)
Google confidential │ Do not distribute
But what ARE they?
Lightweight VMs• no guest OS, lower overhead than VMs, but no virtualization hardware
Better packages• no DLL hell
Hermetically sealed static binaries• no external dependencies
Provide Isolation (from each other and from the host)• Resources (CPU, RAM, Disk, etc.)• Users• Filesystem• Network
![Page 7: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/7.jpg)
Google confidential │ Do not distribute
How?
Implemented by a number of (unrelated) Linux APIs:
• cgroups: Restrict resources a process can consume• CPU, memory, disk IO, ...
• namespaces: Change a process’s view of the system• Network interfaces, PIDs, users, mounts, ...
• capabilities: Limits what a user can do• mount, kill, chown, ...
• chroots: Determines what parts of the filesystem a user can see
![Page 8: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/8.jpg)
Google confidential │ Do not distribute
Google has been developing and using containers to manage our applications for over 10 years.
Images by Connie Zhou
![Page 9: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/9.jpg)
Google confidential │ Do not distribute
Everything at Google runs in containers:• Gmail, Web Search, Maps, ...• MapReduce, batch, ...• GFS, Colossus, ...• Even GCE itself: VMs in
containers
![Page 10: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/10.jpg)
Google confidential │ Do not distribute
Everything at Google runs in containers:• Gmail, Web Search, Maps, ...• MapReduce, batch, ...• GFS, Colossus, ...• Even GCE itself: VMs in
containers
We launch over 2 billion containers per week.
![Page 11: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/11.jpg)
Google confidential │ Do not distribute
Why containers?• Performance
• Repeatability
• Isolation
• Quality of service
• Accounting
• Visibility
• Portability
A fundamentally different way of managing applications
Images by Connie Zhou
![Page 12: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/12.jpg)
Google confidential │ Do not distribute
Docker
Source: Google Trends
![Page 13: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/13.jpg)
Google confidential │ Do not distribute
But what IS Docker?
An implementation of the container idea
A package format
An ecosystem
A company
An open-source juggernaut
A phenomenon
Hoorah! The world is starting to adopt containers!
![Page 14: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/14.jpg)
Google confidential │ Do not distribute
LMCTFY
Also an implementation of the container idea (from Google)
Also open-source
Literally the same code that Google uses internally
“Let Me Contain That For You”
![Page 15: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/15.jpg)
Google confidential │ Do not distribute
LMCTFY
Also an implementation of the container idea (from Google)
Also open-source
Literally the same code that Google uses internally
“Let Me Contain That For You”
Probably NOT what you wantto use!
![Page 16: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/16.jpg)
Google confidential │ Do not distribute
Docker vs. LMCTFY
Docker is primarily about namespacing: control what you can see• resource and performance isolation were afterthoughts
LMCTFY is primarily about performance isolation: jobs can not hurt each other• namespacing was an afterthought
Docker focused on making things simple and self-contained• “sealed” images, a repository of pre-built images, simple tooling
LMCTFY focused on solving the isolation problem very thoroughly• totally ignored images and tooling
![Page 17: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/17.jpg)
Google confidential │ Do not distribute
About isolation
Principles:• Apps must not be able to affect each
other’s perf• if so it is an isolation failure
• Repeated runs of the same app should see ~equal perf
• Graduated QoS drives resource decisions in real-time
• Correct in all cases, optimal in some• reduce unreliable components
• SLOs are the lingua franca
App 1App 2
![Page 18: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/18.jpg)
Google confidential │ Do not distribute
Strong isolation
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0
![Page 19: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/19.jpg)
Google confidential │ Do not distribute
Strong isolation
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0RAM=2GB CPU=1.0
![Page 20: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/20.jpg)
Google confidential │ Do not distribute
Strong isolation
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0RAM=2GB CPU=1.0
RAM=4GB CPU=2.5
![Page 21: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/21.jpg)
Google confidential │ Do not distribute
Strong isolation
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0RAM=2GB CPU=1.0
RAM=1GB CPU=0.5
RAM=4GB CPU=2.5
![Page 22: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/22.jpg)
Google confidential │ Do not distribute
Strong isolation
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0RAM=2GB CPU=1.0
RAM=1GB CPU=0.5
RAM=4GB CPU=2.5RAM=1GB
stranded!
![Page 23: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/23.jpg)
Google confidential │ Do not distribute
Pros:• Sharing - users don’t worry about interference (aka the noisy neighbor problem)• Predictable - allows us to offer strong SLAs to apps
Cons:• Stranding - arbitrary slices mean some resources get lost• Confusing - how do I know how much I need?
• analog: what size VM should I use?• smart auto-scaling is needed!
• Expensive - you pay for certainty
In reality this is a multi-dimensional bin-packing problem: CPU, memory, disk space, IO bandwidth, network bandwidth, ...
Strong isolation
![Page 24: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/24.jpg)
Google confidential │ Do not distribute
A dose of reality
The kernel itself uses some resources “off the top”• We can estimate it statistically but we can’t really limit it
![Page 25: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/25.jpg)
Google confidential │ Do not distribute
A dose of reality
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0 OS
RAM=4GB CPU=2.5
RAM=2GB CPU=1.0
RAM=1GB CPU=0.5
over-committed!
![Page 26: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/26.jpg)
Google confidential │ Do not distribute
A dose of reality
The kernel itself uses some resources “off the top”• We can estimate it statistically but we can’t really limit it
System daemons (e.g. our node agent) use some resources• We can (and do) limit these, but failure modes are not always great
![Page 27: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/27.jpg)
Google confidential │ Do not distribute
A dose of reality
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0 OS
RAM=4GB CPU=2.5
RAM=2GB CPU=1.0
Sys
![Page 28: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/28.jpg)
Google confidential │ Do not distribute
A dose of reality
The kernel itself uses some resources “off the top”• We can estimate it statistically but we can’t really limit it
System daemons (e.g. our node agent) use some resources• We can (and do) limit these, but failure modes are not always great
If ANYONE is uncontained, then all SLOs are void. We pretend that the kernel is contained, but only because we have no real choice. Experience shows this holds up most of the time. Hold this thought for later...
![Page 29: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/29.jpg)
Google confidential │ Do not distribute
Results
Overall this works VERY well for latency-sensitive serving jobs
Shortcomings:• There are still some things that can not be easily contained in real time
• e.g. cache (see CPI2)• Some resource dimensions are really hard to schedule
• e.g. disk IO - so little of it, so bursty, and SO SLOW• Low utilization: nobody uses 100% of what they request• Not well tuned for compute-heavy work (e.g. batch)• Users don’t really know how much CPU/RAM/etc. to request
![Page 30: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/30.jpg)
Google confidential │ Do not distribute
Usage vs bookings
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0
![Page 31: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/31.jpg)
Google confidential │ Do not distribute
Making better use of it all
Proposition: Re-sell unused resources with lower SLOs• Perfect for batch work• Probabilistically “good enough”
Shortcomings:• Even more emphasis on isolation failures
• we can’t let batch hurt “paying” customers• Requires a lot of smarts in the lowest parts of the stack
• e.g. deterministic OOM killing by priority• we have a number of kernel patches we want to mainline, but we have
had a hard time getting upstream kernel on board
![Page 32: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/32.jpg)
Google confidential │ Do not distribute
Usage vs bookings
0 2048 4096 6144 8192Memory (MB)
CPU(cores)
4
3
2
1
0
batch batchbatch b
batch batch
batch
![Page 33: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/33.jpg)
Google confidential │ Do not distribute
Back to Docker
Container isolation today:• ...does not handle most of this• ...is fundamentally voluntary• ...is an obvious area for improvement in the coming year(s)
![Page 34: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/34.jpg)
Google confidential │ Do not distribute
More than just isolation
Scheduling: Where should my job be run?
Lifecycle: Keep my job running
Discovery: Where is my job now?
Constituency: Who is part of my job?
Scale-up: Making my jobs bigger or smaller
Auth{n,z}: Who can do things to my job?
Monitoring: What’s happening with my job?
Health: How is my job feeling?
...
![Page 35: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/35.jpg)
Google confidential │ Do not distribute
Enter Kubernetes
Greek for “Helmsman”; also the root of the word “Governor”
• Container orchestrator
• Runs Docker containers
• Supports multiple cloud and bare-metal environments
• Inspired and informed by Google’s experiences and internal systems
• Open source, written in Go
Manage applications, not machines
![Page 36: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/36.jpg)
Google confidential │ Do not distribute
Design principles
Declarative > imperative: State your desired results, let the system actuate
Control loops: Observe, rectify, repeat
Simple > Complex: Try to do as little as possible
Modularity: Components, interfaces, & plugins
Legacy compatible: Requiring apps to change is a non-starter
Network-centric: IP addresses are cheap
No grouping: Labels are the only groups
Cattle > Pets: Manage your workload in bulk
Open > Closed: Open Source, standards, REST, JSON, etc.
![Page 37: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/37.jpg)
Google confidential │ Do not distribute
Pets vs. Cattle
![Page 38: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/38.jpg)
Google confidential │ Do not distribute
High level design
CLI
API
UI
apiserver
users master
kubelet
kubelet
kubelet
nodes
scheduler
![Page 39: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/39.jpg)
Google confidential │ Do not distribute
Primary concepts
Container: A sealed application package (Docker)Pod: A small group of tightly coupled Containers
example: content syncer & web server
Controller: A loop that drives current state towards desired stateexample: replication controller
Service: A set of running pods that work togetherexample: load-balanced backends
Labels: Identifying metadata attached to other objectsexample: phase=canary vs. phase=prod
Selector: A query against labels, producing a set resultexample: all pods where label phase == prod
![Page 40: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/40.jpg)
Google confidential │ Do not distribute
Pods
![Page 41: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/41.jpg)
Google confidential │ Do not distribute
Pods
![Page 42: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/42.jpg)
Google confidential │ Do not distribute
Pods
Small group of containers & volumes
Tightly coupled
The atom of cluster scheduling & placement
Shared namespace• share IP address & localhost
Ephemeral• can die and be replaced
Example: data puller & web server
Pod
File Puller Web Server
Volume
ConsumersContent Manager
![Page 43: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/43.jpg)
Google confidential │ Do not distribute
10.1.1.0/24
172.16.1.1
172.16.1.2
Docker networking
10.1.2.0/24
172.16.1.1
10.1.3.0/24
172.16.1.1
![Page 44: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/44.jpg)
Google confidential │ Do not distribute
10.1.1.0/24
172.16.1.1
172.16.1.2
Docker networking
10.1.2.0/24
172.16.1.1
10.1.3.0/24
172.16.1.1
NAT
NAT
NAT
NAT
NAT
![Page 45: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/45.jpg)
Google confidential │ Do not distribute
Pod networking
Pod IPs are routable• Docker default is private IP
Pods can reach each other without NAT• even across nodes
No brokering of port numbers
This is a fundamental requirement• several SDN solutions
![Page 46: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/46.jpg)
Google confidential │ Do not distribute
10.1.1.0/24
10.1.1.93
10.1.1.113
Pod networking
10.1.2.0/24
10.1.2.118
10.1.3.0/24
10.1.3.129
![Page 47: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/47.jpg)
Google confidential │ Do not distribute
Labels
Arbitrary metadata
Attached to any API object
Generally represent identity
Queryable by selectors• think SQL ‘select ... where ...’
The only grouping mechanism• pods under a ReplicationController• pods in a Service• capabilities of a node (constraints)
Example: “phase: canary”
App: NiftyPhase: Dev
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Test
Role: BE
![Page 48: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/48.jpg)
Google confidential │ Do not distribute
Selectors
App: NiftyPhase: Dev
Role: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
![Page 49: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/49.jpg)
Google confidential │ Do not distribute
App == NiftyApp: NiftyPhase: Dev
Role: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
Selectors
![Page 50: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/50.jpg)
Google confidential │ Do not distribute
App == NiftyRole == FEApp: Nifty
Phase: DevRole: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
Selectors
![Page 51: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/51.jpg)
Google confidential │ Do not distribute
App == NiftyRole == BEApp: Nifty
Phase: DevRole: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
Selectors
![Page 52: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/52.jpg)
Google confidential │ Do not distribute
App == NiftyPhase == DevApp: Nifty
Phase: DevRole: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
Selectors
![Page 53: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/53.jpg)
Google confidential │ Do not distribute
App == NiftyPhase == Test
App: NiftyPhase: Dev
Role: FE
App: NiftyPhase: Test
Role: FE
App: NiftyPhase: Dev
Role: BE
App: NiftyPhase: Test
Role: BE
Selectors
![Page 54: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/54.jpg)
Google confidential │ Do not distribute
Replication Controllers
Canonical example of control loops
Runs out-of-process wrt API server
Have 1 job: ensure N copies of a pod• if too few, start new ones• if too many, kill some• group == selector
Cleanly layered on top of the core• all access is by public APIs
Replicated pods are fungible• No implied ordinality or identity
Replication Controller- Name = “nifty-rc”- Selector = {“App”: “Nifty”}- PodTemplate = { ... }- NumReplicas = 4
API Server
How many?
3
Start 1 more
OK
How many?
4
![Page 55: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/55.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4node 2
d9376
b0111
a1209
Replication Controller- Desired = 4- Current = 4
![Page 56: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/56.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4node 2
Replication Controller- Desired = 4- Current = 4
d9376
b0111
a1209
![Page 57: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/57.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4
Replication Controller- Desired = 4- Current = 3
b0111
a1209
![Page 58: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/58.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4
Replication Controller- Desired = 4- Current = 4
b0111
a1209
c9bad
![Page 59: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/59.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4node 2
Replication Controller- Desired = 4- Current = 5
d9376
b0111
a1209
c9bad
![Page 60: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/60.jpg)
Google confidential │ Do not distribute
Replication Controllers
node 1
f0118
node 3
node 4node 2
Replication Controller- Desired = 4- Current = 4
d9376
b0111
a1209
c9bad
![Page 61: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/61.jpg)
Google confidential │ Do not distribute
Services
A group of pods that act as one == Service• group == selector
Defines access policy• only “load balanced” for now
Gets a stable virtual IP and port• called the service portal• also a DNS name
VIP is captured by kube-proxy• watches the service constituency• updates when backends change
Hide complexity - ideal for non-native apps
Portal (VIP)
Client
![Page 62: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/62.jpg)
Google confidential │ Do not distribute
Services
10.0.0.1 : 9376
Client
kube-proxy
Service- Name = “nifty-svc”- Selector = {“App”: “Nifty”}- Port = 9376- ContainerPort = 8080
Portal IP is assigned
iptablesDNAT
TCP / UDP
apiserver
watch10.240.2.2 : 808010.240.1.1 : 8080 10.240.3.3 : 8080
TCP / UDP
![Page 63: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/63.jpg)
Google confidential │ Do not distribute
Kubernetes Status & plans
Open sourced in June, 2014• won the BlackDuck “rookie of the year” award• so did cAdvisor :)
Google launched Google Container Engine (GKE)• hosted Kubernetes• https://cloud.google.com/container-engine/
Roadmap:• https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/roadmap.md
Driving towards a 1.0 release in O(months)• O(100) nodes, O(50) pods per node• focus on web-like app serving use-cases
![Page 64: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/64.jpg)
Google confidential │ Do not distribute
Monitoring
Optional add-on to Kubernetes clusters
Run cAdvisor as a pod on each node• gather stats from all containers• export via REST
Run Heapster as a pod in the cluster• just another pod, no special access• aggregate stats
Run Influx and Grafana in the cluster• more pods• alternately: store in Google Cloud Monitoring
![Page 65: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/65.jpg)
Google confidential │ Do not distribute
Logging
Optional add-on to Kubernetes clusters
Run fluentd as a pod on each node• gather logs from all containers• export to elasticsearch
Run Elasticsearch as a pod in the cluster• just another pod, no special access• aggregate logs
Run Kibana in the cluster• yet another pod• alternately: store in Google Cloud Logging
![Page 66: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/66.jpg)
Google confidential │ Do not distribute
Kubernetes and isolation
We support isolation...• ...inasmuch as Docker does
We want better isolation• issues are open with Docker
• parent cgroups, GIDs, in-place updates, • will also need kernel work• we have lots of tricks we want to share!
We have to meet users where they are• strong isolation is new to most people• we’ll all have to grow into it
![Page 67: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/67.jpg)
Google confidential │ Do not distribute
Example: nested cgroups
pod1 cgroupCPU: 4 cores
Memory: 8 GB
c1 cgroupCPU: 2 cores
Memory: 4 GB
c2 cgroupCPU: 1 core
Memory: 4 GB
c2 cgroupCPU: 1 core
Memory: 4 GB
pod2 cgroupCPU: 3 cores
Memory: 5 GB
c1 cgroupCPU: 3 cores
Memory: 5 GB
c1 cgroupCPU: <none>
Memory: <none>
machineCPU: 8 cores
Memory: 16 GB
leftoversCPU: 1 cores
Memory: 3 GB
pod3 cgroupCPU: <none>
Memory: <none>
![Page 68: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/68.jpg)
Google confidential │ Do not distribute
The Goal: Shake things up
Containers is a new way of working
Requires new concepts and new tools
Google has a lot of experience...
...but we are listening to the users
Workload portability is important!
![Page 69: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/69.jpg)
Google confidential │ Do not distribute
Kubernetes is Open SourceWe want your help!
http://kubernetes.iohttps://github.com/GoogleCloudPlatform/kubernetes
irc.freenode.net #google-containers@kubernetesio
![Page 70: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/70.jpg)
Google confidential │ Do not distribute
Questions?
Images by Connie Zhou
http://kubernetes.io
![Page 71: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/71.jpg)
Google confidential │ Do not distribute
Backup Slides
![Page 72: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/72.jpg)
Google confidential │ Do not distribute
Control loops
Drive current state -> desired state
Act independently
APIs - no shortcuts or back doors
Observed state is truth
Recurring pattern in the system
Example: ReplicationController
observe
diff
act
![Page 73: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/73.jpg)
Google confidential │ Do not distribute
Modularity
Loose coupling is a goal everywhere• simpler• composable• extensible
Code-level plugins where possible
Multi-process where possible
Isolate risk by interchangeable parts
Example: ReplicationControllerExample: Scheduler
![Page 74: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/74.jpg)
Google confidential │ Do not distribute
Atomic storage
Backing store for all master state
Hidden behind an abstract interface
Stateless means scalable
Watchable• this is a fundamental primitive• don’t poll, watch
Using CoreOS etcd
![Page 75: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/75.jpg)
Google confidential │ Do not distribute
Volumes
Pod scoped
Share pod’s lifetime & fate
Support various types of volumes• Empty directory (default)• Host file/directory• Git repository• GCE Persistent Disk• ...more to come, suggestions welcome
Pod
Container Container
Git
GitHub
Host
Host’s FS
GCE
GCE PD
Empty
![Page 76: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/76.jpg)
Google confidential │ Do not distribute
Pod lifecycle
Once scheduled to a node, pods do not move• restart policy means restart in-place
Pods can be observed pending, running, succeeded, or failed• failed is really the end - no more restarts• no complex state machine logic
Pods are not rescheduled by the scheduler or apiserver• even if a node dies• controllers are responsible for this• keeps the scheduler simple
Apps should consider these rules• Services hide this• Makes pod-to-pod communication more formal
![Page 77: Container Management at SCALE 13x Google Scale€¦ · “Let Me Contain That For You” Probably NOT what you want to use! Google confidential │ Do not distribute Docker vs. LMCTFY](https://reader033.vdocuments.net/reader033/viewer/2022043013/5faf36b3c4d3e310b653a80e/html5/thumbnails/77.jpg)
Google confidential │ Do not distribute
Cluster services
Logging, Monitoring, DNS, etc.
All run as pods in the cluster - no special treatment, no back doors
Open-source solutions for everything• cadvisor + influxdb + heapster == cluster monitoring • fluentd + elasticsearch + kibana == cluster logging• skydns + kube2sky == cluster DNS
Can be easily replaced by custom solutions• Modular clusters to fit your needs