there is no container - ori pekelman
TRANSCRIPT
#ContainerDayFR
There is no container
#ContainerDayFRParis Container Day 2017
Ori PekelmanGeekPush at Platform.sh
I am @OriPekeman everywhere (github/twitter/LinkedIn)
Co-Founder & VP of Marketing for Platform.sh, an innovative
second generation PaaS.
My role usually spans beyond the technological aspects to the
business strategy, process design and product marketing.
There is no container
2
#ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most people around have an understanding of the underpinnings of “containers”. But let’s have a show of hands to see how much time we are going to spend on which slide.
There is no container
3
Group A
I don’t know much about containers. It sounds interesting. I came here to learn.
#ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most people around have an understanding of the underpinnings of “containers”. But let’s have a show of hands to see how much time we are going to spend on which slide.
There is no container
4
Group B
I use Docker. In production. It works and I never had to care about how it is implemented.
#ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most people around have an understanding of the underpinnings of “containers”. But let’s have a show of hands to see how much time we are going to spend on which slide.
There is no container
5
Group C
I implement my own container stuff. I have Kernel-Fu. I know how this stuff is built.
#ContainerDayFRParis Container Day 2017
1. This is meant as an entry-level talk, I will still discuss some nuts and bolts.. so when I am unclear. Interrupt me. I don’t mind.
2. I am rusty. They make me do marketing these days. So when I am wrong. Interrupt me. I don’t mind.
3. Even more so as we have the incredible honor of having people like Jessie Frazelle with us, people that participated in building many of the nuts and some of the bolts.
So, please, Jessie and you other experts, forgive the depths of my ignorance and any and all lies and errors I am about to spout.
There is no container
6
#ContainerDayFRParis Container Day 2017
What do containers solve? Why do we need containers?
There is no container
7
Containers allow us to package complex software in a reusable format that is easy to deploy, making automation easier.
Sometimes they make updating software easier (with stateless systems… just build a new one, kill the old).
They have lower overhead in terms of memory usage than VMs, so they are less expensive.. and we can have more of them.
They allow us to reason about the systems we run at lesser granularity. AKA abstraction. In greek Atom means - that which cannot be divided. The container is our Atom.
#ContainerDayFRParis Container Day 2017
There is no container
8
#ContainerDayFRParis Container Day 2017
There is no container
9
#ContainerDayFRParis Container Day 2017
There is no container
10
#ContainerDayFRParis Container Day 2017
There is no container
11
#ContainerDayFRParis Container Day 2017
There is no container
12
#ContainerDayFRParis Container Day 2017
The canonical image of the container is something like
There is no container
13
#ContainerDayFRParis Container Day 2017
An orderly world where we put software in opaque boxes
There is no container
14
#ContainerDayFRParis Container Day 2017
The boxes have a common, simple interface, that is not influenced by their content
There is no container
15
#ContainerDayFRParis Container Day 2017
From the outside we don’t care what is inside. There are no dependencies on the exterior world.
There is no container
16
#ContainerDayFRParis Container Day 2017
That is our intuitive abstraction popularized by Docker™
There is no container
17
#ContainerDayFRParis Container Day 2017
We can move containers. Install them. Run them. Without ever knowing what was inside.
There is no container
18
$ docker pull complex_piece_of_software:latest
$ docker run complex_piece_of_software:latest
#ContainerDayFRParis Container Day 2017
The “Nuts and bolts” truth of the matter is probably inverse. The container does not create opacity from the outside in.
There is no container
19
#ContainerDayFRParis Container Day 2017
But from the inside out.
There is no container
20
#ContainerDayFRParis Container Day 2017
From the system’s point of view
There is no container
21
#ContainerDayFRParis Container Day 2017
This is the reality
There is no container
22
#ContainerDayFRParis Container Day 2017
From the outside, the kernel, UID 0, they see all. For them, there is no container.
There is no container
23
#ContainerDayFRParis Container Day 2017
There is no container
It is from the“containerized” process point of view that the world changes. Becomes smaller.
24
#ContainerDayFRParis Container Day 2017
When we create a container what happens is that using a bunch of different Kernel features and modules (cgroups, namespaces, seccomp...) we:
There is no container
25
#ContainerDayFRParis Container Day 2017
Limit the visibility on the outside world (namespaces)
There is no container
26
#ContainerDayFRParis Container Day 2017
Limit the availability of resources from the outside world (cgroups)
There is no container
27
#ContainerDayFRParis Container Day 2017
Sometimes outright lie about the world (namespaces)
There is no container
28
#ContainerDayFRParis Container Day 2017
And we limit the capabilities of the process in what it can invoke as functionalities from the Kernel (seccomp .. and more…)
There is no container
29
#ContainerDayFRParis Container Day 2017
There is no container
30
#ContainerDayFRParis Container Day 2017
There is an operating system. In our case Linux. It abstracts away the hardware.
No software on a normal computer runs “outside” of the operating system. Yup. Even assembly / machine code. You can’t access the processor, memory or hardware without going through it. What you run on Linux are ELF binaries. Nothing else.
Your program interacts with its operating system through System Calls, it cas ask for memory, access to stuff (like the network or the disk), it can ask the operating system to run some other processes. A bunch of fun stuff.
So.. let’s create a container.
There is no container
31
#ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
32
Interactions with the OS pass through system calls.. but sometimes it gets fancy and proposes higher-level constructs to make it easy (like a pseudo-file-system). Most often we will use libraries and full-blown integrated apps to take care of talking to the OS. More on that later.
In Linux processes are organized in a tree. Each process has an ID, and a parent; Everything starts with 0 which is the scheduler and 1, which is init. Everything else is going to get invoked from those and down.
In linux we have three different calls to start a process exec() which we don’t really care about here. fork() which copies the current process with a new PID and clone() that copies all or some of the current process and runs the new process as a child.
#ContainerDayFRParis Container Day 2017
So, how do we make the world seem smaller to a process?
When creating our process we can pass a couple of parameters to clone() that will tell our operating system how it is going to live.
A bunch of these parameters (or flags) are called CLONE_NEW[...SOMETHING….] Some of these parameters, not all, can be modified later-on using the unshare() system call.
So.. let’s create a container.
There is no container
33
#ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
34
For example the parameter CLONE_NEWUTS tells the operating system that:
1. Our newly created process can call sethostname() and that doing so, instead of changing the hostname for the whole OS, it is going to keep a record, just for that Namespace of the Host Name.
2. So when, later the process calls gethostname() it will return whatever was put through this namespace’s sethostname().
So unlike all of its cousins and parents this process thinks the name of the machine it is running on is different.
We tricked it! (remember the part about lying?)
#ContainerDayFRParis Container Day 2017
Setting up namespaces
There is no container
35
So.. we create a new process, and we attach a namespace to it, either at its creation with the flags we pass to clone(), later using the unshare() system call, that can change some of the namespaced resources or using the setns() system call that would set a namespace for an existing process.
#ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
36
Having a different machine name per process is cool.
But not that useful right? That is not a container.
What else can we isolate?
#ContainerDayFRParis Container Day 2017
Isolating the file-system
There is no container
37
As far as containers are concerned the most important thing is the file-system. This is done through CLONE_NEWNS.
1. First we create the new mount namespace2. We can than unmount the stuff from the parent namespace and mount the
various things we need to mount in our target dir (we want to get to a usable root file system).
3. Run `pivot_root $TARGETDIR` and voilà!
We can have different mounts and isolate parts of the file-system! As a side note, doing stuff like mounting, requires “capabilities” in this case CAP_SYS_ADMIN. More often than not these are going to have been dropped. So this is not always trivial.
#ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
38
We can decide what mounts are going to be shared from the “host”. We can totally decide that /var/lib is going to be common. Nothing disallows this.
We can use some crazy layered file system (like AUFS or OverlayFS) which will allow us to mix stuff, some coming from the underlying OS and some ‘overridden’ just for our namespace.
Now, “container runtimes” like Docker, or LXC or runc are a lot about preparing an image of a filesystem that can be mounted in a way that a process could run. If you look at the OCI (open container initiative) it has two specs, one for this, the file system, and one for the runtime.
#ContainerDayFRParis Container Day 2017
Isolating Inter-Process Communications
There is no container
39
With CLONE_NEWIPC we limit our processes capability to send and receive messages from processes to others with the same namespace;
We don’t want our nice isolated process to talk with strangers right?
#ContainerDayFRParis Container Day 2017
This is how when you run ps -aux you only see processes in your own namespace and its children (the pids won’t match. This is complex).
Oops, I forgot to tell you, namespaces are hierarchical. Which is triple fun. So yes containers can run inside other containers ad-infinitum (really up to 32 levels, but, well, you know, details).
Isolate Process IDs!
There is no container
40
#ContainerDayFRParis Container Day 2017
This is how your container gets its own IP. Yay, now is it a big boy.
(We won’t get into this.. but this is also where a lot of suffering will happen. Remember, from the Kernel
perspective this is just another interface. We will need either to use NAT, weird bridging or some creative
uses of IPTABLES to make sense thing. And this is clearly where we see how higher-level abstractions are a
necessity)
Isolating the Network
There is no container
41
#ContainerDayFRParis Container Day 2017
This is oh so important for unprivileged containers.
Yes! Linux supports doing all of this from userspace.
This basically means that the uid running inside does not exist outside. And that your process can feel blessedly aloof.
Isolate User and group IDs
There is no container
42
#ContainerDayFRParis Container Day 2017
man namespacesUSER_NAMESPACES(7)
There is no container
43
A process's user and group IDs can be different inside and outside a user
namespace. In particular, a process can have a normal unprivileged user ID
outside a user namespace while at the same time having a user ID of 0
inside the namespace; in other words, the process has full privileges for
operations inside the user namespace, but is unprivileged for operations
outside the namespace.
This means quantum-state rootness! You are root and unprivileged at the same time!
#ContainerDayFRParis Container Day 2017
man namespacesUSER_NAMESPACES(7)
There is no container
44
Each process is a member of exactly one user namespace. A process
created via fork or clone without the CLONE_NEWUSER flag is a member
of the same user namespace as its parent. A single-threaded process can
join another user namespace with setns if it has the CAP_SYS_ADMIN in
that namespace; upon doing so, it gains a full set of capabilities in that
namespace.
#ContainerDayFRParis Container Day 2017
This is where this ties in to the earlier mechanism we were talking about, cgroups.
CLONE_NEWCGROUP basically allows us to limit the resource usage of the process (and its children), in terms of memory, CPU usage and IO.
Almost last, but not least. Isolate resources!
There is no container
45
#ContainerDayFRParis Container Day 2017
This is of unholy complexity. Short story: Linux used to be mostly all or nothing . User 0 Vs the others. Now you have capabilities. A long list of capabilities. Which you can now go and set per process. And you have stuff like seccomp and seccomp-bpf to help you do just that
And you can use a bunch of modules and kernel patch sets to make everything more robust. Like SELinux. GRSecurity. Or AppArmor.
Really last: isolate all the things and the Kernel.
There is no container
46
#ContainerDayFRParis Container Day 2017
seccomp
There is no container
47
Seccomp is a mechanism in the Linux kernel that allows a process to make a one-way transition to a restricted state where it can only perform a limited set of system calls.
If a process attempts any other system calls, it is killed via a SIGKILL signal. In its most restrictive mode, seccomp prevents all system calls other than read(), write(), _exit(), and sigreturn().
This would allow a program to initialize and then drop into a restricted mode where it could only read from/write to already-opened files.
#ContainerDayFRParis Container Day 2017
seccomp-bpf
There is no container
48
If seccomp is a sledgehammer. seccomp-bpf is the fine-grained version that allows specifying a filter that is applied to every system call.
#ContainerDayFRParis Container Day 2017
BTW You get to have a nice pseudo filesystem with which you can interact to control these values.
try:
sudo ls -lai /proc/8/ns/
cat /proc/800/cgroups
Looking under the hood
There is no container
49
#ContainerDayFRParis Container Day 2017
Unlike other isolation techniques (Solaris Zones, BSD Jails, VMs) this is an emergent thing
There is no container
50
This is not a “first class” citizen. This was not designed. Different projects assemble different types of isolation that have different semantics from all of these elements.
● Docker is about packaging a single executable ● LXC wants to give you what feels like a virtual machine.● FireJail is there as a sandbox to run stuff you don’t trust. GUI much.
And this is a recent thing, user namespaces appeared in Kernel release 3.8 on 18 Feb 2013
#ContainerDayFRParis Container Day 2017
There is no container
51
#ContainerDayFRParis Container Day 2017
Quite far away from our intuitive abstraction popularized by Docker™
There is no container
52
#ContainerDayFRParis Container Day 2017
Everything is this world is “race-condition” prone and much of it, because of the mix of tooling is complex and hard.
Creating a Linux Container or “containerization” is using these different mechanisms together in a coherent way so as to have the end result “feel” as if the process you are running in an isolated machine.
A container runtime is a packaging of the above to make it simple.
The signatures and semantics of cgroups, namespaces and seccomp are different.
There is no container
53
#ContainerDayFRParis Container Day 2017
Container runtimes, try to take something that more reliably looks like this
There is no container
54
#ContainerDayFRParis Container Day 2017
Into our abstract image
There is no container
55
#ContainerDayFRParis Container Day 2017
When you think about all these low-level knobs we can control: the machine name, the network interfaces, the file-system, the users etc… you see something else emerging.
When we define how to “containerize” a piece of software we are extracting its contract.
We are defining the minimal subset of resources it needs.
And what is the minimal understanding of that piece of software that the runtime requires to reliably run it.
Containers as an abstraction
There is no container
56
#ContainerDayFRParis Container Day 2017
There were other isolation techniques before Docker. But because it exposed such a simple contract it gained the incredible traction it had.
According to Docker the contract of a piece of software was:
● A base image (a state of a file-system). Itself can be layered.● A working directory.● A build step (which was basically a bash script).● A TCP port exposed to the world.● Environment variables.● A command to run.
The simple Docker Contract
There is no container
57
#ContainerDayFRParis Container Day 2017
The incredible success it had shows the Docker software, and the Docker contract were good enough; And good enough is good. Sometimes great.
At platform.sh we run a container based based PaaS and we chose not to use Docker.
● Partly because the nuts-and-bolts at the time didn’t fit (it was too new/buggy for production in 2013/14). No User namespaces until two months ago. No Immutability. Weird networking.
● Partly because we thought the contract wasn’t correct for our use-case.
Choosing a contract
There is no container
58
#ContainerDayFRParis Container Day 2017
● The idea of mutable, layered, base-images made creating the first generation of Docker containers easy. Which explains a lot of its popularity. So yes.
● But it is a messy thing. This is something Docker has advanced on by allowing immutable containers. Still the default is that the container is mutable. And this is how the eco-system looks like.
● Build-oriented, reproducible, semantic base-images allow for orders of magnitude better memory utilisation through deduplication; And order of magnitude simpler operations. This is not something you can bolt-on easily later. There is still strong inertia here.
Is it an efficient contract?
There is no container
59
#ContainerDayFRParis Container Day 2017
For some software (most software we cared about) this contract doesn’t really make sense. Not in the long run. Not at scale.
In order to be useful the contract that describes software needed also to describe:
○ How to build it○ Everything it depends on (you can’t run Wordpress without MySQL)○ Its initial data structures (you can’t run Wordpress without some data
in the MySQL)○ Its basic configuration (most software needs to understand some
things about its place in the world)
There is no container
60
#ContainerDayFRParis Container Day 2017
○ And first, of-course, the Kubernetes ecosystem.○ But using 30 different tools strung together doesn’t scream
“abstraction” to us, but more like DIY mess. And it hardly answers the questions:■ What is the minimal subset of resources an app needs?
■ How can we make it run, reliably?
These days there are a billion and one projects that add those capabilities
There is no container
61
#ContainerDayFRParis Container Day 2017
The obligatory XKCD 435
There is no container
62
○ If our intuition is correct, and the minimal viable contract to run “arbitrary” software contains these other things, if the useful level to reason about software is the molecule, not the atom then we need an Organic Chemistry set; Not a physics set.
○ It doesn’t mean physics are wrong. Or that Docker is bad software.
#ContainerDayFRParis Container Day 2017
● RO / immutable base-image that is not opaque○ A semantic representation of system-libraries (with lock files)○ A reproducible, semantic, build system (with lock files)○ Potentially, a build step (which can basically be a bash script).
● RW / mutable base-image (mutable state) - which is Content Addressable● Mapping of working directories to the RW image.● A list of exposed network protocols and their parameters● Build time environment variables / Run time environment variables● Relationships (some containers make no sense -- would not run without a
database) to other containers (that should be semantic themselves).● The capability to understand change (diff as part of the model).
What would be a perfect contract for us?
There is no container
63
#ContainerDayFRParis Container Day 2017
● Because we chose a container description system that did not depend on the containerization method we can swap-out that part later and this is domain where everything moves fast. Shiny new becomes legacy in 6 months.
○ Our reproducible build system can create our base LXC systems (we use in production) our VMs (which we also deploy when we need higher levels of isolation) or Docker images (which we use in our Gitlab based CI system).
● Because we went for Read-On Containers separated from the R/W mounts we have gained factors in terms of density because of the level of memory deduplication.
Why are abstractions important?
There is no container
64
#ContainerDayFRParis Container Day 2017
Why are abstractions important?● Because we are describing the “minimal application” not as a single process
but as a graph.. and because we understand the protocol layer interactions … and what writes where to disk .. we can have consistent operations over the cluster that are fast .. and safe.
● Which also means we do not suffer from the same limitations around running persistent services.
● It is easier to implement HA primitives when you understand who is writing to the disk and how, who has what ports opened etc..
● When your base system is not .yaml but .yaml + git and when your .yaml represents something that has meaning.. you can implement change with much less friction.
There is no container
65
Platform.sh can clone a an arbitrarily complex production cluster in less than a minute.
With all of the data.
To create ephemeral staging clusters on the fly.
Every branch gets a url with basically fail-proof deployments.
Git-driven infrastructure
With a single git push you can deploy an arbitrarily complex cluster (with micro-services, messages queues and the lot.)
Backup means a consistent point-in-time snapshot of the whole shebang.
Automatically redundantarchitecture
High-Performance, automatic high-availability
#ContainerDayFRParis Container Day 2017
There is no container
69
#ContainerDayFRParis Container Day 2017
There is no container but the cluster
There is no container
70
● This is a bonus slide in case I didn’t run-out-of-time which is fun as I had 66 slides for 30 minutes.
● At the beginning of our project we used the word Cluster to describe, well half of the different primitives we had. But then it all became murky. So we started calling stuff Cluster, Kluster and Claster. Which stuck for a little bit but faded back again.
● Now cluster is back with all its glory, and a bit like with Hebrew, my mother’s tongue.. well, people seem just to be able to guess the correct meaning of cluster form the context.
● Oh we should really refresh that cluster.
#ContainerDayFRParis Container Day 2017
I am @OriPekelman everywhere
There is no container
71
Questions ?