there is no container - ori pekelman

#ContainerDayFR

There is no container

#ContainerDayFRParis Container Day 2017

Ori PekelmanGeekPush at Platform.sh

I am @OriPekeman everywhere (github/twitter/LinkedIn)

Co-Founder & VP of Marketing for Platform.sh, an innovative

second generation PaaS.

My role usually spans beyond the technological aspects to the

business strategy, process design and product marketing.


2

https://platform.sh

https://platform.sh


We are in Paris Containers Day, so I could rightly imagine most people around have an understanding of the underpinnings of “containers”. But let’s have a show of hands to see how much time we are going to spend on which slide.


3

Group A

I don’t know much about containers. It sounds interesting. I came here to learn.




4

Group B

I use Docker. In production. It works and I never had to care about how it is implemented.




5

Group C

I implement my own container stuff. I have Kernel-Fu. I know how this stuff is built.


1. This is meant as an entry-level talk, I will still discuss some nuts and bolts.. so when I am unclear. Interrupt me. I don’t mind.

2. I am rusty. They make me do marketing these days. So when I am wrong. Interrupt me. I don’t mind.

3. Even more so as we have the incredible honor of having people like Jessie Frazelle with us, people that participated in building many of the nuts and some of the bolts.

So, please, Jessie and you other experts, forgive the depths of my ignorance and any and all lies and errors I am about to spout.


6


What do containers solve? Why do we need containers?


7

Containers allow us to package complex software in a reusable format that is easy to deploy, making automation easier.

Sometimes they make updating software easier (with stateless systems… just build a new one, kill the old).

They have lower overhead in terms of memory usage than VMs, so they are less expensive.. and we can have more of them.

They allow us to reason about the systems we run at lesser granularity. AKA abstraction. In greek Atom means - that which cannot be divided. The container is our Atom.



8



9



10



11



12


The canonical image of the container is something like


13


An orderly world where we put software in opaque boxes


14


The boxes have a common, simple interface, that is not influenced by their content


15


From the outside we don’t care what is inside. There are no dependencies on the exterior world.


16


That is our intuitive abstraction popularized by Docker™


17


We can move containers. Install them. Run them. Without ever knowing what was inside.


18

$ docker pull complex_piece_of_software:latest

$ docker run complex_piece_of_software:latest


The “Nuts and bolts” truth of the matter is probably inverse. The container does not create opacity from the outside in.


19


But from the inside out.


20


From the system’s point of view


21


This is the reality


22


From the outside, the kernel, UID 0, they see all. For them, there is no container.


23



It is from the“containerized” process point of view that the world changes. Becomes smaller.

24


When we create a container what happens is that using a bunch of different Kernel features and modules (cgroups, namespaces, seccomp...) we:


25


Limit the visibility on the outside world (namespaces)


26


Limit the availability of resources from the outside world (cgroups)


27


Sometimes outright lie about the world (namespaces)


28


And we limit the capabilities of the process in what it can invoke as functionalities from the Kernel (seccomp .. and more…)


29



30


There is an operating system. In our case Linux. It abstracts away the hardware.

No software on a normal computer runs “outside” of the operating system. Yup. Even assembly / machine code. You can’t access the processor, memory or hardware without going through it. What you run on Linux are ELF binaries. Nothing else.

Your program interacts with its operating system through System Calls, it cas ask for memory, access to stuff (like the network or the disk), it can ask the operating system to run some other processes. A bunch of fun stuff.

So.. let’s create a container.


31




32

Interactions with the OS pass through system calls.. but sometimes it gets fancy and proposes higher-level constructs to make it easy (like a pseudo-file-system). Most often we will use libraries and full-blown integrated apps to take care of talking to the OS. More on that later.

In Linux processes are organized in a tree. Each process has an ID, and a parent; Everything starts with 0 which is the scheduler and 1, which is init. Everything else is going to get invoked from those and down.

In linux we have three different calls to start a process exec() which we don’t really care about here. fork() which copies the current process with a new PID and clone() that copies all or some of the current process and runs the new process as a child.


So, how do we make the world seem smaller to a process?

When creating our process we can pass a couple of parameters to clone() that will tell our operating system how it is going to live.

A bunch of these parameters (or flags) are called CLONE_NEW[...SOMETHING….] Some of these parameters, not all, can be modified later-on using the unshare() system call.



33




34

For example the parameter CLONE_NEWUTS tells the operating system that:

1. Our newly created process can call sethostname() and that doing so, instead of changing the hostname for the whole OS, it is going to keep a record, just for that Namespace of the Host Name.

2. So when, later the process calls gethostname() it will return whatever was put through this namespace’s sethostname().

So unlike all of its cousins and parents this process thinks the name of the machine it is running on is different.

We tricked it! (remember the part about lying?)


Setting up namespaces


35

So.. we create a new process, and we attach a namespace to it, either at its creation with the flags we pass to clone(), later using the unshare() system call, that can change some of the namespaced resources or using the setns() system call that would set a namespace for an existing process.




36

Having a different machine name per process is cool.

But not that useful right? That is not a container.

What else can we isolate?


Isolating the file-system


37

As far as containers are concerned the most important thing is the file-system. This is done through CLONE_NEWNS.

1. First we create the new mount namespace2. We can than unmount the stuff from the parent namespace and mount the

various things we need to mount in our target dir (we want to get to a usable root file system).

3. Run `pivot_root $TARGETDIR` and voilà!

We can have different mounts and isolate parts of the file-system! As a side note, doing stuff like mounting, requires “capabilities” in this case CAP_SYS_ADMIN. More often than not these are going to have been dropped. So this is not always trivial.




38

We can decide what mounts are going to be shared from the “host”. We can totally decide that /var/lib is going to be common. Nothing disallows this.

We can use some crazy layered file system (like AUFS or OverlayFS) which will allow us to mix stuff, some coming from the underlying OS and some ‘overridden’ just for our namespace.

Now, “container runtimes” like Docker, or LXC or runc are a lot about preparing an image of a filesystem that can be mounted in a way that a process could run. If you look at the OCI (open container initiative) it has two specs, one for this, the file system, and one for the runtime.


Isolating Inter-Process Communications


39

With CLONE_NEWIPC we limit our processes capability to send and receive messages from processes to others with the same namespace;

We don’t want our nice isolated process to talk with strangers right?


This is how when you run ps -aux you only see processes in your own namespace and its children (the pids won’t match. This is complex).

Oops, I forgot to tell you, namespaces are hierarchical. Which is triple fun. So yes containers can run inside other containers ad-infinitum (really up to 32 levels, but, well, you know, details).

Isolate Process IDs!


40


This is how your container gets its own IP. Yay, now is it a big boy.

(We won’t get into this.. but this is also where a lot of suffering will happen. Remember, from the Kernel

perspective this is just another interface. We will need either to use NAT, weird bridging or some creative

uses of IPTABLES to make sense thing. And this is clearly where we see how higher-level abstractions are a

necessity)

Isolating the Network


41


This is oh so important for unprivileged containers.

Yes! Linux supports doing all of this from userspace.

This basically means that the uid running inside does not exist outside. And that your process can feel blessedly aloof.

Isolate User and group IDs


42


man namespacesUSER_NAMESPACES(7)


43

A process's user and group IDs can be different inside and outside a user

namespace. In particular, a process can have a normal unprivileged user ID

outside a user namespace while at the same time having a user ID of 0

inside the namespace; in other words, the process has full privileges for

operations inside the user namespace, but is unprivileged for operations

outside the namespace.

This means quantum-state rootness! You are root and unprivileged at the same time!


man namespacesUSER_NAMESPACES(7)


44

Each process is a member of exactly one user namespace. A process

created via fork or clone without the CLONE_NEWUSER flag is a member

of the same user namespace as its parent. A single-threaded process can

join another user namespace with setns if it has the CAP_SYS_ADMIN in

that namespace; upon doing so, it gains a full set of capabilities in that

namespace.


This is where this ties in to the earlier mechanism we were talking about, cgroups.

CLONE_NEWCGROUP basically allows us to limit the resource usage of the process (and its children), in terms of memory, CPU usage and IO.

Almost last, but not least. Isolate resources!


45


This is of unholy complexity. Short story: Linux used to be mostly all or nothing . User 0 Vs the others. Now you have capabilities. A long list of capabilities. Which you can now go and set per process. And you have stuff like seccomp and seccomp-bpf to help you do just that

And you can use a bunch of modules and kernel patch sets to make everything more robust. Like SELinux. GRSecurity. Or AppArmor.

Really last: isolate all the things and the Kernel.


46


seccomp


47

Seccomp is a mechanism in the Linux kernel that allows a process to make a one-way transition to a restricted state where it can only perform a limited set of system calls.

If a process attempts any other system calls, it is killed via a SIGKILL signal. In its most restrictive mode, seccomp prevents all system calls other than read(), write(), _exit(), and sigreturn().

This would allow a program to initialize and then drop into a restricted mode where it could only read from/write to already-opened files.


seccomp-bpf


48

If seccomp is a sledgehammer. seccomp-bpf is the fine-grained version that allows specifying a filter that is applied to every system call.


BTW You get to have a nice pseudo filesystem with which you can interact to control these values.

try:

sudo ls -lai /proc/8/ns/

cat /proc/800/cgroups

Looking under the hood


49


Unlike other isolation techniques (Solaris Zones, BSD Jails, VMs) this is an emergent thing


50

This is not a “first class” citizen. This was not designed. Different projects assemble different types of isolation that have different semantics from all of these elements.

● Docker is about packaging a single executable ● LXC wants to give you what feels like a virtual machine.● FireJail is there as a sandbox to run stuff you don’t trust. GUI much.

And this is a recent thing, user namespaces appeared in Kernel release 3.8 on 18 Feb 2013



51


Quite far away from our intuitive abstraction popularized by Docker™


52


Everything is this world is “race-condition” prone and much of it, because of the mix of tooling is complex and hard.

Creating a Linux Container or “containerization” is using these different mechanisms together in a coherent way so as to have the end result “feel” as if the process you are running in an isolated machine.

A container runtime is a packaging of the above to make it simple.

The signatures and semantics of cgroups, namespaces and seccomp are different.


53


Container runtimes, try to take something that more reliably looks like this


54


Into our abstract image


55


When you think about all these low-level knobs we can control: the machine name, the network interfaces, the file-system, the users etc… you see something else emerging.

When we define how to “containerize” a piece of software we are extracting its contract.

We are defining the minimal subset of resources it needs.

And what is the minimal understanding of that piece of software that the runtime requires to reliably run it.

Containers as an abstraction


56


There were other isolation techniques before Docker. But because it exposed such a simple contract it gained the incredible traction it had.

According to Docker the contract of a piece of software was:

● A base image (a state of a file-system). Itself can be layered.● A working directory.● A build step (which was basically a bash script).● A TCP port exposed to the world.● Environment variables.● A command to run.

The simple Docker Contract


57


The incredible success it had shows the Docker software, and the Docker contract were good enough; And good enough is good. Sometimes great.

At platform.sh we run a container based based PaaS and we chose not to use Docker.

● Partly because the nuts-and-bolts at the time didn’t fit (it was too new/buggy for production in 2013/14). No User namespaces until two months ago. No Immutability. Weird networking.

● Partly because we thought the contract wasn’t correct for our use-case.

Choosing a contract


58


● The idea of mutable, layered, base-images made creating the first generation of Docker containers easy. Which explains a lot of its popularity. So yes.

● But it is a messy thing. This is something Docker has advanced on by allowing immutable containers. Still the default is that the container is mutable. And this is how the eco-system looks like.

● Build-oriented, reproducible, semantic base-images allow for orders of magnitude better memory utilisation through deduplication; And order of magnitude simpler operations. This is not something you can bolt-on easily later. There is still strong inertia here.

Is it an efficient contract?


59


For some software (most software we cared about) this contract doesn’t really make sense. Not in the long run. Not at scale.

In order to be useful the contract that describes software needed also to describe:

○ How to build it○ Everything it depends on (you can’t run Wordpress without MySQL)○ Its initial data structures (you can’t run Wordpress without some data

in the MySQL)○ Its basic configuration (most software needs to understand some

things about its place in the world)


60


○ And first, of-course, the Kubernetes ecosystem.○ But using 30 different tools strung together doesn’t scream

“abstraction” to us, but more like DIY mess. And it hardly answers the questions:■ What is the minimal subset of resources an app needs?

■ How can we make it run, reliably?

These days there are a billion and one projects that add those capabilities


61


The obligatory XKCD 435


62

○ If our intuition is correct, and the minimal viable contract to run “arbitrary” software contains these other things, if the useful level to reason about software is the molecule, not the atom then we need an Organic Chemistry set; Not a physics set.

○ It doesn’t mean physics are wrong. Or that Docker is bad software.


● RO / immutable base-image that is not opaque○ A semantic representation of system-libraries (with lock files)○ A reproducible, semantic, build system (with lock files)○ Potentially, a build step (which can basically be a bash script).

● RW / mutable base-image (mutable state) - which is Content Addressable● Mapping of working directories to the RW image.● A list of exposed network protocols and their parameters● Build time environment variables / Run time environment variables● Relationships (some containers make no sense -- would not run without a

database) to other containers (that should be semantic themselves).● The capability to understand change (diff as part of the model).

What would be a perfect contract for us?


63


● Because we chose a container description system that did not depend on the containerization method we can swap-out that part later and this is domain where everything moves fast. Shiny new becomes legacy in 6 months.

○ Our reproducible build system can create our base LXC systems (we use in production) our VMs (which we also deploy when we need higher levels of isolation) or Docker images (which we use in our Gitlab based CI system).

● Because we went for Read-On Containers separated from the R/W mounts we have gained factors in terms of density because of the level of memory deduplication.

Why are abstractions important?


64


Why are abstractions important?● Because we are describing the “minimal application” not as a single process

but as a graph.. and because we understand the protocol layer interactions … and what writes where to disk .. we can have consistent operations over the cluster that are fast .. and safe.

● Which also means we do not suffer from the same limitations around running persistent services.

● It is easier to implement HA primitives when you understand who is writing to the disk and how, who has what ports opened etc..

● When your base system is not .yaml but .yaml + git and when your .yaml represents something that has meaning.. you can implement change with much less friction.


65

Platform.sh can clone a an arbitrarily complex production cluster in less than a minute.

With all of the data.

To create ephemeral staging clusters on the fly.

Every branch gets a url with basically fail-proof deployments.

Git-driven infrastructure

With a single git push you can deploy an arbitrarily complex cluster (with micro-services, messages queues and the lot.)

Backup means a consistent point-in-time snapshot of the whole shebang.

Automatically redundantarchitecture

High-Performance, automatic high-availability



69


There is no container but the cluster


70

● This is a bonus slide in case I didn’t run-out-of-time which is fun as I had 66 slides for 30 minutes.

● At the beginning of our project we used the word Cluster to describe, well half of the different primitives we had. But then it all became murky. So we started calling stuff Cluster, Kluster and Claster. Which stuck for a little bit but faded back again.

● Now cluster is back with all its glory, and a bit like with Hebrew, my mother’s tongue.. well, people seem just to be able to guess the correct meaning of cluster form the context.

● Oh we should really refresh that cluster.


I am @OriPekelman everywhere


71

Questions ?

https://platform.sh

there is no container - ori pekelman

Technology