[2c4]clustered computing with coreos, fleet and etcd

Post on 02-Dec-2014

1.000 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

DEVIEW 2014 [2C4]Clustered computing with CoreOS, fleet and etcd

TRANSCRIPT

Clustered computing with CoreOS, fleet and etcd

DEVIEW 2014

Jonathan BoulleCoreOS, Inc.

@baronboulle

jonboulle

Who am I?

Jonathan Boulle

South Africa -> Australia -> London -> San Francisco

@baronboulle

jonboulle

Who am I?

Jonathan Boulle

South Africa -> Australia -> London -> San Francisco

Red Hat -> Twitter -> CoreOS

@baronboulle

jonboulle

Who am I?

Jonathan Boulle

South Africa -> Australia -> London -> San Francisco

Red Hat -> Twitter -> CoreOS

Linux, Python, Go, FOSS

@baronboulle

jonboulle

Who am I?

Jonathan Boulle

Agenda

Agenda

● CoreOS Linux

Agenda

● CoreOS Linux– Securing the internet

Agenda

● CoreOS Linux– Securing the internet– Application containers

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

● fleet and...

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

● fleet and...– systemd: the good, the bad

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

● fleet and...– systemd: the good, the bad

– etcd: the good, the bad

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

● fleet and...– systemd: the good, the bad

– etcd: the good, the bad

– Golang: the good, the bad

Agenda

● CoreOS Linux– Securing the internet– Application containers– Automatic updates

● fleet– cluster-level init system– etcd + systemd

● fleet and...– systemd: the good, the bad

– etcd: the good, the bad

– Golang: the good, the bad

● Q&A

CoreOS Linux

CoreOS Linux

CoreOS Linux

A minimal, automatically-updated Linux distribution,

designed for distributed systems.

Why?

Why?

● CoreOS mission: “Secure the internet”

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it● Internet is full of servers running years-old

software with dozens of vulnerabilities

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it● Internet is full of servers running years-old

software with dozens of vulnerabilities● CoreOS: make updating the default, seamless

option

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it● Internet is full of servers running years-old

software with dozens of vulnerabilities● CoreOS: make updating the default, seamless

option● Regular

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it● Internet is full of servers running years-old software

with dozens of vulnerabilities● CoreOS: make updating the default, seamless option

● Regular● Reliable

Why?

● CoreOS mission: “Secure the internet”● Status quo: set up a server and never touch it● Internet is full of servers running years-old software

with dozens of vulnerabilities● CoreOS: make updating the default, seamless option

● Regular● Reliable● Automatic

How do we achieve this?

How do we achieve this?

● Containerization of applications

How do we achieve this?

● Containerization of applications● Self-updating operating system

How do we achieve this?

● Containerization of applications● Self-updating operating system● Distributed systems tooling to make applications

resilient to updates

KERNELSYSTEMDSSHDOCKER

PYTHONJAVANGINXMYSQLOPENSSL

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

APP

KERNELSYSTEMDSSHDOCKER

PYTHONJAVANGINXMYSQLOPENSSL

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

APP

KERNELSYSTEMDSSHDOCKER

PYTHONJAVANGINXMYSQLOPENSSL

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

dis

tro

dist

ro d

istr

o di

stro

APP

Application Containers (e.g. Docker)

KERNELSYSTEMDSSHDOCKER

- Minimal base OS (~100MB)- Vanilla upstream components wherever possible- Decoupled from applications- Automatic, atomic updates

Automatic updates

Automatic updates

How do updates work?

Automatic updates

How do updates work?● Omaha protocol (check-in/retrieval)

Automatic updates

How do updates work?● Omaha protocol (check-in/retrieval)

– Simple XML-over-HTTP protocol developed by Google to facilitate polling and pulling updates from a server

Client sends application id andcurrent version to the update server

Omaha protocol

<request protocol="3.0" version="CoreOSUpdateEngine-0.1.0.0"> <app appid="{e96281a6-d1af-4bde-9a0a-97b76e56dc57}" version="410.0.0" track="alpha" from_track="alpha"> <event eventtype="3"></event> </app></request>

Client sends application id andcurrent version to the update server

Update server responds with theURL of an update to be applied

Omaha protocol

<url codebase="https://commondatastorage.googleapis.com/update-storage.core-os.net/amd64-usr/452.0.0/"></url> <package hash="D0lBAMD1Fwv8YqQuDYEAjXw6YZY=" name="update.gz" size="103401155" required="false">

Client sends application id andcurrent version to the update server

Update server responds with theURL of an update to be applied

Client downloads data, verifieshash & cryptographic signature,and applies the update

Omaha protocol

Client sends application id andcurrent version to the update server

Update server responds with theURL of an update to be applied

Client downloads data, verifieshash & cryptographic signature,and applies the update

Updater exits with response codethen reports the update to theupdate server

Omaha protocol

Automatic updates

Automatic updates

How do updates work?● Omaha protocol (check-in/retrieval)

– Simple XML-over-HTTP protocol developed by Google to facilitate polling and pulling updates from a server

● Active/passive read-only root partitions

Automatic updates

How do updates work?● Omaha protocol (check-in/retrieval)

– Simple XML-over-HTTP protocol developed by Google to facilitate polling and pulling updates from a server

● Active/passive read-only root partitions– One for running the live system, one for updates

Booted off partition A.Download update and committo partition B. Change GPT.

Active/passive root partitions

Reboot (into Partition B).If tests succeed, continue normal operation and mark success in GPT.

Active/passive root partitions

But what if partition B fails update tests...

Active/passive root partitions

Change GPT to point to previous partition, reboot.Try update again later.

Active/passive root partitions

Active/passive root partitions

core-01 ~ # cgpt show /dev/sda3 start size contents 264192 2097152 Label: "USR-A" Type: Alias for coreos-rootfs UUID: 7130C94A-213A-4E5A-8E26-6CCE9662 Attr: priority=1 tries=0 successful=1

core-01 ~ # cgpt show /dev/sda4 start size contents2492416 2097152 Label: "USR-B" Type: Alias for coreos-rootfs UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A Attr: priority=2 tries=1 successful=0

Active/passive root partitions

core-01 ~ # cgpt show /dev/sda3 start size contents 264192 2097152 Label: "USR-A" Type: Alias for coreos-rootfs UUID: 7130C94A-213A-4E5A-8E26-6CCE9662 Attr: priority=1 tries=0 successful=1

core-01 ~ # cgpt show /dev/sda4 start size contents2492416 2097152 Label: "USR-B" Type: Alias for coreos-rootfs UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A Attr: priority=2 tries=1 successful=0

Active/passive root partitions

core-01 ~ # cgpt show /dev/sda3 start size contents 264192 2097152 Label: "USR-A" Type: Alias for coreos-rootfs UUID: 7130C94A-213A-4E5A-8E26-6CCE9662 Attr: priority=1 tries=0 successful=1

core-01 ~ # cgpt show /dev/sda4 start size contents2492416 2097152 Label: "USR-B" Type: Alias for coreos-rootfs UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A Attr: priority=2 tries=1 successful=0

Active/passive /usr partitions

core-01 ~ # cgpt show /dev/sda3 start size contents 264192 2097152 Label: "USR-A" Type: Alias for coreos-rootfs UUID: 7130C94A-213A-4E5A-8E26-6CCE9662 Attr: priority=1 tries=0 successful=1

core-01 ~ # cgpt show /dev/sda4 start size contents2492416 2097152 Label: "USR-B" Type: Alias for coreos-rootfs UUID: E03DD35C-7C2D-4A47-B3FE-27F15780A Attr: priority=2 tries=1 successful=0

Active/passive /usr partitions

Active/passive /usr partitions

● Single image containing most of the OS

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)– Parts of /etc generated dynamically at boot

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)– Parts of /etc generated dynamically at boot– A lot of work moving default configs from /etc to /usr

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)– Parts of /etc generated dynamically at boot– A lot of work moving default configs from /etc to /usr

# /etc/nsswitch.conf:

passwd: files usrfilesshadow: files usrfilesgroup: files usrfiles

Active/passive /usr partitions

● Single image containing most of the OS– Mounted read-only onto /usr

– / is mounted read-write on top (persistent data)– Parts of /etc generated dynamically at boot– A lot of work moving default configs from /etc to /usr

# /etc/nsswitch.conf:

passwd: files usrfilesshadow: files usrfilesgroup: files usrfiles

/usr/share/baselayout/passwd/usr/share/baselayout/shadow/usr/share/baselayout/group

Atomic updates

Atomic updates

● Entire OS is a single read-only imagecore-01 ~ # touch /usr/bin/foo touch: cannot touch '/usr/bin/foo': Read-only file system

Atomic updates

● Entire OS is a single read-only imagecore-01 ~ # touch /usr/bin/foo touch: cannot touch '/usr/bin/foo': Read-only file system

– Easy to verify cryptographically● sha1sum on AWS or bare metal gives the same result

Atomic updates

● Entire OS is a single read-only imagecore-01 ~ # touch /usr/bin/foo touch: cannot touch '/usr/bin/foo': Read-only file system

– Easy to verify cryptographically● sha1sum on AWS or bare metal gives the same result

– No chance of inconsistencies due to partial updates● e.g. pull a plug on a CentOS system during a yum update● At large scale, such events are inevitable

Automatic, atomic updates are great! But...

● Problem:

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and

mount new filesystem)

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and

mount new filesystem)– Reboots cause application downtime...

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and

mount new filesystem)– Reboots cause application downtime...

● Solution: fleet

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and

mount new filesystem)– Reboots cause application downtime...

● Solution: fleet– Highly available, fault tolerant, distributed process

scheduler

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and

mount new filesystem)– Reboots cause application downtime...

● Solution: fleet– Highly available, fault tolerant, distributed process

scheduler ● ... The way to run applications on a CoreOS cluster

Automatic, atomic updates are great! But...

● Problem: – updates still require a reboot (to use new kernel and mount

new filesystem)– Reboots cause application downtime...

● Solution: fleet– Highly available, fault tolerant, distributed process scheduler

● ... The way to run applications on a CoreOS cluster– fleet keeps applications running during server downtime

Automatic, atomic updates are great! But...

fleet – the “cluster-level init system”

fleet – the “cluster-level init system”

fleet is the abstraction between machine and application:

fleet – the “cluster-level init system”

fleet is the abstraction between machine and application:– init system manages process on a machine

– fleet manages applications on a cluster of machines

fleet – the “cluster-level init system”

fleet is the abstraction between machine and application:– init system manages process on a machine

– fleet manages applications on a cluster of machines Similar to Mesos, but very different architecture

(e.g. based on etcd/Raft, not Zookeeper/Paxos)

fleet – the “cluster-level init system”

fleet is the abstraction between machine and application:– init system manages process on a machine– fleet manages applications on a cluster of machines

Similar to Mesos, but very different architecture (e.g. based on etcd/Raft, not Zookeeper/Paxos)

Uses systemd for machine-level process management, etcd for cluster-level co-ordination

fleet – low level view

fleet – low level view

fleetd binary (running on all CoreOS nodes)

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

• agent (local unit management – talks to etcd and systemd)

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

• agent (local unit management – talks to etcd and systemd) fleetctl command-line administration tool

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

• agent (local unit management – talks to etcd and systemd) fleetctl command-line administration tool

– create, destroy, start, stop units

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)

• agent (local unit management – talks to etcd and systemd) fleetctl command-line administration tool

– create, destroy, start, stop units

– Retrieve current status of units/machines in the cluster

fleet – low level view

fleetd binary (running on all CoreOS nodes)

– encapsulates two roles:

• engine (cluster-level unit scheduling – talks to etcd)• agent (local unit management – talks to etcd and systemd)

fleetctl command-line administration tool

– create, destroy, start, stop units

– Retrieve current status of units/machines in the cluster HTTP API

fleet – high level view

Single machine

Cluster

systemd

systemd

Linux init system (PID 1) – manages processes

systemd

Linux init system (PID 1) – manages processes– Relatively new, replaces SysVinit, upstart, OpenRC, ...

systemd

Linux init system (PID 1) – manages processes– Relatively new, replaces SysVinit, upstart, OpenRC, ...

– Being adopted by all major Linux distributions

systemd

Linux init system (PID 1) – manages processes– Relatively new, replaces SysVinit, upstart, OpenRC, ...

– Being adopted by all major Linux distributions Fundamental concept is the unit

systemd

Linux init system (PID 1) – manages processes– Relatively new, replaces SysVinit, upstart, OpenRC, ...

– Being adopted by all major Linux distributions Fundamental concept is the unit

– Units include services (e.g. applications), mount points, sockets, timers, etc.

systemd

Linux init system (PID 1) – manages processes– Relatively new, replaces SysVinit, upstart, OpenRC, ...– Being adopted by all major Linux distributions

Fundamental concept is the unit– Units include services (e.g. applications), mount points,

sockets, timers, etc.– Each unit configured with a simple unit file

Quick comparison

fleet + systemd

fleet + systemd

systemd exposes a D-Bus interface

fleet + systemd

systemd exposes a D-Bus interface– D-Bus: message bus system for IPC on Linux

fleet + systemd

systemd exposes a D-Bus interface– D-Bus: message bus system for IPC on Linux

– One-to-one messaging (methods), plus pub/sub abilities

fleet + systemd

systemd exposes a D-Bus interface– D-Bus: message bus system for IPC on Linux

– One-to-one messaging (methods), plus pub/sub abilities fleet uses godbus to communicate with systemd

fleet + systemd

systemd exposes a D-Bus interface– D-Bus: message bus system for IPC on Linux

– One-to-one messaging (methods), plus pub/sub abilities fleet uses godbus to communicate with systemd

– Sending commands: StartUnit, StopUnit

fleet + systemd

systemd exposes a D-Bus interface– D-Bus: message bus system for IPC on Linux– One-to-one messaging (methods), plus pub/sub abilities

fleet uses godbus to communicate with systemd– Sending commands: StartUnit, StopUnit

– Retrieving current state of units (to publish to the cluster)

systemd is great!

systemd is great!

Automatically handles:

systemd is great!

Automatically handles:– Process daemonization

systemd is great!

Automatically handles:– Process daemonization

– Resource isolation/containment (cgroups)

systemd is great!

Automatically handles:– Process daemonization

– Resource isolation/containment (cgroups)• e.g. MemoryLimit=512M

systemd is great!

Automatically handles:– Process daemonization

– Resource isolation/containment (cgroups)• e.g. MemoryLimit=512M

– Health-checking, restarting failed services

systemd is great!

Automatically handles:– Process daemonization

– Resource isolation/containment (cgroups)• e.g. MemoryLimit=512M

– Health-checking, restarting failed services

– Logging (journal)

systemd is great!

Automatically handles:– Process daemonization

– Resource isolation/containment (cgroups)• e.g. MemoryLimit=512M

– Health-checking, restarting failed services

– Logging (journal) • applications can just write to stdout, systemd adds metadata

systemd is great!

Automatically handles:– Process daemonization– Resource isolation/containment (cgroups)

• e.g. MemoryLimit=512M

– Health-checking, restarting failed services– Logging (journal)

• applications can just write to stdout, systemd adds metadata

– Timers, inter-unit dependencies, socket activation, ...

fleet + systemd

fleet + systemd

systemd takes care of things so we don't have to

fleet + systemd

systemd takes care of things so we don't have to fleet configuration is just systemd unit files

fleet + systemd

systemd takes care of things so we don't have to fleet configuration is just systemd unit files fleet extends systemd to the cluster-level, and adds

some features of its own (using [X-Fleet])

fleet + systemd

systemd takes care of things so we don't have to fleet configuration is just systemd unit files fleet extends systemd to the cluster-level, and adds

some features of its own (using [X-Fleet])– Template units (run n identical copies of a unit)

fleet + systemd

systemd takes care of things so we don't have to fleet configuration is just systemd unit files fleet extends systemd to the cluster-level, and adds

some features of its own (using [X-Fleet])– Template units (run n identical copies of a unit)

– Global units (run a unit everywhere in the cluster)

fleet + systemd

systemd takes care of things so we don't have to fleet configuration is just systemd unit files fleet extends systemd to the cluster-level, and adds

some features of its own (using [X-Fleet])– Template units (run n identical copies of a unit)– Global units (run a unit everywhere in the cluster)– Machine metadata (run only on certain machines)

systemd is... not so great

systemd is... not so great

Problem: unreliable pub/sub

systemd is... not so great

Problem: unreliable pub/sub– fleet agent initially used a systemd D-Bus subscription

to track unit status

systemd is... not so great

Problem: unreliable pub/sub– fleet agent initially used a systemd D-Bus subscription

to track unit status

– Every change in unit state in systemd triggers an event in fleet (e.g. “publish this new state to the cluster”)

systemd is... not so great

Problem: unreliable pub/sub– fleet agent initially used a systemd D-Bus subscription

to track unit status

– Every change in unit state in systemd triggers an event in fleet (e.g. “publish this new state to the cluster”)

– Under heavy load, or byzantine conditions, unit state changes would be dropped

systemd is... not so great

Problem: unreliable pub/sub– fleet agent initially used a systemd D-Bus subscription

to track unit status– Every change in unit state in systemd triggers an event

in fleet (e.g. “publish this new state to the cluster”)– Under heavy load, or byzantine conditions, unit state

changes would be dropped– As a result, unit state in the cluster became stale

systemd is... not so great

systemd is... not so great

Problem: unreliable pub/sub

systemd is... not so great

Problem: unreliable pub/sub Solution: polling for unit states

systemd is... not so great

Problem: unreliable pub/sub Solution: polling for unit states

– Every n seconds, retrieve state of units from systemd, and synchronize with cluster

systemd is... not so great

Problem: unreliable pub/sub Solution: polling for unit states

– Every n seconds, retrieve state of units from systemd, and synchronize with cluster

– Less efficient, but much more reliable

systemd is... not so great

Problem: unreliable pub/sub Solution: polling for unit states

– Every n seconds, retrieve state of units from systemd, and synchronize with cluster

– Less efficient, but much more reliable

– Optimize by caching state and only publishing changes

systemd is... not so great

Problem: unreliable pub/sub Solution: polling for unit states

– Every n seconds, retrieve state of units from systemd, and synchronize with cluster

– Less efficient, but much more reliable– Optimize by caching state and only publishing changes– Any state inconsistencies are quickly fixed

systemd (and docker) are... not so great

systemd (and docker) are... not so great

Problem: poor integration with Docker

systemd (and docker) are... not so great

Problem: poor integration with Docker– Docker is de facto application container manager

systemd (and docker) are... not so great

Problem: poor integration with Docker– Docker is de facto application container manager

– Docker and systemd do not always play nicely together..

systemd (and docker) are... not so great

Problem: poor integration with Docker– Docker is de facto application container manager– Docker and systemd do not always play nicely together..– Both Docker and systemd manage cgroups and

processes, and when the two are trying to manage the same thing, the results are mixed

systemd (and docker) are... not so great

systemd (and docker) are... not so great

Example: sending signals to a container

systemd (and docker) are... not so great

Example: sending signals to a container– Given a simple container:

[Service]ExecStart=/usr/bin/docker run busybox /bin/bash -c \ "while true; do echo Hello World; sleep 1; done"

systemd (and docker) are... not so great

Example: sending signals to a container– Given a simple container:

[Service]ExecStart=/usr/bin/docker run busybox /bin/bash -c \ "while true; do echo Hello World; sleep 1; done"

– Try to kill it with systemctl kill hello.service

systemd (and docker) are... not so great

Example: sending signals to a container– Given a simple container:

[Service]ExecStart=/usr/bin/docker run busybox /bin/bash -c \ "while true; do echo Hello World; sleep 1; done"

– Try to kill it with systemctl kill hello.service– ... Nothing happens

systemd (and docker) are... not so great

Example: sending signals to a container– Given a simple container:

[Service]ExecStart=/usr/bin/docker run busybox /bin/bash -c \ "while true; do echo Hello World; sleep 1; done"

– Try to kill it with systemctl kill hello.service– ... Nothing happens– Kill command sends SIGTERM, but bash in a Docker

container has PID1, which happily ignores the signal...

systemd (and docker) are... not so great

systemd (and docker) are... not so great

Example: sending signals to a container

systemd (and docker) are... not so great

Example: sending signals to a container OK, SIGTERM didn't work, so escalate to SIGKILL:

systemctl kill -s SIGKILL hello.service

systemd (and docker) are... not so great

Example: sending signals to a container OK, SIGTERM didn't work, so escalate to SIGKILL:

systemctl kill -s SIGKILL hello.service

● Now the systemd service is gone: hello.service: main process exited, code=killed, status=9/KILL

systemd (and docker) are... not so great

Example: sending signals to a container OK, SIGTERM didn't work, so escalate to SIGKILL:

systemctl kill -s SIGKILL hello.service

● Now the systemd service is gone: hello.service: main process exited, code=killed, status=9/KILL

● But... the Docker container still exists?# docker ps CONTAINER ID COMMAND STATUS NAMES7c7cf8ffabb6 /bin/sh -c 'while tr Up 31 seconds hello # ps -ef|grep '[d]ocker run'root 24231 1 0 03:49 ? 00:00:00 /usr/bin/docker run -name hello ...

systemd (and docker) are... not so great

systemd (and docker) are... not so great

Why?

systemd (and docker) are... not so great

Why? – Docker client does not run containers itself; it just sends

a command to the Docker daemon which actually forks and starts running the container

systemd (and docker) are... not so great

Why? – Docker client does not run containers itself; it just sends

a command to the Docker daemon which actually forks and starts running the container

– systemd expects processes to fork directly so they will be contained under the same cgroup tree

systemd (and docker) are... not so great

Why? – Docker client does not run containers itself; it just sends

a command to the Docker daemon which actually forks and starts running the container

– systemd expects processes to fork directly so they will be contained under the same cgroup tree

– Since the Docker daemon's cgroup is entirely separate, systemd cannot keep track of the forked container

systemd (and docker) are... not so great

# systemctl cat hello.service[Service]ExecStart=/bin/bash -c 'while true; do echo Hello World; sleep 1; done'

# systemd-cgls... ├─hello.service │ ├─23201 /bin/bash -c while true; do echo Hello World; sleep 1; done │ └─24023 sleep 1

systemd (and docker) are... not so great

# systemctl cat hello.service[Service]ExecStart=/usr/bin/docker run -name hello busybox /bin/sh -c \"while true; do echo Hello World; sleep 1; done"

# systemd-cgls...│ ├─hello.service│ │ └─24231 /usr/bin/docker run -name hello busybox /bin/sh -c while true; do echo Hello World; sleep 1; done...│ ├─docker-51a57463047b65487ec80a1dc8b8c9ea14a396c7a49c1e23919d50bdafd4fefb.scope│ │ ├─24240 /bin/sh -c while true; do echo Hello World; sleep 1; done│ │ └─24553 sleep 1

systemd (and docker) are... not so great

systemd (and docker) are... not so great

Problem: poor integration with Docker Solution: ... work in progress

systemd (and docker) are... not so great

Problem: poor integration with Docker Solution: ... work in progress

– systemd-docker – small application that moves cgroups of Docker containers back under systemd's cgroup

systemd (and docker) are... not so great

Problem: poor integration with Docker Solution: ... work in progress

– systemd-docker – small application that moves cgroups of Docker containers back under systemd's cgroup

– Use Docker for image management, but systemd-nspawn for runtime (e.g. CoreOS's toolbox)

systemd (and docker) are... not so great

Problem: poor integration with Docker Solution: ... work in progress

– systemd-docker – small application that moves cgroups of Docker containers back under systemd's cgroup

– Use Docker for image management, but systemd-nspawn for runtime (e.g. CoreOS's toolbox)

– (proposed) Docker standalone mode: client starts container directly rather than through daemon

fleet – high level view

Single machine

Cluster

etcd

etcd

A consistent, highly available key/value store

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

Driven by Raft

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

Driven by Raft Consensus algorithm similar to Paxos

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

Driven by Raft Consensus algorithm similar to Paxos Designed for understandability and simplicity

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

Driven by Raft Consensus algorithm similar to Paxos Designed for understandability and simplicity

Popular and widely used

etcd

A consistent, highly available key/value store– Shared configuration, distributed locking, ...

Driven by Raft Consensus algorithm similar to Paxos Designed for understandability and simplicity

Popular and widely used– Simple HTTP API + libraries in Go, Java, Python, Ruby, ...

etcd connects CoreOS hosts

fleet + etcd

● fleet needs a consistent view of the cluster to make scheduling decisions: etcd provides this view

fleet + etcd

● fleet needs a consistent view of the cluster to make scheduling decisions: etcd provides this view– What units exist in the cluster?

fleet + etcd

● fleet needs a consistent view of the cluster to make scheduling decisions: etcd provides this view– What units exist in the cluster?– What machines exist in the cluster?

fleet + etcd

● fleet needs a consistent view of the cluster to make scheduling decisions: etcd provides this view– What units exist in the cluster?– What machines exist in the cluster?– What are their current states?

fleet + etcd

● fleet needs a consistent view of the cluster to make scheduling decisions: etcd provides this view– What units exist in the cluster?– What machines exist in the cluster?– What are their current states?

● All unit files, unit state, machine state and scheduling information is stored in etcd

fleet + etcd

etcd is great!

● Fast and simple API

etcd is great!

● Fast and simple API● Handles all cluster-level/inter-machine

communication so we don't have to

etcd is great!

● Fast and simple API● Handles all cluster-level/inter-machine

communication so we don't have to● Powerful primitives:

etcd is great!

● Fast and simple API● Handles all cluster-level/inter-machine

communication so we don't have to● Powerful primitives:

– Compare-and-Swap allows for atomic operations and implementing locking behaviour

etcd is great!

● Fast and simple API● Handles all cluster-level/inter-machine

communication so we don't have to● Powerful primitives:

– Compare-and-Swap allows for atomic operations and implementing locking behaviour

– Watches (like pub-sub) provide event-driven behaviour

etcd is great!

etcd is... not so great

● Problem: unreliable watches

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture– watches in etcd used to trigger events

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture– watches in etcd used to trigger events

● e.g. “machine down” event triggers unit rescheduling

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture– watches in etcd used to trigger events

● e.g. “machine down” event triggers unit rescheduling– Highly efficient: only take action when necessary

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture– watches in etcd used to trigger events

● e.g. “machine down” event triggers unit rescheduling– Highly efficient: only take action when necessary– Highly responsive: as soon as a user submits a new

unit, an event is triggered to schedule that unit

etcd is... not so great

● Problem: unreliable watches– fleet initially used a purely event-driven architecture– watches in etcd used to trigger events

● e.g. “machine down” event triggers unit rescheduling– Highly efficient: only take action when necessary– Highly responsive: as soon as a user submits a new

unit, an event is triggered to schedule that unit– Unfortunately, many places for things to go wrong...

etcd is... not so great

Unreliable watches

● Example:

Unreliable watches

● Example:– etcd is undergoing a leader election or otherwise

unavailable, watches do not work during this period

Unreliable watches

● Example:– etcd is undergoing a leader election or otherwise

unavailable, watches do not work during this period– Change occurs (e.g. a machine leaves the cluster)

Unreliable watches

● Example:– etcd is undergoing a leader election or otherwise

unavailable, watches do not work during this period– Change occurs (e.g. a machine leaves the cluster)– Event is missed

Unreliable watches

● Example:– etcd is undergoing a leader election or otherwise

unavailable, watches do not work during this period– Change occurs (e.g. a machine leaves the cluster)– Event is missed – fleet doesn't know machine is lost!

Unreliable watches

● Example:– etcd is undergoing a leader election or otherwise

unavailable, watches do not work during this period– Change occurs (e.g. a machine leaves the cluster)– Event is missed – fleet doesn't know machine is lost!– Now fleet doesn't know to reschedule units that were

running on that machine

Unreliable watches

etcd is... not so great

● Problem: limited event history

etcd is... not so great

● Problem: limited event history– etcd retains a history of all events that occur

etcd is... not so great

● Problem: limited event history– etcd retains a history of all events that occur– Can “watch” from an arbitrary point in the past, but..

etcd is... not so great

● Problem: limited event history– etcd retains a history of all events that occur– Can “watch” from an arbitrary point in the past, but..– History is a limited window!

etcd is... not so great

● Problem: limited event history– etcd retains a history of all events that occur– Can “watch” from an arbitrary point in the past, but..– History is a limited window!– With a busy cluster, watches can fall out of this window

etcd is... not so great

● Problem: limited event history– etcd retains a history of all events that occur– Can “watch” from an arbitrary point in the past, but..– History is a limited window!– With a busy cluster, watches can fall out of this window– Can't always replay event stream from the point we

want to

etcd is... not so great

Limited event history

● Example:– etcd holds history of last 1000 events

Limited event history

● Example:– etcd holds history of last 1000 events– fleet sets watch at i=100 to watch for machine loss

Limited event history

● Example:– etcd holds history of last 1000 events– fleet sets watch at i=100 to watch for machine loss– Meanwhile, many changes occur in other parts of the

keyspace, advancing index to i=1500

Limited event history

● Example:– etcd holds history of last 1000 events– fleet sets watch at i=100 to watch for machine loss– Meanwhile, many changes occur in other parts of the

keyspace, advancing index to i=1500– Leader election/network hiccup occurs and severs

watch

Limited event history

● Example:– etcd holds history of last 1000 events– fleet sets watch at i=100 to watch for machine loss– Meanwhile, many changes occur in other parts of the

keyspace, advancing index to i=1500– Leader election/network hiccup occurs and severs

watch– fleet tries to recreate watch at i=100 and fails:

Limited event history

● Example:– etcd holds history of last 1000 events– fleet sets watch at i=100 to watch for machine loss– Meanwhile, many changes occur in other parts of the

keyspace, advancing index to i=1500– Leader election/network hiccup occurs and severs watch– fleet tries to recreate watch at i=100 and fails:

err="401: The event in requested index is outdated and cleared (the requested history has been cleared [1500/100])

Limited event history

etcd is... not so great

● Problem: unreliable watches– Missed events lead to unrecoverable situations

● Problem: limited event history– Can't always replay entire event stream

etcd is... not so great

● Problem: unreliable watches– Missed events lead to unrecoverable situations

● Problem: limited event history– Can't always replay entire event stream

● Solution: move to “reconciler” model

etcd is... not so great

Reconciler model

Reconciler model

In a loop, run periodically until stopped:

Reconciler model

In a loop, run periodically until stopped:

1. Retrieve current state (how the world is) and desired state (how the world should be) from datastore (etcd)

Reconciler model

In a loop, run periodically until stopped:

1. Retrieve current state (how the world is) and desired state (how the world should be) from datastore (etcd)

2. Determine necessary actions to transform current state --> desired state

Reconciler model

In a loop, run periodically until stopped:

1. Retrieve current state (how the world is) and desired state (how the world should be) from datastore (etcd)

2. Determine necessary actions to transform current state --> desired state

3. Perform actions and save results as new current state

Reconciler model

Example: fleet's engine (scheduler) looks something like:for { // loop forever

select {

case <- stopChan: // if stopped, exit return

case <- time.After(5 * time.Minute): units = fetchUnits() machines = fetchMachines() schedule(units, machines)

}

}

etcd is... not so great

● Problem: unreliable watches– Missed events lead to unrecoverable situations

● Problem: limited event history– Can't always replay entire event stream

● Solution: move to “reconciler” model

etcd is... not so great

● Problem: unreliable watches– Missed events lead to unrecoverable situations

● Problem: limited event history– Can't always replay entire event stream

● Solution: move to “reconciler” model– Less efficient, but extremely robust

etcd is... not so great

● Problem: unreliable watches– Missed events lead to unrecoverable situations

● Problem: limited event history– Can't always replay entire event stream

● Solution: move to “reconciler” model– Less efficient, but extremely robust– Still many paths for optimisation (e.g. using watches to

trigger reconciliations)

etcd is... not so great

fleet – high level view

Single machine

Cluster

golang

● Standard language for all CoreOS projects (above OS)

golang

● Standard language for all CoreOS projects (above OS)

– etcd

golang

● Standard language for all CoreOS projects (above OS)

– etcd– fleet

golang

● Standard language for all CoreOS projects (above OS)

– etcd– fleet– locksmith (semaphore for reboots during updates)

golang

● Standard language for all CoreOS projects (above OS)

– etcd– fleet– locksmith (semaphore for reboots during updates)– etcdctl, updatectl, coreos-cloudinit, ...

golang

● Standard language for all CoreOS projects (above OS)

– etcd– fleet– locksmith (semaphore for reboots during updates)– etcdctl, updatectl, coreos-cloudinit, ...

● fleet is ~10k LOC (and another ~10k LOC tests)

golang

Go is great!

● Fast!

Go is great!

● Fast!– to write (concise syntax)

Go is great!

● Fast!– to write (concise syntax)– to compile (builds typically <1s)

Go is great!

● Fast!– to write (concise syntax)– to compile (builds typically <1s)– to run tests (O(seconds), including with race detection)

Go is great!

● Fast!– to write (concise syntax)– to compile (builds typically <1s)– to run tests (O(seconds), including with race detection)– Never underestimate power of rapid iteration

Go is great!

● Fast!– to write (concise syntax)– to compile (builds typically <1s)– to run tests (O(seconds), including with race detection)– Never underestimate power of rapid iteration

● Simple, powerful tooling

Go is great!

● Fast!– to write (concise syntax)– to compile (builds typically <1s)– to run tests (O(seconds), including with race detection)– Never underestimate power of rapid iteration

● Simple, powerful tooling– Built in package management, code coverage, etc.

Go is great!

Go is great!

● Rich standard library

Go is great!

● Rich standard library– “Batteries are included”

Go is great!

● Rich standard library– “Batteries are included”– e.g.: completely self-hosted HTTP server, no need for

reverse proxies or worker systems to serve many concurrent HTTP requests

Go is great!

● Rich standard library– “Batteries are included”– e.g.: completely self-hosted HTTP server, no need for

reverse proxies or worker systems to serve many concurrent HTTP requests

● Static compilation into a single binary

Go is great!

● Rich standard library– “Batteries are included”– e.g.: completely self-hosted HTTP server, no need for

reverse proxies or worker systems to serve many concurrent HTTP requests

● Static compilation into a single binary– Ideal for a minimal OS with no libraries

Go is great!

Go is... not so great

● Problem: managing third-party dependencies

Go is... not so great

● Problem: managing third-party dependencies– modular package management but: no versioning

Go is... not so great

● Problem: managing third-party dependencies– modular package management but: no versioning– import “github.com/coreos/fleet” - which SHA?

Go is... not so great

● Problem: managing third-party dependencies– modular package management but: no versioning– import “github.com/coreos/fleet” - which SHA?

● Solution: vendoring :-/

Go is... not so great

● Problem: managing third-party dependencies– modular package management but: no versioning– import “github.com/coreos/fleet” - which SHA?

● Solution: vendoring :-/– Copy entire source tree of dependencies into repository

Go is... not so great

● Problem: managing third-party dependencies– modular package management but: no versioning– import “github.com/coreos/fleet” - which SHA?

● Solution: vendoring :-/– Copy entire source tree of dependencies into repository– Slowly maturing tooling: goven, third_party.go, Godep

Go is... not so great

Go is... not so great

● Problem: (relatively) large binary sizes

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS– ~10MB per binary, many tools, quickly adds up

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS– ~10MB per binary, many tools, quickly adds up

● Solutions:

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS– ~10MB per binary, many tools, quickly adds up

● Solutions: – upgrading golang!

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS– ~10MB per binary, many tools, quickly adds up

● Solutions: – upgrading golang!

● go1.2 to go1.3 = ~25% reduction

Go is... not so great

● Problem: (relatively) large binary sizes– “relatively”, but... CoreOS is the minimal OS– ~10MB per binary, many tools, quickly adds up

● Solutions: – upgrading golang!

● go1.2 to go1.3 = ~25% reduction

– sharing the binary between tools

Go is... not so great

Sharing a binary

● client/daemon often share much of the same code

Sharing a binary

● client/daemon often share much of the same code– Encapsulate multiple tools in one binary, symlink the

different command names, switch off command name

Sharing a binary

● client/daemon often share much of the same code– Encapsulate multiple tools in one binary, symlink the

different command names, switch off command name– Example: fleetd/fleetctl

Sharing a binary

● client/daemon often share much of the same code– Encapsulate multiple tools in one binary, symlink the

different command names, switch off command name– Example: fleetd/fleetctl

Sharing a binary

func main() { switch os.Args[0] { case “fleetctl”: Fleetctl() case “fleetd”: Fleetd() }}

● client/daemon often share much of the same code– Encapsulate multiple tools in one binary, symlink the

different command names, switch off command name– Example: fleetd/fleetctl

Sharing a binary

func main() { switch os.Args[0] { case “fleetctl”: Fleetctl() case “fleetd”: Fleetd() }}

Before:

9150032 fleetctl 8567416 fleetd

After:

11052256 fleetctl 8 fleetd -> fleetctl

Go is... not so great

● Problem: young language => immature libraries

Go is... not so great

● Problem: young language => immature libraries– CLI frameworks

Go is... not so great

● Problem: young language => immature libraries– CLI frameworks– godbus, go-systemd, go-etcd :-(

Go is... not so great

● Problem: young language => immature libraries– CLI frameworks– godbus, go-systemd, go-etcd :-(

● Solutions:

Go is... not so great

● Problem: young language => immature libraries– CLI frameworks– godbus, go-systemd, go-etcd :-(

● Solutions:– Keep it simple

Go is... not so great

● Problem: young language => immature libraries– CLI frameworks– godbus, go-systemd, go-etcd :-(

● Solutions:– Keep it simple– Roll your own (e.g. fleetctl's command line)

Go is... not so great

type Command struct {

Name string

Summary string

Usage string

Description string

Flags flag.FlagSet

Run func(args []string) int

}

fleetctl CLI

Wrap up/recap

Wrap up/recap

CoreOS Linux

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in

– Containerized applications --> a[u]tom[at]ic updates

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in

– Containerized applications --> a[u]tom[at]ic updates fleet

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in

– Containerized applications --> a[u]tom[at]ic updates fleet

– Simple, powerful cluster-level application manager

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in

– Containerized applications --> a[u]tom[at]ic updates fleet

– Simple, powerful cluster-level application manager

– Glue between local init system (systemd) and cluster-level awareness (etcd)

Wrap up/recap

CoreOS Linux– Minimal OS with cluster capabilities built-in– Containerized applications --> a[u]tom[at]ic updates

fleet– Simple, powerful cluster-level application manager– Glue between local init system (systemd) and cluster-level

awareness (etcd)– golang++

Questions?

Thank you :-)

● Everything is open source – join us!– https://github.com/coreos

● Any more questions, feel free to email– jonathan.boulle@coreos.com

● CoreOS stickers!

References

● CoreOS updates https://coreos.com/using-coreos/updates/

● Omaha protocol https://code.google.com/p/omaha/wiki/ServerProtocol

● Raft algorithmhttp://raftconsensus.github.io/

References

● fleet https://github.com/coreos/fleet

● etcd https://github.com/coreos/etcd

● toolboxhttps://github.com/coreos/toolbox

● systemd-dockerhttps://github.com/ibuildthecloud/systemd-docker

● systemd-nspawn

http://0pointer.de/public/systemd-man/systemd-nspawn.html

Brief side-note: locksmith

● Reboot manager for CoreOS● Uses a semaphore in etcd to co-ordinate reboots● Each machine in the cluster:

1. Downloads and applies update

2. Takes lock in etcd (using Compare-And-Swap)

3. Reboots and releases lock

top related