how to swim with a whale

46
How to swim with a whale Łukasz Siudut DevOpsKRK #9 2016-04-14

Upload: lukasz-siudut

Post on 15-Jan-2017

663 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: How to swim with a whale

How to swim with a whale

Łukasz SiudutDevOpsKRK #9

2016-04-14

Page 2: How to swim with a whale

What really makes Docker swim?

Page 3: How to swim with a whale

CGroups - hardware resource management

- CPU- CPU pinning

- CPU accounting

- Memory limits

- Disk I/O access priority

- Devices access limitations

- net_cls - marking packets belonging to cgroup

- freezer - suspends/resume tasks in cgroup

Page 4: How to swim with a whale

Namespaces - system resource management

- UTS - UNIX Timesharing System - hostname isolation

- IPC - Interprocess Communication isolation- System V IPC objects

- POSIX message queues

- PID - processes and processes number isolation

- Network - resources associated with networking

- Mount - mountpoints isolation

- UID - user and group ID number spaces (UID mapping)

Page 5: How to swim with a whale

What possibly may go wrong?

Page 6: How to swim with a whale

What can possibly go wrong?

- Exceeding open files limits (sic!)

- Exceeding amount of interfaces in bridge

- Low performance storage engine

- Lack of inter-boxes routing

- Hanging docker master process

- Image building takes ages

- Logs stored in json? Srsly, docker?

- Random kernel panics

- Orphaned processes

Page 7: How to swim with a whale

What can possibly go wrong?

- Exceeding open files limits (sic!)

- Exceeding amount of interfaces in bridge

- Low performance storage engine

- Lack of inter-boxes routing

- Hanging docker master process

- Image building takes ages

- Logs stored in json? Srsly, docker?

- Random kernel panics

- Orphaned processes

- If a remote TCP syslog server is down, docker does not start- Docker daemon can send application/json type with invalid text

payload- Docker inspect gave default log options when the option is emtpy- Networking or DNS not available immediately after container start- EXPOSE and publish need to behave similarly with IPv4 and

IPv6- Container start fails, but volume remains mounted- Cannot remove docker volume- When you create a directory while bind-mounting it, it's owned by

root- error removing container (1.10, 1.11/master) with AUFS- Cannot start container with low probability of repetition- Need better handling/error path from tar/untar failures (especially

in userns)- Image layer contents are unstable- Docker pushing too much- Docker Volume not deleted after machine susspension- Docker network in internal mode will not use IPv4 by default and

is very slow.- Streaming logs, json driver "Error streaming logs: unexpected

EOF"- Docker running on high CPU and crashes in the end- Migration 1.9.1 to 1.10.3 failure: "Failed to register container XYZ:

name is reserved" on RHEL7

Page 8: How to swim with a whale

What can possibly go wrong?

- Exceeding open files limits (sic!)

- Exceeding amount of interfaces in bridge

- Low performance storage engine

- Lack of inter-boxes routing

- Hanging docker master process

- Image building takes ages

- Logs stored in json? Srsly, docker?

- Random kernel panics

- Orphaned processes

- If a remote TCP syslog server is down, docker does not start- Docker daemon can send application/json type with invalid text

payload- Docker inspect gave default log options when the option is empty- Networking or DNS not available immediately after container start- EXPOSE and publish need to behave similarly with IPv4 and

IPv6- Container start fails, but volume remains mounted- Cannot remove docker volume- When you create a directory while bind-mounting it, it's owned by

root- error removing container (1.10, 1.11/master) with AUFS- Cannot start container with low probability of repetition- Need better handling/error path from tar/untar failures (especially

in userns)- Image layer contents are unstable- Docker pushing too much- Docker Volume not deleted after machine suspension- Docker network in internal mode will not use IPv4 by default and

is very slow.- Streaming logs, json driver "Error streaming logs: unexpected

EOF"- Docker running on high CPU and crashes in the end- Migration 1.9.1 to 1.10.3 failure: "Failed to register container XYZ:

name is reserved" on RHEL7AND MOAR

Page 9: How to swim with a whale

Exceeding open files limit

Page 10: How to swim with a whale

Exceeding open files limit# lsof -p `pidof docker` | wc -l35# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l43# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l51# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l59# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l67…# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l

Page 11: How to swim with a whale

Exceeding open files limit# lsof -p `pidof docker` | wc -l35# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l43# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l51# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l59# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l67…# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l

2016/04/12 14:20:21 http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 5ms2016/04/12 14:20:21 http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 10ms2016/04/12 14:20:21 http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 5ms

Page 12: How to swim with a whale

Exceeding open files limit# lsof -p `pidof docker` | wc -l35# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l43# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l51# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l59# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l67…# docker run -tid alpine:3.3 cat -# lsof -p `pidof docker` | wc -l

1) /var/lib/docker/network/files/CONTAINER_ID.sock type=STREAM

2) net

3) /home/docker/containers/CONTAINER_ID/CONTAINER_ID.log

4) /dev/ptmx

5) /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.oom_control

6) [eventfd]

7) /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.oom_control

8) [eventfd]

Page 13: How to swim with a whale

Exceeding amount of interfaces in bridge

Page 14: How to swim with a whale

Exceeding amount of interfaces in bridgeWARN[0300] failed to cleanup ipc mounts:failed to umount /mnt/docker/containers/778954ce1188fd2deb1ac6bfbf82515e8a4d97284a8f100d7f9c07d6ab720472/shm: no such file or directory ERRO[0300] Handler for POST /v1.22/containers/778954ce1188fd2deb1ac6bfbf82515e8a4d97284a8f100d7f9c07d6ab720472/start returned error: failed to create endpoint adoring_euler on network bridge: adding interface veth1dd191f to bridge docker0 failed: exchange full WARN[0300] failed to cleanup ipc mounts:failed to umount /mnt/docker/containers/26b5ddbb4b0fe464768b74218699465c90e0c2292af540cfa77451b311c93514/shm: no such file or directory ERRO[0300] Handler for POST /v1.22/containers/26b5ddbb4b0fe464768b74218699465c90e0c2292af540cfa77451b311c93514/start returned error: failed to create endpoint kickass_goldwasser6 on network bridge: adding interface veth2738bff to bridge docker0 failed: exchange full WARN[0300] failed to cleanup ipc mounts:failed to umount /mnt/docker/containers/aad9becaa4823e7711aaea6144dc0278569a18f1285f040125c22713c1ee5045/shm: no such file or directory ERRO[0300] Handler for POST /v1.22/containers/aad9becaa4823e7711aaea6144dc0278569a18f1285f040125c22713c1ee5045/start returned error: failed to create endpoint suspicious_gates on network bridge: adding interface vethf893b1c to bridge docker0 failed: exchange full

Page 15: How to swim with a whale

Exceeding amount of interfaces in bridge

DO YOU KNOW THE LIMIT?

1024

Page 16: How to swim with a whale

Exceeding amount of interfaces in bridge

#define BR_PORT_BITS 10

#define BR_MAX_PORTS (1<<BR_PORT_BITS)

Page 17: How to swim with a whale

Exceeding amount of interfaces in bridge

- Put multiple containers in same network namespace

- OpenVSwitch

- Multiple networks for container groups

- Run containers without interface (yeah, sure)

Page 18: How to swim with a whale

Low performance disk engine

Page 19: How to swim with a whale

Low performance disk engine

- Devicemapper

- Sparse files (slow)

- Raw disk devices (faster, but slow)

- AUFS (require custom kernel patch, unstable)

- OverlayFS (since kernel 3.18, immature)

- Btrfs - “production ready”

- VFS*

Page 20: How to swim with a whale

Docker storage drivers - devicemapper

- Developed by Red Hat and Docker to fulfill demands of customers

- Utilises in-kernel capabilities (dm-mapper + thin provisioning)

- Works on block device level thus doesn’t support page cache*

- Block size equals 64k, small writes impacts performance significantly

- Removing file in container does not affect space on host

- It’s memory hungry

- Writing new data is accomplished by allocate-on-demand

- Writing existing file uses copy-on-write operation on modified blocks

Page 21: How to swim with a whale

Docker storage drivers - AUFS

- Implements a union mount for Linux filesystem- Idea is based on UnionFS- First storage driver in use with Docker- Works on the top of other filesystem…- … so adding multiple layers may consume a lot of inodes- It also has to track hierarchy tree, so it is memory consuming- Works on file level - all AUFS CoW operations copy entire files…- … however it happens only once- It works on filesystem level page caching is being utilised efficiently- Deleted files are marked by whiteout file (but they still exist in lower layers)- I managed to crash Docker master process when using AUFS

Page 22: How to swim with a whale

Docker storage drivers - AUFSmount("none", "/home/docker/aufs/mnt/1d5f54eef0ac66b82a007ed55141c377a4d947f3c4bb0d3f660b0ca10b184bb9", "aufs", 0, "br:/home/docker/aufs/diff/1d5f54eef0ac66b82a007ed55141c377a4d947f3c4bb0d3f660b0ca10b184bb9=rw:/home/docker/aufs/diff/1d5f54eef0ac66b82a007ed55141c377a4d947f3c4bb0d3f660b0ca10b184bb9-init=ro+wh:/home/docker/aufs/diff/4500e1e0e0f482e19f4f4626165c16a7107ccb3a2a54fbeeb52ce675e01d9841=ro+wh:/home/docker/aufs/diff/23450d8936d7685169bb1f17c49b523f759e569bc9b3e9a8b667634e233cf359=ro+wh:/home/docker/aufs/diff/984a940d62c9b682f0518fbe7412730d05a0d665814ed5253f93d1153adea37c=ro+wh:/home/docker/aufs/diff/eafbd7ec3933e09cca3625d1d48efc646509490ed15022b4c55917e700ee9a91=ro+wh:/home/docker/aufs/diff/1f9f739c84c72c69d5a48f5715e5eb4e4bb55ec7832d886bc71d4997662f0e20=ro+wh:/home/docker/aufs/diff/c028a82aa3ff93427df1327b50621a9c741f5d73f031bf3a49e21a29bb92b55c=ro+wh:/home/docker/aufs/diff/e79c2f7c809e4608a3712b3f6404d036ed294d45577d9770a7bfe2b7891eea34=ro+wh:/home/docker/aufs/diff/0ff5998bd3fa6c2e039fb6f9154d8faa0a3924c67e9b67052475f1685eb00a06=ro+wh:/home/docker/aufs/diff/2fdb0e1a146e2edf035fbee3030854d4911eea52158d5fcbcc57fd22c41503c3=ro+wh:/home/docker/aufs/diff/3f709244513aa38fbddb423d4e1b0bd1d51daa5d1448a3d3a52be9b90782886d=ro+wh:/home/docker/aufs/diff/301eb2b701883975c12282094e5ff4235d19ae924558c38de93d31051c01e064=ro+wh:/home/docker/aufs/diff/a4b54186305ced5a457467c44a7d3cb50f7bce45b13a718c23a4555d715c8d4d=ro+wh:/home/docker/aufs/diff/5234f32576bb6522c18676c5e8d274ea466f22a96c440e284ed927addbdc6a28=ro+wh:/home/docker/aufs/diff/b71baf15872ca55884daa5d17057fd40802240f6693e6ef79c8c2219d0e7c18e=ro+wh:/home/docker/aufs/diff/20272dd081b1ec8984438f29154342a9a8d89ba4683bc6c8029def6cbce5dd1b=ro+wh:/home/docker/aufs/diff/ad686c634be4f7bf1656db8208c1197d562136fd01182b2afbaaf9b42c99b524=ro+wh:/home/docker/aufs/diff/f0a30f2605bbeb8c3574393d1c215fab1d5512262931e686da5bca6323fb2fc9=ro+wh:/home/docker/aufs/diff/6283fbfe4d98f095eed8f039665b3bd0c72cd3c785cddfc96528a9e0b1f69a4c=ro+wh:/home/docker/aufs/diff/acb08f019cfcbc302a4f9df6f2fa30540285fa30e397cbfd1d1b1322c6aa36c5=ro+wh:/home/docker/aufs/diff/c5f1142781b6f053301bb11880c1cec448aef29bd5ddf4b8145189cb397e348a=ro+wh:/home/docker/aufs/diff/47eea3aa543710082284622dfcb3c3c19f8c2df80bef6fa15988d9e865f89c9e=ro+wh:/home/docker/aufs/diff/fe1231ddd1407a216250ae2a814a370bd660940db6046b7bea845b05e845775c=ro+wh:/home/docker/aufs/diff/0af574eab2d6f8e8b428a9d5d2e3f94cf383703d6194408c4067f22bc1d4b361=ro+wh,dio,xino=/dev/shm/aufs.xino,dirperm1" <unfinished…>

Page 23: How to swim with a whale

Docker storage drivers - AUFS

ERRO[0028] Error saving dying container to disk: open /mnt/docker/containers/00878fe1ad98a55c79295d6fbe152425873eb8714ded6c7ac443c36677bfb20c/config.v2.json: no such file or directory

ERRO[0028] Error removing mounted layer 00878fe1ad98a55c79295d6fbe152425873eb8714ded6c7ac443c36677bfb20c: rename /mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca /mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca-removing: device or resource busy

ERRO[0028] Handler for DELETE /v1.22/containers/00 returned error: Driver aufs failed to remove root filesystem 00878fe1ad98a55c79295d6fbe152425873eb8714ded6c7ac443c36677bfb20c: rename /mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca /mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca-removing: device or resource busy

Page 24: How to swim with a whale

Docker storage drivers - AUFS

# rm -rf /mnt/docker/*

rm: cannot remove '/mnt/docker/aufs/mnt/9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24': Device or resource busy

rm: cannot remove '/mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca': Device or resource busy

Page 25: How to swim with a whale

Docker storage drivers - AUFS

# rm -rf /mnt/docker/*

rm: cannot remove '/mnt/docker/aufs/mnt/9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24': Is a directory

rm: cannot remove '/mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca': Is a directory

Page 26: How to swim with a whale

Docker storage drivers - AUFS

# ls -l /mnt/docker/aufs/mnt/ls: cannot access '/mnt/docker/aufs/mnt/9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24': Stale file handlels: cannot access '/mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca': Stale file handletotal 0

Page 27: How to swim with a whale

Docker storage drivers - AUFS

# ls -l /mnt/docker/aufs/mnt/ls: cannot access '/mnt/docker/aufs/mnt/9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24': Stale file handlels: cannot access '/mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca': Stale file handletotal 0

d????????? ? ? ? ? ? 9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24d????????? ? ? ? ? ? ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca

Page 28: How to swim with a whale

Docker storage drivers - AUFS

# ls -l /mnt/docker/aufs/mnt/ls: cannot access '/mnt/docker/aufs/mnt/9f9f2c192645858a0ab1db63d4b29dc2ec4bbdfc130e218395e8d6cb4ff36d24': Stale file handlels: cannot access '/mnt/docker/aufs/mnt/ad9829a80251e071ac2130f979a9e1253ae5a9a8b6ff0e09ff7ee31a111a31ca': Stale file handletotal 0

# reboot

Page 29: How to swim with a whale

Docker storage drivers - OverlayFS

- Included in mainline kernel since 3.18

- Works similarly to AUFS (layers, page cache)

- It design is simpler than AUFS (lowerdir, upperdir)

- Writing to files from lower layers issues copy_up operation (similar to AUFS)

- Deleting file creates whiteout file in upperdir

- Deleting directory creates opaque directory in upperdir (?)

- Implements only a subset of the POSIX standards

- I managed to crash Docker master process when using OverlayFS too :)

Page 30: How to swim with a whale

Docker storage drivers - OverlayFS

root:/etc# cat timezone Etc/UTCroot:/etc# exec 3< timezoneroot:/etc# exec 4<> timezoneroot:/etc# read -n 4 <&4root:/etc# echo -n . >&4root:/etc# exec 4>&-root:/etc# cat <&3Etc/UTCroot:/etc# exec 3>&-root:/etc# cat timezoneEtc/.TC

root:/etc# exec 3< tim ezone

root:/etc# exec 4< > tim ezone

root:/etc# read -n 5 < & 4

root:/etc# echo -n . > &4

root:/etc# exec 4> &-

root:/etc# cat < & 3

Etc/..C

root:/etc# cat tim ezone

Etc/..C

Page 31: How to swim with a whale

Docker storage drivers - Btrfs

- New generation CoW filesystem

- Present in kernel for some time already, considered being “production ready”

- Native support for “layers” - subvolumes and snapshots

- Writing to new file invokes allocate-on-demand to allocate new data block

- Updating existing file invokes copy-on-write for new blocks -> little overhead

- Btrfs is prone to fragmentation due to CoW behavior

- Lot of small writes may

- Does not utilise page caching :(

- df is not reliable - btrfs filesystem df

Page 32: How to swim with a whale
Page 33: How to swim with a whale

Docker storage drivers

# vmtouch data Files: 1 Directories: 0 Resident Pages: 0/26214400 0/100G 0% Elapsed: 0.38062 seconds

# for _ in {1..10}; do docker run -id alpine:3.3 cat -; done

# vmtouch data Files: 1 Directories: 0 Resident Pages: 12877/26214400 50M/100G 0.0491% Elapsed: 0.41442 seconds

Page 34: How to swim with a whale

Network drivers

Page 35: How to swim with a whale

Network drivers

- CNM / libnetwork

- Bridge

- Standard driver, limited by amount of interfaces that can be bound under one bridge

- Overlay

- OVS based interface, great performance, require external k/v store service

- Others

- Weave

- Custom central routing

Page 36: How to swim with a whale

Building docker images

Page 37: How to swim with a whale

Building docker images

Building container images for large applications is still a challenge. If we are to rely on container images for testing, CI, and emergency deploys, we need to have an image ready in less than a minute.

Dockerfiles make this almost impossible for large applications.

Page 38: How to swim with a whale

Building docker images

- Microservices are way to go

- You don’t have to build docker image every time!

- Prebuild containers with runtime environment

- Building code in separate container with build environment

- docker run -ti -v ./app_source:/build_dir mybuilder:latest

- Put code into prebuild containers

- Use minimalistic base images - Alpine Linux ~5MB

- Use less reliable storage driver to speed up builds?

Page 39: How to swim with a whale

Logs

Page 40: How to swim with a whale

Logging drivers

- Dealing with standard logger and json format is painful

- Available logging drivers

- json-file

- syslog

- journald

- gelf

- fluentd

- awslogs

- splunk

Page 41: How to swim with a whale

Logging drivers

container

fluentd

fluentd

fluentd

AWS S3

Elasticsearch

Page 42: How to swim with a whale

Designing Docker friendly application

Page 43: How to swim with a whale

Designing Docker friendly application

- Docker loves cloud

- Cloud is not reliable - instances will always go down unexpectedly…

- … therefore infrastructure must be disposable!

- So are the applications

- Replication factor ~3-4

Page 44: How to swim with a whale

Designing Docker friendly application

- Logs goes to standard output or standard error only

- Do not write to filesystem (mount ro), if unavoidable use volumes- No need to be worried about POSIX compliance or low performance!

- Run containers in non-privileged mode as non root

- Assume that application MAY be killed unexpectedly

- Make containers configurable - ENTRYPOINT + startup script- docker run -d -e KEY1=value1 -e KEY2=value2 myapp:latest

- Handle signals

- Avoid forking! If unavoidable, use some kind of container-init

Page 45: How to swim with a whale

Questions? Answers?

Page 46: How to swim with a whale

Thank you!