n problems of linux containers

30
N Problems of Linux Containers (with solutions!) Kir Kolyshkin <[email protected] > 6 June 2015 ContainerDays Boston

Upload: openvz

Post on 12-Aug-2015

36 views

Category:

Software


0 download

TRANSCRIPT

Page 1: N problems of Linux containers

N Problemsof Linux Containers

(with solutions!)

Kir Kolyshkin<[email protected]>

6 June 2015 ContainerDays Boston

Page 2: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: Effective virtualization

● Virtualization is partitioning● Historical way: $M mainframes● Modern way: virtual machines● Problem: performance overhead● Partial solution: hardware support

(Intel VT, AMD V)

Page 3: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution: isolation

● Run many userspace instanceson top of one single (Linux) kernel

● All processes see each other– files, process information, network,

shared memory, users, etc.

● Make them unsee it!

Page 4: N problems of Linux containers

openvz.org || criu.org || odin.com

One historical way to unsee

chroot()

Page 5: N problems of Linux containers

openvz.org || criu.org || odin.com

Namespaces

● Implemented in the Linux kernel– PID (process tree)– net (net devices, addresses, routing etc)– IPC (shared memory, semaphores, msg queues)– UTS (hostname, kernel version)– mnt (filesystem mounts)– user (UIDs/GIDs)

● clone() with CLONE_NEW* flags

Page 6: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: Shared resources

● All containers share the same set of resources (CPU, RAM, disk, various in-kernel things ...)

● Need fair distribution of “goods” so everyone gets their share

● Need DoS prevention● Need prioritization and SLAs

Page 7: N problems of Linux containers
Page 8: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution: OpenVZ resource controls

● OpenVZ:– user beancounters

● controls 20 parameters

– hierarchical CPU scheduler– disk quota per containers– I/O priority and I/O bandwidth limit per-container

● Dynamic control, can “resize” runtime

Page 9: N problems of Linux containers
Page 10: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 2: VSwap

● Only two primary parameters: RAM and swap– others still exist, but are optional

● Swap is virtual, no actual I/O is performed● Slow down to emulate real swap● Only when actual global RAM shortage

occurs,virtual swap goes into the real swap

● Currently only available in OpenVZ kernel

Page 11: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution: cgroups + controllers

● Cgroups is a mechanism to control resources per hierarchical groups of processes

● Cgroups is nothing without controllers:– blkio, cpu, cpuacct, cpuset, devices, freezer,

memory, net_cls, net_prio

● Cgroups are orthogonal to namespaces● Still working on it: just added kmem controller

Page 12: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 3: vcmmd

● 4th generation of OpenVZ resource mgmt● A user-space daemon using kernel controls● Monitors usage, tweaks limits● Adds a “time” dimension● More flexible limits, e.g. burstable

Page 13: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: fast live migration

● We can already live migratea running OpenVZ containerfrom one server to anotherwithout shutting it down

● We want to do it fast even for huge containers– huge disk: use shared storage– huge RAM: ???

Page 14: N problems of Linux containers

openvz.org || criu.org || odin.com

Live migration process(assuming shared storage)

● 1 Freeze the container● 2 Dump its complete state to a dump file● 3 Copy the dump file to destination server● 4 Undump back to RAM, recreate everything● 5 Unfreeze● Problem: huge dump file -- takes long time*

to dump, copy, undump

* seconds

Page 15: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 1: network swap

● 1 Dump the minimal memory, lock the rest● 2 Restore the minimal memory,

mark the rest as swapped out● 3 Set up network swap from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source

Page 16: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 1: network swap

● 1 Dump the minimal memory, lock the rest● 2 Copy, undump what we have,

mark the rest as swapped out● 3 Set up network swap served from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source● PROBLEM: no way to rollback

Page 17: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 2: Iterative RAM migration

● 1 Ask kernel to track modified pages● 2 Copy all memory to destination system mem● 3 Ask kernel for list of modified pages● 4 Copy those pages● 5 GOTO 3 until satisfied● 6 Freeze and do migration as usual, but

with much smaller set of pages

Page 18: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: upstreaming

● OpenVZ was developed separately● Same for many past IBM Linux projects

(ELVM, CKRM, ...)● Develop, then merge it upstream

(i.e. to vanilla Linux kernel)● Problem?

Page 19: N problems of Linux containers
Page 20: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: upstreaming

● OpenVZ was developed separately● Same for many past IBM Linux projects

(ELVM, CKRM, ...)● Develop, then merge it upstream

(i.e. to vanilla Linux kernel)● Problem:

grizzly bears upstream developersdo not accept massive patchsetsappearing out of nowhere

Page 21: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 1: rewrite from scratch

● User Beancounters -> CGroups + controllers● PID namespace: 2 rewrites until accepted● Network namespace – rewritten● It works!● 1500+ patches ended up in vanilla● OpenVZ made it to top10 contributors

Page 22: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 2: circumvent the system!

● We tried hard to merge checkpoint/restore● Other people tried hard too, no luck● Can't make it to the kernel? Let's riot!

implement it in userspace● With minimal kernel intervention when

required● Kernel exports most of information already, so

let's just add missing bits and pieces

Page 23: N problems of Linux containers

openvz.org || criu.org || odin.com

CRIU

● Checkpoint / Restore [mostly] In Userspace● About 3 years old, tools at version 1.6● Users: Google, Samsung, Huawei, ...● LXC & Docker – integrated!● Already in upstream 3.x kernel

CONFIG_CHECKPOINT_RESTORE● Live migration: P.Haul http://criu.org/P.Haul

Page 24: N problems of Linux containers

openvz.org || criu.org || odin.com

CRIU Linux kernel patches, per vTotal: 176 (+11 this year)

Page 25: N problems of Linux containers

openvz.org || criu.org || odin.com

Problem: common file system

● Container is just a directory on the host we chroot() into● File system journal (metadata updates) is a bottleneck● Lots of small-size files I/O on CT backup/migration

(sometimes rsync hangs or OOMs!)● No sub-tree disk quota support in upstream● No sub-tree snapshots● Live migration: rsync -- changed inodes● File system type and properties are fixed, same for all

CTs

Page 26: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 1: LVM

● Only works only on top of block device● Hard to manage

(e.g. how to migrate a huge volume?)● No thin provisioning

Page 27: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 2: loop device(filesystem within a file)

● VFS operations leads to double page-caching– (already fixed in the recent kernels)

● No thin provisioning● Limited feature set

Page 28: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 3: ZFS + zvol

● PRO: features– zvol, thin provisioning, dedup, zfs send/receive

● CONTRA: – Licensing is problematic– Linux port issues (people report cache OOM)– Was not available in 2008

Page 29: N problems of Linux containers

openvz.org || criu.org || odin.com

Solution 4: ploop

● Basic idea: same as block loop, just better● Modular design:

– various image formats (qcow2 in TODO progress)– various I/O backends (ext4, vfs O_DIRECT, nfs)

● Feature rich:– online resize (grow and shrink, ballooning)– instant live snapshots– write tracker to facilitate faster live migration

Page 30: N problems of Linux containers

openvz.org || criu.org || odin.com

Any problems questions?

[email protected]● Twitter: @kolyshkin @_openvz_ @__criu__