container: is it safe enough to run you application?
TRANSCRIPT
Container SecurityAleksey Zalesov
Agenda1. Who am I?
2. Technology: namespaces, resources and security policies
3. How to escape container and how to prevent the app from doing this?
Who am I?Cloud Foundry engineer @ Altoros
PhysTech biophysics background
Mountaineering
Technology
namespaces, resources and security policies
NamespacesFilesystem
PIDs
UIDs
network
hostname
PID namespaces
In kernel 2.6.24 since 2008
http://lwn.net/Articles/259217/
CLONE_NEWPID
http://linux.die.net/man/2/clone
Before PID namespacesstruct pid {
atomic_t count;/* reference counter */
int nr; /* the pid value */
struct hlist_node pid_chain; /* hash chain */
struct hlist_head tasks[PIDTYPE_MAX]; /* lists of tasks */
struct rcu_head rcu;/* RCU helper */
};
After PID namespaces struct upid {
int nr; /* moved from struct pid */
struct pid_namespace *ns; /* the namespace this value is visible in */
struct hlist_node pid_chain; /* moved from struct pid */
};
After PID namespaces struct pid {
atomic_t count;
struct hlist_head tasks[PIDTYPE_MAX];
struct rcu_head rcu;
int level;/* the number of upids */
struct upid numbers[0];
};
Playing with PID namespacesstatic char child_stack[1048576];
static int child_fn() {
printf("PID: %ld\n", (long)getpid());
return 0;
}
Playing with PID namespacesint main() {
pid_t child_pid = clone(child_fn, child_stack+1048576,
CLONE_NEWPID | SIGCHLD, NULL);
printf("clone() = %ld\n", (long)child_pid);
waitpid(child_pid, NULL, 0);
return 0;
}
Playing with PID namespaces$ sudo ./pid
clone() = 13175
PID: 1
ResourcesCPU
I/O
memory
network
https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
Demo: limiting I/O# yum install libcgroup
# sudo service cgconfig start
# ls /cgroup/
blkio cpu cpuacct cpuset devices freezer memory net_cls
Demo: limiting I/O# cat /etc/cgconfig.conf
group limitio{
blkio {
blkio.throttle.read_bps_device = "8:0 2097152";
}
}
Demo: limiting I/O# vim /etc/cgrules.conf
*:hdparm blkio limitio/
# service cgred start
CGROUP_DAEMON="blkio:limitio"
Demo: limiting I/O# hdparm --direct -t /dev/sda
/dev/sda:
Timing O_DIRECT disk reads: 1332 MB in 3.01 seconds = 443.04 MB/sec
/dev/sda:
Timing O_DIRECT disk reads: 6 MB in 3.00 seconds = 2.00 MB/sec
Security policiesdropping capabilities
seccomp
SELinux
Writing policies is hard!
Who should write the policy for container?
Capabilitiesa partitioning of the all powerful root privilege into a set of distinct privileges
per-thread attribute
http://man7.org/linux/man-pages/man7/capabilities.7.html
https://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.2/capfaq-0.2.txt
Capabilities example$ ls -l /bin/ping
-rwsr-xr-x. 1 root root 38200 Jul 23 17:55 /bin/ping
$ cp /bin/ping .
$ ./ping ya.ru
ping: icmp open socket: Operation not permitted
Capabilities example$ sudo /usr/sbin/setcap cap_net_raw=ep ./ping
$ ./ping ya.ru
PING ya.ru (213.180.193.3) 56(84) bytes of data.
64 bytes from www.yandex.ru (213.180.193.3): icmp_seq=1 ttl=63 time=3.56
64 bytes from www.yandex.ru (213.180.193.3): icmp_seq=2 ttl=63 time=6.05
$ /usr/sbin/getcap ./ping
./ping = cap_net_raw+ep
How to escape from container?
setuid executables in a container may compromise security on the host
https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1244635
/var/lib/lxc/NAME/rootfs/usr/bin/sudo -i
root access to the container
unprivileged access to the host
How to deal with it# chmod 0700 /var/lib/lxc
# docker run -it --rm --cap-drop SETUID --cap-drop SETGID ...
// mounting filesystem with the nosuid attribute
# find / -perm -4000 -exec ls -l {} \; 2>/dev/null
# sudo chmod u-s filename
same UIDsleak from one container running UID 1000 to another: the same user on host
Solution
user namespaces
container 1: UID 1000 => UID 12437
container 2: UID 1000 => UID 13337
Privileged containershttps://www.exploit-db.com/exploits/33808/ for Docker 0.11
- container can access ANY file on your host
Separate pid and net namespace, stripped caps and RO bind mounts
BUT we have dac_override and dac_read_search
open_by_handle_at()
Inode of / is always 2
Privileged containersRun trusted code only
Map groups of mutually-trusted containers to separate machines
Use SELinux or AppArmor
Drop capabilities
Buggy Kernel callsall containers share the same kernel, so if the kernel call allows gaining extra
access or leak data, container will get it
exploit example: vmsplice
prevent:
limit attack surface through security policies
use separate kernels on hypervisor
http://lwn.net/Articles/268783/
https://www.win.tue.nl/~aeb/linux/hh/hh-12.html
Sign your images!
Conclusioncontainers are fast and provide fine level of isolation for trusted apps
effort required for safely running untrusted apps in non-privileged mode
not safe to run untrusted apps in privileged mode