nuse (network stack in userspace) at #osio
DESCRIPTION
NUSE talk at #osio http://operatingsystems.io/TRANSCRIPT
Network Stack in Userspace (NUSE)
!Hajime TazakiRyo Nakamura
(University of Tokyo)!
New Directions in Operating SystemsLondon, 2014
MotivationImplementation of the Internetis not finished yet
!
!
Faster evolution of OSes (network stack)
OS personalization
2
I have a new Layer-3/4 protocol! Yay!
I have new, great Layer-3/4 protocol ! It will change the WORLD !
Replace network stack ?
No: destroy my life ?! (experimental ? not tested ?)
Yes: I wanna be your slave.
Slow evolution of network stack ?
VM on personal device ?3
Virtual Machine ?
4
Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download Practices, no. MSR-TR-2010-51, May 2010
Poll: “When you download and run software, how often do you use a virtual machine (to reduce security risks)?”
Rekindling Network Protocol Innovation with User-Level
Stacks
Michio Honda⇤, Felipe Huici⇤, Costin Raiciu†, Joao Araujo‡, Luigi Rizzo§k
NEC Europe Ltd.⇤, Universitatea Politehnica Bucuresti†, University College London‡, Università di Pisa§,International Computer Science Institute, Berkeley, CAk
{first.last}@neclab.eu, [email protected], [email protected], [email protected]
ABSTRACTRecent studies show that more than 86% of Internet pathsallow well-designed TCP extensions, meaning that it is stillpossible to deploy transport layer improvements despite theexistence of middleboxes in the network. Hence, the blamefor the slow evolution of protocols (with extensions takingmany years to become widely used) should be placed on endsystems.
In this paper, we revisit the case for moving protocolsstacks up into user space in order to ease the deploymentof new protocols, extensions, or performance optimizations.We present MultiStack, operating system support for user-level protocol stacks. MultiStack runs within commodityoperating systems, can concurrently host a large number ofisolated stacks, has a fall-back path to the legacy host stack,and is able to process packets at rates of 10Gb/s.
We validate our design by showing that our mux/de-mux layer can validate and switch packets at line rate (upto 14.88 Mpps) on a 10 Gbit port using 1-2 cores, andthat a proof-of-concept HTTP server running over a basicuserspace TCP outperforms by 18–90% both the same serverand nginx running over the kernel’s stack.
Categories and Subject DescriptorsC.2.2 [Computer-communication Networks]: NetworkProtocols; D.4.4 [Operating Systems]: CommunicationsManagement
Keywordstransport protocols, operating systems, deployability
1. INTRODUCTIONThe TCP/IP protocol suite has been mostly implemented
in the operating system kernel since the inception of UNIXto ensure performance, security and isolation between userprocesses. Over time, new protocols and features have ap-peared (e.g., SCTP, DCCP, MPTCP, improved versions ofTCP), many of which have become part of mainstream OSesand distributions. Fortunately, the Internet is still able toaccommodate the evolution of protocols: a recent study [10]has shown that as many as 86% of Internet paths still allowTCP extensions despite the existence of a large number ofmiddleboxes.
However, the availability of a feature does not imply wide-spread, timely deployment. Being part of the kernel, newprotocols/extensions have system-wide impact, and are typ-ically enabled or installed during OS upgrades. These hap-
0.00
0.25
0.50
0.75
1.00
2007 2008 2009 2010 2011 2012Date
Ratio
of f
lows
OptionSACKTimestampWindowscale
DirectionInboundOutbound
Figure 1: TCP options deployment over time.
pen infrequently not only because of slow release cycles, butalso due to their cost and potential disruption to existingsetups. If protocol stacks were embedded into applications,they could be updated on a case-by-case basis, and deploy-ment would be a lot more timely.
For example, Mac OS, Windows XP and FreeBSD stilluse a traditional Additive Increase Multiplicative Decrease(AIMD) algorithm for TCP congestion control, while Linuxand Windows Vista (and later) use newer algorithms thatachieve better bandwidth utilization and mitigate RTT un-fairness [21, 25]. From a user’s point of view there is noreason not to adopt such new algorithms, but they do notbecause it can only be done via OS upgrades that are oftencostly or unavailable. Even if they are available, OS defaultsettings that disable such extensions or modifications canfurther hinder timely deployment.
Figure 1 shows another example, the usage of thethree most pervasive TCP extensions: Window Scale(WS) [12], Timestamps (TS) [12] and Selective Acknowledg-ment (SACK) [16]⇤. For example, despite WS and TS beingavailable since Windows 2000 and on by default since Win-dows Vista in 2006, as late as 2012 more than 30% and 70%of flows still did not negotiate these options (respectively),showing that it can take a long time to actually upgrade orchange OSes and thus the network stacks in their kernels.We see wider deployment for SACK in 2007 (70%) comparedto the other options thanks to it being on by default sinceWindows 2000, but even with this, 20% of flows still didnot use this option as late as 2011. The argument remains⇤We used a set of daily traces from the WIDE backbonenetwork which provides connectivity to universities and re-search institutes in Japan [3].
ACM SIGCOMM Computer Communication Review 53 Volume 44, Number 2, April 2014
Slow evolution of network stackHonda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM SIGCOMM CCR, Vol.44, Num. 2, April 2014
Meanwhile inFilesystem world..
There is,
Filesystem in Userspace (FUSE)
Userspace code can host new filesystem (sshfs, GmailFS, etc)
Performance is bad, but doesn’t matter
Flexibility and functionality do matter
6
http://fuse.sourceforge.net/
AlternativesContainer (LXC, OpenVZ, vimage)
share kernel with host operating system (no flexibility)
Library OS
full scratch: mtcp, Mirage, lwIP
Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?)
Glue-layer: LKL (Linux-2.6), rumpkernel (NetBSD)7
Network Stack in Userspace
What’s NUSE ?Network stack in Userspace
A library operating system
Library version of network stack (of monolithic kernel)
Linux (latest), FreeBSD (plan)
(UNIX) Process-based virtualization
9
TCP/IPARP/ndisc
NIC
glibc
libnuse
nuse example
userspace
kernel
raw socknetmap
DPDK (etc)
kernel bypassed
Why NUSE ?
minimized porting effortLinux (net-next) changes frequently
!
full functional network stack fornetmapDPDK(any kernel-bypass technology)
10
Application
ARPQdisc
TCP UDP DCCP SCTPICMP IPv4IPv6
NetlinkBridgingNetfilter
IPSec Tunneling
Kernel layer
NUSE core
POSIX glue
bottom halves/rcu/timer/interrupt
struct net_device
RAW DPDK netmap ...
NIC
petit-scheduler
How it works1.(monolithic) kernel
source
2. scheduler
3. POSIX glue
redirect system calls
4. network I/O
raw socket, DPDK, netmap, etc..
11
Application
ARPQdisc
TCP UDP DCCP SCTPICMP IPv4IPv6
NetlinkBridgingNetfilter
IPSec Tunneling
Kernel layer
NUSE core
POSIX glue
bottom halves/rcu/timer/interrupt
struct net_device
RAW DPDK netmap ...
NIC
petit-scheduler
1) kernel build
patch to kernel tree
with new (hw independent) arch (arch/sim)
robust to (frequent) mainstream changes
12
Application
ARPQdisc
TCP UDP DCCP SCTPICMP IPv4IPv6
NetlinkBridgingNetfilter
IPSec Tunneling
Kernel layer
NUSE core
POSIX glue
bottom halves/rcu/timer/interrupt
struct net_device
RAW DPDK netmap ...
NIC
petit-scheduler
2) scheduleroffer alternate context primitives
interrupts, timer, thread, bottom halves (tasklet, workqueue, waiter, etc)
wrap with POSIX thread
easily debuggable
ucontext fiber for low overhead (not yet)
13
Application
ARPQdisc
TCP UDP DCCP SCTPICMP IPv4IPv6
NetlinkBridgingNetfilter
IPSec Tunneling
Kernel layer
NUSE core
POSIX glue
bottom halves/rcu/timer/interrupt
struct net_device
RAW DPDK netmap ...
NIC
petit-scheduler
3) POSIX glue code
Hijack function calls
socket => nuse_socket
read => nuse_read
apps not aware of
LD_PRELOAD=libnuse.so ..
14
Application
ARPQdisc
TCP UDP DCCP SCTPICMP IPv4IPv6
NetlinkBridgingNetfilter
IPSec Tunneling
Kernel layer
NUSE core
POSIX glue
bottom halves/rcu/timer/interrupt
struct net_device
RAW DPDK netmap ...
NIC
petit-scheduler
4) network I/Oconnect NUSE to NIC
options
raw socket (default)
DPDK (if available)
netmap (if available)
Tap
15
Usagedownload
git clone git://github.com/libos-nuse/net-next-nuse
compile
make library ARCH=sim (NETMAP=yes) (DPDK=yes)
execute
sudo NUSECONF=nuse.conf ./nuse (application)16
configs
17
# Interface definition.!interface eth0! address 192.168.0.10! netmask 255.255.255.0! macaddr 00:01:01:01:01:01! viftype TAP!!interface p1p1! address 172.16.0.1! netmask 255.255.255.0! macaddr 00:01:01:01:01:02!!# route entry definition.!route! network 0.0.0.0! netmask 0.0.0.0! gateway 192.168.0.1
(possible) use casesNew protocol deployment
Chrome + Linux mptcp (on NUSE)
Process-level virtual instance
% NUSE-linux-ovs | NUSE-freebsd-NAT | NUSE-router | NUSE-nginx!
VM chaining via UNIX command line
18
Limitation (ongoings)
no fork(2)/exec(2) support
no multi-processes
no sysctl/proc
(inefficient) thread scheduling
19
Experiments
1. Can we benefit with OS personalization?
present a custom (NUSE) kernel with an application (OS personalization)
2. How much overhead does NUSE add?
Simple performance measurements
20
Tested applications
working
ping, iperf, nginx (partially), sleep,
need patches
nc, wget, dig, host
21
Setup: Performance measurement
2210G10G
NUSE node Tx/Rx nodes
CPU Xeon E5-2650v2 @ 2.60GHz
(16 core)
Xeon L3426 @ 1.87GHz (8 core)
Memory 32GB 4GBNIC Intel X520 Intel X520
ping!flowgen
vnstat!(packet count)
Tx NUSE Rx
ping!flowgen
Host Tx (NUSE->Receiver)
23
avg max mindpdk! 2.610 8.000 0.156
netmap 0.370 0.494 0.252raw 0.396 0.501 0.290tap 0.397 0.538 0.303
RxNUSE
0 50
100 150 200 250 300 350 400 450 500
dpdk netmap raw tap
Thro
ughp
ut (M
bps)
ping (RTT) throughput(1024byte,UDP)
0
1
2
3
4
5
6
7
8
dpdk netmap raw tap
RTT
(ms)
L3 RoutingSender->NUSE->Receiver
24
avg max mindpdk! 11.998 27.700 0.252
netmap 0.664 0.741 0.556raw 0.663 0.761 0.575tap 0.694 0.749 0.602
Tx RxNUSE
ping (RTT)
0 50
100 150 200 250 300 350 400 450 500
netmap raw tap
Thro
ughp
ut (M
bps)
throughput(1024byte,UDP)
0
5
10
15
20
25
30
dpdk netmap raw tap
RTT
(ms)
Discussions
not so bad performance
we don’t care much about performance
network stack is full functional
but supplemental tools are not sufficient
25
Network Simulator Integration (ns-3)
network stack +ns-3 network simulator!
Direct Code Execution (DCE)Established by Mathieu Lacage (2006)part of ns-3 project
!
Featuresreproducible (deterministic clock)controllable (simulator’s facility)
26
http://www.nsnam.org/overview/projects/direct-code-execution/
27
NUSE vs DCE
28
NUSE DCE
kernel library ARCH=sim ARCH=sim
scheduler (host) pthread simulator’s scheduler!(deterministic)
POSIX hijack hijacknetwork I/O raw/netmap/DPDK/tap ns3:NetDevice
execution LD_PRELOADdlmopen(3)!
single proc/multi-instances
shared
DCE Architecture
29
ARP
Qdisc
TCP UDP DCCP SCTP
ICMP IPv4IPv6
Netlink
BridgingNetfilter
IPSec Tunneling
Kernel layer
Heap Stack
memory
Virtualization Corelayer
ns-3 (network simulation core)
POSIX layer
Application(ip, iptables, quagga)
bottom halves/rcu/timer/interruptstruct net_device
DCE
ns-3 applicati
on
ns-3TCP/IP stack
3) POSIX!Layer
1) Core!Layer
2) Kernel!Layer
Bug reproducibility
30
Wi-Fi Wi-Fi
Home Agent
AP1 AP2
handoff
ping6
mobile node
correspondentnode
(gdb) b mip6_mh_filter if dce_debug_nodeid()==0Breakpoint 1 at 0x7ffff287c569: file net/ipv6/mip6.c, line 88. <continue> (gdb) bt 4 #0 mip6_mh_filter (sk=0x7ffff7f69e10, skb=0x7ffff7cde8b0) at net/ipv6/mip6.c:109 #1 0x00007ffff2831418 in ipv6_raw_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:199 #2 0x00007ffff2831697 in raw6_local_deliver (skb=0x7ffff7cde8b0, nexthdr=135) at net/ipv6/raw.c:232 #3 0x00007ffff27e6068 in ip6_input_finish (skb=0x7ffff7cde8b0) at net/ipv6/ip6_input.c:197
DebuggingMemory error detection
among distributed nodes
in a single process
using Valgrind
!
!
31
==5864== Memcheck, a memory error detector ==5864== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==5864== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info ==5864== Command: ../build/bin/ns3test-dce-vdl --verbose ==5864== ==5864== Conditional jump or move depends on uninitialised value(s) ==5864== at 0x7D5AE32: tcp_parse_options (tcp_input.c:3782) ==5864== by 0x7D65DCB: tcp_check_req (tcp_minisocks.c:532) ==5864== by 0x7D63B09: tcp_v4_hnd_req (tcp_ipv4.c:1496) ==5864== by 0x7D63CB4: tcp_v4_do_rcv (tcp_ipv4.c:1576) ==5864== by 0x7D6439C: tcp_v4_rcv (tcp_ipv4.c:1696) ==5864== by 0x7D447CC: ip_local_deliver_finish (ip_input.c:226) ==5864== by 0x7D442E4: ip_rcv_finish (dst.h:318) ==5864== by 0x7D2313F: process_backlog (dev.c:3368) ==5864== by 0x7D23455: net_rx_action (dev.c:3526) ==5864== by 0x7CF2477: do_softirq (softirq.c:65) ==5864== by 0x7CF2544: softirq_task_function (softirq.c:21) ==5864== by 0x4FA2BE1: ns3::TaskManager::Trampoline(void*) (task-manager.cc:261) ==5864== Uninitialised value was created by a stack allocation ==5864== at 0x7D65B30: tcp_check_req (tcp_minisocks.c:522) ==5864==
Fine-grained parameter coverage
32
Code coverage measurement with DCEWith fine-grained network, node, protocol parameters
Continuous integration
33
http://ns-3-dce.cloud.wide.ad.jp/jenkins/job/daily-net-next-sim/
SummaryNUSE (Network Stack in Userspace)
OS personalization (fast evolution, easy deployment)
DCE (Direct Code Execution)
Flexible network experiment/test with deterministic clock
34
github: https://github.com/libos-nuse/net-next-nuse
DCE: http://bit.ly/ns-3-dce
twitter: @thehajime
35
Backups
Potentials of Userspace Networking
High-performance networking
Useful debugging facilities
Operating system personalization
37
1) kernel build
build kernel source tree w/ the patch
make menuconfig ARCH=sim
make library ARCH=sim
➔ libnuse-linux-3.17-rc1.so
38
Example: How timer works
39
add_timer()
TIMER_SOFTIRQ
timer_list
run_timer_softirq ()
timer handler
timer thread(timer_create (2))
3) POSIX glue code
40
https://github.com/libos-nuse/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c
extern int sim_sock_socket (int,int,int, struct socket **);int socket (int family, int type, int proto){ sim_update_jiffies (); struct socket *kernel_socket = sim_malloc (sizeof (struct socket)); memset (kernel_socket, 0, sizeof (struct socket)); int ret = sim_sock_socket (family, type, proto, &kernel_socket); g_fd_table[curfd++] = kernel_socket; sim_softirq_wakeup (); return curfd - 1;}
Tx callgraph
41
sendmsg () (socket API) sim_sock_sendmsg () (NUSE) sock_sendmsg () ip_send_skb () ip_finish_output2 () dst_neigh_output () (ex-kernel) neigh_resolve_output () arp_solicit () dev_queue_xmit () sim_dev_xmit () (NUSE) nuse_vif_raw_write ()
start_thread () (pthread) nuse_netdev_rx_trampoline () nuse_vif_raw_read () (NUSE) sim_dev_rx () netif_rx () (ex-kernel)
Rx callgraph
42
start_thread () (pthread) do_softirq () (NUSE) net_rx_action () process_backlog () (ex-kernel) __netif_receive_skb_core () ip_rcv ()
vNIC!rx
softirq!rx