recent advance in netmap/vale(mswitch)

Recent advance in netmap/VALE(mSwitch)

Michio Honda, Felipe Huici (NEC Europe Ltd.)

Giuseppe Lettieri and Luigi Rizzo (Universita di Pisa)

Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013

[email protected] / @michioh

Outline

•  netmap API basics –  Architecture

–  How to write apps

•  VALE (mSwitch) –  Architecture –  System design

–  Evaluation –  Use cases

NETMAP API BASICS

netmap overview

•  A fast packet I/O mechanism between the NIC and the user-space –  Remove unnecessary metadata (e.g., sk_buff) allocation –  Amortized systemcall costs, reduced/removed data copies

From h,p://info.iet.unipi.it/~luigi/netmap/

Performance

•  Saturate 10 Gbps pipe with low CPU frequency


netmap API (initialization)

•  open(“/dev/netmap”) returns a file descriptor •  ioctl(fd, NIOCREG, arg) puts an interface in netmap mode •  mmap(…, fd, 0) maps buffers and rings


netmap API (TX)

•  TX –  Fill up to avail buffers, starting from slot cur

–  ioctl(fd, NIOCTXSYNC) queues the packets •  poll() can be used for blocking I/O


netmap API (RX)

•  RX –  ioctl(fd, NIOCTXSYNC) reports newly received packets

–  Process up to avail buffers, starting from slot cur •  poll() can be used for blocking I/O


Other features

•  Multi queue support –  One netmap ring is initialized for a single physical ring

•  e.g., different pthreads can be assigned to different netmap/physical rings

•  Host stack support –  NIC is put into netmap mode, ��

resetting its phy –  The host stack still sees the interface,��

and packets can be sent to/from the��NIC via “software rings” •  Either implicitly by the kernel or explicitly��

by the app


Implementation

•  Available for FreeBSD and Linux –  Linux code is glued from FreeBSD one

•  Common code –  control, systemcall backends, memory allocator etc

•  Device-specific code –  Each of the supported drivers implements some functions

•  nm_register(struct ifnet *ifp, int onoff)

–  Put the NIC into netmap mode, allocate netmap rings and slots •  nm_txsync(struct ifnet *ifp, u_int ring_nr)

–  Flush out the packets from the netmap ring filled by the user •  nm_rxsync(struct ifnet *ifp, u_int ring_nr)

–  refil the netmap ring with the receiving packets

Implementation (cont.)

•  Small modifications in device drivers


VALE (MSWITCH): A NETMAP-BASED, VERY FAST SOFTWARE SWITCH

Software switch

•  Switching packets between network interfaces –  General-purpose OS and processor

–  Virtual and hardware interfaces

•  20 years ago –  Prototyping, low-performance alternative

•  Now and near future –  Replacement of hardware switch

–  Hosting virtual machines (incl. virtual ��network functions)

Software switch

Performance of today’s software switch

•  Forward packets from a 10Gbps NIC to another one –  Xeon E5-1650 (3.8Ghz) (1 CPU core is used)

–  Lower than 1 Gbps for the minimum-sized packets

1 2

5

10

64 128 256 512 1024

Thro

ughput

(Gbps)

Packet size (Bytes)

FreeBSD bridge Openvswitch

Problems

1.  Inefficient packet I/O mechanism –  Today’s software switches use a dynamically-allocated, big ��

metadata (e.g., sk_buff) designed for end systems –  it should be simplified, because the packet is just forwarded

–  For switches it is more important to process small packets efficiently

2.  Inefficient packet switching algorithm –  How to move packets from the source��

to the destination(s) efficiently?

–  Traditional way •  Lock a destination, send a single packet, then unlock the destination •  Inefficient due to locking cost/contention

1 2

3 4

Problems (cont.)

3.  Lack of flexibility in packet processing –  How to decide packet’s destination?

–  One could use layer 2 learning bridge to decide packet’s destination

–  One could use OpenFlow packet matching to do so

packet processing

Solutions

1.  “Inefficient packet I/O mechanisms” –  Simple, minimalistic packet representation (netmap API*)

•  No metadata allocation cost •  Reduced cache pollution

2.  “Inefficient packet switching” –  Group multiple packets going to the same��

destination

–  Lock the destination only once for a group of packets

1 2

3 4

Netmap – a novel frameworl for fast packet I/O h,p://info.iet.unipi.it/~luigi/netmap/ Luigi Rizzo Università di Pisa

Bitmap-based forwarding algorithm

•  Algorithm in the original VALE •  Support for unicast, multicast and broadcast –  Get pointers to a batch of packets

–  Identify the destination of each ��packet and represent as a bitmap

–  Lock each destination, and send all ��the packets going there

•  Problem –  Scalability issue in the presence of many destinations

VALE, a Virtual Local Ethernet h,p://info.iet.unipi.it/~luigi/vale/ Luigi Rizzo, Giuseppe LeReri Università di Pisa

pkt idp0p1p2p3p4

dst00100001001011110010

00100001001011110010

Figure 3. Bitmap-based packet forwarding algorithm: pack-ets are labeled from p0 to p4; for each packet, destination(s)are identified and represented in a bitmap (a bit for each pos-sible destination port). The forwarder considers each desti-nation port in turn, scanning the corresponding column ofthe bitmap to identify the packets bound to the current desti-nation port.

pkt idp0p1p2p3p4

dst nextp1 ...p1

p0p4

p3p3

dsthead

taild1d0d1

d254d1

p2nullp4nullnull

pkt idp0p1p2p3p4

dst nextp1 ...

d0 d254

p1p0p0

dsthead

taild1d0

nullnull

d1 d2

d0 d254d1 d2

Figure 4. List-based packet forwarding: packets are labeledfrom p0 to p4, destination port indices are labeled from d1to d254 (254 means broadcast). The sender’s “next” field de-notes the next packet in a batch, the receiver’s list has “head”and “tail” pointers. When the destinations for all packetshave been identified (top part of graph), the forwarder serveseach destination by scanning the destination’s private list andthe broadcast one (bottom of graph).

bottleneck. Cache pollution might be significant, so we mayrevise this decision in the future.

Copies are instead the best option when switching packetsfrom/to virtual ports: buffers cannot be shared between vir-tual ports for security reasons, and altering page mappingsto transfer ownership of the buffers is immensely more ex-pensive.

4.2 Packet Forwarding AlgorithmThe original VALE switch operates on batches of packetsto improve efficiency, and groups packets to the same desti-

nation before forwarding so that locking costs can be amor-tized. The grouping algorithm uses a bitmap (see Figure 3) toindicate packets’ destinations; this is an efficient way to sup-port multicast, but has a forwarding complexity that is linearin the number of connected ports (even for unicast traffic andin the presence of idle ports). As such, this does not scalewell for configurations with a large number of ports.

To reduce this complexity, in mSwitch we only allow uni-cast or broadcast packets (multicast is mapped to broadcast).This lets us replace the bitmap with N+1 lists, one per des-tination port plus one for broadcast traffic (Figure 4). In theforwarding phase, we only need to scan two lists (one for thecurrent port, one for broadcast), which makes this a constanttime step, irrespective of the number of ports. Figure 8 and9 show the performance gains.

4.3 Improving Output ParallelismOnce the list of packets destined to a given port has beenidentified, the packets must copied to the destination portand made available on the ring. In the original VALE imple-mentation, the sending thread locks the destination port forthe duration of the entire operation; this is done for a batchof packets, not a single one, and so can take relatively long.

Instead, mSwitch improves parallelism by operating intwo phases: a sender reserves (under lock2) a sufficientlylarge contiguous range of slots in the output queue, then re-leases the lock during the copy, allowing concurrent opera-tion on the queue, and finally acquires the lock again to ad-vance queue pointers. This latter phase also handles out-of-order completions of the copy phases, which may occur formany reasons (different batch or packet sizes, cache misses,page faults). With this new strategy, the queue is locked onlyfor the short intervals needed to reserve slots and updatequeue pointers. As an additional side benefit, we can nowtolerate page faults during the copy phase, which allows us-ing userspace buffers as data sources. We provide evaluationresults for this mechanism in Section 5.3.

4.4 Pluggable Packet ProcessingVALE uses a hard-wired packet processing function thatimplements a learning Ethernet bridge. In mSwitch, eachswitch instance can configure its own packet processingfunction by loading a suitable kernel module which imple-ments it. The packet processing function is called once foreach packet in a batch, before the actual forwarding takesplace, and the response is used to add the packet to the uni-cast or broadcast lists discussed in Section 4.2. The functionreceives a pointer to the packet’s buffer and its length, andshould return the destination port (with special values in-dicating “broadcast” or “drop”). The function can performarbitrary actions on the packet, including modifying its con-tent or length, within the size of the buffer. Speed is not a

2 The spinlock could be replaced by a lock-free scheme, but we doubt thiswould provide any measurable performance gain.

List-based forwarding algorithm

•  Algorithm in the current VALE (mSwitch) •  Support for unicast and broadcast –  Make a linked-list for each destination

–  Broadcast packets are��mapped into destination ��index 254

–  Scan each destination,��and broadcast packets��are inserted in-order

pkt idp0p1p2p3p4

dst00100001001011110010

00100001001011110010

Figure 3. Bitmap-based packet forwarding algorithm: pack-ets are labeled from p0 to p4; for each packet, destination(s)are identified and represented in a bitmap (a bit for each pos-sible destination port). The forwarder considers each desti-nation port in turn, scanning the corresponding column ofthe bitmap to identify the packets bound to the current desti-nation port.

pkt idp0p1p2p3p4

dst nextp1 ...p1

p0p4

p3p3

dsthead

taild1d0d1

d254d1

p2nullp4nullnull

pkt idp0p1p2p3p4

dst nextp1 ...

d0 d254

p1p0p0

dsthead

taild1d0

nullnull

d1 d2

d0 d254d1 d2

Figure 4. List-based packet forwarding: packets are labeledfrom p0 to p4, destination port indices are labeled from d1to d254 (254 means broadcast). The sender’s “next” field de-notes the next packet in a batch, the receiver’s list has “head”and “tail” pointers. When the destinations for all packetshave been identified (top part of graph), the forwarder serveseach destination by scanning the destination’s private list andthe broadcast one (bottom of graph).

bottleneck. Cache pollution might be significant, so we mayrevise this decision in the future.

Copies are instead the best option when switching packetsfrom/to virtual ports: buffers cannot be shared between vir-tual ports for security reasons, and altering page mappingsto transfer ownership of the buffers is immensely more ex-pensive.

4.2 Packet Forwarding AlgorithmThe original VALE switch operates on batches of packetsto improve efficiency, and groups packets to the same desti-

nation before forwarding so that locking costs can be amor-tized. The grouping algorithm uses a bitmap (see Figure 3) toindicate packets’ destinations; this is an efficient way to sup-port multicast, but has a forwarding complexity that is linearin the number of connected ports (even for unicast traffic andin the presence of idle ports). As such, this does not scalewell for configurations with a large number of ports.

To reduce this complexity, in mSwitch we only allow uni-cast or broadcast packets (multicast is mapped to broadcast).This lets us replace the bitmap with N+1 lists, one per des-tination port plus one for broadcast traffic (Figure 4). In theforwarding phase, we only need to scan two lists (one for thecurrent port, one for broadcast), which makes this a constanttime step, irrespective of the number of ports. Figure 8 and9 show the performance gains.

4.3 Improving Output ParallelismOnce the list of packets destined to a given port has beenidentified, the packets must copied to the destination portand made available on the ring. In the original VALE imple-mentation, the sending thread locks the destination port forthe duration of the entire operation; this is done for a batchof packets, not a single one, and so can take relatively long.

Instead, mSwitch improves parallelism by operating intwo phases: a sender reserves (under lock2) a sufficientlylarge contiguous range of slots in the output queue, then re-leases the lock during the copy, allowing concurrent opera-tion on the queue, and finally acquires the lock again to ad-vance queue pointers. This latter phase also handles out-of-order completions of the copy phases, which may occur formany reasons (different batch or packet sizes, cache misses,page faults). With this new strategy, the queue is locked onlyfor the short intervals needed to reserve slots and updatequeue pointers. As an additional side benefit, we can nowtolerate page faults during the copy phase, which allows us-ing userspace buffers as data sources. We provide evaluationresults for this mechanism in Section 5.3.

4.4 Pluggable Packet ProcessingVALE uses a hard-wired packet processing function thatimplements a learning Ethernet bridge. In mSwitch, eachswitch instance can configure its own packet processingfunction by loading a suitable kernel module which imple-ments it. The packet processing function is called once foreach packet in a batch, before the actual forwarding takesplace, and the response is used to add the packet to the uni-cast or broadcast lists discussed in Section 4.2. The functionreceives a pointer to the packet’s buffer and its length, andshould return the destination port (with special values in-dicating “broadcast” or “drop”). The function can performarbitrary actions on the packet, including modifying its con-tent or length, within the size of the buffer. Speed is not a

2 The spinlock could be replaced by a lock-free scheme, but we doubt thiswould provide any measurable performance gain.

Solutions (cont.)

3.  “Lack of flexibility in packet processing” –  Separate a switching fabric ��

and packet processing –  Switching fabric

•  Move packets quickly

–  Packet processing •  Decide packets’ destination and tell the switching fabric

–  Return •  The index of the destination port for unicast •  NM_BDG_BROADCAST for broadcast •  NM_BDG_NOPORT for dropping this packet

–  By default L2 learning is set

Packet processing

Switching fabric

typedef u_int (*BDG_LOOKUP_T)(char *buf, u_int len, uint8_t *ring_nr, struct netmap_adapter *srcif);

VALE (mSwitch) architecture

•  Packet forwarding (identifying packets’ destination (packet processing) and copying packets to the destination ring) takes place in the sender’s context –  The receiver just consumes the packets

app1/vm1

Switching fabricNIC

Packet processing

Kernel

Userapps

OS stackVirtual

interfaces

Socket API

appN/vmN

Net

map

API

Net

map

API

. . .

. . .

The other features

•  Indirect buffer support –  netmap slot can contain a pointer to the actual buffer

–  Useful to eliminate data copy from VM’s backend to a netmap slot

•  Support for a large packet –  Multiple netmap slots (by default 2048 byte each) can be

used to contain a single packet

Bare mSwitch performance

•  NIC to NIC (10Gbps) •  “Dummy” processing module

2

4

6

8

10

64 128 256 512 1518

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

10Gbps line rate

Dummy

5

10

15

20

1.3 1.9 2.6 3.2 3.8

Th

rou

gh

pu

t (G

bp

s)

CPU Clock Frequency (Ghz)

64B 128B 256B

(a) NIC to NIC

5

10

15

20

1.3 1.9 2.6 3.2 3.8

Th

rou

gh

pu

t (G

bp

s)


64B 128B 256B

(b) NIC to virtual interface

5

10

15

20

1.3 1.9 2.6 3.2 3.8

Th

rou

gh

pu

t (G

bp

s)


64B 128B 256B

(c) Virtual interface to NIC

Figure 5. Throughput between 10 Gb/s NICs and virtual ports for different packet sizes and CPU frequencies. Solid lines arefor a single NIC, dashed ones for two.

(Intel Xeon [email protected], 16GB of RAM) with a sim-ilar card. Unless otherwise stated, we run the CPU of themSwitch server at 3.8 GHz. In terms of operating system,we rely on FreeBSD 10 3 for most experiments, and Linux3.9 for the Open vSwitch experiments in the next section.To generate and count packets we use pkt-gen, a fast gen-erator that uses the netmap API and so can be plugged intomSwitch’s virtual ports. Throughout, we use Gb/s to meanGigabits per second, Mp/s for millions of packets per sec-ond, and unless otherwise stated we use a batch size of 1024packets.

5.1 Basic performanceFor the first experiments, and to derive a set of baseline per-formance figures, we implemented a dummy plug-in mod-ule. The idea is that the source port is configured to markpackets with a static destination port. From there, the packetgoes to the dummy module, which returns immediately (thusgiving us a base figure for how much it costs for packets togo through the entire switch’s I/O path), and then the switchsends the packet to the destination port.

We evaluate mSwitch’s throughput for different packetsizes and combinations of NICs and virtual ports. We furthervary our CPU’s frequency by either using Turbo Boost toincrease it or a sysctl to decrease it; this lets us shed lighton CPU-bound bottlenecks.

To begin, we connect the two 10 Gb/s ports of the NIC tothe switch, and have it forward packets from one to the otherusing a single CPU core (figure 5(a)). With this setup, weobtain 10 Gb/s line rate for all CPU frequencies for 256-bytepackets or larger, and line rate for 128-byte ones starting at2.6 GHz. Minimum-sized packets are more CPU-intensive,requiring us to enable Turbo Boost to reach 96% of line rate(6.9/7.1 Gb/s or 14.3/14.9 Mp/s). In a separate experiment

3 mSwitch can also run in Linux.

we confirmed that this small deviation from line rate was aresult of our lower-frequency, receiver server.

For the next experiment we attach the two NIC ports andtwo virtual ports to the switch, and conduct NIC-to-virtualport tests (10 Gb/s ones using one pair of NIC and virtualport, and 20 Gb/s ones with two pairs). In addition, we as-sign one CPU core per port pair. The results in figure 5(b)are similar to those in the NIC-to-NIC case, with line ratevalues for all packet sizes at 3.8 GHz. The graph also showsthat mSwitch scales well with the number of ports and CPUcores: we achieve line rate for two 10 Gb/s ports for allpacket sizes. Finally, figure 5(c) presents throughput num-bers in the opposite direction, from virtual ports to NICs,with similar results.

25

50

100

150

200

60 508 1514 8K 64K

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

1 CPU core2 CPU cores3 CPU cores

Figure 6. Forwarding performance between two virtualports and different number of CPU cores and rings per port.

The previous set of experiments all involved NICs. To getan idea of mSwitch’s raw switching performance, we attacha pair of virtual ports to the switch and forward packets

Experiments are done with Xeon E5-‐1650 CPU (6 core, 3.8 Ghz with Turboboost), 16GB DDR3 RAM (quad channel) and Intel X520-‐T2 10Gbps NICs

Bare mSwitch performance •  Virtual port to virtual port •  “Dummy” processing module

5

10

15

20

1.3 1.9 2.6 3.2 3.8T

hro

ughput

(Gbps)


64B 128B 256B

(a) NIC to NIC

5

10

15

20

1.3 1.9 2.6 3.2 3.8

Thro

ughput

(Gbps)


64B 128B 256B

(b) NIC to virtual interface

5

10

15

20

1.3 1.9 2.6 3.2 3.8

Thro

ughput

(Gbps)


64B 128B 256B

(c) Virtual interface to NIC

Figure 5. Throughput between 10 Gb/s NICs and virtual ports for different packet sizes and CPU frequencies. Solid lines arefor a single NIC, dashed ones for two.

(Intel Xeon [email protected], 16GB of RAM) with a sim-ilar card. Unless otherwise stated, we run the CPU of themSwitch server at 3.8 GHz. In terms of operating system,we rely on FreeBSD 10 3 for most experiments, and Linux3.9 for the Open vSwitch experiments in the next section.To generate and count packets we use pkt-gen, a fast gen-erator that uses the netmap API and so can be plugged intomSwitch’s virtual ports. Throughout, we use Gb/s to meanGigabits per second, Mp/s for millions of packets per sec-ond, and unless otherwise stated we use a batch size of 1024packets.

5.1 Basic performanceFor the first experiments, and to derive a set of baseline per-formance figures, we implemented a dummy plug-in mod-ule. The idea is that the source port is configured to markpackets with a static destination port. From there, the packetgoes to the dummy module, which returns immediately (thusgiving us a base figure for how much it costs for packets togo through the entire switch’s I/O path), and then the switchsends the packet to the destination port.

We evaluate mSwitch’s throughput for different packetsizes and combinations of NICs and virtual ports. We furthervary our CPU’s frequency by either using Turbo Boost toincrease it or a sysctl to decrease it; this lets us shed lighton CPU-bound bottlenecks.

To begin, we connect the two 10 Gb/s ports of the NIC tothe switch, and have it forward packets from one to the otherusing a single CPU core (figure 5(a)). With this setup, weobtain 10 Gb/s line rate for all CPU frequencies for 256-bytepackets or larger, and line rate for 128-byte ones starting at2.6 GHz. Minimum-sized packets are more CPU-intensive,requiring us to enable Turbo Boost to reach 96% of line rate(6.9/7.1 Gb/s or 14.3/14.9 Mp/s). In a separate experiment

3 mSwitch can also run in Linux.

we confirmed that this small deviation from line rate was aresult of our lower-frequency, receiver server.

For the next experiment we attach the two NIC ports andtwo virtual ports to the switch, and conduct NIC-to-virtualport tests (10 Gb/s ones using one pair of NIC and virtualport, and 20 Gb/s ones with two pairs). In addition, we as-sign one CPU core per port pair. The results in figure 5(b)are similar to those in the NIC-to-NIC case, with line ratevalues for all packet sizes at 3.8 GHz. The graph also showsthat mSwitch scales well with the number of ports and CPUcores: we achieve line rate for two 10 Gb/s ports for allpacket sizes. Finally, figure 5(c) presents throughput num-bers in the opposite direction, from virtual ports to NICs,with similar results.

25

50

100

150

200

60 508 1514 8K 64K

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

1 CPU core2 CPU cores3 CPU cores

Figure 6. Forwarding performance between two virtualports and different number of CPU cores and rings per port.

The previous set of experiments all involved NICs. To getan idea of mSwitch’s raw switching performance, we attacha pair of virtual ports to the switch and forward packets

Dummy

Bare mSwitch performance

•  Dummy packet processing module” •  N virtual ports to N virtual ports •  “

unicast

broadcast (a) Experiment topologies.

0

50

100

150

200

250

2 4 6 8

Th

rou

gh

pu

t (G

bp

s)

# of ports

64B packets1514B packets64KB packets

(b) Unicast throughput.

0

50

100

150

200

250

2 4 6 8

Th

rou

gh

pu

t (G

bp

s)

# of ports


(c) Broadcast throughput.

Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a singleCPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign coresin a round-robin fashion.

between them. We also leverage mSwitch’s ability to assignmultiple packet rings to each port (and a CPU core to eachof the rings) to further scale performance. Figure 6 showsthroughput when assigning an increasing number of CPUcores to each port (up to 3 per port, at which point all 6cores in our system are in use). We see a rather high rate ofabout 185.0 Gb/s for 1514-byte packets and 25.6 Gb/s (53.3Mp/s) for minimum-sized ones. We also see a peak of 215.0Gb/s for 64KB “packets”; given that we are not limited bymemory bandwidth (our 1333MHz quad channel memoryhas a maximum theoretical rate of 10.6 GB/s per channel, soroughly 339 Gb/s in total), we suspect the limitation to beCPU frequency.

5.2 Switching ScalabilityHaving tested mSwitch’s performance with NICs attached,as well as between a pair of virtual ports, we now investigatehow its switching capacity scales with additional numbers ofvirtual ports.

We begin by testing unicast traffic, as shown in figure 7(a)(top). We use an increasing number of port pairs, each pairconsisting of a sender and a receiver. Each pair is handled bya single thread which we pin to a separate CPU core (as longas there are more cores than port pairs; when that is not thecase, we pin more than one pair of ports to a CPU core in around-robin fashion).

Figure 7(b) shows the results. We observe high switch-ing capacity. For minimum-sized packets, mSwitch achievesrates of 26.2 Gb/s (54.5 Mp/s) with 6 ports, a rate that de-creases slightly down to 22.5 Gb/s when we start sharingcores among ports (8 ports). For 1514-byte packets we seea maximum rate of 172 Gb/s with 4 ports why does this godown for 6 ports?, down to 136.5 Gb/s for 8 ports.

For the second test, we perform the same experimentbut this time each sender transmits broadcast packets (fig-ure 7(a), bottom). An mSwitch sender is slower than a re-

ceiver because packet switching and processing happen inthe sender’s context. This experiment is thus akin to hav-ing more senders, meaning that the receivers should beless idle than in the previous experiment and thus cumu-lative throughput should be higher (assuming no other bot-tlenecks).

This is, in fact, what we observe (figure 7(c)): in thiscase, mSwitch yields a rate of 46 Gb/s (96 Mp/s) for 64-byte packets when using all 6 cores, going down to 38 Gb/swhen sharing cores among 8 ports. 1514-byte packets resultin a maximum rate of 205.4 Gb/s for 6 ports and 170.5 Gb/sfor 8 ports. As expected, all of these figures are higher thanin the unicast case.

9

10

11

12

13

1 2 3 4 5Aggre

gat

e th

roughput

(Gbps)

# of destination ports

List Algo.Bitmap Algo.

Figure 8. Packet forwarding throughput from a singlesource port to multiple destinations using minimum-sizedpackets. The graph compares mSwitch’s forwarding algo-rithm (list) to VALE’s (bitmap).

Next, we evaluate the cost of having a single source sendpackets to an increasing number of destination ports, up to5 of them, at which point all 6 cores in our system are inuse. Figure 8 shows aggregate throughput from all destina-

mSwitch’s Scalability

•  A single virtual port to ��many virtual ports –  Bitmap- vs List-based��

algorithm

–  List-based algorithm��scales very well

unicast

broadcast (a) Experiment topologies.

0

50

100

150

200

250

2 4 6 8

Th

rou

gh

pu

t (G

bp

s)

# of ports


(b) Unicast throughput.

0

50

100

150

200

250

2 4 6 8

Th

rou

gh

pu

t (G

bp

s)

# of ports


(c) Broadcast throughput.

Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a singleCPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign coresin a round-robin fashion.

9

10

11

12

13

1 2 3 4 5Ag

gre

gat

e th

rou

gh

pu

t (G

bp

s)



Figure 8. Packet forwarding throughput from a singlesource port to multiple destinations using minimum-sizedpackets. The graph compares mSwitch’s forwarding algo-rithm (list) to VALE’s (bitmap).

ments over the bitmap algorithm, which yields throughputsof 12.1 Gbps and 9.8 Gbps respectively.

In the final baseline test we compare mSwitch’s algorithmto that of VALE in the presence of large numbers of ports.In order to remove effects arising from context switches andthe number of cores in our system from the comparison, weemulate the destination ports as receive packet queues thatautomatically empty once packets are copied to them. Thismechanism incurs most of the costs of forwarding packets tothe destination ports without being bottlenecked by having towait for the destination ports’ context/CPU core to run. Wefurther use minimum-sized packets and a single CPU core.

Figure 9 clearly shows the advantage of the new algo-rithm: we see a linear decrease with increasing number ofports as opposed to an exponential one with the bitmap al-gorithm. As a point of comparison, for 60 ports the list algo-rithm yields a rate of 11.1 Gbps (23.2 Mpps) versus 3.0 Gbps

0

2

4

6

8

10

12

14

1 20 40 60 100 150 200 250

Ag

gre

gat

e th

rou

gh

pu

t (G

bp

s)



Figure 9. Comparison of mSwitch’s forwarding algorithm(list) to that of VALE (bitmap) in the presence of a largenumber of active destination ports (single sender, minimum-sized packets). As ports increase, each destination receivesa smaller batch of packets in each round hence reducing theefficiency of batch processing.

(6.4 Mpps) for the bitmap one. This is rather a large im-provement, especially if we consider that the list algorithmsupports much larger numbers of ports (the bitmap algorithmsupports up to 64 ports only; the graph shows only up to 60for simplicity of illustration). In the presence of 250 des-tination ports mSwitch is still able to produce a rather re-spectable 7.2 Gbps forwarding rate.

5.3 Output Queue ManagementWe now evaluate the performance gains derived from usingmSwitch’s parallelization in handling output queues. Recallfrom Section 4.3 that mSwitch allows concurrent copies to

We use the minimum-‐sized packets with a single CPU core

Learning bridge performance

•  mSwitch-learn: pure learning bridge processing –  Adding a cost of MAC address hashing at packet processing

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

FreeBSD bridgemSwitch-learn

(a) Layer 2 learning bridge.

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

mSwitch-3-tuple

(b) 3-tuple filter (user-space nw stack support).

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

OVSmSwitch-OVS

(c) Open vSwitch.

Figure 12. Packet forwarding performance for different mSwitch modules and packet sizes. For the learning bridge and OpenvSwitch module we compare them to their non-mSwitch versions (FreeBSD bridge and standard Open vSwitch, respectively).

ment a filter module able to direct packets between mSwitchports based on the <dst IP, protocol type,dst port>

3-tuple. User-level processes (i.e., the stacks) request the useof a certain 3-tuple, and if the request does not collide withprevious ones, the module inserts the 3-tuple entry, alongwith a hash of it and the switch port number that the processis connected to, into its routing table. On packet arrival, themodule ensures that packets are IPv4 and that their check-sum is correct. If so, it retrieves the packet’s 3-tuple, andmatches a hash of it with the entries in the table. If there is amatch, the packet is forwarded to the corresponding switchport, otherwise it is dropped. The module also includes addi-tional filtering to make sure that stacks cannot spoof packetsand consists of about 200 lines of code in all, excludingprotocol-related data structures.

To be able to analyze the performance of the moduleseparately from that of any user-level stack that may makeuse of it, we run a test that forwards packets through themodule and between two 10 Gb/s ports; this functionalityrepresents the same costs in mSwitch and the module thatwould be incured if an actual stack were connected to a port.Figure 12(b) shows throughput results. We see a high rate of5.8 Gb/s (12.1 Mp/s or 81% of line rate) for 64-byte packets,and full line rate for larger packet sizes, values that are onlyslightly lower than those from the learning bridge module.

To take this one step further, we built a simple user-level TCP stack library along with an HTTP server builton top of it. Using these, and relying on mSwitch’s 3-tuplemodule for back-end support, we were able to achieve HTTPrates of up to 8 Gb/s for 16KB and higher fetch sizes, andapproximately 100 K requests per second for short requests.

6.3 Open vSwitchIn our final use case, we attempt to see whether it is simpleto port existing software packages to mSwitch, and whether

Openvswitch datapath kernel module

Device-specific receive

Commonreceive

Packet matching

Device-specific send

Commonsend

NIC NIC

datapath

Figure 13. Overview of Open vSwitch’s datapath.

doing so provides a performance boost. To this end we targetOpen vSwitch, modifying code of the version in the 3.9.0Linux kernel in order to adapt it to mSwitch; we use the termmSwitch-OVS to refer to this mSwitch module.

At the end we modified 476 lines of code (LoC). Changesin the original files are only 59 LoC, which means followingfuture update of Open vSwitch should be easy. 201 LoC isfor support for netmap-mode with “internal-dev” which isan interface type of Open vSwitch to be connected by thehost network stack. The rest of modifications include controlfunctions that instructs mSwitch instance (e.g., attaching aninterface) based on control commands of Open vSwitch, aglue code between Open vSwitch’s packet processing andmSwitch’s lookup routine, and the lookup routine itself.

As a short background, figure 13 illustrates Open vSwitch’sdatapath. The datapath typically runs in the kernel, and in-cludes one or more NICs or virtual interfaces as ports; itis, at this high level, similar to mSwitch’s datapath (recallfigure 2). The datapath takes care of receiving packets on

Learning bridge

OpenVswitch acceleration

•  mSwitch-OVS: mSwitch with Openswitch’s packet processing

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

FreeBSD bridgemSwitch-learn

(a) Layer 2 learning bridge.

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

mSwitch-3-tuple

(b) 3-tuple filter (user-space nw stack support).

2

4

6

8

10

64 128 256 512 1024

Th

rou

gh

pu

t (G

bp

s)

Packet size (Bytes)

OVSmSwitch-OVS

(c) Open vSwitch.

Figure 12. Packet forwarding performance for different mSwitch modules and packet sizes. For the learning bridge and OpenvSwitch module we compare them to their non-mSwitch versions (FreeBSD bridge and standard Open vSwitch, respectively).

ment a filter module able to direct packets between mSwitchports based on the <dst IP, protocol type,dst port>

3-tuple. User-level processes (i.e., the stacks) request the useof a certain 3-tuple, and if the request does not collide withprevious ones, the module inserts the 3-tuple entry, alongwith a hash of it and the switch port number that the processis connected to, into its routing table. On packet arrival, themodule ensures that packets are IPv4 and that their check-sum is correct. If so, it retrieves the packet’s 3-tuple, andmatches a hash of it with the entries in the table. If there is amatch, the packet is forwarded to the corresponding switchport, otherwise it is dropped. The module also includes addi-tional filtering to make sure that stacks cannot spoof packetsand consists of about 200 lines of code in all, excludingprotocol-related data structures.

To be able to analyze the performance of the moduleseparately from that of any user-level stack that may makeuse of it, we run a test that forwards packets through themodule and between two 10 Gb/s ports; this functionalityrepresents the same costs in mSwitch and the module thatwould be incured if an actual stack were connected to a port.Figure 12(b) shows throughput results. We see a high rate of5.8 Gb/s (12.1 Mp/s or 81% of line rate) for 64-byte packets,and full line rate for larger packet sizes, values that are onlyslightly lower than those from the learning bridge module.

To take this one step further, we built a simple user-level TCP stack library along with an HTTP server builton top of it. Using these, and relying on mSwitch’s 3-tuplemodule for back-end support, we were able to achieve HTTPrates of up to 8 Gb/s for 16KB and higher fetch sizes, andapproximately 100 K requests per second for short requests.

6.3 Open vSwitchIn our final use case, we attempt to see whether it is simpleto port existing software packages to mSwitch, and whether

Openvswitch datapath kernel module

Device-specific receive

Commonreceive

Packet matching

Device-specific send

Commonsend

NIC NIC

datapath

Figure 13. Overview of Open vSwitch’s datapath.

doing so provides a performance boost. To this end we targetOpen vSwitch, modifying code of the version in the 3.9.0Linux kernel in order to adapt it to mSwitch; we use the termmSwitch-OVS to refer to this mSwitch module.

At the end we modified 476 lines of code (LoC). Changesin the original files are only 59 LoC, which means followingfuture update of Open vSwitch should be easy. 201 LoC isfor support for netmap-mode with “internal-dev” which isan interface type of Open vSwitch to be connected by thehost network stack. The rest of modifications include controlfunctions that instructs mSwitch instance (e.g., attaching aninterface) based on control commands of Open vSwitch, aglue code between Open vSwitch’s packet processing andmSwitch’s lookup routine, and the lookup routine itself.

As a short background, figure 13 illustrates Open vSwitch’sdatapath. The datapath typically runs in the kernel, and in-cludes one or more NICs or virtual interfaces as ports; itis, at this high level, similar to mSwitch’s datapath (recallfigure 2). The datapath takes care of receiving packets on

Openvswitch

Conclusion

•  Our contribution –  VALE(mSwitch): fast, modular software switch –  Very fast packet forwarding at bare metal

•  > 200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU cores)

•  Almost the line rate with using 1 CPU core and 2 10 Gbps NICs –  Useful to implement various systems

•  Very fast learning bridge •  Accelerate OpenVswitch up to 2.6 times

–  Small modifications, preserving control interface •  Fast protocol multiplexer/demultiplexer for user-space protocol stacks

•  Code (Linux, FreeBSD) is available at: –  http://info.iet.unipi.it/~luigi/netmap/

recent advance in netmap/vale(mswitch)

Technology

netmap ring

netmap api basics architecture

netmap api tx tx

netmap mode mmap

netmap api rx rx ioctlfd

avail buffers

slot cur ioctlfd

descriptor ioctlfd