nerd lunch
TRANSCRIPT
-
8/3/2019 Nerd Lunch
1/51
1
RouterBricks
Scaling Software Routers with Modern Servers
Kevin Fall
Intel Labs, Berkeley
Feb 24, 2010Ericsson, San Jose, CA
-
8/3/2019 Nerd Lunch
2/51
2
Project Participants
Intel Labs Gianluca Iannaccone (co-PI, researcher)
Sylvia Ratnasamy (co-PI, researcher)
Kevin Fall (principal engineer) Allan Knies (principal engineer)
Maziar Manesh (research engineer)
Eddie Kohler (Click expert)
Dan Dahle (tech strategy)
Badarinath Kommandur (tech strategy)
Ecole Polytecnique (EPFL), Switzerland Katerina Argyraki (faculty)
Mihai Dobrescu (student)
Diaqing Chu (student)
-
8/3/2019 Nerd Lunch
3/51
3
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
Performance results
Next steps
-
8/3/2019 Nerd Lunch
4/51
4
RouterBricks: in a nutshell
A high-speed router using IA server components
fully programmable: control and data plane
extensible: evolve networks via software upgrade
incrementally scalable: flat cost per bit
-
8/3/2019 Nerd Lunch
5/51
5
Motivation
Network infrastructure is doing more than ever before
Packet-pushing (routing) no longer the whole story security, data loss protection, application optimization, etc.
has led to a proliferation of special appliances
and notions that perhaps routers could do more Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco
But these platforms werent born programmable
-
8/3/2019 Nerd Lunch
6/51
6
Motivation
If flexibility ultimately implies programmability...
Hard to beat IA platforms and their ecosystem
Or price
However, must deal with persistent folklore:
IA cant do high-speed packet processing
But todays IA isnt the IA you know from your youth
multicore, multiple integrated mem-controllers, PCIe, multi-Q NICs,
-
8/3/2019 Nerd Lunch
7/51
7
Motivation
Combine a desire for more programmability...
with new router friendly server trends
a new opportunity for IA servers?
Router Bricks: How might we
build a big (~1Tbps) IA-based software router?
-
8/3/2019 Nerd Lunch
8/51
8
Challenge
traditional software routers
research prototypes (2007): 1 - 2 Gbps
Vyatta* datasheet (2009): 2 - 4 Gbps
current carrier-grade routers
line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!
* Other names and brands may be claimed as properties of others
-
8/3/2019 Nerd Lunch
9/51
9
Strategy
1. A cluster-based router architecture
each server need only scale to line speeds (10-40Gbps),rather than aggregate speeds (40Gbps 92Tbps)
2. Understand whether modern server architecturescan scale to line speeds (10-40Gbps)
if not, why?
3. Leverage open-source control plane implementations
xorp, quagga, etc. [but we focus on data plane here]
-
8/3/2019 Nerd Lunch
10/51
10
Broader Benefits
1. infrastructure that is well-known and cheaper to evolve
familiar programming environment
separately-evolvable network software and hardware
reduced cost -> more frequent upgrade opportunity
2. networks with the benefits of the PC ecosystem
high-volume manufacturing
widespread supply/support
state-of-the-art process technologies (ride Moores Law)
evolving PC platform features (power mgmt, crypto, etc.)
-
8/3/2019 Nerd Lunch
11/51
11
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
Performance results Next steps
-
8/3/2019 Nerd Lunch
12/51
1212
Traditional router architecture
#1 Nports
R bps[R each direction]
2 3
N ports, per-port speed R bps
-
8/3/2019 Nerd Lunch
13/51
13
Traditional router architecture
R bps
switchscheduler
switch fabric
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
IP address
lookup,
q-mgmt, etc.
addr tables,FIB, ACLs
queue
mgmt,
shaping,
etc.
linecard
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
control processor (runs IOS/quagga/xorp, etc)
runs atR bps
runs atNR
-
8/3/2019 Nerd Lunch
14/51
1414
Moving to a cluster-router
#1 Nports
R bps
switchscheduler
switch fabric
2 3
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
IP address
lookup,
q-mgmt, etc.
addr tables,FIB, ACLs
queue
mgmt,
shaping,
etc.
linecard
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
queuemgmnt,
shaping,etc.
IP addresslookup,
Q mgmnt, etc.
addrtables,FIB,ACLs
control processor (runs IOS/quagga/xorp, etc)
#1 Nportsstep 1: single server implements one port;
N ports N servers
-
8/3/2019 Nerd Lunch
15/51
1515
Moving to a cluster-router
#1 N
R bps
step 1: single server implements one port;
N ports N servers
switchscheduler
switch fabric
2
control processor (runs IOS/quagga/xorp, etc)
IP addresslookup,
Q mgmnt, etc.
addr tables,FIB, ACLs
queuemgmnt,shaping,
etc.
linecard
implementedin software
Each server must
process at least2R traffic (in+out)
-
8/3/2019 Nerd Lunch
16/51
1616
Moving to a cluster-router
#1 Nports
R bps
step 2: replace switch fabric and scheduler
with a distributed, software-based solution
switchscheduler
switch fabric
control processor (runs IOS/quagga/xorp, etc)
2
-
8/3/2019 Nerd Lunch
17/51
17
Moving to a cluster-router
#1 Nports
R bps
2
control processor (runs IOS/quagga/xorp, etc)
server-to-serverinterconnect topology
step 2: replace switch fabric and scheduler
with a distributed, software-based solution
distributed schedulingalgorithms, based on
Valiant Load Balancing (VLB)
-
8/3/2019 Nerd Lunch
18/51
1818
Example: VLB over a mesh** other topologies offer different tradeoffs
# servers N
internal fanout N-1
internal link capacity
(RN/[N(N-1)/2])
2R
N-1
processing/server
[out+in+through]
3R(2R)*
N servers can achieve switching speeds of N R bps, provided each
server can process packets at 3R (*2R for Direct-VLB avg case)
N ports, Rbpsport rate
Rbps [each direction]
N
1
2
3 5
4
-
8/3/2019 Nerd Lunch
19/51
19
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
RB4 prototype Click overview
Performance results
Next steps
-
8/3/2019 Nerd Lunch
20/51
20
RB4: hardware architecture
10Gbps 4 dual-socket NHM-EPs
8x 2.8GHz cores (no SMT)
8MB L3 cache
6x1 GB DDR3
2 PCIe 2.0 slots (8 lanes)
default BIOS setting
2x 10Gbps Oplin cards per server
dual port
PCIe 1.1
(now using Niantic /PCIe 2.0)
-
8/3/2019 Nerd Lunch
21/51
21
RB4: software architecture
10Gbps
Linux
2.6.24
KernelClick runtime
RB
VLB
RB device driver
user space
packet
processing
(linecard)
NIC NIC NIC NIC
Place for value-added services(e.g., monitoring, energyproxy, management, etc.)
hooks
for
new
srvcs
implemented in Click
unmodified
RB data plane
-
8/3/2019 Nerd Lunch
22/51
22
Click Overview
Modular, extensible software router
built on Linux as kernel module
combines versatility and high performance
Architecture consists of elementsthat implement packet processing functions
configuration language that connects elements into a packet data flow
internal scheduler that decides which element to run
Large open source library (200+ elements) means new routingapplications can often be written with just a configuration script
slide material courtesy E.Kohler, UCLA
-
8/3/2019 Nerd Lunch
23/51
23
RB4: software architecture
Linux
2.6.24
KernelClick runtime
RB
VLB
RB device driver
user space
packetprocessing
(linecard)
NIC NIC NIC NIC
Value-added services(e.g., monitoring, energyproxy, management, etc.)
hooks
fornew
srvcs
implemented in Click
unmodified
Intel 10G driver polling-only operation
(no interrupts)
transfers packets tomemory in batches
of k (we use k=16) RSS w/ upto 32/64 rx/tx
NIC queues
-
8/3/2019 Nerd Lunch
24/51
24
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability
single server scalability
Next steps
-
8/3/2019 Nerd Lunch
25/51
25
Cluster Scalability
# servers N
Internal fanout N-1
internal link capacity 2R
N-1
processing/server 3R(2R)
N ports, Rbpsper portRbps
N
1
2
3 5
4
recall: VLB over a mesh
-
8/3/2019 Nerd Lunch
26/51
26
1
10
100
1000
10000
1 10 100 1000 10000
Cluster Scalability
10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
number of ports
costin#
serv
ers
y=x
-
8/3/2019 Nerd Lunch
27/51
27
1
10
100
1000
10000
1 10 100 1000 10000
Cluster Scalability
number of ports
costin#
serv
ers
one server scales to 20Gbps; typical fanout
10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
-
8/3/2019 Nerd Lunch
28/51
28
1
10
100
1000
10000
1 10 100 1000 10000
Cluster Scalability
number of ports
costin#
serv
ers
one server scales to 20Gbps; typical fanout
20Gbps; higher fanout
10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
-
8/3/2019 Nerd Lunch
29/51
29
1
10
100
1000
10000
1 10 100 1000 10000
Cluster Scalability
number of ports
costin#
serv
ers
one server scales to 20Gbps; typical fanout
20Gbps; higher fanout
server scales to 40Gbps
+ higher fanout
10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
-
8/3/2019 Nerd Lunch
30/51
30
1
10
100
1000
10000
1 10 100 1000 10000
Cluster Scalability
number of ports
costin#
serv
ers
one server scales to 20Gbps; typical fanout
20Gbps; higher fanout
server scales to 40Gbps
+ higher fanout
Conclusions so far
(1) VLB-based server cluster scales well, is cost-effective
(2) feasible if a single server can scale to at least 20Gbps (2R)
10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
-
8/3/2019 Nerd Lunch
31/51
31
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability
single server scalability
Next steps
-
8/3/2019 Nerd Lunch
32/51
32
RB4: software architecture
Linux
2.6.24
KernelClick runtime
RB
VLB++
RB device driver
user space
packetprocessing
(linecard)
NIC NIC NIC NIC
Value-added services(e.g., monitoring, energyproxy, management, etc.)
hooks
fornew
srvcs
implemented in Click
unmodified
Tested 3 packet processing
functions (so far)1. simple forwarding (fwd)
2. IPv4 forwarding (rtr)
3. AES-128 encryption (ipsec)
-
8/3/2019 Nerd Lunch
33/51
33
Test Configuration
packet processing functions simple forwarding (no header
processing; ~ bridging) IPv4 routing (longest-prefix destinationlookup, 256K entry routing table)
AES-128 packet encryption
test traffic
fixed-size packets (64B-1024B) abilene: real-world packet trace
from Abilene/Internet2 backbone
Click runtime
RB device driver
packet
processing
NIC NIC
trafficgeneration
server
traffic sink
test server
-
8/3/2019 Nerd Lunch
34/51
34
Performance versus packet size
Performance for simple forwarding
under different input
traffic workloads
results in bits-per-second (top)and packets-per-second (bottom)
In all our tests, the real-world
Abilene and 1024B packet
workloads achieve similar
performance; hence, from hereon,
we only consider two extremetraffic workloads: 64B and 1024B pkts.
-
8/3/2019 Nerd Lunch
35/51
35
Performance with different packetprocessing functions (64, 1KB pkts)
Simple forwarding and IPv4
forwarding for (realistic) trafficworkloads with larger packets
achieve ~25Gbps;
limited by traffic generation
due to the #PCIe slots
Encryption is CPU limited
Simple Forwarding IPv4 Forwarding Encrypted Forwarding
-
8/3/2019 Nerd Lunch
36/51
36
Memory Loading
64B workload, NHM
nom and benchmark
represent upper bounds on
available memory bandwidth
normalized by packet rate to
compare with actual apps.nom is based on nominal
rated capacity; benchmark
refers to empirically observed
load using a stream-like
read/write random
access workload.
All applications are well below estimated upper bounds.Per-packet memory load is constant as a function of packet rate.
Packet rate (Mpps)
-
8/3/2019 Nerd Lunch
37/51
37
QuickPath (inter-socket) Loading
64B workload, NHM
benchmark refers to
the maximum load onthe inter-socket
QuickPath link
with stream-like
workload
All applications are well below estimated upper bound.Per-packet inter-socket load is constant versus packet rate.
-
8/3/2019 Nerd Lunch
38/51
38
QuickPath (I/O) Loading
64B workload, NHM
benchmark refers to
the maximum load onthe I/O Quickpath link
we have been able to
generate with a NIC.
All applications are well below estimated upper bound.Per-packet I/O load is constant versus packet rate.
-
8/3/2019 Nerd Lunch
39/51
39
Per-packet load on CPU
64B workload, NHM
application instr/pkt (CPI)
simple forwarding 1,033 (1.19)
ipv4 forwarding 1,595 (1.01)
encryption 14,221 (0.55)
All applications reach CPU cycles upper bound.CPU load is (fairly) constant as a function of packet rate.
CPUSaturation
-
8/3/2019 Nerd Lunch
40/51
40
Single server scalability
Key results
(1) NHM server performance is sufficient to enable VLB clustering, for
realistic input traffic
(2) falls short for worst-case traffic
(3) CPUs are the bottleneck for 64B packet workloads
(4) scaling: constant per-packet load with increasing packet rate
-
8/3/2019 Nerd Lunch
41/51
41
Outline
Introduction
Approach: cluster-based router
RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability
single server scalability
Next steps
-
8/3/2019 Nerd Lunch
42/51
42
Next Steps
RB prototype
control plane
additional packet processing functions
new hardware when available management interface
reliability / robustness improvements
power
packaging
-
8/3/2019 Nerd Lunch
43/51
43
Thanks
http://routebricks.orgAlso: see paper in SOSP 2009
-
8/3/2019 Nerd Lunch
44/51
44
Backups
-
8/3/2019 Nerd Lunch
45/51
45
Click on multicore
Each core (or HW thread) runs one instance of Click
instance is statically scheduled and pinned to the core
best performance when one core handles the entire dataflow of a packet
Click runs internal scheduler to decide which element to run
-
8/3/2019 Nerd Lunch
46/51
-
8/3/2019 Nerd Lunch
47/51
47
4-port VLB mesh, 10Gbps ports
10Gbps
5Gbps
Each server has internal fanout = 3Each server runs at avg. 20Gbps
810Gbps
2.5Gbps
Each server has internal fanout = 7Each server runs at avg. 20Gbps
-
8/3/2019 Nerd Lunch
48/51
48
8-port VLB mesh, server@ 20Gbps40Gbps
10Gbps
2.5Gbps
Each server has internal fanout = 7Each server runs at avg. 20Gbps
10Gbps
5Gbps
10Gbps
Each server has internal fanout =3Each server runs at avg. 40Gbps
-
8/3/2019 Nerd Lunch
49/51
49
8-port VLB mesh, server@ 20Gbps1000
10Gbps
2.5Gbps
Each server has internal fanout = 7Each server runs at avg. 20Gbps
And each server has maxinternal fanout = 32
(1Gbps ports)
-
8/3/2019 Nerd Lunch
50/51
50
8-port VLB mesh, server@ 20Gbps1000
10Gbps
1000 servers, each w/
10Gbps external port
Plus (lg32(1000)-1)*1000servers interconnected by a
32-ary-1000-fly topology
(total 2000 servers)
Each server has fanout=32
Each internal link runs at
0.625Gbps (=2*10/32)
40Gbps
1000
-
8/3/2019 Nerd Lunch
51/51
51
More generally
Different topologies offer tradeoffs between:
per server forwarding capability
per server fanout (#slots/server, ports/slot) number of servers required
input (for us)
dominatesrouter cost