nerd lunch

8/3/2019 Nerd Lunch

1/51

1

RouterBricks

Scaling Software Routers with Modern Servers

Kevin Fall

Intel Labs, Berkeley

Feb 24, 2010Ericsson, San Jose, CA

8/3/2019 Nerd Lunch

2/51

2

Project Participants

Intel Labs Gianluca Iannaccone (co-PI, researcher)

Sylvia Ratnasamy (co-PI, researcher)

Kevin Fall (principal engineer) Allan Knies (principal engineer)

Maziar Manesh (research engineer)

Eddie Kohler (Click expert)

Dan Dahle (tech strategy)

Badarinath Kommandur (tech strategy)

Ecole Polytecnique (EPFL), Switzerland Katerina Argyraki (faculty)

Mihai Dobrescu (student)

Diaqing Chu (student)

8/3/2019 Nerd Lunch

3/51

3

Outline

Introduction

Approach: cluster-based router

RouteBricks implementation

Performance results

Next steps

8/3/2019 Nerd Lunch

4/51

4

RouterBricks: in a nutshell

A high-speed router using IA server components

fully programmable: control and data plane

extensible: evolve networks via software upgrade

incrementally scalable: flat cost per bit

8/3/2019 Nerd Lunch

5/51

5

Motivation

Network infrastructure is doing more than ever before

Packet-pushing (routing) no longer the whole story security, data loss protection, application optimization, etc.

has led to a proliferation of special appliances

and notions that perhaps routers could do more Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco

But these platforms werent born programmable

8/3/2019 Nerd Lunch

6/51

6

Motivation

If flexibility ultimately implies programmability...

Hard to beat IA platforms and their ecosystem

Or price

However, must deal with persistent folklore:

IA cant do high-speed packet processing

But todays IA isnt the IA you know from your youth

multicore, multiple integrated mem-controllers, PCIe, multi-Q NICs,

8/3/2019 Nerd Lunch

7/51

7

Motivation

Combine a desire for more programmability...

with new router friendly server trends

a new opportunity for IA servers?

Router Bricks: How might we

build a big (~1Tbps) IA-based software router?

8/3/2019 Nerd Lunch

8/51

8

Challenge

traditional software routers

research prototypes (2007): 1 - 2 Gbps

Vyatta* datasheet (2009): 2 - 4 Gbps

current carrier-grade routers

line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!

* Other names and brands may be claimed as properties of others

8/3/2019 Nerd Lunch

9/51

9

Strategy

1. A cluster-based router architecture

each server need only scale to line speeds (10-40Gbps),rather than aggregate speeds (40Gbps 92Tbps)

2. Understand whether modern server architecturescan scale to line speeds (10-40Gbps)

if not, why?

3. Leverage open-source control plane implementations

xorp, quagga, etc. [but we focus on data plane here]

8/3/2019 Nerd Lunch

10/51

10

Broader Benefits

1. infrastructure that is well-known and cheaper to evolve

familiar programming environment

separately-evolvable network software and hardware

reduced cost -> more frequent upgrade opportunity

2. networks with the benefits of the PC ecosystem

high-volume manufacturing

widespread supply/support

state-of-the-art process technologies (ride Moores Law)

evolving PC platform features (power mgmt, crypto, etc.)

8/3/2019 Nerd Lunch

11/51

11

Outline

Introduction



Performance results Next steps

8/3/2019 Nerd Lunch

12/51

1212

Traditional router architecture

#1 Nports

R bps[R each direction]

2 3

N ports, per-port speed R bps

8/3/2019 Nerd Lunch

13/51

13

Traditional router architecture

R bps

switchscheduler

switch fabric

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs

IP address

lookup,

q-mgmt, etc.

addr tables,FIB, ACLs

queue

mgmt,

shaping,

etc.

linecard

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs

control processor (runs IOS/quagga/xorp, etc)

runs atR bps

runs atNR

8/3/2019 Nerd Lunch

14/51

1414

Moving to a cluster-router

#1 Nports

R bps

switchscheduler

switch fabric

2 3

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs

IP address

lookup,

q-mgmt, etc.


queue

mgmt,

shaping,

etc.

linecard

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs

queuemgmnt,

shaping,etc.

IP addresslookup,

Q mgmnt, etc.

addrtables,FIB,ACLs


#1 Nportsstep 1: single server implements one port;

N ports N servers

8/3/2019 Nerd Lunch

15/51

1515


#1 N

R bps

step 1: single server implements one port;

N ports N servers

switchscheduler

switch fabric

2


IP addresslookup,

Q mgmnt, etc.


queuemgmnt,shaping,

etc.

linecard

implementedin software

Each server must

process at least2R traffic (in+out)

8/3/2019 Nerd Lunch

16/51

1616


#1 Nports

R bps

step 2: replace switch fabric and scheduler

with a distributed, software-based solution

switchscheduler

switch fabric


2

8/3/2019 Nerd Lunch

17/51

17


#1 Nports

R bps

2


server-to-serverinterconnect topology

step 2: replace switch fabric and scheduler

with a distributed, software-based solution

distributed schedulingalgorithms, based on

Valiant Load Balancing (VLB)

8/3/2019 Nerd Lunch

18/51

1818

Example: VLB over a mesh** other topologies offer different tradeoffs

# servers N

internal fanout N-1

internal link capacity

(RN/[N(N-1)/2])

2R

N-1

processing/server

[out+in+through]

3R(2R)*

N servers can achieve switching speeds of N R bps, provided each

server can process packets at 3R (*2R for Direct-VLB avg case)

N ports, Rbpsport rate

Rbps [each direction]

N

1

2

3 5

4

8/3/2019 Nerd Lunch

19/51

19

Outline

Introduction



RB4 prototype Click overview

Performance results

Next steps

8/3/2019 Nerd Lunch

20/51

20

RB4: hardware architecture

10Gbps 4 dual-socket NHM-EPs

8x 2.8GHz cores (no SMT)

8MB L3 cache

6x1 GB DDR3

2 PCIe 2.0 slots (8 lanes)

default BIOS setting

2x 10Gbps Oplin cards per server

dual port

PCIe 1.1

(now using Niantic /PCIe 2.0)

8/3/2019 Nerd Lunch

21/51

21

RB4: software architecture

10Gbps

Linux

2.6.24

KernelClick runtime

RB

VLB

RB device driver

user space

packet

processing

(linecard)

NIC NIC NIC NIC

Place for value-added services(e.g., monitoring, energyproxy, management, etc.)

hooks

for

new

srvcs

implemented in Click

unmodified

RB data plane

8/3/2019 Nerd Lunch

22/51

22

Click Overview

Modular, extensible software router

built on Linux as kernel module

combines versatility and high performance

Architecture consists of elementsthat implement packet processing functions

configuration language that connects elements into a packet data flow

internal scheduler that decides which element to run

Large open source library (200+ elements) means new routingapplications can often be written with just a configuration script

slide material courtesy E.Kohler, UCLA

8/3/2019 Nerd Lunch

23/51

23


Linux

2.6.24

KernelClick runtime

RB

VLB

RB device driver

user space

packetprocessing

(linecard)

NIC NIC NIC NIC

Value-added services(e.g., monitoring, energyproxy, management, etc.)

hooks

fornew

srvcs


unmodified

Intel 10G driver polling-only operation

(no interrupts)

transfers packets tomemory in batches

of k (we use k=16) RSS w/ upto 32/64 rx/tx

NIC queues

8/3/2019 Nerd Lunch

24/51

24

Outline

Introduction




Performance results

cluster scalability

single server scalability

Next steps

8/3/2019 Nerd Lunch

25/51

25

Cluster Scalability

# servers N

Internal fanout N-1

internal link capacity 2R

N-1

processing/server 3R(2R)

N ports, Rbpsper portRbps

N

1

2

3 5

4

recall: VLB over a mesh

8/3/2019 Nerd Lunch

26/51

26

1

10

100

1000

10000

1 10 100 1000 10000

Cluster Scalability

10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

number of ports

costin#

serv

ers

y=x

8/3/2019 Nerd Lunch

27/51

27

1

10

100

1000

10000

1 10 100 1000 10000

Cluster Scalability

number of ports

costin#

serv

ers

one server scales to 20Gbps; typical fanout


8/3/2019 Nerd Lunch

28/51

28

1

10

100

1000

10000

1 10 100 1000 10000

Cluster Scalability

number of ports

costin#

serv

ers


20Gbps; higher fanout


8/3/2019 Nerd Lunch

29/51

29

1

10

100

1000

10000

1 10 100 1000 10000

Cluster Scalability

number of ports

costin#

serv

ers



server scales to 40Gbps

+ higher fanout


8/3/2019 Nerd Lunch

30/51

30

1

10

100

1000

10000

1 10 100 1000 10000

Cluster Scalability

number of ports

costin#

serv

ers



server scales to 40Gbps

+ higher fanout

Conclusions so far

(1) VLB-based server cluster scales well, is cost-effective

(2) feasible if a single server can scale to at least 20Gbps (2R)


8/3/2019 Nerd Lunch

31/51

31

Outline

Introduction




Performance results

cluster scalability


Next steps

8/3/2019 Nerd Lunch

32/51

32


Linux

2.6.24

KernelClick runtime

RB

VLB++

RB device driver

user space

packetprocessing

(linecard)

NIC NIC NIC NIC

Value-added services(e.g., monitoring, energyproxy, management, etc.)

hooks

fornew

srvcs


unmodified

Tested 3 packet processing

functions (so far)1. simple forwarding (fwd)

2. IPv4 forwarding (rtr)

3. AES-128 encryption (ipsec)

8/3/2019 Nerd Lunch

33/51

33

Test Configuration

packet processing functions simple forwarding (no header

processing; ~ bridging) IPv4 routing (longest-prefix destinationlookup, 256K entry routing table)

AES-128 packet encryption

test traffic

fixed-size packets (64B-1024B) abilene: real-world packet trace

from Abilene/Internet2 backbone

Click runtime

RB device driver

packet

processing

NIC NIC

trafficgeneration

server

traffic sink

test server

8/3/2019 Nerd Lunch

34/51

34

Performance versus packet size

Performance for simple forwarding

under different input

traffic workloads

results in bits-per-second (top)and packets-per-second (bottom)

In all our tests, the real-world

Abilene and 1024B packet

workloads achieve similar

performance; hence, from hereon,

we only consider two extremetraffic workloads: 64B and 1024B pkts.

8/3/2019 Nerd Lunch

35/51

35

Performance with different packetprocessing functions (64, 1KB pkts)

Simple forwarding and IPv4

forwarding for (realistic) trafficworkloads with larger packets

achieve ~25Gbps;

limited by traffic generation

due to the #PCIe slots

Encryption is CPU limited

Simple Forwarding IPv4 Forwarding Encrypted Forwarding

8/3/2019 Nerd Lunch

36/51

36

Memory Loading

64B workload, NHM

nom and benchmark

represent upper bounds on

available memory bandwidth

normalized by packet rate to

compare with actual apps.nom is based on nominal

rated capacity; benchmark

refers to empirically observed

load using a stream-like

read/write random

access workload.

All applications are well below estimated upper bounds.Per-packet memory load is constant as a function of packet rate.

Packet rate (Mpps)

8/3/2019 Nerd Lunch

37/51

37

QuickPath (inter-socket) Loading

64B workload, NHM

benchmark refers to

the maximum load onthe inter-socket

QuickPath link

with stream-like

workload

All applications are well below estimated upper bound.Per-packet inter-socket load is constant versus packet rate.

8/3/2019 Nerd Lunch

38/51

38

QuickPath (I/O) Loading

64B workload, NHM

benchmark refers to

the maximum load onthe I/O Quickpath link

we have been able to

generate with a NIC.

All applications are well below estimated upper bound.Per-packet I/O load is constant versus packet rate.

8/3/2019 Nerd Lunch

39/51

39

Per-packet load on CPU

64B workload, NHM

application instr/pkt (CPI)

simple forwarding 1,033 (1.19)

ipv4 forwarding 1,595 (1.01)

encryption 14,221 (0.55)

All applications reach CPU cycles upper bound.CPU load is (fairly) constant as a function of packet rate.

CPUSaturation

8/3/2019 Nerd Lunch

40/51

40

Single server scalability

Key results

(1) NHM server performance is sufficient to enable VLB clustering, for

realistic input traffic

(2) falls short for worst-case traffic

(3) CPUs are the bottleneck for 64B packet workloads

(4) scaling: constant per-packet load with increasing packet rate

8/3/2019 Nerd Lunch

41/51

41

Outline

Introduction




Performance results

cluster scalability


Next steps

8/3/2019 Nerd Lunch

42/51

42

Next Steps

RB prototype

control plane

additional packet processing functions

new hardware when available management interface

reliability / robustness improvements

power

packaging

8/3/2019 Nerd Lunch

43/51

43

Thanks

http://routebricks.orgAlso: see paper in SOSP 2009

8/3/2019 Nerd Lunch

44/51

44

Backups

8/3/2019 Nerd Lunch

45/51

45

Click on multicore

Each core (or HW thread) runs one instance of Click

instance is statically scheduled and pinned to the core

best performance when one core handles the entire dataflow of a packet

Click runs internal scheduler to decide which element to run

8/3/2019 Nerd Lunch

46/51

8/3/2019 Nerd Lunch

47/51

47

4-port VLB mesh, 10Gbps ports

10Gbps

5Gbps

Each server has internal fanout = 3Each server runs at avg. 20Gbps

810Gbps

2.5Gbps


8/3/2019 Nerd Lunch

48/51

48

8-port VLB mesh, server@ 20Gbps40Gbps

10Gbps

2.5Gbps


10Gbps

5Gbps

10Gbps

Each server has internal fanout =3Each server runs at avg. 40Gbps

8/3/2019 Nerd Lunch

49/51

49

8-port VLB mesh, server@ 20Gbps1000

10Gbps

2.5Gbps


And each server has maxinternal fanout = 32

(1Gbps ports)

8/3/2019 Nerd Lunch

50/51

50

8-port VLB mesh, server@ 20Gbps1000

10Gbps

1000 servers, each w/

10Gbps external port

Plus (lg32(1000)-1)*1000servers interconnected by a

32-ary-1000-fly topology

(total 2000 servers)

Each server has fanout=32

Each internal link runs at

0.625Gbps (=2*10/32)

40Gbps

1000

8/3/2019 Nerd Lunch

51/51

51

More generally

Different topologies offer tradeoffs between:

per server forwarding capability

per server fanout (#slots/server, ports/slot) number of servers required

input (for us)

dominatesrouter cost

nerd lunch

Documents