cilium - fast ipv6 container networking with bpf and xdp

23
Cilium: Fast IPv6 Container Networking with BPF and XDP LinuxCon 2016, Toronto Thomas Graf (@tgraf__) Kernel, Cilium & Open vSwitch Team Noiro Networks (Cisco)

Upload: thomas-graf

Post on 06-Jan-2017

2.592 views

Category:

Software


3 download

TRANSCRIPT

Cilium:

Fast IPv6 Container Networking with

BPF and XDP

LinuxCon 2016, Toronto

Thomas Graf (@tgraf__) Kernel, Cilium & Open vSwitch Team Noiro Networks (Cisco)

The Cilium ExperimentScale

– Addressing: IPv6?

– Policy: Linear lists don’t scale. Alternative?

Extensibility

– Can we be as extensible as userspace networkingin the kernel?

Simplicity

– What is an appropriate abstraction away fromtraditional networking?

Performance

– Do we sacrifice performance in the process?

Scaling Addressing

Solution:

– IPv6 addresses with host scope allocator

Pros:

– Everything is globally addressable

– No NAT

– Path to ILA for mobility of tasks

Cons:

– Legacy IPv4 only endpoints/applications

→ Optional IPv4 addressing (+ NAT)

→ NAT46: Provide IPv6 only applications to IPv4only clients

IPv6 Status in Kubernetes/Docker

● Kubernetes (CNI): Almost there

– Pods are IPv6-only capable as of k8s 1.3.6(PR23317, PR26438, PR26439, PR26441)

– Kubeproxy (services) not done yet

● Docker (libnetwork): Working on it

– PR826 - “Make IPv6 Great Again”Not merged yet

Scaling PolicyLB Frontend Backend

Scaling Policy

LB BEFE

LB FE

FE BE

LB

LB Frontend Backend

Policy:

NetworkPolicy Kubernetes policy specas discussed and standardized in theNetworking SIG

https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/network-policy.md

Scaling Policy

LB QA BE QAFE QA

LB Prod BE ProdFE Prod

LB FE

FE BE

LB

LB Frontend Backend

QA

Prod

Policy:

Scaling Policy

LB QA BE QAFE QA

LB Prod BE ProdFE Prod

LB FE

FE

QA

ProdBE

LB QA

Prod

requires

requires

LB Frontend Backend

QA

Prod

Policy:

Cilium extension

Not yet part ofKubernetes spec

QA

Scaling Policy Enforcement

LB FE

FE

QA

Prod

BE

LB QA

Prod

requires

requires

LB QA

FE QA

LB Prod10

11

12

13

Policy enforcement cost becomes a single hashtablelookup regardless of number of containers or policy

complexity.

BE QA

FE Prod 14

BE Prod 15

Distributed Label ID Table:Policy:

QA

This ID is carried in packet asmetadata to provide securitycontext at destination host

Extensibility

Kernel

Userspace

SourceCode

ByteCode

LLVM/clang

Sockets

netdevice

NetworkStackTC

Ingress

TCEgress

netdevice

Verifier+ JIT

add eax,edxshl eax,2

add eax,edxshl eax,2

BPF – Berkley Packet Filter

Kernel

Userspace

BPFProgram

UserspaceProcess

BPF Maps & Perf Ring Buffer

BPF MapHashtable

BPF MapArray

UserspaceProcess

BPFProgram

Per RingBuffer

Data DataTail Call

BPF Features(As of Aug 2016)

● Efficient data sharing via maps

– Per-CPU/global arrays & hashtables

● Rewrite packet content

● Extend/trim packet size

● Redirect to other net_device

● Attachment of tunnel metadata

● Cgroups integration

● Access to high performance perf ring buffer

● …

Kernel

Userspace

XDP – Express Data PathSourceCode

ByteCode

LLVM/clang

Sockets

NetdeviceNetwork

Stack

Verifier+ JIT

add eax,edxshl eax,2

Driver

Access toDMA buffer

Kernel

Cilium Layer

Orchestrationsystems

eth0

BPFProgram

CiliumDaemon

CiliumMonitor

CiliumCLI

BPF Program

Conntrack Policy

Bytecode injection

Events

BPF Program

Conntrack Policy

CodeGeneration

PluginsPolicy

Repository

Cilium Architecture

Why is this awesome?

On the fly BPF program generation means:

● Extensibility of userspace networking in the kernel

● MAC, IP, port number, … all become constants→ compiler can optimize heavily!

● BPF programs can be recompiled and replaced withoutinterrupting the container and its connections

– Features can be compiled in/out at runtime withcontainer granularity

● Access to fast BPF maps and perf ring buffer to interactwith userspace.

– Drop monitor in n*Mpps context

– Use notifications for policy learning, IDS, logging, ...

Available Building Blocks

● L3 forwarding (IPv6 & IPV4)

● Host connectivity

● Encapsulation(VXLAN/Geneve/GRE)

● ICMPv6 generation

● NDisc & ARP responder

● Access Control

Currently working on:

● Fragmentation handling

● Mobility

● Port Mapping (TCP/UDP)

● Connection tracking

● L3/L4 Load Balancer

● Statistics

● Events (perf ring buffer)

● Debugging framework

● NAT46

● End to end encryption

Networking should be invisible,it is not.

Simplicity

Simplicity

● L3 only (Calico gets this right)

– No L2 scaling issues, no broadcast domains, no L2vulnerabilities

● No “Networks”

– No need for containers to join multiple networksto access multiple isolation domains. No need formultiple addresses.

● Policy definition independent of addressing

– As specified in Kubernetes Networking SIG

– All policies based on container labels

Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220

100

200

300

400

500

600

Container to container on local node

# Cores

Gb

it

netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e1 TCP flow per core, 10’000 policies

Intel Xeon 3.5Ghz Sandy Bridge, 24 cores

Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Container to container over 10GiB NICs

64128256512102464000

# Cores

MB

it

netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e1 TCP flow per core, 10’000 policies

Intel Xeon 3.5Ghz Sandy Bridge, 24 cores

<Insert Cool Demo Here>

Q&A

Image Sources:

● Cover (Toronto)Rick Harris (https://www.flickr.com/photos/rickharris/)

● The Invisible ManDr. Azzacov (https://www.flickr.com/photos/drazzacov/)

Start hacking with BPF for containers:http://github.com/cilium/cilium

Contact:

Slack: cilium.slack.com

Twitter: @tgraf__ Mail: [email protected]

Team:● André Martins● Daniel Borkmann

● Madhu Challa● Thomas Graf