cilium - fast ipv6 container networking with bpf and xdp
TRANSCRIPT
Cilium:
Fast IPv6 Container Networking with
BPF and XDP
LinuxCon 2016, Toronto
Thomas Graf (@tgraf__) Kernel, Cilium & Open vSwitch Team Noiro Networks (Cisco)
The Cilium ExperimentScale
– Addressing: IPv6?
– Policy: Linear lists don’t scale. Alternative?
Extensibility
– Can we be as extensible as userspace networkingin the kernel?
Simplicity
– What is an appropriate abstraction away fromtraditional networking?
Performance
– Do we sacrifice performance in the process?
Scaling Addressing
Solution:
– IPv6 addresses with host scope allocator
Pros:
– Everything is globally addressable
– No NAT
– Path to ILA for mobility of tasks
Cons:
– Legacy IPv4 only endpoints/applications
→ Optional IPv4 addressing (+ NAT)
→ NAT46: Provide IPv6 only applications to IPv4only clients
IPv6 Status in Kubernetes/Docker
● Kubernetes (CNI): Almost there
– Pods are IPv6-only capable as of k8s 1.3.6(PR23317, PR26438, PR26439, PR26441)
– Kubeproxy (services) not done yet
● Docker (libnetwork): Working on it
– PR826 - “Make IPv6 Great Again”Not merged yet
Scaling Policy
LB BEFE
LB FE
FE BE
LB
LB Frontend Backend
Policy:
NetworkPolicy Kubernetes policy specas discussed and standardized in theNetworking SIG
https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/network-policy.md
Scaling Policy
LB QA BE QAFE QA
LB Prod BE ProdFE Prod
LB FE
FE BE
LB
LB Frontend Backend
QA
Prod
Policy:
Scaling Policy
LB QA BE QAFE QA
LB Prod BE ProdFE Prod
LB FE
FE
QA
ProdBE
LB QA
Prod
requires
requires
LB Frontend Backend
QA
Prod
Policy:
Cilium extension
Not yet part ofKubernetes spec
QA
Scaling Policy Enforcement
LB FE
FE
QA
Prod
BE
LB QA
Prod
requires
requires
LB QA
FE QA
LB Prod10
11
12
13
Policy enforcement cost becomes a single hashtablelookup regardless of number of containers or policy
complexity.
BE QA
FE Prod 14
BE Prod 15
Distributed Label ID Table:Policy:
QA
This ID is carried in packet asmetadata to provide securitycontext at destination host
Kernel
Userspace
SourceCode
ByteCode
LLVM/clang
Sockets
netdevice
NetworkStackTC
Ingress
TCEgress
netdevice
Verifier+ JIT
add eax,edxshl eax,2
add eax,edxshl eax,2
BPF – Berkley Packet Filter
Kernel
Userspace
BPFProgram
UserspaceProcess
BPF Maps & Perf Ring Buffer
BPF MapHashtable
BPF MapArray
UserspaceProcess
BPFProgram
Per RingBuffer
Data DataTail Call
BPF Features(As of Aug 2016)
● Efficient data sharing via maps
– Per-CPU/global arrays & hashtables
● Rewrite packet content
● Extend/trim packet size
● Redirect to other net_device
● Attachment of tunnel metadata
● Cgroups integration
● Access to high performance perf ring buffer
● …
Kernel
Userspace
XDP – Express Data PathSourceCode
ByteCode
LLVM/clang
Sockets
NetdeviceNetwork
Stack
Verifier+ JIT
add eax,edxshl eax,2
Driver
Access toDMA buffer
Kernel
Cilium Layer
Orchestrationsystems
eth0
BPFProgram
CiliumDaemon
CiliumMonitor
CiliumCLI
BPF Program
Conntrack Policy
Bytecode injection
Events
BPF Program
Conntrack Policy
CodeGeneration
PluginsPolicy
Repository
Cilium Architecture
Why is this awesome?
On the fly BPF program generation means:
● Extensibility of userspace networking in the kernel
● MAC, IP, port number, … all become constants→ compiler can optimize heavily!
● BPF programs can be recompiled and replaced withoutinterrupting the container and its connections
– Features can be compiled in/out at runtime withcontainer granularity
● Access to fast BPF maps and perf ring buffer to interactwith userspace.
– Drop monitor in n*Mpps context
– Use notifications for policy learning, IDS, logging, ...
Available Building Blocks
● L3 forwarding (IPv6 & IPV4)
● Host connectivity
● Encapsulation(VXLAN/Geneve/GRE)
● ICMPv6 generation
● NDisc & ARP responder
● Access Control
Currently working on:
● Fragmentation handling
● Mobility
● Port Mapping (TCP/UDP)
● Connection tracking
● L3/L4 Load Balancer
● Statistics
● Events (perf ring buffer)
● Debugging framework
● NAT46
● End to end encryption
Simplicity
● L3 only (Calico gets this right)
– No L2 scaling issues, no broadcast domains, no L2vulnerabilities
● No “Networks”
– No need for containers to join multiple networksto access multiple isolation domains. No need formultiple addresses.
● Policy definition independent of addressing
– As specified in Kubernetes Networking SIG
– All policies based on container labels
Performance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220
100
200
300
400
500
600
Container to container on local node
# Cores
Gb
it
netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e1 TCP flow per core, 10’000 policies
Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
Performance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Container to container over 10GiB NICs
64128256512102464000
# Cores
MB
it
netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e1 TCP flow per core, 10’000 policies
Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
Q&A
Image Sources:
● Cover (Toronto)Rick Harris (https://www.flickr.com/photos/rickharris/)
● The Invisible ManDr. Azzacov (https://www.flickr.com/photos/drazzacov/)
Start hacking with BPF for containers:http://github.com/cilium/cilium
Contact:
Slack: cilium.slack.com
Twitter: @tgraf__ Mail: [email protected]
Team:● André Martins● Daniel Borkmann
● Madhu Challa● Thomas Graf