Download - Understanding DPDK
Understanding
DPDKDescription of techniques used to achieve
high throughput on a commodity hardware
How fast SW has to work?
14.88 millions of 64 byte packets per second on 10G interface
1.8 GHz -> 1 cycle = 0,55 ns
1 packet -> 67.2 ns = 120 clock cycles
IFGPream
ble
DST
MAC
SRC
MAC
SRC
MACType Payload CRC
84 Bytes
412 8 60
Comparative speed values
CPU to memory speed = 6-8 GBytes/s
PCI-Express x16 speed = 5 GBytes/s
Access to RAM = 200 ns
Access to L3 cache = 4 ns
Context switch ~= 1000 ns (3.2 GHz)
Packet processing in Linux
User space
Kernel space
NIC
App
Driver
RX/TX queues
SocketRing
buffers
Linux kernel overhead
System calls
Context switching on blocking I/O
Data copying from kernel to user space
Interrupt handling in kernel
Expense of sendto
Function Activity Time (ns)
sendto system call 96
sosend_dgram lock sock_buff, alloc mbuf, copy in 137
udp_output UDP header setup 57
ip_output route lookup, ip header setup 198
ether_otput MAC lookup, MAC header setup 162
ixgbe_xmit device programming 220
Total 950
Packet processing with DPDK
User space
Kernel space
NIC
App DPDKRing
buffers
UIO driver
RX/TX
queues
Kernel space
Updating a register in Linux
User space
HW
ioctl()
Register
syscall
VFS
copy_from_user()
iowrite()
Updating a register with DPDK
User space
HW
assign
Register
What is used inside DPDK?
Processor affinity (separate cores)
Huge pages (no swap, TLB)
UIO (no copying from kernel)
Polling (no interrupts overhead)
Lockless synchronization (avoid waiting)
Batch packets handling
SSE, NUMA awareness
Linux default scheduling
Core 0
Core 1
Core 2
Core 3
t1 t4t3t2
How to isolate a core for a process
To diagnose use top“top” , press “f” , press “j”
Before boot use isolcpus“isolcpus=2,4,6”
After boot - use cpuset“cset shield -c 1-3”, “cset shield -k on”
Core 2Core 1
Run-to-completion model
RX/TX
thread
RX/TX
thread
Port 1 Port 2
Core 2Core 1
Pipeline model
RX
thread
TX
thread
Port 1 Port 2
Ring
Page tables tree
Linux paging model
cr3
Page
Page
Global
Directory
Page
Table
Page
Middle
Directory
TLB
TLBPage
Table
RAM
OffsetVirtual page
Physical Page Offset
TLB characteristics
$ cpuid | grep -i tlb
size: 12–4,096 entries
hit time: 0.5–1 clock cycle
miss penalty: 10–100 clock cycles
miss rate: 0.01–1%
It is very expensive resource!
Solution - Hugepages
Benefit: optimized TLB usage, no swap
Hugepage size = 2M
Usage:
mount hugetlbfs /mnt/huge
mmap
Library - libhugetlbfs
Lockless ring design
Writer can preempt writer and reader
Reader can not preempt writer
Reader and writer can work simultaneously on
different cores
Barrier
CAS operation
Bulk queue/dequeue
Lockless ring (Single Producer)
1
cons_head
cons_tail
prod_head
prod_tail
prod_next 2
cons_head
cons_tail
prod_head
prod_next
prod_tail
3
cons_head
cons_tail
prod_head
prod_tail
Lockless ring (Single Consumer)
1
cons_head
cons_tail
prod_head
prod_tail
cons_next 2
cons_tail prod_head
prod_tail
cons_next
cons_head
3
cons_head
cons_tail
prod_head
prod_tail
Lockless ring (Multiple Producers)
1
cons_head
cons_tail
prod_head
prod_tail
prod_next1
prod_next2 3
cons_head
cons_tail
prod_head
2
cons_head
cons_tail
prod_head
prod_next2
prod_tail
prod_next1
4
cons_head
cons_tail
5
cons_head
cons_tail
prod_head
prod_tail
prod_tail
prod_headprod_tail
prod_next1
prod_next2
prod_next1
prod_next2
Kernel space network driver
App
IP stack
Driver
NIC
Data
Desc
Config
Data
User space
Kernel space
Interrupts
UIO
“The most important devices can’t be handled
in user space, including, but not
limited to, network interfaces and block
devices.” - LDD3
UIO
User space
Kernel space
Interfacesysfs /dev/uioX
App
US driver epoll()
mmap()
UIO framework
driver
NIC User space
Access to device from user space
BAR0 (Mem)
BAR1
BAR2 (IO)
BAR5
BAR4
BAR3
Vendor Id
Device Id
Command
Revision Id
Status
...
Configuration
registers
I/O and memory
regions
/sys/class/uio/uioX/maps/mapX
/sys/class/uio/uioX/portio/portX
/dev/uioX -> mmap (offset)
/sys/bus/pci/devices
Host memory NIC memory
DMA RX
Update RDT
DMA descriptor(s)
RX queue RX FIFO
DMA packet
Descriptor ringMemory
DMA descriptors
Host memory NIC memory
DMA TX
Update TDT
DMA descriptor(s)
TX queue TX FIFO
DMA packet
Descriptor ringMemory
DMA descriptors
Receive from SW side
DD DD DDDD
RDT
DDmbuf1
addrDD
mbuf2
addr
RDT
RDH = 1
RDT = 5
RDBA = 0
RDLEN = 6
mbuf1RDH
RDH
mbuf2
Transmit from SW side
DD DD DDDD
TDT
DDmbuf1
addrDD
mbuf2
addr
TDT
TDH = 1
TDT = 5
TDBA = 0
TDLEN = 6
mbuf1TDH
TDH
mbuf2
NUMA
CPU 0
Cores
Me
mo
ry
co
ntro
ller
I/O controller
Me
mo
ry
PCI-E PCI-E
CPU 1
Cores
Me
mo
ry
co
ntr
olle
r
I/O controller
Me
mo
ry
PCI-E PCI-E
QPI
Socket 0 Socket 1
RSS (Receive Side Scaling)
Hash
functionQueue 0 CPU N
...
Queue N
Incoming traffic Indirection
table
Flow director
Queue 0 CPU N
...
Queue N
Incoming trafficFilter table
Hash
function
Outgoing traffic
Drop Route
Virtualization - SR-IOV
NIC
VMM
VM1
VF driver
VM2
VF driver
PF driver
VF
Virtual bridge
VF PF
NIC
Slow path using bifurcated driver
Kernel DPDK
VF
Virtual bridge
PF Filter table
Slow path using TAP
User space
Kernel space
NIC
App DPDKRing
buffers
TAP device
RX/TX
queues
TCP/IP
stack
Slow path using KNI
User space
Kernel space
NIC
App DPDKRing
buffers
KNI device
RX/TX
queues
TCP/IP
stack
x86 HW
Application 1 - Traffic generator
User space
Streams generator
DUT
Traffic analyzer
x86 HW
Application 2 - Router
Kernel
User space
Routing table
Routing table cacheDUT1 DUT2
x86 HW
Application 3 - Middlebox
User space
DPIDUT1 DUT2
References
Device Drivers in User Space
Userspace I/O drivers in a realtime context
The Userspace I/O HOWTO
The anatomy of a PCI/PCI Express kernel driver
From Intel® Data Plane Development Kit to Wind River Network Acceleration
Platform
DPDK Design Tips (Part 1 - RSS)
Getting the Best of Both Worlds with Queue Splitting (Bifurcated Driver)
Design considerations for efficient network applications with Intel® multi-core
processor-based systems on Linux
Introduction to Intel Ethernet Flow Director