low latency edge computing with qemu/kvm: challenges …€¦ · low latency edge computing with...
TRANSCRIPT
TM
Low latency edge computing with
QEMU/KVM: Challenges and
future
A u g u s t , 2 0 1 5
Mihai Caraman, PhD | Virtualization Architect
TM
Mihai Caraman, August 2015 2
Agenda
• 5G Requirements
− Base Station Virtualization
• RT KVM
− Embedded PA and ARMv8
• I/O Virtualization
− Direct Assignment
− Virtio
• vSwitch Offload
• Open Stack Integration
• Tuning & Performance
• Future
TM
Mihai Caraman, August 2015 3
5G Requirements
UE eNB
Metro
EPCCore
network
Server
5 ms1 ms 12 ms 20 ms 20 ms
PDCP crypto IPSec
Air interface
enhancements
Flexible deployment
(small cell, macro,
CRAN) through
SDN and NFV
< 80 ms
Source: METIS
TM
Mihai Caraman, August 2015 4
Base Station Virtualization
L1 (DSP + accelerator)
BS appliance
L3/control plane
Transport
L2/QoS
PDCP
MAC/RLC
L1
On-chip / on-board interconnect (AXI, sRIO, PCIe)
Proprietary
ARMv8 compute farm
vAppliance
L3/control plane
vAppliance
PDCP vAppliance
MAC/RLC
iNIC (on-chip or off-chip)
L2/QoS Transport
On-chip / on-board interconnect (AXI, PCIe)
L1 (DSP + accelerator)
L1
MAC/RLC
10+ GbE (on-board or physically remote)
Cloud
OpenStack Transport control (SDN)
Virtualized
SEC
TM
Mihai Caraman, August 2015 5
Virtual Base Station - PoC
• Scenarios
− L2/L3 stack in VM, end-to-end video download using commercial LTE dongle
− PDCP scaling
Platform Phase I
• QorIQ T4240
− PA Book III-E
− Security Engine
• QorIQ T4240 PCI SR-IOV intelligent NIC
Platform Phase II
• QorIQ LS2085A
− ARMv8
− DPAA2 Devices with Management Complex (MC) configuration bus
− Advanced I/O Processor (AIOP) integrated NPU
TM
Mihai Caraman, August 2015 6
Virtualization requirements
Timing, latency requirements
• Transmission Time Interval (TTI)
− Synchronized between L2 and L1 at 1 ms
− Provisioned through GPS, IEEE1588/PTP (interrupt for
demo purposes)
− L1 with a 1Gbps interface adds 150us latency, 10Gbps: 15us
− TTI IRQ delivered to guest user space application ~50us
max latency
KVM
− RT Linux guest
− One RT vcpu per cpu
− CPU oversubscription (nice to have)
− TTI IRQ
I/O Virtualization
− Direct assignment
− Virtio
Open Stack Integration
RT KVM
RT LInux
US vBTS
SEC EthTTI Eth
iNIC/NPU
IRQ
Packet Processing
TM
Mihai Caraman, August 2015 7
RT KVM - PA Book III-E
MPIC emulation
• Replaced spinlocks with raw spinlocks. PREEMPT_RT spinlock implementation
uses sleeping mutex
• RT-friendly refactoring (to do):
− Increase lock granularity
− Minimize interrupts disabled code paths
− Avoid races with (lazy) pending exceptions
Timer
• Moved the processing of the KVM decrementer from softirq (ksoftirqd kthread) to
hardirq context
• Replaced wait queues with simple work queues, wait queue callbacks prevent the
use of raw spinlocks
• Removed unnecessary tasklet used by the hrtimer
TM
Mihai Caraman, August 2015 8
Low-priority
Low-priority
HWI
High-priority
Timer ISR IPI/Schedule
vcpu
Timer ISR
Stress
Cyclictest
Stress
Int Int
RT
High-priority
HWI
Timer ISR IPI/Schedule
vcpu
Stress
Int
Initial
RT KVM - Timer Interrupt Latency
Timer ISR
Cyclictest
Stress
Int
TM
Mihai Caraman, August 2015 9
RT KVM - TTI IRQ & Latency Tracer
TTI IRQ
• GPIO assigned
− Fast path delivery
• GPIO interrupt affinity
• Extensive host and guest debug statistics
Latency Tracer
• The tracer is a mix between 2 of the available ftrace modes of operation:
− function-trace / function-graph
− max latency - retain the maximum latency of this execution chain
• Expose the maximum execution latency for injecting the TTI interrupt into the guest
and the associated code path.
• Used to analyze the causes for the TTI interrupt delivery latency
TM
Mihai Caraman, August 2015 10
RT KVM - Configuration
# defconfig
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_CPU_STALL_TIMEOUT=300
# CONFIG_MMC is not set
# bootargs
isolcpus=17-23 rcu_nocbs=17-23
# disable RT throttling
echo -1 >/proc/sys/kernel/sched_rt_runtime_us
echo -1 >/proc/sys/kernel/sched_rt_period_us
# disable RCU stall warnings
echo 1 > /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress
echo 0 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
# taskset & chrt
taskset -c $CPU_QEMU qemu-system-x …; chrt -p 95 $QEMUPID
taskset -pc $CPU_VCPU $VCPUPID; chrt -p 95 $VCPUPID
TM
Mihai Caraman, August 2015 11
RT KVM - PV-SCHED
Para-virtualized scheduling porting on PA Book III-E
• Follows the original implementation of Jan Kiszka on x86, adapted for PA and re-
based on newer kernels
• The significant differences are interrupt handling and delivery to the guest
TM
Mihai Caraman, August 2015 12
I/O Virtualization - Direct Assignment
PCI SR-IOV VFs direct assignment (VFIO PCI)
• Place VFs in different IOMMU groups
− Access Control Services quirks
• PCI EP partitioning PoC
− Device ID allocation and programming
− Enable IOMMU entries after each device attach
− Dump IOMMU information for debugging
• Arch fix-ups are evil
− QEMU PCI EP inadvertently hidden
• Memory translation
• MSI interrupt affinity
PCI DMA memory
translations:
T4-iNIC:iNIC app (receives
physical/iova addr
from driver)
|
T4-iNIC:DMA engine
|
T4-iNIC:outbound PCI ATMU
|
PCI bus
(bus:dev:func)
|
T4-RDB:inbound PCI ATMU
|
T4-RDB:PCI controller
maps bus:dev:func to LIODN
|
T4-RDB:PAMU
(iova to real addr)
|
T4-RDB:DDR
TM
Mihai Caraman, August 2015 13
I/O Virtualization - Direct Assignment
Security engine direct assignment (VFIO for Platform Devices)
• QEMU glue code
• Physical and virtual functions
Management Complex Bus device assignment (VFIO MC)
• VFIO for Management Complex
− QEMU integration
− Legacy interrupts with irqfd support
− Performance improvements
KVM ARM support for direct I/O Guest caching attributes (I/O portal)
I/O portal HW interrupt coalescing
TM
Mihai Caraman, August 2015 14
I/O Virtualization - virtio-crypto
Provide binary compatibility across HW platforms (from different vendors) and machine migration
Virtio-crypto device
• supplies cryptographic offloading for the guest
Cryptographic transformations:
• ablkcipher - block ciphers (encryption)
• ahash - hashing (authentication)
• aead - authentication and encryption in one job
Virtqueues
• session, crypto, control
PoC
• QEMU integration
• Linux frontend driver
• vhost-crypto over crypto-API
TM
Mihai Caraman, August 2015 15
Packet Processing Performance
• Increasing complexity of
infrastructure stack
• Performance bottleneck from
software implementation of
networking stack
IP tables+
SYN cookie
OF vSwitch
VxLAN
decap/demux
IP reasm + IPSec
VM
IP tables+
SYN cookie
OF vSwitch
VxLAN
decap/demux
IP reasm + IPSec
VM
TCP/IP OVSIPtables+ OVS
OVS +VxLAN
IPtables+ OVS +VxLAN
IPtables+ OVS +VxLAN +
IPsec
64 370 279 195 181 136 49
390 2205 1652 1194 1072 824 146
390 (1K conn) 2205 1514 1051 914 639 98
1024 5722 4346 3085 2737 2080 190
1472 8042 6097 4334 2906 2365 197
0
1000
2000
3000
4000
5000
6000
7000
8000
Th
rou
gh
pu
t
Native networking stack performance
TM
Mihai Caraman, August 2015 16
Packet Processing Offload
• Limited GPP involvement (management/CP only)
• Offload as much packet processing to NPU/iNIC− NPU implements fast path
− Direct connectivity to VM
• Faster Connection rate− IP Table Policy Caching
− Entire OF pipeline processing for switching
− All OF based data pathsNPU/iNIC
Kernel space
TCP/IP
User space
IPtables
IPSec
dataplane
VM
Datapath
Control path
Management
VM
Datapath
Control path
Management
Nova
compute
Neutron
agent
ovs-vsctl ovs-ofctl
OVSDB OF agent
IPSec fastpath
Firewall
Switching
VxLAN
OF control
TM
Mihai Caraman, August 2015 17
Openstack Controller
Openstack
Dashboard
NSCS
vBTS
Plugin
DB
NSCS
Relay
Agent
Openstack Compute Node
Config
Daemon
L2 VM
NSCS
Agent
B9131
NSCS
Agent
EPC
1
1
3
26
6
7
2
Open Stack - Integration
Network Service Configuration Stack:
• Single Dashboard for Configuration relay
• Service Function Chaining Support
• OF-Controller, OF-Switch – Onboarding
• Dynamic Scalability
• AMQP fan-out exchange
• Configuration Versioning and Delta Support
4
5
8
TM
Mihai Caraman, August 2015 18
OpenStack - L1 & eNodeB integration
TM
Mihai Caraman, August 2015 19
Tuning & performance
Latency Benchmarks
• Cyclictest
− Stress: coremark, lmbench
• L2 application
Networking Benchmarks
• Iperf
Perfomance tools
− Perf kvm
− CPU statistics
KVM ARM CPU accounting
TM
Mihai Caraman, August 2015 20
RT KVM PA Latency Results - Cyclictest
Latency
(cycles)Native Virtual
Min 1660 2770
Avg 2220 6660
Max 3330 11660
Latency
(cycles)Native Virtual
Min 1660 2220
Avg 2770 7220
Max 13880 25550
Memory stress
CPU stressNative vs Virtualized - 1 CPU, core isolation, 12 h
TM
Mihai Caraman, August 2015 21
RT KVM PA Latency Results - Cyclictest
Virtualized pv-sched – host & guest coremark stress, 15 min
Prio host coremark: 10
Prio guest coremark: 0
Latency
(cycles)Native
Min 5550
Avg 14440
Max 27770
CPU stress (PV-SCH)
TM
Mihai Caraman, August 2015 22
T4 iNIC
Metro Switch
Traffic
generator
UE
BSC9131RDB-1
EPC
OPEN
OPEN DAYLIGHT
CONTROLLER
10 Gig
1 Gig1 Gig
1 Gig
10 Gig x 4 10 Gig x 4
80 Gbps
Bidirectional
Traffic
80 Gbps
Bidirectional
Traffic
VM1
VM2
VM2
T4240
-iNIC
BSC9131RDB-2
For Latency
Virtual Base Station - Benchmarking Setup
TM
Mihai Caraman, August 2015 23
RT KVM PA Latency Results - TTI External Interrupt
Virtualized – LTE L2 Stress
Latency
(us)Virtual
Min 8330
Avg 13880
Max 29444
LTE L2 stress
TM
Mihai Caraman, August 2015 24
0%
10%
20%
30%
40%
0 10 20 30 40 50 60 70 80
Perf
orm
ance d
egra
dation
IRQ coleascing time (µs)
Best absolute
performance
KVM ARMv8 - Networking performance degradation
Iperf TCP client 750 flows - DPAA 2 Direct Assignment
• 1 Network interface
• 1 VM, 1 vCPU
TM
Mihai Caraman, August 2015 25
Future
− Scalability and performance
Stack disaggregation
Optimized I/O and accelerator access
CPU oversubscription
− Fast guest interrupt delivery
− IOMMU emulation
TM
Mihai Caraman, August 2015 26
Summary
• “5G” networks are enabled by SDN and NFV
• Base station virtualization with RT KVM
• I/O virtualization using direct assignment and virtio
• NFV packet processing offload
• Virtual base station integration with OpenStack
• Low interrupt latency and low network performance
degradation with KVM
TM
Mihai Caraman, August 2015 27
Q & A
TM
Freescale and the Freescale logo and are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. All
other product or service names are the property of their respective owners. ARM and Cortex are registered trademarks of
ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. © 2015 Freescale Semiconductor, Inc
www.Freescale.com
TM
Mihai Caraman, August 2015 29
48KB
L1-I
32KB
L1-D
48KB
L1-I
2MB Banked L2
ARM A57
32KB
L1-D
48KB
L1-I
ARM A57
32KB
L1-D
48KB
L1-I
1MB Banked L2
ARM A57
32KB
L1-D
48KB
L1-I
ARM A57
32KB
L1-D
48KB
L1-I
Coherency Fabric
IO MMU IO MMU
Secure Boot
Trust Zone
Flash Controller
Power Management
SDXC/eMMC
2x DUART
4x I2C
SPI, GPIO, JTAG
IO MMU
64-bit
DDR2/3
Memory
Controller
64-bit
DDR4
Memory
Controller
1MB
Platform
Cache
2x USB3.0 + PHY
Pre
-fetc
h
Queue
Mgr.
Buffer
Mgr.
SECDCE
8-Lane 10GHz
SERDES
8-Lane 10GHz
SERDES
8x1/10 + 8x1
PME WRIOP
64-bit
DDR2/3
Memory
Controller
64-bit
DDR4
Memory
Controller
Accelerated
Packet
Processor
(AIOP)
Buffer
L2 Switch
PC
Ie
PC
Ie
PC
Ie
PC
Ie
SA
TA
3.0
SA
TA
3.0
32-bit DDR4
Memory Controller
Other Parametrics
• 37.5x37.5 Flipchip
• 1mm Pitch
• 1292pins
48KB
L1-I
32KB
L1-D
48KB
L1-I
2MB Banked L2
ARM A57
32KB
L1-D
48KB
L1-I
ARM A57
32KB
L1-D
48KB
L1-I
1MB Banked L2
ARM A57
32KB
L1-D
48KB
L1-I
ARM A57
32KB
L1-D
48KB
L1-I
LS2085A
Datapath Acceleration
• SEC- crypto acceleration
• DCE - Data Compression Engine
• PME – Pattern Matching Engine
General Purpose Processing
• 8x ARM A57 CPUs, 64b, 2.0GHz
• 4MB Banked L2 cache
• HW L1 & L2 Prefetch Engines
• Neon SIMD in all CPUs
• 1MB L3 platform cache w/ECC
• 2x64b DDR4 up to 2.4GT/s
Accelerated I/O Processor
• 40Gbps Packet Processing
• 20Gbps SEC- crypto acceleration
• 15Gbps Pattern Match/RegEx
• 20Gbps Data Compression Engine
• 4MB Packet Express Buffer
Express Packet IO
• Supports1x8, 4x4, 4x2, 4x1 PCIe Gen3
controllers
• 2 x SATA 3.0, 2 x USB 3.0 with PHY
Network IO
• Wire Rate IO Processor:
• 8x1/10GbE + 8x1G
• XAUI/XFI/KR and SGMII
• MACSec on up to 4x 1/10GbE