embrace storage with open arm · 2020. 12. 22. · (linux*,windows*,freebsd*,etc) virtio front-end...
TRANSCRIPT
© 2020 Arm Limited (or its affiliates)
Richael Zhuangarm
Embrace high performance storage
with open arm
2 © 2020 Arm Limited (or its affiliates)
What’s SPDK
• Storage Performance Development Kit
• A set of tools and libraries to create high performance , scalable, user mode storage applications
3 © 2020 Arm Limited (or its affiliates)
What’s SPDK
• Key techniques• User mode driver(uio/vfio)• Poll mode instead of interrupt• Shared-nothing thread model
4 © 2020 Arm Limited (or its affiliates)
SPDK on Arm64
• 50+ patches to enable and optimize SPDK are merged • Memory barrier• Base64,crc32,isa-l
• Enable SPDK NVMe over Fabrics• RDMA• TCP(posix , uring ,vpp, mtcp)
• Enable SPDK vhost target(for VM)
• SPDK-CSI (for container)
5 © 2020 Arm Limited (or its affiliates)
NVMe over Fabrics
• Local access :pcie (shared memory)• NVME:Specification for SSD access via PCI Express
(PCIe)
• Remote access:message based transport• fibre channel/RDMA/TCP
NVMe host Driver
Host
Admin Queue IO Queue IO Queue
CPU core 0 CPU core n
Memory PCIe Registers Fabric Capsule Operations
Transport-dependent interface
NVME Controller
SQ CQ SQ CQ SQ CQ
6 © 2020 Arm Limited (or its affiliates)
NVMe over RDMA• RDMA
• host-offload, host-bypass (RNICs)• Queue Pairs (QPs=SQ+RQ) and Completion Queues (CQs)
• NVMe over RDMA• Each NVMe qpair mapped to a RDMA qpair• Retain NVMe SQ/CQ CPU alignment • NVMe commands, encapsuled,put into RDMA qpairs,sent over
RNICs
NVMe host Driver
Host
Admin Queue IO Queue IO Queue
CPU core 0 CPU core n
SQ CQ SQ CQ SQ CQ
RDMA fabric context
SQ
CQ
RQ
RDMA fabric context
SQ
CQ
RQ
RDMA fabric context
SQ
CQ
RQ
QP QP QP
7 © 2020 Arm Limited (or its affiliates)
NVMe over RDMA performance
1099
1327
1134
1691
1147
1724
1123
1762
1137
1714
1133
1722
0
200
400
600
800
1000
1200
1400
1600
1800
2000
randwrite randread randwrite randread randwrite randread
1core 2core 4core
ban
dw
idth
(MiB
/s)
number of core
NVMe over RDMA & pcie performance
1NVME RDMA 1NVME local pcie
• 1 NVMe750 in target
• MLX5 NICs(ROCE2)
• 4KB payload size,128 queue depth
8 © 2020 Arm Limited (or its affiliates)
NVMe over TCP
Host-side NVMe-TCP
transport
Receive RSP Capsule
Send CMD Capsule
Controller-side NVMe-TCP transport
Send RSP Capsule
Receive CMD Capsule
Socket APIs
send(bytes) receive(bytes)
Socket APIs
send(bytes) receive(bytes)
NVMe SQ
NVMe CQ
TCP transport TCP connection
IP Network
PhysicalNetwork
Network(ex: Ethernet)
TCP transport
IP Network
PhysicalNetwork
NVMe-oFNVMe-TCP
layer
Typical TCP Network
Stack
• NVMe block storage protocol over standard TCP/IP transport
• TCP provides a reliable transport layer for NVMe queueing model
• Each NVMe queue pair mapped to a TCP connection
• NVMe-OF Commands sent over standard TCP/IP sockets
9 © 2020 Arm Limited (or its affiliates)
NVMe over TCP in SPDK• POSIX (released,stable, no dependency on kernel)
• Uring (released, experimental, Linux kernel > 5.4.3)• io_uring : a new Linux asynchronous I/O interface
• VPP (released,VPP integration test will be stopped in 20.07)• vector packet processing (VPP) : a fast network data plane on top of DPDK
• Seastar (some work done,but not ready,need further investigation)• an event-driven framework
Sock Abstraction
POSIX uring VPP seastar
10 © 2020 Arm Limited (or its affiliates)
NVMe over TCP performance
• 1 NVMe P4600 in target
• 4KB payload size,128 queue depth
576
481
1096934
1443
1752
1534
2212
651
480
1080
940
1482
1730
1528
2229
0
500
1000
1500
2000
2500
randwrite randread randwrite randread randwrite randread randwrite randread
1core 2core 4core 8core
ban
dw
idth
(MiB
/s)
number of core
SPDK NVMe over TCP performance
posix uring
11 © 2020 Arm Limited (or its affiliates)
SPDK & VMs
• Virtio• IO paravirtualization specification• abstraction layer above a set of common emulated
devices in a paravirtualized hypervisor• common mechanism and layouts for device
discovery/configuration• common mechanism for front-end and back-end
to communicate
• Vhost• virtio offloads part of operations to host (kernel or
user mode)• vhost-kernel
– vhost module in kernel transfer data with guest
• Vhost-user– Vhost backend in user space transfer data with guest
Guest VM
(Linux*,Windows*,FreeBSD*,etc)
Virtio front-end drivers
Hypervisor(i.e. QEMU/KVM)
device emulation
Virtio back-end drivers
Virtqueue
Guest VM
(Linux*,Windows*,FreeBSD*,etc)
Virtio front-end drivers
Hypervisor(i.e. QEMU/KVM)
device emulation
Virtio back-end drivers
Virtqueue
vhostVhost target
(kernel or userspace)
vhost
12 © 2020 Arm Limited (or its affiliates)
SPDK & VMs
• SPDK vhost target• Leverage vhost-user protocol to Provide
the backend storage for VM
• Vhost-user slave• VM shares hugepage memory with a
userspace process(SPDK)• SPDK transfers data with VM through
virtqueue (data path)• use unix domain socket to transfer
control message between processes (control path)
• Vhost-scsi/vhost-blk/vhost-nvme(experimental)
13 © 2020 Arm Limited (or its affiliates)
SPDK & container
• Container Storage Interface (CSI)• a standard for exposing arbitrary block and file storage systems to containerized workloads on
Container Orchestration Systems (COs) like Kubernetes.• NOTE: CSI is a general protocol, not for Kubernetes only
• component• Controller Driver
– Talk to Service Provider (SP) to create/delete volumes
• Node Driver– Mount/unmount remote volumes to local host
COK8s, Mesos …
Master Node
Controller Driver
Worker Node
Node Driver
Worker Node
Node Driver
Worker Node
Node Driver~ Controller driver on CO master node
~ Node driver instances per CO worker
~ CO talks to CSI Drivers with CSI
RPC messages
14 © 2020 Arm Limited (or its affiliates)
SPDK-CSI
• Kubernetes supports CSI well• CSI spec 1.0 supported since Kubernetes 1.13
• Kubernetes CSI Drivers List
• SPDK-CSI: Bring SPDK to Kubernetes• Bring SPDK to Kubernetes storage through NVMe-oF, iSCSI• Supports dynamic volume provisioning• Enables Pods to use SPDK for transient or persistent storage• released in 20.07, initiated by Arm
15 © 2020 Arm Limited (or its affiliates)
SPDK-CSI overview
gRPC
16 © 2020 Arm Limited (or its affiliates)
What’s the next?
• NVMe over Fabrics• Integrate and optimize SPDK NVMe over TCP with mTCP as the user space TCP stack• Optimize SPDK NVMe over TCP with uring socket
• SPDK CSI• Tests and improvements for production level quality• New features
– Topology, volume expansion, snapshot, etc.– See Backlogs and Todos at Trello Board
• Integration with Rook– Build a total solution of leveraging SPDK in Kubernetes
17 © 2020 Arm Limited (or its affiliates)
Welcome Contribution• Code review at SPDK Gerrit
• git clone https://review.spdk.io/spdk/spdk-csi• Github mirror: https://github.com/spdk/spdk-csi
• Development Guidelines• https://spdk.io/development/
• Trello Board• https://trello.com/b/nBujJzya/kubernetes-integration
The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2020 Arm Limited (or its affiliates)