nica: an infrastructure for inline acceleration of network ... · nica applies these techniques to...
TRANSCRIPT
NICA: An Infrastructure for Inline Acceleration of Network ApplicationsHAGGAI ERAN†#, L IOR ZENO †, MAROUN TORK†, GABI MALKA †, MARK SILBERSTEIN †
†TECHNION – ISRAEL INSTITUTE OF TECHNOLOGY #MELLANOX TECHNOLOGIES
1
FPGA-Based SmartNICs
SmartNIC
Network FPGA
Azure SmartNICs implementing AccelNet have been deployed on all new Azure servers since late 2015 in a fleet of >1M hosts. [NSDI’18]
2
Inline Acceleration (Analogy)
https://www.pinterest.com/pin/334603447314940566/
https://www.amazon.com/WolVol-Friction-Powered-Garbage-Lights/dp/B00WH4Y9SG3
Inline Processing - Filtering
SmartNIC
Network FPGA
4
Inline Processing - Transformation
SmartNIC
Network FPGA
5
Val
A Key-Value Store Cache
SmartNIC
Network FPGA
ValKeyKeyKeyKey
ValValValValVal
6
Previous work: KV-Direct [SOSP’17], Floem [OSDI’18],LaKe [ReConFig’18]
CoAP Cryptographic Authentication
SmartNIC
Network FPGAIoT
server
7
Challenges for Cloud Inline Accelerators• No operating system abstractions
• No virtualization support:• performance & state isolation
The NICA infrastructure fulfills these requirements for arbitrary inline accelerators.
8
NICA – Contributions• ikernel OS abstraction for inline application acceleration
• SmartNIC virtualization:Fine-grain time sharing & strict performance isolation
9
The ikernel Abstraction
SmartNIC
Network
FPGA Process
ikernelAFUAFUAFU
AFU – Accelerator Functional Unit 10
Process
socket
SmartNIC
FPGA
Attaching an ikernel
AFU – Accelerator Functional Unit 11
Networkikernel
AFUAFUAFU
SmartNIC
Network
FPGA Process
ikernelAFUAFUAFU
socket
Attaching an ikernel
AFU – Accelerator Functional Unit 12
Interfaces of an ikernel
• Attach to sockets
• Use POSIX API for data
• Custom ring – direct producer-consumer interface
• Bypass the host network stack
• RPC – access AFU state
• E.g. configure cryptographic keys, read counters
13
The ikernel Abstraction// Create handle
k = ik_create(MEMCACHED_AFU);
ik_command(k, CONFIGURE, ...);
// Init a socket.
s = socket(...); bind(s, ...);
// Activate the ikernel
ik_attach(k, s);
// Use POSIX APIs to receive
while (recvmsg(s, buf, ...))...
107 lines of code for memcached integration14
Process
ikernel
socket
Virtualization Support
15
Computation
I/O
How FPGAs Are Shared [Background]• Space sharing
• Coarse grain time sharing
• Fine grain time sharing
-Limited utilization
16
How FPGAs Are Shared [Background]• Space sharing
• Coarse-grain time sharing
• Fine-grain time sharing
-Long context switch
17
AmorphOS [OSDI’18] combines the first two.
How FPGAs Are Shared [Background]• Space sharing
• Coarse-grain time sharing
• Fine-grain time sharing
Multiple tenants share the same AFU.
+Low latency – hardware context switch (packet granularity)
18
NICA applies these techniques to SmartNIC AFUs
Cloud Deployment Model
19
FPGA-as-a-Service
• Customers bring their own design
• Coarse-grain sharing
AFU Marketplace
• Cloud provider develops/audits AFUs
• AFUs trusted to implement fine-grain virtualization
SmartNIC
AFU
NICA I/O Virtualization
CPU
Network
Network stack
NICA hardware runtime
ProcessProcessVMvAFU 1vAFU 1vAFU 2vAFU 1
20
SmartNIC
AFU
Performance Isolation
CPU
Network
Network stack
NICA hardware runtime
ProcessProcessVMvAFU 1vAFU 1vAFU 2vAFU 1
21
In the Paper• NICA hardware runtime
• Network stack integration and TCP support
• Custom ring implementation using RDMA
• SR-IOV and para-virtual interfaces
• Implementation details
22
Evaluation40 Gbps bump-in-the-wire SmartNIC:
• Mellanox Innova Flex
• ConnectX 4 Lx EN
• Xilinx Kintex UltraScale FPGA
Similar to Microsoft Catapult.
FPGA NICNetwork
23
EvaluationBaseline:• VMA user-space networking
stack
Microbenchmarks:• UDP/TCP performance
• Virtualization overheads
• I/O isolation overheads
Applications:
• memcached• Comparison with
MICA [NSDI’14]; KV-Direct [SOSP’17]
• IoT message authentication• Node.js server• Integrated with 20 lines of
JavaScript
24
Memcached Cache• Host working set: 32M keys
• SmartNIC Cache: 2M keys
• 16 byte keys and values
• Memcached UDP ASCII protocol
SmartNIC
Network FPGA
25
Simplicity of host integration:
• POSIX – 107 lines of C code
• Custom ring – 135 lines of code
Key-Value Store Cache – Bare Metal R/O
1.7 1.96.8
40.3
14.8
0
10
20
30
40
50
0.69 0.79 0.89 0.99 1.09 1.19 1.29 1.39 1.49Thro
ugh
pu
t [M
tps]
Zipf distribution; Throughput by Zipf skew
CPU-only NICA NICA+CR line-rate
26
At Zipf(0.99), x4 through filtering,
x9 with both
Low hit rate: x2 throughput w. custom ring
Near line-rate hardware
throughput
Key-Value Store Cache - Virtualization• 1 core, 5 GB host RAM, 2M key cache, Zipf(0.9)
0
5
10
1 2 3 4 5 6
Thro
ugh
pu
t [M
tps]
Throughput by #VMs
CPU NICA+CR
27
Latency under virtualization• 60% hit-rate – 2.1 µs
• same as bare-metal
• Misses on the CPU – 12-62 µs
• Compared to 6 µs on bare-metal
• Negligible sharing overhead
28
Key-Value Store – Performance Isolation
0
5
10
15
20
25
0 5 10 15 20 25Thro
ugh
pu
t [M
tps]
Throughput over time
VM 1 VM 2 VM 3
29
• NICA framework enables SmartNIC inline processing in cloud environments.
• Find our code at https://github.com/acsl-technion/nica
Thank you!
Questions?
Conclusion
30