splice, tee & vmsplice: zero copy in...
TRANSCRIPT
1© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Unable to handle kernel paging request at virtual address 4d1b65e8 Unable to handle kernel paging request at virtual address 4d1b65e8 pgd = c0280000 pgd = c0280000 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:Modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>] lr : [<bf00901c>] Not tainted sp : c076df78 ip : 60000093 fp : c076df84 pc : [<c00116c8>] lr : [<bf00901c>] Not tainted
Splice, Tee & Vmsplice: Zero Copy in Linux
Herzelinuxhttp://tuxology.net
2© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Rights to copy
Attribution – ShareAlike 2.0You are free
to copy, distribute, display, and perform the workto make derivative worksto make commercial use of the work
Under the following conditionsAttribution. You must give the original author credit.Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one.
For any reuse or distribution, you must make clear to others the license terms of this work.Any of these conditions can be waived if you get permission from the copyright holder.
Your fair use and other rights are in no way affected by the above.License text: http://creativecommons.org/licenses/bysa/2.0/legalcode
This kit contains work by the following authors:
© Copyright 20042006Michael Opdenackermichael@freeelectrons.comhttp://www.freeelectrons.com
© Copyright 20032006Oron [email protected]://www.actcom.co.il/~oron
© Copyright 2004 – 2008Codefidence [email protected]://www.codefidence.com
3© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Kernel architecture
System call interface
Processmanagement
Memorymanagement
Filesystemsupport
Devicecontrol Networking
CPU supportcode
Filesystemtypes
Storagedrivers
Characterdevice drivers
Networkdevice drivers
CPU / MMU support code
C library
App1 App2 ...Userspace
Kernelspace
Hardware
CPU RAM Storage
4© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Kernel Mode vs. User Mode
All modern CPUs support a dual mode of operation:
User mode, for regular tasks.
Supervisor (or privileged) mode, for the kernel.
The mode the CPU is in determines which instructions the CPU is willing to execute:
“Sensitive” instructions will not be executed when the CPU is in user mode.
The CPU mode is determined by one of the CPU registers, which stores the current “Ring Level”
0 for supervisor mode, 3 for user mode, 12 unused by Linux.
5© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
The System Call Interface
When a user space tasks needs to use a kernel service, it will make a “System Call”.
The C library places parameters and number of system call in registers and then issues a special trap instruction.
The trap atomically changes the ring level to supervisor mode and the sets the instruction pointer to the kernel.
The kernel will find the required system called via the system call table and execute it.
Returning from the system call does not require a special instruction, since in supervisor mode the ring level can be changed directly.
6© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Linux System Call Path
entry.S
Task
sys_name()
do_name()
Glibc
Function call
Trap
Kernel
Task
7© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Exchanging Data With UserSpace (1)
In kernel code, you can't just memcpy betweenan address supplied by userspace andthe address of a buffer in kernelspace!
Correspond to completely differentaddress spaces (thanks to virtual memory).
The userspace address may be swapped out to disk.
The userspace address may be invalid(user space process trying to access unauthorized data).
8© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Exchanging Data With UserSpace (2)
You must use dedicated functions such as the following ones in your read and write file operations code:
include <asm/uaccess.h>
unsigned long copy_to_user(void __user *to, const void *from, unsigned long n);
unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);
Make sure that these functions return 0!Another return value would mean that they failed.
9© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
DMA Off Load Engine
DMA (Direct Memory Access) offload engine is a piece of hardware that does memcpy by hardware other then the CPU.
Example: Intel I/OAT (I/O Acceleration Technology).
Makes the copy the job of an entity other then the CPU.
It's zero copy, if by copy you mean copy by the CPU.
10© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Simple Client/Server Copies
... ret = recv(s, buf) ...
... ret = send(s, buf) ...
Kernel
User spaceApplication
Kernel
User spaceApplication
Client Server
TxRx
Copyto user
Copy from user
11© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Simple Client/Server Copies
... ret = recv(s, buf) ...
... ret = read(s, buf) ... ret = send(s, buf) ...
Kernel
User spaceApplication
Kernel
User spaceApplication
Client Server
TxRx
Copyto user
Disk
Copy from user
DMADMADMA
12© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy
Inkernel buffer that the user has control over.
The buffer is implemented as a set of referencecounted pointers which the kernel copies around without actually copying the data.
splice() moves data to/from the buffer from/to an arbitrary file descriptor
tee() Moves data to/from one buffer to another
vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file.
Can be used anywhere where a process needs to send something from one end to another, but it doesn't need to touch or even look at the data, just forward it.
13© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy
Inkernel buffer that the user has control over.
Implemented as a pipe.
The pipe buffer is implemented as a set of referencecounted pointers which the kernel copies around without actually copying the data.
tee(), splice() and vmsplice() move data from user program to the pipe and from one pipe to the next, without copying
Use when a process needs to send something from one end to another, but doesn't need to touch or even look at the data.
14© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Splice
splice(int fd_in, off_t *off_in, int fd_out, off_t *off_out, size_t len, unsigned int flags );
splice() moves data to (from) the pipe from (to) an arbitrary file descriptor.
sendfile() is now internally implemented as splice().
Must use SPLICE_F_MOVE flag to achieve zero copy, if possible: buffer ref. count of zero of whole pages.
Other flags: SPLICE_F_NONBLOCK, SPLICE_F_MORE which works like TCP_CORK.
15© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Tee
long tee(int fd_in, int fd_out, size_t len, unsigned int flags );
tee() moves (read: copies reference to) data to (from) one pipe buffer to the other.
Source pipe still holds the data.
Only useful flag is SPLICE_F_NONBLOCK.
16© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy of Example 1
Data
File Socket Buf
HD Controller Network Chip
KernelMemory
Hardware
User space
Copy (using DMA)
Pointer to page cachepage
Pointer to pageas part of frag list
Splice() *
Only pointer is copied
* In reality you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers.
17© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Tee Implemented using Tee & Splice#define _GNU_SOURCE#include <fcntl.h>#include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <assert.h>#include <errno.h>#include <limits.h>
int main(int argc, char *argv[]){ int fd; int len, slen;
assert(argc == 2); fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC,\
0644);
if (fd == 1) { perror("open"); exit(EXIT_FAILURE); } do { /* * tee stdin to stdout. */ len = tee(STDIN_FILENO, STDOUT_FILENO, INT_MAX, SPLICE_F_NONBLOCK); if (len < 0) { if (errno == EAGAIN) continue; perror("tee"); exit(EXIT_FAILURE);...
... } else if (len == 0) break; /* * Consume stdin by splicing it to a file. */ while (len > 0) { slen = splice(STDIN_FILENO, NULL, fd, NULL, len, SPLICE_F_MOVE); if (slen < 0) { perror("splice"); break; } len = slen; } } while (1);
close(fd); exit(EXIT_SUCCESS);}
18© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Vmsplice
long vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags);
struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes */};
vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file.
19© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy Vmsplice
Zero copy requires flag SPLICE_F_GIFT
The user pages are a gift to the kernel. The application may not modify this memory ever, or page cache and ondisk data may differ.
Gifting pages to the kernel means that a subsequent splice() SPLICE_F_MOVE can successfully move the pages; if this flag is not specified, then a subsequent splice() SPLICE_F_MOVE must copy the pages.
Data must also be properly page aligned, both in memory and length.
20© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Zero Copy of Example 2
Data
skb
Network Chip
KernelMemory
Hardware
User space
Copy (using DMA)
Pointer to pageas part of frag list
VMSplice() *
Only pointer is copied
Mem write
Processpage tables
* In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers.
21© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
More Information
Zero Copy I: UserMode Perspective
http://www.linuxjournal.com/article/6345
22© Copyright 20062004, Michael Opdenacker© Copyright 20032006, Oron Peled© Copyright 20042006 Codefidence Ltd.
For full copyright information see last page.Creative Commons AttributionShareAlike 2.0 license
Copyrights and Trademarks© Copyright 20062004, Michael Opdenacker© Copyright 20042008 Codefidence Ltd.Tux Image Copyright: © 1996 Larry EwingLinux is a registered trademark of Linus Torvalds.All other trademarks are property of their respective owners.Used and distributed under a Creative Commons AttributionShareAlike 2.0 license