splice, tee & vmsplice: zero copy in...

22
1 © Copyright 2006-2004, Michael Opdenacker © Copyright 2003-2006, Oron Peled © Copyright 2004-2006 Codefidence Ltd. For full copyright information see last page. Creative Commons Attribution-ShareAlike 2.0 license Unable to handle kernel paging request at virtual address 4d1b65e8 Unable to handle kernel paging request at virtual address 4d1b65e8 pgd = c0280000 pgd = c0280000 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:Modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>] lr : [<bf00901c>] Not tainted sp : c076df78 ip : 60000093 fp : c076df84 pc : [<c00116c8>] lr : [<bf00901c>] Not tainted Splice, Tee & Vmsplice: Zero Copy in Linux Herzelinux http://tuxology.net

Upload: others

Post on 23-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

1© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Unable to handle kernel paging request at virtual address 4d1b65e8 Unable to handle kernel paging request at virtual address 4d1b65e8 pgd = c0280000 pgd = c0280000 <1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000 Internal error: Oops: f5 [#1] Internal error: Oops: f5 [#1] Modules linked in:Modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base CPU: 0 CPU: 0 PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44 LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc] pc : [<c00116c8>]    lr : [<bf00901c>]    Not tainted sp : c076df78  ip : 60000093  fp : c076df84 pc : [<c00116c8>]    lr : [<bf00901c>]    Not tainted 

Splice, Tee & Vmsplice: Zero Copy in Linux

Herzelinuxhttp://tuxology.net

Page 2: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

2© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Rights to copy

Attribution – ShareAlike 2.0You are free

to copy, distribute, display, and perform the workto make derivative worksto make commercial use of the work

Under the following conditionsAttribution. You must give the original author credit.Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one.

For any reuse or distribution, you must make clear to others the license terms of this work.Any of these conditions can be waived if you get permission from the copyright holder.

Your fair use and other rights are in no way affected by the above.License text: http://creativecommons.org/licenses/by­sa/2.0/legalcode

This kit contains work by the following authors:

© Copyright 2004­2006Michael Opdenackermichael@free­electrons.comhttp://www.free­electrons.com

© Copyright 2003­2006Oron [email protected]://www.actcom.co.il/~oron

© Copyright 2004 – 2008Codefidence [email protected]://www.codefidence.com

Page 3: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

3© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Kernel architecture

System call interface

Processmanagement

Memorymanagement

Filesystemsupport

Devicecontrol Networking

CPU supportcode

Filesystemtypes

Storagedrivers

Characterdevice drivers

Networkdevice drivers

CPU / MMU support code

C library

App1 App2 ...Userspace

Kernelspace

Hardware

CPU RAM Storage

Page 4: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

4© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Kernel Mode vs. User Mode

All modern CPUs support a dual mode of operation:

User mode, for regular tasks.

Supervisor (or privileged) mode, for the kernel.

The mode the CPU is in determines which instructions the CPU is willing to execute:

“Sensitive” instructions will not be executed when the CPU is in user mode.

The CPU mode is determined by one of the CPU registers, which stores the current “Ring Level”

0 for supervisor mode, 3 for user mode, 1­2 unused by Linux.

Page 5: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

5© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

The System Call Interface

When a user space tasks needs to use a kernel service, it will make a “System Call”.

The C library places parameters and number of system call in  registers and then issues a special trap instruction.

The trap atomically changes the ring level to supervisor mode and the sets the instruction pointer to the kernel.

The kernel will find the required system called via the system call table and execute it.

Returning from the system call does not require a special instruction, since in supervisor mode the ring level can be changed directly.

Page 6: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

6© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Linux System Call Path

entry.S

Task

sys_name()

do_name()

Glibc

Function call

Trap

Kernel

Task

Page 7: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

7© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Exchanging Data With User­Space (1)

In kernel code, you can't just memcpy betweenan address supplied by user­space andthe address of a buffer in kernel­space!

Correspond to completely differentaddress spaces (thanks to virtual memory).

The user­space address may be swapped out to disk.

The user­space address may be invalid(user space process trying to access unauthorized data).

Page 8: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

8© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Exchanging Data With User­Space (2)

You must use dedicated functions such as the following ones in your read and write file operations code:

include <asm/uaccess.h>

unsigned long copy_to_user(void __user *to,     const void *from,     unsigned long n);

unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);

Make sure that these functions return 0!Another return value would mean that they failed.

Page 9: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

9© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

DMA Off Load Engine

DMA (Direct Memory Access) offload engine is a piece of hardware that does memcpy by hardware other then the CPU.

Example: Intel I/OAT (I/O Acceleration Technology).

Makes the copy the job of an entity other then the CPU.

It's zero copy, if by copy you mean copy by the CPU.

Page 10: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

10© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Simple Client/Server Copies

... ret = recv(s, buf) ...

... ret = send(s, buf) ...

Kernel

User spaceApplication

Kernel

User spaceApplication

Client Server

TxRx

Copyto user

Copy from user

Page 11: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

11© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Simple Client/Server Copies

... ret = recv(s, buf) ...

... ret = read(s, buf) ... ret = send(s, buf) ...

Kernel

User spaceApplication

Kernel

User spaceApplication

Client Server

TxRx

Copyto user

Disk

Copy from user

DMADMADMA

Page 12: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

12© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Zero Copy

In­kernel buffer that the user has control over.

The buffer is implemented as a set of reference­counted pointers which the kernel copies around without actually copying the data.

splice() moves data to/from the buffer from/to an arbitrary file descriptor

tee() Moves data to/from one buffer to another

vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file.

Can be used anywhere where a process needs to send something from one end to another, but it doesn't need to touch or even look at the data, just forward it.

Page 13: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

13© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Zero Copy

In­kernel buffer that the user has control over.

Implemented as a pipe.

The pipe buffer is implemented as a set of reference­counted pointers which the kernel copies around without actually copying the data.

tee(), splice() and vmsplice() move data from user program to the pipe and from one pipe to the next, without copying 

Use when a process needs to send something from one end to another, but doesn't need to touch or even look at the data.

Page 14: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

14© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Splice

splice(int fd_in, off_t *off_in, int fd_out, off_t *off_out, size_t len, unsigned int flags );

splice() moves data to (from) the pipe from (to) an arbitrary file descriptor.

sendfile() is now internally implemented as splice().

Must use SPLICE_F_MOVE flag to achieve zero copy, if possible: buffer ref. count of zero of whole pages.

Other flags: SPLICE_F_NONBLOCK, SPLICE_F_MORE which works like TCP_CORK.

Page 15: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

15© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Tee

long tee(int fd_in, int fd_out, size_t len, unsigned int flags );

tee() moves (read: copies reference to) data to (from) one pipe buffer to the other.

Source pipe still holds the data.

Only useful flag is SPLICE_F_NONBLOCK.

Page 16: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

16© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Zero Copy of Example 1

Data

File Socket Buf

HD Controller Network Chip

KernelMemory

Hardware

User space

Copy (using DMA)

Pointer to page cachepage

Pointer to pageas part of frag list

Splice() *

Only pointer is copied

* In reality you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers.

Page 17: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

17© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Tee Implemented using Tee & Splice#define _GNU_SOURCE#include <fcntl.h>#include <stdio.h>#include <stdlib.h>#include <unistd.h>#include <assert.h>#include <errno.h>#include <limits.h>

int main(int argc, char *argv[]){    int fd;    int len, slen;

    assert(argc == 2);    fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC,\

 0644);

    if (fd == ­1) {        perror("open");        exit(EXIT_FAILURE);    }    do {        /*         * tee stdin to stdout.         */        len = tee(STDIN_FILENO, STDOUT_FILENO,                  INT_MAX, SPLICE_F_NONBLOCK);        if (len < 0) {            if (errno == EAGAIN)                continue;            perror("tee");            exit(EXIT_FAILURE);...      

...  } else            if (len == 0)                break;        /*         * Consume stdin by splicing it to a file.         */        while (len > 0) {            slen = splice(STDIN_FILENO, NULL, fd, NULL,                          len, SPLICE_F_MOVE);            if (slen < 0) {                perror("splice");                break;            }            len ­= slen;        }    } while (1);

    close(fd);    exit(EXIT_SUCCESS);}

Page 18: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

18© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Vmsplice

long vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags);

struct iovec {    void  *iov_base;            /* Starting address */    size_t iov_len;             /* Number of bytes */};

vmsplice() does the same than splice(), but instead of splicing from fd to fd as splice() does, it splices from a user address range into a file.

Page 19: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

19© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Zero Copy Vmsplice 

Zero copy requires flag SPLICE_F_GIFT

The user pages are a gift to the kernel. The application may not modify this memory ever, or page cache and on­disk data may differ. 

Gifting pages to the kernel means that a subsequent splice() SPLICE_F_MOVE can successfully move the pages; if this flag is not specified, then a subsequent splice() SPLICE_F_MOVE must copy the pages. 

Data must also be properly page aligned, both in memory and length.

Page 20: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

20© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Zero Copy of Example 2

Data

skb

Network Chip

KernelMemory

Hardware

User space

Copy (using DMA)

Pointer to pageas part of frag list

VMSplice() *

Only pointer is copied

Mem write

Processpage tables

* In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers.

Page 21: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

21© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

More Information

Zero Copy I: User­Mode Perspective

http://www.linuxjournal.com/article/6345

Page 22: Splice, Tee & Vmsplice: Zero Copy in Linuxpds10.egloos.com/pds/200904/23/91/linux_zero_copy_splice_tee.pdf · PC is at set_pxa_fb_info+0x2c/0x44 PC is at set_pxa_fb_info+0x2c/0x44

22© Copyright 2006­2004, Michael Opdenacker© Copyright 2003­2006, Oron Peled© Copyright 2004­2006 Codefidence Ltd.

For full copyright information see last page.Creative Commons Attribution­ShareAlike 2.0 license

Copyrights and Trademarks© Copyright 2006­2004, Michael Opdenacker© Copyright 2004­2008 Codefidence Ltd.Tux Image Copyright: © 1996 Larry EwingLinux is a registered trademark of Linus Torvalds.All other trademarks are property of their respective owners.Used and distributed under a Creative Commons Attribution­ShareAlike 2.0 license