epoll - from the kernel side

26
Epoll - From the kernel side “There are no secret messages in the source code. “ Lijin Liu <[email protected]> twiiter: @llj098 & http://blog.fatlj.me

Upload: llj098

Post on 10-May-2015

10.297 views

Category:

Technology


6 download

DESCRIPTION

Epoll - from the kernel side

TRANSCRIPT

Page 1: Epoll -  from the kernel side

Epoll - From the kernel side

“There are no secret messages in the source code. “

Lijin Liu <[email protected]>twiiter: @llj098 & http://blog.fatlj.me

Page 2: Epoll -  from the kernel side

Some basics about I/O

•Q: What is I/O? • A: The I/O is connecting the

CPU to the outside world.

Page 3: Epoll -  from the kernel side

Some basics about I/O

• Three kinds of I/O:• memory-mapped input/output• I/O-mapped input/output• direct memory access (DMA)

Page 4: Epoll -  from the kernel side

Some basics about I/O

• PCI, ISA, EISA, NuBus..• PCI controller

• Interrupt Controller• Also an I/O device• some device is able to communicate with it , and

needless to talk with CPU• POLLED I/O• delay handle

Page 5: Epoll -  from the kernel side

I/O models – back to software world

• - Blocking IO– normal read/write/open... system call

• - NON-Blocking IO– fcntl/ioctl

• IO-Mulitiplex– SELECT

• - Event driven– EPOLL/KQUEUE

• AIO– IOCP

Page 6: Epoll -  from the kernel side

NON/Blocking I/O

• - user space api/system call– read,write,accept,open,close..

• - Block IO– per connection per thread/process

• - NONBlocking IO– iotcl/fcntl – loop check

Page 7: Epoll -  from the kernel side

IO-Multiplex

• select/poll• Shortcomings• fd number is limited• another type of loop check

Page 8: Epoll -  from the kernel side

SELECT/POLL Internals - basics

• process : task_struct• No thread in linux, just process or ‘task’• data structure– include/linux/list.h

• - process scheduler – CFS

• Process state machine

Page 9: Epoll -  from the kernel side

SELECT/POLL Internals - basics

Page 10: Epoll -  from the kernel side

SELECT/POLL Internals - basics

• sleep/wake up mechanism in linux kernel – wait_queue

• structures:struct __wait_queue {

unsigned int flags;void *private;

wait_queue_func_t func; /*callback function*/struct list_head task_list;

}

Page 11: Epoll -  from the kernel side

SELECT/POLL Internals – basics

• How to wait: • linux/kernel/sched/core.c : schedule() -> __schedule():

...next = pick_next_task(rq);...context_switch(rq, prev, next); /* unlocks the rq */switch_mm(oldmm, mm, next); /*arch independent, x86: arch/x86/include/asm/mmu_context.h*/.....

Page 12: Epoll -  from the kernel side

SELECT/POLL Internals - some basics

• Interruption– interrupt controller • Device• programmable

– interrupt handler• Often, the device driver will register as interupt handler

– Softirq• bottom halves • ksoftirqd

Page 13: Epoll -  from the kernel side

SELECT/POLL Internals

• Why poll/select is not cool?

• fs/select.c do_select() :

   for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) { ... file = fget_light(i, &fput_needed); if (file) { f_op = file->f_op; mask = DEFAULT_POLLMASK; if (f_op && f_op->poll) mask = (*f_op->poll)(file, retval ? NULL : wait); ... } }

Page 14: Epoll -  from the kernel side

SELECT/POLL Internals -- tcp_poll

• - in the vfs part, the xxx_poll will not block• - net/ipv4/tcp.c : unsigned int tcp_poll()

/* omit connect/close state chek */ .....if (tp->urg_seq == tp->copied_seq &&!sock_flag(sk, SOCK_URGINLINE) &&tp->urg_data)

target++; if (tp->rcv_nxt - tp->copied_seq >= target)

mask |= POLLIN | POLLRDNORM; ....

Page 15: Epoll -  from the kernel side

Here Comes the EPOLL• User space API

• epoll_create() , epoll_ctl() , epoll_wait()

• structuresstruct epoll_event {

uint32_t events;epoll_data_t data;

};

typedef union epoll_data {void *ptr;int fd;uint32_t u32;uint64_t u64;

} epoll_data_t;

• - LT/ET mode

Page 16: Epoll -  from the kernel side

Epoll code demo #define MAX_EVENTS 10 struct epoll_event ev, events[MAX_EVENTS]; int efd = epoll_create(1024); ... epoll_ctl(efd,EPOLL_CTL_ADD,listenfd,&ev); ... while(1) { int n = epoll_wait(efd,events,MAX_EVENTS,-1); for(i = 0;i < n;i++){ if(events[n].data.fd == listenfd) { conn = accept(listenfd,(struct sockaddr *)addr,&addrlen); setnonblocking(conn); ev.events = EPOLLIN| EPOLLET; ev.data.fd = conn; epoll_ctl(efd,EPOLL_CTL_ADD,conn,&ev); } else{ do_work(events[n].data.fd); } } }

Page 17: Epoll -  from the kernel side

EPOLL Internals

• some structures• kernel side : • eventpoll main structure,epoll_create() makes• epitem wrap of a file, this struct in the RB tree• eppoll_entry wait structure for poll hooks• epoll_event same as user space

• user space: • epoll_event like eventpoll above• epoll_data custom data area

Page 18: Epoll -  from the kernel side

EPOLL Internals

• why it works?• kernel is event based, user space maybe not

• How it works?• add the fd to the epoll by epoll_ctl()• use epoll_wait() sleep to fish active fds• the interruption happen• send the active fds to the user space• wake up the slept process

Page 19: Epoll -  from the kernel side

EPOLL Internals

• add fd to epoll

• fs/eventpoll.c:• epoll_ctl() -> ep_insert() -> ep_rbtree_insert()

• when we add an fd to a eventpoll, first initilate corresponding structure: epitem

• setup some callback function for this file• add the epitem to the rbtree

Page 20: Epoll -  from the kernel side

EPOLL Internals - how to sleep

• two wait_queues• one for the process right now• one for the ksoftirqd

• epoll_wait() system call• set the current process to TASK_INTERUPPTABLE• schedule()

Page 21: Epoll -  from the kernel side

epoll Internals - how to wakeup

• Work flow• Interrupt handler • fd active• wait_queue #1 actived on ksoftirqd• epoll_callback() fired , active wait_queue #2• copies the ready fds to the user space• set the user process running• user process is scheduled, wake up!

Page 22: Epoll -  from the kernel side

EPOLL Internal - show to wakeup

• Tcp demo- cd net/ipv4/- af_inet: struct net_protocol tcp_protocol- tcp_ipv4.c:tcp_v4_rcv()- tcp_ipv4.c:tcp_v4_do_rcv()- tcp_input.c:tcp_rcv_established()- cd ../core- sock.c:sock->sk_data_ready()- sock.c:sock->sock_def_readable()

- ep_poll_callback() : - add the fd to the epoll's ready list - active the blocked process above (by epoll_wait)- after the blacked process wake: - ep_send_events() - ep_scan_ready_list() : copy the epoll's readylist to a tmp list(ref copy) - ep_send_events_proc() : transfer to user space - move the ovflist_list to the ready list of epoll

Page 23: Epoll -  from the kernel side

EPOLL the whole picture

• - two wait_queue, one for ksoftirqd, one for user process, one fire another

• - three lock(two mutex,one spinlock)• - an ep_item red-black tree

Page 24: Epoll -  from the kernel side

Compare to the IOCP

• IOCP is AIO,EPOLL/KQUEUE is event base multiplexing

• IOCP need to take care of the IO operation• EPOLL is just an notification mechanism, light,

flexible• IOCP need a thread pool overhead

Page 25: Epoll -  from the kernel side

References

• Linus and kernel hackers - Linux kernel source tree – http://kernel.org

• Robert Love - Linux Kernel Development – http://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468/

•Jonathan Corbet , Alessandro Rubini , Greg Kroah-Hartman – Linux Device Driver – http://www.amazon.com/Linux-Device-Drivers-Jonathan-Corbet/dp/0596005903/• Christian Benvenuti - Understanding the linux network internals• http://www.amazon.com/Understanding-Network-Internals-Christian-Benvenuti/dp/05

96002556/•Randall Hyde (Author) - Write Great Code: Volume 1: Understanding the Machine

• http://www.amazon.com/Write-Great-Code-Understanding-Machine/dp/1593270038•W. Richard Stevens , Bill Fenner, Andrew M. Rudoff -Unix Network Programming, Volume 1

– http://www.amazon.com/Unix-Network-Programming-Sockets-Networking/dp/0131411551

•W. Richard Stevens , Stephen A. Rago - Advanced Programming in the UNIX Environment– http://www.amazon.com/Programming-Environment-Addison-Wesley-Professional-Co

mputing/dp/0321525949/•David A Rusling - The Linux Kernel http://tldp.org/LDP/tlk/dd/interrupts.html

Page 26: Epoll -  from the kernel side

References• - linux kernel 中 epoll 的设计和实现

– http://www.pagefault.info/?p=264

• IOCP , kqueue , epoll ... 有多重要? – http://blog.codingnow.com/2006/04/iocp_kqueue_epoll.html

• The linux kernel's interrupt controller API – http://www.stillhq.com/pdfdb/000447/data.pdf

• mapped IO – http://en.wikipedia.org/wiki/Port-mapped_I/O

• wikepedia DMA– http://en.wikipedia.org/wiki/Direct_memory_access

• Improving (network) I/O performance – http://www.xmailserver.org/linux-patches/nio-improve.html