an implementation of the select() linux system call ...jharris/3comproject/senior...an...
TRANSCRIPT
An Implementation of the select() Linux System Call Running on the Cal Poly Intelligent Network Interface Card Platform
A Senior Project Report Presented to the Computer Engineering Program
By
Jared Kwek
California Polytechnic State University, San Luis Obispo
Date Submitted: June 17, 2002 Advisor: Dr. Phillip Nico
ii
Abstract The Cal Poly Intelligent Network Interface Card (CiNIC) project is a research project at the
campus of California Polytechnic State University, San Luis Obispo, CA. This project is funded
by the 3Com Corporation and researches intelligent Network Interface Card (NIC) functionality
and performance. The CiNIC platform can offload the TCP/IP stack from an i686 (Pentium)
host computer running Linux to an EBSA-285 embedded system (co-host) running ARM/Linux,
thus freeing the host computer from having to process network traffic. This is done by
intercepting the system calls from the host machine and sending the parameters to the co-host
machine, where the system call is run and then returned back to the host.
This document describes an implementation of the select() system call for the CiNIC platform.
select() waits for events to occur on multiple files and sockets that are open on either the host
or the co-host. This means that the call cannot just run on the co-host; it must also be run
concurrently on the host. Furthermore, select() can sleep until an event occurs, which may
happen on one side (the host or co-host) but not the other. In order to prevent one side from
blocking forever, the host and co-host must be able to alert the other when it has completed.
First an overview of the project and the select() system call is presented. Then the Linux
implementation of select() is described. Following that is a discussion of my designs,
including one that did not work and one that did. Finally, my design to fix the blocking problem
is explored.
iii
Acknowledgements First off I would like to thank the faculty involved with the project: Dr. Hugh Smith, Dr. Phillip
Nico, and Dr. Jim Harris, for all their support and encouragement throughout the past year I have
been on the project. Also Rob McCready, Mark McClelland, and Jim Fischer for introducing me
to the project. To my partner in crime, Max “Neil a.k.a. Linux Hacker” Roth, it has been an
honor and a pleasure spending hours upon hours in the lab with you. You are always
entertaining, even into the early hours of the morning when we are drinking gallons of Sunkist
and making movies. Oh yeah, and please take your satellite dish home someday!
A couple of other accolades: Heather Heiman for always keeping the lab (and us) in order,
Americo “Bart” Melara for always being in the lab every time I come in, Jason Hatashita for
always reconfiguring the lab setup, and Clif Gordon for having senioritis together. And to
everybody else I have had the pleasure of working with in the past year, you rock!
Finally, I would like to thank my family and Michelle for all the love and support they have
given me. An extra special thanks goes to Michelle for keeping me from having a nervous
breakdown while writing this paper.
iv
Table of Contents Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 CiNIC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 The State of the Project When I Started . . . . . . . . . . . . . . . . . . . . 4 2.3 The Definition of select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 The old_select() System Call . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 The Challenges of Implementing select() . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 The Linux Kernel’s Version of select() . . . . . . . . . . . . . . . . . . . . 9 3.1 sys_select()’s Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 do_select()’s Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 The First Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Rob McCready’s Initial Design . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Initial Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 The Split and Merge Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Reasons for Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 5 The Second Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Modifications to old_select() and sys_select() . . . . . . . . . . . . . . . . . 26 5.2 Transferring Parameters to the Co-Host . . . . . . . . . . . . . . . . . . . . 31 5.3 The Co-Host Side Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 The Split and Merge Routines . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 6 The Solution to the Blocking Problem . . . . . . . . . . . . . . . . . . . . 41 6.1 The Investigation of Kernel Methods . . . . . . . . . . . . . . . . . . . . . . 42 6.2 The Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.3 The Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.4 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Appendix A State Diagrams for the Solution to the Blocking Problem . . . . . . . . . 60 Appendix B Test Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Appendix C select() Bit Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
v
Appendix D Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.1 select.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.2 select_h.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 D.3 select_e.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 D.4 syscalls_h.c – n_sys_select() . . . . . . . . . . . . . . . . . . . . . . . . . . 83 D.5 fd_map.c – split_select_bitmaps(), merge_select_bitmaps() . . . . . . . . . . 85 D.6 CVS Version Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 D.7 Sample Test Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 List of Figures
2.1 CiNIC Platform Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Example select() Bitmap Setup . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1 Example Setup with File Descriptor Mask . . . . . . . . . . . . . . . . . . . 19 4.2 Early select() Flowchart with Problem Areas . . . . . . . . . . . . . . . . . 21 5.1 Example of Splitting File Descriptors . . . . . . . . . . . . . . . . . . . . . 35 5.2 select() Flowchart of Second Design . . . . . . . . . . . . . . . . . . . . . . 40 6.1 Shared Memory with Queues . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.1 do_select() on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.2 Tasklet on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.3 do_select() on Co-Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.4 Tasklet on Co-Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C.1 The Representation of a Bitmap in Kernel Memory . . . . . . . . . . . . . . 65
List of Tables
B.1 Test Plan Run on 9/11/01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.2 Test Plan Run on 2/22/02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1
1. Introduction
The proliferation of the Internet within our society has sparked a global revolution that continues
still today. It has brought about rapid change and growth as more and more users are connecting
to this global information superhighway. This growth has created a demand for faster and more
robust applications that can be run across the Internet. However, modern servers and computers
have not been able to keep up with these increasing demands, causing severe problems in
network and system performance. As a result, new ideas such as Intelligent Network Interface
Cards (iNICs) have come to the forefront of research for the next generation of network
technology.
The Cal Poly Network Interface Card (CiNIC) project is a research project at the campus of
California Polytechnic State University, San Luis Obispo, CA. This project is funded by the
3Com Corporation and researches intelligent NIC functionality and performance. The CiNIC
platform can offload the TCP/IP stack from a host computer running Linux to an embedded
system (co-host) that is also running Linux. All network processing takes place on the co-host,
which will free the host computer from having to handle the potentially vast volumes of network
traffic it receives so that it can perform other tasks. The intelligent NIC also has the potential to
be able to support many advanced networking functions such as security and firewalls, web
caching, streaming media, and quality of service.
This document describes an implementation of the Linux select() system call on the CiNIC
platform. select() waits for open network connections (sockets) or files to change status. It is
unique because it is a system call that must be run on both the host and the co-host. This means
that the system call must be split between the two sides. However, if any of the open sockets or
files were to change status, both sides would have to immediately return. These extra issues
make select() one of the trickier system calls to implement for the CiNIC platform.
Nonetheless, it is a necessary one because it is used in a wide range of applications such as
Telnet, web browsing, and the X Window System (a graphical user interface for Linux).
2
The remainder of this document is organized as follows. Chapter 2 describes the background of
the CiNIC project and the select() system call. Chapter 3 explores the details of the Linux
kernel’s implementation of the system call. Chapter 4 goes over the initial design and why it
failed. Chapter 5 then details the successful second design, followed by Chapter 6, which
provides a solution to select() potentially blocking for long periods of time (or forever).
Chapter 7 provides a conclusion and a discussion of possible future work. Appendix A
illustrates the state diagrams for the Chapter 6 design while Appendix B shows two test plans I
implemented. Appendix C goes over some of the bit macros that are used to manipulate the
select() bitmaps. Finally, Appendix D contains the source code for this implementation.
3
2. Background
Before going into the details of the select() implementation, some background information
about the project and select() are needed. Discussed in this section is an overview of the
CiNIC project and the state of the project when I started, then a definition of select() and
old_select(), and finally challenges related to implementing select().
2.1 CiNIC Overview
The CiNIC project is run by a group of students and faculty members who call themselves the
Cal Poly Network Performance Research Group (CPNPRG). This group researches the
possibilities of offloading the TCP/IP stack of a Linux host computer onto a Linux co-host
computer. The hardware chosen for the co-host is an Intel EBSA-285 card that has an SA-110
StrongARM processor and an Intel 21285 logic chip that is used to interface the ARM processor
with the rest of the system. This card is connected to a secondary PCI bus via an Intel 21554
non-transparent PCI bridge. An SDRAM window is mapped onto the PCI bus by the 21285,
which is then translated to the primary PCI bus via the 21554. This allows the host to read from
and write to a shared memory region that exists on the co-host (see Figure 2.1). Mark
McClelland wrote the device drivers that facilitated this sharing of memory. [2]
To facilitate the offloading of the TCP/IP stack from the host to the co-host, network related
system calls are intercepted off the host and sent to the co-host, where they are run and then
returned to the host. The parameters for each system call are marshaled together into one
contiguous region (hereafter referred to as a “communications packet” or “com packet” for short)
and copied to shared memory, where the co-host un-marshals the parameters and makes the
system call on its system. When the call returns on the co-host, it copies the values back to
shared memory, where the host then picks up the return values and returns them to the user. To
maintain this process, a message passing protocol is implemented with kernel threads. Version 1
of the protocol was a polling protocol written by Rob McCready, hereafter referred to as the
“polling protocol,” that kept checking a value to see if data had been written to shared memory
[3]. Version 2, hereafter referred to as the “interrupt-driven protocol” was completed recently by
Max Roth and involves a faster interrupt mechanism through the 21554 [4].
4
2.2 The State of the Project When I Started
When I began on this project in Spring 2001, Mark McClelland and Rob McCready were getting
ready to graduate and were finishing their respective parts of the project. Their primary goal was
to get a File Transfer Protocol (FTP) session working across Mark’s shared memory driver [2]
and Rob’s polling protocol for transferring system calls to and from the co-host. In order to
accomplish this, the polling protocol had to intercept the socket-related system calls that FTP
uses and send their parameters to the co-host using the shared memory driver, where the call
would be run and the return values sent back to the host. This system worked successfully for
FTP by the time I arrived and focus was shifting to what needs to be done in the future.
Two major items needed attention by the time I arrived: A more powerful interrupt-driven
protocol for transferring system calls to and from the co-host and more system calls intercepted
to allow for more services. Max Roth, who started on the project roughly around the same time I
did, decided to handle the new protocol and I decided to look at the new system calls that needed
to be implemented. It was decided that select() and ioctl() were two of the most important
system calls that needed to be implemented, but they were also two of the hardest to incorporate
Figure 2.1 –- CiNIC Platform Setup Source: [2]
5
in the current system because they are not conventional socket calls. select() was needed for
a number of applications including web browsing, Telnet, and the X Window System. ioctl()
is used extensively in the X Window System and programs that gather information from network
device drivers. Since a goal of the project was to someday have the intelligent NIC be a device
driver, this function proved to be essential. Rob had already thought about how the select()
implementation would go, so I first turned my focus to its implementation.
2.3 The Definition of select()
The basic function of select() is to allow a process to look at a number of open file descriptors
to see if reading from or writing to them will block. The Linux manual page calls this
“synchronous I/O multiplexing”. A file descriptor is an integer number used by the kernel to
identify a file, pipe, or socket opened by a process. Each process has its own set of file
descriptors and can have up to 1024. The first three file descriptors (0, 1, 2) for each process are
reserved for standard input (stdin), standard output (stdout), and standard error output
(stderr), respectively. Three sets of file descriptors are watched: One to see if any of the file
descriptors in the set is ready to be read from, one to see if any file descriptors in the set is ready
to be written to, and one to see if any file descriptors in the set has an exception condition (i.e.
high-priority out-of-band data can be read without blocking) [5]. If any file descriptor in any of
the sets is ready for its given condition, select() returns the number of ready file descriptors
and modifies the sets to indicate which file descriptors are available. If none are available,
select() will block (sleep) for a specified period of time waiting for any of the file descriptors
to become ready. If any do, then select() wakes up and returns. If the time limit is reached
and no file descriptors are available, then select() returns 0. Sets that are NULL pointers are
not watched.
The select() call has 5 arguments:
int n The highest numbered file descriptor in any of the three sets, plus 1
fd_set* readfds A pointer to file descriptors for reading fd_set* writefds A pointer to file descriptors for writing fd_set* exceptfds A pointer to file descriptors for exceptions struct timeval* timeout Maximum time to wait (setting to 0 means do not sleep and
6
return immediately, a NULL pointer means wait indefinitely until a file descriptor becomes available)
struct timeval has two members: tv_sec for seconds, and tv_usec for microseconds. The
timeout parameters are modified in Linux to show how much time was remaining upon return.
This behavior is not uniform across multiple platforms, so portable code should not rely on this
value.
On error, select() returns –1 and the sets and timeout value become undefined. The following
errors are possible for errno (see man errno):
EBADF One of the file descriptor sets specifies an invalid file descriptor EINTR A signal arrived before the time limit or any of the selected file descriptors became ready EINVAL Time limit value is incorrect or n is negative ENOMEM Unable to allocate memory for internal tables
There are a number of macros that can be used to manipulate the sets. fd_set is a structure that
contains an unsigned long array that has enough bits in it for the maximum number of file
descriptors. It acts as a bitmap for the file descriptors so, for example, bit 0 corresponds to file
descriptor 0, bit 1 corresponds to file descriptor 1, etc. A set bit tells select() to look at that
file descriptor while select() ignores cleared bits. Currently on the x86 platform, the limit is
1024 file descriptors. An unsigned long is 32 bits, so 1024/32 makes an array size of 32. The
following macros for fd_set can be found in the Linux kernel source in linux/time.h:
FD_ZERO(fd_set *fdset) clear all bits in the set FD_SET(int fd, fd_set *fdset) set the bit for fd in the set FD_CLR(int fd, fd_set *fdset) clear the bit for fd in the set FD_ISSET(int fd, fd_set *fd_set) test the bit for fd in the set
Figure 2.2 shows an example of calling FD_ZERO(&fdset) followed by FD_SET(4, &fdset).
Bit 4 is set to let select() know to look at file descriptor 4. For an example program using
select(), see Appendix D.7.
select() has a number of uses. It is used a lot with sockets to see which sockets contain data to
be read. This allows the user specify the amount of time to wait before asking for a
retransmission. Telnet uses it to check for data to be read from either STDIN or the Telnet socket
7
when it is waiting for the user to type something. It is also useful on a server that supports
multiple clients. accept() usually blocks if there is no connection, so it could only process one
connection at a time. However, select() can check multiple sockets for a connection and then
fork off threads to accept those. Finally, select() can be used to sleep for a given timeout by
setting n to 0, setting all three sets to NULL, and specifying a timeout value.
2.4 The old_select() System Call
While running strace on Netscape to determine how it uses select() (strace shows
information about system calls used by a program), I noticed that it used a system call called
old_select(). This is different than the system call used by other programs, which was regular
select(). I did some investigating and found out that there are two different select()’s in the
Linux kernel. old_select() has system call number 82 (__NR_select) and select() has a
system call number of 142 (__NR__newselect). old_select() comes from the days when
system calls could not have 5 arguments in them. So this function passes one pointer to all the
arguments for the kernel to handle. The new select() did not come around until the 2.x Linux
kernels. I do not know how or if this function can be called using the C library. My only guess
as to why Netscape uses it is for compatibility with older kernels or machines.
In the 2.x kernels, old_select() is located in arch/i386/kernel/sys_i386.c. All it does is
call copy_from_user() on the pointer to copy the parameters into a structure inside the kernel
containing the five parameters of sys_select(). For the pointers, only the pointer value is
copied, not the data it points to. This structure is then used to call sys_select() with the
appropriate parameters. The new select() goes straight to sys_select() when it is called.
5 0 1 2 3 4
0 0 0 0 1 0 bits fd #
……
1023
0
Figure 2.2 – Example select() Bitmap Setup
8
2.5 The Challenges of Implementing select()
The primary issue with implementing select() in our system is that it can contain file
descriptors for both the co-host and host sides. This means that we must somehow split the call
into two, run it on both the host and co-host, and then merge the results back together and return
to the user. All this must be transparent to the user as if we had not intercepted the call at all.
The reason why this is such a difficult task is because the current driver architecture intercepts
system calls that usually contain only one file descriptor. That file descriptor is checked to see if
it was created for the co-host. If it is, the parameters are sent to the co-host so that the system
call can run there. There is no need to run it on the host since all the work with that file
descriptor is done on the co-host. If the file descriptor is not created for the co-host, the original
system call on the host is called without sending anything to the co-host. With select(),
however, we have to be able to figure out which file descriptors go where, so the call can be split
up to run concurrently on both sides.
There are also issues with running select() concurrently on both sides. Since select() can
potentially block, we could find ourselves in the position where one side is blocking while the
other is finished. The blocking side could eventually timeout and then return, but this could
cause a large delay in the system. Additionally, the system could potentially hang if there is no
timeout value and the blocking side never finds a ready file descriptor. Since the goal is to make
this seem like a normal select() call, there needs to be a mechanism where a side that is ready
to return can notify a side that is blocking so that it can wake up and also return.
Finally, there are issues with what to return to the user once the call has completed. There will
be two sets of parameters and two return values. The file descriptor bitmaps need to be merged
back together somehow so that the user can see both sides of file descriptors in the same sets.
There could potentially also be two different timeout values upon return, so it needs to be
decided which value to return. With the actual return values, we could simply add the two sides
together upon success, but it needs to be decided how to handle the situation where one or both
sides return an error.
The solutions to these issues that we came up with are presented in the following chapters.
9
3. The Linux Kernel’s Version of select()
This chapter will discuss the implementation of select() in the Linux 2.4.2 kernel. The two
main functions associated with select() are sys_select() and do_select(). The following
sections provide a high-level overview of each function followed by a walk-through of the code.
The macros are described in Appendix C.
3.1 sys_select()’s Design
sys_select(), which is located in fs/select.c, is the function called first once the system
call is transferred to kernel space. It is the wrapper function for the core of the select() call,
do_select(). First, it checks the n and timeout parameters to make sure that they are in the
correct range. The struct timeval value is changed to jiffies, which is the kernel’s internal
view of time, for use in do_select(). It then takes the user space file descriptor bitmaps and
sets them up in kernel memory. do_select() is then called. Once do_select() has finished,
the timeout value is converted back to a struct timeval and do_select()’s return value is
checked to see if an error was returned. If no error was returned, then it checks if the return
value is 0. If it is 0, then it checks if a signal is pending (i.e. it may have been interrupted). If a
signal is pending, then the system call will be restarted. The parameters are then copied back to
user space and the function returns.
Following are the declarations for sys_select(): asmlinkage long sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { fd_set_bits fds; char *bits; long timeout; int ret, size; 1. Get the timeout value from user space and convert it from a struct timeval to a long
(jiffies). If the user space pointer is NULL or if the period of time is greater than
MAX_SELECT_SECONDS, then the timeout value is set to MAX_SCHEDULE_TIMEOUT, which is the
equivalent of waiting forever.
10
timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp))) || (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ); timeout += sec * (unsigned long) HZ; } } 2. Check for invalid values of n. max_fdset is the current maximum number of file descriptors,
which is normally 1024. n must be between 0 and 1024. ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; 3. Set up a memory region for an fd_set_bits (fds). The structure looks like this
(include/linux/poll.h): typedef struct {
unsigned long *in, *out, *ex; unsigned long *res_in, *res_out, *res_ex;
} fd_set_bits;
This structure allows for the memory region to be long-aligned and scalable. It is only as big
as the n parameter. Below, FDS_BYTES is used to determine how many bytes are needed for
a given value of n (see Appendix C). select_bits_alloc() then calls kmalloc() to
allocate a memory region that is 6*size big. Then each pointer is assigned to a memory
address in that region. Here is the code from sys_select(): ret = -ENOMEM; size = FDS_BYTES(n); bits = select_bits_alloc(size); if (!bits) goto out_nofds;
11
fds.in = (unsigned long *) bits; fds.out = (unsigned long *) (bits + size); fds.ex = (unsigned long *) (bits + 2*size); fds.res_in = (unsigned long *) (bits + 3*size); fds.res_out = (unsigned long *) (bits + 4*size); fds.res_ex = (unsigned long *) (bits + 5*size); 4. get_fd_set() is called to copy the file descriptor sets from user space into the newly set up
memory region. If any of the sets are NULL, the memory is filled with 0’s. zero_fd_set()
is then called to zero out the memory region occupied by the result (res) side of the struct
fd_set_bits. Now that the memory region is set up, do_select() is called, which returns
the number of file descriptors available in the bitmaps and populates the res bitmaps with
those file descriptors. if ((ret = get_fd_set(n, inp, fds.in)) || (ret = get_fd_set(n, outp, fds.out)) || (ret = get_fd_set(n, exp, fds.ex))) goto out; zero_fd_set(n, fds.res_in); zero_fd_set(n, fds.res_out); zero_fd_set(n, fds.res_ex); ret = do_select(n, &fds, &timeout); 5. The timeout value is put back into a struct timeval and copied to user space. if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } 6. A return value less than 0 means error. A zero return value could mean that the system call
was unable to finish, so check if a signal is pending and, if it is, then return -
ERESTARTNOHAND, which will re-execute the system call after the signal handler termination
(this value does not get passed to the user program). if (ret < 0) goto out; if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current))
12
goto out; ret = 0; } 7. set_fd_set() copies the information in the result part of fd_set_bits (populated by
do_select()) to user space if the user space address is not NULL. Then the memory region
allocated by kmalloc() is freed by calling kfree() inside select_bits_free() and the
return value is returned. set_fd_set(n, inp, fds.res_in); set_fd_set(n, outp, fds.res_out); set_fd_set(n, exp, fds.res_ex); out: select_bits_free(bits, size); out_nofds: return ret; }
3.2 do_select()’s Design
do_select(), which is also located in fs/select.c, is the heart of the select() system call.
It takes care of checking each file descriptor in the bitmaps set up by sys_select(). The
available file descriptors are set in the result bitmaps of the fd_set_bits struct. First, the
maximum file descriptor is found in any of the sets. Then a list of wait queues is initialized if
there is a timeout value. This list is used by select() to sleep on a number of file descriptors.
It wakes up if an event happens on any of them. Next, each file descriptor up to the maximum
found is looked at to see if its bit is set in any of the sets. If it is set, then the poll() method is
called for the type of file descriptor that it is. The poll() method sets up the wait queues for
this file descriptor and adds them to the list [5]. A mask is then returned to indicate the status of
the file descriptor. The mask is compared to the mask it should have for the set(s) it is in and the
result bit is set if it matches. If no file descriptors are found to be available in any of the sets, the
process goes to sleep for the specified timeout period or until one of the file descriptors in the
wait queue list become ready. When it wakes up, this sequence is repeated find out if a file
descriptor became available, if the timeout expired, or if a signal is pending. If any of these
conditions exist, the function returns the number of file descriptors found to be available. A 0
return value means either the timeout expired or a signal interrupted the system call.
13
Following are the declarations for do_select():
int do_select(int n, fd_set_bits *fds, long *timeout) {
poll_table table, *wait; int retval, i, off; long __timeout = *timeout; 1. max_select_fd() checks for bad file descriptors and returns the maximum file descriptor in
any of the sets, plus one. If there is a bad file descriptor, i.e. no open file for that file
descriptor, then it returns –EBADF. read_lock(¤t->files->file_lock); retval = max_select_fd(n, fds); read_unlock(¤t->files->file_lock); if (retval < 0) return retval; n = retval; 2. poll_initwait() is called, which initializes the error value to 0 and the table value to NULL
inside the poll_table structure (these are the only two fields). The error value is an integer
and the table value is a struct poll_table_page type. This is how struct
poll_table_page looks (fs/select.c): struct poll_table_entry {
struct file * filp; wait_queue_t wait; wait_queue_head_t * wait_address;
}; struct poll_table_page { struct poll_table_page * next; struct poll_table_entry * entry; struct poll_table_entry entries[0]; };
wait is set to the address of table, but if there is no timeout value, then it is set to NULL (i.e.
just poll without any wait queues). The return value is initialized to 0. poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0;
14
3. This is the main select() loop. The first step is to set the current state to
TASK_INTERRUPTIBLE, which allows the process to be woken up if wake_up() or
wake_up_interruptible() is called on any of the wait queues (more on this later). The
state is changed here rather than right before going to sleep to avoid a race condition where
the condition we sleep on changes between the time we test it and the time we go to sleep. If
we are sleeping on a wait queue whose condition has already occurred, there could be a delay
or lockup. Setting to TASK_INTERRUPTIBLE before checking all the file descriptors rather
than right before going to sleep ensures that, if any of the wait queues set the process’s state
to TASK_RUNNING (i.e. the condition occurred), then the worse that could happen when
schedule_timeout() is called is that the process would be rescheduled on the running
queue [5]. After setting the state, it goes through each file descriptor up to n doing the
following:
a) Each unsigned long is 8*sizeof(unsigned long) bits, which is what the constant
__NFDBITS is set to (32 on Intel). BIT puts a ‘1’ in the correct position in the
unsigned long variable ‘bit’ while ‘off’ finds the correct unsigned long word to
put it in. For example, if we were on file descriptor 5, bit would be set to 16
decimal, or 10000 binary and off would be 5/32 or 0. For another example, if the file
descriptor were 42, bit would be set to 1024, or 10000000000 binary and off would
be 42/32 or 1. File descriptor 5 is in the first unsigned long of the set while 42 is in
the second unsigned long of the set. See Appendix C for a further discussion. for (;;) {
set_current_state(TASK_INTERRUPTIBLE); for (i = 0 ; i < n; i++) { unsigned long bit = BIT(i); unsigned long mask; struct file *file; off = i / __NFDBITS;
b) BITS returns a mask of the bits set in all three of the file descriptor sets for a
particular offset (long word). Each set is OR’ed together to create this mask. Our bit
15
is then AND’ed with this value to see if any of the sets contain this file descriptor. If
not, it then skips the rest of the loop and goes on to the next file descriptor. if (!(bit & BITS(fds, off))) continue;
c) The file structure is then filled in with the appropriate information and the poll()
method is called for that specific type of file or socket. The device method for
poll() is in charge of calling poll_wait() “on one or more wait queues that could
indicate a change in the poll status” and a bit mask is returned that indicates the
“operations that could be immediately performed without blocking” [5]. wait keeps
track of all the file descriptors and their wait queues. file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); }
d) For each of the three bitmaps, if the file descriptor is in that set, then the mask is
AND’ed with the poll mask for that set (POLLIN_SET for read, POLLOUT_SET for
write, and POLLEX_SET for exceptions). If the two masks have at least one bit in
common, then the bit is set in the result field for that bitmap, the return value is
incremented, and the poll table is set to NULL. The poll table is set to NULL after any
increment of the return value because we can stop populating the wait queues due to
the fact that this function will return. It is for sure set to NULL after the first iteration
through all the file descriptors because all the wait queues would then have been
populated. if ((mask & POLLIN_SET) && ISSET(bit,
__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,
__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++; wait = NULL;
16
} if ((mask & POLLEX_SET) && ISSET(bit,
__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL;
e) After the each iteration through all the file descriptors, if any of the below conditions
are met, it breaks out of the loop. if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } __timeout = schedule_timeout(__timeout); }
Otherwise, schedule_timeout() is called with the timeout value specified. Since
our state is TASK_INTERRUPTIBLE, schedule_timeout() will sleep for the period of
time specified by __timeout until either that time expires or it is awoken for another
reason, i.e. if one of the wait queues wakes up the process or a signal is received [1].
Also if MAX_SCHEDULE_TIMEOUT is passed to schedule_timeout(), like in the
instance when the timeout value passed from user space is NULL, it calls schedule()
with no bound on the timeout. This process will then sleep until woken up by
something else that set its state to TASK_RUNNING. Following is the code for
schedule_timeout (kernel/sched.c): signed long schedule_timeout(signed long timeout) { struct timer_list timer; unsigned long expire; switch (timeout) { case MAX_SCHEDULE_TIMEOUT: schedule(); goto out; default: if (timeout < 0) { printk(KERN_ERR "schedule_timeout: wrong timeout “
17
"value %lx from %p\n", timeout, __builtin_return_address(0)); current->state = TASK_RUNNING; goto out; } } expire = timeout + jiffies; init_timer(&timer); timer.expires = expire; timer.data = (unsigned long) current; timer.function = process_timeout; add_timer(&timer); schedule(); del_timer_sync(&timer); timeout = expire - jiffies; out: return timeout < 0 ? 0 : timeout; }
After schedule_timeout() is called, there is at least one more iteration through the
file descriptors.
4. After breaking out of the loop, the state is set to TASK_RUNNING so that it is no longer in a
sleep state and poll_freewait() is called to depopulate all the wait queues and free the poll
table pages. The timeout value is updated and the return value is returned. current->state = TASK_RUNNING; poll_freewait(&table); *timeout = __timeout; return retval; }
18
4. The First Design
For the first design, I tried to make our implementation of select() similar to the way other
functions were intercepted in the kernel. The only exception was that the bitmaps had to be split
apart at the beginning and merged back together at the end. I went through many phases with
this first design as I was trying to understand both how select() was implemented in the kernel
and how the polling protocol worked. Following is somewhat of a timeline of the process,
followed by a description of the errors and shortcomings that ultimately lead to the failure of this
design.
4.1 Rob McCready’s Initial Design
Rob McCready had already started looking at this system call by the time I came on the project.
He described to me the bitmap arrangement and how select() would be different than other
system calls. Additionally, he had started changing some of the design of the protocol to allow
the system call to be sent to the co-host while calling it locally. Before, the process would block
until the co-host returned.
His main idea was to create an fd_set bitmap in his file descriptor translation mapping structure
that sets the corresponding bit when socket() is called. This could then be used as a mask
when the select() bitmap parameters are passed in (see Figure 4.1). ANDing the bitmaps with
the mask tells which host-side file descriptors belong to the co-host. These file descriptors
would then have to be mapped to the co-host file descriptor numbers using the polling protocol’s
translation method. The way this works is that, when socket() is called, it is run on both the
host and the co-host and both sides return file descriptors. Then these file descriptors are copied
into two arrays: One tells the host what the corresponding co-host file descriptor is, the other
tells the co-host what the corresponding host file descriptor is. For example, in Figure 4.1, the
host returned a file descriptor number of 3 and the co-host returned a file descriptor number of 7,
so a 7 would be put in the third array element of the host-to-co-host translation and a 3 would be
put in the seventh element of the co-host-to-host translation [3]. The split function would return
two pointers to structures that contain the co-host and host file descriptor values (respectively) to
look at in each of the sets and a flag that indicates if any of the bits had been set.
19
There were 4 cases he determined that we had to deal with. The first case is where all the file
descriptors are on the host, in which case the select() call would only be executed locally.
The second case is where all the file descriptors are on the co-host, whereby the call would be
sent to the co-host only. The third case is where there are both host and co-host file descriptors.
For this case, we would have to send the com packet to the co-host and then call the function
locally on the host. We would then attempt to lock waiting for the co-host to return. If it already
had returned, then the call would go through the lock, otherwise it would sleep until the co-host
returned. Then the sets would have to be translated and merged. The final case is where one or
both sides return an error. That error should be propagated to the user and, if both sides return an
error, one needs to take precedence over the other. I used this design as a basis for starting my
work.
4.2 Initial Design Overview
Figure 4.2 shows a flowchart of my initial assessment along with some of the problems I found
associated with it at the time. I found that there would be some major synchronization issues
with splitting the call into two. In the initial design phase, I had to come up with some basic
parameters:
File Descriptor Mappings
host->EBSA EBSA->host
Array Index
4
Contents
7
9
…
5
3
Array Index
8
Contents
4
3
-1
…
9
7
-1 10 -1
-1
6
0 1 2 3 4 5
0 0 0 1 1 0 bits fd #
……
1023
0 fd_mask
Figure 4.1 – Example Setup with File Descriptor Mask
20
1. Only do what is necessary to avoid large overhead. If after some initial checks the
parameters contain errors, immediately return and do not send anything to the co-
host. If there are only co-host-side file descriptors or only host-side file descriptors,
only call select() on the host or co-host.
2. There will need to be two sets of parameters, one for the host and one for the co-host,
and they need to be separated so that the system call can be made on both platforms
and then somehow merged back together.
3. Any errors are returned to the user. If both calls return an error, the host-side error is
returned. This is done because most of the applications running on the host would
find errors on the same machine more useful.
4. Return to the user the sum of the number of file descriptors found on both sides (or an
error), the three file descriptor sets containing both sides’ resulting file descriptors,
and the elapsed time as the timeout value.
5. Whenever a file descriptor is or becomes available, the select() call has to return.
If this is not done, splitting the calls could potentially cause large delays or hang the
system.
21
Problem: Slow. May be faster to directly check fd_bits array, but not sure if this is okay to do. Can we intercept macros (doubt it)?
Return 0
Return > 0
Return -1 Return -1
Return ≥ 0
Intercept select() on host
Find out which file descriptors are being watched. This is done by checking for 1’s (FD_ISSET) in each bitmap.
Problem: Two select() calls, old_select() and sys_select().
Of the file descriptors being watched, find out which ones are socket descriptors by checking the current process’s file descriptor translation table (Rob’s code).
Create two separate sets of arguments for select(): one to hold the local file descriptors, and one to hold the socket file descriptors. This will require making six bitmaps total and either masking bits or using FD_SET and FD_CLR. Reset to new values for n. The socket descriptor bitmaps will have to be translated from host descriptors to EBSA descriptors before placing them in the bitmap (Rob’s code).
Execute select() on the host using the bitmaps that contain the local set of arguments.
Marshal into a com packet the arguments for select() that contain the socket set of arguments and place on queue to send to EBSA. Do not lock.
When/if select() on the EBSA returns, check return code.
When/if select() on the host returns, check the return code.
Look in bitmaps to see which descriptors remain and translate from EBSA descriptors to host descriptors (Rob’s code).
Add return value from each call together and OR matching bitmaps.
Return with return value.
Make sure to have checks for NULL fd_set’s.
Problem: What about the timeval’s?
Problem: What about the error codes on the EBSA? Rob’s code does nothing with them.
Issues and potential problems with splitting up select(): 1. If one finishes before the other, how
long to wait? 2. What if one never returns? 3. What if one returns an error and the
other doesn’t? 4. Synchronization issues, waiting too
long and getting too many descriptors back (than if was only called in one place).
Seems like splitting up can cause some major problems.
Figure 4.2 – Early select() Flowchart with Problem Areas
22
Taking all this into consideration, I designed an algorithm that I proposed at a design review on
July 31, 2001: typedef struct { int maxfd; fdset *read, *write, *except; } select_split_t; select_split_t* host; select_split_t* ebsa;
1. If n < 0 return –EINVAL
2. For each file descriptor set that is not NULL,
a. Allocate the host side of the set (i.e. either host->read, host->write, or
host->except) and copy the parameter passed in from user space (i.e. either
readfds, writefds, or exceptfds) into it.
b. Allocate the co-host side of the set. For each file descriptor set that is NULL,
set the host and co-host sides for that set to NULL.
3. Set host->maxfd to n and if it is greater than 1024 (max number of file descriptors),
then set to 1024.
4. Split bitmaps.
5. If there are co-host-side file descriptors then marshal co-host side parameters into a
com packet and send to co-host to call select() with a timeout of 0.
6. If there are host side file descriptors then call select() on local host with a timeout
of 0.
7. Attempt to lock. If the co-host has already returned or was not called, it will go
through the lock without blocking. Otherwise it will wait for the co-host to return.
8. If any errors were returned, then, if there were any co-host file descriptors, merge
them with the host ones, copy host sets into the original parameters, and return.
9. Add the return values together. If the result is larger than 0, that means file
descriptors are ready. If there are co-host file descriptors, merge the bitmaps
together. Copy host sets back into parameters and return.
10. If both return 0, then set a timer to the timeout value specified by the fifth parameter.
This can be accomplished by either calling schedule_timeout() with the timeout
value or calling select() with all other parameters set to NULL or 0’s. If the timeout
23
value passed in is NULL, then we can set a timer to the timeout value
MAX_SCHEDULE_TIMEOUT and call schedule_timeout() that will, in turn, reschedule
the current process. We could also just call schedule() manually.
11. If there was a timeout value specified, then repeat steps 5-9. If the timer has not
expired then repeat step 10 and then again repeat to steps 5-10 until either the timer
expires or either return value is not equal to 0. Anytime there is a return value greater
than 0, this process will stop on step 9. Anytime there is a return value less than 0,
this process will stop on step 8. If the timeout value is NULL, then 5-10 are repeated
indefinitely until either of the return values are not equal to 0.
Calling select() with a timeout of 0 seemed like a good enough solution to see if any of the
file descriptors were available immediately to avoid any sleeping. If none were available, then
we would sleep for the period of time passed in from user space and then check again with a
timeout of 0. I thought of this design before I had full understanding of what select() was
actually doing, so there are a number of flaws that will be discussed in Section 4.4.
4.3 The Split and Merge Algorithms
One of the first things that had to be decided for the split and merge routines was how to go
about searching through the file descriptor sets. The polling protocol’s design called for masking
off the host file descriptors and then translating host-side file descriptors belonging to the co-host
to the corresponding co-host ones using the translation tables. However, this method would
create overhead when setting up the sockets because the mask would have to be modified every
time, even if select() was never called. Furthermore, once the mask is created, the host-side
file descriptors would have to be translated to the co-host-side file descriptors anyway, so we
would have to search through each bitmap to get the file descriptors to perform this translation
on. After talking this over with Rob McCready, we decided that, if we are going to go through
the bitmaps anyway, then we might as well check if each file descriptor that is set is a co-host-
side file descriptor and, if it is, then translate it. This creates only a little more overhead on
select() while the overhead is removed when a socket is created. We also thought that it could
be changed later if this caused a large loss of performance.
24
fd_map.c has a routine called fd_map_get_ebsa_fd() that finds the co-host mapping of a given
host file descriptor. It returns –1 if it does not find a mapping (the mapping table is initialized to
all –1’s before any file descriptors are put into it). However, calling this function for a large
number of file descriptors would produce much overhead not only because it would have to call
a function in a separate file, but also because it would have to search through a list of mappings
for the current process’s mapping. So it was decided to put the split and merge routines inside
fd_map.c so that they could access the mappings directly and would only have to search for the
current process’s mapping once. [3]
The algorithms for these functions were fairly simple. For splitting the file descriptors, three
fd_set variables are created for the co-host side and initialized to 0 with FD_ZERO. Then, for
each of the non-NULL sets, go through each file descriptor up to n (passed in from the user) and
see if there is a mapping for each file descriptor it finds set (check if it is set by using FD_ISSET).
If a mapping is found, clear the bit in the host set (FD_CLR) and then set it in the co-host set
(FD_SET). The variable maxfd in struct split_select_t would be updated for each side
when a file descriptor is found for it. Then, when select() is called on both sides, it will be
called with the corresponding maxfd+1 and the three sets for that side.
select() will return with an updated set of bitmaps that show the available file descriptors. The
merge algorithm then goes through each of these non-NULL file descriptor sets for the co-host.
It does a mapping from the host to the co-host for each file descriptor it sees in the bitmap sets.
If it sees a mapping, the file descriptor is translated from the co-host to the host and is set on the
corresponding host side bitmap. When this algorithm is complete, the host-side bitmap will
contain all file descriptors that select() set.
I was able to come up with split and merge functions that were able to manipulate the file
descriptor sets, but I quickly found out that there were issues with user space and kernel space
variables. sys_select() wants the variables to be in user space because the __get_user(),
get_fd_set(), put_user(), and set_fd_set() functions require that the function arguments
be from user space. I found this out because select() would continuously return -EFAULT
because it was trying to copy addresses from user space that were already kernel space addresses.
25
The only solution I could come up with was to copy the values from user space, split up the sets,
then copy them back to user space so that sys_select() could be called. This worked, albeit
slow, for a few situations, but I ran into problems that forced me to abandon this idea altogether.
4.4 Reasons for Failure
There were a number of problems that caused this approach to be abandoned for the new
approach that is discussed in Chapter 5. First, user space variables become a problem when there
are none there. It is not a problem for the file descriptor sets because they are not processed if
they do not point to anything. However, the timeout value needs to have a user space variable so
that we can copy a 0 into that place. The user does have the option of making it a NULL pointer,
in which case we could not copy a value to user space. I tried to see if there was a way we could
create a user space variable from our kernel module but I could not find a way. Also, the co-host
does not have a user process at all, so there would definitely be no way we could copy a 0 to user
space even if there was a valid timeout value passed in.
Secondly, I found out that the select() call does not necessarily wait the entire timeout period
before checking all the file descriptors a second time. It populates wait queues for each file
descriptor and then sleeps. It wakes up if any of the file descriptors becomes available or if the
timeout expires, whichever comes first. If no timeout value is given (NULL), it sleeps indefinitely
until a file descriptor becomes available. Calling select() with a timeout of 0 would not
populate the wait queues and we could potentially sleep much longer than the original system
call would. Since this is not the type of behavior we would want for select(), a new design
had to be created.
26
5. The Second Design
Once I saw that the conventional method of intercepting system calls using the polling protocol
was not going to work, I tried to find alternative solutions that would allow select() to be
implemented with the CiNIC platform. Max Roth had suggested that I just include the kernel
code and modify that directly, but I was really hesitant to use this method because we wanted our
implementation to be as kernel independent as possible. However, seeing no alternative solution,
this is the method we went with for the second design. I included comments throughout the code
about what is kernel specific so that, if the kernel’s implementation of select() changes in
future kernel releases, one would be able to figure out what needs to be changed.
Four functions from the kernel had to be copied over: old_select(), sys_select(),
do_select(), and max_select_fd(). sys_select(), do_select(), and max_select_fd()
all come from fs/select.c in the Linux kernel, while old_select() comes from
arch/i386/kernel/sys_i386.c. For Chapter 5, sys_select() and old_select() were the
only functions that needed modification (do_select() is modified in Chapter 6); however, all of
these functions needed to be copied over because do_select() and max_select_fd() are static
functions used by select() that cannot be exported by the kernel. sys_select()’s and
old_select()’s modifications will be outlined in Section 5.1, followed by a description of how
the parameters get transferred to the co-host and run there in Sections 5.2 and 5.3. The split and
merge routines will be described in Section 5.4.
5.1 Modifications to old_select() and sys_select()
old_select()’s modifications are extremely straightforward. Since all it does is get the
parameters from a pointer and then call sys_select(), all that needed to be done was to add in
module use counting (MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT) and make it call our
version of sys_select() rather than the kernel’s version. Following is the code: long n_old_select(select_param_t *args) { select_param_t a; long retval;
27
MOD_INC_USE_COUNT; if (copy_from_user(&a, args, sizeof(a))) { MOD_DEC_USE_COUNT; return -EFAULT; } retval = sys_host_select(a.n, a.inp, a.outp, a.exp, a.tvp); MOD_DEC_USE_COUNT; return retval; }
For sys_select() on the host, I renamed my version to sys_host_select(). I needed a way
to keep track of information that was specific to one side or the other, so I created a data
structure: typedef struct { int n; int size; long timeout; fd_set_bits fds; char* bits; } select_split_t; n is the highest numbered file descriptor in the set, size is the size of one of the sets (there are 6
all together, 3 input and 3 output), timeout is the timeout value, fds is a structure that contains
pointers to each of the 6 sets, and bits is a pointer to the beginning of the memory region. A
select_split_t structure is created for each side. We also needed a structure that is used to
send the parameter information over to the co-host: typedef struct { int numfds; long time_off; int sizefds; char bitmaps[0]; } select_func_t; #define COM_SELECT_HEADER_SIZE 12
numfds, time_off, sizefds, and bitmaps[0] are equivalent to n, timeout, size, and bits in
the select_split_t, respectively. COM_SELECT_HEADER_SIZE defines the size of this structure
without bitmaps, which is used when calculating the size of the com packet to send to the co-
host. bitmaps[0] is actually a pointer of arbitrary length, but I made this into an array of size 0
so that the data would be contiguous with the other members of the struct.
28
sys_host_select() is modified so that a timeout value is computed for each side (they start out
equal). Then the host-side sets are set up and the file descriptors are copied in from user space.
Next, three new fd_set’s are created for the co-host file descriptors. Each one is large enough
to contain the maximum number of file descriptors (1024). The file descriptors are then split
between the host and co-host sets. If there are no co-host-side file descriptors,
do_host_select() is called on the host. Otherwise, a co-host memory region is set up that is
similar to the host’s and the parameters are transferred to the co-host. Upon return, the sets are
merged back together and the sets, along with the time elapsed sleeping, are copied back to user
space.
Following is a walkthrough of the code. I will break it up into chunks and show what needed to
be changed from the original sys_select() code.
When the select() system call is made, the function called is sys_host_select() in
select_h.c. Two split_select_t’s are declared for each side along with 3 fd_set’s that will
be used for the co-host sets: long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { select_split_t host, ebsa; fd_set rfds, wfds, efds; long timeout, ret; MOD_INC_USE_COUNT; 1. This section is the same as the kernel except for setting the host and co-host timeout values to
the timeout value passed in from user space. timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp))) || (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ);
29
timeout += sec * (unsigned long) HZ; } } host.timeout = timeout; ebsa.timeout = timeout; ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; 2. Set up the host-side bitmaps and then call split_select_bitmaps(), which will split up the
host and co-host file descriptors and set the n values for each. The host side contains all file
descriptors initially as they are copied from user space into the host-side bitmaps. The host-
side file descriptors will be set up in this memory region and the co-host side will be set up in
each of the fd_set’s. We are using fd_set because we do not know yet how big to make
the co-host side bitmaps until we get a value of n for it (it could possibly be larger than the
value of n passed in from the call to select()). ret = -ENOMEM; host.size = FDS_BYTES(n); host.bits = kmalloc(6 * host.size, GFP_KERNEL); if (!host.bits) goto out_nofds; host.fds.in = (unsigned long *) host.bits; host.fds.out = (unsigned long *) (host.bits + host.size); host.fds.ex = (unsigned long *) (host.bits + 2*host.size); host.fds.res_in = (unsigned long *) (host.bits + 3*host.size); host.fds.res_out = (unsigned long *) (host.bits + 4*host.size); host.fds.res_ex = (unsigned long *) (host.bits + 5*host.size); if ((ret = get_fd_set(n, inp, host.fds.in)) || (ret = get_fd_set(n, outp, host.fds.out)) || (ret = get_fd_set(n, exp, host.fds.ex))) goto out; zero_fd_set(n, host.fds.res_in); zero_fd_set(n, host.fds.res_out); zero_fd_set(n, host.fds.res_ex); split_select_bitmaps(n, &host, &ebsa, &rfds, &wfds, &efds); 3. If there are co-host file descriptors, then we set up the co-host side bitmaps and copy the
information from the fd_set’s to this region. n_sys_select() is then called, which takes
care of sending the data to the co-host, calling the host side do_select() if needed, and
30
waiting for the co-host to return. The return value is the combined value of
do_host_select() run on the host and do_ebsa_select() run on the co-host (or an error
value if an error occurred on either side). The host and co-host side bitmaps are merged into
the host side by merge_select_bitmaps() and the co-host side bitmaps are freed. If there
are not any co-host file descriptors, do_host_select() is called just like in the kernel and
nothing is sent to the co-host. if (ebsa.n > 0) { ret = -ENOMEM; ebsa.size = FDS_BYTES(ebsa.n); ebsa.bits = kmalloc(6 * ebsa.size, GFP_KERNEL); if (!ebsa.bits) goto out; ebsa.fds.in = (unsigned long *) ebsa.bits; ebsa.fds.out = (unsigned long *) (ebsa.bits + ebsa.size); ebsa.fds.ex = (unsigned long *) (ebsa.bits + 2*ebsa.size); ebsa.fds.res_in = (unsigned long *) (ebsa.bits + 3*ebsa.size); ebsa.fds.res_out = (unsigned long *) (ebsa.bits + 4*ebsa.size); ebsa.fds.res_ex = (unsigned long *) (ebsa.bits + 5*ebsa.size); memcpy((void*)ebsa.fds.in, (void*)&rfds, ebsa.size); memcpy((void*)ebsa.fds.out, (void*)&wfds, ebsa.size); memcpy((void*)ebsa.fds.ex, (void*)&efds, ebsa.size); zero_fd_set(ebsa.n, ebsa.fds.res_in); zero_fd_set(ebsa.n, ebsa.fds.res_out); zero_fd_set(ebsa.n, ebsa.fds.res_ex); ret = n_sys_select(&host, &ebsa); merge_select_bitmaps(&host, &ebsa); kfree(ebsa.bits); } else { ret = do_host_select(host.n, &host.fds, &host.timeout); } 4. The next part is the same as the kernel except for determining the timeout value from the co-
host and host. The smaller timeout value is returned. I thought about returning the larger
timeout value because that is what the timeout value would be if this was running normally
on one computer only, but I decided later that the timeout value should reflect the time
elapsed sleeping, since the user would probably use this value to sleep further if it is waiting
on an event. if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (ebsa.timeout < host.timeout) { timeout = ebsa.timeout; } else {
31
timeout = host.timeout; } if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } if (ret < 0) { goto out; } if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current)) goto out; ret = 0; } 5. Copy the values from the host result fields, free the bits on the host side, and return. set_fd_set(n, inp, host.fds.res_in); set_fd_set(n, outp, host.fds.res_out); set_fd_set(n, exp, host.fds.res_ex); out: kfree(host.bits); out_nofds: MOD_DEC_USE_COUNT; return ret; }
5.2 Transferring Parameters to the Co-Host
The method of transferring parameters to the co-host is very similar to the other system calls. It
is for this reason that I decided I would put the function for it into syscalls_h.c and name it
according to the naming scheme of the other intercepted system calls. This function,
n_sys_select(), is only called if there are co-host-side file descriptors. Its first task is to set up
a com packet that will be transferred to shared memory. This com packet will contain header
information such as the length to copy, the function ID, the process ID, and the return value.
Additionally, it will contain the arguments for the system call that are needed on the co-host side.
Next the com packet is put on the outgoing queue to send to the co-host. The polling protocol
will take care of getting it there from here [3]. It then calls do_host_select() if there are host-
side file descriptors. Once the host side returns, it will wait for the co-host side to return or keep
32
going if it already has. The modified sets from the co-host are copied from the com packet and
the return values are examined. Any errors are returned and host-side errors are returned if both
sides have errors. If there are no errors, the sum of the return values is returned.
Following is the code for n_sys_select(). For simplicity, I have taken out some code that will
be described in Chapter 6. long n_sys_select(select_split_t* local, select_split_t* remote) { long err = 0; int ret_local = 0; int ret_remote = 0; pkt_queue_node_t *pqn; 1. Set up the com packet with the values it needs. It needs to fill in the values of the
select_func_t structure with the co-host’s n, timeout, and size values, and it needs to
copy over the co-host side bitmaps. pqn = proto_get_queue_node(
COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size));
pqn->pkt->copy_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size);
pqn->pkt->pkt_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size);
pqn->pkt->func_id = SYS_SELECT; pqn->pkt->pid = current->pid; pqn->pkt->ret_val = -1; pqn->pkt->func.select.numfds = remote->n; pqn->pkt->func.select.time_off = remote->timeout; pqn->pkt->func.select.sizefds = remote->size; memcpy((void*)&pqn->pkt->func.select.bitmaps[0], (void*)remote->bits,
6*(remote->size)); 2. The com packet is then put on a queue to send to the co-host. If there are host file descriptors
as well, do_host_select() is called with the host-side parameters. After it returns, the
process sleeps on the semaphore until the co-host returns. If the co-host has already returned,
it will go straight through the lock. The updated bitmaps are then copied back and the
timeout value from the co-host is updated. If either side returns an error, that error is
returned. I have designated that if both sides return an error, then the host error gets
precedence over the co-host error because the host error would most likely be needed by the
33
other applications running on the host. Otherwise, the return values are added together and
the sum is returned. proto_enqueue(pqn); if (local->n > 0) { ret_local = do_host_select(local->n, &local->fds, &local->timeout, 1); } if (down_interruptible(&(pqn->lock)) == -EINTR) { err = -EINTR; goto out_select; } ret_remote = pqn->pkt->ret_val; memcpy((void*)remote->bits, (void*)&pqn->pkt->func.select.bitmaps[0], 6*(remote->size)); remote->timeout = pqn->pkt->func.select.time_off; if (ret_local < 0) { err = ret_local; } else if (ret_remote < 0) { err = ret_remote; } else { err = ret_remote + ret_local; } out_select: proto_release_queue_node(pqn); return err; }
5.3 The Co-Host Side Functions
The handling of the system call on the co-host side is relatively straightforward. The polling
protocol takes care of retrieving the com packet from shared memory and putting it on a handler
queue. Currently, the only handler is the default handler, so the default handler thread, which is
the default_handler() function in handler_default.c, checks the function ID from the com
packet and makes the appropriate system call [3]: case SYS_SELECT: pkt->ret_val = sys_ebsa_select(pkt->func.select.numfds, pkt->func.select.sizefds, &pkt->func.select.time_off, &pkt->func.select.bitmaps[0]); break;
34
sys_ebsa_select() (select_e.c) is called with the parameters sent by the com packet. It sets
up a memory region on the co-host to put the bitmaps into and then calls do_ebsa_select(): long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits) { fd_set_bits bmaps; long retval; bmaps.in = (unsigned long *) ebsa_bits; bmaps.out = (unsigned long *) (ebsa_bits + size); bmaps.ex = (unsigned long *) (ebsa_bits + 2*size); bmaps.res_in = (unsigned long *) (ebsa_bits + 3*size); bmaps.res_out = (unsigned long *) (ebsa_bits + 4*size); bmaps.res_ex = (unsigned long *) (ebsa_bits + 5*size); retval = do_ebsa_select(n, &bmaps, timeout); if (!retval) { if (signal_pending(current)) { retval = -ERESTARTNOHAND; } } return retval; }
When the system call is complete, the modified com packet is put on the queue to return to the
host.
5.4 The Split and Merge Routines
The split and merge routines are extremely important to this protocol as they are able to figure
out which file descriptors go to the co-host and which stay on the host. A separate set is used for
the co-host side. The split routine goes through each of the host-side sets, which contain all file
descriptors requested from the user. There are three sets, read, write, and exceptions, which are
bitmaps (see Section 2.3). A set bit (1) means that a file descriptor was requested. For example,
if bit 3 was set in the read bitmap, it tells select() to check if file descriptor 3 is ready to be
read from. So if the split routine finds a set bit, it will look up that file descriptor number in the
file descriptor translation table in the protocol code [3]. If it finds a –1 in the table, then this
means that this file descriptor is intended for the host side and the host-side number of file
descriptors is incremented. If it finds a non-negative value, this means that this file descriptor is
destined for the co-host. It clears the bit in the host bitmap and sets the translated bit in the co-
host bitmap. Figure 5.1 shows an example. Suppose the translation table is the same as in
35
Figure 4.1. Bit 3 is set in the host-side read bitmap and file descriptor 3 is seen to translate to file
descriptor 7 on the co-host, so bit 3 is cleared in the host-side read bitmap and bit 7 is set in the
co-host-side read bitmap. Notice in this figure that the co-host side has 1024 file descriptors in it
because we do not know yet what the highest numbered file descriptor will be on the co-host.
The merge routine works in the opposite direction, but it does not clear any bits. The co-host file
descriptors are translated to the host side and the host side will then contain both sets of bits. In
the previous example, if bit 7 in the co-host read bitmap was ready for reading, then it would be
translated back to file descriptor 3 on the host and set in the host read bitmap to be returned to
the user. It is important to note that the fd_set_bits structure has ‘in’ and ‘out’ bitmaps. The
‘in’ bitmaps are used to see which file descriptors select() should look at while the ‘out’
bitmaps are copied back to user space to show which file descriptors select() found available.
The split routine works with the ‘in’ bitmaps and the merge routine works with the ‘out’ bitmaps.
The split and merge routines reside in fd_map.c so that they have easy access to the file
descriptor mappings between the host and co-host. It is faster because it does not have to
continuously call fd_map_get_ebsa_fd() or fd_map_get_host_fd() whenever it wants a file
descriptor translated.
Split: void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds) { int hfd, efd; /* host/ebsa file descriptor */ int off; fd_translation_table_t *cur;
…… 5 6 7 8 9
0 0 1 0 0 bits fd #
…… 1023
0 Co-host read bitmap
0 1 2 3 4 5 0 0 0 0 0 1 bits
fd #
Host read bitmap ……
Figure 5.1 – Example of Splitting File Descriptors
36
1. The macros in select.h require the file descriptor set to be of type fd_set_bits* but the
type that is passed in from select_split_t is fd_set_bits. An ampersand (&) cannot be
used with the macros because they dereference pointers. So instead of changing the macros,
I created a new variable of type fd_set_bits* that the address of local->fds could be
stored in. The macros then use this new variable. Next, the n values are set to 0. I originally
was not going to have separate n values, but then I realized that, since the value has to be one
larger than the highest numbered file descriptor in any of the sets, the co-host side may have
larger file descriptor numbers than the host and would therefore need a separate n value. The
file descriptor table (fd_table) is then searched to find the current process’s mappings. If
they are not found, then only host-side file descriptors have been established, i.e. there are no
co-host file descriptors, so set the host side n to the value of n passed in and return. /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; local->n = 0; remote->n = 0; down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); if (!cur) { local->n = n; return; /* no fd translation table */ } 2. Zero out the fd_set’s so that we can populate them with the co-host’s file descriptors. FD_ZERO(remote_rfds); FD_ZERO(remote_wfds); FD_ZERO(remote_efds); 3. Algorithm for searching through the file descriptors (note: each bit refers to a file descriptor):
a) Find the bit and offset for the file descriptor on the host (same way as kernel in
do_select()). See Section 3.2 and Appendix C.
b) If that bit is not set in any of the sets, then continue to the next file descriptor.
37
c) Get the file descriptor mapping if that bit is set in any of the sets. If it is less than 0,
i.e. –1, it is not a translated file descriptor; it is only on the host. So update n on the
host and go on to the next file descriptor.
d) If it is translated, then find out which sets it is in and set the translated co-host file
descriptor in the correct fd_set(s) and clear the corresponding host file descriptor in
the host bitmaps (FD_SET and CLR).
e) If the co-host side n is smaller than the translated file descriptor, then update it. We
check it first in case the mappings are not in ascending order, we would not want n
on the co-host to be smaller than the largest file descriptor. for (hfd = 0; hfd < n; hfd++) { unsigned long hbit = BIT(hfd); off = hfd / __NFDBITS; if (!(hbit & BITS(lfds, off))) { continue; } efd = cur->fd_host_ebsa[hfd]; if (efd < 0) { local->n = hfd + 1; continue; } if (ISSET(hbit, __IN(lfds, off))) { FD_SET(efd, remote_rfds); CLR(hbit, __IN(lfds, off)); } if (ISSET(hbit, __OUT(lfds, off))) { FD_SET(efd, remote_wfds); CLR(hbit, __OUT(lfds, off)); } if (ISSET(hbit, __EX(lfds, off))) { FD_SET(efd, remote_efds); CLR(hbit, __EX(lfds, off)); } if (remote->n <= efd) remote->n = efd + 1; } } Merge: void merge_select_bitmaps(select_split_t* local, select_split_t* remote) { int hfd, efd; /* host/ebsa file descriptor */ int off_local, off_remote; fd_translation_table_t *cur; /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; fd_set_bits* rfds = &remote->fds;
38
down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); NOTE: This should not happen because merge_select_bitmaps() would not be called unless
there was an fd_table. I was thinking that I should return an error here, but after talking to Dr.
Nico, we decided that there is nothing that the user can do about it and the call would return
without the co-host file descriptors set anyway, so they would at least see that they were not
ready. Another reason is that we want this to look as much like the regular system call as
possible, so an unusual return value would not work in this case. Since this error would indicate
a more significant device driver failure, an error message is printed and the function returns. if (!cur) { PRINT_ERROR("merge_select_bitmaps: fd translation table missing, this should not happen\n"); return; /* no fd translation table */ } Algorithm for searching through the file descriptors:
a) Find the bit and offset for that file descriptor on the co-host (same way as kernel).
See Section 3.2 and Appendix C.
b) If that bit is not set in any of the sets, then continue to the next file descriptor.
c) Get the file descriptor mapping if that bit is set in any of the sets. If it is less than 0,
i.e. –1, it is an error in the translation table because all file descriptors on the co-host
should be in the translation table. Find the bit and offset for the host side.
d) Find out which sets it is in and set the translated host file descriptor in the correct
result bitmap. for (efd = 0; efd < remote->n; efd++) { unsigned long ebit = BIT(efd); unsigned long hbit; off_remote = efd / __NFDBITS; if (!(ebit & RES_BITS(rfds, off_remote))) { continue; } hfd = cur->fd_ebsa_host[efd]; if (hfd >= 0) { hbit = BIT(hfd); off_local = hfd / __NFDBITS; } else {
39
PRINT_ERROR("merge_select_bitmaps: fd translation missing, this should not happen\n"); continue; } if (ISSET(ebit, __RES_IN(rfds, off_remote))) { SET(hbit, __RES_IN(lfds, off_local)); } if (ISSET(ebit, __RES_OUT(rfds, off_remote))) { SET(hbit, __RES_OUT(lfds, off_local)); } if (ISSET(ebit, __RES_EX(rfds, off_remote))) { SET(hbit, __RES_EX(lfds, off_local)); } } }
This implementation cannot handle above 1024 file descriptors because the translation tables are
only 1024 indices long. This is the current default for Linux systems, but in the future this
scheme may need to be changed to be more robust.
The entire process from this chapter is laid out in Figure 5.2.
40Host Co-Host
System call intercepted, sys_host_select() called.
syscalls_init(), syscalls_h.c
sys_host_select() gets the values from user space and sets up the host side bitmaps.
sys_host_select(), select_h.c
split_select_bitmaps() is called. It splits the file descriptors between the host and co-host and returns the maximum file
descriptor plus one on both sides.
split_select_bitmaps(), fd_map.c
Co-host side bitmaps are set up. n_sys_select()
is called. sys_host_select(),
select_h.c do_host_select() is called. This is where each
of the file descriptors’ status is determined and where sleeping can happen. It
returns the number of file descriptors available.
do_host_select(), select_h.c
n_sys_select() sets up a com packet and sends to
co-host. n_sys_select(), syscalls_h.c
merge_select_bitmaps() is called. It merges both sets of file descriptors onto the host side. merge_select_bitmaps(), fd_map.c
Copy the new values back to user space and return the number of file descriptors
available.
sys_host_select(), select_h.c
The default handler on the co-host picks up the com packet and calls
sys_ebsa_select() with the values from it.
default_handler(), handler_default.c
sys_ebsa_select() sets up a memory region for the co-host side bitmaps and
calls do_ebsa_select(), which does the same thing as do_host_select() but on
the co-host side.
sys_ebsa_select(), select_e.c
The default handler then sends the com packet with the updated data back to the
host. default_handler(), handler_default.c
Host file descriptors only
Co-host file descriptors present
If there are host file descriptors it calls
do_host_select(). When both sides are done, it adds
both return values and returns.
n_sys_select(), syscalls_h.c
Figure 5.2 – select() Flowchart of Second Design
41
6. The Solution to the Blocking Problem
By the end of summer 2001, I had succeeded in getting everything from Chapter 5 to work
correctly. select() worked in most situations and I was able to get the Lynx text-based web
browser to work using this setup. However, functionality still had to be added to wake up one
side when the other had returned. As we saw in the kernel’s version of do_select(),
schedule_timeout() is called to put the process to sleep for a specified period of time after
each iteration if no file descriptors are available, the timeout value is larger than 0, and no signals
are pending. This is implemented on both the co-host and host sides. Also, the way
n_sys_select() works, a com packet is sent to the co-host and then do_host_select() is
called. When it returns, it sleeps until the co-host com packet returns before moving on. With
this method, if one side returns early and is ready to move on, it must wait for the other to return.
This could take a while if it has to wait for it to timeout. Furthermore, if there is no timeout
value, the wait could be forever if there are no file descriptors available and no signal is sent to
the process.
This is exactly what happens when using Telnet to connect to a remote host. It waits for input on
stdin or a socket when the user is typing. Since, with our implementation, sockets would go to
the co-host, the select() call would be split between the host and co-host. When the user types
a character, the select() call will return on the host because it sees the character from stdin.
However, the co-host side is still waiting for input on the socket, which never happens and
causes select() to never return. At this point Telnet cannot receive input and hangs.
This is why there needs to be some sort of mechanism where the side that is ready to return
notifies the other to return as well. There has to be a mechanism to wake up the sleeping process
and ask it to return. The following sections detail how I investigated the problem and came up
with a design. Then the code is discussed along with some of the issues I had to look at during
the implementation.
42
6.1 The Investigation of Kernel Methods
I first began an extensive investigation of kernel methods to try to figure out how I can go about
solving this problem. During my reading of Linux Device Drivers [5], I found that the kernel
uses wait queues to put processes to sleep and wake them up. Wait queues are queues of
processes waiting for various events. Processes can go to sleep on a wait queue by calling
sleep_on() or interruptible_sleep_on(). ‘Interruptible’ means the sleep can be interrupted
by a signal. There is also wake_up() and wake_up_interruptible(), which wake up either all
processes on a wait queue or only those in interruptible sleeps, respectively. The process would
go to sleep on a wait queue to wait for an event to occur and then code in another part of the
driver, usually in an interrupt handler, would wake up the process when this event occurs.
After thinking of a way this could apply to my problem, I decided I could use my own wait
queue in do_host_select() and do_ebsa_select() and then call
interruptible_sleep_on_timeout() on the wait queue with the timeout values from
select(). This would ensure that the timeout value is used (schedule_timeout() is called
internally by these functions) and that I could wake it up early. But, the problem still remained
about how to wake it up. I could not use interrupts to wake up the sleeping process because the
21554 has a limited interrupt mechanism that would already be used by the interrupt-driven
protocol. I needed to find something that could run separately from the process. Some type of
thread would work, but would also incur overhead because it would constantly be checking to
see if it has any processes to wake up, using up valuable CPU time. I finally found the answer in
task queues and tasklets. These enable execution of some task at a later time without using
interrupts. These can be run at various times in the kernel and can continually reschedule
themselves. The three predefined task queues are:
1. The scheduler queue- Runs in process context (as opposed to interrupt context) out of a
dedicated kernel thread called keventd. Sleeping is allowed since it is in a process
context.
2. The timer queue- Runs in interrupt context and runs at every clock tick.
43
3. The immediate queue- Runs via the bottom half mechanism, runs in interrupt time, and is
the fastest queue. Tasks should not be reregistered in this queue. It runs as soon as
possible, either on return from a system call or when schedule() is called.
Tasklets are another method used in the kernel. A tasklet is a new mechanism in the 2.4 kernel
and is a way to accomplish bottom half tasks, which are low priority interrupt space functions
that run when the kernel finds a convenient time [1]. In fact, in the 2.4 kernel, bottom halves run
as tasklets. Additionally, custom task queues can be defined which are not automatically
scheduled by the kernel. [5]
Taking these methods into account, I did some benchmarking tests using the Timestamp Counter
Register to see how fast these were. I ruled out custom queues because I wanted the kernel to
schedule when they ran. Tasklets, the scheduler queue, and the immediate queue all ran in
relatively the same amount of time, but the timer queue was about 20 times slower. Even on a
heavily loaded system, the timer queue is guaranteed to run at every clock tick, but I did not
think this would be an optimal solution due to the length of time between clock ticks. The tasks
using the scheduler queue can sleep, so I thought that one might be slow at times. I finally
decided to use tasklets because it was recommended that the immediate queue not be
rescheduled, and I definitely needed to reschedule my implementation. The only drawback of
tasklets is that they can only be used in kernel versions above 2.4.
6.2 The Design
The basic algorithm for the tasklet is to check a flag that will be set when the other side has
completed its part of the call. If this flag is set, then the sleeping process is woken up and runs
through one final iteration of the file descriptors before it returns. The tasklet is reregistered
each time it is run until the module is unloaded. The problem was finding a way to send a flag to
each side. I thought of two methods: Either send a com packet that tells the other side it has
finished or have a dedicated region in shared memory that contains this flag. For the interrupt-
driven protocol, which select() will eventually need to work with, a new thread would have to
be spawned if I sent a com packet. This would be an extremely slow process, so I decided that
the best method was to use shared memory.
44
This works if there is only one process calling select(). But since the process is sleeping,
other processes would get a chance to run and possibly call select().
wake_up_interruptible() would then wake up all processes on the wait queue. Since we do
not know which one the flag corresponds to, we cannot know which ones need to return and
which ones need to go back to sleep. A solution I came up with was to put in shared memory the
host process ID (PID) of the process that needs to return instead of passing a flag. The tasklet
would see that the PID in shared memory is non-negative and would wake up the processes on
the wait queue. Each process would wake up and, if its PID does not match, go back to sleep. If
the PID did match, the process would run through the file descriptors one last time and return.
The host PID could be used on both the host and co-host because its only use is to uniquely
identify a process that made a call to select(); it is not used with any PID functions such as
kill() or fork(). This value was the most convenient to use because it was already included in
the shared memory com packet header. A possible problem with this is if the wait queue
contained a lot of processes, this could incur quite a bit of overhead with all the context switches
between processes that are just going back to sleep (a.k.a. the “thundering herd” problem [5]). A
solution I proposed to solve this problem was to have local individual wait queues for each
process that contain only that process in its queue. We could then have a global linked list
(queue) of structures that contain the wait queue address and PID for each process. The tasklet
would look through the linked list for the PID and wake up the process on that wait queue.
Yet another thing I had to consider was if multiple processes were ready to return. I thought of
possibly using an array in shared memory to contain the PID of each process, but this would use
up shared memory and the proper length would be hard to determine. So I decided to keep a
linked list (queue) of the processes that are ready to return. Fortunately, the Linux kernel had a
circular linked list implementation that I could use. Not only would the tasklet be responsible for
waking up sleeping processes that the other side labels as needing to return, but also it would be
responsible for putting into shared memory the processes on its side that are ready to return.
Figure 6.1 shows how this would be set up. Two variables would be created in shared memory:
One for the host to write PID’s to for the co-host to read (host_pid) and one for the co-host to
write PID’s to for the host to read (ebsa_pid).
45
host_pid ebsa_pid
Host
Ebsa
Rest of shared memory…
Ready to Return linked list
Process B
Process C
Process D
Process A Sleeping linked list
Process F
Process G
Process H
Process E ebsa_pid = process G
Process G is woken up and removed from linked list
Sleeping linked list
Process A
Process C
Process E
host_pid = process E
Process E is woken up and removed from linked list
Ready to Return linked list
Process G
Process H
Process I
Process J
Figure 6.1 – Shared Memory with Queues
Process A ready to return
Process J ready to return
46
The shared memory structure now looks like this (the data regions would have to shrink in order
to contain these variables): typedef struct { int host_pid; int host_stat; char host_data[(SHRMEM_SIZE/2)-8]; int ebsa_pid; int ebsa_stat; char ebsa_data[(SHRMEM_SIZE/2)-8]; } shrmem_t;
This design would have to be implemented on both sides, so both would have to have wait
queues, tasklets, and linked lists. Following is a design I envisioned:
Host inside select() Go through one iteration of the file descriptors. If a file descriptor is available or the timeout is 0,
If ebsa_pid = current->pid, then set ebsa_pid to -1 and return, Else, put on Ready to Return linked list, and return.
(No file descriptors available and the timeout > 0) If ebsa_pid == current->pid, then set ebsa_pid to -1 and return, Else, set up wait queue, add to sleeping linked list, and sleep on the timeout value. When awoken, repeat above steps. Host in bottom half If host_pid < 0, remove next process from the Ready to Return linked list If there is a process to remove,
If ebsa_pid ==Ready_to_Return PID Set co-host PID to –1 and return
Else, put Ready_to_Return PID into host_pid If ebsa_pid >= 0, then wake up process in sleeping linked list with PID == ebsa_pid If process not found, check Ready to return list If not found, return Else, remove from Ready to Return list, set ebsa_pid to –1, and return Else, remove from sleeping linked list and return Else return Co-Host inside select() Go through one iteration of the file descriptors. If a file descriptor is available or the timeout is 0,
If host_pid = current->pid, then set host_pid to -1 and return,
47
Else, put on Ready to Return linked list, and return. (No file descriptors available and the timeout > 0) If host_pid == current->pid, then set host_pid to -1 and return, Else, set up wait queue, add to sleeping linked list, and sleep on the timeout value. When awoken, repeat above steps. Co-Host in bottom half If ebsa_pid < 0, remove next process from the Ready to Return linked list If there is a process to remove,
If host_pid ==Ready_to_Return PID Set host PID to –1 and return
Else, put Ready_to_Return PID into ebsa_pid If host_pid >= 0, then wake up process in sleeping linked list with PID == host_pid If process not found, check Ready to return list If not found, return Else, remove from Ready to Return list, set host_pid to –1, and return Else, remove from sleeping linked list and return Else return
See Appendix A for a state diagram of this design. –1 is used in host_pid and ebsa_pid to
indicate that no PID is in there and that the next PID off of the ready-to-return linked list can be
copied there.
6.3 The Code
Following is the code for the design presented in Section 6.2. Since the majority of the code is
duplicated on the host and the co-host, the focus of this walkthrough will be on the host side.
The co-host side can easily be seen by replacing ‘host’ with ‘ebsa' in the variables and functions.
Differences between the two sides will be noted.
The following include files are needed for wait queues, sleeping and waking up processes,
linked lists, and tasklets: #include <linux/wait.h> #include <linux/sched.h> #include <linux/list.h> #include <linux/interrupt.h> Because of dependency problems, interrupt.h could not compile without linux/spinlock.h
on the co-host nor asm/system.h on the host, so those files are included as well. Additionally,
com.h is needed to access the shared memory structure, along with ebsa.h to access the shared
48
memory pointer (g_shrmem) on the co-host and host.h to access the shared memory pointer
(module_info) on the host.
Global variables are declared for each side in select_h.c and select_e.c. The ready-to-return
and sleeping queues are initialized along with the tasklet. tasklet_host_sched_flag is used to
tell the tasklet when to stop rescheduling itself. This flag was needed because I could not get the
tasklet to stop with tasklet_kill() by itself. By setting this value to 1 when we want it to
reschedule and setting it to 0 when we do not want it to be rescheduled anymore allows the
tasklet to start and stop smoothly. LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); DECLARE_TASKLET(select_host_tasklet, select_host_tasklet_func, 0); int tasklet_host_sched_flag;
The shared memory area is initialized and the tasklet is started when syscalls_init() is called
in syscalls_h.c and syscalls_e.c. mem = (shrmem_t*)module_info.shared_mem_addr; mem->host_pid = -1; mem->ebsa_pid = -1; tasklet_host_sched_flag = 1; tasklet_schedule(&select_host_tasklet);
Likewise, the tasklet is stopped in syscalls_cleanup(). tasklet_host_sched_flag = 0; tasklet_kill(&select_host_tasklet);
Two structures are used for the entries in the sleeping and ready-to-return linked lists. The
sleeping list entry has a PID, a pointer to the wait queue it can wake up on, and a struct
list_head that is used for putting it in and taking it out of the linked list. The ready-to-return
list entry is the same except that it has no wait queue. typedef struct { pid_t pid; wait_queue_head_t* wq; struct list_head sleep_list_entry; } sleep_list_t; typedef struct { pid_t pid; struct list_head rr_list_entry;
49
} rr_list_t; In select_h.c, do_host_select() has added functionality over the original do_select() in
the kernel. It was modified to include a wait queue to sleep on, shared memory checking, and 2
queues. Now it also checks if ebsa_pid in shared memory is equal to the current PID before
going to sleep and, if the process does go to sleep, the tasklet will wake it up when or if
ebsa_pid contains this PID Also, when this function returns, it now puts its PID on the ready-
to-return queue so that it can be put in shared memory by the tasklet. ebsa_flag is passed in
and indicates if there are co-host-side file descriptors. If there are not, then this function runs the
same as it would in the regular Linux kernel. do_ebsa_select() does not have a host_flag
because this flag is used on the host to bypass all the extra functionality added if there are no file
descriptors destined for the co-host. 1. Added to the declaration list is a pointer to shared memory and pointers to an entry in each of
the queues. A wait queue is declared locally so that this process will be the only one on it.
When wake_up_interruptible() is called on the wait queue, only this process will
awaken. int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout; shrmem_t* mem = NULL; /* shared memory */ rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); if (ebsa_flag) { mem = (shrmem_t*)module_info.shared_mem_addr; } 2. The next section is the same as in the kernel except that a negative return value goes to
ready_to_return rather than just returning so that the process can be added to the ready-to-
return queue. read_lock(¤t->files->file_lock); retval = max_select_fd(n, fds); read_unlock(¤t->files->file_lock);
50
if (retval < 0) goto ready_to_return; n = retval; poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++; wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } 3. Here is where the majority of the changes occur. We only get to this code if the above if
statements do not break out of the loop. If ebsa_pid is the same as the current PID, then we
know that the other side has finished and we break out of the loop. Otherwise, a sleep queue
51
entry is created with the current PID and a pointer to the wait queue for this process. The
entry is then added to the list and the process goes to sleep on the wait queue for up to the
period of time specified by timeout. When it wakes up, the entry is deleted from the list,
freed, and then loops again. If there are no co-host file descriptors, then
schedule_timeout() is called instead and no wait queue is used. This process continues
until one of the if statements causes the loop to break. if (ebsa_flag) { if (mem->ebsa_pid == current->pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { retval = -ENOMEM; break; } sleep_entry->pid = current->pid; sleep_entry->wq = &select_sleep; list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout( sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } else { /* same as regular kernel if no EBSA fd's */ __timeout = schedule_timeout(__timeout); } } current->state = TASK_RUNNING; poll_freewait(&table); *timeout = __timeout; 4. When the function has finished and is ready to return, it needs to add itself to the ready-to-
return queue if the other side has not indicated that it is ready-to-return. First ebsa_pid is
checked to see if it contains the current PID; if it does, then ebsa_pid is set to –1 and we
return. This means that the co-host side has indicated that it is done. Otherwise, we allocate
a new ready-to-return list entry, copy the current PID into it, and add it to the queue. Then
we return and the tasklet will pick up the rest of the work. ready_to_return: if (ebsa_flag) { if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } else {
52
rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) { /* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = current->pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } } return retval; } The tasklet maintains the two queues and manages host_pid and ebsa_pid in shared memory.
The first thing the host-side tasklet does is check host_pid. If there is currently no PID in
host_pid (the value is –1), then the next process in the ready-to-return queue is taken off the
queue and the PID of that process is examined. If the PID matches the PID in ebsa_pid, then
that means that the co-host side has also returned, so there is no need to write this PID to
host_pid. ebsa_pid is set to –1 and the tasklet exits. Otherwise, the PID is put into host_pid.
The next step is to check ebsa_pid. The tasklet would have gone straight to this step if there
were a PID in host_pid. If ebsa_pid has no PID in it, then the co-host side has nothing ready
to return, so the tasklet exits. Otherwise, it checks if the process with the same PID is in the
sleeping queue. If it is, it is removed from the queue and woken up, at which point the tasklet
will exit. If it is not found in the sleeping queue, the tasklet will see if it has already returned and
is possibly in the ready-to-return queue. If it is, then it is removed from this queue and ebsa_pid
is set to –1 to indicate that the next PID can be put into ebsa_pid by the co-host side. If it is not,
ebsa_pid is not changed and the tasklet exits. Each time the tasklet exits, it reschedules itself
until it sees that the value of tasklet_host_sched_flag is 0.
1. The declarations include pointers used to navigate through the lists, a pointer to shared
memory, two structures that contain the information needed for each list element (along with
the list_head pointers to indicate its position within a list), and two flags that tell if a PID
was found in one of the lists. The parameter passed into the tasklet is not used. void select_host_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; struct list_head* sleep_list_ptr; shrmem_t* mem;
53
sleep_list_t* sleep_entry = NULL; rr_list_t* rr_entry = NULL; int sleep_flag = 0; int rr_flag = 0; 2. mem is set up to be a pointer to shared memory. Then if host_pid is less than 0 and the
ready-to-return queue is not empty, then the next entry from the queue (the one right after the
header) is deleted from the list. The macro list_entry points to the structure that contains a
list_head. This allows the values of this entry to be retrieved. If this entry is the same
value that is in ebsa_pid, then ebsa_pid is reset to –1 and the tasklet exits. Otherwise put
the PID value of this entry into host_pid. mem = (shrmem_t*)module_info.shared_mem_addr; if ((mem->host_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->ebsa_pid) { kfree(rr_entry); goto reset_ebsa; } else { mem->host_pid = rr_entry->pid; kfree(rr_entry); } } 3. If ebsa_pid is less than 0, then the tasklet stops processing and exits. Otherwise, it searches
the sleeping queue for the PID found in ebsa_pid. The macro list_for_each works like a
for loop. It goes through the entire list starting from the head. sleep_list_ptr points to
the current position in the list. For each position in the list, the PID is checked against the
value in ebsa_pid. If a match is found, then that process is woken up and the tasklet exits. if (mem->ebsa_pid < 0) { goto tasklet_complete; } list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->ebsa_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) {
54
goto tasklet_complete; } 4. If the entry is not found in the sleeping queue, then the ready-to-return queue is checked. If a
match occurs, the entry is removed from the ready-to-return queue and ebsa_pid is reset to
-1. The tasklet then exits. If the value is not found in the queue, then the tasklet exits
without setting ebsa_pid to –1. list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->ebsa_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) { goto tasklet_complete; } reset_ebsa: mem->ebsa_pid = -1; 5. Before exiting, the tasklet checks the value of tasklet_host_sched_flag to see if it should
continue rescheduling itself. If the flag is clear, then it does not reschedule itself. If the flag
is set, it is rescheduled to run again. tasklet_complete: if (tasklet_host_sched_flag) { tasklet_schedule(&select_host_tasklet); } } In syscalls_h.c, n_sys_select() makes sure the values in shared memory are reset to –1 if
they contain the current PID before the select() call completes. This makes sure there are no
PIDs in shared memory that the tasklet is looking for that it will never find. This case occurs
when there are only co-host file descriptors. The co-host side puts the PID value in shared
memory when it returns, but the host side never is able to reset it because do_host_select()
was never run. mem = (shrmem_t*)module_info.shared_mem_addr; if (mem->host_pid == current->pid) { mem->host_pid = -1; } if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1;
55
} 6.4 Other Issues
There were two other issues that had to be looked at during this design and implementation.
First, the issue of whether to use semaphores surfaced. I believed that race conditions could
occur when writing to and reading from shared memory or when adding to and deleting from a
linked list. This was a critical issue because, if I wanted to use tasklets, I could not sleep in
them. Fortunately, my design is such that access to shared memory is controlled; only the tasklet
is able to write the PID value to it. But each process reads from it and can set it to –1. Also, the
kernel implementation of linked lists was thought to be free of race conditions because I could
not find any literature or examples that show using semaphores with these lists. It was thought
that, since Linux is a non-preemptive kernel, the process would not go to sleep when
manipulating the lists, which makes this free of race conditions. This issue was never fully
resolved and semaphores were never implemented in the code. There have been no problems
yet, but the comments in the code indicate places where I thought could pose a potential race
condition.
The other issue deals with a cache problem I was having on the co-host during the final testing of
this implementation. I was finding that occasionally I would write a value to shared memory but
the previous value it was supposed to overwrite would still be in there. This happened only
when running two or more processes concurrently that called select(). I tried using volatile
and atomic variables, but neither worked. I saw that the difference between when it would work
and when it would fail depended on the order in which the calls occurred. Since within the
protocol there is a lot of sleeping, the order depends on when processes are scheduled. After not
coming up with a solution for a while, Max Roth and Jason Hatashita found out that it was a
problem with the EBSA’s cache. I do not understand the full details of the problem, but it has to
do with the cache updating concurrent memory locations in shared memory. In order to curtail
this issue until it is fixed, I had to space out host_pid and ebsa_pid in shared memory like so: typedef struct { pid_t host_pid; /* for use with select() */ unsigned long spacer1[10]; pid_t ebsa_pid; /* for use with select() */ unsigned long spacer2[10];
56
int host_stat; char host_data[SHRMEM_DATA_SIZE]; int ebsa_stat; char ebsa_data[SHRMEM_DATA_SIZE]; } shrmem_t;
This allows the two variables to be far enough apart so that they are updated in separate caches.
57
7. Conclusion
The select() system call is now fully functional and Telnet and Lynx are able to run through
the protocol as a result of this. It was a very long and challenging process because I had to do a
lot of research and learning in order to figure out how the protocol worked, how select()
worked inside the kernel, and how to use mechanisms available in the kernel to solve the
problems we were having with the implementation. A lot of the process was trial and error. If
one idea did not work, we would try another. For the first design, I continually added
functionality on top of what I already had, hoping that somehow it was going to work, but only
to find out later that this design would be impossible to implement. So I started over from the
beginning. Once select() was able to run successfully on the host and co-host, we then had to
deal with the blocking issue. This project definitely took a lot of resilience and determination, as
there were continuously issues that were brought up that had to be considered. I gained a greater
respect for those who designed and continuously code the Linux kernel, as I now see what it
takes to be a true Linux hacker.
There are a number of future work items that need to be done. First, this code needs to be ported
to the new version of the protocol. Unfortunately, the current design for the interrupt-driven
protocol of one file descriptor per thread would not work for select(), as it has to be able to
handle multiple file descriptors. Next, when this code is ported over, the possibility of using
interrupts rather than tasklets should be looked at. Tasklets are used because the polling protocol
did not have interrupts, but interrupts have the potential to be much faster. Also, the ultimate
goal would be to someday be able to get the X Window System working with the protocol. After
all, the reason select() was so important to implement in the first place was because it was
needed to run Netscape. In order to get closer to this goal, the Virtual File System issue and the
loopback issue must be address. First, with regards to the Virtual File System issue, when a
socket is created with our protocol, the default file operations for sockets are overridden with our
own implementation of the file operation functions. Currently all these do is print an error
message and return –EFAULT. A couple of these messages occur when attempting to start the X
Window System with our protocol, so resolving this issue may help in getting the X Window
System to work. Secondly, since the X Window System uses sockets with the loopback interface
58
to communicate between its client and server, so getting the loopback interface to work with the
protocol is critical. Finally, continuing to implement system calls with the protocol is critical to
adding more functionality to the platform. This increased functionality will help the project to
realize the full potential of the system and will enable more research into its capabilities.
59
References [1] Bovet, Daniel P. and Marco Cesati. Understanding the Linux Kernel. 1st edition.
Sebastopol, CA: O’Reilly, 2001. [2] McClelland, Mark. “Linux PCI Shared Memory Device Drivers for the Cal Poly
Intelligent Network Interface Card.” Senior Project, California Polytechnic State University, San Luis Obispo, June 2001.
[3] McCready, Robert. “Design and Development of the CiNIC Host/Co-Host Protocol.”
Senior Project, California Polytechnic State University, San Luis Obispo, February 2002. [4] Roth, Max. “Design and Implementation for the CiNIC Device Driver v2.0.” Senior
Project, California Polytechnic State University, San Luis Obispo, June 2002. [5] Rubini, Alessandro and Jonathan Corbet. Linux Device Drivers. 2nd edition. Sebastopol,
CA: O’Reilly, 2001.
60
Appendix A – State Diagrams for the Solution to the Blocking Problem Host
From sys_host_select()
Check all file descriptors inbitmaps to see if any are ready
Put on Ready to Return linked list
1 or more file descriptorsavailable or timeout = 0
Check ebsa_pid
No file descriptorsavailable and timeout != 0
ebsa_pid = current_pid
Return to sys_host_select()
Put process on sleeping linked list(using wait queues)
ebsa_pid != current_pid
SleepTimeout not expired andprocess is not scheduled
Awoken by bottom half, aready file descriptor, or anexpired timeout
Check ebsa_pid
ebsa_pid != current_pid
Set ebsa_pid to -1
ebsa_pid = current_pid
Remove from sleepinglinked list
Check host_pid Remove next process in Ready to Return linked list from the list
host_pid < 0
Check ebsa_pid
host_pid >= 0 No processes to remove
Compare ebsa_pid to pid of removed process
Process removed
Find process in sleeping linked list with pid = ebsa_pid
ebsa_pid >= 0
Find process in Ready to Return linked list
Process not found
Remove from Ready to Return linked list
Set ebsa_pid to -1
return
ebsa_pid < 0
Pid’s are equal
Pid’s not equal
Put pid of process into host_pid
Wake up process
Process is found
Process is found
Process not found
Figure A.1 – do_select() on Host
Figure A.2 – Tasklet on Host
61
From sys_ebsa_select()
Check all file descriptors in bitmaps to see if any are ready
Put on Ready to Return linked list
1 or more file descriptors available or timeout = 0
Check host_pid
No file descriptors available and timeout != 0
host_pid = current_pid
Return to sys_ebsa_select()
Put process on sleeping linked list (using wait queues)
host_pid != current_pid
SleepTimeout not expired and process is not scheduled
Awoken by bottom half, a ready file descriptor, or an expired timeout
Check host_pid
host_pid != current_pid
Set host_pid to -1
host_pid = current_pid
Remove from sleeping linked list
Check ebsa_pid Remove next process in Ready to Return linked list from the list
ebsa_pid < 0
Check host_pid
ebsa_pid >= 0 No processes to remove
Compare host_pid to pid of removed process
Process removed
Find process in sleeping linked list with pid = host_pid
host_pid >= 0
Find process in Ready to Return linked list
Process not found
Remove from Ready to Return linked list
Set host_pid to -1
return
host_pid < 0
Pid’s are equal
Pid’s not equal
Put pid of process into ebsa_pid
Wake up process
Process is found
Process is found
Process not found
Figure A.3 – do_select() on Co-Host
Figure A.4 – Tasklet on Co-Host
Co-Host
62
Appendix B – Test Plans Test plan run on September 11, 2001, before solution to blocking problem (up through Chapter
5).
Test Expected Result Pass/Fail File descriptors on both host and co-host
Return with file descriptors Pass
File descriptors on co-host only
Return with file descriptors Pass
File descriptors on host only Return with file descriptors Pass File descriptors on both, co-host blocks
Return with host file descriptors after timeout period *
Pass*
File descriptors on both, host blocks
Return with co-host file descriptors after timeout period *
Pass*
File descriptors on both, co-host has >32 file descriptors
Return with file descriptors Pass
File descriptors on both, host has >32 file descriptors
Return with file descriptors Pass
File descriptors on both, timeout is 0
Return with file descriptors Pass
No file descriptors, only timeout
Sleep for given period of time
Pass
First argument n < 0 Return –EINVAL Pass First argument n > 1024 n changed to 1024 Pass Give invalid file descriptor Return –EBADF Pass Bad memory address Return –EFAULT Pass File descriptors 32 numbers or more apart
Return with file descriptors Pass
* = Blocks now, but will not once blocking problem is solved.
Table B.1 – Test Plan Run on 9/11/01
63
Test plan run on February 22, 2002, after solution to blocking problem (up through Chapter 6).
Test Expected Result Pass/Fail File descriptors on both host and co-host
Return with file descriptors Pass
File descriptors on co-host only
Return with file descriptors Pass
File descriptors on host only Return with file descriptors Pass File descriptors on both, co-host blocks
Return with host file descriptors immediately
Pass
File descriptors on both, host blocks
Return with co-host file descriptors immediately
Pass
File descriptors on both, co-host has >32 file descriptors
Return with file descriptors Pass
File descriptors on both, host has >32 file descriptors
Return with file descriptors Pass
File descriptors on both, timeout is 0
Return with file descriptors Pass
No file descriptors, only timeout
Sleep for given period of time
Pass
First argument n < 0 Return –EINVAL Pass First argument n > 1024 n changed to 1024 Pass Give invalid file descriptor Return –EBADF Pass Bad memory address Return –EFAULT Pass File descriptors 32 numbers or more apart
Return with file descriptors Pass
Multiple select() calls at one time
Return with file descriptors Fail*
Multiple select() calls that block on either side
Return with file descriptors Fail*
Errors on both Host error is returned Pass Multiple co-host file descriptors
Return with file descriptors Pass
* These failed because of the caching problem (Section 6.4). Once this was fixed, these passed.
Table B.2 – Test Plan Run on 2/22/02
64
Appendix C – select() Bit Macros Since select() deals with three sets of bitmaps, there are quite a number of bit analyzing and
manipulation macros it uses. These can be very confusing to understand, so below I will briefly
outline how each one works.
FDS_BYTES is used in sys_host_select() and is defined in linux/poll.h. It is used to
determine how many bytes are needed for a given number of bits (nr): #define FDS_BITPERLONG (8*sizeof(long)) #define FDS_LONGS(nr) (((nr)+FDS_BITPERLONG-1)/FDS_BITPERLONG) #define FDS_BYTES(nr) (FDS_LONGS(nr)*sizeof(long))
If a long is 4 bytes (i386), then FDS_BITPERLONG gives the number of bits in one long, which
would be 32. FDS_LONGS gives the number of longs a given number of bits would take up. For
example, if nr = 60, then FDS_LONGS would be 2. FDS_BYTES then gives the number of bytes for
a given number of bits. If nr is the same value as above, then FDS_BYTES would be 8. This
value is used by kmalloc() as the size of the region to allocate for each bitmap (there are 6
bitmaps altogether, 3 sets of ‘in’ bitmaps and 3 sets of ‘out’ bitmaps).
The following, except for CLR, were all defined in fs/select.c and had to be copied over with
the kernel code. I defined CLR myself so I could use it in split_select_bitmaps(): #define BIT(i) (1UL << ((i)&(__NFDBITS-1))) #define ISSET(i,m) (((i)&*(m)) != 0) #define SET(i,m) (*(m) |= (i)) #define CLR(i,m) (*(m) &= (~(i))) __NFDBITS is defined in linux/posix_types.h to be 8*sizeof(unsigned long). BIT sets a
bit in the correct position within an unsigned long. For file descriptors less than 32, this works as
expected. For example, if i = 8, then 1UL (meaning an unsigned long value 1) is moved left 8
spots (…100000000 binary). For file descriptors above 32, it puts it in a bit position that
assumes there are a certain number of unsigned longs in front of it. If i = 60, then the bit is
shifted 28 spots, assuming that there is an unsigned long in front of it (28+32 = 60). SET and CLR
modify bit i of unsigned long pointer m. ISSET checks bit i of unsigned long pointer m and
returns 1 if it is set or 0 if it is not. All of these are used in the split and merge routines, as well
65
as do_select(), to manipulate the bitmaps. The following are used to find the correct unsigned
long location in each bitmap (originally located in fs/select.c): #define __IN(fds, n) (fds->in + n) #define __OUT(fds, n) (fds->out + n) #define __EX(fds, n) (fds->ex + n) #define __RES_IN(fds, n) (fds->res_in + n) #define __RES_OUT(fds, n) (fds->res_out + n) #define __RES_EX(fds, n) (fds->res_ex + n) The first three are the ‘in’ bitmaps that the user passes to the system call. The last three are the
‘out’ bitmaps that the kernel copies to user space upon return. fds is of type fd_set_bits.
These macros are used as the m parameter in ISSET, SET, and CLR. Each of these finds the correct
unsigned long that a bit is set in. Figure C.1 shows a typical example of how one of these
bitmaps would be set up. The unsigned longs are contiguous, so n acts like an offset to find the
correct unsigned long. The offset is found by taking a file descriptor number and dividing it by
__NFDBITS. In the example of file descriptor 60, n = 1 because there is one unsigned long in
front of the one that file descriptor 60 is in.
The final two macros are used to check bits in three of the sets concurrently. BITS was defined
in fs/select.c while I defined RES_BITS so I could use it in the merge routine. #define BITS(fds, n) (*__IN(fds, n)|*__OUT(fds, n)|*__EX(fds, n)) #define RES_BITS(fds, n) \ (*__RES_IN(fds, n)|*__RES_OUT(fds, n)|*__RES_EX(fds, n)) These dereference the pointers in the macros above. All the bits from the same unsigned long
offset in each of the three sets are OR’ed together to find which bits are set in any of the bitmaps.
These macros allow a bit that is defined with the BIT macro to be AND’ed with either of these to
find out if that bit is set in any of the sets. For example, suppose each bitmap only had 4 bits.
*__IN is 0001, *__OUT is 0011, and *_EX is 0110. BITS would then be all three of these OR’ed
together: 0111. This shows that file descriptors 0-2 are set in at least one of the bitmaps and file
descriptor 3 is not set in any of the bitmaps. These macros are used in the split and merge
routines, as well as in do_select().
bits/fd
Unsigned long Unsigned long Unsigned long …
95 64 63 32 31 0 Figure C.1 – The Representation of a Bitmap in Kernel Memory
66
Appendix D – Source Code D.1 select.h /* * select.h - Cal Poly 3Com CiNIC project * * Definitions and functions for the select() system call. Based off of * fs/select.c in kernel version 2.4.2. Code may need to be changed when newer * versions of the kernel are used. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select.h,v 1.1 2002/05/01 02:01:30 jkwek Exp $ */ #ifndef _SELECT_H_ #define _SELECT_H_ #include "global.h" /* these files included in fs/select.c in Linux kernel */ #include <linux/slab.h> #include <linux/poll.h> #include <linux/file.h> #include <asm/uaccess.h> /* needed for wait queues, waking up processes, and linked lists */ #include <linux/wait.h> #include <linux/sched.h> #include <linux/list.h> /* * All of the following except RES_BITS and CLR are also defined in * fs/select.c. These two were added for the extra functionality I needed. */ #define ROUND_UP(x,y) (((x)+(y)-1)/(y)) #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) #define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR) #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR) #define POLLEX_SET (POLLPRI) /* * Goes to correct u_long boundary in the bitmaps. The kernel routine sets * up the bitmaps along these boundaries for efficiency and speed. */ #define __IN(fds, n) (fds->in + n) #define __OUT(fds, n) (fds->out + n) #define __EX(fds, n) (fds->ex + n) #define __RES_IN(fds, n) (fds->res_in + n) #define __RES_OUT(fds, n) (fds->res_out + n) #define __RES_EX(fds, n) (fds->res_ex + n)
67
/* checks if the bit is set in all three of the bitmaps for a given fd */ #define BITS(fds, n) (*__IN(fds, n)|*__OUT(fds, n)|*__EX(fds, n)) #define RES_BITS(fds, n) \ (*__RES_IN(fds, n)|*__RES_OUT(fds, n)|*__RES_EX(fds, n)) /* * Bit manipulation routines * BIT puts a 1 in the correct u_long location. */ #define BIT(i) (1UL << ((i)&(__NFDBITS-1))) #define ISSET(i,m) (((i)&*(m)) != 0) #define SET(i,m) (*(m) |= (i)) #define CLR(i,m) (*(m) &= (~(i))) /* longest timeout value */ #define MAX_SELECT_SECONDS \ ((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1) /* parameters for old_select() */ typedef struct { unsigned long n; fd_set *inp, *outp, *exp; struct timeval *tvp; } select_param_t; /* divide parameters betweeen host and EBSA */ typedef struct { int n; /* highest numbered fd in split set */ int size; /* size of split set */ long timeout; /* timeout of split set */ fd_set_bits fds; /* pointers to bitmaps in the memory region */ char* bits; /* pointer to the memory region */ } select_split_t; /* entry in sleeping queue */ typedef struct { pid_t pid; /* process id */ wait_queue_head_t* wq; /* ptr to waitq sleeping on */ struct list_head sleep_list_entry; /* positioning in linked list */ } sleep_list_t; /* entry in ready-to-return queue */ typedef struct { pid_t pid; /* process id */ struct list_head rr_list_entry; /* positioning in linked list */ } rr_list_t; /* functions in select_h.c */ int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag); long n_old_select(select_param_t *args); long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp); void select_host_tasklet_func(unsigned long ptr); /* functions in select_e.c */ int do_ebsa_select(int n, fd_set_bits *fds, long *timeout, pid_t remote_pid); long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits,
68
pid_t remote_pid); void select_ebsa_tasklet_func(unsigned long ptr); /* select function in syscalls_h.c */ long n_sys_select(select_split_t* local, select_split_t* remote); #endif D.2 select_h.c /* * select_h.c - Cal Poly 3Com CiNIC project * * Host-side select() implementation. Most of this code was taken from * fs/select.c in the Linux 2.4.2 kernel with bitmap manipulation and EBSA-side * communication added in. The conventional method of hijacking system calls * could not be performed on select() because it needs to be run concurrently * on the host and EBSA sides. Potential race conditions could occur when * writing to or reading from shared memory, or adding and deleting entries * from the lists. However, no problems have been seen as of yet. Code ma y * need to be changed when newer versions of the kernel are used. Refer to * senior project for a more detailed description. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select_h.c,v 1.1 2002/05/01 02:01:46 jkwek Exp $ */ #include <asm/system.h> /* for interrupt.h to compile */ #include <linux/interrupt.h> /* for tasklets */ #include <linux/module.h> /* MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT */ #include "global.h" #include "select.h" #include "fd_map.h" /* split and merge bitmaps */ #include "host.h" /* module_info */ #include "com.h" /* shrmem_t */ /* Initialize host-side ready-to-return and sleeping queues */ LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); /* Initialize host-side tasklet */ DECLARE_TASKLET(select_host_tasklet, select_host_tasklet_func, 0); int tasklet_host_sched_flag; /* * All code in this function is taken from select.c. This had to be copied * because it is static. This function tests for bad file descriptors and * returns the maximum file descriptor in any of the sets, plus one. -EBADF * is returned for bad file descriptors. */ static int max_select_fd(unsigned long n, fd_set_bits *fds) { unsigned long *open_fds; unsigned long set;
69
int max; /* handle last in-complete long-word first */ set = ~(~0UL << (n & (__NFDBITS-1))); n /= __NFDBITS; open_fds = current->files->open_fds->fds_bits+n; max = 0; if (set) { set &= BITS(fds, n); if (set) { if (!(set & ~*open_fds)) goto get_max; return -EBADF; } } while (n) { open_fds--; n--; set = BITS(fds, n); if (!set) continue; if (set & ~*open_fds) return -EBADF; if (max) continue; get_max: do { max++; set >>= 1; } while (set); max += n * __NFDBITS; } return max; } /* * This is the heart of select(). In the original kernel, it checks each file * descriptor in the bitmaps and sets the bit in the result bitmaps for each * avalable file descriptor. The number of available file descriptors is * returned. If no file descriptors are available, this function sleeps until * one becomes available or the timeout expires. The kernel code was modified * to include a wait queue to sleep on, shared memory checking, and 2 regular * queues. Now it also checks if ebsa_pid in shared memory is equal to the * current PID before going to sleep and, if the process does go to sleep, * the tasklet will wake it up when or if they are equal. Also, when this * function returns, it now puts its PID on the ready-to-return queue so that * it can be put in shared memory by the tasklet. ebsa_flag indicates if there * are EBSA-side file descriptors. If there are not, then this function runs * the same as it would in the regular Linux kernel. */ int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout; shrmem_t* mem = NULL; /* shared memory */
70
rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); /* point to shared memory */ if (ebsa_flag) { mem = (shrmem_t*)module_info.shared_mem_addr; } /* get max numbered file descriptor */ read_lock(¤t->files->file_lock); retval = max_select_fd(n, fds); read_unlock(¤t->files->file_lock); if (retval < 0) goto ready_to_return; n = retval; /* set up a list of wait queues to be used by the poll() method */ poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); /* * For each file descriptor selected, call the poll() method * for that file system type. The poll() method will return * a mask that can be used to check its status and set the * appropriate bits in the result bitmaps. */ for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off));
71
retval++; wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } if (ebsa_flag) { /* check if EBSA is done before sleeping */ if (mem->ebsa_pid == current->pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { /* seems to be most practical return value */ retval = -ENOMEM; break; } sleep_entry->pid = current->pid; sleep_entry->wq = &select_sleep; /* add to sleeping queue and go to sleep */ list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout( sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } else { /* same as regular kernel if no EBSA fd's */ __timeout = schedule_timeout(__timeout); } } current->state = TASK_RUNNING; poll_freewait(&table); /* Up-to-date the caller timeout */ *timeout = __timeout; ready_to_return: if (ebsa_flag) { /* if EBSA side is ready, we are done */ if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } else { /* add to ready-to-return queue if EBSA not ready */ rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) {
72
/* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = current->pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } } return retval; } /* * old_select() is used by Netscape. The reason it may use this rather than * the newer version is for compatibility. This function was designed * to be used back in the days when you could not pass 5 parameters to a * system call due to register limitations. This code was mostly borrowed * from old_select() in arch/i386/kernel/sys_i386.c. It gets the arguments * from user space and then calls the new version. */ long n_old_select(select_param_t *args) { select_param_t a; long retval; MOD_INC_USE_COUNT; if (copy_from_user(&a, args, sizeof(a))) { MOD_DEC_USE_COUNT; return -EFAULT; } retval = sys_host_select(a.n, a.inp, a.outp, a.exp, a.tvp); MOD_DEC_USE_COUNT; return retval; } /* * This is the wrapper function for select(). It takes the user space * parameters and sets them up in kernel memory. Once select() is finished, * it copies the parameters back to user space. This code comes from * sys_select() in fs/select.c and is modified to have both EBSA and host side * parameters that are split and merged together. */ long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { select_split_t host, ebsa; /* host and EBSA side values */ fd_set rfds, wfds, efds; /* for creating EBSA bitmaps */ long timeout, ret; MOD_INC_USE_COUNT; /* get timeout value from user space and change to jiffies */ timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp)))
73
|| (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ); timeout += sec * (unsigned long) HZ; } } host.timeout = timeout; ebsa.timeout = timeout; ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; /* * We need 6 bitmaps (in/out/ex for both incoming and outgoing), * since we used fdset we need to allocate memory in units of * long-words. */ ret = -ENOMEM; host.size = FDS_BYTES(n); host.bits = kmalloc(6 * host.size, GFP_KERNEL); if (!host.bits) goto out_nofds; host.fds.in = (unsigned long *) host.bits; host.fds.out = (unsigned long *) (host.bits + host.size); host.fds.ex = (unsigned long *) (host.bits + 2*host.size); host.fds.res_in = (unsigned long *) (host.bits + 3*host.size); host.fds.res_out = (unsigned long *) (host.bits + 4*host.size); host.fds.res_ex = (unsigned long *) (host.bits + 5*host.size); /* get bitmaps from user space and set result bitmaps to 0 */ if ((ret = get_fd_set(n, inp, host.fds.in)) || (ret = get_fd_set(n, outp, host.fds.out)) || (ret = get_fd_set(n, exp, host.fds.ex))) goto out; zero_fd_set(n, host.fds.res_in); zero_fd_set(n, host.fds.res_out); zero_fd_set(n, host.fds.res_ex); /* * move the EBSA descriptors to rfds, wfds, and efds * set the n values for each */ split_select_bitmaps(n, &host, &ebsa, &rfds, &wfds, &efds);
74
if (ebsa.n > 0) { /* * If there are EBSA file descriptors, set up the EBSA side * bitmaps and copy over from the fd_set's. */ ret = -ENOMEM; ebsa.size = FDS_BYTES(ebsa.n); ebsa.bits = kmalloc(6 * ebsa.size, GFP_KERNEL); if (!ebsa.bits) goto out; ebsa.fds.in = (unsigned long *) ebsa.bits; ebsa.fds.out = (unsigned long *) (ebsa.bits + ebsa.size); ebsa.fds.ex = (unsigned long *) (ebsa.bits + 2*ebsa.size); ebsa.fds.res_in = (unsigned long *) (ebsa.bits + 3*ebsa.size); ebsa.fds.res_out = (unsigned long *) (ebsa.bits + 4*ebsa.size); ebsa.fds.res_ex = (unsigned long *) (ebsa.bits + 5*ebsa.size); memcpy((void*)ebsa.fds.in, (void*)&rfds, ebsa.size); memcpy((void*)ebsa.fds.out, (void*)&wfds, ebsa.size); memcpy((void*)ebsa.fds.ex, (void*)&efds, ebsa.size); zero_fd_set(ebsa.n, ebsa.fds.res_in); zero_fd_set(ebsa.n, ebsa.fds.res_out); zero_fd_set(ebsa.n, ebsa.fds.res_ex); /* send to EBSA, run on both sides, merge bitmaps when done */ ret = n_sys_select(&host, &ebsa); merge_select_bitmaps(&host, &ebsa); kfree(ebsa.bits); } else { /* run do_host_select() only if no EBSA descriptors */ ret = do_host_select(host.n, &host.fds, &host.timeout, 0); } /* copy the smallest timeout value to user space (elapsed time) */ if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (ebsa.timeout < host.timeout) { timeout = ebsa.timeout; } else { timeout = host.timeout; } if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } if (ret < 0) { goto out; }
75
/* a 0 return value could mean a signal is pending, restart select() */ if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current)) goto out; ret = 0; } /* copy to user space */ set_fd_set(n, inp, host.fds.res_in); set_fd_set(n, outp, host.fds.res_out); set_fd_set(n, exp, host.fds.res_ex); out: kfree(host.bits); out_nofds: MOD_DEC_USE_COUNT; return ret; } /* * This function is executed every time tasklets are scheduled to run. None * of this function came from the kernel. It is used to facilitate message * passing via PID's across shared memory. If host_pid in shared memory is -1, * then the next process id on the ready-to-return queue is put into this * region for the EBSA side to see that the host side has finished. If * ebsa_pid is a positive number, this means that the EBSA side has finished * and is ready to return, so this tasklet will wake the process up if it * is sleeping and remove it from any queues. */ void select_host_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; /* ptr to current list position */ struct list_head* sleep_list_ptr; shrmem_t* mem; /* shared mem ptr */ sleep_list_t* sleep_entry = NULL; /* entries in sleep queue */ rr_list_t* rr_entry = NULL; /* entries in ready-to-ret queue */ int sleep_flag = 0; /* set if process was woken */ int rr_flag = 0; /* set if process has returned */ /* setup shared mem ptr */ mem = (shrmem_t*)module_info.shared_mem_addr; /* * If host_pid is -1 and the ready-to-return queue is not empty, * then remove the next entry from the list. If it is equal to * ebsa_pid, both sides are ready-to-return and we are done. * Otherwise, put the PID value into host_pid. */ if ((mem->host_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->ebsa_pid) { kfree(rr_entry); goto reset_ebsa;
76
} else { mem->host_pid = rr_entry->pid; kfree(rr_entry); } } /* nothing else to do if EBSA has no PID to offer */ if (mem->ebsa_pid < 0) { goto tasklet_complete; } /* * Search through the sleeping queue for the process id in ebsa_pid. * If it is found, wake up the process and exit the tasklet. */ list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->ebsa_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) { goto tasklet_complete; } /* * The process was not sleeping, so see if it is waiting to put its PID * on the ready-to-return queue. If it is found, remove the entry * and exit the tasklet. */ list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->ebsa_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) { goto tasklet_complete; } /* * ebsa_pid = -1 signals to the EBSA side that it can put a new PID * value into it. We need to reset it here when ebsa_pid's value was * a PID that was on the ready-to-return queue. Otherwise it is * reset in do_host_select(). */ reset_ebsa: mem->ebsa_pid = -1;
77
/* * Continuously reschedule the tasklet as long as * tasklet_host_sched_flag is set. */ tasklet_complete: if (tasklet_host_sched_flag) { tasklet_schedule(&select_host_tasklet); } } D.3 select_e.c /* * select_e.c - Cal Poly 3Com CiNIC project * * EBSA-side select() implementation. Most of this code was taken from * fs/select.c in the Linux 2.4.2 kernel with bitmap manipulation and host-side * communication added in. The conventional method of hijacking system calls * could not be performed on select() because it needs to be run concurrently * on the host and EBSA sides. In particular for the EBSA side, * sys_ebsa_select() is much different from sys_select() in the Linux kernel * because most of the memory manipulation is done on the host side before * it gets here. Potential race conditions could occur when writing to or * reading from shared memory, or adding and deleting entries from the lists. * However, no problems have been seen as of yet. Code may need to be changed * when newer versions of the kernel are used. Refer to senior project for a * more detailed description. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select_e.c,v 1.1 2002/05/01 02:01:54 jkwek Exp $ */ #include <linux/spinlock.h> /* for interrupt.h to compile */ #include <linux/interrupt.h> /* for tasklets */ #include "global.h" #include "select.h" #include "ebsa.h" /* g_shrmem */ #include "com.h" /* shrmem_t */ /* Initialize EBSA-side ready-to-return and sleeping queues */ LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); /* Initialize EBSA-side tasklet */ DECLARE_TASKLET(select_ebsa_tasklet, select_ebsa_tasklet_func, 0); int tasklet_ebsa_sched_flag; /* * All code in this function is taken from select.c. This had to be copied * because it is static. This function tests for bad file descriptors and * returns the maximum file descriptor in any of the sets, plus one. -EBADF * is returned for bad file descriptors. */ static int max_select_fd(unsigned long n, fd_set_bits *fds)
78
{ unsigned long *open_fds; unsigned long set; int max; /* handle last in-complete long-word first */ set = ~(~0UL << (n & (__NFDBITS-1))); n /= __NFDBITS; open_fds = current->files->open_fds->fds_bits+n; max = 0; if (set) { set &= BITS(fds, n); if (set) { if (!(set & ~*open_fds)) goto get_max; return -EBADF; } } while (n) { open_fds--; n--; set = BITS(fds, n); if (!set) continue; if (set & ~*open_fds) return -EBADF; if (max) continue; get_max: do { max++; set >>= 1; } while (set); max += n * __NFDBITS; } return max; } /* * This is the heart of select(). In the original kernel, it checks each file * descriptor in the bitmaps and sets the bit in the result bitmaps for each * avalable file descriptor. The number of available file descriptors is * returned. If no file descriptors are available, this function sleeps until * one becomes available or the timeout expires. The kernel code was modified * to include a wait queue to sleep on, shared memory checking, and 2 regular * queues. Now it also checks if host_pid in shared memory is equal to the * current PID before going to sleep and, if the process does go to sleep, * the tasklet will wake it up when or if they are equal. Also, when this * function returns, it now puts its PID on the ready-to-return queue so that * it can be put in shared memory by the tasklet. */ int do_ebsa_select(int n, fd_set_bits *fds, long *timeout, pid_t remote_pid) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout;
79
shrmem_t* mem; /* shared memory */ rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); /* point to shared memory */ mem = (shrmem_t*)g_shrmem; /* get max numbered file descriptor */ read_lock(¤t->files->file_lock); retval = max_select_fd(n, fds); read_unlock(¤t->files->file_lock); if (retval < 0) goto ready_to_return; n = retval; /* set up a list of wait queues to be used by the poll() method */ poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); /* * For each file descriptor selected, call the poll() method * for that file system type. The poll() method will return * a mask that can be used to check its status and set the * appropriate bits in the result bitmaps. */ for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++;
80
wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } /* check if host is done before sleeping */ if (mem->host_pid == remote_pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { /* seems to be most practical return value */ retval = -ENOMEM; break; } sleep_entry->pid = remote_pid; sleep_entry->wq = &select_sleep; /* add to sleeping queue and go to sleep */ list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout(sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } current->state = TASK_RUNNING; poll_freewait(&table); /* Up-to-date the caller timeout */ *timeout = __timeout; ready_to_return: /* if host side is ready, we are done */ if (mem->host_pid == remote_pid) { mem->host_pid = -1; } else { /* add to ready-to-return queue if host not ready */ rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) { /* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = remote_pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } return retval;
81
} /* * This is the wrapper function for the EBSA-side do_ebsa_select(). It takes * the bitmaps from the packet received from the host and divides them into * their respective regions for the 6 bitmaps. Then do_ebsa_select() is called * to do the work. This is all that is needed on the EBSA side, as the host * side takes care of the rest. */ long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits, pid_t remote_pid) { fd_set_bits bmaps; /* struct with u_long pointers to bitmaps */ long retval; bmaps.in = (unsigned long *) ebsa_bits; bmaps.out = (unsigned long *) (ebsa_bits + size); bmaps.ex = (unsigned long *) (ebsa_bits + 2*size); bmaps.res_in = (unsigned long *) (ebsa_bits + 3*size); bmaps.res_out = (unsigned long *) (ebsa_bits + 4*size); bmaps.res_ex = (unsigned long *) (ebsa_bits + 5*size); retval = do_ebsa_select(n, &bmaps, timeout, remote_pid); if (!retval) { if (signal_pending(current)) { retval = -ERESTARTNOHAND; } } return retval; } /* * This function is executed every time tasklets are scheduled to run. None * of this function came from the kernel. It is used to facilitate message * passing via PID's across shared memory. If ebsa_pid in shared memory is -1, * then the next process id on the ready-to-return queue is put into this * region for the host side to see that the EBSA side has finished. If * host_pid is a positive number, this means that the host side has finished * and is ready to return, so this tasklet will wake the process up if it * is sleeping and remove it from any queues. */ void select_ebsa_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; /* ptr to current list position */ struct list_head* sleep_list_ptr; shrmem_t* mem; /* shared mem ptr */ sleep_list_t* sleep_entry = NULL; /* entries in sleep queue */ rr_list_t* rr_entry = NULL; /* entries in ready-to-ret queue */ int sleep_flag = 0; /* set if process was woken */ int rr_flag = 0; /* set if process has returned */ /* setup shared mem ptr */ mem = (shrmem_t*)g_shrmem; /*
82
* If ebsa_pid is -1 and the ready-to-return queue is not empty, * then remove the next entry from the list. If it is equal to * host_pid, both sides are ready-to-return and we are done. * Otherwise, put the PID value into ebsa_pid. */ if ((mem->ebsa_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->host_pid) { kfree(rr_entry); goto reset_host; } else { mem->ebsa_pid = rr_entry->pid; kfree(rr_entry); } } /* nothing else to do if host has no PID to offer */ if (mem->host_pid < 0) { goto tasklet_complete; } /* * Search through the sleeping queue for the process id in host_pid. * If it is found, wake up the process and exit the tasklet. */ list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->host_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) { goto tasklet_complete; } /* * The process was not sleeping, so see if it is waiting to put its PID * on the ready-to-return queue. If it is found, remove the entry * and exit the tasklet. */ list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->host_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) {
83
goto tasklet_complete; } /* * host_pid = -1 signals to the host side that it can put a new PID * value into it. We need to reset it here when host_pid's value was * a PID that was on the ready-to-return queue. Otherwise it is * reset in do_ebsa_select(). */ reset_host: mem->host_pid = -1; /* * Continuously reschedule the tasklet as long as * tasklet_ebsa_sched_flag is set. */ tasklet_complete: if (tasklet_ebsa_sched_flag) { tasklet_schedule(&select_ebsa_tasklet); } } D.4 syscalls_h.c – n_sys_select() /* * This function sets up and sends the packet to the EBSA for select() and * calls do_host_select() if there are host-side file descriptors. After * do_host_select() returns and the EBSA's packet is received, this function * returns the sum of the return values from these or an error value if either * side has an error. This function has the same functionality as the other * functions in syscalls_h.c, but sys_host_select() must be called first to * setup the memory region and split up the bitmaps. None of this is Linux * kernel code. */ long n_sys_select(select_split_t* local, select_split_t* remote) { long err = 0; /* this function's return value */ int ret_local = 0; /* host return value */ int ret_remote = 0; /* EBSA return value */ pkt_queue_node_t *pqn; /* packet */ shrmem_t* mem; /* shared memory pointer */ /* setup the packet */ pqn = proto_get_queue_node( COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size)); pqn->pkt->copy_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size); pqn->pkt->pkt_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size); pqn->pkt->func_id = SYS_SELECT; pqn->pkt->pid = current->pid; pqn->pkt->ret_val = -1;
84
pqn->pkt->func.select.numfds = remote->n; pqn->pkt->func.select.time_off = remote->timeout; pqn->pkt->func.select.sizefds = remote->size; memcpy((void*)&pqn->pkt->func.select.bitmaps[0], (void*)remote->bits, 6*(remote->size)); /* * Send packet to EBSA and call do_host_select() if there are host * fd's. Then wait for packet to return. */ proto_enqueue(pqn); if (local->n > 0) { ret_local = do_host_select(local->n, &local->fds, &local->timeout, 1); } if (down_interruptible(&(pqn->lock)) == -EINTR) { err = -EINTR; goto out_select; } /* put packet values into EBSA's parameters */ ret_remote = pqn->pkt->ret_val; memcpy((void*)remote->bits, (void*)&pqn->pkt->func.select.bitmaps[0], 6*(remote->size)); remote->timeout = pqn->pkt->func.select.time_off; /* * Any errors are returned. If both sides return an error, the host * side arbitrarily gets precedence over the EBSA side. */ if (ret_local < 0) { err = ret_local; } else if (ret_remote < 0) { err = ret_remote; } else { err = ret_remote + ret_local; } out_select: proto_release_queue_node(pqn); /* cleanup shared memory - used mainly for case with EBSA fd's only */ /* possible race condition writing to shared memory */ mem = (shrmem_t*)module_info.shared_mem_addr; if (mem->host_pid == current->pid) { mem->host_pid = -1; } if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } return err; }
85
D.5 fd_map.c – split_select_bitmaps(), merge_select_bitmaps() /* * This function splits a set of select() bitmaps into two sets: one that * goes to the host and one that goes to the EBSA. It is assumed that the * local (host) side has the original arguments and bitmaps from select(). * This function is put here rather than in select_h.c for faster access to the * translation tables; otherwise, we would continually have to call * fd_map_get_ebsa_fd() for each descriptor found in the bitmaps. fd_set * is used for the remote (EBSA) side because they can handle descriptors up * to 1024 and we do not know what the translation will go to on the EBSA. * The macros are in select.h and are adapted from fs/select.c in the Linux * 2.4.2 kernel. The algorithm uses some of the same principles as do_selec t() * in this kernel version as well. */ void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds) { int hfd, efd; /* host/EBSA file descriptor */ int off; /* u_long offset into bitmap */ fd_translation_table_t *cur; /* translation table pointer */ /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; /* start out highest numbered fd on each side at 0 */ local->n = 0; remote->n = 0; /* find translation table for current process */ down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); if (!cur) { /* no fd translation table */ local->n = n; return; } /* zero out the fd_set's so we can populate them with EBSA's fd's */ FD_ZERO(remote_rfds); FD_ZERO(remote_wfds); FD_ZERO(remote_efds); /* * Algorithm: * For each file descriptor up to n: * 1. Find bit and offset for that fd on the host. * 2. Continue to next fd if the bit is not set in any of the sets. * 3. Get fd mapping for that bit if it is set in any of the sets. * If it maps to -1, the fd is a local descriptor only, so update * n on the host and go on to next fd. * 4. If it maps to a positive number, find out which sets this fd
86
* is in and set the translated EBSA fd in these sets. Clear the * host fd in the host sets. * 5. If the EBSA side n is smaller than the translated fd, then * update the EBSA side n. */ for (hfd = 0; hfd < n; hfd++) { unsigned long hbit = BIT(hfd); off = hfd / __NFDBITS; if (!(hbit & BITS(lfds, off))) { continue; } efd = cur->fd_host_ebsa[hfd]; if (efd < 0) { local->n = hfd + 1; continue; } if (ISSET(hbit, __IN(lfds, off))) { FD_SET(efd, remote_rfds); CLR(hbit, __IN(lfds, off)); } if (ISSET(hbit, __OUT(lfds, off))) { FD_SET(efd, remote_wfds); CLR(hbit, __OUT(lfds, off)); } if (ISSET(hbit, __EX(lfds, off))) { FD_SET(efd, remote_efds); CLR(hbit, __EX(lfds, off)); } if (remote->n <= efd) remote->n = efd + 1; } } /* * This function merges two sets of select() bitmaps into one set. The local * (host) side contains the merged sets when this function is finished. This * function should be called after split_select_bitmaps(). This function is * put here rather than in select_h.c for faster access to the translation * tables; otherwise, we would continually have to call fd_map_get_host_fd() * for each descriptor found in the bitmaps. The macros are in select.h and * are adapted from fs/select.c in the Linux 2.4.2 kernel. The algorithm uses * some of the same principles as do_select() in this kernel version as well. */ void merge_select_bitmaps(select_split_t* local, select_split_t* remote) { int hfd, efd; /* host/EBSA file descriptor */ int off_local, off_remote; /* u_long offset into bitmaps */ fd_translation_table_t *cur; /* translation table pointer */ /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; fd_set_bits* rfds = &remote->fds; /* find translation table for current process */ down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) {
87
cur = cur->next; } up(&table_lock); /* * Sanity check. Since split_select_bitmaps() should have been called * before this function, it would have seen there was no table and this * function would then not be called. */ if (!cur) { PRINT_ERROR("merge_select_bitmaps: fd translation table missing, this should not happen\n"); return; /* no fd translation table */ } /* * Algorithm: * For each file descriptor up to remote->n: * 1. Find bit and offset for that fd on the EBSA. * 2. Continue to next fd if the bit is not set in any of the sets. * 3. Get fd mapping for that bit if it is set in any of the sets * and find the bit and offset on the host side. * 4. Find out which sets it is in and set the translated host fd in * the correct result bitmap. */ for (efd = 0; efd < remote->n; efd++) { unsigned long ebit = BIT(efd); unsigned long hbit; off_remote = efd / __NFDBITS; if (!(ebit & RES_BITS(rfds, off_remote))) { continue; } hfd = cur->fd_ebsa_host[efd]; if (hfd >= 0) { hbit = BIT(hfd); off_local = hfd / __NFDBITS; } else { /* * Sanity check. If the fd is in the EBSA bitmaps, * then it should have a translation * (split_select_bitmaps() should have set this up). */ PRINT_ERROR("merge_select_bitmaps: fd translation missing, this should not happen\n"); continue; } if (ISSET(ebit, __RES_IN(rfds, off_remote))) { SET(hbit, __RES_IN(lfds, off_local)); } if (ISSET(ebit, __RES_OUT(rfds, off_remote))) { SET(hbit, __RES_OUT(lfds, off_local)); } if (ISSET(ebit, __RES_EX(rfds, off_remote))) { SET(hbit, __RES_EX(lfds, off_local)); } } }
88
D.6 CVS Version Differences The following output from cvs diff shows the changes that I made to various files in the CVS
repository. The first file in each diff is the version that was in the repository before I included
my changes, and the second file in each diff is the version that I modified. Only my changes are
shown. Diffs for select_h.c, select_e.c, and select.h are not included because these files
were added to the repository. Also n_sys_select(), split_select_bitmaps(), and
merge_select_bitmaps(), are not included because they are already shown in Sections D.4 and
D.5. The places where the functions would be included are noted with double exclamation
points (!!). Additionally, changes not pertaining to select() are noted with double percent
signs (%%) followed by an explanation of why it was changed. First is a listing of the files
added or changed in the repository, followed by the diff listings.
-r-xr-xr-x 1 jkwek jkwek 10341 Apr 30 18:52 com_e.c -r-xr-xr-x 1 jkwek jkwek 8876 Apr 30 18:49 com.h -r-xr-xr-x 1 jkwek jkwek 6322 Apr 30 18:52 com_h.c -r-xr-xr-x 1 jkwek jkwek 10790 Apr 30 18:53 fd_map.c -r-xr-xr-x 1 jkwek jkwek 2373 Apr 30 18:54 fd_map.h -r-xr-xr-x 1 jkwek jkwek 6783 Apr 30 18:55 handler_default.c -r-xr-xr-x 1 jkwek jkwek 3229 Apr 30 18:48 Makefile -r-xr-xr-x 1 jkwek jkwek 10911 Apr 30 19:01 select_e.c -r-xr-xr-x 1 jkwek jkwek 3610 Apr 30 19:01 select.h -r-xr-xr-x 1 jkwek jkwek 15296 Apr 30 19:01 select_h.c -r-xr-xr-x 1 jkwek jkwek 2009 Apr 30 18:57 syscalls_e.c -r-xr-xr-x 1 jkwek jkwek 36234 Apr 30 19:00 syscalls_h.c Index: com_e.c =================================================================== RCS file: /home/cvsroot/module_p/com_e.c,v retrieving revision 1.22 retrieving revision 1.23 diff -r1.22 -r1.23 6c6 < $Id: com_e.c,v 1.22 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: com_e.c,v 1.23 2002/05/01 01:52:45 jkwek Exp $ 343a344 > case SYS_SELECT: 418a420 > case SYS_SELECT: Index: com.h =================================================================== RCS file: /home/cvsroot/module_p/com.h,v retrieving revision 1.15 retrieving revision 1.16
89
diff -r1.15 -r1.16 7c7 < $Id: com.h,v 1.15 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: com.h,v 1.16 2002/05/01 01:49:38 jkwek Exp $ 237,242c237,240 < int numfds; /* h -> e */ < unsigned long rfds_off; /* h -> e */ < unsigned long wfds_off; /* h -> e */ < unsigned long efds_off; /* h -> e */ < unsigned long time_off; /* h -> e */ < char data[0]; /* h <-> e */ --- > int numfds; /* h -> e */ > long time_off; /* h <-> e */ > int sizefds; /* h -> e */ > char bitmaps[0]; /* h <-> e */ 244c242 < --- > #define COM_SELECT_HEADER_SIZE 12 287,288c285,293 < Virtual struct overlaying shared memory to partition it for us. < */ --- > * Virtual struct overlaying shared memory to partition it for us. > * > * The spacers are a temporary hack until Max fixes the problem. The problem > * has to do with the cache and updating concurrent memory locations within > * shared memory. The * spacers make sure the values are in shared memory > * when they are written. > */ > #define SHRMEM_DATA_SIZE (SHRMEM_SIZE/2)-(sizeof(int)+sizeof(pid_t)+10*sizeof(unsigned long)) > 289a295,298 > pid_t host_pid; /* for use with select() */ > unsigned long spacer1[10]; > pid_t ebsa_pid; /* for use with select() */ > unsigned long spacer2[10]; 291c300 < char host_data[(SHRMEM_SIZE/2)-4]; --- > char host_data[SHRMEM_DATA_SIZE]; 293c302 < char ebsa_data[(SHRMEM_SIZE/2)-4]; --- > char ebsa_data[SHRMEM_DATA_SIZE]; Index: com_h.c =================================================================== RCS file: /home/cvsroot/module_p/com_h.c,v retrieving revision 1.22 retrieving revision 1.23 diff -r1.22 -r1.23 6c6 < $Id: com_h.c,v 1.22 2001/09/18 17:39:37 hheiman Exp $
90
--- > $Id: com_h.c,v 1.23 2002/05/01 01:52:58 jkwek Exp $ 150a151 > case SYS_SELECT: 230a232 > case SYS_SELECT: Index: fd_map.c =================================================================== RCS file: /home/cvsroot/module_p/fd_map.c,v retrieving revision 1.1 retrieving revision 1.2 diff -r1.1 -r1.2 7c7 < $Id: fd_map.c,v 1.1 2001/06/07 15:53:43 rob Exp $ --- > $Id: fd_map.c,v 1.2 2002/05/01 01:53:56 jkwek Exp $ 226a227,393 !! split_select_bitmaps() and merge_select_bitmaps() !! Index: fd_map.h =================================================================== RCS file: /home/cvsroot/module_p/fd_map.h,v retrieving revision 1.1 retrieving revision 1.2 diff -r1.1 -r1.2 7c7 < $Id: fd_map.h,v 1.1 2001/06/07 15:53:44 rob Exp $ --- > $Id: fd_map.h,v 1.2 2002/05/01 01:54:41 jkwek Exp $ 12a13 > #include "select.h" /* select_split_t, fd_set, and macros */ 69a71,82 > > /* > Split the file descriptors in a select() bitmap set into 2 sets: one to go > to the host and one to go to the EBSA. > */ > void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, > fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds); > > /* > Merge the resulting select() bitmaps into one set. > */ > void merge_select_bitmaps(select_split_t* local, select_split_t* remote); Index: handler_default.c =================================================================== RCS file: /home/cvsroot/module_p/handler_default.c,v retrieving revision 1.2 retrieving revision 1.3 diff -r1.2 -r1.3 6c6 < $Id: handler_default.c,v 1.2 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: handler_default.c,v 1.3 2002/05/01 01:55:49 jkwek Exp $ 21a22
91
> #include "select.h" /* sys_ebsa_select() */ 196a198,205 > > case SYS_SELECT: > pkt->ret_val = sys_ebsa_select(pkt->func.select.numfds, > pkt->func.select.sizefds, > &pkt->func.select.time_off, > &pkt->func.select.bitmaps[0], > pkt->pid); > break; 212d220 < Index: Makefile =================================================================== RCS file: /home/cvsroot/module_p/Makefile,v retrieving revision 1.14 retrieving revision 1.15 diff -r1.14 -r1.15 3c3 < # $Id: Makefile,v 1.14 2001/06/07 15:53:44 rob Exp $ --- > # $Id: Makefile,v 1.15 2002/05/01 01:48:54 jkwek Exp $ 18,19c18,19 < HOST_OBJS=${COMMON_OBJS} com_h.o syscalls_h.o host.o fd_map.o proc_host.o proc_host_status.o < EBSA_OBJS=${COMMON_OBJS} com_e.o syscalls_e.o ebsa.o handler_default.o --- > HOST_OBJS=${COMMON_OBJS} com_h.o syscalls_h.o host.o fd_map.o proc_host.o proc_host_status.o select_h.o > EBSA_OBJS=${COMMON_OBJS} com_e.o syscalls_e.o ebsa.o handler_default.o select_e.o 85c85 < fd_map.o: global.h syscalls_h.h fd_map.h fd_map.c --- > fd_map.o: global.h syscalls_h.h fd_map.h select.h fd_map.c 88c88 < handler_default.o: global.h com_e.h handler_default.h handler_default.c --- > handler_default.o: global.h com_e.h select.h handler_default.h handler_default.c 120a121,126 > select_e.o: global.h select.h ebsa.h com.h select_e.c > gcc ${FLAGS} -c select_e.c > > select_h.o: global.h fd_map.h select.h host.h com.h select_h.c > gcc ${FLAGS} -c select_h.c > 124c130 < syscalls_h.o: global.h com.h host.h com_h.h fd_map.h syscalls_h.h syscalls_h.c --- > syscalls_h.o: global.h com.h host.h com_h.h fd_map.h syscalls_h.h select.h syscalls_h.c Index: syscalls_e.c =================================================================== RCS file: /home/cvsroot/module_p/syscalls_e.c,v retrieving revision 1.8 retrieving revision 1.9
92
diff -r1.8 -r1.9 2c2 %% - this changed because the file is syscalls_h.c not syscalls_h.h < syscalls_e.h - Cal Poly 3Com CiNIC project --- > syscalls_e.c - Cal Poly 3Com CiNIC project 6c6 < $Id: syscalls_e.c,v 1.8 2001/06/07 15:53:44 rob Exp $ --- > $Id: syscalls_e.c,v 1.9 2002/05/01 01:57:41 jkwek Exp $ 9a10,11 > #include <linux/spinlock.h> /* for interrupt.h to compile */ > #include <linux/interrupt.h> /* for tasklets */ 12c14,15 < --- > #include "ebsa.h" /* for shared memory */ > #include "com.h" /* for shared mem struct */ 18a22,25 > /* tasklet declared in select_e.c */ > extern struct tasklet_struct select_ebsa_tasklet; > extern int tasklet_ebsa_sched_flag; > 33a41,42 > shrmem_t* mem; > 35a45,51 > /* select tasklet and shared mem init */ > mem = (shrmem_t*)g_shrmem; > mem->host_pid = -1; > mem->ebsa_pid = -1; > tasklet_ebsa_sched_flag = 1; > tasklet_schedule(&select_ebsa_tasklet); > 47a64,67 > > /* stop the tasklet */ > tasklet_ebsa_sched_flag = 0; > tasklet_kill(&select_ebsa_tasklet); Index: syscalls_h.c =================================================================== RCS file: /home/cvsroot/module_p/syscalls_h.c,v retrieving revision 1.39 retrieving revision 1.40 diff -r1.39 -r1.40 8c8 < $Id: syscalls_h.c,v 1.39 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: syscalls_h.c,v 1.40 2002/05/01 02:00:55 jkwek Exp $ 17c17,18 %% - non-useful comment taken out < //hekllo --- > #include <asm/system.h> /* for interrupt.h to compile */ > #include <linux/interrupt.h> /* for tasklets */ 24a26
93
> #include "select.h" /* sys_host_select(), do_host_select() */ 31a34,37 > /* tasklet declared in select_h.c */ > extern struct tasklet_struct select_host_tasklet; > extern int tasklet_host_sched_flag; > 42a49,51 > static long (*o_old_select)(select_param_t *args); > static long (*o_sys_select)(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp); > 101a111,112 > shrmem_t* mem; > 106a118,124 > /* select() tasklet and shared mem init */ > mem = (shrmem_t*)module_info.shared_mem_addr; > mem->host_pid = -1; > mem->ebsa_pid = -1; > tasklet_host_sched_flag = 1; > tasklet_schedule(&select_host_tasklet); > 134a153,160 > /* hijack old_select */ > o_old_select = sys_call_table[__NR_select]; > sys_call_table[__NR_select] = (void *) n_old_select; > > /* hijack sys_select */ > o_sys_select = sys_call_table[__NR__newselect]; > sys_call_table[__NR__newselect] = (void *) sys_host_select; > 144a171,174 > /* stop the tasklet */ > tasklet_host_sched_flag = 0; > tasklet_kill(&select_host_tasklet); > 186a217,228 > if (o_old_select != NULL) { > sys_call_table[__NR_select] = o_old_select; > } else { > PRINT_ERROR("syscalls_cleanup: old_select pointer not stored!\n"); > } > > if (o_sys_select != NULL) { > sys_call_table[__NR__newselect] = o_sys_select; > } else { > PRINT_ERROR("syscalls_cleanup: sys_select pointer not stored!\n"); > } > 538c580 %% - Want to run recv() rather than sendto() when recv() is called. < return old_sys_socketcall(SYS_SENDTO, args); --- > return old_sys_socketcall(SYS_RECV, args); 1098a1141,1229 !! n_sys_select() !!
94
D.7 Sample Test Program #include <stdio.h> #include <linux/unistd.h> #include <fcntl.h> #include <sys/socket.h> int main() { fd_set hi; /* two fd_sets */ fd_set hi2; int fd1,fd2; struct timeval time; time.tv_sec = 10; /* 10 sec sleep */ time.tv_usec = 0; fd1 = open("atext", O_RDONLY | O_SYNC); /* local */ fd2 = socket(AF_INET, SOCK_STREAM, 0); /* remote */ printf("fd1 is %i\n", fd1); printf("fd2 is %i\n", fd2); FD_ZERO(&hi); FD_ZERO(&hi2); FD_SET(fd1, &hi); FD_SET(fd2, &hi2); FD_SET(2, &hi); /* * check local fd and stderr for read, socket for write * fd and socket return available, stderr return NOT available * because nothing to read */ select(fd2+1, &hi, &hi2, NULL, &time); if (FD_ISSET(fd1, &hi)) printf("data available\n"); else printf("data NOT available \n"); if (FD_ISSET(fd2, &hi2)) printf("data available\n"); else printf("data NOT available \n"); if (FD_ISSET(2, &hi)) printf("data available\n"); else printf("data NOT available \n"); return 0; }