an implementation of the select() linux system call ...jharris/3comproject/senior...an...

99
An Implementation of the select() Linux System Call Running on the Cal Poly Intelligent Network Interface Card Platform A Senior Project Report Presented to the Computer Engineering Program By Jared Kwek California Polytechnic State University, San Luis Obispo Date Submitted: June 17, 2002 Advisor: Dr. Phillip Nico

Upload: duongkhue

Post on 11-Mar-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

An Implementation of the select() Linux System Call Running on the Cal Poly Intelligent Network Interface Card Platform

A Senior Project Report Presented to the Computer Engineering Program

By

Jared Kwek

California Polytechnic State University, San Luis Obispo

Date Submitted: June 17, 2002 Advisor: Dr. Phillip Nico

ii

Abstract The Cal Poly Intelligent Network Interface Card (CiNIC) project is a research project at the

campus of California Polytechnic State University, San Luis Obispo, CA. This project is funded

by the 3Com Corporation and researches intelligent Network Interface Card (NIC) functionality

and performance. The CiNIC platform can offload the TCP/IP stack from an i686 (Pentium)

host computer running Linux to an EBSA-285 embedded system (co-host) running ARM/Linux,

thus freeing the host computer from having to process network traffic. This is done by

intercepting the system calls from the host machine and sending the parameters to the co-host

machine, where the system call is run and then returned back to the host.

This document describes an implementation of the select() system call for the CiNIC platform.

select() waits for events to occur on multiple files and sockets that are open on either the host

or the co-host. This means that the call cannot just run on the co-host; it must also be run

concurrently on the host. Furthermore, select() can sleep until an event occurs, which may

happen on one side (the host or co-host) but not the other. In order to prevent one side from

blocking forever, the host and co-host must be able to alert the other when it has completed.

First an overview of the project and the select() system call is presented. Then the Linux

implementation of select() is described. Following that is a discussion of my designs,

including one that did not work and one that did. Finally, my design to fix the blocking problem

is explored.

iii

Acknowledgements First off I would like to thank the faculty involved with the project: Dr. Hugh Smith, Dr. Phillip

Nico, and Dr. Jim Harris, for all their support and encouragement throughout the past year I have

been on the project. Also Rob McCready, Mark McClelland, and Jim Fischer for introducing me

to the project. To my partner in crime, Max “Neil a.k.a. Linux Hacker” Roth, it has been an

honor and a pleasure spending hours upon hours in the lab with you. You are always

entertaining, even into the early hours of the morning when we are drinking gallons of Sunkist

and making movies. Oh yeah, and please take your satellite dish home someday!

A couple of other accolades: Heather Heiman for always keeping the lab (and us) in order,

Americo “Bart” Melara for always being in the lab every time I come in, Jason Hatashita for

always reconfiguring the lab setup, and Clif Gordon for having senioritis together. And to

everybody else I have had the pleasure of working with in the past year, you rock!

Finally, I would like to thank my family and Michelle for all the love and support they have

given me. An extra special thanks goes to Michelle for keeping me from having a nervous

breakdown while writing this paper.

iv

Table of Contents Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 CiNIC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 The State of the Project When I Started . . . . . . . . . . . . . . . . . . . . 4 2.3 The Definition of select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 The old_select() System Call . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 The Challenges of Implementing select() . . . . . . . . . . . . . . . . . . . . 8

Chapter 3 The Linux Kernel’s Version of select() . . . . . . . . . . . . . . . . . . . . 9 3.1 sys_select()’s Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 do_select()’s Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Chapter 4 The First Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Rob McCready’s Initial Design . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Initial Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 The Split and Merge Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Reasons for Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Chapter 5 The Second Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Modifications to old_select() and sys_select() . . . . . . . . . . . . . . . . . 26 5.2 Transferring Parameters to the Co-Host . . . . . . . . . . . . . . . . . . . . 31 5.3 The Co-Host Side Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 The Split and Merge Routines . . . . . . . . . . . . . . . . . . . . . . . . . 34

Chapter 6 The Solution to the Blocking Problem . . . . . . . . . . . . . . . . . . . . 41 6.1 The Investigation of Kernel Methods . . . . . . . . . . . . . . . . . . . . . . 42 6.2 The Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.3 The Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.4 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Appendix A State Diagrams for the Solution to the Blocking Problem . . . . . . . . . 60 Appendix B Test Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Appendix C select() Bit Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

v

Appendix D Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.1 select.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.2 select_h.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 D.3 select_e.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 D.4 syscalls_h.c – n_sys_select() . . . . . . . . . . . . . . . . . . . . . . . . . . 83 D.5 fd_map.c – split_select_bitmaps(), merge_select_bitmaps() . . . . . . . . . . 85 D.6 CVS Version Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 D.7 Sample Test Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 List of Figures

2.1 CiNIC Platform Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Example select() Bitmap Setup . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1 Example Setup with File Descriptor Mask . . . . . . . . . . . . . . . . . . . 19 4.2 Early select() Flowchart with Problem Areas . . . . . . . . . . . . . . . . . 21 5.1 Example of Splitting File Descriptors . . . . . . . . . . . . . . . . . . . . . 35 5.2 select() Flowchart of Second Design . . . . . . . . . . . . . . . . . . . . . . 40 6.1 Shared Memory with Queues . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.1 do_select() on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.2 Tasklet on Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.3 do_select() on Co-Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.4 Tasklet on Co-Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 C.1 The Representation of a Bitmap in Kernel Memory . . . . . . . . . . . . . . 65

List of Tables

B.1 Test Plan Run on 9/11/01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B.2 Test Plan Run on 2/22/02 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

1

1. Introduction

The proliferation of the Internet within our society has sparked a global revolution that continues

still today. It has brought about rapid change and growth as more and more users are connecting

to this global information superhighway. This growth has created a demand for faster and more

robust applications that can be run across the Internet. However, modern servers and computers

have not been able to keep up with these increasing demands, causing severe problems in

network and system performance. As a result, new ideas such as Intelligent Network Interface

Cards (iNICs) have come to the forefront of research for the next generation of network

technology.

The Cal Poly Network Interface Card (CiNIC) project is a research project at the campus of

California Polytechnic State University, San Luis Obispo, CA. This project is funded by the

3Com Corporation and researches intelligent NIC functionality and performance. The CiNIC

platform can offload the TCP/IP stack from a host computer running Linux to an embedded

system (co-host) that is also running Linux. All network processing takes place on the co-host,

which will free the host computer from having to handle the potentially vast volumes of network

traffic it receives so that it can perform other tasks. The intelligent NIC also has the potential to

be able to support many advanced networking functions such as security and firewalls, web

caching, streaming media, and quality of service.

This document describes an implementation of the Linux select() system call on the CiNIC

platform. select() waits for open network connections (sockets) or files to change status. It is

unique because it is a system call that must be run on both the host and the co-host. This means

that the system call must be split between the two sides. However, if any of the open sockets or

files were to change status, both sides would have to immediately return. These extra issues

make select() one of the trickier system calls to implement for the CiNIC platform.

Nonetheless, it is a necessary one because it is used in a wide range of applications such as

Telnet, web browsing, and the X Window System (a graphical user interface for Linux).

2

The remainder of this document is organized as follows. Chapter 2 describes the background of

the CiNIC project and the select() system call. Chapter 3 explores the details of the Linux

kernel’s implementation of the system call. Chapter 4 goes over the initial design and why it

failed. Chapter 5 then details the successful second design, followed by Chapter 6, which

provides a solution to select() potentially blocking for long periods of time (or forever).

Chapter 7 provides a conclusion and a discussion of possible future work. Appendix A

illustrates the state diagrams for the Chapter 6 design while Appendix B shows two test plans I

implemented. Appendix C goes over some of the bit macros that are used to manipulate the

select() bitmaps. Finally, Appendix D contains the source code for this implementation.

3

2. Background

Before going into the details of the select() implementation, some background information

about the project and select() are needed. Discussed in this section is an overview of the

CiNIC project and the state of the project when I started, then a definition of select() and

old_select(), and finally challenges related to implementing select().

2.1 CiNIC Overview

The CiNIC project is run by a group of students and faculty members who call themselves the

Cal Poly Network Performance Research Group (CPNPRG). This group researches the

possibilities of offloading the TCP/IP stack of a Linux host computer onto a Linux co-host

computer. The hardware chosen for the co-host is an Intel EBSA-285 card that has an SA-110

StrongARM processor and an Intel 21285 logic chip that is used to interface the ARM processor

with the rest of the system. This card is connected to a secondary PCI bus via an Intel 21554

non-transparent PCI bridge. An SDRAM window is mapped onto the PCI bus by the 21285,

which is then translated to the primary PCI bus via the 21554. This allows the host to read from

and write to a shared memory region that exists on the co-host (see Figure 2.1). Mark

McClelland wrote the device drivers that facilitated this sharing of memory. [2]

To facilitate the offloading of the TCP/IP stack from the host to the co-host, network related

system calls are intercepted off the host and sent to the co-host, where they are run and then

returned to the host. The parameters for each system call are marshaled together into one

contiguous region (hereafter referred to as a “communications packet” or “com packet” for short)

and copied to shared memory, where the co-host un-marshals the parameters and makes the

system call on its system. When the call returns on the co-host, it copies the values back to

shared memory, where the host then picks up the return values and returns them to the user. To

maintain this process, a message passing protocol is implemented with kernel threads. Version 1

of the protocol was a polling protocol written by Rob McCready, hereafter referred to as the

“polling protocol,” that kept checking a value to see if data had been written to shared memory

[3]. Version 2, hereafter referred to as the “interrupt-driven protocol” was completed recently by

Max Roth and involves a faster interrupt mechanism through the 21554 [4].

4

2.2 The State of the Project When I Started

When I began on this project in Spring 2001, Mark McClelland and Rob McCready were getting

ready to graduate and were finishing their respective parts of the project. Their primary goal was

to get a File Transfer Protocol (FTP) session working across Mark’s shared memory driver [2]

and Rob’s polling protocol for transferring system calls to and from the co-host. In order to

accomplish this, the polling protocol had to intercept the socket-related system calls that FTP

uses and send their parameters to the co-host using the shared memory driver, where the call

would be run and the return values sent back to the host. This system worked successfully for

FTP by the time I arrived and focus was shifting to what needs to be done in the future.

Two major items needed attention by the time I arrived: A more powerful interrupt-driven

protocol for transferring system calls to and from the co-host and more system calls intercepted

to allow for more services. Max Roth, who started on the project roughly around the same time I

did, decided to handle the new protocol and I decided to look at the new system calls that needed

to be implemented. It was decided that select() and ioctl() were two of the most important

system calls that needed to be implemented, but they were also two of the hardest to incorporate

Figure 2.1 –- CiNIC Platform Setup Source: [2]

5

in the current system because they are not conventional socket calls. select() was needed for

a number of applications including web browsing, Telnet, and the X Window System. ioctl()

is used extensively in the X Window System and programs that gather information from network

device drivers. Since a goal of the project was to someday have the intelligent NIC be a device

driver, this function proved to be essential. Rob had already thought about how the select()

implementation would go, so I first turned my focus to its implementation.

2.3 The Definition of select()

The basic function of select() is to allow a process to look at a number of open file descriptors

to see if reading from or writing to them will block. The Linux manual page calls this

“synchronous I/O multiplexing”. A file descriptor is an integer number used by the kernel to

identify a file, pipe, or socket opened by a process. Each process has its own set of file

descriptors and can have up to 1024. The first three file descriptors (0, 1, 2) for each process are

reserved for standard input (stdin), standard output (stdout), and standard error output

(stderr), respectively. Three sets of file descriptors are watched: One to see if any of the file

descriptors in the set is ready to be read from, one to see if any file descriptors in the set is ready

to be written to, and one to see if any file descriptors in the set has an exception condition (i.e.

high-priority out-of-band data can be read without blocking) [5]. If any file descriptor in any of

the sets is ready for its given condition, select() returns the number of ready file descriptors

and modifies the sets to indicate which file descriptors are available. If none are available,

select() will block (sleep) for a specified period of time waiting for any of the file descriptors

to become ready. If any do, then select() wakes up and returns. If the time limit is reached

and no file descriptors are available, then select() returns 0. Sets that are NULL pointers are

not watched.

The select() call has 5 arguments:

int n The highest numbered file descriptor in any of the three sets, plus 1

fd_set* readfds A pointer to file descriptors for reading fd_set* writefds A pointer to file descriptors for writing fd_set* exceptfds A pointer to file descriptors for exceptions struct timeval* timeout Maximum time to wait (setting to 0 means do not sleep and

6

return immediately, a NULL pointer means wait indefinitely until a file descriptor becomes available)

struct timeval has two members: tv_sec for seconds, and tv_usec for microseconds. The

timeout parameters are modified in Linux to show how much time was remaining upon return.

This behavior is not uniform across multiple platforms, so portable code should not rely on this

value.

On error, select() returns –1 and the sets and timeout value become undefined. The following

errors are possible for errno (see man errno):

EBADF One of the file descriptor sets specifies an invalid file descriptor EINTR A signal arrived before the time limit or any of the selected file descriptors became ready EINVAL Time limit value is incorrect or n is negative ENOMEM Unable to allocate memory for internal tables

There are a number of macros that can be used to manipulate the sets. fd_set is a structure that

contains an unsigned long array that has enough bits in it for the maximum number of file

descriptors. It acts as a bitmap for the file descriptors so, for example, bit 0 corresponds to file

descriptor 0, bit 1 corresponds to file descriptor 1, etc. A set bit tells select() to look at that

file descriptor while select() ignores cleared bits. Currently on the x86 platform, the limit is

1024 file descriptors. An unsigned long is 32 bits, so 1024/32 makes an array size of 32. The

following macros for fd_set can be found in the Linux kernel source in linux/time.h:

FD_ZERO(fd_set *fdset) clear all bits in the set FD_SET(int fd, fd_set *fdset) set the bit for fd in the set FD_CLR(int fd, fd_set *fdset) clear the bit for fd in the set FD_ISSET(int fd, fd_set *fd_set) test the bit for fd in the set

Figure 2.2 shows an example of calling FD_ZERO(&fdset) followed by FD_SET(4, &fdset).

Bit 4 is set to let select() know to look at file descriptor 4. For an example program using

select(), see Appendix D.7.

select() has a number of uses. It is used a lot with sockets to see which sockets contain data to

be read. This allows the user specify the amount of time to wait before asking for a

retransmission. Telnet uses it to check for data to be read from either STDIN or the Telnet socket

7

when it is waiting for the user to type something. It is also useful on a server that supports

multiple clients. accept() usually blocks if there is no connection, so it could only process one

connection at a time. However, select() can check multiple sockets for a connection and then

fork off threads to accept those. Finally, select() can be used to sleep for a given timeout by

setting n to 0, setting all three sets to NULL, and specifying a timeout value.

2.4 The old_select() System Call

While running strace on Netscape to determine how it uses select() (strace shows

information about system calls used by a program), I noticed that it used a system call called

old_select(). This is different than the system call used by other programs, which was regular

select(). I did some investigating and found out that there are two different select()’s in the

Linux kernel. old_select() has system call number 82 (__NR_select) and select() has a

system call number of 142 (__NR__newselect). old_select() comes from the days when

system calls could not have 5 arguments in them. So this function passes one pointer to all the

arguments for the kernel to handle. The new select() did not come around until the 2.x Linux

kernels. I do not know how or if this function can be called using the C library. My only guess

as to why Netscape uses it is for compatibility with older kernels or machines.

In the 2.x kernels, old_select() is located in arch/i386/kernel/sys_i386.c. All it does is

call copy_from_user() on the pointer to copy the parameters into a structure inside the kernel

containing the five parameters of sys_select(). For the pointers, only the pointer value is

copied, not the data it points to. This structure is then used to call sys_select() with the

appropriate parameters. The new select() goes straight to sys_select() when it is called.

5 0 1 2 3 4

0 0 0 0 1 0 bits fd #

……

1023

0

Figure 2.2 – Example select() Bitmap Setup

8

2.5 The Challenges of Implementing select()

The primary issue with implementing select() in our system is that it can contain file

descriptors for both the co-host and host sides. This means that we must somehow split the call

into two, run it on both the host and co-host, and then merge the results back together and return

to the user. All this must be transparent to the user as if we had not intercepted the call at all.

The reason why this is such a difficult task is because the current driver architecture intercepts

system calls that usually contain only one file descriptor. That file descriptor is checked to see if

it was created for the co-host. If it is, the parameters are sent to the co-host so that the system

call can run there. There is no need to run it on the host since all the work with that file

descriptor is done on the co-host. If the file descriptor is not created for the co-host, the original

system call on the host is called without sending anything to the co-host. With select(),

however, we have to be able to figure out which file descriptors go where, so the call can be split

up to run concurrently on both sides.

There are also issues with running select() concurrently on both sides. Since select() can

potentially block, we could find ourselves in the position where one side is blocking while the

other is finished. The blocking side could eventually timeout and then return, but this could

cause a large delay in the system. Additionally, the system could potentially hang if there is no

timeout value and the blocking side never finds a ready file descriptor. Since the goal is to make

this seem like a normal select() call, there needs to be a mechanism where a side that is ready

to return can notify a side that is blocking so that it can wake up and also return.

Finally, there are issues with what to return to the user once the call has completed. There will

be two sets of parameters and two return values. The file descriptor bitmaps need to be merged

back together somehow so that the user can see both sides of file descriptors in the same sets.

There could potentially also be two different timeout values upon return, so it needs to be

decided which value to return. With the actual return values, we could simply add the two sides

together upon success, but it needs to be decided how to handle the situation where one or both

sides return an error.

The solutions to these issues that we came up with are presented in the following chapters.

9

3. The Linux Kernel’s Version of select()

This chapter will discuss the implementation of select() in the Linux 2.4.2 kernel. The two

main functions associated with select() are sys_select() and do_select(). The following

sections provide a high-level overview of each function followed by a walk-through of the code.

The macros are described in Appendix C.

3.1 sys_select()’s Design

sys_select(), which is located in fs/select.c, is the function called first once the system

call is transferred to kernel space. It is the wrapper function for the core of the select() call,

do_select(). First, it checks the n and timeout parameters to make sure that they are in the

correct range. The struct timeval value is changed to jiffies, which is the kernel’s internal

view of time, for use in do_select(). It then takes the user space file descriptor bitmaps and

sets them up in kernel memory. do_select() is then called. Once do_select() has finished,

the timeout value is converted back to a struct timeval and do_select()’s return value is

checked to see if an error was returned. If no error was returned, then it checks if the return

value is 0. If it is 0, then it checks if a signal is pending (i.e. it may have been interrupted). If a

signal is pending, then the system call will be restarted. The parameters are then copied back to

user space and the function returns.

Following are the declarations for sys_select(): asmlinkage long sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { fd_set_bits fds; char *bits; long timeout; int ret, size; 1. Get the timeout value from user space and convert it from a struct timeval to a long

(jiffies). If the user space pointer is NULL or if the period of time is greater than

MAX_SELECT_SECONDS, then the timeout value is set to MAX_SCHEDULE_TIMEOUT, which is the

equivalent of waiting forever.

10

timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp))) || (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ); timeout += sec * (unsigned long) HZ; } } 2. Check for invalid values of n. max_fdset is the current maximum number of file descriptors,

which is normally 1024. n must be between 0 and 1024. ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; 3. Set up a memory region for an fd_set_bits (fds). The structure looks like this

(include/linux/poll.h): typedef struct {

unsigned long *in, *out, *ex; unsigned long *res_in, *res_out, *res_ex;

} fd_set_bits;

This structure allows for the memory region to be long-aligned and scalable. It is only as big

as the n parameter. Below, FDS_BYTES is used to determine how many bytes are needed for

a given value of n (see Appendix C). select_bits_alloc() then calls kmalloc() to

allocate a memory region that is 6*size big. Then each pointer is assigned to a memory

address in that region. Here is the code from sys_select(): ret = -ENOMEM; size = FDS_BYTES(n); bits = select_bits_alloc(size); if (!bits) goto out_nofds;

11

fds.in = (unsigned long *) bits; fds.out = (unsigned long *) (bits + size); fds.ex = (unsigned long *) (bits + 2*size); fds.res_in = (unsigned long *) (bits + 3*size); fds.res_out = (unsigned long *) (bits + 4*size); fds.res_ex = (unsigned long *) (bits + 5*size); 4. get_fd_set() is called to copy the file descriptor sets from user space into the newly set up

memory region. If any of the sets are NULL, the memory is filled with 0’s. zero_fd_set()

is then called to zero out the memory region occupied by the result (res) side of the struct

fd_set_bits. Now that the memory region is set up, do_select() is called, which returns

the number of file descriptors available in the bitmaps and populates the res bitmaps with

those file descriptors. if ((ret = get_fd_set(n, inp, fds.in)) || (ret = get_fd_set(n, outp, fds.out)) || (ret = get_fd_set(n, exp, fds.ex))) goto out; zero_fd_set(n, fds.res_in); zero_fd_set(n, fds.res_out); zero_fd_set(n, fds.res_ex); ret = do_select(n, &fds, &timeout); 5. The timeout value is put back into a struct timeval and copied to user space. if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } 6. A return value less than 0 means error. A zero return value could mean that the system call

was unable to finish, so check if a signal is pending and, if it is, then return -

ERESTARTNOHAND, which will re-execute the system call after the signal handler termination

(this value does not get passed to the user program). if (ret < 0) goto out; if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current))

12

goto out; ret = 0; } 7. set_fd_set() copies the information in the result part of fd_set_bits (populated by

do_select()) to user space if the user space address is not NULL. Then the memory region

allocated by kmalloc() is freed by calling kfree() inside select_bits_free() and the

return value is returned. set_fd_set(n, inp, fds.res_in); set_fd_set(n, outp, fds.res_out); set_fd_set(n, exp, fds.res_ex); out: select_bits_free(bits, size); out_nofds: return ret; }

3.2 do_select()’s Design

do_select(), which is also located in fs/select.c, is the heart of the select() system call.

It takes care of checking each file descriptor in the bitmaps set up by sys_select(). The

available file descriptors are set in the result bitmaps of the fd_set_bits struct. First, the

maximum file descriptor is found in any of the sets. Then a list of wait queues is initialized if

there is a timeout value. This list is used by select() to sleep on a number of file descriptors.

It wakes up if an event happens on any of them. Next, each file descriptor up to the maximum

found is looked at to see if its bit is set in any of the sets. If it is set, then the poll() method is

called for the type of file descriptor that it is. The poll() method sets up the wait queues for

this file descriptor and adds them to the list [5]. A mask is then returned to indicate the status of

the file descriptor. The mask is compared to the mask it should have for the set(s) it is in and the

result bit is set if it matches. If no file descriptors are found to be available in any of the sets, the

process goes to sleep for the specified timeout period or until one of the file descriptors in the

wait queue list become ready. When it wakes up, this sequence is repeated find out if a file

descriptor became available, if the timeout expired, or if a signal is pending. If any of these

conditions exist, the function returns the number of file descriptors found to be available. A 0

return value means either the timeout expired or a signal interrupted the system call.

13

Following are the declarations for do_select():

int do_select(int n, fd_set_bits *fds, long *timeout) {

poll_table table, *wait; int retval, i, off; long __timeout = *timeout; 1. max_select_fd() checks for bad file descriptors and returns the maximum file descriptor in

any of the sets, plus one. If there is a bad file descriptor, i.e. no open file for that file

descriptor, then it returns –EBADF. read_lock(&current->files->file_lock); retval = max_select_fd(n, fds); read_unlock(&current->files->file_lock); if (retval < 0) return retval; n = retval; 2. poll_initwait() is called, which initializes the error value to 0 and the table value to NULL

inside the poll_table structure (these are the only two fields). The error value is an integer

and the table value is a struct poll_table_page type. This is how struct

poll_table_page looks (fs/select.c): struct poll_table_entry {

struct file * filp; wait_queue_t wait; wait_queue_head_t * wait_address;

}; struct poll_table_page { struct poll_table_page * next; struct poll_table_entry * entry; struct poll_table_entry entries[0]; };

wait is set to the address of table, but if there is no timeout value, then it is set to NULL (i.e.

just poll without any wait queues). The return value is initialized to 0. poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0;

14

3. This is the main select() loop. The first step is to set the current state to

TASK_INTERRUPTIBLE, which allows the process to be woken up if wake_up() or

wake_up_interruptible() is called on any of the wait queues (more on this later). The

state is changed here rather than right before going to sleep to avoid a race condition where

the condition we sleep on changes between the time we test it and the time we go to sleep. If

we are sleeping on a wait queue whose condition has already occurred, there could be a delay

or lockup. Setting to TASK_INTERRUPTIBLE before checking all the file descriptors rather

than right before going to sleep ensures that, if any of the wait queues set the process’s state

to TASK_RUNNING (i.e. the condition occurred), then the worse that could happen when

schedule_timeout() is called is that the process would be rescheduled on the running

queue [5]. After setting the state, it goes through each file descriptor up to n doing the

following:

a) Each unsigned long is 8*sizeof(unsigned long) bits, which is what the constant

__NFDBITS is set to (32 on Intel). BIT puts a ‘1’ in the correct position in the

unsigned long variable ‘bit’ while ‘off’ finds the correct unsigned long word to

put it in. For example, if we were on file descriptor 5, bit would be set to 16

decimal, or 10000 binary and off would be 5/32 or 0. For another example, if the file

descriptor were 42, bit would be set to 1024, or 10000000000 binary and off would

be 42/32 or 1. File descriptor 5 is in the first unsigned long of the set while 42 is in

the second unsigned long of the set. See Appendix C for a further discussion. for (;;) {

set_current_state(TASK_INTERRUPTIBLE); for (i = 0 ; i < n; i++) { unsigned long bit = BIT(i); unsigned long mask; struct file *file; off = i / __NFDBITS;

b) BITS returns a mask of the bits set in all three of the file descriptor sets for a

particular offset (long word). Each set is OR’ed together to create this mask. Our bit

15

is then AND’ed with this value to see if any of the sets contain this file descriptor. If

not, it then skips the rest of the loop and goes on to the next file descriptor. if (!(bit & BITS(fds, off))) continue;

c) The file structure is then filled in with the appropriate information and the poll()

method is called for that specific type of file or socket. The device method for

poll() is in charge of calling poll_wait() “on one or more wait queues that could

indicate a change in the poll status” and a bit mask is returned that indicates the

“operations that could be immediately performed without blocking” [5]. wait keeps

track of all the file descriptors and their wait queues. file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); }

d) For each of the three bitmaps, if the file descriptor is in that set, then the mask is

AND’ed with the poll mask for that set (POLLIN_SET for read, POLLOUT_SET for

write, and POLLEX_SET for exceptions). If the two masks have at least one bit in

common, then the bit is set in the result field for that bitmap, the return value is

incremented, and the poll table is set to NULL. The poll table is set to NULL after any

increment of the return value because we can stop populating the wait queues due to

the fact that this function will return. It is for sure set to NULL after the first iteration

through all the file descriptors because all the wait queues would then have been

populated. if ((mask & POLLIN_SET) && ISSET(bit,

__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,

__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++; wait = NULL;

16

} if ((mask & POLLEX_SET) && ISSET(bit,

__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL;

e) After the each iteration through all the file descriptors, if any of the below conditions

are met, it breaks out of the loop. if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } __timeout = schedule_timeout(__timeout); }

Otherwise, schedule_timeout() is called with the timeout value specified. Since

our state is TASK_INTERRUPTIBLE, schedule_timeout() will sleep for the period of

time specified by __timeout until either that time expires or it is awoken for another

reason, i.e. if one of the wait queues wakes up the process or a signal is received [1].

Also if MAX_SCHEDULE_TIMEOUT is passed to schedule_timeout(), like in the

instance when the timeout value passed from user space is NULL, it calls schedule()

with no bound on the timeout. This process will then sleep until woken up by

something else that set its state to TASK_RUNNING. Following is the code for

schedule_timeout (kernel/sched.c): signed long schedule_timeout(signed long timeout) { struct timer_list timer; unsigned long expire; switch (timeout) { case MAX_SCHEDULE_TIMEOUT: schedule(); goto out; default: if (timeout < 0) { printk(KERN_ERR "schedule_timeout: wrong timeout “

17

"value %lx from %p\n", timeout, __builtin_return_address(0)); current->state = TASK_RUNNING; goto out; } } expire = timeout + jiffies; init_timer(&timer); timer.expires = expire; timer.data = (unsigned long) current; timer.function = process_timeout; add_timer(&timer); schedule(); del_timer_sync(&timer); timeout = expire - jiffies; out: return timeout < 0 ? 0 : timeout; }

After schedule_timeout() is called, there is at least one more iteration through the

file descriptors.

4. After breaking out of the loop, the state is set to TASK_RUNNING so that it is no longer in a

sleep state and poll_freewait() is called to depopulate all the wait queues and free the poll

table pages. The timeout value is updated and the return value is returned. current->state = TASK_RUNNING; poll_freewait(&table); *timeout = __timeout; return retval; }

18

4. The First Design

For the first design, I tried to make our implementation of select() similar to the way other

functions were intercepted in the kernel. The only exception was that the bitmaps had to be split

apart at the beginning and merged back together at the end. I went through many phases with

this first design as I was trying to understand both how select() was implemented in the kernel

and how the polling protocol worked. Following is somewhat of a timeline of the process,

followed by a description of the errors and shortcomings that ultimately lead to the failure of this

design.

4.1 Rob McCready’s Initial Design

Rob McCready had already started looking at this system call by the time I came on the project.

He described to me the bitmap arrangement and how select() would be different than other

system calls. Additionally, he had started changing some of the design of the protocol to allow

the system call to be sent to the co-host while calling it locally. Before, the process would block

until the co-host returned.

His main idea was to create an fd_set bitmap in his file descriptor translation mapping structure

that sets the corresponding bit when socket() is called. This could then be used as a mask

when the select() bitmap parameters are passed in (see Figure 4.1). ANDing the bitmaps with

the mask tells which host-side file descriptors belong to the co-host. These file descriptors

would then have to be mapped to the co-host file descriptor numbers using the polling protocol’s

translation method. The way this works is that, when socket() is called, it is run on both the

host and the co-host and both sides return file descriptors. Then these file descriptors are copied

into two arrays: One tells the host what the corresponding co-host file descriptor is, the other

tells the co-host what the corresponding host file descriptor is. For example, in Figure 4.1, the

host returned a file descriptor number of 3 and the co-host returned a file descriptor number of 7,

so a 7 would be put in the third array element of the host-to-co-host translation and a 3 would be

put in the seventh element of the co-host-to-host translation [3]. The split function would return

two pointers to structures that contain the co-host and host file descriptor values (respectively) to

look at in each of the sets and a flag that indicates if any of the bits had been set.

19

There were 4 cases he determined that we had to deal with. The first case is where all the file

descriptors are on the host, in which case the select() call would only be executed locally.

The second case is where all the file descriptors are on the co-host, whereby the call would be

sent to the co-host only. The third case is where there are both host and co-host file descriptors.

For this case, we would have to send the com packet to the co-host and then call the function

locally on the host. We would then attempt to lock waiting for the co-host to return. If it already

had returned, then the call would go through the lock, otherwise it would sleep until the co-host

returned. Then the sets would have to be translated and merged. The final case is where one or

both sides return an error. That error should be propagated to the user and, if both sides return an

error, one needs to take precedence over the other. I used this design as a basis for starting my

work.

4.2 Initial Design Overview

Figure 4.2 shows a flowchart of my initial assessment along with some of the problems I found

associated with it at the time. I found that there would be some major synchronization issues

with splitting the call into two. In the initial design phase, I had to come up with some basic

parameters:

File Descriptor Mappings

host->EBSA EBSA->host

Array Index

4

Contents

7

9

5

3

Array Index

8

Contents

4

3

-1

9

7

-1 10 -1

-1

6

0 1 2 3 4 5

0 0 0 1 1 0 bits fd #

……

1023

0 fd_mask

Figure 4.1 – Example Setup with File Descriptor Mask

20

1. Only do what is necessary to avoid large overhead. If after some initial checks the

parameters contain errors, immediately return and do not send anything to the co-

host. If there are only co-host-side file descriptors or only host-side file descriptors,

only call select() on the host or co-host.

2. There will need to be two sets of parameters, one for the host and one for the co-host,

and they need to be separated so that the system call can be made on both platforms

and then somehow merged back together.

3. Any errors are returned to the user. If both calls return an error, the host-side error is

returned. This is done because most of the applications running on the host would

find errors on the same machine more useful.

4. Return to the user the sum of the number of file descriptors found on both sides (or an

error), the three file descriptor sets containing both sides’ resulting file descriptors,

and the elapsed time as the timeout value.

5. Whenever a file descriptor is or becomes available, the select() call has to return.

If this is not done, splitting the calls could potentially cause large delays or hang the

system.

21

Problem: Slow. May be faster to directly check fd_bits array, but not sure if this is okay to do. Can we intercept macros (doubt it)?

Return 0

Return > 0

Return -1 Return -1

Return ≥ 0

Intercept select() on host

Find out which file descriptors are being watched. This is done by checking for 1’s (FD_ISSET) in each bitmap.

Problem: Two select() calls, old_select() and sys_select().

Of the file descriptors being watched, find out which ones are socket descriptors by checking the current process’s file descriptor translation table (Rob’s code).

Create two separate sets of arguments for select(): one to hold the local file descriptors, and one to hold the socket file descriptors. This will require making six bitmaps total and either masking bits or using FD_SET and FD_CLR. Reset to new values for n. The socket descriptor bitmaps will have to be translated from host descriptors to EBSA descriptors before placing them in the bitmap (Rob’s code).

Execute select() on the host using the bitmaps that contain the local set of arguments.

Marshal into a com packet the arguments for select() that contain the socket set of arguments and place on queue to send to EBSA. Do not lock.

When/if select() on the EBSA returns, check return code.

When/if select() on the host returns, check the return code.

Look in bitmaps to see which descriptors remain and translate from EBSA descriptors to host descriptors (Rob’s code).

Add return value from each call together and OR matching bitmaps.

Return with return value.

Make sure to have checks for NULL fd_set’s.

Problem: What about the timeval’s?

Problem: What about the error codes on the EBSA? Rob’s code does nothing with them.

Issues and potential problems with splitting up select(): 1. If one finishes before the other, how

long to wait? 2. What if one never returns? 3. What if one returns an error and the

other doesn’t? 4. Synchronization issues, waiting too

long and getting too many descriptors back (than if was only called in one place).

Seems like splitting up can cause some major problems.

Figure 4.2 – Early select() Flowchart with Problem Areas

22

Taking all this into consideration, I designed an algorithm that I proposed at a design review on

July 31, 2001: typedef struct { int maxfd; fdset *read, *write, *except; } select_split_t; select_split_t* host; select_split_t* ebsa;

1. If n < 0 return –EINVAL

2. For each file descriptor set that is not NULL,

a. Allocate the host side of the set (i.e. either host->read, host->write, or

host->except) and copy the parameter passed in from user space (i.e. either

readfds, writefds, or exceptfds) into it.

b. Allocate the co-host side of the set. For each file descriptor set that is NULL,

set the host and co-host sides for that set to NULL.

3. Set host->maxfd to n and if it is greater than 1024 (max number of file descriptors),

then set to 1024.

4. Split bitmaps.

5. If there are co-host-side file descriptors then marshal co-host side parameters into a

com packet and send to co-host to call select() with a timeout of 0.

6. If there are host side file descriptors then call select() on local host with a timeout

of 0.

7. Attempt to lock. If the co-host has already returned or was not called, it will go

through the lock without blocking. Otherwise it will wait for the co-host to return.

8. If any errors were returned, then, if there were any co-host file descriptors, merge

them with the host ones, copy host sets into the original parameters, and return.

9. Add the return values together. If the result is larger than 0, that means file

descriptors are ready. If there are co-host file descriptors, merge the bitmaps

together. Copy host sets back into parameters and return.

10. If both return 0, then set a timer to the timeout value specified by the fifth parameter.

This can be accomplished by either calling schedule_timeout() with the timeout

value or calling select() with all other parameters set to NULL or 0’s. If the timeout

23

value passed in is NULL, then we can set a timer to the timeout value

MAX_SCHEDULE_TIMEOUT and call schedule_timeout() that will, in turn, reschedule

the current process. We could also just call schedule() manually.

11. If there was a timeout value specified, then repeat steps 5-9. If the timer has not

expired then repeat step 10 and then again repeat to steps 5-10 until either the timer

expires or either return value is not equal to 0. Anytime there is a return value greater

than 0, this process will stop on step 9. Anytime there is a return value less than 0,

this process will stop on step 8. If the timeout value is NULL, then 5-10 are repeated

indefinitely until either of the return values are not equal to 0.

Calling select() with a timeout of 0 seemed like a good enough solution to see if any of the

file descriptors were available immediately to avoid any sleeping. If none were available, then

we would sleep for the period of time passed in from user space and then check again with a

timeout of 0. I thought of this design before I had full understanding of what select() was

actually doing, so there are a number of flaws that will be discussed in Section 4.4.

4.3 The Split and Merge Algorithms

One of the first things that had to be decided for the split and merge routines was how to go

about searching through the file descriptor sets. The polling protocol’s design called for masking

off the host file descriptors and then translating host-side file descriptors belonging to the co-host

to the corresponding co-host ones using the translation tables. However, this method would

create overhead when setting up the sockets because the mask would have to be modified every

time, even if select() was never called. Furthermore, once the mask is created, the host-side

file descriptors would have to be translated to the co-host-side file descriptors anyway, so we

would have to search through each bitmap to get the file descriptors to perform this translation

on. After talking this over with Rob McCready, we decided that, if we are going to go through

the bitmaps anyway, then we might as well check if each file descriptor that is set is a co-host-

side file descriptor and, if it is, then translate it. This creates only a little more overhead on

select() while the overhead is removed when a socket is created. We also thought that it could

be changed later if this caused a large loss of performance.

24

fd_map.c has a routine called fd_map_get_ebsa_fd() that finds the co-host mapping of a given

host file descriptor. It returns –1 if it does not find a mapping (the mapping table is initialized to

all –1’s before any file descriptors are put into it). However, calling this function for a large

number of file descriptors would produce much overhead not only because it would have to call

a function in a separate file, but also because it would have to search through a list of mappings

for the current process’s mapping. So it was decided to put the split and merge routines inside

fd_map.c so that they could access the mappings directly and would only have to search for the

current process’s mapping once. [3]

The algorithms for these functions were fairly simple. For splitting the file descriptors, three

fd_set variables are created for the co-host side and initialized to 0 with FD_ZERO. Then, for

each of the non-NULL sets, go through each file descriptor up to n (passed in from the user) and

see if there is a mapping for each file descriptor it finds set (check if it is set by using FD_ISSET).

If a mapping is found, clear the bit in the host set (FD_CLR) and then set it in the co-host set

(FD_SET). The variable maxfd in struct split_select_t would be updated for each side

when a file descriptor is found for it. Then, when select() is called on both sides, it will be

called with the corresponding maxfd+1 and the three sets for that side.

select() will return with an updated set of bitmaps that show the available file descriptors. The

merge algorithm then goes through each of these non-NULL file descriptor sets for the co-host.

It does a mapping from the host to the co-host for each file descriptor it sees in the bitmap sets.

If it sees a mapping, the file descriptor is translated from the co-host to the host and is set on the

corresponding host side bitmap. When this algorithm is complete, the host-side bitmap will

contain all file descriptors that select() set.

I was able to come up with split and merge functions that were able to manipulate the file

descriptor sets, but I quickly found out that there were issues with user space and kernel space

variables. sys_select() wants the variables to be in user space because the __get_user(),

get_fd_set(), put_user(), and set_fd_set() functions require that the function arguments

be from user space. I found this out because select() would continuously return -EFAULT

because it was trying to copy addresses from user space that were already kernel space addresses.

25

The only solution I could come up with was to copy the values from user space, split up the sets,

then copy them back to user space so that sys_select() could be called. This worked, albeit

slow, for a few situations, but I ran into problems that forced me to abandon this idea altogether.

4.4 Reasons for Failure

There were a number of problems that caused this approach to be abandoned for the new

approach that is discussed in Chapter 5. First, user space variables become a problem when there

are none there. It is not a problem for the file descriptor sets because they are not processed if

they do not point to anything. However, the timeout value needs to have a user space variable so

that we can copy a 0 into that place. The user does have the option of making it a NULL pointer,

in which case we could not copy a value to user space. I tried to see if there was a way we could

create a user space variable from our kernel module but I could not find a way. Also, the co-host

does not have a user process at all, so there would definitely be no way we could copy a 0 to user

space even if there was a valid timeout value passed in.

Secondly, I found out that the select() call does not necessarily wait the entire timeout period

before checking all the file descriptors a second time. It populates wait queues for each file

descriptor and then sleeps. It wakes up if any of the file descriptors becomes available or if the

timeout expires, whichever comes first. If no timeout value is given (NULL), it sleeps indefinitely

until a file descriptor becomes available. Calling select() with a timeout of 0 would not

populate the wait queues and we could potentially sleep much longer than the original system

call would. Since this is not the type of behavior we would want for select(), a new design

had to be created.

26

5. The Second Design

Once I saw that the conventional method of intercepting system calls using the polling protocol

was not going to work, I tried to find alternative solutions that would allow select() to be

implemented with the CiNIC platform. Max Roth had suggested that I just include the kernel

code and modify that directly, but I was really hesitant to use this method because we wanted our

implementation to be as kernel independent as possible. However, seeing no alternative solution,

this is the method we went with for the second design. I included comments throughout the code

about what is kernel specific so that, if the kernel’s implementation of select() changes in

future kernel releases, one would be able to figure out what needs to be changed.

Four functions from the kernel had to be copied over: old_select(), sys_select(),

do_select(), and max_select_fd(). sys_select(), do_select(), and max_select_fd()

all come from fs/select.c in the Linux kernel, while old_select() comes from

arch/i386/kernel/sys_i386.c. For Chapter 5, sys_select() and old_select() were the

only functions that needed modification (do_select() is modified in Chapter 6); however, all of

these functions needed to be copied over because do_select() and max_select_fd() are static

functions used by select() that cannot be exported by the kernel. sys_select()’s and

old_select()’s modifications will be outlined in Section 5.1, followed by a description of how

the parameters get transferred to the co-host and run there in Sections 5.2 and 5.3. The split and

merge routines will be described in Section 5.4.

5.1 Modifications to old_select() and sys_select()

old_select()’s modifications are extremely straightforward. Since all it does is get the

parameters from a pointer and then call sys_select(), all that needed to be done was to add in

module use counting (MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT) and make it call our

version of sys_select() rather than the kernel’s version. Following is the code: long n_old_select(select_param_t *args) { select_param_t a; long retval;

27

MOD_INC_USE_COUNT; if (copy_from_user(&a, args, sizeof(a))) { MOD_DEC_USE_COUNT; return -EFAULT; } retval = sys_host_select(a.n, a.inp, a.outp, a.exp, a.tvp); MOD_DEC_USE_COUNT; return retval; }

For sys_select() on the host, I renamed my version to sys_host_select(). I needed a way

to keep track of information that was specific to one side or the other, so I created a data

structure: typedef struct { int n; int size; long timeout; fd_set_bits fds; char* bits; } select_split_t; n is the highest numbered file descriptor in the set, size is the size of one of the sets (there are 6

all together, 3 input and 3 output), timeout is the timeout value, fds is a structure that contains

pointers to each of the 6 sets, and bits is a pointer to the beginning of the memory region. A

select_split_t structure is created for each side. We also needed a structure that is used to

send the parameter information over to the co-host: typedef struct { int numfds; long time_off; int sizefds; char bitmaps[0]; } select_func_t; #define COM_SELECT_HEADER_SIZE 12

numfds, time_off, sizefds, and bitmaps[0] are equivalent to n, timeout, size, and bits in

the select_split_t, respectively. COM_SELECT_HEADER_SIZE defines the size of this structure

without bitmaps, which is used when calculating the size of the com packet to send to the co-

host. bitmaps[0] is actually a pointer of arbitrary length, but I made this into an array of size 0

so that the data would be contiguous with the other members of the struct.

28

sys_host_select() is modified so that a timeout value is computed for each side (they start out

equal). Then the host-side sets are set up and the file descriptors are copied in from user space.

Next, three new fd_set’s are created for the co-host file descriptors. Each one is large enough

to contain the maximum number of file descriptors (1024). The file descriptors are then split

between the host and co-host sets. If there are no co-host-side file descriptors,

do_host_select() is called on the host. Otherwise, a co-host memory region is set up that is

similar to the host’s and the parameters are transferred to the co-host. Upon return, the sets are

merged back together and the sets, along with the time elapsed sleeping, are copied back to user

space.

Following is a walkthrough of the code. I will break it up into chunks and show what needed to

be changed from the original sys_select() code.

When the select() system call is made, the function called is sys_host_select() in

select_h.c. Two split_select_t’s are declared for each side along with 3 fd_set’s that will

be used for the co-host sets: long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { select_split_t host, ebsa; fd_set rfds, wfds, efds; long timeout, ret; MOD_INC_USE_COUNT; 1. This section is the same as the kernel except for setting the host and co-host timeout values to

the timeout value passed in from user space. timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp))) || (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ);

29

timeout += sec * (unsigned long) HZ; } } host.timeout = timeout; ebsa.timeout = timeout; ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; 2. Set up the host-side bitmaps and then call split_select_bitmaps(), which will split up the

host and co-host file descriptors and set the n values for each. The host side contains all file

descriptors initially as they are copied from user space into the host-side bitmaps. The host-

side file descriptors will be set up in this memory region and the co-host side will be set up in

each of the fd_set’s. We are using fd_set because we do not know yet how big to make

the co-host side bitmaps until we get a value of n for it (it could possibly be larger than the

value of n passed in from the call to select()). ret = -ENOMEM; host.size = FDS_BYTES(n); host.bits = kmalloc(6 * host.size, GFP_KERNEL); if (!host.bits) goto out_nofds; host.fds.in = (unsigned long *) host.bits; host.fds.out = (unsigned long *) (host.bits + host.size); host.fds.ex = (unsigned long *) (host.bits + 2*host.size); host.fds.res_in = (unsigned long *) (host.bits + 3*host.size); host.fds.res_out = (unsigned long *) (host.bits + 4*host.size); host.fds.res_ex = (unsigned long *) (host.bits + 5*host.size); if ((ret = get_fd_set(n, inp, host.fds.in)) || (ret = get_fd_set(n, outp, host.fds.out)) || (ret = get_fd_set(n, exp, host.fds.ex))) goto out; zero_fd_set(n, host.fds.res_in); zero_fd_set(n, host.fds.res_out); zero_fd_set(n, host.fds.res_ex); split_select_bitmaps(n, &host, &ebsa, &rfds, &wfds, &efds); 3. If there are co-host file descriptors, then we set up the co-host side bitmaps and copy the

information from the fd_set’s to this region. n_sys_select() is then called, which takes

care of sending the data to the co-host, calling the host side do_select() if needed, and

30

waiting for the co-host to return. The return value is the combined value of

do_host_select() run on the host and do_ebsa_select() run on the co-host (or an error

value if an error occurred on either side). The host and co-host side bitmaps are merged into

the host side by merge_select_bitmaps() and the co-host side bitmaps are freed. If there

are not any co-host file descriptors, do_host_select() is called just like in the kernel and

nothing is sent to the co-host. if (ebsa.n > 0) { ret = -ENOMEM; ebsa.size = FDS_BYTES(ebsa.n); ebsa.bits = kmalloc(6 * ebsa.size, GFP_KERNEL); if (!ebsa.bits) goto out; ebsa.fds.in = (unsigned long *) ebsa.bits; ebsa.fds.out = (unsigned long *) (ebsa.bits + ebsa.size); ebsa.fds.ex = (unsigned long *) (ebsa.bits + 2*ebsa.size); ebsa.fds.res_in = (unsigned long *) (ebsa.bits + 3*ebsa.size); ebsa.fds.res_out = (unsigned long *) (ebsa.bits + 4*ebsa.size); ebsa.fds.res_ex = (unsigned long *) (ebsa.bits + 5*ebsa.size); memcpy((void*)ebsa.fds.in, (void*)&rfds, ebsa.size); memcpy((void*)ebsa.fds.out, (void*)&wfds, ebsa.size); memcpy((void*)ebsa.fds.ex, (void*)&efds, ebsa.size); zero_fd_set(ebsa.n, ebsa.fds.res_in); zero_fd_set(ebsa.n, ebsa.fds.res_out); zero_fd_set(ebsa.n, ebsa.fds.res_ex); ret = n_sys_select(&host, &ebsa); merge_select_bitmaps(&host, &ebsa); kfree(ebsa.bits); } else { ret = do_host_select(host.n, &host.fds, &host.timeout); } 4. The next part is the same as the kernel except for determining the timeout value from the co-

host and host. The smaller timeout value is returned. I thought about returning the larger

timeout value because that is what the timeout value would be if this was running normally

on one computer only, but I decided later that the timeout value should reflect the time

elapsed sleeping, since the user would probably use this value to sleep further if it is waiting

on an event. if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (ebsa.timeout < host.timeout) { timeout = ebsa.timeout; } else {

31

timeout = host.timeout; } if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } if (ret < 0) { goto out; } if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current)) goto out; ret = 0; } 5. Copy the values from the host result fields, free the bits on the host side, and return. set_fd_set(n, inp, host.fds.res_in); set_fd_set(n, outp, host.fds.res_out); set_fd_set(n, exp, host.fds.res_ex); out: kfree(host.bits); out_nofds: MOD_DEC_USE_COUNT; return ret; }

5.2 Transferring Parameters to the Co-Host

The method of transferring parameters to the co-host is very similar to the other system calls. It

is for this reason that I decided I would put the function for it into syscalls_h.c and name it

according to the naming scheme of the other intercepted system calls. This function,

n_sys_select(), is only called if there are co-host-side file descriptors. Its first task is to set up

a com packet that will be transferred to shared memory. This com packet will contain header

information such as the length to copy, the function ID, the process ID, and the return value.

Additionally, it will contain the arguments for the system call that are needed on the co-host side.

Next the com packet is put on the outgoing queue to send to the co-host. The polling protocol

will take care of getting it there from here [3]. It then calls do_host_select() if there are host-

side file descriptors. Once the host side returns, it will wait for the co-host side to return or keep

32

going if it already has. The modified sets from the co-host are copied from the com packet and

the return values are examined. Any errors are returned and host-side errors are returned if both

sides have errors. If there are no errors, the sum of the return values is returned.

Following is the code for n_sys_select(). For simplicity, I have taken out some code that will

be described in Chapter 6. long n_sys_select(select_split_t* local, select_split_t* remote) { long err = 0; int ret_local = 0; int ret_remote = 0; pkt_queue_node_t *pqn; 1. Set up the com packet with the values it needs. It needs to fill in the values of the

select_func_t structure with the co-host’s n, timeout, and size values, and it needs to

copy over the co-host side bitmaps. pqn = proto_get_queue_node(

COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size));

pqn->pkt->copy_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size);

pqn->pkt->pkt_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size);

pqn->pkt->func_id = SYS_SELECT; pqn->pkt->pid = current->pid; pqn->pkt->ret_val = -1; pqn->pkt->func.select.numfds = remote->n; pqn->pkt->func.select.time_off = remote->timeout; pqn->pkt->func.select.sizefds = remote->size; memcpy((void*)&pqn->pkt->func.select.bitmaps[0], (void*)remote->bits,

6*(remote->size)); 2. The com packet is then put on a queue to send to the co-host. If there are host file descriptors

as well, do_host_select() is called with the host-side parameters. After it returns, the

process sleeps on the semaphore until the co-host returns. If the co-host has already returned,

it will go straight through the lock. The updated bitmaps are then copied back and the

timeout value from the co-host is updated. If either side returns an error, that error is

returned. I have designated that if both sides return an error, then the host error gets

precedence over the co-host error because the host error would most likely be needed by the

33

other applications running on the host. Otherwise, the return values are added together and

the sum is returned. proto_enqueue(pqn); if (local->n > 0) { ret_local = do_host_select(local->n, &local->fds, &local->timeout, 1); } if (down_interruptible(&(pqn->lock)) == -EINTR) { err = -EINTR; goto out_select; } ret_remote = pqn->pkt->ret_val; memcpy((void*)remote->bits, (void*)&pqn->pkt->func.select.bitmaps[0], 6*(remote->size)); remote->timeout = pqn->pkt->func.select.time_off; if (ret_local < 0) { err = ret_local; } else if (ret_remote < 0) { err = ret_remote; } else { err = ret_remote + ret_local; } out_select: proto_release_queue_node(pqn); return err; }

5.3 The Co-Host Side Functions

The handling of the system call on the co-host side is relatively straightforward. The polling

protocol takes care of retrieving the com packet from shared memory and putting it on a handler

queue. Currently, the only handler is the default handler, so the default handler thread, which is

the default_handler() function in handler_default.c, checks the function ID from the com

packet and makes the appropriate system call [3]: case SYS_SELECT: pkt->ret_val = sys_ebsa_select(pkt->func.select.numfds, pkt->func.select.sizefds, &pkt->func.select.time_off, &pkt->func.select.bitmaps[0]); break;

34

sys_ebsa_select() (select_e.c) is called with the parameters sent by the com packet. It sets

up a memory region on the co-host to put the bitmaps into and then calls do_ebsa_select(): long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits) { fd_set_bits bmaps; long retval; bmaps.in = (unsigned long *) ebsa_bits; bmaps.out = (unsigned long *) (ebsa_bits + size); bmaps.ex = (unsigned long *) (ebsa_bits + 2*size); bmaps.res_in = (unsigned long *) (ebsa_bits + 3*size); bmaps.res_out = (unsigned long *) (ebsa_bits + 4*size); bmaps.res_ex = (unsigned long *) (ebsa_bits + 5*size); retval = do_ebsa_select(n, &bmaps, timeout); if (!retval) { if (signal_pending(current)) { retval = -ERESTARTNOHAND; } } return retval; }

When the system call is complete, the modified com packet is put on the queue to return to the

host.

5.4 The Split and Merge Routines

The split and merge routines are extremely important to this protocol as they are able to figure

out which file descriptors go to the co-host and which stay on the host. A separate set is used for

the co-host side. The split routine goes through each of the host-side sets, which contain all file

descriptors requested from the user. There are three sets, read, write, and exceptions, which are

bitmaps (see Section 2.3). A set bit (1) means that a file descriptor was requested. For example,

if bit 3 was set in the read bitmap, it tells select() to check if file descriptor 3 is ready to be

read from. So if the split routine finds a set bit, it will look up that file descriptor number in the

file descriptor translation table in the protocol code [3]. If it finds a –1 in the table, then this

means that this file descriptor is intended for the host side and the host-side number of file

descriptors is incremented. If it finds a non-negative value, this means that this file descriptor is

destined for the co-host. It clears the bit in the host bitmap and sets the translated bit in the co-

host bitmap. Figure 5.1 shows an example. Suppose the translation table is the same as in

35

Figure 4.1. Bit 3 is set in the host-side read bitmap and file descriptor 3 is seen to translate to file

descriptor 7 on the co-host, so bit 3 is cleared in the host-side read bitmap and bit 7 is set in the

co-host-side read bitmap. Notice in this figure that the co-host side has 1024 file descriptors in it

because we do not know yet what the highest numbered file descriptor will be on the co-host.

The merge routine works in the opposite direction, but it does not clear any bits. The co-host file

descriptors are translated to the host side and the host side will then contain both sets of bits. In

the previous example, if bit 7 in the co-host read bitmap was ready for reading, then it would be

translated back to file descriptor 3 on the host and set in the host read bitmap to be returned to

the user. It is important to note that the fd_set_bits structure has ‘in’ and ‘out’ bitmaps. The

‘in’ bitmaps are used to see which file descriptors select() should look at while the ‘out’

bitmaps are copied back to user space to show which file descriptors select() found available.

The split routine works with the ‘in’ bitmaps and the merge routine works with the ‘out’ bitmaps.

The split and merge routines reside in fd_map.c so that they have easy access to the file

descriptor mappings between the host and co-host. It is faster because it does not have to

continuously call fd_map_get_ebsa_fd() or fd_map_get_host_fd() whenever it wants a file

descriptor translated.

Split: void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds) { int hfd, efd; /* host/ebsa file descriptor */ int off; fd_translation_table_t *cur;

…… 5 6 7 8 9

0 0 1 0 0 bits fd #

…… 1023

0 Co-host read bitmap

0 1 2 3 4 5 0 0 0 0 0 1 bits

fd #

Host read bitmap ……

Figure 5.1 – Example of Splitting File Descriptors

36

1. The macros in select.h require the file descriptor set to be of type fd_set_bits* but the

type that is passed in from select_split_t is fd_set_bits. An ampersand (&) cannot be

used with the macros because they dereference pointers. So instead of changing the macros,

I created a new variable of type fd_set_bits* that the address of local->fds could be

stored in. The macros then use this new variable. Next, the n values are set to 0. I originally

was not going to have separate n values, but then I realized that, since the value has to be one

larger than the highest numbered file descriptor in any of the sets, the co-host side may have

larger file descriptor numbers than the host and would therefore need a separate n value. The

file descriptor table (fd_table) is then searched to find the current process’s mappings. If

they are not found, then only host-side file descriptors have been established, i.e. there are no

co-host file descriptors, so set the host side n to the value of n passed in and return. /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; local->n = 0; remote->n = 0; down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); if (!cur) { local->n = n; return; /* no fd translation table */ } 2. Zero out the fd_set’s so that we can populate them with the co-host’s file descriptors. FD_ZERO(remote_rfds); FD_ZERO(remote_wfds); FD_ZERO(remote_efds); 3. Algorithm for searching through the file descriptors (note: each bit refers to a file descriptor):

a) Find the bit and offset for the file descriptor on the host (same way as kernel in

do_select()). See Section 3.2 and Appendix C.

b) If that bit is not set in any of the sets, then continue to the next file descriptor.

37

c) Get the file descriptor mapping if that bit is set in any of the sets. If it is less than 0,

i.e. –1, it is not a translated file descriptor; it is only on the host. So update n on the

host and go on to the next file descriptor.

d) If it is translated, then find out which sets it is in and set the translated co-host file

descriptor in the correct fd_set(s) and clear the corresponding host file descriptor in

the host bitmaps (FD_SET and CLR).

e) If the co-host side n is smaller than the translated file descriptor, then update it. We

check it first in case the mappings are not in ascending order, we would not want n

on the co-host to be smaller than the largest file descriptor. for (hfd = 0; hfd < n; hfd++) { unsigned long hbit = BIT(hfd); off = hfd / __NFDBITS; if (!(hbit & BITS(lfds, off))) { continue; } efd = cur->fd_host_ebsa[hfd]; if (efd < 0) { local->n = hfd + 1; continue; } if (ISSET(hbit, __IN(lfds, off))) { FD_SET(efd, remote_rfds); CLR(hbit, __IN(lfds, off)); } if (ISSET(hbit, __OUT(lfds, off))) { FD_SET(efd, remote_wfds); CLR(hbit, __OUT(lfds, off)); } if (ISSET(hbit, __EX(lfds, off))) { FD_SET(efd, remote_efds); CLR(hbit, __EX(lfds, off)); } if (remote->n <= efd) remote->n = efd + 1; } } Merge: void merge_select_bitmaps(select_split_t* local, select_split_t* remote) { int hfd, efd; /* host/ebsa file descriptor */ int off_local, off_remote; fd_translation_table_t *cur; /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; fd_set_bits* rfds = &remote->fds;

38

down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); NOTE: This should not happen because merge_select_bitmaps() would not be called unless

there was an fd_table. I was thinking that I should return an error here, but after talking to Dr.

Nico, we decided that there is nothing that the user can do about it and the call would return

without the co-host file descriptors set anyway, so they would at least see that they were not

ready. Another reason is that we want this to look as much like the regular system call as

possible, so an unusual return value would not work in this case. Since this error would indicate

a more significant device driver failure, an error message is printed and the function returns. if (!cur) { PRINT_ERROR("merge_select_bitmaps: fd translation table missing, this should not happen\n"); return; /* no fd translation table */ } Algorithm for searching through the file descriptors:

a) Find the bit and offset for that file descriptor on the co-host (same way as kernel).

See Section 3.2 and Appendix C.

b) If that bit is not set in any of the sets, then continue to the next file descriptor.

c) Get the file descriptor mapping if that bit is set in any of the sets. If it is less than 0,

i.e. –1, it is an error in the translation table because all file descriptors on the co-host

should be in the translation table. Find the bit and offset for the host side.

d) Find out which sets it is in and set the translated host file descriptor in the correct

result bitmap. for (efd = 0; efd < remote->n; efd++) { unsigned long ebit = BIT(efd); unsigned long hbit; off_remote = efd / __NFDBITS; if (!(ebit & RES_BITS(rfds, off_remote))) { continue; } hfd = cur->fd_ebsa_host[efd]; if (hfd >= 0) { hbit = BIT(hfd); off_local = hfd / __NFDBITS; } else {

39

PRINT_ERROR("merge_select_bitmaps: fd translation missing, this should not happen\n"); continue; } if (ISSET(ebit, __RES_IN(rfds, off_remote))) { SET(hbit, __RES_IN(lfds, off_local)); } if (ISSET(ebit, __RES_OUT(rfds, off_remote))) { SET(hbit, __RES_OUT(lfds, off_local)); } if (ISSET(ebit, __RES_EX(rfds, off_remote))) { SET(hbit, __RES_EX(lfds, off_local)); } } }

This implementation cannot handle above 1024 file descriptors because the translation tables are

only 1024 indices long. This is the current default for Linux systems, but in the future this

scheme may need to be changed to be more robust.

The entire process from this chapter is laid out in Figure 5.2.

40Host Co-Host

System call intercepted, sys_host_select() called.

syscalls_init(), syscalls_h.c

sys_host_select() gets the values from user space and sets up the host side bitmaps.

sys_host_select(), select_h.c

split_select_bitmaps() is called. It splits the file descriptors between the host and co-host and returns the maximum file

descriptor plus one on both sides.

split_select_bitmaps(), fd_map.c

Co-host side bitmaps are set up. n_sys_select()

is called. sys_host_select(),

select_h.c do_host_select() is called. This is where each

of the file descriptors’ status is determined and where sleeping can happen. It

returns the number of file descriptors available.

do_host_select(), select_h.c

n_sys_select() sets up a com packet and sends to

co-host. n_sys_select(), syscalls_h.c

merge_select_bitmaps() is called. It merges both sets of file descriptors onto the host side. merge_select_bitmaps(), fd_map.c

Copy the new values back to user space and return the number of file descriptors

available.

sys_host_select(), select_h.c

The default handler on the co-host picks up the com packet and calls

sys_ebsa_select() with the values from it.

default_handler(), handler_default.c

sys_ebsa_select() sets up a memory region for the co-host side bitmaps and

calls do_ebsa_select(), which does the same thing as do_host_select() but on

the co-host side.

sys_ebsa_select(), select_e.c

The default handler then sends the com packet with the updated data back to the

host. default_handler(), handler_default.c

Host file descriptors only

Co-host file descriptors present

If there are host file descriptors it calls

do_host_select(). When both sides are done, it adds

both return values and returns.

n_sys_select(), syscalls_h.c

Figure 5.2 – select() Flowchart of Second Design

41

6. The Solution to the Blocking Problem

By the end of summer 2001, I had succeeded in getting everything from Chapter 5 to work

correctly. select() worked in most situations and I was able to get the Lynx text-based web

browser to work using this setup. However, functionality still had to be added to wake up one

side when the other had returned. As we saw in the kernel’s version of do_select(),

schedule_timeout() is called to put the process to sleep for a specified period of time after

each iteration if no file descriptors are available, the timeout value is larger than 0, and no signals

are pending. This is implemented on both the co-host and host sides. Also, the way

n_sys_select() works, a com packet is sent to the co-host and then do_host_select() is

called. When it returns, it sleeps until the co-host com packet returns before moving on. With

this method, if one side returns early and is ready to move on, it must wait for the other to return.

This could take a while if it has to wait for it to timeout. Furthermore, if there is no timeout

value, the wait could be forever if there are no file descriptors available and no signal is sent to

the process.

This is exactly what happens when using Telnet to connect to a remote host. It waits for input on

stdin or a socket when the user is typing. Since, with our implementation, sockets would go to

the co-host, the select() call would be split between the host and co-host. When the user types

a character, the select() call will return on the host because it sees the character from stdin.

However, the co-host side is still waiting for input on the socket, which never happens and

causes select() to never return. At this point Telnet cannot receive input and hangs.

This is why there needs to be some sort of mechanism where the side that is ready to return

notifies the other to return as well. There has to be a mechanism to wake up the sleeping process

and ask it to return. The following sections detail how I investigated the problem and came up

with a design. Then the code is discussed along with some of the issues I had to look at during

the implementation.

42

6.1 The Investigation of Kernel Methods

I first began an extensive investigation of kernel methods to try to figure out how I can go about

solving this problem. During my reading of Linux Device Drivers [5], I found that the kernel

uses wait queues to put processes to sleep and wake them up. Wait queues are queues of

processes waiting for various events. Processes can go to sleep on a wait queue by calling

sleep_on() or interruptible_sleep_on(). ‘Interruptible’ means the sleep can be interrupted

by a signal. There is also wake_up() and wake_up_interruptible(), which wake up either all

processes on a wait queue or only those in interruptible sleeps, respectively. The process would

go to sleep on a wait queue to wait for an event to occur and then code in another part of the

driver, usually in an interrupt handler, would wake up the process when this event occurs.

After thinking of a way this could apply to my problem, I decided I could use my own wait

queue in do_host_select() and do_ebsa_select() and then call

interruptible_sleep_on_timeout() on the wait queue with the timeout values from

select(). This would ensure that the timeout value is used (schedule_timeout() is called

internally by these functions) and that I could wake it up early. But, the problem still remained

about how to wake it up. I could not use interrupts to wake up the sleeping process because the

21554 has a limited interrupt mechanism that would already be used by the interrupt-driven

protocol. I needed to find something that could run separately from the process. Some type of

thread would work, but would also incur overhead because it would constantly be checking to

see if it has any processes to wake up, using up valuable CPU time. I finally found the answer in

task queues and tasklets. These enable execution of some task at a later time without using

interrupts. These can be run at various times in the kernel and can continually reschedule

themselves. The three predefined task queues are:

1. The scheduler queue- Runs in process context (as opposed to interrupt context) out of a

dedicated kernel thread called keventd. Sleeping is allowed since it is in a process

context.

2. The timer queue- Runs in interrupt context and runs at every clock tick.

43

3. The immediate queue- Runs via the bottom half mechanism, runs in interrupt time, and is

the fastest queue. Tasks should not be reregistered in this queue. It runs as soon as

possible, either on return from a system call or when schedule() is called.

Tasklets are another method used in the kernel. A tasklet is a new mechanism in the 2.4 kernel

and is a way to accomplish bottom half tasks, which are low priority interrupt space functions

that run when the kernel finds a convenient time [1]. In fact, in the 2.4 kernel, bottom halves run

as tasklets. Additionally, custom task queues can be defined which are not automatically

scheduled by the kernel. [5]

Taking these methods into account, I did some benchmarking tests using the Timestamp Counter

Register to see how fast these were. I ruled out custom queues because I wanted the kernel to

schedule when they ran. Tasklets, the scheduler queue, and the immediate queue all ran in

relatively the same amount of time, but the timer queue was about 20 times slower. Even on a

heavily loaded system, the timer queue is guaranteed to run at every clock tick, but I did not

think this would be an optimal solution due to the length of time between clock ticks. The tasks

using the scheduler queue can sleep, so I thought that one might be slow at times. I finally

decided to use tasklets because it was recommended that the immediate queue not be

rescheduled, and I definitely needed to reschedule my implementation. The only drawback of

tasklets is that they can only be used in kernel versions above 2.4.

6.2 The Design

The basic algorithm for the tasklet is to check a flag that will be set when the other side has

completed its part of the call. If this flag is set, then the sleeping process is woken up and runs

through one final iteration of the file descriptors before it returns. The tasklet is reregistered

each time it is run until the module is unloaded. The problem was finding a way to send a flag to

each side. I thought of two methods: Either send a com packet that tells the other side it has

finished or have a dedicated region in shared memory that contains this flag. For the interrupt-

driven protocol, which select() will eventually need to work with, a new thread would have to

be spawned if I sent a com packet. This would be an extremely slow process, so I decided that

the best method was to use shared memory.

44

This works if there is only one process calling select(). But since the process is sleeping,

other processes would get a chance to run and possibly call select().

wake_up_interruptible() would then wake up all processes on the wait queue. Since we do

not know which one the flag corresponds to, we cannot know which ones need to return and

which ones need to go back to sleep. A solution I came up with was to put in shared memory the

host process ID (PID) of the process that needs to return instead of passing a flag. The tasklet

would see that the PID in shared memory is non-negative and would wake up the processes on

the wait queue. Each process would wake up and, if its PID does not match, go back to sleep. If

the PID did match, the process would run through the file descriptors one last time and return.

The host PID could be used on both the host and co-host because its only use is to uniquely

identify a process that made a call to select(); it is not used with any PID functions such as

kill() or fork(). This value was the most convenient to use because it was already included in

the shared memory com packet header. A possible problem with this is if the wait queue

contained a lot of processes, this could incur quite a bit of overhead with all the context switches

between processes that are just going back to sleep (a.k.a. the “thundering herd” problem [5]). A

solution I proposed to solve this problem was to have local individual wait queues for each

process that contain only that process in its queue. We could then have a global linked list

(queue) of structures that contain the wait queue address and PID for each process. The tasklet

would look through the linked list for the PID and wake up the process on that wait queue.

Yet another thing I had to consider was if multiple processes were ready to return. I thought of

possibly using an array in shared memory to contain the PID of each process, but this would use

up shared memory and the proper length would be hard to determine. So I decided to keep a

linked list (queue) of the processes that are ready to return. Fortunately, the Linux kernel had a

circular linked list implementation that I could use. Not only would the tasklet be responsible for

waking up sleeping processes that the other side labels as needing to return, but also it would be

responsible for putting into shared memory the processes on its side that are ready to return.

Figure 6.1 shows how this would be set up. Two variables would be created in shared memory:

One for the host to write PID’s to for the co-host to read (host_pid) and one for the co-host to

write PID’s to for the host to read (ebsa_pid).

45

host_pid ebsa_pid

Host

Ebsa

Rest of shared memory…

Ready to Return linked list

Process B

Process C

Process D

Process A Sleeping linked list

Process F

Process G

Process H

Process E ebsa_pid = process G

Process G is woken up and removed from linked list

Sleeping linked list

Process A

Process C

Process E

host_pid = process E

Process E is woken up and removed from linked list

Ready to Return linked list

Process G

Process H

Process I

Process J

Figure 6.1 – Shared Memory with Queues

Process A ready to return

Process J ready to return

46

The shared memory structure now looks like this (the data regions would have to shrink in order

to contain these variables): typedef struct { int host_pid; int host_stat; char host_data[(SHRMEM_SIZE/2)-8]; int ebsa_pid; int ebsa_stat; char ebsa_data[(SHRMEM_SIZE/2)-8]; } shrmem_t;

This design would have to be implemented on both sides, so both would have to have wait

queues, tasklets, and linked lists. Following is a design I envisioned:

Host inside select() Go through one iteration of the file descriptors. If a file descriptor is available or the timeout is 0,

If ebsa_pid = current->pid, then set ebsa_pid to -1 and return, Else, put on Ready to Return linked list, and return.

(No file descriptors available and the timeout > 0) If ebsa_pid == current->pid, then set ebsa_pid to -1 and return, Else, set up wait queue, add to sleeping linked list, and sleep on the timeout value. When awoken, repeat above steps. Host in bottom half If host_pid < 0, remove next process from the Ready to Return linked list If there is a process to remove,

If ebsa_pid ==Ready_to_Return PID Set co-host PID to –1 and return

Else, put Ready_to_Return PID into host_pid If ebsa_pid >= 0, then wake up process in sleeping linked list with PID == ebsa_pid If process not found, check Ready to return list If not found, return Else, remove from Ready to Return list, set ebsa_pid to –1, and return Else, remove from sleeping linked list and return Else return Co-Host inside select() Go through one iteration of the file descriptors. If a file descriptor is available or the timeout is 0,

If host_pid = current->pid, then set host_pid to -1 and return,

47

Else, put on Ready to Return linked list, and return. (No file descriptors available and the timeout > 0) If host_pid == current->pid, then set host_pid to -1 and return, Else, set up wait queue, add to sleeping linked list, and sleep on the timeout value. When awoken, repeat above steps. Co-Host in bottom half If ebsa_pid < 0, remove next process from the Ready to Return linked list If there is a process to remove,

If host_pid ==Ready_to_Return PID Set host PID to –1 and return

Else, put Ready_to_Return PID into ebsa_pid If host_pid >= 0, then wake up process in sleeping linked list with PID == host_pid If process not found, check Ready to return list If not found, return Else, remove from Ready to Return list, set host_pid to –1, and return Else, remove from sleeping linked list and return Else return

See Appendix A for a state diagram of this design. –1 is used in host_pid and ebsa_pid to

indicate that no PID is in there and that the next PID off of the ready-to-return linked list can be

copied there.

6.3 The Code

Following is the code for the design presented in Section 6.2. Since the majority of the code is

duplicated on the host and the co-host, the focus of this walkthrough will be on the host side.

The co-host side can easily be seen by replacing ‘host’ with ‘ebsa' in the variables and functions.

Differences between the two sides will be noted.

The following include files are needed for wait queues, sleeping and waking up processes,

linked lists, and tasklets: #include <linux/wait.h> #include <linux/sched.h> #include <linux/list.h> #include <linux/interrupt.h> Because of dependency problems, interrupt.h could not compile without linux/spinlock.h

on the co-host nor asm/system.h on the host, so those files are included as well. Additionally,

com.h is needed to access the shared memory structure, along with ebsa.h to access the shared

48

memory pointer (g_shrmem) on the co-host and host.h to access the shared memory pointer

(module_info) on the host.

Global variables are declared for each side in select_h.c and select_e.c. The ready-to-return

and sleeping queues are initialized along with the tasklet. tasklet_host_sched_flag is used to

tell the tasklet when to stop rescheduling itself. This flag was needed because I could not get the

tasklet to stop with tasklet_kill() by itself. By setting this value to 1 when we want it to

reschedule and setting it to 0 when we do not want it to be rescheduled anymore allows the

tasklet to start and stop smoothly. LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); DECLARE_TASKLET(select_host_tasklet, select_host_tasklet_func, 0); int tasklet_host_sched_flag;

The shared memory area is initialized and the tasklet is started when syscalls_init() is called

in syscalls_h.c and syscalls_e.c. mem = (shrmem_t*)module_info.shared_mem_addr; mem->host_pid = -1; mem->ebsa_pid = -1; tasklet_host_sched_flag = 1; tasklet_schedule(&select_host_tasklet);

Likewise, the tasklet is stopped in syscalls_cleanup(). tasklet_host_sched_flag = 0; tasklet_kill(&select_host_tasklet);

Two structures are used for the entries in the sleeping and ready-to-return linked lists. The

sleeping list entry has a PID, a pointer to the wait queue it can wake up on, and a struct

list_head that is used for putting it in and taking it out of the linked list. The ready-to-return

list entry is the same except that it has no wait queue. typedef struct { pid_t pid; wait_queue_head_t* wq; struct list_head sleep_list_entry; } sleep_list_t; typedef struct { pid_t pid; struct list_head rr_list_entry;

49

} rr_list_t; In select_h.c, do_host_select() has added functionality over the original do_select() in

the kernel. It was modified to include a wait queue to sleep on, shared memory checking, and 2

queues. Now it also checks if ebsa_pid in shared memory is equal to the current PID before

going to sleep and, if the process does go to sleep, the tasklet will wake it up when or if

ebsa_pid contains this PID Also, when this function returns, it now puts its PID on the ready-

to-return queue so that it can be put in shared memory by the tasklet. ebsa_flag is passed in

and indicates if there are co-host-side file descriptors. If there are not, then this function runs the

same as it would in the regular Linux kernel. do_ebsa_select() does not have a host_flag

because this flag is used on the host to bypass all the extra functionality added if there are no file

descriptors destined for the co-host. 1. Added to the declaration list is a pointer to shared memory and pointers to an entry in each of

the queues. A wait queue is declared locally so that this process will be the only one on it.

When wake_up_interruptible() is called on the wait queue, only this process will

awaken. int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout; shrmem_t* mem = NULL; /* shared memory */ rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); if (ebsa_flag) { mem = (shrmem_t*)module_info.shared_mem_addr; } 2. The next section is the same as in the kernel except that a negative return value goes to

ready_to_return rather than just returning so that the process can be added to the ready-to-

return queue. read_lock(&current->files->file_lock); retval = max_select_fd(n, fds); read_unlock(&current->files->file_lock);

50

if (retval < 0) goto ready_to_return; n = retval; poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++; wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } 3. Here is where the majority of the changes occur. We only get to this code if the above if

statements do not break out of the loop. If ebsa_pid is the same as the current PID, then we

know that the other side has finished and we break out of the loop. Otherwise, a sleep queue

51

entry is created with the current PID and a pointer to the wait queue for this process. The

entry is then added to the list and the process goes to sleep on the wait queue for up to the

period of time specified by timeout. When it wakes up, the entry is deleted from the list,

freed, and then loops again. If there are no co-host file descriptors, then

schedule_timeout() is called instead and no wait queue is used. This process continues

until one of the if statements causes the loop to break. if (ebsa_flag) { if (mem->ebsa_pid == current->pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { retval = -ENOMEM; break; } sleep_entry->pid = current->pid; sleep_entry->wq = &select_sleep; list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout( sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } else { /* same as regular kernel if no EBSA fd's */ __timeout = schedule_timeout(__timeout); } } current->state = TASK_RUNNING; poll_freewait(&table); *timeout = __timeout; 4. When the function has finished and is ready to return, it needs to add itself to the ready-to-

return queue if the other side has not indicated that it is ready-to-return. First ebsa_pid is

checked to see if it contains the current PID; if it does, then ebsa_pid is set to –1 and we

return. This means that the co-host side has indicated that it is done. Otherwise, we allocate

a new ready-to-return list entry, copy the current PID into it, and add it to the queue. Then

we return and the tasklet will pick up the rest of the work. ready_to_return: if (ebsa_flag) { if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } else {

52

rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) { /* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = current->pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } } return retval; } The tasklet maintains the two queues and manages host_pid and ebsa_pid in shared memory.

The first thing the host-side tasklet does is check host_pid. If there is currently no PID in

host_pid (the value is –1), then the next process in the ready-to-return queue is taken off the

queue and the PID of that process is examined. If the PID matches the PID in ebsa_pid, then

that means that the co-host side has also returned, so there is no need to write this PID to

host_pid. ebsa_pid is set to –1 and the tasklet exits. Otherwise, the PID is put into host_pid.

The next step is to check ebsa_pid. The tasklet would have gone straight to this step if there

were a PID in host_pid. If ebsa_pid has no PID in it, then the co-host side has nothing ready

to return, so the tasklet exits. Otherwise, it checks if the process with the same PID is in the

sleeping queue. If it is, it is removed from the queue and woken up, at which point the tasklet

will exit. If it is not found in the sleeping queue, the tasklet will see if it has already returned and

is possibly in the ready-to-return queue. If it is, then it is removed from this queue and ebsa_pid

is set to –1 to indicate that the next PID can be put into ebsa_pid by the co-host side. If it is not,

ebsa_pid is not changed and the tasklet exits. Each time the tasklet exits, it reschedules itself

until it sees that the value of tasklet_host_sched_flag is 0.

1. The declarations include pointers used to navigate through the lists, a pointer to shared

memory, two structures that contain the information needed for each list element (along with

the list_head pointers to indicate its position within a list), and two flags that tell if a PID

was found in one of the lists. The parameter passed into the tasklet is not used. void select_host_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; struct list_head* sleep_list_ptr; shrmem_t* mem;

53

sleep_list_t* sleep_entry = NULL; rr_list_t* rr_entry = NULL; int sleep_flag = 0; int rr_flag = 0; 2. mem is set up to be a pointer to shared memory. Then if host_pid is less than 0 and the

ready-to-return queue is not empty, then the next entry from the queue (the one right after the

header) is deleted from the list. The macro list_entry points to the structure that contains a

list_head. This allows the values of this entry to be retrieved. If this entry is the same

value that is in ebsa_pid, then ebsa_pid is reset to –1 and the tasklet exits. Otherwise put

the PID value of this entry into host_pid. mem = (shrmem_t*)module_info.shared_mem_addr; if ((mem->host_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->ebsa_pid) { kfree(rr_entry); goto reset_ebsa; } else { mem->host_pid = rr_entry->pid; kfree(rr_entry); } } 3. If ebsa_pid is less than 0, then the tasklet stops processing and exits. Otherwise, it searches

the sleeping queue for the PID found in ebsa_pid. The macro list_for_each works like a

for loop. It goes through the entire list starting from the head. sleep_list_ptr points to

the current position in the list. For each position in the list, the PID is checked against the

value in ebsa_pid. If a match is found, then that process is woken up and the tasklet exits. if (mem->ebsa_pid < 0) { goto tasklet_complete; } list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->ebsa_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) {

54

goto tasklet_complete; } 4. If the entry is not found in the sleeping queue, then the ready-to-return queue is checked. If a

match occurs, the entry is removed from the ready-to-return queue and ebsa_pid is reset to

-1. The tasklet then exits. If the value is not found in the queue, then the tasklet exits

without setting ebsa_pid to –1. list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->ebsa_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) { goto tasklet_complete; } reset_ebsa: mem->ebsa_pid = -1; 5. Before exiting, the tasklet checks the value of tasklet_host_sched_flag to see if it should

continue rescheduling itself. If the flag is clear, then it does not reschedule itself. If the flag

is set, it is rescheduled to run again. tasklet_complete: if (tasklet_host_sched_flag) { tasklet_schedule(&select_host_tasklet); } } In syscalls_h.c, n_sys_select() makes sure the values in shared memory are reset to –1 if

they contain the current PID before the select() call completes. This makes sure there are no

PIDs in shared memory that the tasklet is looking for that it will never find. This case occurs

when there are only co-host file descriptors. The co-host side puts the PID value in shared

memory when it returns, but the host side never is able to reset it because do_host_select()

was never run. mem = (shrmem_t*)module_info.shared_mem_addr; if (mem->host_pid == current->pid) { mem->host_pid = -1; } if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1;

55

} 6.4 Other Issues

There were two other issues that had to be looked at during this design and implementation.

First, the issue of whether to use semaphores surfaced. I believed that race conditions could

occur when writing to and reading from shared memory or when adding to and deleting from a

linked list. This was a critical issue because, if I wanted to use tasklets, I could not sleep in

them. Fortunately, my design is such that access to shared memory is controlled; only the tasklet

is able to write the PID value to it. But each process reads from it and can set it to –1. Also, the

kernel implementation of linked lists was thought to be free of race conditions because I could

not find any literature or examples that show using semaphores with these lists. It was thought

that, since Linux is a non-preemptive kernel, the process would not go to sleep when

manipulating the lists, which makes this free of race conditions. This issue was never fully

resolved and semaphores were never implemented in the code. There have been no problems

yet, but the comments in the code indicate places where I thought could pose a potential race

condition.

The other issue deals with a cache problem I was having on the co-host during the final testing of

this implementation. I was finding that occasionally I would write a value to shared memory but

the previous value it was supposed to overwrite would still be in there. This happened only

when running two or more processes concurrently that called select(). I tried using volatile

and atomic variables, but neither worked. I saw that the difference between when it would work

and when it would fail depended on the order in which the calls occurred. Since within the

protocol there is a lot of sleeping, the order depends on when processes are scheduled. After not

coming up with a solution for a while, Max Roth and Jason Hatashita found out that it was a

problem with the EBSA’s cache. I do not understand the full details of the problem, but it has to

do with the cache updating concurrent memory locations in shared memory. In order to curtail

this issue until it is fixed, I had to space out host_pid and ebsa_pid in shared memory like so: typedef struct { pid_t host_pid; /* for use with select() */ unsigned long spacer1[10]; pid_t ebsa_pid; /* for use with select() */ unsigned long spacer2[10];

56

int host_stat; char host_data[SHRMEM_DATA_SIZE]; int ebsa_stat; char ebsa_data[SHRMEM_DATA_SIZE]; } shrmem_t;

This allows the two variables to be far enough apart so that they are updated in separate caches.

57

7. Conclusion

The select() system call is now fully functional and Telnet and Lynx are able to run through

the protocol as a result of this. It was a very long and challenging process because I had to do a

lot of research and learning in order to figure out how the protocol worked, how select()

worked inside the kernel, and how to use mechanisms available in the kernel to solve the

problems we were having with the implementation. A lot of the process was trial and error. If

one idea did not work, we would try another. For the first design, I continually added

functionality on top of what I already had, hoping that somehow it was going to work, but only

to find out later that this design would be impossible to implement. So I started over from the

beginning. Once select() was able to run successfully on the host and co-host, we then had to

deal with the blocking issue. This project definitely took a lot of resilience and determination, as

there were continuously issues that were brought up that had to be considered. I gained a greater

respect for those who designed and continuously code the Linux kernel, as I now see what it

takes to be a true Linux hacker.

There are a number of future work items that need to be done. First, this code needs to be ported

to the new version of the protocol. Unfortunately, the current design for the interrupt-driven

protocol of one file descriptor per thread would not work for select(), as it has to be able to

handle multiple file descriptors. Next, when this code is ported over, the possibility of using

interrupts rather than tasklets should be looked at. Tasklets are used because the polling protocol

did not have interrupts, but interrupts have the potential to be much faster. Also, the ultimate

goal would be to someday be able to get the X Window System working with the protocol. After

all, the reason select() was so important to implement in the first place was because it was

needed to run Netscape. In order to get closer to this goal, the Virtual File System issue and the

loopback issue must be address. First, with regards to the Virtual File System issue, when a

socket is created with our protocol, the default file operations for sockets are overridden with our

own implementation of the file operation functions. Currently all these do is print an error

message and return –EFAULT. A couple of these messages occur when attempting to start the X

Window System with our protocol, so resolving this issue may help in getting the X Window

System to work. Secondly, since the X Window System uses sockets with the loopback interface

58

to communicate between its client and server, so getting the loopback interface to work with the

protocol is critical. Finally, continuing to implement system calls with the protocol is critical to

adding more functionality to the platform. This increased functionality will help the project to

realize the full potential of the system and will enable more research into its capabilities.

59

References [1] Bovet, Daniel P. and Marco Cesati. Understanding the Linux Kernel. 1st edition.

Sebastopol, CA: O’Reilly, 2001. [2] McClelland, Mark. “Linux PCI Shared Memory Device Drivers for the Cal Poly

Intelligent Network Interface Card.” Senior Project, California Polytechnic State University, San Luis Obispo, June 2001.

[3] McCready, Robert. “Design and Development of the CiNIC Host/Co-Host Protocol.”

Senior Project, California Polytechnic State University, San Luis Obispo, February 2002. [4] Roth, Max. “Design and Implementation for the CiNIC Device Driver v2.0.” Senior

Project, California Polytechnic State University, San Luis Obispo, June 2002. [5] Rubini, Alessandro and Jonathan Corbet. Linux Device Drivers. 2nd edition. Sebastopol,

CA: O’Reilly, 2001.

60

Appendix A – State Diagrams for the Solution to the Blocking Problem Host

From sys_host_select()

Check all file descriptors inbitmaps to see if any are ready

Put on Ready to Return linked list

1 or more file descriptorsavailable or timeout = 0

Check ebsa_pid

No file descriptorsavailable and timeout != 0

ebsa_pid = current_pid

Return to sys_host_select()

Put process on sleeping linked list(using wait queues)

ebsa_pid != current_pid

SleepTimeout not expired andprocess is not scheduled

Awoken by bottom half, aready file descriptor, or anexpired timeout

Check ebsa_pid

ebsa_pid != current_pid

Set ebsa_pid to -1

ebsa_pid = current_pid

Remove from sleepinglinked list

Check host_pid Remove next process in Ready to Return linked list from the list

host_pid < 0

Check ebsa_pid

host_pid >= 0 No processes to remove

Compare ebsa_pid to pid of removed process

Process removed

Find process in sleeping linked list with pid = ebsa_pid

ebsa_pid >= 0

Find process in Ready to Return linked list

Process not found

Remove from Ready to Return linked list

Set ebsa_pid to -1

return

ebsa_pid < 0

Pid’s are equal

Pid’s not equal

Put pid of process into host_pid

Wake up process

Process is found

Process is found

Process not found

Figure A.1 – do_select() on Host

Figure A.2 – Tasklet on Host

61

From sys_ebsa_select()

Check all file descriptors in bitmaps to see if any are ready

Put on Ready to Return linked list

1 or more file descriptors available or timeout = 0

Check host_pid

No file descriptors available and timeout != 0

host_pid = current_pid

Return to sys_ebsa_select()

Put process on sleeping linked list (using wait queues)

host_pid != current_pid

SleepTimeout not expired and process is not scheduled

Awoken by bottom half, a ready file descriptor, or an expired timeout

Check host_pid

host_pid != current_pid

Set host_pid to -1

host_pid = current_pid

Remove from sleeping linked list

Check ebsa_pid Remove next process in Ready to Return linked list from the list

ebsa_pid < 0

Check host_pid

ebsa_pid >= 0 No processes to remove

Compare host_pid to pid of removed process

Process removed

Find process in sleeping linked list with pid = host_pid

host_pid >= 0

Find process in Ready to Return linked list

Process not found

Remove from Ready to Return linked list

Set host_pid to -1

return

host_pid < 0

Pid’s are equal

Pid’s not equal

Put pid of process into ebsa_pid

Wake up process

Process is found

Process is found

Process not found

Figure A.3 – do_select() on Co-Host

Figure A.4 – Tasklet on Co-Host

Co-Host

62

Appendix B – Test Plans Test plan run on September 11, 2001, before solution to blocking problem (up through Chapter

5).

Test Expected Result Pass/Fail File descriptors on both host and co-host

Return with file descriptors Pass

File descriptors on co-host only

Return with file descriptors Pass

File descriptors on host only Return with file descriptors Pass File descriptors on both, co-host blocks

Return with host file descriptors after timeout period *

Pass*

File descriptors on both, host blocks

Return with co-host file descriptors after timeout period *

Pass*

File descriptors on both, co-host has >32 file descriptors

Return with file descriptors Pass

File descriptors on both, host has >32 file descriptors

Return with file descriptors Pass

File descriptors on both, timeout is 0

Return with file descriptors Pass

No file descriptors, only timeout

Sleep for given period of time

Pass

First argument n < 0 Return –EINVAL Pass First argument n > 1024 n changed to 1024 Pass Give invalid file descriptor Return –EBADF Pass Bad memory address Return –EFAULT Pass File descriptors 32 numbers or more apart

Return with file descriptors Pass

* = Blocks now, but will not once blocking problem is solved.

Table B.1 – Test Plan Run on 9/11/01

63

Test plan run on February 22, 2002, after solution to blocking problem (up through Chapter 6).

Test Expected Result Pass/Fail File descriptors on both host and co-host

Return with file descriptors Pass

File descriptors on co-host only

Return with file descriptors Pass

File descriptors on host only Return with file descriptors Pass File descriptors on both, co-host blocks

Return with host file descriptors immediately

Pass

File descriptors on both, host blocks

Return with co-host file descriptors immediately

Pass

File descriptors on both, co-host has >32 file descriptors

Return with file descriptors Pass

File descriptors on both, host has >32 file descriptors

Return with file descriptors Pass

File descriptors on both, timeout is 0

Return with file descriptors Pass

No file descriptors, only timeout

Sleep for given period of time

Pass

First argument n < 0 Return –EINVAL Pass First argument n > 1024 n changed to 1024 Pass Give invalid file descriptor Return –EBADF Pass Bad memory address Return –EFAULT Pass File descriptors 32 numbers or more apart

Return with file descriptors Pass

Multiple select() calls at one time

Return with file descriptors Fail*

Multiple select() calls that block on either side

Return with file descriptors Fail*

Errors on both Host error is returned Pass Multiple co-host file descriptors

Return with file descriptors Pass

* These failed because of the caching problem (Section 6.4). Once this was fixed, these passed.

Table B.2 – Test Plan Run on 2/22/02

64

Appendix C – select() Bit Macros Since select() deals with three sets of bitmaps, there are quite a number of bit analyzing and

manipulation macros it uses. These can be very confusing to understand, so below I will briefly

outline how each one works.

FDS_BYTES is used in sys_host_select() and is defined in linux/poll.h. It is used to

determine how many bytes are needed for a given number of bits (nr): #define FDS_BITPERLONG (8*sizeof(long)) #define FDS_LONGS(nr) (((nr)+FDS_BITPERLONG-1)/FDS_BITPERLONG) #define FDS_BYTES(nr) (FDS_LONGS(nr)*sizeof(long))

If a long is 4 bytes (i386), then FDS_BITPERLONG gives the number of bits in one long, which

would be 32. FDS_LONGS gives the number of longs a given number of bits would take up. For

example, if nr = 60, then FDS_LONGS would be 2. FDS_BYTES then gives the number of bytes for

a given number of bits. If nr is the same value as above, then FDS_BYTES would be 8. This

value is used by kmalloc() as the size of the region to allocate for each bitmap (there are 6

bitmaps altogether, 3 sets of ‘in’ bitmaps and 3 sets of ‘out’ bitmaps).

The following, except for CLR, were all defined in fs/select.c and had to be copied over with

the kernel code. I defined CLR myself so I could use it in split_select_bitmaps(): #define BIT(i) (1UL << ((i)&(__NFDBITS-1))) #define ISSET(i,m) (((i)&*(m)) != 0) #define SET(i,m) (*(m) |= (i)) #define CLR(i,m) (*(m) &= (~(i))) __NFDBITS is defined in linux/posix_types.h to be 8*sizeof(unsigned long). BIT sets a

bit in the correct position within an unsigned long. For file descriptors less than 32, this works as

expected. For example, if i = 8, then 1UL (meaning an unsigned long value 1) is moved left 8

spots (…100000000 binary). For file descriptors above 32, it puts it in a bit position that

assumes there are a certain number of unsigned longs in front of it. If i = 60, then the bit is

shifted 28 spots, assuming that there is an unsigned long in front of it (28+32 = 60). SET and CLR

modify bit i of unsigned long pointer m. ISSET checks bit i of unsigned long pointer m and

returns 1 if it is set or 0 if it is not. All of these are used in the split and merge routines, as well

65

as do_select(), to manipulate the bitmaps. The following are used to find the correct unsigned

long location in each bitmap (originally located in fs/select.c): #define __IN(fds, n) (fds->in + n) #define __OUT(fds, n) (fds->out + n) #define __EX(fds, n) (fds->ex + n) #define __RES_IN(fds, n) (fds->res_in + n) #define __RES_OUT(fds, n) (fds->res_out + n) #define __RES_EX(fds, n) (fds->res_ex + n) The first three are the ‘in’ bitmaps that the user passes to the system call. The last three are the

‘out’ bitmaps that the kernel copies to user space upon return. fds is of type fd_set_bits.

These macros are used as the m parameter in ISSET, SET, and CLR. Each of these finds the correct

unsigned long that a bit is set in. Figure C.1 shows a typical example of how one of these

bitmaps would be set up. The unsigned longs are contiguous, so n acts like an offset to find the

correct unsigned long. The offset is found by taking a file descriptor number and dividing it by

__NFDBITS. In the example of file descriptor 60, n = 1 because there is one unsigned long in

front of the one that file descriptor 60 is in.

The final two macros are used to check bits in three of the sets concurrently. BITS was defined

in fs/select.c while I defined RES_BITS so I could use it in the merge routine. #define BITS(fds, n) (*__IN(fds, n)|*__OUT(fds, n)|*__EX(fds, n)) #define RES_BITS(fds, n) \ (*__RES_IN(fds, n)|*__RES_OUT(fds, n)|*__RES_EX(fds, n)) These dereference the pointers in the macros above. All the bits from the same unsigned long

offset in each of the three sets are OR’ed together to find which bits are set in any of the bitmaps.

These macros allow a bit that is defined with the BIT macro to be AND’ed with either of these to

find out if that bit is set in any of the sets. For example, suppose each bitmap only had 4 bits.

*__IN is 0001, *__OUT is 0011, and *_EX is 0110. BITS would then be all three of these OR’ed

together: 0111. This shows that file descriptors 0-2 are set in at least one of the bitmaps and file

descriptor 3 is not set in any of the bitmaps. These macros are used in the split and merge

routines, as well as in do_select().

bits/fd

Unsigned long Unsigned long Unsigned long …

95 64 63 32 31 0 Figure C.1 – The Representation of a Bitmap in Kernel Memory

66

Appendix D – Source Code D.1 select.h /* * select.h - Cal Poly 3Com CiNIC project * * Definitions and functions for the select() system call. Based off of * fs/select.c in kernel version 2.4.2. Code may need to be changed when newer * versions of the kernel are used. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select.h,v 1.1 2002/05/01 02:01:30 jkwek Exp $ */ #ifndef _SELECT_H_ #define _SELECT_H_ #include "global.h" /* these files included in fs/select.c in Linux kernel */ #include <linux/slab.h> #include <linux/poll.h> #include <linux/file.h> #include <asm/uaccess.h> /* needed for wait queues, waking up processes, and linked lists */ #include <linux/wait.h> #include <linux/sched.h> #include <linux/list.h> /* * All of the following except RES_BITS and CLR are also defined in * fs/select.c. These two were added for the extra functionality I needed. */ #define ROUND_UP(x,y) (((x)+(y)-1)/(y)) #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) #define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR) #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR) #define POLLEX_SET (POLLPRI) /* * Goes to correct u_long boundary in the bitmaps. The kernel routine sets * up the bitmaps along these boundaries for efficiency and speed. */ #define __IN(fds, n) (fds->in + n) #define __OUT(fds, n) (fds->out + n) #define __EX(fds, n) (fds->ex + n) #define __RES_IN(fds, n) (fds->res_in + n) #define __RES_OUT(fds, n) (fds->res_out + n) #define __RES_EX(fds, n) (fds->res_ex + n)

67

/* checks if the bit is set in all three of the bitmaps for a given fd */ #define BITS(fds, n) (*__IN(fds, n)|*__OUT(fds, n)|*__EX(fds, n)) #define RES_BITS(fds, n) \ (*__RES_IN(fds, n)|*__RES_OUT(fds, n)|*__RES_EX(fds, n)) /* * Bit manipulation routines * BIT puts a 1 in the correct u_long location. */ #define BIT(i) (1UL << ((i)&(__NFDBITS-1))) #define ISSET(i,m) (((i)&*(m)) != 0) #define SET(i,m) (*(m) |= (i)) #define CLR(i,m) (*(m) &= (~(i))) /* longest timeout value */ #define MAX_SELECT_SECONDS \ ((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1) /* parameters for old_select() */ typedef struct { unsigned long n; fd_set *inp, *outp, *exp; struct timeval *tvp; } select_param_t; /* divide parameters betweeen host and EBSA */ typedef struct { int n; /* highest numbered fd in split set */ int size; /* size of split set */ long timeout; /* timeout of split set */ fd_set_bits fds; /* pointers to bitmaps in the memory region */ char* bits; /* pointer to the memory region */ } select_split_t; /* entry in sleeping queue */ typedef struct { pid_t pid; /* process id */ wait_queue_head_t* wq; /* ptr to waitq sleeping on */ struct list_head sleep_list_entry; /* positioning in linked list */ } sleep_list_t; /* entry in ready-to-return queue */ typedef struct { pid_t pid; /* process id */ struct list_head rr_list_entry; /* positioning in linked list */ } rr_list_t; /* functions in select_h.c */ int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag); long n_old_select(select_param_t *args); long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp); void select_host_tasklet_func(unsigned long ptr); /* functions in select_e.c */ int do_ebsa_select(int n, fd_set_bits *fds, long *timeout, pid_t remote_pid); long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits,

68

pid_t remote_pid); void select_ebsa_tasklet_func(unsigned long ptr); /* select function in syscalls_h.c */ long n_sys_select(select_split_t* local, select_split_t* remote); #endif D.2 select_h.c /* * select_h.c - Cal Poly 3Com CiNIC project * * Host-side select() implementation. Most of this code was taken from * fs/select.c in the Linux 2.4.2 kernel with bitmap manipulation and EBSA-side * communication added in. The conventional method of hijacking system calls * could not be performed on select() because it needs to be run concurrently * on the host and EBSA sides. Potential race conditions could occur when * writing to or reading from shared memory, or adding and deleting entries * from the lists. However, no problems have been seen as of yet. Code ma y * need to be changed when newer versions of the kernel are used. Refer to * senior project for a more detailed description. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select_h.c,v 1.1 2002/05/01 02:01:46 jkwek Exp $ */ #include <asm/system.h> /* for interrupt.h to compile */ #include <linux/interrupt.h> /* for tasklets */ #include <linux/module.h> /* MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT */ #include "global.h" #include "select.h" #include "fd_map.h" /* split and merge bitmaps */ #include "host.h" /* module_info */ #include "com.h" /* shrmem_t */ /* Initialize host-side ready-to-return and sleeping queues */ LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); /* Initialize host-side tasklet */ DECLARE_TASKLET(select_host_tasklet, select_host_tasklet_func, 0); int tasklet_host_sched_flag; /* * All code in this function is taken from select.c. This had to be copied * because it is static. This function tests for bad file descriptors and * returns the maximum file descriptor in any of the sets, plus one. -EBADF * is returned for bad file descriptors. */ static int max_select_fd(unsigned long n, fd_set_bits *fds) { unsigned long *open_fds; unsigned long set;

69

int max; /* handle last in-complete long-word first */ set = ~(~0UL << (n & (__NFDBITS-1))); n /= __NFDBITS; open_fds = current->files->open_fds->fds_bits+n; max = 0; if (set) { set &= BITS(fds, n); if (set) { if (!(set & ~*open_fds)) goto get_max; return -EBADF; } } while (n) { open_fds--; n--; set = BITS(fds, n); if (!set) continue; if (set & ~*open_fds) return -EBADF; if (max) continue; get_max: do { max++; set >>= 1; } while (set); max += n * __NFDBITS; } return max; } /* * This is the heart of select(). In the original kernel, it checks each file * descriptor in the bitmaps and sets the bit in the result bitmaps for each * avalable file descriptor. The number of available file descriptors is * returned. If no file descriptors are available, this function sleeps until * one becomes available or the timeout expires. The kernel code was modified * to include a wait queue to sleep on, shared memory checking, and 2 regular * queues. Now it also checks if ebsa_pid in shared memory is equal to the * current PID before going to sleep and, if the process does go to sleep, * the tasklet will wake it up when or if they are equal. Also, when this * function returns, it now puts its PID on the ready-to-return queue so that * it can be put in shared memory by the tasklet. ebsa_flag indicates if there * are EBSA-side file descriptors. If there are not, then this function runs * the same as it would in the regular Linux kernel. */ int do_host_select(int n, fd_set_bits *fds, long *timeout, int ebsa_flag) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout; shrmem_t* mem = NULL; /* shared memory */

70

rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); /* point to shared memory */ if (ebsa_flag) { mem = (shrmem_t*)module_info.shared_mem_addr; } /* get max numbered file descriptor */ read_lock(&current->files->file_lock); retval = max_select_fd(n, fds); read_unlock(&current->files->file_lock); if (retval < 0) goto ready_to_return; n = retval; /* set up a list of wait queues to be used by the poll() method */ poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); /* * For each file descriptor selected, call the poll() method * for that file system type. The poll() method will return * a mask that can be used to check its status and set the * appropriate bits in the result bitmaps. */ for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off));

71

retval++; wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } if (ebsa_flag) { /* check if EBSA is done before sleeping */ if (mem->ebsa_pid == current->pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { /* seems to be most practical return value */ retval = -ENOMEM; break; } sleep_entry->pid = current->pid; sleep_entry->wq = &select_sleep; /* add to sleeping queue and go to sleep */ list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout( sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } else { /* same as regular kernel if no EBSA fd's */ __timeout = schedule_timeout(__timeout); } } current->state = TASK_RUNNING; poll_freewait(&table); /* Up-to-date the caller timeout */ *timeout = __timeout; ready_to_return: if (ebsa_flag) { /* if EBSA side is ready, we are done */ if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } else { /* add to ready-to-return queue if EBSA not ready */ rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) {

72

/* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = current->pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } } return retval; } /* * old_select() is used by Netscape. The reason it may use this rather than * the newer version is for compatibility. This function was designed * to be used back in the days when you could not pass 5 parameters to a * system call due to register limitations. This code was mostly borrowed * from old_select() in arch/i386/kernel/sys_i386.c. It gets the arguments * from user space and then calls the new version. */ long n_old_select(select_param_t *args) { select_param_t a; long retval; MOD_INC_USE_COUNT; if (copy_from_user(&a, args, sizeof(a))) { MOD_DEC_USE_COUNT; return -EFAULT; } retval = sys_host_select(a.n, a.inp, a.outp, a.exp, a.tvp); MOD_DEC_USE_COUNT; return retval; } /* * This is the wrapper function for select(). It takes the user space * parameters and sets them up in kernel memory. Once select() is finished, * it copies the parameters back to user space. This code comes from * sys_select() in fs/select.c and is modified to have both EBSA and host side * parameters that are split and merged together. */ long sys_host_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp) { select_split_t host, ebsa; /* host and EBSA side values */ fd_set rfds, wfds, efds; /* for creating EBSA bitmaps */ long timeout, ret; MOD_INC_USE_COUNT; /* get timeout value from user space and change to jiffies */ timeout = MAX_SCHEDULE_TIMEOUT; if (tvp) { time_t sec, usec; if ((ret = verify_area(VERIFY_READ, tvp, sizeof(*tvp)))

73

|| (ret = __get_user(sec, &tvp->tv_sec)) || (ret = __get_user(usec, &tvp->tv_usec))) goto out_nofds; ret = -EINVAL; if (sec < 0 || usec < 0) goto out_nofds; if ((unsigned long) sec < MAX_SELECT_SECONDS) { timeout = ROUND_UP(usec, 1000000/HZ); timeout += sec * (unsigned long) HZ; } } host.timeout = timeout; ebsa.timeout = timeout; ret = -EINVAL; if (n < 0) goto out_nofds; if (n > current->files->max_fdset) n = current->files->max_fdset; /* * We need 6 bitmaps (in/out/ex for both incoming and outgoing), * since we used fdset we need to allocate memory in units of * long-words. */ ret = -ENOMEM; host.size = FDS_BYTES(n); host.bits = kmalloc(6 * host.size, GFP_KERNEL); if (!host.bits) goto out_nofds; host.fds.in = (unsigned long *) host.bits; host.fds.out = (unsigned long *) (host.bits + host.size); host.fds.ex = (unsigned long *) (host.bits + 2*host.size); host.fds.res_in = (unsigned long *) (host.bits + 3*host.size); host.fds.res_out = (unsigned long *) (host.bits + 4*host.size); host.fds.res_ex = (unsigned long *) (host.bits + 5*host.size); /* get bitmaps from user space and set result bitmaps to 0 */ if ((ret = get_fd_set(n, inp, host.fds.in)) || (ret = get_fd_set(n, outp, host.fds.out)) || (ret = get_fd_set(n, exp, host.fds.ex))) goto out; zero_fd_set(n, host.fds.res_in); zero_fd_set(n, host.fds.res_out); zero_fd_set(n, host.fds.res_ex); /* * move the EBSA descriptors to rfds, wfds, and efds * set the n values for each */ split_select_bitmaps(n, &host, &ebsa, &rfds, &wfds, &efds);

74

if (ebsa.n > 0) { /* * If there are EBSA file descriptors, set up the EBSA side * bitmaps and copy over from the fd_set's. */ ret = -ENOMEM; ebsa.size = FDS_BYTES(ebsa.n); ebsa.bits = kmalloc(6 * ebsa.size, GFP_KERNEL); if (!ebsa.bits) goto out; ebsa.fds.in = (unsigned long *) ebsa.bits; ebsa.fds.out = (unsigned long *) (ebsa.bits + ebsa.size); ebsa.fds.ex = (unsigned long *) (ebsa.bits + 2*ebsa.size); ebsa.fds.res_in = (unsigned long *) (ebsa.bits + 3*ebsa.size); ebsa.fds.res_out = (unsigned long *) (ebsa.bits + 4*ebsa.size); ebsa.fds.res_ex = (unsigned long *) (ebsa.bits + 5*ebsa.size); memcpy((void*)ebsa.fds.in, (void*)&rfds, ebsa.size); memcpy((void*)ebsa.fds.out, (void*)&wfds, ebsa.size); memcpy((void*)ebsa.fds.ex, (void*)&efds, ebsa.size); zero_fd_set(ebsa.n, ebsa.fds.res_in); zero_fd_set(ebsa.n, ebsa.fds.res_out); zero_fd_set(ebsa.n, ebsa.fds.res_ex); /* send to EBSA, run on both sides, merge bitmaps when done */ ret = n_sys_select(&host, &ebsa); merge_select_bitmaps(&host, &ebsa); kfree(ebsa.bits); } else { /* run do_host_select() only if no EBSA descriptors */ ret = do_host_select(host.n, &host.fds, &host.timeout, 0); } /* copy the smallest timeout value to user space (elapsed time) */ if (tvp && !(current->personality & STICKY_TIMEOUTS)) { time_t sec = 0, usec = 0; if (ebsa.timeout < host.timeout) { timeout = ebsa.timeout; } else { timeout = host.timeout; } if (timeout) { sec = timeout / HZ; usec = timeout % HZ; usec *= (1000000/HZ); } put_user(sec, &tvp->tv_sec); put_user(usec, &tvp->tv_usec); } if (ret < 0) { goto out; }

75

/* a 0 return value could mean a signal is pending, restart select() */ if (!ret) { ret = -ERESTARTNOHAND; if (signal_pending(current)) goto out; ret = 0; } /* copy to user space */ set_fd_set(n, inp, host.fds.res_in); set_fd_set(n, outp, host.fds.res_out); set_fd_set(n, exp, host.fds.res_ex); out: kfree(host.bits); out_nofds: MOD_DEC_USE_COUNT; return ret; } /* * This function is executed every time tasklets are scheduled to run. None * of this function came from the kernel. It is used to facilitate message * passing via PID's across shared memory. If host_pid in shared memory is -1, * then the next process id on the ready-to-return queue is put into this * region for the EBSA side to see that the host side has finished. If * ebsa_pid is a positive number, this means that the EBSA side has finished * and is ready to return, so this tasklet will wake the process up if it * is sleeping and remove it from any queues. */ void select_host_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; /* ptr to current list position */ struct list_head* sleep_list_ptr; shrmem_t* mem; /* shared mem ptr */ sleep_list_t* sleep_entry = NULL; /* entries in sleep queue */ rr_list_t* rr_entry = NULL; /* entries in ready-to-ret queue */ int sleep_flag = 0; /* set if process was woken */ int rr_flag = 0; /* set if process has returned */ /* setup shared mem ptr */ mem = (shrmem_t*)module_info.shared_mem_addr; /* * If host_pid is -1 and the ready-to-return queue is not empty, * then remove the next entry from the list. If it is equal to * ebsa_pid, both sides are ready-to-return and we are done. * Otherwise, put the PID value into host_pid. */ if ((mem->host_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->ebsa_pid) { kfree(rr_entry); goto reset_ebsa;

76

} else { mem->host_pid = rr_entry->pid; kfree(rr_entry); } } /* nothing else to do if EBSA has no PID to offer */ if (mem->ebsa_pid < 0) { goto tasklet_complete; } /* * Search through the sleeping queue for the process id in ebsa_pid. * If it is found, wake up the process and exit the tasklet. */ list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->ebsa_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) { goto tasklet_complete; } /* * The process was not sleeping, so see if it is waiting to put its PID * on the ready-to-return queue. If it is found, remove the entry * and exit the tasklet. */ list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->ebsa_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) { goto tasklet_complete; } /* * ebsa_pid = -1 signals to the EBSA side that it can put a new PID * value into it. We need to reset it here when ebsa_pid's value was * a PID that was on the ready-to-return queue. Otherwise it is * reset in do_host_select(). */ reset_ebsa: mem->ebsa_pid = -1;

77

/* * Continuously reschedule the tasklet as long as * tasklet_host_sched_flag is set. */ tasklet_complete: if (tasklet_host_sched_flag) { tasklet_schedule(&select_host_tasklet); } } D.3 select_e.c /* * select_e.c - Cal Poly 3Com CiNIC project * * EBSA-side select() implementation. Most of this code was taken from * fs/select.c in the Linux 2.4.2 kernel with bitmap manipulation and host-side * communication added in. The conventional method of hijacking system calls * could not be performed on select() because it needs to be run concurrently * on the host and EBSA sides. In particular for the EBSA side, * sys_ebsa_select() is much different from sys_select() in the Linux kernel * because most of the memory manipulation is done on the host side before * it gets here. Potential race conditions could occur when writing to or * reading from shared memory, or adding and deleting entries from the lists. * However, no problems have been seen as of yet. Code may need to be changed * when newer versions of the kernel are used. Refer to senior project for a * more detailed description. * * Author: Jared Kwek * Date: 4/4/02 * * $Id: select_e.c,v 1.1 2002/05/01 02:01:54 jkwek Exp $ */ #include <linux/spinlock.h> /* for interrupt.h to compile */ #include <linux/interrupt.h> /* for tasklets */ #include "global.h" #include "select.h" #include "ebsa.h" /* g_shrmem */ #include "com.h" /* shrmem_t */ /* Initialize EBSA-side ready-to-return and sleeping queues */ LIST_HEAD(rr_list_head); LIST_HEAD(sleep_list_head); /* Initialize EBSA-side tasklet */ DECLARE_TASKLET(select_ebsa_tasklet, select_ebsa_tasklet_func, 0); int tasklet_ebsa_sched_flag; /* * All code in this function is taken from select.c. This had to be copied * because it is static. This function tests for bad file descriptors and * returns the maximum file descriptor in any of the sets, plus one. -EBADF * is returned for bad file descriptors. */ static int max_select_fd(unsigned long n, fd_set_bits *fds)

78

{ unsigned long *open_fds; unsigned long set; int max; /* handle last in-complete long-word first */ set = ~(~0UL << (n & (__NFDBITS-1))); n /= __NFDBITS; open_fds = current->files->open_fds->fds_bits+n; max = 0; if (set) { set &= BITS(fds, n); if (set) { if (!(set & ~*open_fds)) goto get_max; return -EBADF; } } while (n) { open_fds--; n--; set = BITS(fds, n); if (!set) continue; if (set & ~*open_fds) return -EBADF; if (max) continue; get_max: do { max++; set >>= 1; } while (set); max += n * __NFDBITS; } return max; } /* * This is the heart of select(). In the original kernel, it checks each file * descriptor in the bitmaps and sets the bit in the result bitmaps for each * avalable file descriptor. The number of available file descriptors is * returned. If no file descriptors are available, this function sleeps until * one becomes available or the timeout expires. The kernel code was modified * to include a wait queue to sleep on, shared memory checking, and 2 regular * queues. Now it also checks if host_pid in shared memory is equal to the * current PID before going to sleep and, if the process does go to sleep, * the tasklet will wake it up when or if they are equal. Also, when this * function returns, it now puts its PID on the ready-to-return queue so that * it can be put in shared memory by the tasklet. */ int do_ebsa_select(int n, fd_set_bits *fds, long *timeout, pid_t remote_pid) { poll_table table, *wait; /* list of wait queues */ int retval, i, off; /* off - u_long offset */ long __timeout = *timeout;

79

shrmem_t* mem; /* shared memory */ rr_list_t* rr_entry = NULL; /* ready-to-return queue */ sleep_list_t* sleep_entry = NULL; /* sleeping queue */ DECLARE_WAIT_QUEUE_HEAD(select_sleep); /* point to shared memory */ mem = (shrmem_t*)g_shrmem; /* get max numbered file descriptor */ read_lock(&current->files->file_lock); retval = max_select_fd(n, fds); read_unlock(&current->files->file_lock); if (retval < 0) goto ready_to_return; n = retval; /* set up a list of wait queues to be used by the poll() method */ poll_initwait(&table); wait = &table; if (!__timeout) wait = NULL; retval = 0; for (;;) { set_current_state(TASK_INTERRUPTIBLE); /* * For each file descriptor selected, call the poll() method * for that file system type. The poll() method will return * a mask that can be used to check its status and set the * appropriate bits in the result bitmaps. */ for (i = 0; i < n; i++) { unsigned long bit = BIT(i); /* fd bit in u_long */ unsigned long mask; /* poll mask */ struct file *file; /* file structure */ off = i / __NFDBITS; if (!(bit & BITS(fds, off))) continue; file = fget(i); mask = POLLNVAL; if (file) { mask = DEFAULT_POLLMASK; if (file->f_op && file->f_op->poll) mask = file->f_op->poll(file, wait); fput(file); } if ((mask & POLLIN_SET) && ISSET(bit,__IN(fds,off))) { SET(bit, __RES_IN(fds,off)); retval++; wait = NULL; } if ((mask & POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) { SET(bit, __RES_OUT(fds,off)); retval++;

80

wait = NULL; } if ((mask & POLLEX_SET) && ISSET(bit,__EX(fds,off))) { SET(bit, __RES_EX(fds,off)); retval++; wait = NULL; } } wait = NULL; if (retval || !__timeout || signal_pending(current)) break; if(table.error) { retval = table.error; break; } /* check if host is done before sleeping */ if (mem->host_pid == remote_pid) { break; } sleep_entry = kmalloc(sizeof(sleep_list_t), GFP_KERNEL); if (!sleep_entry) { /* seems to be most practical return value */ retval = -ENOMEM; break; } sleep_entry->pid = remote_pid; sleep_entry->wq = &select_sleep; /* add to sleeping queue and go to sleep */ list_add_tail(&sleep_entry->sleep_list_entry, &sleep_list_head); __timeout = interruptible_sleep_on_timeout(sleep_entry->wq, __timeout); list_del(&sleep_entry->sleep_list_entry); kfree(sleep_entry); } current->state = TASK_RUNNING; poll_freewait(&table); /* Up-to-date the caller timeout */ *timeout = __timeout; ready_to_return: /* if host side is ready, we are done */ if (mem->host_pid == remote_pid) { mem->host_pid = -1; } else { /* add to ready-to-return queue if host not ready */ rr_entry = kmalloc(sizeof(rr_list_t), GFP_KERNEL); if (!rr_entry) { /* seems to be most practical return value */ retval = -ENOMEM; } else { rr_entry->pid = remote_pid; list_add_tail(&rr_entry->rr_list_entry, &rr_list_head); } } return retval;

81

} /* * This is the wrapper function for the EBSA-side do_ebsa_select(). It takes * the bitmaps from the packet received from the host and divides them into * their respective regions for the 6 bitmaps. Then do_ebsa_select() is called * to do the work. This is all that is needed on the EBSA side, as the host * side takes care of the rest. */ long sys_ebsa_select(int n, int size, long* timeout, char* ebsa_bits, pid_t remote_pid) { fd_set_bits bmaps; /* struct with u_long pointers to bitmaps */ long retval; bmaps.in = (unsigned long *) ebsa_bits; bmaps.out = (unsigned long *) (ebsa_bits + size); bmaps.ex = (unsigned long *) (ebsa_bits + 2*size); bmaps.res_in = (unsigned long *) (ebsa_bits + 3*size); bmaps.res_out = (unsigned long *) (ebsa_bits + 4*size); bmaps.res_ex = (unsigned long *) (ebsa_bits + 5*size); retval = do_ebsa_select(n, &bmaps, timeout, remote_pid); if (!retval) { if (signal_pending(current)) { retval = -ERESTARTNOHAND; } } return retval; } /* * This function is executed every time tasklets are scheduled to run. None * of this function came from the kernel. It is used to facilitate message * passing via PID's across shared memory. If ebsa_pid in shared memory is -1, * then the next process id on the ready-to-return queue is put into this * region for the host side to see that the EBSA side has finished. If * host_pid is a positive number, this means that the host side has finished * and is ready to return, so this tasklet will wake the process up if it * is sleeping and remove it from any queues. */ void select_ebsa_tasklet_func(unsigned long ptr) { struct list_head* rr_list_ptr; /* ptr to current list position */ struct list_head* sleep_list_ptr; shrmem_t* mem; /* shared mem ptr */ sleep_list_t* sleep_entry = NULL; /* entries in sleep queue */ rr_list_t* rr_entry = NULL; /* entries in ready-to-ret queue */ int sleep_flag = 0; /* set if process was woken */ int rr_flag = 0; /* set if process has returned */ /* setup shared mem ptr */ mem = (shrmem_t*)g_shrmem; /*

82

* If ebsa_pid is -1 and the ready-to-return queue is not empty, * then remove the next entry from the list. If it is equal to * host_pid, both sides are ready-to-return and we are done. * Otherwise, put the PID value into ebsa_pid. */ if ((mem->ebsa_pid < 0) && (!list_empty(&rr_list_head))) { rr_entry = list_entry(rr_list_head.next, rr_list_t, rr_list_entry); list_del(rr_list_head.next); if (rr_entry->pid == mem->host_pid) { kfree(rr_entry); goto reset_host; } else { mem->ebsa_pid = rr_entry->pid; kfree(rr_entry); } } /* nothing else to do if host has no PID to offer */ if (mem->host_pid < 0) { goto tasklet_complete; } /* * Search through the sleeping queue for the process id in host_pid. * If it is found, wake up the process and exit the tasklet. */ list_for_each (sleep_list_ptr, &sleep_list_head) { sleep_entry = list_entry(sleep_list_ptr, sleep_list_t, sleep_list_entry); if (sleep_entry->pid == mem->host_pid) { wake_up_interruptible(sleep_entry->wq); sleep_flag = 1; break; } } if (sleep_flag) { goto tasklet_complete; } /* * The process was not sleeping, so see if it is waiting to put its PID * on the ready-to-return queue. If it is found, remove the entry * and exit the tasklet. */ list_for_each (rr_list_ptr, &rr_list_head) { rr_entry = list_entry(rr_list_ptr, rr_list_t, rr_list_entry); if (rr_entry->pid == mem->host_pid) { list_del(rr_list_ptr); rr_flag = 1; kfree(rr_entry); break; } } if (!rr_flag) {

83

goto tasklet_complete; } /* * host_pid = -1 signals to the host side that it can put a new PID * value into it. We need to reset it here when host_pid's value was * a PID that was on the ready-to-return queue. Otherwise it is * reset in do_ebsa_select(). */ reset_host: mem->host_pid = -1; /* * Continuously reschedule the tasklet as long as * tasklet_ebsa_sched_flag is set. */ tasklet_complete: if (tasklet_ebsa_sched_flag) { tasklet_schedule(&select_ebsa_tasklet); } } D.4 syscalls_h.c – n_sys_select() /* * This function sets up and sends the packet to the EBSA for select() and * calls do_host_select() if there are host-side file descriptors. After * do_host_select() returns and the EBSA's packet is received, this function * returns the sum of the return values from these or an error value if either * side has an error. This function has the same functionality as the other * functions in syscalls_h.c, but sys_host_select() must be called first to * setup the memory region and split up the bitmaps. None of this is Linux * kernel code. */ long n_sys_select(select_split_t* local, select_split_t* remote) { long err = 0; /* this function's return value */ int ret_local = 0; /* host return value */ int ret_remote = 0; /* EBSA return value */ pkt_queue_node_t *pqn; /* packet */ shrmem_t* mem; /* shared memory pointer */ /* setup the packet */ pqn = proto_get_queue_node( COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size)); pqn->pkt->copy_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size); pqn->pkt->pkt_len = COM_PKT_HEADER_SIZE + COM_SELECT_HEADER_SIZE + 6*(remote->size); pqn->pkt->func_id = SYS_SELECT; pqn->pkt->pid = current->pid; pqn->pkt->ret_val = -1;

84

pqn->pkt->func.select.numfds = remote->n; pqn->pkt->func.select.time_off = remote->timeout; pqn->pkt->func.select.sizefds = remote->size; memcpy((void*)&pqn->pkt->func.select.bitmaps[0], (void*)remote->bits, 6*(remote->size)); /* * Send packet to EBSA and call do_host_select() if there are host * fd's. Then wait for packet to return. */ proto_enqueue(pqn); if (local->n > 0) { ret_local = do_host_select(local->n, &local->fds, &local->timeout, 1); } if (down_interruptible(&(pqn->lock)) == -EINTR) { err = -EINTR; goto out_select; } /* put packet values into EBSA's parameters */ ret_remote = pqn->pkt->ret_val; memcpy((void*)remote->bits, (void*)&pqn->pkt->func.select.bitmaps[0], 6*(remote->size)); remote->timeout = pqn->pkt->func.select.time_off; /* * Any errors are returned. If both sides return an error, the host * side arbitrarily gets precedence over the EBSA side. */ if (ret_local < 0) { err = ret_local; } else if (ret_remote < 0) { err = ret_remote; } else { err = ret_remote + ret_local; } out_select: proto_release_queue_node(pqn); /* cleanup shared memory - used mainly for case with EBSA fd's only */ /* possible race condition writing to shared memory */ mem = (shrmem_t*)module_info.shared_mem_addr; if (mem->host_pid == current->pid) { mem->host_pid = -1; } if (mem->ebsa_pid == current->pid) { mem->ebsa_pid = -1; } return err; }

85

D.5 fd_map.c – split_select_bitmaps(), merge_select_bitmaps() /* * This function splits a set of select() bitmaps into two sets: one that * goes to the host and one that goes to the EBSA. It is assumed that the * local (host) side has the original arguments and bitmaps from select(). * This function is put here rather than in select_h.c for faster access to the * translation tables; otherwise, we would continually have to call * fd_map_get_ebsa_fd() for each descriptor found in the bitmaps. fd_set * is used for the remote (EBSA) side because they can handle descriptors up * to 1024 and we do not know what the translation will go to on the EBSA. * The macros are in select.h and are adapted from fs/select.c in the Linux * 2.4.2 kernel. The algorithm uses some of the same principles as do_selec t() * in this kernel version as well. */ void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds) { int hfd, efd; /* host/EBSA file descriptor */ int off; /* u_long offset into bitmap */ fd_translation_table_t *cur; /* translation table pointer */ /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; /* start out highest numbered fd on each side at 0 */ local->n = 0; remote->n = 0; /* find translation table for current process */ down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) { cur = cur->next; } up(&table_lock); if (!cur) { /* no fd translation table */ local->n = n; return; } /* zero out the fd_set's so we can populate them with EBSA's fd's */ FD_ZERO(remote_rfds); FD_ZERO(remote_wfds); FD_ZERO(remote_efds); /* * Algorithm: * For each file descriptor up to n: * 1. Find bit and offset for that fd on the host. * 2. Continue to next fd if the bit is not set in any of the sets. * 3. Get fd mapping for that bit if it is set in any of the sets. * If it maps to -1, the fd is a local descriptor only, so update * n on the host and go on to next fd. * 4. If it maps to a positive number, find out which sets this fd

86

* is in and set the translated EBSA fd in these sets. Clear the * host fd in the host sets. * 5. If the EBSA side n is smaller than the translated fd, then * update the EBSA side n. */ for (hfd = 0; hfd < n; hfd++) { unsigned long hbit = BIT(hfd); off = hfd / __NFDBITS; if (!(hbit & BITS(lfds, off))) { continue; } efd = cur->fd_host_ebsa[hfd]; if (efd < 0) { local->n = hfd + 1; continue; } if (ISSET(hbit, __IN(lfds, off))) { FD_SET(efd, remote_rfds); CLR(hbit, __IN(lfds, off)); } if (ISSET(hbit, __OUT(lfds, off))) { FD_SET(efd, remote_wfds); CLR(hbit, __OUT(lfds, off)); } if (ISSET(hbit, __EX(lfds, off))) { FD_SET(efd, remote_efds); CLR(hbit, __EX(lfds, off)); } if (remote->n <= efd) remote->n = efd + 1; } } /* * This function merges two sets of select() bitmaps into one set. The local * (host) side contains the merged sets when this function is finished. This * function should be called after split_select_bitmaps(). This function is * put here rather than in select_h.c for faster access to the translation * tables; otherwise, we would continually have to call fd_map_get_host_fd() * for each descriptor found in the bitmaps. The macros are in select.h and * are adapted from fs/select.c in the Linux 2.4.2 kernel. The algorithm uses * some of the same principles as do_select() in this kernel version as well. */ void merge_select_bitmaps(select_split_t* local, select_split_t* remote) { int hfd, efd; /* host/EBSA file descriptor */ int off_local, off_remote; /* u_long offset into bitmaps */ fd_translation_table_t *cur; /* translation table pointer */ /* hack to make macros in select.h work */ fd_set_bits* lfds = &local->fds; fd_set_bits* rfds = &remote->fds; /* find translation table for current process */ down(&table_lock); cur = fd_tables; while(cur != NULL && cur->task != current) {

87

cur = cur->next; } up(&table_lock); /* * Sanity check. Since split_select_bitmaps() should have been called * before this function, it would have seen there was no table and this * function would then not be called. */ if (!cur) { PRINT_ERROR("merge_select_bitmaps: fd translation table missing, this should not happen\n"); return; /* no fd translation table */ } /* * Algorithm: * For each file descriptor up to remote->n: * 1. Find bit and offset for that fd on the EBSA. * 2. Continue to next fd if the bit is not set in any of the sets. * 3. Get fd mapping for that bit if it is set in any of the sets * and find the bit and offset on the host side. * 4. Find out which sets it is in and set the translated host fd in * the correct result bitmap. */ for (efd = 0; efd < remote->n; efd++) { unsigned long ebit = BIT(efd); unsigned long hbit; off_remote = efd / __NFDBITS; if (!(ebit & RES_BITS(rfds, off_remote))) { continue; } hfd = cur->fd_ebsa_host[efd]; if (hfd >= 0) { hbit = BIT(hfd); off_local = hfd / __NFDBITS; } else { /* * Sanity check. If the fd is in the EBSA bitmaps, * then it should have a translation * (split_select_bitmaps() should have set this up). */ PRINT_ERROR("merge_select_bitmaps: fd translation missing, this should not happen\n"); continue; } if (ISSET(ebit, __RES_IN(rfds, off_remote))) { SET(hbit, __RES_IN(lfds, off_local)); } if (ISSET(ebit, __RES_OUT(rfds, off_remote))) { SET(hbit, __RES_OUT(lfds, off_local)); } if (ISSET(ebit, __RES_EX(rfds, off_remote))) { SET(hbit, __RES_EX(lfds, off_local)); } } }

88

D.6 CVS Version Differences The following output from cvs diff shows the changes that I made to various files in the CVS

repository. The first file in each diff is the version that was in the repository before I included

my changes, and the second file in each diff is the version that I modified. Only my changes are

shown. Diffs for select_h.c, select_e.c, and select.h are not included because these files

were added to the repository. Also n_sys_select(), split_select_bitmaps(), and

merge_select_bitmaps(), are not included because they are already shown in Sections D.4 and

D.5. The places where the functions would be included are noted with double exclamation

points (!!). Additionally, changes not pertaining to select() are noted with double percent

signs (%%) followed by an explanation of why it was changed. First is a listing of the files

added or changed in the repository, followed by the diff listings.

-r-xr-xr-x 1 jkwek jkwek 10341 Apr 30 18:52 com_e.c -r-xr-xr-x 1 jkwek jkwek 8876 Apr 30 18:49 com.h -r-xr-xr-x 1 jkwek jkwek 6322 Apr 30 18:52 com_h.c -r-xr-xr-x 1 jkwek jkwek 10790 Apr 30 18:53 fd_map.c -r-xr-xr-x 1 jkwek jkwek 2373 Apr 30 18:54 fd_map.h -r-xr-xr-x 1 jkwek jkwek 6783 Apr 30 18:55 handler_default.c -r-xr-xr-x 1 jkwek jkwek 3229 Apr 30 18:48 Makefile -r-xr-xr-x 1 jkwek jkwek 10911 Apr 30 19:01 select_e.c -r-xr-xr-x 1 jkwek jkwek 3610 Apr 30 19:01 select.h -r-xr-xr-x 1 jkwek jkwek 15296 Apr 30 19:01 select_h.c -r-xr-xr-x 1 jkwek jkwek 2009 Apr 30 18:57 syscalls_e.c -r-xr-xr-x 1 jkwek jkwek 36234 Apr 30 19:00 syscalls_h.c Index: com_e.c =================================================================== RCS file: /home/cvsroot/module_p/com_e.c,v retrieving revision 1.22 retrieving revision 1.23 diff -r1.22 -r1.23 6c6 < $Id: com_e.c,v 1.22 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: com_e.c,v 1.23 2002/05/01 01:52:45 jkwek Exp $ 343a344 > case SYS_SELECT: 418a420 > case SYS_SELECT: Index: com.h =================================================================== RCS file: /home/cvsroot/module_p/com.h,v retrieving revision 1.15 retrieving revision 1.16

89

diff -r1.15 -r1.16 7c7 < $Id: com.h,v 1.15 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: com.h,v 1.16 2002/05/01 01:49:38 jkwek Exp $ 237,242c237,240 < int numfds; /* h -> e */ < unsigned long rfds_off; /* h -> e */ < unsigned long wfds_off; /* h -> e */ < unsigned long efds_off; /* h -> e */ < unsigned long time_off; /* h -> e */ < char data[0]; /* h <-> e */ --- > int numfds; /* h -> e */ > long time_off; /* h <-> e */ > int sizefds; /* h -> e */ > char bitmaps[0]; /* h <-> e */ 244c242 < --- > #define COM_SELECT_HEADER_SIZE 12 287,288c285,293 < Virtual struct overlaying shared memory to partition it for us. < */ --- > * Virtual struct overlaying shared memory to partition it for us. > * > * The spacers are a temporary hack until Max fixes the problem. The problem > * has to do with the cache and updating concurrent memory locations within > * shared memory. The * spacers make sure the values are in shared memory > * when they are written. > */ > #define SHRMEM_DATA_SIZE (SHRMEM_SIZE/2)-(sizeof(int)+sizeof(pid_t)+10*sizeof(unsigned long)) > 289a295,298 > pid_t host_pid; /* for use with select() */ > unsigned long spacer1[10]; > pid_t ebsa_pid; /* for use with select() */ > unsigned long spacer2[10]; 291c300 < char host_data[(SHRMEM_SIZE/2)-4]; --- > char host_data[SHRMEM_DATA_SIZE]; 293c302 < char ebsa_data[(SHRMEM_SIZE/2)-4]; --- > char ebsa_data[SHRMEM_DATA_SIZE]; Index: com_h.c =================================================================== RCS file: /home/cvsroot/module_p/com_h.c,v retrieving revision 1.22 retrieving revision 1.23 diff -r1.22 -r1.23 6c6 < $Id: com_h.c,v 1.22 2001/09/18 17:39:37 hheiman Exp $

90

--- > $Id: com_h.c,v 1.23 2002/05/01 01:52:58 jkwek Exp $ 150a151 > case SYS_SELECT: 230a232 > case SYS_SELECT: Index: fd_map.c =================================================================== RCS file: /home/cvsroot/module_p/fd_map.c,v retrieving revision 1.1 retrieving revision 1.2 diff -r1.1 -r1.2 7c7 < $Id: fd_map.c,v 1.1 2001/06/07 15:53:43 rob Exp $ --- > $Id: fd_map.c,v 1.2 2002/05/01 01:53:56 jkwek Exp $ 226a227,393 !! split_select_bitmaps() and merge_select_bitmaps() !! Index: fd_map.h =================================================================== RCS file: /home/cvsroot/module_p/fd_map.h,v retrieving revision 1.1 retrieving revision 1.2 diff -r1.1 -r1.2 7c7 < $Id: fd_map.h,v 1.1 2001/06/07 15:53:44 rob Exp $ --- > $Id: fd_map.h,v 1.2 2002/05/01 01:54:41 jkwek Exp $ 12a13 > #include "select.h" /* select_split_t, fd_set, and macros */ 69a71,82 > > /* > Split the file descriptors in a select() bitmap set into 2 sets: one to go > to the host and one to go to the EBSA. > */ > void split_select_bitmaps(int n, select_split_t* local, select_split_t* remote, > fd_set* remote_rfds, fd_set* remote_wfds, fd_set* remote_efds); > > /* > Merge the resulting select() bitmaps into one set. > */ > void merge_select_bitmaps(select_split_t* local, select_split_t* remote); Index: handler_default.c =================================================================== RCS file: /home/cvsroot/module_p/handler_default.c,v retrieving revision 1.2 retrieving revision 1.3 diff -r1.2 -r1.3 6c6 < $Id: handler_default.c,v 1.2 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: handler_default.c,v 1.3 2002/05/01 01:55:49 jkwek Exp $ 21a22

91

> #include "select.h" /* sys_ebsa_select() */ 196a198,205 > > case SYS_SELECT: > pkt->ret_val = sys_ebsa_select(pkt->func.select.numfds, > pkt->func.select.sizefds, > &pkt->func.select.time_off, > &pkt->func.select.bitmaps[0], > pkt->pid); > break; 212d220 < Index: Makefile =================================================================== RCS file: /home/cvsroot/module_p/Makefile,v retrieving revision 1.14 retrieving revision 1.15 diff -r1.14 -r1.15 3c3 < # $Id: Makefile,v 1.14 2001/06/07 15:53:44 rob Exp $ --- > # $Id: Makefile,v 1.15 2002/05/01 01:48:54 jkwek Exp $ 18,19c18,19 < HOST_OBJS=${COMMON_OBJS} com_h.o syscalls_h.o host.o fd_map.o proc_host.o proc_host_status.o < EBSA_OBJS=${COMMON_OBJS} com_e.o syscalls_e.o ebsa.o handler_default.o --- > HOST_OBJS=${COMMON_OBJS} com_h.o syscalls_h.o host.o fd_map.o proc_host.o proc_host_status.o select_h.o > EBSA_OBJS=${COMMON_OBJS} com_e.o syscalls_e.o ebsa.o handler_default.o select_e.o 85c85 < fd_map.o: global.h syscalls_h.h fd_map.h fd_map.c --- > fd_map.o: global.h syscalls_h.h fd_map.h select.h fd_map.c 88c88 < handler_default.o: global.h com_e.h handler_default.h handler_default.c --- > handler_default.o: global.h com_e.h select.h handler_default.h handler_default.c 120a121,126 > select_e.o: global.h select.h ebsa.h com.h select_e.c > gcc ${FLAGS} -c select_e.c > > select_h.o: global.h fd_map.h select.h host.h com.h select_h.c > gcc ${FLAGS} -c select_h.c > 124c130 < syscalls_h.o: global.h com.h host.h com_h.h fd_map.h syscalls_h.h syscalls_h.c --- > syscalls_h.o: global.h com.h host.h com_h.h fd_map.h syscalls_h.h select.h syscalls_h.c Index: syscalls_e.c =================================================================== RCS file: /home/cvsroot/module_p/syscalls_e.c,v retrieving revision 1.8 retrieving revision 1.9

92

diff -r1.8 -r1.9 2c2 %% - this changed because the file is syscalls_h.c not syscalls_h.h < syscalls_e.h - Cal Poly 3Com CiNIC project --- > syscalls_e.c - Cal Poly 3Com CiNIC project 6c6 < $Id: syscalls_e.c,v 1.8 2001/06/07 15:53:44 rob Exp $ --- > $Id: syscalls_e.c,v 1.9 2002/05/01 01:57:41 jkwek Exp $ 9a10,11 > #include <linux/spinlock.h> /* for interrupt.h to compile */ > #include <linux/interrupt.h> /* for tasklets */ 12c14,15 < --- > #include "ebsa.h" /* for shared memory */ > #include "com.h" /* for shared mem struct */ 18a22,25 > /* tasklet declared in select_e.c */ > extern struct tasklet_struct select_ebsa_tasklet; > extern int tasklet_ebsa_sched_flag; > 33a41,42 > shrmem_t* mem; > 35a45,51 > /* select tasklet and shared mem init */ > mem = (shrmem_t*)g_shrmem; > mem->host_pid = -1; > mem->ebsa_pid = -1; > tasklet_ebsa_sched_flag = 1; > tasklet_schedule(&select_ebsa_tasklet); > 47a64,67 > > /* stop the tasklet */ > tasklet_ebsa_sched_flag = 0; > tasklet_kill(&select_ebsa_tasklet); Index: syscalls_h.c =================================================================== RCS file: /home/cvsroot/module_p/syscalls_h.c,v retrieving revision 1.39 retrieving revision 1.40 diff -r1.39 -r1.40 8c8 < $Id: syscalls_h.c,v 1.39 2001/09/18 17:39:37 hheiman Exp $ --- > $Id: syscalls_h.c,v 1.40 2002/05/01 02:00:55 jkwek Exp $ 17c17,18 %% - non-useful comment taken out < //hekllo --- > #include <asm/system.h> /* for interrupt.h to compile */ > #include <linux/interrupt.h> /* for tasklets */ 24a26

93

> #include "select.h" /* sys_host_select(), do_host_select() */ 31a34,37 > /* tasklet declared in select_h.c */ > extern struct tasklet_struct select_host_tasklet; > extern int tasklet_host_sched_flag; > 42a49,51 > static long (*o_old_select)(select_param_t *args); > static long (*o_sys_select)(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp); > 101a111,112 > shrmem_t* mem; > 106a118,124 > /* select() tasklet and shared mem init */ > mem = (shrmem_t*)module_info.shared_mem_addr; > mem->host_pid = -1; > mem->ebsa_pid = -1; > tasklet_host_sched_flag = 1; > tasklet_schedule(&select_host_tasklet); > 134a153,160 > /* hijack old_select */ > o_old_select = sys_call_table[__NR_select]; > sys_call_table[__NR_select] = (void *) n_old_select; > > /* hijack sys_select */ > o_sys_select = sys_call_table[__NR__newselect]; > sys_call_table[__NR__newselect] = (void *) sys_host_select; > 144a171,174 > /* stop the tasklet */ > tasklet_host_sched_flag = 0; > tasklet_kill(&select_host_tasklet); > 186a217,228 > if (o_old_select != NULL) { > sys_call_table[__NR_select] = o_old_select; > } else { > PRINT_ERROR("syscalls_cleanup: old_select pointer not stored!\n"); > } > > if (o_sys_select != NULL) { > sys_call_table[__NR__newselect] = o_sys_select; > } else { > PRINT_ERROR("syscalls_cleanup: sys_select pointer not stored!\n"); > } > 538c580 %% - Want to run recv() rather than sendto() when recv() is called. < return old_sys_socketcall(SYS_SENDTO, args); --- > return old_sys_socketcall(SYS_RECV, args); 1098a1141,1229 !! n_sys_select() !!

94

D.7 Sample Test Program #include <stdio.h> #include <linux/unistd.h> #include <fcntl.h> #include <sys/socket.h> int main() { fd_set hi; /* two fd_sets */ fd_set hi2; int fd1,fd2; struct timeval time; time.tv_sec = 10; /* 10 sec sleep */ time.tv_usec = 0; fd1 = open("atext", O_RDONLY | O_SYNC); /* local */ fd2 = socket(AF_INET, SOCK_STREAM, 0); /* remote */ printf("fd1 is %i\n", fd1); printf("fd2 is %i\n", fd2); FD_ZERO(&hi); FD_ZERO(&hi2); FD_SET(fd1, &hi); FD_SET(fd2, &hi2); FD_SET(2, &hi); /* * check local fd and stderr for read, socket for write * fd and socket return available, stderr return NOT available * because nothing to read */ select(fd2+1, &hi, &hi2, NULL, &time); if (FD_ISSET(fd1, &hi)) printf("data available\n"); else printf("data NOT available \n"); if (FD_ISSET(fd2, &hi2)) printf("data available\n"); else printf("data NOT available \n"); if (FD_ISSET(2, &hi)) printf("data available\n"); else printf("data NOT available \n"); return 0; }