understanding the freebsd operating system · freebsd is an advanced operating system for x86...

23
Department of Computer Science University of Salzburg Lecture Special Topics in Operating Systems UFO Understanding the FreeBSD Operating System Review of the FreeBSD 5.4 Kernel February 2006 Project Members: Dominik Knoll, Bernhard Sehorz, Simon Sigl {dknoll, bsehorz, ssigl}@cosy.sbg.ac.at Academic Supervisor: Prof. Dr. Ing. Christoph Kirsch [email protected]

Upload: truongdung

Post on 23-Jul-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Department of Computer Science

University of Salzburg

Lecture Special Topics in Operating Systems

UFO

Understanding the FreeBSD Operating System

Review of the FreeBSD 5.4 KernelFebruary 2006

Project Members:

Dominik Knoll, Bernhard Sehorz, Simon Sigl

{dknoll, bsehorz, ssigl}@cosy.sbg.ac.at

Academic Supervisor:

Prof. Dr. Ing. Christoph Kirsch

[email protected]

Contents

1 Introduction 31.1 FreeBSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 A Short History of FreeBSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 FreeBSD’s Present State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Process Management 42.1 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Kernel Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 KSEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.4 KSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.1 Kernel and User Level Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 SMP Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Kernel Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.1 Process Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.2 Loading Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Memory Management 113.1 Architecture Independent Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 VM Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.3 VM Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.4 VM Map Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.5 Memory Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.6 Pager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.7 Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Architecture Dependent Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Physical Map Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.1 Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.2 Forking Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.3 Page Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.4 The Paging Daemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.5 The Unified Buffer Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 File Systems 194.1 VFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 UFS and FFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 UFS Partition Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.2 UFS Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Summary 215.1 Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

References 22

2

1 Introduction

In this document, we present a brief but decent review of the FreeBSD kernel, trying to bring togetherarchitectural concepts with some of the implementation details, which are—as far as it seemed to us—neglected in most documentation handbooks, as well as in papers and textbooks. This work was expectedto cover five major areas:

• process management,

• memory management,

• input/output (I/O),

• file systems,

• and symmetric multiprocessing (SMP).

Since it is impossible to completely cover these topics in a reasonable number of pages, we tried topick FreeBSD-specific features of the kernel that should cover each of these areas briefly. Thus, the readershould get an overall idea of how the FreeBSD kernel is designed after having read the paper.

We address an audience with knowledge of fundamental operating system concepts, and an under-standing of terms like priority scheduling, deadlock prevention, demand paging or inode. (An introductionto these concepts is, for example, provided in [1] and [2].) A basic knowledge of some UNIX-like operatingsystem will help understanding the explanations, too.

The focus in this work lies on the FreeBSD 5.4, as it was the most recent stable version at the beginningof our project. (Since then, there has been at least one major stable release: FreeBSD 6.0.) For machinedependent parts in the kernel, we only discuss Intel x86 architecture specific implementations.

1.1 FreeBSD

FreeBSD is an advanced operating system for x86 compatibles and many other architectures. It is amember of the Berkeley System Distribution (BSD) family of operating systems. BSD was originallydeveloped by the University of California, Berkeley (UCB) but split up into many independent projectsafter the final release by UCB. From the BSD-roots FreeBSD inherited the famous TCP/IP protocol stackimplementation, which has often been referred to as reference implementation.

1.1.1 A Short History of FreeBSD

The FreeBSD project was started in early 1993, as an unofficial patchkit for 386BSD, a somewhat unfinishedoperating system in use then. This patchkit soon turned into a full operating system of its own. At first, ithad been developed based on the main 4.xBSD releases for a short time. However, the 4.4BSD-Lite releasein 1994 brought about significant changes. Due to a lawsuit, the code in 4.4BSD-Lite was incomplete (theprevious releases incorporated some code by AT&T) and forced FreeBSD to re-implement much of thecode. 4.4BSD-Lite was the last release by UCB and many of todays operating systems share it as acommon ancestor. Among those, the best known operating systems are (beneath FreeBSD) NetBSD andOpenBSD.

It should be noted that these operating systems do not compete against each other but try to comple-ment each other. Each of them sets its focus differently. For example, OpenBSD is focused on maximumsecurity, whereas NetBSD’s main goal is portability. FreeBSD’s declared target is ease of use.

1.1.2 FreeBSD’s Present State

Since 1994, FreeBSD has brought up many distinctive features. A discussion of the advantages of FreeBSDcan be found, for example, in [3]. Summing up, FreeBSD is

a complete, lean, and easy to install and to use operating system.

available for a decent set of architectures.

POSIX compliant.

3

fast, offers decent security, and well integrated with a number of networking tools. Togetherwith the TCP/IP reference implemenation, these features render FreeBSD perfectly suited for use ina network. It is among the most often used operating systems on server machines like web servers,file servers, etc.

capable of symmetric multiprocessing (SMP) and offers decent scalability. Recently, SMP ca-pability has been boosted by the introduction of a new m : n application level threading system (seeSection 2.3.2) and some reworking of kernel subsystems to allow concurrent execution (such as filesystem, network protocol stack). Thus, it is also well-suited for running on SMP-machines.

FreeBSD additionally offers

a Linux binary compatibility module. This feature should ease the transition to FreeBSD for Linuxusers.

relatively good documentation. In comparison to other (open) operating system projects, the kerneldocumentation provided by the FreeBSD project is very good. There is even a book dedicated to“The Design and Implementation of the FreeBSD Operating System” [4].

All in all FreeBSD seems to stand out as a particularly fine operating system without major limitationsor defects. It is, of course, not a swiss army knife, though. It does not match, for example, OpenBSD interms of security, or NetBSD in the number of supported architectures (this is natural since the two lietheir respective focus on these areas). But even there, FreeBSD is said to perform reasonably well, andprobably even very good.

1.2 Document Conventions

When kernel data structures or functions are referred to in this document, their names are set in typewriterstyle. When mentioned for the first time, we specify their source file in the form [dir/file] where dir/is the subdirectory (or probably subdirectory path) in the kernel root source directory, and file is thefile name. The root source directory in a full FreeBSD installation is located in /usr/src/sys/.

For each supported architecture, architecture-dependent sources and binaries are kept in a subdirectoryof sys/ with the same name as the architecture. All 32bit-Intel-specifics are located in sys/i386/,for example. This directory again contains a subdirectory with the name i386/, which we refer to bymachine/.

When we discuss listings, we only present lines that are relevant in the current context. These linesare then copy-pasted unmodified from the kernel source (including comments). One or more left out linesof code are represented by “(...)”. The line numbers in listings do not correspond to source code linenumbers.

2 Process Management

FreeBSD supports multiprogramming (including multithreading) for one or multiple central processingunits (CPUs). Threading of processes is supported through a hybrid model between user-level and kernel-level threads, which is similar to Scheduler Activations [5]. This model is called the Kernel ScheduledEntities1 (KSE) threading system and described in [7].

We start with introducing the concepts used in FreeBSD to implement processes and threads. This isfollowed by the descriptions of the involved data structures: proc, thread, ksegrp, kse. Then we describebriefly how the kernel does scheduling, especially when running on a multiprocessor machine. Anothersection is dedicated to the most important system calls, that allow a user level program to interact with theprocess management subsystem. Finally, we bring a short section on deadlock handling in the FreeBSDsystem.

1On the official website of the FreeBSD Project [6] it is also called Kernel Scheduler Entities.

4

2.1 Processes and Threads

As said above, FreeBSD uses a hybrid threading model, meaning that scheduling is not entirely done by thekernel but (optionally) in co-operation with a user level scheduler. Conceptually, we have to distinguishbetween a users’ and the kernels’ view of the threading model. From the application programmers pointof view, there are only the classical entities process and thread2. The kernel additionally knows two otherentities, namely KSEs and KSE groups (KSEGs). These are used by the kernel scheduler and the userlevel thread scheduler (UTS), in order to jointly do the scheduling.

The relationships between these entities is shown in Figure 1.

Process

?

1

limit

KSEG

�������

HHHHHHj

1

KSE

#CPUs

Thread

n

Figure 1: Overview of entities involved in process management.

A process consists of one or more KSEGs, which are requested and released through system calls.These KSEGs receive scheduling quanta (which are assigned by the kernel scheduler, see Section 2.3).Therefore, a process has a variable share of CPU time, which is limited by a maximum number (limit) ofKSEGs per process.

Each thread belongs to exactly one KSEG. This enables system level scheduling contention, meaningthat two threads of different KSEGs (possibly belonging to the same process) can compete for the sameCPU quantum.

A KSEG has one or more KSEs, that are again requested and released through system calls. A KSEcan be understood as a virtual CPU, being assigned to a physical CPU by the kernel scheduler. The UTSis responsible for assigning threads of a KSEG to its KSEs. The number of KSEs in a KSEG is called theKSEG’s concurrency level, indicating that threads of the same KSEG can physically execute concurrentlyon multiple CPUs. Allowing the concurrency level to grow beyond the number of physical CPUs in thesystem would not make sense, since it is not possible to execute more threads in parallel than there areCPUs.

2.2 Kernel Structures

Now we focus on some implementation aspects of the discussed concepts. Most data structures presentedin this section can be found in [sys/proc.h], only the struct kse is located in [kern/sched_ule.c].

2.2.1 Process

We start our survey of data structures with the “old fashioned process”, as it is called in the code comments.Since this data structure is very large, we only present a small part of it.

1 struct proc {2 LIST ENTRY( proc ) p l i s t ; /∗ ( d ) L i s t o f a l l p roce s s e s . ∗/3 TAILQ HEAD( , ksegrp ) p ksegrps ; /∗ ( c ) ( k g k s e g rp ) A l l KSEGs . ∗/4 ( . . . )5 struct ucred ∗ p ucred ; /∗ ( c ) Process owner ’ s i d e n t i t y . ∗/

2Here, we use the term “application” in a strict sense. The user level scheduler also has to use another data structure, theKSE, though only through anonymous handles. But this will most often be encapsulated in the user level scheduling libraryused by a program.

5

6 struct f i l e d e s c ∗ p fd ; /∗ ( b ) Ptr to open f i l e s s t r u c t u r e . ∗/7 ( . . . )8 struct ps t a t s ∗ p s t a t s ; /∗ ( b ) Accounting/ s t a t i s t i c s (CPU) . ∗/9 struct p l im i t ∗ p l im i t ; /∗ ( c ) Process l im i t s . ∗/

10 ( . . . )11 enum {12 PRS NEW = 0 , /∗ In c r ea t i on ∗/13 PRS NORMAL, /∗ t h reads can be run . ∗/14 PRS ZOMBIE15 } p s t a t e ; /∗ ( j /c ) S∗ process s t a t u s . ∗/16 p id t p pid ; /∗ ( b ) Process i d e n t i f i e r . ∗/17 ( . . . )18 struct vmspace ∗ p vmspace ; /∗ ( b ) Address space . ∗/19 ( . . . )20 } ;

Listing 1: Process

Depending on its state, each struct proc is member in a list of processes (line 2 in Listing 1). Inline 3, a list of the processes KSEGs is kept. The user credentials (in line 5) basically contain the userID and group memberships of the processes owner. Line 6 contains the file descriptor table. In line 8,the standard process statistics (p_stats) that are written to an accounting file on exit are declared. Theprocess resource limit (line 9) is basically an array of maximal CPU time, filesize, number of open files,etc. a process may hold.

In line 11–15, the three possible process states are declared. PRS_NEW and PRS_ZOMBIE processes arecurrently undergoing creation or termination, respectively. A PRS_NORMAL process has runnable threads.The next field (line 16) is the process identifier.

A very important field is p_vmspace in line 18, representing the address space of the process and isdiscussed in detail in Chapter 3.

2.2.2 Thread

In texts on operating systems, a thread is often referred to as the “unit of dispatching” (in constrastto the “unit of resource ownership”, the process). The FreeBSD developers put it this way in the codecomments: “[A thread] is what is put to sleep and reactivated”. A thread knows its associated processand KSEG (lines 2–3 in Listing 2) and is entered in the list of threads in the current process and KSEG,respectively (lines 4–5). In line 7, a reference to the threads mailbox can be found, which is used forpassing thread-based scheduling information from the kernel scheduler to UTS.

1 struct thread {2 struct proc ∗ td proc ; /∗ ( ∗ ) Assoc ia ted proces s . ∗/3 struct ksegrp ∗ td ksegrp ; /∗ ( ∗ ) Assoc ia ted KSEG. ∗/4 TAILQ ENTRY( thread ) t d p l i s t ; /∗ ( ∗ ) A l l t h reads in t h i s proc . ∗/5 TAILQ ENTRY( thread ) t d k g l i s t ; /∗ ( ∗ ) A l l t h reads in t h i s k segrp . ∗/6 ( . . . )7 struct kse thr ma i lbox ∗ td mai lbox ; /∗ ( ∗ ) Userland mai lbox address . ∗/8 ( . . . )9 } ;

Listing 2: Thread

2.2.3 KSEG

In Listing 3, the first field the KSEG references the process it belongs to and the second ties all KSEGsof one process together in one queue. The fields in line 4 and 5 implement queues of threads belongingto this KSEG, one contains all threads and the other only runnable ones. The fields from line 7 to 9 areused for communication between kernel scheduler and user level scheduler. The scheduling parameters arecontained in line 11 and 12. Line 14 contains scheduler specific data, such as the time quantum and theconcurrency level.

6

1 struct ksegrp {2 struct proc ∗ kg proc ; /∗ ( ∗ ) Proc t ha t conta ins t h i s ksegrp . ∗/3 TAILQ ENTRY( ksegrp ) kg ksegrp ; /∗ ( ∗ ) Queue o f k s eg rp s in kg proc . ∗/4 TAILQ HEAD( , thread ) kg threads ; /∗ ( t d k g l i s t ) A l l t h reads . ∗/5 TAILQ HEAD( , thread ) kg runq ; /∗ ( td runq ) wa i t ing RUNNABLE threads ∗/6 ( . . . )7 struct kse thr ma i lbox ∗ kg completed ; /∗ ( c ) Completed thread mboxes . ∗/8 int kg nex tupca l l ; /∗ ( n ) Next u p c a l l t ime . ∗/9 int kg upquantum ; /∗ ( n ) Quantum to schedu l e an up c a l l . ∗/

10 ( . . . )11 u char k g p r i c l a s s ; /∗ ( j ) Schedu l ing c l a s s . ∗/12 u char k g u s e r p r i ; /∗ ( j ) User p r i from es tcpu and nice . ∗/13 ( . . . )14 struct kg sched ∗ kg sched ; /∗ ( ∗ ) Scheduler−s p e c i f i c data . ∗/15 } ;

Listing 3: KSEG

2.2.4 KSE

The KSE structure in Listing 4 has three fields of interest: it is member in a run queue used by thescheduler, it can have an associated active thread (set by the UTS), and it remembers the CPU that itshould be assigned to (by the kernel scheduler).

1 struct kse {2 TAILQ ENTRY( kse ) ke procq ; /∗ ( j / z ) Run queue . ∗/3 ( . . . )4 struct thread ∗ ke thread ; /∗ ( ∗ ) Act ive a s s o c i a t e d thread . ∗/5 ( . . . )6 u char ke cpu ; /∗ CPU tha t we have a f f i n i t y f o r . ∗/7 } ;

Listing 4: KSE

2.3 Scheduling

Programs can be characterized according to the amount of computation and the amount of I/O that theydo. Scheduling policies typically attempt to balance resource utilization against the time that it takes fora program to complete.

FreeBSD’s default scheduler is biased to favor interactive programs such as text editors, over long-running batch-type jobs. Processes that execute for the duration of their time slice have their prioritylowered, whereas processes that give up the CPU (usually because they do I/O) are allowed to remainat their priority. Thus, jobs that use large amounts of CPU time sink rapidly to a low priority, whereasinteractive jobs that are mostly inactive remain at a high priority so that, when they are ready to run,they will preempt the long-running lower-priority jobs. FreeBSD also supports scheduling of real-timejobs.

The system also needs a scheduling policy to deal with problems that arise from not having enoughmain memory to hold the execution contexts of all processes that want to execute. The scheduler tries toavoid this phenomenon called thrashing controlled by some thresholds and in co-operation with the VMsubsystem. The last process causing a lot of swapping is marked as not being allowed to run, which allowsthe pageout daemon to push all the pages associated with the process to backing store. The memoryfreed by blocking the process can then be distributed to the remaining processes, which usually can thenproceed.

Because a busy system makes thousands of scheduling decisions per second, the speed with whichscheduling decisions are made is critical to the performance of the system as a whole. Therefore, FreeBSD

7

allows selection of the scheduler (see function prototypes in [sys/sched.h]) only at compile time and theactual scheduling code is streamlined for short execution time.

Together with the management of threads and resources, FreeBSD also incorporates a priority inheri-tance mechanism to prevent a priority inversion.

2.3.1 Kernel and User Level Scheduling

As already has been stated, the actual scheduling of threads is done in cooperation of the kernel schedulerand a userlevel thread scheduler. The usage of a UTS is optional and has to be activated through a systemcall, and if not present a user application operates in single threaded mode.

Kernel Scheduler The kernel scheduler actually deals with KSEs when a runnable thread is associatedwith it and with threads when it is blocked in a system-call, but we use the term thread in both cases forsimplification.

Threads can be of one of three scheduling classes: time-share, real-time, and idle which are mappedto a range of priorities (see [sys/priority.h]). For time-share threads a so called interactivitiy scoreis periodically recalculated based on the amount of CPU time it has used and the time it voluntarilyslept. The dynamic priority of such a thread and the size of its time slice are then determined by thisinteractivity score. For real-time threads the scheduler uses a separate range of priorities and their priorityis not subject to dynamic priority adaptation. Additionally, they will not be preempted by another threadof smaller or equal real-time priority. A thread running at idle priority will run only when no other real-time or time-share thread is runnable and its idle priority is equal or greater than all other runnable idlepriority threads.

For enhanced SMP support (see Section 2.3.2) there is a run-queue for each CPU (struct kseq in[kern/sched_ule.c]), which is internally structured as follows. One queue is the idle queue, where allidle threads are stored. The other two queues are designated current and next and used for time-share andreal-time threads. Threads are picked to run for a certain time-slice, in priority order, from the currentqueue until it is empty, at which point the current and next queues are swapped. The period from onequeue switch to the next could be compared to an epoch known from other scheduling algorithms.

Real-time and interrupt threads are always inserted into the current queue so that they will havethe least possible scheduling latency. The insertion of time-share threads depends on their interactivityscore. If bigger than a certain threshold they are put on the current queue to improve their interactiveresponse. Otherwise they are put into the next queue and are scheduled to run when the queues areswitched. Switching the queues guarantees that a thread gets to run at least once every two queueswitches regardless of priority. Thereby an almost fair sharing of the processor is ensured, because thetime from one queue switch to the next is more or less limited.

Scheduler Communication In order to perform meaningful scheduling decisions the two schedulershave to communicate with each other. The UTS makes use of the appropriate system calls to inform thekernel of its demands, such as obtaining a new KSE, or binding a KSE to a certain processor. The kernelscheduler uses so called upcalls and mailboxes to inform the UTS of scheduling events, such as a blockingthread.

On creation of a new KSE, the UTS has to specify a mailbox (see struct kse_mailbox in [sys/kse.h])where it wants to receive notifications from the kernel scheduler. The mailbox serves as a handle to acertain KSE. This data structure also contains a pointer to function, which will be executed by an upcall.This mechanism radically changes the flow of control of a user level program because upcalls can occurasynchronously. The kernel issues such upcalls on the following events: A KSE has consumed its time-sliceor it has voluntarily given up control by invoking any blocking system call or causing a page-fault (seeSection 3.3.3). Additionally the UTS will be informed if a previously blocked thread is able to continue.

Userlevel Thread Scheduler Even if we only describe the kernel part of the FreeBSD system, we wantto give a brief overview what the UTS has to perform upon receiving an upcall that says that a certainthread has relinquished control [7]:

1. Find the highest priority thread of the the notified thread’s KSEG. Optionally, heuristically try toimprove cache locality by running a thread whose data may still be partially present in the processorcache.

8

2. Set a timer that will indicate the end of the threads scheduling quantum.

3. Switch to the selected thread.

2.3.2 SMP Support

The current scheduler (see [kern/sched_ule.c]) is called ULE [8] and has been developed to supportSMP by addressing the following issues:

• The need for processor affinity in SMP systems.

• Providing better support for symmetric multithreading (SMT)-processors with multiple on-chip CPUcores.

• Improving the performance of the scheduling algorithm so that it is no longer dependent on thenumber of threads in the system.

Typically there are many runnable threads competing for a few processors and the job of the scheduleris to ensure that the CPUs are always busy and are not wasting their cycles. Trying to preserve processoraffinity is a must in a SMP scheduler, by taking into account the cost of migrating a thread from oneCPU to another. ULE enables processor affinity by maintaining a run queue for each CPU in the system.This also eliminates the need to iterate a global list of all runnable threads in the system when making ascheduling a decision (like in the original FreeBSD scheduler (see [kern/sched_4bsd.c])).

There are two mechanisms used to migrate threads among multiple processors. The so called pull-migration is done when a CPU has no work to do and sets an idle-flag readable by all processors. Whenan active CPU is about to add work to its own run queue and has excess work, it checks if another processorin the system is idle. If an idle processor is found, then the thread is migrated to the idle processor usingan interprocessor interrupt (IPI). Seeking out idle processors by only inspecting these shared flags whenadding a new task works well and it spreads the load when it is presented to the system.

The second form of migration, called push migration, is done by the system on a periodic basis andmore aggressively offloads work to other processors in the system, paying less attention to processor affinity.Twice per second a routine picks the most-loaded and least-loaded processors in the system and equalizestheir run queues. Push migration ensures fairness among the runnable threads, because every thread hasto share the CPU it is running on with a (periodically balanced) number of other threads.

2.4 Kernel Interface

A number of system calls are actually dedicated to process management but we only take the mostimportant ones into consideration, which are used for creating a new process and loading a new program.

2.4.1 Process Creation

Like in all UNIX systems, in FreeBSD new processes are created with the fork family of system calls. Thesystem call fork (see [kern/kern_fork.c]) creates a complete copy of the parent process. The rforksystem call creates a new process entry that shares a selected set of resources from its parent rather thanmaking copies of everything. So it does not copy the virtual memory resources but initializes its memorywith the text and data of a given program.

A call to fork returns the child process ID to the parent and zero to the child process. Thus, a programcan identify whether it is the parent or child process after a fork by checking this return value. Forkinginvolves three main steps:

1. Allocating and initializing a new process structure for the child process

2. Duplicating the context of the parent for the child process

3. Scheduling the child process to run

The second step includes copying the parent’s address space. To duplicate a process’ memory footprint,the kernel invokes the memory management facilities through a call to vm_forkproc(), which allocatesall the memory resources that the child will need to execute (described in Section 3.3.2).

9

The kernel begins by allocating memory for the new process and thread structures. They are initializedin three steps: One part is copied from the parent’s corresponding structure, another part is zeroed, andthe rest is explicitly initialized. The zeroed fields include recent CPU utilization, swap and sleep time,timers, tracing, and pending-signal information. The copied portions include the following:

• A reference to the parent’s set of open files

• A reference to the parent’s user credential

• A reference to the parent’s limits

The explicitly set information includes:

• A pointer to the process’ statistics structure, allocated in its user structure

• A pointer to the process’ signal actions structure, allocated in its user structure

• A new PID for the process

The new process is inserted into the following collections:

• The list of all processes

• The child list of the parent

• The parent’s process group list

• The hash structure that allows the process to be looked up by its PID

The new PID must be unique among all processes. Therefore, FreeBSD maintains a range of unallocatedPIDs between lastpid and pidchecked. It allocates a new PID by incrementing and then using the valueof lastpid. When the newly selected PID reaches pidchecked, the system calculates a new range ofunused PIDs by making a single scan of all existing processes.

Once the child process is fully built, its thread is made known to the scheduler by being placed on therun queue.

2.4.2 Loading Program

Loading a new program into memory and start executing it is done equally to other UNIX systems viathe system call execve()3, [sys/sysproto.h] taking a proc structure and an execve_args structure asparameter as shown in Listing 5.

1 int execve ( struct proc ∗ , struct execve a rg s ∗ ) ;2

3 struct execve a rg s {4 char ∗ fname ;5 char ∗∗ argv ;6 char ∗∗ envv ;7 } ;

Listing 5: function prototype and parameter type of execve function

This system call actually transforms the calling process into a new process by replacing its memorywith a executable file, whose name is specified as a parameter (see line 4 in Listing 5). This file is eitheran executable object file, or a file of data for an interpreter. An executable object file consists of anidentifying header, followed by pages of data representing the initial program (text) and initialized datapages. An interpreter file begins with a line of the form: #! interpreter. In such a case the systemactually execves the specified interpreter and the name of the originally execved file becomes the firstargument.

The field argv is a pointer to a null-terminated array of character pointers to null-terminated characterstrings. These strings construct the argument list to be made available to the new process. At least one

3execve() stands for execute vector with environment [9]. There are multiple slightly different frontends available forthis system call (see the man page on execv [10]).

10

argument must be present in the array; by custom, the first element should be the name of the executedprogram (for example, the last component of path).

The field envp is also a pointer to a null-terminated array of character pointers to null-terminatedstrings. A pointer to this array is normally stored in the global variable environ. These strings passinformation to the new process that is not directly an argument to the command.

As the execve() system call overlays the current process image with a new process image the successfulcall has no process to return to. When unsuccessful, the return value informs the calling process abouterrors occurred.

If the set-user-ID mode bit of the new process image file is set (see filesystem, file access rights,Section 4.2.2), the effective user ID of the new process image is set to the owner ID of the new processimage file. If the set-group-ID mode bit of the new process image file is set, the effective group ID of thenew process image is set to the group ID of the new process image file.

The new process also inherits many attributes from the calling process, such as parent process ID,process group ID, access groups, working directory, resource limits and file mode mask.

2.5 Deadlocks

FreeBSD does not employ a general strategy to prevent or detect deadlocks (known as the Ostrich algorithm[1]). However, in some places of the implementation, the possibility of deadlocks is explicitly consideredand steps toward deadlock prevention are taken. First to mention is the locking subsystem, where amechanism is included to check for a possible loop in the sleep chains, on acquiring a lock, in orderto prevent a resulting deadlock ([sys/lockf.h] line 68f. and [kern/kern_lockf.c] line 54ff.) Whenresolving a pathname in a file system and therefore the hierarchy of directory nodes each of them islocked. If the file system is not maintained in a strict tree hierarchy, this can result in a deadlocksituation (see comments in [ufs/ufs/ufs_lookup.c] line 83, [gnu/fs/ext2fs/ext2_lookup.c] line 264and [fs/msdosfs/msdosfs_lookup.c] line 509). In the memory management subsystem, there can befound some references to the prevention of low-memory deadlocks (e. g., [sys/vmmeter.h], line 121f. and[vm/swap_pager.c], line 1139ff.) Further examples are the paging daemon ([vm/vm_pageout.c], seecomment in line 1175), the page fault handler routine ([vm/vm_fault.c], comment in line 283).

3 Memory Management

FreeBSD supports virtual memory (VM) including paging and swapping. Conceptually, each process (andthe kernel) exclusively uses its own contiguous virtual address space. Although there are some differencesbetween kernel and user process virtual memory, we will limit our discussion to covering only the latter.

Basically, the VM system can be divided into two parts: a small architecture-dependent (or machine-dependent) part, and a larger machine-independent part. The machine-independent part uniformly main-tains the virtual to physical page mapping data. The machine-dependent subsystem interacts with thememory management unit (MMU) by setting up page tables in the required format. Put in simple words,the machine-independent subsystem knows what to map, and the machine-dependent subsystem knowshow.

Since each process has its own virtual address space, VM management has to be done on a per-processbasis. Consequently, each process (proc [sys/proc.h]), Section 2.2.1) has a reference to its vmspacedata structure, containing a (machine-independent) VM address map (vm_map [vm/vm_map.h]) and a(machine-dependent) physical map (pmap [vm/pmap.h]). We will now discuss these subsystems in detail.

3.1 Architecture Independent Subsystem

The architecture-independent part of FreeBSD’s VM system is shared by all supported processors. It con-tains high-level functionality, such as managing a process’ file mappings, requesting data from secondarystorage, demand paging, managing the allocation of physical memory, and managing copy-on-write mem-ory. All information for mapping virtual to physical pages is completely stored in machine independentdata structures. Machine specific page tables can be fully constructed based on this information (and areusually considered throwaway).

11

3.1.1 Overview

We will start our survey on the major machine independent data structures with giving an organizationaloverview. As it is viewed in Figure 2, each process’ vmspace [vm/vm_map.h] data structure (see Sec-tion 3.1.2) contains a reference to a VM address map vm_map (Section 3.1.3), which is a set of map entries(vm_map_entry [vm/vm_map.h], Section 3.1.4).

Figure 2: Data structures of the VM system.

A vm_map_entry represents a contiguous area of the virtual address space (of the size of one or multiplepages) and is the unit of memory protection, meaning that all pages within a map entry have the sameprotection level. A map entry knows nothing about the actual content of the pages it manages, but defersthis knowledge to its memory object (vm_object [vm/vm_object.h]).

A vm_object (see Section 3.1.5) maintains the pages associated with allocated virtual memory. Itdescribes an area of anonymous memory, a memory mapped file, or a device mapped into the virtualaddress space. Memory objects know which of the maintained pages are currently present in physicalmemory and which are not.

A memory object uses a pager ([vm/pager.h], see Section 3.1.6) to transfer pages from backing storageto physical memory when necessary. There are different pagers that provide a generic interface that allowsmemory objects to access various backing stores (swap-backed on a swap-partition, physical device-backedfor memory-mapped I/O or file-backed for memory-mapped files) in a uniform way.

A vm_page [vm/vm_page.h] (see Section 3.1.7) ultimately describes a frame of physical memory.Every frame is represented by a data structure of type vm_page that is initialized on system start-up([vm/vm_init.c]).

3.1.2 VM Space

A VM space represents a process’ virtual memory. As shown in Figure 2, it maintains references to vm_mapand pmap (lines 2–3 in Listing 6).

This data structure reveals that FreeBSD does not support segmentation in the sense of multipleindependent address spaces for one process. Instead, the classical memory layout of text, data and stacksegments within the same address space is used, as it is indicated by the declarations in lines 5–10.

1 struct vmspace {2 struct vm map vm map ; /∗ VM address map ∗/3 struct pmap vm pmap ; /∗ p r i v a t e p h y s i c a l map ∗/4 ( . . . )

12

5 s e g s z t vm ts i ze ; /∗ t e x t s i z e ( pages ) XXX ∗/6 s e g s z t vm dsize ; /∗ data s i z e ( pages ) XXX ∗/7 s e g s z t vm ss i ze ; /∗ s t a c k s i z e ( pages ) ∗/8 caddr t vm taddr ; /∗ ( c ) user v i r t u a l address o f t e x t ∗/9 caddr t vm daddr ; /∗ ( c ) user v i r t u a l address o f data ∗/

10 caddr t vm maxsaddr ; /∗ user VA at max s t a c k growth ∗/11 ( . . . )12 } ;

Listing 6: VM space

3.1.3 VM Map

A vm_map knows the entire virtual address space of a process. All virtual memory content that hasbeen previously accessed (and therefore must be mapped) is managed by this data structure (and itssubstructures). As said above, the map is a set of map entries. These entries are organized both as abalanced binary search tree and a doubly linked linear list. Both are ordered by the memory area theycover. This double organization is done for performance reasons, since one can benefit from the binarytree organization when searching for a certain page, while one can quickly obtain neighboring entries usingthe list organization. References to this data structures can be found in Listing 7 in lines 2 and 11.

Furthermore, a vm_map contains locks to prevent concurrent modification of the content (lines 3–4),some meta-information (lines 5–10), and a reference to the physical map (line 12). The “version number”in line 7 is an access counter rather than a time stamp (as suggested by the variable name) and is used todetect concurrent modification.

1 struct vm map {2 struct vm map entry header ; /∗ L i s t o f e n t r i e s ∗/3 struct sx lock ; /∗ Lock f o r map data ∗/4 struct mtx system mtx ;5 int nen t r i e s ; /∗ Number o f e n t r i e s ∗/6 vm s i z e t s i z e ; /∗ v i r t u a l s i z e ∗/7 u in t timestamp ; /∗ Version number ∗/8 u char needs wakeup ;9 u char system map ; /∗ Am I a system map? ∗/

10 vm f l ag s t f l a g s ; /∗ f l a g s f o r t h i s vm map ∗/11 vm map entry t root ; /∗ Root o f a b inary search t r e e ∗/12 pmap t pmap ; /∗ ( c ) Phys i ca l map ∗/13 ( . . . )14 } ;

Listing 7: VM map

3.1.4 VM Map Entry

A map entry’s purpose is to group pages with the same protection level. Map entries (Listing 8) foruser processes have associated permissions (none, read, write, execute, copy-on-write, see [vm/vm.h]),expressed through the protection flags in lines 13–14 (kernel maps differ in that, they do not use memoryprotection since kernel memory is only accessed by the kernel itself). Lines 2–5 implement the list andtree structures in which entries are organized (see Section 3.1.3). Lines 6–8 describe the range of virtualaddresses this map entry is representing. The memory object that maintains the contents of the representedmemory is referenced in line 10.

1 struct vm map entry {2 struct vm map entry ∗ prev ; /∗ prev ious entry ∗/3 struct vm map entry ∗ next ; /∗ next entry ∗/4 struct vm map entry ∗ l e f t ; /∗ l e f t c h i l d in b inary search t r e e ∗/5 struct vm map entry ∗ r i g h t ; /∗ r i g h t c h i l d in b inary search t r e e ∗/6 vm o f f s e t t s t a r t ; /∗ s t a r t address ∗/

13

7 vm o f f s e t t end ; /∗ end address ∗/8 vm o f f s e t t a v a i l s s i z e ; /∗ amt can grow i f t h i s i s a s t a c k ∗/9 ( . . . )

10 union vm map object ob j e c t ; /∗ o b j e c t I po in t to ∗/11 ( . . . )12 /∗ Only in t a s k maps : ∗/13 vm prot t p r o t e c t i on ; /∗ p ro t e c t i on code ∗/14 vm prot t max protect ion ; /∗ maximum pro t e c t i on ∗/15 ( . . . )16 } ;

Listing 8: VM map entry

3.1.5 Memory Object

Memory objects maintain the contents of virtual memory. They know the pages that are currently presentin physical memory (lines 5, 6, 12 in Listing 9), which are redundantly organized as list and binary searchtree, similar to the map entry organization in a vm_map. To swap in non-present pages from backingstorage, a memory object uses its pager (lines 15–17) (of a specific type (line 10) which is fixed at objectcreation time).

Memory objects can be shadowed (or chained). When pages are copied using copy-on-write semantics,a new memory object is created that represents the (yet only conceptually) copied data. The new objectis said to shadow the old one. Initially, the new object does not contain any pages but forwards addresstranslation requests to the shadowed memory object. Once one of the pages is modified, it is actually copiedaccording to copy-on-write semantics. The shadowing memory object now answers address translationrequests for the new page itself and still forwards requests for unmodified pages to the shadowed memoryobject. A known problem of this technique is that it may lead to growing object chains and to long lookuptimes when searching for a certain address mapping. Fields related to chaining can be found in lines 3, 4,9 and 13.

Since a memory object can be referenced by many map entries (when shared memory is used), it needsa reference count in order to determine when it can be disposed. This reference count can be found inline 8.

1 struct vm object {2 ( . . . )3 LIST HEAD( , vm object ) shadow head ; /∗ o b j e c t s t h a t t h i s i s a shadow fo r ∗/4 LIST ENTRY( vm object ) shadow l i s t ; /∗ chain o f shadow o b j e c t s ∗/5 TAILQ HEAD( , vm page ) memq ; /∗ l i s t o f r e s i d en t pages ∗/6 vm page t root ; /∗ roo t o f the r e s i d en t page sp l ay t r e e ∗/7 ( . . . )8 int r e f c oun t ; /∗ How many r e f s ?? ∗/9 int shadow count ; /∗ how many o b j e c t s t h i s i s a shadow fo r ∗/

10 ob j type t type ; /∗ type o f pager ∗/11 ( . . . )12 int r e s i d en t page coun t ; /∗ number o f r e s i d en t pages ∗/13 struct vm object ∗ back ing ob j e c t ; /∗ o b j e c t t h a t I ’m a shadow of ∗/14 ( . . . )15 union {16 ( . . . )17 } un pager ;18 } ;

Listing 9: Memory object

3.1.6 Pager

A pager is not directly represented by a specific data structure. Instead, a pager is an array of functionpointers (pagerops [vm/pager.h]), hiding the actual type of backing store used by the pager. So, a

14

uniform access to various backing storage (e. g., a swap partition or a file) is provided to the memoryobject. Listing 10 shows the different operations a pager provides, the most important being getpages()and putpages() in lines 5 and 6, which are used for retrieving pages from and putting pages to backingstore, respectively.

1 struct pagerops {2 p g o i n i t t ∗ pgo i n i t ; /∗ I n i t i a l i z e pager . ∗/3 p g o a l l o c t ∗ pgo a l l o c ; /∗ Al l o ca t e pager . ∗/4 pgo d e a l l o c t ∗ pgo dea l l o c ; /∗ Disa s so c i a t e . ∗/5 pgo ge tpage s t ∗ pgo getpages ; /∗ Get ( read ) page . ∗/6 pgo putpages t ∗pgo putpages ; /∗ Put ( wr i t e ) page . ∗/7 pgo haspage t ∗pgo haspage ; /∗ Does pager have page ? ∗/8 pgo pageunswapped t ∗ pgo pageunswapped ;9 } ;

Listing 10: Pager operations

3.1.7 Page

Each physical page frame in the system is represented by a vm_page. It has a usage state (wired, active,inactive, cache, free) describing the availability of the represented frame. This means that when a framemust be allocated to hold the contents of a page (that is swapped-in or created newly), the system prefersto use free frames. If no free frames are available, cache frames are paged out and used to hold the newpage. Wired frames cannot be paged out (a page is wired by the user through a system call or by thekernel under certain circumstances). The state of a vm_page is expressed through the membership of thepage in one of the paging queues, where they are kept in FIFO fashion. These queue are implementedusing pageq in line 2 (in Listing 11).

Frames used by the same memory object are grouped in a list, listq in line 3, and in a binary searchtree (lines 4–5). When frames are used by an object, they contain a reference to this object and theiroffset within the object (lines 6–7).

The actual physical memory address of the frame this vm_page is describing is contained in phys_addrin line 8. Machine-dependent information is encapsulated in md (line 9).

Furthermore, a vm_page contains several status meta data. The most important are the number ofwirings of this page (line 11), a usage counter to determine the usage state mentioned above (line 15),and a busy counter that indicates if the frame is currently being paged in or out or there is another I/Ooperation manipulating the frame contents.

1 struct vm page {2 TAILQ ENTRY( vm page ) pageq ; /∗ queue in f o f o r FIFO queue or f r e e l i s t (P) ∗/3 TAILQ ENTRY( vm page ) l i s t q ; /∗ pages in same o b j e c t (O) ∗/4 struct vm page ∗ l e f t ; /∗ s p l ay t r e e l i n k (O) ∗/5 struct vm page ∗ r i g h t ; /∗ sp l ay t r e e l i n k (O) ∗/6 vm object t ob j e c t ; /∗ which o b j e c t am I in (O,P) ∗/7 vm pindex t pindex ; /∗ o f f s e t i n t o o b j e c t (O,P) ∗/8 vm paddr t phys addr ; /∗ ph y s i c a l address o f page ∗/9 struct md page md ; /∗ machine dependant s t u f f ∗/

10 ( . . . )11 u shor t wi re count ; /∗ wired down maps r e f s (P) ∗/12 ( . . . )13 u char act count ; /∗ page usage count ∗/14 u char busy ; /∗ page busy count (O) ∗/15 ( . . . )16 } ;

Listing 11: VM page

15

3.2 Architecture Dependent Subsystem

The machine-independent code relies on services that are defined in pmap [vm/pmap.h] (abbreviationfor physical map), which connects the machine-(in)dependent subsystems. The task of the machine-dependent subsystem is to provide those services on a specific hardware platform without revealing thedetails of interaction with the specific hardware. The implementation of those services is provided in[machine/pmap.c].

3.2.1 Physical Map Services

The implementation of map services mainly deals with the intricacies of the specific hardware and giveslittle substance for general discussion. Instead, we present what is expected as services from the machine-independent subsystem’s point of view.

In Listing 12, one can see some operations on single physical page frames: clear the modification flagof the frame (line 2), clear the reference flag (line 3), get the respective values (lines 10 and 12), set theprotection bits (line 14), zero the frame (line 18), or physically copy the content of one frame to another(line 5).

Examples for operations on pages of the virtual address space are: get the actual physical address ofa virtual address, (line 8) change the wiring state of a virtual page (line 1), insert a physical page at avirtual address (line 6), or remove a range of addresses from the map (line 16).

1 void pmap change wiring ( pmap t , vm o f f s e t t , boo l ean t ) ;2 void pmap clear modi fy ( vm page t m) ;3 void pmap c l e a r r e f e r enc e ( vm page t m) ;4 ( . . . )5 void pmap copy page ( vm page t , vm page t ) ;6 void pmap enter ( pmap t , vm o f f s e t t , vm page t , vm prot t , boo l ean t ) ;7 ( . . . )8 vm paddr t pmap extract ( pmap t pmap , vm o f f s e t t va ) ;9 ( . . . )

10 boo l ean t pmap is modi f i ed ( vm page t m) ;11 ( . . . )12 boo l ean t pmap ts re f e r enced ( vm page t m) ;13 ( . . . )14 void pmap page protect ( vm page t m, vm prot t prot ) ;15 ( . . . )16 void pmap remove ( pmap t , vm o f f s e t t , vm o f f s e t t ) ;17 ( . . . )18 void pmap zero page ( vm page t ) ;19 ( . . . )

Listing 12: Physical map services

3.3 Mechanisms

In order to complement our discussion of data structures as “static” entities, we now present some mecha-nisms implemented in the FreeBSD VM system in order to show how these entities interact and how theyare used to provide services such as system calls.

3.3.1 Memory Mapped Files

Files can be mapped into the virtual address space using the system call mmap. Memory mapped files canbe directly read and modified by manipulating memory contents. The system keeps the mapped regionsin sync with the file system by periodically flushing memory contents to disk (see Section 3.3.5 for moredetails). Calling mmap, the user can choose between private or shared mapping. Shared mappings canbe used for interprocess communication (IPC), since processes mapping the same file into their virtualmemory can see each others modifications (access synchronization is left to the user programs).

16

Figure 3: Memory mapped files (shared and private).

As viewed on the left of Figure 3, FreeBSD utilizes a file-backed memory object (using a vnode pager)to implement shared mappings. Map entries of sharing processes directly refer to this object. As aconsequence, changes in the file are immediately visible to other processes sharing the same mapping.(The synchronization between physical memory and file can be user-controlled separately and is by defaultdone periodically.)

A private mapping requires an object chain (Section 3.1.5), as viewed on the right of Figure 3. The file-backed memory object is shadowed by another memory object that is used only by process A. On a write, acopy of the modified page is created and maintained by the shadowing object. When process B establishesa private mapping of the same file, it uses its own shadow object and does not see any modifications ofprocess A’s mapping.

3.3.2 Forking Processes

The fork() system call is used for process creation and duplicates a process context, including its virtualaddress space. This copying is done utilizing the copy-on-write mechanism. vmspace_fork [vm/vm_map.c](called in fork1() [kern/kern_fork.c]) performs the duplication of the virtual address space by shad-owing the memory objects of the parent process as described in Section 3.1.5. All memory objects of thechild process are then shadowing the objects of the parent process, resulting in identical initial virtualmemory contents.

Although copy-on-write semantics prevents actual page frames from being copied on a fork, a lotof overhead lies in the creation and shadowing of memory objects. This may be a serious performanceproblem. Additionally, fork is often used directly before execve (see 2.4.2), invalidating the copied addressspace. For this reason, the system call vfork() should be used when a fork-execve combination isintended because it does not copy the address space of the parent process but creates a new one containingthe text, stack and data of the specified program.

3.3.3 Page Fault

Usually, a page fault occurs when a program accesses a virtual page that is not currently present inphysical memory. FreeBSD’s page fault service routine (contained in [vm/vm_fault.c]) is called on suchoccasions. Additionally, this routine is called on page access violation (e. g., when a process writes to aread-only page).

In the case of a “classical” page fault, the service (or handler) routine uses the memory object torequest the loading of the page into main memory. As discussed above (Section 3.1.6), the memory objectitself relies on its pager to perform the necessary actions. Since programs are expected to operate on a setof nearby memory addresses, the page fault handler tries to minimize future page faults by anticipatingthem and loads neighboring pages (as far as they are maintained by the same memory object) into physical

17

memory in advance. This not only reduces the number of page faults but also saves data transfer timesince a bulk transfer of many pages is more efficient than subsequent transfers of single pages.

A call to the page fault handler due to an access violation (e. g., write-fault) does not necessarily implythat the program that caused the violation is erroneous or dangerous. Write faults also occur when writingto memory that was previously copied using copy-on-write semantics. In this case, the page fault handleranalyzes what actions must be taken, performs the copying and updates the map entry and memory objectdata structures of the process that caused the write fault in order to reflect the new configuration.

3.3.4 The Paging Daemon

The paging daemon ([vm/vm_pageout.c]) is a kernel thread that collects rarely used pages and writesthem to backing store. It works on a demand triggered basis, and is additionally woken once per second ifthe system is idle. If it is called on demand, the paging daemon does no more than is necessary to satisfythe need.

The daemon targets several goals, one is keeping the paging queue (Section 3.1.7) lengths balanced(with minimal CPU consumption). It scans the queues (in a Clock-Algorithm fashion) and reattachespages to other queues according to their usage count. Pages are normally not swapped out if there areenough free pages. But a number of pages must always be available and kept in free state in order toguarantee enough space for interrupt handler routines.

Under normal circumstances, the activity of the paging daemon will be sufficient to reach the goalsand keep some portion of physical memory available all the time. When memory is scarce, and manyprocesses compete for large numbers of pages, a lot of page faults occur. In such situations (indicated bysome threshold values, e. g., a minimal number of free pages) the system starts to swap entire processesinstead of paging out portions of them. Under normal conditions, swapping will not occur.

3.3.5 The Unified Buffer Cache

In FreeBSD, there are two ways of manipulating file contents: using the read()/write() system call pairthat operates on file descriptors or using the mmap() system call (see Section 3.3.1) and access a file viavirtual memory. If a file is accessed in both ways at the same time (probably by different processes),consistency must be maintained. FreeBSD addresses this problem with a Unified Buffer Cache (UBC, see[12]).

UBC unifies the buffer cache used when reading/writing a file and the page cache involved whenmmaping it to one single cache, as shown in Figure 4. The read/write system calls and the mmap() systemcall are then merely two different user interfaces to the same kernel mechanism. The only entities the kerneluses for file access are vnode-backed memory objects (see Section 3.1.5). This eliminates the consistencyproblem, saves memory (no double caching), and speeds up memory mapped file access (no extra copying).

Figure 4: The Unified Buffer Cache

18

4 File Systems

4.1 VFS

As several other operating systems, FreeBSD employs a file system abstraction layer (called Virtual FileSystem or VFS for short) to provide uniform access to various file systems. The principal entity in VFS isthe virtual node (vnode [sys/vnode.h]), which can represent a file, directory, socket, block or characterdevice, symbolic link, or a pipe. Since other parts of the kernel (e. g., the vnode-backed pager, Section 3.1.1)rely on vnodes, they can transparently use different underlying file systems.

Vnodes support the standard operations open, close, read, write, stat, etc., which they forward tothe corresponding routine provided by the underlying file system (or socket/device driver) implementa-tion. (e. g., UFS [ufs/ufs/ufs_vfsops.c], NTFS [fs/ntfs/ntfs_vsfops.c], or EXT2 [gnu/ext2fs/ext2_vfsops.c]). Vnodes also represent file system-specific information in a uniform way. The mappingof file system-specific data to vnode data must be provided by the underlying file system. Since the VFSorganization is very similar to the way the second extended file system EXT2 organizes its data, EXT2interacts seamlessly and well performing with VFS.

4.2 UFS and FFS

On the World Wide Web, there are few comprehensive and up-to-date sources on UFS and FFS but ahandful of short and contradicting statements on the two being the same or not, which term means whichfile system, etc. We have tried to unravel the confusion introduced by these statements, and we will nowexplain what we found to be solution to the differing accounts.

The current standard file system used in FreeBSD was first introduced with FreeBSD 5.0, and is calledUNIX file system 2 (UFS2). It is an extension to the Berkeley Fast File System (FFS) which is nowadaysoften referred to as UNIX file system (UFS)4.

There is another factor that increases confusion: The current file system code in FreeBSD is alsosplit into two layers, which are called FFS and UFS. UFS [13] (the lower layer) defines the on-disklayout, whereas FFS (the upper layer) provides disk access optimizations, such as soft updates [14] andsnapshots. There are two subdirectories, ffs/ and ufs/, in the ufs/-directory, which contain the code forthe respective layers.

We will now discuss UFS because it specifies on-disk layout and hence is what is commonly understoodas the core of a file system. UFS is a modern hierarchical file system, and is standard in many UNIX-likeoperating systems.

4.2.1 UFS Partition Layout

UFS divides each partition into one or more cylinder groups, each being a set of adjacent physical diskcylinders5. All cylinder groups have the same basic layout (Figure 5) with an invariant portion containingmeta-information consisting of

• a copy of the superblock (containing important information, e. g., block size and the index of theroot inode),

• a cylinder group map (a bitmap describing storage block availability for this group),

• and a fixed number of index-nodes (inodes, see Section 4.2.2).

The remaining space in a cylinder group contains data blocks. The boot block of the partition lies at thebeginning of the first cylinder group.

It should be noted, that the above mentioned meta-information has a different offset in each cylindergroup. This is done to improve failure safety in case of disk damage. If all superblocks would be stored atthe same place in each group, the failure of a single disk platter would corrupt all superblock copies.

4Originally, UFS refers to “the file system” of System V UNIX.5This scheme was introduced by Berkeley in 1984 [1]. (There were two additional major changes introduced by that

(re)design: The maximum length of file names was increased from 14 to 255 characters and blocks could be fragmented,essentially providing a variable block size.)

19

Cylinder Group 0 Cylinder Group 1 Cylinder Group 2Bootblock (8 Kilobytes)

Superblock Data Blocks Data BlocksCylinder Group Map Superblock

Inodes Cylinder Group Map SuperblockInodes Cylinder Group Map

Data Blocks Data Blocks InodesData Blocks

Figure 5: The UFS Partition Layout

4.2.2 UFS Inodes

An inode (Figure 6) is used to describe each file and directory in the UFS filesystem. The inode structureof UFS is very similar to other file metadata structures, e. g., it utilizes the concept of (multiple) indirectblock addressing in order to optimize metadata overhead for various file or directory sizes and it takesaccount for the traditional UNIX access permissions system with group ID and user ID.

Figure 6: A UFS inode

Listing 13 shows the dinode [ufs/ufs/dinode.h] structure that contains all the meta-data associatedwith a UFS file or directory. This structure defines the on-disk format of an inode, so all its fields aredefined by types with precise widths.

1 struct u f s2 d inode {2 u i n t 1 6 t di mode ; /∗ 0 : IFMT, permiss ions ; see be low . ∗/3 i n t 1 6 t d i n l i n k ; /∗ 2 : F i l e l i n k count . ∗/4 u i n t 3 2 t d i u i d ; /∗ 4 : F i l e owner . ∗/5 u i n t 3 2 t d i g i d ; /∗ 8 : F i l e group . ∗/6 u i n t 3 2 t d i b l k s i z e ; /∗ 12 : Inode b l o c k s i z e . ∗/7 u i n t 6 4 t d i s i z e ; /∗ 16 : F i l e by t e count . ∗/8 u i n t 6 4 t d i b l o c k s ; /∗ 24 : Bytes a c t u a l l y he l d . ∗/

20

9 u f s t ime t d i a t ime ; /∗ 32 : Last acces s time . ∗/10 u f s t ime t di mtime ; /∗ 40 : Last modi f ied time . ∗/11 u f s t ime t d i c t ime ; /∗ 48 : Last inode change time . ∗/12 u f s t ime t d i b i r t h t ime ; /∗ 56 : Inode c r ea t i on time . ∗/13 i n t 3 2 t di mtimensec ; /∗ 64 : Last modi f ied time . ∗/14 i n t 3 2 t d i a t imensec ; /∗ 68 : Last acces s time . ∗/15 i n t 3 2 t d i c t imens e c ; /∗ 72 : Last inode change time . ∗/16 i n t 3 2 t d i b i r t hn s e c ; /∗ 76 : Inode c r ea t i on time . ∗/17 i n t 3 2 t d i gen ; /∗ 80 : Generation number . ∗/18 u i n t 3 2 t d i k e r n f l a g s ; /∗ 84 : Kernel f l a g s . ∗/19 u i n t 3 2 t d i f l a g s ; /∗ 88 : S ta tus f l a g s ( c h f l a g s ) . ∗/20 i n t 3 2 t d i e x t s i z e ; /∗ 92 : Externa l a t t r i b u t e s b l o c k . ∗/21 u f s 2 daddr t d i ex tb [NXADDR] ; /∗ 96 : Externa l a t t r i b u t e s b l o c k . ∗/22 u f s 2 daddr t d i db [NDADDR] ; /∗ 112 : Direc t d i s k b l o c k s . ∗/23 u f s 2 daddr t d i i b [NIADDR ] ; /∗ 208 : I n d i r e c t d i s k b l o c k s . ∗/24 i n t 6 4 t d i s p a r e [ 3 ] ; /∗ 232 : Reserved ; c u r r en t l y unused ∗/25 } ;

Listing 13: A UFS inode

This is a particularly nice example where the concept shown in Figure 6 has an equivalent counterpartin the source code.

5 Summary

Although we have to leave many things unmentioned, we hope to have provided sort of an overview ofthe FreeBSD kernel. We discussed the core components process and memory management as well as theimportant topics SMP, deadlocks, I/O buffer cache, and file systems while trying to strike the balancebetween structural and behavioral aspects.

FreeBSD is in several ways representative for many modern operating systems. Since it is technolog-ically state of the art, widely used and considered stable and reliable, we expected it to be well-designedand of high source code quality. But, after all, the FreeBSD kernel is a grown system and suffers frommany problems known from legacy systems. Reading the continuously evolved source code base, we gotthe impression that an overall design emphasizing modularity and structure got lost.

Another problem we encountered is the gap between high-level concepts and low-level code comments.From our point of view, the existing documentation (e. g., [4], [10], and [17]) does not address this suffi-ciently. This makes it hard for newcomers to systems programming to understand the implementation.

As it is common practice, FreeBSD programmers use XXX in their comments to flag problematic codepassages. Examples are:

XXX: this code doesn’t cover the general case, but it would be difficult to do so

or

XXX: This is not an ideal solution, but I believe it to be deadlock safe, all things considered

The kernel has more than 1, 600 files that contain at least one XXX flag. Since this number onlyreflects the problems the programmers are aware of, we expect the true number of possible problems tobe even higher, which does not emphasize our trust into the system.

We assume that not only FreeBSD is suffering from these problems, but that many other operatingsystem’s code has similar weaknesses. From our point of view, this is partially caused by deviationsfrom standards (such as ANSI-C) and the lack of emphasis on code readability. This is probably due toseveral reasons, where the most important could be runtime performance optimizations and the continuousre-engineering of the existing codebase by different people.

21

5.1 Suggestions

We would like to conclude with some constructive criticism and collected several proposals for improvingcode quality while not touching runtime performance and even enforcing the possibility of re-engineeringby different people. From our point of view, it could be gained a lot, if

• coding conventions would be used.

• variable names were spelled out or variables had names that more appropriately described theirpurpose.

• very long procedures (> 100 lines) would be split up into more (probably inline) procedures notaffecting runtime performance.

• directory and file names would reflect a clear structure and follow a naming convention.

• all files would begin with at least a few comments of what functionality they contain. Some newerfiles (like [sys/kse.h]) already do contain such descriptions but they are missing in many otherimportant files.

• all kernel source directories would contain a readme-file on the functionality they are intended tocontain.

• a consistent terminology would be used in comments and code throughout the project.

• the same programming practices were used throughout the project, e. g., not all code that utilizesqueues uses the macros defined for this purpose (if this is done intentionally, it should be communi-cated through a comment).

All these proposals do not impact runtime performance but solely improve readability of the code, andultimately reduce the likeliness of bugs.

References

[1] A. S. Tanenbaum Modern Operating Systems. Prentice Hall, New Jersey, 2001. Second (international)edition.

[2] L. F. Bic and A. C. Shaw Operating Systems Principles. Prentice Hall, New Jersey, 2003.

[3] F. Pohlmann Why FreeBSD A quick tour of the BSD alternative. 2005

http://www-128.ibm.com/developerworks/opensource/library/os-freebsd/

[4] M. K. McKusick and G. Neville-Neil. The Design and Implementation of the FreeBSD OperatingSystem. FreeBSD release 5.2. Addison-Wesley Professional, 2004

The chapter on process management can be found online athttp://www.awprofessional.com/articles/article.asp?p=366888.

[5] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy Scheduler Activations: Effective Ker-nel Support for the User-Level Management of Parallelism. ACM Transactions on Computer Systems,10(1):53–79, 1992.

[6] The FreeBSD KSE Project http://www.freebsd.org/kse/

[7] J. Evans Kernel-Scheduled Entities for FreeBSD. 2003

[8] J. Robertson ULE: A Modern Scheduler For FreeBSD in Proceedings of the BSDCon. 2003.

[9] The Unix Acronym List http://www.roesler-ac.de/wolfram/acro/

[10] The FreeBSD Hypertext Man Pages http://www.freebsd.org/cgi/man.cgi

Yields a lot of useful information when searching on concepts, entities, system calls, or commands,e. g., KSE, execv.

22

[11] C. D. Cranor and G. M. Parulkar The UVM Virtual Memory System in Proceedings of the USENIXAnnual Conference. Monterey, California, 1999.

[12] C. Silvers UBC: An Efficient Unified I/O and Memory Caching Subsystem for NetBSD in Proceedingsof the USENIX Annual Conference. 2000.

[13] J. C. van Gelderen Little UFS2 FAQ (2003/04/25, v15.a) athttp://sixshooter.v6.thrupoint.net/jeroen/faq.html

This FAQ is brief and out of date but still the most informative source on the World Wide Web onUFS2.

[14] M. K. McKusick and G. R. Ranger. Soft Updates: A Technique for Eliminating Most SynchronousWrites in the Fast Filesystem in USENIX Annual Conference. 1999.

[15] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. 1984.

[16] A. Silberschatz, P. B. Galvin, and G. Gagne Operating System Concepts with JAVA. Sixth Edition.Wiley, 2003.

This book dedicates a whole appendix to “The FreeBSD System”. This appendix is available onlineand presents a good overall introduction to (Free)BSD.

[17] The FreeBSD Architecture Handbookhttp://www.freebsd.org/doc/en_US.ISO8859-1/books/arch-handbook/

[18] The FreeBSD Handbookhttp://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/

[19] Kerneltrap-Interview with Matthew Dillon: http://kerneltrap.org/node/8.

Matthew Dillon heavily contributed to the FreeBSD VM subsystem. In this interesting interview heshares his thoughts on various aspects of the FreeBSD system and this even extends to some shortbut informative technical discussion.

[20] WikipediA http://en.wikipedia.org/

The WikipediA-articles on BSD, 386BSD, and FreeBSD provide a good starting point for furtherresearch on the FreeBSD system. There are also some other entries on topics discussed in this work,such as UFS and KSE.

23