operating systems (slides)
TRANSCRIPT
Textbooks
D.P. Bovet and M. Cesatı. Understanding The LinuxKernel. 3rd ed. O’Reilly, 2005.Silberschatz, Galvin, and Gagne. Operating SystemConcepts Essentials. John Wiley & Sons, 2011.Andrew S. Tanenbaum. Modern Operating Systems. 3rd.Prentice Hall Press, 2007.邹恒明. 计算机的心智:操作系统之哲学原理. 机械工业出版社,2009.
2 / 397
Course Web Site
Course web site: http://cs2.swfu.edu.cn/moodleLecture slides:
http://cs2.swfu.edu.cn/∼wx672/lecture_notes/os/slides/Source code: http://cs2.swfu.edu.cn/∼wx672/lecture_notes/os/src/Sample lab report:
http://cs2.swfu.edu.cn/∼wx672/lecture_notes/os/sample-report/
3 / 397
What’s an Operating System?
▶ “Everything a vendor ships when you order an operatingsystem”
▶ It’s the program that runs all the time▶ It’s a resource manager
- Each program gets time with the resource Each program gets space on the resource
▶ It’s a control program- Controls execution of programs to prevent errors and
improper use of the computer▶ No universally accepted definition
4 / 397
What’s in the OS?
Hardware
Hardware control
Device drivers
Characterdevices
Blockdevices
Buffercache
File subsystem
VFS
NFS · · · Ext2 VFAT
Process controlsubsystem
Inter-processcommunication
Scheduler
Memorymanagement
System call interface
Libraries
Kernel levelHardware level
User levelKernel level
User programs
trap
trap
5 / 397
AbstractionsTo hide the complexity of the actual implementations
Section 1.10 Summary 25
Figure 1.18Some abstractions pro-vided by a computersystem. A major themein computer systems is toprovide abstract represen-tations at different levels tohide the complexity of theactual implementations.
Main memory I/O devicesProcessorOperating system
Processes
Virtual memory
Files
Virtual machine
Instruction setarchitecture
sor that performs just one instruction at a time. The underlying hardware is farmore elaborate, executing multiple instructions in parallel, but always in a waythat is consistent with the simple, sequential model. By keeping the same execu-tion model, different processor implementations can execute the same machinecode, while offering a range of cost and performance.
On the operating system side, we have introduced three abstractions: files asan abstraction of I/O, virtual memory as an abstraction of program memory, andprocesses as an abstraction of a running program. To these abstractions we adda new one: the virtual machine, providing an abstraction of the entire computer,including the operating system, the processor, and the programs. The idea of avirtual machine was introduced by IBM in the 1960s, but it has become moreprominent recently as a way to manage computers that must be able to runprograms designed for multiple operating systems (such as Microsoft Windows,MacOS, and Linux) or different versions of the same operating system.
We will return to these abstractions in subsequent sections of the book.
1.10 Summary
A computer system consists of hardware and systems software that cooperateto run application programs. Information inside the computer is represented asgroups of bits that are interpreted in different ways, depending on the context.Programs are translated by other programs into different forms, beginning asASCII text and then translated by compilers and linkers into binary executablefiles.
Processors read and interpret binary instructions that are stored in mainmemory. Since computers spend most of their time copying data between memory,I/O devices, and the CPU registers, the storage devices in a system are arrangedin a hierarchy, with the CPU registers at the top, followed by multiple levelsof hardware cache memories, DRAM main memory, and disk storage. Storagedevices that are higher in the hierarchy are faster and more costly per bit thanthose lower in the hierarchy. Storage devices that are higher in the hierarchy serveas caches for devices that are lower in the hierarchy. Programmers can optimizethe performance of their C programs by understanding and exploiting the memoryhierarchy.
6 / 397
System Goals
Convenient vs. Efficient▶ Convenient for the user — for PCs▶ Efficient — for mainframes, multiusers▶ UNIX
- Started with keyboard + printer, none paid to convenience- Now, still concentrating on efficiency, with GUI support
7 / 397
History of Operating Systems
1401 7094 1401
(a) (b) (c) (d) (e) (f)
Cardreader
Tapedrive Input
tapeOutputtape
Systemtape
Printer
Fig. 1-2. An early batch system. (a) Programmers bring cards to1401. (b) 1401 reads batch of jobs onto tape. (c) Operator carriesinput tape to 7094. (d) 7094 does computing. (e) Operator carriesoutput tape to 1401. (f) 1401 prints output.
8 / 397
1945 - 1955 First generation- vacuum tubes, plug boards
1955 - 1965 Second generation- transistors, batch systems
1965 - 1980 Third generation- ICs and multiprogramming
1980 - present Fourth generation- personal computers
9 / 397
Multi-programming is the first instance where theOS must make decisions for the usersJob scheduling — decides which job should be loaded into
the memory.Memory management — because several programs in
memory at the same timeCPU scheduling — choose one job among all the jobs are
ready to runProcess management — make sure processes don’t offend
each other
Job 3
Job 2
Job 1
Operatingsystem
Memorypartitions
Fig. 1-4. A multiprogramming system with three jobs in memory.10 / 397
The Operating System Zoo
▶ Mainframe OS▶ Server OS▶ Multiprocessor OS▶ Personal computer OS▶ Real-time OS▶ Embedded OS▶ Smart card OS
11 / 397
OS ServicesLike a government
Helping the users:▶ User interface▶ Program execution▶ I/O operation▶ File system manipulation▶ Communication▶ Error detection
Keeping the system efficient:▶ Resource allocation▶ Accounting▶ Protection and security
12 / 397
A Computer System
Bankingsystem
Airlinereservation
Operating system
Webbrowser
Compilers Editors
Application programs
Hardware
Systemprograms
Commandinterpreter
Machine language
Microarchitecture
Physical devices
Fig. 1-1. A computer system consists of hardware, system pro-grams, and application programs. 13 / 397
CPU Working Cycle
Fetchunit
Decodeunit
Executeunit
1. Fetch the first instruction from memory2. Decode it to determine its type and operands3. execute it
Special CPU RegistersProgram counter(PC): keeps the memory address of the next
instruction to be fetchedStack pointer(SP): points to the top of the current stack in
memoryProgram status(PS): holds
- condition code bits- processor state
14 / 397
System BusMonitor
Keyboard Floppydisk drive
Harddisk drive
Harddisk
controller
Floppydisk
controller
Keyboardcontroller
VideocontrollerMemoryCPU
Bus
Fig. 1-5. Some of the components of a simple personal computer.Address Bus: specifies the memory locations (addresses) for
the data transfersData Bus: holds the data transfered. bidirectional
Control Bus: contains various lines used to route timing andcontrol signals throughout the system
15 / 397
Controllers and Peripherals
▶ Peripherals are real devices controlled by controller chips▶ Controllers are processors like the CPU itself, have
control registers▶ Device driver writes to the registers, thus control it▶ Controllers are connected to the CPU and to each other
by a variety of buses
16 / 397
ISAbridge
Modem
Mouse
PCIbridgeCPU
Mainmemory
SCSI USB
Local bus
Soundcard Printer Available
ISA slot
ISA bus
IDEdisk
AvailablePCI slot
Key-board
Mon-itor
Graphicsadaptor
Level 2cache
Cache bus Memory bus
PCI bus
Fig. 1-11. The structure of a large Pentium system17 / 397
Motherboard Chipsets
Intel Core 2(CPU)
Fron
tSide Bus
North bridgeChip
DDR2System RAM Graphics Card
DMI
Interface
South bridgeChipSerial ATA Ports Clock Generation
BIOS
USB Ports
Power Management
PCI Bus
18 / 397
▶ The CPU doesn’t know what it’s connected to- CPU test bench? network router? toaster? brain
implant?▶ The CPU talks to the outside world through its pins
- some pins to transmit the physical memory address- other pins to transmit the values
▶ The CPU’s gateway to the world is the front-side bus
Intel Core 2 QX6600▶ 33 pins to transmit the physical memory address
- so there are 233 choices of memory locations▶ 64 pins to send or receive data
- so data path is 64-bit wide, or 8-byte chunks
This allows the CPU to physically address 64GB of memory(233 × 8B)
19 / 397
Some physical memoryaddresses are mapped away!
▶ only the addresses, not the spaces▶ Memory holes
- 640KB ∼ 1MB- /proc/iomem
Memory-mapped I/O▶ BIOS ROM▶ video cards▶ PCI cards▶ ...
This is why 32-bit OSes have problemsusing 4 gigs of RAM.
0xFFFFFFFF +--------------------+ 4GB
Reset vector | JUMP to 0xF0000 |
0xFFFFFFF0 +--------------------+ 4GB - 16B
| Unaddressable |
| memory, real mode |
| is limited to 1MB. |
0x100000 +--------------------+ 1MB
| System BIOS |
0xF0000 +--------------------+ 960KB
| Ext. System BIOS |
0xE0000 +--------------------+ 896KB
| Expansion Area |
| (maps ROMs for old |
| peripheral cards) |
0xC0000 +--------------------+ 768KB
| Legacy Video Card |
| Memory Access |
0xA0000 +--------------------+ 640KB
| Accessible RAM |
| (640KB is enough |
| for anyone - old |
| DOS area) |
0 +--------------------+ 0
What if you don’t have 4G RAM?
20 / 397
the northbridge1. receives a physical memory request2. decides where to route it
- to RAM? to video card? to ...?- decision made via the memory address map
21 / 397
Bootstrapping
Can you pull yourself up by your own bootstraps?A computer cannot run without first loading software butmust be running before any software can be loaded.
BIOSInitialization MBR Boot Loader
EarilyKernel
Initialization
FullKernel
Initialization
FirstUser ModeProcess
BIOS Services Kernel ServicesHardware
CPU in Real Mode
Time flow
CPU in Protected Mode
Switch toProtected Mode
22 / 397
Intel x86 Bootstrapping1. BIOS (0xfffffff0)
à POST à HW init à Find a boot device (FD,CD,HD...) à
Copy sector zero (MBR) to RAM (0x00007c00)2. MBR – the first 512 bytes, contains
▶ Small code (< 446Bytes), e.g. GRUB stage 1, for loadingGRUB stage 2
▶ the primary partition table (= 64Bytes)▶ its job is to load the second-stage boot loader.
3. GRUB stage 2 — load the OS kernel into RAM4. startup5. init — the first user-space program
|<-------------Master Boot Record (512 Bytes)------------>|
0 439 443 445 509 511
+----//-----+----------+------+------//---------+---------+
| code area | disk-sig | null | partition table | MBR-sig |
| 440 | 4 | 2 | 16x4=64 | 0xAA55 |
+----//-----+----------+------+------//---------+---------+
$ sudo hd -n512 /dev/sda23 / 397
Why Interrupt?
While a process is reading a disk file, can we do...while(!done_reading_a_file())
{
let_CPU_wait();
// or...
lend_CPU_to_others();
}
operate_on_the_file();
24 / 397
Modern OS are Interrupt Driven
HW INT by sending a signal to CPUSW INT by executing a system call
Trap (exception) is a software-generated INT coursed by anerror or by a specific request from an userprogram
Interrupt vector is an array of pointers pointing to thememory addresses of interrupt handlers. Thisarray is indexed by a unique device number
$ less /proc/devices$ less /proc/interrupts
25 / 397
Programmable Interrupt Controllers
InterruptINT
IRQ 0 (clock)IRQ 1 (keyboard)
IRQ 3 (tty 2)IRQ 4 (tty 1)IRQ 5 (XT Winchester)IRQ 6 (floppy)IRQ 7 (printer)
IRQ 8 (real time clock)IRQ 9 (redirected IRQ 2)IRQ 10IRQ 11IRQ 12IRQ 13 (FPU exception)IRQ 14 (AT Winchester)IRQ 15
ACK
Master interrupt controller
INT
ACK
Slave interrupt controller
INT
CPU
INTAInterrupt
ack
s y s t e m d a t a b u s
Figure 2-33. Interrupt processing hardware on a 32-bit Intel PC.
26 / 397
Interrupt Processing
CPUInterruptcontroller
Diskcontroller
Disk drive
Current instruction
Next instruction
1. Interrupt3. Return
2. Dispatch to handler
Interrupt handler
(b)(a)
1
3
4 2
Fig. 1-10. (a) The steps in starting an I/O device and getting aninterrupt. (b) Interrupt processing involves taking the interrupt,running the interrupt handler, and returning to the user program.
27 / 397
Interrupt Timeline1.2 Computer-System Organization 9
userprocessexecuting
CPU
I/O interruptprocessing
I/Orequest
transferdone
I/Orequest
transferdone
I/Odevice
idle
transferring
Figure 1.3 Interrupt time line for a single process doing output.
the interrupting device. Operating systems as different as Windows and UNIXdispatch interrupts in this manner.
The interrupt architecture must also save the address of the interruptedinstruction. Many old designs simply stored the interrupt address in afixed location or in a location indexed by the device number. More recentarchitectures store the return address on the system stack. If the interruptroutine needs to modify the processor state—for instance, by modifyingregister values—it must explicitly save the current state and then restore thatstate before returning. After the interrupt is serviced, the saved return addressis loaded into the program counter, and the interrupted computation resumesas though the interrupt had not occurred.
1.2.2 Storage Structure
The CPU can load instructions only from memory, so any programs to run mustbe stored there. General-purpose computers run most of their programs fromrewriteable memory, called main memory (also called random-access memoryor RAM). Main memory commonly is implemented in a semiconductortechnology called dynamic random-access memory (DRAM). Computers useother forms of memory as well. Because the read-only memory (ROM) cannotbe changed, only static programs are stored there. The immutability of ROMis of use in game cartridges. EEPROM cannot be changed frequently and socontains mostly static programs. For example, smartphones have EEPROM tostore their factory-installed programs.
All forms of memory provide an array of words. Each word has itsown address. Interaction is achieved through a sequence of load or storeinstructions to specific memory addresses. The load instruction moves a wordfrom main memory to an internal register within the CPU, whereas the storeinstruction moves the content of a register to main memory. Aside from explicitloads and stores, the CPU automatically loads instructions from main memoryfor execution.
A typical instruction–execution cycle, as executed on a system with a vonNeumann architecture, first fetches an instruction from memory and storesthat instruction in the instruction register. The instruction is then decodedand may cause operands to be fetched from memory and stored in some
28 / 397
System Calls
A System Call▶ is how a program requests a service from an OS kernel▶ provides the interface between a process and the OS
29 / 397
Program 1 Program 2 Program 3
+---------+ +---------+ +---------+
| fork() | | vfork() | | clone() |
+---------+ +---------+ +---------+
| | |
+--v-----------v-----------v--+
| C Library |
+--------------o--------------+ User space
-----------------|------------------------------------
+--------------v--------------+ Kernel space
| System call |
+--------------o--------------+
| +---------+
| : ... :
| 3 +---------+ sys_fork()
o------>| fork() |---------------.
| +---------+ |
| : ... : |
| 120 +---------+ sys_clone() |
o------>| clone() |---------------o
| +---------+ |
| : ... : |
| 289 +---------+ sys_vfork() |
o------>| vfork() |---------------o
+---------+ |
: ... : v
+---------+ do_fork()
System Call Table 30 / 397
User program 2�
User program 1�
Kernel call
Service �
procedure�
Dispatch table�
User programs �
run in user mode �
Operating �
system �
runs in �
kernel mode
4�
3�
1
2�
Mai
n m
emor
y
Figure 1-16. How a system call can be made: (1) User pro-gram traps to the kernel. (2) Operating system determines ser-vice number required. (3) Operating system calls service pro-cedure. (4) Control is returned to user program.
31 / 397
The 11 steps in making the system callread(fd,buffer,nbytes)
Return to caller
410
6
0
9
7 8
321
11
DispatchSys callhandler
Address0xFFFFFFFF
User space
Kernel space (Operating system)
Libraryprocedureread
User programcalling read
Trap to the kernelPut code for read in register
Increment SPCall readPush fdPush &bufferPush nbytes
5
Fig. 1-17. The 11 steps in making the system callread(fd, buffer, nbytes).
32 / 397
Process management22222222222222222222222222222222222222222222222222222222222222222222222222222222222Call Description22222222222222222222222222222222222222222222222222222222222222222222222222222222222
pid = fork( ) Create a child process identical to the parent22222222222222222222222222222222222222222222222222222222222222222222222222222222222pid = waitpid(pid, &statloc, options) Wait for a child to terminate22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = execve(name, argv, environp) Replace a process’ core image22222222222222222222222222222222222222222222222222222222222222222222222222222222222exit(status) Terminate process execution and return status222222222222222222222222222222222222222222222222222222222222222222222222222222222221111111
1111111
1111111
File management22222222222222222222222222222222222222222222222222222222222222222222222222222222222Call Description22222222222222222222222222222222222222222222222222222222222222222222222222222222222
fd = open(file, how, ...) Open a file for reading, writing or both22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = close(fd) Close an open file22222222222222222222222222222222222222222222222222222222222222222222222222222222222n = read(fd, buffer, nbytes) Read data from a file into a buffer22222222222222222222222222222222222222222222222222222222222222222222222222222222222n = write(fd, buffer, nbytes) Write data from a buffer into a file22222222222222222222222222222222222222222222222222222222222222222222222222222222222position = lseek(fd, offset, whence) Move the file pointer22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = stat(name, &buf) Get a file’s status information222222222222222222222222222222222222222222222222222222222222222222222222222222222221111111111
1111111111
1111111111
Directory and file system management22222222222222222222222222222222222222222222222222222222222222222222222222222222222Call Description22222222222222222222222222222222222222222222222222222222222222222222222222222222222
s = mkdir(name, mode) Create a new directory22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = rmdir(name) Remove an empty directory22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = link(name1, name2) Create a new entry, name2, pointing to name122222222222222222222222222222222222222222222222222222222222222222222222222222222222s = unlink(name) Remove a directory entry22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = mount(special, name, flag) Mount a file system22222222222222222222222222222222222222222222222222222222222222222222222222222222222s = umount(special) Unmount a file system222222222222222222222222222222222222222222222222222222222222222222222222222222222221111111111
1111111111
1111111111
Miscellaneous2222222222222222222222222222222222222222222222222222222222222222222222222222222222Call Description2222222222222222222222222222222222222222222222222222222222222222222222222222222222
s = chdir(dirname) Change the working directory2222222222222222222222222222222222222222222222222222222222222222222222222222222222s = chmod(name, mode) Change a file’s protection bits2222222222222222222222222222222222222222222222222222222222222222222222222222222222s = kill(pid, signal) Send a signal to a process2222222222222222222222222222222222222222222222222222222222222222222222222222222222seconds = time(&seconds) Get the elapsed time since Jan. 1, 197022222222222222222222222222222222222222222222222222222222222222222222222222222222221111111
1111111
1111111
Fig. 1-18. Some of the major POSIX system calls. The return codes is −1 if an error has occurred. The return codes are as follows:pid is a process id, fd is a file descriptor, n is a byte count, positionis an offset within the file, and seconds is the elapsed time. Theparameters are explained in the text.
33 / 397
System Call Examples
fork()1 #include <stdio.h>
2 #include <unistd.h>
3
4 int main ()
5 {
6 printf("Hello World!\n");
7 fork();
8 printf("Goodbye Cruel World!\n");
9 return 0;
10 }
$ man 2 fork
34 / 397
exec()1 #include <stdio.h>
2 #include <unistd.h>
3
4 int main ()
5 {
6 printf("Hello World!\n");
7 if(fork() != 0 )
8 printf("I am the parent process.\n");
9 else {
10 printf("A child is listing the directory contents...\n");
11 execl("/bin/ls", "ls", "-al", NULL);
12 }
13 return 0;
14 }
$ man 3 exec
35 / 397
Hardware INT vs. Software INT
Device: Send electrical signal to interrupt controller.
Controller: 1. Interrupt CPU. 2. Send digital identification of interrupting device.
Kernel: 1. Save registers. 2. Execute driver software to read I/O device. 3. Send message. 4. Restart a process (not necessarily interrupted process).
Caller: 1. Put message pointer and destination of message into CPU registers. 2. Execute software interrupt instruction.
Kernel: 1. Save registers. 2. Send and/or receive message. 3. Restart a process (not necessarily calling process).
(a) (b)
Figure 2-34. (a) How a hardware interrupt is processed. (b)How a system call is made.
36 / 397
References
Wikipedia. Interrupt — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. System call — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.
37 / 397
Process
A process is an instance of a program in execution
Processes are like human beings:à they are generatedà they have a lifeà they optionally generate one or
more child processes, andà eventually they die
A small difference:▶ sex is not really common among
processes▶ each process has just one parent
Stack
Gap
DataText 0000
FFFF
39 / 397
Process Control Block (PCB)
ImplementationA process is the collection of datastructures that fully describes how far theexecution of the program has progressed.
▶ Each process is represented by a PCB▶ task_struct in
+-------------------+
| process state |
+-------------------+
| PID |
+-------------------+
| program counter |
+-------------------+
| registers |
+-------------------+
| memory limits |
+-------------------+
| list of open files|
+-------------------+
| ... |
+-------------------+
40 / 397
Process Creation
exit()exec()
fork() wait()
anything()parent
child
▶ When a process is created, it is almost identical to itsparent
▶ It receives a (logical) copy of the parent’s address space,and
▶ executes the same code as the parent▶ The parent and child have separate copies of the data
(stack and heap)41 / 397
Forking in C
1 #include <stdio.h>
2 #include <unistd.h>
3
4 int main ()
5 {
6 printf("Hello World!\n");
7 fork();
8 printf("Goodbye Cruel World!\n");
9 return 0;
10 }
$ man fork
42 / 397
exec()1 int main()
2 {
3 pid_t pid;
4 /* fork another process */
5 pid = fork();
6 if (pid < 0) { /* error occurred */
7 fprintf(stderr, "Fork Failed");
8 exit(-1);
9 }
10 else if (pid == 0) { /* child process */
11 execlp("/bin/ls", "ls", NULL);
12 }
13 else { /* parent process */
14 /* wait for the child to complete */
15 wait(NULL);
16 printf ("Child Complete");
17 exit(0);
18 }
19 return 0;
20 }
$ man 3 exec43 / 397
Process State Transition
1 23
4Blocked
Running
Ready
1. Process blocks for input2. Scheduler picks another process3. Scheduler picks this process4. Input becomes available
Fig. 2-2. A process can be in running, blocked, or ready state.Transitions between these states are as shown.
44 / 397
CPU Switch From Process To Process
45 / 397
Process vs. Thread
a single-threaded process = resource + executiona multi-threaded process = resource + executions
Thread Thread
Kernel Kernel
Process 1 Process 1 Process 1 Process
Userspace
Kernelspace
(a) (b)
Fig. 2-6. (a) Three processes each with one thread. (b) One processwith three threads.
A process = a unit of resource ownership, used to groupresources together;
A thread = a unit of scheduling, scheduled for executionon the CPU.
46 / 397
Process vs. Thread
multiple threads runningin one process:
multiple processes run-ning in one computer:
share an address space andother resources
share physical memory,disk, printers ...
No protection between threadsimpossible — because process is the minimum unit of
resource managementunnecessary — a process is owned by a single user
47 / 397
Threads
+------------------------------------+
| code, data, open files, signals... |
+-----------+-----------+------------+
| thread ID | thread ID | thread ID |
+-----------+-----------+------------+
| program | program | program |
| counter | counter | counter |
+-----------+-----------+------------+
| register | register | register |
| set | set | set |
+-----------+-----------+------------+
| stack | stack | stack |
+-----------+-----------+------------+
48 / 397
A Multi-threaded Web Server
Dispatcher thread
Worker thread
Web page cache
Kernel
Networkconnection
Web server process
Userspace
Kernelspace
Fig. 2-10. A multithreaded Web server.while (TRUE) { while (TRUE) {
get3next3request(&buf); wait3 for3work(&buf)handoff3work(&buf); look3for3page3 in3cache(&buf, &page);
} if (page3not3 in3cache(&page))read3page3 from3disk(&buf, &page);
return3page(&page);}
(a) (b)
Fig. 2-11. A rough outline of the code for Fig. 2-10. (a) Dispatcherthread. (b) Worker thread.
49 / 397
A Word Processor With 3 Threads
KernelKeyboard Disk
Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war testing whether that
nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their
lives that this nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we cannot dedicate, we cannot consecrate we cannot hallow this ground. The brave men, living and dead,
who struggled here have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember, what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated
here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us, that from these honored dead we take increased devotion to that cause for which
they gave the last full measure of devotion, that we here highly resolve that these dead shall not have died in vain that this nation, under God, shall have a new birth of freedom and that government of the people by the people, for the people
Fig. 2-9. A word processor with three threads.
50 / 397
Why Having a Kind of Process Within a Process?▶ Responsiveness
▶ Good for interactive applications.▶ A process with multiple threads makes a great server (e.g.
a web server):Have one server process, many ”worker” threads – if onethread blocks (e.g. on a read), others can still continueexecuting
▶ Economy – Threads are cheap!▶ Cheap to create – only need a stack and storage for
registers▶ Use very little resources – don’t need new address space,
global data, program code, or OS resources▶ switches are fast – only have to save/restore PC, SP, and
registers▶ Resource sharing – Threads can pass data via shared
memory; no need for IPC▶ Can take advantage of multiprocessors
51 / 397
Thread States TransitionSame as process states transition
1 23
4Blocked
Running
Ready
1. Process blocks for input2. Scheduler picks another process3. Scheduler picks this process4. Input becomes available
Fig. 2-2. A process can be in running, blocked, or ready state.Transitions between these states are as shown.
52 / 397
Each Thread Has Its Own Stack
▶ A typical stack stores local data and call information for(usually nested) procedure calls.
▶ Each thread generally has a different execution history.
Kernel
Thread 3's stack
Process
Thread 3Thread 1
Thread 2
Thread 1'sstack
Fig. 2-8. Each thread has its own stack.53 / 397
Thread Operations
Startwith onethreadrunning
thread1
thread2
thread3
end
releaseCPU
wait fora threadto exit
thread_create()
thread_create()
thread_create()
thread_exit()
thread_yield()
thread_exit()
thread_exit()
thread_join()
54 / 397
POSIX Threads
IEEE 1003.1c The standard for writing portable threadedprograms. The threads package it defines iscalled Pthreads, including over 60 function calls,supported by most UNIX systems.
Some of the Pthreads function callsThread call Descriptionpthread_create Create a new threadpthread_exit Terminate the calling threadpthread_join Wait for a specific thread to exitpthread_yield Release the CPU to let another
thread runpthread_attr_init Create and initialize a thread’s at-
tribute structurepthread_attr_destroy Remove a thread’s attribute struc-
ture
55 / 397
PthreadsExample 1
1 #include <pthread.h>
2 #include <stdlib.h>
3 #include <unistd.h>
4 #include <stdio.h>
5
6 void *thread_function(void *arg){
7 int i;
8 for( i=0; i<20; i++ ){
9 printf("Thread says hi!\n");
10 sleep(1);
11 }
12 return NULL;
13 }
14
15 int main(void){
16 pthread_t mythread;
17 if(pthread_create(&mythread, NULL, thread_function, NULL)){
18 printf("error creating thread.");
19 abort();
20 }
21
22 if(pthread_join(mythread, NULL)){
23 printf("error joining thread.");
24 abort();
25 }
26 exit(0);
27 }
56 / 397
Pthreads
pthread_t defined in pthread.h, is often called a ”thread id”(tid);
pthread_create() returns zero on success and a non-zerovalue on failure;
pthread_join() returns zero on success and a non-zero valueon failure;
How to use pthread?1. #include<pthread.h>2. $ gcc thread1.c -o thread1 -pthread3. $ ./thread1
57 / 397
PthreadsExample 2
1 #include <pthread.h>
2 #include <stdio.h>
3 #include <stdlib.h>
4
5 #define NUMBER_OF_THREADS 10
6
7 void *print_hello_world(void *tid)
8 {
9 /* prints the thread’s identifier, then exits.*/
10 printf ("Thread %d: Hello World!\n", tid);
11 pthread_exit(NULL);
12 }
13
14 int main(int argc, char *argv[])
15 {
16 pthread_t threads[NUMBER_OF_THREADS];
17 int status, i;
18 for (i=0; i<NUMBER_OF_THREADS; i++)
19 {
20 printf ("Main: creating thread %d\n",i);
21 status = pthread_create(&threads[i], NULL, print_hello_world, (void *)i);
22
23 if(status != 0){
24 printf ("Oops. pthread_create returned error code %d\n",status);
25 exit(-1);
26 }
27 }
28 exit(NULL);
29 }
58 / 397
Pthreads
With or without pthread_join()? Check it by yourself.
59 / 397
User-Level Threads vs. Kernel-Level Threads
Process ProcessThread Thread
Processtable
Processtable
Threadtable
Threadtable
Run-timesystem
Kernelspace
Userspace
KernelKernel
Fig. 2-13. (a) A user-level threads package. (b) A threads packagemanaged by the kernel.
60 / 397
User-Level ThreadsUser-level threads provide a library of functions to allow user
processes to create and manage their ownthreads.
© No need to modify the OS;© Simple representation
▶ each thread is represented simply by a PC, regs, stack, anda small TCB, all stored in the user process’ address space
© Simple Management▶ creating a new thread, switching between threads, and
synchronization between threads can all be done withoutintervention of the kernel
© Fast▶ thread switching is not much more expensive than a
procedure call© Flexible
▶ CPU scheduling (among threads) can be customized to suitthe needs of the algorithm – each process can use adifferent thread scheduling algorithm
61 / 397
User-Level Threads
§ Lack of coordination between threads and OS kernel▶ Process as a whole gets one time slice▶ Same time slice, whether process has 1 thread or 1000
threads▶ Also – up to each thread to relinquish control to other
threads in that process§ Requires non-blocking system calls (i.e. a multithreaded
kernel)▶ Otherwise, entire process will blocked in the kernel, even if
there are runnable threads left in the process▶ part of motivation for user-level threads was not to have to
modify the OS§ If one thread causes a page fault(interrupt!), the entire
process blocks
62 / 397
Kernel-Level Threads
Kernel-level threads kernel provides system calls to createand manage threads
© Kernel has full knowledge of all threads▶ Scheduler may choose to give a process with 10 threads
more time than process with only 1 thread© Good for applications that frequently block (e.g. server
processes with frequent interprocess communication)§ Slow – thread operations are 100s of times slower than
for user-level threads§ Significant overhead and increased kernel complexity –
kernel must manage and schedule threads as well asprocesses
▶ Requires a full thread control block (TCB) for each thread
63 / 397
Hybrid ImplementationsCombine the advantages of two
Multiple user threadson a kernel thread
Userspace
KernelspaceKernel threadKernel
Fig. 2-14. Multiplexing user-level threads onto kernel-levelthreads. 64 / 397
Programming Complications
▶ fork(): shall the child has the threads that its parenthas?
▶ What happens if one thread closes a file while another isstill reading from it?
▶ What happens if several threads notice that there is toolittle memory?
And sometimes, threads fix the symptom, but not theproblem.
65 / 397
Linux Threads
To the Linux kernel, there is no concept of a thread▶ Linux implements all threads as standard processes▶ To Linux, a thread is merely a process that shares certain
resources with other processes▶ Some OS (MS Windows, Sun Solaris) have cheap threads
and expensive processes.▶ Linux processes are already quite lightweight
On a 75MHz Pentium thread: 1.7µsfork: 1.8µs
66 / 397
Linux Threadsclone() creates a separate process that shares the address
space of the calling process. The cloned task behavesmuch like a separate thread.
Program 3
Kernel space
User space
fork()
clone()
vfork()
System Call Table
3
120
289
Program 1 Program 2
C Library
vfork()fork() clone()
System call sys_fork()
sys_clone()
sys_vfork()
do_fork()
67 / 397
clone()
1 #include <sched.h>2 int clone(int (*fn) (void *), void *child_stack,3 int flags, void *arg, ...);
arg 1 the function to be executed, i.e. fn(arg), whichreturns an int;
arg 2 a pointer to a (usually malloced) memory space to beused as the stack for the new thread;
arg 3 a set of flags used to indicate how much the callingprocess is to be shared. In fact,
clone(0) == fork()
arg 4 the arguments passed to the function.
It returns the PID of the child process or -1 on failure.$ man clone
68 / 397
The clone() System Call
Some flags:flag SharedCLONE_FS File-system infoCLONE_VM Same memory spaceCLONE_SIGHAND Signal handlersCLONE_FILES The set of open files
In practice, one should try to avoid calling clone()directly
Instead, use a threading library (such as pthreads) whichuse clone() when starting a thread (such as during a callto pthread_create())
69 / 397
clone() Example1 #include <unistd.h>
2 #include <sched.h>
3 #include <sys/types.h>
4 #include <stdlib.h>
5 #include <string.h>
6 #include <stdio.h>
7 #include <fcntl.h>
8
9 int variable;
10
11 int do_something()
12 {
13 variable = 42;
14 _exit(0);
15 }
16
17 int main(void)
18 {
19 void *child_stack;
20 variable = 9;
21
22 child_stack = (void *) malloc(16384);
23 printf("The variable was %d\n", variable);
24
25 clone(do_something, child_stack,
26 CLONE_FS | CLONE_VM | CLONE_FILES, NULL);
27 sleep(1);
28
29 printf("The variable is now %d\n", variable);
30 return 0;
31 }
7
70 / 397
clone() Example1 #include <unistd.h>
2 #include <sched.h>
3 #include <sys/types.h>
4 #include <stdlib.h>
5 #include <string.h>
6 #include <stdio.h>
7 #include <fcntl.h>
8
9 int variable;
10
11 int do_something()
12 {
13 variable = 42;
14 _exit(0);
15 }
16
17 int main(void)
18 {
19 void *child_stack;
20 variable = 9;
21
22 child_stack = (void *) malloc(16384);
23 printf("The variable was %d\n", variable);
24
25 clone(do_something, child_stack,
26 CLONE_FS | CLONE_VM | CLONE_FILES, NULL);
27 sleep(1);
28
29 printf("The variable is now %d\n", variable);
30 return 0;
31 }
770 / 397
Stack Grows Downwards
child_stack = (void**)malloc(8192) + 8192/sizeof(*child_stack);✓71 / 397
References
Wikipedia. Process (computing) — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2014.Wikipedia. Process control block — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. Thread (computing) — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.
72 / 397
Interprocess CommunicationExample:
ps | head -2 | tail -1 | cut -f2 -d' '
IPC issues:1. How one process can pass information to another2. Be sure processes do not get into each other’s way
e.g. in an airline reservation system, two processes competefor the last seat
3. Proper sequencing when dependencies are presente.g. if A produces data and B prints them, B has to wait until A
has produced some data
Two models of IPC:▶ Shared memory▶ Message passing (e.g. sockets)
74 / 397
Producer-Consumer Problem
compiler Assemblycode assembler Object
module loader
Processwants
printingfile Printer
daemon
75 / 397
Process SynchronizationProducer-Consumer Problem
▶ Consumers don’t try to remove objects from Buffer whenit is empty.
▶ Producers don’t try to add objects to the Buffer when it isfull.
Producer
1 while(TRUE){2 while(FULL);3 item = produceItem();4 insertItem(item);5 }
Consumer
1 while(TRUE){2 while(EMPTY);3 item = removeItem();4 consumeItem(item);5 }
How to define full/empty?
76 / 397
Producer-Consumer Problem— Bounded-Buffer Problem (Circular Array)
Front(out): the first full positionRear(in): the next free position
out
inc
ba
Full or empty when front == rear?
77 / 397
Producer-Consumer Problem
Common solution:Full: when (in+1)%BUFFER_SIZE == out
Actually, this is full - 1
Empty: when in == out
Can only use BUFFER_SIZE-1 elements
Shared data:1 #define BUFFER_SIZE 62 typedef struct {3 ...4 } item;5 item buffer[BUFFER_SIZE];6 int in = 0; //the next free position7 int out = 0;//the first full position
78 / 397
Bounded-Buffer Problem
Producer:1 while (true) {2 /* do nothing -- no free buffers */3 while (((in + 1) % BUFFER_SIZE) == out);45 buffer[in] = item;6 in = (in + 1) % BUFFER_SIZE;7 }
Consumer:1 while (true) {2 while (in == out); // do nothing3 // remove an item from the buffer4 item = buffer[out];5 out = (out + 1) % BUFFER_SIZE;6 return item;7 }
out
inc
ba
79 / 397
Race Conditions
Now, let’s have two producers
4
5
6
7
abc
prog.c
prog.nProcess A
out = 4
in = 7
Process B
Spoolerdirectory
Fig. 2-18. Two processes want to access shared memory at thesame time.
80 / 397
Race ConditionsTwo producers
1 #define BUFFER_SIZE 1002 typedef struct {3 ...4 } item;5 item buffer[BUFFER_SIZE];6 int in = 0;7 int out = 0;
Process A and B do the same thing:1 while (true) {2 while (((in + 1) % BUFFER_SIZE) == out);3 buffer[in] = item;4 in = (in + 1) % BUFFER_SIZE;5 }
81 / 397
Race Conditions
Problem: Process B started using one of the sharedvariables before Process A was finished with it.
Solution: Mutual exclusion. If one process is using ashared variable or file, the other processes willbe excluded from doing the same thing.
82 / 397
Critical RegionsMutual Exclusion
Critical Region: is a piece of code accessing a commonresource.
A enters critical region A leaves critical region
B attempts toenter critical
region
B enterscritical region
T1 T2 T3 T4
Process A
Process B
B blocked
B leavescritical region
Time
Fig. 2-19. Mutual exclusion using critical regions. 83 / 397
Critical Region
A solution to the critical region problem must satisfy threeconditions:Mutual Exclusion: No two process may be simultaneously
inside their critical regions.Progress: No process running outside its critical region
may block other processes.Bounded Waiting: No process should have to wait forever to
enter its critical region.
84 / 397
Mutual Exclusion With Busy WaitingDisabling Interrupts
1 {2 ...3 disableINT();4 critical_code();5 enableINT();6 ...7 }
Problems:▶ It’s not wise to give user process the power of turning off
INTs.▶ Suppose one did it, and never turned them on again
▶ useless for multiprocessor systemDisabling INTs is often a useful technique within the kernelitself but is not a general mutual exclusion mechanism foruser processes.
85 / 397
Mutual Exclusion With Busy WaitingLock Variables
1 int lock=0; //shared variable2 {3 ...4 while(lock); //busy waiting5 lock=1;6 critical_code();7 lock=0;8 ...9 }
Problem:▶ What if an interrupt occurs right at line 5?▶ Checking the lock again while backing from an interrupt?
86 / 397
Mutual Exclusion With Busy WaitingStrict Alternation
Process 0
1 while(TRUE){2 while(turn != 0);3 critical_region();4 turn = 1;5 noncritical_region();6 }
Process 1
1 while(TRUE){2 while(turn != 1);3 critical_region();4 turn = 0;5 noncritical_region();6 }
Problem: violates condition-2▶ One process can be blocked by another not in its critical
region.▶ Requires the two processes strictly alternate in entering
their critical region.
87 / 397
Mutual Exclusion With Busy WaitingPeterson’s Solution
int interest[0] = 0;int interest[1] = 0;int turn;
P0
1 interest[0] = 1;2 turn = 1;3 while(interest[1] == 14 && turn == 1);5 critical_section();6 interest[0] = 0;
P1
1 interest[1] = 1;2 turn = 0;3 while(interest[0] == 14 && turn == 0);5 critical_section();6 interest[1] = 0;
Wikipedia. Peterson’s algorithm — Wikipedia, The FreeEncyclopedia. [Online; accessed 23-February-2015].2015.
88 / 397
Mutual Exclusion With Busy WaitingHardware Solution: The TSL Instruction
Lock the memory busenter3region:
TSL REGISTER,LOCK | copy lock to register and set lock to 1CMP REGISTER,#0 | was lock zero?JNE enter3region | if it was non zero, lock was set, so loopRET | return to caller; critical region entered
leave3region:MOVE LOCK,#0 | store a 0 in lockRET | return to caller
Fig. 2-22. Entering and leaving a critical region using the TSLinstruction.
89 / 397
Mutual Exclusion Without Busy WaitingSleep & Wakeup
1 #define N 100 /* number of slots in the buffer */2 int count = 0; /* number of items in the buffer */
1 void producer(){2 int item;3 while(TRUE){4 item = produce_item();5 if(count == N)6 sleep();7 insert_item(item);8 count++;9 if(count == 1)
10 wakeup(consumer);11 }12 }
1 void consumer(){2 int item;3 while(TRUE){4 if(count == 0)5 sleep();6 item = rm_item();7 count--;8 if(count == N - 1)9 wakeup(producer);
10 consume_item(item);11 }12 }
Deadlock!
90 / 397
Mutual Exclusion Without Busy WaitingSleep & Wakeup
1 #define N 100 /* number of slots in the buffer */2 int count = 0; /* number of items in the buffer */
1 void producer(){2 int item;3 while(TRUE){4 item = produce_item();5 if(count == N)6 sleep();7 insert_item(item);8 count++;9 if(count == 1)
10 wakeup(consumer);11 }12 }
1 void consumer(){2 int item;3 while(TRUE){4 if(count == 0)5 sleep();6 item = rm_item();7 count--;8 if(count == N - 1)9 wakeup(producer);
10 consume_item(item);11 }12 }
Deadlock!
90 / 397
Producer-Consumer ProblemRace Condition
Problem1. Consumer is going to sleep upon seeing an empty buffer,
but INT occurs;2. Producer inserts an item, increasing count to 1, then call
wakeup(consumer);3. But the consumer is not asleep, though count was 0. So
the wakeup() signal is lost;4. Consumer is back from INT remembering count is 0, and
goes to sleep;5. Producer sooner or later will fill up the buffer and also
goes to sleep;6. Both will sleep forever, and waiting to be waken up by
the other process. Deadlock!
91 / 397
Producer-Consumer ProblemRace Condition
Solution: Add a wakeup waiting bit1. The bit is set, when a wakeup is sent to an awaken
process;2. Later, when the process wants to sleep, it checks the bit
first. Turns it off if it’s set, and stays awake.What if many processes try going to sleep?
92 / 397
What is a Semaphore?▶ A locking mechanism▶ An integer or ADT
that can only be operated with:
Atomic OperationsP() V()Wait() Signal()Down() Up()Decrement() Increment()... ...
1 down(S){2 while(S<=0);3 S--;4 }
1 up(S){2 S++;3 }
More meaningful names:▶ increment_and_wake_a_waiting_process_if_any()▶ decrement_and_block_if_the_result_is_negative()
93 / 397
Semaphore
How to ensure atomic?1. For single CPU, implement up() and down() as system
calls, with the OS disabling all interrupts while accessingthe semaphore;
2. For multiple CPUs, to make sure only one CPU at a timeexamines the semaphore, a lock variable should be usedwith the TSL instructions.
94 / 397
Semaphore is a Special Integer
A semaphore is like an integer, with threedifferences:1. You can initialize its value to any integer, but after that
the only operations you are allowed to perform areincrement (S++) and decrement (S- -).
2. When a thread decrements the semaphore, if the resultis negative (S ≤ 0), the thread blocks itself and cannotcontinue until another thread increments the semaphore.
3. When a thread increments the semaphore, if there areother threads waiting, one of the waiting threads getsunblocked.
95 / 397
Why Semaphores?
We don’t need semaphores to solve synchronizationproblems, but there are some advantages to using them:
▶ Semaphores impose deliberate constraints that helpprogrammers avoid errors.
▶ Solutions using semaphores are often clean andorganized, making it easy to demonstrate theircorrectness.
▶ Semaphores can be implemented efficiently on manysystems, so solutions that use semaphores are portableand usually efficient.
96 / 397
The Simplest Use of SemaphoreSignaling
▶ One thread sends a signal to another thread to indicatethat something has happened
▶ it solves the serialization problemSignaling makes it possible to guarantee that a section ofcode in one thread will run before a section of code inanother thread
Thread A
1 statement a12 sem.signal()
Thread B
1 sem.wait()2 statement b1
What’s the initial value of sem?
97 / 397
SemaphoreRendezvous Puzzle
Thread A
1 statement a12 statement a2
Thread B
1 statement b12 statement b2
Q: How to guarantee that1. a1 happens before b2, and2. b1 happens before a2
a1 → b2; b1 → a2Hint: Use two semaphores initialized to 0.
98 / 397
Thread A: Thread B:Solution 1:statement a1 statement b1sem1.wait() sem1.signal()sem2.signal() sem2.wait()statement a2 statement b2Solution 2:statement a1 statement b1sem2.signal() sem1.signal()sem1.wait() sem2.wait()statement a2 statement b2Solution 3:statement a1 statement b1sem2.wait() sem1.wait()sem1.signal() sem2.signal()statement a2 statement b2
Deadlock!
99 / 397
Thread A: Thread B:Solution 1:statement a1 statement b1sem1.wait() sem1.signal()sem2.signal() sem2.wait()statement a2 statement b2Solution 2:statement a1 statement b1sem2.signal() sem1.signal()sem1.wait() sem2.wait()statement a2 statement b2Solution 3:statement a1 statement b1sem2.wait() sem1.wait()sem1.signal() sem2.signal()statement a2 statement b2Deadlock!
99 / 397
Mutex▶ A second common use for semaphores is to enforce
mutual exclusion▶ It guarantees that only one thread accesses the shared
variable at a time▶ A mutex is like a token that passes from one thread to
another, allowing one thread at a time to proceed
Q: Add semaphores to the following example to enforcemutual exclusion to the shared variable i.
Thread A: i++ Thread B: i++
Why? Because i++ is not atomic.
100 / 397
i++ is not atomic in assembly language1 LOAD [i], r0 ;load the value of 'i' into2 ;a register from memory3 ADD r0, 1 ;increment the value4 ;in the register5 STORE r0, [i] ;write the updated6 ;value back to memory
Interrupts might occur in between. So, i++ needs to beprotected with a mutex.
101 / 397
Mutex Solution
Create a semaphore named mutex that is initializedto 11: a thread may proceed and access the shared variable0: it has to wait for another thread to release the mutex
Thread A
1 mutex.wait()2 i++3 mutex.signal()
Thread B
1 mutex.wait()2 i++3 mutex.signal()
102 / 397
Multiplex — Without Busy Waiting1 typedef struct{2 int space; //number of free resources3 struct process *P; //a list of queueing producers4 struct process *C; //a list of queueing consumers5 } semaphore;6 semaphore S;7 S.space = 5;
Producer1 void down(S){2 S.space--;3 if(S.space == 4){4 rmFromQueue(S.C);5 wakeup(S.C);6 }7 if(S.space < 0){8 addToQueue(S.P);9 sleep();
10 }11 }
Consumer1 void up(S){2 S.space++;3 if(S.space > 5){4 addToQueue(S.C);5 sleep();6 }7 if(S.space <= 0){8 rmFromQueue(S.P);9 wakeup(S.P);
10 }11 }
if S.space < 0,S.space == Number of queueing producers
if S.space > 5,S.space == Number of queueing consumers + 5
103 / 397
Barrier
Bar
rier
Bar
rier
Bar
rier
A A A
B B B
C C
D D D
Time Time Time
Process
(a) (b) (c)
C
Fig. 2-30. Use of a barrier. (a) Processes approaching a barrier.(b) All processes but one blocked at the barrier. (c) When the lastprocess arrives at the barrier, all of them are let through.
1. Processes approaching a barrier2. All processes but one blocked at the barrier3. When the last process arrives at the barrier, all of them
are let through
Synchronization requirement:specific_task()critical_point()
No thread executes critical_point() until after all threadshave executed specific_task().
104 / 397
Barrier Solution1 n = the number of threads2 count = 03 mutex = Semaphore(1)4 barrier = Semaphore(0)
count: keeps track of how many threads have arrivedmutex: provides exclusive access to countbarrier: is locked (≤ 0) until all threads arrive
When barrier.value<0,barrier.value == Number of queueing processes
Solution 1
1 specific_task();2 mutex.wait();3 count++;4 mutex.signal();5 if (count < n)6 barrier.wait();7 barrier.signal();8 critical_point();
Solution 2
1 specific_task();2 mutex.wait();3 count++;4 mutex.signal();5 if (count == n)6 barrier.signal();7 barrier.wait();8 critical_point();
Only one thread can pass the barrier!✓ Deadlock!
105 / 397
Barrier Solution1 n = the number of threads2 count = 03 mutex = Semaphore(1)4 barrier = Semaphore(0)
count: keeps track of how many threads have arrivedmutex: provides exclusive access to countbarrier: is locked (≤ 0) until all threads arrive
When barrier.value<0,barrier.value == Number of queueing processes
Solution 1
1 specific_task();2 mutex.wait();3 count++;4 mutex.signal();5 if (count < n)6 barrier.wait();7 barrier.signal();8 critical_point();
Solution 2
1 specific_task();2 mutex.wait();3 count++;4 mutex.signal();5 if (count == n)6 barrier.signal();7 barrier.wait();8 critical_point();
Only one thread can pass the barrier!✓ Deadlock!
105 / 397
Barrier Solution
Solution 3
1 specific_task();23 mutex.wait();4 count++;5 mutex.signal();67 if (count == n)8 barrier.signal();9
10 barrier.wait();11 barrier.signal();1213 critical_point();
Solution 4
1 specific_task();23 mutex.wait();4 count++;56 if (count == n)7 barrier.signal();89 barrier.wait();
10 barrier.signal();11 mutex.signal();1213 critical_point();
Blocking on a semaphore while holding a mutex!
Deadloc
k!
106 / 397
Barrier Solution
Solution 3
1 specific_task();23 mutex.wait();4 count++;5 mutex.signal();67 if (count == n)8 barrier.signal();9
10 barrier.wait();11 barrier.signal();1213 critical_point();
Solution 4
1 specific_task();23 mutex.wait();4 count++;56 if (count == n)7 barrier.signal();89 barrier.wait();
10 barrier.signal();11 mutex.signal();1213 critical_point();
Blocking on a semaphore while holding a mutex!
Deadloc
k!
106 / 397
barrier.wait();barrier.signal();
TurnstileThis pattern, a wait and a signal in rapid succession, occursoften enough that it has a name called a turnstile, because
▶ it allows one thread to pass at a time, and▶ it can be locked to bar all threads
107 / 397
SemaphoresProducer-Consumer Problem
Whenever an event occurs▶ a producer thread creates an event object and adds it to
the event buffer. Concurrently,▶ consumer threads take events out of the buffer and
process them. In this case, the consumers are called“event handlers”.
Producer
1 event = waitForEvent()2 buffer.add(event)
Consumer
1 event = buffer.get()2 event.process()
Q: Add synchronization statements to the producer andconsumer code to enforce the synchronizationconstraints
1. Mutual exclusion2. Serialization
108 / 397
Semaphores — Producer-Consumer ProblemSolution
Initialization:mutex = Semaphore(1)items = Semaphore(0)
▶ mutex provides exclusive access to the buffer▶ items:
+: number of items in the buffer−: number of consumer threads in queue
109 / 397
Semaphores — Producer-Consumer ProblemSolution
Producer
1 event = waitForEvent()2 mutex.wait()3 buffer.add(event)4 items.signal()5 mutex.signal()
or,Producer
1 event = waitForEvent()2 mutex.wait()3 buffer.add(event)4 mutex.signal()5 items.signal()
Consumer
1 items.wait()2 mutex.wait()3 event = buffer.get()4 mutex.signal()5 event.process()
or,Consumer
1 mutex.wait()2 items.wait()3 event = buffer.get()4 mutex.signal()5 event.process()
Danger: any time you wait for a semaphore while holdinga mutex!
Deadlock!
110 / 397
Semaphores — Producer-Consumer ProblemSolution
Producer
1 event = waitForEvent()2 mutex.wait()3 buffer.add(event)4 items.signal()5 mutex.signal()
or,Producer
1 event = waitForEvent()2 mutex.wait()3 buffer.add(event)4 mutex.signal()5 items.signal()
Consumer
1 items.wait()2 mutex.wait()3 event = buffer.get()4 mutex.signal()5 event.process()
or,Consumer
1 mutex.wait()2 items.wait()3 event = buffer.get()4 mutex.signal()5 event.process()
Danger: any time you wait for a semaphore while holdinga mutex!
Deadlock!
110 / 397
Producer-Consumer Problem WithBounded-Buffer
Given:semaphore items = 0;semaphore spaces = BUFFER_SIZE;
Can we?if (items >= BUFFER_SIZE)
producer.block();
if: the buffer is fullthen: the producer blocks until a consumer removes
an item
No! We can’t check the current value of a semaphore,because
! the only operations are wait and signal.? But...
7
111 / 397
Producer-Consumer Problem WithBounded-Buffer
Given:semaphore items = 0;semaphore spaces = BUFFER_SIZE;
Can we?if (items >= BUFFER_SIZE)
producer.block();
if: the buffer is fullthen: the producer blocks until a consumer removes
an itemNo! We can’t check the current value of a semaphore,because
! the only operations are wait and signal.? But...
7
111 / 397
Producer-Consumer Problem WithBounded-Buffer
1 semaphore items = 0;2 semaphore spaces = BUFFER_SIZE;
1 void producer() {2 while (true) {3 item = produceItem();4 down(spaces);5 putIntoBuffer(item);6 up(items);7 }8 }
1 void consumer() {2 while (true) {3 down(items);4 item = rmFromBuffer();5 up(spaces);6 consumeItem(item);7 }8 }
works fine when there is only one producer and oneconsumer, because putIntoBuffer() is not atomic.
!
112 / 397
Producer-Consumer Problem WithBounded-Buffer
1 semaphore items = 0;2 semaphore spaces = BUFFER_SIZE;
1 void producer() {2 while (true) {3 item = produceItem();4 down(spaces);5 putIntoBuffer(item);6 up(items);7 }8 }
1 void consumer() {2 while (true) {3 down(items);4 item = rmFromBuffer();5 up(spaces);6 consumeItem(item);7 }8 }
works fine when there is only one producer and oneconsumer, because putIntoBuffer() is not atomic.
!112 / 397
putIntoBuffer() could contain two actions:1. determining the next available slot2. writing into it
Race condition:1. Two producers decrement spaces2. One of the producers determines the next empty slot in
the buffer3. Second producer determines the next empty slot and
gets the same result as the first producer4. Both producers write into the same slot
putIntoBuffer() needs to be protected with a mutex.
113 / 397
With a mutex
1 semaphore mutex = 1; //controls access to c.r.2 semaphore items = 0;3 semaphore spaces = BUFFER_SIZE;
1 void producer() {2 while (true) {3 item = produceItem();4 down(spaces);5 down(mutex);6 putIntoBuffer(item);7 up(mutex);8 up(items);9 }
10 }
1 void consumer() {2 while (true) {3 down(items);4 down(mutex);5 item = rmFromBuffer();6 up(mutex);7 up(spaces);8 consumeItem(item);9 }
10 }
114 / 397
Monitors
Monitor a high-level synchronization object for achievingmutual exclusion.
▶ It’s a language concept, and Cdoes not have it.
▶ Only one process can be activein a monitor at any instant.
▶ It is up to the compiler toimplement mutual exclusion onmonitor entries.
▶ The programmer just needs toknow that by turning all thecritical regions into monitorprocedures, no two processeswill ever execute their criticalregions at the same time.
1 monitor example2 integer i;3 condition c;45 procedure producer();6 ...7 end;89 procedure consumer();
10 ...11 end;12 end monitor;
115 / 397
MonitorThe producer-consumer problem
1 monitor ProducerConsumer2 condition full, empty;3 integer count;45 procedure insert(item: integer);6 begin7 if count = N then wait(full);8 insert_item(item);9 count := count + 1;
10 if count = 1 then signal(empty)11 end;1213 function remove: integer;14 begin15 if count = 0 then wait(empty);16 remove = remove_item;17 count := count - 1;18 if count = N - 1 then signal(full)19 end;20 count := 0;21 end monitor;
1 procedure producer;2 begin3 while true do4 begin5 item = produce_item;6 ProducerConsumer.insert(item)7 end8 end;910 procedure consumer;11 begin12 while true do13 begin14 item = ProducerConsumer.remove;15 consume_item(item)16 end17 end;
116 / 397
Message Passing
▶ Semaphores are too low level▶ Monitors are not usable except in a few programming
languages▶ Neither monitor nor semaphore is suitable for distributed
systems▶ No conflicts, easier to implement
Message passing uses two primitives, send and receivesystem calls:
- send(destination, &message);- receive(source, &message);
117 / 397
Message Passing
Design issues▶ Message can be lost by network; — ACK▶ What if the ACK is lost? — SEQ▶ What if two processes have the same name? — socket▶ Am I talking with the right guy? Or maybe a MIM? —
authentication▶ What if the sender and the receiver on the same
machine? — Copying messages is always slower thandoing a semaphore operation or entering a monitor.
118 / 397
Message PassingTCP Header Format
0 1 2 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Port | Destination Port |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Sequence Number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Acknowledgment Number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Data | |U|A|P|R|S|F| || Offset| Reserved |R|C|S|S|Y|I| Window || | |G|K|H|T|N|N| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Checksum | Urgent Pointer |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Options | Padding |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| data |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
119 / 397
Message PassingThe producer-consumer problem
1 #define N 100 /* number of slots in the buffer */2 void producer(void)3 {4 int item;5 message m; /* message buffer */6 while (TRUE) {7 item = produce_item(); /* generate something to put in buffer */8 receive(consumer, &m); /* wait for an empty to arrive */9 build_message(&m, item); /* construct a message to send */
10 send(consumer, &m); /* send item to consumer */11 }12 }1314 void consumer(void)15 {16 int item, i;17 message m;18 for (i=0; i<N; i++) send(producer, &m); /* send N empties */19 while (TRUE) {20 receive(producer, &m); /* get message containing item */21 item = extract_item(&m); /* extract item from message */22 send(producer, &m); /* send back empty reply */23 consume_item(item); /* do something with the item */24 }25 }
120 / 397
The Dining Philosophers Problem
1 while True:2 think()3 get_forks()4 eat()5 put_forks()
How to implement get_forks() and put_forks() to ensure1. No deadlock2. No starvation3. Allow more than one philosopher to eat at the same time
121 / 397
The Dining Philosophers ProblemDeadlock
#define N 5 /* number of philosophers */
void philosopher(int i) /* i: philosopher number, from 0 to 4 */{
while (TRUE) {think( ); /* philosopher is thinking */take3fork(i); /* take left fork */take3fork((i+1) % N); /* take right fork; % is modulo operator */eat( ); /* yum-yum, spaghetti */put3 fork(i); /* put left fork back on the table */put3 fork((i+1) % N); /* put right fork back on the table */
}}
Fig. 2-32. A nonsolution to the dining philosophers problem.▶ Put down the left fork and wait for a while if the right one
is not available? Similar to CSMA/CD — Starvation
122 / 397
The Dining Philosophers ProblemWith One Mutex
1 #define N 52 semaphore mutex=1;34 void philosopher(int i)5 {6 while (TRUE) {7 think();8 wait(&mutex);9 take_fork(i);
10 take_fork((i+1) % N);11 eat();12 put_fork(i);13 put_fork((i+1) % N);14 signal(&mutex);15 }16 }
▶ Only one philosopher can eat at a time.▶ How about 2 mutexes? 5 mutexes?
123 / 397
The Dining Philosophers ProblemWith One Mutex
1 #define N 52 semaphore mutex=1;34 void philosopher(int i)5 {6 while (TRUE) {7 think();8 wait(&mutex);9 take_fork(i);
10 take_fork((i+1) % N);11 eat();12 put_fork(i);13 put_fork((i+1) % N);14 signal(&mutex);15 }16 }
▶ Only one philosopher can eat at a time.▶ How about 2 mutexes? 5 mutexes?
123 / 397
The Dining Philosophers ProblemAST Solution (Part 1)
A philosopher may only move into eating state ifneither neighbor is eating
1 #define N 5 /* number of philosophers */
2 #define LEFT (i+N-1)%N /* number of i’s left neighbor */
3 #define RIGHT (i+1)%N /* number of i’s right neighbor */
4 #define THINKING 0 /* philosopher is thinking */
5 #define HUNGRY 1 /* philosopher is trying to get forks */
6 #define EATING 2 /* philosopher is eating */
7 typedef int semaphore;
8 int state[N]; /* state of everyone */
9 semaphore mutex = 1; /* for critical regions */
10 semaphore s[N]; /* one semaphore per philosopher */
11
12 void philosopher(int i) /* i: philosopher number, from 0 to N-1 */
13 {
14 while (TRUE) {
15 think( );
16 take_forks(i); /* acquire two forks or block */
17 eat( );
18 put_forks(i); /* put both forks back on table */
19 }
20 }
124 / 397
The Dining Philosophers ProblemAST Solution (Part 2)
1 void take_forks(int i) /* i: philosopher number, from 0 to N-1 */
2 {
3 down(&mutex); /* enter critical region */
4 state[i] = HUNGRY; /* record fact that philosopher i is hungry */
5 test(i); /* try to acquire 2 forks */
6 up(&mutex); /* exit critical region */
7 down(&s[i]); /* block if forks were not acquired */
8 }
9 void put_forks(i) /* i: philosopher number, from 0 to N-1 */
10 {
11 down(&mutex); /* enter critical region */
12 state[i] = THINKING; /* philosopher has finished eating */
13 test(LEFT); /* see if left neighbor can now eat */
14 test(RIGHT); /* see if right neighbor can now eat */
15 up(&mutex); /* exit critical region */
16 }
17 void test(i) /* i: philosopher number, from 0 to N-1 */
18 {
19 if (state[i] == HUNGRY && state[LEFT] != EATING && state[RIGHT] != EATING) {
20 state[i] = EATING;
21 up(&s[i]);
22 }
23 }
Starvation!
125 / 397
The Dining Philosophers ProblemAST Solution (Part 2)
1 void take_forks(int i) /* i: philosopher number, from 0 to N-1 */
2 {
3 down(&mutex); /* enter critical region */
4 state[i] = HUNGRY; /* record fact that philosopher i is hungry */
5 test(i); /* try to acquire 2 forks */
6 up(&mutex); /* exit critical region */
7 down(&s[i]); /* block if forks were not acquired */
8 }
9 void put_forks(i) /* i: philosopher number, from 0 to N-1 */
10 {
11 down(&mutex); /* enter critical region */
12 state[i] = THINKING; /* philosopher has finished eating */
13 test(LEFT); /* see if left neighbor can now eat */
14 test(RIGHT); /* see if right neighbor can now eat */
15 up(&mutex); /* exit critical region */
16 }
17 void test(i) /* i: philosopher number, from 0 to N-1 */
18 {
19 if (state[i] == HUNGRY && state[LEFT] != EATING && state[RIGHT] != EATING) {
20 state[i] = EATING;
21 up(&s[i]);
22 }
23 }
Starvation!
125 / 397
The Dining Philosophers ProblemMore Solutions
▶ If there is at least one leftie and at least one rightie, thendeadlock is not possible
▶ Wikipedia: Dining philosophers problem
126 / 397
The Readers-Writers ProblemConstraint: no process may access the shared data for
reading or writing while another process iswriting to it.
1 semaphore mutex = 1;2 semaphore noOther = 1;3 int readers = 0;45 void writer(void)6 {7 while (TRUE) {8 wait(&noOther);9 writing();
10 signal(&noOther);11 }12 }
1 void reader(void)2 {3 while (TRUE) {4 wait(&mutex);5 readers++;6 if (readers == 1)7 wait(&noOther);8 signal(&mutex);9 reading();
10 wait(&mutex);11 readers--;12 if (readers == 0)13 signal(&noOther);14 signal(&mutex);15 anything();16 }17 }
Starvation The writer could be blocked forever if there arealways someone reading.
Starvation!
127 / 397
The Readers-Writers ProblemConstraint: no process may access the shared data for
reading or writing while another process iswriting to it.
1 semaphore mutex = 1;2 semaphore noOther = 1;3 int readers = 0;45 void writer(void)6 {7 while (TRUE) {8 wait(&noOther);9 writing();
10 signal(&noOther);11 }12 }
1 void reader(void)2 {3 while (TRUE) {4 wait(&mutex);5 readers++;6 if (readers == 1)7 wait(&noOther);8 signal(&mutex);9 reading();
10 wait(&mutex);11 readers--;12 if (readers == 0)13 signal(&noOther);14 signal(&mutex);15 anything();16 }17 }
Starvation The writer could be blocked forever if there arealways someone reading.
Starvation!
127 / 397
The Readers-Writers ProblemNo starvation
1 semaphore mutex = 1;2 semaphore noOther = 1;3 semaphore turnstile = 1;4 int readers = 0;56 void writer(void)7 {8 while (TRUE) {9 turnstile.wait();
10 wait(&noOther);11 writing();12 signal(&noOther);13 turnstile.signal();14 }15 }
1 void reader(void)2 {3 while (TRUE) {4 turnstile.wait();5 turnstile.signal();67 wait(&mutex);8 readers++;9 if (readers == 1)
10 wait(&noOther);11 signal(&mutex);12 reading();13 wait(&mutex);14 readers--;15 if (readers == 0)16 signal(&noOther);17 signal(&mutex);18 anything();19 }20 }
128 / 397
The Sleeping Barber Problem
Where’s the problem?▶ the barber saw an empty room right before a customer
arrives the waiting room;▶ Several customer could race for a single chair;
129 / 397
Solution1 #define CHAIRS 52 semaphore customers = 0; // any customers or not?3 semaphore bber = 0; // barber is busy4 semaphore mutex = 1;5 int waiting = 0; // queueing customers
1 void barber(void)2 {3 while (TRUE) {4 wait(&customers);5 wait(&mutex);6 waiting--;7 signal(&mutex);8 signal(&bber);9 cutHair();
10 }11 }
1 void customer(void)2 {3 wait(&mutex);4 if (waiting == CHAIRS){5 signal(&mutex);6 goHome();7 } else {8 waiting++;9 signal(&customers);
10 signal(&mutex);11 wait(&bber);12 getHairCut();13 }14 }
130 / 397
Solution2
1 #define CHAIRS 52 semaphore customers = 0;3 semaphore bber = ???;4 semaphore mutex = 1;5 int waiting = 0;67 void barber(void)8 {9 while (TRUE) {
10 wait(&customers);11 cutHair();12 }13 }
1 void customer(void)2 {3 wait(&mutex);4 if (waiting == CHAIRS){5 signal(&mutex);6 goHome();7 } else {8 waiting++;9 signal(&customers);10 signal(&mutex);11 wait(&bber);12 getHairCut();13 wait(&mutex);14 waiting--;15 signal(&mutex);16 signal(&bber);17 }18 }
131 / 397
References
Wikipedia. Inter-process communication — Wikipedia,The Free Encyclopedia. [Online; accessed21-February-2015]. 2015.Wikipedia. Semaphore (programming) — Wikipedia, TheFree Encyclopedia. [Online; accessed21-February-2015]. 2015.
132 / 397
Scheduling Queues
Job queue consists all the processes in the systemReady queue A linked list consists processes in the main
memory ready for executeDevice queue Each device has its own device queue
134 / 397
3.2 Process Scheduling 105
queue header PCB7
PCB3
PCB5
PCB14 PCB6
PCB2
head
head
head
head
head
readyqueue
disk unit 0
terminal unit 0
magtape
unit 0
magtape
unit 1
tail registers registers
tail
tail
tail
tail
•••
•••
•••
Figure 3.6 The ready queue and various I/O device queues.
The system also includes other queues. When a process is allocated theCPU, it executes for a while and eventually quits, is interrupted, or waits forthe occurrence of a particular event, such as the completion of an I/O request.Suppose the process makes an I/O request to a shared device, such as a disk.Since there are many processes in the system, the disk may be busy with theI/O request of some other process. The process therefore may have to wait forthe disk. The list of processes waiting for a particular I/O device is called adevice queue. Each device has its own device queue (Figure 3.6).
A common representation of process scheduling is a queueing diagram,such as that in Figure 3.7. Each rectangular box represents a queue. Two typesof queues are present: the ready queue and a set of device queues. The circlesrepresent the resources that serve the queues, and the arrows indicate the flowof processes in the system.
A new process is initially put in the ready queue. It waits there until it isselected for execution, or is dispatched. Once the process is allocated the CPUand is executing, one of several events could occur:
• The process could issue an I/O request and then be placed in an I/O queue.
• The process could create a new subprocess and wait for the subprocess’stermination.
• The process could be removed forcibly from the CPU as a result of aninterrupt, and be put back in the ready queue.
135 / 397
Queueing Diagram106 Chapter 3 Processes
ready queue CPU
I/O I/O queue I/O request
time sliceexpired
fork achild
wait for aninterrupt
interruptoccurs
childexecutes
Figure 3.7 Queueing-diagram representation of process scheduling.
In the first two cases, the process eventually switches from the waiting stateto the ready state and is then put back in the ready queue. A process continuesthis cycle until it terminates, at which time it is removed from all queues andhas its PCB and resources deallocated.
3.2.2 Schedulers
A process migrates among the various scheduling queues throughout itslifetime. The operating system must select, for scheduling purposes, processesfrom these queues in some fashion. The selection process is carried out by theappropriate scheduler.
Often, in a batch system, more processes are submitted than can be executedimmediately. These processes are spooled to a mass-storage device (typically adisk), where they are kept for later execution. The long-term scheduler, or jobscheduler, selects processes from this pool and loads them into memory forexecution. The short-term scheduler, or CPU scheduler, selects from amongthe processes that are ready to execute and allocates the CPU to one of them.
The primary distinction between these two schedulers lies in frequencyof execution. The short-term scheduler must select a new process for the CPUfrequently. A process may execute for only a few milliseconds before waitingfor an I/O request. Often, the short-term scheduler executes at least once every100 milliseconds. Because of the short time between executions, the short-termscheduler must be fast. If it takes 10 milliseconds to decide to execute a processfor 100 milliseconds, then 10/(100 + 10) = 9 percent of the CPU is being used(wasted) simply for scheduling the work.
The long-term scheduler executes much less frequently; minutes may sep-arate the creation of one new process and the next. The long-term schedulercontrols the degree of multiprogramming (the number of processes in mem-ory). If the degree of multiprogramming is stable, then the average rate ofprocess creation must be equal to the average departure rate of processes
136 / 397
Scheduling
▶ Scheduler uses scheduling algorithm to choose a processfrom the ready queue
▶ Scheduling doesn’t matter much on simple PCs, because1. Most of the time there is only one active process2. The CPU is too fast to be a scarce resource any more
▶ Scheduler has to make efficient use of the CPU becauseprocess switching is expensive
1. User mode → kernel mode2. Save process state, registers, memory map...3. Selecting a new process to run by running the scheduling
algorithm4. Load the memory map of the new process5. The process switch usually invalidates the entire memory
cache
137 / 397
Scheduling Algorithm Goals
All systemsFairness giving each process a fair share of the CPU
Policy enforcement seeing that stated policy is carried outBalance keeping all parts of the system busy
138 / 397
Batch systemsThroughput maximize jobs per hourTurnaround time minimize time between submission and
terminationCPU utilization keep the CPU busy all the time
Interactive systemsResponse time respond to requests quicklyProportionality meet users’ expectations
Real-time systemsMeeting deadlines avoid losing dataPredictability avoid quality degradation in multimedia
systems
139 / 397
Process BehaviorCPU-bound vs. I/O-bound
Types of CPU bursts:▶ long bursts – CPU bound (i.e. batch work)▶ short bursts – I/O bound (i.e. emacs)
Long CPU burst
Short CPU burst
Waiting for I/O
(a)
(b)
Time
Fig. 2-37. Bursts of CPU usage alternate with periods of waitingfor I/O. (a) A CPU-bound process. (b) An I/O-bound process.
As CPUs get faster, processes tend to get more I/O-bound.
140 / 397
Process ClassificationTraditionally
CPU-bound processes vs. I/O-bound processes
AlternativelyInteractive processes responsiveness
▶ command shells, editors, graphical appsBatch processes no user interaction, run in background,
often penalized by the scheduler▶ programming language compilers, database
search engines, scientific computationsReal-time processes video and sound apps, robot controllers,
programs that collect data from physical sensors
▶ should never be blocked by lower-priorityprocesses
▶ should have a short guaranteed responsetime with a minimum variance
141 / 397
Schedulers
Long-term scheduler (or job scheduler) selects whichprocesses should be brought into the readyqueue.
Short-term scheduler (or CPU scheduler) selects whichprocess should be executed next and allocatesCPU.
Midium-term scheduler swapping.
▶ LTS is responsible for a good process mix of I/O-boundand CPU-bound process leading to best performance.
▶ Time-sharing systems, e.g. UNIX, often have nolong-term scheduler.
142 / 397
Nonpreemptive vs. preemptive
A nonpreemptive scheduling algorithm lets a process run aslong as it wants until it blocks (I/O or waiting foranother process) or until it voluntarily releasesthe CPU.
A preemptive scheduling algorithm will forcibly suspend aprocess after it runs for sometime. — clockinterruptable
143 / 397
Scheduling In Batch SystemsFirst-Come First-Served
▶ nonpreemptive▶ simple▶ also has a disadvantage
What if a CPU-bound process (e.g. runs 1s at a time)followed by many I/O-bound processes (e.g. 1000 diskreads to complete)?
▶ In this case, a preemptive scheduling is preferred.
144 / 397
Scheduling In Batch SystemsShortest Job First
(a)
8
A
4
B
4
C
4
D
(b)
8
A
4
B
4
C
4
D
Fig. 2-39. An example of shortest job first scheduling. (a) Run-ning four jobs in the original order. (b) Running them in shortestjob first order.
Average turnaround time(a) (8 + 12 + 16 + 20)÷ 4 = 14
(b) (4 + 8 + 12 + 20)÷ 4 = 11
How to know the length of the next CPU burst?▶ For long-term (job) scheduling, user provides▶ For short-term scheduling, no way
145 / 397
Scheduling In Interactive SystemsRound-Robin Scheduling
(a)
Currentprocess
Nextprocess
B F D G A
(b)
Currentprocess
F D G A B
Fig. 2-41. Round-robin scheduling. (a) The list of runnableprocesses. (b) The list of runnable processes after B uses up itsquantum.
▶ Simple, and most widely used;▶ Each process is assigned a time interval, called itsquantum;
▶ How long shoud the quantum be?▶ too short — too many process switches, lower CPU
efficiency;▶ too long — poor response to short interactive requests;▶ usually around 20 ∼ 50ms.
146 / 397
Scheduling In Interactive SystemsPriority Scheduling
Priority 4
Priority 3
Priority 2
Priority 1
Queueheaders
Runable processes
(Highest priority)
(Lowest priority)
Fig. 2-42. A scheduling algorithm with four priority classes.▶ SJF is a priority scheduling;▶ Starvation — low priority processes may never execute;
▶ Aging — as time progresses increase the priority of theprocess;
$ man nice
147 / 397
Thread SchedulingProcess A Process B Process BProcess A
1. Kernel picks a process 1. Kernel picks a thread
Possible: A1, A2, A3, A1, A2, A3Also possible: A1, B1, A2, B2, A3, B3
Possible: A1, A2, A3, A1, A2, A3Not possible: A1, B1, A2, B2, A3, B3
(a) (b)
Order in whichthreads run
2. Runtime system picks a thread
1 2 3 1 3 2
Fig. 2-43. (a) Possible scheduling of user-level threads with a 50-msec process quantum and threads that run 5 msec per CPU burst.(b) Possible scheduling of kernel-level threads with the samecharacteristics as (a).
▶ With kernel-level threads, sometimes a full contextswitch is required
▶ Each process can have its own application-specific threadscheduler, which usually works better than kernel can
148 / 397
Process Scheduling In LinuxA preemptive, priority-based algorithm with two separatepriority ranges:1. real-time range (0 ∼ 99), for tasks where absolute
priorities are more important than fairness2. nice value range (100 ∼ 139), for fair preemptive
scheduling among multiple processes
149 / 397
In Linux, Process Priority is DynamicThe scheduler keeps track of what processes aredoing and adjusts their priorities periodically
▶ Processes that have been denied the use of a CPU for along time interval are boosted by dynamically increasingtheir priority (usually I/O-bound)
▶ Processes running for a long time are penalized bydecreasing their priority (usually CPU-bound)
▶ Priority adjustments are performed only on user tasks,not on real-time tasks
Tasks are determined to be I/O-bound orCPU-bound based on an interactivity heuristic
A task’s interactiveness metric is calculated based onhow much time the task executes compared to howmuch time it sleeps
150 / 397
Problems With The Pre-2.6 Scheduler
▶ an algorithm with O(n) complexity▶ a single runqueue for all processors
▶ good for load balancing▶ bad for CPU caches, when a task is rescheduled from one
CPU to another▶ a single runqueue lock — only one CPU working at a time
151 / 397
Scheduling In Linux 2.6 Kernel
▶ O(1) — Time for finding a task to execute depends not onthe number of active tasks but instead on the number ofpriorities
▶ Each CPU has its own runqueue, and schedules itselfindependently; better cache efficiency
▶ The job of the scheduler is simple — Choose the task onthe highest priority list to execute
How to know there are processes waiting in apriority list?A priority bitmap (5 32-bit words for 140 priorities) is used todefine when tasks are on a given priority list.
▶ find-first-bit-set instruction is used to find the highestpriority bit.
152 / 397
Scheduling In Linux 2.6 Kernel
Each runqueue has two priority arrays
153 / 397
Completely Fair Scheduling (CFS)
Linux’s Process Schedulerup to 2.4: simple, scaled poorly
▶ O(n)▶ non-preemptive
▶ single run queue(cache? SMP?)
from 2.5 on: O(1) scheduler© 140 priority lists — scaled well© one run queue per CPU — true SMP support© preemptive© ideal for large server workloads§ showed latency on desktop systems
from 2.6.23 on: Completely Fair Scheduler (CFS)© improved interactive performance
154 / 397
Completely Fair Scheduler (CFS)
For a perfect (unreal) multitasking CPU▶ n runnable processes can run at the same time▶ each process should receive 1
n of CPU powerFor a real world CPU
▶ can run only a single task at once — unfair© while one task is running
§§ the others have to wait▶ p->wait_runtime is the amount of time the task should
now run on the CPU for it becomes completely fair andbalanced.© on ideal CPU, the p->wait_runtime value would always be
zero▶ CFS always tries to run the task with the largest
p->wait_runtime value
155 / 397
CFS
In practice it works like this:▶ While a task is using the CPU, its wait_runtime decreases
wait_runtime = wait_runtime - time_runningif: its wait_runtime = MAXwait_runtime (among all processes)
then: it gets preempted▶ Newly woken tasks (wait_runtime = 0) are put into the
tree more and more to the right▶ slowly but surely giving a chance for every task to
become the “leftmost task” and thus get on the CPUwithin a deterministic amount of time
156 / 397
References
Wikipedia. Scheduling (computing) — Wikipedia, TheFree Encyclopedia. [Online; accessed21-February-2015]. 2015.
157 / 397
A Major Class of Deadlocks Involve Resources
Processes need access to resources in reasonableorderSuppose...
▶ a process holds resource A and requests resource B. Atsame time,
▶ another process holds B and requests ABoth are blocked and remain so
Examples of computer resources▶ printers▶ memory space▶ data (e.g. a locked record in a DB)▶ semaphores
159 / 397
Resourcestypedef int semaphore;
semaphore resource31; semaphore resource31;semaphore resource32; semaphore resource32;
void process3A(void) { void process3A(void) {down(&resource 31); down(&resource 31);down(&resource 32); down(&resource 32);use3both3resources( ); use3both3resources( );up(&resource32); up(&resource32);up(&resource31); up(&resource31);
} }
void process3B(void) { void process3B(void) {down(&resource 31); down(&resource 32);down(&resource 32); down(&resource 31);use3both3resources( ); use3both3resources( );up(&resource32); up(&resource31);up(&resource31); up(&resource32);
} }
(a) (b)
Fig. 3-2. (a) Deadlock-free code. (b) Code with a potentialdeadlock.
Deadlock!
160 / 397
Resources
Deadlocks occur when ...processes are granted exclusive access to resources
e.g. devices, data records, files, ...
Preemptable resources can be taken away from a processwith no ill effects
e.g. memoryNonpreemptable resources will cause the process to fail if
taken awaye.g. CD recorder
In general, deadlocks involve nonpreemptable resources.
161 / 397
Resources
Sequence of events required to use a resourcerequest à use à releaseopen() close()allocate() free()
What if request is denied?Requesting process
▶ may be blocked▶ may fail with error code
162 / 397
The Best Illustration of a Deadlock
A law passed by the Kansas legislature early in the20th century“When two trains approach each other at a crossing, bothshall come to a full stop and neither shall start up again untilthe other has gone.”
163 / 397
Introduction to Deadlocks
DeadlockA set of processes is deadlocked if each process in the set iswaiting for an event that only another process in the set cancause.
▶ Usually the event is release of a currently held resource▶ None of the processes can ...
▶ run▶ release resources▶ be awakened
164 / 397
Four Conditions For Deadlocks
Mutual exclusion condition each resource can only beassigned to one process or is available
Hold and wait condition process holding resources canrequest additional
No preemption condition previously granted resourcescannot forcibly taken away
Circular wait condition▶ must be a circular chain of 2 or more
processes▶ each is waiting for resource held by next
member of the chain
165 / 397
Four Conditions For Deadlocks
Unlocking a deadlock is to answer 4 questions:1. Can a resource be assigned to more than one process at
once?2. Can a process hold a resource and ask for another?3. can resources be preempted?4. Can circular waits exits?
166 / 397
Resource-Allocation Graph
(a) (b) (c)
T U
D
C
S
B
A
R
Fig. 3-3. Resource allocation graphs. (a) Holding a resource.(b) Requesting a resource. (c) Deadlock.
has wants
Strategies for dealing with Deadlocks1. detection and recovery2. dynamic avoidance — careful resource allocation3. prevention — negating one of the four necessary
conditions4. just ignore the problem altogether
167 / 397
Resource-Allocation Graph
168 / 397
Resource-Allocation Graph
Basic facts:▶ No cycles à no deadlock▶ If graph contains a cycle à
▶ if only one instance per resourcetype, then deadlock
▶ if several instances per resource type,possibility of deadlock
169 / 397
Deadlock Detection
▶ Allow system to enter deadlock state▶ Detection algorithm▶ Recovery scheme
170 / 397
Deadlock DetectionSingle Instance of Each Resource Type — Wait-for Graph
▶ Pi → Pj if Pi is waiting for Pj;▶ Periodically invoke an algorithm that searches for a cycle
in the graph. If there is a cycle, there exists a deadlock.▶ An algorithm to detect a cycle in a graph requires an
order of n2 operations, where n is the number of verticesin the graph.
171 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
E = ( 4 2 3 1 )
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
A = ( 2 1 0 0 )
Current allocation matrix
020
001
102
010
Request matrix
212
001
010
100
R =C =
Fig. 3-7. An example for the deadlock detection algorithm.Row n:C: current allocation to process nR: current requirement of process n
Column m:C: current allocation of resource class mR: current requirement of resource class m
172 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Resources in existence(E1, E2, E3, …, Em)
Current allocation matrix
C11C21
Cn1
C12C22
Cn2
C13C23
Cn3
C1mC2m
Cnm
Row n is current allocationto process n
Resources available(A1, A2, A3, …, Am)
Request matrix
R11R21
Rn1
R12R22
Rn2
R13R23
Rn3
R1mR2m
Rnm
Row 2 is what process 2 needs
Fig. 3-6. The four data structures needed by the deadlock detectionalgorithm.
n∑i=1
Cij + Aj = Ej
e.g.(C13 + C23 + . . . + Cn3) + A3 = E3
173 / 397
Maths recall: vectors comparisonFor two vectors, X and Y
X ≤ Y iff Xi ≤ Yi for 0 ≤ i ≤m
e.g. [1 2 3 4
]≤
[2 3 4 4
][1 2 3 4
]≰
[2 3 2 4
]
174 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
E = ( 4 2 3 1 )
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
A = ( 2 1 0 0 )
Current allocation matrix
020
001
102
010
Request matrix
212
001
010
100
R =C =
Fig. 3-7. An example for the deadlock detection algorithm.
A R(2 1 0 0) ≥ R3, (2 1 0 0)(2 2 2 0) ≥ R2, (1 0 1 0)(4 2 2 1) ≥ R1, (2 0 0 1)
175 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
E = ( 4 2 3 1 )
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
A = ( 2 1 0 0 )
Current allocation matrix
020
001
102
010
Request matrix
212
001
010
100
R =C =
Fig. 3-7. An example for the deadlock detection algorithm.
A R(2 1 0 0) ≥ R3, (2 1 0 0)
(2 2 2 0) ≥ R2, (1 0 1 0)(4 2 2 1) ≥ R1, (2 0 0 1)
175 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
E = ( 4 2 3 1 )
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
A = ( 2 1 0 0 )
Current allocation matrix
020
001
102
010
Request matrix
212
001
010
100
R =C =
Fig. 3-7. An example for the deadlock detection algorithm.
A R(2 1 0 0) ≥ R3, (2 1 0 0)(2 2 2 0) ≥ R2, (1 0 1 0)
(4 2 2 1) ≥ R1, (2 0 0 1)
175 / 397
Deadlock DetectionSeveral Instances of a Resource Type
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
E = ( 4 2 3 1 )
Tape
driv
es
Plotte
rs
Scann
ers
CD Rom
s
A = ( 2 1 0 0 )
Current allocation matrix
020
001
102
010
Request matrix
212
001
010
100
R =C =
Fig. 3-7. An example for the deadlock detection algorithm.
A R(2 1 0 0) ≥ R3, (2 1 0 0)(2 2 2 0) ≥ R2, (1 0 1 0)(4 2 2 1) ≥ R1, (2 0 0 1)
175 / 397
Recovery From Deadlock
▶ Recovery through preemption▶ take a resource from some other process▶ depends on nature of the resource
▶ Recovery through rollback▶ checkpoint a process periodically▶ use this saved state▶ restart the process if it is found deadlocked
▶ Recovery through killing processes
176 / 397
Deadlock AvoidanceResource Trajectories
Plotter
Printer
Printer
Plotter
B
A
u (Both processesfinished)
p q
r
s
t
I8
I7
I6
I5
I4I3I2I1
���� ���
deadzone
Unsafe region
▶ B is requesting a resource at point t. The system mustdecide whether to grant it or not.
▶ Deadlock is unavoidable if you get into unsafe region.177 / 397
Deadlock AvoidanceSafe and Unsafe States
Assuming E = 10
Unsafe
A
B
C
3
2
2
9
4
7
Free: 3(a)
A
B
C
4
2
2
9
4
7
Free: 2(b)
A
B
C
4
4 —4
2
9
7
Free: 0(c)
A
B
C
4
—
2
9
7
Free: 4(d)
Has Max Has Max Has Max Has Max
Fig. 3-10. Demonstration that the state in (b) is not safe.
unsafe
Safe
A
B
C
3
2
2
9
4
7
Free: 3(a)
A
B
C
3
4
2
9
4
7
Free: 1(b)
A
B
C
3
0 ––
2
9
7
Free: 5(c)
A
B
C
3
0
7
9
7
Free: 0(d)
–
A
B
C
3
0
0
9
–
Free: 7(e)
Has Max Has Max Has Max Has Max Has Max
Fig. 3-9. Demonstration that the state in (a) is safe.
178 / 397
Deadlock AvoidanceThe Banker’s Algorithm for a Single Resource
The banker’s algorithm considers each request as it occurs,and sees if granting it leads to a safe state.
A
B
C
D
0
0
0
0
6
Has Max
5
4
7
Free: 10
A
B
C
D
1
1
2
4
6
Has Max
5
4
7
Free: 2
A
B
C
D
1
2
2
4
6
Has Max
5
4
7
Free: 1
(a) (b) (c)
Fig. 3-11. Three resource allocation states: (a) Safe. (b) Safe.(c) Unsafe.
unsafe
!
unsafe = deadlock
179 / 397
Deadlock AvoidanceThe Banker’s Algorithm for Multiple Resources
Proce
ssTa
pe d
rives
Plotte
rsA
B
C
D
E
3
0
1
1
0
0
1
1
1
0
1
0
1
0
0
1
0
0
1
0
E = (6342)P = (5322)A = (1020)
Resources assignedPro
cess
Tape
driv
esPlo
tters
A
B
C
D
E
1
0
3
0
2
1
1
1
0
1
0
1
0
1
1
0
2
0
0
0
Resources still needed
Scann
ers
CD RO
Ms
Scann
ers
CD RO
Ms
Fig. 3-12.Thebanker’salgorithmwith multiple resources.D→ A→ B,C,ED→ E→ A→ B,C
180 / 397
Deadlock AvoidanceMission Impossible
In practice▶ processes rarely know in advance their max future
resource needs;▶ the number of processes is not fixed;▶ the number of available resources is not fixed.
Conclusion: Deadlock avoidance is essentially a missionimpossible.
181 / 397
Deadlock PreventionBreak The Four Conditions
Attacking the Mutual Exclusion Condition▶ For example, using a printing daemon to avoid exclusive
access to a printer.▶ Not always possible
▶ Not required for sharable resources;▶ must hold for nonsharable resources.
▶ The best we can do is to avoid mutual exclusion as muchas possible
▶ Avoid assigning a resource when not really necessary▶ Try to make sure as few processes as possible may actually
claim the resource
182 / 397
Attacking the Hold and Wait ConditionMust guarantee that whenever a process requests aresource, it does not hold any other resources.
Try: the processes must request all their resourcesbefore starting execution
if everything is availablethen can run
if one or more resources are busythen nothing will be allocated (just wait)
Problem:▶ many processes don’t know what they will
need before running▶ Low resource utilization; starvation possible
183 / 397
Attacking the No Preemption Conditionif a process that is holding some resources requests
another resource that cannot be immediately allocatedto it
then 1. All resources currently being held are released2. Preempted resources are added to the list of resources for
which the process is waiting3. Process will be restarted only when it can regain its old
resources, as well as the new ones that it is requesting
Low resource utilization; starvation possible
184 / 397
Attacking Circular Wait ConditionImpose a total ordering of all resource types, and requirethat each process requests resources in an increasing orderof enumeration
A1. Imagesetter2. Scanner3. Plotter4. Tape drive5. CD Rom drive
i
B
j
(a) (b)
Fig. 3-13. (a) Numerically ordered resources. (b) A resourcegraph.
It’s hard to find an ordering that satisfies everyone.
185 / 397
The Ostrich Algorithm
▶ Pretend there is no problem▶ Reasonable if
▶ deadlocks occur very rarely▶ cost of prevention is high
▶ UNIX and Windows takes thisapproach
▶ It is a trade off between▶ convenience▶ correctness
186 / 397
References
Wikipedia. Deadlock — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.
187 / 397
Memory Management
In a perfect world Memory is large, fast, non-volatile
In real world ...
Registers
Cache
Main memory
Magnetic tape
Magnetic disk
1 nsec
2 nsec
10 nsec
10 msec
100 sec
<1 KB
1 MB
64-512 MB
5-50 GB
20-100 GB
Typical capacityTypical access time
Fig. 1-7. A typical memory hierarchy. The numbers are very roughapproximations.
Memory manager handles the memory hierarchy.
189 / 397
Basic Memory ManagementReal Mode
In the old days ...▶ Every program simply saw the physical memory▶ mono-programming without swapping or paging
(a) (b) (c)
0xFFF …
0 0 0
Userprogram
Userprogram
Userprogram
Operatingsystem in
RAM
Operatingsystem in
RAM
Operatingsystem in
ROM
Devicedrivers in ROM
Fig. 4-1. Three simple ways of organizing memory with an operat-ing system and one user process. Other possibilities also exist.
old mainstream handhold, embedded MS-DOS
190 / 397
Basic Memory ManagementRelocation Problem
Exposingphysical memory
toprocessesis not
a good idea
191 / 397
Memory ProtectionProtected mode
We need▶ Protect the OS from access by user programs▶ Protect user programs from one another
Protected mode is an operational mode of x86-compatibleCPU.
▶ The purpose is to protect everyone else(including the OS) from your program.
192 / 397
Memory ProtectionLogical Address Space
Base register holds the smallest legal physical memoryaddress
Limit register contains the size of the range
278 Chapter 7 Main Memory
operands, results may be stored back in memory. The memory unit sees only astream of memory addresses; it does not know how they are generated (by theinstruction counter, indexing, indirection, literal addresses, and so on) or whatthey are for (instructions or data). Accordingly, we can ignore how a programgenerates a memory address. We are interested only in the sequence of memoryaddresses generated by the running program.
We begin our discussion by covering several issues that are pertinent to thevarious techniques for managing memory. This coverage includes an overviewof basic hardware issues, the binding of symbolic memory addresses to actualphysical addresses, and the distinction between logical and physical addresses.We conclude the section with a discussion of dynamically loading and linkingcode and shared libraries.
7.1.1 Basic Hardware
Main memory and the registers built into the processor itself are the onlystorage that the CPU can access directly. There are machine instructions that takememory addresses as arguments, but none that take disk addresses. Therefore,any instructions in execution, and any data being used by the instructions,must be in one of these direct-access storage devices. If the data are not inmemory, they must be moved there before the CPU can operate on them.
Registers that are built into the CPU are generally accessible within onecycle of the CPU clock. Most CPUs can decode instructions and perform simpleoperations on register contents at the rate of one or more operations perclock tick. The same cannot be said of main memory, which is accessed viaa transaction on the memory bus. Completing a memory access may takemany cycles of the CPU clock. In such cases, the processor normally needsto stall, since it does not have the data required to complete the instructionthat it is executing. This situation is intolerable because of the frequency ofmemory accesses. The remedy is to add fast memory between the CPU and
operatingsystem
0
256000
300040 300040
base
120900
limit420940
880000
1024000
process
process
process
Figure 7.1 A base and a limit register define a logical address space.
A pair of base and limit reg-isters define the logical ad-dress space
JMP 28
à
JMP 300068
193 / 397
Memory ProtectionBase and limit registers
7.1 Background 279
main memory. A memory buffer used to accommodate a speed differential,called a cache, is described in Section 1.8.3.
Not only are we concerned with the relative speed of accessing physicalmemory, but we also must ensure correct operation to protect the operatingsystem from access by user processes and, in addition, to protect user processesfrom one another. This protection must be provided by the hardware. It can beimplemented in several ways, as we shall see throughout the chapter. In thissection, we outline one possible implementation.
We first need to make sure that each process has a separate memory space.To do this, we need the ability to determine the range of legal addresses thatthe process may access and to ensure that the process can access only theselegal addresses. We can provide this protection by using two registers, usuallya base and a limit, as illustrated in Figure 7.1. The base register holds thesmallest legal physical memory address; the limit register specifies the size ofthe range. For example, if the base register holds 300040 and the limit register is120900, then the program can legally access all addresses from 300040 through420939 (inclusive).
Protection of memory space is accomplished by having the CPU hardwarecompare every address generated in user mode with the registers. Any attemptby a program executing in user mode to access operating-system memory orother users’ memory results in a trap to the operating system, which treats theattempt as a fatal error (Figure 7.2). This scheme prevents a user program from(accidentally or deliberately) modifying the code or data structures of eitherthe operating system or other users.
The base and limit registers can be loaded only by the operating system,which uses a special privileged instruction. Since privileged instructions canbe executed only in kernel mode, and since only the operating system executesin kernel mode, only the operating system can load the base and limit registers.This scheme allows the operating system to change the value of the registersbut prevents user programs from changing the registers’ contents.
The operating system, executing in kernel mode, is given unrestrictedaccess to both operating system memory and users’ memory. This provisionallows the operating system to load users’ programs into users’ memory, to
base
memorytrap to operating system
monitor—addressing error
address yesyes
nono
CPU
base � limit
≥ <
Figure 7.2 Hardware address protection with base and limit registers.
194 / 397
UNIX View of a Process’ Memory
max +------------------+
| Stack | Stack segment
+--------+---------+
| | |
| v |
| |
| ^ |
| | |
+--------+---------+
| Dynamic storage | Heap
|(from new, malloc)|
+------------------+
| Static variables |
| (uninitialized, | BSS segment
| initialized) | Data segment
+------------------+
| Code | Text segment
0 +------------------+
the size (text + data + bss) ofa process is established atcompile time
text: program codedata: initialized global and
static databss: uninitialized global and
static dataheap: dynamically allocated
with malloc, newstack: local variables
195 / 397
Stack vs. Heap
Stack Heap
compile-time allocation run-time allocation
auto clean-up you clean-up
inflexible flexible
smaller bigger
quicker slower
How large is the ...stack: ulimit -sheap: could be as large as your virtual memory
text|data|bss: size a.out
196 / 397
Multi-step Processing of a User ProgramWhen is space allocated?
7.1 Background 281
dynamiclinking
sourceprogram
objectmodule
linkageeditor
loadmodule
loader
in-memorybinary
memoryimage
otherobject
modules
compiletime
loadtime
executiontime (runtime)
compiler orassembler
systemlibrary
dynamicallyloadedsystemlibrary
Figure 7.3 Multistep processing of a user program.
7.1.3 Logical Versus Physical Address Space
An address generated by the CPU is commonly referred to as a logical address,whereas an address seen by the memory unit—that is, the one loaded intothe memory-address register of the memory—is commonly referred to as aphysical address.
The compile-time and load-time address-binding methods generate iden-tical logical and physical addresses. However, the execution-time address-binding scheme results in differing logical and physical addresses. In this case,we usually refer to the logical address as a virtual address. We use logical addressand virtual address interchangeably in this text. The set of all logical addressesgenerated by a program is a logical address space; the set of all physicaladdresses corresponding to these logical addresses is a physical address space.Thus, in the execution-time address-binding scheme, the logical and physicaladdress spaces differ.
The run-time mapping from virtual to physical addresses is done by ahardware device called the memory-management unit (MMU). We can choosefrom many different methods to accomplish thsi mapping, as we discuss in
Static: before program startrunning
▶ Compile time▶ Load time
Dynamic: as program runs▶ Execution time
197 / 397
Address BindingWho assigns memory to segments?
Static-binding: before a program starts runningCompile time: Compiler and assembler generate an object
file for each source fileLoad time:
▶ Linker combines all the object files into asingle executable object file
▶ Loader (part of OS) loads an executableobject file into memory at location(s)determined by the OS
- invoked via the execve system call
Dynamic-binding: as program runs▶ Execution time:
▶ uses new and malloc to dynamically allocate memory▶ gets space on stack during function calls
198 / 397
Static loading
▶ The entire program and all data of a process must be inphysical memory for the process to execute
▶ The size of a process is thus limited to the size ofphysical memory670 Chapter 7 Linking
main2.c vector.h
libvector.a libc.a
addvec.o printf.o and any othermodules called by printf.o
main2.o
Translators(cpp, cc1, as)
Linker (ld)
p2 Fully linkedexecutable object file
Relocatableobject files
Source files
Static libraries
Figure 7.7 Linking with static libraries.
To build the executable, we would compile and link the input files main.o andlibvector.a:
unix> gcc -O2 -c main2.c
unix> gcc -static -o p2 main2.o ./libvector.a
Figure 7.7 summarizes the activity of the linker. The -static argument tellsthe compiler driver that the linker should build a fully linked executable object filethat can be loaded into memory and run without any further linking at load time.When the linker runs, it determines that the addvec symbol defined by addvec.ois referenced by main.o, so it copies addvec.o into the executable. Since theprogram doesn’t reference any symbols defined by multvec.o, the linker doesnot copy this module into the executable. The linker also copies the printf.omodule from libc.a, along with a number of other modules from the C run-timesystem.
7.6.3 How Linkers Use Static Libraries to Resolve References
While static libraries are useful and essential tools, they are also a source ofconfusion to programmers because of the way the Unix linker uses them to resolveexternal references. During the symbol resolution phase, the linker scans therelocatable object files and archives left to right in the same sequential order thatthey appear on the compiler driver’s command line. (The driver automaticallytranslates any .c files on the command line into .o files.) During this scan, thelinker maintains a set E of relocatable object files that will be merged to form theexecutable, a set U of unresolved symbols (i.e., symbols referred to, but not yetdefined), and a set D of symbols that have been defined in previous input files.Initially, E, U , and D are empty.
. For each input file f on the command line, the linker determines if f is anobject file or an archive. If f is an object file, the linker adds f to E, updatesU and D to reflect the symbol definitions and references in f , and proceedsto the next input file.
199 / 397
Dynamic Linking
A dynamic linker is actually a special loader that loadsexternal shared libraries into a running process
▶ Small piece of code, stub, used to locate the appropriatememory-resident library routine
▶ Only one copy in memory▶ Don’t have to re-link after a library update
if ( stub_is_executed ){
if ( !routine_in_memory )
load_routine_into_memory();
stub_replaces_itself_with_routine();
execute_routine();
}
200 / 397
Dynamic LinkingSection 7.11 Loading and Linking Shared Libraries from Applications 683
Figure 7.15Dynamic linking withshared libraries.
main2.c
libc.solibvector.so
libc.solibvector.so
main2.o
p2
Translators(cpp,cc1,as)
Linker (ld)
Fully linkedexecutable in memory
Partially linkedexecutable object file
vector.h
Loader(execve)
Dynamic linker (ld-linux.so)
Relocatableobject file
Relocation andsymbol table info
Code and data
contains a .interp section, which contains the path name of the dynamic linker,which is itself a shared object (e.g., ld-linux.so on Linux systems). Instead ofpassing control to the application, as it would normally do, the loader loads andruns the dynamic linker.
The dynamic linker then finishes the linking task by performing the followingrelocations:
. Relocating the text and data of libc.so into some memory segment.
. Relocating the text and data of libvector.so into another memory segment.
. Relocating any references in p2 to symbols defined by libc.so and libvec-tor.so.
Finally, the dynamic linker passes control to the application. From this point on,the locations of the shared libraries are fixed and do not change during executionof the program.
7.11 Loading and Linking Shared Libraries from Applications
Up to this point, we have discussed the scenario in which the dynamic linker loadsand links shared libraries when an application is loaded, just before it executes.However, it is also possible for an application to request the dynamic linker toload and link arbitrary shared libraries while the application is running, withouthaving to link in the applications against those libraries at compile time.
201 / 397
Logical vs. Physical Address Space
▶ Mapping logical address space to physical address spaceis central to MMLogical address generated by the CPU; also referred to
as virtual addressPhysical address address seen by the memory unit
▶ In compile-time and load-time address binding schemes,LAS and PAS are identical in size
▶ In execution-time address binding scheme, they arediffer.
202 / 397
Logical vs. Physical Address SpaceThe user program
▶ deals with logical addresses▶ never sees the real physical addresses282 Chapter 7 Main Memory
�
MMU
CPU memory14346
14000
relocationregister
346
logicaladdress
physicaladdress
Figure 7.4 Dynamic relocation using a relocation register.
Sections 7.3 through 7.7. For the time being, we illustrate this mapping witha simple MMU scheme that is a generalization of the base-register schemedescribed in Section 7.1.1. The base register is now called a relocation register.The value in the relocation register is added to every address generated by a userprocess at the time the address is sent to memory (see Figure 7.4). For example,if the base is at 14000, then an attempt by the user to address location 0 isdynamically relocated to location 14000; an access to location 346 is mappedto location 14346. The MS-DOS operating system running on the Intel 80x86family of processors used four relocation registers when loading and runningprocesses.
The user program never sees the real physical addresses. The program cancreate a pointer to location 346, store it in memory, manipulate it, and compare itwith other addresses—all as the number 346. Only when it is used as a memoryaddress (in an indirect load or store, perhaps) is it relocated relative to the baseregister. The user program deals with logical addresses. The memory-mappinghardware converts logical addresses into physical addresses. This form ofexecution-time binding was discussed in Section 7.1.2. The final location ofa referenced memory address is not determined until the reference is made.
We now have two different types of addresses: logical addresses (in therange 0 to max) and physical addresses (in the range R + 0 to R + max for a basevalue R). The user generates only logical addresses and thinks that the processruns in locations 0 to max. The user program generates only logical addressesand thinks that the process runs in locations 0 to max. However, these logicaladdresses must be mapped to physical addresses before they are used.
The concept of a logical address space that is bound to a separate physicaladdress space is central to proper memory management.
7.1.4 Dynamic Loading
In our discussion so far, it has been necessary for the entire program and alldata of a process to be in physical memory for the process to execute. The sizeof a process has thus been limited to the size of physical memory. To obtainbetter memory-space utilization, we can use dynamic loading. With dynamic
203 / 397
MMUMemory Management Unit
CPUpackage
CPU
The CPU sends virtualaddresses to the MMU
The MMU sends physicaladdresses to the memory
Memorymanagement
unit
MemoryDisk
controller
Bus
Fig. 4-9. The position and function of the MMU. Here the MMUis shown as being a part of the CPU chip because it commonly isnowadays. However, logically it could be a separate chip and wasin years gone by.
204 / 397
Memory Protection
7.3 Contiguous Memory Allocation 287
In contiguous memory allocation, each process is contained in a singlecontiguous section of memory.
7.3.1 Memory Mapping and Protection
Before discussing memory allocation further, we must discuss the issue ofmemory mapping and protection. We can provide these features by using arelocation register, as discussed in Section 7.1.3, together with a limit register,as discussed in Section 7.1.1. The relocation register contains the value ofthe smallest physical address; the limit register contains the range of logicaladdresses (for example, relocation = 100040 and limit = 74600). With relocationand limit registers, each logical address must be less than the limit register; theMMU maps the logical address dynamically by adding the value in the relocationregister. This mapped address is sent to memory (Figure 7.6).
When the CPU scheduler selects a process for execution, the dispatcherloads the relocation and limit registers with the correct values as part of thecontext switch. Because every address generated by a CPU is checked againstthese registers, we can protect both the operating system and other users’programs and data from being modified by this running process.
The relocation-register scheme provides an effective way to allow theoperating system’s size to change dynamically. This flexibility is desirable inmany situations. For example, the operating system contains code and bufferspace for device drivers. If a device driver (or other operating-system service)is not commonly used, we do not want to keep the code and data in memory, aswe might be able to use that space for other purposes. Such code is sometimescalled transient operating-system code; it comes and goes as needed. Thus,using this code changes the size of the operating system during programexecution.
7.3.2 Memory Allocation
Now we are ready to turn to memory allocation. One of the simplestmethods for allocating memory is to divide memory into several fixed-sizedpartitions. Each partition may contain exactly one process. Thus, the degree
CPU memory
logicaladdress
trap: addressing error
no
yesphysicaladdress
relocationregister
��
limitregister
Figure 7.6 Hardware support for relocation and limit registers.
205 / 397
Swapping
284 Chapter 7 Main Memory
in it. Other programs linked before the new library was installed will continueusing the older library. This system is also known as shared libraries.
Unlike dynamic loading, dynamic linking generally requires help from theoperating system. If the processes in memory are protected from one another,then the operating system is the only entity that can check to see whether theneeded routine is in another process’s memory space or that can allow multipleprocesses to access the same memory addresses. We elaborate on this conceptwhen we discuss paging in Section 7.4.4.
7.2 Swapping
A process must be in memory to be executed. A process, however, can beswapped temporarily out of memory to a backing store and then broughtback into memory for continued execution. For example, assume a multipro-gramming environment with a round-robin CPU-scheduling algorithm. Whena quantum expires, the memory manager will start to swap out the process thatjust finished and to swap another process into the memory space that has beenfreed (Figure 7.5). In the meantime, the CPU scheduler will allocate a time sliceto some other process in memory. When each process finishes its quantum, itwill be swapped with another process. Ideally, the memory manager can swapprocesses fast enough that some processes will be in memory, ready to execute,when the CPU scheduler wants to reschedule the CPU. In addition, the quantummust be large enough to allow reasonable amounts of computing to be donebetween swaps.
A variant of this swapping policy is used for priority-based schedulingalgorithms. If a higher-priority process arrives and wants service, the memorymanager can swap out the lower-priority process and then load and executethe higher-priority process. When the higher-priority process finishes, the
operatingsystem
swap out
swap in
userspace
main memory
backing store
process P2
process P11
2
Figure 7.5 Swapping of two processes using a disk as a backing store.
Major part of swap time is transfer timeTotal transfer time is directly proportional to the amountof memory swapped
206 / 397
Contiguous Memory AllocationMultiple-partition allocation
(a)
Operatingsystem����
A
(b)
Operatingsystem
����
A
B
(c)
Operatingsystem
�A
B
C
(d)
Time
Operatingsystem
����
����
B
C
(e)
D
Operatingsystem����B
C
(f)
D
Operatingsystem
����
���C
(g)
D
Operatingsystem
�A
C
Fig. 4-5. Memory allocation changes as processes come intomemory and leave it. The shaded regions are unused memory.
Operating system maintains information about:a allocated partitionsb free partitions (hole)
207 / 397
Dynamic Storage-Allocation ProblemFirst Fit, Best Fit, Worst Fit
1000 10000 5000 200
150
150 1509850
leftover
50
leftover
firstfit
worst
fit
bestfit
208 / 397
First-fit: The first hole that is big enoughBest-fit: The smallest hole that is big enough
▶ Must search entire list, unless ordered bysize
▶ Produces the smallest leftover holeWorst-fit: The largest hole
▶ Must also search entire list▶ Produces the largest leftover hole
▶ First-fit and best-fit better than worst-fit in terms ofspeed and storage utilization
▶ First-fit is generally faster
209 / 397
Fragmentation
ProcessA
ProcessB
ProcessC
InternalFragmentation
externalFragmentation
Reduce external fragmentationby
▶ Compaction is possible only ifrelocation is dynamic, and is doneat execution time
▶ Noncontiguous memory allocation▶ Paging▶ Segmentation
210 / 397
Virtual MemoryLogical memory can be much larger than physical memory
Virtualaddress
space
Physicalmemoryaddress
60K-64K
56K-60K
52K-56K
48K-52K
44K-48K
40K-44K
36K-40K
32K-36K
28K-32K
24K-28K
20K-24K
16K-20K
12K-16K
8K-12K
4K-8K
0K-4K
28K-32K
24K-28K
20K-24K
16K-20K
12K-16K
8K-12K
4K-8K
0K-4K
Virtual page
Page frame
X
X
X
X
7
X
5
X
X
X
3
4
0
6
1
2
Fig. 4-10. The relation between virtual addresses and physicalmemory addresses is given by the page table.
Address translation
virtualaddress
page table−−−−−−→ physicaladdress
Page 0map to−−−−→ Frame 2
0virtualmap to−−−−→ 8192physical
20500vir(20k+ 20)vir
map to−−−−→ 12308phy(12k+ 20)phy
211 / 397
Page FaultVirtual
addressspace
Physicalmemoryaddress
60K-64K
56K-60K
52K-56K
48K-52K
44K-48K
40K-44K
36K-40K
32K-36K
28K-32K
24K-28K
20K-24K
16K-20K
12K-16K
8K-12K
4K-8K
0K-4K
28K-32K
24K-28K
20K-24K
16K-20K
12K-16K
8K-12K
4K-8K
0K-4K
Virtual page
Page frame
X
X
X
X
7
X
5
X
X
X
3
4
0
6
1
2
Fig. 4-10. The relation between virtual addresses and physicalmemory addresses is given by the page table.
MOV REG, 32780 ?Page fault & swapping
212 / 397
PagingAddress Translation Scheme
Address generated by CPU is divided into:Page number(p): an index into a page tablePage offset(d): to be copied into memoryGiven logical address space (2m) and page size (2n),
number of pages =2m
2n= 2m−n
Example: addressing to 0010000000000100
m−n=4︷ ︸︸ ︷0 0 1 0
n=12︷ ︸︸ ︷0 0 0 0 0 0 0 0 0 1 0 0︸ ︷︷ ︸
m=16
page number = 0010 = 2, page offset = 000000000100
213 / 397
1514131211109876543210
000000000000111000101000000000011100000110001010
0000101000111111 Present/
absent bit
Pagetable
12-bit offsetcopied directlyfrom inputto output
Virtual page = 2 is usedas an index into thepage table Incoming
virtualaddress(8196)
Outgoingphysicaladdress(24580)
110
1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
00 1 0 0 0 0 0 0 0 0 0 0 1 0 0
Fig. 4-11. The internal operation of the MMU with 16 4-KBpages.
Virtual pages: 16Page size: 4k
Virtual memory: 64KPhysical frames: 8
Physical memory: 32K
214 / 397
Shared Pages 7.5 Structure of the Page Table 299
7
6
5
ed 24
ed 13
2
data 11
0
3
4
6
1
page tablefor P1
process P1
data 1
ed 2
ed 3
ed 1
3
4
6
2
page tablefor P3
process P3
data 3
ed 2
ed 3
ed 1
3
4
6
7
page tablefor P2
process P2
data 2
ed 2
ed 3
ed 1
8
9
10
11
data 3
2data
ed 3
Figure 7.13 Sharing of code in a paging environment.
of interprocess communication. Some operating systems implement sharedmemory using shared pages.
Organizing memory according to pages provides numerous benefits inaddition to allowing several processes to share the same physical pages. Wecover several other benefits in Chapter 8.
7.5 Structure of the Page Table
In this section, we explore some of the most common techniques for structuringthe page table.
7.5.1 Hierarchical Paging
Most modern computer systems support a large logical address space(232 to 264). In such an environment, the page table itself becomes excessivelylarge. For example, consider a system with a 32-bit logical address space. Ifthe page size in such a system is 4 KB (212), then a page table may consist ofup to 1 million entries (232/212). Assuming that each entry consists of 4 bytes,each process may need up to 4 MB of physical address space for the page tablealone. Clearly, we would not want to allocate the page table contiguously inmain memory. One simple solution to this problem is to divide the page tableinto smaller pieces. We can accomplish this division in several ways.
One way is to use a two-level paging algorithm, in which the page tableitself is also paged (Figure 7.14). For example, consider again the system with
215 / 397
Page Table EntryIntel i386 Page Table Entry
▶ Commonly 4 bytes (32 bits) long▶ Page size is usually 4k (212 bytes). OS dependent
$ getconf PAGESIZE▶ Could have 232−12 = 220 = 1M pages
Could address 1M× 4KB = 4GB memory
31 12 11 0
+--------------------------------------+-------+---+-+-+---+-+-+-+
| | | | | | |U|R| |
| PAGE FRAME ADDRESS 31..12 | AVAIL |0 0|D|A|0 0|/|/|P|
| | | | | | |S|W| |
+--------------------------------------+-------+---+-+-+---+-+-+-+
P - PRESENT
R/W - READ/WRITE
U/S - USER/SUPERVISOR
A - ACCESSED
D - DIRTY
AVAIL - AVAILABLE FOR SYSTEMS PROGRAMMER USE
NOTE: 0 INDICATES INTEL RESERVED. DO NOT DEFINE.
216 / 397
Page Table
▶ Page table is kept in main memory▶ Usually one page table for each process▶ Page-table base register (PTBR): A pointer to the page
table is stored in PCB▶ Page-table length register (PRLR): indicates size of the
page table▶ Slow
▶ Requires two memory accesses. One for the page tableand one for the data/instruction.
▶ TLB
217 / 397
Translation Lookaside Buffer (TLB)Fact: 80-20 rule
▶ Only a small fraction of the PTEs are heavily read; therest are barely used at all
296 Chapter 7 Main Memory
page table
f
CPU
logicaladdress
p d
f d
physicaladdress
physicalmemory
p
TLB miss
pagenumber
framenumber
TLB hit
TLB
Figure 7.11 Paging hardware with TLB.
contain valid virtual addresses but have incorrect or invalid physical addressesleft over from the previous process.
The percentage of times that a particular page number is found in the TLBis called the hit ratio. An 80-percent hit ratio, for example, means that wefind the desired page number in the TLB 80 percent of the time. If it takes 20nanoseconds to search the TLB and 100 nanoseconds to access memory, thena mapped-memory access takes 120 nanoseconds when the page number isin the TLB. If we fail to find the page number in the TLB (20 nanoseconds),then we must first access memory for the page table and frame number (100nanoseconds) and then access the desired byte in memory (100 nanoseconds),for a total of 220 nanoseconds. To find the effective memory-access time, weweight the case by its probability:
effective access time = 0.80 × 120 + 0.20 × 220= 140 nanoseconds.
In this example, we suffer a 40-percent slowdown in memory-access time (from100 to 140 nanoseconds).
For a 98-percent hit ratio, we have
effective access time = 0.98 × 120 + 0.02 × 220= 122 nanoseconds.
This increased hit rate produces only a 22-percent slowdown in access time.We will further explore the impact of the hit ratio on the TLB in Chapter 8.
218 / 397
Multilevel Page Tables▶ a 1M-entry page table eats
4M memory▶ while 100 processes running,
400M memory is gone forpage tables
▶ avoid keeping all the pagetables in memory all thetime
A two-level scheme:
page number | page offset
+----------+----------+------------+
| p1 | p2 | d |
+----------+----------+------------+
10 10 12
| |
| ‘-> pointing to 1k frames
‘--> pointing to 1k page tables
300 Chapter 7 Main Memory
•••
•••
outer pagetable
page ofpage table
page tablememory
929
900
929
900
708
500
100
1
0
•••
100
708
•••
•••
•••
•••
•••
•••
•••
•••
•••
1
500
Figure 7.14 A two-level page-table scheme.
a 32-bit logical address space and a page size of 4 KB. A logical address isdivided into a page number consisting of 20 bits and a page offset consistingof 12 bits. Because we page the page table, the page number is further dividedinto a 10-bit page number and a 10-bit page offset. Thus, a logical address is asfollows:
p1 p2 d
page number page offset
10 10 12
where p1 is an index into the outer page table and p2 is the displacementwithin the page of the inner page table. The address-translation method for thisarchitecture is shown in Figure 7.15. Because address translation works fromthe outer page table inward, this scheme is also known as a forward-mappedpage table.
The VAX architecture supports a variation of two-level paging. The VAX isa 32-bit machine with a page size of 512 bytes. The logical address space of aprocess is divided into four equal sections, each of which consists of 230 bytes.Each section represents a different part of the logical address space of a process.The first 2 high-order bits of the logical address designate the appropriatesection. The next 21 bits represent the logical page number of that section, andthe final 9 bits represent an offset in the desired page. By partitioning the page
219 / 397
Two-Level Page TablesExample
Don’t have to keep all the 1K page tables(1M pages) in memory. In this exampleonly 4 page tables are actuallymapped into memory
process+-------+| stack | 4M+-------+| || ... | unused| |+-------+| data | 4M+-------+| code | 4M+-------+
Top−levelpage table
Second−levelpage tables
Topages
Pagetable forthe top4M ofmemory
6543210
1023
1023
654321
0
1023
Bits 10 10 12
PT1 PT2 Offset
4M−1
8M−1
Page table
Page table
Page table
Page table
Page table
Page table
00000000010000000011000000000100PT1 = 1 PT2 = 3 Offset = 4
01
23 929
788
850
Page table
Page table
901
500
220 / 397
Problem With 64-bit Systems
if▶ virtual address space = 64bits▶ page size = 4KB = 212 B
then How much space would a simple single-levelpage table take?
if Each page table entry takes 4Bytesthen The whole page table (264−12 entries) will
take
264−12×4B = 254 B = 16PB (peta⇒ tera⇒ giga)!
And this is for ONE process!Multi-level?
if 10bits for each levelthen 64−12
10 = 5 levels are required5 memory accress for each address translation!
221 / 397
Inverted Page TablesIndex with frame number
Inverted Page Table:▶ One entry for each physical frame
▶ The physical frame number is the table index▶ A single global page table for all processes
▶ The table is shared — PID is required
▶ Physical pages are now mapped to virtual — each entrycontains a virtual page number instead of a physical one
▶ Information bits, e.g. protection bit, are as usual
222 / 397
Find index according to entry contents(pid, p)⇒ i
7.5 Structure of the Page Table 303
address, regardless of the latter’s validity). This table representation is a naturalone, since processes reference pages through the pages’ virtual addresses. Theoperating system must then translate this reference into a physical memoryaddress. Since the table is sorted by virtual address, the operating system isable to calculate where in the table the associated physical address entry islocated and to use that value directly. One of the drawbacks of this methodis that each page table may consist of millions of entries. These tables mayconsume large amounts of physical memory just to keep track of how otherphysical memory is being used.
To solve this problem, we can use an inverted page table. An invertedpage table has one entry for each real page (or frame) of memory. Each entryconsists of the virtual address of the page stored in that real memory location,with information about the process that owns the page. Thus, only one pagetable is in the system, and it has only one entry for each page of physicalmemory. Figure 7.17 shows the operation of an inverted page table. Compareit with Figure 7.7, which depicts a standard page table in operation. Invertedpage tables often require that an address-space identifier (Section 7.4.2) bestored in each entry of the page table, since the table usually contains severaldifferent address spaces mapping physical memory. Storing the address-spaceidentifier ensures that a logical page for a particular process is mapped to thecorresponding physical page frame. Examples of systems using inverted pagetables include the 64-bit UltraSPARC and PowerPC.
To illustrate this method, we describe a simplified version of the invertedpage table used in the IBM RT. Each virtual address in the system consists of atriple:
<process-id, page-number, offset>.
Each inverted page-table entry is a pair <process-id, page-number> where theprocess-id assumes the role of the address-space identifier. When a memory
page table
CPU
logicaladdress physical
address physicalmemory
i
pid p
pid
search
p
d i d
Figure 7.17 Inverted page table.
223 / 397
Std. PTE (32-bit sys.):page frame address | info
+--------------------+------+
| 20 | 12 |
+--------------------+------+
indexed by page number
if 220 entries, 4B eachthen SIZEpage table =
220 × 4 = 4MB(for each process)
Inverted PTE (64-bit sys.):pid | virtual page number | info
+-----+---------------------+------+
| 16 | 52 | 12 |
+-----+---------------------+------+indexed by frame number
if assuming▶ 16 bits for PID▶ 52 bits for virtual page
number▶ 12 bits of information
then each entry takes16 + 52 + 12 = 80bits = 10bytes
if physical mem = 1G (230 B), andpage size = 4K (212 B), we’llhave 230−12 = 218 pages
then SIZEpage table = 218×10B = 2.5MB(for all processes)
224 / 397
Inefficient: Require searching the entire table
225 / 397
Hashed Inverted Page TablesA hash anchor table — an extra level before the actual page
table▶ maps process IDs
virtual page numbers ⇒ page table entries▶ Since collisions may occur, the page table must do
chaining
302 Chapter 7 Main Memory
The next step would be a four-level paging scheme, where the second-levelouter page table itself is also paged, and so forth. The 64-bit UltraSPARC wouldrequire seven levels of paging—a prohibitive number of memory accesses—to translate each logical address. You can see from this example why, for 64-bitarchitectures, hierarchical page tables are generally considered inappropriate.
7.5.2 Hashed Page Tables
A common approach for handling address spaces larger than 32 bits is to usea hashed page table, with the hash value being the virtual page number. Eachentry in the hash table contains a linked list of elements that hash to the samelocation (to handle collisions). Each element consists of three fields: (1) thevirtual page number, (2) the value of the mapped page frame, and (3) a pointerto the next element in the linked list.
The algorithm works as follows: The virtual page number in the virtualaddress is hashed into the hash table. The virtual page number is comparedwith field 1 in the first element in the linked list. If there is a match, thecorresponding page frame (field 2) is used to form the desired physical address.If there is no match, subsequent entries in the linked list are searched for amatching virtual page number. This scheme is shown in Figure 7.16.
A variation of this scheme that is favorable for 64-bit address spaces hasbeen proposed. This variation uses clustered page tables, which are similar tohashed page tables except that each entry in the hash table refers to severalpages (such as 16) rather than a single page. Therefore, a single page-tableentry can store the mappings for multiple physical-page frames. Clusteredpage tables are particularly useful for sparse address spaces, where memoryreferences are noncontiguous and scattered throughout the address space.
7.5.3 Inverted Page Tables
Usually, each process has an associated page table. The page table has oneentry for each page that the process is using (or one slot for each virtual
hash table
q s
logical addressphysicaladdress
physicalmemory
p d r d
p rhashfunction
• • •
Figure 7.16 Hashed page table. 226 / 397
Hashed Inverted Page Table
227 / 397
Demand Paging
With demand paging, the size of the LAS is nolonger constrained by physical memory
▶ Bring a page into memory only when it is needed▶ Less I/O needed▶ Less memory needed▶ Faster response▶ More users
▶ Page is needed ⇒ reference to it▶ invalid reference ⇒ abort▶ not-in-memory ⇒ bring to memory
▶ Lazy swapper never swaps a page into memory unlesspage will be needed
▶ Swapper deals with entire processes▶ Pager (Lazy swapper) deals with pages
228 / 397
Valid-Invalid BitWhen Some Pages Are Not In Memory
324 Chapter 8 Virtual Memory
8.2.1 Basic Concepts
When a process is to be swapped in, the pager guesses which pages will beused before the process is swapped out again. Instead of swapping in a wholeprocess, the pager brings only those pages into memory. Thus, it avoids readinginto memory pages that will not be used anyway, decreasing the swap timeand the amount of physical memory needed.
With this scheme, we need some form of hardware support to distinguishbetween the pages that are in memory and the pages that are on the disk.The valid–invalid bit scheme described in Section 7.4.3 can be used for thispurpose. This time, however, when this bit is set to “valid,” the associated pageis both legal and in memory. If the bit is set to “invalid,” the page either is notvalid (that is, not in the logical address space of the process) or is valid butis currently on the disk. The page-table entry for a page that is brought intomemory is set as usual, but the page-table entry for a page that is not currentlyin memory is either simply marked invalid or contains the address of the pageon disk. This situation is depicted in Figure 8.5.
Notice that marking a page invalid will have no effect if the process neverattempts to access that page. Hence, if we guess right and page in all and onlythose pages that are actually needed, the process will run exactly as though wehad brought in all pages. While the process executes and accesses pages thatare memory resident, execution proceeds normally.
B
D
D EF
H
logicalmemory
valid–invalidbitframe
page table
1
0 4
62
3
4
5 9
6
7
1
0
2
3
4
5
6
7
i
v
v
i
i
v
i
i
physical memory
A
A BC
C
F G HF
1
0
2
3
4
5
6
7
9
8
10
11
12
13
14
15
A
C
E
G
Figure 8.5 Page table when some pages are not in main memory.229 / 397
Page Fault Handling8.2 Demand Paging 325
load M
referencetrap
i
page is onbacking store
operatingsystem
restartinstruction
reset pagetable
page table
physicalmemory
bring inmissing page
free frame
1
2
3
6
5 4
Figure 8.6 Steps in handling a page fault.
But what happens if the process tries to access a page that was not broughtinto memory? Access to a page marked invalid causes a page fault. The paginghardware, in translating the address through the page table, will notice thatthe invalid bit is set, causing a trap to the operating system. This trap is theresult of the operating system’s failure to bring the desired page into memory.The procedure for handling this page fault is straightforward (Figure 8.6):
1. We check an internal table (usually kept with the process control block)for this process to determine whether the reference was a valid or aninvalid memory access.
2. If the reference was invalid, we terminate the process. If it was valid, butwe have not yet brought in that page, we now page it in.
3. We find a free frame (by taking one from the free-frame list, for example).
4. We schedule a disk operation to read the desired page into the newlyallocated frame.
5. When the disk read is complete, we modify the internal table kept withthe process and the page table to indicate that the page is now in memory.
6. We restart the instruction that was interrupted by the trap. The processcan now access the page as though it had always been in memory.
In the extreme case, we can start executing a process with no pages inmemory. When the operating system sets the instruction pointer to the first
230 / 397
Copy-on-WriteMore efficient process creation
▶ Parent and childprocesses initially sharethe same pages inmemory
▶ Only the modified page iscopied upon modificationoccurs
▶ Free pages are allocatedfrom a pool of zeroed-outpages
330 Chapter 8 Virtual Memory
process1
physicalmemory
page A
page B
page C
process2
Figure 8.7 Before process 1 modifies page C.
Recall that the fork() system call creates a child process that is a duplicateof its parent. Traditionally, fork() worked by creating a copy of the parent’saddress space for the child, duplicating the pages belonging to the parent.However, considering that many child processes invoke the exec() systemcall immediately after creation, the copying of the parent’s address space maybe unnecessary. Instead, we can use a technique known as copy-on-write,which works by allowing the parent and child processes initially to share thesame pages. These shared pages are marked as copy-on-write pages, meaningthat if either process writes to a shared page, a copy of the shared page iscreated. Copy-on-write is illustrated in Figures 8.7 and Figure 8.8, which showthe contents of the physical memory before and after process 1 modifies pageC.
For example, assume that the child process attempts to modify a pagecontaining portions of the stack, with the pages set to be copy-on-write. Theoperating system will create a copy of this page, mapping it to the address spaceof the child process. The child process will then modify its copied page and notthe page belonging to the parent process. Obviously, when the copy-on-writetechnique is used, only the pages that are modified by either process are copied;all unmodified pages can be shared by the parent and child processes. Note, too,
process1
physicalmemory
page A
page B
page C
Copy of page C
process2
Figure 8.8 After process 1 modifies page C.
330 Chapter 8 Virtual Memory
process1
physicalmemory
page A
page B
page C
process2
Figure 8.7 Before process 1 modifies page C.
Recall that the fork() system call creates a child process that is a duplicateof its parent. Traditionally, fork() worked by creating a copy of the parent’saddress space for the child, duplicating the pages belonging to the parent.However, considering that many child processes invoke the exec() systemcall immediately after creation, the copying of the parent’s address space maybe unnecessary. Instead, we can use a technique known as copy-on-write,which works by allowing the parent and child processes initially to share thesame pages. These shared pages are marked as copy-on-write pages, meaningthat if either process writes to a shared page, a copy of the shared page iscreated. Copy-on-write is illustrated in Figures 8.7 and Figure 8.8, which showthe contents of the physical memory before and after process 1 modifies pageC.
For example, assume that the child process attempts to modify a pagecontaining portions of the stack, with the pages set to be copy-on-write. Theoperating system will create a copy of this page, mapping it to the address spaceof the child process. The child process will then modify its copied page and notthe page belonging to the parent process. Obviously, when the copy-on-writetechnique is used, only the pages that are modified by either process are copied;all unmodified pages can be shared by the parent and child processes. Note, too,
process1
physicalmemory
page A
page B
page C
Copy of page C
process2
Figure 8.8 After process 1 modifies page C.
231 / 397
Memory Mapped Files
Mapping a file (disk block) to one or more memory pages
▶ Improved I/Operformance — muchfaster than read() andwrite() system calls
▶ Lazy loading (demandpaging) — only a smallportion of file is loadedinitially
▶ A mapped file can beshared, like shared library
354 Chapter 8 Virtual Memory
whether the page in memory has been modified. When the file is closed, all thememory-mapped data are written back to disk and removed from the virtualmemory of the process.
Some operating systems provide memory mapping only through a specificsystem call and use the standard system calls to perform all other file I/O.However, some systems choose to memory-map a file regardless of whetherthe file was specified as memory-mapped. Let’s take Solaris as an example. Ifa file is specified as memory-mapped (using the mmap() system call), Solarismaps the file into the address space of the process. If a file is opened andaccessed using ordinary system calls, such as open(), read(), and write(),Solaris still memory-maps the file; however, the file is mapped to the kerneladdress space. Regardless of how the file is opened, then, Solaris treats allfile I/O as memory-mapped, allowing file access to take place via the efficientmemory subsystem.
Multiple processes may be allowed to map the same file concurrently, topermit sharing of data. Writes by any of the processes modify the data invirtual memory and can be seen by all others that map the same section ofthe file. Given our earlier discussions of virtual memory, it should be clearhow the sharing of memory-mapped sections of memory is implemented:the virtual memory map of each sharing process points to the same page ofphysical memory—the page that holds a copy of the disk block. This memorysharing is illustrated in Figure 8.23. The memory-mapping system calls canalso support copy-on-write functionality, allowing processes to share a file inread-only mode but to have their own copies of any data they modify. So that
process Avirtual memory
1
1
1 2 3 4 5 6
23
3
45
5
42
66
123456
process Bvirtual memory
physical memory
disk file
Figure 8.23 Memory-mapped files.
232 / 397
Need For Page ReplacementPage replacement: find some page in memory, but not really
in use, swap it out332 Chapter 8 Virtual Memory
monitor
load M
physicalmemory
1
0
2
3
4
5
6
7
H
load M
J
M
logical memoryfor user 1
0
PC1
2
3 B
M
valid–invalidbitframe
page tablefor user 1
i
A
B
D
E
logical memoryfor user 2
0
1
2
3
valid–invalidbitframe
page tablefor user 2
i
43
5
v
v
v
7
2 v
v
6 v
D
H
J
A
E
Figure 8.9 Need for page replacement.
Over-allocation of memory manifests itself as follows. While a user processis executing, a page fault occurs. The operating system determines where thedesired page is residing on the disk but then finds that there are no free frameson the free-frame list; all memory is in use (Figure 8.9).
The operating system has several options at this point. It could terminatethe user process. However, demand paging is the operating system’s attempt toimprove the computer system’s utilization and throughput. Users should notbe aware that their processes are running on a paged system—paging shouldbe logically transparent to the user. So this option is not the best choice.
The operating system could instead swap out a process, freeing all itsframes and reducing the level of multiprogramming. This option is a good onein certain circumstances, and we consider it further in Section 8.6. Here, wediscuss the most common solution: page replacement.
8.4.1 Basic Page Replacement
Page replacement takes the following approach. If no frame is free, we findone that is not currently being used and free it. We can free a frame by writingits contents to swap space and changing the page table (and all other tables) toindicate that the page is no longer in memory (Figure 8.10). We can now usethe freed frame to hold the page for which the process faulted. We modify thepage-fault service routine to include page replacement:
1. Find the location of the desired page on the disk.
2. Find a free frame:
a. If there is a free frame, use it.
233 / 397
Performance ConcernBecause disk I/O is so expensive, we must solve two majorproblems to implement demand paging.Frame-allocation algorithm If we have multiple processes in
memory, we must decide how many frames toallocate to each process.
Page-replacement algorithm When page replacement isrequired, we must select the frames that are tobe replaced.
PerformanceWe want an algorithm resulting in lowest page-fault rate
▶ Is the victim page modified?▶ Pick a random page to swap out?▶ Pick a page from the faulting process’ own pages? Or
from others?
234 / 397
Page-Fault Frequency Scheme
Establish ”acceptable” page-fault rate
235 / 397
FIFO Page Replacement Algorithm
▶ Maintain a linked list (FIFO queue) of all pages▶ in order they came into memory
▶ Page at beginning of list replaced▶ Disadvantage
▶ The oldest page may be often used▶ Belady’s anomaly
236 / 397
FIFO Page Replacement AlgorithmBelady’s Anomaly
▶ Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5▶ 3 frames (3 pages can be in memory at a time per
process)1 4 5 32 1 /1 43 2 /2 /5
9 page faults
▶ 4 frames1 /1 22 /2 33 5 44 1 5
10 page faults
▶ Belady’s Anomaly: more frames ⇒ more page faults
237 / 397
Optimal Page Replacement Algorithm (OPT)
▶ Replace page needed at the farthest point in future▶ Optimal but not feasible
▶ Estimate by ...▶ logging page use on previous runs of process▶ although this is impractical, similar to SJF CPU-scheduling,
it can be used for comparison studies
238 / 397
Least Recently Used (LRU) Algorithm
FIFO uses the time when a page was brought into memoryOPT uses the time when a page is to be usedLRU uses the recent past as an approximation of the near
future
Assume recently-used-pages will used again soonreplace the page that has not been used for the longestperiod of time
239 / 397
LRU ImplementationsCounters: record the time of the last reference toeach page
▶ choose page with lowest value counter▶ Keep counter in each page table entry▶ counter overflow — periodically zero the counter▶ require a search of the page table to find the LRU page▶ update time-of-use field in the page table every memory
reference!
Stack: keep a linked list (stack) of pages▶ most recently used at top, least (LRU) at bottom
▶ no search for replacement▶ whenever a page is referenced, it’s removed from the
stack and put on the top▶ update this list every memory reference!
240 / 397
Second Chance Page Replacement Algorithm
When a page fault occurs,the page the hand ispointing to is inspected.The action taken dependson the R bit: R = 0: Evict the page R = 1: Clear R and advance hand
AB
C
D
E
FG
H
I
J
K
L
Fig. 4-17. The clock page replacement algorithm.
241 / 397
Allocation of Frames
▶ Each process needs minimum number of pages▶ Fixed Allocation
▶ Equal allocation — e.g., 100 frames and 5 processes, giveeach process 20 frames.
▶ Proportional allocation — Allocate according to the size ofprocess
ai =si∑si
×m
Si: size of process pim: total number of framesai: frames allocated to pi
▶ Priority Allocation — Use a proportional allocationscheme using priorities rather than size
priorityi∑priorityi
or (si∑si,
priorityi∑priorityi
)
242 / 397
Global vs. Local Allocation
If process Pi generates a page fault, it can select areplacement frame
▶ from its own frames — Local replacement▶ from the set of all frames; one process can take a frame
from another — Global replacement▶ from a process with lower priority number
Global replacement generally results in greater systemthroughput.
243 / 397
Thrashing8.6 Thrashing 349
thrashing
degree of multiprogramming
CP
U u
tiliz
atio
n
Figure 8.18 Thrashing.
thrashing, they will be in the queue for the paging device most of the time. Theaverage service time for a page fault will increase because of the longer averagequeue for the paging device. Thus, the effective access time will increase evenfor a process that is not thrashing.
To prevent thrashing, we must provide a process with as many frames asit needs. But how do we know how many frames it “needs”? There are severaltechniques. The working-set strategy (Section 8.6.2) starts by looking at howmany frames a process is actually using. This approach defines the localitymodel of process execution.
The locality model states that, as a process executes, it moves from localityto locality. A locality is a set of pages that are actively used together (Figure8.19). A program is generally composed of several different localities that mayoverlap.
For example, when a function is called, it defines a new locality. In thislocality, memory references are made to the instructions of the function call, itslocal variables, and a subset of the global variables. When we exit the function,the process leaves this locality, since the local variables and instructions of thefunction are no longer in active use. We may return to this locality later.
Thus, we see that localities are defined by the program structure and itsdata structures. The locality model states that all programs will exhibit thisbasic memory reference structure. Note that the locality model is the unstatedprinciple behind the caching discussions so far in this book. If accesses to anytypes of data were random rather than patterned, caching would be useless.
Suppose we allocate enough frames to a process to accommodate its currentlocality. It will fault for the pages in its locality until all these pages are inmemory; then, it will not fault again until it changes localities. If we do notallocate enough frames to accommodate the size of the current locality, theprocess will thrash, since it cannot keep in memory all the pages that it isactively using.
8.6.2 Working-Set Model
As mentioned, the working-set model is based on the assumption of locality.This model uses a parameter, �, to define the working-set window. The idea
244 / 397
Thrashing
1. CPU not busy ⇒ add more processes2. a process needs more frames ⇒ faulting, and taking
frames away from others3. these processes also need these pages ⇒ also faulting,
and taking frames away from others ⇒ chain reaction4. more and more processes queueing for the paging
device ⇒ ready queue is empty ⇒ CPU has nothing to do⇒ add more processes ⇒ more page faults
5. MMU is busy, but no work is getting done, becauseprocesses are busy paging — thrashing
245 / 397
Demand Paging and ThrashingLocality Model
▶ A locality is a set of pagesthat are actively usedtogether
▶ Process migrates from onelocality to another
350 Chapter 8 Virtual Memory
18
20
22
24
26
28
30
32
34
page
num
bers
mem
ory
addr
ess
execution time
Figure 8.19 Locality in a memory-reference pattern.
is to examine the most recent � page references. The set of pages in the mostrecent � page references is the working set (Figure 8.20). If a page is in activeuse, it will be in the working set. If it is no longer being used, it will drop fromthe working set � time units after its last reference. Thus, the working set is anapproximation of the program’s locality.
For example, given the sequence of memory references shown in Figure8.20, if � = 10 memory references, then the working set at time t1 is {1, 2, 5,6, 7}. By time t2, the working set has changed to {3, 4}.
The accuracy of the working set depends on the selection of �. If � is toosmall, it will not encompass the entire locality; if � is too large, it may overlap
locality in a memory reference pattern
Why does thrashing occur?∑i=(0,n)
Localityi > total memory size
246 / 397
Working-Set ModelWorking Set (WS) The set of pages that a process is
currently(∆) using. (≈ locality) 8.6 Thrashing 351
page reference table. . . 2 6 1 5 7 7 7 7 5 1 6 2 3 4 1 2 3 4 4 4 3 4 3 4 4 4 1 3 2 3 4 4 4 3 4 4 4 . . .
Δ
t1WS(t1) = {1,2,5,6,7}
Δ
t2WS(t2) = {3,4}
Figure 8.20 Working-set model.
several localities. In the extreme, if � is infinite, the working set is the set ofpages touched during the process execution.
The most important property of the working set, then, is its size. If wecompute the working-set size, WSSi , for each process in the system, we canthen consider that
D =∑
WSSi ,
where D is the total demand for frames. Each process is actively using the pagesin its working set. Thus, process i needs WSSi frames. If the total demand isgreater than the total number of available frames (D > m), thrashing will occur,because some processes will not have enough frames.
Once � has been selected, use of the working-set model is simple. Theoperating system monitors the working set of each process and allocates tothat working set enough frames to provide it with its working-set size. If thereare enough extra frames, another process can be initiated. If the sum of theworking-set sizes increases, exceeding the total number of available frames,the operating system selects a process to suspend. The process’s pages arewritten out (swapped), and its frames are reallocated to other processes. Thesuspended process can be restarted later.
This working-set strategy prevents thrashing while keeping the degree ofmultiprogramming as high as possible. Thus, it optimizes CPU utilization.
The difficulty with the working-set model is keeping track of the workingset. The working-set window is a moving window. At each memory reference,a new reference appears at one end and the oldest reference drops off the otherend. A page is in the working set if it is referenced anywhere in the working-setwindow.
We can approximate the working-set model with a fixed-interval timerinterrupt and a reference bit. For example, assume that � equals 10,000references and that we can cause a timer interrupt every 5,000 references.When we get a timer interrupt, we copy and clear the reference-bit values foreach page. Thus, if a page fault occurs, we can examine the current referencebit and two in-memory bits to determine whether a page was used within thelast 10,000 to 15,000 references. If it was used, at least one of these bits will beon. If it has not been used, these bits will be off. Those pages with at least onebit on will be considered to be in the working set. Note that this arrangementis not entirely accurate, because we cannot tell where, within an interval of5,000, a reference occurred. We can reduce the uncertainty by increasing thenumber of history bits and the frequency of interrupts (for example, 10 bitsand interrupts every 1,000 references). However, the cost to service these morefrequent interrupts will be correspondingly higher.
∆: Working-set window. In this example,∆ = 10 memory access
WSS: Working-set size. WS(t1) = {WSS=5︷ ︸︸ ︷
1, 2, 5, 6, 7}
▶ The accuracy of the working set depends on theselection of ∆
▶ Thrashing, if ∑WSSi > SIZEtotalmemory
247 / 397
The Working-Set Page Replacement AlgorithmTo evict a page that is not in the working set
Information aboutone page 2084
2204 Current virtual time
2003
1980
1213
2014
2020
2032
1620
Page table
1
1
1
0
1
1
1
0
Time of last use
Page referencedduring this tick
Page not referencedduring this tick
R (Referenced) bit
Scan all pages examining R bit: if (R == 1) set time of last use to current virtual time
if (R == 0 and age > τ) remove this page
if (R == 0 and age ≤ τ) remember the smallest time
Fig. 4-21. The working set algorithm.
age = Current virtual time− Time of last use248 / 397
The WSClock Page Replacement AlgorithmCombine Working Set Algorithm With Clock Algorithm
2204 Current virtual time
1213 0
2084 1 2032 1
1620 0
2020 12003 1
1980 1 2014 1
Time oflast use
R bit
(a) (b)
(c) (d)
New page
1213 0
2084 1 2032 1
1620 0
2020 12003 1
1980 1 2014 0
1213 0
2084 1 2032 1
1620 0
2020 12003 1
1980 1 2014 0
2204 1
2084 1 2032 1
1620 0
2020 12003 1
1980 1 2014 0
Fig. 4-22. Operation of the WSClock algorithm. (a) and (b) givean example of what happens when R = 1. (c) and (d) give anexample of R = 0.
249 / 397
Other Issues — Prepaging
8.7 Memory-Mapped Files 353
WORKING SETS AND PAGE FAULT RATES
There is a direct relationship between the working set of a process and itspage-fault rate. Typically, as shown in Figure 8.20, the working set of a processchanges over time as references to data and code sections move from onelocality to another. Assuming there is sufficient memory to store the workingset of a process (that is, the process is not thrashing), the page-fault rate ofthe process will transition between peaks and valleys over time. This generalbehavior is shown in Figure 8.22.
1
0time
working set
page fault rate
Figure 8.22 Page fault rate over time.
A peak in the page-fault rate occurs when we begin demand-paging a newlocality. However, once the working set of this new locality is in memory,the page-fault rate falls. When the process moves to a new working set, thepage-fault rate rises toward a peak once again, returning to a lower rate oncethe new working set is loaded into memory. The span of time between thestart of one peak and the start of the next peak represents the transition fromone working set to another.
8.7.1 Basic Mechanism
Memory mapping a file is accomplished by mapping a disk block to a page (orpages) in memory. Initial access to the file proceeds through ordinary demandpaging, resulting in a page fault. However, a page-sized portion of the fileis read from the file system into a physical page (some systems may optto read in more than a page-sized chunk of memory at a time). Subsequentreads and writes to the file are handled as routine memory accesses, therebysimplifying file access and usage by allowing the system to manipulate filesthrough memory rather than incurring the overhead of using the read() andwrite() system calls. Similarly, as file I/O is done in memory — as opposedto using system calls that involve disk I/O — file access is much faster as well.
Note that writes to the file mapped in memory are not necessarilyimmediate (synchronous) writes to the file on disk. Some systems may chooseto update the physical file when the operating system periodically checks
▶ reduce faulting rate at (re)startup▶ remember working-set in PCB
▶ Not always work▶ if prepaged pages are unused, I/O and memory was wasted
250 / 397
Other Issues — Page Size
Larger page sizeà Bigger internal fragmentationà longer I/O time
Smaller page sizeà Larger page tableà more page faults
▶ one page fault for each byte, if page size = 1 Byte▶ for a 200K process, with page size = 200K, only one page
faultNo best answer
$ getconf PAGESIZE
251 / 397
Other Issues — TLB Reach
▶ Ideally, the working set of each process is stored in theTLB
▶ Otherwise there is a high degree of page faults▶ TLB Reach — The amount of memory accessible from the
TLBTLB Reach = (TLB Size)× (Page Size)
▶ Increase the page sizeInternal fragmentation may be increased
▶ Provide multiple page sizes▶ This allows applications that require larger page sizes the
opportunity to use them without an increase infragmentation
▶ UltraSPARC supports page sizes of 8KB, 64KB, 512KB, and 4MB▶ Pentium supports page sizes of 4KB and 4MB
252 / 397
Other Issues — Program Structure
Careful selection of data structures and programmingstructures can increase locality, i.e. lower the page-fault rateand the number of pages in the working set.
Example▶ A stack has a good locality, since access is always made
to the top▶ A hash table has a bad locality, since it’s designed to
scatter references▶ Programming language
▶ Pointers tend to randomize access to memory▶ OO programs tend to have a poor locality
253 / 397
Other Issues — Program Structure
Example▶ int[i][j] = int[128][128]▶ Assuming page size is 128 words, then▶ Each row (128 words) takes one page
If the process has fewer than 128 framesProgram 1:
for(j=0;j<128;j++)
for(i=0;i<128;i++)
data[i][j] = 0;
Worst case:
128× 128 = 16, 384 page faults
Program 2:
for(i=0;i<128;i++)
for(j=0;j<128;j++)
data[i][j] = 0;
Worst case:
128 page faults
254 / 397
Other Issues — I/O interlock
Sometimes it is necessary to lock pages in memory so thatthey are not paged out.
Example▶ The OS▶ I/O operation — the frame into which the I/O device was
scheduled to write should not be replaced.▶ New page that was just brought in — looks like the best
candidate to be replaced because it was not accessedyet, nor was it modified.
255 / 397
Other Issues — I/O interlockCase 1
Be sure the following sequence of events does notoccur:1. A process issues an I/O request, and then queueing for
that I/O device2. The CPU is given to other processes3. These processes cause page faults4. The waiting process’ page is unluckily replaced5. When its I/O request is served, the specific frame is now
being used by another process
256 / 397
Other Issues — I/O interlockCase 2
Another bad sequence of events:1. A low-priority process faults2. The paging system selects a replacement frame. Then,
the necessary page is loaded into memory3. The low-priority process is now ready to continue, and
waiting in the ready queue4. A high-priority process faults5. The paging system looks for a replacement
5.1 It sees a page that is in memory but not been referencednor modified: perfect!
5.2 It doesn’t know the page is just brought in for thelow-priority process
257 / 397
Two Views of A Virtual Address Space
One-dimensionala linear array of bytes
max +------------------+
| Stack | Stack segment
+--------+---------+
| | |
| v |
| |
| ^ |
| | |
+--------+---------+
| Dynamic storage | Heap
|(from new, malloc)|
+------------------+
| Static variables |
| (uninitialized, | BSS segment
| initialized) | Data segment
+------------------+
| Code | Text segment
0 +------------------+
Two-dimensionala collection ofvariable-sized segments
258 / 397
User’s View
▶ A program is a collection of segments▶ A segment is a logical unit such as:
main program procedure functionmethod object local variablesglobal variables common block stacksymbol table arrays
259 / 397
Logical And Physical View of Segmentation
8.46 Silbersc hatz, Galvin and Gagne ©2009Operating S ystem Con cepts – 8th Edition
Logical View of Segmentation
1
3
2
4
1
4
2
3
user space physical memory space
à
260 / 397
Segmentation Architecture
▶ Logical address consists of a two tuple:<segment-number, offset>
▶ Segment table maps 2D virtual addresses into 1Dphysical addresses; each table entry has:
▶ base contains the starting physical address where thesegments reside in memory
▶ limit specifies the length of the segment▶ Segment-table base register (STBR) points to the
segment table’s location in memory▶ Segment-table length register (STLR) indicates number
of segments used by a program;segment number s is legal if s < STLR
261 / 397
Segmentation hardware306 Chapter 7 Main Memory
CPU
physical memory
s d
< +
trap: addressing error
no
yes
segment table
limit base
s
Figure 7.19 Segmentation hardware.
Libraries that are linked in during compile time might be assigned separatesegments. The loader would take all these segments and assign them segmentnumbers.
7.6.2 Hardware
Although the user can now refer to objects in the program by a two-dimensionaladdress, the actual physical memory is still, of course, a one-dimensionalsequence of bytes. Thus, we must define an implementation to map two-dimensional user-defined addresses into one-dimensional physical addresses.This mapping is effected by a segment table. Each entry in the segment tablehas a segment base and a segment limit. The segment base contains the startingphysical address where the segment resides in memory, and the segment limitspecifies the length of the segment.
The use of a segment table is illustrated in Figure 7.19. A logical addressconsists of two parts: a segment number, s, and an offset into that segment, d.The segment number is used as an index to the segment table. The offset d ofthe logical address must be between 0 and the segment limit. If it is not, we trapto the operating system (logical addressing attempt beyond end of segment).When an offset is legal, it is added to the segment base to produce the addressin physical memory of the desired byte. The segment table is thus essentiallyan array of base–limit register pairs.
As an example, consider the situation shown in Figure 7.20. We have fivesegments numbered from 0 through 4. The segments are stored in physicalmemory as shown. The segment table has a separate entry for each segment,giving the beginning address of the segment in physical memory (or base) andthe length of that segment (or limit). For example, segment 2 is 400 bytes longand begins at location 4300. Thus, a reference to byte 53 of segment 2 is mapped
262 / 397
7.7 Example: The Intel Pentium 307
logical address space
subroutine stack
symbol table
main program
Sqrt
1400
physical memory
2400
3200
segment 24300
4700
5700
6300
6700
segment table
limit0 1 2 3 4
1000 400 400
1100 1000
base1400 6300 4300 3200 4700
segment 0
segment 3
segment 4
segment 2segment 1
segment 0
segment 3
segment 4
segment 1
Figure 7.20 Example of segmentation.
onto location 4300 + 53 = 4353. A reference to segment 3, byte 852, is mapped to3200 (the base of segment 3) + 852 = 4052. A reference to byte 1222 of segment0 would result in a trap to the operating system, as this segment is only 1,000bytes long.
7.7 Example: The Intel Pentium
Both paging and segmentation have advantages and disadvantages. In fact,some architectures provide both. In this section, we discuss the Intel Pentiumarchitecture, which supports both pure segmentation and segmentation withpaging. We do not give a complete description of the memory-managementstructure of the Pentium in this text. Rather, we present the major ideas onwhich it is based. We conclude our discussion with an overview of Linuxaddress translation on Pentium systems.
In Pentium systems, the CPU generates logical addresses, which are givento the segmentation unit. The segmentation unit produces a linear address foreach logical address. The linear address is then given to the paging unit, whichin turn generates the physical address in main memory. Thus, the segmentationand paging units form the equivalent of the memory-management unit (MMU).This scheme is shown in Figure 7.21.
7.7.1 Pentium Segmentation
The Pentium architecture allows a segment to be as large as 4 GB, and themaximum number of segments per process is 16 K. The logical-address space
(0, 1222) ⇒ Trap!(3, 852) ⇒ 3200 + 852 = 4052
(2, 53) ⇒ 4300 + 53 = 4353
263 / 397
Advantages of Segmentation▶ Each segment can be
▶ located independently▶ separately protected▶ grow independently
▶ Segments can be shared between processes
Problems with Segmentation▶ Variable allocation▶ Difficult to find holes in physical memory▶ Must use one of non-trivial placement algorithm
▶ first fit, best fit, worst fit▶ External fragmentation
264 / 397
Linux prefers paging to segmentation
Because▶ Segmentation and paging are somewhat redundant▶ Memory management is simpler when all processes
share the same set of linear addresses▶ Maximum portability. RISC architectures in particular
have limited support for segmentation
The Linux 2.6 uses segmentation only when required by the80x86 architecture.
265 / 397
Case Study: The Intel PentiumSegmentation With Paging
|<--------------- MMU --------------->|
+--------------+ +--------+ +----------+
+-----+ Logical | Segmentation | Linear | Paging | Physical | Physical |
| CPU |---------->| unit |---------->| unit |----------->| memory |
+-----+ address +--------------+ address +--------+ address +----------+
selector | offset
+----+---+---+--------+
| s | g | p | |
+----+---+---+--------+
13 1 2 32
page number | page offset
+----------+----------+------------+
| p1 | p2 | d |
+----------+----------+------------+
10 10 12
| |
| ‘-> pointing to 1k frames
‘--> pointing to 1k page tables
266 / 397
267 / 397
Segmentation
Logical Address ⇒ Linear Address7.7 Example: The Intel Pentium 309
logical address selector
descriptor table
segment descriptor +
32-bit linear address
offset
Figure 7.22 Intel Pentium segmentation.
detail in Figure 7.23. The 10 high-order bits reference an entry in the outermostpage table, which the Pentium terms the page directory. (The CR3 registerpoints to the page directory for the current process.) The page directory entrypoints to an inner page table that is indexed by the contents of the innermost10 bits in the linear address. Finally, the low-order bits 0–11 refer to the offsetin the 4-KB page pointed to in the page table.
One entry in the page directory is the Page Size flag, which—if set—indicates that the size of the page frame is 4 MB and not the standard 4 KB.If this flag is set, the page directory points directly to the 4-MB page frame,bypassing the inner page table; and the 22 low-order bits in the linear addressrefer to the offset in the 4-MB page frame.
page directory
page directory
CR3register
pagedirectory
pagetable
4-KBpage
4-MBpage
page table
offset
offset
(linear address)
31 22 21 12 11 0
2131 22 0
Figure 7.23 Paging in the Pentium architecture.
268 / 397
Segment Selectors
A logical address consists of two parts:segment selector : offset
16 bits 32 bits
Segment selector is an index into GDT/LDT
selector | offset
+-----+-+--+--------+ s - segment number
| s |g|p | | g - 0-global; 1-local
+-----+-+--+--------+ p - protection use
13 1 2 32
269 / 397
Segment Descriptor Tables
All the segments are organized in 2 tables:GDT Global Descriptor Table
▶ shared by all processes▶ GDTR stores address and size of the GDT
LDT Local Descriptor Table▶ one process each▶ LDTR stores address and size of the LDT
Segment descriptors are entries in either GDT or LDT, 8-bytelong
AnalogyProcess ⇐⇒ Process Descriptor(PCB)
File ⇐⇒ InodeSegment ⇐⇒ Segment Descriptor
270 / 397
Segment Registers
The Intel Pentium has▶ 6 segment registers, allowing 6 segments to be
addressed at any one time by a process▶ Each segment register + an entry in LDT/GDT
▶ 6 8-byte micro program registers to hold descriptorsfrom either LDT or GDT
▶ avoid having to read the descriptor from memory for everymemory reference
0 1 2 3~8 9
+--------+--------|--------+---//---+--------+
| Segment selector| Micro program register |
+--------+--------|--------+---//---+--------+
Programmable Non-programmable
271 / 397
Fast access to segment descriptors
An additional nonprogrammable register for eachsegment register
DESCRIPTOR TABLE SEGMENT
+--------------+ +------+
| ... | ,------->| |<--.
+--------------+ | | | |
.--->| Segment |____/ | | |
| | Descriptor | +------+ |
| +--------------+ |
| | ... | |
| +--------------+ |
| Nonprogrammable |
| Segment Registor Register |
\__+------------------+--------------------+____/
| Segment Selector | Segment Descriptor |
+------------------+--------------------+
272 / 397
Segment registers hold segment selectorscs code segment register
CPL 2-bit, specifies the Current Privilege Level ofthe CPU00 - Kernel mode11 - User mode
ss stack segment registerds data segment register
es/fs/gs general purpose registers, may refer to arbitrarydata segments
273 / 397
Example: A LDT entry for code segment
Privilege level (0-3)
Relativeaddress
0
4
Base 0-15 Limit 0-15
Base 24-31 Base 16-23Limit16-19G D 0 P DPL Type
0: Li is in bytes1: Li is in pages
0: 16-Bit segment1: 32-Bit segment
0: Segment is absent from memory1: Segment is present in memory
Segment type and protection
S
��
0: System1: Application
32 Bits
Fig. 4-44. Pentium code segment descriptor. Data segments differslightly.
Base: Where the segment starts
Limit: 20 bit, ⇒ 220 in size
G: Granularity flag
0 - segment size in bytes1 - in 4096 bytes
S: System flag
0 - system segment, e.g.LDT
1 - normal code/datasegment
D/B: 0 - 16-bit offset1 - 32-bit offset
Type: segment type (cs/ds/tss)TSS: Task status, i.e. it’s
executing or notDPL: Descriptor Privilege Level. 0
or 3P: Segment-Present flag
0 - not in memory1 - in memory
AVL: ignored by Linux
274 / 397
The Four Main Linux Segments
Every process in Linux has these 4 segmentsSegment Base G Limit S Type DPL D/B Puser code 0x00000000 1 0xfffff 1 10 3 1 1user data 0x00000000 1 0xfffff 1 2 3 1 1kernel code 0x00000000 1 0xfffff 1 10 0 1 1kernel data 0x00000000 1 0xfffff 1 2 0 1 1
All linear addresses start at 0, end at 4G-1▶ All processes share the same set of linear addresses▶ Logical addresses coincide with linear addresses
275 / 397
Pentium PagingLinear Address ⇒ Physical Address
Two page size in Pentium:
4K:2-level paging (Fig. 219)
4M:1-level paging (Fig. 214)
page number | page offset
+----------+----------+------------+
| p1 | p2 | d |
+----------+----------+------------+
10 10 12
| |
| ‘-> pointing to 1k frames
‘--> pointing to 1k page tables
7.7 Example: The Intel Pentium 309
logical address selector
descriptor table
segment descriptor +
32-bit linear address
offset
Figure 7.22 Intel Pentium segmentation.
detail in Figure 7.23. The 10 high-order bits reference an entry in the outermostpage table, which the Pentium terms the page directory. (The CR3 registerpoints to the page directory for the current process.) The page directory entrypoints to an inner page table that is indexed by the contents of the innermost10 bits in the linear address. Finally, the low-order bits 0–11 refer to the offsetin the 4-KB page pointed to in the page table.
One entry in the page directory is the Page Size flag, which—if set—indicates that the size of the page frame is 4 MB and not the standard 4 KB.If this flag is set, the page directory points directly to the 4-MB page frame,bypassing the inner page table; and the 22 low-order bits in the linear addressrefer to the offset in the 4-MB page frame.
page directory
page directory
CR3register
pagedirectory
pagetable
4-KBpage
4-MBpage
page table
offset
offset
(linear address)
31 22 21 12 11 0
2131 22 0
Figure 7.23 Paging in the Pentium architecture.
276 / 397
Paging In Linux4-level paging for both 32-bit and 64-bit
Global directory Upper directory Middle directory Page Offset
Page
Page tablePage middle
directoryPage upper
directoryPage global
directory
Virtual address
cr3
277 / 397
4-level paging for both 32-bit and 64-bit▶ 64-bit: four-level paging
1. Page Global Directory2. Page Upper Directory3. Page Middle Directory4. Page Table
▶ 32-bit: two-level paging1. Page Global Directory2. Page Upper Directory — 0 bits; 1 entry3. Page Middle Directory — 0 bits; 1 entry4. Page Table
The same code can work on 32-bit and 64-bit architectures
Page Address Paging AddressArch size bits levels splittingx86 4KB(12bits) 32 2 10 + 0 + 0 + 10 + 12x86-PAE 4KB(12bits) 32 3 2 + 0 + 9 + 9 + 12x86-64 4KB(12bits) 48 4 9 + 9 + 9 + 9 + 12
278 / 397
References
Wikipedia. Memory management — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. Virtual memory — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.
279 / 397
Long-term Information Storage Requirements
▶ Must store large amounts of data▶ Information stored must survive the termination of the
process using it▶ Multiple processes must be able to access the
information concurrently
281 / 397
File-System Structure
File-system design addressing two problems:1. defining how the FS should look to the user
▶ defining a file and its attributes▶ the operations allowed on a file▶ directory structure
2. creating algorithms and data structures to map thelogical FS onto the physical disk
282 / 397
File-System — A Layered Design
APPs⇓
Logical FS⇓
File-org module⇓
Basic FS⇓
I/O ctrl⇓
Devices
▶ logical file system — manages metadatainformation
- maintains all of the file-system structure(directory structure, FCB)
- responsible for protection and security▶ file-organization module
- logical blockaddress
translate−−−−−→ physical blockaddress
- keeps track of free blocks▶ basic file system issues generic
commands to device driver, e.g- “read drive 1, cylinder 72, track 2,
sector 10”▶ I/O Control — device drivers, and INT
handlers- device driver:
high-levelcommands
translate−−−−−→ hardware-specificinstructions
283 / 397
The Operating Structure
APPs⇓
Logical FS⇓
File-org module⇓
Basic FS⇓
I/O ctrl⇓
Devices
Example — To create a file1. APP calls creat()2. Logical FS
2.1 allocates a new FCB2.2 updates the in-mem dir structure2.3 writes it back to disk2.4 calls the file-org module
3. file-organization module3.1 allocates blocks for storing the file’s
data3.2 maps the directory I/O into disk-block
numbers
Benefit of layered designThe I/O control and sometimes the basic file system code canbe used by multiple file systems.
284 / 397
FileA Logical View Of Information Storage
User’s viewA file is the smallest storage unit on disk.
▶ Data cannot be written to disk unless they are within a file
UNIX viewEach file is a sequence of 8-bit bytes
▶ It’s up to the application program to interpret this bytestream.
285 / 397
FileWhat Is Stored In A File?
Source code, object files, executable files, shell scripts,PostScript...
Different type of files have different structure▶ UNIX looks at contents to determine type
Shell scripts start with “#!”PDF start with “%PDF...”
Executables start with magic number▶ Windows uses file naming conventions
executables end with “.exe” and “.com”MS-Word end with “.doc”MS-Excel end with “.xls”
286 / 397
File Naming
Vary from system to system▶ Name length?▶ Characters? Digits? Special characters?▶ Extension?▶ Case sensitive?
287 / 397
File Types
Regular files: ASCII, binaryDirectories: Maintaining the structure of the FS
In UNIX, everything is a fileCharacter special files: I/O related, such as terminals,
printers ...Block special files: Devices that can contain file systems, i.e.
disksdisks — logically, linear collections of
blocks; disk driver translates theminto physical block addresses
288 / 397
Binary files
(a) (b)
Header
Header
Header
Magic number
Text size
Data size
BSS size
Symbol table size
Entry point
Flags
Text
Data
Relocationbits
Symboltable
Objectmodule
Objectmodule
Objectmodule
Modulename
Date
Owner
Protection
Size
���H
eade
r
Fig. 6-3. (a) An executable file. (b) An archive.
An UNIX executable file An UNIX archive
289 / 397
File Attributes — Metadata
▶ Name only information kept in human-readable form▶ Identifier unique tag (number) identifies file within file
system▶ Type needed for systems that support different types▶ Location pointer to file location on device▶ Size current file size▶ Protection controls who can do reading, writing,
executing▶ Time, date, and user identification data for protection,
security, and usage monitoring
290 / 397
File OperationsPOSIX file system calls
1. fd = creat(name, mode)2. fd = open(name, flags)3. status = close(fd)4. byte_count = read(fd, buffer, byte_count)5. byte_count = write(fd, buffer, byte_count)6. offset = lseek(fd, offset, whence)7. status = link(oldname, newname)8. status = unlink(name)9. status = truncate(name, size)
10. status = ftruncate(fd, size)11. status = stat(name, buffer)12. status = fstat(fd, buffer)13. status = utimes(name, times)14. status = chown(name, owner, group)15. status = fchown(fd, owner, group)16. status = chmod(name, mode)17. status = fchmod(fd, mode)
291 / 397
An Example Program Using File System Calls/* File copy program. Error checking and reporting is minimal. */
#include <sys/types.h> /* include necessary header files */#include <fcntl.h>#include <stdlib.h>#include <unistd.h>
int main(int argc, char *argv[]); /* ANSI prototype */
#define BUF3SIZE 4096 /* use a buffer size of 4096 bytes */#define OUTPUT3MODE 0700 /* protection bits for output file */
int main(int argc, char *argv[]){
int in3 fd, out3 fd, rd3count, wt3count;char buffer[BUF3SIZE];
if (argc != 3) exit(1); /* syntax error if argc is not 3 */
/* Open the input file and create the output file */in3fd = open(argv[1], O3RDONLY); /* open the source file */if (in3 fd < 0) exit(2); /* if it cannot be opened, exit */out3 fd = creat(argv[2], OUTPUT3MODE); /* create the destination file */if (out3fd < 0) exit(3); /* if it cannot be created, exit */
/* Copy loop */while (TRUE) {
rd3count = read(in3 fd, buffer, BUF3SIZE); /* read a block of data */if (rd3count <= 0) break; /* if end of file or error, exit loop */
wt3count = write(out3fd, buffer, rd3count); /* write data */if (wt3count <= 0) exit(4); /* wt3count <= 0 is an error */
}
/* Close the files */close(in3fd);close(out3 fd);if (rd3count == 0) /* no error on last read */
exit(0);else
exit(5); /* error on last read */}
Fig. 6-5. A simple program to copy a file.292 / 397
open()
fd open(pathname, flags)A per-process open-file table is kept in the OS
▶ upon a successful open() syscall, a new entry is added intothis table
▶ indexed by file descriptor (fd)To see files opened by a process, e.g. init
$ lsof -p 1
Why open() is needed?To avoid constant searching
▶ Without open(), every file operation involves searchingthe directory for the file.
293 / 397
DirectoriesSingle-Level Directory Systems
All files are contained in the same directoryRoot directory
A A B C
Fig. 6-7. A single-level directory system containing four files,owned by three different people, A, B, and C.
- contains 4 files- owned by 3 different
people, A, B, and C
Limitations- name collision- file searching
Often used on simple embedded devices, such as telephone,digital cameras...
294 / 397
DirectoriesTwo-level Directory Systems
A separate directory for each user
Files
Userdirectory
A A
A B
B
C
CC C
Root directory
Fig. 6-8. A two-level directory system. The letters indicate theowners of the directories and files.
Limitation: hard to access others files
295 / 397
DirectoriesHierarchical Directory Systems
Userdirectory
User subdirectoriesC C
C
C C
C
B
B
A
A
B
B
C C
C
B
Root directory
User file
Fig. 6-9. A hierarchical directory system. 296 / 397
DirectoriesPath Names
ROOT
bin boot dev e t c home var
grub pa s swd staff s t u d mail
w x 6 7 2 20081152001
dir
file
2 0 0 8 1152001
297 / 397
DirectoriesDirectory Operations
Create Delete Rename LinkOpendir Closedir Readdir Unlink
298 / 397
File System Implementation
A typical file system layout|<---------------------- Entire disk ------------------------>|
+-----+-------------+-------------+-------------+-------------+
| MBR | Partition 1 | Partition 2 | Partition 3 | Partition 4 |
+-----+-------------+-------------+-------------+-------------+
_______________________________/ \____________
/ \
+---------------+-----------------+--------------------+---//--+
| Boot Ctrl Blk | Volume Ctrl Blk | Dir Structure | Files |
| (MBR copy) | (Super Blk) | (inodes, root dir) | dirs |
+---------------+-----------------+--------------------+---//--+
|<-------------Master Boot Record (512 Bytes)------------>|
0 439 443 445 509 511
+----//-----+----------+------+------//---------+---------+
| code area | disk-sig | null | partition table | MBR-sig |
| 440 | 4 | 2 | 16x4=64 | 0xAA55 |
+----//-----+----------+------+------//---------+---------+
299 / 397
On-Disk Information Structure
Boot control block a MBR copyUFS: Boot block
NTFS: Partition boot sectorVolume control block Contains volume details
number of blocks size of blocksfree-block count free-block pointersfree FCB count free FCB pointers
UFS: SuperblockNTFS: Master File Table
Directory structure Organizes the files FCB, File controlblock, contains file details (metadata).
UFS: I-nodeNTFS: Stored in MFT using a relatiional database
structure, with one row per file
300 / 397
Each File-System Has a Superblock
SuperblockKeeps information about the file system
▶ Type — ext2, ext3, ext4...▶ Size▶ Status — how it’s mounted, free blocks, free inodes, ...▶ Information about other metadata structures
# dumpe2fs /dev/sda1 | grep -i superblock
301 / 397
Implementing FilesContiguous Allocation
572 CHAPTER 12 / FILE MANAGEMENT
access, degree of multiprogramming, other performance factors in the system,disk caching, disk scheduling, and so on.
File Allocation Methods Having looked at the issues of preallocation versusdynamic allocation and portion size, we are in a position to consider specific file al-location methods. Three methods are in common use: contiguous, chained, and in-dexed. Table 12.3 summarizes some of the characteristics of each method.
With contiguous allocation, a single contiguous set of blocks is allocated to afile at the time of file creation (Figure 12.7). Thus, this is a preallocation strategy,using variable-size portions. The file allocation table needs just a single entry foreach file, showing the starting block and the length of the file. Contiguous allocationis the best from the point of view of the individual sequential file. Multiple blockscan be read in at a time to improve I/O performance for sequential processing. It isalso easy to retrieve a single block. For example, if a file starts at block b, and the ithblock of the file is wanted, its location on secondary storage is simply b $ i % 1. Con-tiguous allocation presents some problems. External fragmentation will occur, mak-ing it difficult to find contiguous blocks of space of sufficient length. From time totime, it will be necessary to perform a compaction algorithm to free up additional
Table 12.3 File Allocation Methods
Contiguous Chained Indexed
Preallocation? Necessary Possible Possible
Fixed or variable size portions? Variable Fixed blocks Fixed blocks Variable
Portion size Large Small Small Medium
Allocation frequency Once Low to high High Low
Time to allocate Medium Long Short Medium
File allocation table size One entry One entry Large Medium
0 1 2 3 4
5 6 7
File A
File Allocation Table
File B
File C
File E
File D
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
File Name
File AFile BFile CFile DFile E
29183026
35823
Start Block Length
Figure 12.7 Contiguous File Allocation
M12_STAL6329_06_SE_C12.QXD 2/21/08 9:40 PM Page 572
- simple;- good for read only;
- fragmentation
302 / 397
Linked List (Chained) Allocation A pointer in each disk block
12.6 / SECONDARY STORAGE MANAGEMENT 573
space on the disk (Figure 12.8).Also, with preallocation, it is necessary to declare thesize of the file at the time of creation, with the problems mentioned earlier.
At the opposite extreme from contiguous allocation is chained allocation(Figure 12.9). Typically, allocation is on an individual block basis. Each block con-tains a pointer to the next block in the chain. Again, the file allocation table needsjust a single entry for each file, showing the starting block and the length of the file.Although preallocation is possible, it is more common simply to allocate blocks asneeded. The selection of blocks is now a simple matter: any free block can be addedto a chain. There is no external fragmentation to worry about because only one
Figure 12.9 Chained Allocation
0 1 2 3 4
5 6 7
File A
File Allocation Table
File B
File C
File E File D
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
File Name
File AFile BFile CFile DFile E
0381916
35823
Start Block Length
Figure 12.8 Contiguous File Allocation (After Compaction)
0 1 2 3 4
5 6 7
File Allocation Table
File B
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
File B
File Name Start Block Length
1 5
M12_STAL6329_06_SE_C12.QXD 2/21/08 9:40 PM Page 573
- no wasteblock;
- slow randomaccess;
- not 2n
303 / 397
Linked List (Chained) Allocation Though there is no externalfragmentation, consolidation is still preferred.574 CHAPTER 12 / FILE MANAGEMENT
block at a time is needed.This type of physical organization is best suited to sequen-tial files that are to be processed sequentially. To select an individual block of a filerequires tracing through the chain to the desired block.
One consequence of chaining, as described so far, is that there is no accommo-dation of the principle of locality. Thus, if it is necessary to bring in several blocks ofa file at a time, as in sequential processing, then a series of accesses to different partsof the disk are required. This is perhaps a more significant effect on a single-usersystem but may also be of concern on a shared system. To overcome this problem,some systems periodically consolidate files (Figure 12.10).
Indexed allocation addresses many of the problems of contiguous and chainedallocation. In this case, the file allocation table contains a separate one-level index foreach file; the index has one entry for each portion allocated to the file. Typically, thefile indexes are not physically stored as part of the file allocation table. Rather, thefile index for a file is kept in a separate block, and the entry for the file in the file al-location table points to that block.Allocation may be on the basis of either fixed-sizeblocks (Figure 12.11) or variable-size portions (Figure 12.12). Allocation by blockseliminates external fragmentation, whereas allocation by variable-size portions im-proves locality. In either case, file consolidation may be done from time to time. Fileconsolidation reduces the size of the index in the case of variable-size portions, butnot in the case of block allocation. Indexed allocation supports both sequential anddirect access to the file and thus is the most popular form of file allocation.
Free Space Management
Just as the space that is allocated to files must be managed, so the space that is notcurrently allocated to any file must be managed. To perform any of the file alloca-tion techniques described previously, it is necessary to know what blocks on the diskare available. Thus we need a disk allocation table in addition to a file allocationtable. We discuss here a number of techniques that have been implemented.
0 1 2 3 4
5 6 7
File Allocation Table
File B
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
File B
File Name Start Block Length
0 5
Figure 12.10 Chained Allocation (After Consolidation)
M12_STAL6329_06_SE_C12.QXD 2/21/08 9:40 PM Page 574
304 / 397
FAT: Linked list allocation with a table in RAM
▶ Taking the pointer out of eachdisk block, and putting it into atable in memory
▶ fast random access (chain is inRAM)
▶ is 2n
▶ the entire table must be in RAM
disk↗⇒ FAT↗⇒ RAMused ↗
Physicalblock
File A starts here
File B starts here
Unused block
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
10
11
7
3
2
12
14
-1
-1
Fig. 6-14. Linked list allocation using a file allocation table inmain memory.
305 / 397
Indexed Allocation 12.6 / SECONDARY STORAGE MANAGEMENT 575
Bit Tables This method uses a vector containing one bit for each block on thedisk. Each entry of a 0 corresponds to a free block, and each 1 corresponds to ablock in use. For example, for the disk layout of Figure 12.7, a vector of length 35 isneeded and would have the following value:
00111000011111000011111111111011000
A bit table has the advantage that it is relatively easy to find one or a con-tiguous group of free blocks. Thus, a bit table works well with any of the file allo-cation methods outlined. Another advantage is that it is as small as possible.
Figure 12.11 Indexed Allocation with Block Portions
0 1 2 3 4
5 6 7
File Allocation Table
File B
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
File B
File Name Index Block
24
183
1428
0 1 2 3 4
5 6 7
File B
8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
25 26 27 28 29
30 31 32 33 34
Start Block
12814
341
Length
File Allocation Table
File B
File Name Index Block
24
Figure 12.12 Indexed Allocation with Variable-Length Portions
M12_STAL6329_06_SE_C12.QXD 2/21/08 9:40 PM Page 575
I-node A data structure for each file. An i-node is inmemory only if the file is open
filesopened ↗ ⇒ RAMused ↗
306 / 397
I-node — FCB in UNIX
Directory inode (128B)
Type Mode
User ID Group ID
File size # blocks
# links Flags
Timestamps (×3)
Triple indirect
Double indirect
Single indirect
Direct blocks (×12)
.
..
passwd
fstab
… …
Directory block
File inode (128B)
Type Mode
User ID Group ID
File size # blocks
# links Flags
Timestamps (×3)
Triple indirect
Double indirect
Single indirect
Direct blocks (×12)
Indirect block
inode #
inode #
inode #
inode #
Direct blocks (×512)
Block #s ofmoredirectoryblocks
Block # ofblock with512 singleindirectentries
Block # ofblock with512 doubleindirectentries
File data block
Data
File type Description0 Unknown1 Regular file2 Directory3 Character device4 Block device5 Named pipe6 Socket7 Symbolic link
Mode: 9-bit pattern
307 / 397
Inode QuizGiven: block size is 1KB
pointer size is 4B Addressing: byte offset 9000byte offset 350,000
+----------------+
0 | 4096 |
+----------------+ ---->+----------+ Byte 9000 in a file
1 | 228 | / | 367 | |
+----------------+ / | Data blk | v
2 | 45423 | / +----------+ 8th blk, 808th byte
+----------------+ /
3 | 0 | / -->+------+
+----------------+ / / 0| |
4 | 0 | / / +------+
+----------------+ / / : : :
5 | 11111 | / / +------+ Byte 350,000
+----------------+ / ->+-----+/ 75| 3333 | in a file
6 | 0 | / / 0| 331 | +------+\ |
+----------------+ / / +-----+ : : : \ v
7 | 101 | / / | | +------+ \ 816th byte
+----------------+/ / | : | 255| | \-->+----------+
8 | 367 | / | : | +------+ | 3333 |
+----------------+ / | : | 331 | Data blk |
9 | 0 | / | | Single indirect +----------+
+----------------+ / +-----+
S | 428 (10K+256K) | / 255| |
+----------------+/ +-----+
D | 9156 | 9156 /***********************
+----------------+ Double indirect What about the ZEROs?
T | 824 | ***********************/
+----------------+
308 / 397
UNIX In-Core Data Structure
mount table — Info about each mounted FSdirectory-structure cache — Dir-info of recently accessed dirsinode table — An in-core version of the on-disk inode table
file table▶ global▶ keeps inode of each open file▶ keeps track of
▶ how many processes are associated witheach open file
▶ where the next read and write will start▶ access rights
user file descriptor table▶ per process▶ identifies all open files for a process
309 / 397
UNIX In-Core Data StructureUser
File DescriptorTable
FileTable
InodeTable
open()/creat()1. add entry in each table2. returns a file descriptor — an index into the user file
descriptor table
310 / 397
The Tables
Two levels of internal tables in the OSA per-process table tracks all files that a process has open.
Stores▶ the current-file-position pointer (not really)▶ access rights▶ more...
a.k.a file descriptor tableA system-wide table keeps process-independent
information, such as▶ the location of the file on disk▶ access dates▶ file size▶ file open count — the number of processes
opening this file
311 / 397
Per-process FDT
Process 1
+------------------+ System-wide
| ... | open-file table
+------------------+ +------------------+
| Position pointer | | ... |
| Access rights | +------------------+
| ... |\ | ... |
+------------------+ \ +------------------+
| ... | --------->| Location on disk |
+------------------+ | R/W |
| Access dates |
Process 2 | File size |
+------------------+ | Pointer to inode |
| Position pointer | | File-open count |
| Access rights |----------->| ... |
| ... | +------------------+
+------------------+ | ... |
| ... | +------------------+
+------------------+
312 / 397
A process executes the following code:fd1 = open("/etc/passwd", O_RDONLY);fd2 = open("local", O_RDWR);fd3 = open("/etc/passwd", O_WRONLY);
STDINSTDOUTSTDERR
...
UserFile descriptor
table
012345...
...count 1
R...
count 1RW...
count 1W
Globalopen filetable
...
...
...(/etc/passwd)
count 2...
(local)count 1
...
Inode table
313 / 397
One more process B:fd1 = open("/etc/passwd", O_RDONLY);fd2 = open("private", O_RDONLY);
STDINSTDOUTSTDERR
...
Proc B
UserFile descriptor
table
012345...
...count 1
R...
count 1RW...
count 1R...
count 1W...
count 1R
Globalopen filetable
...
...(/etc/passwd)
count 3......
(local)count 1
...
...(private)count 1
...
...
Inode table
STDINSTDOUTSTDERR
...
Proc B
01234...
314 / 397
Why File Table?To allow a parent and child to share a file position, but toprovide unrelated processes with their own values.
Mode
i-node
Link count
Uid
Gid
File size
Times
Addresses offirst 10
disk blocks
Single indirect
Double indirect
Triple indirect
Parent’sfile
descriptortable
Child’sfile
descriptortable
Unrelatedprocess
filedescriptor
table
Open filedescription
File positionR/W
Pointer to i-node
File positionR/W
Pointer to i-node
Pointers todisk blocks
Tripleindirectblock Double
indirectblock Single
indirectblock
‘
Fig. 10-33. The relation between the file descriptor table, the openfile description table, and the i-node table.
315 / 397
Why File Table?
Where To Put File Position Info?Inode table? No. Multiple processes can open the same file.
Each one has its own file position.User file descriptor table? No. Trouble in file sharing.
Example#!/bin/bash
echo hello
echo world
Where should the “world” be?$ ./hello.sh > A
316 / 397
Implementing Directories
(a)
games
news
work
attributes
attributes
attributes
attributes
Data structurecontaining theattributes
(b)
games
news
work
Fig. 6-16. (a) A simple directory containing fixed-size entries withthe disk addresses and attributes in the directory entry. (b) A direc-tory in which each entry just refers to an i-node.
(a) A simple directory (Windows)▶ fixed size entries▶ disk addresses and attributes in directory entry
(b) Directory in which each entry just refers to an i-node(UNIX)
317 / 397
How Long A File Name Can Be?
File 1 entry length
File 1 attributes
Pointer to file 1's name
File 1 attributes
Pointer to file 2's name
File 2 attributes
Pointer to file 3's nameFile 2 entry length
File 2 attributes
File 3 entry length
File 3 attributes
p
e
b
e
r
c
u
t
o
t
d
j
-
g
p
e
b
e
r
c
u
t
o
t
d
j
-
g
p
e r s o
n n e l
f o o
p
o
l
e
n
r
n
f o o
s
e
Entry
for one
file
Heap
Entry
for one
file
(a) (b)
File 3 attributes
318 / 397
UNIX Treats a Directory as a File
Directory inode (128B)
Type Mode
User ID Group ID
File size # blocks
# links Flags
Timestamps (×3)
Triple indirect
Double indirect
Single indirect
Direct blocks (×12)
.
..
passwd
fstab
… …
Directory block
File inode (128B)
Type Mode
User ID Group ID
File size # blocks
# links Flags
Timestamps (×3)
Triple indirect
Double indirect
Single indirect
Direct blocks (×12)
Indirect block
inode #
inode #
inode #
inode #
Direct blocks (×512)
Block #s ofmoredirectoryblocks
Block # ofblock with512 singleindirectentries
Block # ofblock with512 doubleindirectentries
File data block
Data
Example. 2.. 2bin 11116545boot 2cdrom 12dev 3...
...
319 / 397
The steps in looking up /usr/ast/mbox
Root directoryI-node 6 is for /usr
Block 132 is /usr
directory
I-node 26 is for
/usr/ast
Block 406 is /usr/ast directory
Looking up usr yields i-node 6
I-node 6 says that /usr is in
block 132
/usr/ast is i-node
26
/usr/ast/mbox is i-node
60
I-node 26 says that
/usr/ast is in block 406
1
1
4
7
14
9
6
8
.
..
bin
dev
lib
etc
usr
tmp
6
1
19
30
51
26
45
dick
erik
jim
ast
bal
26
6
64
92
60
81
17
grants
books
mbox
minix
src
Mode size
times
132
Mode size
times
406
320 / 397
File SharingMultiple Users
User IDs identify users, allowing permissions andprotections to be per-user
Group IDs allow users to be in groups, permitting groupaccess rights
Example: 9-bit patternrwxr-x--- means:
user group otherrwx r-x ---111 1-1 0007 5 0
321 / 397
File SharingRemote File Systems
Networking — allows file system access betweensystems
▶ Manually via programs like FTP▶ Automatically, seamlessly using distributed file systems▶ Semi automatically, via the world wide web
C/S model — allows clients to mount remote file systemsfrom servers
▶ NFS — standard UNIX client-server file sharing protocol▶ CIFS — standard Windows protocol▶ Standard system calls are translated into remote calls
Distributed Information Systems (distributed namingservices)
▶ such as LDAP, DNS, NIS, Active Directory implementunified access to information needed for remote computing
322 / 397
File SharingProtection
▶ File owner/creator should be able to control:▶ what can be done▶ by whom
▶ Types of access▶ Read▶ Write▶ Execute▶ Append▶ Delete▶ List
323 / 397
Shared FilesHard Links vs. Soft Links
Root directory
B
B B C
C C
CA
B C
B
? C C C
A
Shared file
Fig. 6-18. File system containing a shared file. 324 / 397
Hard LinksHard links + the same inode
325 / 397
DrawbackC's directory B's directory B's directoryC's directory
Owner = C Count = 1
Owner = C Count = 2
Owner = C Count = 1
(a) (b) (c)
326 / 397
Symbolic LinksA symbolic link has its own inode + a directory entry.
327 / 397
Disk Space ManagementStatistics
328 / 397
▶ Block size is chosen while creating the FS▶ Disk I/O performance is conflict with space utilization
▶ smaller block size ⇒ better space utilization▶ larger block size ⇒ better disk I/O performance$ dumpe2fs /dev/sda1 | grep "Block size"
329 / 397
Keeping Track of Free Blocks1. Linked List10.5 Free-Space Management 443
0 1 2 3
4 5 7
8 9 10 11
12 13 14
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
15
6
free-space list head
Figure 10.10 Linked free-space list on disk.
of a large number of free blocks can now be found quickly, unlike the situationwhen the standard linked-list approach is used.
10.5.4 Counting
Another approach takes advantage of the fact that, generally, several contigu-ous blocks may be allocated or freed simultaneously, particularly when space isallocated with the contiguous-allocation algorithm or through clustering. Thus,rather than keeping a list of n free disk addresses, we can keep the address ofthe first free block and the number (n) of free contiguous blocks that follow thefirst block. Each entry in the free-space list then consists of a disk address anda count. Although each entry requires more space than would a simple diskaddress, the overall list is shorter, as long as the count is generally greater than1. Note that this method of tracking free space is similar to the extent methodof allocating blocks. These entries can be stored in a B-tree, rather than a linkedlist, for efficient lookup, insertion, and deletion.
10.5.5 Space Maps
Sun’s ZFS file system was designed to encompass huge numbers of files,directories, and even file systems (in ZFS, we can create file-system hierarchies).The resulting data structures could have been large and inefficient if they hadnot been designed and implemented properly. On these scales, metadata I/Ocan have a large performance impact. Consider, for example, that if the free-space list is implemented as a bit map, bit maps must be modified both whenblocks are allocated and when they are freed. Freeing 1 GB of data on a 1-TBdisk could cause thousands of blocks of bit maps to be updated, because thosedata blocks could be scattered over the entire disk.
2. Bit map (n blocks)
0 1 2 3 4 5 6 7 8 .. n-1
+-+-+-+-+-+-+-+-+-+-//-+-+
|0|0|1|0|1|1|1|0|1| .. |0|
+-+-+-+-+-+-+-+-+-+-//-+-+
bit[i] ={0⇒ block[i] is free1⇒ block[i] is occupied
330 / 397
Journaling File Systems
Operations required to remove a file in UNIX:1. Remove the file from its directory
- set inode number to 02. Release the i-node to the pool of free i-nodes
- clear the bit in inode bitmap3. Return all the disk blocks to the pool of free disk blocks
- clear the bits in block bitmap
What if crash occurs between 1 and 2, or between 2 and 3?
331 / 397
Journaling File Systems
Keep a log of what the file system is going to dobefore it does it
▶ so that if the system crashes before it can do its plannedwork, upon rebooting the system can look in the log tosee what was going on at the time of the crash and finishthe job.
▶ NTFS, EXT3, and ReiserFS use journaling among others
332 / 397
Ext2 File System
Physical Layout+------------+---------------+---------------+--//--+---------------+
| Boot Block | Block Group 0 | Block Group 1 | | Block Group n |
+------------+---------------+---------------+--//--+---------------+
__________________________/ \_____________
/ \
+-------+-------------+------------+--------+-------+--------+
| Super | Group | Data Block | inode | inode | Data |
| Block | Descriptors | Bitmap | Bitmap | Table | Blocks |
+-------+-------------+------------+--------+-------+--------+
1 blk n blks 1 blk 1 blk n blks n blks
333 / 397
Ext2 Block groups
The partition is divided into Block Groups▶ Block groups are same size — easy locating▶ Kernel tries to keep a file’s data blocks in the same block
group — reduce fragmentation▶ Backup critical info in each block group▶ The Ext2 inodes for each block group are kept in the
inode table▶ The inode-bitmap keeps track of allocated and
unallocated inodes
334 / 397
Group descriptor▶ Each block group has a group descriptor▶ All the group descriptors together make the group
descriptor table▶ The table is stored along with the superblock▶ Block Bitmap: tracks free blocks▶ Inode Bitmap: tracks free inodes▶ Inode Table: all inodes in this block group
▶Free blocks countFree Inodes count
Used dir count
}counters
See more: # dumpe2fs /dev/sda1
335 / 397
Maths
Given block size = 4Kblockbitmap = 1blk , then
blockspergroup = 8bits× 4K = 32K
How large is a group?
groupsize = 32K× 4K = 128M
How many block groups are there?
≈ partition sizegroupsize =
partition size128M
How many files can I have in max?
≈ partition sizeblock size =
partition size4K
336 / 397
Ext2 Block Allocation Policies626 Chapter 15 The Linux System
allocating scattered free blocks
allocating continuous free blocks
block in use bit boundaryblock selectedby allocator
free block byte boundarybitmap search
Figure 15.9 ext2fs block-allocation policies.
these extra blocks to the file. This preallocation helps to reduce fragmentationduring interleaved writes to separate files and also reduces the CPU cost ofdisk allocation by allocating multiple blocks simultaneously. The preallocatedblocks are returned to the free-space bitmap when the file is closed.
Figure 15.9 illustrates the allocation policies. Each row represents asequence of set and unset bits in an allocation bitmap, indicating used andfree blocks on disk. In the first case, if we can find any free blocks sufficientlynear the start of the search, then we allocate them no matter how fragmentedthey may be. The fragmentation is partially compensated for by the fact thatthe blocks are close together and can probably all be read without any diskseeks, and allocating them all to one file is better in the long run than allocatingisolated blocks to separate files once large free areas become scarce on disk. Inthe second case, we have not immediately found a free block close by, so wesearch forward for an entire free byte in the bitmap. If we allocated that byteas a whole, we would end up creating a fragmented area of free space betweenit and the allocation preceding it, so before allocating we back up to make thisallocation flush with the allocation preceding it, and then we allocate forwardto satisfy the default allocation of eight blocks.
15.7.3 Journaling
One popular feature in a file system is journaling, whereby modificationsto the file system are sequentially written to a journal. A set of operationsthat performs a specific task is a transaction. Once a transaction is written tothe journal, it is considered to be committed, and the system call modifyingthe file system (write()) can return to the user process, allowing it tocontinue execution. Meanwhile, the journal entries relating to the transactionare replayed across the actual file-system structures. As the changes are made, a
337 / 397
Ext2 inode
338 / 397
Ext2 inodeMode: holds two pieces of information
1. Is it a{file|dir|sym-link|blk-dev|char-dev|FIFO}?
2. PermissionsOwner info: Owners’ ID of this file or directory
Size: The size of the file in bytesTimestamps: Accessed, created, last modified timeDatablocks: 15 pointers to data blocks (12 + S+D+ T)
339 / 397
Max File SizeGiven: {
block size = 4Kpointer size = 4B
,
We get:
MaxFileSize = numberof pointers× block size
= (
number of pointers︷ ︸︸ ︷12︸︷︷︸
direct
+ 1K︸︷︷︸1−indirect
+ 1K× 1K︸ ︷︷ ︸2−indirect
+1K× 1K× 1K︸ ︷︷ ︸3−indirect
)× 4K
= 48K+ 4M+ 4G+ 4T
340 / 397
Ext2 Superblock
Magic Number: 0xEF53Revision Level: determines what new features are availableMount Count and Maximum Mount Count: determines if the
system should be fully checkedBlock Group Number: indicates the block group holding this
superblockBlock Size: usually 4KBlocks per Group: 8bits× block sizeFree Blocks: System-wide free blocksFree Inodes: System-wide free inodesFirst Inode: First inode number in the file systemSee more: ∼# dumpe2fs /dev/sda1
341 / 397
Ext2 File Types
File type Description0 Unknown1 Regular file2 Directory3 Character device4 Block device5 Named pipe6 Socket7 Symbolic link
Device file, pipe, and socket: No data blocks are required.All info is stored in the inode
Fast symbolic link: Short path name (< 60 chars) needs nodata block. Can be stored in the 15 pointer fields
342 / 397
Ext2 Directories0 11|12 23|24 39|40
+----+--+-+-+----+----+--+-+-+----+----+--+-+-+----+----+--//--+
| 21 |12|1|2|. | 22 |12|2|2|.. | 53 |16|5|2|hell|o | |
+----+--+-+-+----+----+--+-+-+----+----+--+-+-+----+----+--//--+
,--------> inode number
| ,---> record length
| | ,---> name length
| | | ,---> file type
| | | | ,----> name
+----+--+-+-+----+
0 | 21 |12|1|2|. |
+----+--+-+-+----+
12| 22 |12|2|2|.. |
+----+--+-+-+----+----+
24| 53 |16|5|2|hell|o |
+----+--+-+-+----+----+
40| 67 |28|3|2|usr |
+----+--+-+-+----+----+
52| 0 |16|7|1|oldf|ile |
+----+--+-+-+----+----+
68| 34 |12|4|2|sbin|
+----+--+-+-+----+
▶ Directories are special files▶ “.” and “..” first▶ Padding to 4×▶ inode number is 0 — deleted
file
343 / 397
Many different FS are in use
Windowsuses drive letter (C:, D:, ...) to identify each FS
UNIXintegrates multiple FS into a single structure
▶ From user’s view, there is only one FS hierarchy
$ man fs
344 / 397
$ cp /floppy/TEST /tmp/test
cp
VFS
ext2 MS-DOS/tmp/test /floppy/TEST
1 inf = open("/floppy/TEST", O_RDONLY, 0);2 outf = open("/tmp/test",3 O_WRONLY|O_CREAT|O_TRUNC, 0600);4 do {5 i = read(inf, buf, 4096);6 write(outf, buf, i);7 } while (i);8 close(outf);9 close(inf);
345 / 397
ret = write(fd, buf, len);
ptg
263Unix Filesystems
sys_write() system call that determines the actual file writing method for the filesystem on which fd resides.The generic write system call then invokes this method, which is part of the filesystem implementation, to write the data to the media (or whatever this filesys-tem does on write). Figure 13.2 shows the flow from user-space’s write() call through the data arriving on the physical media. On one side of the system call is the genericVFS interface, providing the frontend to user-space; on the other side of the system call is the filesystem-specific backend, dealing with the implementation details.The rest of this chap-ter looks at how theVFS achieves this abstraction and provides its interfaces.
Unix FilesystemsHistorically, Unix has provided four basic filesystem-related abstractions: files, directory entries, inodes, and mount points.
A filesystem is a hierarchical storage of data adhering to a specific structure. Filesystems contain files, directories, and associated control information.Typical operations performed on filesystems are creation, deletion, and mounting. In Unix, filesystems are mounted at a specific mount point in a global hierarchy known as a namespace.1 This enables all mounted filesystems to appear as entries in a single tree. Contrast this single, unified tree with the behavior of DOS and Windows, which break the file namespace up into drive letters, such as C:.This breaks the namespace up among device and partition boundaries, “leaking” hardware details into the filesystem abstraction.As this delineation may be arbi-trary and even confusing to the user, it is inferior to Linux’s unified namespace.
A file is an ordered string of bytes.The first byte marks the beginning of the file, and the last byte marks the end of the file. Each file is assigned a human-readable name for identification by both the system and the user.Typical file operations are read, write,
1 Recently, Linux has made this hierarchy per-process, to give a unique namespace to each process.
Because each process inherits its parent’s namespace (unless you specify otherwise), there is seem-
ingly one global namespace.
user-space VFS filesystem physical media
write( ) sys_write( ) filesystem’swrite method
Figure 13.2 The flow of data from user-space issuing a write() call, through the VFS’s generic system call, into the filesystem’s specific write method, and finally
arriving at the physical media.
From the Library of Wow! eBook
346 / 397
Virtural File SystemsPut common parts of all FS in a separate layer
▶ It’s a layer in the kernel▶ It’s a common interface to several kinds of file systems▶ It calls the underlying concrete FS to actual manage the
data
User process
FS 1 FS 2 FS 3
Buffer cache
Virtual file system
File system
VFS interface
POSIX
347 / 397
Virtual File System▶ Manages kernel level file abstractions in one format for
all file systems▶ Receives system call requests from user level (e.g. write,
open, stat, link)▶ Interacts with a specific file system based on mount
point traversal▶ Receives requests from other parts of the kernel, mostly
from memory management
Real File Systems▶ managing file & directory data▶ managing meta-data: timestamps, owners, protection,
etc.▶ disk data, NFS data... translate←−−−−−→ VFS data
348 / 397
File System Mounting
/
a b a
c
p q r q q r
d
/
c d
b
Diskette
/
Hard diskHard disk
x y z
x y z
Fig. 10-26. (a) Separate file systems. (b) After mounting.
349 / 397
A FS must be mounted before it can be usedMount — The file system is registered with the VFS
▶ The superblock is read into the VFS superblock▶ The table of addresses of functions the VFS requires is
read into the VFS superblock▶ The FS’ topology info is mapped onto the VFS superblock
data structure
The VFS keeps a list of the mounted file systemstogether with their superblocksThe VFS superblock contains:
▶ Device, blocksize▶ Pointer to the root inode▶ Pointer to a set of superblock routines▶ Pointer to file_system_type data structure▶ more...
350 / 397
V-node
▶ Every file/directory in the VFS has a VFS inode, kept inthe VFS inode cache
▶ The real FS builds the VFS inode from its own info
Like the EXT2 inodes, the VFS inodes describe▶ files and directories within the system▶ the contents and topology of the Virtual File System
351 / 397
VFS Operationread()
...
Process table
0
File descriptors
...
V-nodes
openreadwrite
Function pointers
...2
4
VFS
Read function
FS 1
Call from VFS into FS 1
352 / 397
Linux VFS
The Common File ModelAll other filesystems must map their own concepts into thecommon file model
For example, FAT filesystems do not have inodes.
▶ The main components of the common file model aresuperblock – information about mounted filesystem
inode – information about a specific filefile – information about an open file
dentry – information about directory entry▶ Geared toward Unix FS
353 / 397
The Superblock Object▶ is implemented by each FS and is used to store
information describing that specific FS▶ usually corresponds to the filesystem superblock or the
filesystem control block▶ Filesystems that are not disk-based (such as sysfs, proc)
generate the superblock on-the-fly and store it inmemory
▶ struct super_block in <linux/fs.h>▶ s_op in struct super_block + struct super_operations —
the superblock operations table▶ Each item in this table is a pointer to a function that
operates on a superblock object
354 / 397
The Inode Object▶ For Unix-style filesystems, this information is simply read
from the on-disk inode▶ For others, the inode object is constructed in memory in
whatever manner is applicable to the filesystem▶ struct inode in <linux/fs.h>▶ An inode represents each file on a FS, but the inode
object is constructed in memory only as files areaccessed
▶ includes special files, such as device files or pipes▶ i_op + struct inode_operations
355 / 397
The Dentry Object▶ components in a path▶ makes path name lookup easier▶ struct dentry in <linux/dcache.h>▶ created on-the-fly from a string representation of a path
name
356 / 397
Dentry State▶ used▶ unused▶ negative
Dentry Cacheconsists of three parts:1. Lists of “used”dentries2. A doubly linked “least recently used”list of unused and
negative dentry objects3. A hash table and hashing function used to quickly
resolve a given path into the associated dentry object
357 / 397
The File Object▶ is the in-memory representation of an open file▶ open() ⇒ create; close() ⇒ destroy▶ there can be multiple file objects in existence for the
same file▶ Because multiple processes can open and manipulate a file
at the same time▶ struct file in <linux/fs.h>
358 / 397
Process 1
Process 2
Process 3
File object
File object
File object
dentryobject
dentryobject
inodeobject
Superblockobject
diskfile
fd f_dentryd_inode
i_sb
359 / 397
References
Wikipedia. Computer file — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. Ext2 — Wikipedia, The Free Encyclopedia.[Online; accessed 21-February-2015]. 2015.Wikipedia. File system — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. Inode — Wikipedia, The Free Encyclopedia.[Online; accessed 21-February-2015]. 2015.Wikipedia. Virtual file system — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2014.
360 / 397
I/O Hardware
Different people, different view▶ Electrical engineers: chips, wires, power supplies,
motors...▶ Programmers: interface presented to the software
▶ the commands the hardware accepts▶ the functions it carries out▶ the errors that can be reported back▶ ...
362 / 397
I/O Devices
Roughly two Categories:1. Block devices: store information in fix-size blocks2. Character devices: deal with streams of characters
363 / 397
Device ControllersI/O units usually consist of1. a mechanical component2. an electronic component is called the device controller
or adaptere.g. video adapter, network adapter...
Monitor
Keyboard Floppydisk drive
Harddisk drive
Harddisk
controller
Floppydisk
controller
Keyboardcontroller
VideocontrollerMemoryCPU
Bus
Fig. 1-5. Some of the components of a simple personal computer. 364 / 397
Port: A connection point. A device communicates withthe machine via a port — for example, a serialport.
Bus: A set of wires and a rigidly defined protocol thatspecifies a set of messages that can be sent onthe wires.
Controller: A collection of electronics that can operate aport, a bus, or a device.
e.g. serial port controller, SCSI bus controller,disk controller...
365 / 397
The Controller’s Job
Examples:▶ Disk controllers: convert the serial bit stream into a block
of bytes and perform any error correction necessary▶ Monitor controllers: read bytes containing the characters
to be displayed from memory and generates the signalsused to modulate the CRT beam to cause it to write onthe screen
366 / 397
Inside The Controllers
Control registers: for communicating with the CPU (R/W,On/Off...)
Data buffer: for example, a video RAM
Q: How the CPU communicates with the control registersand the device data buffers?
A: Usually two ways...
367 / 397
(a) Each control register is assigned an I/O port number,then the CPU can do, for example,
▶ read in control register PORT and store the result in CPUregister REG.IN REG, PORT
▶ write the contents of REG to a control registerOUT PORT, REG
Note: the address spaces for memory and I/O aredifferent. For example,
▶ read the contents of I/O port 4 and puts it in R0IN R0, 4 ; 4 is a port number
▶ read the contents of memory word 4 and puts it in R0MOV R0, 4 ; 4 is a memory address
368 / 397
Address spacesTwo address One address space Two address spaces
Memory
I/O ports
0xFFFF…
0
(a) (b) (c)
Fig. 5-2. (a) Separate I/O and memory space. (b) Memory-mappedI/O. (c) Hybrid.
(a) Separate I/O and memory space(b) Memory-mapped I/O: map all the control registers into
the memory space. Each control register is assigned aunique memory address.
(c) Hybrid: with memory-mapped I/O data buffers andseparate I/O ports for the control registers. For example,Intel Pentium
▶ Memory addresses 640K to 1M being reserved for devicedata buffers
▶ I/O ports 0 through 64K.369 / 397
Advantages of Memory-mapped I/O
No assembly code is needed (IN, OUT...)With memory-mapped I/O, device control registers are justvariables in memory and can be addressed in C the sameway as any other variables. Thus with memory-mapped I/O,a I/O device driver can be written entirely in C.
No special protection mechanism is neededThe I/O address space is part of the kernel space, thuscannot be touched directly by any user space process.
370 / 397
Disadvantages of Memory-mapped I/OCaching problem
▶ Caching a device control register would be disastrous▶ Selectively disabling caching adds extra complexity to
both hardware and the OS
Problem with multiple buses
CPU Memory I/O
BusAll addresses (memory
and I/O) go here
CPU Memory I/O
CPU reads and writes of memorygo over this high-bandwidth bus
This memory port isto allow I/O devicesaccess to memory
(a) (b)
Fig. 5-3. (a) A single-bus architecture. (b) A dual-bus memoryarchitecture.
Figure: (a) A single-bus architecture. (b) A dual-bus memoryarchitecture.
371 / 397
Programmed I/OHandshaking
...........
busy = ?
.
0
.....
write a byte
.....
command-ready = 1
.....
command-ready = ?
.
1
.....
busy = 1
.....
write == 1?
.
1
.....
read
.
1 byte
.....
command-ready = 0, error = 0, busy = 0
.Host:. status:Register. control:Register. data-out:Register. Controller:
372 / 397
ExampleSteps in printing a string
String tobe printedUser
space
Kernelspace
ABCDEFGH
Printedpage
(a)
ABCDEFGH
ABCDEFGH
Printedpage
(b)
ANext
(c)
ABNext
Fig. 5-6. Steps in printing a string.
1 copy_from_user(buffer, p, count); /* p is the kernel bufer */2 for (i = 0; i < count; i++) { /* loop on every character */3 while(*printer_status_reg != READY); /* loop until ready */4 *printer_data_register = p[i]; /* output one character */5 }6 return_to_user();
373 / 397
Pulling Is Inefficient
A better wayInterrupt The controller notifies the CPU when the device
is ready for service.1.2 Computer-System Organization 9
userprocessexecuting
CPU
I/O interruptprocessing
I/Orequest
transferdone
I/Orequest
transferdone
I/Odevice
idle
transferring
Figure 1.3 Interrupt time line for a single process doing output.
the interrupting device. Operating systems as different as Windows and UNIXdispatch interrupts in this manner.
The interrupt architecture must also save the address of the interruptedinstruction. Many old designs simply stored the interrupt address in afixed location or in a location indexed by the device number. More recentarchitectures store the return address on the system stack. If the interruptroutine needs to modify the processor state—for instance, by modifyingregister values—it must explicitly save the current state and then restore thatstate before returning. After the interrupt is serviced, the saved return addressis loaded into the program counter, and the interrupted computation resumesas though the interrupt had not occurred.
1.2.2 Storage Structure
The CPU can load instructions only from memory, so any programs to run mustbe stored there. General-purpose computers run most of their programs fromrewriteable memory, called main memory (also called random-access memoryor RAM). Main memory commonly is implemented in a semiconductortechnology called dynamic random-access memory (DRAM). Computers useother forms of memory as well. Because the read-only memory (ROM) cannotbe changed, only static programs are stored there. The immutability of ROMis of use in game cartridges. EEPROM cannot be changed frequently and socontains mostly static programs. For example, smartphones have EEPROM tostore their factory-installed programs.
All forms of memory provide an array of words. Each word has itsown address. Interaction is achieved through a sequence of load or storeinstructions to specific memory addresses. The load instruction moves a wordfrom main memory to an internal register within the CPU, whereas the storeinstruction moves the content of a register to main memory. Aside from explicitloads and stores, the CPU automatically loads instructions from main memoryfor execution.
A typical instruction–execution cycle, as executed on a system with a vonNeumann architecture, first fetches an instruction from memory and storesthat instruction in the instruction register. The instruction is then decodedand may cause operands to be fetched from memory and stored in some
374 / 397
ExampleWriting a string to the printer using interrupt-driven I/O
When the print system call is made...1 copy_from_user(buffer, p, count);2 enable_interrupts();3 while(*printer_status_reg != READY);4 *printer_data_register = p[0];5 scheduler();
Interrupt service procedure for the printer1 if (count == 0) {2 unblock_user();3 } else {4 *printer_data_register = p[1];5 count = count - 1;6 i = i + 1;7 }8 acknowledge_interrupt();9 return_from_interrupt();
375 / 397
Direct Memory Access (DMA)
Programmed I/O (PIO) tying up the CPU full time until all theI/O is done (busy waiting)
Interrupt-Driven I/O an interrupt occurs on every charcater(wastes CPU time)
Direct Memory Access (DMA) Uses a special-purposeprocessor (DMA controller) to work with the I/Ocontroller
376 / 397
ExamplePrinting a string using DMA
In essence, DMA is programmed I/O, only with the DMAcontroller doing all the work, instead of the main CPU.
When the print system call is made...1 copy_from_user(buffer, p, count);2 set_up_DMA_controller();3 scheduler();
Interrupt service procedure1 acknowledge_interrupt();2 unblock_user();3 return_from_interrupt();
The big win with DMA is reducing the number of interruptsfrom one per character to one per buffer printed.
377 / 397
DMA Handshaking ExampleRead from disk
..........
read
.
data
.....
data ready?
.
yes
.....
DMA-request
.......
DMA-acknowledge
...
memory address
.....
data
.Disk:. Disk Controller:. DMA Controller:. Memory:
378 / 397
The DMA controller▶ has access to the system bus independent of the CPU▶ contains several registers that can be written and read
by the CPU. These includes▶ a memory address register▶ a byte count register▶ one or more control registers
▶ specify the I/O port to use▶ the direction of the transfer (read/write)▶ the transfer unit (byte/word)▶ the number of bytes to transfer in one burst.
379 / 397
CPUDMA
controllerDisk
controllerMain
memoryBuffer
1. CPUprogramsthe DMAcontroller
Interrupt whendone
2. DMA requeststransfer to memory 3. Data transferred
Bus
4. Ack
Address
Count
Control
Drive
Fig. 5-4. Operation of a DMA transfer.
380 / 397
I/O Software Layers
I/Orequest
LayerI/Oreply I/O functions
Make I/O call; format I/O; spooling
Naming, protection, blocking, buffering, allocation
Set up device registers; check status
Wake up driver when I/O completed
Perform I/O operation
User processes
Device-independentsoftware
Device drivers
Interrupt handlers
Hardware
Fig. 5-16. Layers of the I/O system and the main functions of eachlayer.
381 / 397
Interrupt Handlers
Mauerer runc14.tex V2 - 09/04/2008 5:37pm Page 851
Chapter 14: Kernel Activities
mode stack. However, this alone is not sufficient. Because the kernel also uses CPU resources to executeits code, the entry path must save the current register status of the user application in order to restore itupon termination of interrupt activities. This is the same mechanism used for context switching duringscheduling. When kernel mode is entered, only part of the complete register set is saved. The kernel doesnot use all available registers. Because, for example, no floating point operations are used in kernel code(only integer calculations are made), there is no need to save the floating point registers.3 Their valuedoes not change when kernel code is executed. The platform-specific data structure pt_regs that lists allregisters modified in kernel mode is defined to take account of the differences between the various CPUs(Section 14.1.7 takes a closer look at this). Low-level routines coded in assembly language are responsiblefor filling the structure.
InterruptHandler
Schedulingnecessary?
Signals?
Restore registers
Deliver signalsto process
Activate user stack
Switch tokernel stack
Save registers
Interrupt
schedule
Figure 14-2: Handling an interrupt.
In the exit path the kernel checks whether
❑ the scheduler should select a new process to replace the old process.
❑ there are signals that must be delivered to the process.
Only when these two questions have been answered can the kernel devote itself to completing its regulartasks after returning from an interrupt; that is, restoring the register set, switching to the user modestack, switching to an appropriate processor mode for user applications, or switching to a differentprotection ring.4
Because interaction between C and assembly language code is required, particular care must be takento correctly design data exchange between the assembly language level and C level. The correspondingcode is located in arch/arch/kernel/entry.S and makes thorough use of the specific characteristicsof the individual processors. For this reason, the contents of this file should be modified as seldom aspossible — and then only with great care.
3Some architectures (e.g., IA-64) do not adhere to this rule but use a few registers from the floating comma set and save them eachtime kernel mode is entered. The bulk of the floating point registers remain ‘‘untouched‘‘ by the kernel, and no explicit floating pointoperations are used.4Some processors make this switch automatically without being requested explicitly to do so by the kernel.
851
382 / 397
Device Drivers
Userspace
Kernelspace
User process
Userprogram
Rest of the operating system
Printerdriver
Camcorderdriver
CD-ROMdriver
Printer controllerHardware
Devices
Camcorder controller CD-ROM controller
Fig. 5-11. Logical positioning of device drivers. In reality allcommunication between drivers and device controllers goes overthe bus.
383 / 397
Device-Independent I/O Software
Functions▶ Uniform interfacing for device drivers▶ Buffering▶ Error reporting▶ Allocating and releasing dedicated devices▶ Providing a device-independent block size
384 / 397
Uniform Interfacing for Device Drivers
Operating system Operating system
Disk driver Printer driver Keyboard driver Disk driver Printer driver Keyboard driver
(a) (b)
Fig. 5-13. (a) Without a standard driver interface. (b) With a stan-dard driver interface.
385 / 397
Buffering
User process
Userspace
Kernelspace
2 2
1 1 3
Modem Modem Modem Modem
(a) (b) (c) (d)
Fig. 5-14. (a) Unbuffered input. (b) Buffering in user space.(c) Buffering in the kernel followed by copying to user space. (d)Double buffering in the kernel.
386 / 397
User-Space I/O Software
Examples:▶ stdio.h, printf(), scanf()...▶ spooling(printing, USENET news...)
387 / 397
Disk458 Chapter 11 Mass-Storage Structure
track t
sector s
spindle
cylinder c
platter
arm
read-writehead
arm assembly
rotation
Figure 11.1 Moving-head disk mechanism.
A read–write head “flies” just above each surface of every platter. Theheads are attached to a disk arm that moves all the heads as a unit. The surfaceof a platter is logically divided into circular tracks, which are subdivided intosectors. The set of tracks that are at one arm position makes up a cylinder.There may be thousands of concentric cylinders in a disk drive, and each trackmay contain hundreds of sectors. The storage capacity of common disk drivesis measured in gigabytes.
When the disk is in use, a drive motor spins it at high speed. Most drivesrotate 60 to 200 times per second. Disk speed has two parts. The transferrate is the rate at which data flow between the drive and the computer. Thepositioning time, sometimes called the random-access time, consists of thetime necessary to move the disk arm to the desired cylinder, called the seektime, and the time necessary for the desired sector to rotate to the disk head,called the rotational latency. Typical disks can transfer several megabytes ofdata per second, and they have seek times and rotational latencies of severalmilliseconds.
Because the disk head flies on an extremely thin cushion of air (measuredin microns), there is a danger that the head will make contact with the disksurface. Although the disk platters are coated with a thin protective layer, thehead will sometimes damage the magnetic surface. This accident is called ahead crash. A head crash normally cannot be repaired; the entire disk must bereplaced.
A disk can be removable, allowing different disks to be mounted as needed.Removable magnetic disks generally consist of one platter, held in a plastic caseto prevent damage while not in the disk drive. Floppy disks are inexpensiveremovable magnetic disks that have a soft plastic case containing a flexibleplatter. The head of a floppy-disk drive generally sits directly on the disksurface, so the drive is designed to rotate more slowly than a hard-disk driveto reduce the wear on the disk surface. The storage capacity of a floppy disk
388 / 397
Disk SchedulingFirst Come First Serve (FCFS)
464 Chapter 11 Mass-Storage Structure
0 14 37 536567 98 122124 183199
queue � 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.4 FCFS disk scheduling.
in that order. If the disk head is initially at cylinder 53, it will first move from53 to 98, then to 183, 37, 122, 14, 124, 65, and finally to 67, for a total headmovement of 640 cylinders. This schedule is diagrammed in Figure 11.4.
The wild swing from 122 to 14 and then back to 124 illustrates the problemwith this schedule. If the requests for cylinders 37 and 14 could be servicedtogether, before or after the requests for 122 and 124, the total head movementcould be decreased substantially, and performance could be thereby improved.
11.4.2 SSTF Scheduling
It seems reasonable to service all the requests close to the current head positionbefore moving the head far away to service other requests. This assumption isthe basis for the shortest-seek-time-first (SSTF) algorithm. The SSTF algorithmselects the request with the least seek time from the current head position.Since seek time increases with the number of cylinders traversed by the head,SSTF chooses the pending request closest to the current head position.
For our example request queue, the closest request to the initial headposition (53) is at cylinder 65. Once we are at cylinder 65, the next closestrequest is at cylinder 67. From there, the request at cylinder 37 is closer than theone at 98, so 37 is served next. Continuing, we service the request at cylinder 14,then 98, 122, 124, and finally 183 (Figure 11.5). This scheduling method resultsin a total head movement of only 236 cylinders—little more than one-thirdof the distance needed for FCFS scheduling of this request queue. Clearly, thisalgorithm gives a substantial improvement in performance.
SSTF scheduling is essentially a form of shortest-job-first (SJF) scheduling;and like SJF scheduling, it may cause starvation of some requests. Rememberthat requests may arrive at any time. Suppose that we have two requests inthe queue, for cylinders 14 and 186, and while the request from 14 is beingserviced, a new request near 14 arrives. This new request will be servicednext, making the request at 186 wait. While this request is being serviced,another request close to 14 could arrive. In theory, a continual stream of requestsnear one another could cause the request for cylinder 186 to wait indefinitely.
389 / 397
Disk SchedulingShortest Seek Time First (SSTF)
11.4 Disk Scheduling 465
0 14 37 536567 98 122124 183199
queue � 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.5 SSTF disk scheduling.
This scenario becomes increasingly likely as the pending-request queue growslonger.
Although the SSTF algorithm is a substantial improvement over the FCFSalgorithm, it is not optimal. In the example, we can do better by moving thehead from 53 to 37, even though the latter is not closest, and then to 14, beforeturning around to service 65, 67, 98, 122, 124, and 183. This strategy reducesthe total head movement to 208 cylinders.
11.4.3 SCAN Scheduling
In the SCAN algorithm, the disk arm starts at one end of the disk and movestoward the other end, servicing requests as it reaches each cylinder, until it getsto the other end of the disk. At the other end, the direction of head movementis reversed, and servicing continues. The head continuously scans back andforth across the disk. The SCAN algorithm is sometimes called the elevatoralgorithm, since the disk arm behaves just like an elevator in a building, firstservicing all the requests going up and then reversing to service requests theother way.
Let’s return to our example to illustrate. Before applying SCAN to schedulethe requests on cylinders 98, 183, 37, 122, 14, 124, 65, and 67, we need to knowthe direction of head movement in addition to the head’s current position.Assuming that the disk arm is moving toward 0 and that the initial headposition is again 53, the head will next service 37 and then 14. At cylinder 0,the arm will reverse and will move toward the other end of the disk, servicingthe requests at 65, 67, 98, 122, 124, and 183 (Figure 11.6). If a request arrives inthe queue just in front of the head, it will be serviced almost immediately; arequest arriving just behind the head will have to wait until the arm moves tothe end of the disk, reverses direction, and comes back.
Assuming a uniform distribution of requests for cylinders, consider thedensity of requests when the head reaches one end and reverses direction. Atthis point, relatively few requests are immediately in front of the head, sincethese cylinders have recently been serviced. The largest density of requests is
390 / 397
Disk SchedulingSCAN Scheduling466 Chapter 11 Mass-Storage Structure
0 14 37 536567 98 122124 183199
queue � 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.6 SCAN disk scheduling.
at the other end of the disk. These requests have also waited the longest, sowhy not go there first? That is the idea of the next algorithm.
11.4.4 C-SCAN Scheduling
Circular SCAN (C-SCAN) scheduling is a variant of SCAN designed to providea more uniform wait time. Like SCAN, C-SCAN moves the head from one endof the disk to the other, servicing requests along the way. When the headreaches the other end, however, it immediately returns to the beginning ofthe disk without servicing any requests on the return trip (Figure 11.7). TheC-SCAN scheduling algorithm essentially treats the cylinders as a circular listthat wraps around from the final cylinder to the first one.
0 14 37 53 65 67 98 122124 183199
queue = 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.7 C-SCAN disk scheduling.
391 / 397
Disk SchedulingCircular SCAN (C-SCAN) Scheduling
466 Chapter 11 Mass-Storage Structure
0 14 37 536567 98 122124 183199
queue � 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.6 SCAN disk scheduling.
at the other end of the disk. These requests have also waited the longest, sowhy not go there first? That is the idea of the next algorithm.
11.4.4 C-SCAN Scheduling
Circular SCAN (C-SCAN) scheduling is a variant of SCAN designed to providea more uniform wait time. Like SCAN, C-SCAN moves the head from one endof the disk to the other, servicing requests along the way. When the headreaches the other end, however, it immediately returns to the beginning ofthe disk without servicing any requests on the return trip (Figure 11.7). TheC-SCAN scheduling algorithm essentially treats the cylinders as a circular listthat wraps around from the final cylinder to the first one.
0 14 37 53 65 67 98 122124 183199
queue = 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.7 C-SCAN disk scheduling.
392 / 397
Disk SchedulingC-LOOK Scheduling 11.4 Disk Scheduling 467
0 14 37 536567 98 122124 183199
queue = 98, 183, 37, 122, 14, 124, 65, 67head starts at 53
Figure 11.8 C-LOOK disk scheduling.
11.4.5 LOOK Scheduling
As we described them, both SCAN and C-SCAN move the disk arm across thefull width of the disk. In practice, neither algorithm is often implemented thisway. More commonly, the arm goes only as far as the final request in eachdirection. Then, it reverses direction immediately, without going all the way tothe end of the disk. Versions of SCAN and C-SCAN that follow this pattern arecalled LOOK and C-LOOK scheduling, because they look for a request beforecontinuing to move in a given direction (Figure 11.8).
11.4.6 Selection of a Disk-Scheduling Algorithm
Given so many disk-scheduling algorithms, how do we choose the best one?SSTF is common and has a natural appeal because it increases performance overFCFS. SCAN and C-SCAN perform better for systems that place a heavy load onthe disk, because they are less likely to cause a starvation problem. For anyparticular list of requests, we can define an optimal order of retrieval, but thecomputation needed to find an optimal schedule may not justify the savingsover SSTF or SCAN. With any scheduling algorithm, however, performancedepends heavily on the number and types of requests. For instance, supposethat the queue usually has just one outstanding request. Then, all schedulingalgorithms behave the same, because they have only one choice of where tomove the disk head: they all behave like FCFS scheduling.
Requests for disk service can be greatly influenced by the file-allocationmethod. A program reading a contiguously allocated file will generate severalrequests that are close together on the disk, resulting in limited head movement.A linked or indexed file, in contrast, may include blocks that are widelyscattered on the disk, resulting in greater head movement.
The location of directories and index blocks is also important. Since everyfile must be opened to be used, and opening a file requires searching thedirectory structure, the directories will be accessed frequently. Suppose that adirectory entry is on the first cylinder and a file’s data are on the final cylinder. Inthis case, the disk head has to move the entire width of the disk. If the directory
393 / 397
The Best Scheduling Algorithm?
Performance depends heavily on the number andtypes of requestsFile system design can be influential
▶ File-allocation method (contiguous? indexed?)▶ Location of directories and index blocks
394 / 397
RAID Structure
Redundant Arrays of Independent/InexpensiveDisksß Replace SLED with RAIDß Replace the disk controller card with a RAID controller© Better performance and better reliability© No software changes are required to use the RAID
395 / 397
RAID Levels 11.7 RAID Structure 477
(a) RAID 0: non-redundant striping.
(b) RAID 1: mirrored disks.
C C C C
(c) RAID 2: memory-style error-correcting codes.
(d) RAID 3: bit-interleaved parity.
(e) RAID 4: block-interleaved parity.
(f) RAID 5: block-interleaved distributed parity.
P P
P
P
P P P
(g) RAID 6: P � Q redundancy.
PP P
PPPP P
PPP
PP
Figure 11.11 RAID levels.
P indicates error-correcting bits and C indicates a second copy of the data). Inall cases depicted in the figure, four disks’ worth of data are stored, and theextra disks are used to store redundant information for failure recovery.
• RAID level 0. RAID level 0 refers to disk arrays with striping at the level ofblocks but without any redundancy (such as mirroring or parity bits), asshown in Figure 11.11(a).
• RAID level 1. RAID level 1 refers to disk mirroring. Figure 11.11(b) showsa mirrored organization.
• RAID level 2. RAID level 2 is also known as memory-style error-correcting-code (ECC) organization. Memory systems have long detected certainerrors by using parity bits. Each byte in a memory system may have aparity bit associated with it that records whether the number of bits in thebyte set to 1 is even (parity = 0) or odd (parity = 1). If one of the bits in the
396 / 397
References
Wikipedia. Hard disk drive — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.Wikipedia. Input/output — Wikipedia, The FreeEncyclopedia. [Online; accessed 21-February-2015].2015.
397 / 397