cs9222 advanced operating systems

CS9222 Advanced Operating System

Unit – V

Dr.A.Kathirvel

Professor & Head/IT - VCEW

Unit - V

Structures – Design Issues – Threads – Process Synchronization – Processor Scheduling – Memory Management – Reliability / Fault Tolerance; Database Operating Systems – Introduction – Concurrency Control – Distributed Database Systems – Concurrency Control Algorithms.

Motivation for Multiprocessors

Enhanced Performance -

Concurrent execution of tasks for increased throughput (between processes)

Exploit Concurrency in Tasks (Parallelism within process)

Fault Tolerance -

graceful degradation in face of failures

Basic MP Architectures

Single Instruction Single Data (SISD) - conventional uniprocessor designs.

Single Instruction Multiple Data (SIMD) - Vector and Array Processors

Multiple Instruction Single Data (MISD) - Not Implemented.

Multiple Instruction Multiple Data (MIMD) - conventional MP designs

MIMD Classifications

Tightly Coupled System - all processors share the same global memory and have the same address spaces (Typical SMP system).

Main memory for IPC and Synchronization.

Loosely Coupled System - memory is partitioned and attached to each processor. Hypercube, Clusters (Multi-Computer).

Message passing for IPC and synchronization.

MP Block Diagram

cache MMU

CPU

cache MMU

CPU

cache MMU

CPU

cache MMU

CPU

MM MM MM MM

Interconnection Network

Memory Access Schemes

• Uniform Memory Access (UMA)

– Centrally located

– All processors are equidistant (access times)

• NonUniform Access (NUMA)

– physically partitioned but accessible by all

– processors have the same address space

• NO Remote Memory Access (NORMA)

– physically partitioned, not accessible by all

– processors have own address space

Other Details of MP

Interconnection technology

Bus

Cross-Bar switch

Multistage Interconnect Network

Caching - Cache Coherence Problem!

Write-update

Write-invalidate

bus snooping

MP OS Structure - 1

Separate Supervisor -

all processors have their own copy of the kernel.

Some share data for interaction

dedicated I/O devices and file systems

good fault tolerance

bad for concurrency

• Master/Slave Configuration

– master monitors the status and assigns work to other processors (slaves)

– Slaves are a schedulable pool of resources for the master

– master can be bottleneck

– poor fault tolerance

MP OS Structure - 2

Symmetric Configuration - Most Flexible.

all processors are autonomous, treated equal

one copy of the kernel executed concurrently across all processors

Synchronize access to shared data structures:

Lock entire OS - Floating Master

Mitigated by dividing OS into segments that normally have little interaction

multithread kernel and control access to resources (continuum)

MP OS Structure - 3

MP Overview

MultiProcessor

SIMD MIMD

Shared Memory

(tightly coupled) Distributed Memory

(loosely coupled)

Master/Slave Symmetric

(SMP)

Clusters

SMP OS Design Issues

Threads - effectiveness of parallelism depends on performance of primitives used to express and control concurrency.

Process Synchronization - disabling interrupts is not sufficient.

Process Scheduling - efficient, policy controlled, task scheduling (process/threads) global versus per CPU scheduling

Task affinity for a particular CPU

resource accounting and intra-task thread dependencies

Memory Management - complicated since main memory is shared by possibly many processors. Each processor must maintain its own map tables for each process

cache coherence

memory access synchronization

balancing overhead with increased concurrency

Reliability and fault Tolerance - degrade gracefully in the event of failures

SMP OS design issues - 2

Typical SMP System

cache MMU

CPU

cache MMU

CPU

cache MMU

CPU

cache MMU

CPU

I/O

subsystem

Issues:

• Memory contention

• Limited bus BW

• I/O contention

• Cache coherence

Main

Memory

50ns

Typical I/O Bus:

• 33MHz/32bit (132MB/s)

• 66MHz/64bit (528MB/s)

500MHz

System/Memory Bus

ether

scsi

video

Bridge

System Functions

(timer, BIOS, reset)

INT

Some Definitions

Parallelism: degree to which a multiprocessor application achieves parallel execution

Concurrency: Maximum parallelism an application can achieve with unlimited processors

System Concurrency: kernel recognizes multiple threads of control in a program

User Concurrency: User space threads (coroutines) provide a natural programming model for concurrent applications. Concurrency not supported by system.

Process and Threads

Process: encompasses

set of threads (computational entities)

collection of resources

Thread: Dynamic object representing an execution path and computational state.

threads have their own computational state: PC, stack, user registers and private data

Remaining resources are shared amongst threads in a process

Threads

Effectiveness of parallel computing depends on the performance of the primitives used to express and control parallelism

Threads separate the notion of execution from the Process abstraction

Useful for expressing the intrinsic concurrency of a program regardless of resulting performance

Three types: User threads, kernel threads and Light Weight Processes (LWP)

User Level Threads

User level threads - supported by user level (thread) library

Benefits:

no modifications required to kernel

flexible and low cost

Drawbacks:

can not block without blocking entire process

no parallelism (not recognized by kernel)

Kernel Level Threads

Kernel level threads - kernel directly supports multiple threads of control in a process. Thread is the basic scheduling entity Benefits:

coordination between scheduling and synchronization

less overhead than a process

suitable for parallel application

Drawbacks:

more expensive than user-level threads

generality leads to greater overhead

Light Weight Processes (LWP)

Kernel supported user thread

Each LWP is bound to one kernel thread.

a kernel thread may not be bound to an LWP

LWP is scheduled by kernel

User threads scheduled by library onto LWPs

Multiple LWPs per process

Thread operations in user space:

create, destroy, synch, context switch

kernel threads implement a virtual processor

Course grain in kernel - preemptive scheduling

Communication between kernel and threads library

shared data structures.

Software interrupts (user upcalls or signals). Example, for scheduling decisions and preemption warnings.

Kernel scheduler interface - allows dissimilar thread packages to coordinate.

First Class threads (Psyche OS)

Scheduler Activations

An activation:

serves as execution context for running thread

notifies thread of kernel events (upcall)

space for kernel to save processor context of current user thread when stopped by kernel

kernel is responsible for processor allocation => preemption by kernel.

Thread package responsible for scheduling threads on available processors (activations)

Support for Threading

• BSD: – process model only. 4.4 BSD enhancements.

• Solaris:provides – user threads, kernel threads and LWPs

• Mach: supports – kernel threads and tasks. Thread libraries provide

semantics of user threads, LWPs and kernel threads.

• Digital UNIX: extends MACH to provide usual UNIX semantics. – Pthreads library.

Process Synchronization:Motivation

Sequential execution runs correctly but concurrent execution (of the same program) runs incorrectly.

Concurrent access to shared data may result in data inconsistency

Maintaining data consistency requires mechanisms to ensure the orderly execution of cooperating processes

Let’s look at an example: consumer-producer problem.

Producer-Consumer Problem

Producer while (true) { /* produce an item and put in nextProduced */

while (count == BUFFER_SIZE); // do nothing

buffer [in] = nextProduced; in = (in + 1) % BUFFER_SIZE; count++; } count: the number of items in the buffer (initialized to 0)

Consumer while (true) { while (count == 0); // do nothing

nextConsumed = buffer[out]; out = (out + 1) % BUFFER_SIZE; count--; // consume the item in nextConsumed

} What can go wrong in concurrent execution?

Race Condition count++ could be implemented as

register1 = count register1 = register1 + 1 count = register1

count-- could be implemented as register2 = count register2 = register2 - 1 count = register2

Consider this execution interleaving with “count = 5” initially:

S0: producer execute register1 = count {register1 = 5} S1: producer execute register1 = register1 + 1 {register1 = 6} S2: consumer execute register2 = count {register2 = 5} S3: consumer execute register2 = register2 - 1 {register2 = 4} S4: producer execute count = register1 {count = 6 } S5: consumer execute count = register2 {count = 4}

What are all possible values from concurrent execution?

How to prevent race condition? Define a critical section in

each process Reading and writing

common variables.

Make sure that only one process can execute in the critical section at a time.

What sync code to put into the entry & exit sections to prevent race condition?

do { entry section critical section exit section remainder section } while (TRUE);

Solution to Critical-Section Problem

1. Mutual Exclusion - If process Pi is executing in its critical section, then no other processes can be executing in their critical sections

2. Progress - If no process is executing in its critical section and there exist some processes that wish to enter their critical section, then the selection of the processes that will enter the critical section next cannot be postponed indefinitely

3. Bounded Waiting - A bound must exist on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted

What is the difference between

Progress and Bounded Waiting?

Peterson’s Solution

Simple 2-process solution

Assume that the LOAD and STORE instructions are atomic; that is, cannot be interrupted.

The two processes share two variables:

int turn;

Boolean flag[2]

The variable turn indicates whose turn it is to enter the critical section.

The flag array is used to indicate if a process is ready to enter the critical section. flag[i] = true implies that process Pi is ready!

Algorithm for Process Pi while (true) { flag[i] = TRUE; turn = j; while ( flag[j] && turn == j); CRITICAL SECTION flag[i] = FALSE; REMAINDER SECTION }

Mutual exclusion

Only one process enters critical section at a time.

Proof: can both processes pass the while loop (and enter critical section) at the same time?

Progress

Selection for waiting-to-enter-critical-section process does not block.

Proof: can Pi wait at the while loop forever (after Pj leaves critical section)?

Bounded Waiting

Limited time in waiting for other processes.

Proof: can Pj win the critical section twice while Pi waits?

Entry Section

Exit Section

Algorithm for Process Pi while (true) { flag[i] = TRUE; turn = j; while ( flag[j] && turn == j); CRITICAL SECTION flag[i] = FALSE; REMAINDER SECTION }

Entry Section

Exit Section

while (true) { flag[j] = TRUE; turn = i; while ( flag[i] && turn == i); CRITICAL SECTION flag[j] = FALSE; REMAINDER SECTION }

Synchronization Hardware

Many systems provide hardware support for critical section code

Uniprocessors – could disable interrupts

Currently running code would execute without preemption

Generally too inefficient on multiprocessor systems

Operating systems using this not broadly scalable

Modern machines provide special atomic hardware instructions

Atomic = non-interruptable

TestAndSet(target): Either test memory word and set value

Swap(a,b): Or swap contents of two memory words

TestAndSet Instruction

• Definition:

boolean TestAndSet (boolean *target) { boolean rv = *target; *target = TRUE; return rv: }

Solution using TestAndSet

Shared boolean variable lock, initialized to false.

Solution: while (true) { while ( TestAndSet (&lock )) ; /* do nothing // critical section lock = FALSE; // remainder section }

Does it satisfy mutual exclusion?

How about progress and bounded waiting?

How to fix this?

Entry Section

Exit Section

Bounded-Waiting TestAndSet

• Shared variable boolean waiting[n];

boolean lock; // initialized false.

• Solution: do { waiting[i] = TRUE; while (waiting[i] &&

TestAndSet(&lock); waiting[i] = FALSE; // critical section j=(i+1)%n; while ((j!=i) && !waiting[j]) j=(j+1)%n; If (j==i) lock = FALSE; else waiting[j] = FALSE; // reminder section } while (TRUE);

Mutual exclusion

Proof: can two processes pass the while loop (and enter critical section) at the same time?

Bounded Waiting

Limited time in waiting for other processes.

What is waiting[] for? When does waiting[i] set to FALSE?

Proof: how long does Pi’s wait till waiting[i] becomes FALSE?

Progress

Proof: exit section unblocks at least one process’s waiting[] or set the lock to FALSE.

Entry Section

Exit Section

Swap Instruction

• Definition:

void Swap (boolean *a, boolean *b) { boolean temp = *a; *a = *b; *b = temp: }

Solution using Swap Shared Boolean variable lock initialized to FALSE; Each process

has a local Boolean variable key.

Solution:

while (true) {

key = TRUE;

while ( key == TRUE)

Swap (&lock, &key );

// critical section

lock = FALSE;

// remainder section

}

Mutual exclusion? Progress and Bounded Waiting?

Notice a performance problem with Swap & TestAndSet solutions?

Entry Section

Exit Section

Processor Scheduling

PS: ready tasks are assigned to the processors so that performance is maximized.

Cooperate and communicate through shared variables or message passing, PS in multiprocessor system is difficult problem.

PS is very critical to the performance of multiprocessor systems because a naïve scheduler can degrade performance substantially.

Issues in Processor Scheduling

3 major causes of performance degradation are Preemption inside spinlock-controlled critical sections.

This situation occurs when a task is preempted inside CS when there are other tasks spinning the lock to enter the same CS.

cache corruption Big chunk of data needed by the previous tasks must be purged from the

cache and new data must be brought into the cache.

Very high miss ratio a processor switched to another task – Cache corrp.

context switching overheads Execution of a large no. of instructions to save and store the registers, to

initialize the registers, to switch address space, etc.

Co-Scheduling of the Medusa OS

Co-scheduling –proposed by ousterhout for MOS for cm*

All runnable tasks of an application are scheduled on the processor simultaneously.

Context switching between appl. Rather than bet. Tasks of several different applications.

Pbm: tasks wasting resources in lock-spinning while they wait for a preempted task to release the critical section.

Smart Scheduling

Proposed by zahorjan et al. – 2 nice features

It avoids preempting a task when the task is inside its CS

It avoids the rescheduling of tasks that were busy waiting at the time of their preemption until the task that is executing the corresponding CS release it.

Eliminates the resource waste due to a processor spinning a lock.

To reduce the overhead due to context switching nor to reduce the performance degradation due to cache corruption.

Scheduling in the NYU Ultracomputer

Edler et al. and it cobines the the strategies of the previous 2 scheduling techniques.

Tasks can be formed into groups and scheduled in any of the following ways:

task – scheduled or preempted in the normal manner

All task in group are sched. Or preempted simultaneously.

Tasks in group are never preempted.

Memory Management The Mach Operating System

Virtual MM of mach OS developed at cm*

Design Issues Portability

Data sharing

Protection

Efficiency

The Mach Kernel Basic primitives necessary for building parallel and

distributed applications.

The Mach Kernel

4.3 BSD

emulator

System V

emulator HP/UX

emulator Other

emulator

Microkernel

User process

User space

Kernel space

Software

emulator

layer

The kernel manages five principal abstractions:

1. Processes.

2. Threads.

3. Memory objects.

4. Ports.

5. Messages.

Process Management in Mach

Process

port

Bootstrap

port

Exception

port

Registered

ports kernel

process

Thread

Address space

Ports

The process port is used to communicate with the kernel.

The bootstrap port is used for initialization when a process starts up.

The exception port is used to report exceptions caused by the process. Typical exceptions are division by zero and illegal instruction executed.

The registered ports are normally used to provide a way for the process to communicate with standard system servers.

Ports

A process can be runnable or blocked.

If a process is runnable, those threads that are also runnable can be scheduled and run.

If a process is blocked, its threads may not run, no matter what state they are in.

Process Management Primitives

Create Create a new process, inheriting certain properties

Terminate Kill a specified process

Suspend Increment suspend counter

Resume Decrement suspend counter. If it is 0, unblock the process

Priority Set the priority for current or future threads

Assign Tell which processor new threads should run on

Info Return information about execution time, memory usage, etc.

Threads Return a list of the process’ threads

Threads

Mach threads are managed by the kernel. Thread creation and destruction are done by the kernel.

Fork Create a new thread running the same code as the

parent thread

Exit Terminate the calling thread

Join Suspend the caller until a specified thread exits

Detach Announce that the thread will never be jointed (waited

for)

Yield Give up the CPU voluntarily

Self Return the calling thread’s identity to it

Scheduling algorithm

When a thread blocks, exits, or uses up its quantum, the CPU it is running on first looks on its local run queue to see if there are any active threads.

If it is nonzero, run the highest-priority thread, starting at the queue specified by the hint.

If the local run queue is empty, the same algorithm is applied to the global run queue. The global queue must be locked first.

Scheduling Global run queue for processor set 1 Global run queue for processor set 2

Priority

(high) 0

Low 31

0

31

:Free

Count: 6

Hint: 2

:Busy

Count: 7

Hint: 4

Memory Management in Mach

Mach has a powerful, elaborate, and highly flexible memory management system based on paging.

The code of Mach’s memory management is split into three parts. The first part is the pmap module, which runs in the kernel and is concerned with managing the MMU.

The second part, the machine-independent kernel code, is concerned with processing page faults, managing address maps, and replacing pages.

The third part of the memory management code runs as a user process called a memory manager. It handles the logical part of the memory management system, primarily management of the backing store (disk).

Virtual Memory

The conceptual model of memory that Mach user processes see is a large, linear virtual address space. The address space is supported by paging.

A key concept relating to the use of virtual address space is the memory object. A memory object can be a page or a set of pages, but it can also be a file or other, more specialized data structure.

An address space with allocated regions, mapped objects, and unused addresses

File xyz region

Stack region

Data region

Text region

Unused

Unused

Unused

System calls for virtual address space manipulation

Allocate Make a region of virtual address space usable

Deallocate Invalidate a region of virtual address space

Map Map a memory object into the virtual address space

Copy Make a copy of a region at another virtual address

Inherit Set the inheritance attribute for a region

Read Read data from another process’ virtual address

space

Write Write data to another process’ virtual address space

Memory Sharing

Process 1 Process 2 Process 3

Mapped

file

Operation of Copy-on-Write

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

RW

RO

7

6

5

4

3

2

1

0

RO

Prototype’s address space

Physical memory

Child’s address space

Operation of Copy-on-Write

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

RW

RO

7

6

5

4

3

2

1

0

R

O

Prototype’s address space

Physical memory

Child’s address space 8

Copy of page 7

Advantages of Copy-on-write

1. some pages are read-only, so there is no need to copy them.

2. other pages may never be referenced, so they do not have to be copied.

3. still other pages may be writable, but the child may deallocate them rather than using them.

Disadvantages of Copy-on-write

1. the administration is more complicated.

2. requires multiple kernel traps, one for each page that is ultimately written.

3. does not work over a network.

External Memory Managers

Each memory object that is mapped in a process’ address space must have an external memory manager that controls it. Different classes of memory objects are handled by different memory managers.

Three ports are needed to do the job.

The object port, is created by the memory manager and will later be used by the kernel to inform the memory manager about page faults and other events relating to the object.

The control port, is created by the kernel itself so that the memory manager can respond to these events.

The name port, is used as a kind of name to identify the object.

Distributed Shared Memory in Mach

The idea is to have a single, linear, virtual address space that is shared among processes running on computers that do not have any physical shared memory. When a thread references a page that it does not have, it causes a page fault. Eventually, the page is located and shipped to the faulting machine, where it is installed so that the thread can continue executing.

Communication in Mach

The basis of all communication in Mach is a kernel data structure called a port.

When a thread in one process wants to communicate with a thread in another process, the sending thread writes the message to the port and the receiving thread takes it out.

Each port is protected to ensure that only authorized processes can send it and receive from it.

Ports support unidirectional communication. A port that can be used to send a request from a client to a server cannot also be used to send the reply back from the server to the client. A second port is needed for the reply.

A Mach port

Message queue

Current message count

Maximum messages

Port set this port belongs to

Counts of outstanding capabilities

Capabilities to use for error reporting

Queue of threads blocked on this port

Pointer to the process holding the RECEIVE capability

Index of this port in the receiver’s capability list

Pointer to the kernel object

Miscellaneous items

Message passing via a port

port

Sending

thread

Receiving thread

Kernel

send receive

Capabilities

1

2

3

4

1

2

3

4

Port

X

Port

Y

A B

process thread

Capability list Capability with

SEND right

Capability

with RECEIVE

right kernel

Primitives for Managing Ports

Allocate Create a port and insert its capability in the capability list

Destroy Destroy a port and remove its capability from the list

Deallocate Remove a capability from the capability list

Extract_right Extract the n-th capability from another process

Insert_right Insert a capability in another process’ capability list

Move_member Move a capability into a capability set

Set_qlimit Set the number of messages a port can hold

Sending and Receiving Messages

Mach_msg(&hdr, options, send_size, rcv_size, rcv_port, timeout, notify_port);

The first parameter, hdr, is a pointer to the message to be sent or to the place where the incoming message is put, or both.

The second parameter, options, contains a bit specifying that a message is to be sent, and another one specifying that a message is to be received. Another bit enables a timeout, given by the timeout parameter. Other bits in options allow a SEND that cannot complete immediately to return control anyway, with a status report being sent to notify_port later.

The send_size and rcv_size parameters tell how large the outgoing message is and how many bytes are available for storing the incoming message, respectively.

Rcv_port is used for receiving messages. It is the capability name of the port or port set being listened to.

The Mach message format

Message size

Capability index for destination port

Capability index for reply port

Message kind

Function code

Descriptor 1

Data field 1

Descriptor 2

Data field 2

Reply rights Dest. rights Complex/Simple

Header

Message

body

Not examined

by the

kernel

Complex message field descriptor

Data field size

In bits

Data field type Number of

in the data field

Bits 1 1 1 1 12 8 8

Bit

Byte

Unstructured word

Integer(8,16,32 bits)

Character

32 Booleans

Floating point

String

Capability

0: Out-of-line data present

1: No out-of-line data

0: Short form descriptor

1: Long form descriptor

0: Sender keeps out-of-line data

1: Deallocate out-of-line data from sender

Reliability/Fault Tolerance: the SEQUOIA System

Sequoia system – a loosely coupled multiprocessor system.

Attains a high level of fault tolerance by performing fault detection in hardware and fault recovery in the OS.

Design Issues Fault detection and isolation

Fault recovery

Efficiency

The sequoia Architecture

The Sequoia Architecture

Reliability/Fault Tolerance: the SEQUOIA System

Fault detection Error detecting codes

Comparison of duplicated operations

Protocol monitoring

Fault Recovery Recovery from processor failures

Recovery from main memory failures

Recovery from I/O failures

Database Operating Systems

Database system have been implemented as an application on top of general purpose OS

Requrements of DBOS Transaction

Management

Support for complex, persistent data

Buffer Management

Concurrency Control

CC is the process of controlling concurrent access to a database to ensure that the correctness of the database is maintained.

Database systems

Set of shared data objects that can be accessed by users.

Transactions

A transaction consists of a sequence of R, compute & W s/m that refer to the data objects of a database.

Conflicts

Transactions conflicts if they access the same data objects.

Transaction processing

A transaction is executed by executing its actions one by one from the beginning to the end.

A concurrency control model of DBS

3 software modules

Transaction manager (TM)

Supervises the execution of a transaction

Data manager (DM)

Responsible for enforcing concurrency control

Scheduler

Distributed Database System

A distributed database is a database in which storage devices are not all attached to a common processing unit such as the CPU.

It may be stored in multiple computers, located in the same physical location; or may be dispersed over a network of interconnected computers.

Unlike parallel systems, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components.

Model of Distributed Database System

Distributed Database System Motivations: DDBS offers several advantages over a centralized

database system such as Sharing

Higher system availability (reliability)

Improved performance

Easy expandability

Large databases

Transaction Processing Model

Serializability condition in DDBS

Data replication

Complications due to Data replication

Fully Replicated Database Systems 1. Enhanced reliability 2. Improved responsiveness 3. No directory

management 4. Easier load balancing

Concurrency Control Algorithms

It controls the interleaving of conflicting actions of transactions so that the integrity of a database is maintained, i.e., their net effect is a serial execution.

Basic synchronization primitives

Locks A transaction can request, hold or release the lock on a data

object.

lock a data object in 2 modes: exclusive and shared

Timestamps Unique number is assigned to a transaction or a data object and is

chosen from a monotonically increasing sequence.

Commonly generated using Lamport’s scheme

Lock based algorithms

Static locking

Two Phase Locking (2PL)

Problems with 2PL: Price for Higher concurrency

2PL in DDBS

Timestamp Based locking

Conflict Resolution

Wait Restart Die Wound

Non-two-phase locking

Timestamp Based Algorithms

Basic timestamp ordering algorithm

Thomas Write Rule (TWR)

Multiversion timestamp ordering algorithm

Conservative timestamp ordering algorithm

Thank U

cs9222 advanced operating systems

Engineering