distributed operating systems - introduction

Distributed Operating Systems - Introduction

Prof. Nalini Venkatasubramanian(also slides borrowed from Prof. Petru Eles)

Process/Thread Management Scheduling Communication Synchronization

Memory ManagementStorage ManagementFileSystems ManagementProtection and SecurityNetworking

What does an OS do?

Distributed Operating System

Manages a collection of independent computers and makes them appear to the users of the system as if it were a single computer.

Hardware Architectures

Multiprocessors Tightly coupled Shared memory

Cache

CPU

Cache

CPUMemory

Parallel Architecture

Hardware Architectures

Multicomputers Loosely coupled Private memory Autonomous

Memory

CPU

Memory

CPU

Memory

CPU

Distributed Architecture

Workstation Model: Issues

How to find an idle workstation?

How is a process transferred from one workstation to another?

What happens to a remote process if a user logs onto a workstation that was idle, but is no longer idle now?

Other models - processor pool, workstation server...

ws1

ws1ws1

ws1

ws1

Communication Network

Distributed Operating System (DOS)

Distributed Computing Systems commonly use two types of Operating Systems. Network Operating Systems Distributed Operating System

Differences between the two types System Image Autonomy Fault Tolerance Capability

Operating System Types

Multiprocessor OSLooks like a virtual uniprocessor, contains only one

copy of the OS, communicates via shared memory, single run queue

Network OS Does not look like a virtual uniprocessor, contains n

copies of the OS, communicates via shared files, n run queues

Distributed OS Looks like a virtual uniprocessor (more or less),

contains n copies of the OS, communicates via messages, n run queues

Design Issues

TransparencyPerformanceScalabilityReliabilityFlexibility (Micro-kernel architecture)

IPC mechanisms, memory management, Process management/scheduling, low level I/O

Heterogeneity Security

Transparency

Location transparency processes, cpu’s and other devices, files

Replication transparency (of files)Concurrency transparency

(user unaware of the existence of others)Parallelism

User writes serial program, compiler and OS do the rest

Performance

Throughput - response timeLoad Balancing (static, dynamic)Communication is slow compared to

computation speed fine grain, coarse grain parallelism

Design Elements

Process Management Task Partitioning, allocation, load balancing,

migration

Communication Two basic IPC paradigms used in DOS

Message Passing (RPC) and Shared Memory

synchronous, asynchronous

FileSystems Naming of files/directories File sharing semantics Caching/update/replication

Remote Procedure Call

A convenient way to construct a client-server connection without explicitly writing send/ receive type programs (helps maintain transparency).

Remote Procedure Calls (RPC)

General message passing model. Provides programmers with a familiar mechanism for building distributed applications/systems

Familiar semantics (similar to LPC) Simple syntax, well defined interface, ease of use,

generality and IPC between processes on same/different machines.

It is generally synchronous Can be made asynchronous by using multi-

threading

A typical model for RPC

Caller Process Server

Process

Call procedure and wait for reply Request Message

(contains Remote Procedure’s parameters

Receive request and startProcedure execution

Procedure Executes

Send reply and wait For next message

Resume Execution

Reply Message( contains result of procedure execution)

RPC continued…

Transparency of RPC Syntactic Transparency Semantic Transparency

Unfortunately achieving exactly the same semantics for RPCs and LPCs is close to impossible

Disjoint address spaces More vulnerable to failure Consume more time (mostly due to communication delays)

Implementing RPC Mechanism

Uses the concept of stubs; A perfectly normal LPC abstraction by concealing from programs the interface to the underlying RPC

Involves the following elements The client The client stub The RPC runtime The server stub The server

Remote Procedure Call (cont.)

Client procedure calls the client stub in a normal way Client stub builds a message and traps to the kernel Kernel sends the message to remote kernel Remote kernel gives the message to server stub Server stub unpacks parameters and calls the server Server computes results and returns it to server

stub Server stub packs results in a message and traps to

kernel Remote kernel sends message to client kernel Client kernel gives message to client stub Client stub unpacks results and returns to client

RPC servers and protocols…

RPC Messages (call and reply messages) Server Implementation

Stateful servers Stateless servers

Communication Protocols Request(R)Protocol Request/Reply(RR) Protocol Request/Reply/Ack(RRA) Protocol

RPC NG: DCOM & CORBA

Object models allow services and functionality to be called from distinct processes

DCOM/COM+(Win2000) and CORBA IIOP extend this to allow calling services and objects on different machines

More OS features (authentication,resource management,process creation,…) are being moved to distributed objects.

Distributed Shared Memory (DSM)

Two basic IPC paradigms used in DOS Message Passing (RPC) Shared Memory

Use of shared memory for IPC is natural for tightly coupled systems

DSM is a middleware solution, which provides a shared-memory abstraction in the loosely coupled distributed-memory processors.

General Architecture of DSM

CPU1

CPU n

MemoryMemory

MMU

Memory

MMU

CPU1

CPU n…

Memory

MMU

CPU1

CPU n

Communication Network

Distributed Shared Memory(exists only virtually)

Node 1 Node n

Issues in designing DSM

Granularity of the block size Synchronization Memory Coherence (Consistency models) Data Location and Access Replacement Strategies Thrashing Heterogeneity

Synchronization

Inevitable in Distributed Systems where distinct processes are running concurrently and sharing resources.

Synchronization related issues Clock synchronization/Event Ordering (recall happened

before relation) Mutual exclusion Deadlocks Election Algorithms

Distributed Mutual Exclusion

Mutual exclusion ensures that concurrent processes have serialized

access to shared resources - the critical section problem

Shared variables (semaphores) cannot be used in a distributed system

• Mutual exclusion must be based on message passing, in the context of unpredictable delays and incomplete knowledge

In some applications (e.g. transaction processing) the resource is managed by a server which implements its own lock along with mechanisms to synchronize access to the resource.

Non-Token Based Mutual Exclusion Techniques

Central Coordinator Algorithm

Ricart-Agrawala Algorithm

Ricart-Agrawala Algorithm

In a distributed environment it seems more natural to implement mutual exclusion, based upon distributed agreement - not on a central coordinator. Shared variables (semaphores) cannot be used in a distributed system Mutual exclusion must be based on message passing, in the context of

unpredictable delays and incomplete knowledge In some applications (e.g. transaction processing) the resource is managed by a

server which implements its own lock along with mechanisms to synchronize access to the resource.

It is assumed that all processes keep a (Lamport’s) logical clock which is updated according to the clock rules. The algorithm requires a total ordering of requests. Requests are ordered

according to their global logical timestamps; if timestamps are equal, process identifiers are compared to order them.

The process that requires entry to a CS multicasts the request message to all other processes competing for the same resource. Process is allowed to enter the CS when all processes have replied to this

message. The request message consists of the requesting process’ timestamp (logical

clock) and its identifier. Each process keeps its state with respect to the CS: released, requested, or

held.

• Ricart-Agrawala Second Algorithm

• Token Ring Algorithm

Token-Based Mutual Exclusion

Ricart-Agrawala Second Algorithm A process is allowed to enter the critical section when it gets the

token. Initially the token is assigned arbitrarily to one of the processes.

In order to get the token it sends a request to all other processes competing for the same resource. The request message consists of the requesting process’ timestamp

(logical clock) and its identifier.

When a process Pi leaves a critical section it passes the token to one of the processes which are waiting for it; this

will be the first process Pj, where j is searched in order [ i+1, i+2, ..., n, 1, 2, ..., i-2, i-1] for which there is a pending request.

If no process is waiting, Pi retains the token (and is allowed to enter the CS if it needs); it will pass over the token as result of an incoming request.

How does Pi find out if there is a pending request? Each process Pi records the timestamp corresponding to the last

request it got from process Pj, in requestPi[ j]. In the token itself, token[ j] records the timestamp (logical clock) of Pj’s last holding of the token. If requestPi[ j] > token[ j] then Pj has a pending request.

Election Algorithms

Many distributed algorithms require one process to act as a coordinator or, in general, perform some special role.

Examples with mutual exclusion Central coordinator algorithm

At initialization or whenever the coordinator crashes, a new coordinator has to be elected.

Token ring algorithmWhen the process holding the token fails, a new

process has to be elected which generates the new token.

Election Algorithms

It doesn’t matter which process is elected. What is important is that one and only one process is chosen (we call

this process the coordinator) and all processes agree on this decision.

Assume that each process has a unique number (identifier). In general, election algorithms attempt to locate the process with the

highest number, among those which currently are up.

Election is typically started after a failure occurs. The detection of a failure (e.g. the crash of the current coordinator) is

normally based on time-out a process that gets no response for a period of time suspects a failure and initiates an election process.

An election process is typically performed in two phases: Select a leader with the highest priority. Inform all processes about the winner.

The Bully Algorithm

A process has to know the identifier of all other processes (it doesn’t know, however, which one is still up); the process with the highest

identifier, among those which are up, is selected. Any process could fail during the election procedure. When a process Pi detects a failure and a coordinator has to be elected

it sends an election message to all the processes with a higher identifier and then waits for an answer message:

If no response arrives within a time limit Pi becomes the coordinator (all processes with higher identifier are down) it broadcasts a coordinator message to all processes to let them know.

If an answer message arrives, Pi knows that another process has to become the coordinator it waits in order to

receive the coordinator message. If this message fails to arrive within a time limit (which means that a potential

coordinator crashed after sending the answer message) Pi resends the election message. When receiving an election message from Pi

a process Pj replies with an answer message to Pi and then starts an election procedure itself( unless it has already started one) it

sends an election message to all processes with higher identifier. Finally all processes get an answer message, except the one which

becomes the coordinator.

The Ring-based Algorithm

We assume that the processes are arranged in a logical ring Each process knows the address of one other process, which is its

neighbor in the clockwise direction. The algorithm elects a single coordinator, which is the process

with the highest identifier. Election is started by a process which has noticed that the

current coordinator has failed. The process places its identifier in an election message that is

passed to the following process. When a process receives an election message

It compares the identifier in the message with its own. If the arrived identifier is greater, it forwards the received election

message to its neighbor If the arrived identifier is smaller it substitutes its own identifier in the

election message before forwarding it. If the received identifier is that of the receiver itself this will be the

coordinator. The new coordinator sends an elected message through the

ring.

The Ring-based Algorithm- An Optimization

Several elections can be active at the same time. Messages generated by later elections should be killed as soon as

possible. Processes can be in one of two states

Participant or Non-participant. Initially, a process is non-participant.

The process initiating an election marks itself participant. Rules

For a participant process, if the identifier in the election message is smaller than the own, does not forward any message (it has already forwarded it, or a larger one, as part of another simultaneously ongoing election).

When forwarding an election message, a process marks itself participant.

When sending (forwarding) an elected message, a process marks itself non-participant.

Summary (Distributed Mutual Exclusion) In a distributed environment no shared variables (semaphores) and local

kernels can be used to enforce mutual exclusion. Mutual exclusion has to be based only on message passing.

There are two basic approaches to mutual exclusion: non-token-based and token-based.

The central coordinator algorithm is based on the availability of a coordinator process which handles all the requests and provides exclusive access to the resource. The coordinator is a performance bottleneck and a critical point of failure. However, the number of messages exchanged per use of a CS is small.

The Ricart-Agrawala algorithm is based on fully distributed agreement for mutual exclusion. A request is multicast to all processes competing for a resource and access is provided when all processes have replied to the request. The algorithm is expensive in terms of message traffic, and failure of any process prevents progress.

Ricart-Agrawala’s second algorithm is token-based. Requests are sent to all processes competing for a resource but a reply is expected only from the process holding the token. The complexity in terms of message traffic is reduced compared to the first algorithm. Failure of a process (except the one holding the token) does not prevent progress.

Summary (Distributed Mutual Exclusion) The token-ring algorithm very simply solves mutual exclusion. It is

requested that processes are logically arranged in a ring. The token is permanently passed from one process to the other and the process currently holding the token has exclusive right to the resource. The algorithm is efficient in heavily loaded situations.

For many distributed applications it is needed that one process acts as a coordinator. An election algorithm has to choose one and only one process from a group, to become the coordinator. All group members have to agree on the decision.

The bully algorithm requires the processes to know the identifier of all other processes; the process with the highest identifier, among those which are up, is selected. Processes are allowed to fail during the election procedure.

The ring-based algorithm requires processes to be arranged in a logical ring. The process with the highest identifier is selected. On average, the ring based algorithm is more efficient then the bully algorithm.

Deadlocks

Mutual exclusion, hold-and-wait, No-preemption and circular wait.

Deadlocks can be modeled using resource allocation graphs

Handling Deadlocks Avoidance (requires advance knowledge of processes

and their resource requirements) Prevention (collective/ordered requests, preemption) Detection and recovery (local/global WFGs,

local/centralized deadlock detectors; Recovery by operator intervention, termination and rollback)

Resource Management Policies

Load Estimation Policy How to estimate the workload of a node

Process Transfer Policy Whether to execute a process locally or remotely

Location Policy Which node to run the remote process on

Priority Assignment Policy Which processes have more priority (local or remote)

Migration Limiting policy Number of times a process can migrate

Process Management

Process migration Freeze the process on the source node and restart it at

the destination node Transfer of the process address space Forwarding messages meant for the migrant process Handling communication between cooperating

processes separated as a result of migration Handling child processes

Process migration in heterogeneous systems

Process Migration

Load Balancing Static load balancing - CPU is determined at

process creation. Dynamic load balancing - processes

dynamically migrate to other computers to balance the CPU (or memory) load.

Migration architecture One image system Point of entrance dependent system (the

deputy concept)

A Mosix Cluster

Mosix (from Hebrew U): Kernel level enhancement to Linux that provides dynamic load balancing in a network of workstations.

Dozens of PC computers connected by local area network (Fast-Ethernet or Myrinet).

Any process can migrate anywhere anytime.

An Architecture for Migration

Architecture that fits one system image.Needs location transparent file system.

(Mosix previous versions)

Architecture for Migration (cont.)

Architecture that fits entrance dependant systems.Easier to implement based on current Unix.

(Mosix current versions)

Mosix: File Access

Each file access must go back to deputy…

= = Very Slow for I/O apps.

Solution: Allow processes to access a distributed file system through the current kernel.

Mosix: File Access

DFSA Requirements (cache coherent, monotonic timestamps, files

not deleted until all nodes finished) Bring the process to the files.

MFS

Single cache (on server)

/mfs/1405/var/tmp/myfiles

Other Considerations for Migration

Not only CPU load!!!

Memory.

I/O - where is the physical device?

Communication - which processes

communicate with which other processes?

Resource Management of DOS

A new online job assignment policy based on economic principles, competitive analysis.

Guarantees near-optimal global lower-bound performance.

Converts usage of heterogeneous resources (CPU, memory, IO) into a single, homogeneous cost using a specific cost function.

Assigns/migrates a job to the machine on which it incurs the lowest cost.

Distributed File Systems (DFS)

DFS is a distributed implementation of the classical file system model Issues - File and directory naming, semantics of file sharing

Important features of DFS Transparency, Fault Tolerance

Implementation considerations caching, replication, update protocols

The general principle of designing DFS: know the clients have cycles to burn, cache whenever possible, exploit usage properties, minimize system wide change, trust the fewest possible entries and batch if possible.

File and Directory Naming

Machine + path /machine/pathone namespace but not transparent

Mounting remote filesystems onto the local file hierarchy

view of the filesystem may be different at each computer

Full naming transparency A single namespace that looks the same on

all machines

File Sharing Semantics

One-copy semanticsUpdates are written to the single copy and are

available immediately

SerializabilityTransaction semantics (file locking protocols

implemented - share for read, exclusive for write).

Session semanticsCopy file on open, work on local copy and copy

back on close

Example: Sun-NFS

Supports heterogeneous systemsArchitecture

• Server exports one or more directory trees for access by remote clients

• Clients access exported directory trees by mounting them to the client local tree

• Diskless clients mount exported directory to the root directory

Protocols• Mounting protocol• Directory and file access protocol - stateless, no open-

close messages, full access path on read/write

Semantics - no way to lock files

Example: Andrew File System

Supports information sharing on a large scale

Uses a session semantics Entire file is copied to the local machine

(Venus) from the server (Vice) when open. If file is changed, it is copied to server when closed.

Works because in practice, most files are changed by one person

AFS File Validation

Older AFS Versions On open: Venus accesses Vice to see if its

copy of the file is still valid. Causes a substantial delay even if the copy is valid.

Vice is stateless

Newer AFS Versions

The Coda File System

Descendant of AFS that is substantially more resilient to server and network failures.

Support for “mobile” users. Directories are replicated in several servers

(Vice) When the Venus is disconnected, it uses local

versions of files. When Venus reconnects, it reintegrates using optimistic update scheme.

Naming and Security

Naming Important for achieving location transparency Facilitates Object Sharing Mapping is performed using directories. Therefore name

service is also known as Directory Service

Security Client-Server model makes security difficult Cryptography is the solution

distributed operating systems - introduction

Documents

queuesdistributed os

idle workstation

single run queuenetwork

remote process

operating systemmanages

types of operating systems

virtual uniprocessor

n copies