distributed operating systems

Distributed Operating Systems

By:

Akshay DabholkarMayur PalankarAmol Pandit

Based on the paper by Andrew S. Tanenbaum and Robbert Van Renesse

Outline

What is a Distributed Operating System ?

How is it different ?

Why Distributed Operating Systems ?

Problems with Distributed Operating Systems

Distributed Operating System Models

Design Issues

Comparison of some Distributed Operating Systems

Conclusion

What is a Distributed Operating System ?

A Distributed Operating System is the one that runs on multiple, autonomous CPUs which provides its users an illusion of an ordinary Centralized Operating System that runs on a Virtual Uniprocessor.

Distributed Operating Systems provide resource transparency to the user processes.

“If you can tell which computer you are using, you are not using a distributed operating system.” - Tanenbaum

How is it different ?

The Distributed Operating System is unique and resides on different CPUs.

User processes can run on any of the CPUs as allocated by the Distributed Operating System.

Data can be resident on any machine that is the part of the Distributed System.

All multi-machine systems are not Distributed Systems.

“It is the software not the hardware that determines whether a system is distributed or not” - Tanenbaum

Distributed OS vs. Network OS.

User is not aware of the multiple CPUs.

Each machine runs a part of the Distributed Operating System. The system is fault-tolerant.

User is aware of the existence of multiple CPUs.

Each machine has its own private Operating System.

The system is not fault-tolerant.

Why Distributed Operating Systems ?

Price/Performance advantage (Availability of cheap and powerful Microprocessors).

Incremental growth.

Reliability and Availability.

Simplicity of Software (Theoretically).

Provides Transparency.

Creates another level of abstraction (e.g. Process creation).

Problems with Distributed Operating System

Communication Protocol Overhead.

Lack of Simplicity.

High requirement of the degree of fault tolerance.

Lack of global state information (e.g. No global Process Tables).

Atomic Transactions.

Process and Data Migration (e.g. During Load Balancing and Paging respectively).

Distributed Operating System Models

Minicomputer Model It consists of a few minicomputers each with multiple users. Simple outgrowth of the Central Time-Sharing Systems. Each user is locally logged-on to one machine and remotely logged-on to other machines. (Logged-in Users / Available CPUs) < 1

Workstation Model Each user has his personal workstation and nearly all work is done on the workstation. Each user is locally logged-on to one machine and remotely logged-on to other machines. It supports single, global file-system that provides location-independent data access. (Logged-in Users / Available CPUs) ~ 1

Processor Pool Model When an user needs to perform computation, a processor is allocated from the processor

pool to the user task. (Logged-in Users / Available CPUs) > 1

Design Issues

Communication Primitives

Naming and Protection

Resource Management

Fault-Tolerance

Services


Types of Message Passing Primitives

Blocking versus Non-Blocking Primitives

Buffered versus Unbuffered Primitives

Message Passing

Client sends request message

Server receivesrequest message

Client-Server Model of Communication


The idea is to make the semantics of Inter-machine communication as similar to normal machine calls.

RPC Design Issues:

Parameter Passing: Passing reference parameters over the network is not easy. A unique system-wide pointer for each object is needed to access it remotely.

Parameter Representation: Incompatible representation of data across network. Conversion to and from a standard format is expensive and wasteful when both the receiver and sender use the same formats.

Client-Server Binding: Sometimes it is important to know the details of the servers while handling RPC calls (Multiple File Server systems). Its difficult to achieve this functionality.

Remote Procedure Call (RPC)

Naming and Protection OS support a large number of objects like files, directories, segments,

mailboxes, processes, services, servers, nodes and I/O devices. Required for Object Recognition.

Naming as Mapping Problem of mapping between two domains.

Name Servers

Name Server Models:

o Centralized Name Server Model: A single server accepts names in one domain and maps them to names in another domain.o Distributed Name Lookup Model: Partition the system into domains with each domain having its own naming server.

Maintain a table or database of the name-to-object mapping.

Services, processes, etc need to register with the underlying naming system.

Resource Management

Managing resources without having accurate global state information is difficult.

Distributed OS do not have tables that provide up-to-date status information of all the resources being managed.

Considerations:

Processor Allocation

Scheduling

Load balancing

Distributed Deadlock Detection

Processor Allocation

Processors are organized in a logical hierarchy independent of the physical structure of the network (MICROS).

Each manager has an idea about the free processors possessed by it.

If it has enough number of free processors for a request then it allocates them otherwise forwards the request to his immediate boss.

Scheduling In presence of multiple processors, a way is needed to ensure that processes that

communicate frequently run simultaneously so that they can be scheduled together in a group to run on different processors.

It is difficult to dynamically determine the inter-process communication (IPC) patterns.

Ousterhout has proposed several algorithms based on the concept of Coscheduling, which takes IPC patterns into account while scheduling to ensure that all members of a group run at the same time.

One idea is to have each processor use a round-robin scheduling algorithm and schedule all processes that communicate with each other on different processors in the same slot, to achieve N-fold parallelism.

The disadvantage of this approach is the high overhead incurred for performing IPC between processes of a group that run on different processors over the network.

To avoid high cost of IPC over the network, the closely related groups of processes should be scheduled on the same processor.

Load balancing In order to avoid one processor from being heavily loaded, load balancing is

required.

Techniques: Graph-theoretic Model:

Requires the CPU and memory requirements of each process and the average of traffic between each pair of processes to be known in advance.

System can be represented as a graph with each process as a node and each pair of communicating process represented by an arc.

The problem of allocating all the processes to k processors reduces to the problem of partitioning the graph into k disjoint subgraphs.

Drawback: This model is only of theoretic importance as none of the assumptions are known in advance.

Heuristic Load Balancing: Each processor estimates its own load continuously, processors exchange load

information and this information is used for process creation and migration.

Practical Considerations of load balancing (How to do process migration?).

Fault-Tolerance

A fault tolerant system is the one that can continue functioning, perhaps in a degraded form, even if something goes wrong.

One of the advantages of Distributed Operating Systems is that there are enough resources to achieve fault tolerance.

Two radically different approaches:

Redundancy Techniques

Atomic Transactions

Redundancy Techniques

Redundancy through backup process.

Provides every process with a backup process on different processor.

All messages sent to a process are also sent to the backup process.

If one process crashes, the other can clone itself to make a new backup and continue.

Redundancy through recorder process.

A special recorder process records all messages sent on the network.

Every process checkpoints itself onto a remote disk periodically.

On a crash the process is started on an idle processor from the most recent checkpoint. The recorder process sends it all the messages the original process received between the checkpoint and the crash.

Atomic Transactions

The property to run-to completion or do nothing is called an atomic update.

A technique for achieving Atomic Transactions proposed by Lampson is it Building up an Hierarchy of Abstractions.

It makes use of abstraction layers such as careful disk, stable storage and stable processors to implement multicomputer atomic transactions.

How to implement Mutual Exclusion ? When 2 processes on different CPUs try to access shared memory

using remote semaphores. Network becomes the bottleneck.

Services

In a Distributed Operating system, it is useful to have user level server processes to provide functions that have been traditionally provided by the operating system leading to the microkernel approach of the operating system design.

Server Structure (Single-threaded or Multi-threaded). File Service (disk, flat file & directory services). Print Service. Process Service (Remote process creation and caching of servers

possible). Terminal Service. Mail service. Time Service. Boot Service. Gateway Service.

Comparison of some Distributed Operating Systems

Cambridge Amoeba V Kernel Eden Project

Developed By Computing Laboratory@

Univ. of Cambridge

Tanenbaum@ Vrije Universiteit-

Amsterdam

David Cheriton@ Stanford

University

University of Washington-

Seattle


RPC RPC RPC RPC

Naming and Protection

Single Name Server

Sparse capabilities with encryption

Three-level naming mechanism

Capabilities without protection

Resources Processor Bank Processor Pool Workstation Model

Workstation Model

Fault tolerance Small server to startup services

Some fault tolerance through boot server

No fault tolerance Uses Recorder process.

File Server Universal file service and Filing Machine

Several file services.

Similar to Unix No file server. One process for each file

Conclusion Distributed systems are interesting and fruitful area of research for the

future. They advocate the use of Microkernel approach to Operating Systems

Design.

Latest Research:

Plan 9 @ Bell Labs 2K @ UIUC Inferno @ Vita Nuova The Sprite OS @ Berkeley Mach @ CMU AgentOS @ UCI WebOS @ Berkeley

distributed operating systems

Documents

distributed systems

machine systems

users available cpus

different cpus

user processes

machine communication

user task

operating systemsby