distributed systems unit 1- devendra gautam a.p.cse

Distributed SystemsUNIT 1- DEVENDRA GAUTAM A.P.CSE

Characteristics of DS

The term distributed system describes a system the consists of several loosely coupled computers.

What does this mean ??

No common or shared memory;No common clock;Computers communicate via some communication networkEach computer has its own processor, memory, and operating system.

CPU

MEM MEM

MEM

MEM

CPU

CPU

CPU

Communication Network

Motivation for DSWhat are the advantages of building a DS ??

Do the advantages outweigh the associate cost and complexity of the DS ??

What are the issues that must be considered ??

In order to evaluate price/performance tradeoffs, we must decide how to measure (or express) performance of a DS !!

The advantages of DS over traditional time-sharing systems include:

Resource Sharingi.e., utilization of resources from remote systems

Enhanced Performancei.e., higher throughput by utilizing concurrent execution

Improved Reliability/Availabilityi.e., intrinsic backup through duplication

Modular Expandability/Scalabilityi.e., addition of new HW/SW without replacing existing elements.

Issues in DSWhen designing a DS or the algorithms within, a number of issues must be addressed:

1.Global State / Knowledge2.Naming3.Scalability4.Compatibility5.Process Synchronization6.Resource Management7.Security8.Structure

Ideally, the underlying mechanisms that address or resolve these issues should be transparent to the user / programmer.

The users view the DS as a virtual (abstracted) uni-processor, and not as a collection of individual computers.

Global State / KnowledgeIn a shared memory system, the state of all processes and resources (i.e., the entire system).

Based on this global knowledge, decisions and actions among processes can effectively be coordinated.

The lack of global information, i.e., the absence of shared memory and a single globally accessible clock makes coordination of actions in DS difficult.

Example: Would the youngest and oldest student in this class please stand !!

So, what do we need ??

Means to reach consensus among processes in the DS regarding the current time.

Means to coordinate the actions of processes based on events.

Means to allow for the ordering of events.

Resource ManagementResource Management in DS is concerned with making both local and remote resources available to the user.

Resource Management is roughly divided into three areas:

The program, i.e., the location of the computation, must be able to access data.

These data is generally thought of as being contained in a file that is referenced (open, read, write, modify, etc) by the program.

Hence, one of the issues in data migration is the design of a file system that hides the location of the data w.r.t. the location of the program.

Network Transparency

1. Data Migration

1. Computation Migration

1. Distributed Scheduling

Resource Management cont’dAnother issue in Data Migration is shared memory access among multiple computers.

Distribute Shared Memory (SHMEM) must guarantee that the consistency of data is maintained.

Think about how SHMEM could be implemented !!

1. How should can we achieve network transparency?

1. How can we manage memory across all computers?

Computation Migration:

Here, the computation “moves” to another location, where the data may be available.

Paradigms for computation migration include:

Remote Procedure Calls (RPC)Mobile Agents (MA)Process Migration

We shall look at RPC and MA in more detail !

Resource Management cont’dThe fact that jobs (or processes) may be transferred to remote machines requires a distributed scheduling strategy.

Distributed Scheduling must assure that the resources of the entire systems are utilized optimally and must assure that specific job requirements are being met.

This is of even greater importance in GRID computing environments where resources are generally owned and operated by different organizations.

Issues related to distributed scheduling include:

Load Balancing

Resource Utilization

Quality of Service (for Jobs)

Cost / Performance Tradeoff

Security (maybe)

Networks and CommunicationAny DS is build on top of some

type of Communication Infrastructure.

Remote hosts are connected to sub-networks via switches. Subnets connect to other subnets or a backbone via routers.

We can distinguish different types of networks / network classes:

1. Wide Area Networks2. Local Area Networks3. Metropolitan Area Networks4. Wireless (not really a nw-class)5. Ad-Hoc ( not really a nw-class)

Main Network Types Local Area Networks

(LANs)• Processors distributed

over small geographical areas, e.g. single building, or neighboring buildings

• Replacement of large mainframe computer systems

• High speed, reliability and quality (usually more expensive high quality cables are used)

Wide Area Networks (WANs)Autonomous processors

distributed over large geographical area, e.g. the U.S.

Originally as an academic project to provide communication between sites.

First WAN: ArpanetExamples: Internet,

inter-connection of company sites

Subject to slower unreliable channels

DS and NetworksRecall from Computer Networks Class

that all communication is governed by some communication protocol. (e.g. TCP/IP)

A DS may span as little as a small LAN or as much as multiple WANs.

Different Topologies may determine the structure of algorithms used to facilitate a DS.

1. Ring Topology2. Star Topology3. Tree Topology4. Bus Topology5. Switched subnets

The concept of topology may be extended beyond the physical layer into the logical structure of the DS.

This structure may be exploited in formulating more efficient communication sequences.

In this course, we will mostly disregard the specific structure of the network and make some simplifying assumption on reliability of the underlying communication.

Communication Naming and Name

Resolution• <host name,

identifier>• Host name must be

unique within network.

• Domain Name System (DNS) Name-to-address

resolution Hierarchical system of

name servers Before: all hosts needed

to know all addresses

RoutingFixed routing: fixed

communication pathVirtual routing: fixed for

sessionDynamic routing: use of

packets; can take link load into account

Quality of Service (QoS)LossLatency

Communication Protocols

Modes of communication• Asynchronous

Large overhead

• Synchronous Set-up phase Negotiate

communication parameters

• Connectionless User Datagram Protocrol (UDP)

• Connection-oriented Transmission Control Protocol/Internet Protocol (TCP/IP)

International Standards Organization (OSI) layersPhysical layer:

mechanical and electrical details

Data-link layer: handle frames, error detection and recovery

Network layer: provide connections, routing

Transport layer: Transfer messages between layer: Formats, character conversion, clients, packet order, etc.

Session layer: Process-to process communication

Presentation etc

Communication PrimitivesUnless the network architecture

provides functions to access the lower layers of the protocol stack, communication primitives are used at the application layer.

Most DS algorithms and protocols assume only 2 primitives:

1. SEND message2. RECEIVE message

We need to distinguish, however, two modes of primitives:

1. Synchronous2. Asynchronous

application

transportnetworkdata linkphysical

application


application


Communication PrimitivesIn synchronous communication, a SEND is blocked until a corresponding RECEIVE is executed on the remote machine (or process).

With asynchronous communication, a SEND operation is buffered and the process is not blocked to wait for a corresponding receive.

What are possible implications of these semantics ??

RPCOne frequently used communication vehicle in DS is the

Remote Procedure Call

RPC provides access to functionality that is available on remote hosts.

In the program, the RPC is “identical” to a local procedure call.

The fact that the procedure is actually executed remotely is hidden.

Robustness Failure Detection

• Hard to distinguish type of failure in DS Link failure Site failure Message loss

• Handshaking procedure sending “I-am-up” and “Are-you-up?” messages Combine with time-out

scheme

Reconfiguration Assume failure has been

discoveredLink failure: broadcast

informationAssumed site failure, notify

every site in DSPossibly elect new

coordinator siteWhat if site did not really

fail?

Recovery Notify that link is back

upNotify that site back up

It’s Time…

Time is an important theoretical construct in understanding howdistributed computations unfold!

In order to know at what time aparticular event occurred at a particular computer, it is necessaryto synchronize its clock with anauthoritative external time source!

Distributed algorithms that depend on synchronized time have been developed. These include:

1. Maintaining consistency of distributed data;

1. Serialization of Transaction

1. Authentication and Authorizationbased on tickets/certificates(Kerberos)

Thanks Herr Einstein!!

The Special Theory of Relativity establishes the consequences that follow from the observation that the speed of light is constant for all observers, regardless of their relative speed.

From this assumption, it has been proven that two events that arejudged (perceived) to be simultaneous in one frame of reference are not necessarily simultaneous according to two observers in frames of reference that are in relative motion.

For example: an observer on Earth and an observer traveling away fromEarth will disagree on the time interval between event.

In the extreme – the order of two events may be reversed for two different observers!

The reversal of events by twodifferent observers cannot occur if the two events arecausally dependent.

The physical effect follows thephysical cause for all observers.

The elapsed time between theoccurrence of events mayhowever vary.

The timing of physical events was hence show tobe relative to the observer, thereby discreditingNewton’s notion of absolute physical time !!

CONCLUSION: There is no special physical clock in the universeto which we can appeal to precisely measure intervals of time !

Events, Clocks, and Ticks

A distributed system is generallyviewed as a collection P of N processes pi, i= 1,2,…N.

Each process pi executes on a single processor and has a state si associatedwith it.

We can view each process pi as toexecute a sequence of actions thatfall in one of the following categories:

1. Sending a message;2. Receiving a message;3. Performing a computation that

alters its state si;

We define an event to be theexecution of a single action bypi.

The sequence of events withina single process pi can be totallyordered, which is generally denoted by eie’.

We can then define the historyor process pi to be a the sequenceof events that take place within:

history(pi) = hi = <ei0,ei

1,ei2,…>

Why do we care about history ??

We will later use the histories ofconcurrent processes pi and pj to argue that any allowable execution sequence (or schedule) is consistent with executing pi and pj sequentially.

This will lead to the notion of serializability – the property thatassures that events (histories) oftwo or more concurrent processescan be ordered.

Note: In order to globally order events that occur at different hosts in the DS, nodes must agreeon when theses events happen!

This necessitates the introductionof protocols that allow nodes to obtain the current time.

It further requires that nodes agreeon what time it is when event take place.

Clocks, Drift, and SkewEach node in the DS contains itsown physical clock !

Physical clocks are HW devices that count oscillations of a crystal orquartz.

After a specified number of oscillations, the clock incrementsa register, thereby adding one clock-tick to a counter the represents the passing of time.

The operating system transforms theHW clock into a software-based clock by reading the clock register !

The OS reads the HW-clock valueHi(t), scales it, and adds an offset to produce a SW-clock Ci(t) = Hi(t) + .

Ci(t) approximates the physical time tat process pi.

Ci(t) may be implemented by a 64-bitword, representing nanoseconds that have elapsed at time t.

Successive events can be timed if theclock resolutions is smaller that the time interval between the two events.

Network

Clock SkewComputer clocks, like any otherclocks tend not to be in perfectagreement !!

Definition:Clock Drift is the effect that a clock experiences when itscrystal is subject to physical variations and oscillates atdifferent rates. On possiblecause: Temperature variation.

Drift in ordinary quartz crystals isgenerally limited to 10-6 sec/sec or1 second every 11.6 days.High Precision Crystals: 10-7 – 10-8

Definition:Clock Skew is the instantaneousdifference between the readingsof any two clocks !!

i.e., |Ci(t) – Cj(t)|

Coordinated Universal Time (UTC)

What is the correct time ??

Since we cannot refer to any universalauthority to answer this question, we must rely on the availability of highlyreliable physical clocks.

Atomic oscillators have a drift rate ofabout 10-13. The output of these clocksis used as standard for elapsed realtime, known as International Atomic Time.

The standard second has been defined by the 9,192,631,770periods of transition betweenthe two hyperfine levels of theground state of Caesium-133(Cs133).

Seconds, days, months, and yearsare rooted in astronomical time.

UTC is based on atomic time butleap seconds are inserted or deletedto keep in step with astronomical time.

Clock SynchronizationSo, how does UTC make it into thenodes of the DS ??

UTC Signals are transmitted from land-based radio stations and satellitescovering many parts of the earth.

•Satellite sources include GPS•Receivers are available commercially•Land-based station accuracy = .1 – 10ms•GPS accuracy is about 1μs•UTC is available via phone line •Accuracy over phone line is several ms

•Sources include: NIST

We need to distinguish:

1. External Synchronization

1. Internal Synchronization

Internal/External SynchronizationExternal synchronization refers to

synchronization of process’ clocksCi with an authoritative externalsource S.

Let D>0 be the synchronization boundand S be the source of UTC.

Then |S(t) – Ci(t)| < D for i=1,2,…,Nand for all real times t.

We say that clocks Ci are accuratewithin the bound of D

Internal synchronization refers to synchronization of process’ clocksCi with each other.

Let D>0 be the synchronization boundand Ci and Cj are clocks at processespi and pj, respectively.

Then |Ci(t) – Cj(t)| < D for i,j=1,2,…,Nand for all real times t.

We say that clocks Ci, Cj agreewithin the bound of D

Note that clocks that are internally synchronized are not necessarily externally synchronized. i.e., even though they agree with each other, the drift collectively from the external source of time.

Clock Synchronization

When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.

Clock Synchronization Algorithms

The relation between clock time and UTC when clocks tick at different rates.

Synchronization ProtocolsThe simplest case of clock synchronization involves two processes in a synchronous system.

Here, bounds are known for:• drift rate of clocks• maximum transmission delay• time for each step in the process

pi can then send Ci(t) in a message m to other processes pj. The receiving process pj sets its clock to Ci(t) + Ttrans, where Ttrans is the time taken to transmit message m.

Unfortunately, Ttrans cannot be static and is subject to variation. In general, Ttrans is not known.

In a synchronous systems, wehave an upper and lower boundon transmission delay Ttrans. Hence, the uncertainty in Ttrans

is u=(max – min).

Setting the clock to t + min willresult in clock skew as much asu. Similarly, if the clock is set tot + max, the skew may be as large as u.

If, however, we set the clock tot + (min + max)/2, the skew is atmost u/2.

more SynchronizationLundelius and Lynch have shownthat the optimal bound that can be achieved on clock skew when synchronizing N clocks isu(1-1/N).

Most DS found in practice are asynchronous:

•factors leading to message delaysare not bounded;

•there is no max on Ttrans

see the Internet !!

Here, Ttrans = min + x where x isx ≥ 0 unknown.

External Synchronization as proposedby Cristian (1989)…..

He suggested the use of time servers,connected to a device that receivessignals from a UTC source.

Upon request, server process S supplies the time according to its clock.

A process p requests the time via a message mr and receives time value tvia a message mt. p records the total round-trip time Tround. p can do so with reasonable precision if its rateof clock drift is small.

Cristian’s approach

For example: the round-trip delayin a LAN is on the order of 1 – 10 ms.A clock drift rate of 10-6 sec/secwill cause a drift of at most 10-5ms.

p should set its clock to t + Tround /2,which assumes that delay is split equally in both directions.

If min is known or can be estimatedconservatively, the clock accuracycan be computed as follows:

The earliest time that S could have placed t into mt was min after psend mr.

The latest point this could havebeen done was min before mt arrived at p.

The uncertainty is hence:

[ t + min, t + Tround –min]

accuracy is thus (Tround /2 – min)

Cristian's Algorithm Getting the current time from a time server.

DiscussionOf course, Cristian’s approach suffers from several disadvantages including:

•Single point of failure if S fails, no synchronization is possible !•Faulty or corrupt time servers may reply with spurious time values !•An imposter may deliberately reply with incorrect times and wreak havoc.

Cristian advocated he use of groups of time servers to avoid some of theseproblems. However, this would require the coordination of time servers, i.e.,internal synchronization among Si.

Imposters and faulty time servers are beyond the scope of clock synchronization. They are, however, addressed in the context ofthe Byzantine Generals problem, which deals with the ability tocompute correct values in a DS even in the presence of faulty nodes.

The Berkeley ApproachGusella and Zatti (1989) developedan algorithm for internal synchronization.

In it, one node is chosen as coordinatorto act as master. The master periodicallycontacts nodes and requests their current time.

Upon receiving their responses, the master estimates their correspondingCi(t) by observing round-trip delays.

It then averages the values of all nodes(including its own). This averaging cancelsout the individual clock drifts.

The master then returns to eachnode the amount of time by whicheach individual Ci(t) should beadjusted. (i.e., a + or – number).

In order to address the issue offaulty clocks, which could haveadverse effects on the average,a fault-tolerant average is computed.

For this, only a subset of nodeswith Ci(t) values close to eachother are considered.

The Berkeley Algorithm

a) The time daemon asks all the other machines for their clock valuesb) The machines answerc) The time daemon tells everyone how to adjust their clock

The Network Time Protocol

Cristian’s and Berkley algorithmsare designed for use in small,delineated network (DS) environments.

NTP defines and architecture for time services and a protocol for thedistribution of time information acrossthe Internet.

NTP has the following design aims:

•to provide services that enables clients across the Internet to synch. accurately.

•to provide reliable service that can overcome lengthy losses of connectivity.

•to enable client to frequently to resynchronize to offset the drift rates.

•to protect against interference with the time service, both malicious and accidental.

this is too much for this coursebut you can read more about NTP at http://www.ntp.orgalso, check out RFCs 1305 & 2030.

Events and Logical Clocks Lamport’s 1978 paper: Time, Clocks, and the

Ordering of Events in Distributed Systems.

• Theoretical Foundation• Logical Clocks• Partial and Total Event Ordering

• Towards distribute mutual exclusion

Theoretical Foundations

Inherent limitations of a distributed system:• Absence of a global clock:

Global clock is available to all the processes: two processes can observe a global clock value at different instants due to unpredictable message delay; therefore, may perceive two different instants in physical time to be a single instant in physical time.

A physical clock for each computer: these clocks can drift from the physical time and the drift rate may vary from clock to clock; therefore, may perceive two different instants in physical time as a single instant.

• Impact: Due to the absence of global clock, it is difficult to reason about the temporal order of events in distributed system, e.g. scheduling is more difficult.

Inherent Limitation -- cont...

Absence of shared memory: an up-to-date state of the entire system is not available to any process.• A view is coherent if all the observations of different

processes (computers) are made at the same physical time.

• A complete view (global state) encompasses the local views (local states) at all the processes (computers) and any messages that are in transit.

• A process in a distributed system can obtain a coherent but partial view of the system or a complete but incoherent view of the system.

Lamport’s Logical Clocks The execution of processes is characterized by a

sequence of events; e.g. execution of an instruction or a procedure, sending or receiving messages.

Lamport proposed a scheme to order events in a distributed system.

Note that due to the absence of perfectly synchronized clocks and global time in distributed systems, the order in which two events occur at two different computers cannot be determined based on the local time at which they occur.

Happened Before Relation

The happened before relation captures the causal dependencies between events, i.e. whether two events are causally related or not.

ab if a and b are events in the same process and a occurred before b.

ab if a is the event of sending a message m in a process and b is the event of receipt of the same message m by another process.

If a->b and b->c, then ac, i.e. happened before relation is transitive.

That is, past events causally affects future events.

Concurrent Events Two distinct events a and b are concurrent

(a||b) if not (ab or ba). In other words, concurrent events do not causally affect each other.

For any two events a and b in a distributed system, either ab, ba or a||b.

Logical Clocks

There is a clock Ci at each process Pi in the distributed system.

The clock Ci can be thought of as a function that assigns a number Ci(a) to any event a, called the timestamp of event a, at Pi.

These clocks can be implemented by counters and have no relation to physical time.

Conditions Satisfied by the System of Clocks

For any events a and b: if ab , then C(a)<C(b).

The happened before relation can now be realized by using the logical clock if the following two conditions are met:• [C1] For any two events a and b in a process Pi, if a occurs

before b, then Ci(a) < Ci(b).• [C2] If a is the event of sending a message m in process Pi and

b is the event of receiving the same message m at process Pj, then Ci(a) < Cj(b).

Implementation Rules

[IR1] Clock Ci is incremented between any two successive events in process Pi: Ci:=Ci+d (d>0). Note that if a and b are two successive events in Pi and a->b, then Ci(b):=Ci(a)+d. Note: d is usually 1.

[IR2] If event a is the sending of message m by process Pi, then message m is assigned a timestamp tm=Ci(a) (note that the value of Ci(a) is obtained after applying rule IR1). On receiving the same message m by process Pj, Cj is set to a value greater than or equal to its present value and greater than tm. Cj:=max(Cj, tm+d) (d>0).

Total Ordering of Events

Lamport’s happened before relation defines an irreflexive partial order among the events.

The set of all events in a distributed computation can be totally ordered (denoted by =>) using the above system of clocks as follows: if a is any event at process Pi and b is any event at process Pj then a=>b iff either • Ci(a)<Cj(b) or • Ci(a)=Cj(b) and Pi<Pj.

Virtual Time Lamport’s system of logical clocks implements an

approximation to global/physical time, which is referred to as virtual time.

Virtual time advances along with the progression of events and is therefore discrete.

If no events occur in the system, virtual time stops, unlike physical time which continuously progresses.

Limitation of Lamport’s Clocks

Note that in Lamport’s system of clocks, if ab then C(a)<C(b).

However, the reverse is not necessary true if the events have occurred in different processes: if a and b are events in different processes and C(a)<C(b), then a->b is not necessary true; events a and b may be causally related or may not be causally related.

Simple Solution to DME

A site, called the control site, is assigned the task of granting permission for the CS execution.

To request the CS, a site sends a request message to the control site, which queues up the requests and grants them one by one.

Requires 3 messages per CS execution. Drawbacks:

• single point of failure• control site may be overloaded and nearby communication links may

be congested• low system throughput.

Lamport’s Algorithm

Based on Lamport’s clock synchronization scheme.

For all i, the request set Ri={S1,S2,.......,Sn}

Every site Si keeps a request queue, which contains requests ordered by their timestamp.

Assume that messages to be delivered in FIFO order between every pair of sites.

Requesting CS:To request CS, a site send a REQUEST(tsi,i) message to all sites in Ri and places the

request on its request queueWhen a site Sj receives the REQUEST message from Si, it returns a timestamped

REPLY message to Si and places the REQUEST in its request queue.

Executing CS: Site Si enters CS when Si has received a message with timestamp larger than (tsi,i) from all other sites, andSi’s request is at the top of its own request queue

Releasing CS:Remove its request and sends timestamped RELEASE

Other sites will remove the REQUEST accordingly.

Correctness Proof

By contradiction: suppose two sites Si and Sj are executing the CS concurrently. Then both the conditions for executing CS must hold at both sites, i.e. both Si and Sj have its own requests at the top of their request queues. WLOG, assume that Si’s request has a smaller timestamp than that of Sj.

It is clear that the request of Si must be present in Sj’s request queue, when Sj is executing in CS. This provides the contradiction that Sj’s own request is at the top of the request queue when a smaller timestamp request is present.

Performance and Optimization Number of messages required: 3(N-1) message per CS

invocation.

Synchronization delay: T

Optimization: By suppressing REPLY messages in certain condition:• For example, suppose site Sj receives a REQUEST message

from site Si after it has sent its own REQUEST message with timestamp higher than the timestamp of site Si’s request. In this case site Sj need not send a REPLY message to site Si.

distributed systems unit 1- devendra gautam a.p.cse

Documents