lecture 2 introduction to principles of distributed computing

Sergio Rajsbaum 2006

Lecture 2Introduction to Principles of

Distributed Computing

Sergio RajsbaumMath Institute

UNAM, Mexico


Lecture 2

• Part I: Refresh from Lecture I. What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation. Two-phase commit

• Part II: Coordinated attack, consensus


Part I: What is a distributed system

The need for a theoretical foundation. Two-phase commit


Principles of Distributed Computing

• Distributed computing studies systems where components interact and collaborate

• Principles of distributed computing tries to understand the fundamental possibilities and limitations of such systems, with a precise, scientific approach

• Goal: to design efficient and reliable systems, and techniques to design them, analyze them and prove them correct, or to prove impossibility results when no protocol exists


What is distributed computing?

• Any system where several independent computing components interact

• This broad definition encompasses– VLSI chips, and any modern PC

– tightly-coupled shared memory multiprocessor

– local area cluster of workstations

– internet, WEB, Web services

– wireless networks, sensor networks, ad-hoc networks

– cooperating robots, mobile agents, P2P systems


Computing components

• Referred to processors or processes in the literature

• Can represent a– microprocessor – process in a multiprocessing operating system– Java thread– mobile agent, mobile node (e.g. laptop), robot– computing element in a VLSI chip


Interaction – message passing vs. shared memory

• Processors need to communicate with each other to collaborate, via

• Message passing– Point-to-point channels, defining an interconnection

graph– All-to-all using an underlying infrastructure (e.g.

TCP/IP)– Broadcast; wireless, satellite

• Shared memory– Shared-objects: read/write, test&set, compare&swap, etc– Usually harder to implement, easier to program


A distributed system

processors

Communicationmedia

collaborate


Failures

• Any system that includes many components running over a long period of time must consider the possibility of failures

• of processors and communication media

• of different severity– from processor crashes or message loses, to– malicious Byzantine behavior


Many kinds of problems

• Clock synchronization• Routing• Broadcasting• Naming• P2P, how to share and find resources• sharing resources, mutual exclusion• Increasing fault-tolerance, failure detection• Security, authentication, cryptography• Database transactions, atomic commitment• Backups, reliable storage, file systems• Applications, airline reservation, banking, electronic

commerce, publish/subscribe systems, web search, web caching, …


Multi-layered, complex interactionsAn example

• A fault-tolerant broadcast service is useful to build a higher level database transaction module

• Naming, authentication is required• And may work more efficiently if clocks are tightly

synchronized• And good routing schemes should exist• If the clock synchronization is attacked, the whole

system may be compromised


Chaos

We need a good foundation,

principles of distributed computing


Chaos

• Too many models, problems and orthogonal, interacting issues

• Very hard to get things right, to reproduce operating scenarios

• Sometimes it is easy to adapt a solution to a different model, sometimes a small change in the model makes a problem unsolvable


Distributed computing theory• Models

– Good models [Schneider Ch.2 in Distributed Systems, Mullender (Ed.)]

– Relation between models: solve a problem only once; solve it in the strongest possible model

• Problems– Search of paradigms that represent fundamental distributed

computing issues– Relations between problems: hierarchies of solvable and unsolvable

problems; reductions• Solutions

– Design algorithms, verification techniques, programming abstractions

– Impossibility results and lower bounds• Efficiency measures

– Time, communication, failures, recovery time, bottlenecks, congestion


Distributed Commit

An example of a distributed protocol

Fundamental part of distributed DBMS


Distributed Commit

• A distributed transaction with components at several sites should execute atomically

• Example: A manager of a chain of stores wants to query all the stores, find the inventory of toothbrushes at each, and issue instructions to move toothbrushes from store to store in order to balance the inventory.

• The operation is done by a single global transaction T that has component Ti at the i-th store and a component T0 at the office where the manages is located.


Sequence of activities performed by T

1. Component T0 is created at the site of the manager2. T0 sends messages to all the stores instructing them to

create components Ti3. Each Ti executes a query at store I to discover the number

of toothbrushes in inventory and reports this number to T04. T0 takes these numbers and determines, by some algorithm

we shall not discuss, what shipments of toothbrushes are desired. T0 then sends messages such as “store 10 should ship 500 toothbrushes to store 7” to the appropriate stores

5. Stores receiving instructions update their inventory and perform the shipments


Atomicity

• Make sure it does not happen: some of the actions of T get executed, but others do not

• We do assume atomicity of each Ti, through mechanisms such as logging and recovery

• Failures make difficult the achievement of atomicity of T– A site fails or is disconnected from the network

– A bug in the algorithm to redistribute toothbrushes instructs store 10 to ship more than it has


Example of failures

• Suppose T10 replies to T0’s 1st message with its inventory.

• The machine at 10 then crashes, the instructions form T0 are never received by T10

• However, T7 sees no problem, and receives the instructions from T0

• Can distributed transaction T ever commit?


Agreement Paradigms

Coordinated attack

Consensus


Coordinated AttackAn important abstraction

• a pair of allied generals A and B have agreed to attack simultaneously or not at all.

• they can only communicate via carrier pigeon; message loss is possible

A B


Difficulty: uncertainty

• Suppose general A sends the message to B “attack at dawn”

• general A won’t attack alone. A doesn’t know whether B has received the message. B understand A’s predicament, so B sends an acknowledgment “agreed”


Impossible

Theorem: Assume that communication is unreliable. Any protocol that guarantees that if one of the generals attacks, then the other does so at the same time, is a protocol in which necessarily neither general attacks.

A B

“attack at dawn”

Did B get it?

BA

“ack”

Did A get it?


It never ends

• There is always uncertainty of weather the last message was delivered or not

• Corollary: If decision must be made within a fixed time period, then unreliable communication prevents database commitment protocols

A B

“ack your ack”

Did B get it?

BA

“ack your ack to my ack”

Did A get it?


Agreement Problems in Distributed Computing are common

Because processes have different views of its state and history


Agreement Problems in Distributed Computing are common…

Because processes have different views of its state and history, due to:

• Delays• Failures

NASA plunged the Galileo spacecraft into Jupiter’s turbulent atmosphere today. The unmanned spacecraft dived into the atmosphere at 2:57 p.m. Eastern time. The last of Galileo’s data arrived on Earth today after the spacecraft was destroyed, taking 52 minutes to cross half a billion miles of space

The New York Times, 21 Sept. 2003


… and Agreement Problems are Important

• In a replicated data system: to execute the same sequence of operations on the replicated data

• In a replicated sensor system: to agree on the values of the sensors

• In a timed system: to synchronize a set of clocks• In a broadcast system: to deliver the same messages

in the same order• In a database system: to commit or abort a

transactionEtc….


Consensus

The king of agreement problems


CONSENSUS A fundamental Abstraction

Each process has an input, should decide an output s.t.

Agreement: correct processes’ decisions are the same

Validity: decision is input of one process

Termination: eventually all correct processes decide

There are at least two possible input values 0 and 1


A Solution to Consensus For a group of people sitting in a room


A Solution to ConsensusEach one raises a card with its input

2

00

1

0


A Solution to Consensus Follow a coordinator

2

00

1

0 1

1

11

1


A Solution to Consensus Majority wins (breaking ties with the largest)

2

00

1

0 0

0

00

0


A Solution to ConsensusFailures are no problem (choose another

coordinator, or majority of non-failed)

2

0%!#

1

0


A Solution to Consensus… because this cannot happen!!

2

0

%!#

1

0

1


Consensus in Distributed SystemsThis can happen: delays

1

?

?

?


Consensus in Distributed Systems and then there are different views

2

0

1

01020

1

1020?

1020?

1020?

†


Consensus in Distributed Systems so we try to reconcile views- another round

2

0

1

01020

1

1020?

1020?

1020?

†

10201


Consensus in Distributed Systems but we could have the same problem!!

2

0

1

01020

1

1020?

1020?

1020?

†

10201

10201


So, is consensus solvable?If so, how long does it take to solve it?

• It depends on what exactly the model is• But what is a realistic model?• And what are the common scenarios within the

model? The nature of a distributed system is to include complex combinations of failures and delays


Basic Model – asynchronous crash failure model

• Message passing (another option would be a shared memory model)

• Channels between every pair of processes

• Crash failures, with a bound tt < n potential failures out of n >1 processes

• No message loss among correct processes

• Unbounded message delays, unpredictable processor’s speeds


Distributed algorithms(protocols)

• A set of algorithms, each one runs on a different processor (or as a thread in the same computer)

• The code includes instructions to communicate with other processors: – Send (M) to p– Upon receiving a message form q do


A consensus protocol1. val input2. send val to all3. wait until at least n - t messages have been

received4. let V[j] be the val received from process j else ‘-’ 5. return h (V) = largest value in V

- This same code is executed by every process - each one receives the value input from some

application- h is a predefined function, that all processors know


Is this protocol correct ?

• It depends on what is the set C of possible inputs

• An input to the protocol is a vector I, where I[j] contains the local input of the j-th process

• The local input of pj is known only to pj

• And is taken from some universe of possible values V not including ‘-’

• Let C be the set of possible input vectors to the protocol


Exercise 11. Define a set C as large as possible for which the

protocol is correct2. Prove that the protocol is correct for this C3. Do you need to assume t < n / 2 ?

Namely, that for every I in C, in every execution with input I where at most t processes crash, the consensus requirements are satisfied

Termination: eventually all correct processes decideAgreement: correct processes’ decisions are the sameValidity: decision is input of one process


Exercise 2

The protocol uses h (V) = largest value in V

1. Define another such function h’

2. Repeat the previous exercise with respect to your h’


Exercise 3

Consider the set C that includes every possible input vector formed with values from V, where | V | is at least 2

1. Is there a function h for which the protocol is correct ?

If so, give one such h and prove the protocol is correct, otherwise, give a brief intuitive argument of why there is no such h


BibliographyTheory of distributed computing textbooks

• Attiya, Welch, Distributed Computing, Wiley-Interscience, 2 ed., 2004

• Garg, Elements of Distributed Computing, Wiley-IEEE, 2002

• Lynch, Distributed Algorithms, Morgan Kaufmann,1997

• Tel, Introduction to Distributed Algorithms, Cambridge U., 2 ed. 2001


Bibliographyothers

• Distributed Algorithms and Systems http://www.md.chalmers.se/~tsigas/DISAS/index.html

• Conferences: DISC, PODC,…

• Journals: Distributed Computing,…– Special issue PODC 20th anniversary, Sept. 2003

• ACM SIGACT News Distributed Computing Column. Also one in EATCS Bulletin

lecture 2 introduction to principles of distributed computing

Documents