reliable distributed systems agenda for march 23 rd, 2006 group membership services use of multicast...

91
Reliable Distributed Systems Agenda for March 23 rd , 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications - Examples All based on Ken Birman’s slide se

Upload: jerome-rodgers

Post on 11-Jan-2016

224 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Reliable Distributed Systems

Agenda for March 23rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications - Examples

All based on Ken Birman’s slide set

Page 2: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Architecture

Membership Agreement, “join/leave” and “P seems to be unresponsive”

3PC-like protocols use membership changes instead of failure notification

Applications use replicated data for Applications use replicated data for high availabilityhigh availability

Page 3: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Issues? How to “detect” failures

Can use timeout Or could use other system monitoring tools

and interfaces Sometimes can exploit hardware

Tracking membership Basically, need a new replicated service System membership “lists” are the data it

manages We’ll say it takes join/leave requests as input

and produces “views” as output

Page 4: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Architecture

GMS

A

B

C

D

join leave

join

A seems to have failed

{A}

{A,B,D}

{A,D}

{A,D,C}

{D,C}

X Y Z

Application processes

GMS processes

membership views

Page 5: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Issues Group membership service (GMS) has

just a small number of members This core set will tracks membership for a

large number of system processes Internally it runs a group membership

protocol (GMP) Full system membership list is just

replicated data managed by GMS members, updated using multicast

Page 6: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

GMP design What protocol should we use to

track the membership of GMS Must avoid split-brain problem Desire continuous availability

We’ll see that a version of 3PC can be used

But can’t “always” guarantee liveness

Page 7: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Multicast Primitives

To support replication

Page 8: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: FIFO

Fifo or sender ordered multicast: fbcast

Messages are delivered in the order they were sent (by any single sender)

p

q

r

s

a e

Page 9: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: FIFO

Fifo or sender ordered multicast: fbcast

Messages are delivered in the order they were sent (by any single sender)

p

q

r

s

a

b c d

e

delivery of c to p is delayed until after b is delivered

Page 10: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Implementing FIFO order Basic reliable multicast algorithm has

this property Without failures all we need is to run it on

FIFO channels (like TCP, except “wired” to our GMS

With failures need to be careful about the order in which things are done but problem is simple

Multithreaded applications: must carefully use locking or order can be lost as soon as delivery occurs!

Page 11: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Causal Causal or happens-before ordering: cbcast

If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b

Page 12: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Causal Causal or happens-before ordering: cbcast

If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b cdelivery of c to p is delayed until after b is delivered

Page 13: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Causal Causal or happens-before ordering: cbcast

If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b c

e

delivery of c to p is delayed until after b is deliverede is sent (causally) after b

Page 14: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Causal Causal or happens-before ordering: cbcast

If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b c d

e

delivery of c to p is delayed until after b is delivereddelivery of e to r is delayed until after b&c are delivered

Page 15: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Insights about c/fbcast These two primitives are asynchronous:

Sender doesn’t get blocked and can deliver a copy to itself without “stopping” to learn a safe delivery order

If used this way, the multicast can seem to sit in the output buffers a long time, leading to surprising behavior

But this also gives the system a chance to concatenate multiple small messages into one larger one.

Page 16: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Concatenation

Application sends 3 asynchronous cbcasts

Multicast Subsystem

Message layer of multicast system combines them in a single packet

Page 17: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

State Machine Concept Sometimes, we want a replicated

object or service that advances through a series of “state machine transitions”

Clearly will need all copies to make the same transitions

Leads to a need for totally ordered multicast

Page 18: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Total

Total or locally total multicast: abcast

Messages are delivered in same order to all recipients (including the sender)

p

q

r

s

a

b c d

e

all deliver a, b, c, d, then e

Page 19: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Ordering properties: Total

Can visualize as “closely synchronous” Real delivery is less synchronous, as on the

previous slide

p

q

r

s

a

b c d

e

all deliver a, b, c, d, then e

Page 20: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Often conceive of causal order as a form of total order!

Point is that causal order is totally ordered for any single causal chain.

We’ll use this observation later

p

q

r

s

a

b c d

e

all receive a, b, c, d, then e

Page 21: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Implementing Total Order Many ways have been proposed

Just have a token that moves around Token has a sequence number When you hold the token you can send the next

burst of multicasts Extends to a cbcast-based version

We use this when there are multiple concurrent threads sending messages

Transis and Totem extend VT causal order to a total order

But their scheme involves delaying messages and sometimes sends extra multicasts

Page 22: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How to think about order? Usually, we think in terms of state machines and

total order But all the total order approaches are costly

There is always someone who may need to wait for a token or may need to delay delivery

Loses benefit of asynchronous execution Could be several orders of magnitude slower!

So often we prefer to find sneaky ways to use fbcast or cbcast instead

Page 23: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Reliable Distributed Systems

Virtual Synchrony

Page 24: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Virtual Synchrony

A powerful programming model! Called virtual synchrony It offers

Process groups with state transfer, automated fault detection and membership reporting

Ordered reliable multicast, in several flavors

Extremely good performance

Page 25: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Why “virtual” synchrony?

What would a synchronous execution look like?

In what ways is a “virtual” synchrony execution not the same thing?

Page 26: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

A synchronous execution

p

q

r

s

t

u

With true synchrony executions run in genuine lock-step.

Page 27: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Virtual Synchrony at a glance

With virtual synchrony executions only look “lock step” to the application

p

q

r

s

t

u

Page 28: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

What about membership changes? Virtual synchrony model synchronizes

membership change with multicasts Idea is that:

Between any pair of successive group membership views…

… same set of multicasts are delivered to all members

If you implement code this makes algorithms much simpler for you!

Page 29: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Process groups with joins, failures

crash

G0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}

p

q

r

s

tr, s request to join

r,s added; state xfer

t added, state xfer

t requests to join

p fails

Page 30: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Implementation? When membership view is changing, we

also need to terminate any pending multicasts Involves wiring the fault-tolerance

mechanism of the multicast to the view change notification

Tricky but not particularly hard to do Resulting scheme performs well if

implemented carefully

Page 31: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Virtual Synchrony at a glance

p

q

r

s

t

u

We use the weakest (hence fastest) form of communication possible

Page 32: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Chances to “weaken” ordering Suppose that any conflicting updates are

synchronized using some form of locking Multicast sender will have mutual exclusion Hence simply because we used locks, cbcast

delivers conflicting updates in order they were performed!

If our system ever does see concurrent multicasts… they must not have conflicted. So it won’t matter if cbcast delivers them in different orders at different recipients!

Page 33: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Causally ordered updates Each thread corresponds to a different

lock

In effect: red “events” never conflict with green ones!

p

r

s

t1

2

3

4

5

1

2

Page 34: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

In general?

Replace “safe” (dynamic uniformity) with a standard multicast when possible

Replace abcast with cbcast Replace cbcast with fbcast

Unless replies are needed, don’t wait for replies to a multicast

Page 35: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Why “virtual” synchrony?

The user sees what looks like a synchronous execution Simplifies the developer’s task

But the actual execution is rather concurrent and asynchronous Maximizes performance Reduces risk that lock-step execution

will trigger correlated failures

Page 36: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Correlated failures Why do we claim that virtual synchrony

makes these less likely? Recall that many programs are buggy Often these are Heisenbugs (order sensitive)

With lock-step execution each group member sees group events in identical order So all die in unison

With virtual synchrony orders differ So an order-sensitive bug might only kill one

group member!

Page 37: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Programming with groups Many systems just have one group

E.g. replicated bank servers Cluster mimics one highly reliable server

But we can also use groups at finer granularity E.g. to replicate a shared data structure Now one process might belong to many groups

A further reason that different processes might see different inputs and event orders

Page 38: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Embedding groups into “tools”

We can design a groups API: pg_join(), pg_leave(), cbcast()…

But we can also use groups to build other higher level mechanisms Distributed algorithms, like snapshot Fault-tolerant request execution Publish-subscribe

Page 39: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Distributed algorithms

Processes that might participate join an appropriate group

Now the group view gives a simple leader election rule Everyone sees the same members, in

the same order, ranked by when they joined

Leader can be, e.g., the “oldest” process

Page 40: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Distributed algorithms

A group can easily solve consensus Leader multicasts: “what’s your

input”? All reply: “Mine is 0. Mine is 1” Initiator picks the most common value

and multicasts that: the “decision value”

If the leader fails, the new leader just restarts the algorithm

Page 41: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Distributed algorithms

A group can easily do consistent snapshot algorithm Either use cbcast throughout system,

or build the algorithm over gbcast Two phases:

Start snapshot: a first cbcast Finished: a second cbcast, collect process

states and channel logs

Page 42: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Distributed algorithms: Summary

Leader election Consensus and other forms of

agreement like voting Snapshots, hence deadlock

detection, auditing, load balancing

Page 43: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

More tools: fault-tolerance Suppose that we want to offer clients

“fault-tolerant request execution” We can replace a traditional service with a

group of members Each request is assigned to a primary (ideally,

spread the work around) and a backup Primary sends a “cc” of the response to the request

to the backup Backup keeps a copy of the request and steps

in only if the primary crashes before replying Sometimes called “coordinator/cohort”

just to distinguish from “primary/backup”

Page 44: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Publish / Subscribe

Goal is to support a simple API: Publish(“topic”, message) Subscribe(“topic”, event_hander)

We can just create a group for each topic Publish multicasts to the group Subscribers are the members

Page 45: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Scalability warnings! Many existing group communication

systems don’t scale incredibly well E.g. JGroups, Ensemble, Spread Group sizes limited to perhaps 50-75 members And individual processes limited to joining

perhaps 50-75 groups (Spread: see next slide) Overheads soar as these sizes increase

Each group runs protocols oblivious of the others, and this creates huge inefficiency

Page 46: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Other “toolkit” ideas We could embed group

communication into a framework in a “transparent” way Example: CORBA fault-tolerance

specification does lock-step replication of deterministic components

The client simply can’t see failures But the determinism assumption is painful,

and users have been unenthusiastic And exposed to correlated crashes

Page 47: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Other similar ideas

There was some work on embedding groups into programming languages But many applications want to use them

to link programs coded in different languages and systems

Hence an interesting curiosity but just a curiosity

More work is needed on the whole issue

Page 48: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Existing toolkits: challenges

Tensions between threading and ordering We need concurrency (threads) for

perf. Yet we need to preserve the order in

which “events” are delivered This poses a difficult balance for

the developers

Page 49: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Preserving order

Group Communication Subsystem: A library linked to theapplication, perhaps with its own daemon processes

G1={p,q}

m3 m4 G2={p,q,r}

Time application

p

q

r

m1 m2

m3 m4

Page 50: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Features of major virtual synchrony platforms

Isis: First and no longer widely used But was perhaps the most successful;

has major roles in NYSE, Swiss Exchange, French Air Traffic Control system (two major subsystems of it), US AEGIS Naval warship

Also was first to offer a publish-subscribe interface that mapped topics to groups

Page 51: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Features of major virtual synchrony platforms

Totem and Transis Sibling projects, shortly after Isis Totem (UCSB) went on to become

Eternal and was the basis of the CORBA fault-tolerance standard

Transis (Hebrew University) became a specialist in tolerating partitioning failures, then explored link between vsync and FLP

Page 52: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Features of major virtual synchrony platforms Horus, JGroups and Ensemble

All were developed at Cornell: successors to Isis These focus on flexible protocol stack linked directly

into application address space A stack is a pile of micro-protocols Can assemble an optimized solution fitted to specific

needs of the application by plugging together “properties this application requires”, lego-style

The system is optimized to reduce overheads of this compositional style of protocol stack

JGroups is very popular. Ensemble is somewhat popular and supported by a

user community. Horus works well but is not widely used.

Page 53: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

JGroups (part of JBoss) Developed by Bela Ban

Implements group multicast tools Virtual synchrony was on their “to do” list But they have group views, multicast,

weaker forms of reliability Impressive performance! Very popular for Java community

Downloads from http://www.JGroups.org

Page 54: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Spread Toolkit Developed at John Hopkins

Focused on a sort of “RISC” approach Very simple architecture and system Fairly fast, easy to use, rather popular

Supports one large group within which user sees many small “lightweight” subgroups that seem to be free-standing

Protocols implemented by Spread “agents” that relay messages to apps

Page 55: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Summary? Role of a toolkit is to package commonly

used, popular functionality into simple API and programming model

Group communication systems have been more popular when offered in toolkits If groups are embedded into programming

languages, we limit interoperability If groups are used to transparently replicate

deterministic objects, we’re too inflexible Many modern systems let you match the

protocol to your application’s requirements

Page 56: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Reliable Distributed Systems

Applications

Page 57: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Applications of GCS Over the past three weeks we’ve heard

about group communication Process groups Membership tracking and reporting “new views” Reliable multicast, ordered in various ways Dynamic uniformity (safety), quorum protocols

So we know how to build group multicast… but what good are these things?

Page 58: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Applications of GCS

Today, we’ll review some practical applications of the mechanisms we’ve studied Each is representative of a class Goal is to illustrate the wide scope of

these mechanisms, their power, and the ways you might use them in your own work

Page 59: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Sample Applications Wrappers and

Toolkits Distributed Program-

ming Languages Wrapping a Simple

RPC server Wrapping a Web Site

Hardening Other Aspects of the Web

Unbreakable Stream Connections

Reliable Distributed Shared Memory

Page 60: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

What should the user “see”? Presentation of group communication

tools to end users has been a controversial topic for decades!

Some schools of thought: Direct interface for creating and using groups Hide in a familiar abstraction like publish-

subscribe or Windows event notification Use inside something else, like a cluster mgt.

platform a new programming language Each approach has pros and cons

Page 61: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Toolkits

Most systems that offer group communication directly have toolkit interfaces User sees a library with various calls

and callbacks These are organized into “tools”

Page 62: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Style of coding? User writes a program in Java, C, C++,

C#... The program declares “handlers” for

events like new views, arriving messages Then it joins groups and can

send/receive multicasts Normally, it would also use threads to

interact with users via a GUI or do other useful things

Page 63: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Toolkit approach: Isis Join a group, state transfer:

Gid = pg_join(“group-name”,PG_INIT, init_func, PG_NEWVIEW, got_newview,

XFER_IN, rcv_state, XFER_OUT, snd_state, … 0); Multicast to a group:

nr = abcast(gid, REQ, “%s,%d”, “a string”, 1234, ALL, “%f”, &fltvec);

Register a callback handler for incoming messagesisis_entry(REQ, got_msg);

Receive a multicast:void got_msg(message *mp) {

Msg_scan(“%s,%d”, &astring, &anint);Reply(mp, “%f”, 123.45);

}

A group is created when a join is first issued. In this case the group initializer

function is called. The user needs to code that function. Here the “new view”

function, also supplied by the user, gets called when the group membership

changes

If the group already exists, a leader is automatically selected and its

XFER_OUT routine is called. It calls xfer_out repeatedly to send state.

Each call results in a message delivered to the XFER_IN routine, which extracts the state from the message

To send a multicast (here, a totally ordered one), you specify the group identifier from

a join or lookup, a request code (an integer), and then the message. This

multicast builds a message using a C-style format string. This abcast wants a reply

from all members; the replies are floating point numbers and the set of replies is

stored in a vector specified by the caller. Abcast tells the caller how many replies it

actually got (nr)

This is how an application registers a callback handler. In this case the

application is saying that messages with the specified request code should be passed to the procedure “got_msg”

Here’s got_msg. It gets invoked when a multicast arrived with the matching request code. This particular procedure extracts a

string and an integer from the message and sends a reply. Abcast will collect all of those

replies into a vector, set the caller’s pointer to point to that vector, and return the number of

replies it received (namely, the number of members in the current view)

Page 64: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Threading A tricky topic in Isis

The user needs threads, e.g. to deal with I/O from the client while also listening for incoming messages, or to accept new requests while waiting for replies to an RPC or multicast

But the user also needs to know that messages and new views are delivered in order, hence concurrent threads pose issues

Solution? Isis acts like a “monitor” with threads, but running them one at a time unless the user explicitly “exits” the monitor

Page 65: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

A tricky model to work with!

We have… Threads, which many people find tricky Virtual synchrony, including choices of

ordering A new distributed “abstraction” (groups)

Developers will be making lots of choices, some with big performance implications, and this is a negative

Page 66: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Examples of tools in toolkit Group join, state

xfer Leader selection Holding a “token” Checkpointing a

group

Data replication Locking Primary-backup Load-balancing Distributed

snapshot

Page 67: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How toolkits work They offer a programmer API

More procedures, e.g. Create_replicated_data(“name”, type) Lock_replica(“name”) Update_replica(“name”, value) V = (type)Read_replica(“name”)

Internally, these use groups & multicast Perhaps, asynchronous cbcast as discussed last

week… Toolkit builder optimizes extensively, etc…

Page 68: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How programmers use toolkits

Two main styles Replicating a data structure

For example, “air traffic sector D-5” Consists of all the data associated with that

structure… could be quite elaborate Processes sharing the structure could be

very different (maybe not even the same language)

Replicating a service For high availability, load-balancing

Page 69: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Experience is mixed…. Note that many systems use group communication

but don’t offer “toolkits” to developers/end users Major toolkit successes include New York and Swiss

Stock Exchange, French Air Traffic Control System, US AEGIS warship, various VLSI Fab systems, etc

But building them demanded special programmer expertise and knowledge of a large, complex platform

Not every tool works in every situation! Performance surprises & idiosyncratic behavior common. Toolkits never caught on the way that transactions became standard

But there are several popular toolkits, like JGroups, Spread and Ensemble. Many people do use them

Page 70: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Leads to notion of “wrappers”

Suppose that we could have a magic wand and wave it at some system component “Replicatum transparentus!”

Could we “automate” the use of tools and hide the details from programmers?

Page 71: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Wrapper examples Transparently…

Take an existing service and “wrap” it so as to replicate inputs, making it fault-tolerant

Take a file or database and “wrap” it so that it will be replicated for high availability

Take a communication channel and “wrap” it so that instead of connecting to a single server, it connects to a group

Page 72: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Experience with wrappers? Transparency isn’t always a good thing

CORBA has a fault-tolerance wrapper In CORBA, programs are “active objects” The wrapper requires that these be deterministic

objects with no GUI (e.g. servers) CORBA replaces the object with a group, and

uses abcast to send requests to the group. Members do the same thing, “state machine” style So replies are identical. Give the client the first one

Page 73: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Why CORBA f.tol. was a flop Users find the determinism assumption too

constraining Prevents use of threads, shared memory, system

clock, timers, multiple I/O channels… Real programs sometimes use these sorts of

things unknown to the programmer Who knows how the .NET I/O library was programmed

by Microsoft? Could it have threads inside, or timers? Moreover, costs were high

Twice as much hardware… slower performance! Also, had to purchase the technology separately

from your basic ORB (and for almost same price)

Page 74: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Files and databases?

Here, issue is that there are other ways to solve the same problem A file, for example, could be put on a

RAID file server This provides high speed and high

capacity and fault-tolerance too Software replication can’t easily

compete

Page 75: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How about “TCP to a group?” This is a neat application and very

interesting to discuss. We saw it before. Let’s look at it again, carefully

Goals: Client system runs standard, unchanged TCP Server replaced by a group… leader owns the

TCP endpoint but if it crashes, someone else takes over and client sees no disruption at all!

Page 76: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How would this work? Revisit idea from before: Reminder: TCP is a kind of state

machine Events occur (incoming IP packets,

timeouts, read/write requests from app) These trigger “actions” (sending data

packets, acks, nacks, retransmission) We can potentially checkpoint the state of a

TCP connection or even replicate it in realtime!

Page 77: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

How to “move” a TCP connection

We need to move the IP address We know that in the modern internet, IP

addresses do move, all the time NATs and firewalls do this, why can’t

we? We would also need to move the TCP

connection “state” Depending on how TCP was

implemented this may actually be easy!

Page 78: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Migrating a TCP connection

client

Initial Server

New Server

Client “knows” the server by its TCP endpoint: an IP address and port that speak TCP and have the state of this

connection

The server-side state consists of the contents of the TCP window (on the server), the socket to which the IP

address and port are bound, and timeouts or ACK/NACK “pending actions”

We can write this into a checkpoint record

TCP state

TCP state

We transmit the TCP state (with any other tasks we migrate) to the new server. It opens a

socket, binds to the SAME IP address, initializes its TCP stack out of the checkpoint received

from the old server

The client never even notices that the channel endpoint was moved!

The old server discards its connection endpoint

Page 79: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

TCP connection state Includes:

The IP address, port # used by the client and the IP address and port on the server

Best to think of the server as temporarily exhibiting a “virtual address”

That address can be moved Contents of the TCP “window”

We can write this down and move it too ACK/NACK state, timeouts

Page 80: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Generalizing the idea

Create a process group Use multicasts when each event

occurs (abcast) All replicas can track state of the

leader Now if a new view shows that the

leader has failed, a replica can take over by binding to the IP address

Page 81: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Fault-tolerant TCP connection

client

Initial Server

New Server

With replication technology we could continuously replicate the connection state (as well as any “per task” state needed by the

server)

Page 82: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Fault-tolerant TCP connection

client

Initial Server

New Server

After a failure, the new server could take over, masking the

fault. The client doesn’t notice anything

Page 83: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

What’s new?

Before we didn’t know much about multicast… now we do

This lets us ask how costly the solution would be

In particular Which multicast should be used? When would a delay be incurred?

Page 84: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Choice of multicast

We need to be sure that everyone sees events in the identical order Sounds like abcast

But in fact there is only a single sender at a time, namely the leader Fbcast is actually adequate! Advantage: leader doesn’t need to

multicast to itself, only to the replicas

Page 85: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Timeline picture

client

leader

replica

An IP packet generated by TCP

Leader fbcasts the “event description”

Leader bound to IP address

replica binds to IP address, now it owns the TCP stack

Leader doesn’t need to wait (to “sync”) here

because the client can’t see any evidence of the

leader’s TCP protocol stack state

Leader does need to wait before sending this IP packet

to the client, (to “sync”) to be sure that if he crashes, client TCP stack will be in the same

state as his was

Page 86: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Asynchronous multicast This term is used when we can send a

multicast without waiting for replies Our example uses asynchronous fbcast

An especially cheap protocol: often just sends a UDP packet

Acks and so forth can happen later and be amortized over many multicasts

“Sync” is slower: must wait for an ack But often occurs in background while leader

is processing the request, “hiding” the cost!

Page 87: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Sources of delay? Building event messages to represent

TCP state, sending them But this can occur concurrently with handing

data to the application and letting it do whatever work is required

Unless TCP data is huge, delay is very small Synchronization before sending packets

of any kind to client Must be certain that replica is in the identical

state

Page 88: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Using our solution?

Now we can wrap a web site or some other service Run one copy on each of two or more

machines Use our replicated TCP

Application sees identical inputs and produces identical outputs…

Page 89: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Repeat of CORBA f.tol. idea? Not exactly…

We do need determinism with respect to the TCP inputs

But in fact we don’t need to legislate that “the application must be a deterministic object”

Users could, for example, use threads as long as they ensure that identical TCP inputs result in identical replies

Page 90: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Would users accept this?

Unknown: This style of wrapping has never been explored in commercial products

But the idea seems appealing… perhaps someone in the class will find out…

Page 91: Reliable Distributed Systems Agenda for March 23 rd, 2006 Group Membership Services Use of Multicast to support replications Virtual Synchrony Model Applications

Recap of today’s lecture

… we’ve looked at each of these topics and seen that with a group multicast platform, the problem isn’t hard to solve

Wrappers and Toolkits

Distributed Program-ming Languages

Wrapping a Simple RPC server

Wrapping a Web Site

Hardening Other Aspects of the Web

Unbreakable Stream Connections

Reliable Distributed Shared Memory [skipped]