communication and data sharing for dynamic distributed systems
DESCRIPTION
Communication and Data Sharing for Dynamic Distributed Systems. Nancy Lynch MIT. Alex Shvartsman UConn. Motivation and Focus. Constructing distributed applications for highly dynamic environments is a difficult In practice, considerable effort is required to make applications resilient to - PowerPoint PPT PresentationTRANSCRIPT
Communication and Data Sharing Communication and Data Sharing for Dynamic Distributed Systemsfor Dynamic Distributed Systems
Nancy LynchNancy LynchMITMIT
Alex ShvartsmanAlex ShvartsmanUConnUConn
Motivation and FocusMotivation and Focus• Constructing distributed applications for highly Constructing distributed applications for highly
dynamic environments is a difficultdynamic environments is a difficult• In practice, considerable effort is required to make In practice, considerable effort is required to make
applications resilient to applications resilient to – changes in client requirements– evolution of the underlying computing medium
• Focus of our workFocus of our work– design and analysis of distributed services – that provide useful guarantees and – that make the construction of sophisticated
distributed applications easier.
Our ApproachOur Approach• TraditionallyTraditionally
– research on distributed services emphasized specification and correctness, while
– research on distributed algorithms emphasized complexity and performance
• We combine these concerns leading toWe combine these concerns leading to– algorithms that perform efficiently and degrade
gracefully in dynamic distributed settings, and – whose correctness, performance, and fault-
tolerance guarantees are expressed by precisely-defined global services.
Research Direction SummaryResearch Direction Summary• Develop and analyze algorithms to solve problems Develop and analyze algorithms to solve problems
of of communicationcommunication and and data sharingdata sharing in highly in highly dynamic distributed environmentsdynamic distributed environments
• ““Dynamic” encompassesDynamic” encompasses– Changes in network topology– Processor mobility– Changing sets of participants– Wide range of failures– Timing variations
Research Direction (cont’d)Research Direction (cont’d)• The properties we study includeThe properties we study include
– ordering and reliability guarantees for communication
– coherence guarantees for data sharing• The algorithmic results will be accompanied by The algorithmic results will be accompanied by
– lower bound and impossibility results, – which describe inherent limitations on what
problems can be solved, and at what cost.
RAMBORAMBOReconfigurable Atomic Reconfigurable Atomic
MemoryMemoryfor Read/Write Objectsfor Read/Write Objects
Nancy LynchNancy LynchAlex ShvartsmanAlex Shvartsman
Design GoalsDesign Goals• RAMBORAMBO
– Reconfigurable Atomic Memory for Basic Objects (Read/Write) for message-passing systems
• Dynamic replication for availability and survivabilityDynamic replication for availability and survivability• Loosely-coupled on-the-fly reconfigurationLoosely-coupled on-the-fly reconfiguration• High concurrencyHigh concurrency• Low latencyLow latency• Safety for any patterns of asynchrony and failuresSafety for any patterns of asynchrony and failures• Good performance under partial asynchrony and for Good performance under partial asynchrony and for
moderate failuresmoderate failures
Algorithmic IdeasAlgorithmic Ideas• Reconfigurable quorum systemsReconfigurable quorum systems
– Quorums maintain consistency during modest and transient changes
– Reconfigurations accommodate more drastic and permanent changes
• Read/write operations are frequentRead/write operations are frequent– Use quorum access and allow concurrency– Isolate from reconfiguration
• Reconfigurations are infrequentReconfigurations are infrequent– Use consensus to impose total order (Paxos)– Optimistic dissemination without formal installation– Conservative garbage collection of obsolete config-
s
Related Prior WorkRelated Prior Work• Atomic read/write memory in message-passing Atomic read/write memory in message-passing
modelsmodels– Upfal Widgerson 86– Attiya Bar-Noy Dolev 91, 95– Lynch Shvartsman 97– Englert Shvartsman 01
• – Lamport 89, 98
• QuorumsQuorums– Gifford 79, Thomas 79– and many many others
MethodologyMethodology• Specify algorithmSpecify algorithm
– Interacting state machines– Using non-deterministic “gossip”
• Show correctness/safety for Show correctness/safety for – arbitrary patterns of asynchrony– assuming arbitrary crash-failures and message loss
• Analyze performance for a subset of timed executionsAnalyze performance for a subset of timed executions– Bounded message delay, 0-time local processing– Some “gossip” becomes deliberate, some periodic– Non-failure of certain quorums for certain periods– Reason about operation latency– (Of course none of this impacts safety)
Showing Read/Write Showing Read/Write AtomicityAtomicity
• We show atomicity using a partial orderWe show atomicity using a partial order• Atomicity of a sequence Atomicity of a sequence of reads/writes of reads/writes
– Let be an irreflexive PO of all op-s in . Show:– For any , finitely many – If precedes , then not – If is write then either or – Any read returns value written by last write, per
[Lynch, Lemma 13.16]
Approach: Values and TagsApproach: Values and Tags• Each value Each value vv has an associated tag has an associated tag tt
– Tag is made up of the sequence-processor pair• Reads:Reads:
– a set of value-tag pairs is obtained– the result is the value with the maximum tag
• Writes:Writes:– a set of value-tag pairs is obtained– new-value is propagated with a new-tag that is a
lexicographic increment of tag :
new-tag := tag.seq + 1, pid
Using Quorum SystemsUsing Quorum Systems• Given a set Given a set II (a set of processor ids) (a set of processor ids)• A A quorum systemquorum system is a pair is a pair
– < read-quorums, write-quorums >
• WhereWhere– Read-quorums is a collection of subsets of I– Write-quorums is a collection of subsets of I
• Such thatSuch that– For any RR in read-quorums and WW in write-quorums,
RR W W – For any WW11 and WW22 in write-quorums,
WW11 WW22
High-Level FunctionsHigh-Level Functions• JoinerJoiner
– Introduces new participants to the system• Reader-WriterReader-Writer
– Routine read and write operations– Two-phased algorithm using all “known”
configurations– Using tags
• ReconfigurationReconfiguration– Chooses new next configuration– Informs members of the previous configuration
• Garbage collection (“packaged” with Reader-Writer)Garbage collection (“packaged” with Reader-Writer)– Identify and remove obsolete configurations
RAMBO
RAMBO SystemRAMBO System
Reader-Writer
Recon
Cons
Network
Joiner
Architectural ViewArchitectural View• Each component is formally specifiedEach component is formally specified
– Input/Output Automata [Tuttle Lynch]
• Joiners are specified as Joiners are specified as JoinerJoinerii for for ii in in II
• Reader-Writers are Reader-Writers are Reader-WriterReader-Writerii for for ii in in II
• Reconfigurers are Reconfigurers are ReconReconii for for ii in in II
• Consensus instances are Consensus instances are Cons(k,c)Cons(k,c) for for ii in in NN, c in , c in CC– Where the members of configuration c decide on
the configuration number k
• Network is specified in terms of Network is specified in terms of ChannelChanneli,ji,j for for i, ji, j in in II
– Assumed only to be “honest”• The System is then the composition of all automataThe System is then the composition of all automata
Configurations and Config Configurations and Config MapsMaps
• Configuration Configuration cc– members(c) -- set of members of configuration c– read-quorums(c) -- set of read quorums– write-quorums(c) -- set of write quorums
• Configuration map Configuration map cmcm– mapping from naturals to configurations– cm(k) is the configuration k, and it can be– defined, undefined (), garbage-collected (±)
± ± c c c c . . . . . .
G-C-ed Defined “Mixed” Undefined
Configuration MapsConfiguration Maps
c0
c0 c1
c0 c1 c2 ck
± c1 c2 ck
± ± c2 ck
TIME
. . .
. . .
. . .
. . .
. . .
± ± ± c3 ck . . .
± ± ± ± ± c c c c . . .
. . .
Reader-Writer ProtocolReader-Writer Protocol• One “gossip” messageOne “gossip” message
– < World, value, tag, cmap, ns, nr >• Message from a sender Message from a sender ss to a receiver to a receiver rr is such that is such that
– World is s ’s set of participants, and r World– value and tag are the object value and its tag at s– cmap is the configuration map at s– ns and nr are sender’s and best known receiver’s
phase numbers used to identify “fresh” messages• These messages areThese messages are
– Sent non-deterministically– For performance analysis we impose an
additional deterministic send policy• Certain actions are taken when “enough” info is Certain actions are taken when “enough” info is
gatheredgathered
goss
ip
RAMBOi
Reader-Writeri
Reconi
Read/Write ProtocolRead/Write Protocol
RAMBOj
Reader-Writerj
Reconj
readi
goss
ip
new-config(c,k)i
read-ack(v)i
write(v)i
RAMBOn
Reader-Writern
Reconn
gossip
write-acki
. . .
Reader-Writer CodeReader-Writer Code
Start read
Start write
End read
End write
New cfg
Receive
SendQuery fix
Prop fix
Fixpoint reached?
Start End
Recv Send
Send Collect responses
The Phase PatternThe Phase Pattern
• Send to a collection of processes in “known” configsSend to a collection of processes in “known” configs• Collect responses and update configuration Collect responses and update configuration
informationinformation• Continue until a certain predicate is satisfiedContinue until a certain predicate is satisfied
Continue sending
no yes
Read and Write OperationsRead and Write Operations
• ReadsReads and and WritesWrites use use QueryQuery and and PropagationPropagation phases phases involving known quorum configurationsinvolving known quorum configurations– Query obtains information about “latest” operations
from read quorums & updates configurations– Propagation disseminates the results of “latest”
operation to write quorums & updates configurations • Fixed point must be reached -- discovery of new Fixed point must be reached -- discovery of new
configurations requires new quorums to be reachedconfigurations requires new quorums to be reached
Read or Write
PropagateQuery
StartQuery
EndQuery
StartProp.
EndProp.
Reader-Writer: Send/RecvReader-Writer: Send/Recv
Reader-Writer: Fixed PointsReader-Writer: Fixed Points
Why Readers PropagateWhy Readers Propagate• If the readers do not propagate, If the readers do not propagate,
atomicity can be easily violated:atomicity can be easily violated:
Write of v1 . . . ( s l o w )
v0
v0
v1
Read of v1 Read of v0
v0
v0
v0
RAMBOi
Reader-Writeri
Reconi
Joining ProtocolJoining Protocol
RAMBOj
Reader-Writerj
Reconj
Joinerj
join
joinack
ack
Joinerijoin(J)i
join
gossip
Garbage CollectionGarbage Collection• When a process has the following configuration map When a process has the following configuration map cmapcmap
it can garbage-collection configuration it can garbage-collection configuration cmapcmap((k) = ck) = ckk
• Two-phase protocol using the “gossip” messagesTwo-phase protocol using the “gossip” messages– Update own tag & value by obtaining the “best” tag
and value from a read- and write-quorum of cmap(k)
– Propagate tag & value to a write-quorum of cmap(k+1)– Set cmap(k) to ±
• This “bootstraps” configuration This “bootstraps” configuration k k in case it is “too new” in case it is “too new”
± ± ck ck+1 . . .. . . . . .
ReconfigurationReconfiguration• Very simple protocol for ReconVery simple protocol for Reconii
– Reconfiguration is free of atomicity concerns• Initiator i (multiple initiators are allows)Initiator i (multiple initiators are allows)
– Accepts reconfiguration request recon(c,c’)i from environment: reconfigure from c to c’
– If c is the locally-known “latest” configuration k-1, informs member of c of the reconfiguration
– Calls Paxos for k to decide on “next” configuration c’
– Informs Reader-Writeri of the new configuration• Participants iParticipants i
– Learn about the initiation of reconfiguration– Participate in Paxos– Inform Reader-Writeri of the new configuration
Latency AnalysisLatency Analysis• Certain gossip and messages become “important”Certain gossip and messages become “important”
– Messages to members of “active” configurations when read or write is performed
– Messages to configurations k and k+1 when garbage collection is performed
– Specific messages when joining and reconfiguring– Responses to such messages
• Consider “good” timed executionsConsider “good” timed executions– Bounded message delay d– 0 local processing time– Environment is well-formed
Additional AssumptionsAdditional Assumptions• These are assumptions are used in some resultsThese are assumptions are used in some results• Configuration-viability for time parameter eConfiguration-viability for time parameter e
– If c becomes “known” as configuration k anywhere– Then either one read- and one write-quorum of c
stays alive forever– Or if by time t another configuration is decided
upon by non-faulty members of c, then one read- and one write-quorum of c stays alive until t+e
• Reconfiguration-spacing for time parameter eReconfiguration-spacing for time parameter e
– recon(c,*)i occurs at least e time after report(c)i
• Join-connectivity for time parameter eJoin-connectivity for time parameter e– If i and j join by time t then the learn about each
other by time t+e
Latency Bounds (selected)Latency Bounds (selected)• Joining: Joining:
– 2d, provided “joiner” and “joinee” do not fail• Reconfiguration: Reconfiguration:
– In 0-configuration-viable executions– If recon(c,c’)i action occurs by time t and no
members of c fail after t, then recon-acki occurs at t+12d+
• Garbage-collection of cGarbage-collection of ckk at non-faulty i : at non-faulty i :– 4d, if R in read-quorums(ck), W1 in write-
quorums(ck), and W2 in write-quorums(ck+1) do not fail
• Read and write operations in “stable” systemsRead and write operations in “stable” systems– If no reconfig-s in progress, then process with “up-
to-date” config map completes its operation in 4d• (These do not depend on “gossip”)(These do not depend on “gossip”)
More Latency (1)More Latency (1)• These bounds depend on periodic gossipThese bounds depend on periodic gossip• Learning new configurationsLearning new configurations
– If i and j are “old enough” and do not fail, then information from i is conveyed to j within time 2d
• Garbage-collection when reconfigurations are 6d-Garbage-collection when reconfigurations are 6d-spaced and executions are 6d-configuration-viablespaced and executions are 6d-configuration-viable– If recon(c,*) occurs before t and c is “known” by
t-6d then any non-faulty process that is “old enough” learns about c and garbage-collects any older configuration by time t+6d
– All non-faulty “old enough” processes have one or two defined configurations in their configuration maps
More Latency (2)More Latency (2)• Read and write operations (with periodic gossip)Read and write operations (with periodic gossip)
– Complete in time 8d for non-faulty processes that are “old enough”, provided execution satisfies12.1d-recon-spacing and 6d-configuration-viability
• Learning in failure-free executionsLearning in failure-free executions
– Let J be the set of processes that joined by time t1. Then by time t + log|J|, J worldi for any i in J
2. If i in J “knows” a configuration at time t’, then any j in J learns about it by max(t + log|J|, t’) + 2d
Algorithmic InnovationsAlgorithmic Innovations• Dynamic owners of data:Dynamic owners of data:
– Any and all owners may request reconfiguration– the set of owners can be changed dynamically
• Dynamic configurations:Dynamic configurations:– Arbitrary configurations can be installed– no constraints on intersection of quorum sets or
member sets in distinct configurations.• Loosely-coupled reconfiguration:Loosely-coupled reconfiguration:
– Concurrent reads, writes and reconfiguration– If finite reconfigurations occur during a read or
write operation, then its completion does not depend on whether any reconfigurations complete
Algorithmics (cont’d)Algorithmics (cont’d)• Efficient “steady-state”:Efficient “steady-state”:
– Assuming bounded delays, infrequent reconfig-s, and periodic gossip, reads and writes complete in time constant times the message delay
– Assuming periodic garbage collection, readers/writers only deal with 1 or 2 configurations
• Fast “catch-up”:Fast “catch-up”:– New “joiners” with out-of-date configurations can
catch up after a logarithmic number of message exchanges provided the “joiners graph” is connected
Comparison with Other Comparison with Other ApproachesApproaches
• Paxos or a similar consensus service can be used to Paxos or a similar consensus service can be used to agree on global order of operationsagree on global order of operations– We only agree on sequence configurations– Consensus termination impacts only Recon– Reads/writes are not affected by consensus
• Group communication systems can also be usedGroup communication systems can also be used– Our algorithm is “from scratch”: low-level send-
receive, no hidden/relative costs– Reads/writes work during “new view”
establishment• Dynamic quorums / dynamic configurations workDynamic quorums / dynamic configurations work
– We allow arbitrary new configurations - no static • Our earlier work also solves this problemOur earlier work also solves this problem
– New work: concurrent recon-s and garbage-collect
Work in Progress and FuturesWork in Progress and Futures• Full-fledged implementation is under developmentFull-fledged implementation is under development• Additional analysis in progressAdditional analysis in progress
– “Normal timing” starts at some point– Trade-off between configuration-viability and
garbage collection– Analysis of “join-connectivity” graphs
• Algorithmic refinementsAlgorithmic refinements– Elimination of unnecessary communication– Explicit “leave” protocol– Gossip: “owners” vs. “users” of objects