replication and fault tolerant
DESCRIPTION
Replication and Fault Tolerant. Introduction. Reason for Replication Reliability Maintaining multiple copies= if one crash; continue with another replicas Performance Divide the work to multiple server Place data close to the process that is using it data access time reduces - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/1.jpg)
1
Replication and Fault Tolerant
![Page 2: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/2.jpg)
2
IntroductionReason for Replication1. Reliability
Maintaining multiple copies= if one crash; continue with another replicas
2. PerformanceDivide the work to multiple server
Place data close to the process that is using it– data access time reduces– any drawbacks? Cost?
Inconsistency? = bank account
Accessing Web pages– can cache pages– need to keep the cache updated all the time
![Page 3: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/3.jpg)
3
Object Replication (1)
Organization of a distributed remote object shared by two different clients.
If remote objects are replicated,– need to ensure that operations are performed in the correct (same) order in
all replicas– first, need to ensure that the concurrent invocations on each replica are
handled correctly
![Page 4: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/4.jpg)
4
How do you prevent concurrent access to distributed Objects?
2 choices1. Let the object itself handle it
Java allows methods to be synchronized In C++, use pthreads, mutex, …
2. The middleware handles it
![Page 5: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/5.jpg)
5
Object Replication (2)
a) A remote object capable of handling concurrent invocations on its own.b) A remote object for which an object adapter is required to handle concurrent
invocations
![Page 6: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/6.jpg)
6
Object Replication (3)
a) A distributed system for replication-aware distributed objects.b) A distributed system responsible for replica management
![Page 7: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/7.jpg)
7
Data-Centric Consistency Models
The general organization of a logical data store, physically distributed and replicated across multiple processes.
Contract between process and data store(fileSys,S/memory,S/database) obey certain rules,data store promises to obey certain rules,data store promises to work correctly.
e.g: process read the up-to-date data stored from the last write operation.
![Page 8: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/8.jpg)
8
Strict Consistency
Behavior of two processes, operating on the same data item.• A strictly consistent store.• A store that is not strictly consistent.
Any read on a data item xReturns a value the most recent write on x
Observation: It doesn't make sense to talk about "the most recent" in a distributed environment.
•Assume all data items have been initialized to NIL
•W( x) a: value a is written to x
•R( x) a: reading x returns the value a
![Page 9: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/9.jpg)
9
Sequential Consistency (1)
a) A sequentially consistent data store.b) A data store that is not sequentially consistent.
SQ: results of any execution is same as if operations from different processes are executed in some sequential order.Operations of single process must appear in order specified by program
any valid interleaving of read and write operations is acceptable,but all processes must see same interleaving of operations.
![Page 10: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/10.jpg)
10
Causal Consistency (1)
Necessary condition:Writes that are potentially causally related must be seen by all processes in the same order.
Concurrent writes may be seen in a different order on different machines.
![Page 11: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/11.jpg)
11
Causal Consistency (2)
This sequence is allowed with a causally-consistent store, but not with sequentially or strictly consistent store.
a) A data store that is not sequentially consistent.
![Page 12: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/12.jpg)
12
Causal Consistency (3)
a) A violation of a casually-consistent store.b) A correct sequence of events in a casually-consistent store.
Concurent writeConcurent write
![Page 13: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/13.jpg)
13
FIFO Consistency (1)
Necessary Condition:Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes.
![Page 14: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/14.jpg)
14
FIFO Consistency (2)
A valid sequence of events of FIFO consistency
![Page 15: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/15.jpg)
15
FIFO
Three concurrently executing processes.
Process P1 Process P2 Process P3
x = 1;
print ( y, z);
y = 1;
print (x, z);
z = 1;
print (x, y);
![Page 16: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/16.jpg)
16
FIFO Consistency (3)
Statement execution as seen by the three processes from the previous slide. The statements in bold are the ones that generate the output shown.
x = 1;
print (y, z);
y = 1;
print(x, z);
z = 1;
print (x, y);
Prints: 00
(a)
x = 1;
y = 1;
print(x, z);
print ( y, z);
z = 1;
print (x, y);
Prints: 10
(b)
y = 1;
print (x, z);
z = 1;
print (x, y);
x = 1;
print (y, z);
Prints: 01
(c)
![Page 17: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/17.jpg)
17
FIFO Consistency (4)
Two concurrent processes.Both process can be killed
P1 read y =0 before it sees P2(y)1
Process P1 Process P2
x = 1;
if (y == 0) kill (P2);
y = 1;
if (x == 0) kill (P1);
![Page 18: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/18.jpg)
18
Summary of Consistency Models
a) Consistency models not using synchronization operations.
Consistency Description
Strict Absolute time ordering of all shared accesses matters.
LinearizabilityAll processes must see all shared accesses in the same order. Accesses are furthermore ordered according to a (nonunique) global timestamp
SequentialAll processes see all shared accesses in the same order. Accesses are not ordered in time
Causal All processes see causally-related shared accesses in the same order.
FIFOAll processes see writes from each other in the order they were used. Writes from different processes may not always be seen in that order
(a)
![Page 19: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/19.jpg)
19
Distribution Protocols
Replica Placement
Update Propagation
Epidemic Protocols
![Page 20: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/20.jpg)
20
Replica Placement
The logical organization of different kinds of copies of a data store into three concentric rings.
![Page 21: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/21.jpg)
21
Replica Placement
Permanent replicas:
Process/machine always having a initial set of replica
Web site(file ) & mirroring (all the content) & distributed database
![Page 22: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/22.jpg)
22
Server-initiated replica:
Process that can dynamically host a replica on request of another server in the data storepush caches-
create a replicate when they have burst request from certain
location.The Algorithm:
Replication take place to reduce the load on a server.Specified file on server can be migrate to the nearest request.
![Page 23: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/23.jpg)
23
Server-Initiated Replicas
Q Counting access requests from different clients.Eg: Web Hosting Service
![Page 24: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/24.jpg)
24
Client-initiated replica:
– client cache.– Local storage capacity– use temporarily to store a copy of
data just requested.– Managing the cache is left to the client.– Access time improved if the cache hit is
said to occurs.
![Page 25: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/25.jpg)
25
Update propagation
Update are initiated at a client
Forwarded to one of the copies an propagate to another copies
Some design issues to consider in propagating the update.
State versus operations
Pull vs Push Protocol
![Page 26: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/26.jpg)
26
Push and Pull based ApproachPush based Approach
– Also referred as server-based protocol– Update are directly propagate to the replica
without request.
Pull based Approach– Referred as client-based protocol– Client request a server to send any update it has
at the moment.
![Page 27: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/27.jpg)
27
Push versus Pull Protocols
A comparison between push-based and pull-based protocols in the case of multiple client, single server systems.
Issue Push-based Pull-based
State of server List of client replicas and caches None
Messages sent Update (and possibly fetch update later) Poll and update
Response time at client
Immediate (or fetch-update time) Fetch-update time
![Page 28: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/28.jpg)
28
Quorum-Based Protocols
Three examples of the voting algorithm:a) A correct choice of read and write setb) A choice that may lead to write-write conflictsc) A correct choice, known as ROWA (read one, write all)
![Page 29: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/29.jpg)
29
Fault ToleranceBasic ConceptsFailure Models
![Page 30: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/30.jpg)
30
Introduction
• Partial failure in distributed system may happen when one component is fails.– May affect the operation in certain component– Leaving another component totally unaffected
• The design goal in DS is– Build a system that automatically recover from a partial failure– Without seriously affecting the overall performance
![Page 31: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/31.jpg)
31
Basic Concepts
Dependability Includes
– Availability– Reliability– Safety– Maintainability
![Page 32: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/32.jpg)
32
Availability
• The system is ready to be used immediately• In general, the system is operating correctly at any given moment and is available to performs its functions.
– Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time
• 99.9%
![Page 33: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/33.jpg)
33
Reliability
• System can run continuously without failure.• High reliable system is one that will most likely continue to work without interruption during a relative long period of time• One measure used to define a component or system's reliability is mean time between failures (MTBF)
• MTBF = (total elapsed time – sum of downtime)/number of failures • A related measurement is mean time to repair (MTTR). MTTR is the average time interval (usually expressed in hours) that it takes to repair a failed component.
![Page 34: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/34.jpg)
34
Safety
• Nothing catastrophic will happen if a system temporary fails to operate correctly.
![Page 35: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/35.jpg)
35
Maintainability
• Refers to how easy a failed system can be repaired
![Page 36: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/36.jpg)
36
Terminology
Failure: When a component is not living up to its specifications, a failure occurs
Error: That part of a component's state that can lead to a failure
Fault: The cause of an error
Fault prevention: prevent the occurrence of a fault
Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults
![Page 37: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/37.jpg)
37
Failure Models
Different types of failures.
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times
![Page 38: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/38.jpg)
38
Failure Models(cont)
Timing failures: The output of a component is correct, but lies outside a specified real-time interval - (performance failures: too slow)
Response failures: The output of a component is incorrectValue failure: The wrong value is produced State transition failure: Execution of the component's service brings it into a wrong state
![Page 39: Replication and Fault Tolerant](https://reader035.vdocuments.net/reader035/viewer/2022062408/568143a1550346895db0230b/html5/thumbnails/39.jpg)
39
Failure Models(cont)
Crash failures: A component halts but behaves correctly before halting
Omission failures: A component fails to respondReceive omission: A server fails to receive incoming
messagesSend omission: A server fails to send messages
Arbitrary failures: A component may produce arbitrary output and be subject to arbitrary timing failures