cmpt 401 summer 2007 dr. alexandra fedorova lecture xiii: replication-ii

72
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication- II

Upload: edwin-flowers

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

CMPT 401 Summer 2007

Dr. Alexandra Fedorova

Lecture XIII: Replication-II

Page 2: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

2CMPT 401 Summer 2007 © A. Fedorova

Outline

• Google File System – A real replicated file system

• Paxos – A consensus algorithm used in real systems

• Harp– A replicated research file system

Page 3: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

3CMPT 401 Summer 2007 © A. Fedorova

Google File System

• A real massive distributed file system• Hundreds of servers and clients

– The largest cluster has >1000 storage nodes, over 300 TB of disk storage, hundreds of clients

• Metadata replication• Data replication• Design driven by application workload and technological

environment• Avoided many of difficulties traditionally associated with

replication by designing for a specific use case

Page 4: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

4CMPT 401 Summer 2007 © A. Fedorova

Specifics of the Google Environment

• FS is built of hundreds of storage machines, built of inexpensive commodity parts

• Component failures are a norm– Application and OS bugs– Human errors– Hardware failures: disks, memory, network, power supplies

• Millions of files, each 100 MB or larger• Multi-GB files are common• Applications are written for GFS• Allows co-design of the file system and applications

Page 5: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

5CMPT 401 Summer 2007 © A. Fedorova

Specifics of the Google Workload

• Most files are mutated by appending new data – large sequential writes

• Random writes are very uncommon• Files are written once, then they are only read• Reads are sequential• Large streaming reads and small random reads• High bandwidth is more important than low latency• Google applications:

– Data analysis programs that scan through data repositories– Data streaming applications– Archiving– Applications producing (intermediate) search results

Page 6: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

6CMPT 401 Summer 2007 © A. Fedorova

GFS Architecture

Page 7: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

7CMPT 401 Summer 2007 © A. Fedorova

GFS Architecture (cont.)

• Single master• Multiple chunk servers• Multiple clients• Each is a commodity Linux machine, a server is a user-level process• Files are divided into chunks • Each chunk has a handle (an ID assigned by the master)• Each chunk is replicated (on three machines by default)• Master stores metadata, manages chunks, does garbage collection,

etc. • Clients communicate with master for metadata operations, but with

chunkservers for data operations• No additional caching (besides the Linux in-memory buffer caching)

Page 8: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

8CMPT 401 Summer 2007 © A. Fedorova

Client/GFS Interaction

• Client:– Takes file and offset– Translates it into the chunk index within the file– Sends request to master, containing file name and chunk index

• Master:– Replies with the corresponding chunk handle and location of the

replicas (the master must know where the replicas are)• Client:

– Caches this information– Contacts one of the replicas (i.e., a chunkserver) for data

Page 9: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

9CMPT 401 Summer 2007 © A. Fedorova

Master

• Stores metadata– The file and chunk namespaces– Mapping from files to chunks– Locations of each chunk’s replicas

• Interacts with clients• Creates chunk replicas• Orchestrates chunk modifications across multiple replicas

– Ensures atomic concurrent appends– Locks concurrent operations

• Deletes old files (via garbage collection)

Page 10: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

10CMPT 401 Summer 2007 © A. Fedorova

Metadata On Master

• Metadata – data about the data:– File names– Mapping of file names to chunk IDs– Chunk locations

• Metadata is kept in memory• File names and chunk mappings are also kept persistent in

an operation log• Chunk locations are kept in memory only

– It will be lost during the crash– The master asks chunk servers about their chunks at startup –

builds a table of chunk locations

Page 11: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

11CMPT 401 Summer 2007 © A. Fedorova

Why Keep Metadata In Memory?

• To keep master operations fast • Master can periodically scan its internal state in the

background, in order to implement:– Garbage collection– Re-replication (in case of chunk server failures)– Chunk migration (for load balancing)

• But the file system size is limited by the amount of memory on the master? – This has not been a problem for GFS – metadata is compact

Page 12: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

12CMPT 401 Summer 2007 © A. Fedorova

Why Not Keep Chunk Locations Persistent?

• Chunk location – which chunk server has a replica of a given chunk• Master polls chunk servers for that information on startup• Thereafter, master keeps itself up-to-date:

– It controls all initial chunk placement, migration and re-replication– It monitors chunkserver status with regular HeartBeat messages

• Motivation: simplicity• Eliminates the need to keep master and chunkservers synchronized • Synchronization would be needed when chunkservers:

– Join and leave the cluster– Change names– Fail and restart

Page 13: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

13CMPT 401 Summer 2007 © A. Fedorova

Operation Log

• Historical record of metadata changes• Maintains logical order of concurrent operations• Log is used for recovery – the master replays it in the

event of failures• Master periodically checkpoints the log• Checkpoint is a B-tree data structure

– Can be loaded into memory– Used for namespace lookup without extra parsing

• Checkpoint can be done on the background

Page 14: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

14CMPT 401 Summer 2007 © A. Fedorova

Data Consistency in GFS• Loose data consistency – applications are designed for it• Applications may see inconsistent data – data is different on

different replicas • Applications may see data from partially completed writes –

undefined file region• On successful modification the file region is consistent• A write may leave the region undefined – if the client reads the

file before another client’s write is complete• Replicas are not guaranteed to be bytewise identical (we’ll see

why later, and how clients deal with this)

Page 15: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

15CMPT 401 Summer 2007 © A. Fedorova

Data Consistency in GFS (cont.)

• Failures:– A modification may fail at one or more replicas– On modification failure, file region is inconsistent

• Successes:– Modifications are applied to a chunk in the same order on all replicas– After a number of successful modifications, the file region is guaranteed

to be defined:• All replicas have the same data• All replicas contain all the data written by all the write

operations

Page 16: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

16CMPT 401 Summer 2007 © A. Fedorova

Implications of Loose Data Consistency For Applications

• Applications are designed to handle loose data consistency

• Example 1: a file is generated from beginning to end– An application creates a file with a temporary name– Atomically renames the file – May periodically checkpoint the file while it is written– File is written via appends – more resilient to failures than random

writes• Example 2: producer-consumer file

– Many writers concurrently append to one file (for merged results)– Each record is self-validating (contains a checksum)– Client filters out padding and duplicate records

Page 17: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

17CMPT 401 Summer 2007 © A. Fedorova

Updates of Replicated Data

• Each mutation (modification) is performed at all the replicas

• Modifications are applied in the same order across all replicas

• Master grants a chunk lease to one replica – i.e., the primary

• The primary picks a serial order for all mutations to the chunk

• The client pushes data to all replicas• The primary tells the replicas in which order they should

apply modifications

Page 18: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

18CMPT 401 Summer 2007 © A. Fedorova

Updates of Replicated Data (cont.)

1. Client asks master for replica locations

2. Master responds3. Client pushes data to all replicas;

replicas store it in a buffer cache4. Client sends a write request to the

primary (identifying the data that had been pushed)

5. Primary forwards request to the secondaries (identifies the order)

6. The secondaries respond to the primary

7. The primary responds to the client

Page 19: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

19CMPT 401 Summer 2007 © A. Fedorova

Failure Handling During Updates

• If a write fails at the primary:– The primary may report failure to the client – the client will retry– If the primary does not respond, the client retries from Step 1 by

contacting the master• If a write succeeds at the primary, but fails at several

replicas– The client retries several times (Step 3-7)

Page 20: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

20CMPT 401 Summer 2007 © A. Fedorova

Data Flow

• Data flow is decoupled from control flow• Data is pushed linearly across all chunkservers in a

pipelined fashion (not necessarily from client to primary and from primary to secondary)

• Client forwards data to the closest replica; that replica forwards to the next closest replica, etc.

• Pipelined fashion: while the data is incoming, the server begins forwarding it to the next replica

• This design ensures good network utilization

Page 21: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

21CMPT 401 Summer 2007 © A. Fedorova

Atomic Record Appends

• Atomic append is a write – but GFS (the primary replica) chooses the offset where the append happens; returns appends to the client

• This way GFS can decide on serial order of concurrent appends without client synchronization

• If an append fails at some replicas – the client retries• As a result, the file may contain multiple copies of the

same record, plus replicas may be bytewise different• But after a successful update all replicas will be defined –

they will all have the data written by the client at the same offset

Page 22: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

22CMPT 401 Summer 2007 © A. Fedorova

Non-Identical Replicas

• Because of failed and retried record appends, replicas may be non-identical bytewise

• Some replicas may have duplicate records (because of failed and retried appends)

• Some replicas may have padded file space (empty space filled with junk) – if the master chooses record offset higher than the first available offset at a replica

• Clients must deal with it: they write self-identifying records so they can distinguish valid data from junk

• If the cannot tolerate duplicates, they must insert version numbers in records

• GFS pushes complexity to the client; without this, complex failure recovery scheme would need to be in place

Page 23: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

23CMPT 401 Summer 2007 © A. Fedorova

Snapshot

• Copy of a file or a directory tree – used by applications for fast copies of data sets and for checkpointing

• Steps involved to snapshot directory A:1. Master revokes leases on directory A2. Logs the operation to disk, copies metadata for A to A’ in its

memory: both A and A’ point to the same files on disk3. When a client wants to write to chunk C in A, master defers

replying to the client; creates a new chunk handle C’4. Master asks each chunkserver that has replica C to create a copy

in chunk C’ – this ensures that copies are created locally, not over the network

5. All new clients modifications go to chunk C’

Page 24: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

24CMPT 401 Summer 2007 © A. Fedorova

Namespace Management and Locking

• Each file or directory has an associated read/write lock• Each operation on a master acquires a set of read/write locks before it

runs• Read locks are acquired on all files/directories that are being

accessed, i.e., each intermediate directory in /d1/d2/…/dn

• Write locks are acquired on – Snapshots (to prevent creation of new files in a directory during

the snapshot)– File names – when that file is created– No write lock on directory is needed on file creation – no directory

inode to modify; multiple file creations can be done concurrently

Page 25: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

25CMPT 401 Summer 2007 © A. Fedorova

Garbage Collection

• File deletion is not done immediately – space from deleted files is garbage collected lazily

• When a file is deleted – the master logs the operation and renames it to a hidden name

• During regular metadata scan the master deletes that file’s metadata (after at least three days)

• During regular scan of chunk namespace, the master identifies orphaned chunks, deletes that metadata

• Master tells chunk replicas to delete orphaned chunks

Page 26: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

26CMPT 401 Summer 2007 © A. Fedorova

Load Balancing

• Goals:– Maximize data availability and reliability– Maximize network bandwidth utilization

• Google infrastructure:– Cluster consists of hundreds of racks– Each rack has a dozen machines– Racks are connected by network

switches– A rack is on a single power circuit

• Must balance load across machines and across racks

Page 27: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

27CMPT 401 Summer 2007 © A. Fedorova

Creation, Re-replication, Rebalancing• Creation (initial replica placement):

– On chunk servers with low disk space utilization– Limit the number of recent creations on each chunkserver –

recent creations mean heavy write traffic– Spread replicas across racks

• Re-replication– When the number of replicas falls below the replication target– When a chunkserver becomes unavailable– When a replica becomes corrupted– A new replica is copied directly from an existing one

• Re-balancing– Master periodically examines replica distribution and moves them

to meet load-balancing criteria

Page 28: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

28CMPT 401 Summer 2007 © A. Fedorova

Fault Tolerance

• Fast recovery– No distinction between normal and abnormal shutdown– Servers are routinely restarted by “killing” a server process– Servers are designed for fast recovery – all state can be recovered

from the log• Chunk replication• Master replication• Data integrity• Diagnostic tools

Page 29: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

29CMPT 401 Summer 2007 © A. Fedorova

Chunk Replication

• Each chunk is replicated on multiple chunkservers on different racks

• Users can specify different replication levels for different parts of the file namespace (default is 3)

• The master clones existing replicas as needed to keep each chunk fully replicated

Page 30: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

30CMPT 401 Summer 2007 © A. Fedorova

Single Master

• Simplifies design• Master can make sophisticated load-balancing decisions

involving chunk placement using global knowledge• To prevent master from becoming the bottleneck

– Clients communicate with master only for metadata– Master keeps metadata in memory– Clients cache metadata– File data is transferred from chunkservers

Page 31: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

31CMPT 401 Summer 2007 © A. Fedorova

Master Replication

• Master state is replicated on multiple machines, so a new server can become master if the old master fails

• What is replicated: operation logs and checkpoints• A modification is considered successful only after it has been logged

on all master replicas• A single master is in charge; if it fails, it restarts almost

instantaneously• If a machine fails and the master cannot restart itself, a failure

detector outside GFS starts a new master with a replicated operation log (no master election)

• Master replicas are master’s shadows – they operate similarly to the master w.r.t. updating the log, the in-memory metadata, polling the chunkservers

Page 32: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

32CMPT 401 Summer 2007 © A. Fedorova

Data Integrity• Disks often fail – may cause data corruption• Detect corrupt replicas by comparing with other chunk servers?

– Not a good idea – divergent replicas may be legal• Each chunkserver verifies its own replicas using checksums• Checksums are kept in memory and stored persistently in the log• Small effect on read performance – checksums are kept in memory,

checksum computation can be overlapped with I/O• Write performance: checksum computation optimized for appends• Checksum can be computed incrementally for a checksum block

(64KB)• If corruption is detected, the master creates new replicas using data

from correct chunks• During idle periods chunkservers scan inactive chunks for corruption

Page 33: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

33CMPT 401 Summer 2007 © A. Fedorova

Detecting Stale Replicas

• A replica may become stale if it misses a modification while the chunkserver was down

• Each chunk has a version number, version numbers are used to detect stale replicas

• A stale replica will never be given to the client as a chunk location, and will never participate in mutation

• A client may read from a stale replica (because the client caches metadata)– But this window is limited, because cache entries time out

Page 34: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

34CMPT 401 Summer 2007 © A. Fedorova

Diagnostic Tools

• GFS servers perform diagnostic logging• Helps debugging and performance analysis• Diagnostic logs record:

– Chunk servers going up and down– All RPC requests and replies

• RPC requests and responses from different machine logs can be collated and analyzed to determine exact interaction between machines

• Logs are also used for load testing and performance analysis

Page 35: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

35CMPT 401 Summer 2007 © A. Fedorova

GFS Summary

• Real replicated file system• Uses commodity hardware – hundreds of commodity PCs

and disks• Two levels of replication:

– Metadata is replicated via replicated masters– Data is replicated on replicated chunkservers

• Designed for specific use case – for Google applications– And applications are designed for GFS

• This is why it is simple and it actually works

Page 36: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

36CMPT 401 Summer 2007 © A. Fedorova

GFS Summary (cont.)• Design philosophy:

– A replicated FS can’t do all things right and all things well:– Strong data consistency?– Identical replicas?– Fast concurrent operations?– That’s too hard…– So make several operations fast, make them common case

• Common case operations – atomic appends• Client deal with weak consistency

– Write self-identifying records– Deal with duplicate records and padding

• Something to learn: if generic design is hard, design for your use case – that’s your only hope!

Page 37: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

37CMPT 401 Summer 2007 © A. Fedorova

Outline

• Google File System – A real replicated file system

• Paxos – A consensus algorithm used in real systems– Used in Chubby – Google’s distributed lock service– Why a consensus algorithm? Many replicated FS use consensus

algorithms• Harp

– A replicated research file system

Page 38: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

38CMPT 401 Summer 2007 © A. Fedorova

The Consensus Problem

• A collection of processes can propose values• Only a single of the proposed values must be chosen• Three classes of agents:

– Proposers (propose the values)– Acceptors (accept the values)– Learners (learn the chosen values)

• System model– Asynchronous system– Failstop failures

Page 39: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

39CMPT 401 Summer 2007 © A. Fedorova

Acceptors

• Naïve solution: – A single acceptor– Accepts the first proposed value it receives– Problem: algorithm cannot terminate if the acceptor fails

• Let’s have multiple acceptors– A value is chosen if the majority of acceptors accept it

• We want a value to be chosen even if only one value has been proposed, so we have a requirement:

P1: An acceptor must accept the first proposal that it receives

Page 40: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

40CMPT 401 Summer 2007 © A. Fedorova

Accepting More than One Proposal

• P1: An acceptor must accept the first proposal that it receives

• There is a problem: – multiple proposers propose different values– each acceptor has accepted a value– no single value is accepted by the majority

• So we must allow for acceptor to accept more than one proposal

• We distinguish proposals by numbers:

number valuen v

Page 41: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

41CMPT 401 Summer 2007 © A. Fedorova

Choosing a Value

• A value is considered chosen when it has been accepted by a majority of acceptors

• But acceptors may accept many different proposals!• We must ensure that all accepted proposals have the

same value!

1 X

5 X

4 XA1

A1

A3

Page 42: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

42CMPT 401 Summer 2007 © A. Fedorova

Same Value for All Proposals

• We must ensure that all accepted proposals have the same value! So we have another requirement:

P2: If a proposal with a value v is chosen, then every higher-numbered proposal issued by any proposal has value v

• This ensures that even if acceptors accept different proposals, the values will be the same

Page 43: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

43CMPT 401 Summer 2007 © A. Fedorova

Same Value for All Proposals

A3

A1

A2

P1

P2

P3

Proposed values

Proposal numbers

1 X

Accepted values

Accepted proposal numbers

1 X

1 X 2 X

2 X

How does P3 learn X?

Page 44: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

44CMPT 401 Summer 2007 © A. Fedorova

Learning The “Right” Value for Proposal• A proposer decides to issue a proposal numbered n• A proposer must learn the value of the highest numbered

proposal less than n, such that:– That proposal has been accepted in the pas, or– That proposal will be accepted in the future

• Learning the proposals accepted thus far is easy – just ask around

• Predicting the future (which proposals will be accepted?) is hard

• So the proposer controls the future! – Makes the acceptors promise not to accept any proposals

numbered less than n

Page 45: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

45CMPT 401 Summer 2007 © A. Fedorova

Proposer-Acceptor DialogueHey, what

value have you accepted

so far?

I accepted X, with

proposal #5

Ok, do me a favour, don’t accept any

other proposals

numbered < 5. You got

it!

proposeracceptor

Page 46: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

46CMPT 401 Summer 2007 © A. Fedorova

Algorithm at the Proposer• A proposer chooses request number n, sends a prepare request

to some set of acceptors, asking to respond with:– The highest-numbered proposal <n that it has accepted – A promise to never accept another proposal numbered <n

• The proposer may receive responses from a majority of acceptors - it chooses the value v for its new proposal n and send it to everyone

• The proposer may receive responses saying that acceptors accepted no proposals - it chooses any value v and issues proposal n

• Once v is chosen the proposer sends an accept request with a new v

Page 47: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

47CMPT 401 Summer 2007 © A. Fedorova

Algorithm at the Acceptor

• An acceptor responds to a prepare request• An acceptor responds to an accept request n only if it had

not responded to a request >n• Several optimizations:

– An acceptor does not respond to prepare request n if it has already responded to a prepare request >n (because it will not accept proposal n anyway)

– An acceptor ignores prepare request n if it has already accepted a proposal >n

Page 48: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

48CMPT 401 Summer 2007 © A. Fedorova

The Entire Algorithm• Phase 1:

a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors

b) An acceptor responds to the request (unless it knows to ignore it) with: A promise not to accept lower-numbered request The highest-numbered request it has accepted so far

• Phase 2:a) If the proposer receives responses to its prepare request, it

learns (or chooses) the right v and sends accept request to acceptors

b) If an acceptor receives an accept request n it accepts the value unless it has promised to another proposer not to accept proposal with that number

Page 49: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

49CMPT 401 Summer 2007 © A. Fedorova

Let’s Play Paxos

• We have two proposers p1 and p2• We have k acceptors a1, …, ak• Each person in class is either a proposer or an acceptor; I

orchestrate the actions of proposers/acceptors• We will use the following notation:

– PR(i) – prepare request for proposal i– respPR(i, v) – respond to PR(i) with previously accepted value v– respPR(i, - ) – respond to PR(i) if no proposal had been accepted– AR(i, v) – accept request for proposal i, value v– respAR(i, v) – respond accepting value v

Page 50: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

50CMPT 401 Summer 2007 © A. Fedorova

Ensuring Different Proposal Numbers

• Each new proposal must have a different proposal number• How do different proposers ensure that they do not use

the same numbers? • They each draw from different number sets:

– E.g., one uses even numbers another one odd numbers, etc.

Page 51: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

51CMPT 401 Summer 2007 © A. Fedorova

Learning the Chosen Value

• Learner – a process that learns which value has been chosen

• Whenever an acceptor accepts a value it sends a message to the learner, so the learner knows the chosen value

• For fault tolerance we can have multiple learners

Page 52: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

52CMPT 401 Summer 2007 © A. Fedorova

Making Progress

• A scenario in which no progress is made:– Proposer p1 issues proposal number n1– Proposer p2 issues proposal number n2 > n1; proposal n1 is not

accepted– Proposer p1 issues proposal number n3 > n2; proposal n2 is not

accepted– … And so on

• The paper suggests electing a distinguished proposer – this proposer sends proposals, others are silent

• A distinguished proposer must be elected (and we can’t use Paxos)• Non-distinguished proposers must know if the distinguished proposer

fails (and we know how easy that is in an asynchronous system )

Page 53: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

53CMPT 401 Summer 2007 © A. Fedorova

Paxos Implementation

• Choose a distinguished proposer• An acceptor records its intended response in stable

storage before sending the response – In case of failure the acceptor knows the value it has

chosen• Each proposer remembers (in stable storage) the highest-

numbered proposal it has tried to issue– So it does not issue different proposals with the same

number

Page 54: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

54CMPT 401 Summer 2007 © A. Fedorova

Paxos Summary• Consensus algorithm that tolerates failstop failures• In an asynchronous system it eventually terminates if network and

process failures are repaired• The algorithm proceeds in rounds, so it can tolerate acceptor and

proposer failures• How is it better than other consensus algorithms we studied?

– Non-blocking– Does not rely on a single coordinator (like two-phase commit)– Multiple proposers can act concurrently without violating

correctness• Caveat: need a distinguished leader

– Must be elected– Must detect when it fails so we can elect a new one

Page 55: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

55CMPT 401 Summer 2007 © A. Fedorova

Outline

• Google File System – A real replicated file system

• Paxos – A consensus algorithm used in real systems

• Harp– A replicated research file system

Page 56: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

56CMPT 401 Summer 2007 © A. Fedorova

Overview of Harp

• Uses primary copy replication for– Reliability– Availability

• Single primary server, backups and witness• Accessed via NFS interface• Performance was a concern – operations log is kept in

memory only:– To guard against machine failures: other replicas will have the log

in memory– To guard against power failures: each machine has a UPS, upon

power failure there is time to flush log to persistent storage

Page 57: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

57CMPT 401 Summer 2007 © A. Fedorova

Access via NFS Interface

User application

OS

NFS client

OS

NFS server

Replicated FS: • Primary• Backup• Witness

Page 58: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

58CMPT 401 Summer 2007 © A. Fedorova

Failover Transparent to Clients

User application

OS

NFS client

OS

NFS server

OS

NFS server

OS

NFS server

• Data is sent to a multicast address

• Reaches all potential primaries

• Discarded by hardware at all except the primary

192.168.51.2

primary

backup

witness

Page 59: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

59CMPT 401 Summer 2007 © A. Fedorova

Goals and Environment of Harp

• Provide highly available file system service via replication• Assume failstop failures• Survive network partitions• Assume synchronous system (?) – probably, because they

rely on timeouts when detecting node failure• In many systems, replication caused performance

degradation – replica communication slowed down the sending of response to the client

• Harp’s goal was to provide reliability and availability without performance loss

Page 60: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

60CMPT 401 Summer 2007 © A. Fedorova

Harp’s Components

• In presence of network partitions, must have 2n + 1 replicated components to survive n failures

• The quorum, (the majority (n+1) servers) get to form a new group and elect a new primary

• Usually data is replicated on 2n+1 replicas

• In Harp, data is replicated on n+1 servers

• The other servers are used to create quorum

• They are called witnesses

Page 61: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

61CMPT 401 Summer 2007 © A. Fedorova

Harp’s Witnessprimarybackup

witness

• Backup and primary cannot communicate• Who should be the primary?• Witness resolves the tie in favor of

primary• Data survives at the primary

primarybackup

witness

• Witness resolves the tie in favor of backup• Data survives at the backup

Page 62: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

62CMPT 401 Summer 2007 © A. Fedorova

Harp: Normal Operation

primary

backup

witness

1. Send request to the primary

2. Record the operation in the in-memory log

3. Forward request to backup

4. Record the operation in the in-memory log

5. Respond to primary

6. “Commit” the operation – mark it as committed in memory 7. Respond to client

8. Tell the back up to commit

Page 63: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

63CMPT 401 Summer 2007 © A. Fedorova

In-Memory Logging• Client operations are recorded in the in-memory logs (at

the primary and at the backup) when the response is sent to client

• Operations are applied to the file system later, in the background

• This is done to remove disk access out of critical path when communicating with the client

• What if there primary fails?– That’s okay, because in-memory log survives at the backup

• What if there is a power failure?– The machine will operate for a while on UPS – this time will be

used to apply operations in the log to the file system

Page 64: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

64CMPT 401 Summer 2007 © A. Fedorova

Write-Behind Logging

CP – commit pointer – most recently committed event record

Record n

Record n+1

Record n+2

Record n+3

Record n+4

AP – most recently applied event recordRecord n+5

LB – most recent event that has reached the local disk

GLB – most recent event that has reached the local disk at primary and backup

Record n+6

On failure the server restores the log and re-does all committed operations in the log

Page 65: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

65CMPT 401 Summer 2007 © A. Fedorova

A Potential Failure Scenario

primary backup

1. Receive operation from the client

2. Forward it to backup 3. Record the operation in the log

4. Respond to the primary5. Commit the operation

6. Respond to the client

7. Crash

• Backup does not know if the operation was committed

• Does it assume it was not committed and discard log entries?

• Does it assume it committed and apply the results?

Page 66: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

66CMPT 401 Summer 2007 © A. Fedorova

Handling Failures: View Changes

• View –a composition of the group and the roles of the members

• When some members fail, the view has to change• A view change selects the members of the new view and makes

sure that the state of the new view reflects all committed operations form previous views

• The designated primary and backup monitor other group members to detect changes in communication ability

• If they cannot communicate with some of the members, a view change is needed

• Either a primary or a backup can initiate a view change (not witness)

Page 67: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

67CMPT 401 Summer 2007 © A. Fedorova

Causes and Outcomes of View Changes

• A primary fails, so a new primary is needed– A backup will become the primary after a view change

• A backup fails, someone else needs to replicate the state at the primary– Witness is configured to act as a backup – the witness is

promoted• A primary that had failed comes back

– It will bring itself up-to-date (using other servers’ logs) and will become the primary again

• A backup that had failed comes back– It will bring itself up-to-date; the previously promoted witness will

no longer act as backup – the witness is demoted

Page 68: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

68CMPT 401 Summer 2007 © A. Fedorova

View Change: The Structure

• The node that starts the view change acts as coordinator• Phase 1:

– Coordinator tells others it wants to start a view change– Others stop processing any operations and send the coordinator

their state, i.e., log records (that the coordinator does not already have)

– The coordinator applies the log records to bring itself up-to-date• Phase 2:

– The coordinator writes the new view number to disk– If both backup and witness responded, witness will be demoted– If only the witness responded, witness will be promoted

Page 69: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

69CMPT 401 Summer 2007 © A. Fedorova

A Promoted Witness

• The witness does not have a copy of the file system state• In the absence of failures the witness does not participate in the

processing of file system operations• If the witness is promoted, it begins participating in the processing of

file system operations• Two important differences:

– Since it has no copy of the file system, it does not apply changes to disk, it only records them to the log

– It never discards log records (so it can later help bring up-to-date the failed server)

– If the log gets large, old log entries are recorded on disk or tape• When a witness is promoted it receives records of all operations that

have not reached the disk at either backup or primary

Page 70: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

70CMPT 401 Summer 2007 © A. Fedorova

Optimizations for Fast View Changes

• User operations are not processed during a view change, so view changes must be fast

• A view change may be slow if the server that must bring itself up-to-date must receive lots of log records from other servers

• Therefore, the server that must bring itself up-to-date in a new view (i.e., the primary that comes back after failure) brings itself up-to-date before initiating the view change

• If the server’s disk is intact it gets log records from the witness• If the disk is damaged, it get FS state from the backup and then

it gets log records from the witness

Page 71: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

71CMPT 401 Summer 2007 © A. Fedorova

Guarding Against a “Killer Packet”

• Many crashes are due to software bugs• Some bugs may cause simultaneous failure at the primary and

backup – i.e., an OS bug is triggered by a certain FS operation• To guard against this, the backup waits with applying changes to

the FS until they have been applied at the primary• If the primary fails after applying a certain change, the backup

will likely initiate the view change and will send the log to the witness

• So even if the backup fails after applying the same operation that crashed the primary, the record of that operation won’t be lost

Page 72: CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II

72CMPT 401 Summer 2007 © A. Fedorova

Summary

• Primary-copy file system• Unlike other replicated file system, provides good

performance, because disk writes are not in the critical path

• Needs at least 2n+1 participants to handle n failures• Data is replicated only on n+1 servers, to save disk space• Wishing to have evidence/discussion on:

– How the system works with view changes– What happens if a component crashes during a view change? – What happens with log records of uncommitted operations?