cs 347: distributed databases and transaction processing data replication

CS 347 Notes08 1

CS 347: Distributed Databases and

Transaction ProcessingData Replication

Hector Garcia-Molina

CS 347 Notes08 2

Replication Space

• Updates– at any copy– at fixed (primary) copy– at one copy but control can migrate– no updates

CS 347 Notes08 3

Replication Space

• Correctness– no consistency– local consistency– order preserving– serializable schedule– 1-copy serializability

CS 347 Notes08 4

Replication Space

• Expected Failures– processors: fail-stop, byzantine?– network: reliable, partitions, in-order

msgs?– storage: stable disk?

CS 347 Notes08 5

Replication Space

• Implementation Details– update propagation

– physical log records– logical log records– sql updates– transactions

– reads at backup?– architecture

– cross backups– multi-computer copy

– initialization of backup copy

CS 347 Notes08 6

Cross Backups

primary copyDB1

backup copyDB2

primary copyDB2

backup copyDB1

site A site B

CS 347 Notes08 7

Multi-Computer Sites

P1

L1

X1

B1

L1’

Y1

P2

L2

X2

P3

L3

X3

B2

L2’

Y2

B3

L3’

Y3

primarysite

backupsite

CS 347 Notes08 8

1-Safe Backups

– Transactions commit at primary– Redo log records propagated– Transaction commit at backup

P1

L1

X1

B1

L1’

Y1

CS 347 Notes08 9

1-Safe Backups– Transactions can get lost

P1

L1

X1

B1

L1’

Y1

P1

L1

X1

B1

L1’

Y1

T1, T2, T3 T1, T2

T1, T2, T3 T1, T2, T4, T5

CS 347 Notes08 10

2-Safe Backups

– Transactions do 2-phase commit– Redo log records propagated in prepare– Transactions not lost, but

• longer delay, contention• cannot process unless both sites are up

– After failure, go to 1-safe (no backup)

P1

L1

X1

B1

L1’

Y1

CS 347 Notes08 11

What is Correctness?

• In 2-safe• In 1-safe

CS 347 Notes08 12

What is in Paper You Read?

• Specific Senario– updates at fixed primary site– each site has multiple computers– primary-backup sites are matched– clean site failures; stable storage; rel

net– log shipping– no reads at backup– no initialization

CS 347 Notes08 13

Main Problem: Update Dependencies

P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb

data dependency: TaTb

CS 347 Notes08 14


P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb


Ta(1) Tb

?

CS 347 Notes08 15


P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb


Ta(1) Tb

?

• should not install Ta• should not install Tb

CS 347 Notes08 16

Dependency Reconstruction Algorithm

• Locking at backup to detect dependencies

• Ensure locks granted in same order as they were granted at primary

CS 347 Notes08 17

Example: Dependency Reconstruction

P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb


5 6

18

tickets reflectlocal commit

order

CS 347 Notes08 18


P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb


5 6

18

Ta(1) Tb5 6

?

CS 347 Notes08 19


P1

L1

X1

B1

L1’

Y1

P2

L2

X2

B2

L2’

Y2

primarysite

backupsite

Ta(1)

Ta(2)

Tb


5 6

18

Ta(1) Tb5 6

?

• Say Tb requests lock first at B1;• Tb request delayed until all lockswith tickets <6 have been granted

CS 347 Notes08 20

Epoch Algorithm

• Backup updates are installed in batches

• Epoch delimiters written on log

CS 347 Notes08 21

Writing Delimiters at Primary

master

slave

slave

15 16

1615

15 16

log time

CS 347 Notes08 22

Problem with Commits

master

slave

slave

15 16

1615

15 16

log time

prepare commit

T

T’s commit record in Epoch 15 in some logs;in Epoch 16 in others

CS 347 Notes08 23

Solution: Bump Epoch

master

slave

slave

15 16

1615

15 16

log time

prepare commit

T

prepare ack reports epoch number;coordinator bumps epoch if necessary

CS 347 Notes08 24

Installing an Epoch at Backup

master

slave

slave

15

1615

15 16

log time

end of 16 install 1616

end of 16

CS 347 Notes08 25

To Install Epoch X at Backup J

• Redo transactions:– If commit(T) X, commit T– If prepare(T) X but commit(T) > X:

• If T’s primary peer was coordinator, do not commit;

• Else check with the backup of T’s coordinator B’:

– If B’ committing T in epoch X, then we commit T– Else do not commit T

– Otherwise do not commit T (defer to next epoch)

commit(T) X means that T’s commit recordfound in epoch X (or earlier) at node J.

CS 347 Notes08 26

Why Do We Need Coordinator Check?• Assignment: Construct 2 scenarios

that look the same to backup J:– In Scenario 1, T should be installed– In Scenario 2, T should not be

installed

CS 347 Notes08 27

Scenario 1

B’

slave

15 16

15

log time

16P(T)

C(T)P(T)

C(T)

CS 347 Notes08 28

Scenario 2

B’

slave

15 16

15

log time

16P(T)

C(T)P(T)

C(T)

CS 347 Notes08 29

Scenario 3: Possible?

B’

slave

15 16

15

log time

16P(T)

C(T)P(T)

C(T)17

17

Note that T commits at slave but not at B’!!

CS 347 Notes08 30

Scenario 4: Possible?

B’

slave

15 16

15

log time

16

P(T)

C(T)P(T)

C(T)17

17

Note that T commits at B’ but not at slave!!

CS 347 Notes08 31

Comparison of Options

• 2-safe• 1-safe

– dep reconstruction– epoch

• Specific Senario– updates at fixed primary site– each site has multiple computers– primary-backup sites are matched– clean site failures; stable storage; rel

net– log shipping– no reads at backup– no initialization

CS 347 Notes08 32

How to Evaluate

• What system?– actual system(s)– simulation– testbed

• What transactions?– real transactions– synthetic transactions

CS 347 Notes08 33

Metrics

• IO utilization• CPU utilization• Throughput (given max delay?)• Transaction commit delay• Backup copy lag• Network overhead• Probability of inconsistency

CS 347 Notes08 34

Sample Results

CS 347 Notes08 35

Sample Results

CS 347 Notes08 36

And Now For SomethingCompletely Different:


have seen

next: available copies

CS 347 Notes08 37

PC-lock available copies

• Transactions write lock at all available copies• Transactions read lock at any available copy• Primary site (static) manages

U – set of available copies

X1 X2 X3

*

X4

downprimary

CS 347 Notes08 38

Update Transaction

(1) Get U from primary(2) Get write locks from U nodes(3) Commit at U nodes

C0

PrimaryC1

BackupC2

Backup

Trans T3, U={C0, C1}

U={C0, C1}

Uupdates, 2PC

CS 347 Notes08 39

A potential problem - example

Now: U={C0, C1}

-recovering-

C0

PrimaryC1

BackupC2

Backup


I am recovering

CS 347 Notes08 40

A potential problem - example

Later: U={C0, C1, C2}

-recovering-

C0

PrimaryC1

BackupC2

Backup


You missed T0, T1, T2

T3 updates T3 updates

CS 347 Notes08 41

Solution:

• Initially transaction T gets copy of U’ ofU from primary (or uses cached value)

• At commit of T, check U’ with current Uat primary (if different, abort T)

CS 347 Notes08 42

Solution Continued

• When CX recovers:– request missed and pending transactions

from primary (primary updates U)– set write locks for pending transactions

• Primary polls nodes to detect failures(updates U)

CS 347 Notes08 43

Example Revisited

C0

PrimaryC1

BackupC2

Backup


You missed T0, T1, T2

U={C0, C1, C2}

-recovering-

U={C0, C1}

I am recovering

prepare prepare

reject

CS 347 Notes08 44

Available Copies — No Primary

• Let all nodes have a copy of U (not just primary)

• To modify U, run a special atomic transaction at all available sites(use commit protocol)– E.g.: U1={C1, C2} U2={C1, C2 , C3}

only C1, C2 participate in this transaction– E.g.: U2={C1, C2 , C3} U3={C1, C2}

only C1, C2 participate in this transaction

CS 347 Notes08 45

• Details are tricky...• What if commit of U-change

blocks?

CS 347 Notes08 46

Node Recovery (no primary)• Get missed updates from any active

node• No unique sequence of transactions• If all nodes fail, wait for - all to recover

- majority to recover

CS 347 Notes08 47

recovering node

How much information (update values) must beremembered? By whom?

Committed:A,B,C,D,E,F

Pending: G

Committed:A,C,B,E,D

Pending: F,G,H

Committed:A,B

Example

CS 347 Notes08 48

Correctness with replicated data

S1: r1[X1] r2[X2] w1[X1] w2[X2] Is this schedule serializable?

X1 X2

CS 347 Notes08 49

One copy serializable (1SR)

A schedule S on replicated data is 1SR if it is equivalent to a serial history of the same transactions on a one-copy database

CS 347 Notes08 50

To check 1SR

• Take schedule• Treat ri[Xj] as ri[X] Xj is copy of

X wi[Xj] as wi[X]

• Compute P(S)• If P(S) acyclic, S is 1SR

CS 347 Notes08 51

S1: r1[X1] r2[X2] w1[X1] w2[X2] S1’: r1[X] r2[X] w1[X] w2[X]

S1 is not 1SR!

Example

T1T2

T2T1

CS 347 Notes08 52

Second example

S2: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1] w2[X2]

S2’: r1[X] w1[X] w1[X]

r2[X] w2[X] w2[X]

P(S2): T1 T2

S2 is 1SR

CS 347 Notes08 53

Second example

S2: r1[X1] w1[X1] w1[X2]

r2[X1] w2[X1] w2[X2]

S2’: r1[X] w1[X] w1[X]

r2[X] w2[X] w2[X]

• Equivalent serial schedule

SS: r1[X] w1[X]

r2[X] w2[X]

CS 347 Notes08 54

Summary


have seen

available copies

cs 347: distributed databases and transaction processing data replication

Documents

epoch x

logsin epoch

epochcommitt x

backup of ts coordinator

backup jredo transactions

tif preparet x

primary copyat

tatbnotes08main problem