efficient replica maintenance for distributed storage systems byung-gon chun, frank dabek, andreas...

34
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. 2006. 1

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Efficient replica maintenance for distributed storage systems

Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert Morris. 2006.

1

Page 2: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

About the Paper

• 137 Citations (Google Scholar)• In Proceedings of the 3rd conference on

Networked Systems Design \& Implementation - Volume 3 (NSDI'06), Vol. 3. USENIX Association, Berkeley, CA, USA, 4-4.

• Research of OceanStore’s Project

2

Page 3: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Credit

• Modified version ofhttp://nrl.iis.sinica.edu.tw/Web2.0/presentation/ERMfDSS.ppt

3

Page 4: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

4

Page 5: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Related terms

• Distributed Storage Systems• Storage systems that aggregate the disks of many

nodes scattered throughout the Internet

5

Page 6: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Motivation

• One of the most important goals of distributed storage system: Robust

• Solution?• -> replication

• Durability: objects that an application has putput into the system are not lost due to disk failure

• Availability: getget will be able to return the object promptly

6

Page 7: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Motivation

• Failure• Transient failure: availability

• Out of power• Scheduled maintenance• Network problem

• Permanent failure: durability • Disk failure

7

Page 8: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Contribution

• Develop and Implement an algorithm Carbonite to store immutable objects durably and at a low bandwidth cost in a distributed storage systems

• Create replica fast enough to handle failure• Keep track of all the replicas• Use a model to determine a reasonable # of

replications

8

Page 9: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

9

Model of the relationship between:

•Network capacity•Amount of replicated data•Number of replicas•Durability

Page 10: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Providing Durability

• Durability is more practical and useful than availability

• Challenge to durability• Create new replicas faster than losing them• Reduce network bandwidth• Distinguish transient failures from permanent disk

failures• Reintegration

10

Page 11: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Challenges to Durability

• Create new replicas faster than replicas are destroyed• Creation rate < failure rate system is infeasible

• Higher number of replicas do not allow the system to survive a high average failure rate

• Creation rate = failure rate + ε (ε is small) bursts of failures may destroy all of the replicas

112/04/18

11

Page 12: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Number of Replicas as a Birth-Death Process• Assumption: independent exponential inter-failure and

inter-repair times

• λf : average failure rate

• μi: average repair rate at state i

• rL : lower bound of number of replicas, is the target number of replicas in order to survive bursts of failures

112/04/18

12

Page 13: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Model Simplification

• Fixed μ and λ the equilibrium number of replicas is Θ = μ/ λ

• Higher values of Θ decrease the time it takes to repair an object

112/04/18

13

Page 14: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Example• PlanetLab

• 490 nodes• Average inter-failure time 39.85 hours• 150 KB/s bandwidth

• Assumption• 500 GB per node• rL = 3

• λ = 365 day / 490 * (39.85 / 24) = 0.439 disk failures / year

• μ = 365 day / (500 GB * 3 / 150 KB/sec) = 3 disk copies / year

• Θ = μ/ λ = 6.85

112/04/18

PlanetLab •A large research testbed for computer networking and distributed system research with nodes located around the world•Different organizations each donate one or more computers

14

λ depends on •# of nodes•Frequency of failure

μ depends on •Amount of data•Bandwidth

Θ = μ/ λ = constant * bandwidth*#nodes* inter-failure time /(amount of data*RL )

Page 15: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Impact of Θ

112/04/18

Θ = 1

15

• Bandwith ↑ μ ↑ Θ ↑

• rL ↑ μ ↓ Θ ↓

• Θ is the theoretical upper limit of extra replica number• If Θ < 1, the system can no longer maintain full replication

regardless of rL

Θ = 1

Θ = constant * bandwidth/RL

Page 16: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Choosing rL

• Guidelines• Large enough to ensure durability• At least one more than the maximum burst of

simultaneous failures• Small enough rL, if a low value of rL would suffice, the

bandwidth maintaining the extra replicas would be wasted.

112/04/18

16

Page 17: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

rL vs Durablility• Higher rL means high cost but tolerates more bursts of

failures

• Larger data size λ ↑ need higher rL

Analytical results from PlanetLab traces (4 years)

112/04/18

17

Page 18: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

18

Proper placement of replicas on servers

Page 19: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Node’s Scope

• Defintion: Each node, n, designates a set of other nodes that can potentially hold copies of the objects that n is responsible for. We call the size of that set the node’s scope.

• scope є [rL , N]

• N: total number of nodes in the system

112/04/18

19

Page 20: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Effect of Scope

• Small scope• Easy to keep track of objects• Needs more time to create new object

• Big scope• Reduces repair time, thus increases durability• Needs to monitor more nodes• If large number of objects and random placement,

may increase the likelihood of simultaneous failures

112/04/18

20

Page 21: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Scope vs. Repair Time

• Scope ↑ repair work is spread over more access links and completes faster

• rL ↓ scope must be higher to achieve the same durability

112/04/18

21

Page 22: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

22

Reduce bandwidth wasted due to the transient failure

•How to distinguish the failure, i.e., transient and permanent failure. •Reintegrate replicas stored on nodes after transient failures.

Page 23: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Reintegration

• Reintegrate replicas stored on nodes after transient failures

• System must be able to track all the replicas

112/04/18

23

Page 24: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Effect of Node Availability

• a: the average fraction of time that a node is available

• Pr [new replica needs to be created]

= Pr [less than rL replicas are available] :

• Chernoff bound: 2rL/a replicas are needed to keep rL copies available

112/04/18

24

Page 25: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Node Availability vs. Reintegration• Reintegration can work safely with 2rL/a replicas

• 2/a is the penalty for not being able to distinguishing transient and permanent failures

• rL = 3

112/04/18

25

Page 26: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Create replicas as needed

• Probability of making replicas depends on a• Estimate a?

112/04/18

26

Page 27: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Timeouts

• A heuristic to avoid misclassifying temporary failures as permanent

• setting a timeout can reduce response to transient failures but its success depends greatly on its relationship to the downtime distribution and can in some instances reduce durability as well.

27

Page 28: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Four Replication Algorithms

• Cates• Fixed number of replicas rL

• Timeout

• Total Recall• Batch (In addition to rL replicas, make e additional

copies so that can make repair less frequent)

• Carbonite• Timeout + reintegration

• Oracle• Hypothetical system that can differentiate transient

failures from permanent failures

112/04/18

28

Page 29: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Comparison

112/04/18

29

Page 30: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

30

ChallengesLike Node Monitoring

Page 31: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Node Monitoring for Failure Detection• Carbonite requires that each node know the

number of available replicas of each object for which it is responsible

• The goal of monitoring is to allow the nodes to track the number of available replicas

• Challenges:• Monitoring consistent hashing systems• Monitoring host availability

112/04/18

31

Page 32: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Outline

• Motivation• Understanding durability• Improving repair time• Reducing transient costs• Implementation• Conclusion

32

Page 33: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

Conclusion

• Described a set of techniques that allow wide-area systems to efficiently store and maintain large amounts of data

• Keep all data durable and uses 44% more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system.

112/04/18

33

Page 34: Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

34

Questions or Comments?

Thanks!