failure independence in oceanstore archive hakim weatherspoon university of california, berkeley

22
Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

Failure Independence in Oceanstore Archive

Hakim WeatherspoonUniversity of California, Berkeley

Page 2: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:2

Questions About Information:

• Where is persistent information stored?– Want: Geographic independence for

availability, durability, and freedom to adapt to circumstances

Page 3: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:3

m of n Encoding

RedundantFragments

Data / Object

Fragment ReceivedFragment Not Received

• Redundancy without overhead of replication.

• Object into m fragments.• Recode into n fragments.• A rate r = m/n code.• Increases storage by 1/r .• Key

– reconstruct from any m. • E.g.

– r = ¼, m = 16, n = 64 fragments.– Increases storage by four.

Page 4: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:4

Assumptions

• OceanStore:collection of independently failing disks.

• Failed disks replaced by new, blank ones.• Each fragment placed on a unique, randomly

selected disk.– For a given block.

• A repair epoch.– Time period between a global sweep, where a repair

process scans the system, attempting to restore redundancy.

Page 5: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:5

Availability

• Exploit statistical stability from a large number number of components

• E.g. given 90% of a million machines availability: – 2 replicas yield 2 9’s of availability.– 16 fragments yield 5 9’s of availability.– 32 fragments yield 8 9’s of availability.

• “More than 6’s of availability requires world peace.”– Steve Gribble, 2001.

Page 6: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:6

Durability

• E.g. MTTFblock = 1035 years for a particular block.– n = 64, r = ¼ and repair epoch e = 6 months.

– MTTFblock = 35 years for replication. • Same storage cost and repair epoch!

• Need 36 replicas for MTTFblock = 1035 years for a particular block.MTTF vs. Repair Epoch

1.00E+001.00E+061.00E+121.00E+181.00E+241.00E+301.00E+361.00E+421.00E+481.00E+541.00E+60

0 2 4 6 8 10 12Repair Epoch (in Months)

MT

TF

(in

Years

)

n=4,r=1/4

n=8,r=1/4

n=16,r=1/4

n=32,r=1/4

n=64,r=1/4

Page 7: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:7

Erasure Coding vs.Replication

• Fix storage overhead and repair epoch.– MTTF for Erasure codes orders of magnitude higher.

• Fix MTTFsystem and repair epoch.– Storage, BW, and disk seeks for Erasure codes a magnitude lower.

• Storagereplica/Storageerasure = R * r• BWreplica/Bwerasure = R * r• DiskSeeksreplica/DiskSeekserasure = R /n

– = R * r with smart storage server.

• E.g.– 216 users, 35 MB/hr/user 1017 blocks

• want MTTFsystem = 1020 years. – R = 22 replicas or r = m/n = 32/64, Repair epoch = 4 months.

• Storagereplica/Storageerasure = 11• BWreplica/BWerasure = 11• DiskSeeksreplica/DiskSeekserasure = 11 best case or 0.29 worst case.

Page 8: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:8

Requirements

• Can this be real?– Three requirements must be met:

• Failure Independence. • Data Integrity.• Efficient Repair.

Page 9: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:9

Failure Independence Model

• Model Builder.– Various sources.– Model failure

correlation.

• Set Creator.– Queries random nodes.– Dissemination Sets.

• Storage servers that fail with low correlation.

• Disseminator.– Sends fragments to

members of set.

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Disseminator

Disseminator

set

set

probe

type

fragments

fragments

fragments

Storage Server

Page 10: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:10

Model Builder I

• Correlation of failure among types of storage servers.– Type enumeration of

server properties.

• Collects availability statistics on storage server types– Compute marginal

and pair-wise joint probabilities of failure.

Page 11: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:11

Model Builder II

• Mutual Information – computes pairwise correlation of to node

failures.– I(X,Y) = H(X) – H(X|Y) – I(X,Y) = H(X) + H(Y) – H(X,Y)

• I(X,Y) is mutual information• H(X) is the entropy of X.

– Entropy is a measure of randomness.

– I(X,Y) is the reduction of entropy of Y given X.– E.g. X,Y up, down

H(X|Y) H(Y|X)

I(X,Y)

H(X) H(Y) H(X,Y)

Xx Yy ypxp

yxpyxpYXI

)()(

),(log),(),(

Page 12: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:12

Model Builder III

• Implemented mutual Information in Matlab.– Used fault load from [1] to compute Mutual Information– Reflects the rate of failures of the underlying network.

• Measure interconnectivity was among eight sites: – Verio(CA), Rackspace(TX), Rackspace(U.K.), University of Cali-

fornia, Berkeley, University of California, San Diego, University of Utah, University of Texas, Austin, and Duke.

• Test was six days. Day one had highest failure rate of 0.17%.

– Results• Different levels of service availability.

– Some site fault loads had same average failure rates.– Timing and nature of failures.

• Sites fail with low correlation according to Mutual Information.» [1] Yu and Vahdat. The Costs and Limits of Availability for

Replicated Services. SOSP 2001.computes pairwise correlation of to nodes.

Page 13: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:13

Set Creator I

• Node Discovery– Collects information on a large set of storage

servers.– Uses properties of Tapestry.– Scan node address space.

• Node address space sparse.• Node names random.

– Willing servers respond with signed statement of type.

• Set Creation– Clusters servers that fail with high correlation.– Create dissemination set.

• Servers that fail with low correlation.

Page 14: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:14

Disseminator I

• Disseminator Server Architecture– Requests to archive

objects recv’d thru network layers.

– Consistency mechanisms decides to archive obj.

Asynchronous DiskAsynchronous Network

Network

Operating System

Java Virtual Machine

ThreadScheduler

X

Y

Consistency

Location & Routing

Disseminator

IntrospectionModules

Dispatch

4

2

31

4

Page 15: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:15

SEDA I

• Stage Event-Driven Architecture (SEDA) as Server.– by Matt Welsh

• High Concurrency– Similar to traditional event-driven design.

• Load Conditioning– Drop, filter, or reorder events.

• Code modularity– Stages are developed and maintained independently.

• Scalable I/O– Asynchronous I/O– Network and disk.

• Debugging Support– Queue length profiler.– Event flow trace.

Page 16: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:16

SEDA II

• Stage – a self-contained application component

consisting of:• event handler.• incoming event queue• thread pool.

– Each stage is managed by a controller• affects scheduling and resource

allocation.

• Operation– Thread pulls event off of the stage’s

incoming event queue.– Invokes supplied event handler.– Event handler processes each task, and

dispatches zero or more tasks by enqueuing them on the event queues of other stages.

Event Handler

Thread Pool

Controller

Event Queue

OutgoingEvents

Page 17: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:17

Disseminator Control Flow

GenerateChkptStage

GenerateFragsStage

DisseminatorStage

ConsistencyStage(s)

GenerateFragsChkptReqDiss

eminate

FragsR

eq

GenerateFragsReq

GenerateFragsResp

GenerateFragsChkptResp

Send Frags toStorage Servers

DisseminateFragsResp

CacheStage

BlkResp

SpaceResp

BlkReq

SpaceReq

Req

Page 18: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:18

Storage Control Flow

StorageStage

CacheStage

Storage ReqStoreReq

Send MAC’d Ack

DiskStage

BufferdStoreReq

Page 19: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:19

Performance I

• Focus– Performance an OceanStore server in archiving an object.– restrict analysis to only the operations of archiving or

recoalescing and object.• Do not analyze the network phases of disseminating or

requesting fragments.

• Performance of the Archival Layer.– OceanStore server were analyzed on a single processor.– 850 MHz Pentium III machine with 768 MB of memory– running Debian distribution of the Linux 2.4.1 kernel. – Used BerkeleyDB

• when reading and writing blocks to disk.

– Simulated a number of independent event streams.• Read or write.

Page 20: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:20

Performance II

• Throughput– Users created traffic

without delay.– Archive ~30 req/sec.– Remains constant.

• Turnaround time.– Response time.

• User perceived latency.

– Increases linearly with the number of events.

Page 21: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:21

Discussion

• Distribute the creation of models.

• Byzantine commit on dissemination sets.

• Cache– “Old hat”

• LRU, second chance alg., free list, multiple databases, etc.

– Critical to performance of server.

Page 22: Failure Independence in Oceanstore Archive Hakim Weatherspoon University of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:22

Issues

• Number of disk heads needed.• Are erasure codes good for streaming media?

– Caching Layer.

• delete.– Eradication is antithetical to durability!– If you can eradicate something, then so can someone

else! (denial of service)– Must have “eradication certificate” or similar