failure independence in oceanstore archive hakim weatherspoon university of california, berkeley

Failure Independence in Oceanstore Archive

Hakim WeatherspoonUniversity of California, Berkeley

ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:2

Questions About Information:

• Where is persistent information stored?– Want: Geographic independence for

availability, durability, and freedom to adapt to circumstances


m of n Encoding

RedundantFragments

Data / Object

Fragment ReceivedFragment Not Received

• Redundancy without overhead of replication.

• Object into m fragments.• Recode into n fragments.• A rate r = m/n code.• Increases storage by 1/r .• Key

– reconstruct from any m. • E.g.

– r = ¼, m = 16, n = 64 fragments.– Increases storage by four.


Assumptions

• OceanStore:collection of independently failing disks.

• Failed disks replaced by new, blank ones.• Each fragment placed on a unique, randomly

selected disk.– For a given block.

• A repair epoch.– Time period between a global sweep, where a repair

process scans the system, attempting to restore redundancy.


Availability

• Exploit statistical stability from a large number number of components

• E.g. given 90% of a million machines availability: – 2 replicas yield 2 9’s of availability.– 16 fragments yield 5 9’s of availability.– 32 fragments yield 8 9’s of availability.

• “More than 6’s of availability requires world peace.”– Steve Gribble, 2001.


Durability

• E.g. MTTFblock = 1035 years for a particular block.– n = 64, r = ¼ and repair epoch e = 6 months.

– MTTFblock = 35 years for replication. • Same storage cost and repair epoch!

• Need 36 replicas for MTTFblock = 1035 years for a particular block.MTTF vs. Repair Epoch

1.00E+001.00E+061.00E+121.00E+181.00E+241.00E+301.00E+361.00E+421.00E+481.00E+541.00E+60

0 2 4 6 8 10 12Repair Epoch (in Months)

MT

TF

(in

Years

)

n=4,r=1/4

n=8,r=1/4

n=16,r=1/4

n=32,r=1/4

n=64,r=1/4


Erasure Coding vs.Replication

• Fix storage overhead and repair epoch.– MTTF for Erasure codes orders of magnitude higher.

• Fix MTTFsystem and repair epoch.– Storage, BW, and disk seeks for Erasure codes a magnitude lower.

• Storagereplica/Storageerasure = R * r• BWreplica/Bwerasure = R * r• DiskSeeksreplica/DiskSeekserasure = R /n

– = R * r with smart storage server.

• E.g.– 216 users, 35 MB/hr/user 1017 blocks

• want MTTFsystem = 1020 years. – R = 22 replicas or r = m/n = 32/64, Repair epoch = 4 months.

• Storagereplica/Storageerasure = 11• BWreplica/BWerasure = 11• DiskSeeksreplica/DiskSeekserasure = 11 best case or 0.29 worst case.


Requirements

• Can this be real?– Three requirements must be met:

• Failure Independence. • Data Integrity.• Efficient Repair.


Failure Independence Model

• Model Builder.– Various sources.– Model failure

correlation.

• Set Creator.– Queries random nodes.– Dissemination Sets.

• Storage servers that fail with low correlation.

• Disseminator.– Sends fragments to

members of set.

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Disseminator

Disseminator

set

set

probe

type

fragments

fragments

fragments

Storage Server


Model Builder I

• Correlation of failure among types of storage servers.– Type enumeration of

server properties.

• Collects availability statistics on storage server types– Compute marginal

and pair-wise joint probabilities of failure.


Model Builder II

• Mutual Information – computes pairwise correlation of to node

failures.– I(X,Y) = H(X) – H(X|Y) – I(X,Y) = H(X) + H(Y) – H(X,Y)

• I(X,Y) is mutual information• H(X) is the entropy of X.

– Entropy is a measure of randomness.

– I(X,Y) is the reduction of entropy of Y given X.– E.g. X,Y up, down

H(X|Y) H(Y|X)

I(X,Y)

H(X) H(Y) H(X,Y)

Xx Yy ypxp

yxpyxpYXI

)()(

),(log),(),(


Model Builder III

• Implemented mutual Information in Matlab.– Used fault load from [1] to compute Mutual Information– Reflects the rate of failures of the underlying network.

• Measure interconnectivity was among eight sites: – Verio(CA), Rackspace(TX), Rackspace(U.K.), University of Cali-

fornia, Berkeley, University of California, San Diego, University of Utah, University of Texas, Austin, and Duke.

• Test was six days. Day one had highest failure rate of 0.17%.

– Results• Different levels of service availability.

– Some site fault loads had same average failure rates.– Timing and nature of failures.

• Sites fail with low correlation according to Mutual Information.» [1] Yu and Vahdat. The Costs and Limits of Availability for

Replicated Services. SOSP 2001.computes pairwise correlation of to nodes.


Set Creator I

• Node Discovery– Collects information on a large set of storage

servers.– Uses properties of Tapestry.– Scan node address space.

• Node address space sparse.• Node names random.

– Willing servers respond with signed statement of type.

• Set Creation– Clusters servers that fail with high correlation.– Create dissemination set.

• Servers that fail with low correlation.


Disseminator I

• Disseminator Server Architecture– Requests to archive

objects recv’d thru network layers.

– Consistency mechanisms decides to archive obj.

Asynchronous DiskAsynchronous Network

Network

Operating System

Java Virtual Machine

ThreadScheduler

X

Y

Consistency

Location & Routing

Disseminator

IntrospectionModules

Dispatch

4

2

31

4


SEDA I

• Stage Event-Driven Architecture (SEDA) as Server.– by Matt Welsh

• High Concurrency– Similar to traditional event-driven design.

• Load Conditioning– Drop, filter, or reorder events.

• Code modularity– Stages are developed and maintained independently.

• Scalable I/O– Asynchronous I/O– Network and disk.

• Debugging Support– Queue length profiler.– Event flow trace.


SEDA II

• Stage – a self-contained application component

consisting of:• event handler.• incoming event queue• thread pool.

– Each stage is managed by a controller• affects scheduling and resource

allocation.

• Operation– Thread pulls event off of the stage’s

incoming event queue.– Invokes supplied event handler.– Event handler processes each task, and

dispatches zero or more tasks by enqueuing them on the event queues of other stages.

Event Handler

Thread Pool

Controller

Event Queue

OutgoingEvents


Disseminator Control Flow

GenerateChkptStage

GenerateFragsStage

DisseminatorStage

ConsistencyStage(s)

GenerateFragsChkptReqDiss

eminate

FragsR

eq

GenerateFragsReq

GenerateFragsResp

GenerateFragsChkptResp

Send Frags toStorage Servers

DisseminateFragsResp

CacheStage

BlkResp

SpaceResp

BlkReq

SpaceReq

Req


Storage Control Flow

StorageStage

CacheStage

Storage ReqStoreReq

Send MAC’d Ack

DiskStage

BufferdStoreReq


Performance I

• Focus– Performance an OceanStore server in archiving an object.– restrict analysis to only the operations of archiving or

recoalescing and object.• Do not analyze the network phases of disseminating or

requesting fragments.

• Performance of the Archival Layer.– OceanStore server were analyzed on a single processor.– 850 MHz Pentium III machine with 768 MB of memory– running Debian distribution of the Linux 2.4.1 kernel. – Used BerkeleyDB

• when reading and writing blocks to disk.

– Simulated a number of independent event streams.• Read or write.


Performance II

• Throughput– Users created traffic

without delay.– Archive ~30 req/sec.– Remains constant.

• Turnaround time.– Response time.

• User perceived latency.

– Increases linearly with the number of events.


Discussion

• Distribute the creation of models.

• Byzantine commit on dissemination sets.

• Cache– “Old hat”

• LRU, second chance alg., free list, multiple databases, etc.

– Critical to performance of server.


Issues

• Number of disk heads needed.• Are erasure codes good for streaming media?

– Caching Layer.

• delete.– Eradication is antithetical to durability!– If you can eradicate something, then so can someone

else! (denial of service)– Must have “eradication certificate” or similar

failure independence in oceanstore archive hakim weatherspoon university of california, berkeley

Documents