failure independence in oceanstore archive hakim weatherspoon university of california, berkeley
Post on 20-Dec-2015
217 views
TRANSCRIPT
Failure Independence in Oceanstore Archive
Hakim WeatherspoonUniversity of California, Berkeley
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:2
Questions About Information:
• Where is persistent information stored?– Want: Geographic independence for
availability, durability, and freedom to adapt to circumstances
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:3
m of n Encoding
RedundantFragments
Data / Object
Fragment ReceivedFragment Not Received
• Redundancy without overhead of replication.
• Object into m fragments.• Recode into n fragments.• A rate r = m/n code.• Increases storage by 1/r .• Key
– reconstruct from any m. • E.g.
– r = ¼, m = 16, n = 64 fragments.– Increases storage by four.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:4
Assumptions
• OceanStore:collection of independently failing disks.
• Failed disks replaced by new, blank ones.• Each fragment placed on a unique, randomly
selected disk.– For a given block.
• A repair epoch.– Time period between a global sweep, where a repair
process scans the system, attempting to restore redundancy.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:5
Availability
• Exploit statistical stability from a large number number of components
• E.g. given 90% of a million machines availability: – 2 replicas yield 2 9’s of availability.– 16 fragments yield 5 9’s of availability.– 32 fragments yield 8 9’s of availability.
• “More than 6’s of availability requires world peace.”– Steve Gribble, 2001.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:6
Durability
• E.g. MTTFblock = 1035 years for a particular block.– n = 64, r = ¼ and repair epoch e = 6 months.
– MTTFblock = 35 years for replication. • Same storage cost and repair epoch!
• Need 36 replicas for MTTFblock = 1035 years for a particular block.MTTF vs. Repair Epoch
1.00E+001.00E+061.00E+121.00E+181.00E+241.00E+301.00E+361.00E+421.00E+481.00E+541.00E+60
0 2 4 6 8 10 12Repair Epoch (in Months)
MT
TF
(in
Years
)
n=4,r=1/4
n=8,r=1/4
n=16,r=1/4
n=32,r=1/4
n=64,r=1/4
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:7
Erasure Coding vs.Replication
• Fix storage overhead and repair epoch.– MTTF for Erasure codes orders of magnitude higher.
• Fix MTTFsystem and repair epoch.– Storage, BW, and disk seeks for Erasure codes a magnitude lower.
• Storagereplica/Storageerasure = R * r• BWreplica/Bwerasure = R * r• DiskSeeksreplica/DiskSeekserasure = R /n
– = R * r with smart storage server.
• E.g.– 216 users, 35 MB/hr/user 1017 blocks
• want MTTFsystem = 1020 years. – R = 22 replicas or r = m/n = 32/64, Repair epoch = 4 months.
• Storagereplica/Storageerasure = 11• BWreplica/BWerasure = 11• DiskSeeksreplica/DiskSeekserasure = 11 best case or 0.29 worst case.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:8
Requirements
• Can this be real?– Three requirements must be met:
• Failure Independence. • Data Integrity.• Efficient Repair.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:9
Failure Independence Model
• Model Builder.– Various sources.– Model failure
correlation.
• Set Creator.– Queries random nodes.– Dissemination Sets.
• Storage servers that fail with low correlation.
• Disseminator.– Sends fragments to
members of set.
Model Builder
Set Creator
IntrospectionHuman Input
Network
Monitoringmodel
Disseminator
Disseminator
set
set
probe
type
fragments
fragments
fragments
Storage Server
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:10
Model Builder I
• Correlation of failure among types of storage servers.– Type enumeration of
server properties.
• Collects availability statistics on storage server types– Compute marginal
and pair-wise joint probabilities of failure.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:11
Model Builder II
• Mutual Information – computes pairwise correlation of to node
failures.– I(X,Y) = H(X) – H(X|Y) – I(X,Y) = H(X) + H(Y) – H(X,Y)
• I(X,Y) is mutual information• H(X) is the entropy of X.
– Entropy is a measure of randomness.
– I(X,Y) is the reduction of entropy of Y given X.– E.g. X,Y up, down
H(X|Y) H(Y|X)
I(X,Y)
H(X) H(Y) H(X,Y)
Xx Yy ypxp
yxpyxpYXI
)()(
),(log),(),(
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:12
Model Builder III
• Implemented mutual Information in Matlab.– Used fault load from [1] to compute Mutual Information– Reflects the rate of failures of the underlying network.
• Measure interconnectivity was among eight sites: – Verio(CA), Rackspace(TX), Rackspace(U.K.), University of Cali-
fornia, Berkeley, University of California, San Diego, University of Utah, University of Texas, Austin, and Duke.
• Test was six days. Day one had highest failure rate of 0.17%.
– Results• Different levels of service availability.
– Some site fault loads had same average failure rates.– Timing and nature of failures.
• Sites fail with low correlation according to Mutual Information.» [1] Yu and Vahdat. The Costs and Limits of Availability for
Replicated Services. SOSP 2001.computes pairwise correlation of to nodes.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:13
Set Creator I
• Node Discovery– Collects information on a large set of storage
servers.– Uses properties of Tapestry.– Scan node address space.
• Node address space sparse.• Node names random.
– Willing servers respond with signed statement of type.
• Set Creation– Clusters servers that fail with high correlation.– Create dissemination set.
• Servers that fail with low correlation.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:14
Disseminator I
• Disseminator Server Architecture– Requests to archive
objects recv’d thru network layers.
– Consistency mechanisms decides to archive obj.
Asynchronous DiskAsynchronous Network
Network
Operating System
Java Virtual Machine
ThreadScheduler
X
Y
Consistency
Location & Routing
Disseminator
IntrospectionModules
Dispatch
4
2
31
4
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:15
SEDA I
• Stage Event-Driven Architecture (SEDA) as Server.– by Matt Welsh
• High Concurrency– Similar to traditional event-driven design.
• Load Conditioning– Drop, filter, or reorder events.
• Code modularity– Stages are developed and maintained independently.
• Scalable I/O– Asynchronous I/O– Network and disk.
• Debugging Support– Queue length profiler.– Event flow trace.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:16
SEDA II
• Stage – a self-contained application component
consisting of:• event handler.• incoming event queue• thread pool.
– Each stage is managed by a controller• affects scheduling and resource
allocation.
• Operation– Thread pulls event off of the stage’s
incoming event queue.– Invokes supplied event handler.– Event handler processes each task, and
dispatches zero or more tasks by enqueuing them on the event queues of other stages.
Event Handler
Thread Pool
Controller
Event Queue
OutgoingEvents
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:17
Disseminator Control Flow
GenerateChkptStage
GenerateFragsStage
DisseminatorStage
ConsistencyStage(s)
GenerateFragsChkptReqDiss
eminate
FragsR
eq
GenerateFragsReq
GenerateFragsResp
GenerateFragsChkptResp
Send Frags toStorage Servers
DisseminateFragsResp
CacheStage
BlkResp
SpaceResp
BlkReq
SpaceReq
Req
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:18
Storage Control Flow
StorageStage
CacheStage
Storage ReqStoreReq
Send MAC’d Ack
DiskStage
BufferdStoreReq
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:19
Performance I
• Focus– Performance an OceanStore server in archiving an object.– restrict analysis to only the operations of archiving or
recoalescing and object.• Do not analyze the network phases of disseminating or
requesting fragments.
• Performance of the Archival Layer.– OceanStore server were analyzed on a single processor.– 850 MHz Pentium III machine with 768 MB of memory– running Debian distribution of the Linux 2.4.1 kernel. – Used BerkeleyDB
• when reading and writing blocks to disk.
– Simulated a number of independent event streams.• Read or write.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:20
Performance II
• Throughput– Users created traffic
without delay.– Archive ~30 req/sec.– Remains constant.
• Turnaround time.– Response time.
• User perceived latency.
– Increases linearly with the number of events.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:21
Discussion
• Distribute the creation of models.
• Byzantine commit on dissemination sets.
• Cache– “Old hat”
• LRU, second chance alg., free list, multiple databases, etc.
– Critical to performance of server.
ROC/Oceanstore ©2002 Hakim Weatherspoon/UC Berkeley OceanStore Archive:22
Issues
• Number of disk heads needed.• Are erasure codes good for streaming media?
– Caching Layer.
• delete.– Eradication is antithetical to durability!– If you can eradicate something, then so can someone
else! (denial of service)– Must have “eradication certificate” or similar