oceanstore status and directions roc/oceanstore retreat 1/13/03 john kubiatowicz university of...

28
OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStoreStatus and Directions

ROC/OceanStore Retreat 1/13/03

John KubiatowiczUniversity of California at Berkeley

Page 2: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:2ROC/OceanStore Jan’02

Everyone’s Data, One Utility

• Millions of servers, billions of clients ….• 1000-YEAR durability (excepting fall of society)• Maintains Privacy, Access Control, Authenticity• Incrementally Scalable (“Evolvable”)• Self Maintaining!

• Not quite peer-to-peer: • Utilizing servers in infrastructure• Some computational nodes more equal than others

Page 3: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:3ROC/OceanStore Jan’02

OceanStore Data Model• Versioned Objects

– Every update generates a new version– Can always go back in time (Time Travel)

• Each Version is Read-Only– Can have permanent name– Much easier to repair

• An Object is a signed mapping between permanent name and latest version– Write access control/integrity involves managing

these mappings

Comet Analogy updates

versions

Page 4: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:4ROC/OceanStore Jan’02

Self-Verifying Objects

DataBlocks

VGUIDi VGUIDi + 1

d2 d4d3 d8d7d6d5 d9d1

Data B -Tree

IndirectBlocks

M

d'8 d'9

Mbackpointer

copy on write

copy on write

AGUID = hash{name+keys}

UpdatesHeartbeats +

Read-Only Data

Heartbeat: {AGUID,VGUID, Timestamp}signed

Patrick Eaton: Discussions of the Future formats

Page 5: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:5ROC/OceanStore Jan’02

The Path of an OceanStore Update

Second-TierCaches

Multicasttrees

Inner-RingServers

Clients

Page 6: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:6ROC/OceanStore Jan’02

OceanStore Goes Global!• Planet Lab global network

– 98 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized)

– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2

OceanStore components running “globally:”• Word on the street: it was “straightfoward.”

– Basic architecture scales – Lots of Communications issues (NAT, Timeouts, etc)– Locality really important

• Challenge: Stability and fault tolerance!• Dennis Geels: Analysis (FAST 2003 paper)• Steve Czerwinski/B. Hoon Kang:

Tentative Updates

Page 7: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:7ROC/OceanStore Jan’02

Enabling Technology: DOLR(Decentralized Object Location and

Routing)“TAPESTRY”

GUID1

DOLR

GUID1GUID2

Page 8: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:8ROC/OceanStore Jan’02

• Preliminary resultsshow that this is effective– Dennis will talk about

effectiveness for streamingupdates

• Have simple algorithms for placing replicas on nodes in the interior– Intuition: locality properties

of network help place replicas– DOLR helps associate

parents and childrento build multicast tree

Self-Organizing second-tier

Page 9: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:9ROC/OceanStore Jan’02

Tapestry Stability under Faults• Instability is the common case….!

– Small half-life for P2P apps (1 hour????)– Congestion, flash crowds, misconfiguration,

faults

• Must Use DOLR under instability!– The right thing must just happen

• Tapestry is natural framework to exploit redundant elements and connections– Multiple Roots, Links, etc.– Easy to reconstruct routing and location

information– Stable, repairable layer

• Thermodynamic analogies: – Heat Capacity of DOLR network– Entropy of Links (decay of underlying order)

Page 10: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:10ROC/OceanStore Jan’02

Single Node Tapestry

Transport Protocols

Network Link Management

Application Interface / Upcall API

OceanStoreApplication-LevelMulticast

OtherApplications

RouterRouting Table

&Object Pointer DB

Dynamic Node

Management

Page 11: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:11ROC/OceanStore Jan’02

It’s Alive On Planetlab!

• Tapestry Java deployment– 6-7 nodes on each physical machine– IBM Java JDK 1.30– Node virtualization inside JVM and SEDA– Scheduling between virtual nodes increases latency

• Dynamic insertion algorithms mostly working– Experiments with many simultaneous insertions– Node deletion getting there

• Tomorrow: Ben Zhao on Tapestry Deployment

Page 12: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:12ROC/OceanStore Jan’02

Object Location

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (

min

, m

ed

ian

, 9

0%

)

Page 13: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:13ROC/OceanStore Jan’02

Tradeoff: Storage vs Locality

Tomorrow: Jeremy Stribling on Locality

Page 14: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:14ROC/OceanStore Jan’02

Archival Disseminationof Fragments

Page 15: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:15ROC/OceanStore Jan’02

Fraction of Blocks Lost per Year (FBLPY)

• Exploit law of large numbers for durability!• 6 month repair, FBLPY:

– Replication: 0.03– Fragmentation: 10-35

Page 16: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:16ROC/OceanStore Jan’02

The Dissemination Process:Achieving Failure Independence

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Inner Ring

Inner Ringse

t

set

probe

type

fragments

fragments

fragments

Page 17: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:17ROC/OceanStore Jan’02

Active Data Maintenance

• Tapestry enables “data-driven multicast”– Mechanism for local servers to watch each other– Efficient use of bandwidth (locality)

3274

4577

5544

AE87

3213

9098

1167

6003

0128

L2L2

L1

L1

L2

L2

L3

L3

L2

L1L1

L2L3

L2

Ring of L1Heartbeats

Page 18: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:18ROC/OceanStore Jan’02

Project Seagull• Push for long-term stable archive

– Fault Tolerant Networking– Periodic restart of servers– Correlation analysis for fragment placement– Efficient heart-beats for fragment tracking– Repair mechanisms

• Use for Backup system– Conversion of dump to use OceanStore– With versioning: yields first-class archival

system

• Use for Web browsing– Versioning yields long-term history of web sites

Page 19: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:19ROC/OceanStore Jan’02

PondStorePrototype

Page 20: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:20ROC/OceanStore Jan’02

First Implementation [Java]:• Event-driven state-machine model

– 150,000 lines of Java code and growing• Included Components

DOLR Network (Tapestry)• Object location with Locality• Self Configuring, Self R epairing

Full Write path• Conflict resolution and Byzantine agreement

Self-Organizing Second Tier• Replica Placement and Multicast Tree Construction

Introspective gathering of tacit info and adaptation• Clustering, prefetching, adaptation of network routing

Archival facilities • Interleaved Reed-Solomon codes for fragmentation• Independence Monitoring• Data-Driven Repair

• Downloads available from www.oceanstore.org

Page 21: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:21ROC/OceanStore Jan’02

Event-Driven Architecture of an OceanStore Node

• Data-flow style– Arrows Indicate flow of messages

• Potential to exploit small multiprocessors at each physical node

World

Page 22: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:22ROC/OceanStore Jan’02

Working Applications

Page 23: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:23ROC/OceanStore Jan’02

InternetInternet

MINO: Wide-Area E-Mail Service

TraditionalMail Gateways

Client

Local network

Replicas

Replicas

OceanStore Client API

Mail Object Layer

IMAPProxy

SMTPProxy

• Complete mail solution– Email inbox – Imap folders

OceanStore Objects

Page 24: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:24ROC/OceanStore Jan’02

Riptide: Caching the Web with

OceanStore

Page 25: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:25ROC/OceanStore Jan’02

Other Apps• Long-running archive

– Project Segull

• File system support– NFS with time travel (like VMS)– Windows Installable file system (soon)

• Anonymous file storage:– Nemosyne uses Tapestry by itself

• Palm-pilot synchronization– Palm data base as an OceanStore DB

• Come see OceanStore demo at Poster Session: IMAP on OceanStore/Versioned NFS

Page 26: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:26ROC/OceanStore Jan’02

Future Challenges• Fault Tolerance

– Network/Tapestry layer– Inner Ring

• Repair– Continuous monitoring/restart of components

• Online/offline validation – What mechanisms can be used to increase

confidence and reliability in systems like OceanStore?

• More intelligent replica management• Security

– Data Level security– Tapestry-level admission control

• “Eat our Own Dogfood”– Continuous deployment of OceanStore components

• Large-Scale Thermodynamic Design– Is there a science of aggregate systems design?

Page 27: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:27ROC/OceanStore Jan’02

OceanStore Sessionshttp://10.0.0.1/

• ROC: Monday (3:30pm – 5:00pm)– OceanStore Pond Deployment– Evolution of Data Format and Structure– Tentative Updates

• Shared: Monday (5:30pm – 6:00pm)– OceanStore Long-Term Archival Storage

• Sahara: Tuesday (8:30am-9:10am)– Tapestry status and deployment– Peer-to-peer Benchmarking (Chord/Tapestry)– Tapestry Locality Enhancement

• Sahara: Tuesday (11:35-12:00am)– Peer-to-peer APIs

Page 28: OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03 John Kubiatowicz University of California at Berkeley

OceanStore:28ROC/OceanStore Jan’02

For more info:http://oceanstore.org

• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale

Persistent Storage”

• OceanStore Prototype (FAST 2003):“Pond: the OceanStore Prototype”

• Tapestry algorithms paper (SPAA 2002):“Distributed Object Location in a Dynamic Network”

• Upcoming Tapestry Deployment Paper (JSAC)“Tapestry: a Global-Scale Overlay for Rapid Service Deployment”

• Probabilistic Routing (INFOCOM 2002):“Probabilistic Location and Routing”

• Upcoming CACM paper (not until February):– “Extracting Guarantees from Chaos”