edelweiss: automatic storage reclamation for distributed programming

28
Edelweiss: Automatic Storage Reclamation for Distributed Programming Neil Conway Peter Alvaro Emily Andrews Joseph M. Hellerstein University of California, Berkeley

Upload: alvin-mccoy

Post on 30-Dec-2015

32 views

Category:

Documents


1 download

DESCRIPTION

Edelweiss: Automatic Storage Reclamation for Distributed Programming. Neil Conway Peter Alvaro Emily Andrews Joseph M. Hellerstein University of California, Berkeley. Mutable shared state. Frequent source of bugs. Hard to scale. Accumulate & exchange sets of immutable events - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Edelweiss:Automatic Storage Reclamation for Distributed Programming

Neil ConwayPeter Alvaro

Emily AndrewsJoseph M. Hellerstein

University of California, Berkeley

Page 2: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Mutable shared state

Frequent sourceof bugs

Hard to scale

Page 3: Edelweiss: Automatic Storage Reclamation for Distributed Programming

EventLogging

• Accumulate & exchange sets of immutable events No

mutation/deletion

• To delete: add new event “Event X should be

ignored”

• Current state: query over event log

Page 4: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Event Logging

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)

Example: Key-Value Store

Mutable State

tbl = Hash.new

Insert(k, v): tbl[k] = v

Delete(k): tbl.delete(k)

View(): tbl

Update-in-place

Deletion

Set union

Compute“live” keys

Page 5: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Benefits of Event Logging

1. Concurrency2. Replication3. Undo/redo4. Point-in-time query, audit trails

(Sometimes: performance!)

Page 6: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Example Applications

• Multi-version concurrency control (MVCC)

• Write-ahead logging (WAL)• Stream processing• Log-structured file systems

Also: CRDTs, tombstones, purely functional data structures, accounting ledgers.

Page 7: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Observation: Logs consume unbounded storage

Solution: Discard log entries that are“no longer useful”(garbage collection)

Page 8: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Observation: Logs consume unbounded storage

Challenge: Discard log entries that are“no longer useful”(garbage collection)

Page 9: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Traditional Approach

“No longer useful” defined by application semantics– No framework support– Every system requires

custom GC logic– Reinvented many

times• >25 papers propose

~same scheme!

Page 10: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Engineering Challenges

1. Difficult to implement correctly– Too aggressive: destroy live data– Too conservative: storage leak

2. Ongoing maintenance burden– GC scheme and application code must

be updated together

Page 11: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Our Approach

1. New language: Edelweiss– Based on Datalog– No constructs for deletion or mutation!

2. Automatically generate safe, application-specific distributed GC protocols

3. Present several in-depth case studies– Reliable unicast/broadcast, key-value store,

causal consistency, atomic registers

Page 12: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Base Data(“Event Logs”)

Derived Data( “Live View”)

Query

Page 13: Edelweiss: Automatic Storage Reclamation for Distributed Programming
Page 14: Edelweiss: Automatic Storage Reclamation for Distributed Programming

The queries define how log entries contribute to the view.Goal: Find log entries that will never contribute to the view in the future.

A log entry is useful iff it might contribute to the view.

Page 15: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Semantics of Base Data

• Accumulate and broadcast to other nodes

• Datalog: monotonic–Set union: grows over time

• CALM Theorem [CIDR’11]: event log guaranteed to be eventually consistent

Page 16: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Semantics of Derived Data

Grows and shrinksover time– e.g., KVS keys

added and removed

Hence, not monotonic

Page 17: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Common Pattern

Live View = set difference between growing sets

Key-Value Store Insertions that haven’t been deleted

Reliable Broadcast

Outbound messages that haven’t been acknowledged

Causal Consistency

Writes that haven’t been replaced by a causally later write to the same key

Page 18: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Semantics of Set Difference

X = Y – Z– Z grows: X

shrinks– If t appears in Z,

t will never again appear in X

– “Anti-monotone with respect to Z”

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)Can reclaim from i_log

upon match in d_log

Page 19: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Other Analysis Techniques

• Reclaim from negative notin input– Often called “tombstones”– E.g., how to reclaim from d_log in the

KVS

• Reclaim from join input tables• Disseminate GC metadata

automatically• Exploit user knowledge for better GC– Punctuations [Tucker & Maier ‘03]

Page 20: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Whole Program Analysis

• For each query q, find condition when input t will never contribute to q’s output– “Reclamation condition” (RC)

• For each tuple t, find the conjunction of the RCs for t over all queries–When all consumers no longer need t:

safe to reclaim

Page 21: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Edelweiss Input

Program

Source To Source

Rewriter

Datalog Output

Program

DatalogEvaluator

“Positive” program:no deletion or statemutation

Compute RCs,add deletion rules

Input program +deletion rules

Page 22: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Comparison of Program Size

Only19 rules!

Page 23: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Takeaways

No storage management code!– Similar to malloc/free vs. GC

Programs are concise and declarative– Developer: just compute current view– Log entries removed automatically

Reclamation logic application code always in sync

Page 24: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Conclusions

• Event logging: powerful design pattern– Problem: need for hand-written distributed

storage reclamation code

• Datalog: natural fit for event logging• Storage reclamation as a compiler rewrite?

Results:– Automatic, safe GC synthesis!– High-level, declarative programs

• No storage management code• Focus on solving domain problem

Page 25: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Thank You!

Page 26: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Future Work: Checkpoints

• Closely related to simple event logging– Summarize many log entries with a

single “checkpoint” record– View = last checkpoint + Query(¢Logs)

• General goal: reclaim space by structural transformation, not just discarding data

Page 27: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Future Work: Theory

• Current analysis is somewhat ad hoc• If program does not reclaim storage, two

possibilities:1. Program is “not reclaimable” in principle

• (Possible program bug!)

2. Our analysis is not complete• (Possible analysis bug!)

How to characterize the class of “not reclaimable” programs?

Page 28: Edelweiss: Automatic Storage Reclamation for Distributed Programming

Reclaiming KVS Deletions

• Good question • X.notin(Y): how to

reclaim from Y?1. Y is a dense

ordered set; compress it.

2. Prove that each Y tuple matches exactly one X tuple

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)

k is a keyof i_log