foundation usenix08 talk

8/4/2019 Foundation Usenix08 Talk

1/39

Fast, Inexpensive Content-

Addressed Storage in Foundation

Sean Rhea* Russ Cox, Alex Pesterev*

Meraki, Inc. MIT CSAIL

*Work done while at Intel Research, Berkeley.


2/39


3/39

As a community, were not bad at storing

important data over the long term.

Weve only just begun to think about how

well interpret that data 30 years from now.


4/39

For Example

Viewing an old PowerPoint presentation

Do we still have PowerPoint at all? And Windows?

Does the presentation use non-standard fonts/codecs?

Has some newer application overwritten a sharedlibrary with an incompatible version (DLL Hell)?

Not just a Microsoft problem: consider a web page

Even current IE/Safari/Firefox dont agree on formatting

All kinds of plugins necessary: sound, video, Flash


5/39

The Foundation Idea

Make daily backups of entiresoftware stack

Archives users applications, OS, and configuration state

Dont worry about identifying dependencies Just save it all: Every byte, every night

To recover an obscure file, boot the relevant stackin an emulator

View file with the application that created it


6/39

Foundation FAQ

Why preserve the entiredisk? Preserve software stack dependencies: preserve the data with the

right application, libraries, and operating system as a singleunit

Works for allapplications, not just ones designed for preservation

Why dailyimages? Want to preserve machine state as close as possible to last write of

users data (i.e., preserve image before something changes)

Also allows recovery from user errors

Why emulate hardware? Much better track record than emulating software

Software example: OpenOffice emulating Microsoft Word (yikes)

Hardware emulators available today for Amiga, PDP-11, Nintendo


7/39

I would love to give a talk about whyFoundation is a great solution to the

digital preservation problem.

Really, though, I think its just a prettygood start.

Instead, Im going to talk about a funproblem we had to solve to make it work.


8/39

Every Byte, Every Night?Indefinitely? Really?

Plan 9 did exactly that Archive changed blocks every night to optical jukebox

Found that storage capacity grew faster than usage

Later with Content-Addressable Storage (Venti) Automatically coalesces duplicate data to save space

Required multiple, high-speed disks for performance

Challenge for Foundation: provide similar storageefficiency on consumer hardware Time Machine model: one external USB drive


9/39

Talk Outline

Introduction

What is Foundation?

Review of Content-Addressed Storage (Venti)

Contributions

Making CheapContent-Addressed Storage Fast

Avoiding Concerns over Hash Collisions

Related Work

Conclusions


10/39

Venti Review

Plan 9 file system was two-level Spinning storage, mostly a normal file system

Archival storage, optical write-once jukebox

Venti replaced optical jukebox Still write-once

Chunks of data named by their SHA-1 hashesContent-Addressable Storage (CAS)

Automatically coalesces duplicate writes


11/39

5:h( )16:7:8:9:

h( )2

reads 1st blockreads 2nd block

Users Hard Drive External USB Drive

Hash Offset

Data Log

seen it before?

0:1:2:3:h( )04:

RAM

ArchivalProcess

Summary

h( )

appendto log

update index

appendhash to

summary,h( ),h( )

reads 4th block

no logwrite!

h( )

,

Venti Review


12/39

Venti Review


Hash Offset

Data Log

0:h( )

41:2:h( )33:h( )04:h( )7

5:h( )16:h( )67:h( )58:h( )29:

RAM

Summary

h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

RestoreProcess

lookup hashof 1st block

map hash to log offset

read blockfrom log

restore block

Crash!

Final step (not shown): archivesummary in data log as well


13/39

Notes on Venti

The Good News: CAS stores each block with particular contents only once

Changing any one block and re-archiving uses only onemore block in archive

Adding a duplicate file from a different source uses no

additional storage

The Bad News: Synchronous, random reads to on-disk index


14/39

reads 4th block


Hash Offset

Data Log

seen it before?

0:1:2:3:h( )04:

5:h( )16:7:8:9:

RAM

ArchivalProcess

Summary

h( ),h( ),h( )

h( )2

Venti Review

Have to seek to theright bucket


15/39

Venti Review


Hash Offset

Data Log

RAM

Summary


RestoreProcess



0:h( )41:2:h( )33:h( )04:h( )7

5:h( )16:h( )67:h( )58:h( )29:



16/39

Notes on Venti

The Good News: CAS stores each block with particular contents only once

Changing any one block and re-archiving uses only onemore block in archive

Adding a duplicate file from a different source uses no

additional storage

The Bad News: Synchronous, random reads to on-disk index

Best case, one-disk performance for 512-byte blocks:one 5 ms seek per 512 bytes archived = 100 kB/s

Thats 12 days to archive a 100 GB disk!

Larger blocks give better throughput, less sharing


17/39

Notes on Venti (cont.)

Ventis solution: use 8 high-speed disks for index Untennable in consumer space

Wears disks out pretty quickly, too

The compare-by-hash controversy: Fear of hash collisions: two different blocks with same

hash breaks Venti

May be very unlikely, but cost (data corruption) is huge

Does CAS really require a cryptographically strong hash?


18/39

Talk Outline

Introduction

What is Foundation?


Contributions



Related Work

Conclusions


19/39

Making Inexpensive CAS Fast

The problem: disk seeks

Secure hash randomizes an otherwise sequential disk-to-disk transfer

To reduce seeks, must reduce hash table lookups

When do hash table lookups occur?

1. When writing data, to determine if weve seen it before

2. When writing data, to update the index

3. When reading data, to map hashes to disk locations


20/39

2. Updating the Index

After appending a block to the data log,must update the index

Psuedorandom hash causes a seek


21/39


Hash Offset

Data Log

0:1:2:3:h( )04:

5:h( )16:7:8:9:

RAM

ArchivalProcess

Summary

h( )

appendto log

update indexUpdating the Index


reads 2nd block


22/39

2. Updating the Index

After appending a block to the data log,must update the index

Psuedorandom hash causes a seek

Easy to fix: use a write-back index cache

Store index writes in memory

Flush to disk sequentially in large batches

On crash, reconstruct index from the data log


23/39

3. Mapping Hashes to DiskLocations During Reads

To restore disk

Start with the list of original blocks hashes

Lookup each block in index

Read block from data log and restore to disk


24/39


Hash Offset

Data Log

RAM

Summary


RestoreProcess



0:h( )41:2:h( )33:h( )04:h( )7

5:h( )16:h( )67:h( )58:h( )29:



25/39


26/39


27/39

3. Mapping Hashes to DiskLocations During Reads

To restore disk

Start with the list of original blocks hashes

Lookup each block in index

Read block from data log and restore to disk

Observation: data log is mostlyordered

Duplicate blocks often occur as part of duplicate files

Idea: add another index, ordered by log offset

Read-ahead in this index to eliminate future lookupsin original index


28/39

Offset Hash0:h( )1:h( )2:h( )3:h( )4:h( )

5:h( )

6:h( )7:h( )8:9:

10:

11:

read blockfrom log(seek!)

read blockfrom log

(no seek!)

Index by Offset


Hash Offset

Data Log

RAM

Summary

h( ), h( ), h( ),

h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )

RestoreProcess


map hash to log offset (seek!)

Crash!

Hash Offsetprefetch hashes

for next fewoffsets from

secondary index(seek!)

new index,sorted by offset

h( )0h( )1h( )2

h( )3h( )4

restore block 0:h( )41:2:h( )33:h( )04:h( )7

5:h( )16:h( )67:h( )58:h( )29:

lookup hashof 2nd block

find log offset

in secondaryindex no seek!


29/39

1. Is a Block New, or Duplicate?

Optimization for reads also helps duplicate writes Index misses on first duplicate block

Hits on subsequent blocks rewritten in same order

Doesnt help for new data Every lookup in primary index fails

Still suffer a seek for every new block


30/39

1. Is a Block New, or Duplicate?

Idea: use a Bloom filter to identify new blocks

Lossy representation of the primary index

Uses much less memory than index itself

For any given block, Bloom filter tells us:

Its definitely new append to log, update index

It might be duplicate lookup in index

If it really is a duplicate, we get the prefetch benefit

Otherwise, called a false positive

Using enough memory keeps false positives at ~1%


31/39

Results

Do these optimizations pay off? Buffering index writes is an obvious win Bloom filter is, too: removes 99% of seeks when

writing new data

Both trade RAM for seeks

Benefit of secondary index less clear If duplicate data comes in long sequences, it reduces

index seeks to two per sequence If duplicate data comes in little fragments, it doubles

the number of index seeks Need traces of real data to answer this question


32/39

Results (cont.)

Research group at MIT has been running Ventias its backup server for two years

We looked at 400 nightly snapshots

Simulated archiving and restoring these in both Ventiand Foundation

Venti Foundation

Average archival speed < 1 MB/s 20.1 MB/s

% time spent seeking 96% 10%

Average restore speed 1.2 MB/s 13.6 MB/s

% time spent seeking 95% 58%


33/39

Talk Outline

Introduction

What is Foundation?


Contributions



Related Work

Conclusions


34/39

Eliminating Compare by Hash

Some worried that same SHA-1 doesnt implysame contents (i.e., hash collisions are possible) Even if very rare, consequences (corruption) too great

Stepping back a bit, CAS as a black box: Give it a data block, get back an opaque ID Give it an opaque ID, get back the data block

Do we care that the ID is a SHA-1 hash? What if the opaque ID was just the blocks location

in the data log?


35/39


36/39

2nd Disk Arm to the Rescue

Once we eliminate most index reads (via ourprevious optimizations), the backup disk isotherwise idle while backing up duplicate data

Can instead put it to work doing byte-by-bytecomparisons of suspected duplicates

Foundation

Venti By Hash By ValueArchival < 1 MB/s 20.1 MB/s 15.4 MB/s

Restore 1.2 MB/s 13.6 MB/s 15.0 MB/s


37/39

Talk Outline

Introduction

What is Foundation?


Contributions



Related Work

Conclusions


38/39


39/39

Conclusions

Consumer-grade CAS works now

A single, external USB drive is enough

Just have to be crafty about avoiding seeks

Lots of uses other than preservation

E.g., inexpensive household backup server thatautomatically coalesces duplicate media collections

Doesnt require a collision-free hash function

foundation usenix08 talk

Documents