foundation usenix08 talk
TRANSCRIPT
-
8/4/2019 Foundation Usenix08 Talk
1/39
Fast, Inexpensive Content-
Addressed Storage in Foundation
Sean Rhea* Russ Cox, Alex Pesterev*
Meraki, Inc. MIT CSAIL
*Work done while at Intel Research, Berkeley.
-
8/4/2019 Foundation Usenix08 Talk
2/39
-
8/4/2019 Foundation Usenix08 Talk
3/39
As a community, were not bad at storing
important data over the long term.
Weve only just begun to think about how
well interpret that data 30 years from now.
-
8/4/2019 Foundation Usenix08 Talk
4/39
For Example
Viewing an old PowerPoint presentation
Do we still have PowerPoint at all? And Windows?
Does the presentation use non-standard fonts/codecs?
Has some newer application overwritten a sharedlibrary with an incompatible version (DLL Hell)?
Not just a Microsoft problem: consider a web page
Even current IE/Safari/Firefox dont agree on formatting
All kinds of plugins necessary: sound, video, Flash
-
8/4/2019 Foundation Usenix08 Talk
5/39
The Foundation Idea
Make daily backups of entiresoftware stack
Archives users applications, OS, and configuration state
Dont worry about identifying dependencies Just save it all: Every byte, every night
To recover an obscure file, boot the relevant stackin an emulator
View file with the application that created it
-
8/4/2019 Foundation Usenix08 Talk
6/39
Foundation FAQ
Why preserve the entiredisk? Preserve software stack dependencies: preserve the data with the
right application, libraries, and operating system as a singleunit
Works for allapplications, not just ones designed for preservation
Why dailyimages? Want to preserve machine state as close as possible to last write of
users data (i.e., preserve image before something changes)
Also allows recovery from user errors
Why emulate hardware? Much better track record than emulating software
Software example: OpenOffice emulating Microsoft Word (yikes)
Hardware emulators available today for Amiga, PDP-11, Nintendo
-
8/4/2019 Foundation Usenix08 Talk
7/39
I would love to give a talk about whyFoundation is a great solution to the
digital preservation problem.
Really, though, I think its just a prettygood start.
Instead, Im going to talk about a funproblem we had to solve to make it work.
-
8/4/2019 Foundation Usenix08 Talk
8/39
Every Byte, Every Night?Indefinitely? Really?
Plan 9 did exactly that Archive changed blocks every night to optical jukebox
Found that storage capacity grew faster than usage
Later with Content-Addressable Storage (Venti) Automatically coalesces duplicate data to save space
Required multiple, high-speed disks for performance
Challenge for Foundation: provide similar storageefficiency on consumer hardware Time Machine model: one external USB drive
-
8/4/2019 Foundation Usenix08 Talk
9/39
Talk Outline
Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making CheapContent-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions
-
8/4/2019 Foundation Usenix08 Talk
10/39
Venti Review
Plan 9 file system was two-level Spinning storage, mostly a normal file system
Archival storage, optical write-once jukebox
Venti replaced optical jukebox Still write-once
Chunks of data named by their SHA-1 hashesContent-Addressable Storage (CAS)
Automatically coalesces duplicate writes
-
8/4/2019 Foundation Usenix08 Talk
11/39
5:h( )16:7:8:9:
h( )2
reads 1st blockreads 2nd block
Users Hard Drive External USB Drive
Hash Offset
Data Log
seen it before?
0:1:2:3:h( )04:
RAM
ArchivalProcess
Summary
h( )
appendto log
update index
appendhash to
summary,h( ),h( )
reads 4th block
no logwrite!
h( )
,
Venti Review
-
8/4/2019 Foundation Usenix08 Talk
12/39
Venti Review
Users Hard Drive External USB Drive
Hash Offset
Data Log
0:h( )
41:2:h( )33:h( )04:h( )7
5:h( )16:h( )67:h( )58:h( )29:
RAM
Summary
h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )
RestoreProcess
lookup hashof 1st block
map hash to log offset
read blockfrom log
restore block
Crash!
Final step (not shown): archivesummary in data log as well
-
8/4/2019 Foundation Usenix08 Talk
13/39
Notes on Venti
The Good News: CAS stores each block with particular contents only once
Changing any one block and re-archiving uses only onemore block in archive
Adding a duplicate file from a different source uses no
additional storage
The Bad News: Synchronous, random reads to on-disk index
-
8/4/2019 Foundation Usenix08 Talk
14/39
reads 4th block
Users Hard Drive External USB Drive
Hash Offset
Data Log
seen it before?
0:1:2:3:h( )04:
5:h( )16:7:8:9:
RAM
ArchivalProcess
Summary
h( ),h( ),h( )
h( )2
Venti Review
Have to seek to theright bucket
-
8/4/2019 Foundation Usenix08 Talk
15/39
Venti Review
Users Hard Drive External USB Drive
Hash Offset
Data Log
RAM
Summary
h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )
RestoreProcess
lookup hashof 1st block
map hash to log offset
0:h( )41:2:h( )33:h( )04:h( )7
5:h( )16:h( )67:h( )58:h( )29:
Have to seek to theright bucket
-
8/4/2019 Foundation Usenix08 Talk
16/39
Notes on Venti
The Good News: CAS stores each block with particular contents only once
Changing any one block and re-archiving uses only onemore block in archive
Adding a duplicate file from a different source uses no
additional storage
The Bad News: Synchronous, random reads to on-disk index
Best case, one-disk performance for 512-byte blocks:one 5 ms seek per 512 bytes archived = 100 kB/s
Thats 12 days to archive a 100 GB disk!
Larger blocks give better throughput, less sharing
-
8/4/2019 Foundation Usenix08 Talk
17/39
Notes on Venti (cont.)
Ventis solution: use 8 high-speed disks for index Untennable in consumer space
Wears disks out pretty quickly, too
The compare-by-hash controversy: Fear of hash collisions: two different blocks with same
hash breaks Venti
May be very unlikely, but cost (data corruption) is huge
Does CAS really require a cryptographically strong hash?
-
8/4/2019 Foundation Usenix08 Talk
18/39
Talk Outline
Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making CheapContent-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions
-
8/4/2019 Foundation Usenix08 Talk
19/39
Making Inexpensive CAS Fast
The problem: disk seeks
Secure hash randomizes an otherwise sequential disk-to-disk transfer
To reduce seeks, must reduce hash table lookups
When do hash table lookups occur?
1. When writing data, to determine if weve seen it before
2. When writing data, to update the index
3. When reading data, to map hashes to disk locations
-
8/4/2019 Foundation Usenix08 Talk
20/39
2. Updating the Index
After appending a block to the data log,must update the index
Psuedorandom hash causes a seek
-
8/4/2019 Foundation Usenix08 Talk
21/39
Users Hard Drive External USB Drive
Hash Offset
Data Log
0:1:2:3:h( )04:
5:h( )16:7:8:9:
RAM
ArchivalProcess
Summary
h( )
appendto log
update indexUpdating the Index
Have to seek to theright bucket
reads 2nd block
-
8/4/2019 Foundation Usenix08 Talk
22/39
2. Updating the Index
After appending a block to the data log,must update the index
Psuedorandom hash causes a seek
Easy to fix: use a write-back index cache
Store index writes in memory
Flush to disk sequentially in large batches
On crash, reconstruct index from the data log
-
8/4/2019 Foundation Usenix08 Talk
23/39
3. Mapping Hashes to DiskLocations During Reads
To restore disk
Start with the list of original blocks hashes
Lookup each block in index
Read block from data log and restore to disk
-
8/4/2019 Foundation Usenix08 Talk
24/39
Users Hard Drive External USB Drive
Hash Offset
Data Log
RAM
Summary
h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )
RestoreProcess
lookup hashof 1st block
map hash to log offset
0:h( )41:2:h( )33:h( )04:h( )7
5:h( )16:h( )67:h( )58:h( )29:
Have to seek to theright bucket
-
8/4/2019 Foundation Usenix08 Talk
25/39
-
8/4/2019 Foundation Usenix08 Talk
26/39
-
8/4/2019 Foundation Usenix08 Talk
27/39
3. Mapping Hashes to DiskLocations During Reads
To restore disk
Start with the list of original blocks hashes
Lookup each block in index
Read block from data log and restore to disk
Observation: data log is mostlyordered
Duplicate blocks often occur as part of duplicate files
Idea: add another index, ordered by log offset
Read-ahead in this index to eliminate future lookupsin original index
-
8/4/2019 Foundation Usenix08 Talk
28/39
Offset Hash0:h( )1:h( )2:h( )3:h( )4:h( )
5:h( )
6:h( )7:h( )8:9:
10:
11:
read blockfrom log(seek!)
read blockfrom log
(no seek!)
Index by Offset
Users Hard Drive External USB Drive
Hash Offset
Data Log
RAM
Summary
h( ), h( ), h( ),
h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( ),h( ), h( ), h( )
RestoreProcess
lookup hashof 1st block
map hash to log offset (seek!)
Crash!
Hash Offsetprefetch hashes
for next fewoffsets from
secondary index(seek!)
new index,sorted by offset
h( )0h( )1h( )2
h( )3h( )4
restore block 0:h( )41:2:h( )33:h( )04:h( )7
5:h( )16:h( )67:h( )58:h( )29:
lookup hashof 2nd block
find log offset
in secondaryindex no seek!
-
8/4/2019 Foundation Usenix08 Talk
29/39
1. Is a Block New, or Duplicate?
Optimization for reads also helps duplicate writes Index misses on first duplicate block
Hits on subsequent blocks rewritten in same order
Doesnt help for new data Every lookup in primary index fails
Still suffer a seek for every new block
-
8/4/2019 Foundation Usenix08 Talk
30/39
1. Is a Block New, or Duplicate?
Idea: use a Bloom filter to identify new blocks
Lossy representation of the primary index
Uses much less memory than index itself
For any given block, Bloom filter tells us:
Its definitely new append to log, update index
It might be duplicate lookup in index
If it really is a duplicate, we get the prefetch benefit
Otherwise, called a false positive
Using enough memory keeps false positives at ~1%
-
8/4/2019 Foundation Usenix08 Talk
31/39
Results
Do these optimizations pay off? Buffering index writes is an obvious win Bloom filter is, too: removes 99% of seeks when
writing new data
Both trade RAM for seeks
Benefit of secondary index less clear If duplicate data comes in long sequences, it reduces
index seeks to two per sequence If duplicate data comes in little fragments, it doubles
the number of index seeks Need traces of real data to answer this question
-
8/4/2019 Foundation Usenix08 Talk
32/39
Results (cont.)
Research group at MIT has been running Ventias its backup server for two years
We looked at 400 nightly snapshots
Simulated archiving and restoring these in both Ventiand Foundation
Venti Foundation
Average archival speed < 1 MB/s 20.1 MB/s
% time spent seeking 96% 10%
Average restore speed 1.2 MB/s 13.6 MB/s
% time spent seeking 95% 58%
-
8/4/2019 Foundation Usenix08 Talk
33/39
Talk Outline
Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making CheapContent-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions
-
8/4/2019 Foundation Usenix08 Talk
34/39
Eliminating Compare by Hash
Some worried that same SHA-1 doesnt implysame contents (i.e., hash collisions are possible) Even if very rare, consequences (corruption) too great
Stepping back a bit, CAS as a black box: Give it a data block, get back an opaque ID Give it an opaque ID, get back the data block
Do we care that the ID is a SHA-1 hash? What if the opaque ID was just the blocks location
in the data log?
-
8/4/2019 Foundation Usenix08 Talk
35/39
-
8/4/2019 Foundation Usenix08 Talk
36/39
2nd Disk Arm to the Rescue
Once we eliminate most index reads (via ourprevious optimizations), the backup disk isotherwise idle while backing up duplicate data
Can instead put it to work doing byte-by-bytecomparisons of suspected duplicates
Foundation
Venti By Hash By ValueArchival < 1 MB/s 20.1 MB/s 15.4 MB/s
Restore 1.2 MB/s 13.6 MB/s 15.0 MB/s
-
8/4/2019 Foundation Usenix08 Talk
37/39
Talk Outline
Introduction
What is Foundation?
Review of Content-Addressed Storage (Venti)
Contributions
Making CheapContent-Addressed Storage Fast
Avoiding Concerns over Hash Collisions
Related Work
Conclusions
-
8/4/2019 Foundation Usenix08 Talk
38/39
-
8/4/2019 Foundation Usenix08 Talk
39/39
Conclusions
Consumer-grade CAS works now
A single, external USB drive is enough
Just have to be crafty about avoiding seeks
Lots of uses other than preservation
E.g., inexpensive household backup server thatautomatically coalesces duplicate media collections
Doesnt require a collision-free hash function