economic sustainabilityof digital preservation - prof. david rosenthal, chief scientist lockss,...
DESCRIPTION
EUDAT 3rd Conference Sustainability Session:Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief Scientist LOCKSS, Stanford University Libraries - Wednesday 24th September 2014, Amsterdam, the NetherlandsTRANSCRIPT
Economic Sustainability ofDigital Preservation
David S. H. Rosenthal
LOCKSS ProgramStanford University Libraries
http://www.lockss.org/http://blog.dshr.org/
© 2014 David S. H. Rosenthal
Journals move to the Web
● Access for current readers better:● Links, search, data spreadsheets behind graphs, ...● No need to go to the library
● Access for future readers worse:● Not purchase but rental, no rent payment no access● Not many copies, but one on shortlived rewritable media
Paper Libraries
● Interesting example of faulttolerance:● Looselycoupled network of many independent peers● Each storing a selection of available content● On durable, somewhat tamperevident media● Market in copies, fewer copies more care→
● Easy to find a copy, hard to find all copies● Interlibrary loan & copy to repair loss or damage
LOCKSS Program
● LOCKSS box acts as persistent Web cache:● Crawls Web to preload with subscribed content● If can't get publisher copy, readers get library copy● Boxes cooperate to detect, repair loss & damage
● Timeline:● 1998 NSF funded prototype● 1999 NSF, Sun funded alpha: 1 journal, 15 boxes● 2000 Mellon, Sun funded beta: ~40 libraries● 2004 Production● 2005 Mellon matching grant● 2007 Sustainability!
LOCKSS: Businesses
● Develop & support use of LOCKSS software:● Free & opensource, but pay for support (cf. Red Hat)● ~150 libraries using the software
● Under contract, run CLOCKSS network:● Dark archive of ejournals & ebooks● Notforprofit managed jointly by publishers and libraries● 12 nodes worldwide● Triggered if unavailable from any publisher, CC license● Certified “Trustworthy Repository” score 13/15● Technologies, Technical Infrastructure, Security – 5/5
The HalfEmpty Archive
● Ejournals: less than half preserved● ARL vs. Keepers: ~40% of serials preserved● Faria et al.: <50% of serials preserved
● Public web pages Ainsworth et al.:● Search engine sampled URLs: ~2/3 preserved● Bit.ly random URLs: ~1/3 preserved
● Choices:● Do nothing● Double the budget● Halve the cost per unit content
Cost Data?
● Lots of research into preservation costs:● CMDP, LIFE, KRDS, PrestoPrime, ENSURE, ...● Serious lack of usable data● Inconsistent accounting, hidden costs, content variability
● My rule of thumb summarizing the research:● Ingest 1/2, preservation 1/3, access 1/6 of lifetime cost
● 4C project please submit cost data to:● http://www.4cproject.eu/● Curation Cost Exchange
Kryder's Law
● Bit density on disk platters:● Doubles every 18 months
● Thus $ per GB:● Drops 3040% per year
● If you can afford to store stuff for a few years● You can afford to store it forever
Source: Preeti Gupta, SSRC, UC Santa Cruz
Stored Safe in the Cloud?
● Cloud storage sold as “cheaper”:● If all charges accounted for, not cheaper for preservation● Its made of the same disks you use locally● Economies of scale captured by the provider
● Cloud storage locks you in:● Free to store, costs to access● Changing providers slow, expensive – you will be gouged● Not a competitive market – dominated by Amazon
● To avoid lockin, must keep a copy yourself● To allow you to change providers without paying arm+leg
Blue Ribbon Task Force
● Sustainable Digital Preservation & Access:● 2year study, report in 2010● NSF, Mellon, Library of Congress, JISC, CLIR, NARA
● Preservation has to be justified by access:● D'oh!● Dark archives (e.g. CLOCKSS) hard to sustain● Scholars don't like to, no budget to, pay for access to data
“Big Data”
● Research on past access to archives:● Rare, sparse, except for integrity checks● “Cold” data
● Future access will be different:● Scholars want to datamine from archive collections● Access much more intense, expensive● Data “warm” to “hot”
● How much more expensive?● Compare S3 (warm) vs Glacier (cold)● S3 2.5 times more expensive
Cloud For Access?
● Cloud ideal for datamining from collections:● Spiky demand● Charging mechanism
● Amazon Free Public Datasets:● No charge to data owner● Amazon charges readers for compute they use for access
● Library of Congress & Twitter feed (public):● Store copy in Amazon Reduced Redundancy Storage● Charge scholars for access to pay storage cost of copy● Scholars pay Amazon for compute to access copy
Sustaining Open SourcePreservation
● Open source essential for preservation:● No “just trust me” like closedsource encryption● … or cloud storage
● Niche market – not like Linux, Apache, ...:● No foundation with large industry sponsors● Red Hat needs frequent, visible upgrades to motivate $● Hard to devote resources to infrastructure improvements
● Mellon recognizes this problem:● Small grant for infrastructure● AJAX crawler, Shibboleth support, protocol improvements
A Petabyte for a Century
● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged
A Petabyte for a Century
● Black Box:● Put PB in, wait 100yrs, take PB out● Whatever media, replication, algorithms you like inside● 50% chance every bit undamaged
● This defines bit halflife:● Approx 60M times the age of the Universe● No feasible benchmark of adequate reliability
● Stuff will get lost or damaged:● Only question is “how much damage for how many $?”
Threat Model
Media failure
Hardware failure
Software failure
Network failure
Obsolescence
Natural Disaster
Threat Model
Media failure
Hardware failure
Software failure
Network failure
Obsolescence
Natural Disaster
Operator error
External Attack
Insider Attack
Economic Failure
Organization Failure
Is More Reliable Better?
● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable
Is More Reliable Better?
● Two systems, same budget for a decade:● A) zero loss rate● B) 1%/yr loss rate, 50% less $/yr than A per unit content● B's loss rate is clearly unacceptable
● After a decade:● B preserves 1.89 times as much at the same cost
● After 3 decades:● B preserves more than 5 times as much
The Good News
● Sustainable digital preservation possible:● LOCKSS is an example
The Bad News
● Expectations way out of line with reality:● Can't preserve as much as people assume is being● Nor as reliably as people assume it is being preserved
● Mismatch will get worse:● Expect lots more data, no more money● Expect costs to drop rapidly, experts say slowly if at all
● Technology won't save us:● Research data, libraries, archives niche market● Hard problems, no big payoff for solution, so little research● Build systems from stuff designed to do something else