multi-archival syndicated storage platform
DESCRIPTION
TRANSCRIPT
Bryan BeecherUniversity of Michigan
Director, Computing & Network Services
W: http://www.icpsr.umich.edu/ICPSR/staff/beecher.html/
Micah AltmanHarvard University
Archival Director, Henry A. Murray Research ArchiveAssociate Director, Harvard-MIT Data CenterSenior Research Scientist, Institute for Quantitative Social SciencesE: [email protected]: http://maltman.hmdc.harvard.edu/
Roadmap Why replicate for preservation? What is institutional model for replication
in Data-PASS use? How do we build on LOCKSS to support
these institutional needs? Collaborators and Conspirators
Leonid Andreev, IQSS; Steve Burling, ICPSR; Jonathan Crabtree, Odum; Marc Maynard, Roper; Nancy McGovern, ICPSR;
Technical Media failure: storage conditions, media characteristics Format obsolescence Preservation infrastructure software failure Storage infrastructure software failure Storage infrastructure hardware failure
External Threats to Institutions Third party attacks Institutional funding Change in legal regimes
Quis custodiet ipsos custodes? Unintentional curatorial modification Loss of institutional knowledge & skills Intentional curatorial deaccessioning Change in institutional mission
There are potential single points of failure in both technology, organization and legal regimes: Diversify your
portfolio: multiple software systems, hardware, organization
Find diverse partners – diverse business models, legal regimes
http://failblog.org/2008/02/08/floppy-fail/
Consider organizational credentials
No organization is absolutely certain to be reliable
Consider the trust relationships across institutions
http://flickr.com/photos/phauly/35555985/
Policy Driven Institutional policy creates formal
replication commitments Replication commitments are described in
metadata, using schema Metadata drives
Configuration of replication network Auditing of replication network
Asymmetric CommitmentsPartners vary in … … storage commitments to replication … size of holdings being replicated … what holdings of other partners they
replicate
Completeness Complete public holdings of each partner Retain previous version of holdings Include
metadata data documentation legal agreements
Restoration guarantees Restore groups of versioned content
to owning archive to replication hosts
Institutional failure restoration:– transfer entire holdings of an archive to another
Trust & Verification Each partner is trusted …
… to hold the public content of other(not to disseminate improperly) … to add units to be harvested
No partner is trusted to be “super-user” No deletion (or directly manipulation of replication storage owned by another partner
Legal agreements reinforce trust model Schema based auditing used to …
… verify replication guarantees are met … record replication and storage commitments … document related TRAC criteria
Network level: Identification: name; description; contact; access
point URI Capabilities: protocol version; number of
replicates maintained; replication frequency; versioning/deletion support
Human readable documentation: restrictions on content that may be placed in the network; services guaranteed by the network; Virtual Organization policies relating to network maintenance
Host level Identification: name; description; contact; access
point URI Capabilities: protocol version; storage available Human readable terms of use: Documentation of
hardware, software and operating personnel in support of TRAC criteria
Archival unit level Identification: name; description; contact; access
point URI Attributes: update frequency, plugin required for
harvesting, storage required Terms of use: Required statement of content
compliance with network terms. ; Dissemination terms and conditions
TRAC Integration A number of elements comprise documentation
showing how the replication system itself supports relevant TRAC criteria
Other elements that may be use to include text, or reference external text that documents evidence of compliance with TRAC criteria.
Specific TRAC criteria are identified implicitly, can be explicitly identified with attributes
Schema documentation describes each elements relevance to TRAC, and mapping to particular TRAC criteria
Initialization: Given schema instance distribute AU harvesting responsibility to hosts
Auditing: Does current host harvesting allocation & history match replication commitment in schema?
Recovery of hosts Deliver AU content to source archive Addition of AU’s, hosts Growth of AU’s over initial commitment
Assumptions Nothing is deleted Resources in network grow monotonically Off-the-path behavior
is detected automatically, resolved manually
LOCKSS CLOCKSS
Very easy to build and deploy 5 minutes
Very easy to plug into public LOCKSS network 5 minutes
Very easy to manage thereafter – it is basically an appliance
Also easy to set-up Grouping your CLOCKSS
devices into a private network and paring the 20k-line configuration file into the right 200-line configuration file is not
Managing a network now, not a device
Standard LOCKSS used for Harvesting Recovery
New LOCKSS bulk update mechanism used for Initial configuration Adding AU’s
CLOCKSS mechanisms (certificates, cache monitor) Content delivery Optimize recovery Auditing
Data-PASS customizations for schema processing Translating schema instance into bulk update requests Reporting on compliance based on cache monitor database
Summer 2007: Attended the MetaArchive LOCKSS tutorial Very good overview of LOCKSS
Summer 2007: SSP System Requirements Developed & Approved Winter 2007: First public LOCKSS network nodes built at two Data-PASS sites Winter 2007: SSP Replication Commitment Schema Developed Spring 2008: Completed Test harvest of MRA collection into LOCKSS Sprint 2008: SSP System Use Cases Developed Spring 2008: Prototype plugin developed to harvest Dataverse Networks Spring 2008: Data-PASS sites are joined into single Private LOCKSS Network (PLN) Spring 2008: Met with LOCKSS developers to review use cases
SSP will leverage functionality “in the works” by LOCKSS team
Replication ameliorates institutional risks to preservation
Data PASS requires policy based, auditable, asymmetric replication commitments
Formalize policy in schema (Re)Configure & audit LOCKSS using
schema Replication uses standard LOCKSS
mechanisms