xml processing in the cloud: large-scale digital preservation in small institutions
DESCRIPTION
Digital preservation deals with the problem of retaining the meaning of digital information over time to ensure its accessibility. The process often involves a workflow which transforms the digital objects. The workflow defines document pipelines containing transformations and validation checkpoints, either to facilitate migration for persistent archival or to extract metadata. The transformations, nevertheless, are computationally expensive, and therefore digital preservation can be out of reach for an organization whose core operation is not in data conservation. The operations described the document workflow, however, do not frequently reoccur. This paper combines an implementation-independent workflow designer with cloud computing to support small institution in their ad-hoc peak computing needs that stem from their efforts in digital preservation.TRANSCRIPT
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions
Peter Wittek
Swedish School of Library and Information ScienceUniversity of Boras
16/05/11
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Outline
1 Workflows and Digital Preservation
2 Computational Requirements of Digital Preservation
3 Preservation Workflow in the Cloud
4 Experimental Results
5 Open Issues
6 Conclusions
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Fundamental Issues in Digital Preservation
Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks
Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Migration, Enrichment, and Other Approaches
Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)
Dynamic collections: scalabilityReuse
Exploitation with a novel purposeSufficient metadata at document and collection level
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
An Example of Enrichment: ToC Extraction
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Preserving the Pipeline
Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
An XML Processing Pipeline
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Deployment
Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS
Integrated Rule-Oriented Data SystemPolicy-based data grid software system
Current experiment using Amazon Web Services
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Computational Requirements of Digital Preservation
Conversion
Steps of a workflow are computationally expensiveXSLT processors
Processing a single large document tree can take hoursDeep parsing and named entity recognition
May involve high-complexity natural language processing
Ad-hoc computations
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Computational Requirements of Digital Preservation
Learning
A step towards digital curationSaaS approach to digital curation
Indexing by Lucene/NutchCollection-level metadata extraction by Mahout
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Preservation Workflow in the Cloud
MapReduce and Deployment
No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Preservation Workflow in the Cloud
The Proposed Architecture
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Experimental Results
Cost
1 4 10 20 40 80
Number of Processing Cores
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08Avera
ge C
ost
in U
SD
100100010000
Figure: Comparison of average cost of computations with differentcollection sizes
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Experimental Results
Running time
1 4 10 20 40 80
Number of Processing Cores
0
1000
2000
3000
4000
5000
6000
7000
8000R
unnin
g T
ime (
Min
s)
100100010000
Figure: Comparison of running times with different collection sizes
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Open Issues
Obstacles to Adoption
Persistence and high-reliabilityMapReduceNot just a technological issue
Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Conclusions
Acknowledgment
Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Conclusions
Summary
Digital preservation is an attractive area to be offered asSaaS
Computational needsExpertiseComplexity
Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research