xml processing in the cloud: large-scale digital preservation in small institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions

Peter Wittek

Swedish School of Library and Information ScienceUniversity of Boras

16/05/11


Outline

1 Workflows and Digital Preservation

2 Computational Requirements of Digital Preservation

3 Preservation Workflow in the Cloud

4 Experimental Results

5 Open Issues

6 Conclusions


Workflows and Digital Preservation

Fundamental Issues in Digital Preservation

Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks

Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart



Migration, Enrichment, and Other Approaches

Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)

Dynamic collections: scalabilityReuse

Exploitation with a novel purposeSufficient metadata at document and collection level



An Example of Enrichment: ToC Extraction



Preserving the Pipeline

Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process



An XML Processing Pipeline



Deployment

Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS

Integrated Rule-Oriented Data SystemPolicy-based data grid software system

Current experiment using Amazon Web Services


Computational Requirements of Digital Preservation

Conversion

Steps of a workflow are computationally expensiveXSLT processors

Processing a single large document tree can take hoursDeep parsing and named entity recognition

May involve high-complexity natural language processing

Ad-hoc computations


Computational Requirements of Digital Preservation

Learning

A step towards digital curationSaaS approach to digital curation

Indexing by Lucene/NutchCollection-level metadata extraction by Mahout


Preservation Workflow in the Cloud

MapReduce and Deployment

No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents


Preservation Workflow in the Cloud

The Proposed Architecture


Experimental Results

Cost

1 4 10 20 40 80

Number of Processing Cores

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Avera

ge C

ost

in U

SD

100100010000

Figure: Comparison of average cost of computations with differentcollection sizes


Experimental Results

Running time

1 4 10 20 40 80

Number of Processing Cores

0

1000

2000

3000

4000

5000

6000

7000

8000R

unnin

g T

ime (

Min

s)

100100010000

Figure: Comparison of running times with different collection sizes


Open Issues

Obstacles to Adoption

Persistence and high-reliabilityMapReduceNot just a technological issue

Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers


Conclusions

Acknowledgment

Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/

http://shaman-ip.eu/shaman/


Conclusions

Summary

Digital preservation is an attractive area to be offered asSaaS

Computational needsExpertiseComplexity

Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research

xml processing in the cloud: large-scale digital preservation in small institutions

Documents

digital preservationpreserving

digital preservationmigration

small institutions workows

digital curation indexing

cloud adoption

digital curation saas

alternative cloud providers

collection level