xml processing in the cloud: large-scale digital preservation in small institutions

17
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Peter Wittek Swedish School of Library and Information Science University of Bor˚ as 16/05/11

Upload: peter-wittek

Post on 11-May-2015

415 views

Category:

Documents


2 download

DESCRIPTION

Digital preservation deals with the problem of retaining the meaning of digital information over time to ensure its accessibility. The process often involves a workflow which transforms the digital objects. The workflow defines document pipelines containing transformations and validation checkpoints, either to facilitate migration for persistent archival or to extract metadata. The transformations, nevertheless, are computationally expensive, and therefore digital preservation can be out of reach for an organization whose core operation is not in data conservation. The operations described the document workflow, however, do not frequently reoccur. This paper combines an implementation-independent workflow designer with cloud computing to support small institution in their ad-hoc peak computing needs that stem from their efforts in digital preservation.

TRANSCRIPT

Page 1: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions

Peter Wittek

Swedish School of Library and Information ScienceUniversity of Boras

16/05/11

Page 2: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Outline

1 Workflows and Digital Preservation

2 Computational Requirements of Digital Preservation

3 Preservation Workflow in the Cloud

4 Experimental Results

5 Open Issues

6 Conclusions

Page 3: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Fundamental Issues in Digital Preservation

Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks

Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart

Page 4: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Migration, Enrichment, and Other Approaches

Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)

Dynamic collections: scalabilityReuse

Exploitation with a novel purposeSufficient metadata at document and collection level

Page 5: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

An Example of Enrichment: ToC Extraction

Page 6: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Preserving the Pipeline

Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process

Page 7: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

An XML Processing Pipeline

Page 8: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Deployment

Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS

Integrated Rule-Oriented Data SystemPolicy-based data grid software system

Current experiment using Amazon Web Services

Page 9: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Computational Requirements of Digital Preservation

Conversion

Steps of a workflow are computationally expensiveXSLT processors

Processing a single large document tree can take hoursDeep parsing and named entity recognition

May involve high-complexity natural language processing

Ad-hoc computations

Page 10: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Computational Requirements of Digital Preservation

Learning

A step towards digital curationSaaS approach to digital curation

Indexing by Lucene/NutchCollection-level metadata extraction by Mahout

Page 11: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Preservation Workflow in the Cloud

MapReduce and Deployment

No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents

Page 12: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Preservation Workflow in the Cloud

The Proposed Architecture

Page 13: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Experimental Results

Cost

1 4 10 20 40 80

Number of Processing Cores

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Avera

ge C

ost

in U

SD

100100010000

Figure: Comparison of average cost of computations with differentcollection sizes

Page 14: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Experimental Results

Running time

1 4 10 20 40 80

Number of Processing Cores

0

1000

2000

3000

4000

5000

6000

7000

8000R

unnin

g T

ime (

Min

s)

100100010000

Figure: Comparison of running times with different collection sizes

Page 15: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Open Issues

Obstacles to Adoption

Persistence and high-reliabilityMapReduceNot just a technological issue

Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers

Page 16: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Conclusions

Acknowledgment

Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/

Page 17: XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Conclusions

Summary

Digital preservation is an attractive area to be offered asSaaS

Computational needsExpertiseComplexity

Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research