leveraging on high-performance computing and cloud technologies in digital libraries: a case study

17

Click here to load reader

Upload: peter-wittek

Post on 11-May-2015

164 views

Category:

Documents


0 download

DESCRIPTION

With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a data-intensive infrastructure, but given the resources, advanced services based on data and text mining could be developed. A fundamental issue is the ease of development and integration of such services. We demonstrate the feasibility by providing a case study on a visual machine learning algorithm with MapReduce running in the cloud in a small cluster.

TRANSCRIPT

Page 1: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computingand Cloud Technologies in Digital Libraries: A

Case Study

Peter Wittek

Swedish School of Library and Information ScienceUniversity of Boras

29/11/11

Page 2: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Outline

1 Workflows and Digital Libraries

2 Computational Requirements of Digital Libraries

3 A Workflow in Cloud HPC

4 Experimental Results

5 Open Issues

6 Conclusions

Page 3: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Workflows and Digital Libraries

Fundamental Issues in Digital Libraries

Digital objects remain authentic and accessible for futurere-use

Component and management failuresNatural disastersAttacks

Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpartHow do you facilitate future re-use?

Page 4: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Workflows and Digital Libraries

Digital libraries and HPC?

Advantages:No need for upfront investmentGo beyond full-text searchMachine learningPattern matchingSocial media and graph mining

Page 5: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Computational Requirements of Digital Libraries

Digital Preservation

Future-proofing document collectionsEmulationMigration

Workflows are often tremendously compute-intensive

Page 6: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Computational Requirements of Digital Libraries

Machine Learning and Advanced Services

A step towards digital curationSaaS approach to digital curation

Indexing by Lucene/NutchCollection-level metadata extraction by MahoutTopic modellingVisualization

Page 7: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

A Workflow in Cloud HPC

A Middleware Architecture

SupportServices:-Documentprocesses-Contextsearch-Datamining

Map

Red

uce

En

gin

e

PolicyEnforcement

ArchivalStorageInterface

Middleware

Grid or Cloud Storage Grid or Cloud Computing

A middleware to make adoption by DL practitioners easierMoving towards computational science

Page 8: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

A Workflow in Cloud HPC

MapReduce Frameworks

Language Distributed MapReduceHadoop Java Yes/Custom PureMR-MPI C/C++ Yes/MPI MixedPhoenix++ C++ No Pure

Table: Comparison of some open source MapReduce frameworks

Page 9: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

A Workflow in Cloud HPC

Adapting SOM to Digital Libraries

Batch formulation of update formulaMapReduce version in MR-MPINeeds to work on sparse structures

Sparse topic model by random indexing

Page 10: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Experimental Results

Visual clusters in a 17th-century Corpus

Page 11: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Experimental Results

Hardware and software

AWS Cluster Instances8 cores per instance24Gbyte of RAMLinux environmentOpenMPI

Page 12: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Experimental Results

Running time

0

2000

4000

6000

8000

10000

12000

8 16 32 64

Tim

e (

s)

Number of cores

SOM (RI)SOM (LSI)

Figure: Running time of SOM

Page 13: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Experimental Results

Speed-up

1

2

3

4

5

6

7

8

8 16 32 64

Speed-u

p

Number of cores

SOM (RI)SOM (LSI)

Linear

Figure: Speed-up of SOM

Page 14: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Open Issues

Obstacles to Adoption

Persistence and high-reliabilityMapReduceNot just a technological issue

Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers

Difficulty of integration

Page 15: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Conclusions

Ongoing work

Use GPU-aware MapReduce

Language Distributed MapReduceMars C/C++ No PureGPMR C/C++ Yes/MPI Mixed

Table: Comparison of GPU-aware open source MapReduceframeworks

Adapting MR-MPISpeed-up: 10x

Page 16: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Conclusions

Acknowledgment

Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated project.http://shaman-ip.eu/shaman/

Additional funding has been received from Amazon WebServices.http://aws.amazon.com/

Page 17: Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Conclusions

Summary

Cloud and HPC: a solution looking for a problemDigital libraries

Computational requirementsExpertiseComplexity and integration

Contact: [email protected]