leveraging on high-performance computing and cloud technologies in digital libraries: a case study
DESCRIPTION
With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a data-intensive infrastructure, but given the resources, advanced services based on data and text mining could be developed. A fundamental issue is the ease of development and integration of such services. We demonstrate the feasibility by providing a case study on a visual machine learning algorithm with MapReduce running in the cloud in a small cluster.TRANSCRIPT
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Leveraging on High-Performance Computingand Cloud Technologies in Digital Libraries: A
Case Study
Peter Wittek
Swedish School of Library and Information ScienceUniversity of Boras
29/11/11
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Outline
1 Workflows and Digital Libraries
2 Computational Requirements of Digital Libraries
3 A Workflow in Cloud HPC
4 Experimental Results
5 Open Issues
6 Conclusions
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Workflows and Digital Libraries
Fundamental Issues in Digital Libraries
Digital objects remain authentic and accessible for futurere-use
Component and management failuresNatural disastersAttacks
Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpartHow do you facilitate future re-use?
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Workflows and Digital Libraries
Digital libraries and HPC?
Advantages:No need for upfront investmentGo beyond full-text searchMachine learningPattern matchingSocial media and graph mining
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Computational Requirements of Digital Libraries
Digital Preservation
Future-proofing document collectionsEmulationMigration
Workflows are often tremendously compute-intensive
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Computational Requirements of Digital Libraries
Machine Learning and Advanced Services
A step towards digital curationSaaS approach to digital curation
Indexing by Lucene/NutchCollection-level metadata extraction by MahoutTopic modellingVisualization
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
A Workflow in Cloud HPC
A Middleware Architecture
SupportServices:-Documentprocesses-Contextsearch-Datamining
Map
Red
uce
En
gin
e
PolicyEnforcement
ArchivalStorageInterface
Middleware
Grid or Cloud Storage Grid or Cloud Computing
A middleware to make adoption by DL practitioners easierMoving towards computational science
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
A Workflow in Cloud HPC
MapReduce Frameworks
Language Distributed MapReduceHadoop Java Yes/Custom PureMR-MPI C/C++ Yes/MPI MixedPhoenix++ C++ No Pure
Table: Comparison of some open source MapReduce frameworks
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
A Workflow in Cloud HPC
Adapting SOM to Digital Libraries
Batch formulation of update formulaMapReduce version in MR-MPINeeds to work on sparse structures
Sparse topic model by random indexing
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Experimental Results
Visual clusters in a 17th-century Corpus
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Experimental Results
Hardware and software
AWS Cluster Instances8 cores per instance24Gbyte of RAMLinux environmentOpenMPI
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Experimental Results
Running time
0
2000
4000
6000
8000
10000
12000
8 16 32 64
Tim
e (
s)
Number of cores
SOM (RI)SOM (LSI)
Figure: Running time of SOM
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Experimental Results
Speed-up
1
2
3
4
5
6
7
8
8 16 32 64
Speed-u
p
Number of cores
SOM (RI)SOM (LSI)
Linear
Figure: Speed-up of SOM
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Open Issues
Obstacles to Adoption
Persistence and high-reliabilityMapReduceNot just a technological issue
Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers
Difficulty of integration
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Conclusions
Ongoing work
Use GPU-aware MapReduce
Language Distributed MapReduceMars C/C++ No PureGPMR C/C++ Yes/MPI Mixed
Table: Comparison of GPU-aware open source MapReduceframeworks
Adapting MR-MPISpeed-up: 10x
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Conclusions
Acknowledgment
Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated project.http://shaman-ip.eu/shaman/
Additional funding has been received from Amazon WebServices.http://aws.amazon.com/
Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study
Conclusions
Summary
Cloud and HPC: a solution looking for a problemDigital libraries
Computational requirementsExpertiseComplexity and integration
Contact: [email protected]