2013.10.12 slide 1did meeting - montreal integrating data mining and data management technologies...
TRANSCRIPT
![Page 1: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/1.jpg)
2013.10.12 SLIDE 1DID Meeting - Montreal
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
Ray R. Larson
University of California, Berkeley
Paul Watry Richard Marciano
University of Liverpool University of North
Carolina, Chapel Hill
![Page 2: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/2.jpg)
2013.10.12 SLIDE 2
• Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
• Goals:– Text mining and NLP techniques to extract
content (named Persons, Places, Time Periods/Events) and associate context
• Data:– Internet Archive Books Collection (with
associated MARC where available) ~7.2T– Jstore ~1T– Context sources: SNAC Archival and Library
Authority records.
• Tools– Cheshire 3 – DL Search and Retrieval
Framework– iRODS – Policy-driven distributed data storage– Amazon S3 storage and EC2 computing
DID Meeting - Montreal
![Page 3: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/3.jpg)
2013.10.12 SLIDE 3DID Meeting - Montreal
Grid-Based Digital Libraries: Needs
• Large-scale distributed storage requirements and technologies
• Organizing distributed digital collections• Shared Metadata – standards and
requirements• Managing distributed digital collections• Security and access control• Collection Replication and backup• Distributed Information Retrieval
support and algorithms
![Page 4: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/4.jpg)
2013.10.12 SLIDE 4
But…
• Hasn’t Hadoop and its menagerie already solved everything?– Yes – many tasks can be done now with great
scaleup– And No – most Hadoop solutions are batch
oriented and not geared towards information access, but more towards summarization
– Maybe – we are looking at replacing or supplementing the low-level data management with Hadoop or Spark tools
DID Meeting - Montreal
![Page 5: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/5.jpg)
2013.10.12 SLIDE 5DID Meeting - Montreal
Grid/Cloud IR Issues
• Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is (still) a challenge for sub-second retrieval
• Different from most other typical Grid/Cloud processes, IR is potentially less computing intensive and more data intensive
• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search
• We have developed the Cheshire3 system to evaluate and manage these issues. The Cheshire3 system is actually one component in a larger Grid-based environment
![Page 6: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/6.jpg)
2013.10.12 SLIDE 6DID Meeting - Montreal
Cheshire3 Environment
or iRODS
![Page 7: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/7.jpg)
2013.10.12 SLIDE 7DID Meeting - Montreal
Cheshire3 IR Overview
• XML Information Retrieval Engine – 3rd Generation of the UC Berkeley Cheshire system, as co-
developed at the University of Liverpool– Uses Python for flexibility and extensibility, but uses C/C++
based libraries for processing speed– Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to
name a few– Grid/Cloud capable. Uses distributed configuration files,
workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes
– Free and Open Source Software
![Page 8: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/8.jpg)
2013.10.12 SLIDE 8
Cheshire3 Object Model
DID Meeting - Montreal
![Page 9: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/9.jpg)
2013.10.12 SLIDE 9
Current Version
• iRODS and C3 on Amazon EC2 and S3
DID Meeting - Montreal
Bucket 2Bucket 2
Bucket 1Bucket 1
Amazon
S3
iRODSiRODS
Cache
Resource
Cache
Resource
Amazon
EC2
Data Ingestion
Cheshire3Cheshire3
Indexing
RetrievaliCATiCAT
Rule
Engine
Rule
Engine
Data Presentation
![Page 10: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/10.jpg)
2013.10.12 SLIDE 10
Sample demo
DID Meeting - Montreal
![Page 11: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/11.jpg)
2013.10.12 SLIDE 11DID Meeting - Montreal
![Page 12: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/12.jpg)
2013.10.12 SLIDE 12DID Meeting - Montreal
![Page 13: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/13.jpg)
2013.10.12 SLIDE 13DID Meeting - Montreal
![Page 14: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/14.jpg)
2013.10.12 SLIDE 14DID Meeting - Montreal
Summary
• Indexing and IR work very well in the Grid/Cloud environment, with the expected scaling behavior for multiple processes
• Still in progress:– We are still processing collecting the books
collection from the Internet Archive– We are still extracting place names, personal
names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC)
![Page 15: 2013.10.12 SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,](https://reader036.vdocuments.net/reader036/viewer/2022062517/56649ef95503460f94c0ae53/html5/thumbnails/15.jpg)
2013.10.12 SLIDE 15DID Meeting - Montreal
Thank you!
iRODS available via https://www.irods.org
Project web site http://diggingintodata.web.unc.edu
Available via https://github.com/cheshire3
Special thanks to John Harrison (Liverpool),
Chien-Yi Hou (UNC), Shreyas and Luis Aguilar (UCB)