using srb and irods with the cheshire3 information framework building data grids with irods 27-30...

17
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert Sanderson Dept. of Computer Science University of Liverpool [email protected] http://www.cheshire3.org/ Building Data Grids with iRODS iRODS Workshop, May 27 th 2008 Slide 1

Upload: cameron-lane

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Using SRB and iRODS with theCheshire3 Information Framework

Building Data Grids with iRODS27-30 May, 2008National e-Science CentreEdinburgh

Dr Robert SandersonDept. of Computer ScienceUniversity of [email protected]

http://www.cheshire3.org/

Building Data Grids with iRODS

iRODS Workshop, May 27th 2008 Slide 1

Page 2: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Cheshire3IntroductionArchitecture

SRB IntegrationArchitectureGridUsage

iRODS IntegrationPossible Architectures

Overview

iRODS Workshop, May 27th 2008 Slide 2

Page 3: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Cheshire3:Information Analysis Framework

Digital Library/Information Retrieval engine with ...Data Mining/Machine LearningText Mining/Natural Language ProcessingComputational GridData Grid

Standards Based: Unicode, XML/XPath, MPI, Z39.50/SRU, ...

Object Oriented Architecture

Easy to develop and extend in Python,

... but heavy lifting possible in imported C libraries

Developed at University of Liverpool, plus UC Berkeley

Version: 0.9.10

Mostly stable, needs thorough testing/documentation

Introduction

iRODS Workshop, May 27th 2008 Slide 3

Page 4: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Context

iRODS Workshop, May 27th 2008 Slide 4

Page 5: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Architecture

iRODS Workshop, May 27th 2008 Slide 5

Index

Extractor

ServerConfigStore

UserStore

User

Object

Database

Query

Query

Normalizer

Record

Document

Document

PreParser

Parser

Transformer

Records

ProtocolHandler

RecordStore

Terms

Documents

Ingest Process

ResultSetPreParserPreParser

DocumentFactory

DocumentStoreIndexStore

Tokenizer

TokenMerger

Page 6: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Architecture 2

iRODS Workshop, May 27th 2008 Slide 6

Index

Record

IndexStore

Extractor

XPathObject

Extractor

XPathObject

Extractor

Normalizer

Index Index Index

Normalizer Normalizer

Tokenizer

TokenMerger

Tokenizer

TokenMerger

Index

Normalizer

Page 7: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

SRB Integration

iRODS Workshop, May 27th 2008 Slide 7

RecordStore / DocumentStore

Filesystem Berkeley DB SQL RDBMS(postgresql)

SRB

record, document

data

Page 8: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

SRB Integration

iRODS Workshop, May 27th 2008 Slide 8

IndexStore

SRB

terms

a-b c-d e-f g-h ...

Index

dbs

db with query term

Page 9: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Grid Implementation

iRODS Workshop, May 27th 2008 Slide 9

Focus on ingest, not discovery (yet)

Instantiate architecture on every node

Assign one node as master, rest as slaves.

Master then divides the processing as appropriate.

Calls between slaves possible

Calls as small, simple as possible:

(objectIdentifier, functionName, *arguments)

Typically:

(workflow_id, 'process', document_id)

Page 10: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Grid Architecture

iRODS Workshop, May 27th 2008 Slide 10

Master Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(workflow, process, document) (workflow, process, document)

fetch document fetch document

document document

extracted data extracted data

Page 11: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Grid Architecture 2

iRODS Workshop, May 27th 2008 Slide 11

Master Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(index, load) (index, load)

store index store index

fetch extracted data fetch extracted data

Page 12: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

NARA ERA Demonstrator

20Gb of web crawled data in SRB, indexes stored in SRB

Interface generated by easily deployable Python layer

Medline Dataset Experiments

16.5 Million Abstracts plus associated metadata

Parsed data stored in SRB

Indexes in filesystem

NSDL Grade Level Analysis

NSDL web crawl data (3 Tb+)

Data already in SRB, analysis stored to SRB

Usage

iRODS Workshop, May 27th 2008 Slide 12

Page 13: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Simple Integration (ala SRB) possible:

Store data in iRODS for Storage classes

Requires Python interface to iRODS

Doesn't really benefit from rule capabilities

Other (more interesting) Options:

Cheshire3 as External Microservice Platform

Cheshire3 as Internal Microservice Platform

Cheshire3 as Rules Platform(?)

iRODS Integration

iRODS Workshop, May 27th 2008 Slide 13

Page 14: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

External Microservice Platform

iRODS Workshop, May 27th 2008 Slide 14

iRODSCheshire3

C3 Microservic

e

C3 Microservic

e

C3 Interface

Microservice

data

data

processed data

Possible Interfaces:MPI/PVMRPCSOAPXml Over HttpArbitrary Transport Protocoletc.

Loose Coupling via Client Interface

Page 15: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Internal Microservice Platform

iRODS Workshop, May 27th 2008 Slide 15

iRODSC3

Microservice

C3 Microservic

e

data

Cheshire3

Requires iRODS to have Python interpreter as alternative Microservice platform, rather than a Python client API.

Much tighter integration: Cheshire3 would have access to iRODS internal information rather than just what was passed over interface.

Microservice definition problem becomes Cheshire3 Workflow definition – XML description

No bandwidth problems of transferring large amounts of data back and forth

Tight Coupling via Python Integration

Page 16: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Rules Platform?

iRODS Workshop, May 27th 2008 Slide 16

iRODS

data

Cheshire3Rules C3

Microservice

C3 Microservic

eMicroservic

es

Requires Python interpreter at the Rules execution level, rather than (as well as) at the Microservice level.

More flexible in terms of rule design

Easier to write rules than current rule language

Event system rather than rules execution?

Integration of Computational Grid for rule/microservice execution?

Page 17: Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS 27-30 May, 2008 National e-Science Centre Edinburgh Dr Robert

Website: http://www.cheshire3.org/

Me: [email protected]

Acknowledgements:

SHAMAN: EU 7th Framework Programme

Cheshire3: JISC, NSF

Questions?

Thank You!

iRODS Workshop, May 27th 2008 Slide 17