provenance in my grid jun zhao school of computer science the university of manchester, u.k. 21...

20
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004

Upload: edwin-sutton

Post on 25-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Provenance in myGrid

Jun Zhao

School of Computer Science

The University of Manchester, U.K.

21 October, 2004

Outline

• myGrid

• Motivation

• Challenges

• myGrid approach

• Related work

• Conclusions

myGrid Project

• http://www.mygrid.co.uk• A pilot e-Science project in U.K.;• Target at biologists and

bioinformatician;• Three bio-test beds:• Providing middleware services in a

Grid environment, which are orchestrated in the mechanism of workflows;

e-Science in silico Experiments(workflows)

• Automate the process of experiments;

• Orchestrate distributed resources and Web/Grid services;

• Transparent, seamless access to remote data and computation resources

• Increase the collaboration and results sharing across multi-scale communities

Discovering and reusingexperiments

and resources

Managing lifecycle,

provenance and results of

experiments

Sharingservices &

experiments

Personalisation

Forming experiments

Executing and

monitoring experiment

s

Soaplab

Problems when doing in silico experiments

Experiments being performed repeatedly, at different sites, different times, by different users or groups;

Scientists

• A large repository of zipped records about experiments!!• frequently updated resources;• volatile, distributed environment

Problems when doing in silico experiments

Experiments being performed repeatedly, at different sites, different times, by different users or groups;

Scientists

•verification of data;• “recipes” for experiment designs;• explanation for the impact of changes;• ownership;• performance of services;• data quality;

PROVENANCE

Provenance Forms

• Derivations– A workflow log.– Linking items, in a directed graph.– when, who, how, which, what, where– Execution Process-centric

• Annotations– Attached to items or collections of

items, in a structured, semi-structured or free text form.

– Annotations on one item or linking items.

– why, when, where, who, what, how.– Data-centric

mass = 200decay = WWstability = 1event = 8

mass = 200decay = WWstability = 1plot = 1

mass = 200decay = WWplot = 1

mass = 200decay = WWevent = 8

mass = 200decay = WWstability = 1

mass = 200decay = WWstability = 3

mass = 200

mass = 200decay = WW

mass = 200decay = ZZ

mass = 200decay = bb

mass = 200plot = 1

mass = 200event = 8

mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000

Challenges

cross-referencing across runs and within experiment;

Provenance of *good* metadata annotation

Bridging provenance islands Moreover….

Challenges: Complex cross-referencing information

Complex control flow Iterative data and

process flow Repetitive running

producing cross-referencing information

human interaction activities v.s. service invocations

Service failure and experiment re-composition

Experiment run with interactionsExperiment run with interactions

Experiment design fileExperiment design fileState

controls

Iterativeservice

Revised experimentRevised experiment

Experiment run with failures

Experiment run with failures

Challenges

• Annotations:– Mandatory / automatic– Who did that– How much should be

trusted– Security control– Authenticity validation– Quality– Cross-referencing– Versioning

Challenges: provenance islands

Workflow 1Workflow 1 Service 2Service 2

Service 1Service 1

Data 1Data 1Experimental Investigation 1Experimental Investigation 1

Diverse informatio

n

Diverse informatio

n

Diverse metadata of information

Diverse metadata of information

Moreover

• Intellectual property• Preservation• Archiving• Query and access• Integration• Investigation• Impact analysis• ……

myGrid Approach

• Taverna workflow workbench– Provenance plug-in;– mIR(myGrid Information Repository) plug-in;

• myGrid information model– Based on CCLRC scientific metadata model– Providing shared model for services and components

interactions

• Semantic Web technologies– RDF (Resource Description Framework)– Ontologies

• LSIDs and URNs http://taverna.sourceforge.nethttp://freefluo.sourceforge.net

B. Matthews and S. Sufi: The CLRC Scientific Metadata Model, version 1, DL TR 02001, CLRC, February 2001

RDF in a Nutshell

• Resource Description Language

• Common model for metadata

• A graph of triples• <subject, predicate,

object>• RDQL, repositories,

integration tools, presentation tools

• Jena, Haystack http://www.w3.org/RDF/

data derivation e.g. output data derived from input data

knowledge statementse.g. similar protein sequence to

instanceOf

partOf componentProcesse.g. web service invocation of BLAST @ NCBI

componentEvente.g. completion of a web service invocation at 12.04pm

runBye.g. BLAST @ NCBI

run for

Organisation level provenance Process level provenance

Data/ knowledge level provenancehasInput

hasOutput

projectproject

Experiment designExperiment design

Workflow designWorkflow design

Workflow runWorkflow run

Person Person

OrganisationOrganisation

ServiceService

Process Process

EventEvent

DataData

Blast ResultBlast ResultDNA sequenceDNA sequence

User can add templates to each workflow process to determine knowledge links between data items. subClass

Representing links

• Identify link type– Again use URI– Allows us to use RDF infrastructure

• Repositories• Ontologies

urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3

http://www.mygrid.org.uk/ontology#derived_from

Provenance Web

LSID for GenBank

Data

Personalization view

Reflection

First attempt Bridging the island Provenance modelling: relational + schema-less

model Provenance collection

Moreover: Provenance slicing Security control Authenticity validation Provenance versioning and (long-time) preservation

Related Work

• Chimera: Provenance cross-referencing– www.griphyn.org/chimera/

• CombeChem: – www.combechem.org/

• PASOA (Provenance Aware Service Oriented Architecture)– http://twiki.pasoa.ecs.soton.ac.uk/bin/view/PASOA/WebHome

• CMCS(Collaboratory for Multi-Scale Chemical Science)– http://cmcs.ca.sandia.gov/index.php

• ESSW (Earth System Science Workbench)– http://essw.bren.ucsb.edu/

Acknowledgement

– myGrid team:• esp. Carole Goble, Robert Stevens, Chris Wroe,

Mark Greenwood, Phil Lord

– IBM:• Dennis Quan

– Williams Group• Esp. Hannah Tipney