provenance in my grid jun zhao school of computer science the university of manchester, u.k. 21...
TRANSCRIPT
Provenance in myGrid
Jun Zhao
School of Computer Science
The University of Manchester, U.K.
21 October, 2004
myGrid Project
• http://www.mygrid.co.uk• A pilot e-Science project in U.K.;• Target at biologists and
bioinformatician;• Three bio-test beds:• Providing middleware services in a
Grid environment, which are orchestrated in the mechanism of workflows;
e-Science in silico Experiments(workflows)
• Automate the process of experiments;
• Orchestrate distributed resources and Web/Grid services;
• Transparent, seamless access to remote data and computation resources
• Increase the collaboration and results sharing across multi-scale communities
Discovering and reusingexperiments
and resources
Managing lifecycle,
provenance and results of
experiments
Sharingservices &
experiments
Personalisation
Forming experiments
Executing and
monitoring experiment
s
Soaplab
Problems when doing in silico experiments
Experiments being performed repeatedly, at different sites, different times, by different users or groups;
Scientists
• A large repository of zipped records about experiments!!• frequently updated resources;• volatile, distributed environment
Problems when doing in silico experiments
Experiments being performed repeatedly, at different sites, different times, by different users or groups;
Scientists
•verification of data;• “recipes” for experiment designs;• explanation for the impact of changes;• ownership;• performance of services;• data quality;
PROVENANCE
Provenance Forms
• Derivations– A workflow log.– Linking items, in a directed graph.– when, who, how, which, what, where– Execution Process-centric
• Annotations– Attached to items or collections of
items, in a structured, semi-structured or free text form.
– Annotations on one item or linking items.
– why, when, where, who, what, how.– Data-centric
mass = 200decay = WWstability = 1event = 8
mass = 200decay = WWstability = 1plot = 1
mass = 200decay = WWplot = 1
mass = 200decay = WWevent = 8
mass = 200decay = WWstability = 1
mass = 200decay = WWstability = 3
mass = 200
mass = 200decay = WW
mass = 200decay = ZZ
mass = 200decay = bb
mass = 200plot = 1
mass = 200event = 8
mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000
Challenges
cross-referencing across runs and within experiment;
Provenance of *good* metadata annotation
Bridging provenance islands Moreover….
Challenges: Complex cross-referencing information
Complex control flow Iterative data and
process flow Repetitive running
producing cross-referencing information
human interaction activities v.s. service invocations
Service failure and experiment re-composition
Experiment run with interactionsExperiment run with interactions
Experiment design fileExperiment design fileState
controls
Iterativeservice
Revised experimentRevised experiment
Experiment run with failures
Experiment run with failures
Challenges
• Annotations:– Mandatory / automatic– Who did that– How much should be
trusted– Security control– Authenticity validation– Quality– Cross-referencing– Versioning
Challenges: provenance islands
Workflow 1Workflow 1 Service 2Service 2
Service 1Service 1
Data 1Data 1Experimental Investigation 1Experimental Investigation 1
Diverse informatio
n
Diverse informatio
n
Diverse metadata of information
Diverse metadata of information
Moreover
• Intellectual property• Preservation• Archiving• Query and access• Integration• Investigation• Impact analysis• ……
myGrid Approach
• Taverna workflow workbench– Provenance plug-in;– mIR(myGrid Information Repository) plug-in;
• myGrid information model– Based on CCLRC scientific metadata model– Providing shared model for services and components
interactions
• Semantic Web technologies– RDF (Resource Description Framework)– Ontologies
• LSIDs and URNs http://taverna.sourceforge.nethttp://freefluo.sourceforge.net
B. Matthews and S. Sufi: The CLRC Scientific Metadata Model, version 1, DL TR 02001, CLRC, February 2001
RDF in a Nutshell
• Resource Description Language
• Common model for metadata
• A graph of triples• <subject, predicate,
object>• RDQL, repositories,
integration tools, presentation tools
• Jena, Haystack http://www.w3.org/RDF/
data derivation e.g. output data derived from input data
knowledge statementse.g. similar protein sequence to
instanceOf
partOf componentProcesse.g. web service invocation of BLAST @ NCBI
componentEvente.g. completion of a web service invocation at 12.04pm
runBye.g. BLAST @ NCBI
run for
Organisation level provenance Process level provenance
Data/ knowledge level provenancehasInput
hasOutput
projectproject
Experiment designExperiment design
Workflow designWorkflow design
Workflow runWorkflow run
Person Person
OrganisationOrganisation
ServiceService
Process Process
EventEvent
DataData
Blast ResultBlast ResultDNA sequenceDNA sequence
User can add templates to each workflow process to determine knowledge links between data items. subClass
Representing links
• Identify link type– Again use URI– Allows us to use RDF infrastructure
• Repositories• Ontologies
urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3
http://www.mygrid.org.uk/ontology#derived_from
Reflection
First attempt Bridging the island Provenance modelling: relational + schema-less
model Provenance collection
Moreover: Provenance slicing Security control Authenticity validation Provenance versioning and (long-time) preservation
Related Work
• Chimera: Provenance cross-referencing– www.griphyn.org/chimera/
• CombeChem: – www.combechem.org/
• PASOA (Provenance Aware Service Oriented Architecture)– http://twiki.pasoa.ecs.soton.ac.uk/bin/view/PASOA/WebHome
• CMCS(Collaboratory for Multi-Scale Chemical Science)– http://cmcs.ca.sandia.gov/index.php
• ESSW (Earth System Science Workbench)– http://essw.bren.ucsb.edu/