provenance in mygrid and beyond luc moreau, university of southampton, uk

72
Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK

Upload: archibald-jackson

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Provenance in myGrid and beyond www.mygrid.org.uk Luc Moreau, University of Southampton, UK
  • Slide 2
  • or the Provenance of my interest for Provenance Luc Moreau, University of Southampton, UK
  • Slide 3
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 4
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 5
  • Large amounts of data EMBL July 2001 150 Gbytes Microarray 1 Petabyte per annum Sanger Centre 20 terabytes of data Genome sequences increase 4x per annum http://www3.ebi.ac.uk/Services/DBStats/
  • Slide 6
  • Complexity Diversity Heterogeneity Disease Drug Disease Clinical trial Phenotyp e Protein Protein Structur e Protein Sequen ce P-P interaction s Proteo me Gene sequen ce Genom e sequen ce Gene express ion homology Genomic, proteomic, transcriptomic, metabalomic, protein- protein interactions, regulatory bio-networks, alignments, disease, patterns & motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors),
  • Slide 7
  • Heterogeneity Data types & forms Community Autonomy Over 500 different databases Different formats, structure, schemas, coverage Web interfaces, flat file distribution,
  • Slide 8
  • Heterogeneous Data Multimedia Images & Video Text annotations & literature Descriptive as well as numeric Knowledge-based Text Extraction
  • Slide 9
  • Bioinformatics Analysis Different algorithms BLAST, FASTA, pSW Different implementations WU-BLAST, NCBI-BLAST Different service providers NCBI, EBI, DDBJ
  • Slide 10
  • Drug Discovery
  • Slide 11
  • In silico experimentation Discovery of resources and tools, staging of operations, sharing of results Process is as important as outcome Science is dynamic change happens Scientific discovery is personal & global Provenance and history
  • Slide 12
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 13
  • myGrid EPSRC funded pilot project Generic middleware within application setting 36 month in 42 month performance period Start 1 st October 2001 16 full-time post docs altogether 6 DTA studentships 1 technical project manager
  • Slide 14
  • myGrid consortium Scientific Team Biologists and Bioinformaticians GSK, AZ, Merck KGaA, Manchester, EBI Technical Team Manchester, Southampton, Newcastle, Sheffield, EBI, Nottingham IBM, SUN GeneticXchange Network Inference, Epistemics Ltd
  • Slide 15
  • myGrid outcomes e-Scientists Bioinformatics demonstrator (Graves disease and Williams syndrome) Developers myGrid-in-a-Box developers kit (currently myGrid 0.4) Integrating some existing bioinformatics tools with myGrid (EBI services)
  • Slide 16
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 17
  • Graves disease Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophthalmos
  • Slide 18
  • The Biology GD caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system. Why is the lymphocyte causing the antibodies that attack the thyroid cell?
  • Slide 19
  • Graves Disease Experimental Process What genes are associated with Graves Disease? Candidate Gene What is known about my candidate gene? Literature Previous Research Databases Experimental Annotation Pipeline What SNPs (single nucleotide polymorphisms) in my candidate gene might be relevant? Verify relevance of SNPs. How can I visualise annotations to my candidate gene? Genotype Assay Design System 3D Protein Structure & SNP Visualisation
  • Slide 20
  • Experiment life cycle Executing experiments Workflow enactment Distributed Query processing Job execution Provenance generation Single sign-on authorisation Event notification Resource & service discovery Repository creation Workflow creation Database query formation Discovering and reusing experiments and resources Workflow discovery & refinement Resource & service discovery Repository creation Provenance Managing experiments Information repository Metadata management Provenance management Workflow evolution Event notification Providing services & experiments Service registration Workflow deposition Metadata Annotation Third party registration Personalisation Personalised registries Personalised workflows Info repository views Personalised annotations Personalised metadata Security Forming experiments
  • Slide 21
  • A work bench for demonstrating services myView on the mIR Workflow Metadata about workflow note about workflow
  • Slide 22
  • A workflow represents an experiment that can be run on the Grid. A workflow takes data as input. It performs activities, which are steps involved in analysing the data, including using tools and services, querying databases and running other workflows. A workflow can be run on the users local machine, or remotely, taking advantage of resources that are distributed. Data intensive grid having to deal with heterogeneity of the data and processes. Worflows
  • Slide 23
  • myGrid schematic Graves disease scenario Workbench Workflow editor Event Notification Workflow Enactment Information repository Service Registry Knowledge management Text services Bio services Distributed query processing Services Core components Generic Applications Exemplars Talisman SoapLab Gateway
  • Slide 24
  • Notification Service Knowledge Services DB2 Registry Service Oriented Architecture Semantic registration Service Structural registration Knowledge Service Ontology Server Reasoner Matcher Registry DB2 Workflow templates DataProvenance mInfo Repository Workflow enactment engine Workflow instances Build/Edit Workflow Service Discovery Test Data Notification Service WSFL JMS Distributed Query Processor Information Extraction PASTA Job Execution SoapLab mIR Provenance service Component Discovery MetadataConcepts Registry View UDDI UDDI-M
  • Slide 25
  • myGrid Deployment
  • Slide 26
  • myGrid 0.4 (Nov 2003) Describer (MAN): A tool for attaching semantic descriptions to WS and workflows Find Service (MAN): A component for classifying and discovering services and workflows via their semantic descriptions Ontology Server (MAN): The DAML+OIL reasoner Workbench (NOT): a NetBeans module for examining and updating the MIR and submitting workflows for enactment e-Science Gateway (NOT): An API giving access to myGrid core services MIR (myGrid information repository) (MAN/NEW): A Web Service accessing a repository that can hold data for an individual scientist or a team of scientists. Notification Service (IAM): A general-purpose Web Service that supports a publish/subscribe model of event notification, based on JMS Registry View service (IAM): A Web Service supporting a registry of published Web Services and workflows annotated with metadata, including semantic descriptions Freefluo (ITI): workflow enactment engine Taverna (EBI): workflow editing environment
  • Slide 27
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 28
  • Provenance: definition Main Entry: provenance Pronunciation: 'prv-n&n(t)s, 'pr-v&-"nn(t)s Function: noun Etymology: French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come -- more at PRO-, COME Date: 1785 1 : ORIGIN, SOURCE 2 : the history of ownership of a valued object or work of art or literaturePRO-COMEORIGINSOURCE
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Provenance Experiment is repeatable, if not reproducible, and explained by provenance records Who, what, where, why, when, (w)how? The traceability of knowledge as it is evolves and as it is derived. Immutable metadata Migration travels with its data but may not be stored with it. Private vs Shared provenance records. Credit. Provenance is related to:
  • Slide 34
  • A full provenance record is linked with the results. Its a log of execution. Early Provenance Capture
  • Slide 35
  • Kinds of Provenance Backward Derivation An explanation of when, by who, how something was produced. Linking items, usually in a directed graph. Execution Process- centric To be contrasted with forward derivation, which is a path like a workflow, script or query. mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000
  • Slide 36
  • Kinds of Provenance Annotations Attached to items or collections of items, in a structured, semi- structured or free text form. Annotations on one item or linking items. An explanation of why, when, where, who, what, how. Data-centric
  • Slide 37
  • Kinds of Provenance in myGrid Derivations Workflow Enactment Engine provides a detailed provenance record stored in the myGrid Information Repository (mIR) describing what was done, with what services and when XML document, soon to be an RDF model Annotations Every mIR object has Dublin Core provenance properties described in an attribute value model
  • Slide 38
  • Provenance of data Operational execution trail process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output Gene:AC005412.6SNP:000010197
  • Slide 39
  • From Provenance to Knowledge Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output contains_single_nucleotide_polymorphism as stated by Declarative semantic execution trail
  • Slide 40
  • From Provenance to Knowledge Gene:AC005412.6SNP:000010197 process start time end time lsid:HGVBase_retrieve input by_service urn: Claire Jennings run_for output contains_single_nucleotide_polymorphism as stated by Trust and attribution urn: Carole Goble disputed by
  • Slide 41
  • Provenance vs Provenance vs Annotation Provenance of an annotation Annotation of Provenance Provenance vs Workflow Provenance describes past execution A workflow is a script for future execution
  • Slide 42
  • What is Provenance? Annotations may be subject of interpretation (e.g. Alice believes annotation X, whereas Bob does not). Provenance should aim at recording an undisputed view of an execution.
  • Slide 43
  • What is Provenance? Provenance traces execution Provenance must be generated automatically Annotations can be either generated automatically or created by the user Annotations can contain semantic augmentation, which can be derived automatically or supplied manually.
  • Slide 44
  • Generating provenance RDF registry mIR FreeFluo WFEE Workflow execution Template OWL descriptions Identify workflow Execution Provenance log Data and metadata from the run Input data & parameters startTime, endTime, service instances invoked Bind services Knowledge Provenance log Workflow knowledge template RDF+OWL Knowledge arising from workflow Scufl
  • Slide 45
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 46
  • Provenance in a Bioinformatics Grid myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Provenance in Drugs Discovery process: FDA requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).
  • Slide 47
  • Provenance in Aerospace Engineering Provenance requirement: to maintain a historical record of outputs from each sub-system involved in simulations. Aircrafts provenance data need to be kept for up to 99 years when sold to some countries. Currently, little direct support is available for this.
  • Slide 48
  • Provenance in Organ Transplant Management Decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and doctors and surgeons knowledge Heavily regulated domain: European, national, regional and site specific rules govern how decisions are made. Application of these rules must be ensured, be auditable and may change over time Provenance allows tracking previous decisions: crucial to maximise the efficiency in matching and recovery rate of patients
  • Slide 49
  • The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations Understanding how a given service is likely to modify data flowing into it, and how this data has been generated is crucial.
  • Slide 50
  • Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded? The lack of information about the origin of results does not help users to trust such open environments.
  • Slide 51
  • Provenance and Workflows Workflow enactment has become popular in the Grid and Web Services communities Workflow enactment can be seen as a scripted form of virtual organisation. The problem is similar: how can we determine the origin of enactment results.
  • Slide 52
  • Provenance: Definition Provenance is some data able to explain how a particular result has been derived. In a service-oriented architecture, provenance identifies what data is passed between services, what services are available, and what results are generated for particular sets of input values, etc. Using provenance, a user can trace the process that led to the aggregation of services producing a particular output.
  • Slide 53
  • Overview Bioinformatics background myGrid facts Services and Workflows Provenance in myGrid Beyond myGrid Provenance Architectural vision Conclusions
  • Slide 54
  • What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.
  • Slide 55
  • Architectural Vision
  • Slide 56
  • Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance data will be submitted to one or more provenance repositories acting as storage for provenance data. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.
  • Slide 57
  • Architectural Vision Storage could be achieved by a provenance service. Provenance service would provide support for analysis, navigation or reasoning over provenance Client side support for submitting provenance data to the provenance service.
  • Slide 58
  • A First Prototype (Szomszor,Moreau 03) A service-oriented architecture for provenance support in Grid and Web Services environments, based on the idea of a provenance service; A client-side API for recording provenance data for Web Service invocation; A data model for storing provenance data; A server-side interface for querying provenance data; Two components making use of provenance: provenance browsing and provenance validation.
  • Slide 59
  • Prototype Overview
  • Slide 60
  • Prototype Sequence Diagram
  • Slide 61
  • To identify the interactions between provenance service, client side library and enactment engine Creation of a session Need to be able to support the most complex workflows including conditional branching, iteration, recursion and parallel execution. Support asynchronous submission of provenance data so that provenance submission does not delay workflow execution.
  • Slide 62
  • Prototype Provenance Data Model
  • Slide 63
  • Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution).
  • Slide 64
  • Prototype Provenance Browser
  • Slide 65
  • Discussion In order for provenance data to be useful, we expect such a protocol to support some classical properties of distributed algorithms. Using mutual authentication, an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation, we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties
  • Slide 66
  • Towards Trust
  • Slide 67
  • Using the provenance of data, trust metrics of the data can be derived from: Trust the user places in invoked services Trust the user places in the input data Trust the user places in the enacted workflow Trust the user places in the enactor Trust the user places in the provenance service.
  • Slide 68
  • The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the fundamental computer science for e-Science call In collaboration with Cardiff www.pasoa.org
  • Slide 69
  • Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.
  • Slide 70
  • Publications [SM03] Martin Szomszor and Luc Moreau. Recording and reasoning over data provenance in web and grid services. In International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE'03), volume 2888 of Lecture Notes in Computer Science, pages 603-620, Catania, Sicily, Italy, November 2003. [MCS + 03] Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based trust for grid computing - position paper. 2003. [GGS+03] Mark Greenwood, Carole Goble, Robert Stevens, Jun Zhao, Matthew Addis, Darren Marvin, Luc Moreau, and Tom Oinn. Provenance of e-science experiments - experience from bioinformatics. In Proceedings of the UK OST e-Science second All Hands Meeting 2003 (AHM'03), pages 223-226, Nottingham, UK, September 2003.
  • Slide 71
  • Acknowledgements The myGrid Southampton Team: Simon Miles, Juri Papay, Ananth Krishna, Michael Luck, David De Roure, Terry Payne Mark Greenwood, Carole Goble, Manchester Martin Szomszor, Southampton Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven Willmott, UPC
  • Slide 72
  • www.mygrid.org.uk m