provenance and workflows yolanda gil usc/isi [email protected] march 6, 2015 coal_from_the_titanic.jpg

Download Provenance and Workflows Yolanda Gil USC/ISI gil@isi.edu March 6, 2015 Coal_from_the_Titanic.jpg

If you can't read please download the document

Upload: hugo-walker

Post on 13-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1

Provenance and Workflows Yolanda Gil USC/ISI [email protected] March 6, 2015 http://en.wikipedia.org/wiki/Certificate_of_origin#mediaviewer/File:Coal_from_the_Titanic.jpg http://commons.wikimedia.org/wiki/File:The_seal_of_National_Taiwan_University.png https://www.flickr.com/photos/alterschwede08/3203630740/ (CC BY-ND 2.0) Slide 2 Provenance in Science Slide 3 Goals Today 1. Understand what provenance is in a scientific article 2. Understand how to provide proper provenance in an article https://www.flickr.com/photos/vizzzual-dot-com/2655969483/ Slide 4 Reproducibility Today: Can We Afford It? Typically: Paper in PDF Datasets may be available, often undocumented Analytic steps are described at a high level Not all details of analysis are published Often kept in notes, emails, Communication with author Research assistants leave lab Effort required deters many from attempting replication Slide 5 Reproducibility Today: A Fishing Expedition Reconstruct the pipeline Software installation Data transformations Access the data Often not documented Feels like extracting from a vaults Costly and iterative process Slide 6 Reproducibility: From Costly to Impractical Data processing, however, is often not described well enough to allow for exact reproduction of the results, leading to exercises in forensic bioinformatics where aspects of raw data and reported results are used to infer what methods must have been employed. [Baggerly and Coombes 09] Analysis of 18 quantitative papers published in Nature Genetics in the past two years found that reproducibility was not achievable even in principle in 10 cases, even when datasets are published [Ioannidis et al 09] Slide 7 Reproducibility: An Absolute Necessity Avoiding errors: Unfortunately, poor documentation and irreproducibility can shift from an inconvenience to an active danger when it obscures not just methods but errors. This can lead to scenarios where well-meaning investigators argue in good faith for treating patients with apparently promising drugs that are in fact ineffective or even contraindicated. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common. [Baggerly and Coombes 09] Slide 8 NSF Workshop on Challenges of Scientific Workflows Despite investments on computing infrastructure as an enabler of a significant paradigm change in science: Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential Human bottleneck: Perceived importance of capturing and sharing process in accelerating pace of scientific advances Process (method/protocol) is increasingly complex and highly distributed Reproducibility, a cornerstone of the scientific method, is difficult Workflows need to be first-class citizens in science http://www.isi.edu/nsf-workflows06 Slide 9 Workflows as Representations of Processes 1. Workflows of human activities Eg, checking patient in hospital Eg, manually extracting and verifying findings from the literature 2. Workflows of services Eg, integration of business services Eg, accessing databases in biology 3. Computational workflows Eg, document classification Slide 10 9 Semantic Workflows in WINGS http://www.wings-workflows.org Slide 11 Reuse of Workflow Fragments [Sethi et al MM13] Slide 12 Measuring Reproducibility Effort in Reproducibility Maps 2 months of effort in reproducing published method (in PLoS10) Authors expertise was required Comparison of ligand binding sites Comparison of dissimilar protein structures Graph network generation Molecular Docking [Garijo et al PLOS CB13] Slide 13 Provenance of Articles Text: Narrative of method, software packages used Software/Workflow: scripted codes + manual steps + notes/emails Workflow: Workflow/scripts with dataflow, codes, and parameters Data: Key datasets and figures/plots Typical Published Article Text: Narrative of method, software packages used Data: Key datasets and figures/plots Article with full provenance NOT published, loosely recorded: Software: Scripts and codes Slide 14 What is Provenance? Provenance covers: 1. Processes 2. Documents (resources) 3. Entities http://www.thestaffingstream.com/2012/08/06/the-buzz-about-talent-communities/ Slide 15 1) Provenance as Process: Workflows Computational workflows represent dataflow dependencies across individual computations Slide 16 2) Provenance as Documents Slide 17 3) Provenance as Entities Ex: NY Times article from REUTERS reporting At a press conference last Monday, Buckingham Palace was adamant that Prince Larry did not inhale. Slide 18 A Working Definition of Provenance Is provenance = metadata? Or = trust? Or = authentication? Provenance can be seen as metadata, but not all metadata is provenance Provenance provides a substrate for deriving different trust metrics Provenance records can be used to verify and authenticate among other uses Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Notice: Provenance assertions can have their own provenance Inference is useful if provenance records are incomplete/erroneous There may be alternative accounts of provenance of the same resource http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance Slide 19 A Well-Known Provenance Vocabulary: The Dublin Core http://dublincore.org/documents/dcq-rdf-xml/ From library sciences http://dublincore.org/documents/dcmi-terms/ Slide 20 A Provenance Standard for the Web: W3C PROV http://www.w3.org/TR/prov-primer/ Slide 21 Provenance of Articles Text: Narrative of method, software packages used Software/Workflow: scripted codes + manual steps + notes/emails Workflow: Workflow/scripts with dataflow, codes, and parameters Data: Key datasets and figures/plots Typical Published Article Text: Narrative of method, software packages used Data: Key datasets and figures/plots Article with full provenance NOT published, loosely recorded: Software: Scripts and codes Slide 22 Execution vs Method Slide 23 Describing Workflows at Different Levels of Abstraction Term Weighting Correlation Scoring TF-IDF Chi Squared Java code R code METHODS ALGORITHMS IMPLEMENTATIONS Slide 24 Developing Workflows What the paper may describe What the authors really did Feature selection Slide 25 From a Workflow Sketch to a Formal Workflow SketchFormal workflow Slide 26 Code for Preparing Data Scientists and engineers spend more than 60% of their time just preparing the data for model input or data-model comparison (NASA A40) [Garijo et al FGCS13] Slide 27 Goals Today 1. Understand what provenance is in a scientific article 2. Understand how to provide proper provenance in an article https://www.flickr.com/photos/vizzzual-dot-com/2655969483/ Slide 28 Suggested Approach 1. Describe the workflow in text Data + software + workflow Specify unique identifiers for data and software, versions, credit all sources 2. Develop a workflow sketch Capture high-level dataflow across components 3. Specify the formal workflow Command lines + parameter values Dataflow across components Options: 1. Describe it as a graph 2. Use the PROV standard 3. Use a workflow system Slide 29 How to show provenance in your article? Describe your workflow in text In a separate Methods section Include your workflow sketch As a figure in the article Publish your formal workflow and assign a unique identifier Cite it in the paper