cis 602: provenance & scientific data management...

46
CIS 602, Fall 2014 CIS 602: Provenance & Scientific Data Management Provenance Dr. David Koop

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

CIS 602: Provenance & Scientific Data Management

Provenance

Dr. David Koop

Page 2: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz

2

Page 3: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz1.Which of the following best describes provenance?

(a) The process of verifying that code performs as it is supposed to. (b) The process description (or sequence of steps) that, together

with input data and parameters, led a data product’s creation. (c) A well-defined language for specifying complex tasks from

simpler ones. (d) A mechanism that captures operating-system traces via a

filesystem interface or a system call tracer.

3

Page 4: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz1.Which of the following best describes provenance?

(a) The process of verifying that code performs as it is supposed to. (b)The process description (or sequence of steps) that,

together with input data and parameters, led a data product’s creation.

(c) A well-defined language for specifying complex tasks from simpler ones.

(d) A mechanism that captures operating-system traces via a filesystem interface or a system call tracer.

4

Page 5: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz2.Retrospective provenance captures:

(a) A computational task’s specification (e.g. the workflow or script).(b) The semantics of a set of tasks (e.g. in a biological language).(c) The steps executed as well as information about the

environment used to derive a specific data product.(d) The user views of provenance, enabled via abstractions.

5

Page 6: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz2.Retrospective provenance captures:

(a) A computational task’s specification (e.g. the workflow or script).(b) The semantics of a set of tasks (e.g. in a biological language).(c) The steps executed as well as information about the

environment used to derive a specific data product.(d) The user views of provenance, enabled via abstractions.

6

Page 7: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz3.Three key components of a provenance management solution

discussed in the paper are:(a) Capture mechanisms, provenance models, and methods for

storing, accessing, and querying provenance. (b) Workflow systems, workflow evolution, and capture

mechanisms.(c) Workflow systems, OS-based systems, and Process-based

systems.(d) Provenance models, layered provenance, and normalized

representation.

7

Page 8: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz3.Three key components of a provenance management solution

discussed in the paper are:(a) Capture mechanisms, provenance models, and methods for

storing, accessing, and querying provenance. (b) Workflow systems, workflow evolution, and capture

mechanisms.(c) Workflow systems, OS-based systems, and Process-based

systems.(d) Provenance models, layered provenance, and normalized

representation.

8

Page 9: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz4.Provenance can be stored in:

(a) Relational databases(b) Text files(c) XML databases(d) All of the above(e) (b) and (c) only

9

Page 10: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Quiz4.Provenance can be stored in:

(a) Relational databases(b) Text files(c) XML databases(d)All of the above(e) (b) and (c) only

10

Page 11: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Response Example• http://www.cis.umassd.edu/~dkoop/cis602/response.html

11

Page 12: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Presentation

12

Page 13: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance for Computational Tasks: A Survey

J. Freire, D. Koop, E. Santos, C. Silva

Presented by: David Koop

Page 14: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

What is Provenance?• Dictionary: “the source or origin of an object; its history and

pedigree; a record of the ultimate derivation and passage of an item through its various owners.”

• Focus on causality–the sequence of steps that detail how a result was generated

• Provenance itself is data, this list of steps along with metadata for each step: when it occurred, who initiated it, notes about it

• Can be used to preserve information about an experiment and to answer many questions

14

Page 15: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance in Science• Provenance is as (or more)

important as the result!• Old solution:

- Lab notebooks where detailed notes are kept

• New problems:- Large volumes of data- Complex analyses- Writing notes doesn’t scale

• Let computers handle this automatically!

15

[DNA Recombination, Lederberg]

Page 16: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance in Science• Provenance is as (or more)

important as the result!• Old solution:

- Lab notebooks where detailed notes are kept

• New problems:- Large volumes of data- Complex analyses- Writing notes doesn’t scale

• Let computers handle this automatically!

15

Date

Annotations

Observed Data

[DNA Recombination, Lederberg]

Page 17: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Questions about provenance• How does one capture provenance?• How does one manage provenance for later use?

- Answer questions like:• Who create this data product?• Who modified this data file?• What process was used to generate this data file?• Was the same data used to generate more than one result?

• What are the approaches and their tradeoffs?

16

Page 18: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Related Work• Bose and Frew: “Lineage Retrieval for Scientific Data Processing: A

Survey”• Simmhan, Plale, and Gannon: “A Survey of Data Provenance in E-

Science”• Tan, “Provenance in Databases: Past, Current, and Future”

17

Page 19: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance Management• Provenance can be generated from tasks/programs/scripts/etc.• Properties of provenance is related to the computational model

- a specific application with a graphical interface- a script that automates the use of several command-line tools- a scientific workflow that combines several tools

18

Page 20: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

MAY/JUNE 2008 21

tems in wide use that illustrate different solutions. Applications that use provenance appear in other articles in this special issue. The problem of man-aging fine-grained provenance recorded for items in a database is out of scope for this survey—a de-tailed overview appears elsewhere3.

Provenance Management: An OverviewBefore discussing the specific trade-offs among provenance systems and models, let’s examine the general aspects of provenance management. Spe-cifically, let’s explore the methods for modeling computational tasks and the types of provenance information we can capture from these tasks. To illustrate these themes, we use a scenario that’s common in visualizing medical data—the cre-ation of multiple visualizations of a volumetric computer tomography (CT) data set.

Modeling Computational TasksTo allow reproducibility, we can represent compu-tational tasks with a variety of mechanisms, includ-ing computer programs, scripts, and workflows, or

construct them interactively by using specialized tools (such as ParaView [www.paraview.org] for scientific visualization and GenePattern [www.broad.mit.edu/cancer/software/genepattern] for biomedical research). Some complex computa-tional tasks require weaving tools together, such as loosely coupled resources, specialized libraries, or Web services. To analyze a CT scan’s results, for example, we might need to preprocess data with different parameters, visualize each result, and then compare them. To ensure the reproducibility of the entire task, it’s beneficial to have a description that captures these steps and the parameter values used. One approach is to order computational processes and organize them into scripts; the session log in-formation that some software tools expose can also help document and reproduce results. However, these approaches have shortcomings—specifically, the user is responsible for manually checking-in in-cremental script changes or saving session log files. Moreover, the saved information often isn’t in an easily queried format.

Recently, workflows and workflow-based sys-tems have emerged as an alternative to these

(b)

(a)

(c)

Read File

ExtractIsosurface

Display onScreen

import vtk

data = vtk.vtkStructuredPointsReader()data.setFileName("../../../examples/data/head.120.vtk")

contour = vtk.vtkContourFilter()contour.SetInput(0, data.GetOutput())contour.SetValue(0, 67)

mapper = vtk.vtkPolyDataMapper()mapper.SetInput(contour.GetOutput())mapper.ScalarVisibilityOff()

actor = vtk.vtkActor()actor.SetMapper(mapper)

cam = vtk.vtkCamera()cam.SetViewUp(0,0,-1)cam.SetPosition(745,-453,369)cam.SetFocalPoint(135,135,150)cam.ComputeViewPlaneNormal()

ren = vtk.vtkRenderer()ren.AddActor(actor)ren.SetActiveCamera(cam)ren.ResetCamera()

renwin = vtk.vtkRenderWindow()renwin.AddRenderer(ren)

style = vtk.vtkInteractorStyleTrackballCamera()iren = vtk.vtkRenderWindowIneractor()iren.SetRenderWindow(renwin)iren.SetInteractorStyle(style)iren.Initialize()iren.Start()

123456789101112131415161718192021222324252627282930313233343536

vtkStructuredPointsReader

vtkCountourFilter

vtkDataSetMapper

vtkActor

vtkRenderer

vtkCamera

VTKCell

Figure 1. Different abstractions for a data flow. (a) A Python script containing a series of Visualization Toolkit (VTK) calls; (b) a workflow that produces the same result as the script; and (c) a simplified view of the workflow that hides some of its details.

Abstraction

19

The amount of abstraction (script, workflow, abstract workflow) impacts provenance

Page 21: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Prospective and Retrospective Provenance

20

• Example: Baking a Cake• Prospective Provenance (Recipe):

1. Gather ingredients (3/4 cup butter, 3/4 cocoa, 3/4 cup flour, ...)2. Preheat oven to 350 degrees3. Grease cake pan4. Mix wet ingredients in large bowl5. Mix dry ingredients in a separate bowl6. Add dry mixture to wet mixture7. Pour batter into cake pan8. Put pan in the oven and bake for 30 minutes9. Take cake out of oven and let it cool

Page 22: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Prospective and Retrospective Provenance• Retrospective Provenance (What actually happened)

1. Went to store to buy butter2. Gathered ingredients (3/4 cup butter, 3/4 cocoa, 1 cup flour, ...)3. Greased cake pan4. Preheated oven to 350 degrees5. Mixed wet ingredients in large bowl6. Mixed dry ingredients in a separate bowl7. Added wet mixture to dry mixture8. Poured batter into cake pan9. Put pan in the oven and baked for 35 minutes10.Took cake out of oven and let it cool for 10 minutes

21

Page 23: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Prospective and Retrospective Provenance• Prospective provenance is what was specified/intended

- a workflow, script, list of steps• Retrospective provenance is what actually happened

- actual data, actual parameters, errors that occurred, timestamps• Do not need prospective provenance to have retrospective

provenance!• Retrospective provenance the same type of information as

prospective plus more• Could have multiple retrospective provenance traces for one

prospective provenance listing

22

Page 24: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance & Causality• Knowing what data/steps influenced other data/steps is important!• Data dependencies: this output file depended on this input file• Data-process dependencies: this output figure depended on these

processes• Causality can often be represented as a graph where connections

represent dependencies

23

Page 25: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

User-defined provenance• Goal: capture lots of provenance automatically based on what

steps are executed• Problem: not everything can be captured automatically• Annotations offer ability to keep notes about processes• Users might also specify known causal links that cannot be

automatically determined (e.g. a step depends on three system files that were not specified as inputs in the workflow)

24

Page 26: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance Management Solution• What is needed to capture, store, and use provenance?

1.Capture mechanism2.Model for representing provenance3.Tools to store, query, and analyze provenance

25

Page 27: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance Capture Mechanisms• Workflow-based

- Since workflow execution is controlled, keep track of all the workflow modules, parameters, etc. as they are executed

• Process-based - Each process is required to write out its own provenance

information (not centralized like workflow-based)• OS-based

- The OS or filesystem is modified so that any activity it does it monitored and the provenance subsystem organizes it

• Tradeoffs:- Workflow- and process-based have better abstraction, OS-based

requires minimal user effort once installed and can capture “hidden dependencies”

26

Page 28: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance Models• How provenance is represented (more abstract than the details of

how it is actually stored)• PROV (W3C Standard) has different storage backends for

provenance but all of it conforms to the same model• Model the objects involved and their relationships (e.g. activities,

dependencies)• Interoperability is a concern

- Why? May use multiple tools/techniques to achieve a result, want to analyze the entire provenance chain

- 2nd Provenance Challenge noted some efforts to integrate provenance from different sources

27

Page 29: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Layered Provenance• As with relational databases, want to normalize provenance to

minimize redundant information• Example: Don’t store workflow specification each time that

workflow is executed–store it once and reference it• Also allow different layers for different aspects of provenance

28

24 COMPUTING IN SCIENCE & ENGINEERING

proaches require processes to be wrapped—in the former, so that the workflow engine can invoke them, and in the latter, so that instrumentation can capture and publish provenance information.

Because workflow systems have access to work-flow definitions and control their execution, they can capture both prospective and retrospective provenance. OS- and process-based mechanisms only capture retrospective provenance: they must reconstruct causal relationships through prov-enance queries. The ES3 system (http://eil.bren.ucsb.edu), for example, monitors the interactions between arbitrary applications and their environ-ments (via arguments, file I/O, system, and calls), and then uses this information to assemble a prov-enance graph to describe what actually happened during execution.6

In fact, by capturing provenance at the OS level, we can record detailed information about all system calls and files touched during a task’s execution. This forms a superset of the information captured in workflow- and process-based systems, whose granularity is determined by the wrapping provid-ed for individual processes. Consider, for example, a command-line tool integrated in a workflow sys-tem that creates and depends on temporary files not explicitly defined in its wrapper. The causal depen-dencies the workflow system captures won’t include the temporary files, but we can capture these de-pendencies at the OS level. However, because even simple tasks can lead to a large number of low-level calls, the amount of provenance that OS-based ap-proaches record can be prohibitive, making it hard to query and reason about the information.7

Provenance ModelsResearchers have proposed several provenance models in the literature.9,10,12 All these models support some form of retrospective provenance, and most of those that workflow systems use pro-vide the means to capture prospective provenance. Many of the models also support annotations.

Although these models differ in several ways, including their use of structures and storage strat-egies, they all share an essential type of informa-tion: process and data dependencies. In fact, a recent exercise to explore interoperability issues among provenance models showed that it’s possible to integrate information that conform to different provenance models (http://twiki.ipaw.info/bin/view/Challenge/SecondProvenanceChallenge).

Despite a base commonality, provenance mod-els tend to vary according to domain and user needs. Even though most models strive to store general concepts, specific use cases often influ-ence model design—for example, Taverna was de-veloped to support the creation and management of workflows in the bioinformatics domain, and therefore provides an infrastructure that includes support for ontologies available in this domain. VisTrails was designed to support exploratory tasks in which workflows are iteratively refined, and thus uses a model that treats workflow speci-fications as first-class data products and captures the provenance of workflow evolution.

Because the provenance information a model must represent varies both by type and specificity, it’s advantageous to structure a model as a set of layers to enable a normalized, configurable repre-sentation. The ability to represent provenance at different levels of abstraction also leads to simpler queries and more intuitive results. Consider the REDUX system,16 which uses the layered model depicted in Figure 3. The first layer corresponds to an abstract description of a workflow, in which each module corresponds to a class of activities. This ab-stract description is bound to specific services and data sets defined in the second layer—for example, in the workflow shown in Figure 1, the abstract activity extract isosurface is bound to a call to the vtkContourFilter—a specific implemen-tation of isosurface extraction provided by VTK. The third layer captures information about input data and parameters supplied at runtime, and the fourth layer captures operational details, such as the workflow execution’s start and end time.

Structuring provenance information into mul-tiple layers leads to a normalized representation that avoids the storage of redundant information. Some models, for example, store a workflow’s

Figure 3. Layered provenance models. For REDUX, the first layer corresponds to an abstract description, the second layer describes the binding of specific services and data to the abstract description, the third layer captures runtime inputs and parameters, and the final layer captures operational data. Other models use layers in different ways. The top-layer in VisTrails captures provenance of workflow evolution, and Pegasus uses an additional layer to represent the workflow execution plan over grid resources.

Page 30: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Storing Provenance• Files, relational databases, XML databases, RDF (linked data)• Log files are good for preserving data but can be bad to query or

analyze• Relational databases are great for column-specific queries but can

be bad for dependency queries• XML databases are more portable than relational databases but are

usually less efficient for queries• RDF triples are better for dependencies and integrating domain-

specific knowledge but can be slower

29

Page 31: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Querying Provenance• Query methods are often tied to storage• SQL, XQuery, Prolog, SPARQL, ...

30

26 COMPUTING IN SCIENCE & ENGINEERING

ate views of provenance data would benefit OS- and process-based provenance models as well.

The ability to query a computational task’s prov-enance also enables knowledge reuse. By querying a set of tasks and their provenance, users can not only identify suitable tasks and reuse them, but also compare and understand differences between different tasks. Provenance information is often associated with data products (such as images or graphs), so this data helps users pose structured queries over unstructured data as well.

A common feature across many approaches to querying provenance is that their solutions are closely tied to the storage models used. Hence, they require users to write queries in languages such as SQL,16 Prolog,20 and SPARQL.10,11 Although such general languages are useful to those already famil-iar with their syntax, they weren’t designed specifi-cally for provenance, which means simple queries can be awkward and complex to write. Figure 5 compares three representations of a single query in the First Provenance Challenge that asked for tasks

using a specific module (Align Warp) with given parameters executed on a Monday. The VisTrails approach uses a language specifically designed to query workflows and their provenance, whereas REDUX and myGrid use native languages for their storage choices. Because the VisTrails lan-guage abstracts details about physical storage, it leads to much more concise queries.

However, even queries that use a language designed for provenance are likely to be too complicated for many users because provenance contains structural information represented as a graph. Thus, text-based query interfaces effec-tively require a subgraph query to be encoded as text. The VisTrails query-by-example (QBE) in-terface (see Figure 6) addresses this problem by letting users quickly construct expressive que-ries using the same familiar interface they use to build workflow.21 The query’s results are also displayed visually.

Some provenance models use Semantic Web technology both to represent and query provenance

VisTrails

REDUX

MyGrid

SELECT Execution.ExecutableWorkflowId, Execution.ExecutionId, Event.EventId, ExecutableActivity.ExecutableActivityIdfrom Execution, Execution_Event, Event, ExecutableWorkflow_ExecutableActivity, ExecutableActivity, ExecutableActivity_Property_Value, Value, EventType as ETwhere Execution.ExecutionId=Execution_Event.ExecutionId and Execution_Event.EventId=Event.EventId and ExecutableActivity.ExecutableActivityId=ExecutableActivity_Property_Value.ExecutableActivityId and ExecutableActivity_Property_Value.ValueId=Value.ValueId and Value.Value=Cast('-m 12' as binary) and ((CONVERT(DECIMAL, Event.Timestamp)+0)%7)=0 and Execution_Event.ExecutableWorkflow_ExecutableActivityId= ExecutableWorkflow_ExecutableActivity.ExecutableWorkflow_ExecutableActivityIdand ExecutableWorkflow_ExecutableActivity.ExecutableWorkflowId=Execution.ExecutableWorkflowIdand ExecutableWorkflow_ExecutableActivity.ExecutableActivityId=ExecutableActivity.ExecutableActivityIdand Event.EventTypeId=ET.EventTypeId and ET.EventTypeName='Activity Start';

wf{*}: x where x.module='AlignWarp' and x.parameter('model')='12' and (log{x}: y where y.dayOfWeek='Monday')

SELECT ?pwhere (?p <http://www.mygrid.org.uk/provenance#startTime> ?time) and (?time > date)using ns for <http://www.mygrid.org.uk/provenance#> xsd for <http://www.w3.org/2001/XMLSchema#>

SELECT ?p where <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0>(?p <http://www.mygrid.org.uk/provenance#runsProcess> ?processname . ?p <http://www.mygrid.org.uk/provenance#processInput> ?inputParameter .?inputParameter <ont:model> <ontology:twelfthOrder>) using ns for <http://www.mygrid.org.uk/provenance#> ont for <http://www.mygrid.org.uk/ontology#>

Figure 5. Provenance query implemented by three different systems. REDUX uses SQL, VisTrails uses a language specialized for querying workflows and their provenance, and myGrid uses SPARQL.

Page 32: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Querying Provenance• Graph-type queries can be problematic

- Could use graph database languages or provenance-specific languages

- Could use visual query-by-example style methods

31

MAY/JUNE 2008 27

information.10,11,15,22 Semantic Web languages such as RDF and OWL provide a natural way to model provenance graphs and the ability to represent com-plex knowledge, such as annotations and metadata. This technology has the potential to simplify in-teroperability across different provenance models, but it’s an open issue whether this technology scales to handle large provenance stores.

Provenance-Enabled SystemsNow let’s review a selection of different prove-nance-enabled systems. We don’t intend to provide a comprehensive guide to all existing systems—rather, we aim to describe a representative subset. Table 1 summarizes the various systems’ features.

Workflow-Based SystemsTaverna is a workflow system used in the myGrid project (www.mygrid.org.uk), whose goal is to le-verage Semantic Web technologies and ontologies available for bioinformatics to simplify data anal-ysis processes. In Taverna, the workflow engine is responsible for capturing provenance. It stores any prospective provenance information as Scufl specifications (an XML dialect) and retrospective provenance as RDF triples in a MySQL database.

Karma (www.extreme.indiana.edu/karma) was developed to support dynamic workflows in weather forecasting simulations, where the ex-

ecution path can change rapidly due to external events.14 Karma collects retrospective provenance in the form of a workflow trace; it also explic-itly models data products’ derivation history. Although Karma is a workflow-based system, in-dividual services that compose a workflow publish their own provenance to minimize performance overheads at the workflow-engine level (just as in process-based capture). Karma records the pub-lished provenance messages in a central database server: this information is exchanged between services and server as XML, but it’s translated into relational tuples before final storage. Karma also stores prospective provenance using the Busi-ness Process Execution Language.

The Kepler workflow system stores prospective provenance in the Modeling Markup Language (MoML) format,1 which is an XML dialect for representing workflows; it captures retrospective provenance by using a variation of MoML that omits irrelevant information, such as the coordi-nates for workflow modules drawn on the screen. Currently, it stores provenance in files; query sup-port is under development.

Pegasus (http://pegasus.isi.edu) is a workflow-mapping engine that, while taking into account resource efficiency and dynamic availability, au-tomatically maps high-level workflow specifica-tions into executable plans that run on distributed

vtkClipPolyData

VolumeRendering

SW

ClippingPlaneSW

IsosurfaceScript

Visual query Results

vtkStructurePointsReader

vtkCountourFilter

Visual query canvas

VolumeRendering

SW

ClippingPlaneSW

IsosurfaceScript

CombinedRendering

SW

ImageSlicesSW

Histogram

HistogramFile

Isosurface

vtkCamera

vtkInteractionHandler

vtkPlane

vtkColorTransferFunction

vtkRenderer

VTKCell

vtkVolumeRayCastMapper

vtkPolyDataMapper

vtkVolume

vtkVolumeProperty

vtkVolume

vtkImplicitPlaneWidget

vtkPiecewiseFunction

vtkActor

vtkProperty

vtkVolumeRayCastCompositeFunction

vtkClipPolyData vtkCamera

vtkInteractionHandler

vtkPlane

vtkColorTransferFunction

vtkRenderer

VTKCell

vtkVolumeRayCastMapper

vtkPolyDataMapper

vtkVolume

vtkVolumePropertyy

vtkVolume

vtkImplicitPlaneWidget

vtkPiecewiseFunction

vtkActor

vtkPropertyty

vtkVolumeRayCastCompositeFunction

vtkStructuredPointsReader

vtkCountourFilter

History view

Workflow view

Figure 6. Querying provenance by example. We can specify a query visually in the same interface used to construct workflows. Besides the structure defined by the set of modules and connections, we can also specify conditions for parameter values.

Page 33: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Provenance Overload• Can be difficult to look at large volumes of provenance• Try to abstract the provenance as with workflows (user views)

32

MAY/JUNE 2008 21

tems in wide use that illustrate different solutions. Applications that use provenance appear in other articles in this special issue. The problem of man-aging fine-grained provenance recorded for items in a database is out of scope for this survey—a de-tailed overview appears elsewhere3.

Provenance Management: An OverviewBefore discussing the specific trade-offs among provenance systems and models, let’s examine the general aspects of provenance management. Spe-cifically, let’s explore the methods for modeling computational tasks and the types of provenance information we can capture from these tasks. To illustrate these themes, we use a scenario that’s common in visualizing medical data—the cre-ation of multiple visualizations of a volumetric computer tomography (CT) data set.

Modeling Computational TasksTo allow reproducibility, we can represent compu-tational tasks with a variety of mechanisms, includ-ing computer programs, scripts, and workflows, or

construct them interactively by using specialized tools (such as ParaView [www.paraview.org] for scientific visualization and GenePattern [www.broad.mit.edu/cancer/software/genepattern] for biomedical research). Some complex computa-tional tasks require weaving tools together, such as loosely coupled resources, specialized libraries, or Web services. To analyze a CT scan’s results, for example, we might need to preprocess data with different parameters, visualize each result, and then compare them. To ensure the reproducibility of the entire task, it’s beneficial to have a description that captures these steps and the parameter values used. One approach is to order computational processes and organize them into scripts; the session log in-formation that some software tools expose can also help document and reproduce results. However, these approaches have shortcomings—specifically, the user is responsible for manually checking-in in-cremental script changes or saving session log files. Moreover, the saved information often isn’t in an easily queried format.

Recently, workflows and workflow-based sys-tems have emerged as an alternative to these

(b)

(a)

(c)

Read File

ExtractIsosurface

Display onScreen

import vtk

data = vtk.vtkStructuredPointsReader()data.setFileName("../../../examples/data/head.120.vtk")

contour = vtk.vtkContourFilter()contour.SetInput(0, data.GetOutput())contour.SetValue(0, 67)

mapper = vtk.vtkPolyDataMapper()mapper.SetInput(contour.GetOutput())mapper.ScalarVisibilityOff()

actor = vtk.vtkActor()actor.SetMapper(mapper)

cam = vtk.vtkCamera()cam.SetViewUp(0,0,-1)cam.SetPosition(745,-453,369)cam.SetFocalPoint(135,135,150)cam.ComputeViewPlaneNormal()

ren = vtk.vtkRenderer()ren.AddActor(actor)ren.SetActiveCamera(cam)ren.ResetCamera()

renwin = vtk.vtkRenderWindow()renwin.AddRenderer(ren)

style = vtk.vtkInteractorStyleTrackballCamera()iren = vtk.vtkRenderWindowIneractor()iren.SetRenderWindow(renwin)iren.SetInteractorStyle(style)iren.Initialize()iren.Start()

123456789101112131415161718192021222324252627282930313233343536

vtkStructuredPointsReader

vtkCountourFilter

vtkDataSetMapper

vtkActor

vtkRenderer

vtkCamera

VTKCell

Figure 1. Different abstractions for a data flow. (a) A Python script containing a series of Visualization Toolkit (VTK) calls; (b) a workflow that produces the same result as the script; and (c) a simplified view of the workflow that hides some of its details.

Page 34: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

28 COMPUTING IN SCIENCE & ENGINEERING

infrastructures such as the TeraGrid.11 Although Pegasus models prospective provenance using OWL, it captures retrospective provenance by using the Virtual Data System (VDS; a precursor of Swift) and then stores it in a relational database. Queries that span prospective and retrospective provenance must combine two different query languages: SPARQL and SQL.

REDUX extends the Windows Workflow Foundation engine to transparently capture the workflow execution trace. As discussed earlier, it uses a layered provenance model to normalize data and avoid redundancy. REDUX stores prov-enance data (both prospective and retrospective) in a relational database’s set of tables that can be queried with SQL. The system can also return an executable workflow as the result of a provenance query (for example, a query that requests all the steps used to derive a particular data product).

Swift (www.ci.uchicago.edu/swift) builds on and includes technology previously distributed as the GriPhyN VDS.23 The system combines a scripting language (SwiftScript) with a power-ful runtime system for the concise specification and reliable execution of large, loosely coupled computations. Swift specifies these computations as scripts, which the runtime system translates into an executable workflow. A launcher program invokes the workflow’s tasks, monitors the exe-cution process, and records provenance informa-tion, including the executable name, arguments, start time, duration, machine information, and exit status. Similar to VDS, Swift captures the relationships among data, programs, and com-

putations and uses this information for data and program discovery as well as for workflow sched-uling and optimization.

VisTrails is a workflow and provenance man-agement system designed to support exploratory computational tasks. An important goal of the VisTrails project is to build intuitive interfaces for users to query and reuse provenance infor-mation. Besides its QBE interface (which is built on top of its specialized provenance query lan-guage), VisTrails provides a visual interface to compare workflows side by side12 and a mecha-nism for refining workflows by analogy—users can modify workflows by example without hav-ing to directly edit their definitions.21 VisTrails internally represents prospective provenance as Python objects that can be serialized into XML and relations; it stores retrospective provenance in a relational database.

OS-Based Systems PASS (www.eecs.harvard.edu/syrah/pass) op-erates at the level of a shared storage system: it automatically records information about which programs are executed, their inputs, and any new files created as output. The capture mechanism consists of a set of Linux kernel modules that transparently record provenance—it doesn’t re-quire any changes to computational tasks. PASS also constructs a provenance graph stored as a set of tables in Berkeley DB. Users can pose prov-enance queries using nq, a proprietary tool that supports recursive searches over the provenance graph. As discussed earlier, the fine granularity

Table 1. Provenance-enabled systems.

System Capture mechanism Prospective provenanceRetrospective provenance Workflow evolution Storage Query support

Available as open source?

REDUX Workflow-based Relational Relational No Relational database management system (RDBMS)

SQL No

Swift Workflow-based SwiftScript Relational No RDBMS SQL Yes

VisTrails Workflow-based XML and relational Relational Yes RDBMS and files Visual query by example, specialized language

Yes

Karma Workflow- and process-based

Business Process Execution Language

XML No RDBMS Proprietary API Yes

Kepler Workflow-based MoML MoML variation Under development Files; RDBMS planned Under development Yes

Taverna Workflow-based Scufl RDF Under development RDBMS SPARQL Yes

Pegasus Workflow-based OWL Relational No RDBMS SPARQL for metadata and workflow; SQL for execution log

Yes

PASS OS-based N/A Relational No Berkeley DB nq (proprietary query tool) No

ES3 OS-based N/A XML No XML database XQuery No

PASOA/PreServ Process-based N/A XML No Filesystem, Berkeley DB XQuery, Java query API Yes

Provenance-Enabled Systems

33

Page 35: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

28 COMPUTING IN SCIENCE & ENGINEERING

infrastructures such as the TeraGrid.11 Although Pegasus models prospective provenance using OWL, it captures retrospective provenance by using the Virtual Data System (VDS; a precursor of Swift) and then stores it in a relational database. Queries that span prospective and retrospective provenance must combine two different query languages: SPARQL and SQL.

REDUX extends the Windows Workflow Foundation engine to transparently capture the workflow execution trace. As discussed earlier, it uses a layered provenance model to normalize data and avoid redundancy. REDUX stores prov-enance data (both prospective and retrospective) in a relational database’s set of tables that can be queried with SQL. The system can also return an executable workflow as the result of a provenance query (for example, a query that requests all the steps used to derive a particular data product).

Swift (www.ci.uchicago.edu/swift) builds on and includes technology previously distributed as the GriPhyN VDS.23 The system combines a scripting language (SwiftScript) with a power-ful runtime system for the concise specification and reliable execution of large, loosely coupled computations. Swift specifies these computations as scripts, which the runtime system translates into an executable workflow. A launcher program invokes the workflow’s tasks, monitors the exe-cution process, and records provenance informa-tion, including the executable name, arguments, start time, duration, machine information, and exit status. Similar to VDS, Swift captures the relationships among data, programs, and com-

putations and uses this information for data and program discovery as well as for workflow sched-uling and optimization.

VisTrails is a workflow and provenance man-agement system designed to support exploratory computational tasks. An important goal of the VisTrails project is to build intuitive interfaces for users to query and reuse provenance infor-mation. Besides its QBE interface (which is built on top of its specialized provenance query lan-guage), VisTrails provides a visual interface to compare workflows side by side12 and a mecha-nism for refining workflows by analogy—users can modify workflows by example without hav-ing to directly edit their definitions.21 VisTrails internally represents prospective provenance as Python objects that can be serialized into XML and relations; it stores retrospective provenance in a relational database.

OS-Based Systems PASS (www.eecs.harvard.edu/syrah/pass) op-erates at the level of a shared storage system: it automatically records information about which programs are executed, their inputs, and any new files created as output. The capture mechanism consists of a set of Linux kernel modules that transparently record provenance—it doesn’t re-quire any changes to computational tasks. PASS also constructs a provenance graph stored as a set of tables in Berkeley DB. Users can pose prov-enance queries using nq, a proprietary tool that supports recursive searches over the provenance graph. As discussed earlier, the fine granularity

Table 1. Provenance-enabled systems.

System Capture mechanism Prospective provenanceRetrospective provenance Workflow evolution Storage Query support

Available as open source?

REDUX Workflow-based Relational Relational No Relational database management system (RDBMS)

SQL No

Swift Workflow-based SwiftScript Relational No RDBMS SQL Yes

VisTrails Workflow-based XML and relational Relational Yes RDBMS and files Visual query by example, specialized language

Yes

Karma Workflow- and process-based

Business Process Execution Language

XML No RDBMS Proprietary API Yes

Kepler Workflow-based MoML MoML variation Under development Files; RDBMS planned Under development Yes

Taverna Workflow-based Scufl RDF Under development RDBMS SPARQL Yes

Pegasus Workflow-based OWL Relational No RDBMS SPARQL for metadata and workflow; SQL for execution log

Yes

PASS OS-based N/A Relational No Berkeley DB nq (proprietary query tool) No

ES3 OS-based N/A XML No XML database XQuery No

PASOA/PreServ Process-based N/A XML No Filesystem, Berkeley DB XQuery, Java query API Yes

MAY/JUNE 2008 29

of PASS’s capture mechanism often leads to very large volumes of provenance information; another limitation of this approach is that it’s restricted to local filesystems. It can’t, for example, track files in a grid environment.

ES3’s goal is to extract provenance information from arbitrary applications by monitoring their in-teractions with the execution environment.6 These interactions are logged to the ES3 database, which stores the information as provenance graphs, rep-resented in XML. ES3 currently supports a Linux plugin, which uses system call tracing to capture provenance. As in PASS, ES3 requires no changes to the underlying processes, but provenance cap-ture is restricted to applications that run on ES3-supported environments.

Process-Based Systems The Provenance-Aware Service Oriented Ar-chitecture (PASOA) project (www.pasoa.org) developed a provenance architecture that relies on individual services to record their own prov-enance.5 The system doesn’t model the notion of a workflow—rather, it captures assertions produced by services that reflect the relationships between the represented services and data. The system must infer the complete provenance of a task or data product by combining these assertions and recursively following the relationships they repre-sent. The PASOA architecture distinguishes the notion of process documentation—that is, the prove-nance recorded specifically about a process—from the notion of a data item’s provenance, which is de-rived from the process documentation. The PA-

SOA project developed an open source software package called PreServ that lets developers inte-grate process documentation recording into their applications. PreServ also supports multiple back end storage systems, including files and relational databases; users can pose provenance queries by using its Java-based query API or XQuery.

P rovenance management is a new area, but it is advancing rapidly. Researchers are actively pursuing several directions in this area, including the ability to in-

tegrate provenance derived from different systems and enhanced analytical and visualization mech-anisms for exploring provenance information. Provenance research is also enabling several new applications, such as science collaboratories, which have the potential to change the way people do sci-ence—sharing provenance information at a large scale exposes researchers to techniques and tools to which they wouldn’t otherwise have access. By exploring provenance information in a collabora-tory, scientists can learn by example, expedite their scientific work, and potentially reduce their time to insight. The “wisdom of the crowds,” in the context of scientific exploration, can avoid duplica-tion and encourage continuous, documented, and reproducible scientific progress.24

AcknowledgmentsThis work was partially supported by the US Nation-al Science Foundation, the US Department of Energy, and IBM faculty awards.

Table 1. Provenance-enabled systems.

System Capture mechanism Prospective provenanceRetrospective provenance Workflow evolution Storage Query support

Available as open source?

REDUX Workflow-based Relational Relational No Relational database management system (RDBMS)

SQL No

Swift Workflow-based SwiftScript Relational No RDBMS SQL Yes

VisTrails Workflow-based XML and relational Relational Yes RDBMS and files Visual query by example, specialized language

Yes

Karma Workflow- and process-based

Business Process Execution Language

XML No RDBMS Proprietary API Yes

Kepler Workflow-based MoML MoML variation Under development Files; RDBMS planned Under development Yes

Taverna Workflow-based Scufl RDF Under development RDBMS SPARQL Yes

Pegasus Workflow-based OWL Relational No RDBMS SPARQL for metadata and workflow; SQL for execution log

Yes

PASS OS-based N/A Relational No Berkeley DB nq (proprietary query tool) No

ES3 OS-based N/A XML No XML database XQuery No

PASOA/PreServ Process-based N/A XML No Filesystem, Berkeley DB XQuery, Java query API Yes

Provenance-Enabled Systems

34

Page 36: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Conclusions

35

• Provenance solutions have many benefits to scientists and engineers

• Capture mechanisms seem well-developed• Storage, interoperability, and querying methods are still being

improved• Different types of systems have different benefits and challenges• Sharing provenance can help other people better understand

exactly how an experiment was done or how an application works

Page 37: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Discussion• How might workflow-based and OS-based provenance systems be

integrated or used concurrently?• Are current data storage/query tools good enough for provenance

or do provenance-specific solutions need to be implemented?• What other uses of provenance (beyond causality, checking steps)

are there?

• Other questions?

36

Page 38: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Reading Presentation• Should be 15-25 minutes• Should cover the whole paper, explaining the material presented• Do not simply copy/quote the paper!• Look up references if something is unclear and needs to be

explained• Detail points of discussion, potential issues with the approach• Insert your own viewpoints into the presentation• If figures or images from the paper are useful, please include them

in the presentation- May also create your own diagrams if they help explain something

37

Page 39: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project• Many questions about examples of projects, language or system

requirements• Goal is to create something new relating to provenance, scientific

workflows, or scientific data management• No language requirement, but I need to be able to run the code on

a standard computer unless we agree upon something else• I will allow groups of 2, but I will expect more projects that are twice

as involved as individual projects. I will also expect equal participation from both group members in presentations as well as reports.

• Will require a proposal, progress report, and final report- I will give feedback about my impression of your project so you

know what is expected for the final project and report

38

Page 40: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas• Add meaningful provenance capture to an existing tool that

currently lacks it. Such a project could use a custom solution or capture provenance in an existing format (like PROV or using the VisTrails SDK). For example, you might use the VisTrails SDK to store and replay provenance from an application like a web browser. (http://www.vistrails.com/sdk.html)

39

Page 41: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas

40

Page 42: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas• Wrap an existing library for use in a scientific workflow system (e.g.

Kepler, VisTrails, Taverna) and create new useful workflows. You should then be able to create provenance by running those workflows.

41

Page 43: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas• Create a new technique for analyzing, visualizing, or querying

provenance using existing provenance (e.g. the ProvBench datasets: https://github.com/provbench). For example, you might design a new visualization technique for organizing similar provenance traces in an efficient manner.

42

Page 44: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas• Use a scientific or graph database to store and query information

that is used in a tool. For example, you might use neo4j to support a recommendation system for movies based on a social network.

43

Page 45: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Course Project Ideas• Some other project of your choice that relates to provenance,

workflows, scientific data management.

44

Page 46: CIS 602: Provenance & Scientific Data Management …dkoop/cis602-2014fa/lecture02.pdf2.Retrospective provenance captures: (a) A computational task’s specification (e.g. the workflow

CIS 602, Fall 2014

Next Class (9/11/2014)• New paper: Scientific Workflow Management and the Kepler

System (download from course web page)• Send Reading Response to [email protected] by 12pm

- Text or PDF (no Word documents, please)• Keep thinking about course project ideas

- Please stop by during office hours if you want to discuss a project idea

• I will be asking for preferences for reading presentations soon. I will email the class on the procedure.

45