linking the prospective and retrospective provenance of scripts

18
Linking Prospective and Retrospective Provenance of Scripts Saumen Dey, Khalid Belhajjame , David Koop, Meghan Raul, Bertram Ludäscher

Upload: khalid-belhajjame

Post on 14-Aug-2015

97 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Linking Prospective and Retrospective Provenance of

ScriptsSaumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, Bertram Ludäscher

TaPP'15 2

Retrospective Provenance Is Useful … But Needs Abstraction

Scientists process and analyze their datasets using scripts.

Retrospective provenance information can be useful for scientists to analyze of script execution E.g., noWorkflow [Murta et al., IPAW’14]

While useful, the amount of information such fine-grained traces contain can be overwhelming for end users. There is a need for abstraction techniques that focus

the attentionof users on the provenance information relevant for

their analyses.

TaPP'15 3

Related WorkWorkflow-Oriented Proposals Zoom*UserViews [Biton et al., ICDE’08]: Workflow views are used

for abstracting the retrospective provenance of (potentially complex) workflows

LabelFlow [Alper et al., IPAW’14]: Both (prospective and retrospective) workflow provenance are summarized using user annotations and reduction rules.

Script-Oriented Proposals YesWorkflow [McPhillips et al., TaPP’15]: We have seen in the

presentation by B. Ludäscher, how a URI scheme can be used to reconstruct part of the retrospective provenance, without recording it at run time. This scheme is useful when the data products used and generated,

including intermediate ones, are stored within the file system.We adopt a similar approach as Zoom*UserViews and apply it to scripts:

We use the workflow specifications (i.e., prospective provenance) extracted by YesWorkflow to abstract the

retrospective provenance captured by noWorkflow

TaPP'15 4

Approach

Script Annotate

Extract Workflow

Specification

Workflow Descripti

on

YesWorkflow

TaPP'15 5

Approach

Script Annotate

Extract Workflow

Specification

Workflow Descripti

on

YesWorkflow

TaPP'15 6

Approach

Script

RunRetrospecti

ve Provenance

noWorkflow

Annotate

Extract Workflow

Specification

Workflow Descripti

on

YesWorkflow

• variable(6, 93, row, 14, ”[0.0]” , 1430231173.397779).

• dependency(6, 34, 94, 93).

TaPP'15 7

Approach

Script

Link* Query

Provenance

User

RunRetrospecti

ve Provenance

noWorkflow

Annotate

Extract Workflow

Specification

Workflow Descripti

on

YesWorkflow

Link*: Links retrospective and prospective provenances

TaPP'15 8

Linking noWorkflow data instances in noWorkflows to YesWorkflow Variables Variable names are not reliable IDs.

Two different variables may have the same name within a script, e.g., within the scope of different functions

noWorkflow tracks line numbers in the script to identify the variable

YesWorkflow does not provide this information However, it provides the line numbers of the start

and end of a block, when requested. Using the two pieces of information, we are able to

connect data values in noWorkflows to their corresponding YesWorkflow variables.

TaPP'15 9

Annotating data values with YesWorkflow annotations

The variables in YesWorkflows are associated with user annotations

Such annotations can be used to provide more information about the data values within noWorkflow retrospective provenance

We note here that it is possible in YesWorkflow to specify variables that are not mapped to any variables within the script

TaPP'15 10

Controlflow

model

pastTemperatureData

pastPrecipitationData

simulatedWeather

model_1 Model_2

simulatedWeather

TaPP'15 11

Data Dependencies

We found gaps in the noWorkflow dependency graph (when exported to prolog) Function returns do not always link back to

correct variable. An object that is modified (e.g via a list.append

call) is not always captured in the dependency graph

We also observed that noWorkflow does not capture information about the name of the script in the retrospective provenance This is required for uniquely identifying

variables.

TaPP'15 12

Objects Within Objects

An object may be nested, i.e., it may have other objects as is children.

Any change to the child object is attributed to the parent object.

noWorkflow attributes any changes to any child object to the ultimate parent.

TaPP'15 13

Framework

TaPP'15 14

Dependency Repair

TaPP'15 15

Provenance Integrator: Abstracting Retrospective Provenance

iDepAbs

iDepAbs

TaPP'15 16

Example Query

Q1: find the temperature file used.

fileName(N) :- map(V,A), A=“temperatureDataFile”, iData(V,N).

Q2: show how the temperature file was used by.

fileUsedBy(X,Y) :-iDepAbs(X,Y),map(X,A), A=“temperatureDataFile”.

fileUsedBy(X,Y) :-fileUsedBy(X,Z).iDepAbs(Z,Y).

TaPP'15 17

Conclusions

We have implemented an approach for linking the retrospective provenance of script to a more abstracted and user defined prospective provenance.

This solution is complementary to the YesWorkflow solution for capturing retrospective provenance of data files [1]

The issues that we came across using YesWorkflow and noWorkflow were communicated to the development teams of these tools to address them

Linking prospective and retrospective provenance for multiple scripts as opposed to a single script Designers may organize the implementation their

data analyses into multiple scripts. [1] T. M. McPhillips et al. Retrospective provenance without a run-time provenance recorder. In Tapp, 2015

Linking Prospective and Retrospective Provenance of

ScriptsSaumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, Bertram Ludäscher