report on provenance challenge 3, at the pc3 meeting, 2009
DESCRIPTION
Amsterdam, June 10-11, 2009TRANSCRIPT
![Page 1: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/1.jpg)
Paolo MissierInformation Management Group
School of Computer Science, University of Manchester, UK
Provenance Challenge 3 meetingAmsterdam, June 10-11, 2009
1
Provenance challenge 3and Taverna
![Page 2: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/2.jpg)
1. Implement the challenge workflow
2
Interpreting the challenge
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
![Page 3: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/3.jpg)
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
![Page 4: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/4.jpg)
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
![Page 5: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/5.jpg)
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
![Page 6: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/6.jpg)
1. Implement the challenge workflow
2
Interpreting the challenge
➡ Expose the same amount of data that is visible in the Trident version of the workflow
➡ Is the current Taverna Provenance (TP) query model sufficient to answer the queries?
➡ Export the smallest OPM graph that contains the query answer➡(graph for the entire run is a special “degenerate” case)
➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as
above
2. Answer the core (+optional) challenge queries
3. Produce and export OPM
4. Import and consume OPM
![Page 7: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/7.jpg)
Our challenges
3
![Page 8: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/8.jpg)
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
![Page 9: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/9.jpg)
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
![Page 10: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/10.jpg)
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query
answering
![Page 11: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/11.jpg)
Our challenges
3
➡Expose the same amount of data that is visible in the Trident version of the workflow:➡ map the control flow to a pure dataflow model
➡Is the current Taverna Provenance (TP) query model sufficient to answer the queries?➡ challenge queries gave us good new requirements
➡export the smallest OPM graph that contains the query answer➡ OPM generation is now fully integrated into TP query
answering
➡Map an OPM graph to an instance of a TP causal graph➡Use TP queries to answer the same challenge queries as
above➡ TP query processing requires a representation of the
originating workflow➡ This required generating the “minimal plausible
originating dataflow” (MPOD) for the OPM graph
![Page 12: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/12.jpg)
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
![Page 13: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/13.jpg)
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
![Page 14: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/14.jpg)
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
...these consume one item of the list at a time....
![Page 15: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/15.jpg)
The challenge workflow as a Taverna dataflow
4
control links:“first LoadCSVFileIntoTable, then UpdateComputedColumns”
data links
this produces a list...
...these consume one item of the list at a time....
the resulting iteration over the lists occurs automatically
![Page 16: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/16.jpg)
TP query model and new requirements
5
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
![Page 17: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/17.jpg)
TP query model and new requirements
5
trace the lineage of values observed here...
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
![Page 18: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/18.jpg)
TP query model and new requirements
5
trace the lineage of values observed here...
...at these points in the workflow
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
![Page 19: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/19.jpg)
TP query model and new requirements
5
trace the lineage of values observed here...
...at these points in the workflow
- queries are purely structural, no semantics- queries can be answered only at same level of granularity as that of the service incapsulation of the data in the workflow
- challenge query 1 “detailed version” seems to require more knowledge on data dependencies than those obtained from this structural model
- level of detection ID not available unless query black box in LoadCSVFileIntoTable is opened up- i.e.,CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,?,?,?,?)
![Page 20: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/20.jpg)
Example: query 3
6
Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1
query.processors=ALL
![Page 21: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/21.jpg)
Example: query 3
6
Example: query 3query.vars= LoadCSVFileIntoTable / LoadCSVFileIntoTableOutput / 1
query.processors=ALL
![Page 22: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/22.jpg)
Integrated OPM generation as part of TP query
7
![Page 23: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/23.jpg)
Integrated OPM generation as part of TP query
7
![Page 24: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/24.jpg)
Integrated OPM generation as part of TP query
7
➡ the answer to any TP query can be viewed as an OPM graph
➡ encoded as RDF/XML using the Tupelo provenance API.
![Page 25: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/25.jpg)
MPOD generation rules - I
8
MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement
A PwasGeneratedBy (R)
AP used (R)
RP
R P
output port R
input port R
![Page 26: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/26.jpg)
MPOD generation rules - I
8
MPOD = Minimal Plausible Originating Dataflow- induced from an OPM graph- The TP query models requires a workflow structure- this is a first approximation...subject to refinement
A PwasGeneratedBy (R)
AP used (R)
RP
R P
A1
P3
A2
A3
A4
wgb(R1)
wgb(R2)
used(R3)
used(R4)
P1wgb(R5)
P2wgb(R6)
R5
P1
R6
P2
R1
P3
R3 R4
R2
output port R
input port R
![Page 27: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/27.jpg)
MPOD generation rules - II
9
A2wasDeterminedFrom A1
A2 PwasGeneratedBy (R2)
This is usually inferred, i.e. there exist P, R1, R2 such that:
A1P used (R1) R2
PR1
Note 1: if the corresponding “wgby” and “used” edges are not found, then new P, R1, R2 are created and added to the graph ‣ however, in all cases encountered so far, wasDeterminedFrom was inferred: P, R1, R2 appear in existing wgb(R) and used(R) edges
Note 2: wasControlledBy and wasTriggeredBy ignored for now
Derived property:
Note 3: a separate MPOD is created for each account in the OPM graph
![Page 28: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/28.jpg)
10
• OPM contributions successfully imported so far:– UC Davis– NCSA– SOTON
• Example (UC Davis)
Importing OPM for the challenge
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
query.variables: LoadCSVFileIntoTable:2 / out query.processors=ALL
(links to PC3 / UoM wiki page)
MPODexample!
![Page 29: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/29.jpg)
Example -- query 3 result• Ideally, the imported graph + MPOD allow provenance queries
to be submitted to the imported graph just as if it were a native TP graph
• The answer is viewed as a new OPM graph itself
11
![Page 30: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/30.jpg)
Lossless mappings using OPM and MPOD
12
f exec → trace(f)
〈f, trace(f)〉TP
TP = Taverna Provenance model
Q OPMQ(trace(f))
subgraph with query answer only
Provenance Query:
Export to OPM:export is just a query exp(trace(f)) that returns the entire trace
Import from OPM:
OPMimport
〈f, trace(f)〉TP
export OPMexp(trace(f))
MPOD
![Page 31: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/31.jpg)
Lossless mappings using OPM and MPOD
12
f exec → trace(f)
〈f, trace(f)〉TP
TP = Taverna Provenance model
Q OPMQ(trace(f))
subgraph with query answer only
Provenance Query:
Export to OPM:export is just a query exp(trace(f)) that returns the entire trace
Import from OPM:
OPMimport
〈f, trace(f)〉TP
export OPMexp(trace(f))
MPOD
when is this transformation loss-less?
![Page 32: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/32.jpg)
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import
Tavernadataflow
![Page 33: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/33.jpg)
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
Tavernadataflow
![Page 34: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/34.jpg)
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
Tavernadataflow
=?=
![Page 35: Report on Provenance Challenge 3, at the PC3 meeting, 2009](https://reader034.vdocuments.net/reader034/viewer/2022052523/555066a5b4c905c0448b5485/html5/thumbnails/35.jpg)
More on lossless-ness and OPM
13
f exec → trace(f)
〈f, trace(f)〉TP
export OPMexp(trace(f))
〈f’, trace(f’)〉TP
import export
This is indeed lossless when f is itself a Taverna dataflow:
export ( import (export (trace(f)))) =?= export (trace(f))
(requires proof)
Tavernadataflow
=?=