paper talk @ edbt'10: fine-grained and efficient lineage querying of collection-based workflow...
DESCRIPTION
Missier, P., Paton, N., & Belhajjame, K. (2010). Fine-grained and efficient lineage querying of collection-based workflow provenance. Procs. EDBT. Lausanne, Switzerland.TRANSCRIPT
Paolo Missier, Norman Paton, Khalid BelhajjameInformation Management Group
School of Computer Science, University of Manchester, UK
EDBT ConferenceLausanne, Switzerland, March 2010
Fine-grained and efficient lineage querying of collection-based workflow provenance
EDBT, Lausanne, March 2010
• Setting:– Black box provenance of workflow data products
• Fine-grained provenance:– tracking provenance through collections: motivation– functional model of collection-oriented workflow
processing
• Efficient query processing:– leveraging the functional model to achieve efficient
processing for a simple query model
• Experimental evaluation
Problem statement and outline
EDBT, Lausanne, March 2010
Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph
structure– large: grows with size of input– lineage queries involve graph traversal
3From: Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific Workflows," Procs. ICDE, 2009.
EDBT, Lausanne, March 2010
Main result
• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph
• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
EDBT, Lausanne, March 2010
Main result
• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph
• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
[1][n]
[]
[][1]
EDBT, Lausanne, March 2010
Main result
• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph
• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
[1][n]
[]
[][1]
EDBT, Lausanne, March 2010
Workflow as data integrator
EDBT, Lausanne, March 2010
Workflow as data integrator
QTLgenomicregions
genesin QTL
metabolicpathways(KEGG)
EDBT, Lausanne, March 2010
Workflow as data integrator
QTLgenomicregions
genesin QTL
metabolicpathways(KEGG)
EDBT, Lausanne, March 2010
Motivation for fine-grained provenance List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
EDBT, Lausanne, March 2010
Motivation for fine-grained provenance List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
geneIDs pathways
••
••
••
•
••
•
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
EDBT, Lausanne, March 2010
Motivation for fine-grained provenance List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
geneIDs pathways
••
••
••
•
••
•
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
EDBT, Lausanne, March 2010
Motivation for fine-grained provenance List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
geneIDs pathways
••
••
••
•
••
•
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
EDBT, Lausanne, March 2010
Motivation for fine-grained provenance List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
geneIDs pathways
••
••
••
•
••
•
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
EDBT, Lausanne, March 2010
• Setting:– Black box provenance of workflow data products
• Fine-grained provenance:– tracking provenance through collection elements– motivation➡ functional model of collection-oriented workflow
processing
Problem statement and outline
EDBT, Lausanne, March 2010
Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values
v1 v2 v3
w1 w2
X1 X2
Y1
P
X3
Y2
①
EDBT, Lausanne, March 2010
Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values
v1 v2 v3
w1 w2
X1 X2
Y1
P
X3
Y2
① ②
Simple iteration:service expects atomic values,receives input list
X
Y
P1
X
Y
Pn
v1
w1 wn
vn
v = [v1 ... vn]
w = [w1 ... wn]
v = [v1 ... vn]
X
Y
P ➠
w = [w1 ... wn]
...
EDBT, Lausanne, March 2010
Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values
v1 v2 v3
w1 w2
X1 X2
Y1
P
X3
Y2
① ②
Simple iteration:service expects atomic values,receives input list
X
Y
P1
X
Y
Pn
v1
w1 wn
vn
v = [v1 ... vn]
w = [w1 ... wn]
v = [v1 ... vn]
X
Y
P ➠
w = [w1 ... wn]
...
lineage(wi) = vi
EDBT, Lausanne, March 2010
Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values
v1 v2 v3
w1 w2
X1 X2
Y1
P
X3
Y2
① ②
Simple iteration:service expects atomic values,receives input list
X
Y
P1
X
Y
Pn
v1
w1 wn
vn
v = [v1 ... vn]
w = [w1 ... wn]
v = [v1 ... vn]
X
Y
P ➠
w = [w1 ... wn]
...dd = 0
ad = 1
δ = 1
lineage(wi) = vi
EDBT, Lausanne, March 2010
Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values
v1 v2 v3
w1 w2
X1 X2
Y1
P
X3
Y2
①
➠
v = [[...], ...[...]]
w = [[..] ...[...]]
X
Y
Pdd = 0
ad =2
δ = 2
lineage(wii) = vij
③Extension:service expects atomic values,receives input nested list
②
Simple iteration:service expects atomic values,receives input list
X
Y
P1
X
Y
Pn
v1
w1 wn
vn
v = [v1 ... vn]
w = [w1 ... wn]
v = [v1 ... vn]
X
Y
P ➠
w = [w1 ... wn]
...dd = 0
ad = 1
δ = 1
lineage(wi) = vi
EDBT, Lausanne, March 2010
Functional model /2
v = [[...], ...[...]]
w = [[..] ...[...]] - depth = n-m
X
Y
Pdd = m
ad =n
δ = n-m ≥ 0
The simple iteration modelgeneralises by induction to a generic δ=n-m
EDBT, Lausanne, March 2010
Functional model /2
v = [[...], ...[...]]
w = [[..] ...[...]] - depth = n-m
X
Y
Pdd = m
ad =n
δ = n-m ≥ 0
The simple iteration modelgeneralises by induction to a generic δ=n-m
v = !a1 . . . an"
(evall P v) =
!(P v) if l = 0(map (evall!1 P ) v) if l > 0
This leads to a recursive functional formulation for simple collection processing:
EDBT, Lausanne, March 2010
Functional model - multiple inputs /3
X1 X2
Y
P
X3v1 = [v11 ... vin] v3 = [v31 ... v3m]
v2 = [v21 ... v2k]
w = [ [w11 ... w1n], ... [wm1 ...wmn] ]
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Functional model - multiple inputs /3
X1 X2
Y
P
X3v1 = [v11 ... vin] v3 = [v31 ... v3m]
v2 = [v21 ... v2k]
w = [ [w11 ... w1n], ... [wm1 ...wmn] ]
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
Cross-product involving v1 and v2 (but not v3):
v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product
and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]
EDBT, Lausanne, March 2010
Functional model - multiple inputs /3
X1 X2
Y
P
X3v1 = [v11 ... vin] v3 = [v31 ... v3m]
v2 = [v21 ... v2k]
w = [ [w11 ... w1n], ... [wm1 ...wmn] ]
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
Cross-product involving v1 and v2 (but not v3):
v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product
and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]
lineage(wii) = < v1i, v2, v3j>
a! b = [["ai, bj#]|bj $ b]|ai $ a]
(eval2 P !a, b") = (map (eval1 P ) a # b)
EDBT, Lausanne, March 2010
Generalised cross product
Binary product, δ = 1:
a! b = [["ai, bj#]|bj $ b]|ai $ a]
(eval2 P !a, b") = (map (eval1 P ) a # b)
EDBT, Lausanne, March 2010
Generalised cross product
!i:1...n(vi, di)
(v, d1)! (w, d2) =
!"""#
"""$
[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0
Generalized to arbitrary depths:
...and to n operands:
Binary product, δ = 1:
a! b = [["ai, bj#]|bj $ b]|ai $ a]
(eval2 P !a, b") = (map (eval1 P ) a # b)
EDBT, Lausanne, March 2010
Generalised cross product
!i:1...n(vi, di)
(v, d1)! (w, d2) =
!"""#
"""$
[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0
Generalized to arbitrary depths:
...and to n operands:
Binary product, δ = 1:
Finally: general functional semantics for collection-based processing
(evall P !(v1, d1), . . . , (vn, dn)")
=
!(P !v1, . . . , vn") if l = 0(map (evall!1 P ) #i:1...n !vi, di") if l > 0
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] =
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] =
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] = δ1
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] =
X1
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] = δ2
X1
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] =
X1 X2
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] = δk
X1 X2
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Static mapping of output to input values
X1 X2
Y
P
X3
(0,1) (0,1)(1,1)
(0,2)
this leads to a simple mapping rule:
index of an output list value ➔ {index of input values}
Y[i.j] → X1[i], X2[], X3[j]
[i1 . i2 . ... . ik] =
X1 X2 Xk
The iteration structure can be determined statically
dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1
dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0
dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1
EDBT, Lausanne, March 2010
Workflow graph as index for query processing
1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
EDBT, Lausanne, March 2010
Workflow graph as index for query processing
1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
[1][n]
[]
[][1]
EDBT, Lausanne, March 2010
Workflow graph as index for query processing
1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end
y11
a1 b1
ymn
bman
wv1 vn
... ...
...
...
Provenance graphWorkflow graph
X1 X2
Y
P
X3
y = [ [y11 ... y1n], ... [ym1 ... ymn] ]
X
Y
R
X
Y
Q
[1][n]
[]
[][1]
EDBT, Lausanne, March 2010
From granularity to efficient query processing
Summary so far:
• whenever iterations are involved, we can trace the provenance of individual elements of a processor’s output
• iterations are explained in terms of a functional model and based on list depth discrepancies
• The relationships between output and input indexes are derived using the workflow specification graph (statically)
How about expressivity and efficient processing of lineage queries?
EDBT, Lausanne, March 2010
Lineage query model /1
I - Focusing:Not all processors are interesting:–report lineage only at
specified nodes in the graph
EDBT, Lausanne, March 2010
Lineage query model /2List-structured KEGG gene ids:
[ [ mmu:26416 ], [ mmu:328788 ] ]
[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]
[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]
II - Granularity:Trace lineage for individual elements within collections- when possible!
EDBT, Lausanne, March 2010
Lineage query model and language
<pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus></pquery>
optionally specifies one or more runs for the target workflow
port values for which lineage is sought:global outputs or processor-qualified
processors where lineage is to be reported- possibly workflow-qualified
workflow scopedefaults to latest run
EDBT, Lausanne, March 2010
Fine granularity + efficient processing• Scalability:
– query time depends on size of workflow graph, not size of provenance graph
– workflow graphs are small, fit in memory, can be indexed easily
• Graceful degradation:– worst case is a completely unfocused query– no worse than other approaches
• Fine-grain answers provided at the same time
EDBT, Lausanne, March 2010
• Assumption:– Black box provenance of workflow data products
✓Fine-grained provenance:– tracking provenance through collection elements– motivation, functional model of collection-oriented
workflow processing
✓Efficient query processing:✓leveraging the functional model to achieve efficient
processing for a simple query model
➡Experimental evaluation
Outline
EDBT, Lausanne, March 2010
Experimental setup - I
20
• Performance evaluation performed on programmatically generated dataflows
– the “T-towers”
parameters:- size of the lists involved- length of the paths- includes one cross product
EDBT, Lausanne, March 2010
Experimental results - II
10 28 50 75 100 150 2000
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
3250
3500
124256
440
700
1000
2024
3257
workflow pre-processing time by graph size
path length l
tim
e (
ms)
1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67%
0
10
20
30
40
50
60
70
80
90
100
110
120
130
response times for PP on unfocused queries (l=150)
% of processors in target set
tim
e (
ms)
performance degradation on fully unfocused queries
10 28 50 75 100 150
0
25
50
75
100
125
150
175
200
225
d=10
NI
NGQ
PP
path length l
tim
e (
ms)
10 25 50 75 100 150
0
25
50
75
100
125
150
175
200
225
d=150
NR
NGQ
PP
path length l
tim
e (
ms)
10 28 50 75 100 150
0
25
50
75
100
125
150
175
200
225
d=10
NI
NGQ
PP
path length l
tim
e (
ms)
10 25 50 75 100 150
0
25
50
75
100
125
150
175
200
225
d=150
NR
NGQ
PP
path length l
tim
e (
ms)
Naive traversal of provenance graph
single multi-join query workflow as index(our approach)
10 elements in input list
150 elements in input list
EDBT, Lausanne, March 2010
Summary• A simple lineage query model for Taverna
–grounded in the semantics of collection-oriented processing
–combines fine-grain answers with efficient query processing
• Ongoing work:– space compression, indexing– QLP?– semantic provenance (initial paper submitted)
• Currently part of the Taverna 2.1 release