provenance-based analysis of data-centric processesmoskovitch1/docs/vldbj15.pdf · provenance-based...

25
Noname manuscript No. (will be inserted by the editor) Provenance-Based Analysis of Data-Centric Processes Daniel Deutch · Yuval Moskovitch · Val Tannen the date of receipt and acceptance should be inserted later Abstract We consider in this paper static analysis of the possible executions of data-dependent applications, namely applications whose control flow is guided by a finite state machine, as well as by the state of an un- derlying database. We note that previous work in this context has not addressed two important features of such analysis, namely analysis under hypothetical sce- narios, such as changes to the application’s state ma- chine and/or to the underlying database, and the con- sideration of meta-data, such as cost or access priv- ileges. Observing that semiring-based provenance has been proven highly effective in supporting these two features for database queries, we develop in this paper a semiring-based provenance framework for the anal- ysis of data-dependent processes, accounting for hypo- thetical reasoning and meta-data. The development ad- dresses two interacting new challenges: (1) combining provenance annotations for both information that re- sides in the database and information about external inputs (e.g., user choices), and (2) finitely capturing in- finitely many process executions. We have implemented our framework as part of the PROPOLIS system. This research was partially supported by the Israeli Ministry of Science, the Israeli Science Foundation, the National Sci- ence Foundation (NSF IIS 1217798), the US-Israel Binational Science Foundation, the Broadcom Foundation and Tel Aviv University Authentication Initiative Daniel Deutch Tel Aviv University Yuval Moskovitch Tel Aviv University Val Tannen University of Pennsylvania 1 Introduction There are two dominant approaches in the modeling and specification of applications: the classic, process- centric, approach focusing on the application control flow while abstracting away data (see e.g. [32], and a data-centric approach (e.g. [18, 13]), where the database underlying the process, and queries over it that guide the course of execution, are explicitly modeled. We fo- cus here on the data-centric approach, and model such applications through data-dependent processes (DDPs) which are finite state machines (FSM) whose transitions are guided by both external input and by the result of internal database queries. As a simple example, consider a typical E-commerce application where users can navigate through a selec- tion of products (classified through categories and sub- categories), and proposed discount deals. The under- lying process semantics can be captured via an FSM whose states are associated with web-pages and/or prop- erties defined by the applications’ business logic. The transitions between states can be governed both by user clicks or text input and by queries on the prod- uct database. The typical complexity of these applications requires automatic (static) analysis of their possible executions, verifying properties of interest. In particular, an analyst may be interested in questions such as “can a user view a particular category without being a club member?”, or “what is the minimum number of clicks allowing a user to view the daily deals being offered?”. Some as- pects of such analysis tasks may be captured via tempo- ral logics; evaluating temporal logic formulas (extended with first order predicates to query the underlying data [18]) with respect to the possible executions of data- centric processes is the subject of extensive investiga- tion [18,13]. There are, however, two important aspects that should be addressed as part of the analysis, and are absent from current solutions: analysis under hypothet- ical scenarios, and analysis in presence of meta-data. We next explain these two features which are the focus of the present paper. Hypothetical Reasoning The result of evaluating a temporal logic formula with respect to a process for- malism is typically boolean, indicating the existence or inexistence of an execution that satisfies the property specified by the formula (e.g. “can a user view a particu- lar category”). However, to obtain a better understand- ing of the application, an analyst may want to do more than pose static analysis questions with respect to its current specification. Indeed, she may also want to test and explore the effect on analysis results of hypothetical

Upload: others

Post on 31-Aug-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Noname manuscript No.(will be inserted by the editor)

Provenance-Based Analysis ofData-Centric Processes

Daniel Deutch ·Yuval Moskovitch ·Val Tannen

the date of receipt and acceptance should be inserted later

Abstract We consider in this paper static analysis of

the possible executions of data-dependent applications,

namely applications whose control flow is guided by a

finite state machine, as well as by the state of an un-

derlying database. We note that previous work in this

context has not addressed two important features of

such analysis, namely analysis under hypothetical sce-

narios, such as changes to the application’s state ma-

chine and/or to the underlying database, and the con-

sideration of meta-data, such as cost or access priv-

ileges. Observing that semiring-based provenance has

been proven highly effective in supporting these two

features for database queries, we develop in this paper

a semiring-based provenance framework for the anal-

ysis of data-dependent processes, accounting for hypo-

thetical reasoning and meta-data. The development ad-

dresses two interacting new challenges: (1) combining

provenance annotations for both information that re-

sides in the database and information about external

inputs (e.g., user choices), and (2) finitely capturing in-

finitely many process executions. We have implemented

our framework as part of the PROPOLIS system.

This research was partially supported by the Israeli Ministryof Science, the Israeli Science Foundation, the National Sci-ence Foundation (NSF IIS 1217798), the US-Israel BinationalScience Foundation, the Broadcom Foundation and Tel AvivUniversity Authentication Initiative

Daniel DeutchTel Aviv University

Yuval MoskovitchTel Aviv University

Val TannenUniversity of Pennsylvania

1 Introduction

There are two dominant approaches in the modeling

and specification of applications: the classic, process-

centric, approach focusing on the application control

flow while abstracting away data (see e.g. [32], and a

data-centric approach (e.g. [18,13]), where the database

underlying the process, and queries over it that guide

the course of execution, are explicitly modeled. We fo-

cus here on the data-centric approach, and model such

applications through data-dependent processes (DDPs)

which are finite state machines (FSM) whose transitions

are guided by both external input and by the result of

internal database queries.

As a simple example, consider a typical E-commerce

application where users can navigate through a selec-

tion of products (classified through categories and sub-

categories), and proposed discount deals. The under-

lying process semantics can be captured via an FSM

whose states are associated with web-pages and/or prop-

erties defined by the applications’ business logic. The

transitions between states can be governed both by

user clicks or text input and by queries on the prod-

uct database.

The typical complexity of these applications requires

automatic (static) analysis of their possible executions,

verifying properties of interest. In particular, an analyst

may be interested in questions such as “can a user view

a particular category without being a club member?”,

or “what is the minimum number of clicks allowing a

user to view the daily deals being offered?”. Some as-

pects of such analysis tasks may be captured via tempo-

ral logics; evaluating temporal logic formulas (extended

with first order predicates to query the underlying data

[18]) with respect to the possible executions of data-

centric processes is the subject of extensive investiga-

tion [18,13]. There are, however, two important aspects

that should be addressed as part of the analysis, and are

absent from current solutions: analysis under hypothet-

ical scenarios, and analysis in presence of meta-data.

We next explain these two features which are the focus

of the present paper.

Hypothetical Reasoning The result of evaluating a

temporal logic formula with respect to a process for-

malism is typically boolean, indicating the existence or

inexistence of an execution that satisfies the property

specified by the formula (e.g. “can a user view a particu-

lar category”). However, to obtain a better understand-

ing of the application, an analyst may want to do more

than pose static analysis questions with respect to its

current specification. Indeed, she may also want to test

and explore the effect on analysis results of hypothetical

Page 2: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

2 Daniel Deutch et al.

scenarios. E.g. if the answer to the above question is

“no”, then the analyst may wish to explore changes to

the applications that would lead to a positive answer.

The scenarios may involve different modifications to the

business logic (e.g., the application’s link structure) or

to the underlying product data. In realistic cases, there

may be many possible scenarios and their combinations

that are of interest. An analyst would like to interac-

tively explore the different combinations, refining them

according to the analysis results (e.g., gradually remov-

ing links and observing the effect).

Meta-data awareness An additional important fea-

ture absent from previous work is the analysis in pres-

ence of data from meta-domains such as cost (e.g., num-

ber of clicks), trust (e.g., confidence scores), or security

(e.g., access control/clearance levels). Such meta-data

is typically associated with the data as well as with the

control flow logic, and accounting for it is required as

part of analysis tasks such as those exemplified above

(“number of clicks” or “club membership” in the above

examples).

A provenance-based solution A main observation

that stands in the core of our research is that in the

context of relational queries, it has been shown that the

use of provenance annotations, based on the provenance

semiring approach, is effective for the two above chal-

lenges (hypothetical reasoning and accounting for meta-

data). It was shown in [16] that provenance annotations

enable such interactive exploration under combinations

of hypothetical scenarios, a technique referred to as

provisioning. This technique incorporates in the query

answer compactly stored provenance expressions corre-

sponding to the hypothetical scenarios. It was further

shown in [26,21,6] that it suffices to compute prove-

nance and then evaluate it (specialize it) in specific

meta-domains. While provenance was shown to be effec-

tive for these tasks in the context of relational queries,

provenance support for process analysis was not previ-

ously studied. Therefore, the main goal of this paper is

to develop the foundations of provenance tracking for

data-dependent process analysis.

Several conceptual and technical challenges need to

be addressed in this context. We review here these chal-

lenges and our contributions in addressing them.

First, one needs to design a formal model of the pro-

cess specification. As described above, for the former we

introduce a simple notion of DDPs (Data-Dependent

Process), specified by an FSM whose transitions are ei-

ther guarded by a yes/no query on a static underlying

database, or otherwise are considered as non-deterministic.

Intuitively, non-deterministic transitions correspond to

external effects such as user choices, interacting appli-

cations, etc. For database boolean query guards we es-

sentially allow an arbitrary class of queries, as long as

provenance support (captured by “oracle”) is present

for this class, and satisfies some natural desiderata. We

later show an instantiation of the query language with

the quite expressive positive relational algebra with ag-

gregates. Similarly, a formalism for specifying analysis

questions with respect to the process’s possible execu-

tions is required. Here we follow common practice and

use the Linear Temporal Logic (LTL) formalism [32].

LTL analysis of DDP specifications can determine if

certain properties of possible executions hold. We fo-

cus here on analysis of finite executions, following the

perspective used in workflow provenance (see e.g. [14]),

but there may still be infinitely many such executions

if the process logic involves loops.

To support meta-data from various meta-domains

(modeling e.g., cost, access control privileges etc.), as

well as provisioning for analysis under hypothetical sce-

narios, we further develop a provenance model for the

result of LTL formulas with respect to a DDP. The de-

velopment of the provenance model must account in

particular for the different kinds of transition choices

(depending on queries on the data or on external ef-

fects) and for the possibly infinitely many executions.

A starting point towards a solution is the semiring-

based model [26]. However, due to the above features,

the model cannot be used as is, and a novel construc-

tion is required. We start by allowing the separate an-

notation of data tuples and external choices, using two

different semirings. This separation allows us, for exam-

ple, to simultaneously do provisioning for data tuples

and provenance specialization for external choices. An

example of interest might be “what is the minimum

number of clicks to view daily deals, if the database is

modified in a particular way”. Then, we propose a novel

construction that, intuitively, allows capturing pairs of

provenance annotations. The two components of a pair

correspond to external choices and to query answers on

the underlying database (resp.). A novel construction

is needed because completely separating the two kinds

of provenance is not a good idea. In semiring prove-

nance alternatives are modeled by addition and joint

use is modeled by multiplication [26]. The semantics

of DDPs involves alternative paths. If we accumulate

the database provenance separately from the external

effects provenance we lose track of when both happen

along the same path. Therefore instead of just taking

the cartesian product of the two semirings, we need to

factor by certain algebraic congruences on pairs. The

construction is in fact a tensor product of two semirings.

This allows us to both greatly simplify the resulting

Page 3: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 3

expressions and to validate semiring homomorphisms

that compute the meta-domain annotations in prove-

nance specialization. In addition, the new construction

that we develop has to accommodate the tracking of

provenance over possibly infinitely many different ex-

ecutions. This means using a semiring that supports

infinite sums (countable sums suffice). To this end we

construct the tensor product construction so that it ma-

nipulates infinite bags of pairs before factoring through

the desired algebraic congruences. The result is a closed

semiring [39] that supports directly the desired infinite

sums. We then define the provenance for LTL formulas

in this structure as the (possibly infinite) sum over all

the paths that realize the formula; along each path we

compute provenance as a multiplication of provenance

of the individual transitions.

Several properties of the framework are then of in-

terest, for it to be of practical use. We provide here

intuitive descriptions of these properties, which are for-

malized in the sequel.

Semantics properties. A first important prop-

erty of provenance tracking is that it faithfully extends

the semantics of the underlying computation as per-

formed while ignoring provenance. This property was

formalized as set-compatability for positive relational al-

gebra with aggregates in [6], meaning that for a suitable

choice of semiring for annotations (namely the boolean

semiring), the semantics for computing query result (in-

cluding provenance) coincides with the standard set-

semantics for queries. The counterpart here is compati-

bility with the usual boolean semantics of LTL for pro-

cess analysis. Second, we provide an efficient (given an

efficient oracle for the individual queries) algorithm for

computing, given a (provenance-aware) DDP specifica-

tion and an LTL formula, the provenance of the formula

with respect to the specification. The output of our al-

gorithm is a finitely described expression in the closed

tensor product structure that captures the provenance.

The algorithm leverages and combines two techniques:

one for computing the provenance of database queries,

and the second for computing a regular expression that

is equivalent to a given Finite State Machine. Third,

we consider commutation with homomorphism, which

is essential for the soundness of applying provenance to

provisioning and to specialization in meta-domains of

interest. This is because the provenance expressions can

be built using indeterminate parameters (variables) as

input annotations and then both hypothetical scenarios

and meta-domain assumptions correspond to valuations

of these parameters in specific semirings. These valu-

ations determine homomorphisms through which the

evaluation of provenance expressions is done.

We formalize these properties and show that they

hold for our construction, given that counterpart prop-

erties hold for the provenance oracle given for guard-

ing queries. Indeed, such counterparts have been shown

in previous work for expressive languages (specifically

positive relational algebra with aggregates), and so such

expressive languages may be incorporated in our model.

Implementation and experiments. We have im-

plemented our model and algorithms in the context

of a system prototype called PROPOLIS (PROvenance

for data-dependent PrOcess anaLysIS) [17]. PROPOLIS

allows analysts to define, in addition to an LTL for-

mula, annotations (from e.g. cost, trust, and access-

control meta-domains) on the process model. This cap-

tures hypothetical scenarios by parameterizing the pro-

cess specification (its logical flow and/or its underly-

ing database) at points of interest. Then, PROPOLIS

computes (offline) provenance expression for the given

LTL formula with respect to the process specification.

This expression is the compact result of evaluating the

LTL formula with respect to applying all possible com-

binations of specified scenarios (parameters values) to

the process specification. The provenance expression is

passed to a module which allows for rapid exploration

of scenarios as well as provenance specialization, using

the provisioned expression while avoiding further costly

access to the database or costly reevaluation of the LTL

formula. We provide a detailed description of the sys-

tem, accounting for challenges in implementation and

different optimizations. We have further used PROPOLIS

to conduct an extensive experimental study, showing its

effectiveness and scalability.

2 DDPs and Their Analysis

We next define our model for data-dependent processes

(DDPs) and the temporal formalism used in its analy-

sis. Provenance will be introduced in Section 3.

2.1 Data-dependent Processes

We consider a simple model for DDP specifications. The

logical flow is specified by a state machine in which

some transitioning is governed by boolean queries in a

class Q over an underlying Database of schema D. We

will later explain how analysts are able to examine the

effects of changes to the current database state as well

as to the state machine.

We keep the class of queries Q abstract for now, and

later (Section 5) discuss a particular query language of

interest (namely positive relational algebra with aggre-

gates). Our examples use SQL syntax for simplicity.

Page 4: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

4 Daniel Deutch et al.

Definition 1 A Data-Dependent Process (DDP) spec-

ification is a tuple (V,E, Vq, Ve, Fq, D) such that (V,E)

is a directed graph referred to as the DDP state ma-

chine, with a distinguished “initial” node vi ∈ Vq⋃Ve

having no ingoing edges, Vq, Ve ⊆ V are subsets of nodes

referred to as query nodes and external effect nodes re-

spectively, such that Vq⋂Ve = φ and Vq

⋃Ve = V . Ev-

ery node vq ∈ Vq has exactly two outgoing edges, and

the end nodes of these edges are denoted true(vq) and

false(vq). Fq : Vq 7→ Q maps query nodes to queries

deciding their transitioning. Finally, D is a database

over the schema D.

Homepage

Newproducts

Cat.

SubCat.

Product

ShoppingCart

DailyDeals

Payment

Exit

PayExit

5

2

5

3

2

if Q1 = 0

if Q1 6= 0 22

3

2 2

2

if Q2 = 0

if Q2 6= 0

Fig. 1: Data-Driven Process

Example 1 Consider the (partial) process logic in Fig-

ure 1. Each node intuitively stands for a web page, and

edges model links. The initial node is HomePage. Some

transitioning is based on the user decision, such as the

one from “Shopping Cart”, depending on the user nav-

igation choice (ignore for now the numbers annotat-

ing some transitions). Other transition choices depend

on the underlying database: for instance the transitionfrom ”Cat.” (standing for a page where the user chooses

a category) to a “SubCat.” page (where the user is

presented a set of sub-categories) or to a “Product”

page (listing relevant products) depends on availability

of sub-categories, modeled as the truth value for the

boolean query “Q1 = 0” (checking for inexistence of

a sub-category of available categories). Q1 is given in

Fig. 3, with respect to the underlying database whose

fragment is given in Fig. 2 (ignore for now the Prov.

column).

A valid execution of a DDP follows query results for

transitioning out of query nodes, while performing ar-

bitrary choices for other nodes. We limit our attention

to finite executions. To simplify definitions, we also as-

sume that an execution terminates in a node from which

there is no outgoing edge (such as Exit and PayExit

in the running example).

Definition 2 (Executions) An execution of a DDP

(V,E, Vq, Ve, Fq, D) is a finite path [v1, v2, ..., vn] in the

AvailableCatCat. Prov.· · · · · ·Cell Phones d1Computers d2Fashion d3· · · · · ·

(a)

PaymentSystemsCat. Prov.· · · · · ·PayPal d5· · · · · ·

(b)

CategoryHierarchyCat SubCat. Prov.· · · · · · · · ·Cell Phones Smartphones d4· · · · · · · · ·

(c)

Fig. 2: Underlying Database

Q1 :SELECT COUNT(*)

FROM CategoryHierarcy CH,

AvailableCat AC

WHERE CH.Cat = AC.Cat

Q2:SELECT COUNT(*)

FROM PaymentSystems PS

Fig. 3: SQL Queries

graph (V,E) such that vn has no outgoing edges, for

every vi ∈ Vq, if the query Fq(vi) is satisfied by the

database D then vi+1 = true(vi) and otherwise vi+1 =

false(vi).

Example 2 Reconsider the DDP specification in Exam-

ple 1. Consider the path P = [Homepage, Cat, SubCat,

Exit], intuitively corresponding to a user making a choice

of a category, then of a sub-category, then exiting with-

out completing a purchase. Cat. is the only query node

(i.e. Cat. ∈ Vq) in this execution, associated with the

query Q1 = 0. Since the current state of D does not

satisfy Q1 = 0, the node that follows Cat must be

SubCat, so P is an execution of the specification but

e.g. P ′ = [HomePage, Cat, Product, ...] is not.

Note that a DDP may admit infinitely many (finite)

executions, if its FSM includes cycles. We next consider

LTL as a formalism that allows to specify properties of

interest with respect to such executions.

2.2 Analysis Formalism

We revisit the syntax of LTL formulas [32], as well as

their semantics with respect to finite process executions

[23] 1. We follow common practice [23] of considering

for finite runs only the next-free fragment of LTL (we

may also account for the next operator, with particular

1 The semantics of LTL is more commonly defined with re-spect to infinite executions; however from a provenance per-spective we will only be interested in finite prefixes, as is alsothe case in [23] that works directly with traces.

Page 5: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 5

choice of semantics for the last state). Formulas are

built up from a finite set of propositional variables P ,

the connectives ¬ and ∨, and the modal operator U

(until). I.e. if p ∈ P then p is an LTL formula, and

for every two LTL formulas f1,f2, it holds that f1∨f2,

¬f1, and f1Uf2 are LTL formulas.

The semantics of LTL formulas defines their satis-

faction with respect to an execution with the assump-

tion that each state can be tested for satisfaction of a

propositional variable in P . In our examples P simply

corresponds to the state names, and a state only satis-

fies the variable corresponding to its name.

Definition 3 [23] Let φ be an LTL formula over a set

of propositional variables P . Let e = v0, v1, ..., vn be an

execution and for 0 ≤ i ≤ n let ei = vi, vi+1, ...vn. We

say that e |= φ if

– φ is a propositional variable p and the state v0 sat-

isfies p.

– φ = ¬φ1 for some LTL formula φ1 and e 6|= φ1.

– φ = φ1 ∨ φ2 for some LTL formulas φ1 and φ2 such

that e |= φ1 or e |= φ2.

– φ = φ1Uφ2 and there exists some 0 ≤ i ≤ n such

that ei |= φ2, and for all 0 ≤ k < i, ek |= φ1.

The basic modal operator U allow the definition of

many derived operators such as “Finally” (Fφ), “Be-

fore” (φ1Bφ2), or “Globally” (Gφ). Fφ states that φ

holds eventually, and it can be written as trueUφ. The

formula φ1Bφ2 states that φ1 holds before φ2 and can

be written as ¬φ2Uφ1; and Gφ states that φ holds

throughout the execution, and can be written as ¬F (¬φ).

Example 3 An analyst of our running example DDP

may be interested in various execution properties, such

as whether the execution involves “A user exiting with-

out viewing the daily deals”. This is captured by the

LTL formula (Exit OR PayExit) B DailyDeals. Other

properties that may be of interest to the analyst are “A

user views product sub-categories for some category”,

“A user views proposed daily deals, and later proceeds

to the payment page”, or “A user never views the daily

deals”. These properties can be captured by the LTL

formulas F SubCat., F (DailyDeals ∧ F Payment), and

G (¬DailyDeals) respectively.

Note that an LTL formula expresses a property of

a given execution, while in general we are interested

in evaluating the formula with respect to the (possibly

infinitely many) possible executions of a given DDP. We

next define the (boolean) semantics of such evaluation.

Definition 4 Given a DDP s and a LTL formula φ, we

say that s |= φ if there exists an execution e of s such

that e |= φ.

The above definition is essentially restricted to ask-

ing for the existence of a path satisfying the formula,

and does not account for neither data in meta-domains

nor hypothetical reasoning. To this end we present a

provenance model. The boolean semantics will serve as

a yardstick in our development: we will generalize it.

3 Provenance Model

We present in this section a semiring-based provenance

model for the LTL-based analysis of DDPs. We first

recall some basic algebraic notions required for our de-

velopments, then define provenance-aware DDPs and a

provenance construction for their executions.

3.1 Semirings and homomorphisms

A commutative monoid is an algebraic structure

(M,+M, 0

M) where +

Mis an associative and commuta-

tive binary operation and 0M

is an identity for +M

.

A monoid homomorphism is a mapping h : M →M ′

where M,M ′ are monoids, and h(0M

) = 0M′ ,h(a +M

b) = h(a) +M ′ h(b). A semiring homomorphism is a

mapping h : K → K ′ where K,K ′ are semirings, and

h(0K

) = 0K′ , h(1

K) = 1

K′ ,h(a +K b) = h(a) +K′ h(b),

h(a ·K b) = h(a) ·K′ h(b).

We will consider database operations on relations

whose tuples are annotated with elements from commu-

tative semirings. These are structures (K,+K, ·

K, 0

K, 1

K)

where (K,+K, 0

K) and (K, ·

K, 1

K) are commutative monoids,

·K

is distributive over +K

, and a ·K

0K

= 0 ·Ka = 0

K.

Example 4 Examples, which we will use in the sequel,

include the boolean semiring ({true, false},∨,∧, false, true)and the tropical semiring (N∞,min,+,∞, 0) (also termed

a cost semiring) where N∞ includes all natural numbers

as well as∞. Given a set X of provenance tokens which

correspond to “atomic” provenance information, e.g.,

tuple identifiers, (N[X],+, ·, 0, 1) is the semiring of poly-

nomials with natural coefficients. For x1, x2, x3 ∈ X,

2x1 + x2 · x3 is an example of an element in the semir-

ing. (N[X],+, ·, 0, 1) was shown in [26] to most generally

capture provenance for positive relational queries and

is thus termed the “provenance semiring”.

3.2 Provenance-Aware DDPs

We next define the notion of a Provenance-Aware DDP

(PADDP for short). We propose to use two semirings

for annotations, accounting for the inherently different

types of transitioning. A first semiring will be used to

Page 6: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

6 Daniel Deutch et al.

annotate the database tuples, following [26]; a second

semiring will be used for annotation of external effects.

Definition 5 A Provenance-Aware DDP (PADDP) is

a tuple (S,Kext,Kdata, Aext, Adata) where S is a DDP,

Kext,Kdata are commutative semirings,Aext maps tran-

sitions out of external effects nodes of S to elements of

Kext; Adata maps tuples of the underlying database of

S to elements of Kdata.

For any two commutative semirings K,K ′ we will

use the term (K,K’)-PADDP for a PADDP whose an-

notations for external effect are elements of K and an-

notations for data are elements of K ′. We have not

defined yet the way that these annotations propagate

through executions, but we can already exemplify their

use based on the intuitive correspondence of the semir-

ing + operation with alternative use (of data / transi-

tions), and of the semiring · operation with joint use.

Example 5 Re-consider the running example and sup-

pose that the analyst is interested in questions of the

flavor “What is the minimum user effort2 required for

a user to view the daily deals and later proceed to the

“Payment” page”. We have already shown how this con-

straint on executions can be expressed in LTL, but we

need to also capture user effort. This can be done by

choosing Kext to be the tropical semiring (identifying

“cost” here with user effort), and using Aext to map

transitions to natural numbers corresponding to user

effort associated with them (see weights next to tran-

sitions in Figure 1). Since multiplication in the tropi-

cal semiring corresponds to natural numbers addition,

weights of joint choices along an execution are summed

up; and since addition corresponds to min, provenance

for multiple executions (e.g. all those satisfying the con-

straint that daily deals are viewed) captures their small-

est weight. We will show concrete examples for such

weights in the examples to follow.

Furthermore, the analyst may be interested in per-

forming the analysis assuming the database will change,

e.g. some products or categories are no longer available.

This can be captured by associating provenance with

the database tuples. Specifically, we can use Kdata =

N[D], the semiring of polynomials over the set of prove-

nance tokens D = {d1, ..., d5}. The annotation function

Adata is given as the Prov. column in Figure 2. Intu-

itively, D may be considered as indeterminates, and hy-

pothetical scenarios will be modeled in the sequel via

truth assignments to them. We will show concrete hy-

pothetical scenarios in Example 13.

2 where user effort is some quantification associated withdifferent actions such as following a link, filling in a text boxetc.

We now start defining provenance: first for individ-

ual queries (in an abstract way), then for executions,

and finally for a possibly infinite set of executions con-

forming to a given LTL formula.

3.3 Provenance for Individual Queries

Recall that we have abstracted away the choice of boolean

query language Q. We next define in an abstract way

a notion of a (semiring-based) “provenance oracle” for

queries in Q, specifying desiderata with respect to this

oracle that will allow for provenance tracking for PAD-

DPs. A concrete such oracle will be shown in Section

5.

Semiring-Annotated Relations. We start by recalling

the notion of semiring-annotated databases [26]. We use

the named perspective [1] of the relational model in

which tuples are functions t : U → D with U a finite

set of attributes and D a domain of values. We fix the

domain D and denote the set of all such U -tuples by

U -Tup. (Usual) relations over U are subsets of U -Tup.

Annotations are then modeled by a function from all

possible tuples to elements of a semiring K, with those

tuples not considered to be “in” the relation tagged

with the 0 value of the semiring.

Definition 6 Let K be a semiring. A K-relation over

a finite set of attributes U is a function R : U -Tup →K such that its support defined by supp(R)

def= {t |

R(t) 6= 0} is finite. We say that R is annotated by ele-

ments of K. A K-database is a collection of K-relations.

Example 6 Different semirings may be used for annota-

tion, capturing meta-data of different domains. Perhaps

the simplest example is standard (set-theoretic) rela-

tions, carrying no meta-data, which may be regarded

as relations with annotations from the boolean semir-

ing, with all tuples present in the relations annotated

by true (tuples mapped to “false” are not part of the

support). Similarly, annotations from the cost semir-

ing may be used to associate an access cost with every

tuple, and the provenance semiring is typically used

to associate an “abstract tag” from some domain with

each tuple (which will be further used in the tracking

of provenance for query results, see following discus-

sion). Such an annotated relation, with basic annota-

tions d1, d2, ... is shown in Figure 2 (in Section 3).

Provenance for guarding queries. Let Q be a (rela-

tional) boolean query language, and let ProvOracleK be

a “provenance oracle” (parameterized by a commuta-

tive semiringK) for queries in Q, i.e. ProvOracleK : Q×Kdb 7→ K where Kdb is the class of all K-databases;

Page 7: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 7

and K is a commutative semiring, linked to K through

the notion of homomorphism lift, namely for every ho-

momorphism h : K 7→ K ′ there is a “lift” h : K 7→ K ′.

The specification of this lift, as well as the definition

of how to obtain K from an arbitrary semiring K, is

part of the oracle description. The simplest case is of

course when K = K (and then h = h) but the above

definition allows for further flexibility in the choice of

the “output” semiring.

Three desiderata naturally arise with respect to

ProvOracle and will be important at different stages

of our developments (as will be made explicit below).

The first is set-compatibility which specifies that the se-

mantics of ProvOracleB on B-databases (i.e. standard

set-databases) coincides with the standard set seman-

tics (i.e. is true if and only if the query is satisfied by

the underlying database).

The second is for complexity analysis purposes, and

states that ProvOracleK(Q,D) may be computed in

polynomial time in D for every Q,D,K (and then we

say that ProvOracle is polynomial-time computable).

The third desiderata is commutation with homomor-

phism:

Definition 7 We say that ProvOracle commutes with

homomorphism if for every two semirings K,K ′, every

homomorphism h : K 7→ K ′, every query q in Q and

every K-database D, we have that

ProvOracleK(Q, h(D)) = h(ProvOracleK(Q,D)), where

h(D) is the K ′-database obtained by applying h to ev-

ery annotation from K appearing in D.

As we will show in the sequel, commutation of ho-

momorphism is crucial for hypothetical reasoning, as

hypothetical scenarios may be captured by such homo-

morphisms.

We next introduce a provenance structure that can

accommodate provenance for PADDP executions, given

such an oracle for computing provenance for the indi-

vidual queries associated with transitions.

Notations. The model and construction of prove-

nance for PADDPs execution are then naturally pa-

rameterized by the provenance model and construc-

tion for individual queries annotating its transitions

(i.e. by the ProvOracle). Where clear from context, we

omit ProvOracle from the notations. Furthermore, for

clarity of the examples, we will use expressions of the

sort [Q = 0] or [Q 6= 0] (in addition to the semiring

constants 0 and 1) to denote the provenance of such

boolean queries (comparing the value computed by Q

to 0, in these examples). For now these should only

be considered as symbols which are assumed to be ele-

ments of ˆKdata (the provenance semiring with respect

to “data semiring” of the PADDP); a “semantics” for

these expressions as well as the precise structure ac-

commodating them will be given in Section 5.

3.4 Provenance for PADDP Executions

We start by presenting the construction in a fairly gen-

eral way and will then show how to use it in our context.

A main challenge here is the existence of two, possibly

distinct, semirings – one for the user choices and one

for the result of queries (as computed by a given or-

acle), that need to be combined while accounting for

possibly infinitely many executions. To this end, we will

show that for every two commutative semirings K and

L we can define a tensor product-like structure with

additional closure properties that will be required. This

structure will be introduced in two steps. First, we will

construct a semiring whose elements are possibly infi-

nite bags of elements from K×L. Intuitively each such

pair includes the provenance of a single execution: the

first element of the pair includes the “data” portion of

the provenance and the second one includes the “exter-

nal effect” portion. A bag of such pairs then stands for

the provenance of multiple (possibly infinitely many)

executions. We will show that the obtained structure is

commutative and closed, thus suitable for provenance

tracking, and indeed will use it for defining provenance

for LTL (Sec. 3.5). In the second step (Sec. 3.6) we

will introduce congruences to the structure, that will

be crucial for effective provenance tracking. The result-

ing structure will be called K ⊗ L.

The construction details now follow.

Bags of pairs We start our construction with a sim-

ple step, introducing the set of pairs over items of K

and L, denoted K ×L. Abusing notation we denote its

elements k⊗ l as well as 〈k, l〉 invariably. The intuition

is that a single pair k⊗ l will capture the provenance of

an entire execution: k will capture the “external prove-

nance” of the execution and l will capture its “data

provenance”.

A significant difficulty that is addressed lies in cor-

rectly managing provenance for possibly infinitely many

alternatively executions, accounting for operations on

these pairs. To this end, we consider the set Bag(K×L)

of possibly infinite bags of such pairs. There are two

useful ways of working with bags. One is to consider

them as N∞-valued functions. The other is to observe

that bags, with bag union and the empty bag, form a

commutative monoid whose elements are uniquely rep-

resentable as infinite sums of singleton bags. In this

second perspective, if we abuse notation and denote

singleton bags by the unique element they contain we

Page 8: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

8 Daniel Deutch et al.

can write each bag of pairs from K × L as∑i∈I

ki ⊗ li

where i ranges over an infinite set of indices I. More-

over, we can always rename the indices without loss of

generality such that when we work with several bags,

the sets of indices used in their sum representations are

pairwise disjoint.

It will be convenient to already denote bag union by

+K⊗L and the empty bag by 0

K⊗L . That is,

(∑i∈I

ki ⊗ li) +K⊗L (

∑j∈J

kj ⊗ lj) =∑h∈I∪J

kh ⊗ lh

0K⊗L =

∑i∈∅

ki ⊗ li

Recalling the intuitive correspondence between sum-

mation and alternative use, a bag (sum) of pairs corre-

sponds to alternative paths.

Now we define

(∑i∈I

ki⊗li)·K⊗L(∑j∈J

kj⊗lj) =∑

{i∈I}×{j∈J}

(ki·Kkj)⊗(li·L lj)

This is again consistent with the intuition of mul-

tiplication as joint use and summation as alternatives:

following one of the alternatives of the first bag com-

bined with one of the alternatives of the second bag,

corresponds exactly to all alternatives obtained by such

combinations

We have defined a mathematical structure to cap-

ture bags of pairs. For this structure to be consistent

with our intuition of summation capturing alternatives

and product capturing joint use, they should follow the

axioms of commutative semiring. Indeed, we next show

that this is the case.

Proposition 1 (Bag(K ×L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1

L)

is a commutative semiring.

Proof

– For b1, b2, b3 ∈ Bag(K × L) and by properties of

bag union we immediately have that b1 + b2 = b2 +

b1 (symmetry of bag union), b1 + 0 = b1 (union

with the empty bag), (b1 + b2) + b3 = b1 + (b2 + b3)

(associativity of bag union).

– For b1, b2 ∈ Bag(K × L) we observe that b1 · b2 =

b2 ·b1, due to the fact ki ·kj = kj ·ki and li · lj = lj · lifor ki, kj ∈ K and li, lj ∈ L (due to the fact that

K,L are commutative semirings). We further have

that (1K⊗1

L) is the neutral element with respect to

multiplication, i.e. b1 ·(1K⊗1L

) = b1, since every pair

element of the bag has its K attribute multiplied by

the neutral 1K

and its L attribute multiplied by 1L

.

– For b1, b2, b3 ∈ Bag(K × L) we have (b1 + b2) · b3 =∑{i∈b1+b2}×{j∈b3}(ki·Kkj)⊗(li·L lj) =

∑{i∈b1}×{j∈b3}(ki·K

kj)⊗ (li ·L lj) +∑{i∈b2}×{j∈b3}(ki ·K kj)⊗ (li ·L lj) =

b1 · b3 + b2 · b3.

Overloading notation we will simply use Bag(K×L)

to denote (Bag(K × L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1

L).

Next, we must make sure that the semiring “behaves

properly” with respect to infinite sums. The latter will

intuitively be necessary as we use the structure to cap-

ture possibly infinitely many executions.

We recall the definition of a closed semiring:

Definition 8 [39] A semiring K is closed if for all ele-

ments ofK it holds that (1)∑i∈φ xi = 0, (2)

∑i∈{k} xi =

xk, (3) If there is {Ij} such that I =⋃j∈J Ij is a dis-

joint partition then∑i∈I xi =

∑j∈J(

∑i∈Ij xi) and (4)

b · (∑i ai) =

∑i b · ai = (

∑i ai) · b

The following proposition guarantees that infinite

sums in the structure interact in the “expected” way

with semiring operations. In particular, this means that

they preserve the intuition of joint and alternative use

in executions. The following proposition will guarantee

the above, and will also be the basis for defining sim-

plifications of expressions in the sequel.

Proposition 2 For every two commutative semirings

K and L, it holds that Bag(K×L) is a closed semiring.

Proof We show a stronger result: Bag(K×L) is in fact

ω-continuous. Every ω-continuous semiring is closed (see

[39]). We first recall the property of ω-continuity. Given

a semiring K, define the binary relation v such that

a v b if and only if ∃c ∈ K.a + c = d. We say that

K is ω-continuous if (1) v is a partial order, (2) every

infinite chain a1 v a2... has a suprimum (least upper

bound), and (3) for every semiring element a we have

a+ supai = sup(a+ ai) and a · supai = sup(a · ai).We then note that a partial order relation in (Bag(K×

L),+K⊗L , ·K⊗L , 0K⊗L , 1K ⊗ 1

L) is bag inclusion, namely

a v b if a is bag-included in b. We then show:

1. v is a partial order relation. Observe that (1) a v a,

(2) if a v b and b v a then a = b and (3) if a v b

and b v c then a v c. These claims all follow from

the definition of bag inclusion, and the partial order

relation ’≤’ on N∞.

2. Let a1 v a2 v a3.... be an infinite chain. We claim

that the chain suprimum is an element of Bag(K ×L). To this end, we note that every bag ai on a

domain D (in our case D is a domain of pairs) can

be represented as a function fi : D 7→ N∞, where

fi(d) is the number of occurrences of d in the bag

ai. Now since N∞ is ω-continuous, there exists a

Page 9: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 9

suprimum to f1(d) ≤ f2(d) ≤ ... (the inequalities

follow from the definition of bag inclusion), denoted

sup(d). Let S be the bag defined by the function

f : D 7→ N∞, f(d) = sup(d). S is a suprimum for

{a1, a2, ...}, denoted supiai.

3. Let a ∈ Bag(K × L), then a is a possibly infinite

bag defined by a function fa : D 7→ N∞. Now a +

supiai is, by definition, captured by the function

f ′ = fa + f where f is the function defined in item

(2) of this proof. Now, supi(a+ai) is captured by f ′′

such that f ′′(d) is the suprimum of fa(d) + f1(d) ≤fa(d)+f2(d) ≤ .... Since N∞ is ω-continuous (and in

particular using the additive identity for suprimum

for N∞), we have that f ′′ = f ′.

4. Let a ∈ Bag(K ×L) and let fa as in item (3). Now

a · supiai is the bag defined by

f ′(d) =∑{x1,x2|d=x1·x2} fa(x1) · f(x2) (by x1 · x2

we mean pointwise multiplication of the pairs x1,

x2 and f is the function capturing the bag that is

the suprimum of {ai}). For supia · ai we get a bag

captured by f ′′(d) = supi(∑{x1,x2|d=x1·x2} fa(x1) ·

fi(x2)). Let (x11, x12)...(xn1 , x

n2 ) be all pairs satisfying

x1 · x2 = d. We have:

f ′′(d) = supi{fa(x11) · fi(x21) + ...+ fa(xn1 ) · fi(xn2 )}f ′′(d) = supi{fa(x11) · fi(x21)} + ... + supi{fa(xn1 ) ·fi(x

n2 )}.

Then by repeated application of the multiplicative

identity that follows from the ω-continuity of N∞,

we get that we can push the supi to be applied

only on elements of the sort fi(x). We get f ′′(d) =

fa(x11) · f(x21) + ... + fa(xn1 ) · fi(xn2 ). We thus get

f ′′(p) = f ′(p), as needed.

We obtain that Bag(K × L) is ω-continuous, and

thus is closed.

We are now ready to define provenance for execu-

tions of PADDPs. Provenance for a transition outgoing

an external choice node is annotated with a singleton

bag {〈k, 1Kdata〉} where k ∈ Kext is the provenance as-

sociated with this transition according to the PADDP

specification; and 1Kdatais the neutral value with re-

spect to multiplication in Kdata. Intuitively since there

is no effect with respect to data provenance. Similarly

provenance for a transition out of a query node is de-

fined as {〈1Kext, k′〉} where k′ ∈ Kdata is the annota-

tion (as defined by the oracle ProvOraclefor the corre-

sponding query with respect to the underlying anno-

tated database.

Given a transition t of a PADDP we use Prov(t)

to denote the transition provenance according to the

above. We then define the provenance of an execution

path to be the multiplication of provenance expressions

associated with its transitions.

Definition 9 Given a PADDP execution

e = (v0, v1, ..., vn), the provenance of e, denoted Prov(e) ∈Bag(Kext×Kdata), is defined as

∏(vi,vi+1)∈e Prov((vi, vi+1)).

Note that the provenance of an execution involves

multiplication of bags in Bag(Kext×Kdata). This allows

to “mix” annotations of Kdata and annotations in Kext

in the same expression. To simplify notations, we will

identify 〈k, k′〉 with the singleton bag {〈k, k′〉}.

Example 7 Reconsider the PADDP of Example 5, the

provenance of the path [Home page, Cat., Sub Cat.,

Exit] is

〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉

Where [Q1 6= 0] should currently be interpreted

only as a symbol standing for the provenance of Q1

not being equal 0 (and will later be replaced with a

concrete construction for queries). Each pair element

in this multiplication corresponds to a single transi-

tion, and they are multiplied (multiplication is done in

Bag(Kext ×Kdata)) since they are used together in an

execution. For example, the item 〈5, 1〉 stems from the

(external effect) transition from HomePage to Cat and

is shorthand to the singleton bag containing of a pair

whose first element is the natural number 5 in the trop-

ical semiring (signaling user cost of 5), and its second

element 1 is the neutral with respect to multiplication

in Kdata = N[D]. The item 〈0, [Q1 6= 0]〉 originates in

the transition from Cat. to Sub Cat. depending on the

underlying database; note that the natural number 0 is

the neutral with respect to multiplication of the tropical

semiring.

Finally, we can use the definition of bag multipli-cation above to simplify the expression by perform-

ing “point-wise multiplication”, to obtain (·T and ·N[D]

stand for multiplications in the tropical and N[D] semir-

ings, respectively).

〈5 ·T 0 ·T 2, 1 ·N[D] [Q1 6= 0] ·N[D] 1〉 ≡ 〈7, [Q1 6= 0]〉

The last equivalence is due to multiplication in the

tropical semiring corresponding to natural number ad-

ditions. Intuitively, the accumulated cost (“user effort”)

for this path is 7 and the accumulated condition with

respect to the database is [Q1 6= 0].

3.5 Provenance for LTL

Building on the fact that Bag(Kext×Kdata) is a closed

semiring, we next define provenance for an LTL formula

with respect to a PADDP, as the (possibly infinite) sum

of provenances of executions conforming to the formula.

Page 10: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

10 Daniel Deutch et al.

Let exec(S) denote the (possibly infinite) set of (finite)

executions of a PADDP S. We define (this definition is

of course parameterized by the choice of ProvOracle):

Definition 10 Given a PADDP S, a provenance oracle

ProvOracle for its guarding queries and an LTL formula

f , we define the result of evaluating f on S (denoted

fProvOracle(S)) as∑{e∈Paths(S)|e|=f} prov(e).

Whenever the particular choice of provenance oracle

is irrelevant to the discussion or clear from context we

omit it from the notation and simply write f(S).

Definition 10 does not give an explicit way of repre-

senting the infinite sums. To this end, we introduce the

Kleene star operation:

a∗ = 1 + a+ a2 + ...

And note that it is well-defined in closed semirings.

Example 8 In the tropical semiring, since multiplica-

tion corresponds to natural number addition, addition

corresponds to min and the tropical 1 is the natural

number 0 (i.e. less or equal to any other number in the

semiring) we get a∗ = 0 (the tropical 1, i.e. the nat-

ural number 0). For the Boolean semiring, we will get

a∗ = true ∨ a ∨ a... = true ∨ a = true.

Proposition 3 For any PADDP S and LTL formula

f , f(S) may be represented as a finite starred expres-

sion.

This proposition will follow from the correctness of

the algorithm for provenance generation, presented in

Section 4. We can already note that intuitively, the ob-

tained starred expression corresponds to a regular ex-pression over the annotation pairs, capturing exactly

those paths that satisfy the LTL formula. Based on this

intuition we exemplify the obtained expressions.

Example 9 Reconsider the LTL formula F (DailyDeals

∧ F Payment) and the running example PADDP. Intu-

itively, to represent alternative paths (alternative ways

of “realizing” the LTL property) we use sum of pairs

where each pair captures the provenance of a single

path. Also, using the introduced axioms, we may in

fact generate sub-expressions for multiple partial exe-

cutions, and then combine them. For instance, observe

that all executions reaching DailyDeals start by reach-

ing Cat. The two partial executions reaching Cat have,

together, as provenance, the sum 〈5, 1〉 + 〈7, 1〉. This

expression may then be multiplied by an expression

capturing the provenance of reaching DailyDeals (and

then Payment) from Cat. The latter will involve star

to capture possible loops. We obtain:((〈5, 1〉+ 〈7, 1〉

)·(〈2, 1〉 · 〈0, [Q1 6= 0]〉+

〈0, [Q1 = 0]〉)· 〈3, 1〉 · 〈2, 1〉

)∗·(〈5, 1〉+ 〈7, 1〉

)·(〈2, 1〉 ·

〈0, [Q1 6= 0]〉 + 〈0, [Q1 = 0]〉)· 〈3, 1〉 · 〈2, 1〉 · 〈3, 1〉 ·(

〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)

The obtained expression is quite long but we can al-

ready simplify it using the axioms of our structure.

In particular, we have seen before that we can sim-

plify multiplication expression, to get e.g. 〈3, 1〉 · 〈2, 1〉 ·〈3, 1〉 = 〈8, 1〉. Further simplifications lead to:((〈5, 1〉 + 〈7, 1〉

)·(〈7, [Q1 6= 0]〉 + 〈5, [Q1 = 0]〉

))∗·(

〈5, 1〉+ 〈7, 1〉)·(〈10, [Q1 6= 0]〉+ 〈8, [Q1 = 0]〉

)·(

〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)

In this expression, every pair represents “joint” prove-

nance terms in the two domains (Kext being tropical,

and Kdata being N[D]). For instance, 〈5, 1〉 intuitively

means a cost of 5 and no dependency on data. A sum

of such pairs reflects alternative paths, e.g. the sub-

expression (〈5, 1〉 + 〈7, 1〉) corresponds to the two op-

tions of following a transition with cost 5 or following

one with cost 7 (both with no dependency on data). A

product of such sub-terms corresponds to joint use (i.e.

in conjunction with taking either of these transitions,

we also continue the execution). Finally, Kleene star is

applied to provenance of sub-executions appearing in a

loop.

The expression that we get is still quite complex but

we will show later (Example 10), that by introducing

congruence axioms, we can further simplify it.

3.6 Introducing a Congruence

We have defined provenance for LTL formula evaluated

over PADDPs, but observed that their representation

may become quite complex. There are certain equiva-

lence axioms that are “natural” in this setting. For in-

stance if the same data provenance is used repeatedly

in multiple execution paths, one expects to be able to

write an equivalent expression where it appears only

once.

To allow for simplifications, we need to identify el-

ements of Bag(K ×L) with other, “simpler” elements.

This is done via the introduction of a congruence rela-

tion. Since we are dealing with possibly infinite bags,

we need to consider inifinitary congruences:

Definition 11 An equivalence relation ≡ is an inifini-

tary congruence if it satisfies:

– k1 ≡ k3, k2 ≡ k4 =⇒ k1 · k2 ≡ k3 · k4– ∀i ai ∼ bi ⇒ Σiai ∼ Σibi.

Where the second item applies to both finite and infi-

nite sums.

Page 11: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 11

Next, let ∼ be the smallest inifinitary congruence on

Bag(K×L) with respect to +K⊗L and ·

K⊗L that contains

(for all k, k′, l, l′):

(k +Kk′)⊗ l ∼ k ⊗ l +

K⊗L k′ ⊗ l

0K⊗ l ∼ 0

K⊗L

k ⊗ (l +Ll′) ∼ k ⊗ l +

K⊗L k ⊗ l′

k ⊗ 0L∼ 0

K⊗L

We denote by K ⊗ L the set of congruence classes

of bags of pairs modulo ∼ and by 1K⊗L the congruence

class of 1K⊗ 1

L. As usual when we take the quotient by

a congruence the result, (K ⊗ L,+K⊗L , ·K⊗L , 0K⊗L , 1K⊗L),

which we will also denote K⊗L overloading notations,

is also a commutative semiring.

When we define provenance for LTL queries in K⊗Lall constructions and results go through, and in addition

we can perform significant expression simplifications.

Example 10 Reconsider the provenance expression ob-

tained in Example 9. Note that the introduced equiv-

alence axioms allow for simplifications that were not

possible so far. For instance, we have the sub-expression

〈5, 1〉+ 〈7, 1〉, intuitively corresponding to two sub

-executions that involve the same dependency on data

(in this case, no dependency) with different costs. Using

the congruence relation, this expression is equivalent to

〈5+7, 1〉 = 〈5, 1〉. Intuitively we have “factored out” the

common dependency on data, and computed the min-

imal cost out of the two options with respect to user

effort.

Via repeated applications of the congruence relation

we may perform further partial computations, and ob-

tain:(〈5 + 7, 1〉 ·

(〈7, [Q1 6= 0]〉+ 〈5, [Q1 = 0]〉

))∗· 〈5 + 7, 1〉 ·(

〈10, [Q1 6= 0]〉+ 〈8, [Q1 = 0]〉)·(

〈0, [Q2 6= 0]〉+ 〈0, [Q2 = 0]〉)

=(〈12, [Q1 6= 0]〉 + 〈10, [Q1 = 0]〉

)∗ · (〈15, [Q1 6= 0]〉 +

〈13, [Q1 = 0]〉)·(〈0, [Q2 6= 0] + [Q2 = 0]〉

)An important property of the construction thus far

was that the obtained semiring was closed. We can show

that this property still holds for the structure obtained

by taking the quotient by a congruence, and in partic-

ular:

Proposition 4 For every two semirings K,L, it holds

that K ⊗ L is a closed semiring.

Proof Recall that Bag(K×L) is a closed semiring. We

obtained K ⊗ L from Bag(K × L) by adding a con-

gruence, that is, all equalities held on Bag(K ×L) also

holds on K × L. In addition, the definition of closed

semiring consists only of equalities. Therefore K ⊗L is

also a closed semiring.

Intuitively, this means that we can use K ⊗ L to

track both K-annotations and L-annotations; by keep-

ing all K-annotations as 1K we can track the

L-annotations, and by keeping all L-annotations as 1Lwe can track the K-annotations.

4 Properties of the construction

We next analyze the properties of the construction, and

show a correspondence between the desiderata imposed

for ProvOracle, and important properties of our con-

struction.

4.1 Faithful extension of LTL semantics

A basic sanity check is that the definition is consistent

with the semantics of LTL for DDPs without annota-

tion, namely that it in fact faithfully extends it. Indeed

we can show:

Proposition 5 The following property (*) holds if and

only if ProvOracle is set-compatible.

(*) For any LTL formula f and (B,B)-PADDP S,

it holds that fProvOracle(S) ≡ {(true, true)} if and

only if S′ |= f where S′ is a DDP obtained from S by

deleting all tuples and transitions annotated by false,

and keeping the rest with no annotation.

Intuitively, (true, true) corresponds to the boolean

true. We will formalize this intuition in the sequel.

Proof Assume first that ProvOracle is set-compatible.

Further assume that S′ |= f , therefore there is a path p

in the DDP S′ such that p |= f . Since S′ is a DDP ob-

tained from S by deleting all transitions annotated by

false, p is also a path in S. We claim that Prov(p) =

(true, true). The external effects provenance is obvi-

ously true since all such transitions in p are labelled

with true (those labeled with false do not appear in S′).

The data provenance is true since the following holds: p

is a path in S′, and so all queries occurring along p are

satisfied. Since ProvOracle is set-compatible, queries

that are “satisfied” on the corresponding set-database

are associated with provenance value true. Thus the

data provenance is a multiplication of true values which

yields true. Then, the provenance (after simplifications)

of a path e in a (B,B)-PADDP can be one of the follow-

ing: (true, true), (true, false), (false, true) and (false, false).

Now since f(S) =∑{e∈Paths(S)|e|=f} prov(e), and there

exists at least one path in S (call it p) such that Prov(p) =

(true, true) we get f(S) =∑{e∈Paths(S)|e|=f} prov(e) =

(true, true). This is because (true, true)+(true, false) =

(true, true)+(false, true) = (true, true)+(false, false) =

Page 12: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

12 Daniel Deutch et al.

(true, true) + (true, true) = (true, true). Now assume

f(S) ≡ {(true, true)}. Therefore there exists a path p in

the (B,B)-PADDP S such that Prov(p) = (true, true)

and p |= f (Otherwise we will not get

f(S) =∑{e∈Paths(S)|e|=f} prov(e) = (true, true)).

Since for the path p = (v0, v1, ..., vn),

Prov(p) =∏

(vi,vi+1)∈e Prov((vi, vi+1)), and Prov(p) =

(true, true), necessarily Prov((vi, vi+1)) = (true, true)

for all 0 ≤ i < n. Namely all the annotation of the path

p are true. Due to set-compatibility (data provenance

is true only if the corresponding query is satisfied), this

implies that p is also a path in S′

In contrast, if ProvOracle is not set-compatible then

there exists a boolean query Q and a B-database in-

stance D such that the provenance of Q with respect to

D is false (true) but Q is (respectively, not) satisfied

by the instance D′ obtained from D by keeping only the

tuples labeled true, and omitting the annotations. Con-

sider a PADDP S with three nodes labeled init (initial),

A and B, such that the transitions from init to A and

B are guarded by Q (A is the true node and B is the

false node). Further consider the LTL formula f = FA

(Finally A), and note that f(S) ≡ {(true, false)} while

S′ |= f or otherwise f(S) ≡ {(true, true)} while S′ 6|= f

depending on the direction in which ProvOracle errs.

4.2 Efficient Provenance Generation

We next show the following crucial proposition (also

proving proposition 3):

Proposition 6 For any LTL formula φ and a PADDP

S and given a polynomial time ProvOracle for prove-nance of guarding queries, we can compute a descrip-

tion of φProvOracle(S) in time polynomial in the size

of the underlying database and exponential in the size

of the state machine of S 3.

Algorithm The high-level flow of an algorithm that

generates provenance for an LTL formula with respect

to a PADDP is given in Algorithm 1. Some important

details, from a practical perspective, on its implemen-

tation are given Section 7. The first step is to compute

provenance expressions for the queries associated with

query nodes by simply applying ProvOracle to each of

them. The obtained annotated structure is named S′.

The LTL formula is compiled (Line 2) into an FSM Sφconsistent with its finite semantics, based on the Algo-

rithm of [23]). Sφ is then (Line 3) intersected with S′

in the standard way of automata intersection (see e.g.

[39]), while maintaining annotations of S′′ consistent

3 We follow common practice of analyzing data complexity,so φ and the guarding queries are considered of constant size.

with those of S′: namely, a transition in the intersec-

tion automata S′′ that has origin (u, ψ) and destina-

tion (v, ψ′) will be annotated by the annotation in S′ of

(u, v). Note that S′′ is a PADDP with some states desig-

nated as accepting. The last step (line 4) is to transform

S′′ into a starred expression, which is an element of the

tensor product semiring. This is done by a direct ap-

plication of Kleene’s algorithm (see e.g. [39]) (which is

a generalization of the standard translation of FSMs to

regular expressions). Given the obtained regular expres-

sion, we may simply interpret the (regular expression)

addition, multiplication and Kleene star operation as

their counterparts in the tensor product semiring.

Correctness We show that for any LTL formula φ

and a PADDP S, the output of Algorithm 1 is equiva-

lent (in the tensor product semiring) to φ(S). For that,

we consider the algorithm’s execution step by step. Af-

ter Line 1 of the algorithm, every transition (non-deterministic

or data-dependent) of S′ is labeled by its corresponding

provenance; this is due to the correctness of computing

provenance for queries with aggregates. The translation

of φ to an FSM follows the algorithm of [23] and Sφ thus

guaranteed to satisfy that the finite executions of the

obtained FSM exactly characterize those that satisfy

the LTL formula; by intersecting Sφ with S′ we thus

get that the labeled sequences obtained for executions

of S′′ are exactly those of S′ satisfying φ. Following the

correctness of Kleene’s algorithm [39], the last step of

the algorithm generates a regular expression that com-

pactly captures this set of labels sequences (which in

turn is the set of label sequences of all executions of S

satisfying φ), i.e. is equivalent to φ(S).

Complexity For any (fixed-size) LTL formula φand PADDP S, the time complexity and output size

of Algorithm 1 is polynomial in the databse size of S,

with the exponent possibly dependent on the state ma-

chine size. To observe that this holds, note that the gen-

eration of provenance expressions for database queries

was shown to be in polynomial time with respect to

the database size. The translation to starred expres-

sion (step 4), however, may in the worst case incur an

(unavoidable [28]) exponential blow up in the size of

the state machine.

Algorithm 1 Algorithm for Expression Generation

Input PADDP specification S; LTL formula φ, Provenanceoracle ProvOracle

Output Provenance expression φ(S)1: S′ := QueriesToProvenance(S,ProvOracle)2: QueryAutomaton := TransformToFSM(φ)3: S′′ := Intersect(S′, QueryAutomaton)4: exp := TranslateToStarredExpression(S′′)5: return exp

Page 13: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 13

4.3 Commutation with homomorphism

Now that we have defined provenance for DDP analy-

sis, and have designed an algorithm for computing it,

we turn to setting the grounds for using it (for hypo-

thetical reasoning). An important principle underlying

the semiring-based provenance framework, and the one

allowing hypothetical reasoning, is that one can com-

pute an “abstract” representation of the provenance

and then “specialize” it in any domain. This “special-

ization” is formalized through the use of semiring ho-

momorphism. Recall that function h is a semiring ho-

momorphism if for every x, y ∈ K we have h(x + y) =

h(x) + h(y) and h(x · y) = h(x) · h(y).

To allow for a similar use of provenance in our set-

ting, we extend the notion of homomorphism to the

tensor product structure, and study properties of the

construction. A fundamental new challenge lies in the

mapping from elements of the tensor (which are essen-

tially bags of pairs), to single semiring elements. Such

mapping “makes sense” in some cases, as we next show.

We first show how to “shift” between meta-domains.

The idea is that two mappings from the individual semir-

ings may be combined into a single homomorphism

whose domain is the tensor structure.

Proposition 7 Let K1,K2,K3,K4 be commutative semir-

ings, and let h1 : K1 7→ K3, h2 : K2 7→ K4 be semiring

homomorphisms. Let h(k1 ⊗ k2) = h1(k1)⊗ h2(k2) and

extend h to a full mapping by defining h(0 ⊗ 0) = 0,

h(s1 +s2) = h(s1) +h(s2) and h(s1 ·s2) = h(s1) ·h(s2).

Then h : K1 ⊗K2 7→ K3 ⊗K4 is a semiring homomor-

phism.

We use h1 | h2 to denote the homomorphism h ob-

tained, according to the above construction, from h1and h2.

Proof We have that h maps 0 to 0 by definition, and

observe that h(1⊗1) = 1⊗1 (the former is the 1 element

of K1⊗K2 and the latter is the 1 element of K3⊗K4).

By definition it further holds that h commutes with

addition and with multiplication.

It remains to show that h is a well-defined mapping,

i.e. to show that if x ∼ y then h(x) ∼ h(y). We show

that this is the case for every axiom of the tensor struc-

ture; since every equivalence is by repeated use of the

axioms, the proposition follows.

– h(k ⊗ 0) = h1(k)⊗ h2(0) = h1(k)⊗ 0 = 0 = h(0)

– h(k1⊗ l+ k2⊗ l) = h(k1⊗ l) + h(k2⊗ l) = h1(k1)⊗h2(l) + h1(k2)⊗ h2(l) ≡ (h1(k1) + h1(k2))⊗ h2(l) ≡h1(k1)⊗h2(l) +h1(k2)⊗h2(l) = (h1(k1) +h1(k2)⊗h2(l) = h((k1 + k2) ⊗ l), and symmetrically for the

other direction of distributivity.

And symmetrically for the other direction of asso-

ciativity. We also have h(k ⊗ 0) = h1(k) ⊗ h2(0) =

h1(k)⊗ 0 = 0 and h(1⊗ 1) = h1(1)⊗ h2(1) = 1⊗ 1 = 1

We can now extend the notion of semiring homo-

morphism to homomorphisms on PADDPs:

Definition 12 Let K1,K2,K3,K4 be 4 semirings, let

h1 : K1 7→ K3, h2 : K2 7→ K4 be semiring homo-

morphisms and let S be a (K1,K2)-PADDP. We use

(h1 | h2)(S) to denote the (K3,K4)-PADDP S′ ob-

tained from S by replacing every annotation k1 ∈ K1

by h1(k1) and every annotation k2 ∈ K2 by h2(k2).

Next we show that provenance propagation com-

mutes with homomorphisms, namely:

Proposition 8 The following (*) holds if and only if

ProvOracle satisfies commutation with homomorphism:

(*) For every LTL formula φ, a (K1,K2)-PADDP

S and semiring homomorphisms h1 : K1 7→ K3, h2 :

K2 7→ K4, it holds that (h1 | h2)(φ(s)) ≡ φ((h1 | h2)s).

Proof By definition 10:

(h1 | h2)(φ(s)) = (h1 | h2)(∑{e∈Paths(S)|e|=φ} prov(e)).

Since h1 | h2 is a homomorphism we have:

(h1 | h2)(∑{e∈Paths(S)|e|=φ} prov(e)) =∑

{e∈Paths(S)|e|=φ}(h1 | h2)(prov(e))

Now assume that ProvOracle satisfies commutation

with homomorphism. We then claim that (h1 | h2)(prov(e)) =

prov((h1 | h2)(e)) where (h1 | h2)(e) (with respect to

an underlying annotated database D) is the annotated

execution obtained by replacing every external effect

annotation a in e by h1(a) and replacing every anno-

tation a′ in D by h2(a′) and denoting the obtained

database by D′. Equality then holds for the external

effect part of the provenance expression by the basic

properties of homomorphisms (h(x+ y) = h(x) + h(y),

h(x · y) = h(x) · h(y)). For the database provenance

part we first use the commutation of ProvOracle (so

the provenance for D′ for each individual query Q is

the same as the result of applying h2 to the provenance

computed for Q with respect to D), and then use again

the the basic properties of homomorphisms to obtain

that the commutation holds for the entire provenance

expression. Finally we can again apply definition 10 to

get φ((h1 | h2)s.

If ProvOracle does not satisfy commutation with ho-

momorphism (failing for some query Q and annotated

database D) then, similarly to the proof of Proposition

8, we construct a PADDP S with three nodes labeled

init (initial), A and B, such that the transitions from

init to A and B are guarded by Q. The provenance

Page 14: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

14 Daniel Deutch et al.

for the LTL formula FA is then simply the result of

ProvOracle applied to Q and D, and so commutation

with homomorphism fails for the LTL formula.

We have shown a provenance construction for PAD-

DPs that is parameterized by a provenance oracle for

the queries annotating transitions out of its query nodes,

and have shown that desirable properties of such or-

acle, namely set-compatibility, commutation with ho-

momorphism, and computability in polynomial time,

carry over to the settings of DDPs via our construc-

tion. Such “oracles” (having all three desiderata) were

given in previous work for positive relational algebra

[26] and enriched with aggregates (and a specific form

of difference) in [6]. Thus, any of these models may eas-

ily be “plugged into” our constructions. To complete

the picture, we next give an outline of the construction

for positive relational algebra with aggregates; this also

allows us, in the section that follows, to complete our

examples.

5 Provenance oracle for aggregate queries

Let K be any commutative semiring (intuitively used

for annotations) and M be any commutative monoid,

used for aggregates. To support provenance for aggre-

gate queries we have designed in [6] a structure that

combines (“pairs”) annotations from k with elements

from M , as follows. We start with K ×M , denote its

elements k⊗m instead of 〈k,m〉 and call them “simple

tensors”. Next we consider finite bags of such simple

tensors, which, with bag union and the empty bag, form

a commutative monoid. It will be convenient to denote

bag union by +K⊗M , the empty bag by 0

K⊗M and to

abuse notation denoting singleton bags by the unique

element they contain. Then, every non-empty bag of

simple tensors can be written (repeating summands by

multiplicity) k1⊗m1 +K⊗M · · · +K⊗M kn⊗mn. Now we

define

k ∗K⊗M

∑ki⊗mi =

∑(k ·

Kki)⊗mi

Let ∼ be the smallest congruence w.r.t. +K⊗M and ∗

K⊗M

that satisfies (for all k, k′,m,m′):

(k +Kk′)⊗m ∼ k⊗m+

K⊗M k′⊗m0K⊗m ∼ 0

K⊗M

k⊗ (m+Mm′) ∼ k⊗m+

K⊗M k⊗m′

k⊗ 0M∼ 0

K⊗M

We denote by K⊗M the set of tensors i.e., equivalence

classes of bags of simple tensors modulo ∼.

This structure suffices for defining ”simple” aggre-

gate queries, namely queries where the aggregate op-

eration must appear as the last one (i.e. the aggregate

operation is disallowed in a nested query). For the out-

put relations of our algebra queries, we thus need re-

sults of aggregation (i.e., the elements of K ⊗M) to

also be part of the domain out of which the tuples are

constructed. Thus for the output domain we will as-

sume that K ⊗M ⊆ D, i.e. the result “combines an-

notations with values”. The elements of M (e.g., real

numbers for sum or max aggregation) are still present,

but only via the embedding ι : M → K ⊗M defined

by ι(m) = 1K⊗m. Now we may define AGGM (R) as a

one-attribute relation with one tuple annotation is 1K

and whose content is SetAggK⊗M (ι(R)), which is equal

to

k1 ∗K⊗M ι(m1) +K⊗M · · ·+K⊗M kn ∗K⊗M ι(mn)

= k1⊗m1 +K⊗M · · ·+K⊗M kn⊗mn

We define the annotation of the only tuple in the output

of AGGM to be 1K

, since this tuple is always available.

However, the content of this tuple does depend on R.

Next, we need to support the use of aggregates which

are not the last operation (and in particular to com-

pare the obtained aggregate value to a given value, in

a boolean query, as done in our DDP example). To this

end we have designed in [6] a semiring whose elements

are polynomials, in which equation (and inequality) el-

ements are additional indeterminates. To achieve that,

we introduce for any semiring K and any commutative

monoid M , the “domain” equation K = N[K ∪ {[c1 =

c2] ∪ {[c1 6= c2] | c1, c2 ∈ K⊗M}]. The right-hand-side

is a monotone, in fact continuous w.r.t. the usual set in-

clusion operator, hence this equation has a set-theoretic

least solution (no need for order-theoretic domain the-

ory). The solution also has an obvious commutative

semiring structure induced by that of polynomials. The

solution semiring is K = (X,+K , ·K , 0K , 1K), and we

continue by taking the quotient on K defined by the fol-

lowing axioms. For all k1, k2 ∈ K, c1, c2, c3, c4 ∈ K⊗M :

0K ∼ 0K

1K ∼ 1K

k1 +Kk2 ∼ k1 +Kk2

k1 ·Kk2 ∼ k1 ·K k2

[c1 = c3] ∼ [c2 = c4] (if c1 = K⊗Mc2, c3 = K⊗Mc4)

and the symmetric treatment for the [a 6= b] expres-

sions. If K and M are such that ι defined by ι(m) =

1K⊗m is an isomorphism (and let h be its inverse),

we further take the quotient defined by: for all a, b ∈K ⊗M ,

[a = b] ∼ 1K

(if h(a) =Mh(b))

[a = b] ∼ 0K

(if h(a) 6=Mh(b))

Page 15: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 15

We use KM to denote the semiring obtained by ap-

plying the above construction on a semiring K and a

commutative monoid M .

The extended semiring construction allows us to de-

sign a semantics for general aggregation queries. Intu-

itively, when the existence of a tuple in the result re-

lies on the result of a comparison involving aggregate

values (as in the result of applying selection or joins),

we multiply the tuple annotation by the correspond-

ing equation annotation. In the sequel we assume, to

simplify the definition, that the query aggregates and

compares only values of KM ⊗M (a value m ∈ M is

first replaced by ι(m) = 1K⊗m). In what follows, let

R,R1 and R2 be (M,KM )-relations on an attributes

set U . Recall that for a tuple t, t(u) (where u ∈ U) is

the value of the attribute u in t; also for U ′ ⊆ U , recall

that we use t |U ′ to denote the restriction of t to the

attributes in U ′. Last, we use (KM⊗M)U to denote the

set of all tuples on attributes set U , with values from

KM ⊗M . The semantics follows:

1. empty relation: ∀t φ(t) = 0.

2. union: (R1 ∪R2) (t) =∑t′∈supp(R1)

R1(t′) ·∏u∈U [t′(u) = t(u)] if t ∈ supp(R1)

+∑t′∈supp(R2)

R2(t′) ·∏u∈U [t′(u) = t(u)] ∪supp(R2)

0 Otherwise.

3. projection: Let U ′ ⊆ U , and let T = {t|U ′ | t ∈supp(R)}. Then ΠU ′(t) =∑t′∈Supp(R)R(t′)·

∏u∈U ′ [t(u) = t′(u)] if t ∈ T

0 Otherwise.

4. selection: If P is an equality predicate involving the

equation of some attribute u ∈ U and a value m ∈M then (σP (R)) (t) = R(t)·[t(u) = ι(m)].

5. value based join: We assume for simplicity that R1

and R2 have disjoint sets of attributes, U1 and U2

resp., and that the join is based on comparing a

single attribute of each relation. Let u′1 ∈ U1 and

u′2 ∈ U2 be the attributes to join on. For every t ∈(KM ⊗M)U1∪U2 :

(R1 ./R1.u1=R2.u2R2) (t) =

R1(t|U1)·R2(t|U2)·K

[t(u1) = t(u2)].

6. Aggregation: AGGM

(R)(t) = 1 t(u) =∑t′∈supp(R)R(t′)⊗ t′(u)

0 otherwise

Where t′(u) is the value of t′ in the aggregated col-

umn, or the natural number 1 if the aggregation is

COUNT.

Boolean queries We have used the notation Q = c,

Q 6= c, where Q is an aggregate query whose result

is a single tuple with a single value and c is a constant.

This is just a notation for the corresponding nested ag-

gregate query, comparing the aggregate result to c, and

its provenance definition thus follows immediately.

Example 11 Consider the aggregation query Q1 given

in Figure 3. The result of the query evaluation with

respect to the relations given in Figure 2 consist of a

single tuple with the aggregated value. Since the aggre-

gation function in this example is COUNT, the aggrega-

tion monoid is that of natural numbers, t′(u) = 1 for all

t′ ∈ R (intuitively the contribution of each tuple in R

to the count of tuples is 1). The provenance expression

of the single tuple in the table resulting by the inner

query is (d1 ·d4), therefore, the value of the tuple in the

result is the expression (d1 · d4)⊗ 1.

Thus, the provenance annotation of the nested, boolean,

query Q1 = 0 is simply [(d1 ·d4)⊗ 1 = 0]. Similarly, the

provenance of the query Q2 = 0 is [d5 ⊗ 1 = 0].

Incorporation in the DDP modelThe following proposi-

tion was shown in [6].

Proposition 9 The semantics for positive relational

algebra with (nested) aggregates satisfies polynomial time

computability, set comparability and commutation with

homomorphism.

Using our results, this means that the provenance

construction for positive relational algebra with (nested)

aggregates, following the above semantics, may be in-

corporated as oracle in our construction. The semiring

K is simply KM where K is the semiring used to anno-

tate database tuples, and M is the aggregation monoid,

and the lift of homomorphism h is simply by replac-

ing each element of k appearing in the expression by

h(k) [6]. We note that provenance constructions for ad-

ditional query constructs, such as the one for queries

with difference in [6], may similarly be incorporated as

well.

6 Further examples

Now that we have defined a provenance model for aggre-

gate query, we revisit and complete our running exam-

ple of PADDP provenance, showing both the obtained

full-fledged provenance expression as well as examples

for its usefulness.

We first show the obtained expression:

Example 12 Reconsider the provenance expression given

in Example 10, by substitution of the database queries

Page 16: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

16 Daniel Deutch et al.

provenance we obtain the following expression:(〈12, [d1 · d4 6= 0]〉+ 〈10, [d1 · d4 = 0]〉

)∗ · (〈15, [d1 · d4 6=0]〉+ 〈13, [d1 · d4 = 0]〉

)·(〈0, [d5 6= 0] + [d5 = 0]〉

)[d1 ·d4 6= 0] is in fact shorthand to [[(d1 ·d4)⊗1 = 0] = 0]

which is the negation of the provenance for Q1 = 0. In-

tuitively, it means that both tuples annotated with d1and d4 need to be present for the query to be satisfied;

this will next be reflected in the effect of homomor-

phisms.

Then, the expression may be used e.g. for deletion

propagation, through the notion of homomorphisms and

leveraging the commutation property we have estab-

lished:

Example 13 Say that we are interested in finding the

minimal cost of realizing our running example LTL query,

but this time under the assumption that the Cell Phone

category is no longer available. This corresponds to prop-

agating the hypothetical deletion of the tuple annotated

by d1 (and other tuples are kept), and re-performing the

analysis. Instead, Proposition 8 suggests a much more

efficient alternative: given the computed provenance ex-

pression of Example 10, we can use the homomorphism

h2 : N[D] 7→ B, where (in particular) h2(d1) = false,

h2(d4) = true and h2(d5) = true. We use the identity

homomorphism as h1 (assuming the user effort quan-

tification stays intact), and obtain:

(h1 | h2)((〈12, [d1 · d4 6= 0]〉 + 〈10, [d1 · d4 = 0]〉

)∗ ·(〈15, [d1 · d4 6= 0]〉 + 〈13, [d1 · d4 = 0]〉

)·(〈0, [d5 6=

0]〉+ 〈0, [d5 = 0]〉))

=(〈12, false〉 + 〈10, true〉

)∗ · (〈15, false〉 + 〈13, true〉)·

〈0, true〉more simplifications, based on the congruence axioms,

we obtain:

〈13, true〉

Indeed, the minimal effort required to reach the goal

given the deletions is 13, realized e.g. through the execu-

tion [Home page, Cat., Product, Shopping Cart, Daily

deals, Payment, PayExit].

We next show another use-case where the generated

expression and its properties are leveraged

Example 14 As another example, from a different meta-

domain, assume that the application supports three lev-

els of memberships. A user can be a club member in one

of the clubs: silver (S), gold (G) and platinum (P ), and

each product or category is annotated with minimal

level of membership. Namely a silver club member can-

not see a product that is annotated with G, but gold

or platinum club members can. Again, we want to find

the minimal cost of realizing our running example LTL

query, but we want to do it now under the assump-

tion that the user is a gold club member. We can use

the semiring C = ({1c, S,G, P, 0c},min,max, 0c, 1m),

where 1c < S < G < P < 0c, and in a similar way

to the above example, apply the homomorphism h2 :

N[D] 7→ C.

For instance, say that the cell phone category is

available to all users (i.e. h2(d1) = 1c), the computers

category is unavailable (h2(d2) = 0c), the fashion cat-

egory is available only to club members (h2(d3) = S),

silver or higher, the smartphones sub category is avail-

able only to gold club members or higher (h2(d4) = G),

and h2(d5) = 1c. As before, we use the identity homo-

morphism as h1 and obtain:

(h1 | h2)((〈12, [d1 · d4 6= 0]〉 + 〈10, [d1 · d4 = 0]〉

)∗ ·(〈15, [d1 · d4 6= 0]〉 + 〈13, [d1 · d4 = 0]〉

)·(〈0, [d5 6=

0]〉+ 〈0, [d5 = 0]〉))

=((〈12, [1c ·G 6= 0]〉+ 〈10, [1c ·G = 0]〉

)∗ · (〈15, [1c ·G 6=

0]〉+ 〈13, [1c ·G = 0]〉)·(〈0, [1c 6= 0]〉+ 〈0, [1c = 0]〉

))Note that 1c ·G = max(1c, G) = G, thus we obtain:((〈12, [G 6= 0]〉 + 〈10, [G = 0]〉

)∗ · (〈15, [G 6= 0]〉 +

〈13, [G = 0]〉)·(〈0, [1c 6= 0]〉+ 〈0, [1c = 0]〉

))To denote that the user is a gold club member we

apply an additional homomorphism h′2, that maps x ∈C s.t. x ≥ G to 1, and otherwise to 0, for example

h′2(G) = 1 and h′2(S) = 0 (we again use the identity

homomorphism as h′1). The result of the application is

the following expression.((〈12, true〉+ 〈10, false〉

)∗ · (〈15, true〉+ 〈13, false〉)·(

〈0, true〉+ 〈0, false〉))

and via more simplifications, based on the congruence

axioms, we obtain:

〈15, true〉

The minimal effort required to reach the goal for a

gold club member is indeed 15, through the execution

[Home page, Cat., SubCat., Product, Shopping Cart,

Daily deals, Payment, PayExit]

Applying homomorphisms So far we have demon-

strated that provenance computation can be done in

K ⊗ L for arbitrary positive K and L. To then get an

item (not a pair) in K (symmetrically L) we can first

apply the homomorphism h : K ⊗ L 7→ K ⊗ K that

combines (via prop. 7) the two mappings h1 : K 7→ K

defined by h1(k) = k and h2 : L 7→ K defined by

h2(l) = 0K if l = 0 and h2(l) = 1K otherwise. We

get a correct provenance expression in K ⊗K.

Finally, it is desirable that the final provenance out-

come should reflect a single value (e.g. a cost, or a truth

Page 17: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 17

LTL query Engine

Interactive Explorer

Admin Interface

Spec

Analyst Annotations

View LTL query

Provenance Expression

Homomorphism

Results

State Machine

DB (SQL Server)

Provenance Generator

Dashboard

Fig. 4: System Architecture

value) rather than a bag of pairs. For that, we apply a

mapping defined by h(k1 ⊗ k2) = k1 · k2, and extended

to a full mapping by defining h(s1 +s2) = h(s1)+h(s2)

and h(s1 · s2) = h(s1) · h(s2)

Example 15 Reconsider the obtained provenance expres-

sion in Example 13, 〈13, true〉. We use the identity ho-

momorphism h1 and the homomorphism h2 from boolean

to tropical mapping true to the tropical 1 element and

false to the tropical 0 element Composing the map-

ping with h1 | h2, and applying the result to 〈13, true〉,we finally get 13 as final answer under the particular

hypothetical scenario.

7 Prototype Implementation and Optimizations

We have implemented our provenance management frame-

work in the context of the PROPOLIS system. PROPOLIS

is implemented in C] with WPF GUI using .NET frame-

work, and runs on Windows 7. It uses MS SQL server as

its underlying database management system. We start

by a high-level description of the system architecture,

and then focus on the implementation and optimiza-

tion of its main component, namely the generation of

provenance expressions.

7.1 Architecture

The system architecture is shown in Figure 4. First, the

administrator interface allows specification of a DDP

instance (including both the logical flow in the form of

a Finite State Machine, and the underlying Database,

but no annotations). Then, an analyst specifies the an-

notations, both on the database tuples and the user

choice transitions of the state machine. Intuitively, she

could use annotations to parameterize the analysis at

points whose change she would like to further example,

in which case variables are used; other annotations may

be used to quantify certain aspects of the computation

(e.g. cost in our running example), so that the tempo-

ral analysis result will further reflect this quantification.

We thus obtain a PADDP.

After the annotations are set, and the analyst has

further specified an analysis task in LTL, the computa-

tion may begin. The Provenance Generator module is

then invoked, applying our algorithm to compute the

provenance expression for the specified LTL query with

respect to the PADDP specification (see an extended

discussion of the module’s implementation below). The

obtained expression is then fed to the Dashboard. The

latter, in turn, displays to the user a list of all vari-

ables occurring in the expression. Interacting with the

dashboard and by assigning values to the variables, the

analyst can interactively and repeatedly explore the ef-

fect of hypothetical scenarios on the LTL analysis re-

sult. The scenarios are specified, using the dashboard,

through the assignment of different values, from semir-

ings of the analyst’s choice (e.g. boolean, costs etc.) to

the different parameters. In effect, this defines a homo-

morphism of the analyst choice, to a chosen structure. A

simple example involves deciding on a subset of tuples /

transitions that are (hypothetically) deleted and assign-

ing truth values to parameters to reflect the choice of

deletion (as in Examples 13 and 15). A dedicated GUI

allows for such choices. The analyst is then presented

with the result of the LTL query for the obtained sce-

nario, and can repeatedly and interactively change the

scenarios (i.e. change the homomorphism), to explore

the effect on the LTL query result.

7.2 Efficient Provenance Generation

We next describe in detail the implementation and op-

timization of the provenance generator module. This

entails both optimization of the generation algorithm

(which is otherwise highly time-consuming, as discussed

next) as well as optimizing the size of the outputted

provenance; the latter naturally greatly effects the time

of interacting with PROPOLIS through the exploration

of hypothetical scenarios.

7.2.1 Implementation Details

The high-level algorithm for provenance generation was

given in Algorithm 1 and we next provide more details

on its implementation. We start by introducing a data

structure that is useful as intermediate representation.

Equation system of starred expressions. We use here

an idea due to [9]. Let Ri be a new “variable”, whose

values intuitively range over starred expressions. Even-

tually we will store in Ri the starred expression repre-

senting the provenance of all possible qualifying sub-

Page 18: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

18 Daniel Deutch et al.

executions that start at state qi. Towards achieving

that, we generate an equation system, with an equation

for each such Ri, of the form:

Ri =∑

qiaj−→qj

ajRj

If qi is an accepting state then we introduce the term

1K⊗L in the righthand side of the equation of Ri.

The idea is that each such equation represents the

starred expression Ri in terms of the starred expressions

Rj for all possible following states qj . After intersecting

the LTL formula with the DDP’s state machine (which

is done in a standard manner, see [32]), we generate

this equation system by simply considering the obtained

FSM state by state, in arbitrary order, generating an

equation for each state based on its outgoing transitions

and immediate neighbors.

Example 16 Consider the PADDP of the running ex-

ample, and the LTL formula f = F (PayExit). After

intersection we get a state machine whose structure is

exactly as the one in Figure 1 with PayExit being the

only final state. The equation system is (see state vari-

ables mapping in Table 1):

R0 = 〈2, 1〉R1 + 〈5, 1〉R2

R1 = 〈5, 1〉R2

R2 = 〈0, [Q1 = 0]〉R4 + 〈0, [Q1 6= 0]〉R3

R3 = 〈2, 1〉R4 + 〈2, 1〉R8

R4 = 〈3, 1〉R5

R5 = 〈2, 1〉R0 + 〈2, 1〉R6 + 〈2, 1〉R7

R6 = 〈3, 1〉R7 + 〈2, 1〉R8

R7 = 〈0, [Q2 = 0]〉R8 + 〈0, [Q2 6= 0]〉R9

R8 = 0

R9 = 1

For example, the equation for R0 is due to its two

out-going transitions: to R1 associated with provenance

〈2, 1〉 and to R2 associated with provenance 〈5, 1〉.

Solving the equations. As pointed out in [9], solving

the equations may be done via standard substitution

of variables (i.e. “plugging in” the righthand side of

the equation for some Ri instead of an occurrence of

Ri), in some order (the choice of order applies both

to the choice of Ri and the choice of equation to plug

it into). After substitution, some Rj may occur in the

righthand side of the equation for Rj (due to loops in

the underlying FSM). We then use the following result:

Proposition 10 (adapted from [9]) Given an equation

Ri = A·Ri+B where multiplication and addition are in

State Variable

Home Page R0New products R1

Cat. R2Sub Cat. R3Products R4

Shopping Cart R5Daily Deals R6Payment R7

Exit R8PayExit R9

Table 1: State variables

a closed semiring K, Ri is a variable and A,B are ar-

bitrary expressions (possibly involving variables and/or

constants), its solution (for Ri, in terms of A and B),

is Ri = A∗ ·B.

Since we have showed that the semiring obtained

in our construction is closed, we can thus leverage this

result.

Example 17 Continuing the above example, the elimi-

nation of R8 and R9 results in the following equations

R3 = 〈2, 1〉R4

R6 = 〈3, 1〉R7

R7 = 〈0, [Q2 6= 0]〉

We use the arithmetics operation of the semiring K⊗Land the congruence axioms, for instance, the substitu-

tion of the expression of R7 in the equation of R6 results

R6 = 〈3, 1〉 · 〈0, [Q2 6= 0]〉 = 〈3, [Q2 6= 0]〉

After the elimination of R1, . . . , R9 we get an equation

for R0

R0 =(〈10, [Q1 = 0]〉+ 〈13, [Q1 6= 0]〉

)R0

+ 〈10, [Q1 = 0] · [Q2 6= 0]〉+ 〈12, [Q1 6= 0] · [Q2 = 0]〉

The regular expression for R0 is thus

R0 =(〈10, [Q1 = 0]〉+ 〈13, [Q1 6= 0]〉

)∗·(〈10, [Q1 = 0] · [Q2 6= 0]〉+ 〈12, [Q1 6= 0] · [Q2 = 0]〉

)7.2.2 Optimizations

To allow for scalability of the algorithm, both in term of

its execution time and size of the outputted expression,

we have employed multiple optimizations in different

parts of the implementation, and we explain them next.

Leveraging Congruences. Beyond the theoretical guar-

antees, the fact that the obtained provenance struc-

ture (K⊗L) forms a semiring and the particular con-

gruence axioms that we have designed allow for sim-

plifications of the provenance expression, reducing its

size. A particularly important axiom in this respect is

(k1 ⊗ l1) · (k2 ⊗ l2) = (k1 · k2)⊗ (l1 · l2) which allows us

Page 19: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 19

to collapse every summand in the provenance expres-

sion into a single paired element. We then check for

the applicability of the other axioms that reduce the

expression size, applying them in a greedy manner.

Delaying Query Evaluation. Recall that the prove-

nance expression combines provenance stemming from

queries on the underlying database, with external prove-

nance. Recall also that Algorithm 1 starts by essen-

tially “replacing” every query with the corresponding

data provenance it admits. A first simple observation is

that there is no need to re-evaluate the same query in

its multiple occurrences, but rather provenance may be

computed only once. A more subtle decision is whether

to compute provenance for the queries, and plug-in the

computed provenance into the overall expression, be-

fore or after simplifications. Computing provenance for

the individual queries before simplifications has the ad-

vantage of possibly allowing for more simplifications

(that involve combining parts of expressions of different

queries), but the disadvantage of manipulating larger

expressions (even when sub-expression sharing is em-

ployed as discussed below). Our experiments show that

the latter is consistently much more significant, and

thus we defer the computation of provenance for the

individual queries, meaning that we first generate a

starred expression with query identifiers as “abstract

tokens”, and only then replace those tokens with their

respective provenance.

Sub-expression Sharing. It is extremely common to

have a sub-expression occurring multiple times in the

overall provenance expression. This in particular hap-

pens when a single transition occurs in multiple paths.

Following ideas from [19], this property is exploited in

the generation of optimized data structure, maintain-

ing pointers to sub-expressions (which are the prove-

nance of queries) rather than the expressions them-

selves; so provenance for every query is computed and

stored only once, with multiple sub-expressions con-

taining the provenance of the query, just pointing at

the stored representation. As explained below, we may

further consider a more complex type of sharing, that

crosses the boundaries of individual queries and looks

at the expression as a whole in an attempt to find and

exploit further commonalities.

Dynamic Programming. When the underlying FSM

is acyclic, we may employ a further optimized dynamic-

programming computation. The computation follows

naturally the structure of the obtained automaton, as

follows. We maintain a table which is indexed by state

and execution length. For each state s and length i, the

entry in the table contains a compact representation

of the provenance of all executions of length i ending

in state s. Given the entries for i − 1 and the possibly

Fig. 5: Dynamic Programming Table

predecessors of s, the entry for i and s is computed,

while avoiding the creation of a copy of the expressions

already in the table, and instead pointing at them as

sub-expressions.

Example 18 Let P be the fragment of the PADDP of

our running example, that consists of the states Home

page, New products, Cat., Sub cat. and Products, and

the transitions between them. P is a PADDP with no

loops. The starred expression capturing all executions

in P that start at Homepage and end at Products can

be captured by the table shown in Figure 5. The entry

[3,Products] for instance, represents the two possible

executions of length 3 ending in Products. By following

the pointers we obtain their provenance, which is 〈2, 1〉·〈5, 1〉 · 〈0, [Q1 = 0]〉 and 〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉.The expression for all executions ending in Products is

the sum of the expression in the entries [i,Products] for

0 ≤ i ≤ 4, in this case:

〈5, 1〉 · 〈0, [Q1 = 0]〉+ 〈2, 1〉 · 〈5, 1〉 · 〈0, [Q1 = 0]〉+〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉+〈2, 1〉 · 〈5, 1〉 · 〈0, [Q1 6= 0]〉 · 〈2, 1〉

The effect of optimization on the execution time of

the algorithm is discussed in the following section.

8 Experimental Evaluation

We have conducted experiments whose main goals were

examining (1) the scalability of the approach in terms

of the generated provenance size and generation time,

and (2) the extent of usefulness of the approach, namely

the time it takes to specialize the provenance expression

for applications such as those described in this paper.

All experiments were executed on Windows 7, 64-

bit, with 4GB of RAM and Intel Core Duo i5 3.10 GHz

processor.

8.1 Evaluation Benchmark

We have developed a dedicated benchmark that in-

volves both synthetic and real data as follows.

Page 20: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

20 Daniel Deutch et al.

(a) Serial (b) Parallel

(c) Dense, fan-out 3, 3 levels

Fig. 6: Arctic Station Finite State Machines [5]

E-commerce We used two different DDPs for E-

commerce process specifications to examine the per-

formance (provenance size, and generation and usage

time) as a function of the underlying database size. The

first dataset uses the fixed topology of the state machine

used in our running example (Figure 1), which demon-

strates all features of the model, including an FSM

with cycles, queries on a database and external choices.

The second DDP is partially based on the proposed

benchmark for e-commerce applications, described in

[24] (which we translated to an FSM and enriched with

queries and external choices). The underlying database

in both cases was populated with synthetically and ran-

domly generated data, of growing size, up to 5M tuples

(to examine scalability with respect to the database

size). We have annotated some transitions and tuples.

We have examined various LTL queries including those

presented as examples and their counterparts for the

DDP based on [24].

Arctic Stations The second set of DDPs that we

have examined is based on the real-life “Arctic stations”

data, used as a benchmark also in [5] (albeit in a differ-

ent context, of tracking provenance of actually running

executions rather than temporally analyzing possible

executions). This benchmark includes a variety of pro-

cesses that model the operation of meteorological sta-

tions in the Russian Arctic. Their flows are based on

three kinds of topologies, serial, parallel, and dense as

shown in Figure 6. The process specifications include

queries with aggregates, and we have synthetically in-

troduced external effect choices. The underlying real

data consists of 25000 tuples and includes monthly me-

teorological observations, collected in 1961-2000. The

number of nodes in the different process specifications

(corresponding to stations) is at most 25. But to exam-

ine scalability with respect to the FSM size, we have

also considered varying FSM sizes of up to 5000 nodes.

While doing that, we followed the topologies. I.e., for

the serial and parallel structure we concatenated copies

of the depicted structure such that the last state of the

i component is the first state of the i + 1 component.

For the dense structure we have varied both the fan-out

and number of levels; we report results obtained for in-

creasing FSM size by setting fan-out value of 10 and

increasing the number of levels.

Scientific Workflows We have further employed

our analysis in the context of Scientific Workflows from

MyExperiment.org [36]. We show the result for three

representative such workflows, named here Workflow1

(No. 16 in [36]), Workflow2 (No. 204 there), Work-

flow3 (No. 72 there) enriched with queries and external

choices and with the underlying database was popu-

lated with synthetically and randomly generated data,

of growing size, up to 10M tuples. We observed that

the number of nodes in these different process specifi-

cations is at most 55, but to examine scalability with

respect to the FSM size, we have also considered vary-

ing FSM sizes of up to 1500 nodes, by chaining copies

of the original state machines.

Business Processes We have also experimented

with (centralized) Business Process specifications from

[8]. Here again we show the results for three represen-

tative structures, named here BP1 (“order processing”

process in [8]), BP2 (“order fulfilment and procure-

ment”), BP3 (“shipment process of hardware retailer”).

We have considered synthetically and randomly gener-

ated databases with up to 10M tuples and have syn-

thetically considered increasingly growing sizes of the

process specification state machines, up to 1000 nodes,

to further study the scalability of our solution.

We next describe our experimental results with re-

spect to provenance size, provenance computation time

and provenance use time. We have examined all three

aspects with respect to growing DB size and growing

state machine size. The results with respect to grow-

ing DB size are summarized in Figures 7, 8, 9, and the

results with respect to growing state machine size are

summarized in Figures 10, 11 and 12.

8.2 Provenance Size

The first set of experiments aims at studying the size

of obtained provenance expressions, as a function of

the DDP size (state machine and underlying database

sizes). In Figure 7(a) we present the expression size ob-

tained using our two E-commerce datasets and varying

the database size from 0 to 5M tuples. We observe a

moderate growth of the provenance size with respect to

growing database sizes, indicating the scalability of the

approach (expression size of about 140MB and 180MB

for database of 5M tuples, and for the two workflows).

We note that the provenance size also depends (in an

expected way) on factors such as the size of join results

for guarding queries. For this dataset we have set the

join result size to be proportional to the input DB size.

Page 21: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 21

0 MB

50 MB

100 MB

150 MB

200 MB

0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M

Exp

ress

ion

Siz

e

DB Size

E-commerce Running Example

E-commerce Benchmark

(a) Expression Size

00.0

04.3

08.6

13.0

17.3

21.6

25.9

30.2

34.6

38.9

0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M

Ge

ne

rati

on

Tim

e [

sec]

DB Size

E-commerce Running Example

E-commerce Benchmark

(b) Generation Time

00.0

00.9

01.7

02.6

03.5

04.3

05.2

06.0

06.9

07.8

08.6

0.0 M 1.0 M 2.0 M 3.0 M 4.0 M 5.0 M 6.0 M

Usa

ge T

ime

[se

c]

DB Size

E-commerce Running Example

E-commerce Benchmark

Running Example - Competitor

Benchmark - Competitor

(c) Usage Time

Fig. 7: Expression Size, Generation and Usage Time as Function of DB Size (E-commerce dataset)

0 MB

20 MB

40 MB

60 MB

80 MB

100 MB

120 MB

140 MB

160 MB

180 MB

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Exp

ress

ion

Siz

e

DB Size

Workflow1

Workflow2

Workflow3

(a) Expression Size

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Ge

ne

rati

on

Tim

e [

sec]

DB Size

Workflow1

Workflow2

Workflow3

(b) Generation Time

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Usa

ge T

ime

[se

c]

DB Size

Workflow1 Workflow2 Workflow3 Workflow1 - Competitor Workflow2 - Competitor Workflow3 - Competitor

(c) Usage Time

Fig. 8: Expression Size, Generation and Usage Time as Function of DB Size for Scientific Workflows

0 MB

10 MB

20 MB

30 MB

40 MB

50 MB

60 MB

70 MB

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Exp

ress

ion

Siz

e

DB Size

BP 1

BP 2

BP 3

(a) Expression Size

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Ge

ne

rati

on

Tim

e [

sec]

DB Size

BP 1

BP 2

BP 3

(b) Generation Time

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0.0 M 2.0 M 4.0 M 6.0 M 8.0 M 10.0 M 12.0 M

Usa

ge T

ime

[se

c]

DB Size

BP 1

BP 2

BP 3

BP 1 - Competitor

BP 2 - Competitor

BP 3 - Competitor

(c) Usage Time

Fig. 9: Expression Size, Generation and Usage Time as Function of DB Size for Business Processes Workflows

We have varied join result size and observed the ap-

proximately linear dependency of provenance size with

respect to it.

Figures 8(a) and 9(a) shows the expression size as

a function of the Database size using scientific and busi-

ness processes workflows respectively where the Database

size varied from 0 to 10M tuples. The provenance size

grows moderately as a function of the database size, up

to 162MB and 61MB for 10M tuples for the scientific

and business processes workflows respectively.

In Figure 10(a), we have examined the effect of the

state machine size on the provenance expression size,

showing the results for the Arctic Stations dataset. The

figure shows the results for the three topologies, for

number of nodes that is increased up to 5000 (i.e. up to

200 times the real size). The expressions are compactly

represented allowing for scalability even for the dense

structure. Our experiments indicate that for relatively

simple structures commonly found in workflows, the ex-

ponential size theoretical bound is not met and feasibly

small expressions are obtained. We observed similar re-

sults using scientific and business processes workflows

as shown in Figures 11(a) and 12(a).

8.3 Provenance Generation Time

The second set of experiments aims at assessing the

time it takes to generate the provenance expression,

again as a function of the DDP size. Figures 7(b), 8(b)

and 9(b) present the time it takes to generate the ex-

pression with respect to increased database size, and

Page 22: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

22 Daniel Deutch et al.

0 MB

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

0 1000 2000 3000 4000 5000 6000

Exp

ress

ion

Siz

e

States

Serial

Parallel

Dense

(a) Expression Size

00

17

35

52

69

86

104

0 1000 2000 3000 4000 5000 6000

Ge

ne

rati

on

Tim

e [

sec]

States

Serial

Parallel

Dense

(b) Generation Time

00

173

346

518

691

864

1037

0 1000 2000 3000 4000 5000 6000

Usa

ge T

ime

[se

c]

States

Serial

Parallel

Dense

Serial - Competitor

Parallel - Competiror

Dense - Competitor

(c) Usage Time

Fig. 10: Expression Size, Generation and Usage Time as Function of FSM Size (Arctic Stations Dataset)

0 MB

10 MB

20 MB

30 MB

40 MB

50 MB

60 MB

0 200 400 600 800 1000 1200 1400 1600 1800

Exp

ress

ion

Siz

e

States

Workflow 1

Workflow2

Workflow3

(a) Expression Size

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 200 400 600 800 1000 1200 1400 1600 1800

Ge

ne

rati

on

Tim

e [

sec]

States

Workflow 1

Workflow2

Workflow3

(b) Generation Time

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0 200 400 600 800 1000 1200 1400 1600 1800

Usa

ge T

ime

[se

c]

States

Workflow 1

Workflow2

Workflow3

Workflow1-Competitor

Workflow2-Competitor

Workflow3-Competitor

(c) Usage Time

Fig. 11: Expression Size, Generation and Usage Time as Function of FSM Size for Scientific Workflows

0 MB

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

0 200 400 600 800 1000 1200

Exp

ress

ion

Siz

e

States

BP 1

BP 2

BP 3

(a) Expression Size

0.00

5.00

10.00

15.00

20.00

25.00

0 200 400 600 800 1000 1200

Ge

ne

rati

on

Tim

e [

sec]

States

BP 1

BP 2

BP 3

(b) Generation Time

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

0 200 400 600 800 1000 1200

Usa

ge T

ime

[se

c]

States

BP 1

BP 2

BP 3

PB 1-Competitor

BP 2-Competitor

BP 3-Competitor

(c) Usage Time

Fig. 12: Expression Size, Generation and Usage Time as Function of FSM Size for Business Processes Workflows

Figures 10(b) 11(b) and 12(b) present the time it takes

to generate the expression with respect to increased

FSM size.

Figure 7(b) presents the generation time of the prove-

nance expression, on the two E-commerce DDPs, with

database of increasing size from 0 to 5M tuples. Observe

that the generation time is just under 35 and 25 seconds

for DB size of 5M tuples (for the two DDPs). Recall

that the provenance generation is done offline, render-

ing this computation time very reasonable. Figures 8(b)

and 9(b) present the generation time for the scientific

and business processes workflows with database of in-

creasing size up to 10M tuples. The generation time

was up to about 12 seconds for 10M tuples DB in both

cases.

Figure 10(b) shows provenance generation time as a

function of FSM size using the Arctic Stations dataset.

For the “serial” and “parallel” structures, generation

time is very fast even for every large FSMs. For the

complex “dense” structure, the generation time is sig-

nificantly slower due to the complex graph structure

and the needs for simplification, but is still under 1.5

minutes for a very large FSM with 5000 states. The gen-

eration time as a function of FSM size for scientific and

business processes workflows are presented in Figures

11(b) and 12(b) respectively. The observed time was

Page 23: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 23

0MB

2MB

4MB

6MB

8MB

10MB

12MB

14MB

16MB

18MB

0 10 20 30 40 50 60 70

Exp

ress

ion

Siz

e

States

No Optimizations

With Optimizations

(a) Expression Size

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70

Ge

ne

rati

on

Tim

e [

sec]

States

No Optimizations

With Optimizations

(b) Generation Time

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10 20 30 40 50 60 70

Usa

ge T

ime

[se

c]

States

No Optimizations

With Optimizations

(c) Usage Time

Fig. 13: Effect of optimization

about a minute in the worst case for scientific worklows

and 22 seconds for business processes.

8.4 Using Provenance Information

The last set of experiments aims at assessing the use-

fulness of the approach by examining the time it takes

to “interact” with the expression, i.e. apply homomor-

phisms to observe results under hypothetical scenarios.

In particular we have considered deletion propagation,

and compared the observed times with those obtained

through a simple baseline approach (the “competitor”)

of applying the deletions directly to the DDP and per-

forming optimized temporal analysis of the obtained

specification.

Results for deletion propagation using the two ap-

proaches are reported in Figures 7(c), 8(c) and 9(c) for

growing DB size, and Figures 10(c), 11(c) and 12(c) for

growing FSM size. In Fig. 7(c) we see that by usingthe provenance expression we can consistently and sig-

nificantly outperform the baseline approach. For DB of

5M tuples, the gain is about 65% for the benchmark

DDP that follows [24], and 53% for our running exam-

ple DDP. The results also indicate scalability of the ap-

proach, allowing to use the expression in about 2.5 and

3.5 seconds for the benchmark DDP and our running

example (resp.), for DB size of 5M tuples (and much

faster for smaller databases). We note that for our ex-

amples, all input tuples are reflected in the output and

thus in the provenance. Our approach performance, in

all measures and specifically usage time and gain with

respect to the competitor, improves significantly as the

percentage of such tuples drops (e.g. in a database with

higher join selectivity). We observed similar trends in

the scientific workflows and business processes as shown

in Figures 8(c) and 9(c). The gain for 10M tuples was up

to 92% for the scientific workflows and up to 94% for

business processes. The using the provenance expres-

sion required less than 1 second for all business work-

flows we examined. and between 1 to 5 seconds for the

scientific workflows.

The results in Fig. 10(c) indicate that for FSM struc-

tures commonly found in context of workflows, appli-

cation of homomorphism to the computed expression is

very fast. It was instantaneous for all structures of the

Arctic Stations dataset (with their actual sizes). Even

when extending the FSM size to up to 5000 nodes, it

required less than 2.8 seconds for the dense structure,

1 second for the parallel structure and 0.8 seconds for

the serial one. With respect to the competitor, a very

significant gain was achieved for the dense and parallel

structure (about 96% and 58% improvement for 5000

nodes). For the simple serial structure, there was no

significant gain; this is due to the extreme simplicity

of the FSM structure where the pre-processing effect

diminishes. Figures 11(c) and 12(c) shows the gain for

deletion propagation for growing FSM size using the

workflows from [36] and [8]. The average gain was 96%-

98% fir the scientific workflows and 97%-99% for busi-

ness processes. Using the expression required less than

0.3 seconds and less than 0.1 seconds respectively.

8.5 Effect of Optimizations

Last, we have experimentally examined the effect of our

optimizations on all aspects of the computation: prove-

nance generation time, resulting provenance size, and

provenance usage time. Representative results, with and

without the optimizations, for the arctic stations dataset

with dense flow, with fan-out 6 and increasing number

of levels are shown in Figure 13. Figure 13(a) shows

the expression size as a function of the state machine

size. For example, while, without optimization, the ex-

pression size is up to 16MB for a state machine with

60 nodes, the expression size generated by the opti-

mized algorithm for the same state machine was only

34.5KB. We observed similar trends for other datasets,

and for the provenance generation and usage time as

shown in Figures 13(b) and 13(c) respectively. For ex-

Page 24: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

24 Daniel Deutch et al.

ample, provenance generation time for FSM with 60

nodes has dropped to only 85 milliseconds using the op-

timized algorithm, compared to over 35 seconds without

optimizations. Usage time has dropped to 1 millisec-

ond from 34 milliseconds for the same input. Overall,

the observed gain obtained using the optimization was

very significant: up to 99% in terms of the expression

size and up to 97% in terms of generation and usage

time.

9 Related Work

Several lines of work are related to the present paper.

Data and Workflow Provenance Provenance for data

transformations has been considered in many papers

(see e.g. [19,11,7,26,10,12]). Provenance was shown use-

ful in a variety of applications, including access con-

trol for data management and provisioning for database

queries [16], and automatic optimization based on hy-

pothetical reasoning for such queries [33,34]. Unique

technical challenges in our settings include accounting

for the workflow in addition to data (including treat-

ment of possibly infinitely many executions) in conjunc-

tion with the combination of different kinds of choices

(external and data-dependent). The work of [31] has

studied combination of annotations, but especially in

the context of dependency between them and only for

the relational algebra.

Workflow Provenance Different approaches for captur-

ing workflow provenance appear in the literature (e.g.

[14,13,3,29,20,38,35]), but none of these models cap-

ture semiring-based provenance for temporal analysis

results of workflow executions. Semiring-based prove-

nance for executions of data-centric workflows was stud-

ied in [5], however the development there (1) tracks

provenance along with the execution rather than sup-

porting analysis over all possible executions, (2) as-

sumes a DAG-based control flow (no cycles), and (3)

does not capture provenance for external effects.

Process analysis and semiring-weighted graphs Tempo-

ral (and specifically LTL) analysis of processes was stud-

ied extensively (see e.g. [32,37] for an overview). The

use of elements of a semiring to weigh transitions has

also been extensively explored, see e.g. [25], and sev-

eral works have further laid the foundations for param-

eterization of temporal analysis based on weights, with

or without the design of semirings that accommodate

these weights (see e.g. [30,15]). The main distinction

from our work is that the process models in these works

are not data-centric, and in particular do not involve

queries to an underlying database. The dependency on

data combined with the “external effects” on control

flow lead to multiple novel challenges in our settings.

These challenges include (1) the need to “correctly”

combine two semirings rather than weigh transitions

based on a single semiring, which is addressed in our

algebraic construction, (2) the formulation and proof of

the properties that hold for the construction and are re-

quired for the usefulness of the approach, and (3) novel

implementation issues.

W3C Prov W3C PROV-family [40] is a specification of

standard for modeling and querying provenance for the

web. Applications include explanations, assessments of

quality, reliability or trustworthiness of data item, etc.

The standard describes a range of transformations for

data on the web, such as data generation, use and re-

vision by multiple agents, as well as a formal model

for provenance validation [41] It also describes different

perspectives on data through a notion of specialization

that resembles inheritance.

We note that provenance for the analysis of possible

executions of a data-dependent process, is studied here

for the first time (through a precise algebraic notion). It

is interesting to explore what aspects described in the

W3C standard may be modeled through the semiring

framework. The aspects that “fit” the framework, may

possibly also be incorporated as part of the PADDP

model, and perhaps accounted for in the analysis. For

instance, trustworthiness may be captured by elements

of the tropical semiring (see [27]); joint or alternative

use of data by different agents may be represented by

the semiring · or + operations. Incorporating into the

semiring framework other concepts such as specializa-

tion/inheritance requires further work.

Business Processes and Web Services There is a wealth

of research, in many different lines, on the analysis of

data-aware (and data-centric) processes. This includes

data-centric Business Processes (e.g. [4,18]) and Web

Services and applications [22,2], and the analysis goals

vary from verification of fully specified processes to the

automatic synthesis of new processes combining a given

set of components. The results obtained in these lines of

research are not applicable to our needs since no prove-

nance model was proposed there. Rather, the focus was

on analyzing a concrete given instance of a (rich) pro-

cess model, and not on accounting for different weight-

ing and for hypothetical changes as in our work.

Since our process model is quite generic, many as-

pects of Business Process (BP) and workflow models

may be captured, and then our results apply to these

models (as demonstrated by our experiments). How-

ever, more work is needed to be able to capture every

Page 25: Provenance-Based Analysis of Data-Centric Processesmoskovitch1/docs/VLDBJ15.pdf · Provenance-Based Analysis of Data-Centric Processes 3 expressions and to validate semiring homomorphisms

Provenance-Based Analysis of Data-Centric Processes 25

aspect of the more complex distributed BP and work-

flow models accounting for communication and paral-

lelism. Developing provenance support for such models

is an intriguing direction for future work.

10 Conclusion

We have presented in this paper a semiring based prove-

nance framework for temporal analysis of data-dependent

processes. We have studied properties of the frame-

work, from theoretical and experimental perspectives,

and have demonstrated its usefulness. There are ad-

ditional important aspects of data-dependent processes

such as communication and parallelism, for which prove-

nance modeling requires further work. We believe that

the foundations laid in this paper will serve as sound

grounds for the incorporation of provenance support for

such features.

References

1. S. Abiteboul, R. Hull, and V. Vianu. Foundations ofDatabases. Addison-Wesley, 1995.

2. S. Abiteboul, V. Vianu, B. Fordham, and Y. Yesha. Re-lational transducers for electronic commerce. In PODS,1998.

3. A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientificworkflow management by database management. In SS-DBM, 1998.

4. Lakhdar Akroun, Boualem Benatallah, Lhouari Nourine,and Farouk Toumani. Decidability and complexity ofsimulation preorder for data-centric web services. In IC-SOC, pages 535–542, 2014.

5. Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo,J. Stoyanovich, and V. Tannen. Putting lipstick on pig:Enabling database-style workflow provenance. PVLDB,5(4):346–357, 2011.

6. Y. Amsterdamer, D. Deutch, and V. Tannen. Provenancefor aggregate queries. In Proc. of PODS, 2011.

7. O. Benjelloun, A.D. Sarma, A.Y. Halevy, M. Theobald,and J. Widom. Databases with uncertainty and lineage.VLDB J., 17, 2008.

8. http://www.bpmn.org/.9. Janusz A. Brzozowski. Derivatives of regular expressions.

J. ACM, 11(4):481–494, October 1964.10. P. Buneman, J. Cheney, and S. Vansummeren. On the ex-

pressiveness of implicit provenance in query and updatelanguages. ACM Trans. Database Syst., 33(4), 2008.

11. P. Buneman, S. Khanna, and W.C. Tan. Why and where:A characterization of data provenance. In ICDT, 2001.

12. James Cheney, Laura Chiticariu, and Wang Chiew Tan.Provenance in databases: Why, how, and where. Foun-dations and Trends in Databases, 1(4):379–474, 2009.

13. D. Cohn and R. Hull. Business artifacts: A data-centricapproach to modeling business operations and processes.IEEE Data Eng. Bull., 32(3), 2009.

14. S. B. Davidson and J. Freire. Provenance and scientificworkflows: challenges and opportunities. In SIGMOD,2008.

15. Conrado Daws. Symbolic and parametric model checkingof discrete-time markov chains. In Theoretical Aspects ofComputing-ICTAC 2004, pages 280–294. Springer, 2005.

16. D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan:Provisioning for what-if analysis. In CIDR, 2013.

17. D. Deutch, Y. Moskovitch, and V. Tannen. Propolis:Provisioned analysis of data-centric processes (demo). InVLDB, 2013.

18. A. Deutsch, L. Sui, V. Vianu, and D. Zhou. A system forspecification and verification of interactive, data-drivenweb applications. In SIGMOD Conference, 2006.

19. Robert Fink, Larisa Han, and Dan Olteanu. Aggrega-tion in probabilistic databases via knowledge compila-tion. PVLDB, 5(5), 2012.

20. I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera:A virtual data system for representing, querying, and au-tomating data derivation. SSDBM, 2002.

21. J. Nathan Foster, Todd J. Green, and Val Tannen. An-notated xml: queries and provenance. In PODS, pages271–280, 2008.

22. X. Fu, T. Bultan, and J. Su. Wsat: A tool for formalanalysis of web services. In CAV, 2004.

23. Dimitra Giannakopoulou and Klaus Havelund.Automata-based verification of temporal propertieson running programs. In ASE, pages 412–416, 2001.

24. M. Gillmann, R. Mindermann, and G. Weikum. Bench-marking and configuration of workflow management sys-tems. In CoopIS, 2000.

25. Michel Gondran and Michel Minoux. Graphs, Dioids andSemirings: New Models and Algorithms. Springer Pub-lishing Company, Incorporated, 2008.

26. T. J. Green, G. Karvounarakis, and V. Tannen. Prove-nance semirings. In PODS, 2007.

27. Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives,and Val Tannen. Provenance in orchestra. IEEE DataEng. Bull., 33(3):9–16, 2010.

28. H. Gruber and M. Holzer. Finite automata, digraph con-nectivity, and regular expression size. In ICALP, 2008.

29. D. Hull, K. Wolstencroft, R. Stevens, C. Goble,M. Pocock, P. Li, and T. Oinn. Taverna: a tool forbuilding and running workflows of services. Nucleic AcidsRes., 34, 2006.

30. T. Hune, J. Romijn, M. Stoelinga, and F. Vaandrager.Linear parametric model checking of timed automata.Springer, 2001.

31. Egor V. Kostylev and Peter Buneman. Combining depen-dent annotations for relational algebra. In ICDT, pages196–207, 2012.

32. Z. Manna and A. Pnueli. The temporal logic of reactiveand concurrent systems - specification. Springer, 1992.

33. A. Meliou and D. Suciu. Tiresias: the database oracle forhow-to queries. In SIGMOD, 2012.

34. Alexandra Meliou, Wolfgang Gatterbauer, and Dan Su-ciu. Reverse data management. PVLDB, 4(12), 2011.

35. P. Missier, N. Paton, and K. Belhajjame. Fine-grainedand efficient lineage querying of collection-based work-flow provenance. In EDBT, 2010.

36. http://www.myexperiment.org/.37. Amir Pnueli. Applications of temporal logic to the speci-

fication and verification of reactive systems: a survey ofcurrent trends. Springer, 1986.

38. Y. L. Simhan, B. Plale, and D. Gammon. Karma2: Prove-nance management for data-driven workflows. Int. J.Web Service Res., 5(2), 2008.

39. Jeffrey D. Ullman. Principles of Database andKnowledge-Base Systems. Computer Science Press, 1989.

40. Prov-overview, w3c working group note, 2013.http://www.w3.org/TR/prov-overview/.

41. Constraints of the prov data model, w3c working groupnote, 2014. http://www.w3.org/TR/2013/REC-prov-constraints-20130430/.