data exchange with arithmetic comparisonschenli/pub/deac-tr.pdf · data exchange with arithmetic...

30
Data Exchange with Arithmetic Comparisons * Foto Afrati 1 Chen Li 2 Vassia Pavlaki 1 1 National Technical University of Athens (NTUA), Greece {afrati,vpavlaki}@softlab.ntua.gr 2 University of California, Irvine, USA [email protected] Abstract In this paper, the problem of data exchange in the presence of arithmetic comparisons is studied. For this purpose, tuple generating dependencies with arithmetic comparisons (tgd-AC) and arithmetic comparison generating dependencies (acgd) are introduced to capture the need to use arithmetic comparisons in source-to- target and in target constraints (both in their antecedent and in their consequent). Arithmetic comparisons are interpreted over densely totally ordered domains. A chase procedure called AC-chase is introduced to handle such constraints. AC-chase is a tree rather than a sequence and its depth is polynomial on the size of the source instance if the set of target constraints is the union of a set of target acgds with a weakly acyclic set of target tgd-ACs. It is shown that AC-chase computes a universal solution which can be used to compute certain answers for unions of conjunctive queries with arithmetic comparisons (UCQAC). The complexity of existence of a solution is shown to be in NP. The complexity of computing certain answers of UCQAC queries is shown to be coNP-complete. In order to identify polynomial cases, we introduce the succinct AC-chase which however does not always produce a solution. We identify cases where its result is a universal solution and we investigate the syntactic conditions of the query under which we can use the result of succinct AC-chase to compute the certain answers in polynomial time. We show that the latter is also feasible in cases where the result of succinct chase is not a universal solution. 1 Introduction Data exchange has been recognized as a fundamental problem decades ago and has recently attracted the interest of many researchers (e.g., [AL05, Ber03, FKMP03, FKP03, Got05, GN06, KPT06, Lib06, VR07]). In the data exchange problem, we want to take data structured under a schema, called the source schema, and transform it into data structured under another schema, called the target schema, for the purpose of being able to answer queries posed over the target schema. In this paper we investigate the data exchange problem in the presence of arithmetic comparisons which is admittedly a much harder problem. The following is an example of such a problem. A real estate company, called A, has a database S that stores information about apartments and owners in Orange County, California. Fig. 1 shows its two tables, namely Home and Owner. Due to business reasons, the company needs to send some of its data to another real estate company, called B, which has a database T with two tables: Apartment(aID, apartAddr, price, zip, name); Person(name, Addr). The table Apartment contains information about apartments, including the ID (aID), address (apartAddr), price (price), zip code (zip), and owner’s name (name). The table Person stores information about the names (name) and addresses (Addr) of owners. Thus, in order to facilitate the data transfer from the database of company A to the database of company B, we need to transform automatically the data from the source schema to the * Work partially done when all authors were visiting Stanford University. The project is co - funded by the European Social Fund (75%) and National Resources (25%) - Operational Program for Educational and Vocational Training II (EPEAEK II) and particularly the Program HERAKLEITOS.

Upload: others

Post on 29-Mar-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Data Exchange with Arithmetic Comparisons∗

Foto Afrati1 Chen Li2 Vassia Pavlaki†1

1 National Technical University of Athens (NTUA), Greece{afrati,vpavlaki}@softlab.ntua.gr2University of California, Irvine, USA

[email protected]

Abstract

In this paper, the problem of data exchange in the presence of arithmetic comparisons is studied. Forthis purpose, tuple generating dependencies with arithmetic comparisons (tgd-AC) and arithmetic comparisongenerating dependencies (acgd) are introduced to capture the need to use arithmetic comparisons in source-to-target and in target constraints (both in their antecedent and in their consequent). Arithmetic comparisonsare interpreted over densely totally ordered domains. A chase procedure called AC-chase is introduced tohandle such constraints. AC-chase is a tree rather than a sequence and its depth is polynomial on the size ofthe source instance if the set of target constraints is the union of a set of target acgds with a weakly acyclicset of target tgd-ACs. It is shown that AC-chase computes a universal solution which can be used to computecertain answers for unions of conjunctive queries with arithmetic comparisons (UCQAC). The complexity ofexistence of a solution is shown to be in NP. The complexity of computing certain answers of UCQAC queries isshown to be coNP-complete. In order to identify polynomial cases, we introduce the succinct AC-chase whichhowever does not always produce a solution. We identify cases where its result is a universal solution and weinvestigate the syntactic conditions of the query under which we can use the result of succinct AC-chase tocompute the certain answers in polynomial time. We show that the latter is also feasible in cases where theresult of succinct chase is not a universal solution.

1 Introduction

Data exchange has been recognized as a fundamental problem decades ago and has recently attracted the interestof many researchers (e.g., [AL05, Ber03, FKMP03, FKP03, Got05, GN06, KPT06, Lib06, VR07]). In the dataexchange problem, we want to take data structured under a schema, called the source schema, and transformit into data structured under another schema, called the target schema, for the purpose of being able to answerqueries posed over the target schema. In this paper we investigate the data exchange problem in the presenceof arithmetic comparisons which is admittedly a much harder problem. The following is an example of such aproblem.

A real estate company, called A, has a database S that stores information about apartments and owners inOrange County, California. Fig. 1 shows its two tables, namely Home and Owner. Due to business reasons, thecompany needs to send some of its data to another real estate company, called B, which has a database T withtwo tables:

Apartment(aID, apartAddr, price, zip, name);Person(name, Addr).

The table Apartment contains information about apartments, including the ID (aID), address (apartAddr), price(price), zip code (zip), and owner’s name (name). The table Person stores information about the names (name)and addresses (Addr) of owners. Thus, in order to facilitate the data transfer from the database of companyA to the database of company B, we need to transform automatically the data from the source schema to the

∗Work partially done when all authors were visiting Stanford University.†The project is co - funded by the European Social Fund (75%) and National Resources (25%) - Operational Program for

Educational and Vocational Training II (EPEAEK II) and particularly the Program HERAKLEITOS.

Page 2: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

target schema. Fagin et al. [FKMP03] showed that tuple-generating dependencies (tgds) and equality-generatingdependencies (egds) are very powerful to define many kinds of semantic correspondences between the schemas and,hence, they can be used to transform data in a data exchange problem. Intuitively, a tuple generating dependencyexpresses that if some facts exist in a database, then some other facts should also exist. Thus, in our example,the following two tgds could facilitate the data transformation: dst1 : Home(I, A, P) → Apartment(I, A, P, Z, N) anddst2 : Owner(I, N, A) → Person(N, A).

Home:aID apartAddr price

13 82 Main St 45054 12 Alton Ave 30055 14 Alton Ave 800

Owner:aID name Addr

13 Tom Smith Highland Ave54 John Parker Franklin Ave43 Ken Johnson Culver Dr

Figure 1: Source database at company A.

However, suppose we want to transfer data only about apartments with price ≤ 500 thousand dollars. Supposealso that we know that the zip code of apartments in company A is less than 92000 (e.g., because we know that allapartments are in Orange County), and we should store this information in the database of company B. Tgds andegds cannot handle either of these two cases. In this paper, we introduce tuple generating dependencies that usealso arithmetic comparisons, called tuple-generating dependencies with arithmetic comparisons (tgd-ACs in short),by extending the class of tgds to include comparison predicates. We also extend the class of egds to allow forany arithmetic comparison on the consequent and we call them arithmetic-comparison-generating dependencies(acgds in short). The need for including arithmetic comparisons to data dependencies has been observed in earlierwork [BCW99, IO93, MS96, Mah97].

In our example, the following two tgd-ACs will transform the data while satisfying also the two conditionsabove: dst1 : Home(I, A, P) ∧ P ≤ 500 → Apartment(I, A, P, Z, N) ∧ Z > 92000 and dst2 : Owner(I, N, A) →Person(N, A). Finally, besides using tgd-ACs and acgds to declare how we want the data to be transformedfrom the source schema to the target schema (source-to-target constraints), we may also use them to declare thatsome constraints are required to be satisfied on the target data (target constraints).

The data exchange problem that we investigate is the following: We are given a source schema S, a targetschema T and a set of source-to-target constraints and target constraints. For any source instance, we want tofind a target instance which satisfies the constraints (existence of solution problem). Moreover, for a given queryq, we want to find the certain answers of q (query answering problem). These problems and their complexityare investigated in [FKMP03] for the case of constraints that are tgds and egds and for conjunctive queries andconjunctive queries with inequalities (6=). In the present paper, we investigate the complexity of these problemsfor the case of constraints that are tgd-ACs and acgds and for conjunctive queries with arithmetic comparisons(<,≤, >,≥, 6=, =).Summary of Results: It is known that reasoning on logical formulas with arithmetic comparisons is morecomplicated than logical formulas without them. More specifically, many of the technical tools used to solve dataexchange problems are closely related to query containment tests. Conjunctive query containment is recognizedas a much simpler problem (NP-complete) than containment of conjunctive queries in the presence of arithmeticcomparisons (ΠP

2 -complete) [CM77, Klu88, Mey92]. Thus, in our setting we have to deal with all additionalcomplications that are introduced by the presence of arithmetic comparisons. The organization of the paper andour contributions are as follows:

In Section 2, we define the concepts that we use. Since arithmetic comparisons are interpreted over totallyordered domains, the concept of a tgd-AC being satisfied on a database instance which includes labeled nulls isnot straightforward. Hence, new notions are needed. The notion of universal solution (which is used to computecertain answers of queries) is extended to be a set of solutions rather than a single solution. Interestingly, alsoin our setting universal solutions have the property that any two of them are equivalent, if viewed as unions ofconjunctive queries with arithmetic comparisons, (Proposition 2.1) which is an extension of the property noticedin [FKMP03].

In Section 3, we define a novel chase procedure called AC-chase which is a tree rather than a sequence andcan handle tgd-ACs and acgds. We show that the result of AC-chase is a universal solution (Theorem 3.1). We

Page 3: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Constraints Query Language Complexity Reference

tgds and egds UCQ with two 6= per CQ coNP-hard [FKMP03]

tgds and egds CQ with seven 6= coNP-hard [AD98]

tgds and egds CQ with two 6= coNP-hard [Mad05]

tgds and egds UCQ with inequalities ( 6=) coNP [FKMP03]

tgd-ACs and acgds UCQAC coNP Section 4

tgds and egds UCQ, at most one 6= per CQ PTIME [FKMP03]

full tgd-ACs UCQAC, each CQAC has the PTIME Section 5and acgds homomorphism property

tgd-ACs and acgds UCQAC, each CQAC has the PTIME Section 5no target constraints homomorphism property

tgds and egds UCQAC, each CQAC uses open PTIME Section 5(closed) LSI (RSI) comparisons only

Table 1: Complexity results on query answering in data exchange for weakly acyclic dependencies. “CQ” =conjunctive query; “UCQ” = union of conjunctive queries; “(U)CQAC” = (union of) conjunctive query (queries)with arithmetic comparisons.

extend the definition of weakly acyclic set of tgds to define weakly acyclic set of tgd-ACs. We prove that for aset of weakly acyclic set of tgd-ACs and acgds, the AC-chase always produces a tree of polynomial length (on thesize of the input source instance). From that, we derive the complexity results, which show that the problem ofexistence of solution is in NP (Theorem 3.2).

In Section 4, we show that we can compute certain answers of unions of conjunctive queries with arithmeticcomparisons using the result of the AC-chase (Theorem 4.1(1)). Again, since arithmetic comparisons areinterpreted over totally ordered domains, computing such queries on a database instance with labeled nulls isa challenge and the definition of some new concepts is needed. In particular, we first define the concept ofcomputing certain answers on an instance. We prove that the data complexity of the problem of computingcertain answers for the data exchange problem with arithmetic comparisons is co-NP complete (Theorem 4.2).Interestingly, we have been able to extend another observation made in [FKMP03] about properties of universalsolution (Theorem 4.1(2)).

In Section 5, we identify polynomial cases of the problem. For that, we define another chase procedurecalled succinct AC-chase which is a sequence and whenever it uses acgds and weakly acyclic tgd-ACs, its lengthis polynomial (on the size of the source instance) (Theorem 5.1). However, it does not produce always auniversal solution. We show that universal solutions are produced by the succinct AC-chase in two cases: (a)when there are no target constraints and (b) when we use full tgd-ACs (i.e., without existentially quantifiedvariables)(Theorem 5.2). On query answering we show that whenever the query has the homomorphism propertythen we can compute certain answers in polynomial time (Lemma 5.1) on a given instance. A consequence of thatis Theorem 5.4 which settles the problem of computing certain answers in polynomial time for the cases wheresuccinct AC-chase produces a universal solution.However, we identify an additional case where certain answers of queries with the homomorphism property canbe computed in polynomial time on the result of the succinct AC-chase, although the latter is not a universalsolution – it is when we use only tgds and egds and the query has a certain property (Theorem 5.5)1. A summaryof all our polynomial results can be found in Section 5.5. Table 1 summarizes existing complexity results oncomputing certain answers.

The usefulness of the homomorphism property has been observed for reducing the complexity of the querycontainment problem [ALM04, Klu88] in the presence of arithmetic comparisons. It contains large classes ofqueries such as queries which use only comparisons of the form X < c where X is a variable and c is a constant.

1Note our notion of solution does not coincide with that in [FKMP03]. Hence, although when we use tgds and egds the succinctAC-chase reduces to the chase, its result does not qualify as solution. This remark is implicitly made in [FKMP03] by noticing thatthis result cannot be used to compute certain answers of queries with 6= and hence the disjunctive chase was introduced there.

Page 4: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

2 Data Exchange with Arithmetic Comparisons.

In this section we define a) dependencies with arithmetic comparisons, b) data exchange setting with arithmeticcomparisons, and c) the concepts of solution and universal solution. We prove the first important result which saysthat universal solutions as defined here have good properties in that any pair of universal solutions are equivalentif viewed as queries (Proposition 2.1). For that, we review the query containment problem for conjunctive querieswith arithmetic comparisons.

2.1 Dependencies with Arithmetic Comparisons. We first define the class of arithmetic comparisongenerating dependencies, which extends equality-generating dependencies (egds). We then define the class oftuple-generating dependencies with arithmetic comparisons, which extends tuple-generating dependencies (tgds).We consider arithmetic comparison predicates of the form “AθB”, where A is a variable, B is a variable or aconstant, and θ is in {≤,≥, <, >, 6=,=}. They are interpreted over densely totally ordered domains. We call anarithmetic comparison open if its operator is < or >, and closed if its operator is ≤ or ≥. We call an arithmeticcomparison AθB semi-interval (SI in short) if either A or B is a constant. If B is a constant we call it leftsemi-interval and if A is a constant we call it right semi-interval.

definition 2.1. (acgd & tgd-AC)Let D be a database schema.• An arithmetic-comparison-generating dependency (acgd in short) is a first-order formula of the form

∀x(φ(x) ∧ βφ → (x1 θ x2)).

x1 is a variable from x and x2 is either a variable from x or constant.• A tuple-generating dependency with arithmetic comparisons (tgd-AC in short) is a first-order formula of theform

∀x(φ(x) ∧ βφ → ∃yψ(x,y) ∧ βψ

).

In the formulas, φ(x) is a conjunction of atomic relational formulas over D, with variables in x. Each variablein x occurs in at least one formula in φ(x). In addition, ψ(x,y) is a conjunction of atomic relational formulaswith variables in x and y, and each variable in y occurs in at least one formula in ψ(x,y). Also βφ and βψ areeach a conjunction of comparison predicates with variables from φ(x) and ψ(x,y), respectively.

In general, a conjunctive formula with arithmetic comparisons is denoted as F = F0 + βF , where F0 contains therelational atoms and βF the comparisons of F . A conjunctive formula is in compact form if the conjunction of itscomparisons βF do not imply equalities.

2.2 Preliminary Definitions. Let S ={S1, . . . , Sn} and T = {T1, . . . , Tm} be two disjoint schemas, whereeach element in the sets is a relation. A source-to-target dependency in Σst is a tgd-AC of the form:∀x(

φS(x) ∧ βφ → ∃yψT(x,y) ∧ βψ

). A target dependency is a dependency over the target schema T. Every

target dependency in Σt is either a tgd-AC of the form ∀x(φT(x) ∧ βφ → ∃yψT(x,y) ∧ βψ

), or an acgd of the

form ∀x(φT(x) ∧ βφ → (x1θx2)). In source-to-target and target dependencies, φS(x), ψT(x,y), and φT(x) are

each a conjunction of atomic relational formulas over S or T respectively, and βφ and βψ are conjunctions ofcomparison predicates.

A data exchange setting with arithmetic comparisons, denoted M = (S,T, Σst,Σt), consists of a source schemaS, a target schema T, a set Σst of source-to-target dependencies, and a set Σt of target dependencies. We useConst to denote the set of constants that may occur in the database. We assume constants are from an infinitetotally ordered dense domain such as the rationals or reals. We assume an infinite set Var of values, called labelednulls, such that Const∩ Var = ∅. We define instances as conjunctive formulas with arithmetic comparisons usingconstants and labeled nulls. For an instance K, we use Var(K) and Const(K) to denote the set of labeled nullsand the set of constants in K, respectively. An instance K where Var(K) is empty, is called ground instance. Weuse Const(Σst ∪ Σt) to denote the set of constants appearing in source-to-target and target dependencies.

If M = (S,T, Σst,Σt) is a data exchange setting with arithmetic comparisons, and I is a ground instanceof S, then an instance K is called a t-instance with respect to M and I, or simply a t-instance (if M andI are understood from the context), if the arithmetic comparisons of K define a total order with respect toC = Const(K) ∪ Var(K) ∪ Const(Σst ∪ Σt) ∪ Const(I). A t-instance K ′ is called a t-instance induced by aninstance K if K ′ results from K by adding to K additional arithmetic comparisons so that a total order is definedamong the labeled nulls and constants in C, possibly with some labeled nulls or constants equated. We define the

Page 5: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

concept of t-homomorphism as a homomorphism which a) satisfies the arithmetic comparisons and b) it maps aconjunctive formula on a t-instance. The formal definition follows.

definition 2.2. (t-homomorphism)Let F = F0 + βF be a conjunctive formula and K = K0 + βK be a t-instance in compact form over the sameschema. A t-homomorphism from F to K is a mapping h from Const(F ) ∪ Var(F ) to Const(K) ∪ Var (K) withthe following properties:• Every constant in F is mapped to itself.• For every fact R(x1, . . . , xk) in F we have R(h(x1), . . . , h(xk)) in K.• βK logically implies h(βF ), denoted by βK ⇒ h(βF ).

We say F t-homomorphically maps to K if there is a t-homomorphism from F to K. Note that a t-homomorphismh from F to K directly implies a homomorphism h0 from F0 to K0. We call h0 the underlined homomorphismof h.

There might not be a ground instance corresponding to an instance. We say an instance K is consistent if thereis a non-empty ground instance K ′ such that there is a t-homomorphism from K to K ′. We check whether aninstance K is consistent by constructing the inequality graph using the arithmetic comparisons and finding cyclesin this graph [Klu88]. Using the cycles in the inequality graph, we can also convert a t-instance to a compactt-instance. It can be done in polynomial time. Example 2.1 illustrates the concepts defined here such as instance,t-instance, ground instance.

example 2.1. Let M = (S,T, Σst,Σt) be a data exchange setting with arithmetic comparisons in which S hasa relation E(A,B), T a relation H(C, D), Σt = ∅, and Σst = {E(X, Y), X < Y→ ∃Z(H(X, Z), H(Z, Y), Z < X)}. LetI = {E(2, 5), E(4, 7)} be a ground instance of the source S. Consider the following instances.

J1 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z2 < z1 < 2};J2 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 = z2 < 2};J3 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < z2 < 2};J4 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < 2 = z2};J5 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < 2 < z2 < 4};J6 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < 2, z2 > 7};J7 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z2 < z1 < 3};J8 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < 2, z2 ≤ 2};J9 = {H(2, z1), H(z1, 5), H(4, z2), H(z2, 7), z1 < 2, z2 < 4};J10 = {H(2, 1), H(1, 5), H(4, 0), H(0, 7)};J11 = {H(2, 2), H(2, 5), H(4, 0), H(0, 7), H(8, 7)}.

J1 is a t-instance, since its arithmetic comparisons define a total order of its labeled nulls (z1 and z2), its constants(2, 4, 5, and 7), and the constants in I (there is no constant in the tgd-AC). Similarly, J2, J3, J4, J5, and J6

are also t-instances. J7, J8, and J9 are instances, but not t-instances, since the arithmetic comparisons of each ofthem do not define a total order of its values (constants and labeled nulls). Both J10 and J11 are ground instancesover T.

There is a t-homomorphism h from J7 to J1, where h is the identity mapping from Const ∪ Var(J7) to Const∪ Var(J1). In particular, the mapping turns z2 < z1 < 3 to z2 < z1 < 3, which is implied by z2 < z1 < 2.Similarly, J1 t-homomorphically maps to the ground instance J10. There is no t-homomorphism from J6 to J1.Finally, J1, J2, J3, and J4 are all the t-instances induced by J8.

J1 is a compact t-instance, since its comparisons do not imply equalities. J2 is not, since it has an equalityz1 = z2. A compact t-instance of J2 is J ′2 = {H(2, z1), H(z1, 5), H(4, z1), H(z1, 7), z1 < 2}, which is obtained fromJ2 by replacing z2 with z1.

Let d be a tgd-AC or an acgd, and K be a t-instance of the corresponding schema. K satisfies d if wheneverthere is a t-homomorphism h from the left-hand side (lhs in short) of d to K, there exists an extension h′ of h,where h′ is a t-homomorphism from the conjunction of the lhs and the right-hand side (rhs in short) of d to K.Notice that for an egd or a tgd, there is always a non-empty database on which the dependency is satisfied. Thisproperty does not hold for tgd-ACs and acgds. We call a tgd-AC or an acgd consistent if there exists a non-emptydatabase instance on which this dependency is satisfied.

Page 6: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

2.3 T-solutions, universal t-solutions, universal solutions. Let M = (S,T, Σst,Σt) be a data exchangesetting with arithmetic comparisons, and I a ground instance of the source schema S. We say an instance Jover T is a t-solution for I under M (or, simply a t-solution if I and M are clear from the context), if J is at-instance, and the instance 〈I, J〉 = I ∪ J satisfies Σst ∪ Σt. A t-solution that is a ground instance is called aground solution. We say an instance J of T is a solution for I under M (or, simply a solution if I and M areclear from the context), if every t-instance induced by J is a t-solution. A universal solution for I under M , orsimply a universal solution is a set J = {J1, . . . , Jm} of solutions for I, such that for every ground solution K,there is a solution Ji ∈ J that t-homomorphically maps to K. A universal t-solution is a universal solution whichcontains only t-solutions.

example 2.2. In Example 2.1, the set J1 = {J1, . . . , J5} is a universal t-solution. J2 = {J5, J8} and J3 = {J9}are two universal solutions. It is easy to check that the source-to-target constraint is satisfied on the union of Iand every t-instance in J1. Also, since J1, . . . , J5 are all the induced instances of J8, the constraint is also satisfiedon J8. How to prove that J1 is a universal solution is explained in the next section (Example 3.3). How to provethat J3 is a universal solution is explained in Section 5 (Example 5.2). If J3 is a universal solution, then J2 isalso a universal solution because if there is a t-homomorphism from an instance F to a t-instance K, then thereis some induced t-instance of F which t-homomorphically maps on K (the proof can be found in Lemma 3.2).

To every instance K we associate a boolean conjunctive query with arithmetic comparisons qK , whose bodycontains exactly all the facts in K as subgoals (labeled nulls are replaced by variables), and the (possibly partial)order is added to qK as arithmetic comparison predicates. The predicate in the head uses a fresh predicate name.We call qK the corresponding boolean query of K. To a set of instances we associate a corresponding boolean queryby creating a boolean query for each instance with the same head predicate and taking the union. Proposition2.1 shows that the universal solution has good properties. In what follows we use SOL(M, I) to denote the setof all solutions for I.

Proposition 2.1. Let M = (S,T, Σst,Σt) be a data exchange setting with arithmetic comparisons. The followinghold.

1. If J and J ′ are two universal solutions for a source instance I, then the corresponding boolean queries qJ

and qJ′are equivalent as queries.

2. Let I and I ′ be two source instances, J be a universal solution for I, and J ′ be a universal solution for I ′

under the same data exchange setting M . Then, SOL(M, I) ⊆ SOL(M, I ′) iff qJ is contained in qJ′. Moreover,

SOL(M, I) = SOL(M, I ′) iff qJ and qJ′are equivalent as queries.

The proof of this proposition uses techniques from query containment tests. Subsections 2.4 and 2.5 review thetechniques for checking containment between two conjunctive query with arithmetic comparisons (and betweenunions of them) and in Subsection 2.6 we prove Proposition 2.1.

We consider query containment over densely totally ordered domains. The following is an example whichillustrates why containment is different if interpreted over discrete domains.

example 2.3. Consider queries Q1() : −r(X), 3 < X < 4 and Q2() : −r(X), 13 < X < 14. If the comparisonsare interpreted over the integers then they are both empty on all databases, hence they are equivalent. However ifcomparisons are interpreted over the reals then on database {r(3.5)}, query Q1 computes to true whereas queryQ2 computes to false.

2.4 Containment of Conjunctive Queries with Arithmetic Comparisons (CQACs). In this sectionwe review the problem of checking containment between two conjunctive queries with arithmetic comparisons(CQACs). Kolaitis [Kol05] observed that tgds can be seen as containment between two conjunctive queries. Inthat sense the intuition behind the check for containment between two CQACs appears in many proofs in thepresent paper.

A conjunctive query with arithmetic comparisons (CQAC) is of the formh(X) : −e1(X1), . . . , ek(Xk), C1, . . . , Cm

where each Ci is an arithmetic comparison.In general, testing containment between two CQACs is more complex than checking containment between

two conjunctive queries. The main reason is that when checking containment of two conjunctive queries with

Page 7: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

arithmetic comparisons (CQACs), a single containment mapping in general does not suffice, as it is the case forconjunctive queries. The following example illustrates the complexities that arise when checking containment oftwo CQACs.

example 2.4. Consider the following two queries, which are graphically illustrated in Figure 2.

Q1() : −r(X1, X2), r(X2, X3), r(X3, X4), r(X4, X5),r(X5, X1), X1 < X2.

Q2() : −r(X1, X2), r(X2, X3), r(X3, X4), r(X4, X5),r(X5, X1), X1 < X3.X1rX5 rX4 r r X2X3r

X1rX5 rX4 r r X2X3rQ1 : X1 < X2 Q2 : X1 < X3Figure 2: Graph representation of Q1 and Q2.

Although the two queries have different comparisons, surprisingly, Q1 ≡ Q2. To show Q1 v Q2, we considerthe five mappings from the five ordinary subgoals of Q2 to the five of Q1. Each mapping corresponds to a“rotation” of the variables. Under these mappings, the ACs of Q2 become X1 < X3, X2 < X4, X3 < X5,X4 < X1, and X5 < X2, respectively. According to the containment test between CQACs of Gupta and Zhang-Ozsoyoglu [Gup94, ZO93] (see below),2 it is easy, to see that Q1 v Q2 since if the right hand side of the followingimplication is false, then X1 = X2:(X1 < X2) ⇒ (X1 < X3) ∨ (X2 < X4) ∨ (X3 < X5)∨

(X4 < X1) ∨ (X5 < X2).Similarly, we can prove that Q2 v Q1. Notice that a single containment mapping does not suffice to checkcontainment of two CQACs.

Let Q1 and Q2 be two CQACs. There are two algorithms in the literature to test whether Q2 v Q1: a) thetest of canonical databases and b) the test based on logical implication [GSUW94, Klu88]. We briefly presenteach of them for completeness (in our proofs we make explicit use of the test based on the canonical databases).

a) Containment Test based on Canonical Databases.This test is a generalization of the Levy-Sagiv test [LS93], and it is due to Klug [Klu88]. The test producesan exponential number of canonical databases, each of which could be a counterexample to the containment.Suppose we want to test Q1 v Q2. We do the following:

1. Consider all partitions of the variables of Q1. For each partition, consider all possible total orders of theblocks of this partition, and assign to each block of the partition a unique constant so that the assigned constantssatisfy the total order. Then substitute the variables in each block of the partition by the corresponding constant.Thus we obtain a number of canonical databases D1, D2, . . . , Dk, one database for each different order in eachpartition. Each Di consists of the frozen subgoals of Q1 excluding the subgoals having comparison predicates.

2. For all Di (together with the corresponding total order) that make the body of Q1 true, test whetherQ2(Di) includes the frozen head of Q1. The frozen head of Q1 is obtained by making the same substitution ofconstants for variables that yielded Di.

3. Q1 v Q2 if and only if (2) holds.

2Technically the containment testing requires normalization of the two queries. We can show that in this example the argumentalso holds after the normalization.

Page 8: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

b) Containment Test based on Logical Implication.This test was introduced by Gupta [Gup94] and Zhang-Ozsoyoglu [ZO93] in parallel work. First we need tonormalize both queries Q1 and Q2 to Q′

1, Q′2 respectively as follows:1. Replace each occurrence of a shared variable X in the ordinary subgoals except the first occurrence, by a

new distinct variable Xi, and add X = Xi to the comparisons of the query.2. Replace each constant c in the query by a new distinct variable Z, and add Z = c to the comparisons of

the query.The following theorem states the main claim of the second algorithm.

Theorem 2.1. [GSUW94, Klu88] Let Q1, Q2 be CQACs and Q′1 = Q′1,0 + β′1, Q′2 = Q′

2,0 + β′2 be the queriesafter normalization. Let µ1, . . . , µk be all the mappings (homomorphisms) from Q′1,0 to Q′2,0. Then Q2 v Q1 ifand only if the following logical implication ϕ is true:

ϕ : β′2 ⇒ µ1(β′1) ∨ . . . ∨ µk(β′1).

That is, the comparisons in the normalized query Q′2 logically imply (denoted “⇒”) the disjunction of the imagesof the comparisons of the normalized query Q′

1 under these mappings.

Notice that in Theorem 2.1, the “or” operation (∨) in the implication is critical, since there might not be asingle mapping µi from Q1,0 to Q2,0, such that β2 ⇒ µi(β1). Another important observation is the following. Totest the containment of two queries Q1 and Q2, using the result in Theorem 2.1, we need to normalize them first.Introducing more comparisons to the queries in the normalization can make the implication test computationallymore expensive. Thus, we may want to have a containment result that does not require the queries to benormalized. [ALM04] presented two cases, in which even if Q1 and Q2 are not normalized, we still have Q2 v Q1

if and only if β2 ⇒ µ1(β1) ∨ . . . ∨ µk(β1), where the µi’s are all the homomorphisms from the normal subgoals inQ1 to the normal subgoals in Q2.

2.5 Containment test for unions of CQACs. To prove Proposition 2.1 we need to use a test for checkingcontainment between two unions of CQACs. Figure 3 shows a procedure to test if Q1 v Q2, where Q1, Q2 areunions of CQACs, which was first presented in [Klu88].

Input: Two unions of CQACs: Q1 = ∪Qi1, Q2 = ∪Qj

2.Output: Boolean value to show if Q1 v Q2.Method:(1) For every CQAC query Qi

1 in Q1 do:

(a) Consider all partitions of the variables of Qi1. For each partition, consider all possible total orders of the

blocks of this partition, and assign to each block of the partition a unique constant so that the assignedconstants satisfy the total order. Then substitute the variables in each block of the partition by thecorresponding constant. Thus we obtain a number of canonical databases D1, D2, . . . , Dk, one databasefor each different order in each partition. Each Dj consists of the frozen subgoals of Qi

1 excluding thesubgoals having comparison predicates.

(b) For all Dj (together with the corresponding total order) that make the body of Qi1 true, test whether

Q2(Dj) includes the frozen head of Qi1. The frozen head of Qi

1 is obtained by making the same substitutionof constants for variables that yielded Dj .

(2) If step (1) holds for all Qi1 in Q1, then output “yes”, otherwise, output “no”.

Figure 3: Procedure for testing containment of unions of CQACs

Theorem 2.2. The procedure in Figure 3 can correctly test if Q1 v Q2 [Klu88].

Proof. Suppose the test output “yes.” Let D be a database and let t be a tuple in the answers of query Q1

on D, i.e., t is in Q1(D). Then, there is a CQAC query Qi1 in Q1, such that t is in Qi

1(D). Hence there is a

Page 9: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

homomorphism from the ordinary subgoals in the body of Qi1 to D, which maps the head of Qi

1 on t and satisfiesthe arithmetic comparisons. The image of this homomorphism is isomorphic to one of the databases Dj thatmake the body of Qi

1 true. Thus the answers of Q2 on D also contain the tuple t. Therefore, Q1 v Q2.Suppose Q1 v Q2. Then every database Dj that makes the body of some Qi

1 true is such that the frozenhead of Qi

1 is in the answers of Q1. Since Q1 v Q2, the frozen head of Qi1 is in the answers of Q2 too. Hence the

test will report “yes.”

2.6 Proof of Proposition 2.1. First we need the following lemma which proves a property of qJ .

Lemma 2.1. • Let K be an instance and qK its corresponding boolean query. Then the following hold: a) forevery canonical database D of qK such that qK(D) is true, there is a t-instance induced by K such that thereis a surjective and injective t-homomorphism from K to D; b) for every t-instance Ki induced by K, there isa canonical database D of qK , where qK(D) is true, and there is a surjective and injective t-homomorphismfrom Ki to D.

• Let J be a universal solution and qJ its corresponding boolean query (a union of CQACs). Then thefollowing hold: a) for every CQAC query qJi in qJ it holds: for every canonical database D of qJi such thatqJi (D) is true, there is a t-instance Jjk

induced by a solution Jj in J , such that there is a surjective andinjective t-homomorphism from Jjk

to D; and b) for every solution Jj in J it holds: for every t-instanceJjk

induced by Jj, there is a CQAC qJi in qJ , such that there is a canonical database D of qJi where qJi (D)is true, and there is a surjective and injective t-homomorphism from Jjk

to D.

Proof. We will prove each item separately.

• (a) Let D be a canonical database of qK . Hence there is a t-homomorphism h from qK to D which issurjective. Since qK is the corresponding boolean query of K, K essentially is obtained after renaming thevariables of qK . Hence t-homomorphism h implies a t-homomorphism h′ from K to D, which is surjective.

(b) Let KD be a t-instance induced by K. Hence there is a t-homomorphism h from K to KD which issurjective. Since qK is the corresponding boolean query of K, for the same reason as above, t-homomorphismh implies a t-homomorphism h′ from qK to KD, which is surjective.

• This is a consequence of the claims in the first item and the way we extend the definition of the correspondingboolean query to refer also to unions of solutions.

We proceed to the proof of Proposition 2.1.

Proof. First item: By the definition of universal solution, the following holds: For every t-solution J1 inducedfrom a solution in J , there exists a solution J ′1 in J ′ such that there exists a t-homomorphism from J ′1 to J1. ByLemma 2.1, this statement is equivalent to the containment test for unions of CQACs presented in Section 2.5.Hence qJ is contained in qJ

′. The same reasoning also applies to the other direction to prove qJ

′ v qJ .Second item: Suppose SOL(I) ⊆ SOL(I ′). For every t-solution Ji induced from a solution in J , since Ji is inSOL(I ′), there is a t-homomorphism from a solution in J ′ to Ji, because J ′ is a universal solution for I ′. Hence,according to the containment test of union of CQACs presented in Section 2.5, qJ is contained in qJ

′.

For the other direction, assume qJ is contained in qJ′. By Lemma 2.1, for every t-solution Jj induced by

a solution in J , there is a solution J ′j in J ′ and a t-homomorphism h that maps J ′j to Jj (denoted by (I) inFigure 4).

Let J∗ be a solution for I. We must show that J∗ is a solution for I ′, which amounts to showing that 〈I ′, J∗〉satisfies Σst and J∗ satisfies Σt. Since J∗ is a solution for I, we know J∗ satisfies Σt. So it suffices to show that〈I ′, J∗〉 satisfies Σst.Consider a tgd-AC

d : ϕ(x) ∧ βϕ → ψ(x,y) ∧ βψ

in Σst. To show 〈I ′, J∗〉 satisfies d, we need to show that for every t-solution J∗i induced from J∗, 〈I ′, J∗i 〉 satisfiesd. For every t-solution J ′i induced from a solution in J ′, 〈I ′, J ′i〉 satisfies d since we deal with source-to-targetconstraints (i.e. the schemata are disjoint), it follows that for every vector ~a of constants such that I ′ satisfies

Page 10: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

ϕ(~a) ∧ βϕ, there is a vector ~b of elements of J ′i such that J ′i satisfies ψ(~a,~b) ∧ βψ (denoted by (II) in Figure 4).Since J is a universal solution for I, by definition of universal solution, for t-solution J∗i induced from J∗ there isa t-homomorphism h∗ from a t-solution Jj in J to J∗i (denoted by (III) in Figure 4). Also, as we observed at thebeginning of this direction, for Jj there is a solution J ′j in J ′ such that there is a t-homomorphism h from J ′j toJj (denoted by (I) in Figure 4). Note that atomic formulas are preserved under t-homomorphisms. So applyingt-homomorphism h followed by h∗ (i.e, the composition h∗ ◦ h) to ~a we get again ~a because ~a contains constantsonly. It follows that 〈I ′, J∗i 〉 satisfies the tgd-AC d because J∗i satisfies ψ(~a, h∗ ◦ h~b) ∧ βψ (denoted by (IV) inFigure 4). This argument applies to every t-solution induced from J∗, hence J∗ satisfies the tgd-AC d.

III

IV

� � � � � � � � � � � � � � � � � � � � � II ����� � � � �� �III � � �Figure 4: Proof of Proposition 2.1

3 Computing Universal Solutions

3.1 AC-Chase. We define a chase procedure, called AC-chase, which deals with arithmetic comparisonsand computes a universal solution. The procedure shares the main idea of the chase procedure in theliterature [MMS79, BV84b, BV84a]. However, AC-chase produces a tree rather than a sequence for the followingreason. Since the result of an AC-chase step on a t-instance may not be a t-instance, whenever necessary, we needto replace this resulting instance with the set of its induced t-instances.

definition 3.1. (AC-Chase Step)Let K be a t-instance. We have three stages:Stage I: We construct an instance K ′ by adding relational facts and comparisons to K. We have two possiblecases:

a) (acgd) Let d be an acgd φ(x)∧βφ → (x1θx2). Let h be a t-homomorphism from φ(x)∧βφ to K = K0 +βK

which cannot be extended. We say that d can be applied to K with t-homomorphism h. We add h(x1θx2) to Kand produce K ′.

b) (tgd-AC) Let d : φ(x)∧ βφ → ψ(x,y)∧ βψ be a tgd-AC. Let h be a t-homomorphism from φ(x)∧ βφ to K,such that there is no extension of h to a t-homomorphism from φ(x) ∧ βφ ∧ ψ(x,y) ∧ βψ to K. We say that dcan be applied to K with t-homomorphism h. Let h0 be the underlined homomorphism of h. Let K ′

0 be the unionof K0 with the set of facts obtained by: (a) extending h0 to a homomorphism h′0, such that each variable in y isassigned a fresh labeled null, followed by (b) taking the image of the atoms on the rhs of d under h′0. We add toK ′

0 all the comparisons h′0(x1θx2) for each comparison x1θx2 in βψ and obtain K ′.Stage II: We check K ′ for consistency. We distinguish two cases:

1. If K ′ is not consistent we say that the result of applying d to K with h is “failure” and write Kd,h−→ ⊥.

2. Otherwise, we say that the result of applying d to K with h is K ′, and write Kd,h−→ K ′.

Stage III: We produce all induced t-instances K ′1,K

′2, . . . ,K

′n of K ′ in compact form and write also (equivalently)

Kd,h−→ {K ′

1,K′2, . . . ,K

′n}.

The following example illustrates an AC-chase step.

example 3.1. Consider the following tgd-AC d and a t-instance K:

d: R(A) → S(B), A < B;K: {R(x), T (y, 3), x < y < 3}.

Page 11: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

x and y are labeled nulls. There is a t-homomorphism h from the lhs of d to K which maps A to x. Thereis no extension of this mapping to a t-homomorphism from the rhs of d to K. Applying d to K underh derives one fact S(b) with a comparison x < b, where b is a fresh labeled null. The new instance isK ′ : {R(x), T (y, 3), S(b), x < y < 3, x < b}. Stage II checks that K ′ is consistent, which it is, because thereare the following values that satisfy the comparisons: x = 1, y = 2, b = 5. Hence this is a successful AC-chasestep. K ′ is not a t-instance, since it does not define a total order of b w.r.t. the labeled null b and the constant3. Hence in stage III, we produce all induced t-instances of K ′. In this case, we have five induced t-instances K ′

i

where each is as follows: K ′i : {R(x), T (y, 3), S(b), x < y < 3, x < b,Ci}, where Ci is a conjunction of comparisons:

C1 : b < y, C2 : b = y, C3 : y < b < 3, C4 : b = 3, C1 : b > 3.

definition 3.2. (AC-Chase)Let Σ be a set of tgd-ACs and acgds, and J be a t-instance on the same database schema. An AC-chase tree of Jwith Σ is a tree (finite or infinite) such that:

• The root is J .• Each node in the tree is a t-instance.• For every node K in the tree, let {K ′

1, . . . , K′l} be the set of its children. Then there must exist a tgd-AC

or an acgd d in Σ, a t-homomorphism h from the lhs of d to K, and an instance K ′, such that Kd,h−→ K ′, and

K ′1, . . . , K

′l are all the t-instances induced by K ′.

A finite AC-chase of J with Σ is a finite chase tree such that each leaf is either ⊥, or satisfies Σ. The resultof a finite AC-chase is the union of all the leaf nodes that are not ⊥. If the result is not empty, we refer to it asthe result of a successful AC-chase.

Theorem 3.1 states that AC-chase can be used to compute a universal solution.

Theorem 3.1. Let M = (S,T, Σst, Σt) be a data exchange setting with arithmetic comparisons, and I be a sourceground instance.

1. Let L = {〈I, J1〉, 〈I, J2〉, . . . , 〈I, Jk〉} be the result of a successful finite AC-chase of 〈I, ∅〉 with Σ = Σst∪Σt,then J = {J1, J2, . . . , Jk} is a universal t-solution.

2. If the result of a finite chase of 〈I, ∅〉 with Σ = Σst ∪ Σt is empty, then there is no solution.

The proof of the Theorem 3.1 makes use of a basic property of an AC-chase step, as described in the followinglemma. This property has been used and proven in [BV84b, MUV84, FKMP03] in different contexts or settingsthat are more restricted than ours. Lemma 3.1 essentially proves that the instance K ′ produced in Stage II ofthe AC-chase Step t-homomorphically maps on a solution E as long as the instance K that created K ′ in thisstep, t-homomorphically maps on E. However, in Stage III of the AC-chase Step we “replace” K ′ by its inducedinstances. Thus, we need to prove that at least one of those induced instances t-homomorphically maps to E.This is done in Lemma 3.2.

Lemma 3.1. Let d be a tgd-AC of the form, K be a t-instance, and Kd,h−→ K ′ be an AC-chase step (where

K ′ 6= ⊥) with a t-homomorphism h from the lhs of d to K. Let E be a t-instance such that: (i) E satisfies d; and(ii) K t-homomorphically maps to E. Then, K ′ t-homomorphically maps to E.

Proof. Let d be: φ(x) ∧ βφ → ψ(x,y) ∧ βψ. Since the result of composing two t-homomorphisms is still a t-homomorphism, µ ◦ h is a t-homomorphism from φ(x) ∧ βφ to E, as illustrated in Figure 5. Since E satisfies d,there exists an extension of µ ◦ h, denoted by h′, which is a t-homomorphism from φ(x)∧ βφ ∧ψ(x,y)∧ βψ to E.

For each variable y ∈ y, denote by Λy the labeled null replacing y in the AC-chase step. Define ν on Var(K ′) toE as follows: ν(Λ) = µ(Λ) if Λ ∈ Var(K), and ν(Λy) = h′(y) for y ∈ y. Now we show that ν is a t-homomorphismfrom K ′ to E.

1. ν maps the relational facts of K ′ to relational facts of E. This claim is true for those relational facts ofK ′ coming from K, since µ is a t-homomorphism. Let T (x0,y0) be an arbitrary relational atom in theconjunction ψ(x,y). (Here x0 and y0 contain variables in x and y, respectively.) Then K ′ contains, inaddition to those relational facts of K, a relational fact T (h(x0),Λy0). The image under ν of this fact is, bydefinition of ν, the fact T (µ(h(x0)), h′(y0)). Since h′(x0) = µ(h(x0)), this is the same as T (h′(x0), h′(y0)).But h′ homomorphically maps all relational atoms in φ(x)∧ψ(x,y), in particular, T (x0,y0), into relationalfacts of E.

Page 12: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

t-instance: E

extension of

ψϕ βψβφ ∧∃→∧ ),()(: yxyxd

hoµ

h

µ

K hoµ:'hdh,

ν

'K

Figure 5: Proof of Lemma 3.1

2. ν preserves the order of K ′. We use the symbol θ to denote a comparison operator, including <, ≤, =, 6=,>, or ≥. Let α1θα2 be an arbitrary comparison in the conjunction βψ. Without loss of generality, we haveto consider the following cases:

• α1 ∈ x and α2 ∈ x. Then K ′ contains the comparison h(α1)θh(α2). Similarly, the image underν of this comparison is, by definition of ν, the comparison µ(h(α1))θµ(h(α2)). This is the same ash′(α1)θh′(α2). By the definition of h′, this image is true in E.

• α1 ∈ x and α2 ∈ y. Then K ′ contains the comparison h(α1)θh(Λα2). Then the image under νof this comparison is, by definition of ν, the comparison µ(h(α1))θh′(α2), which is still the same ash′(α1)θh′(α2). Using the same reasoning above, this image is also true in E.

• α1 ∈ y and α2 ∈ y. The reasoning is similar to the above.

Thus, ν is a t-homomorphism from K ′ to E.

Lemma 3.2. Let K be an instance (not necessarily t-instance) and E be a t-instance. If there is a t-homomorphismh from K to E, then there is a t-instance Ki induced by K, such that h is also a t-homomorphism from Ki to E.

Proof. For every two values (constant or labeled null) α1 and α2 in K, let h(α1)θh(α2) be the order between theirimages in E under the mapping h. For each comparison C in K, we can see that α1θα2 is not contradictory toC, since otherwise the image of the comparison C under h will be contradictory to the comparison h(α1)θh(α2)in E. If α1θα2 is not in K, we add it to K. We repeat this addition step for every pair of values in K, and get at-instance induced by K. Clearly h is also a t-homomorphism from this t-instance to E.

We proceed to the proof of Theorem 3.1.

Proof. Claim 1: By Definition 3.2, every leaf node 〈I, Ji〉 in the finite AC-chase tree is a t-instance that satisfiesΣ. By definition of t-solution, Ji is also a t-solution. Now consider every ground solution E. Notice that theidentity mapping is a t-homomorphism from 〈I, ∅〉 to 〈I, E〉. By Lemmas 3.1 and 3.2, there exists a Ji in J , suchthat there is a t-homomorphism from 〈I, Ji〉 to 〈I, E〉. Thus, Ji can t-homomorphically map to E. Therefore, Jis a universal t-solution.

Claim 2: If the result of a finite AC-chase of 〈I, ∅〉 with Σ = Σst ∪ Σt is empty, then each leaf node in thecorresponding AC-chase tree, which is ⊥, corresponds to a set of derived facts that contain contradictory facts.Suppose there exists a t-solution E. Following the same argument as in the proof of Claim 1, there is a leaf nodeL, such that there exists a t-homomorphism ν from 〈I, FL〉 to 〈I, E〉, in which FL is the set of facts correspondingto the leaf node L. Since the image of the pair of contradictory comparisons under ν is still a pair of contradictorycomparisons, we have that E contains a pair of contradictory comparisons, which is a contradiction to the factthat E is a ground instance.

The following is an example that a) illustrates an AC-chase tree and b) is used in Proposition 3.1 to provethat in this simple case, there is no universal solution which contains a single instance.

Page 13: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

example 3.2. Consider the following data exchange setting.

Target acgd d1: E(X,Y, Z), X < Z → X = YTarget acgd d2: E(X,Y, Z), X = Z → X = 5Source-to-target tgd-AC d3: A(Y ) → E(X, Y, Z).

Let I = {A(8)} be the source instance. Then, after chasing with d3, we get E(X, 8, Z). When we chase with d1,we get the solution {E(8, 8, Z), 8 < Z}, whereas when we chase with d2 we get the solution {E(5, 8, 5)}. These twosolutions together with {E(X, 8, Z), X > Z} comprise a universal solution. Still, there is no singleton universalsolution as Proposition 3.1 proves. In Figure 6 we have the AC-chase tree for this example. The three t-solutionsthat comprise the universal solution are the leaf nodes in the oval.

A(8)

d3

E(X,8,Z)

E(X,8,Z),X<Z E(X,8,Z),X=Z E(X,8,Z),X>Z

d1

E(8,8,Z), 8<Z E(5,8,5)

d2

Figure 6: AC-Chase Tree for Example 3.1.

Proposition 3.1. There is a data exchange setting with arithmetic comparisons which uses a single tgd in thesource-to-target constraints and two acgds as target constraints such that there is no universal solution whichcontains one instance.

Proof. We use the Example 3.2. We have proved that any two universal solutions are equivalent if viewedas Boolean queries. Hence if there exists a single instance universal solution then its corresponding Booleanquery qJ is equivalent to the union of Boolean queries: Q1 : −E(8, 8, Z), 8 < Z, Q2 : −E(5, 8, 5) andQ3 : −E(X, 8, Z), X > Z. Let J be this single universal solution. Then J must have no E atom with aconstant in the first argument because otherwise there is either no mapping on Q1 or no mapping on Q2 sincethese two have different constants in the first arguments. Thus let E(A,C, B) be an atom of J . Then (using thecontainment test with canonical databases), there is a canonical database D of qJ in which A < B. Hence, thereis no mapping from Q3. However, there is no mapping from Q1 or Q2 either since constants should map to thesame constants.

The following is an example continued from Example 2.2, where we mentioned that J1 is a universal solutionof the setting of Example 2.1.

example 3.3. To prove that J1 is a universal solution of the setting of Example 2.1, apply an AC-chase stepto the source instance I = {E(2, 5), E(4, 7)} with t-homomorphism: X → 2, Y → 5. This step will produce anumber of t-instances (as a consequence of stage III). To each of these t-instances, apply an AC-chase step witht-homomorphism: X → 4, Y → 7.

3.2 Complexity. We investigate the conditions under which an AC-chase procedure terminates and we studythe complexity of the problem. We show that an AC-chase tree is finite when the target dependencies is a set of

Page 14: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

weakly acyclic tgd-ACs and a set of acgds. The definition of weak acyclicity for tgd-ACs is an extension of thedefinition of weak acyclicity for tgds as in [FKMP03]3.

definition 3.3. (Weakly acyclic set of tgd-ACs)Let Σ be a set of tgd-ACs over a database schema in compact form. We construct a directed graph (called thedependency graph) as follows: (1) There is a node for every pair (R,A) with a relation symbol R of the schemaand an attribute A of R. We call such a pair (R, A) a position. (2) Add edges as follows: for every tgd-ACφ(x)∧ βφ → ψ(x,y)∧ βψ in Σ and for every x in x that occurs in ψ(x,y)∧ βψ, we call x a propagated variable.For such a propagated variable x, for every occurrence of x in φ(x) in position (R, Ai), do the following:

(i) For every occurrence of x in ψ(x,y) in position (S,Bj), add an edge (R, Ai) → (S, Bj) (if it does notalready exist);

(ii) In addition, for every existentially quantified variable y in y and for every occurrence of y in ψ(x,y) inposition (T, Ck), add a special edge (R,Ai) → (T,Ck) (if it does not already exist).

We say Σ is weakly acyclic if the dependency graph has no cycle going through a special edge.

The above definition differs from the one in [FKMP03] in that a variable from the lhs of a tgd-AC that appearsin an arithmetic comparison on the rhs affects acyclicity even if this variable does not appear in any relationalatom on the rhs. Example 3.4 shows such a case.

Theorem 3.2. Let M = (S,T,Σst, Σt) be a data exchange setting with arithmetic comparisons, such that Σt

is the union of a weakly acyclic set of tgd-ACs and a set of acgds and let I be a ground source instance. Thenthe following hold: (1) There is a polynomial in the size of I that bounds the depth of every AC-chase beginningwith I. (2) There is a nondeterministically polynomial-time algorithm such that it tests whether a t-solution forI exists. (3) AC-chase takes exponential time to compute a universal t-solution for I.

Before proceeding to the proof of Theorem 3.2 we start with an example about weak acyclicity which showswhy we had to modify the definition in the presence of arithmetic comparisons.

example 3.4. Consider a database schema with a relation R(A) and a relation S(B). We consider first thefollowing set of tgds: Σ = {d1, d2}, where d1 : R(X) → S(Y ) and d2 : S(Y ) → R(Y ). The dependency graph isweakly acyclic so and a chase on any input terminates. Now we change d1 slightly, i.e., we consider the followingset of tgd-ACs: Σ′ = {d′1, d2}, where d′1 : R(X) → S(Y ), X < Y. The dependency graph of Σ′ is not weaklyacyclic. In fact, it is easy to see that every AC-chase on the instance {R(1)} using Σ′ does not terminate becauseevery time we apply tgd-AC d′1 we must add an atom with a value for its argument which is larger than the valuein the atom we added in the previous application of d′1.

Theorem 3.2 covers interesting cases that arise in many applications. For instance, the class of weakly acyclictgd-ACs includes the class of full tgd-ACs (no existentially quantified variables). To illustrate the technical partof the proof of Theorem 3.2, we first prove part (1) of the theorem (as Theorem 3.3) and then the other parts canbe easily proved from part (1), and are included again in separate statements (Theorem 3.4 and Theorem 3.5).

Theorem 3.3. Let Σ be a set of weakly cyclic tgd-ACs. Then, there is a polynomial in the size of a t-instanceK that bounds the depth of every AC-chase tree of K with Σ.

Proof. We start with introducing some notations used in the proof. Every node in the dependency graph of Σis of the form (R, A), where R is a relation symbol of the schema and A an attribute of R. We call such a pair(R,A) a position. An incoming path for (R,A) is a path (finite or infinite) ending in (R, A). We define the rankof (R,A), denoted by rank(R, A), as the maximum number of special edges on any incoming path. rank(R, A)is finite since Σ is weakly acyclic. We use the following notations in the proof.

r: maximum rank rank(R, A) over all positions (R, A);n: number of distinct values (constants or labeled nulls)

in the t-instance K;D: number of dependencies in Σ.

3The concept of weak acyclicity for the case of tgds also appears in [DT03] as “constraints with stratified witness.”

Page 15: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

We start the proof by partitioning the nodes in the graph according to their rank, into subsets N0, N1, . . . , Nr,where Ni is the set of all nodes with rank i. Let K ′ be a leaf node (a t-instance) of some arbitrary AC-chase treeof K using Σ. We prove by induction on i the following claim:

Claim: For every i, there exists a polynomial Qi such that the total number of distinct values thatoccur in K ′ at positions that are restricted to be in Ni is at most Qi(n).

Base step: If (R,A) is a position in N0, then there are no incoming paths with special edges. Thus the onlypossible way to introduce new labeled nulls for this position is from a tgd-AC g of the form:

g : φ(x) ∧ βφ → ψ(y) ∧ βψ,

in which no variable (in x) from the lhs appears in the rhs (i.e., there is no propagated variable in g), andthere is a variable in the rhs corresponding to the position (R, A). Such a tgd-AC can be used in at mostone AC-chase step in each AC-chase tree, since the newly added facts for the rhs of g guarantee that thereis a t-homomorphism from the rhs of g to each induced t-instance of the new instance after the step. Thist-homomorphism can be used to extend every possible t-homomorphism from the lhs of g to a later t-instancein the tree, since the rhs of g does not share variables with the lhs. Therefore, the number of labelednulls for the positions in N0 is at most D × V , where V is the maximal number of variables in the rhs of eachtgd-AC that do not appear in the lhs (i.e., “y”). Thus we can take Q0(n) = n+D×V , in which D×V is a constant.

Inductive step: There are two kinds of values that may appear in K ′. The first kind of values are thosealready in K. This number is bounded by n. The second kind of values are those newly generated labeled nulls.

We show the number of the second kind is also bound by a polynomial. Let (R, A) be a position of Ni. Thereare two possible ways a new labeled null can be generated in (R, A) during an AC-chase step. The first way isfrom a tgd-AC of the form g as described above, which does not have propagated variables.. We have seen thatthere are at most D × V such labeled nulls.

The second way a new labeled null can be generated is due to special edges, i.e., they are derived because ofpropagated variables. Note that a special edge entering (R,A) must have started at a node in N0 ∪ . . . ∪ Ni−1.That is, it cannot originate at nodes in Nj with j > i. Otherwise, assume that there exists j > i and there existsa special edge from some position of Nj to a position (R, A) of Ni. Then the rank of (R,A) would have to belarger than i, which is a contradiction.

By the inductive hypothesis, the number of distinct values may appear in all the nodes in N0 ∪ . . . ∪ Ni−1

is bounded by P (n) = Q0(n) + . . . + Qi−1(n). Let d be the maximum number of special edges that enter aposition from all positions in the schema. Then for every choice of d values in N0 ∪ . . . ∪ Ni−1 (one value foreach special edge that can enter a position) and for every dependency in Σ, there is at most one new value thatcan be generated at position (R, A). (The way those nonpropagated variables are mapped does not affect thisnumber.) Thus the total number of new values that can be generated in (R, A) due to special edges is at most(P (n))d ×D, which is polynomial in n since the schema and Σ are fixed. Considering all positions (R, A) in Ni,the total number of values that can be generated due to special edges is at most pi × (P (n))d ×D, where pi isthe number of positions in Ni. Let G(n) = pi × (P (n))d ×D. Therefore, G(n) is polynomial.

Thus Qi(n) = n + (P × V ) + G(n), which is polynomial, bounds the number of distinct values in K ′. So theClaim is proven.

Notice that i is bounded by the maximum rank r, which is a constant. Hence, there exists a fixed polynomialQ such that the number of distinct values that can exist in K ′ over all positions is bounded by Q(n). The numberof distinct values that can exist in K ′ at a single position is also bounded by Q(n). Therefore, the total number ofrelational tuples that can exist in one relation in K ′ is bounded by Q(n)p (where p is the total number of positions,for each relation, in the schema), since the maximum number of attributes in one relation is bounded by p. Itfollows that the total number of relational tuples that can exist in K ′ over all relations is at most s × Q(n)p,where s is the number of relations in the schema. This is a polynomial in n since s and p are constants. SinceK ′ is a t-instance with at most Q(n) distinct values, and its comparisons define a total order of these values, weneed at most Q(n) − 1 nonredundant comparisons to define this total order. Finally, since every AC-chase stepwith a tgd-AC adds at least one relational tuple or a nonredundant comparison in K ′, it follows that the depthof every AC-chase tree is at most s× (Q(n))p + (Q(n)− 1).

Page 16: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Theorem 3.4. Let M = (S,T,Σst, Σt) be a data exchange setting with arithmetic comparisons, such that Σt isthe union of a weakly acyclic set of tgd-ACs and a set of acgds. Then there is a nondeterministically polynomial-time algorithm such that, given a source instance I, it tests whether a t-solution for I exists.

Proof. The result of an AC-chase is always a universal solution. Thus if there is a solution, there is at least onesuccessful leaf of an AC-chase. Let K be a t-solution in a successful leaf node of an AC-chase. As a consequenceof Theorem 3.3, K is of polynomial size. Moreover, we can check if the tgd-ACs are satisfied on K in polynomialtime, since the size of each tgd-AC is constant, and it takes polynomial time to consider all mappings from atgd-AC or from an acgd to the K. Thus, if thre is a solution, we guess a t-solution (there is one of polynomialsize) and check in polynomial time if the constraints are satisfied.

Theorem 3.5. Let M = (S,T,Σst, Σt) be a data exchange setting with arithmetic comparisons, such that Σt isthe union of a weakly acyclic set of tgd-ACs and a set of acgds. Given a source instance I, an AC-chase takesexponential time to compute a universal t-solution for I.

Proof. In each AC-chase step Kd,h−→ K ′, we need to consider all the induced t-instances of the newly derived

instance K ′. Notice that K is already a t-instance, and we need to define a total order on the values in K and thenew labeled nulls in K ′. Therefore, the number of these t-instances (these are all the children of K in an AC-chasetree) is bounded by na+1 where n is the number of distinct values in K, and a is the number of new variables inthe rhs of d that do not appear in the lhs. a is a constant that depends on the tgd-AC d only, and it bounds thenumber of newly introduced labeled nulls in this AC-chase step. Thus an AC-chase tree has polynomial depthon the size of I, and each internal node has degree which is also polynomial on the size of I. Thus an AC-chasetakes exponential time to compute all the leaf nodes.

4 Query Answering and Complexity.

In this section we study the query answering problem in data exchange settings with arithmetic comparisons. Weconsider unions of conjunctive queries with arithmetic comparisons (UCQAC) posed over the target schema. Aconjunctive query with arithmetic comparisons (CQAC) is of the form: h(X) :- F (Y ), where F (Y ) is a conjunctiveformula with arithmetic comparisons.

definition 4.1. (Certain answers)Let M = (S,T, Σst, Σt) be a data exchange setting with arithmetic comparisons, and let q be a query over thetarget schema T. Let I be a source instance. The certain answers of q on I with respect to M , denoted bycertainM (q, I), are:

certainM (q, I) = ∩{q(J) : J ∈ G(M, I)}where G(M, I) is the set of all the ground solutions for I under M , and q(J) is the answers of query q on theground solution J .

We need to define the concept of qt-instance. The following example and the discussion that follows show thereason for the necessity of defining qt-instances. The lemma in the end of this subsection formalizes the fact thatqt-instances are sufficient to compute certain answers of a query on a particular instance.

example 4.1. Given a query:Q : ans(A) : −R(A), A < 3.

Consider a t-instance J = {R(x), x < 7}. If we replace the labeled null x with a constant 2, then the answer to thequery on the new ground instance is {2}. But if we replace x with another constant 5, the answer to the query onthis new ground instance becomes empty. Notice that although J defines a total order of its constant and labelednull, the query could introduce a new constant, and the order between the labeled null and the new constant is notpredefined.

This example shows that a t-instance may no longer define a total order on those new constants introducedin a query. However, when the query contains a constant not mentioned in the t-instance on which the queryis applied, we might wonder whether it is possible to produce anything but an empty set of answers (as is theexample above). The next example shows that we may still produce a non-empty set of answers.

Page 17: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

example 4.2. Consider the following t-instance and query:

t-instance J : {R(x, y), R(y, z), 3 < x < z < 4 < y};query Q: ans : −R(A, B), A > 3, B < 5.

To test if the query is true on the database, we need to find a homomorphism from the single subgoal of thequery to one of the two facts in J , which moreover satisfies the comparisons in the query. If we use homomorphismµ1, which maps A to x and B to y, then subgoal R(A,B) maps to fact R(x, y). We also require that µ1(A) > 3and µ1(B) < 5, i.e., we require that x > 3 and y < 5. Inequality x > 3 is true in J , and we do not know anythingabout inequality y < 5. However, we know that there are two cases for y: either y < 5 or y ≥ 5. In the first case,µ1 will do to answer the query. In the second case, consider µ2 that maps A to y and B to z. Now we requirethat µ2(A) > 3 and µ2(B) < 5 or y > 3 and z < 5. Since y ≥ 5, the first is satisfied. The second is satisfied toobecause z < 4. Thus, the answer of Q on J is true.

The above query is a boolean query. For the following pair of t-instance and non-boolean query, the set ofanswers is non-empty for similar reasons above:T-instance J :{T(8, x), T(w, y), R(x, y), R(y, z), 3 < x < z < 4 < y < 8};Query Q:ans(C) : −T(C, A), R(A, B), A > 3, B < 5.Here the answers to Q contain a single tuple ans(8).

The following lemma states that the certain answers of a query on an instance (as defined in Definition 4.2)can be computed considering a finite set of qt-instances (i.e., the problem is decidable).

Lemma 4.1. Let M = (S,T,Σst, Σt) be a data exchange setting with arithmetic comparisons, J be an instanceover T under M , and q be a CQAC query over the same database schema. Let Jq = J (q, M, J) be the set of allqt-instances induced by J under M . Then:

certain(q, J) = ∩{q(K) ↓: K ∈ Jq}.Proof. Let G(J) be the set of ground instances to which J can t-homomorphically map.“(⊇)”: We need to show the following:

∀K ∈ G(J), ∃K ′ ∈ Jq, q(K) ⊇ q(K ′)↓.

For each K ∈ G(J), by definition, there exists a t-homomorphism h from J to K. Notice that h maps labellednulls to constants. In addition, because of the compact form and the total order imposed among the labelled nullsof J , it never maps two distinct labelled nulls to the same constant. Thus we can use the targets of the labellednulls under h to decide a total order of the labelled nulls of J with respect to the constants also in the query.Let K ′ be the instance in Jq corresponding to this total order. Then h is also a t-homomorphism from K ′ to K.Let t be a tuple in the answers q(K ′)↓. There exists a t-homomorphism µ from q to K ′, such that the head ofq is mapped to t under µ. By the definition of the ↓ operator, t has constants only. We know h is an identitymapping on constants. Therefore, (µ ◦ h)(head of q) = µ(h(head of q)) = µ(t) = t.“(⊆)”: We need to show the following:

∀K ′ ∈ Jq, ∃K ∈ G(J), q(K) ⊆ q(K ′)↓.

In K ′ there exists a total order of the labelled nulls w.r.t. the constants in q. Since the domain is densely ordered,we can replace each labeled null with a unique constant that respects the order, thus obtaining K. Let thisreplacement be a mapping h from K ′ to K. Thus if there is a mapping µ from q to K ′, then the mapping µ ◦ his from q to K.

Let M = (S,T, Σst, Σt) be a data exchange setting with arithmetic comparisons, q be a CQAC query, andJ be an instance over T. Let C = Const(J) ∪ Const(Σst ∪ Σt) ∪ Const(q). An instance K is called a qt-instance(with respect to query q and instance J under M) if it is a t-instance with respect to C and the labeled nulls ofK. We evaluate query q on a qt-instance K by treating the labeled nulls in K as distinct constants and use anoperator “R ↓” to remove those tuples with labeled nulls.

Page 18: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

However, since a solution (even a t-instance) may not be a qt-solution, we need the following intermediatedefinition of what means to compute the answers of query q on an instance. For that, we define the notion ofcertain answers of a query q on an instance J .

definition 4.2. Let J be an instance and q be a query. Let G(J) be the (possibly infinite) set of ground instances,such that J t-homomorphically maps to each of them. The certain answers of q on J , denoted by certain(q, J),are defined as:

certain(q, J) = ∩{q(K) : K ∈ G(J)},where q(K) is the answers to q on ground instance K.

We evaluate a query q on an instance K by considering the set J (q, M, J) of all induced qt-instances by K andcomputing ∩{q(K) ↓: K ∈ J (q, M, J)}. We can show that certain(q, J) = ∩{q(K) ↓: K ∈ J (q, M, J)} (seeLemma 4.1). Observe that the number of distinct instances in J (q,M, J) is bounded, whereas the number of allground instances in Definition 4.2 may not be.

The following theorem shows the usefulness of universal solutions in computing certain answers of CQACqueries in a data exchange setting with arithmetic comparisons.

Theorem 4.1. Assume that M = (S,T,Σst, Σt) is a data exchange setting with arithmetic comparisons.

1. Let q be a union of CQAC queries over the target schema T, and I be a source instance. If J ={J1, J2, . . . , Jk} is a universal solution for I, then certainM (q, I) = ∩{certain(q, Ji) : Ji ∈ J }.

2. Let I be a source instance and J be a set of solutions, such that for every union of CQAC queries q overschema T, we have certainM (q, I) = ∩{certain(q, Ji) : Ji ∈ J }, then J is a universal solution.

Proof. First claim: We focus on the case where q is a CQAC, and the results can be used to prove the claim forthe case where q is a union of CQACs.

For each Ji in J , let GS(Ji) be the set of ground solutions to which Ji can t-homomorphically map. LetG(M, I) be the set of all the ground solutions for I under M . According to definition of universal solution,G(M, I) = GS(J1) ∪ . . . ∪GS(Jk). In addition, by definition, certainM (q, I) = ∩{q(K) : K ∈ G(M, I)}. Thus,

certainM (q, I) = ∩{q(K) : K ∈ GS(J1) ∪ . . . ∪GS(Jk)}= ∩(∩{q(K) : K ∈ GS(Ji)}

).

On the other hand, for each Ji in J , by definition,

certain(q, Ji) = ∩{q(K) : K ∈ G(Ji)},

where G(Ji) be the set of ground instances, such that Ji t-homomorphically maps to each of them. Therefore, toprove the first claim, we can prove the following for each Ji:

∩{q(K) : K ∈ G(Ji)} = ∩{q(K) : K ∈ GS(Ji)}.

Notice that the difference between G(Ji) and GS(Ji) is that the former could include some ground instances thatare not ground solutions.

The “⊆” direction is straightforward, since G(Ji) ⊇ GS(Ji). To show the “⊇” direction, consider each tuple tin ∩{q(K) : K ∈ GS(Ji)}. For each K in G(Ji), let h be a t-homomorphism from Ji to K. Now we show that theimage of Ji under this homomorphism, i.e., h(Ji), is a ground solution for M . To see the reason, notice that Ji is at-instance with a total order on its labeled nulls, constants, and those constants in I and dependencies. Therefore,the t-homomorphism h is one-to-one between the values in Ji and those in h(Ji). Thus a t-homomorphism fromthe lhs (all the conjuncts, respectively) of a dependency to Ji corresponds to a t-homomorphism from the lhs (allthe conjuncts, respectively) of the dependency to h(Ji), and vice versa. Therefore, h(Ji) is also a ground solution,i.e., h(Ji) is in GS(Ji). As a consequence, t is in q(h(Ji)). Since h(Ji) is a subset of K, we have t ∈ q(K). Thust is also in ∩{q(K) : K ∈ G(Ji)}, proving the “⊇” direction.

Page 19: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Second claim: We need to show that for each ground solution K, there is a solution J in J , such that Jt-homomorphically maps to K. Let qJ be the corresponding boolean union CQAC of J . We are given

certainM (qJ , I) = ∩{certain(qJ , J), J ∈ J }.

For every J ∈ J , we have certain(qJ , J) = true, because one of the CQACs in qJ is qJ (the correspondingboolean query of J), and certain(qJ , J) = true. Therefore, certainM (qJ , I) = ∩{certain(qJ , J), J ∈ J } = true.

Since certainM (qJ , I) = ∩{qJ (K),K ∈ G(M, I)} = true, for each ground solution K in G(M, I), we haveqJ (K) = true. So there exists a J ∈ J such that qJ(K) = true. Hence, there is a t-homomorphism h : qJ → K,therefore, there is a t-homomorphism h′ : J → K. Hence, J has a member that t-homomorphically maps to K.Since each J in J is a solution, by definition, J is a universal solution.

The following example shows how to compute certain answers of conjunctive query with arithmetic comparisonsusing the universal solution.

example 4.3. Consider a data exchange setting M = (S,T, Σ) where S contains one binary predicate v andT contains three unary predicates a, b and c. Let the source instance I = v(1, 1). We consider the queryQ : q() : −a(X), c(X), b(X), X < 7. We discuss three cases of tgd-ACs:1) Σ consists of d1:

d1 : v(T, U) → a(S), c(T ), b(U), T ≤ S, S ≤ U.

Applying the AC-chase, we get the universal solution which consists of the following solution:J = {a(L), c(1), b(1), L = 1}

Then, the certain answer of Q is true.

2) Σ consists of d2:d2 : v(T, U) → a(S), c(T ), b(U), T ≤ S, S < U.

Then, no solution exists and the certain answer of Q is false.

3) Σ consists of d3:d3 : v(T, U) → a(S), c(T ), b(U), S ≤ T.

Then, the universal solution J consists of two solutions:J1 = {a(L), c(1), b(1), L < 1}J2 = {a(L), c(1), b(1), L = 1}

Since the universal solution J = {J1, J2} we compute the certain answer of Q by computing Q(J1) and Q(J2) andtaking the intersection. The first is false and the second is true. Thus the certain answer of Q is false. Let usconsider another query, Q′ : q():- b(X), X < 7. The certain answer of Q′ in this case is true because Q′ computesto true on both J1 and J2.

The following proposition makes an interesting observation on the second claim of Theorem 4.1. In particular,it shows that it does not hold for the language of CQACs (only for the language of unions of CQACs). It is notsurprising though since universal solutions in our framework are sets of t-solutions, not singletons.

Proposition 4.1. There exists a data exchange setting with arithmetic comparisons and with Σt = ∅, such thatthe following holds. There is a source instance I and a set of solutions J , which is not a universal solution, suchthat for every CQAC query q over schema T, we have certainM (q, I) = ∩{certain(q, Ji) : Ji ∈ J }.

We prove the proposition on the following example.

example 4.4. Consider a data exchange setting M with the following tgd-ACs:

Source-to-target tgd-AC d1: Z(1) → R(X, Y );Target tgd-AC d2: R(X, Y ) → S(X, Y ), X = Y ;Target tgd-AC d3: R(X, Y ) → T (X,Y ), X < Y ;Target tgd-AC d4: R(X, Y ) → W (X, Y ), X > Y .

Page 20: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Let I = {Z(1)} be a source instance. Then J = {J1, J2, J3} is a universal solution, where:

J1 = {S(x, y), x = y},J2 = {T (x, y), x < y},J3 = {W (x, y), x > y}.

Consider another set of solutions, J ′ = {J1, J2}. We can show that for every CQAQ q on the target schema,certainM (q, I) = ∩{certain(q, Ji) : Ji ∈ J ′} = ∅. Intuitively, each such query q returns an empty certain answerfor at least one of the three t-solutions in J , and the same argument holds for the two t-solutions in J ′. But J ′is not a universal solution.

This argument will not hold for queries in the language of unions of CQACs. In particular, consider theunion of CQACs q = q1 ∪ q2 ∪ q3, where

q1 : q(5) : −S(X,Y ), X = Y ,q2 : q(5) : −T (X,Y ), X < Y ,q3 : q(5) : −W (X, Y ), X > Y .

This query has a certain answer q(5) for the data exchange setting, but an empty set for ∩{certain(q, Ji) : Ji ∈ J ′}.

Let J to be a t-solution. Now we illustrate the differences between computing certain answers using varioussets of instances for J . Figure 7 depicts three sets of instances for a t-solution:

• The set of ground instances (denoted by A) isomorphically mapped from t-solution J , i.e., using a one-to-oneonto t-homomorphism.

• The set of ground solutions (denoted by B) on which the t-solution can map to. Each such ground solutioncould have additional facts that are not used in the t-homomorphism.

• The set of ground instances (denoted by C) on which J can map to. It could include some ground instancesthat are not ground solutions.

Figure 7: Instances and Certain Answers

Apparently, A ⊆ B ⊆ C. In addition, using the proof in for the first claim of Theorem 4.1, we can show thefollowing:

Certain(q, J) = ∩{q(K) : K ∈ A} = ∩{q(K) : K ∈ B} = ∩{q(K) : K ∈ C}.Theorem 4.2 states the complexity of the query answering problem as a function of the size of the source

instance, assuming the query on the target schema and the dependencies are of constant size.

Theorem 4.2. Let M = (S,T,Σst, Σt) be a data exchange setting with arithmetic comparisons, such that Σt isthe union of a weakly acyclic set of tgd-ACs and a set of acgds. Let q be a union of CQACs query. Given a sourceinstance I and a tuple t, the problem “is t in certainM (q, I)?” is co-NP complete.

Page 21: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Proof. If t is not a certain answer to q, according to Theorem 4.1, there is a solution J in a universal t-solution,such that t is not in certain answers of q on J . According to Theorem 3.1, the chase procedure produces auniversal t-solution by taking all the t-solutions in the leaves of a chase tree. By Theorem 3.3, the depth of achase tree is polynomial in the size of the source instance. Hence a certificate for t not being a certain answerincludes a chase path from the root to a leaf J , such that t is not in certain answers of q on J . Finally, thecertificate includes also an induced qt-instance of J for which t is not in the answers of the query. This provesmembership in co-NP.

Fagin et al. [FKMP03] proved that computing certain answers of a CQAC query on a data exchange settingwithout inequalities in the tgds is co-NP hard. Thus the problem in our setting is co-NP complete.

5 Polynomial Results.

We investigate the conditions under which universal solutions and certain answers of conjunctive query witharithmetic comparisons can be computed in PTIME in a data exchange setting with arithmetic comparisons. Thefollowing example shows that there are cases where there is a universal solution which is a singleton.

example 5.1. Consider a data exchange setting with arithmetic comparisons, where: Σst ={E(Y) → A(Y, Z), Y ≤ Z}, and Σt = ∅. Suppose I = {E(5)} is the source instance. Using an AC-chase procedurewe compute a universal t-solution which contains two t-solutions and is the set: {A(5, z) ∧ 5 < z, A(5, 5)}. It iseasy to see that the set {A(5, z) ∧ 5 ≤ z} is also a universal solution, which is a singleton one.

5.1 Succinct AC-chase. We introduce a new chase procedure which we call succinct AC-chase and is asequence rather than a tree. We will use succinct AC-chase to show how to compute singleton universal solutionefficiently and how to compute certain answers efficiently. First, we need to define some new concepts.

We use the notions of p-homomorphism and p-homomorphically map, which naturally extend the conceptsof t-homomorphism and t-homomorphically map respectively, by allowing a mapping from a conjunctive formulato an instance possibly with a partial order. Similarly, we use the concept of extending a p-homomorphismand p-satisfaction of a tgd-AC or an acgd. These concepts are natural extensions of the t-homomorphism andt-satisfaction concepts. Their formal definition follows.

5.2 Definitions. We start with formally defining the concepts of p-homomorphism and p-homomorphicallymap.

Definition 5.1. (p-homomorphism)Let F = F0 + βF be a conjunctive formula and K = K0 + βK be an instance in compact form over the same

schema. The comparisons (βK) in K do not imply equalities. A p-homomorphism from F to K is a mapping hfrom Const(F ) ∪ Var(F ) to Const(K) ∪ Var (K) with the following properties:

• Every constant in F is mapped to itself.

• For every relational fact R(x1, . . . , xk) in F , we also have R(h(x1), . . . , h(xk)) in K.

• βK ⇒ h(βF ) where “⇒” denotes the logical implication.

We say F p-homomorphically maps to K if there is a p-homomorphism from F to K. Observe that a p-homomorphism h from F to K directly implies a homomorphism h0 from F0 to K0. We call h0 the underlinedhomomorphism of h.

The following proposition relates a p-homomorphism to t-homomorphisms.

Proposition 5.1. Let F = F0 + βF be a conjunctive formula over a schema S and K = K0 + βK an instancein compact form over S. If there is a p-homomorphism from F to K then, there is a t-homomorphism from F toevery induced t-instance of K.

We proceed to the definition of extending a p-homomorphism. Actually, the extension does not have to relateto acgds or tgd-ACs but we restrict the definition below to these cases for convenience (it is easy to generalizethe definition) because we only use it for such cases here.

Page 22: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Definition 5.2. (extending a p-homomorphism)Let K be an instance. We have two possible cases:(acgd): Let d be an acgd ϕ(x) → (x1θx2). Let h be a p-homomorphism from ϕ(x) to K = K0 +βK . We say thath can be extended to a p-homomorphism h′ from ϕ(x) ∧ (x1θx2) to K if it holds βK ⇒ h(x1θx2).(tgd-AC): Let d : φ(x)∧ βφ → ψ(x,y)∧ βψ be a tgd-AC. Let h be a p-homomorphism from φ(x)∧ βφ to K. Wesay that h can be extended to a p-homomorphism h′ from φ(x) ∧ βφ ∧ ψ(x,y) ∧ βψ to K if either of the followingtwo happen:

i) if βK ⇒ h(x1θx2) for each AC x1θx2 in βψ.ii) the underlined homomorphism h0 of h can be extended to a homomorphism h′0 from φ(x) ∧ ψ(x,y) to K0

such that it holds βK ⇒ h′0(x1θx2) for each AC x1θx2 in βψ.

Definition 5.3. (p-satisfaction)Consider a tgd-AC: d : φ(x)∧ βφ → ψ(x,y)∧ βψ. Let K be an instance of the corresponding schema. We say K

p-satisfies d if whenever there is a p-homomorphism h from the left-hand side (lhs) of d (φ(x) ∧ βφ) to K, thereexists an extension h′ of h, where h′ is a p-homomorphism from the conjunction of lhs and the right-hand side(rhs) of d (ψ(x,y) ∧ βψ) to K.

Now, we proceed to the definition of succinct AC-chase. Intuitively, a succinct AC-chase step consists of thefirst two stages of Definition 3.1, and it does not do the final conversion of an instance to its induced t-instancesin Stage III. Correspondingly, every node in the succinct AC-chase step does not produce multiple children, buta single one, which is an instance possibly with a partial order. The formal definition follows.

Definition 5.4. (Succinct AC-Chase Step)Let K be an instance. We have two stages:Stage I: We construct instance K by adding relational facts and/or ACs to K. We have two possible cases:

a) (acgd): Let d be an acgd ϕ(x) → (x1θx2). Let h be a p-homomorphism from ϕ(x) to K = K0 +βK whichcannot be extended, i.e., such that βK ⇒ h(x1θx2). We say that d can be applied to K with p-homomorphism h.We add h(x1θx2) to K and produce K ′.

b) (tgd-AC): Let d : φ(x) ∧ βφ → ψ(x,y) ∧ βψ be a tgd-AC. Let h be a p-homomorphism from φ(x) ∧ βφ toK, such that there is no extension of h to a p-homomorphism from φ(x)∧βφ∧ψ(x,y)∧βψ to K. Then we do thefollowing. Let h0 be the underlined p-homomorphism. Let K ′

0 be the union of K0 with the set of facts obtained by:(a) extending h0 to a p-homomorphism h′0, such that each variable in y is assigned a fresh labeled null, followedby (b) taking the image of the atoms in the rhs of d under h′0. Then we do the following: for each AC x1θx2 inβψ, we add h′0(x1θx2) to K ′

0 and obtain K ′ in compact form.Stage II: We check K ′ for consistency. We distinguish two cases:

1. If K ′ is not consistent we say that the result of applying d to K with h is “failure” and write Kd,h−→ ⊥.

2. Otherwise, we turn K ′ in compact form and say that the result of applying d to K with h is K ′, and writeK

d,h−→ K ′.

definition 5.1. (Succinct AC-Chase)Let M = (S, T, Σ) be a data exchange setting with arithmetic comparisons, where each dependency in Σ isnormalized. Let K be a t-instance.• A chase sequence of K with Σ is a sequence (finite or infinite) of succinct AC-chase steps starting from K.• A finite succinct AC-chase of K with Σ is a finite succinct AC-chase sequence, with the requirement that eithera) the last instance Ke in the sequence is ⊥, or b) the last instance satisfies all the dependencies in Σ. We saythat the last instance is the result of the finite chase. We refer to case (a) as the case of a failing finite chase andto case (b) as the case of a successful finite chase.

Theorem 5.1 shows that the succinct AC-chase terminates in polynomial time for weakly acyclic tgd-ACs. Theproof is similar to the proof of Theorem 3.2 part (1).

Theorem 5.1. Let Σ be a set of weakly acyclic tgd-ACs and a set of acgds. Then there is a polynomial in thesize of a t-instance K that bounds the length of every succinct AC-chase sequence of K with Σ.

Page 23: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

5.3 Computing Universal Solutions in PTIME. The succinct AC-chase is not guaranteed to produce auniversal solution. In this section we identify two cases where it does produce a universal solution.

First, we observed that if we do not have target constraints, then the AC-chase step is simpler in that when wecheck whether a certain constraint d can be applied to an instance, then the variables on the lhs of this constraint(which is a tgd-AC) map always to constants. Thus, we have no need to “expand” the instance obtained in eachchase step so that its labeled nulls are totally ordered. The reason is because the total order is not of any use tothe AC-chase which thus reduces to the succinct AC-chase. The second case is when the tgd-ACs are full, i.e.,there are no existential variables on the rhs of the tgd-AC. This means that the AC-chase will not create anylabeled nulls, so any instance obtained by an AC-chase step does not contain labeled nulls, hence it is alreadytotally ordered. Thus in these two cases, we obtain a universal solution which is a singleton. The followingtheorem states these results.

Theorem 5.2. Assume a data exchange setting with arithmetic comparisons M = (S,T, Σst, Σt), where eithera) Σst ∪ Σt is a set of full tgd-ACs and acgds; or b) there are no target constraints (Σt = ∅). Then:

• Given a source instance I, if there is a solution for I under M , there is a universal solution for I that is asingleton.

• There is a polynomial-time algorithm in the size of the source instance that is based on a succinct AC-chaseprocedure such that, given a source instance I, it tests whether a solution for I exists, and, if so, it producesa singleton universal solution for I.

Proof. First claim: For the first case, no new labeled nulls are introduced after an AC-chase step. After addingnew facts, the new instance is still a t-instance. Hence, an AC-chase reduces to a succinct AC-chase. Forthe second case, in a succinct AC-chase step, the new facts for the target schema will not introduce any newt-homomorphism from the lhs of a dependency to each induced t-instance after the step. Thus the resultingt-instances of an AC-chase are the t-instances induced by the result of a succinct AC-chase.Second claim: It is a corollary of Theorem 5.1.

In Example 5.1, we only need one chase step to compute the universal solution {A(5, z) ∧ 5 ≤ z}, which is asingleton. Another example follows.

example 5.2. In Example 2.2, we mentioned that J3 is a universal solution of the setting of Example 2.1. This isproved by applying a succinct AC-chase step to the source instance I = {E(2, 5), E(4, 7)} with p-homomorphism:X → 2, Y → 5. This will produce a single instance K11 = {E(2, 5), E(4, 7),H(2, z1),H(z1, 5), z1 < 2}. Then weapply another succinct AC-chase step to K11 with p-homomorphism: X → 4, Y → 7.

5.4 Computing Certain Answers in PTIME. In the next paragraph, we consider a given instance (couldbe the result of the succinct AC-chase or an arbitrary instance) and show how the problem of computing certainanswers of a query on this instance (see Definition 4.2) can be computed efficiently when the query has a certaingood property, specifically the homomorphism property.

5.4.1 Homomorphism Property

Definition 5.5. (Homomorphism Property for Queries – HP in short)1. Let Q be a conjunctive query with arithmetic comparisons, and K be an instance with labeled nulls and

comparisons. We say that the homomorphism property holds on Q with respect to K if the following is true: Qp-homomorphically maps to K such that the head of Q is mapped to a tuple of constants t if and only if for everyt-instance induced by K, there is a t-homomorphism from Q to this t-instance such that the head of Q is mappedto t.

2. Let Q be a class of conjunctive queries with arithmetic comparisons and K be a class of instances. We saythat the homomorphism property holds on Q with respect to K if every Q in Q has this property with respect toeach instance in K.

3. If the class of instances K is understood from the context (usually we mean conjunctive formulas witharithmetic comparisons), we simply say that the CQAC Q or the union of CQACs Q has the homomorphismproperty.

Page 24: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Lemma 5.1 and Theorem 5.3 (which is a corollary of Lemma 5.1) show that if the homomorphism propertyholds then we can compute certain answers on an instance more efficiently.

Lemma 5.1. Let K be an instance, and q be a CQAC query. Assume the homomorphism property holds on q withrespect of K. Let π(q, K) = {h(head(q))|h is a p-homomorphism from q to K}. Then, π(q,K)↓ = certain(q, K).

Proof. We first prove certain(q, K) ⊇ π(q, K)↓. Let t be a tuple in π(q, K)↓ produced by a p-homomorphism hfrom q to K. Let G be a ground instance, such that there is a t-homomorphism µ from K to G. Then h ◦ µ is at-homomorphism from q to G, which is the identity mapping on the constants. Thus h ◦ µ will produce t in theanswer q(G). Thus t is also in certain(q,K).

Now we prove certain(q, K) ⊆ π(q,K)↓. Let t be a tuple in certain(q, K). By definition, for each groundinstance G, such that there is a t-homomorphism µ from K to G, there is a t-homomorphism from q to G thatmaps the head of q to t. Thus there is a t-homomorphism from q to every induced t-instance of K, which mapsthe head of q to t. Since q has the homomorphism property, there is a p-homomorphism h from q to K, such thatthe head of q is mapped to t. Hence, t ∈ π(q,K).

Theorem 5.3. Suppose a CQAC query Q is given for which the homomorphism property holds. Then, for anyinstance K, we can compute the certain answers of q on K in PTIME.

The work in [ALM04] presented conditions on a class of conjunctive formulas that can guarantee thehomomorphism property. The class of queries that have this property includes CQAC that use either onlyopen (closed respectively) left semi-interval (LSI) comparisons, i.e., of the form X < c (X ≤ c respectively) whereX is a variable and c is a constant, and the analogous cases for right semi-interval comparisons (RSI).

Theorem 5.4 states that we can compute certain answers in polynomial time using a universal solution. It isa direct consequence of Theorem 5.2 and Theorem 5.3.

Theorem 5.4. Assume a data exchange setting with arithmetic comparisons M = (S,T, Σst, Σt), where eithera) Σst ∪ Σt is a set of full tgd-ACs and acgds; or b) there are no target constraints (Σt = ∅). Let Q be a querywhich has the homomorphism property. Then given a source instance I we can compute the certain answers of Qin PTIME.

Interestingly, in some cases the result of succinct AC-chase can be used to compute certain answers althoughit is not itself a universal solution. One such case, where the constraints do not contain arithmetic comparisons,is stated in the following theorem.

Theorem 5.5. Assume that M = (S, T, Σst, Σt) is a data exchange setting which uses tgds and egds for source-to-target constraints and egds and weakly acyclic tgds for target constraints. Let I be a source instance and q aCQAC query which uses only open (closed respectively) LSI (RSI respectively) comparisons. Then we can use theresult of the succinct AC-chase to compute the certain answers of q in polynomial time.

To prove Theorem 5.5 we need to deal with a series of technical challenges. Mostly, they emanate from thefact that the result J of a succinct AC-chase might not be a universal solution for the source instance I underthe data-exchange setting M . Recall that a universal solution has two properties; a) it is a solution, and b) ithomomorphically maps on every solution. According to definition of solution, an instance is a solution if everyt-instance induced by it is a t-solution. However, for J there are induced t-instances that are not t-solutions. Thereason for that is that in some of those induced t-instances the constraints may not be satisfied.First, we need to prove the following lemma:

Lemma 5.2. Let M be a data exchange setting without arithmetic comparisons. Let J be the result of succinctAC-chase for a given source instance I, and Q be CQAC query, where either of the following happens: a) all thecomparisons of Q are LSI (open or closed), or b) all the comparisons of Q are RSI (open or closed). Let A be theset of all solutions. Then, the following holds; ∩Ji∈AQ↓(Ji) = certain(Q, J).

Proof. We will prove for left semi intervals (LSI) comparisons. The proof for right semi intervals (RSI) is similar.Let B be the set of all t-instances K, such that there is a t-homomorphism from J to K. We want to prove thefollowing: ∩Ji∈AQ↓(Ji) = ∩Ji∈BQ↓(Ji).

Page 25: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

Since A ⊆ B, the direction ∩Ji∈AQ↓(Ji) ⊇ ∩Ji∈BQ↓(Ji) is straightforward. For the other direction, let t bea tuple of constants in I. Suppose t ∈ ∩Ji∈AQ↓(Ji).

Let c be the greatest constant of the query. Take a solution J ′ which is J where all labeled nulls are equatedto distinct constants greater than c. J ′ is a solution because if it were not, then there is a t-homomorphismfrom the lhs of a tgd or egd to J ′ which cannot be extended. However this implies immediately that there is at-homomorphism from the lhs of a tgd or egd to J (because J and J ′ are isomorphic and remember that we usetgds and egds which means that there are no ACs) which cannot be extended. This leads to contradiction. Thisis the critical observation: J ′ is isomorphic to J and also there is the obvious isomorphism h (i.e., the one whichjust maps each labeled null of J to its value in J ′) which has the following property: it maps the labeled nullsvarJ of J to a set of constants h(varJ) in J ′ which are greater than c, hence they (the constants in h(varJ)) aregreater than any constant in the query. Thus the constants in J ′ are partitioned to a) the constants in h(varJ )and b) the constants in h(constJ) and not in h(varJ).

Consider a t-homomorphism µ from Q to J ′ that satisfies all ACs of Q and maps the head of Q to t (i.e., itcomputes t). Recall that any AC in Q is either of the form X ≤ c1 or of the form X < c1. Hence any variableX of the query that also appears in an AC in the query, is mapped by µ to a constant in J ′ which is less thanor equal to c (recall that c is the greatest constant appearing in the query). Therefore, such a variable X is notmapped to any constant in h(varJ), consequently X is mapped by µ to some constant in h(constJ ).

Because of isomorphism h, µ implies a homomorphism µ′ from the subgoals of the query to J . Moreover (asa consequence of the observations above) µ′ maps any variable X of the query that also appears in an AC in thequery to a constant in J (i.e., not to a labeled null). Hence µ′ also satisfies the ACs in the query.

Let J1 be any t-instance in B. There is a t-homomorphism ν from J to J1. The composition of µ′ and ν givesa t-homomorphism from Q to J1 that satisfies the ACs in Q and maps the head of Q to t. The reason that theACs in Q are satisfied is that variables in Q that also appear in ACs in Q are mapped by µ′ to constJ and hencethese variables are mapped by the composition of µ′ and ν also to constJ . Therefore, t ∈ ∩Ji∈BQ↓(Ji), and wehave ∩Ji∈AQ↓(Ji) ⊆ ∩Ji∈BQ↓(Ji).

We proceed to the proof of Theorem 5.5.

Proof. Suppose t is in the certain answers of Q on I. Let B the set of all induced t-instances of the result ofthe succinct AC-chase. From Lemma 5.2 t ∈ ∩Ji∈BQ↓(Ji). We have polynomially many tuples to check whetherthey are in certain answers. According to Theorem 5.3, for each tuple, it suffices to check query containment ofa CQAC query of size equal to the size of the universal solution J to a query of constant size (the query Q).Since homomorphism property holds for Q, one homomorphism suffices to check containment. This can be donein polynomial time.

5.5 Summary of our Polynomial Results. The polynomial results presented in this work are the following:

(A) We can compute a universal solution using succinct AC-chase in polynomial time in the following two cases:

(a) Σt = ∅;(b) Σst ∪ Σt are full tgd-ACs.

(B) We can compute certain answers in polynomial time whenever the query is a conjunctive query witharithmetic comparisons which has the homomorphism property and moreover it holds:

(a) We can compute the universal solution is polynomial time, hence in the two cases in (A) above.(b) The query uses only open (closed respectively) LSI (RSI respectively) comparisons and the data

exchange setting uses only tgds and egds.

5.6 Negative results. In this subsection we present some negative results. In particular, we show some casesin which the result of the succinct chase can not be used to compute the certain answers even when queries posedon the target schema do not have arithmetic comparisons (they are plain CQs).

The following example shows that we can not compute the certain answers of a conjunctive query witharithmetic comparisons using the result of the succinct chase in the case where we have a single full target tgd-ACand its rhs contains no arithmetic comparisons.

Page 26: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

example 5.3. Suppose after some chase steps we have the instance:

K : {a(5, x), a(5, y), b(x, y), 5 ≤ x, 5 ≤ y}and the tgd-AC d: a(x, y), 5 < y → b(x, x).

When applying the succinct chase, there is no p-homomorphism from the lhs of d to K, hence the resultof the succinct chase is K.

Suppose we have the Boolean query Q:- b(A,A). Q computes to FALSE on K. However the certain answersof Q on this data exchange setting is TRUE because if we do the AC-chase, then:

• an induced t-instance of K is a(5, x), a(5, y), b(x, x) on which the AC-chase decides that it is a solution.

• all other induced t-instances of K have the property that one of the x or y (in K) are strictly greater than5, hence a chase step is applied producing a number of solutions all of which have an atom b(w, w) (eitherw = x or w = y).

Hence the query Q computes to TRUE on all solutions.

The following example shows that the result of the succinct chase can not be used to compute the certainanswers of a conjunctive query with arithmetic comparisons in the case where Σt is a set of (or single) targettgd-ACs who rhs contains only equality comparisons.

example 5.4. Suppose after some chase steps we have the instance:

K : {a(5, x), a(5, y), b(x, y), 5 ≤ x, 5 ≤ y}and the acgd d′ : a(x, y), 5 < y → x = y.

When applying the succinct chase, there is no p-homomorphism from the lhs of d′ to K, hence the resultof the succinct chase is K.

Suppose we have the Boolean query Q:- b(A,A). Q computes to FALSE on K. However the certain answersof Q on this data exchange setting is TRUE because if we do the AC-chase, then:

• an induced t-instance of K is a(5, 5), a(5, 5), b(5, 5) on which the AC-chase decides that it is a solution.

• all other induced t-instances of K have the property that one of the x or y (in K) are strictly greater than5, hence a chase step is applied producing a number of solutions all of which have an atom b(w, w) (eitherw = x or w = y).

Hence the query Q computes to TRUE on all solutions.

example 5.5. Suppose after some chase steps we have the instance:

K : {a(7, x), a(7, y), b(x, y), x ≤ 5, y ≤ 5}and the tgd-AC d : b(x, x) → a(7, z), z < 5.

Suppose we have the Boolean query Q:- a(7, x), x < 5. Certain answers can not be obtained using theresult of the succinct chase.

example 5.6. Suppose after some chase steps we have the instance:

K : {a(7, x), a(7, y), b(x, y), x ≤ 5, y ≤ 5}and the tgd-AC d : b(x, x) → c(x, x).

For any query posed on K, certain answers can be obtained using the result of the succinct chase but theresult of the succinct chase is not a universal solution. Two conditions must hold for the result to be a universalsolution; a) it represents all solutions and b) it is a solution itself. In this example the first condition holds sincethere is a t-homomorphism from K to every solution but the second condition does not hold because there is aninduced instance of K which is not a solution.

Page 27: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

It remains an interesting open problem to identify broader classes of schema mappings for which succinct chasecan be used to compute the certain answers of a conjunctive query with arithmetic comparisons. The followingare some cases which seem interesting to be investigated. One of them is when Σt contains only tgd-ACs with noarithmetic comparisons on their lhs.

6 Optimization Using Type Information.

So far we assume attributes take values from a single totally ordered dense domain, and all constants are containedin this domain. In many applications we do not compare values from different domains. For example, it is notmeaningful to compare a price with a year. Generally, arithmetic comparisons are between values from the samedomain. That is, values are typed, and comparisons only involve values from the same type (domain). In theexample in Section 1, the values of the attribute zip and the values of the price attribute are from differentdomains. Given a data exchange setting, we can deduce domain information or type information by drawing anundirected graph with vertices for the attributes and edges representing the fact that two attributes are containedin the same comparison in one of the given dependencies. Each connected component defines one domain (ortype).

In the presence of such type information, a typed t-instance is defined to be an instance where its comparisonsdefine a total order among labeled nulls and constants of the same domain. That is, we do not need to definean order between values in different domains, if we know that we will never encounter a dependency that has acomparison between them.

In general, all concepts introduced in previous sections can be modified slightly to enforce the variablesappearing in arithmetic comparisons to belong to the same domain. In fact all definitions are practically simplified.In the last stage of an AC-chase step in Definition 3.1, we only construct all typed t-instances induced by K ′. Inthis way, we reduce the number of children of each node in an AC-chase tree. The only concept that does notneed to change is the weakly acyclic graph. To create the dependency graph, we allow the interrelation betweennodes of different types (domains), since such an interrelation captures how different attributes affect each otherin terms of introducing new labeled nulls. Even for the query answering problem, all results transfer naturallyto typed database schemas by requiring the variables appearing in the arithmetic comparisons of the query tobelong to the same domain.

6.1 Typed Data Exchange Settings. A database schema is called typed if each attribute A in a relation Rin the schema is associated with a domain, denoted by dom(R.A), which includes the set of possible values forthis attribute. Some of these attributes could share the same domain. For instance, Table 2 shows the domains ofthe attributes in the source schema of Example in the Introduction. Attributes Home.aID and Owner.aID sharethe same domain.

Attribute DomainHome.aID Apartment IDs.Home.appartAddr Apartment addresses.Home.price Prices.Owner.aID Apartment IDs.Owner.name Person names.Owner.Addr Owner’s address.

Table 2: Domains of attributes.

Given a query, tgd-AC, acgd, or an instance (possibly with labeled nulls), we use the following two rules tocompute the domains of its variables, constants, and/or labeled nulls.

r1: In a conjunct R(X1, . . . , Xk), the domain of Xi, denoted by Dom(Xi), is the corresponding domain of thej-th attribute in relation R.

r2: For each comparison X θ c between a variable or labeled null X and a constant c, we set the domain Dom(c)of the constant c to be the domain Dom(X) of X.

Page 28: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

For instance, after applying these rules on the tgd-AC d1 in Section 1, we have dom(I) = dom(Home.aID), anddom(P) = dom(Home.price). The information about the domains in a query, a tgd-AC, an acgd, or an instance,can be obtained in polynomial time on its number of variables, constants, and labeled nulls. We say a query,a tgd-AC, an acgd, or an instance on a typed schema is typed if the following are true when applying the tworules r1 and r2 on it: (1) If there is a shared variable or labeled nulls in two formulas, then their correspondingattributes have the same domain. (2) For each comparison X θ Y , where X and Y are variables or labeled nulls,Dom(X) and Dom(Y ) are the same.

6.2 Extending Results. We extend the results developed so far to typed data exchange settings witharithmetic comparisons and typed queries on the target schema. To start with, we revisit the concept of “t-instance” extending it to “typed t-instance” as follows. A typed t-instance over a typed schema is a typed instance,such that for each domain of its constants and labeled nulls, its comparisons define a total order of the constantsand labeled nulls in this domain. Notice that our earlier results are developed based on two important concepts:“instance” and “t-instance”. We replace each reference of “instance” by “typed instance” in these results, andeach reference of “t-instance” by “typed t-instance.” All the earlier results hold after the replacements.

This extension results in a more efficient chase procedure, since we do not need to consider all total ordersof labeled nulls and constants. Instead, we only need to have total orders within each domain, while the orderbetween two variables/constants belonging to different domains does not need to be defined. In particular, inthe definition of AC-Chase Step (Definition 3.1, in Stage III, we can convert each instance to a set of typedt-instances.

Theorem 6.1. Assume a typed data exchange setting with arithmetic comparisons, for which there is a singletonuniversal solution for a source instance I. Suppose that a query on the target schema is such that, for each type(domain) D, either a) there is only one comparison of type D in the query, or b) all the comparisons of type Dare open (closed) LSI (RSI). Then, there is a polynomial algorithm to compute the certain answers to the queryon the source instance I.

[[[ Proof of theorem missing ]]] <==Notice that Theorem 6.1 allows the comparisons of one type to be open LSI, while the comparisons of

another type to be open RSI. Thus, the theorem extends the polynomial results in the case of typed informationsignificantly.

7 Related Work and Conclusions.

Recent work on data exchange can be found in [AL05, Ber03, FKMP03, FKP03, Got05, GN06, KPT06, Lib06,VR07]. Chase is a useful tool for reasoning about dependencies [MMS79, BV84b, BV84a]. Extending dependencytheories for interpreted domains has not received much attention. Constraint-generating dependencies overinterpreted domains and especially over totally ordered domains were investigated in [BCW99], where theimplication and consistency problem was studied. It is shown that the implication problem is in co-NP foracgds, and in PTIME for acgds with no constraints on the antecedent. The implication problem for a subclass ofthe constraint-generating dependencies in [BCW99] is studied in [Mah97] and some more polynomial cases wereidentified.

A broader class than tgd-ACs was studied in [MS96], and two chase algorithms were given for the implicationproblem which are not complete. In [JWM01] the class in [MS96] is extended to allow also disjunctions. Theintuition of the chase in [JWM01] is drawn from the CQAC query containment test based on logical implication.There are no complexity results in [JWM01]. Tgd-ACs are considered implicitly in constraint checking in [IO93].

The definition of certain answers was introduced as the standard concept used in incomplete databases[Gra91, Mey98]. It also became the standard semantics of query answering in data integration [AD98, Len02]and data exchange [FKMP03]. Computing certain answers of conjunctive queries with inequalities is shown to bein coNP in [FKMP03]. If we have a union of conjunctive queries with inequalities, where at most one inequalityis allowed in every conjunctive query in the union, then the problem is polynomial-time computable, whereaswhen two inequalities are used per conjunctive query in the union then it is shown to be coNP-hard [FKMP03].Abiteboul and Duschka [AD98] showed that there is a single conjunctive query with seven inequalities and aschema mapping with no target constraints for which computing the certain answers in a coNP-complete problem.Madry [Mad05] proved the same result for a single conjunctive query with two inequalities. Containment tests for

Page 29: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

conjunctive queries [CM77] and conjunctive queries with comparisons [Klu88, Gup94, ALM02] provide some ofthe technical tools for studying data exchange problems both in the absence and in the presence of comparisons.Conclusions: In this work, we defined a framework of data exchange with arithmetic comparisons. We developeda novel chase procedure, AC-chase, to compute a universal solution for a source instance under a data exchangesetting with arithmetic comparisons. AC-chase is based on intuition gained from the CQAC query containmenttest based on canonical databases. We investigated the conditions under which the AC-chase terminates andwe studied the problem of query answering. We developed polynomial results for computing universal solutionsand computing certain answers. Polynomial results on computing certain answers of conjunctive queries withinequalities were also presented in [FKMP03]. For this purpose a chase called disjunctive was developed whichhowever it is tailored to a specific query, whereas the AC-chase presented in this paper can be used to computecertain answers of any query. We are currently working on developing more polynomial results. There areexamples showing that delineating tractable cases is not easy. Another interesting future direction is to considerdiscrete ordered domains, such as the integers. We conjecture that most of the results will change significantly,although the initial definitions may easily carry over. AC-chase can be used to study implication for the class oftgd-ACs and it seems that some of the results in this paper can solve specific cases of the implication problem ina rather straightforward way.

References

[AD98] S. Abiteboul and O. M. Duschka. Complexity of answering queries using materialized views. In PODS, 1998.[AL05] M. Arenas and L. Libkin. XML data exchange: Consistency and query answering. In PODS, 2005.[ALM02] F. Afrati, C. Li, and P. Mitra. Answering queries using views with arithmetic comparisons. In PODS, 2002.[ALM04] F. Afrati, C. Li, and P. Mitra. On containment of conjunctive queries with arithmetic comparisons. In EDBT,

2004.[BCW99] M. Baudinet, J. Chomicki, and P. Wolper. Constraint-generating dependencies. J. Comput. Syst. Sci., 59(1):94–

115, 1999.[Ber03] P. A. Bernstein. Generic model management: A database infrastructure for schema manipulation. In Keynote

Address, IDM 2003 Workshop, Seattle, Washington, 2003.[BV84a] C. Beeri and M. Y. Vardi. Formal systems for tuple and equality generating dependencies. SIAM J. on Computing,

13(1):76–98, 1984.[BV84b] C. Beeri and M. Y. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718–741, 1984.[CM77] A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In

STOC, 1977.[DT03] Alin Deutsch and Val Tannen. Reformulation of XML queries and constraints. In ICDT, 2003.[FKMP03] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. In ICDT,

2003. Full version: TCS 336(1): 89–124, 2005.[FKP03] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: Getting to the core. In PODS, 2003. Full version: ACM

TODS 30(1): 174–210, 2005.[GN06] G. Gottlob and A. Nash. Data exchange: Computing cores in polynomial time. In PODS, 2006.[Got05] G. Gottlob. Computing cores for data exchange: New algorithms and practical solutions. In PODS, 2005.[Gra91] G. Grahne. The problem of incomplete information in relational databases, volume 554. Spring-Verlag, Berlin,

Lecture Notes in Computer Science, 1991.[GSUW94] A. Gupta, Y. Sagiv, J. D. Ullman, and J. Widom. Constraint checking with partial information. In PODS,

1994.[Gup94] A. Gupta. Partial Information Based Integrity Constraint Checking. PhD thesis, Stanford, 1994.[IO93] N. Ishakbeyoglu and Z. M. Ozsoyoglu. On the maintenance of implication integrity constraints. In DEXA, 1993.[JWM01] R. Topor J. Wang and M. Maher. Reasoning with Disjunctive Constrained Tuple–Generating Dependencies. In

DEXA, 2001.[Klu88] A. Klug. On conjunctive queries containing inequalities. Journal of the ACM, 35(1):146–160, 1988.[Kol05] P. G. Kolaitis. Schema Mappings, Data Exchange, and Metadata Management. In PODS, 2005.[KPT06] P. Kolaitis, J. Panttaja, and W.-C. Tan. The complexity of data exchange. In PODS, 2006.[Len02] M. Lenzerini. Data integration: A theoretical perspective. In PODS, 2002.[Lib06] L. Libkin. Data exchange and incomplete information. In PODS, 2006.[LS93] A. Levy and Y. Sagiv. Queries independent of updates. In VLDB, 1993.[Mad05] A. Madry. Data exchange: on complexity of answering queries with inequalities. Information Processing Letters,

94(6):253–257, 2005.

Page 30: Data Exchange with Arithmetic Comparisonschenli/pub/deac-tr.pdf · Data Exchange with Arithmetic Comparisons⁄ Foto Afrati1 Chen Li2 Vassia Pavlakiy1 1 National Technical University

[Mah97] M.J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113–149, 1997.[Mey92] R. van der Meyden. The complexity of querying indefinite data about linearly ordered domains. In PODS, 1992.[Mey98] R. van der Meyden. Logical approaches to incomplete information: A survey. In Logics for Databases and

Information Systems, pages 307–356, 1998.[MMS79] D. Maier, A. O. Mendelzon, and Y. Sagiv. Testing implications of data dependencies. ACM TODS, 4(4):455–469,

1979.[MS96] Michael J. Maher and Divesh Srivastava. Chasing constrained tuple-generating dependencies. In PODS, 1996.[MUV84] D. Maier, J. D. Ullman, and M. Y. Vardi. On the foundations of the universal relation model. ACM TODS,

9(2):283–308, 1984.[VR07] Adrien Vieilleribiere and Michel De Rougemont. Approximate data exchange. In ICDT, 2007.[ZO93] X. Zhang and M. Z. Ozsoyoglu. On efficient reasoning with implication constraints. In DOOD, 1993.