orsi vldb11

Post on 22-Nov-2014

341 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Optimizing Query Answering

under

Ontological Constraints

Giorgio Orsi1,2 and Andreas Pieris2

1Institute for the Future of Computing

Oxford Martin School

University of Oxford

2Department of Computer Science

University of Oxford

VLDB 2011

Ontological Databases

Ontological Reasoning DB Constraints

Ontological DB

Ontological Databases

D

D

D

ABox

TBox

Ontological Reasoning DB Constraints

Ontological DB

Ontological Databases

D

D

D

Q(X) 9Y (X,Y)

ABox

TBox

Ontological Reasoning DB Constraints

Ontological DB

Ontological Databases

D

D

D

, ABox

TBox

{ t | D [ ² 9u (t,u) }

Ontological Reasoning DB Constraints

Ontological DB

Q(X) 9Y (X,Y)

Ontological Constraints (examples)

Concept Inclusions: 8X emp(X) person(X)

(Inverse) Relation Inclusion:

Relation Transitivity: 8X8Y8Z mgs(X,Y),mgs(Y,Z) mgs(X,Z)

8X8Y manages(X,Y) isManaged(Y,X)

Participation: 8X emp(X) 9Y report(X,Y)

Disjointness: 8X emp(X), customer(X) ?

Functionality: 8X8Y8Z reports(X,Y),reports(X,Z) Y = Z

Datalog§

¡ Datalog variant allowing in the head:

- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)

- Equality atoms ! EGDs 8X (X) Xi=Xj

- Constant false (?) ! NCs 8X (X) ?

Datalog+

[Cali’ et Al, PODS 09]

Datalog+

Datalog§ [Cali’ et Al, PODS 09]

¡ Datalog variant allowing in the head:

- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)

- Equality atoms ! EGDs 8X (X) Xi=Xj

- Constant false (?) ! NCs 8X (X) ?

¡ But, query answering under Datalog+ is undecidable

Datalog+

Datalog§ [Cali’ et Al, PODS 09]

¡ Datalog variant allowing in the head:

- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)

- Equality atoms ! EGDs 8X (X) Xi=Xj

- Constant false (?) ! NCs 8X (X) ?

¡ Datalog+ is syntactically restricted ! Datalog§

¡ But, query answering under Datalog+ is undecidable

Datalog+

Datalog§ [Cali’ et Al, PODS 09]

¡ Datalog variant allowing in the head:

- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)

- Equality atoms ! EGDs 8X (X) Xi=Xj

- Constant false (?) ! NCs 8X (X) ?

¡ Datalog+ is syntactically restricted ! Datalog§

¡ But, query answering under Datalog+ is undecidable

¡ TGDs more expressive than inclusion dependencies

8D8P8A runs(D,P),area(P,A) 9E employee(E,D,A)

The Chase Procedure

Input: Database D, set of TGDs

Output: A model of D [

person(john)

8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)

D

chase(D,) = D [ ?

The Chase Procedure

Input: Database D, set of TGDs

Output: A model of D [

person(john)

D

chase(D,) = D [ {father(z1,john)

8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)

The Chase Procedure

Input: Database D, set of TGDs

Output: A model of D [

person(john)

D

chase(D,) = D [ {father(z1,john), person(z1)

8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)

The Chase Procedure

Input: Database D, set of TGDs

Output: A model of D [

person(john)

D

chase(D,) = D [ {father(z1,john), person(z1), father(z2,z1)

8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)

The Chase Procedure

Input: Database D, set of TGDs

Output: A model of D [

person(john)

D

chase(D,) = D [ {father(z1,john), person(z1), father(z2,z1), …}

8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)

Query Answering via Chase

[see, e.g., Deutsch, Nash & Remmel, PODS 08]

D [ ² Q , chase(D,) ² Q

D

. . .

C = chase(D,)

M1

M2

h1 h2

h1(C) h2(C)

Q h

Query Answering via Chase

D

. . .

C = chase(D,)

M1

M2

h1 h2

h1(C) h2(C)

Q h

Possibly Infinite!

Q

Query answering via rewriting

Q

Q

compilation

Query answering via rewriting

Q

evaluation

Q

Q

compilation

D

Query answering via rewriting

Chase vs rewriting

Linear TGDs

8X8Y r(X,Y) 9Z (X,Z)

single body atom

¡ Properly generalize inclusion dependencies.

¡ Enjoy the bounded-derivation depth property.

¡ FO-rewritable query answering in AC0 (data complexity).

Bounded derivation depth property (BDDP)

D

Q

Bounded derivation depth property (BDDP)

D

Q k levels

constant

w.r.t. D

chase(D,) ² Q ) chasek(D,) ² Q

Bounded derivation depth property (BDDP)

BDDP ) First-Order Rewritability

D

Q k levels

constant

w.r.t. D

Q

q promotesTo(A,B), customer(B) (original query)

promoter(X) Y promotesTo(X,Y)

promotesTo(X,Y) customer(Y)

q promotesTo(A,B), customer(B) Q

FO-rewritability: Example [Gottlob et Al., ICDE 11]

q promotesTo(A,B), customer(B)

q promotesTo(A,B), customer(V0,B)

{ Y = B }

( V0 is fresh )

promoter(X) Y promotesTo(X,Y)

promotesTo(X,Y) customer(Y)

Q

FO-rewritability: Example [Gottlob et Al., ICDE 11]

Q

q promotesTo(A,B), customer(B)

q promotesTo(A,B), customer(B)

q promotesTo(A,B), promotesTo(V0,B) ans(A) promotesTo(A,B)

factorization

{ A = V0 }

promoter(X) Y promotesTo(X,Y)

promotesTo(X,Y) customer(Y)

Q

FO-rewritability: Example [Gottlob et Al., ICDE 11]

Q

q promotesTo(A,B), customer(B)

q promoter(A)

promoter(X) Y promotesTo(X,Y)

promotesTo(X,Y) customer(Y)

Q

FO-rewritability: Example [Gottlob et Al., ICDE 11]

Q

q promotesTo(A,B), customer(B)

q promotesTo(A,B) {X = A, Y = B}

q promotesTo(A,B), customer(B)

UCQ rewriting

(first-order)

promoter(X) Y promotesTo(X,Y)

promotesTo(X,Y) customer(Y)

Q

FO-rewritability: Example [Gottlob et Al., ICDE 11]

Q

q promoter(A)

q promotesTo(A,B), customer(B)

q promotesTo(A,B)

q promotesTo(A,B), customer(B)

FO-rewritability

¡ Desirable properties of a FO-rewriting:

independent on the DB

executable by any DBMS

easy to compute (e.g., polynomial time)

small size (e.g., polynomial size)

FO-rewritability

¡ Unions of Conjunctive Queries (UCQs)

executable by any DBMS

DB independent

easy to optimize and distribute

worst-case exponential size in Q and

Calvanese et Al, JAR 07

Perez Urbina et Al, JAL 09

Cali’ et Al, PODS 09

Gottlob et Al, ICDE 11

and others…

¡ Desirable properties of a FO-rewriting:

independent on the DB

executable by any DBMS

easy to compute (e.g., polynomial time)

small size (e.g., polynomial size)

¡ Combined and hybrid FO-rewriting

good computational properties

(e.g., polynomial in size)

requires access to the DB

Kontchakov et Al., KR 10

Gottlob and Schwentick, DL 11

FO-rewritability

¡ Purely intensional Datalog rewriting

very compressed representation

purely intensional

requires view-creation or Datalog engine

¡ Combined and hybrid FO-rewriting

good computational properties

(e.g., polynomial in size)

requires access to the DB

Kontchakov et Al., KR 10

Gottlob and Schwentick, DL 11

Perez Urbina et Al, JAL 09

Rosati and Almatelli., KR 10

Orsi and Pieris., VLDB 11

FO-rewritability

Datalog rewriting: keep it first-order!

¡ A Datalog query is (in general) not a first-order query

a non-recursive Datalog query is a first-order query

a bounded Datalog query is a first-order query

an output-predicate-bounded Datalog query is a first-order query

Datalog rewriting: keep it first-order!

¡ Input:

a (w.l.o.g. boolean) conjunctive query Q = <q,ρ>

Q : q(X) p(X), s(X,Y) <q, q(X) p(X),s(X,Y) >

a set of linear TGDs

¡ Output:

a bounded Datalog query Q = <q,π >

¡ A Datalog query is (in general) not a first-order query

a non-recursive Datalog query is a first-order query

a bounded Datalog query is a first-order query

an output-predicate-bounded Datalog query is a first-order query

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

t is unbounded t(X) cannot be computed using

a DB-independent FO-query

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

t is unbounded t(X) cannot be computed using

a DB-independent FO-query

t(X) r(X,Y), t(Y)

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

t is unbounded t(X) cannot be computed using

a DB-independent FO-query

t(X) r(X,Y), r(Y,Z), t(Z)

t(X) r(X,Y), t(Y) v

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

t is unbounded t(X) cannot be computed using

a DB-independent FO-query

t(X) r(X,Y), r(Y,Z), t(Z) v

t(X) r(X,Y), t(Y) v

t(X) r(X,Y), r(Y,Z), r(Z,W), t(W)

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

t is unbounded t(X) cannot be computed using

a DB-independent FO-query

t(X) r(X,Y), r(Y,Z), t(Z) v

t(X) r(X,Y), t(Y) v

t(X) r(X,Y), r(Y,Z), r(Z,W), t(W) v

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

q is bounded q(X) can be computed using

a DB-independent FO-query

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

q is bounded q(X) can be computed using

a DB-independent FO-query

q(X) r(X,Y)

Predicate-bounded Datalog

r(X,Y), t(Y) → t(X)

r(X,Y ) → r(Y,X)

r(X,Y ), t(Z) → q(X)

¡ bounded Datalog program

all IDB predicates are bounded.

¡ p-bounded Datalog program

the predicate p is bounded.

q is bounded q(X) can be computed using

a DB-independent FO-query

q(X) r(Y,X)

q(X) r(X,Y) v

Output-predicate-bounded Datalog and BDDP

OPB

DATALOG

PROGRAM

Output-predicate-bounded Datalog and BDDP

INPUT

RULES

OPB

DATALOG

PROGRAM

BDDP

RULES

OUTPUT

RULES

Output-predicate-bounded Datalog and BDDP

OPB

DATALOG

PROGRAM

enjoy

BDDP

property

non recursive non recursive

INPUT

RULES

BDDP

RULES

OUTPUT

RULES

Datalog Rewriting: Skolemization (and renaming)

r(X,Y) Z s(Y,Z)

s(X,Y) Z p(Y,Y,Z)

p(X,Y,Z) t(Z)

Datalog Rewriting: Skolemization (and renaming)

r(X,Y) Z s(Y,Z)

s(X,Y) Z p(Y,Y,Z)

p(X,Y,Z) t(Z)

r(X1,Y1) s(Y1,f1(Y1))

s(X2,Y2) p(Y2,Y2,f2(Y2))

p(X3,Y3,Z3) t(Z3)

f

Datalog Rewriting: Skolemization (and renaming)

r(X,Y) Z s(Y,Z)

s(X,Y) Z p(Y,Y,Z)

p(X,Y,Z) t(Z)

¡ f and are equisatisfiable (not equivalent)

¡ Introduce one Skolem function for each existential variable

r(X1,Y1) s(Y1,f1(Y1))

s(X2,Y2) p(Y2,Y2,f2(Y2))

p(X3,Y3,Z3) t(Z3)

f

Datalog Rewriting: Rule Saturation

¡ Apply resolution inference rule to rules in f

at least one of the rules contains Skolem terms

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

f

Datalog Rewriting: Rule Saturation

¡ Apply resolution inference rule to rules in f

at least one of the rules contains Skolem terms

f [f]

r(X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1)))

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

Datalog Rewriting: Properties of Rule Saturation

¡ [f] mimics the chase derivations.

Datalog Rewriting: Properties of Rule Saturation

¡ [f] mimics the chase derivations.

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

Datalog Rewriting: Properties of Rule Saturation

¡ [f] mimics the chase derivations.

¡ [f] depends only on .

¡ [f] is possibly infinite

linear TGDs have BDDP: suffices to construct it up to k steps [f]k.

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

Datalog Rewriting: Query Saturation

¡ resolve [f] with the query Q.

use only rules with Skolem terms.

Datalog Rewriting: Query Saturation

¡ resolve [f] with the query Q.

use only rules with Skolem terms.

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

[δ12]] : r (X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1)))

Q s(A,B), p(B,B,C) [f] Q

Q r(X1,Y1), p(f1(Y1), f1(Y1),C)

[Q,f]

Datalog Rewriting: Query Saturation

¡ bypasses chase derivations with function symbols

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

[δ12]] : r (X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1)))

Q s(A,B), p(B,B,C)

[f]

Q

Datalog Rewriting: Finalization

¡ keep only the function-free rules from [f] [ [Q,f]

¡ derivations producing certain answers are captured by function-

symbol-free rules.

¡ (optional) add auxiliary rules to make the rewriting a proper Datalog

query i.e., clear distinction IDB/EDB.

OPB-Datalog and BDDP revisited

OPB

DATALOG

PROGRAM

INPUT

RULES

BDDP

RULES

OUTPUT

RULES

ff([f]) ff([Q,f]) aux

enjoy

BDDP

property

non recursive non recursive

¡ use the predicate graph to reduce the number of rules in f

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

f

Q s(A,B), p(B,B,C)

Q

Optimizations: Pruning

¡ use the predicate graph to reduce the number of rules in f

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

f

Q s(A,B), p(B,B,C)

Q

Optimizations: Pruning

r s

p t

Q

Optimizations: Pruning

¡ use the predicate graph to reduce the number of rules in f

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

f

Q s(A,B), p(B,B,C)

Q

r s

p t

Q

Optimizations: Pruning

¡ use the predicate graph to reduce the number of rules in f

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

δ3 : p(X3,Y3,Z3) t(Z3)

f

Q s(A,B), p(B,B,C)

Q

¡ we are no longer

independent on Q!

r s

p t

Q

Optimizations: Query Elimination

¡ eliminate implied atoms during query saturation

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

f Q

Q s(A,B), p(B,B,C)

Optimizations: Query Elimination

¡ eliminate implied atoms during query saturation

f Q

s(A,B) ² p(B,B,C)

Q s(A,B), p(B,B,C)

atom coverage

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

Optimizations: Query Elimination

¡ eliminate implied atoms during query saturation

f

Q s(A,B), p(B,B,C) ≡ Q s(A,B)

Q

s(A,B) ² p(B,B,C)

Q s(A,B), p(B,B,C)

atom coverage

δ1 : r (X1,Y1) s(Y1,f1(Y1))

δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))

Optimizations: Query Elimination

¡ unique elimination strategy (w.r.t. the number of eliminated atoms)

see paper.

¡ given m = |body(ρ)| and n = ||

worst-case size of [f] [ [Q,f] is O((n∙m)m)

worst-case size of Q = <q,π > is O(n+m)m

¡ atom coverage under linear TGDs can be checked in polynomial time

see paper.

Experimental Results

Discussion

¡ Datalog rewriting is substantially more compact than UCQ rewriting.

¡ Does Datalog improve the end-to-end performance?

¡ Extend the procedure to larger classes of TGDs

guarded TGDs [Cali’ et Al, PODS 09] non FO-rewritable

sticky-join TGDs [Cali’ et Al, VLDB 10]

The Datalog Family

Thomas

Lukasiewicz

Georg

Gottlob

Andreas

Pieris

Andrea

Calì

Giorgio

Orsi

Thank you!

top related