li chen

KapQuilt: Semantic Caching for Quilt

Queries

--- A New Quilt Query Answerable By Cached Ones?

Li Chen

Outline• Background• Motivation• Goals: semantic cache for quilt queries• Overall task list• Immediate task• Approaches for containment and rewriting• Case studies• Module design• Timetable

Background

answer queries using views

query optimization

database design web site

integration

independence ofphysical & logical data

containmenttheory & algo

semantic cache mat view

maintenance

views

query performancequery performancequery query efficiencyefficiency

query query qualityquality

data warehousing

data mining

Argos! :-))sweep

ECA,

dynamically decidemat views control concurrency

of queries & updates

web site management

Dimensions of Semantic Caching

outcomes

rew

ritt

en q

uer

y

qu

ery

pla

n

SQL - simple select-project-join SPJ - group, aggregation, query blocks - datalogs

OQL

TSL

Quilt

languages

containmentrelationshipsfully contained max-contained

Motivations

• What’s new about semantic caching?– Web proxies just cache web page hits, not real

computed queries– Web information integration needs expressive

XML queries– Semantic caching for XML queries is new– Quilt is a full capability XML query language,

promising for the integration of web info, and kweelt is a quilt query engine implemented!

Goals

• Goal! build a SC system for quilt queries– to better answer populate queries– quicker, less expansive and more up-to-date

• KapQuilt comes to rescue

limitation

we start from the core subset of quilt queries while ignoring nesting queries and regular expression queris for now

KapQuilt System ArchitectureKSP Client

DOM

XML Parser

…... XMLSource 1

XMLSource 2

XMLSource n

DOM… Other Node Factories

Doc RDB

Parser Wrapper

Kweelt Engine Kweelt APIParser

Evaluator

Query Rewriter

Query plansCost Estimator

DTDMQuery Matcher

Cached Views

q1 q2 q3

Query Decomposer

PQ

KapQuilt

RQ

rem

ote

quer

y re

ques

ts

CIS

Task List

• Answer whether a query is computable by cached ones• If answerable, compute PQ (probe query) and RQ (remainder

query)• If many PQ candidates, pick the one benefits most• Decide whether a query is worth to cache, when to cache• In case of cache space limitation, apply replace policy• Decompose and coalesce the query segments in cache• Concurrency control of queries and updates• Analyze costs in various web query archs• ?Keep cached view always fresh

– integrate Argos with KapQuilt

Immediate Task!-- MQP project goal as well

• Design and impl core functions of KapQuilt– input

• a set of cached queries S={s1,s2...}• a new query q

– output • a probe query (PQ)

– might be null if not answerable at all– if not null, PQ Ac (s1 s2 … sn)

• a reminder query (RQ) – might be null if q fully contained in S– if not null, RQ go down to query against data sources

Approaches

• Analyze quilt query process and its variable binding mechanism

• Set up cache index structure (CIS) to represent elements of a quilt query

• Warm up cache by initializing CIS with decomposed queries

• Implement the query containment and rewriting algorithm for quilt

• Conduct experimental studies for cost analysis

• Integrate with Argos system for cached view maintenance

A Taxonomy for XML Query

XML-QLXSL PatternsXPointer

XQL

XQL-99XPath

OQL

SQL

Quilt

Briefs on Quilt

• Quilt is a functional language

• A query is an expression, composed of– FLWR Expressions

FOR ... LET ... WHERE ... RETURN

– Filters– XPath expressions document("bids.xml")//bid[itemno="47"]/bid_amount

– Operators and functions– Element Constructors

<bid><userid> $u </userid> ,<bid_amount> $a </bid_amount></bid>

Data Flow in a FLWR Expression

($x = value, $y = value, $z = value),($x = value, $y = value, $z = value),($x = value, $y = value, $z = value)

FOR/LET

WHERE

RETURN

XML

List of tuples ofbound variables

List of tuples ofbound variables

XML

Quilt Compared to XQL

• A superset of XQL

• Overcome shortcomings of XQL– no variable bindings, joins, transformations, ordering,

aggregate functions, etc

– no data integration from multiple XML sources

– do semi-join, but in pretty non-intuitive syntax

• Cover queries on structured document (including SGML), relational data even!

book[author=//book[title='Moby Dick']/author]

Quilt Query Process

• Variable binding is an important means– a query can define multiple variables, in order– dependency relationships exist among variables– a tuple list is bound to each variable, condition

evaluation and return invocation are tuple-based– tuple lists are handles to data tree components, of

which answer tree is composed

Cache Index Structure (CIS)

• A structure to capture the essential elements of a quilt query

• What’s essential elements of a quilt query?– variable bindings, conditions and returning nodes all refer

to some element nodes in dtd. – a query can be identified by variable nodes V, return

nodes T, condition nodes F and their dependency relationships

– each element node in a dtd tree can be assigned a unique number (with unique absolute xpath)

Example DTD<?xml version="1.0"?><!DOCTYPE bib [<!ELEMENT bib (book* )><!ELEMENT book (title, (author+ | editor+ ), publisher, price )><!ATTLIST book year CDATA #REQUIRED ><!ELEMENT author (last, first )><!ELEMENT editor (last, first, affiliation )><!ELEMENT title (#PCDATA )><!ELEMENT last (#PCDATA )><!ELEMENT first (#PCDATA )><!ELEMENT affiliation (#PCDATA )><!ELEMENT publisher (#PCDATA )><!ELEMENT price (#PCDATA )>]>

1bib

2book

5title

7author

12editor

19publisher

21price

8last

year3

first10 13

last first15

affiliation174

CDATA PCDATA

6

PCDATA

9

PCDATA

11

PCDATA

14

PCDATA

16

PCDATA

18

PCDATA

20

PCDATA

22

Quilt Query Sampler IQ1<bib> FOR $book IN document("bib.xml")//book[@year.>=.1991 AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>

2

/bib/book

19publisheryear

3

2

variable nodes

condition nodes

return nodes

3

3

19

5

/book

35

2 v1

v1 @ year.>.=1991f13

v1 / publisher=“Addison-Wesley”f219

/bib/book

t13

t25

v1 @ year

v1 / title

b1b2b3

v1

1

1

0

1

0

1

f1 ( )2 3

y1[1993]

y2[1995]

y3[1990]

f2 ( )19

p1[…]p2[…]p3[…]

y1[1993] t1[…]

t13 t2 5t1

b2b1

p1 p2

b3

p3y1 y2 y3

r1

t1y1

v1

v2

//book

v1/author

Quilt Query Sampler IIQ2FOR $author IN DISTINCT document("bib.xml")//author, $book IN document("bib.xml")//book[author = $author]RETURN <result> $book/title, $author</result>

7

variable nodes

return nodes

5

7

7 v1 /bib/book/author

t15

t27

v2 / title

v1

2 2 v2 /bib/book[author= v1]

2

/bib/book/result

7

b2 b3b1

a2 a3a1 a2a1

57author

a1a2a3

v1 7

t17 5

b1b2b1

2

b3b2

a1a2a3

t2

t1t2t1t3t2

r2r1

a1 a1t1 t2

r4r3

a2 a2t1 t3

r5

a3t2

v2 [author= v1]

v2 / title v1

/bib/book/author/bib/book[author= v1]/title

Quilt Query Sampler IIIQ3<results> FOR $author IN DISTINCT document("bib.xml")//author RETURN <result> $author, document("bib.xml")//book[author = $author]/title </result></results>

7

variable nodes

return nodes

5

7

7 v1 /bib/book/author

t1

5 t2

7 v1

b2 b3b1

a2 a3a1 a2a1

/bib/book[author= v1]/title

a1a2a3

v1 7

t27 5

a1a2a3

t1

t1t2t1t3t2

/bib/book/author

7

r1

t2t1a1

r2

t3t1a2

r3

t2a3

/result

57

Quilt Query Sampler IVQ4<books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title,

<price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn>

</book-with-prices></books-with-prices>

Quilt Query Sampler IV

2

variable nodes2 v1 bib/book[source = "www.amazon.com"]

2’ 2’ v2 bib/book[source = "www.bn.com"][title = v1 /title]

/ book-with-prices/bib/book

2

/bib/book

2’

@source = "www.amazon.com"

@source = "www.bn.com"

5’

PCDATA

22’

PCDATA

22

price-amazon price-bn

b1b2b3

v1 2

t2 22

$12.5$23$54

b1’b2’b3’

v2 [title = v1 /title]2’

22

t1

22 t2 v1 /price/text()

return nodes

5’5’ v2 / title

22’ 22’ t3 v2 /price/text()

t322’

$21$22$47

t15

t1t2t3

b3

t3

b2

t2

b1

t1

b3’

t3’

b2’

t2’

b1’

t1’

$12.5

t1

$21 $23

t2

$22 $54

t3

$47

More Quilt Query SamplerQ5<bib> FOR $book IN document("bib.xml")//book[price.<=.$50] RETURN <book year=$book/@year><editors>$book/editor</editors></book></bib>

2

variable nodes

condition nodes

return nodes

21

3

12

2 v1

v1 / price.<.=$50f121

/bib/book

t13

t212

v1 @ year

v1 / editor

2

variable nodes

condition nodes

return nodes

21

21

7

2 v1

v1 / price.<.=$50f121

/bib/book

t1

21

t37

v1 / price

v2

Q6<bib> FOR $book IN document("bib.xml")//book[price.<=.$50], $author IN /bib[book=$book]//author[last=“Abiteboul”] RETURN <book>$book/title, $book/price, $author</book></bib>

7 7 v2 /bib[book= v1]//author

5v1 / title

t2

5

Query Containment for Relational Queries

q

T q

F q

s

Ts

Fs

q

T q

F q

q

T q

F q

s

Ts

Fs

s

Ts

Fs

Our Containment Theorem

Given a set of cached queries S={s1,s2...}, and a new query q,

q can be fully answerable by S if

),(, ''' SssTtorffFfFfjissq ji

)(' SsTtTtisq i

),,(,,,,, '''''' SssjittTttTttTttjihkshjskiqji ji

),)(,,,,(, ''''' SssffFfFfffFfTtFfjikkqkskiisisqi iij

)(,, ''' SsTtvVvTtvVvissqq ii

))(,,(, ''''' SsFfvFfvTtvVvTtvVviqsssqq iii

1

2

3

4

5

6

Explanations

),(, ''' SssTtorffFfFfjissq ji

for every condition node f of q, it must either also be one condition node, with loose predicates, of some si in the cache, or be one return node of some sj

for every condition node fi of q, if it is not one of any return node of sj, then it must be one condition node, with loose predicates, of some si in the cache, and any othercondition node fk of si should be one condition node fk of q

),)(,,,,(, ''''' SssffFfFfffFfTtFfjikkqkskiisisqi iij

),...,(,,, '''''

...,...,SsssFFTFFTTFF

kjiqsssqssss kjikji oror

there is a subset of S, whose condition nodes is a subset of those of q, but whosecondition nodes and return nodes are a superset of the condition nodes of q.

Explanations

),,(,,,,, '''''' SssjittTttTttTttjihkshjskiqji ji

for every pair of return nodes ti and tj of q, if their counterparts are in different segments si and sj, then there must be a common return node in si and sj.

))(,,(, ''''' SsFfvFfvTtvVvTtvVviqsssqq iii

for every return node t of q, if it is derived from a variable node v, then its counterpart in the cache should be also derived from the same variable node,and all the condition nodes derived from this v should also have their counterparts derived from v in q.

)(' SsTtTtisq i

for every return node t of q, it must also be one return node of some si in the cache.

Query Rewriting Rules

1. Decide which filters to keep(not evaluated by any cached query yet), which filters to remove (evaluated by some cached query) and remember those cached queries S with established F mappings.

• keep all those f that has t’ matches, and those f with a looser f ’ matches, they would be still appearing as condition nodes in the probe query

• remove those f with exact f ’ matches

• for each non-exact f ’ match, remember its s so to know which s to associated with those left over filters

If a query is judged to be computable by cached views, the following rules can be followed to figure out the rewritten q’

Query Rewriting Rules (cont.)

2. We need to figure out the semantic meanings of newly constructed nodes in the returning structure of each cached queries, they are associated with new xpaths as the replacement of their old ones …. A newly constructed node can be seen as the renaming of some old element node. Return nodes usually appear under each newly constructed node, hence a mapping of this new node to the old one can be inferred from those return nodes

• replace in the new query q those old xpaths, with the new xpath to a newly constructed node in cached views

3. In case of a query rewriting using joins of more than one s with common t pair, be sure to add such joins as new conditions

• if there is no variable binding in the new q, a new binding should be produced for one of the common t pair so that there is a way to join with its pair

Query Containment ISuppose that we have queries of q1,q2,q3,q4,q5,q6 cached in C={s1,s2,s3,s4, s5,s6}, a new query q comes in,

case q of

<bib> FOR $book IN document("bib.xml")//book[editor/affiliation=“WPI”] RETURN <book year=$book/@year>$book/title</book></bib>

it does not even satisfy the first condition. s.t. not answerablef1 refers to the element node of 17, which has no match in s1 to s6

<bib> FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>

it satisfies the first condition, but not the second one. s.t. not answerablef1 < -- > f1’ in s1, but another condition node f2’ of s1 is not any condition node of q

f1

f1

Query Rewriting Icase q of

<bib> FOR $book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>

it satisfies all those conditions, s.t. is answerable1) f1 < -- > f1’ in s1, f2< -- > t2’ of s1, f3 < -- > f2’ in s1, 2) there is no other f ’ in s1, 3) t1 < -- > t1’ in s1, t2< -- >t2’ in s1, 4) t1’ and t2’ are both from the same s1, 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2

Rewrite the query as<bib> FOR $book IN /book [source = "s1"][@year=1997 AND title like “JAVA*”] RETURN <book year=$book/@year>$book/title</book></bib>

/book

35

left over filters

t1 t2v1 @ year v1 / title

v1 /bib/book$book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"]

f1 f2 f3

rewritten as

/book [source = "s1”]

s1

Query Rewriting IIcase q of

<bib> FOR $book IN … /bib/book [@year.>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50] RETURN <book>$book/title,<editors>$book/editor</editors></book></bib>

it satisfies all the conditions, s.t. is answerable1) f1 < -- > f1’ in s1, f2 < -- > f2’ in s1, f3< -- > f1’ of s5, 2) there is no other f ’ in s1 and s53) t1< -- >t2 in s1, t2 < -- > t2 in s5, 4) t1 in s1 = t1 in s5, 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2, f3

/book

35t1 t2v1 @ year v1 / title

v1 /bib/book

$book IN …/bib/book [@year .>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50]

rewritten as

/book1[source = "s1”] and /book2[source = "s5”]

f1 f2 f3

s1

/book

3t1 v1 @ year

v1 / editor

v1 /bib/book

s5

5 t2

Rewrite the query as<bib> FOR $book1 IN /book [source = "s1"], $book2 IN /book [source = "s5"][title =$book1/title] RETURN <book>$book1/title,<editors>$book2/editor</editors></book></bib>

Query Rewriting III

case q of

<bib> FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>

<bib> FOR $book IN … //book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>

Suppose that we have q2 cached in S, but q3 is not cached,instead, it is a new query,

Query Rewriting IVSuppose that we have q2 cached in S, but q3 is not cached in,instead, it is a new query,

Q4<books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title,

<price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn>

</book-with-prices></books-with-prices>

Input: Query q, Semantic Cache COutput: Result of qAnsweringQuery Procedure: answerable, fullAns <--- False; T <--- current timestamp; C={s1,s2,..} <--- set up CIS for si segment; s <--- set up CIS for q; M <--- matched node set, set as null at the beginning ; R = RENs <--- not matched node set; RC <--- CENs of s; si <--- look for the first q related si in C; S <--- put si into a candidate set; While (si can be found) { answerable <--- True; STs <--- T; (MS, RM) <--- query_trimming(si, s);

MS <--- matching nodes of s and si;

RM <--- remaining nodes of s not covered by si; R = R-RM; M = M+MS; RC = RC+RemainingCENS; If (R=null) { fullyAns <--- True; break; } si <--- next q related segment in C; } PQ <--- query_rewriting(S, JoinCISs, M, RC); MatV <--- materialized view sets referred by S; ResPQ, Result <--- process PQ against MatV; If (fullyAns = False) { RQ <--- query_rewriting(S, JoinCISs, M, RC); ResRQ <--- process RQ at the server; Result <--- coalesce(ResPQ, ResRQ); } create a new segment Snew contains the result of q; SnewTs <--- T If there isn’t enough space, do cache replacement Cache Snew ; return(Result).

Input: CIS structure for q s and a segment si; current candidate set S and JoinCISs;

Output: judge whether si is related to q; if yes, add into S,

matching nodes MS and remaining nodes RM;Query_trimming Procedure: IsRealted <--- False; MS <--- null;

RM <--- boundRENs of s;

RS <--- RENs in si;

RNS <--- RENs in all s of S;

RemainingCENS <--- boundCENs of s; While (RM=\null) { matchingRENS <-- match nodes in RM and RS by applying theorems; if (matchingRENS =\null) { commonSet <--- RNS ^ RS; if (commonSet =\null) { IsRelated <--- True;

<si, sj, commonSet> <--- sj is the segment in S has commonSet with si;

JoinCISs <--- add <si, sj, commonSet> into JoinCISs; (MS, RM) <--- (matchingRENS, RM-M); RemainingCENS <--- left-over ones except exact-match CENS; } } } return (S, JoinCISs, MS, RM, RemainingCENS).

to-be-cached queries

QueryDecomposer

CacheIndexStructs

QueryIndex VENIndex CENIndex RENIndexReturnEleNode ConditionEleNode)(VariableEleNode

DTDDTDWalker

DTDTree(ElementNode)

new queryQueryDecomposer

NewQueryQueryTrimmer

ENIndex

MatView(ViewDTDTree)

QueryRewritterNewQuery

(MatchingCENPairMatchingRENPairMatchingENPair)

ProbeQuery

RemainderQuery

fully contained?

contained?

Y

N

ResultCoalescer

Result

set up

AnsweringQuery

QueryRewritter

remainingRENs

ElementNode

enId:intdtdRef:DTDTreeenName:StringabsXpath:StringparentEN: ElementNodechildrenEN:Vector

VariableEleNode

venName:StringcisRef:CISparentVEN:VENchildrenVEN:VectorchildrenCEN:VectorchildrenREN:Vector

ConditionEleNode

cenName:StringcisRef:CIScondition:StringparentVEN:VEN

ReturnEleNode

renName:StringcisRef:CISparentVEN:VEN

stricterThan(CEN):boolean

equalTo(EN):booleanrelativeXpathForm(EN):String

DTDTree

dtdName:StringdtdLoc:StringrootEN: ElementNodeelementNodes:Vector

ViewDTDTree

matchingENPairs:Vector

ENIndex

dtdElements:Vector

1

MatchingCENPair

newCEN: ConditionEleNodeoldCEN: ConditionEleNodeoldREN: ReturnEleNoderemainingCond:string

MatchingRENPair

newTEN: ReturnEleNodeoldTEN: ReturnEleNodeotherOldTENs:Vector

stricterThan(CEN):boolean

CacheIndexStructs

cisId:Stringqstring:StringviewDTD:ViewDTDTreematRef: MatViewboundVENs:VectorboundCENs:VectorboundRENs:Vector

NewQuery

candidateCISSet:VectormatchingCENPairs:VectorremainingCENs:VectormatchingRENPairs:VectorremainingRENs:Vector

hasMoreCENsThan(CIS):booleanhasOverlapTENs(MatchingRENPair, MatchingRENPair):boolean

initialize

diffCISFrom(MatchingRENPair):booleanoverlapTENWith(MatchingRENPair):booleansameParentVEN():booleannewWithPVENhasMoreCENs():boolean

VENIndex

cachedVENinCISs:Hashtable

CENIndex

cachedCENinCISs:Hashtable

RENIndex

cachedRENinCISs:Hashtable

QueryIndex

cachedQueries:Hashtable

MatView

matId:StringcisRef:CacheIndexStructs

MatchingENPairnewConstructEN:ElementNodeoriginalDtdEN:ElementNodesource: MatView

1

1

1

1

ProbeQuery

boundVENs:VectorboundCENs:VectorboundRENs:VectornewDTD:ViewDTDTree

Timetable• By 11/15: design due, implement starts• By 11/30: half finish coding• By 12/10: fully finish coding• By 12/20: finish integration• By 1/15: test designed cases• By 1/30: design and do experiments• By 2/15: collect experiment results• By 2/28: document code, writing• By 3/15: summarize

Task Assignment• Lily:

– design classes, containment, rewriting and candidate picking algorithms, design experiments

– ideas for query decomposition, result combination, cache decomposing /coalesce, replacement policy, data updates handling

• Jake:– implement containment algo, ...

• Ian:– implement classes of EN, VEN, CEN, TEN, ...

• Amar:– module of rewriting algo

Implementation Toolsuites

• JDK1.2, Servlet

• XML Parser, DTD Parser

• Quilt Parser, Kweelt(Quilt) Query Engine

li chen

Documents

query containment

query segments

quilt query engine

probe query pq

new quilt query answerable

new query qoutput

functional languagea

compute pq probe query