li chen
DESCRIPTION
KapQuilt: Semantic Caching for Quilt Queries --- A New Quilt Query Answerable By Cached Ones?. Li Chen. Outline. Background Motivation Goals: semantic cache for quilt queries Overall task list Immediate task Approaches for containment and rewriting Case studies Module design - PowerPoint PPT PresentationTRANSCRIPT
KapQuilt: Semantic Caching for Quilt
Queries
--- A New Quilt Query Answerable By Cached Ones?
Li Chen
Outline• Background• Motivation• Goals: semantic cache for quilt queries• Overall task list• Immediate task• Approaches for containment and rewriting• Case studies• Module design• Timetable
Background
answer queries using views
query optimization
database design web site
integration
independence ofphysical & logical data
containmenttheory & algo
semantic cache mat view
maintenance
views
query performancequery performancequery query efficiencyefficiency
query query qualityquality
data warehousing
data mining
Argos! :-))sweep
ECA,
dynamically decidemat views control concurrency
of queries & updates
web site management
Dimensions of Semantic Caching
outcomes
rew
ritt
en q
uer
y
qu
ery
pla
n
SQL - simple select-project-join SPJ - group, aggregation, query blocks - datalogs
OQL
TSL
Quilt
languages
containmentrelationshipsfully contained max-contained
Motivations
• What’s new about semantic caching?– Web proxies just cache web page hits, not real
computed queries– Web information integration needs expressive
XML queries– Semantic caching for XML queries is new– Quilt is a full capability XML query language,
promising for the integration of web info, and kweelt is a quilt query engine implemented!
Goals
• Goal! build a SC system for quilt queries– to better answer populate queries– quicker, less expansive and more up-to-date
• KapQuilt comes to rescue
limitation
we start from the core subset of quilt queries while ignoring nesting queries and regular expression queris for now
KapQuilt System ArchitectureKSP Client
DOM
XML Parser
…... XMLSource 1
XMLSource 2
XMLSource n
DOM… Other Node Factories
Doc RDB
Parser Wrapper
Kweelt Engine Kweelt APIParser
Evaluator
Query Rewriter
Query plansCost Estimator
DTDMQuery Matcher
Cached Views
q1 q2 q3
Query Decomposer
PQ
KapQuilt
RQ
rem
ote
quer
y re
ques
ts
CIS
Task List
• Answer whether a query is computable by cached ones• If answerable, compute PQ (probe query) and RQ (remainder
query)• If many PQ candidates, pick the one benefits most• Decide whether a query is worth to cache, when to cache• In case of cache space limitation, apply replace policy• Decompose and coalesce the query segments in cache• Concurrency control of queries and updates• Analyze costs in various web query archs• ?Keep cached view always fresh
– integrate Argos with KapQuilt
Immediate Task!-- MQP project goal as well
• Design and impl core functions of KapQuilt– input
• a set of cached queries S={s1,s2...}• a new query q
– output • a probe query (PQ)
– might be null if not answerable at all– if not null, PQ Ac (s1 s2 … sn)
• a reminder query (RQ) – might be null if q fully contained in S– if not null, RQ go down to query against data sources
Approaches
• Analyze quilt query process and its variable binding mechanism
• Set up cache index structure (CIS) to represent elements of a quilt query
• Warm up cache by initializing CIS with decomposed queries
• Implement the query containment and rewriting algorithm for quilt
• Conduct experimental studies for cost analysis
• Integrate with Argos system for cached view maintenance
A Taxonomy for XML Query
XML-QLXSL PatternsXPointer
XQL
XQL-99XPath
OQL
SQL
Quilt
Briefs on Quilt
• Quilt is a functional language
• A query is an expression, composed of– FLWR Expressions
FOR ... LET ... WHERE ... RETURN
– Filters– XPath expressions document("bids.xml")//bid[itemno="47"]/bid_amount
– Operators and functions– Element Constructors
<bid><userid> $u </userid> ,<bid_amount> $a </bid_amount></bid>
Data Flow in a FLWR Expression
($x = value, $y = value, $z = value),($x = value, $y = value, $z = value),($x = value, $y = value, $z = value)
FOR/LET
WHERE
RETURN
XML
List of tuples ofbound variables
List of tuples ofbound variables
XML
Quilt Compared to XQL
• A superset of XQL
• Overcome shortcomings of XQL– no variable bindings, joins, transformations, ordering,
aggregate functions, etc
– no data integration from multiple XML sources
– do semi-join, but in pretty non-intuitive syntax
• Cover queries on structured document (including SGML), relational data even!
book[author=//book[title='Moby Dick']/author]
Quilt Query Process
• Variable binding is an important means– a query can define multiple variables, in order– dependency relationships exist among variables– a tuple list is bound to each variable, condition
evaluation and return invocation are tuple-based– tuple lists are handles to data tree components, of
which answer tree is composed
Cache Index Structure (CIS)
• A structure to capture the essential elements of a quilt query
• What’s essential elements of a quilt query?– variable bindings, conditions and returning nodes all refer
to some element nodes in dtd. – a query can be identified by variable nodes V, return
nodes T, condition nodes F and their dependency relationships
– each element node in a dtd tree can be assigned a unique number (with unique absolute xpath)
Example DTD<?xml version="1.0"?><!DOCTYPE bib [<!ELEMENT bib (book* )><!ELEMENT book (title, (author+ | editor+ ), publisher, price )><!ATTLIST book year CDATA #REQUIRED ><!ELEMENT author (last, first )><!ELEMENT editor (last, first, affiliation )><!ELEMENT title (#PCDATA )><!ELEMENT last (#PCDATA )><!ELEMENT first (#PCDATA )><!ELEMENT affiliation (#PCDATA )><!ELEMENT publisher (#PCDATA )><!ELEMENT price (#PCDATA )>]>
1bib
2book
5title
7author
12editor
19publisher
21price
8last
year3
first10 13
last first15
affiliation174
CDATA PCDATA
6
PCDATA
9
PCDATA
11
PCDATA
14
PCDATA
16
PCDATA
18
PCDATA
20
PCDATA
22
Quilt Query Sampler IQ1<bib> FOR $book IN document("bib.xml")//book[@year.>=.1991 AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>
2
/bib/book
19publisheryear
3
2
variable nodes
condition nodes
return nodes
3
3
19
5
/book
35
2 v1
v1 @ year.>.=1991f13
v1 / publisher=“Addison-Wesley”f219
/bib/book
t13
t25
v1 @ year
v1 / title
b1b2b3
v1
1
1
0
1
0
1
f1 ( )2 3
y1[1993]
y2[1995]
y3[1990]
f2 ( )19
p1[…]p2[…]p3[…]
y1[1993] t1[…]
t13 t2 5t1
b2b1
p1 p2
b3
p3y1 y2 y3
r1
t1y1
v1
v2
//book
v1/author
Quilt Query Sampler IIQ2FOR $author IN DISTINCT document("bib.xml")//author, $book IN document("bib.xml")//book[author = $author]RETURN <result> $book/title, $author</result>
7
variable nodes
return nodes
5
7
7 v1 /bib/book/author
t15
t27
v2 / title
v1
2 2 v2 /bib/book[author= v1]
2
/bib/book/result
7
b2 b3b1
a2 a3a1 a2a1
57author
a1a2a3
v1 7
t17 5
b1b2b1
2
b3b2
a1a2a3
t2
t1t2t1t3t2
r2r1
a1 a1t1 t2
r4r3
a2 a2t1 t3
r5
a3t2
v2 [author= v1]
v2 / title v1
/bib/book/author/bib/book[author= v1]/title
Quilt Query Sampler IIIQ3<results> FOR $author IN DISTINCT document("bib.xml")//author RETURN <result> $author, document("bib.xml")//book[author = $author]/title </result></results>
7
variable nodes
return nodes
5
7
7 v1 /bib/book/author
t1
5 t2
7 v1
b2 b3b1
a2 a3a1 a2a1
/bib/book[author= v1]/title
a1a2a3
v1 7
t27 5
a1a2a3
t1
t1t2t1t3t2
/bib/book/author
7
r1
t2t1a1
r2
t3t1a2
r3
t2a3
/result
57
Quilt Query Sampler IVQ4<books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title,
<price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn>
</book-with-prices></books-with-prices>
Quilt Query Sampler IV
2
variable nodes2 v1 bib/book[source = "www.amazon.com"]
2’ 2’ v2 bib/book[source = "www.bn.com"][title = v1 /title]
/ book-with-prices/bib/book
2
/bib/book
2’
@source = "www.amazon.com"
@source = "www.bn.com"
5’
PCDATA
22’
PCDATA
22
price-amazon price-bn
b1b2b3
v1 2
t2 22
$12.5$23$54
b1’b2’b3’
v2 [title = v1 /title]2’
22
t1
22 t2 v1 /price/text()
return nodes
5’5’ v2 / title
22’ 22’ t3 v2 /price/text()
t322’
$21$22$47
t15
t1t2t3
b3
t3
b2
t2
b1
t1
b3’
t3’
b2’
t2’
b1’
t1’
$12.5
t1
$21 $23
t2
$22 $54
t3
$47
More Quilt Query SamplerQ5<bib> FOR $book IN document("bib.xml")//book[price.<=.$50] RETURN <book year=$book/@year><editors>$book/editor</editors></book></bib>
2
variable nodes
condition nodes
return nodes
21
3
12
2 v1
v1 / price.<.=$50f121
/bib/book
t13
t212
v1 @ year
v1 / editor
2
variable nodes
condition nodes
return nodes
21
21
7
2 v1
v1 / price.<.=$50f121
/bib/book
t1
21
t37
v1 / price
v2
Q6<bib> FOR $book IN document("bib.xml")//book[price.<=.$50], $author IN /bib[book=$book]//author[last=“Abiteboul”] RETURN <book>$book/title, $book/price, $author</book></bib>
7 7 v2 /bib[book= v1]//author
5v1 / title
t2
5
Query Containment for Relational Queries
q
T q
F q
s
Ts
Fs
q
T q
F q
q
T q
F q
s
Ts
Fs
s
Ts
Fs
Our Containment Theorem
Given a set of cached queries S={s1,s2...}, and a new query q,
q can be fully answerable by S if
),(, ''' SssTtorffFfFfjissq ji
)(' SsTtTtisq i
),,(,,,,, '''''' SssjittTttTttTttjihkshjskiqji ji
),)(,,,,(, ''''' SssffFfFfffFfTtFfjikkqkskiisisqi iij
)(,, ''' SsTtvVvTtvVvissqq ii
))(,,(, ''''' SsFfvFfvTtvVvTtvVviqsssqq iii
1
2
3
4
5
6
Explanations
),(, ''' SssTtorffFfFfjissq ji
for every condition node f of q, it must either also be one condition node, with loose predicates, of some si in the cache, or be one return node of some sj
for every condition node fi of q, if it is not one of any return node of sj, then it must be one condition node, with loose predicates, of some si in the cache, and any othercondition node fk of si should be one condition node fk of q
),)(,,,,(, ''''' SssffFfFfffFfTtFfjikkqkskiisisqi iij
),...,(,,, '''''
...,...,SsssFFTFFTTFF
kjiqsssqssss kjikji oror
there is a subset of S, whose condition nodes is a subset of those of q, but whosecondition nodes and return nodes are a superset of the condition nodes of q.
Explanations
),,(,,,,, '''''' SssjittTttTttTttjihkshjskiqji ji
for every pair of return nodes ti and tj of q, if their counterparts are in different segments si and sj, then there must be a common return node in si and sj.
))(,,(, ''''' SsFfvFfvTtvVvTtvVviqsssqq iii
for every return node t of q, if it is derived from a variable node v, then its counterpart in the cache should be also derived from the same variable node,and all the condition nodes derived from this v should also have their counterparts derived from v in q.
)(' SsTtTtisq i
for every return node t of q, it must also be one return node of some si in the cache.
Query Rewriting Rules
1. Decide which filters to keep(not evaluated by any cached query yet), which filters to remove (evaluated by some cached query) and remember those cached queries S with established F mappings.
• keep all those f that has t’ matches, and those f with a looser f ’ matches, they would be still appearing as condition nodes in the probe query
• remove those f with exact f ’ matches
• for each non-exact f ’ match, remember its s so to know which s to associated with those left over filters
If a query is judged to be computable by cached views, the following rules can be followed to figure out the rewritten q’
Query Rewriting Rules (cont.)
2. We need to figure out the semantic meanings of newly constructed nodes in the returning structure of each cached queries, they are associated with new xpaths as the replacement of their old ones …. A newly constructed node can be seen as the renaming of some old element node. Return nodes usually appear under each newly constructed node, hence a mapping of this new node to the old one can be inferred from those return nodes
• replace in the new query q those old xpaths, with the new xpath to a newly constructed node in cached views
3. In case of a query rewriting using joins of more than one s with common t pair, be sure to add such joins as new conditions
• if there is no variable binding in the new q, a new binding should be produced for one of the common t pair so that there is a way to join with its pair
Query Containment ISuppose that we have queries of q1,q2,q3,q4,q5,q6 cached in C={s1,s2,s3,s4, s5,s6}, a new query q comes in,
case q of
<bib> FOR $book IN document("bib.xml")//book[editor/affiliation=“WPI”] RETURN <book year=$book/@year>$book/title</book></bib>
it does not even satisfy the first condition. s.t. not answerablef1 refers to the element node of 17, which has no match in s1 to s6
<bib> FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>
it satisfies the first condition, but not the second one. s.t. not answerablef1 < -- > f1’ in s1, but another condition node f2’ of s1 is not any condition node of q
f1
f1
Query Rewriting Icase q of
<bib> FOR $book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>
it satisfies all those conditions, s.t. is answerable1) f1 < -- > f1’ in s1, f2< -- > t2’ of s1, f3 < -- > f2’ in s1, 2) there is no other f ’ in s1, 3) t1 < -- > t1’ in s1, t2< -- >t2’ in s1, 4) t1’ and t2’ are both from the same s1, 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2
Rewrite the query as<bib> FOR $book IN /book [source = "s1"][@year=1997 AND title like “JAVA*”] RETURN <book year=$book/@year>$book/title</book></bib>
/book
35
left over filters
t1 t2v1 @ year v1 / title
v1 /bib/book$book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"]
f1 f2 f3
rewritten as
/book [source = "s1”]
s1
Query Rewriting IIcase q of
<bib> FOR $book IN … /bib/book [@year.>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50] RETURN <book>$book/title,<editors>$book/editor</editors></book></bib>
it satisfies all the conditions, s.t. is answerable1) f1 < -- > f1’ in s1, f2 < -- > f2’ in s1, f3< -- > f1’ of s5, 2) there is no other f ’ in s1 and s53) t1< -- >t2 in s1, t2 < -- > t2 in s5, 4) t1 in s1 = t1 in s5, 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2, f3
/book
35t1 t2v1 @ year v1 / title
v1 /bib/book
$book IN …/bib/book [@year .>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50]
rewritten as
/book1[source = "s1”] and /book2[source = "s5”]
f1 f2 f3
s1
/book
3t1 v1 @ year
v1 / editor
v1 /bib/book
s5
5 t2
Rewrite the query as<bib> FOR $book1 IN /book [source = "s1"], $book2 IN /book [source = "s5"][title =$book1/title] RETURN <book>$book1/title,<editors>$book2/editor</editors></book></bib>
Query Rewriting III
case q of
<bib> FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>
<bib> FOR $book IN … //book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book></bib>
Suppose that we have q2 cached in S, but q3 is not cached,instead, it is a new query,
Query Rewriting IVSuppose that we have q2 cached in S, but q3 is not cached in,instead, it is a new query,
Q4<books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title,
<price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn>
</book-with-prices></books-with-prices>
Input: Query q, Semantic Cache COutput: Result of qAnsweringQuery Procedure: answerable, fullAns <--- False; T <--- current timestamp; C={s1,s2,..} <--- set up CIS for si segment; s <--- set up CIS for q; M <--- matched node set, set as null at the beginning ; R = RENs <--- not matched node set; RC <--- CENs of s; si <--- look for the first q related si in C; S <--- put si into a candidate set; While (si can be found) { answerable <--- True; STs <--- T; (MS, RM) <--- query_trimming(si, s);
MS <--- matching nodes of s and si;
RM <--- remaining nodes of s not covered by si; R = R-RM; M = M+MS; RC = RC+RemainingCENS; If (R=null) { fullyAns <--- True; break; } si <--- next q related segment in C; } PQ <--- query_rewriting(S, JoinCISs, M, RC); MatV <--- materialized view sets referred by S; ResPQ, Result <--- process PQ against MatV; If (fullyAns = False) { RQ <--- query_rewriting(S, JoinCISs, M, RC); ResRQ <--- process RQ at the server; Result <--- coalesce(ResPQ, ResRQ); } create a new segment Snew contains the result of q; SnewTs <--- T If there isn’t enough space, do cache replacement Cache Snew ; return(Result).
Input: CIS structure for q s and a segment si; current candidate set S and JoinCISs;
Output: judge whether si is related to q; if yes, add into S,
matching nodes MS and remaining nodes RM;Query_trimming Procedure: IsRealted <--- False; MS <--- null;
RM <--- boundRENs of s;
RS <--- RENs in si;
RNS <--- RENs in all s of S;
RemainingCENS <--- boundCENs of s; While (RM=\null) { matchingRENS <-- match nodes in RM and RS by applying theorems; if (matchingRENS =\null) { commonSet <--- RNS ^ RS; if (commonSet =\null) { IsRelated <--- True;
<si, sj, commonSet> <--- sj is the segment in S has commonSet with si;
JoinCISs <--- add <si, sj, commonSet> into JoinCISs; (MS, RM) <--- (matchingRENS, RM-M); RemainingCENS <--- left-over ones except exact-match CENS; } } } return (S, JoinCISs, MS, RM, RemainingCENS).
to-be-cached queries
QueryDecomposer
CacheIndexStructs
QueryIndex VENIndex CENIndex RENIndexReturnEleNode ConditionEleNode)(VariableEleNode
DTDDTDWalker
DTDTree(ElementNode)
new queryQueryDecomposer
NewQueryQueryTrimmer
ENIndex
MatView(ViewDTDTree)
QueryRewritterNewQuery
(MatchingCENPairMatchingRENPairMatchingENPair)
ProbeQuery
RemainderQuery
fully contained?
contained?
Y
N
ResultCoalescer
Result
set up
AnsweringQuery
QueryRewritter
remainingRENs
ElementNode
enId:intdtdRef:DTDTreeenName:StringabsXpath:StringparentEN: ElementNodechildrenEN:Vector
VariableEleNode
venName:StringcisRef:CISparentVEN:VENchildrenVEN:VectorchildrenCEN:VectorchildrenREN:Vector
ConditionEleNode
cenName:StringcisRef:CIScondition:StringparentVEN:VEN
ReturnEleNode
renName:StringcisRef:CISparentVEN:VEN
stricterThan(CEN):boolean
equalTo(EN):booleanrelativeXpathForm(EN):String
DTDTree
dtdName:StringdtdLoc:StringrootEN: ElementNodeelementNodes:Vector
ViewDTDTree
matchingENPairs:Vector
ENIndex
dtdElements:Vector
1
MatchingCENPair
newCEN: ConditionEleNodeoldCEN: ConditionEleNodeoldREN: ReturnEleNoderemainingCond:string
MatchingRENPair
newTEN: ReturnEleNodeoldTEN: ReturnEleNodeotherOldTENs:Vector
stricterThan(CEN):boolean
CacheIndexStructs
cisId:Stringqstring:StringviewDTD:ViewDTDTreematRef: MatViewboundVENs:VectorboundCENs:VectorboundRENs:Vector
NewQuery
candidateCISSet:VectormatchingCENPairs:VectorremainingCENs:VectormatchingRENPairs:VectorremainingRENs:Vector
hasMoreCENsThan(CIS):booleanhasOverlapTENs(MatchingRENPair, MatchingRENPair):boolean
initialize
diffCISFrom(MatchingRENPair):booleanoverlapTENWith(MatchingRENPair):booleansameParentVEN():booleannewWithPVENhasMoreCENs():boolean
VENIndex
cachedVENinCISs:Hashtable
CENIndex
cachedCENinCISs:Hashtable
RENIndex
cachedRENinCISs:Hashtable
QueryIndex
cachedQueries:Hashtable
MatView
matId:StringcisRef:CacheIndexStructs
MatchingENPairnewConstructEN:ElementNodeoriginalDtdEN:ElementNodesource: MatView
1
1
1
1
ProbeQuery
boundVENs:VectorboundCENs:VectorboundRENs:VectornewDTD:ViewDTDTree
Timetable• By 11/15: design due, implement starts• By 11/30: half finish coding• By 12/10: fully finish coding• By 12/20: finish integration• By 1/15: test designed cases• By 1/30: design and do experiments• By 2/15: collect experiment results• By 2/28: document code, writing• By 3/15: summarize
Task Assignment• Lily:
– design classes, containment, rewriting and candidate picking algorithms, design experiments
– ideas for query decomposition, result combination, cache decomposing /coalesce, replacement policy, data updates handling
• Jake:– implement containment algo, ...
• Ian:– implement classes of EN, VEN, CEN, TEN, ...
• Amar:– module of rewriting algo
Implementation Toolsuites
• JDK1.2, Servlet
• XML Parser, DTD Parser
• Quilt Parser, Kweelt(Quilt) Query Engine