efficient processing of partially specified twig queries
DESCRIPTION
Efficient Processing of Partially Specified Twig Queries. Junfeng Zhou Renmin University of China. Outline. Introduction Preliminary PTwigStack Conclusion. Outline. Introduction Preliminary PTwigStack Conclusion. Introduction(1). - PowerPoint PPT PresentationTRANSCRIPT
1
Efficient Processing of Partially Specified Twig Queries
Junfeng Zhou
Renmin University of China
2
Outline
• Introduction
• Preliminary
• PTwigStack
• Conclusion
3
Outline
• Introduction
• Preliminary
• PTwigStack
• Conclusion
4
Introduction(1)
• XML has been used extensively as a standard for information representation and exchange
• More and more data is stored and exchanged with XML format
• Effective and efficient querying of XML data is indispensable
5
Introduction(2)
• Using standard query language (XPath or XQuery)
• How can we write a proper query when:– the structure or schema is not fully available or – Extracting information from different data sources with
different structure bibliography(1)
bib(2) bib(…)
book(4)year(3)
1999 title(5) author(6)
article(7)
author(9)title(8)
XML Joe
author(10)
MaryXML Bob
book
title authorQ
6
Introduction (4)
• Using keyword based query
• For example[1]– Find title and author of the publications
bibliography(1)
bib(2) bib(…)
book(4)year(3)
1999 title(5) author(6)
article(7)
author(9)title(8)
XML Joe
author(10)
MaryXML Bob
The answer is : (5,6), (8,9,10)
[1]Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In Proceedings of VLDB2004, pages 72-83, 2003
7
Introduction (5)
• Using keyword based query• How if node 6 and 8 are removed from the
document– Find title and author of the publications
bibliography(1)
bib(2) bib(…)
book(4)year(3)
1999 title(5)
article(7)
author(9)
Joe
author(10)
MaryXML
The answer is : (5,9,10)
Meaningless Result(5,NULL), (NULL,9,10)Correct answer
8
Introduction (6)• Using Partially Specified Twig Query (PSTQ) [2]
– Can provide users the most flexibility
• But– No existing method can process a PSTQ efficiently
[2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006
9
Introduction(7)
• Objective– A concise but effective way to specify more flexible semantics
constrains in a twig query
– An efficient approach to process a PSTQ holistically without deriving twig queries and process them one by one
• Scan Once: Each stream whose elements’ tag appears in the twig pattern is scanned only once.
• No redundant output: None of the intermediate path solutions is useless
• Bounded space complexity: The space required by the algorithm is bounded by a factor which is independent of source document size.
10
Outline
• Introduction
• Preliminary– Holistic Twig Join– Partially Specified Twig Query
• PTwigStack
• Conclusion
11
Preliminary- Holistic Twig Join[3]
• Query Processing– Output useful Path Solutions– Merge all path solutions to get final results
• Data Structure– Each query node is associated with a stack and an element stream
• Benefits– No useless path solutions
R
a1
b1
a2
b2 c1
A
B C
QXML document
[3]N. Bruno, N. Koudas, and D. Srivastava: Holistic twig joins: Optimal XML pattern matching. TechnicalR eport Columbia University March 2002
12
Preliminary- Partially Specified Twig Query[2]
• Q1 consists of two partial paths (PP), p1 and p2• In p1, Y is descendant of W• In p2, W and A are being at the same path • p1 share W with p2• “*” means p2 is output path
[2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006
W
Y
W
A
PP p1 PP p2*
Q1• Compared with Twig Query:
– Some nodes are specified with being at the same path relationship with other nodes, but not the precedence relationship
• Compared with keyword based query:– Each part of the query can be a path expression, but not just keyword
• Benefits of using PSTQ:– Users can specify query with whatever partial knowledge they have
whenever possible
13
Preliminary- Partially Specified Twig Query
• Query Processing of PSTQ: A naïve method– Deriving Twig Queries– Processing each twig query
• Problem of the naïve method
– Processing cost is too high– Eliminating redundant results
A
B
C
A
C
B
B
A
C
A
B C
QQ1 Q2 Q3 Q4
a1
b1
c1
Xml document
A
C
A
B
PP p1 PP p2*
14
Outline
• Introduction
• Preliminary
• PTwigStack
• Conclusion
15
PTwigStack __PSTQ Expression
• Extending XPath by adding an operator– “ ” is used to denote being at the same path relatio
nship• A B is equivalent to A//B or B//A• A B C ?
A
B
C
A
B
C
A
C
B
C
A
B
C
B
A
B
A C
Q Q1 Q2 Q3 Q4 Q5
B
A
C
B
C
AQ6 Q7
16
PTwigStack
• Objective– Scan Once– No redundant output– Bounded space
complexity
• Problems– Which query node should
be processed first?– Which element should be
processed first?– How to guarantee no
useless path solutions from being produced?
b1
a1 a2
c1
b2
b3
Document
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
According to special order in the given Query
Element with solution extension
Element which cannot participate in answers will not be pushed into stack
17
PTwigStack
• Problems(1)– Which query node
should be processed first?
– Deep first order – ABC
b1
a1 a2
c1
b2
b3
Document
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
18
PTwigStack
• Problems(2)– Which element should be
processed first?– The element with Partial
Solution Extension
b1
a1 a2
c1
b2
b3
Document
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a
ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is not a leaf node, for each q’ children(q)∈
– If q//q’, then Cq is ancestor of Cq’a1
c1
19
PTwigStack
• Problems(2)– Which element should be
processed first?– The element with Partial
Solution Extension
b1
a1 a2
c1
b2
b3
Document
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a
ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is a non-leaf node, for each q’ children(q)∈
– If q//q’, then Cq is ancestor of Cq’
– If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start
b1
a1
c1c0
a1 b1
c1
a1
b1
c1
20
PTwigStack
• Problems(2)– Which element should be
processed first?– The element with Partial
Solution Extension
b1
a1 a2
c1
b2
b3
Document
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a
ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is a non-leaf node, for each q’ children(q)∈
– If q//q’, then Cq is ancestor of Cq’
– If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start
– If q q’ and q’ hasn’t PSE, let p be descendent of q’ which has PSE, then Cq.start<Cp.start
21
PTwigStack
• Feature of Partial Solution Extension– If E has a PSE, E must have a Solution Extension of s
ome twig queries derived from the given PSTQ, which means CE may participate in final results.
• Usage of Partial Solution Extension– Guiding the executing of PTwigStack
22
PTwigStack
• Problems(3)– How to guarantee no
useless path solutions from being produced?
• Prevent useless elements from being pushed into stack
– What is useless element?
• cannot satisfy query requirement with top elements in correlated stacks or head element in each element stream
c1
b1 a1
Document
B
A
Ca1
Document
c1
a0
b1
a1
b1 c1
Document
23
PTwigStack
• Data Structure– Stack
• Each query node is also associated with a stack to compactly represent temporal results
– Tag index• Each query node is associated with an element
stream
24
PTwigStack
PTwigStack(root)// the first stage1 while not end(root) 2 q = getNext(root) 3 Clean All Stacks related with q and output relevant path solutions4 If Cq can be pushed into Stack Sq5 Push(Sq, Cq)
6 Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start
7. Output all possible path solutions8. Advance(Cq) //the second stage9 MergeAllPathSolution();
6
25
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
Output:Output: Final Result:
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
26
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
Output:Output: Final Result:
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
c1
27
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
a1
b1
Output:Output: Final Result:
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
28
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
a1
b1
Output:Output: Final Result:
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
29
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
a1
b1 c2
Output:Output: Final Result:
a1c2
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
30
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
a1
b1
Output:Output: Final Result:
a1c2a1b2
b2PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
31
PTwigStack
b1
a1 a3
c2
B
A
C
c1
b2a2
BA
C
A
B
C
A
C
B
B
A
C
A
B C
Q Q1 Q2 Q3 Q4
a1
b1
Output:Output: Final Result:
a1c2a1b2a1b1
a1b1c2a1b2c2
PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();
32
PTwigStack
• Properties:– Each element is scanned only once– Each element in stack must participate in at le
ast one final result– No “Eliminating Operation” for redundant resul
ts– Space bounded by |Q|×L where L is the
longest path in the XML source document and |Q| is the number of nodes in the given query Q
33
Outline
• Introduction
• Preliminary
• PTwigStack
• Conclusion
34
Conclusion
• We propose a concise but effective way to express the semantics of being at the same path by expanding XPath
• We propose a new concept, Partial Solution Extension, to guide the executing of getNext
• We propose a new holistic join method to process a PSTQ with root node
35
Future Work
• The above method cannot be applied directly to query without being specified with root node, e.g.– #[//A]//B– #[//A//B]//C– #[//A B]//C
• Possible Solution– Implementing special algorithm to process a PSTQ without
being specified with root node (using Dewey code)– Using ORASS[4] to construct a twig query with more
semantics constrains (using range code)
[4] Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000.
36
Thank You !
Q & A