1 optimizing cursor movement in holistic twig joins marcus fontoura, vanja josifovski, eugene...
Post on 22-Dec-2015
222 views
TRANSCRIPT
1
Optimizing Cursor Movement in Holistic
Twig Joins
Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center)
Beverly Yang (Stanford)
CIKM’2005
2
Motivation
for $a in //article[year = “2005” or
keyword = “XML”]
for $s in $a/section
return $s/title
In an index-based method, 7 tags and text elements need to be verified to process this query Running time is dominated by the I/O for manipulating this
cursors Twig join Algorithms are not optimized for I/O and do
not exploit the query’s extraction points
article
AND
OR section
titleyear
2005
keyword
XML
3
Our Contributions
1. TwigOptimal, a new holistic twig join algorithm that supports a large fraction of XQuery (including AND/OR branches)
2. Description of how extraction points improve query performance
3. Experimental evaluation that shows how TwigOptimal outperforms current algorithms
4
Agenda
Background TwigOptimal algorithm Experimental results Conclusions
5
XML Indexing
Begin/End/Level encoding Begin: preorder position of tag/text End: preorder position of last descendent Level: depth
Containment: X contains Y iff
X.begin < Y.begin <= X.end (assuming well-formed)
A1
B1 B2
C1 D1
B3
C2
R (0,7,0)(1,5,1)
(2,2,2)
(4,4,3)(5,5,3)
(6,7,1)
(7,7,2)(3,5,2)
6
Basic Access Path
Inverted lists Posting: <Token, Location> Token = <term/tag> Location = <DocumentID, Position>
Supported method on cursor: CB.fowardTo(Position p)
A1
B1 B2
C1 D1
B3
C2
R
B1 B2 B3
C1 C2
7
Joins in XML Structural (Containment) Joins
Twig Joins
A||B
A||B
|| ||C D
B||C
B||D
A||B||C
8
LocateExtension
“Extension” (w.r.t. query node q) – a solution for the subquery rooted at q
Input: q Result: the cursors of all descendants of q
point to an extension for qA||B
|| ||C D
B1
C1 X1 X2 D2
B3D1
A
C2
9
LocateExtension
While (not end(q) && not hasExtension(q)) {(p, c) = PickBrokenEdge(q);ZigZagJoin(p, c);
}
A||B
|| ||C D
B1
C1 X1 X2 D2
B3D1
A
C2
10
TwigOptimal Algorithm
Tests if the cursor with the minimal location has an extension If not, try to virtually move cursors until they form an
extension Only move cursors physically if no more virtual move is
possible
A virtual move just sets the begin value of the cursor, therefore no I/O is involved: Cq.begin = new begin value for Cq; Cq.virtual = true; //indicates that the cursor is virtual
11
Checking Extension
We have an extension for cursor q if: All cursors underneath q are properly aligned All cursors underneath q have physical locations
A||B
|| ||C D
B1
C1 X1 X2 D2
B3D1
A
C2
Return false
12
Checking Extension
We have an extension for cursor q if: All cursors underneath q are properly aligned All cursors underneath q have physical locations
A||B
|| ||C D
B1
C1 X1 X2 D2
B3D1
A
C2
Return true
13
Moving Cursors
Two passes over the query tree Bottom-up: move each parent cursor forward so it
contains the children cursors Top-down: move the children cursors forward so
they are contained by their parents
14
Move Cursors Example
x2
y4 y5y1
x1
z2z1
y2 y3
1
3
2 4
5
6
7
= virtual move
Query = //x[.//y and .//z] = physical move
15
Comparing with TSGeneric+
w1
x1
w2
x2
y2y3… y50 y51 y52 ... y100
z2
x50
y49 y98
x3 x4... x49
= current cursor position
Query = //w//x//y//z = virtual move
= physical move
y1
z1
y99
16
Comparing with TSGeneric+
x2
y2 y50 y51y52...y49 y98
x3 x4... x49
= current cursor position
Query = //w//x//y//z = physical move
w1
x1
y1
z1
y3…
w2
y100
z2
x50
y99
17
Extraction Points Optimization
If neither q or its descendants in the query are extraction points we can virtually move these cursors within q’s parent
C1 B1
A1
C99
|| ||B C
A
C100
A2
B2 B3
18
Prototype
Implemented over Berkeley DB B-tree Inverted lists
Posting: <Token, Location> Token = <term/tag> Location = <DocumentID, Position>
Position is BEL
19
Data Sets
Xmark 10 documents of size ~ 100MB each
Synthetic 4 tags: W, X, Y, Z Uncorrelated, no self-nesting Same frequency
20
Experimental Results
0
500
1000
1500
2000
2500
3000
3500
4000
//w [.//x] //w [.//x//z] //w [.//x//y//z]
Physical cursor moves
TSGeneric+
Tw igOptimal
21
Experimental Results
0
2
4
6
8
10
12
14
16
//w [.//x] //w [.//x//z] //w [.//x//y//z]
Running time (ms)
TSGeneric+
Tw igOptimal
22
Experimental Results
0200000400000600000800000
100000012000001400000160000018000002000000
Small Xmark Query (4nodes)
Large Xmark Query (10nodes)
Physical cursor moves
TSGeneric+
Tw igOptimal
23
Experimental Results
05
1015
2025
3035
4045
50
//w //x//y//z //w //x//y[.//z] //w //x[.//y//z] //w [.//x//y//z]
Physical cursor moves
24
Experimental Results
0
5
10
15
20
25
30
35
//w //x//y//z //w //x//y[.//z] //w //x[.//y//z] //w [.//x//y//z]
Running time (ms)
25
Conclusion
TwigOptimal algorithm outperforms existing twig join algorithms by more than 40%, especially for larger queries Optimized for I/O, which is the performance
bottleneck Extraction points optimization improve
performance