structural indexes of xml databases
DESCRIPTION
Structural indexes of XML Databases. D r. Vu Le Anh [email protected]. Outline. Mo tiviation Regular queries processing over XML datasets Indexes over XML datasets Stru ctural indexes Stru ctural indexe s for distributed XML d atasets Summary. NCBI GEO dataset. - PowerPoint PPT PresentationTRANSCRIPT
Master Informatique 1Dr. Vu Le Anh Structural indexes of XML Databases
Structural indexes of XML Databases
Dr. Vu Le Anh [email protected]
Master Informatique 2Dr. Vu Le Anh Structural indexes of XML Databases
Outline
1. Motiviation2. Regular queries processing over XML datasets3. Indexes over XML datasets4. Structural indexes5. Structural indexes for distributed XML
datasets6. Summary
Master Informatique 3Dr. Vu Le Anh Structural indexes of XML Databases
NCBI GEO dataset• GEO is a public functional genomics data repository supporting MIAME-compliant data submissions.
• About 600 gigabyte (Feb - 2009). Data are stored in XML datasets
A map of gene is written in XML file, and its XML graph.
Master Informatique 4Dr. Vu Le Anh Structural indexes of XML Databases
Virtual observatory
• A collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be conducted.
• IVOA (International Virtual Observatory Alliance) Building an international community
• Using very big XML datasets for storing, exchanging data
Master Informatique 5Dr. Vu Le Anh Structural indexes of XML Databases
Problem
• Efficient query processing over Big (Distributed) XML - Databases
• Two “interesting” ideas:1. Storing the XML database in relational
database. Rewriting XML a az XML queries SQL and Datalog. Rewriting and combining the results.
2. Indexing the XML database. Using the indexes for query processing.
Master Informatique 6Dr. Vu Le Anh Structural indexes of XML Databases
Data Graph – Data Model for XML
• Data graph: directed, rooted, labelled graph.
: set of nodes. : set of label values
: set of edges
: set of basic edges.
: set of reference edges.
: the root.
: labeling function
),,,,( labelrEVG VVVE
fb EEE bE
fE
VrVlabel :
Master Informatique 7Dr. Vu Le Anh Structural indexes of XML Databases
Publication XML document<CSDepartment> <PhDStudents> <Student id="s1"> <Name>John</Name>
<Papers> <Paper id="pp1" > <Title>ABC</Title> <Author>Dr.Ben</Author> <Author idref="p1" > </Author> </Paper> </Papers> </Student> <Student id="s2"> <Name>Tom</Name> </Student> </PhDStudents> <Professors> <Professor id="p1" > …
… <Name>Dr. Kiss</Name> <Papers> <Paper idref="pp1" > </Paper> <Paper> <Title>DEF</Title> </Paper> </Papers> </Professor> <Professor id="p2"> <Name>Dr. Baker</Name> <Papers> <Paper> <Title>XYZ</Title> </Paper> </Papers> </Professor> </Professors></CSDepartment>
Master Informatique 8Dr. Vu Le Anh Structural indexes of XML Databases
XML - Datagraph
Master Informatique 9Dr. Vu Le Anh Structural indexes of XML Databases
Regular queries• Query language for XML:
– XQuery, XPath, UnQL, Lorel, XQL, XML-QL, etc.
• Build around regular expressions.• 3 basic operations:
– Concatation: . or /– Union: | – Interation: *
• For short: _ - some label value// - (_)* some sequence of label values
• Example: //(Student | Professor)//Paper/Title
Master Informatique 10Dr. Vu Le Anh Structural indexes of XML Databases
Regular queries• Pair of nodes (u, v) matches R regular query, if
there is a rout from u to v, in which the label sequence of the rout matching R.
• The result of R :
I the input-set and O the output-set
, (u, v) matches R}• General case: I={root} és O={V}.• Every R regular expression can be represented
by a finite, not determined automata (NFA), which computes L(R) language. Query graph is the graph representing the automata.
|),{(),( OIvuRGR IO
Master Informatique 11Dr. Vu Le Anh Structural indexes of XML Databases
Query processing based on the automata
• The query graph of //B/D:• Input: I={0}; Output: O={0,1,…,15}
A
A B
0
1 8
CB2 6 A D9 13
AD B E3 107 14
DCA F4 5 1211 E 15
* B D
q0 q1 q2
q0
q0 q0
q0
q0
q0 q2 q0 q2
q1
q0
The result = {(0,3),(0,11),(0,13)}
Master Informatique 12Dr. Vu Le Anh Structural indexes of XML Databases
Transform to Edge Labeled graphNode labeled graph Edge labeled graph
Query graph is a edge labeled graph.Transform data graph to edge labeled graph.
Master Informatique 13Dr. Vu Le Anh Structural indexes of XML Databases
State-Data (SD) graph
• SD graph = Query graph JOINING Data graph
• SD graph may be not connective.
• SD-Nodes: (data-node, state-node)
• SD- labeled edges: Constructing from the matching of labels of data-edges and node-edges.
Master Informatique 14Dr. Vu Le Anh Structural indexes of XML Databases
Joining R:= a/(b|c)*/a and data graph
s0 s1 s2
a
b
c
a
Query graph: Data graph:
5
4
3
2
1
ac
a
a
b
SD-graph:
1,s0
2,s02,s1
1,s1
2,s2
a
b
3,s1
c
4,s2
a
5,s2
a5,s1
a
a
3,s0
4,s1
Result: (1,4) , (1,5)
a
Master Informatique 15Dr. Vu Le Anh Structural indexes of XML Databases
SD-graph representation on relational database [KissVu05]
• Main results: – The data graph and query graph can be
represented by tables – SD graph (table) = Joining data table and
query table.– Computing the result based on the SD-table.– Regular query processing DATALOG +
SQL– Building the index to support SQL
computation.
Master Informatique 16Dr. Vu Le Anh Structural indexes of XML Databases
1. Step: Transform data graph to edge labeled graph
Master Informatique 17Dr. Vu Le Anh Structural indexes of XML Databases
2. step: Query graph representation
Master Informatique 18Dr. Vu Le Anh Structural indexes of XML Databases
3. lépés: Using DATALOG, SQL for the computation
Master Informatique 19Dr. Vu Le Anh Structural indexes of XML Databases
4. step: Computation in Relational Databases
results: {4,5,6}
Master Informatique 20Dr. Vu Le Anh Structural indexes of XML Databases
Classes of XML indexes1. Indexing the basic values
– The basis values are indexing (Ex: data(//emp/salary))– Using B+-tree
2. Indexing the text values– Keywords should be indexed
3. Indexes for XML -Tree – Quickly checking and computing the label sequence of
rout between some pair of nodes.– Applying it for near-tree XML datasets.
4. Structural indexes.– Simulating the datagraph by smaller one to reduce the
cost of computation
Master Informatique 21Dr. Vu Le Anh Structural indexes of XML Databases
XML-tree pre/post computing [Dietz82]
• Tree preorder/postorder walking for computing (pre(x),post(x))
(1,7)
(2,4)
(3,1) (4,2) (5,3)
(6,6)
(7,5)
x is a descendent of y <=>pre(x) < pre(y) és
post(x) > post(y)
Master Informatique 22Dr. Vu Le Anh Structural indexes of XML Databases
Tree- Structure Improvement [Li&Moon VLDB 2001]
• Every x node: (order(x), size(x))
(1,100)
(10,30)
(11,5) (17,5)(25,5)
(41,10)
(45,5)
x is a descendent of y <=>order(x) < order(y) és
order(y) <= order(y) + size(x)
Master Informatique 23Dr. Vu Le Anh Structural indexes of XML Databases
Regular query processing over XML –tree and near tree
• Very efficient based on tree-structured indexes
• [KissVu06]: Applying for near-tree XML dataset
• Link graph: Connecting between link nodes.
• Using tree-structured indexes for the basic structure
Master Informatique 24Dr. Vu Le Anh Structural indexes of XML Databases
Family of Structural indexes
Master Informatique 25Dr. Vu Le Anh Structural indexes of XML Databases
1-index [Milo & Suciu, LNCS 1997]
Idea: Grouping all “equivalent” data-nodes into an index-node. Computing the index nodes bi-simulation equivalent ≡ ekvivalencia helyett.
• Index graph is smaller than the data-graph
• Working for every regular queries.
• A bi-simulation computing = PTIME.
Master Informatique 26Dr. Vu Le Anh Structural indexes of XML Databases
Bisimulation• A bi-simulation:
– x1 és x2 have the same label
– If x1 x2 and (y1,x1) is an edge, then there
exists edge (y2,x2), in which y1 y2.
y1 y2
a
x1 a x2
b b
Master Informatique 27Dr. Vu Le Anh Structural indexes of XML Databases
Example 1-index
1
paper
2,4,8,13section
3,5,9,14
title
6,10algorithm
7proof 11
proof12
uses
15,16
17,18about
exp
1-index
1
paper
4 section
5 title 6
algorithm
7proof
8section
9title
10
11proof
12
uses
algorithm
13 section14
15
16
17
18
about
about
title2
section
3
title
expexp
Data Graph
/paper/section/algorithm
Master Informatique 28Dr. Vu Le Anh Structural indexes of XML Databases
Using 1-index?
• Good: Working for all regular queries.• Bad: Not small enough !!!• Idea: The index graph is designed only for
the most frequently in use queries. The index graph is very small now !!!
• New equivalent relationship between nodes should be defined
• If the query is not support, re-check on the data graph
Master Informatique 29Dr. Vu Le Anh Structural indexes of XML Databases
Structural indexes and a given set of queries
• Important : – //a0/a1/…/ai (i<=k), not longer than k
• A(k)-index
– Dinamikus indexek• APEX, D(k)-index
– //S0/S1/…/Sk, SAPE queries• DL-1, DL-A*(k)-index
– Forward-backward queries• F&B-index
Master Informatique 30Dr. Vu Le Anh Structural indexes of XML Databases
A(k)-Index [Kaushik et al. 02]
• A //a0/a1/…/ai (i<=k) • A k-biszimulation.
• A k (k-biszimuláció):– u 0 v, ha u és v if they have same label,– u k v if u k-1 v and
• If (u’,u) is an edge, there exists edge (v’,v): u’ k-1 v’• If (v’,v) is an edge, there exists edge (u’,u): u’ k-1 v’
Master Informatique 31Dr. Vu Le Anh Structural indexes of XML Databases
A(k)-index
imdb
movie
director
name
tv
director
name
{1}
{2}
{3}
{4}
{5}
{6,8}
{7,9}
A(2)-index (1-index)
1
2
3
4
5
6
7
8
9
imdb
movie
director
name
tv
director
name
director
name
Data graph
imdb
movie tv
director
name
{1}
{2} {5}
{3,6,8}
{4,7,9}
A(0)-index
imdb
movie
director
tv
director
name
{1}
{2}
{3}
{5}
{6,8}
{4,7,9}
A(1)-index
Master Informatique 32Dr. Vu Le Anh Structural indexes of XML Databases
Split Operation
R
A B
C3
C6
C1 C2
C4 C5
R
A B
C2,C3C1
C4 C5,C6
R
A B
C2,C3C1
C4,C5,C6
R
A B
C1,C2,C3C4,C5,C6
Adatgráf A(2) (=1-index) A(1) A(0)
Master Informatique 33Dr. Vu Le Anh Structural indexes of XML Databases
Refinement (1. step)
R
A B
C3
C6
C1 C2
C4 C5
R
A B
C2,C3C1
C4 C5,C6
R
A B
C2,C3C1
C4,C5,C6
R
A B
C1,C2,C3C4,C5,C6
Data gráph A(2) (=1-index) A(1) A(0)
Master Informatique 34Dr. Vu Le Anh Structural indexes of XML Databases
Refinement (2. step)
R
A B
C3
C6
C1 C2
C4 C5
R
A B
C2,C3C1
C4 C5,C6
R
A B
C2,C3C1
C4,C5,C6
R
A B
C1,C2,C3C4,C5,C6
Data graph A(2) (=1-index) A(1) A(0)
Master Informatique 35Dr. Vu Le Anh Structural indexes of XML Databases
DL-1-index [KissVu06]
• //S0/S1/…/Sk (SAPE = Simple Alternation Path Expression).
• Dinamikus index (Dynamic labelling).
Master Informatique 36Dr. Vu Le Anh Structural indexes of XML Databases
A //(d|e)/f SAPE query
0
1 2
64 5
3
7 8
9 10 11 12 13
a
bb
d
c
de
f
e
f f f
d
g
Data GraphA SAPE query: //(d|e)/fR := S0/S1
S0= { d,e } ; S1= { f }
A (4,9), (5,10), (6,11) és (7,12) matching R.
The result:TG(R) = {9,10,11,12}
Master Informatique 37Dr. Vu Le Anh Structural indexes of XML Databases
Example: DL 1-index support //(K|L) és //(B|C)/E queries
0
1 2 3 4
5 6 7 8
9 10 11 12
A
B
EE
C
F
C D
E
ML NK
The data graph and the 1-index are the same.
0 A
1,2,3,4K,L,M,N
5,6,7,8B,C,D
9,10,11,12E,F
DL-1-index at the begin.
0 A
1,2K,L
3,4M,N
5,6B,C
7,8C,D
9,10E
11,12E,F
0 A
1,2K,L
5,6B,C
9,10E
3,4M,N
7 8
11 12
C
F
D
E(a) (b) (c) (d)
R1= //(K|L) support R2= //(B|C)/ESupport
Master Informatique 38Dr. Vu Le Anh Structural indexes of XML Databases
A DL-A*(k)-index [KissVu06]
1. The A(i)-index is a special case of DL-A*(k).
2. DL-A*(k)-index support for a given not longer k SAPE queries.
Master Informatique 39Dr. Vu Le Anh Structural indexes of XML Databases
DL-A*(1)-index support A //(K|L) and //(B|C)/E queries
0
1 2 3 4
5 6 7 8
9 10 11 12
A
B
EE
C
F
C
MLK
D
E
N
Data graph
the begin index:
//(K|L) - refinement:
//(B|C)/E -refinement:
Master Informatique 40Dr. Vu Le Anh Structural indexes of XML Databases
Experiments
1. DL-1 vs. 1-index
2. DL-A*(k) vs. A(k)-index
• 2 datasets:
- XMark: 100 Mb, 1.681.342 nodes.
- TreeBank: 82Mb, 2.437.667 nodes.
Master Informatique 41Dr. Vu Le Anh Structural indexes of XML Databases
Master Informatique 42Dr. Vu Le Anh Structural indexes of XML Databases
Distributed XML-tree
• XML- tree = Fragments – sub trees.
• Servers stores some fragments.
• There are linking edges between fragments.
• Questions: Finding efficient protocol for regular query processing? Waiting time – Computing time
• Applying structural indexes?
Master Informatique 43Dr. Vu Le Anh Structural indexes of XML Databases
//a/b//a processing on XML –tree using 2 servers
Master Informatique 44Dr. Vu Le Anh Structural indexes of XML Databases
Flow modell (SPIDER algoritmus)
• Beginning from the root.• (F, q) (F’, q’):1. Processing on F stops.2. Processing on F’ with state q’. 3. If finish processing over F’, then send the
result to F.4. F continues
Waiting time!
Master Informatique 45Dr. Vu Le Anh Structural indexes of XML Databases
2 phases parallel modell
• Servers: Computing every possible states on it own site.
• Sending to the coordinator the link edge
• Coordinator examines the link edges and request the results from servers
• Severs send the results to coordinator.
• The computing time !!!
Master Informatique 46Dr. Vu Le Anh Structural indexes of XML Databases
1- phase parallel model [KissVu07]
• The coordinator builds the structural Tree-index for whole system for determine connective (F,q) states.
• Processing on the index first for computing connective states
Good: Efficient processing
Bad: The index may be big.
Master Informatique 47Dr. Vu Le Anh Structural indexes of XML Databases
Structural Tree-index
A F00 F3
1
2
A B8
F2 F4F13
4 5
10
6
12
14
13
1511A C
D
CB
F
E
D
D
B
A
A
E
7
F5 Fa-index
A F0
A F2 BF3
B F4
DF1
DF5
ε
ABAC
A
εq0
q0 q1
(F2,q1), (F2,q2): is not connective
q0
q0
q0q0 q1
Connective states:(F0,q0), (F1,q0), …
Master Informatique 48Dr. Vu Le Anh Structural indexes of XML Databases
Experiments
• 19 Linux local-servers.
• Waiting time:
1IP : 2P : SP = 1 : 1.94 : 37.52
• Computing time:
1IP : 2P : SP = 1 : 1.77 : 2.75
Master Informatique 49Dr. Vu Le Anh Structural indexes of XML Databases
Native XML database systems http://www.rpbourret.com/xml/XMLDatabaseProds.htm#native
Termék Fejlesztő License AdatbázistípusQizx/db XMLMind Commercial ProprietarySedna XML DBMS ISP RAS MODIS Free ProprietarySekaiju / Yggdrasill Media Fusion Commercial ProprietarySQL/XML-IMDB QuiLogic Commercial Proprietary (native XML and relational)Sonic XML Server Sonic Software Commercial Object-oriented (ObjectStore).Tamino Software AG Commercial Proprietary. Relational through ODBC.TeraText DBS TeraText Solutions Commercial ProprietaryTEXTML Server IXIASOFT, Inc.Commercial ProprietaryTigerLogic XDMS Raining Data Commercial PickTimber University of Michigan Open Source (non-commercial only) Shore, Berkeley DBTOTAL XML Cincom Commercial Object-relationalVirtuoso OpenLink Software Commercial Proprietary. Relational through ODBCXDBM Matthew Parry, Paul Sokolovsky Open Source ProprietaryXDB ZVON.org Open Source Relational (PostgreSQL)XediX TeraSolution AM2 Systems Commercial ProprietaryX-Hive/DB X -Hive Corporation Commercial Proprietary. Relational through JDBCXindice Apache Software Foundation Open Source Proprietaryxml.gax.com GAX Technologies Commercial ProprietaryXpriori XMS Xpriori Commercial ProprietaryXQuantum XML Database Server Cognetic Systems Commercial ProprietaryXStreamDB Native XML Database Bluestream Db. Soft. Corp. Commercial ProprietaryXyleme Zone Server Xyleme SA Commercial Proprietary
Master Informatique 50Dr. Vu Le Anh Structural indexes of XML Databases
Summary
1. Big XML is used in many applications
2. Our problem:
Efficient processing regular queries over XML databases.
3. Two ideas:1. Using Relational databases
2. Building special indexes for XML databases
Master Informatique 51Dr. Vu Le Anh Structural indexes of XML Databases
Summary
4. Tree - index can be applied for XML tree and XML- near tree (using link graph)
5. Structural indexes: Simulate the data-graph by the smaller ones – index graphs. Construction based on the equivalent relationships.
6. Structural indexes is designed for support only a given of queries.
7. It can be applied in distributed XML database query processing (Cloud, Social networks)
Master Informatique 52Dr. Vu Le Anh Structural indexes of XML Databases
References• [Chung et al., SIGMOD 2002]
– Chin-Wan Chung , Jun-Ki Min , Kyuseok Shim, APEX: an adaptive path index for XML data, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin [doi>10.1145/564691.564706]
• [Dietz82]– Dietz, P. F. 1982. Maintaining order in a linked list. In Proceedings of the Fourteenth Annual ACM Symposium on
theory of Computing (San Francisco, California, United States, May 05 - 07, 1982). STOC '82. ACM, New York, NY, 122-127. DOI= http://doi.acm.org/10.1145/800070.802184
• [Goldman & Widom VLDB 97] – Goldman, R. and Widom, J. 1997. DataGuides: Enabling Query Formulation and Optimization in Semistructured
Databases. In Proceedings of the 23rd international Conference on Very Large Data Bases (August 25 - 29, 1997). M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, Eds. Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 436-445.
• [Kaushik et al. 02]– Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, Ehud Gudes, "Exploiting Local Similarity for Indexing Paths in
Graph-Structured Data," Data Engineering, International Conference on, p. 0129, 18th International Conference on Data Engineering (ICDE'02), 2002
• [Kiss05]– Attila Kiss, Vu Le Anh A solution for regular queries on XML Data, (PUMA Volume 15 (2005), Issue No. 2, pp .179-
202.• [Kiss06]
– Attila Kiss, Vu Le Anh: Efficient Processing SAPE Queries Using the Dynamic Labelling Structural Indexes. ADBIS 2006: 232-247
• [Kiss07]– Attila Kiss, Vu Le Anh: Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using
Tree- And Structural Indexes. ADBIS Research Communications 2007 • [Li&Moon VLDB 2001]
– Li and Moon, 2001 Li, Q., Moon, B., 2001. Indexing and querying XML data for regular expressions. In: Proceedings of VLDB 2001, pp. 367–370.
• [Milo & Suciu, LNCS 1997]– Milo, T., Suciu, D. (1999), "Index structures for path expressions", 7th International Conference on Database Theory
(ICDT), pp.277-95. • [Paige &Tarjan 87]
– Paige, R. and Tarjan, R. E. 1987. Three partition refinement algorithms. SIAM J. Comput. 16, 6 (Dec. 1987), 973-989. DOI= http://dx.doi.org/10.1137/0216062
Master Informatique 53Dr. Vu Le Anh Structural indexes of XML Databases
Thank you!