topx 2.0 — a (very) fast object-store for top-k xpath query processing
DESCRIPTION
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing. Martin Theobald Stanford University. Mohammed AbuJarour Hasso-Plattner Institute. Ralf Schenkel Max-Planck Institute. article. article. title. title. “ Current Approaches to XML Data Manage- - PowerPoint PPT PresentationTRANSCRIPT
TopX 2.0—
A (Very) Fast Object-Store for Top-k XPath Query Processing
Martin TheobaldStanford University
Ralf SchenkelMax-Planck Institute
Mohammed AbuJarourHasso-Plattner Institute
“Native XML data base systems can store schemaless data ... ”
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML-QL: A Query Language for XML.”
“Native XML Data Bases.”
“Proc. Query Languages Workshop, W3C,1998.”
“XML queries with an expres- sive power similar to that of Datalog …”
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment”
itempar
title inproc
title
//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]
“What does XML add for retrieval? It adds formal ways …”
“w3c.org/xml”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “The
XML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
bib
“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
RANKING
VAGUENESS
EARLY PRUNING
From the INEX ’03-’05 IEEE Collection
Ontology/Large Thesaurus
WordNet,OpenCyc, etc.
SA
Relational DBMS BackendUnified Text & XML Schema
Random Access
Top-kQueue
Scan Threads
CandidateQueue
Indexer/Crawler
Frontends• Web Interface • Web Service • API
• Selectivities• Histograms• Correlations
Index Metadata
TopX 1.0 Query Processor
Sequential Access
SA SA
• Path Conditions• Phrases & Proximity• Other Full-Text Op’s
Expensive Predicates
RA
Probabilistic Candidate
Pruning
Probabilistic Index AccessScheduling
Dynamic Query
Expansion
Non-conjunctiveTop-k XPath
Query Processing
RA
JDBC
2.0
Data Model
XML trees (no XLink/ID/IDRef) Pre-/postorder ranges for the structural index Redundant full-content text nodes
<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>
“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“
“native xml data base native xml data base system store schemaless data“
“xml data manage”
article
title abs sec
“xml manage system vary wide
expressivepower“
“native xml data base”
“native xml data base system store schemaless data“
title par
1 6
2 1 3 2 4 5
5 3 6 4
“xml data manage xml manage system vary
wide expressive power native xml native
xml data base system store schemaless data“
ftf (“xml”, article1 ) = 4
ftf (“xml”, sec4 ) = 2
“native xml data base native xml data base system store schemaless data“
Scoring Model [INEX ‘05/’06/’07]
XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text)
DocID Tag Term Pre Post FTF
1 article xml 1 6 41 sec xml 4 5 21 title xml 5 3 11 par xml 6 4 1… … … … … …
Tag N AvLen
article 659K 269.2sec 1.6M 89.1title 2.2M 2.8par 2.8M 34.1… … …
Tag Term EF
article xml 863sec xml 947title xml 62par xml 674… … …
Content Index (Tag-Term Pairs) Element Freq. Element Statistics
bib[“transactions”]vs.
par[“transactions”]
TopX 1.0: Relational Schema Precompute & materialize scoring model into combined inverted index over tag-term pairs Supports sorted access (by MaxScore) and random access (by DocID)
DocID Pre Post Score MaxScore
2 2 15 0.9 0.9
2 10 8 0.5 0.91 23 48 0.8 0.81 45 87 0.2 0.83 4 24 0.7 0.73 12 18 0.6 0.73 17 25 0.3 0.7… … … … …
sec[“xml”]
Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc
Pre asc, Post Desc
SA
Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc
RA
Two B+trees
Top-k XPath on a Relational Schema [VLDB ’05]
• Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//,
“retrieval”)]
DocID Pre Post Score MaxScore
2 2 15 0.9 0.92 10 8 0.8 0.91 23 48 0.8 0.81 45 87 0.2 0.8… … … … …
Sequentially (mostly) scan each index list in desc. order of MaxScore Hash-join element blocks by DocID in-memory Do “some” incremental XPath evaluation using Pre/Post indices Aggregate Score along connected path fragments Use variant of Fagin’s threshold algorithm for top-k-style early termination
DocID Pre Post Score MaxScore
17 2 15 0.9 0.93 14 10 0.8 0.92 4 12 0.5 0.5
31 12 23 0.4 0.4… … … …
DocID Pre Post Score MaxScore
1 12 21 1.0 1.02 8 14 0.8 0.85 3 7 1 0.74 6 4 1 0.7… … … …
sec[“xml”] title[“native”] par[“retrieval”]
DocID
Pre Post
1 1 2452 1 1233 1 1764 1 89… … …
article
RA
Expensive predicate probes (RA) to the structure index (3rd B+tree)
Non-conjunctive XPath evaluations Dynamically relax content- & structure-related query conditions
(top-k results entirely driven by score aggregations for content & structure cond.’s)
• Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)]
Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’Order by Pre asc, Post desc
DocID Pre Post Score MaxScore
2 2 15 0.9 0.9
2 10 8 0.8 0.9
1 23 48 0.8 0.8
1 45 87 0.2 0.8
… … … …
sec[“xml”]
SA
Top-k XPath on a Relational Schema [VLDB ’05]
1.0
Relational Schema (cont’d)
20,810,942 distinct tag-term pairs for 4.38 GB Wikipedia collection
DocID
Pre Post
1 1 2452 1 1233 1 1764 1 89… … …
DocID
Pre Post Score MaxScore
2 2 15 0.9 0.92 10 8 0.8 0.91 23 48 0.8 0.81 45 87 0.2 0.8… … … … …
sec[“xml”] article
No shredding into DTD-specific relational schema! No DTD at all for INEX Wikipedia!
1,107 distinct tags
Relational Schema (cont’d)
DocID
Tag Term Pre Post Score MaxScore
2 sec xml 2 15 0.9 0.9
2 sec xml 10 8 0.5 0.9
17 title xml 5 3 0.5 0.5
1 par xml 6 4 0.7 0.7
… … … … … … …
DocID
Tag Pre Post
1 article 1 245
2 article 1 123
3 sec 2 15
3 sec 10 8
… … … …
2-dimensional source of redundancy Full-content scoring model (#terms times avg. depth of a text node 6.7 for INEX Wiki) De-normalized relational schema
High overhead in the architecture (Java->JDBC->DBMS & back) Element-block sizes are data-driven, not easy to control layout on disk Hashing too slow compared to very efficient in-memory merge-joins
Content Index Structure Index
(4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs
16 GB
(4+4+4+4) bytes X 52,561,559 tags
0.85 GB
TopX 2.0: Object-Oriented Storage
2 15 0.92DocID
10 8 0.5
23 48 0.8
45 87 0.2
MaxSore
1MaxSore
DocID
sec[“xml”]
0
title[“xml”]
122,564
…
par[“xml”]
432,534
(4+4+4+4+4+4+4) X 567,262,445
Relational: 16 GB
4 X 456,466,649+ (4+4+4) X 567,262,445
Object-oriented: 8.6 GB
(+ (4+4) X 20,810,942 = 166 MB for the offset index)
DocID
Tag Term Pre Post Score MaxScore
2 sec xml 2 15 0.9 0.9
2 sec xml 10 8 0.5 0.9
… … … … … … …
17 title xml 2 15 0.9 0.9
… … … … … … …
11 par xml 6 4 0.7 0.7
… … … … … … …
B
2 15 0.917
14 5 0.2
27 32 0.4
1 6 0.93B
L
L
Binary file
B – Element block separatorL – Index list separator
Group element blocks with similar MaxScore into document blocks of fixed length (e.g. 256KB)
Sort element blocks within each document block by DocID
Supports Sorted access by MaxScore Merge-joins by DocID
Raw disk access
Object-Oriented Storage w/Block-Mergingsec[“xml”]
0
title[“xml”]
122,564L
B…BB
2 24 0.7
3 11 0.3
2 15 0.92
10 8 0.5
23 48 0.81
B
5B
…
6 15 0.6
13 17 0.5
14 32 0.3
5 23 0.5
7 21 0.3
24 15 0.1
…B…BB
3
6B
Doc
umen
t Blo
ck
MaxSore
MaxSore
Merging Document Blocks
Sequential access and efficient merge-joins on top of large document blocks
6 15 0.5
13 17 0.5
14 32 0.3
5 23 0.6
7 21 0.3
24 15 0.1
sec[“xml”]
B…BB
2 24 0.7
3 11 0.3
2 15 0.92
10 8 0.5
23 48 0.81
B
5B
…B…BB
3
6B
…
32 45 0.8
33 27 0.7
37 39 0.5
18 29 0.8
23 24 0.8
24 15 0.7
par[“retrieval”]
B…BB
65 21 1.0
72 43 0.5
3 17 0.95
13 9 0.2
12 48 0.92
B
7B
B…BB
6
9B
//sec[about(.//, “XML”)] //par[about(.//, “retrieval”)]
SA
1.0
0.8
0.8
0.7
Compressed Number Encoding Multi-attribute (4), double-nested block-index structure
Delta encoding only works for DocID (and to some extent for Pre) No specific assumptions on distributions of Pre/Post or Score
No Unary or Huffman coding (prefix-free but additional coding table) Sophisticated compression schemes may be expensive to decode
No Zip, etc.
But known number ranges DocID [1, 659,388] -> 3 bytes (2543 = 16,387,064, lossless) Pre/Post [1, 43,114] -> 2 bytes (2542 = 64,516, lossless) Score [0,1] -> rounded to 1 byte (254 buckets, lossy)
Variable-length byte encoding w/leading length-indicator byte
4 3 26 7
9 225 332 192
Len Pre Post Score
5 bytes
10 bytes
Some more tricks… Dump leading histogram blocks into index list headers
Histograms only for index lists that exceed one document block (<5% of all lists) Own native compare methods for DocID, Pre/Post Decode only Score for arithmetic op’s
( Mostly perform pointer operations at qp time)
Incrementally read & process precomputed memory image for fast top-k queries on top of large disk blocks
His
togr
am
Bloc
k36
byt
es
10
sec[“xml”]
score
freq
EB 1
EB 2
… EB k
DB1 (256 KB)
……
DB2 (256 KB) DBl (256 KB)
… … …1.0
0.9
0.8
0.8
1.0
0.9
0.9
0.2
1.0
0.9
0.7
0.6
SA Scheduling Look-ahead Δi through precomputed
score histograms Knapsack-based optimization of
Score Reduction
RA Scheduling 2-phase probing: Schedule RAs “late & last”
i.e., cleanup the queue if
Extended probabilistic cost model for integrated SA & RA scheduling
Block Access Scheduling [VLDB ’06]
Inverted Block-Index(256KB doc-blocks)
Δ3,3 = 0.2Δ1,3 = 0.8
SA
SA
SA
RA
Object Storage Summary
• 567,262,445 tag-term pairs• 20,810,942 distinct tag-term pairs• 20,815,884 document blocks (256KB)• 456,466,649 element blocks• 4,703,385,686 total bytes (8.3 bytes/tag-term pair)
• 52,561,559 tags (elements)• 1,107 distinct tags• 2,323 document blocks (256KB)• 8,999,193 element blocks• 246,601,752 total bytes
(4.7 bytes/tag)
4.38 GB Wikipedia XML sources
Structure Index
0
4
8
12
16Relational
Object-Oriented
Cont
ent I
ndex
(incl
. hist
ogra
ms)
Preliminary Runtime Experiments
CO (top-10, non-conjunctive)
Preliminary Runtime Experiments
CAS (top-10, non-conjunctive)
Some INEX Results
CAS (top-1,500, non-conjunctive)
Some INEX Results
CAS (top-1,500, non-conjunctive)
Conclusions & Outlook
Scalable and efficient XML-IR with vague search Mature system, reference engine for INEX topic development &
interactive tracks [VLDB Special Issue on DB&IR Integration ‘08]
Brand-new TopX 2.0 prototype Very efficient reimplementation in C++ Object-oriented XML storage, moderate compression rates 10—20 times better sequential throughput than relational
More features Generalized proximity search, graph top-k Updates (gaps within document blocks) XQuery Full-Text (top-k-style bounds over IF, For-Let) …
http://www.inex.otago.ac.nz/