topx 2.0 — a (very) fast object-store for top-k xpath query processing

23
TopX 2.0 A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Ralf Schenkel Max-Planck Institute Mohammed AbuJarour Hasso-Plattner Institute

Upload: barth

Post on 04-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing. Martin Theobald Stanford University. Mohammed AbuJarour Hasso-Plattner Institute. Ralf Schenkel Max-Planck Institute. article. article. title. title. “ Current Approaches to XML Data Manage- - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

TopX 2.0—

A (Very) Fast Object-Store for Top-k XPath Query Processing

Martin TheobaldStanford University

Ralf SchenkelMax-Planck Institute

Mohammed AbuJarourHasso-Plattner Institute

Page 2: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

“Native XML data base systems can store schemaless data ... ”

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML-QL: A Query Language for XML.”

“Native XML Data Bases.”

“Proc. Query Languages Workshop, W3C,1998.”

“XML queries with an expres- sive power similar to that of Datalog …”

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment”

itempar

title inproc

title

//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]

“What does XML add for retrieval? It adds formal ways …”

“w3c.org/xml”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “The

XML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

bib

“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”

RANKING

VAGUENESS

EARLY PRUNING

From the INEX ’03-’05 IEEE Collection

Page 3: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Ontology/Large Thesaurus

WordNet,OpenCyc, etc.

SA

Relational DBMS BackendUnified Text & XML Schema

Random Access

Top-kQueue

Scan Threads

CandidateQueue

Indexer/Crawler

Frontends• Web Interface • Web Service • API

• Selectivities• Histograms• Correlations

Index Metadata

TopX 1.0 Query Processor

Sequential Access

SA SA

• Path Conditions• Phrases & Proximity• Other Full-Text Op’s

Expensive Predicates

RA

Probabilistic Candidate

Pruning

Probabilistic Index AccessScheduling

Dynamic Query

Expansion

Non-conjunctiveTop-k XPath

Query Processing

RA

JDBC

2.0

Page 4: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Data Model

XML trees (no XLink/ID/IDRef) Pre-/postorder ranges for the structural index Redundant full-content text nodes

<article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec></article>

“xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“

“native xml data base native xml data base system store schemaless data“

“xml data manage”

article

title abs sec

“xml manage system vary wide

expressivepower“

“native xml data base”

“native xml data base system store schemaless data“

title par

1 6

2 1 3 2 4 5

5 3 6 4

“xml data manage xml manage system vary

wide expressive power native xml native

xml data base system store schemaless data“

ftf (“xml”, article1 ) = 4

ftf (“xml”, sec4 ) = 2

“native xml data base native xml data base system store schemaless data“

Page 5: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Scoring Model [INEX ‘05/’06/’07]

XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text)

DocID Tag Term Pre Post FTF

1 article xml 1 6 41 sec xml 4 5 21 title xml 5 3 11 par xml 6 4 1… … … … … …

Tag N AvLen

article 659K 269.2sec 1.6M 89.1title 2.2M 2.8par 2.8M 34.1… … …

Tag Term EF

article xml 863sec xml 947title xml 62par xml 674… … …

Content Index (Tag-Term Pairs) Element Freq. Element Statistics

bib[“transactions”]vs.

par[“transactions”]

Page 6: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

TopX 1.0: Relational Schema Precompute & materialize scoring model into combined inverted index over tag-term pairs Supports sorted access (by MaxScore) and random access (by DocID)

DocID Pre Post Score MaxScore

2 2 15 0.9 0.9

2 10 8 0.5 0.91 23 48 0.8 0.81 45 87 0.2 0.83 4 24 0.7 0.73 12 18 0.6 0.73 17 25 0.3 0.7… … … … …

sec[“xml”]

Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc

Pre asc, Post Desc

SA

Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc

RA

Two B+trees

Page 7: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Top-k XPath on a Relational Schema [VLDB ’05]

• Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//,

“retrieval”)]

DocID Pre Post Score MaxScore

2 2 15 0.9 0.92 10 8 0.8 0.91 23 48 0.8 0.81 45 87 0.2 0.8… … … … …

Sequentially (mostly) scan each index list in desc. order of MaxScore Hash-join element blocks by DocID in-memory Do “some” incremental XPath evaluation using Pre/Post indices Aggregate Score along connected path fragments Use variant of Fagin’s threshold algorithm for top-k-style early termination

DocID Pre Post Score MaxScore

17 2 15 0.9 0.93 14 10 0.8 0.92 4 12 0.5 0.5

31 12 23 0.4 0.4… … … …

DocID Pre Post Score MaxScore

1 12 21 1.0 1.02 8 14 0.8 0.85 3 7 1 0.74 6 4 1 0.7… … … …

sec[“xml”] title[“native”] par[“retrieval”]

Page 8: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

DocID

Pre Post

1 1 2452 1 1233 1 1764 1 89… … …

article

RA

Expensive predicate probes (RA) to the structure index (3rd B+tree)

Non-conjunctive XPath evaluations Dynamically relax content- & structure-related query conditions

(top-k results entirely driven by score aggregations for content & structure cond.’s)

• Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)]

Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’Order by Pre asc, Post desc

DocID Pre Post Score MaxScore

2 2 15 0.9 0.9

2 10 8 0.8 0.9

1 23 48 0.8 0.8

1 45 87 0.2 0.8

… … … …

sec[“xml”]

SA

Top-k XPath on a Relational Schema [VLDB ’05]

1.0

Page 9: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Relational Schema (cont’d)

20,810,942 distinct tag-term pairs for 4.38 GB Wikipedia collection

DocID

Pre Post

1 1 2452 1 1233 1 1764 1 89… … …

DocID

Pre Post Score MaxScore

2 2 15 0.9 0.92 10 8 0.8 0.91 23 48 0.8 0.81 45 87 0.2 0.8… … … … …

sec[“xml”] article

No shredding into DTD-specific relational schema! No DTD at all for INEX Wikipedia!

1,107 distinct tags

Page 10: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Relational Schema (cont’d)

DocID

Tag Term Pre Post Score MaxScore

2 sec xml 2 15 0.9 0.9

2 sec xml 10 8 0.5 0.9

17 title xml 5 3 0.5 0.5

1 par xml 6 4 0.7 0.7

… … … … … … …

DocID

Tag Pre Post

1 article 1 245

2 article 1 123

3 sec 2 15

3 sec 10 8

… … … …

2-dimensional source of redundancy Full-content scoring model (#terms times avg. depth of a text node 6.7 for INEX Wiki) De-normalized relational schema

High overhead in the architecture (Java->JDBC->DBMS & back) Element-block sizes are data-driven, not easy to control layout on disk Hashing too slow compared to very efficient in-memory merge-joins

Content Index Structure Index

(4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs

16 GB

(4+4+4+4) bytes X 52,561,559 tags

0.85 GB

Page 11: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

TopX 2.0: Object-Oriented Storage

2 15 0.92DocID

10 8 0.5

23 48 0.8

45 87 0.2

MaxSore

1MaxSore

DocID

sec[“xml”]

0

title[“xml”]

122,564

par[“xml”]

432,534

(4+4+4+4+4+4+4) X 567,262,445

Relational: 16 GB

4 X 456,466,649+ (4+4+4) X 567,262,445

Object-oriented: 8.6 GB

(+ (4+4) X 20,810,942 = 166 MB for the offset index)

DocID

Tag Term Pre Post Score MaxScore

2 sec xml 2 15 0.9 0.9

2 sec xml 10 8 0.5 0.9

… … … … … … …

17 title xml 2 15 0.9 0.9

… … … … … … …

11 par xml 6 4 0.7 0.7

… … … … … … …

B

2 15 0.917

14 5 0.2

27 32 0.4

1 6 0.93B

L

L

Binary file

B – Element block separatorL – Index list separator

Page 12: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Group element blocks with similar MaxScore into document blocks of fixed length (e.g. 256KB)

Sort element blocks within each document block by DocID

Supports Sorted access by MaxScore Merge-joins by DocID

Raw disk access

Object-Oriented Storage w/Block-Mergingsec[“xml”]

0

title[“xml”]

122,564L

B…BB

2 24 0.7

3 11 0.3

2 15 0.92

10 8 0.5

23 48 0.81

B

5B

6 15 0.6

13 17 0.5

14 32 0.3

5 23 0.5

7 21 0.3

24 15 0.1

…B…BB

3

6B

Doc

umen

t Blo

ck

MaxSore

MaxSore

Page 13: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Merging Document Blocks

Sequential access and efficient merge-joins on top of large document blocks

6 15 0.5

13 17 0.5

14 32 0.3

5 23 0.6

7 21 0.3

24 15 0.1

sec[“xml”]

B…BB

2 24 0.7

3 11 0.3

2 15 0.92

10 8 0.5

23 48 0.81

B

5B

…B…BB

3

6B

32 45 0.8

33 27 0.7

37 39 0.5

18 29 0.8

23 24 0.8

24 15 0.7

par[“retrieval”]

B…BB

65 21 1.0

72 43 0.5

3 17 0.95

13 9 0.2

12 48 0.92

B

7B

B…BB

6

9B

//sec[about(.//, “XML”)] //par[about(.//, “retrieval”)]

SA

1.0

0.8

0.8

0.7

Page 14: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Compressed Number Encoding Multi-attribute (4), double-nested block-index structure

Delta encoding only works for DocID (and to some extent for Pre) No specific assumptions on distributions of Pre/Post or Score

No Unary or Huffman coding (prefix-free but additional coding table) Sophisticated compression schemes may be expensive to decode

No Zip, etc.

But known number ranges DocID [1, 659,388] -> 3 bytes (2543 = 16,387,064, lossless) Pre/Post [1, 43,114] -> 2 bytes (2542 = 64,516, lossless) Score [0,1] -> rounded to 1 byte (254 buckets, lossy)

Variable-length byte encoding w/leading length-indicator byte

4 3 26 7

9 225 332 192

Len Pre Post Score

5 bytes

10 bytes

Page 15: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Some more tricks… Dump leading histogram blocks into index list headers

Histograms only for index lists that exceed one document block (<5% of all lists) Own native compare methods for DocID, Pre/Post Decode only Score for arithmetic op’s

( Mostly perform pointer operations at qp time)

Incrementally read & process precomputed memory image for fast top-k queries on top of large disk blocks

His

togr

am

Bloc

k36

byt

es

10

sec[“xml”]

score

freq

EB 1

EB 2

… EB k

DB1 (256 KB)

……

DB2 (256 KB) DBl (256 KB)

Page 16: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

… … …1.0

0.9

0.8

0.8

1.0

0.9

0.9

0.2

1.0

0.9

0.7

0.6

SA Scheduling Look-ahead Δi through precomputed

score histograms Knapsack-based optimization of

Score Reduction

RA Scheduling 2-phase probing: Schedule RAs “late & last”

i.e., cleanup the queue if

Extended probabilistic cost model for integrated SA & RA scheduling

Block Access Scheduling [VLDB ’06]

Inverted Block-Index(256KB doc-blocks)

Δ3,3 = 0.2Δ1,3 = 0.8

SA

SA

SA

RA

Page 17: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Object Storage Summary

• 567,262,445 tag-term pairs• 20,810,942 distinct tag-term pairs• 20,815,884 document blocks (256KB)• 456,466,649 element blocks• 4,703,385,686 total bytes (8.3 bytes/tag-term pair)

• 52,561,559 tags (elements)• 1,107 distinct tags• 2,323 document blocks (256KB)• 8,999,193 element blocks• 246,601,752 total bytes

(4.7 bytes/tag)

4.38 GB Wikipedia XML sources

Structure Index

0

4

8

12

16Relational

Object-Oriented

Cont

ent I

ndex

(incl

. hist

ogra

ms)

Page 18: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Preliminary Runtime Experiments

CO (top-10, non-conjunctive)

Page 19: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Preliminary Runtime Experiments

CAS (top-10, non-conjunctive)

Page 20: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Some INEX Results

CAS (top-1,500, non-conjunctive)

Page 21: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Some INEX Results

CAS (top-1,500, non-conjunctive)

Page 22: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

Conclusions & Outlook

Scalable and efficient XML-IR with vague search Mature system, reference engine for INEX topic development &

interactive tracks [VLDB Special Issue on DB&IR Integration ‘08]

Brand-new TopX 2.0 prototype Very efficient reimplementation in C++ Object-oriented XML storage, moderate compression rates 10—20 times better sequential throughput than relational

More features Generalized proximity search, graph top-k Updates (gaps within document blocks) XQuery Full-Text (top-k-style bounds over IF, For-Let) …

Page 23: TopX  2.0 — A (Very) Fast Object-Store for  Top-k  XPath  Query Processing

http://www.inex.otago.ac.nz/