1 xml storage and query processing yanlei diao university of massachusetts amherst some slide...

44
XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst me slide content courtesy of Donald Kossmann

Upload: naomi-snow

Post on 01-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

1

XML Storage and Query Processing

Yanlei DiaoUniversity of Massachusetts

Amherst

Some slide content courtesy of Donald Kossmann

Page 2: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

2

XML Storage Alternatives

Plain Text Trees with Navigation Tuples (i.e., mapping to RDBMS)

Page 3: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

3

Plain Text

Use XML standards to encode data Advantages:

• simple, universal• indexing possible

Disadvantages:• need to re-parse (re-validate) all the time• no compliance with XQuery data model

(collections)• not an option for XQuery processing

Page 4: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

4

Trees XML data model uses tree semantics

• use Trees/Forests to represent XML instances• annotate nodes of tree with data model info

Examples:• Document Object Model (DOM) http://www.w3.org/DOM/• Object Exchange Model (OEM)

f1

f4

f8f7

f5 f6f3f2

Page 5: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

5

DataGuides [Goldman & Widom 97]

Schema-based environments

Schema Datagenerates

Queries

formulates execute against

Page 6: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

6

DataGuides [Goldman & Widom 97]

Schema-free environments:• don't know the schema in advance.• semantic heterogeneity (i.e. a mix of schemas)

DataGuidesSummarized into

Queries

formulate

App-specific TemplatesData

generate

execute against

Page 7: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

7

Schema vs. DataGuides

A DataGuide only includes info that exists in a DB.

A schema can be a superset of any DB that conforms to it.

So, a schema defines a superset of a DataGuide.

Issues addressed in the paper:• Summarize data into DataGuides;• Use them for query formulation and optimization.

Page 8: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

8

Object Exchange Model (OEM)

Object Exchange Model (OEM) • Each object has an id (oid) and a value (atomic or a

set of subobjects).• Each edge links an object to one of its subobjects

with a label; a subobject may have multiple parents.

Label path: a seq. of labels

Data path: an alternating seq. of labels and oids

Target set: a set of all objects reached by traversing a label path

Page 9: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

9

Definition of a DataGuide

Conciseness: a DataGuide describes every unique label path of a source exactly once

Accuracy: a DataGuide does not encode any label path that does not appear in the source

Convenience: represented as an OEM model, like the data

A DataGuide reflects the structure of a DB; it contains no atomic values.

Page 10: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

10

From Data to DataGuides

Creating a DataGuide is equivalent to converting an NFA to DFA! • Consider a label path (query) as a string to be accepted by

the data source and the DataGuide. • Intuition: The data source has multiple matches, so

execution is non-deterministic. But the DataGuide has only one path, so execution is deterministic.

Cost of creation• Source DB is a tree: linear• Worst-case: exponential in #. of objects and edges in the

source• Empirical results: average performance for certain datasets

is quite encouraging

Page 11: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

11

Multiple DataGuides An OEM source may have multiple DataGuides

• A single NFA may have many equivalent DFAs. Minimal DataGuide

• Can be created using DFA minimization

Minimality may not always be desirable• Hard to maintain as the

data source changes--well known problem with DFA.

• Does not allow annotations. 22 ? ?

Page 12: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

12

Annotations Annotation: a property of the target set of

a label path l in the data source s• Statistical information: e.g. # occurrences of l

in s• Pointers to objects reachable via l• …

Issue with minimalityAnnotation for A.C

Annotation for B.C?

Page 13: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

13

Strong DataGuides

Each set of label paths that share a node in the DataGuide is the set of label paths that share the same target set in the source. • Label paths can be merged in the DataGuide if they

lead to the same target set. There is one-to-one correspondence between

source target sets and DataGuide objects. Creation from the data source

• A DFS algorithm that examines source target sets reachable by al possible label paths…

Maintenance uses a similar set of data structures…

Page 14: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

14

Query Formulation & Optimization

Query formulation • Query by example: click buttons to select a

path and add value filters • Blurs the distinction between formulating a

query and browsing a query result Query optimization

• Uses the DataGuide for structural matching (e.g. A.B.C) and retrieves the target set

• Uses value indexes (e.g. B+trees) for value filters for a specific label (e.g. C.price>100)

• Intersects the two resulting sets of objects

Page 15: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

15

XML Data Stored as Tuples

Motivation: Use an RDBMS infrastructure to store and process the XML data• query optimization• scalability• richness and maturity of RDBMS

Alternative relational storage approaches:• Map XML schema to relational schema • Generic shredding of the data (edge, binary, …)• New XML storage integrated tightly with the

relational processor

Page 16: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

16

Relational Support for XML [Zhang et al. 2001]

Goal: relational support for path queries, including storage and query processing

Assumption: we have the DTD/schema Problem addressed: to support XML path

queries• Can we use a relational DBMS? • Shall we design a native XML store, i.e. using

novel storage and indexing techniques?

Page 17: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

17

Representation of XML

Each XML document is parsed to a seq. of items:• Start tag• Text word• End tag

All items are numbered, from 1.

<?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost

16</title> 17</figure>  18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section>25</book>

Page 18: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

18

Element Index An Element Index (E-index) records occurrences of

each element name inside the entire collection of documents.

Each index entry in an E-index corresponds to one occurrence of the element name. It has:• document identifier, • start position of the element in the doc, i.e. position of its

start tag.• end position of the element in the doc, i.e. position of its

end tag• document level of the element in the doc, i.e. level from

the root. An E-index is sorted in increasing order of

<document id, start position, end position>.

Page 19: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

19

Example of E-Index<?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost

16</title> 17</figure>  18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section>25</book>

(1, 1:25, 1) (2, …<book>

<section> (1, 2:24, 2) (1, 6:18, 3) (2, …

<title> (1, 3:5, 3) (1, 7:10, 4) (1, 12:16, 5) (1, 20:22, 4) (2, …

<figure> (1, 11:17, 4) (1, 19:23, 3) (2, …

Page 20: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

20

Text Index

A Text Index (T-index) records the occurrences of each text word inside the entire collection of documents, similar to E-Index.

Difference is that each index entry in a T-index contains a single word position, instead of the pair of start and end positions.

Similarly, a T-index is sorted in increasing of <document identifier, word position>.

Page 21: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

21

Example of T-Index<?xml version="1.0" ?> 1<book> 2<section id=“intro” difficulty=“easy”> 3<title> 4XML 5</title> 6<section difficulty=“easy”> 7<title> 8XML 9Processing 10</title> 11<figure source=“g1.jpg”> 12<title> 13XML 14Processing 15Cost

16</title> 17</figure>  18</section> 19<figure source=“g2.jpg”> 20<title> 21Scalability 22</title> 23</figure> 24</section>25</book> (1, 4:4, 4) (1, 8:8, 5) (1, 13:13, 6) (2, …“XML”

“Processing” (1, 9:9, 5) (1, 14:14, 6) (2, …

“Cost” (1, 15:15, 6) (2, …

“Scalability” (1, 21:21, 5) (2, …

Page 22: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

22

Relational Storage(a) Element-Index

1 19 23 3

1 11 17 4

2 … … …

1 12 16 5

1 3 5 3

1 20 22 4

1 7 10 4

2 … … …

1 6 18 3

1 2 24 2

2 … … …

doc_id start_pos end_pos doc_level

1 1 25 1

2 … … …

term

<book>

<book>

<section>

<section>

<section>

<title>

<title>

<title>

<title>

<title>

<figure>

<figure>

<figure>

(b) Text-Index

2 … …

doc_id word_pos doc_level

1 4 4

1 8 5

1 13 6

1 9 5

1 14 6

2 … …

1 15 6

2 … …

1 21 5

2 … …

term

“XML”

“XML”

“XML”

“XML”

“Processing”

“Processing”

“Processing”

“Cost”

“Cost”

“Scalability”

“Scalability”

Page 23: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

23

Relational Storage (contd.)

One relation for elements, one for text words

Clustered B+trees over each table • On (term, docno)• On all columns: lead to index-only plans

Page 24: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

24

“//section//title”

Index Scan on <section>

Index Scan on <title>

(//)l.doc_id = r.doc_id and l.start_pos <

r.start_pos and l.end_pos > r.end_pos

Page 25: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

25

Questions

Page 26: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

26

Outline

Storage and Query Processing• DataGuides [Goldman and Widom 97]• Relational Approach [Zhang et al. 2001]

Other Research Topics• Query Rewriting• Benchmarking• …

Page 27: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

27

Node Identifiers XQuery Data Model Requirements

• identify a node uniquely (implementing identity)• lives as long as node lives• robust to updates

Identifiers might include additional information• Schema/type information• Document order• Parent/child relationship• Ancestor/descendent relationship• Document information

Required for indexes

Page 28: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

28

Simple Node Identifiers Examples:

• Alternative 1 (data: trees)• id of document (integer)• pre-order number of node in document (integer)

• Alternative 2 (data: plain text)• file name• offset in file

Encode document ordering (Alternative 1)• identity: doc1 = doc2 AND pre1 = pre2• order: doc1 < doc2

OR (doc1 = doc2 AND pre1 < pre2) Assessment:

• bad: Not robust to updates• bad: Not able to answer more complex queries

Page 29: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

29

Dewey Order

Idea:• Generate surrogates for each path• 1.2.3 identifies the third child of the second

child of the first child of the given root Assessment:

• good: order comparison, ancestor/descendent easy

• bad: updates expensive, space overhead Improvement: ORDPath Bit Encoding

O‘Neil et al. 2004 (Microsoft SQL Server)

Page 30: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

30

Example: Dewey Order

name

name child

person

person

hobby hobby

1.1 1.2

1

1.2.1

1.2.1.1 1.2.1.2 1.2.1.3

Page 31: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

31

XML Storage Alternatives

Plain Text Trees with Random Access Tuples (i.e., mapping to RDBMS)

Page 32: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

32

Plain Text

Use XML standards to encode data Advantages:

• simple, universal• indexing possible

Disadvantages:• need to re-parse (re-validate) all the time• no compliance with XQuery data model

(collections)• not an option for XQuery processing

Page 33: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

33

Trees XML data model uses tree semantics

• use Trees/Forests to represent XML instances• annotate nodes of tree with data model info

Example<f1>

<f2>..</f2> <f3>..</f3> <f4> <f7/> <f8>..</f8> </f4> <f5/> <f6>..</f6> </f1>

f1

f4

f8f7

f5 f6f3f2

Page 34: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

34

Trees Advantages

• natural representation of XML data• good support for navigation, updates index built

into the data structure• compliance with DOM standard interface

Disadvantages• difficult to partition• high overhead: mixes indexes and data• index everything

Example: Document Object Model (DOM) • http://www.w3.org/DOM/

Page 35: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

35

Edge Approach (Florescu & Kossmann 99)

Source Label Target

0 person 4711

0 person 666

4711 name v1

4711 child i314

666 name v2

666 child i314

Id Value

v1 Lilly Potter

v2 James Potter

v3 Harry Potter

Id Value

v4 12

Edge Table Value Table (String)

Value Table (Integer)

Page 36: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

36

XML Example

<person id = “4711“> <name> Lilly Potter </name> <child> <person id = “314“> <name> Harry Potter </name> <age> 12 </age> </child></person><person id = “666“> <name> James Potter </name> <child idref = “314“/></person>

Page 37: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

37

person person

Harry Potter

name

name name

person

Lilly Potter James Potter

child

314

0

4711 666

i314

<person id = “4711“> <name> Lilly Potter </name> <child> <person id = “314“> <name> Harry Potter

</name> <age> 12 </age> </child></person><person id = “666“> <name> James Potter </name> <child idref = “314“/></person>

age

12

Page 38: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

38

Kinds of Indexes

1. Value Indexes• index atomic values; e.g.,

//emp/salary/fn:data(.)• use B+ trees (like in relational world)• (integration into query optimizer more tricky)

2. Structure Indexes• materialize results of path expressions• (pendant to Rel. join indexes, OO path indices)

3. Full text indexes• Keyword search, inverted files• (IR world, text extenders)

Any combination of the above

Page 39: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

39

Outline

XML Storage XML Indexing Query Processing Other Research Topics

• Query Rewriting• Benchmarking• …

Page 40: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

40

What is a Correct Rewriting

E1 -> E2 is a legal rewriting iff• Type(E2) is a subtype of Type(E1)• FreeVar(E2) is a subset of FreeVar(E1)• For a binding of free variables, either

• E1 or E2 return ERROR (possibly different errors)• Or E1 and E2 return the same result

This definition allows the rewrite E1->ERROR• Trust your vendor she does not do that for all

E1!

Page 41: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

41

Handling Backwards Navigation

Replace backwards navigation with forward navigation

for $x in $input/a/b for $y in $input/a,return <c>{$x/.., $x/d}</c> $x in $y/b return <c>{$y,

$x/d}</c>

for $x in $input/a/breturn <c>{$x//e/..}</c> ??

Enables streaming

Page 42: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

42

FLWR Unnesting

Traditional database techniquefor $x in $input/a/b for $x in $input/a/b,where $x/c eq 3 $y in $x/dreturn (for $y in $x/d where ($x/e eq 4) and ($x/c

eq 3) where $x/e eq 4 return $y return $y)

Problem simpler than in OQL/ODMG

• No nested collections in XML

Page 43: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

43

XML Query Processing

Techniques vary a lot, depending on• Storage model• Indexes available• Algebra used• …

A large body of ongoing work• Research community: McHugh and Widom 1999,

Zhang et al. 2001, Bruno et al. 2002, Ghua et al. 2002, Chen et al. 2003, Paparizos et al. 2004, Jagadish 2004, … (just look at SIGMOD and VLDB proceedings in recent years!)

• Industry: IBM DB2, Oracle, SQL Server, …

Page 44: 1 XML Storage and Query Processing Yanlei Diao University of Massachusetts Amherst Some slide content courtesy of Donald Kossmann

44

XML Processing Benchmark

We cannot really compare approaches until we decide on a comparison basis

XML processing very broad Industry not mature enough Usage patterns not clear enough Existing XML benchmarks (Xmark, etc. )

limited Strong need for a TP benchmark