web databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance...

Post on 08-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

XML Benchmarks 1

Web Databases

XML System Benchmarks

XML Benchmarks 2

Benchmarks

• Many XML DB tools– Design and adopt benchmarks to allow

comparative performance analysis• Four key criteria (Jim Gray 1993)

– Relevance– Portability– Scalability– Simplicity

XML Benchmarks 3

Existing XML Benchmarks

• Application benchmarks– X007 (National University of Singapore,

University of Auckland, Arizona State University)

– XMach-1 (University of Leipzig)– XMark (CWI, Inria, Microsoft, BEA, XQRL,

Fraunhofer-IPSI)– XBench (University of Waterloo + IBM)

XML Benchmarks 4

Benchmark Dataset

• Must be complex enough to capture all characteristics of XML data representation

• Capture the document (ordering) and navigation (references) features

• Scalability– Depth of a tree can be controlled by varying the number

of repetitions of recursive elements– The width of the tree can be adjusted by varying the

cardinality of some elements

XML Benchmarks 5

Benchmark Queries

• Types of queries– Data-centric

• Join, aggregation, sorting (R9, R10, R11)

– Document-centric• Element/document ordering (R17, R21)

– Navigational• Traversal (R13, R20)

XML Benchmarks 6

Benchmark Queries• R1

– Query all data types and collections of possibly multiple XML documents

• R2– Allow data-oriented, document-oriented, and mixed queries

• R3– Accept streaming data

• R4– Support operations on various data models

• R5– Allow conditions/constraints on text elements

• R6– Support hierarchical and sequence queries

XML Benchmarks 7

Benchmark Queries• R7

– Manipulate NULL values• R8

– Support quantifiers (some, all, not) in queries• R9

– Allow queries that combine different parts of document(s)• R10

– Support for aggregation• R11

– Able to generate sorted results• R12

– Support composition of operations

XML Benchmarks 8

Benchmark Queries• R13

– Allow navigation (reference traversals)• R14

– Able to use environment information as part of queries• R15

– Able to support XML updates if data model allows• R16

– Support type coercion• R17

– Preserve the structure of the documents• R18

– Transform and create XML documents

XML Benchmarks 9

Benchmark Queries• R19

– Support ID creation• R20

– Structural recursion• R21

– Element ordering

XML Benchmarks 10

X007

• Comes from the 007 Benchmark• X007

– Bressan, Dobbie 2001– Bressan, Lee 2001– http://www.comp.nus.edu.sg/~ebh/XOO7.html

• Allow datasets of varying sizes

XML Benchmarks 11

X007 - DTD<!ELEMENT Module (Manual, ComplexAssembly)><!ATTLIST Module MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT Manual (#PCDATA)><!ATTLIST Manual MyID NMTOKEN #REQUIRED

title CDATA #REQUIREDtextLen NMTOKEN #REQUIRED>

<!ELEMENT ComplexAssembly (ComplexAssembly+ | BaseAssembly+)><!ATTLIST ComplexAssembly MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT BaseAssembly (CompositePart+)><!ATTLIST BaseAssembly MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

XML Benchmarks 12

X007 - DTD<!ELEMENT CompositePart (Document, Connection+)><!ATTLIST CompositePart MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT Document (#PCDATA | para)+><!ATTLIST Document MyID NMTOKEN #REQUIRED

title CDATA #REQUIRED><!ELEMENT para (#PCDATA)><!ELEMENT Connection (AtomicPart, AtomicPart)><!ATTLIST Connection type CDATA #REQUIRED

length NMTOKEN #REQUIRED><!ELEMENT AtomicPart EMPTY><!ATTLIST AtomicPart MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIREDx NMTOKEN #REQUIREDy NMTOKEN #REQUIREDdocId NMTOKEN #REQUIRED>

XML Benchmarks 13

X007 ERD

Module

Manual

Document

ComplexAssembly

CompositeParts

BaseAssembly

Assembly

DesignObj

AtomicPart

XML Benchmarks 14

X007 Queries• Query 1 (R1, R2)

– Randomly generate 5 numbers in the range of AtomicPart's MyID, then return the AtomicPart according to the 5 numbers.

– FOR $a IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@MyID = 221 or @MyID = 1000 or @MyID = 535 or @MyID = 13 or @MyID =

2000]RETURN $a

• Query 2 (R1, R2)– Randomly generate 5 titles for Documents, then return the first paragraph of the

Document by lookup on these titles. – FOR $d IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[ @title = "Composite Part 00000009" or @title = "Composite Part 00000050" or @title = "Composite Part 00000034" or @title = "Composite Part 00000022" or @title = "Composite Part 00000080"]

RETURN $d/para[1]

XML Benchmarks 15

X007 Queries• Query 3 (R4)

– Select 5% of AtomicParts via buildDate (in a certain period).– FOR $a IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@buildDate .>=. 1900 and @buildDate .<. 1950]

Return $a• Query 4 (R13)

– Find the CompositePart if it is later than BaseAssembly it is using (comparing the buildDate attribute).

– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA

ssembly,$c IN $b/CompositePart[@buildDate .>. $b/@buildDate]

RETURN $c• Query 5 (R9)

– Within the same BaseAssembly, return the AtomicParts once finding a Document that has MyID equals to its docId.

– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA

ssembly,$d IN $b/CompositePart/Document

LET $a := $b/CompositePart/Connection/AtomicPartWHERE $d/@MyID = $a/@docIdRETURN $a

XML Benchmarks 16

X007 Queries• Query 6 (R9)

– Select all BaseAssemblies with earlier buildDate from one XML database where it has the same "type" attributes as the BaseAssemblies in another database.

– FOR $b1 IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/Base

Assembly,$b2 IN document("/export/home/liyg/genxml/small32.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyWhere $b1/@type = $b2/@typeand $b1/@buildDate .<. $b2/@buildDateRETURN $b1

• Query 7 (R5)– Randomly generate two phrases among all phrases in Documents. Select those documents

containing the 2 phrases.– FOR $d IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[contains(., "00000010") and contains(., "document")]

Return $d• Query 8

– (to be changed)

XML Benchmarks 17

X007 Queries• Query 9 (R18)

– Select all AtomicParts with corresponding CompositeParts as their sub-elements.– FOR $a IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart

Return <AtomicPart $a/@*>shallow($a/../..)

</AtomicPart>• Query 10 (R17, R18)

– Select all ComplexAssembly with type "type008" without the knowledge of the path.– FOR $ca IN document("small31.xml")//ComplexAssembly[./@type = "type008"]

RETURN $ca• Query 11 (R9, R21)

– Among the first 5 Connections of each CompositePart, select those with length greater than "len".

– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

RETURN$c/Connection[position() .<=. 5][@length .>. 60000]

XML Benchmarks 18

X007 Queries• Query 12 (R9, R21)

– For each CompositePart, select the first 5 Connections with length greater than "len".– FOR $c IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

RETURN$c/Connection[@length .>. 60000][position() .<=. 5]

• Query 13 (R9, R10)– For each BaseAssembly count the number of documents.– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyLET $d := $b/CompositePart/DocumentRETURN count($d)

XML Benchmarks 19

X007 Queries• Query 14 (R11, R14)

– Sort CompositePart in descending order where buildDate is within a year from current year.

– FUNCTION year(){

"2002"}

FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

Where $c/@buildDate .>=. (year()-1)RETURN

<result>$c

</result>sortby (buildDate DESCENDING)

• Query 15 (R8)– Find BaseAssembly of not type "type008".– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly

[not(@type='type008')]RETURN $b

XML Benchmarks 20

X007 Queries• Query 16 (R18)

– Return all BaseAssembly of type "type008" without any child nodes.– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly[./@type="type008"]

Return shallow($b)

• Query 17 (R9, R10)– Return all CompositePart having Connection elements with length greater than

Avg(length) within the same CompositePart without child elements.– FOR $c IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly /BaseAssembly/CompositePart,

$con IN $c/Connection[./@length .>. avg($c/Connection/@length)] Return shallow($con)

XML Benchmarks 21

X007 Queries• Query 18 (R17, R18)

– For CompositePart of type "type008", give 'Result' containing ID of CompositePart and Document.

– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

Return<Result $c/@MyID>$c/Document

</Result>

• Query 19– Select all of CompositePart, Document and AtomicPart.– <Result>

Let $m := document("small31.xml") FILTER (self::CompositePart OR self::Document OR self::AtomicPart) return $m

</Result>

XML Benchmarks 22

X007 Queries• Query 20

– Select the last connection of each CompositePart.– For $c in document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

return $c/Connection[position() = last()]

• Query 21– Select the third connection's AtomicParts of each CompositePart.– for $c in document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart,

$cn in $c/Connection[position() = 3]return $cn/AtomicPart

XML Benchmarks 23

X007 Queries• Query 22

– Select the AtomicPart whose MyID is smaller than its sibling's and it occurs before that sibling.

– for $c in document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection,

$a1 in $c/AtomicPartreturn $c/AtomicPart[(. BEFORE $a1) AND (./@MyID .<. $a1/@MyID)]

• Query 23– Select all Document after the Document with MyID = 25.– FOR $doc in document("small31.xml")

LET $d := $doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly

/BaseAssembly/CompositePart/Document[@MyID = 25]return

<After_DOC>$doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document AFTER $d

</After_DOC>

XML Benchmarks 24

X007 DB ParametersParameters Small Medium Large

NumAtomicPerComposite 20 200 200

NumConnPerAtomic 3,6,9 3, 6, 9 3, 6, 9

DocumentSize (bytes) 500 1000 1000

ManualSize (bytes) 2000 4000 4000

NumCompositePerModule 50 50 500

NumAssmPerAssm 3 3 3

NumAssmLevels 5 5 7

NumCompositePerAssm 3 3 3

NumModules 1 1 1

XML Benchmarks 25

XBench

• Capture different application database characteristics

• Capture different application workload characteristics

• Capture full XQuery functionality• Xbench

– db.uwaterloo.ca/~ddbms/projects/xbench

XML Benchmarks 26

XBench• “Relevant, portable, scalable and simple”• Text and non-text documents

– Text (e.g., digital libraries)• Order of elements important• Mixed content

– Non-text (e.g., transactional data)• Only child elements and only data• Structured (schema-based) and non-structured (schema-less)

• Single and multiple documents• Ability to deal with XML Schema definitions/DTDs as

well as the lack of them

XML Benchmarks 27

XBench• Scalability

– Small: 10MB, Normal: 100MB, Large: 1GB, Huge: 10GB

• XML Documents– Balanced and skewed tree structures– Exploit XML features (links, notations, entities, name spaces)

• Workload– Queries, updates, bulk loading

• XQuery compatibility• Implementation independence

XML Benchmarks 28

System Under Test • Single machine• All applications on

the same machine– A DBMS– A Client

• Send / Receive• Measure & Log

• No Web interaction overhead in this version• Similar to XMark, different from XMach-1

XML Benchmarks 29

Database Design

• Characterization– Text-centric (TC) vs data-centric (DC) - Application– Single document (SD) vs multiple documents (MD) -

Document

E-commerce transactional data

E-commerce catalogs, IMDB (Internet Movie DB)

DC

Reuters news corpusSpringer DL, DBPL

GCIDE Dictionary,OED

TC

MDSD

XML Benchmarks 30

Document Characteristics

Applications Size Elems AttrsAvgA/E

MinA/E

MaxA/E

AvgDept

MinDept

MaxDept

Avg FanOut

Min FanOut

Max FanOut

Text Text% AttV AttV%

Avg 2,340.0 33.2 20.9 0.8 0.0 3.9 4.2 2.5 5.1 1.9 1.0 4.6 189.0 7.9% 192.0 10.3%Min 294.0 3.0 6.0 0.3 0.0 2.0 2.0 1.0 2.0 1.0 1.0 1.0 0.0 0.0% 59.0 4.3%Max 6,954.0 87.0 60.0 2.0 0.0 5.0 8.1 3.0 10.0 3.4 1.0 10.0 543.0 16.9% 515.0 29.8%

Avg 581.8 15.4 3.9 0.2 0.0 1.0 1.0 1.0 1.0 14.4 14.4 14.4 271.1 47.7% 40.3 6.3%Min 233.0 5.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 4.0 1.0 4.0 98.0 16.0% 12.0 0.6%Max 5,937.0 138.0 125.0 0.9 0.0 1.0 1.2 1.0 2.0 137.0 137.0 137.0 2,273.0 67.4% 1,462.0 33.1%

WCS 1,598,165.0 11,294.0 65,107.0 5.8 1.0 27.0 1.0 1.0 1.0 11,293.0 11,293.0 11,293.0 11,294.0 0.7% 458,222.0 28.7%

GCIDE 57,917,440.0 2,267,510.0 23.0 0.0 0.0 2.0 2.4 1.0 7.0 4.0 1.0 239,185.0 33,594,747.0 58.0% 572.0 0.0%

IMDB1 4,006,587.0 155,887.0 21,960.0 0.1 0.0 1.0 2.0 2.0 2.0 8.4 3.0 18,590.0 1,143,552.0 28.5% 36,437.0 0.9%IMDB2 11,036,866.0 283,065.0 24,881.0 0.1 0.0 1.0 3.0 2.0 4.0 5.4 1.0 11,024.0 4,446,604.0 40.3% 72,054.0 0.7%

OLAPCube 63,618.0 662.0 3,805.0 5.7 0.0 8.0 2.1 2.0 4.0 55.1 1.0 613.0 2,488.0 3.9% 14,557.0 22.9%

Avg 2,967.1 38.6 40.7 1.1 0.0 4.0 2.2 1.0 4.0 3.6 1.0 14.1 1,557.0 47.5% 480.8 18.1%Min 1,146.0 20.0 23.0 0.2 0.0 4.0 1.9 1.0 4.0 2.2 1.0 7.0 223.0 17.8% 263.0 2.5%Max 11,214.0 229.0 161.0 1.5 0.0 4.0 2.9 1.0 4.0 20.9 1.0 200.0 10,005.0 89.2% 1,966.0 30.2%

Avg 213,448.5 4,856.5 0.0 0.0 0.0 0.0 3.9 1.0 5.0 5.6 1.0 155.3 126,893.6 59.5% 0.0 0.0%Min 141,345.0 3,153.0 0.0 0.0 0.0 0.0 3.9 1.0 4.0 4.6 1.0 71.0 84,458.0 55.8% 0.0 0.0%Max 288,735.0 6,636.0 0.0 0.0 0.0 0.0 4.0 1.0 5.0 7.0 2.0 434.0 170,648.0 64.2% 0.0 0.0%

Xmark 116,524,435.0 1,666,315.0 381,878.0 0.2 0.0 2.0 4.6 2.0 11.0 3.7 1.0 25,500.0 81,286,567.0 69.8% 4,284,980.0 3.7%

Shakspeare(37)

Statistics of Paramters in Some Application Domains

cXML(46)

DBLP(4362)

Reuters (1952)

XML Benchmarks 31

Database Characterization

• Element types• Tree structure of element types• Distribution of children to elements• Distribution of element values to types• Attribute names• Distribution of attribute values to names• Distribution of attributes to elements

XML Benchmarks 32

Data Gathering Methodology

• Analysis

• Abstraction– Statistical analysis to develop probability distributions

for each document• Generalization

– Statistically combining the two document characteristics to come up with one document

• Database generation– Use ToxGene from University of Toronto

TPC-W (All tables; transactional data)

TPC-W (ITEM+AUTHOR+ADDRESS+COUNTRY tables; catalog data)

DC

Reuters news corpusSpringer DL

GCIDE Dictionary,OEDTC

MDSD

XML Benchmarks 33

Analysis - DC (TPC-W -> XML)

• Element oriented mapping vs. attribute oriented mapping

• Existing mapping methods– Flat translation (FT) – Nesting based translation (NeT) – Constraint based translation (CoT)

• Improved mapping methods are used

XML Benchmarks 34

Analysis - TC• Stats of occurrence of <chapter>• Stats of occurrence of <section> for

each <chapter>• Stats of occurrence of <p> for each

<chapter>• Stats of occurrence of <p> for each

<section>• Stats of lengths of content of <p>

XML Benchmarks 35

Generalization - TC

• Merge two or more semantically same element types– Same document– Different documents

• Assumptions– All data sources are equally important– Frequencies change proportionally w.r.t. data

size

XML Benchmarks 36

Generation - ToxGene

• Template based tool generating synthetic XML documents

• The Toxgene Specification Language (TSL) is based on XML Schema

• Features– Distribution– Re-use

XML Benchmarks 37

TSL

<tox-distribution name = "c1"type = "exponential" minInclusive = "5"maxInclusive = "100" mean = "35"/>

...<simpleType name = "my_float">

<restriction base = "float"><tox-number tox-distribution = "c1"/>

</restriction></simpleType>

XML Benchmarks 38

Example Database Schema – TC/MD

XML Benchmarks 39

TPC-W Schema

XML Benchmarks 40

DC/SD Schema

XML Benchmarks 41

Synthetic Data Characteristics

57600005760005760057601num_addressesDC/MD (address.xml)

----1fixedDC/MD (country.xml)

40004004041num_authorsDC/MD (author.xml)

25920002592002592025922592-2592000

num_ordersDC/MD (orderXX.xml)

100001000100101num_itemsDC/MD (item.xml)

28800002880002880028801num_customersDC/MD (customer.xml)

31110003111003111031111num_itemsDC/SD (catalog.xml)

5555555555555555-55555

article_numTC/MD (articleXX.xml)

1000K100K10K1K1entry_numTC/SD (dictionary.xml)

HugeLargeNormalSmall# FilesSize Par.Sources

XML Benchmarks 42

Workload• Core queries

– Exact match (shallow/deep): 8 queries– Function application:– Ordered access (relative/absolute): – Queries with quantifiers (existential and universal): – Sorting queries (by string types/by others):

• Text-centric queries– Document construction (structure preserving/transforming):– Irregular data (missing elements/null values):– Individual document retrieval:– Text search (single/multiple word):

• Data-centric queries– References and joins:– Data type casting:

XML Benchmarks 43

End of Lecture

top related