on relational support for xml publishing

02.19.200402.19.2004 CS561CS561

On Relational On Relational Support for XML Support for XML PublishingPublishing

Beyond Sorting and TaggingBeyond Sorting and Tagging

Surajit ChaudhuriSurajit ChaudhuriRaghav KaushikRaghav KaushikJeffrey F. NaughtonJeffrey F. Naughton

Presented by:Presented by:Conn DohertyConn Doherty

02.19.200402.19.2004 CS561CS561

OutlineOutline

Motivation & ObservationsMotivation & Observations XMLXML Topic of PaperTopic of Paper GApply Operator ApproachGApply Operator Approach Transformation RulesTransformation Rules Experiments and ResultsExperiments and Results Related WorkRelated Work ConclusionsConclusions Future ProblemsFuture Problems

02.19.200402.19.2004 CS561CS561

MotivationMotivation

Does the need for efficient XML Does the need for efficient XML publishing bring any new publishing bring any new requirements for relational query requirements for relational query engines, or is sorting query results in engines, or is sorting query results in the relational engine and tagging the relational engine and tagging them in middleware sufficient? them in middleware sufficient?

02.19.200402.19.2004 CS561CS561

ObservationsObservations

The mismatch between the XML data The mismatch between the XML data model and relational model requires model and relational model requires relational engines to be enhances for relational engines to be enhances for efficiencyefficiency

Need support for relation-valued Need support for relation-valued variablesvariables

02.19.200402.19.2004 CS561CS561

XMLXML

Extendible Markup Language Extendible Markup Language (rather a metalanguage or metametalanguage)(rather a metalanguage or metametalanguage)

Rapidly emerging as a standard for Rapidly emerging as a standard for exchanging business dataexchanging business data

Substantial interest in publishing Substantial interest in publishing existing relational data as XMLexisting relational data as XML

02.19.200402.19.2004 CS561CS561

Current XML PublishingCurrent XML Publishing

Most focus has been on issues external to Most focus has been on issues external to the RDBMSthe RDBMS– Determining the class of XML views that can be Determining the class of XML views that can be

defineddefined– Languages used to specify the conversion from Languages used to specify the conversion from

relational data to XMLrelational data to XML– Methods of composing XML queries with XML Methods of composing XML queries with XML

viewsviews Data warehousing has caused focus on Data warehousing has caused focus on

similar issues internal to RDBMSsimilar issues internal to RDBMS

02.19.200402.19.2004 CS561CS561

Primary Topic of PaperPrimary Topic of Paper

Focus closely on the class of SQL Focus closely on the class of SQL queries that are typically generated queries that are typically generated by XML publishing applicationsby XML publishing applications

Ask if anything needs to be changed Ask if anything needs to be changed within the relational engine to within the relational engine to efficiently evaluate these queries?efficiently evaluate these queries?

02.19.200402.19.2004 CS561CS561

YES!YES!

Differences in the XML and relational Differences in the XML and relational data modelsdata models– cause awkward and inefficient cause awkward and inefficient

translations of XML queries to relational translations of XML queries to relational SQL queriesSQL queries

Main IssueMain Issue– XML’s hierarchical model makes it very XML’s hierarchical model makes it very

convenient and natural to apply convenient and natural to apply operators to subtreesoperators to subtrees

02.19.200402.19.2004 CS561CS561

Part Supplier ExamplePart Supplier Example

Part and Supplier Data SetPart and Supplier Data Set– supplier(s_key, s_name)supplier(s_key, s_name)– partsupp(ps_suppkey, ps_partkey)partsupp(ps_suppkey, ps_partkey)– part(p_partkey, p_name, p_retailprice)part(p_partkey, p_name, p_retailprice)

02.19.200402.19.2004 CS561CS561

Part Supplier ExamplePart Supplier Example

Query Q1: For Query Q1: For each supplier each supplier element, return the element, return the names and retail names and retail prices of all parts prices of all parts supplied by that supplied by that supplier, and also, supplier, and also, the over-all the over-all average retail price average retail price of all parts suppliedof all parts supplied

Example XML DocumentExample XML Document

<suppliers><suppliers><supplier><supplier>

<sname>S1</sname><sname>S1</sname><parts><parts>

<part><part><pname>P1</pname><pname>P1</pname><retailprice>10</retailprice><retailprice>10</retailprice>

</part></part><part><part>

<pname>P2</pname><pname>P2</pname><retailprice>10</retailprice><retailprice>10</retailprice>

</part></part></parts></parts>

</supplier></supplier><supplier><supplier>

<sname>S2</sname><sname>S2</sname><parts><parts>

<part><part><pname>P21</pname><pname>P21</pname><retailprice>12</retailprice><retailprice>12</retailprice>

</part></part><part><part>

<pname>P22</pname><pname>P22</pname><retailprice>13</retailprice><retailprice>13</retailprice>

</part></part></parts></parts>

</supplier></supplier><suppliers><suppliers>

02.19.200402.19.2004 CS561CS561

Example QueriesExample Queries

XQueryXQueryFor $s in /doc(tpch.xml)/suppliers/supplierFor $s in /doc(tpch.xml)/suppliers/supplierReturn <ret> $s/s_suppkeyReturn <ret> $s/s_suppkey

<parts><parts>For $p in $s/partFor $p in $s/partReturn <part>Return <part>$p/p_name$p/p_name$p/p_retailprice$p/p_retailprice</part></part></parts></parts>avg($s/part/p_retailprice)avg($s/part/p_retailprice)</ret></ret>

SQLSQL(select ps_suppkey, p_name, (select ps_suppkey, p_name,

p_retailprice,nullp_retailprice,null from partsupp, partfrom partsupp, part where ps_partkey = p_partkeywhere ps_partkey = p_partkey union allunion all select ps_suppkey,null,null, select ps_suppkey,null,null,

avg(p_retailprice)avg(p_retailprice) from partsupp, partfrom partsupp, part where ps_partkey = p_partkeywhere ps_partkey = p_partkey group by ps_suppkey)group by ps_suppkey)Order by ps_suppkeyOrder by ps_suppkey

SQL (relational data model) is hard to express SQL (relational data model) is hard to express and inefficient and inefficient – Unable to bind a variable to sets of tuples and Unable to bind a variable to sets of tuples and

execute subqueries on these setsexecute subqueries on these sets

02.19.200402.19.2004 CS561CS561

3 Angle Approach3 Angle Approach

1) New operator, 1) New operator, GApplyGApply– Binds variable to sets of tuples Binds variable to sets of tuples – Allows subqureies to be executed over set Allows subqureies to be executed over set

of tuples (tmp relation) bound to a variableof tuples (tmp relation) bound to a variable 2) Propose transformation rules to 2) Propose transformation rules to

modify query plan trees with GApply modify query plan trees with GApply operatoroperator

3) Expose GApply operator in SQL 3) Expose GApply operator in SQL syntaxsyntax

02.19.200402.19.2004 CS561CS561

GApply OperatorGApply Operator

Syntax: GApply(GCols, PGQ)Syntax: GApply(GCols, PGQ)– GCols: grouping/partitioning columnsGCols: grouping/partitioning columns– PGQ: per-group queryPGQ: per-group query

Input tuple stream is partitioned on Input tuple stream is partitioned on GColsGCols

PGQ applied to each groupPGQ applied to each group Output is the union of all above Output is the union of all above

results taken over all groupsresults taken over all groups

02.19.200402.19.2004 CS561CS561

TerminologyTerminology

Outer tuple streamOuter tuple stream: input tuple : input tuple streamstream

Inner queryInner query: per-group query: per-group query Outer child of GApplyOuter child of GApply: root of outer : root of outer

queryquery Inner child of GApplyInner child of GApply: root of inner : root of inner

queryquery

02.19.200402.19.2004 CS561CS561

PGQ RestrictionsPGQ Restrictions

Only operate on temporary relation Only operate on temporary relation associated with the group of tuplesassociated with the group of tuples

Operator type also known as Operator type also known as groupwise processinggroupwise processing

Operators allowed in PGQ: Operators allowed in PGQ: scanscan, , selectselect, , projectproject, , distinctdistinct, , applyapply, , existsexists, , unionunion((allall), ), groupbygroupby, , aggregateaggregate, and , and orderbyorderby

02.19.200402.19.2004 CS561CS561

Physical ImplemenationPhysical Implemenation

Two Phases:Two Phases:– Partitioning PhasePartitioning Phase

Implemented using Implemented using sortingsorting or or hashinghashing

– Execution PhaseExecution Phase Performed in nested loop fashionPerformed in nested loop fashion PGQ is evaluated on each group of tuplesPGQ is evaluated on each group of tuples

– Each group is a temporary relation bound to a Each group is a temporary relation bound to a relation-valued parameter relation-valued parameter $group$group

02.19.200402.19.2004 CS561CS561

Implementation DiagramImplementation Diagram

Outer ChildOuter QueryPartition Phase

Inner ChildInner QueryExecution Phase

NL – Nested Loop

Tmp relation: $group $group

02.19.200402.19.2004 CS561CS561

Expose GApply in SyntaxExpose GApply in Syntax

Difficult for the parser and optimizer Difficult for the parser and optimizer to determine when GApply appliesto determine when GApply applies

Tests on Microsoft SQL Server 2000 Tests on Microsoft SQL Server 2000 with GApply operator not exposed in with GApply operator not exposed in syntaxsyntax– Need sometimes identified by optimizerNeed sometimes identified by optimizer– Use in each case, considerably speeds Use in each case, considerably speeds

up performanceup performance

02.19.200402.19.2004 CS561CS561

Proposed SyntaxProposed Syntax

Proposed extension to SQL syntaxProposed extension to SQL syntax SQL query performing groupwise SQL query performing groupwise

processing:processing:– Select gapply(PGQ(x)) as <column list>Select gapply(PGQ(x)) as <column list>

from <relation list>from <relation list>

where <conditions>where <conditions>

group by <grouping columns> : xgroup by <grouping columns> : x– x is a relation-valued variablex is a relation-valued variable

02.19.200402.19.2004 CS561CS561

Example Query in SyntaxExample Query in Syntax Query Q1:Query Q1:

– select gapply(PGQ1(tmpSupp))select gapply(PGQ1(tmpSupp))from partsupp, partfrom partsupp, partwhere ps_partkey = p_partkeywhere ps_partkey = p_partkeygroup by ps_suppkey: tmpSuppgroup by ps_suppkey: tmpSupp

– PGQ1(tmpSupp)PGQ1(tmpSupp) select p_name, p_retailprice, nullselect p_name, p_retailprice, null

from tmpSuppfrom tmpSuppunion allunion allselect null, null, avg(p_retailprice)select null, null, avg(p_retailprice)from tmpfrom tmp

02.19.200402.19.2004 CS561CS561

Transformation RulesTransformation Rules

Precise semantics of the operatorsPrecise semantics of the operators Three categoriesThree categories

– 1) Pushing Computation into the Outer 1) Pushing Computation into the Outer QueryQuery Placing Projections Before GApplyPlacing Projections Before GApply Placing Selections Before GApplyPlacing Selections Before GApply Converting GApply to groupbyConverting GApply to groupby

– 2) Group Selection2) Group Selection– 3) Pushing GApply Below Joins3) Pushing GApply Below Joins

02.19.200402.19.2004 CS561CS561

Rule 2Rule 2

Group SelectionGroup Selection– Consider PGQ that either return whole group Consider PGQ that either return whole group

(subtree) or nothing based on a predicate(subtree) or nothing based on a predicate– Two methods to evaluateTwo methods to evaluate

Join suppliers & parts, group by suppkey, check Join suppliers & parts, group by suppkey, check selection method on group, if true - return groupselection method on group, if true - return group

Selection method to get suppkeys, then return Selection method to get suppkeys, then return joinjoin

– Second method will win if predicate is highly Second method will win if predicate is highly selectiveselective

02.19.200402.19.2004 CS561CS561

Rule 2 cont.Rule 2 cont.

– ExampleExampleFor $s in /doc(tpch.xml)/suppliersFor $s in /doc(tpch.xml)/suppliers

/supplier[/part/p_retailprice > /supplier[/part/p_retailprice > 1000]1000]

Return $sReturn $s

02.19.200402.19.2004 CS561CS561

Integrating Rules in Integrating Rules in OptimizerOptimizer

None of the rules above loop -> None of the rules above loop -> optimizer terminatesoptimizer terminates

Optimizer must estimate the cost of Optimizer must estimate the cost of the GApply operationthe GApply operation

02.19.200402.19.2004 CS561CS561

Preliminary ExperimentsPreliminary Experiments

Performance studyPerformance study– Find efficacy of the GApply operator to Find efficacy of the GApply operator to

speed up queriesspeed up queries– Understand impact of each proposed Understand impact of each proposed

transformation ruletransformation rule Microsoft SQL Server 2000Microsoft SQL Server 2000

– Supports GApply without syntax exposureSupports GApply without syntax exposure– Control over GApply invocation is neededControl over GApply invocation is needed

Simulate operation of GApply on the client sideSimulate operation of GApply on the client side

02.19.200402.19.2004 CS561CS561

Client Side Simulation of Client Side Simulation of GApplyGApply

PartitionPartition– SortingSorting– Hashing (simulation) Hashing (simulation)

ExecuteExecute– Store result of outer query in temporary Store result of outer query in temporary

tabletable– For each distinct tmp group relation, For each distinct tmp group relation,

evaluate PGQ on that relation, then evaluate PGQ on that relation, then union all resultsunion all results

02.19.200402.19.2004 CS561CS561

Estimate Running TimeEstimate Running Time

Measure both elapsed time and CPU Measure both elapsed time and CPU timetime

Operator trees with GApply is the top Operator trees with GApply is the top most operatormost operator

Expect real elapsed time less in full Expect real elapsed time less in full server implementationserver implementation

02.19.200402.19.2004 CS561CS561

SetupSetup

Experimental SetupExperimental Setup– TPCH benchmark dataTPCH benchmark data– 5GB database5GB database– ServerServer

1 GHz processor1 GHz processor 784 MB main memory784 MB main memory 512 MB buffer pool512 MB buffer pool

– Each query ran several times and then Each query ran several times and then average takenaverage taken

02.19.200402.19.2004 CS561CS561

ResultsResults

Effectiveness of GApplyEffectiveness of GApply– Comparable whether performing partitioning Comparable whether performing partitioning

using sorting or hashingusing sorting or hashing– Tested 4 queries representing a wide range Tested 4 queries representing a wide range

of queriesof queries

02.19.200402.19.2004 CS561CS561

GApply Effectiveness GApply Effectiveness ResultsResults

– Main conclusions:Main conclusions: GApply is a useful operator even for simple XQuery queriesGApply is a useful operator even for simple XQuery queries Yields improvements of factors of up to 2x fasterYields improvements of factors of up to 2x faster Queries representative of a wide class of queriesQueries representative of a wide class of queries Q4 took 20% longer with the client side implementationQ4 took 20% longer with the client side implementation Q1, Q2, Q3 expect performance improvements with server Q1, Q2, Q3 expect performance improvements with server

side implementationside implementation

(hash-based partitioning)

02.19.200402.19.2004 CS561CS561

Results cont.Results cont.

Effectiveness of Optimization RulesEffectiveness of Optimization Rules– Tested the improvement obtained by Tested the improvement obtained by

firing each rulefiring each rule– Performance metric is elapsed timePerformance metric is elapsed time– Method:Method:

Choose relevant parameterized queryChoose relevant parameterized query Vary parameter and find performance Vary parameter and find performance

benefit for each valuebenefit for each value Benefit ratio: elapsed time without the rule Benefit ratio: elapsed time without the rule

to time taken with the rule firedto time taken with the rule fired

02.19.200402.19.2004 CS561CS561

Rule Effectiveness ExampleRule Effectiveness Example Query:Query:

– For $s in /doc(tpch.xml)/suppliersFor $s in /doc(tpch.xml)/suppliers /supplier[/part/p_retailprice > x]/supplier[/part/p_retailprice > x]Return $sReturn $s

– x parameter determines the selectivity of selectionx parameter determines the selectivity of selection

02.19.200402.19.2004 CS561CS561

Results cont.Results cont.

Effectiveness of Optimization RulesEffectiveness of Optimization Rules– Main conclusions:Main conclusions:

Proposed rules can have significant impact Proposed rules can have significant impact on elapsed time of a query involving GApplyon elapsed time of a query involving GApply

Some rules always lowered cost of the Some rules always lowered cost of the query, while other occasionally lowered or query, while other occasionally lowered or increased costincreased cost

Benefit of converting GApply to groupby is Benefit of converting GApply to groupby is comparatively lowercomparatively lower

02.19.200402.19.2004 CS561CS561

Related WorkRelated Work Xperanto ProjectXperanto Project

– Concluded, pushing as much computation to Concluded, pushing as much computation to relational engine is bestrelational engine is best

SilkRoute ProjectSilkRoute Project– Language to specify the conversion between Language to specify the conversion between

relational data and XMLrelational data and XML ROLEX ProjectROLEX Project

– To avoid inefficient parsing in applications, the To avoid inefficient parsing in applications, the relational engine returns a navigable result treerelational engine returns a navigable result tree

DifferenceDifference– Question whether whole process of XML Question whether whole process of XML

publishing has any impact on the core relational publishing has any impact on the core relational operators (YES)operators (YES)

02.19.200402.19.2004 CS561CS561

ConclusionsConclusions

Relational engine must provide support Relational engine must provide support for binding variable to sets of tuples for binding variable to sets of tuples

Required support can be enabled Required support can be enabled through the GApply operator with through the GApply operator with seamless integration into existing seamless integration into existing relational enginesrelational engines

Operator should be exposed in the Operator should be exposed in the syntaxsyntax

Optimization rules are neededOptimization rules are needed

02.19.200402.19.2004 CS561CS561

Future ProblemsFuture Problems

How should modified syntax be How should modified syntax be exploited by algorithms to translate exploited by algorithms to translate XML queries over XML views of XML queries over XML views of relational data?relational data?

Any other changes needed to meet Any other changes needed to meet the requirements of XML publishing?the requirements of XML publishing?

What changes are needed in the What changes are needed in the optimizer if the relational database optimizer if the relational database returns navigable results?returns navigable results?

02.19.200402.19.2004 CS561CS561

Other PapersOther Papers D. Chatziantoniou and K. A. Ross. Querying D. Chatziantoniou and K. A. Ross. Querying

multiple features of groups in relational multiple features of groups in relational databases. In VLDB, 1996.databases. In VLDB, 1996.– Extension to SQL syntax with relational algebra Extension to SQL syntax with relational algebra

implementationimplementation D. Chatziantoniou and K. A. Ross. Groupwise D. Chatziantoniou and K. A. Ross. Groupwise

processing of relational queries. In VLDB, 1997.processing of relational queries. In VLDB, 1997.– Methods to identify group query componentsMethods to identify group query components

C. A. Galindo-Legaria and M. M. Joshi. Ortogonal C. A. Galindo-Legaria and M. M. Joshi. Ortogonal optimization of subqueries and aggregation. In optimization of subqueries and aggregation. In SIGMOD, 2001.SIGMOD, 2001.– Introduction of segmentApply operator and many Introduction of segmentApply operator and many

transformation rulestransformation rules

on relational support for xml publishing

Documents