using semantics in xml data management

52
April 9, 2007 SWIIS, Bangkok 1 Using Semantics in Using Semantics in XML Data Management XML Data Management Tok Wang Ling Department of Computer Science National University of Singapore Gillian Dobbie Department of Computer Science University of Auckland

Upload: aiden

Post on 02-Feb-2016

72 views

Category:

Documents


0 download

DESCRIPTION

Using Semantics in XML Data Management. Tok Wang Ling Department of Computer Science National University of Singapore Gillian Dobbie Department of Computer Science University of Auckland. Roadmap. XML documents and current XML schema languages - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 1

Using Semantics in XML Using Semantics in XML Data ManagementData Management

Tok Wang LingDepartment of Computer Science

National University of Singapore

Gillian DobbieDepartment of Computer Science

University of Auckland

Page 2: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 2

RoadmapRoadmap

1. XML documents and current XML schema languages

2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [6]

3. The applications of ORA-SS• Semantic query optimization in XML

4. Conclusion

[6]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005

Page 3: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 3

RoadmapRoadmap

1. XML documents and current XML schema languages

2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

3. The applications of ORA-SS• Semantic query optimization in XML

4. Conclusion

Page 4: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 4

1. XML – Brief introduction 1. XML – Brief introduction • XML (eXtensible Markup Language) is

– Released by W3C– An application of SGML– A promising standard of data publishing, integrating and

exchanging on the web• XML schemas

– DTD (Data Type Definition) [4]– XSD (XML Schema Definition), W3C recommended standard

[8, 9, 10]

[4]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/[8]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [9]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/[10]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

Page 5: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 5

1. XML – A motivating example1. XML – A motivating example

• Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where– The document has a root element psj;– Under psj, there is a sequence of part elements;– Under part, there is a sequence of supplier elements;– Under supplier, there is a sequence of project

elements.

Page 6: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 6

Example 1. psj.xml<?xml version="1.0" encoding="UTF-8"?><psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…"><part> <pno>P001</pno> <pname>Nut</pname> <color>Silver</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>60</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>650</qty> </project> </supplier> <supplier> <sno>S002</sno> <sname>Beta</sname> <city>Atlanta</city> <city>New York</city> <price>5.5</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>70</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>50</qty> </project> </supplier></part>…

…<part> <pno>P002</pno> <pname>Nut</pname> <color>Copper</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>4.6</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>60</qty> </project> </supplier> <supplier> <sno>S003</sno> <sname>Beta</sname> <city>New York</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>20</qty> </project> <project> <jno>J004</jno> <jname>Blue fireworks</jname> <budget>20000</budget> <qty>50</qty> </project> </supplier></part></psj>

Figure 1. Example XML document

Page 7: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 7

1. XML – the DTD of the “psj.xml”1. XML – the DTD of the “psj.xml”

<?xml version="1.0" encoding="UTF-8"?><!--DTD generated by XXX--><!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)>

▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty

(a) “psj.dtd”, The DTD of the “psj.xml” (b) psj.dtd in Data Guide

Figure 2. DTD and DataGuide of Example XML document

Page 8: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 8

1. XML – what the DTD says1. XML – what the DTD says• DTD is a simple definition of an XML document, where users can

define– Element/Attribute types– Occurrence constraints (e.g. ?, +, *)– Containment among different element types (the structure)

• DTD cannot express– Occurrence constraints in numbers (e.g. 2 to 8)– Uniqueness/Key constraints on a combination of attributes/elements (ID

attribute can be only assigned on one attribute at a time in DTD.)– Relationship types among elements and their degrees – Difference between the attribute (or simple element) of element type and

the attribute (or simple element) of relationship type.

Simple elements are those element types with PCDATA only without any attribute types.

Page 9: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 9

1. XML – XSD 1. XML – XSD <xs:schema xmlns:xs = “…”><xs:element name = “psj”> <xs:complexType> <xs:sequence> <xs:element name="part"> <xs:complexType> <xs:sequence> <xs:element name="pno" type="xs:string"/> <xs:element name="pname" type=" xs:string"/> <xs:element name="color" type=" xs:string"/> <xs:element name="supplier" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="sno" type=" xs:string"/> <xs:element name="sname" type=" xs:string"/> <xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/> <xs:element name="price" type=" xs:string"/> <xs:element name="project" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="jno" type=" xs:string"/> <xs:element name="jname" type=" xs:string"/> <xs:element name="budget" type=" xs:string"/> <xs:element name="qty" type=" xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:key name="PK"> <xs:selector xpath="part"/> <xs:field xpath="pno"/> </xs:key></xs:element></xs:schema>

“psj.xsd”, the XSD schema of the motivating example data.

XSD definition of element occurrence constraint

XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique.

Figure 3. XML Schema of Example XML document

Page 10: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 10

1. XML – what XSD can tell1. XML – what XSD can tell

• XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which– has extensible XML syntax, – supports more data types (user-defined type and 37

built-in types)– is able to represent uniqueness/key for both attribute

types and element types.– And has many other improvements in comparison

with DTD.

Page 11: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 11

1. XML – XSD still flaws1. XML – XSD still flaws

1. A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases.

– E.g. In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document.

– Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively.

XSD is not sufficient in expressing the relational semantics in XML data, such as:

Page 12: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 12

1. XML – XSD still flaws 1. XML – XSD still flaws (cont.)(cont.)

- The key element must contain the following (in order):a) One and only one selector element

- contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique

b) One or more field elements - contain an XPath expressions that specifies the values

must be unique for the set of elements specified by the selector element.

- The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values.

Page 13: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 13

1. XML – XSD still flaws 1. XML – XSD still flaws (Cont.)(Cont.)

2. XSD does not support relationship types and other relational semantic constraints.

– E.g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD.

3. XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types.

– E.g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier.

Page 14: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 14

RoadmapRoadmap

1. XML documents and current XML schema languages

2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

3. The applications of ORA-SS• Semantic query optimization in XML

4. Conclusion

Page 15: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 15

2. 2. ORA-SS in a nutshellORA-SS in a nutshell• ORA-SS is a semantics rich data model for semi-

structured data.• It can easily represent the relational semantics

and constraints in XML data.• ORA-SS model is also a bridge that connects the

tree structure of XML and the semantics in relational and object-relational databases.

• In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data.

Page 16: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 16

2. ORA-SS in a nutshell2. ORA-SS in a nutshell

• A complete ORA-SS model has 4 diagrams– Schema diagram

• Represents the structure and constrains (business rules) on XML documents

– Instance diagram• Visually represents the graphical structure of XML data

– Functional dependency diagram• Represents FDs in relationship types

– Inheritance diagram• Represents the specialization/generalization relationships among

different object classes in ORA-SS

Page 17: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 17

2. 2. ORA-SS data modelsORA-SS data models• Object class

– attributes of object class– ordering on object class

• Relationship Type– degree of relationship type– participating object classes in relationship type– attributes of relationship type– disjunctive relationship type– recursive relationship type– ID dependent relationship type

Page 18: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 18

2. ORA-SS data models 2. ORA-SS data models (Cont.)(Cont.)

• Attribute– attributes of object class or relationship type– key attribute (OID)– foreign key / referential constraint (IDREF/IDREFS)– composite attribute– disjunctive attribute– attribute with unknown structure– ordering on attributes– fixed or default value of attribute– derived attribute

Page 19: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 19

The ORA-SS schema diagram of Example 1.

Part, supplier and project are modeled as object classes.

Pno, sno and jno are declared as the object ID of part, supplier and project respectively.

Price is an attribute of the relationship type PS;and qty is an attribute of PSJ.

PS is a binary relationship type between part and supplier,

PSJ is a ternary relationship type defined among part, supplier and project

part

project

supplierpno pname

sno sname

jno jname

price

qty

PS, 2, +, +

PSJ, 3, +, +PS

PSJ

budget

city

color

+

Figure 4. ORA-SS schema diagram of Example XML document

Page 20: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 20

ORA-SS – Semantic AdvantagesORA-SS – Semantic Advantages

• ORA-SS can represent the following semantics that DTD and XMLSchema cannot:– Attribute vs. object class– Multi-valued attribute vs. object class– Identifier (ID)– IDREF or Foreign Key– n-ary relationship type– Attribute of object class vs. attribute of

relationship type– View of XML document

Page 21: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 21

RoadmapRoadmap

1. XML documents and current XML schema languages

2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

3. The applications of ORA-SS• Semantic query optimization in XML

4. Conclusion

Page 22: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 22

3. 3. ORA-SS applicationsORA-SS applications• Due to the rich semantics in ORA-SS, the model can be

widely used in– Normal form XML schema– Relational/object-relational storage of XML data– XML schema/data integration– XML query optimization [12]– XML aggregates evaluation– XML view creation and validation [2]– XML graphical query language and output [7]– XML keyword search [13]– etc.

[2]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002[7]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.[12]. H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML query processing. Submitted to ER’07[13]. B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to VLDB’07.

We will illustrate these with in details

Page 23: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 23

• The semantic information represented in ORA-SS is helpful in optimizing XML query.– There are many algorithms proposed for XML query

optimization, e.g. TwigStack [1] and its variants.– When ORA-SS semantics of the data are known, they

can be taken into account for query optimization.

[1]. Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching. SIGMOD Conference, 2002.

Semantic query optimizationSemantic query optimization3. ORA-SS applications3. ORA-SS applications

Page 24: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 24

Semantic Semantic query optimizationquery optimization3. ORA-SS applications3. ORA-SS applications

• Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values.

• However, in ORA-SS, since jno is the object ID and we have the functional dependecny:

jno budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value.

Example: Consider the following simple query example which means,

(Query 1) To display the budget of project “J001”.

//project [jno = “J001”]/budget

Page 25: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 25

• Most existing algorithms focus on structural search of twig pattern queries

• Few of them pay high attentions on content search for values of elements.

• They treat content nodes (or values) the same as element nodes

• Disadvantages: – Too many label streams of contents – Difficult to find the actual values of labels as output solutions

• We propose VERT (Value Extraction with Relational Table)

Semantic query optimization –Semantic query optimization – Content SearchContent Search 3. ORA-SS applications3. ORA-SS applications

Page 26: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 26

• Idea of VERT:1. Introduce relational tables to store document

values instead of treating them as nodes and labeling them.

2. Rewrite and optimize XML twig queries based on underlining relational tables.

3. Further optimize relational tables for query processing if more semantic information is available (i.e. more semantics better optimization).

3. ORA-SS applications3. ORA-SS applications

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 27: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 27

1. Introduce relational tables to store document values instead of treating them as nodes and labeling them.

E.g. the values for price (title, etc) of XML tree in Figure 5 can be stored with the labels of price (title, etc) elements in Figure 6.

3. ORA-SS applications3. ORA-SS applications

Figure 5. Example XML document 2 Figure 6. Example VERT tables

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 28: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 28

2. Rewrite and optimize XML twig queries based on underlining relational tables.

e.g.– Rewrite the twig query in Figure 7(a) to the twig in Figure 7(b)– Execute SQL in table Rprice of Figure 6 to get all labels of price

elements with value greater than 15 and form the stream Tprice>15

– Perform structural joins based on these labels for price elements (i,e.Tprice>15 ) with book and ISBN elements

book

ISBN Price >15

3. ORA-SS applications3. ORA-SS applications

Benefits:• Save stream merging of all price

elements with values > 15• Save structural join between

price elements and their valuesFigure 7. Example twig query

(a) Twig query (b) rewritten query

book

ISBN

price

>15

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 29: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 29

3. Further optimize relational tables for query processing if some more semantic information is available (i.e. more semantics better optimization).

Optimization 1 (VERT-1): put the value of price (title, etc) with labels of book objects since price (title) is a property of book object class according to semantics captured in ORA-SS (shown in Figure 8).

3. ORA-SS applications3. ORA-SS applications

Benefit:Further save structural joins between price and book & between ISBN and book for query in Figure 7

Figure 8. VERT tables with optimization 1

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 30: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 30

3. Further optimize relational tables for query processing if some more semantic information is available (i.e. more semantics better optimization).

Optimization 2 (VERT-2): pre-merge the tables of title, price, etc. in Figure 8 if we further know they are single-valued attributes of book object class according to semantics in ORA-SS (shown in Figure 9). (Note: should not merge multi-valued attribute, author.)

3. ORA-SS applications3. ORA-SS applications

Benefit:Save expensive structure joins by using an efficient selection on the table for query in Figure 7.

Figure 9. VERT tables with optimization 2

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 31: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 31

Experimental results on three datasetsi.e. NASA, DBLP and XMark (Figure 10)

• VERT outperforms TwigStack in query processing time• VERT-2 is superior to VERT-1, which is in turn better

than original VERT.

3. ORA-SS applications3. ORA-SS applications

Figure 10. Experimental results of VERT

Semantic query optimization –Semantic query optimization – Content Search Content Search

Page 32: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 32

• XML semantics captured in ORA-SS are crucial in correctly writing queries with aggregates

Example. Consider the query:

(Query 3.) Find the average budget of all the projects.Two potential XQuery expressions are::

XML query with XML query with aggregatesaggregates3. ORA-SS applications3. ORA-SS applications

XQ.3afor $pid in distinct_values(//project/jno)

let $bgts := //project[jno = $pid]/budget

return

<avg_bgt>{avg($bgts)} </avg_bgt>

XQ.3blet $bgts := //project/budget

return

<avg_bgt>{avg($bgts)} </avg_bgt>

Page 33: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 33

Example - cont.

• If we know jno is the OID or key of project object class from ORA-SS, i.e.

jno budgetthen we can easily judge that XQ.3a is a correct Xquery expression while XQ3.b is incorrect as some projects may appear more times than other projects in the XML document.

• If we don’t know this semantics, it is difficult to say which XQuery expression is correct.

XML query with XML query with aggregatesaggregates3. ORA-SS applications3. ORA-SS applications

Page 34: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 34

Define and validate Define and validate XML viewsXML views

p ar t

p r o jec t

s u p p lie rp n o p n am e

s n o s n am e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

s u pplie r

p r o jec t

pa rt

price

q ty

2

32

3

3. ORA-SS applications3. ORA-SS applications

•Valid XML views in ORA-SS•View definition operators: select, project/drop, swap, joinFor example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels:

s u pplie r

p r o jec t

pa rt price

q ty

2

3

3

Valid view Invalid viewBecause price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view. Figure 11. Example view definition 1

Page 35: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 35

Define and validate XML viewsDefine and validate XML views

p ar t

p r o jec t

s u p p lie rp n o p n am e

s n o s n am e

jn o jn am e

p r ic e

q ty

P S , 2 , + , +

P S J , 3 , + , +P S

P S J

b u d g et

c ity

c o lo r

+

3. ORA-SS applications3. ORA-SS applications

Another example, consider the following projection operation that drops supplier from the structure:

Valid viewInvalid view

Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view.

p r o jec t

pa rt

A v g _ price

T o ta l_ q ty

Figure 12. Example view definition 2

project

part

price

q ty

Page 36: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 36

Graphical XML queryGraphical XML query based on ORA-SS based on ORA-SS3. ORA-SS applications3. ORA-SS applications

A graphical XML query language is designed on the base of ORA-SS

Figure 13. The screenshot of the user-interface of our graphical query language

The schema panel loads the ORA-SS schema diagram

Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window.

Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window

Query 1: To select and display the projects that do not have any suppliers located in Atlanta.

Page 37: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 37

• Keyword search is a user-friendly way to query XML documents.

• Most existing algorithms are based on either tree data model or graph (digraph) data model of XML without the semantics.

XML keyword searchXML keyword search with semantics with semantics3. ORA-SS applications3. ORA-SS applications

Page 38: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 38

• Tree data model (LCA [11])– Lowest Common Ancestor (LCA)

• Contains the all keywords • Has no descendant node containing all the keywords

• Graph (digraph) data model (Banks [5])– Reduced sub-tree

• A tree T in graph (digraph) containing all keywords• No proper sub-tree of T contains all keywords

• Limitations of keyword search without semantics– May have difficulty in representing results

– May return many irrelevant results

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

[5]. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proc. of VLDB Conference, pages 505-516, 2005.[11] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of SIGMOD Conference, pages 537-538, 2005.

Page 39: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 39

Example:• Q1 = {Widom}

• LCA & reduced sub-tree give node 1.1.1

• Not enough information

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

• Q2 = {semistructured query processing} • LCA(Q2) = dblp (i.e. the whole XML database) …

overwhelming information• Reduced sub-tree results includes all papers with either

“semistructured” or “query processing”. However, not all “query processing” papers are about “semistructured”.

Figure 14. Example XML document 3

Page 40: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 40

• Therefore, we propose ICA (Interested Common Ancestor) and IRA (Interested Related Ancestors) to exploit the semantics for ranked keyword search.

• Ideas:1. DBA Defines the set of interested object classes and the

conceptual connections between objects.

e.g. in DBLP publications and author can be the interested object classes; the reference/citations can be one type of conceptual connection between publications.

Note: we can group all publications for each author object.

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

Page 41: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 41

• Ideas:2. The results of a keyword query include interested objects

based on ICA and IRA semantics.– The results of ICA (Interested Common Ancestor) include all

objects that each contains all query keywords– The results of IRA (Interested Related Ancestors) include all

object pairs (o, o’) such that – the pair together contain all keywords AND– o and o’ are conceptually connected.

Note: we output a list of IRA objects instead of IRA pairs.

Intuitive meaning for IRA:

For query “semistructured query processing”, if a paper P with title “query processing” cites or is cited by a paper with title “semistructured”, then P is considered related to the query; at least it is a better result than “query processing” papers that do not cite or are cited by “semistructured” papers.

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

Page 42: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 42

• Ideas:3. The system automatically ranks result objects based on

the following metrics for output.– RelevanceRank:

Intuitive meaning: – for query “semistructured query processing”, – given two papers P1 and P2 containing “query processing”, – if P1 cites or is cited by many “semistructured” papers whereas P2

cites or is cited by few “semistructured” papers, then P1 is considered more relevant to the query.

– Keyword Proximity Ranks (ProxRank):– Intuition: The less the number of elements in one object that

directly contain all keywords, the better result the object is.

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

Page 43: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 43

Experimental evaluation based on DBLP

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

• Our approach outperforms most existing academic demos in both execution time and result quality

Figure 15. Execution time

Figure 16. Comparisons of relevant result in top-10, 20, 30 answers among academic demos

Page 44: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 44

Experimental evaluation based on DBLP

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

• Our approach is comparable or superior to commercial systems, Google Scholar and Microsoft Libra, in term of result quality even though they can search in much more web data.

Figure 17. Comparisons of relevant result in top-10, 20, 30 answers with commercial systems

Page 45: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 45

A demo prototype of our keyword search system on DBLP data is available at

http://xmldb.ddns.comp.nus.edu.sg

XML keyword search with semanticsXML keyword search with semantics3. ORA-SS applications3. ORA-SS applications

Figure 18. User interface of the demo system

Page 46: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 46

RoadmapRoadmap

1. XML documents and current XML schema languages

2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data)

3. The applications of ORA-SS• Semantic query optimization in XML

4. Conclusion

Page 47: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 47

4. 4. ConclusionConclusion

1. We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints.

Page 48: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 48

4. Conclusion4. Conclusion

2. We have shown that semantics in XML data are crucial in many applications, such as

• XML query optimization • XML query optimization for content search• XML aggregate computation• XML view creation and validation• XML graphical query language and output• XML keyword search• etc.

Page 49: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 49

4. Conclusion4. Conclusion

3. Many semantic information of XML data can be expressed in ORA-SS, which is a semantics rich data model, but not in DTD or XML Schema.

Page 50: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 50

References:References:[1] Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching.

SIGMOD Conference, 2002.[2]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002[3]. C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing Company (1981).[4]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004.

http://www.w3.org/TR/2004/REC-xml-20040204/[5]. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion

for keyword search on graph databases. In Proc. of VLDB Conference, pages 505-516, 2005.[6]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc.

2005[7]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.[8]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [9]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/[10]. XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October 2004.

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/ [11] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of

SIGMOD Conference, pages 537-538, 2005.[12]. H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML

query processing. Submitted to ER’07[13]. B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to

VLDB’07.

Page 51: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 51

Q & AQ & A

Page 52: Using Semantics in XML Data Management

April 9, 2007 SWIIS, Bangkok 52

The EndThe End