xml and databases - uzhc39a954d-4083-4c02... · interplay between xml and databases. the following...

34
XML and Databases Spring Semester 2017 Date: Thursday, 08:15-9:45, BIN-2.A.10 Lecturer: Dr. Can Türker (Functional Genomics Center Zurich) Email: [email protected] WWW: http://www.ifi.uzh.ch/dbtg/teaching/courses/xmlanddatabases.html

Upload: others

Post on 03-Jun-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

XML and Databases

Spring Semester 2017

Date: Thursday, 08:15-9:45, BIN-2.A.10

Lecturer: Dr. Can Türker (Functional Genomics Center Zurich)

Email: [email protected]

WWW: http://www.ifi.uzh.ch/dbtg/teaching/courses/xmlanddatabases.html

Page 2: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-2Lecture "XML and Databases" - Dr. Can Türker

Google Indicator (February 13, 2017)

Database 940'000'000

XML 427'000'000

FIFA 353'000'000

SQL 224'000'000

UEFA 160'000'000

Zürich 98'900'000

W3C 60'000'000

ETH 38'500'000

UBS 32'900'000

DB2 19'500'000

FGCZ 42'500

Can Türker 23'700

XML and Databases: Both terms seam to be highly relevant!

In this lecture you will learn more about XML and its interplay with databases!

Page 3: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

Lecture Summary

Description: Today, the W3C standard XML is widely used as document format for exchanging data over the Internet. While the generation of XML data is easy, the management of XML data requires systems that are able to efficiently store, query, and process XML data. With other words, database technology is required for handling XML data. The goal of this lecture is to teach the interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages (XPath, XQuery) for declarative access to XML data, XML processor technologies, mapping between XML and databases including efficient storage and index structures for XML data. A further central concern of this lecture is to show the practical relevance of all presented concepts by demonstrating how they are realized in major database management systems such as Oracle, IBM DB2, Microsoft SQL Server, and PostgreSQL.

Goal: Achieve deep understanding of XML and its interplay with database technology

Prerequisite: Databases (Bachelor level), i.e., basic knowledge in databases

Recommended: MSc Studies

Documents: Before each lesson, presentations and demo scripts are provided for download via the lecture’s home page http://www.ifi.uzh.ch/dbtg/teaching/courses/xmlanddatabases.html

1-3Lecture "XML and Databases" - Dr. Can Türker

Page 4: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

Topics

1-4Lecture "XML and Databases" - Dr. Can Türker

Storage

UpdateGenerate

Query & Search

XML Databases

Transformation

Create

XML Application

XML Document

Access

Process

Output

Presentation

5/6 Mapping between XML and Database

9 XML Update

10 XML Support in Database Systems

2 XML Language

3 XML Processors

4 XML Query

Languages

7 SQL/XML

8 XML Indexing

Page 5: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-5Lecture "XML and Databases" - Dr. Can Türker

Schedule

Week Date Chapter

8 23.02.2017 Introduction and Motivation

9 02.03.2017 XML & XML-DTD

10 09.03.2017 XML Schema & XML Processors

11 16.03.2017 XML Processors

12 23.03.2017 XML Query Languages

13 30.03.2017 XML Query Languages

14 06.04.2017 XML Query Languages

15 13.04.2017 Mapping between XML and Databases

16 20.04.2017 No lecture!

17 27.04.2017 Mapping between XML and Databases

18 04.05.2017 SQL/XML

19 11.05.2017 SQL/XML

20 18.05.2017 XML Indexing & XML Update

21 25.05.2017 No lecture!

22 01.06.2017 XML Support in Databases Systems

24 15.06.2017 Examination

Page 6: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-6Lecture "XML and Databases" - Dr. Can Türker

Literature

M. Klettke, H. Meyer: XML & Datenbanken. dpunkt Verlag, 2002

H. Schöning: XML und Datenbanken - Konzepte and Systeme. Carl Hanser Verlag, 2002

M. Seemann: Native XML-Datenbanken im Praxiseinsatz. Software & Support Verlag, 2003

A. B. Chaudhri, A. Rashid, R. Zicari: XML Data Management: Native XML and XML-Enabled Database Systems. Addison-Wesley, 2003

S. Abiteboul, P. Buneman, D. Suciu: Data on the Web: From Relations to Semistructured Data and XML.Morgan Kaufmann, 1999

G. Powell: Beginning XML Databases. Wiley / Wrox, 2007

Page 7: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

Chapter 1

Introduction and Motivation

Motivation of XML via

1) HTML

2) Databases

3) Information Retrieval

Page 8: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-8Lecture "XML and Databases" - Dr. Can Türker

HTML vs. XML

<html><head><title>The Tragedy of Hamlet, Prince of Denmark</title></head><body><h1>The Tragedy of Hamlet, Prince of Denmark</h1><h2>Dramatis Personae</h2><p>CLAUDIUS, king of Denmark. </p><p>HAMLET, son to the late, and nephew to the present

king. </p><h2>Scene Description</h2><p>SCENE Denmark.</p><h2>Subtitle</h2><p>HAMLET</p><h2>ACT I</h2><h3>SCENE I. Elsinore. A platform before the

castle.</h3><p><i>FRANCISCO at his post. Enter to him

BERNARDO</i><br>BERNARDO: Who'sthere?</p>…

<?xml version="1.0" encoding="ISO-8859-1"?><play><title>The Tragedy of Hamlet, Prince of

Denmark</title><personae><title>Dramatis Personae</title><persona>CLAUDIUS, king of Denmark.</persona><persona>HAMLET, son to the late, and nephew to the

present king.</persona></personae><scndescr>SCENE Denmark.</scndescr><playsubt>HAMLET</playsubt><act> <title>ACT I</title><scene><title>SCENE I. Elsinore. A platform before the

castle.</title><stagedir>FRANCISCO at his post. Enter to him

BERNARDO</stagedir><speech> <speaker>BERNARDO</speaker> <line>Who'sthere?</line></speech> ...

</scene>...

</act>...

</play>

Page 9: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-9Lecture "XML and Databases" - Dr. Can Türker

Web provides information embedded in HTML documents

– Positive:

Platform-independent, producer-independent

Always accessible from anywhere

Easy access/download via HTTP protocol

Applications can easily generate HTML documents

– Negative:

HTML document processing is cumbersome within applications since HTML primarily describes layout but not structure of documents

HTML provides only a fixed set of structuring elements, which can neither be updated nor extended by users

Only simple, text-based queries supported

XML tackles the disadvantages of HTML without giving up it advantages

– Data transfer is based on Internet protocols such as HTTP

– Structuring elements can be user-defined

– Structuring elements determine not only layout but also structure

– User-defined structuring elements support arbitrary complex queries

Motivation via HTML

Page 10: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-10Lecture "XML and Databases" - Dr. Can Türker

XML as Web Standard

XML documents

– Generated by applications

– Consumed by applications

– Used for data exchange

Data description and exchange as main scenarios for XML usage

– Weakly structured data such as Shakepears dramas available as text documents

– Strongly structured data such customer/order data coming from a relational database

Power of XML goes far beyond HTML

– XML vs. classic data models vs. information retrieval models: How they relate?

– From semantic to semi-structured data models: A schema capable of describing all objects?

Page 11: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-11Lecture "XML and Databases" - Dr. Can Türker

Motivation via Databases and Semantic Data Models

Semantic data models distinguish between

– Schema and instance

– Type and variable

Object-oriented databases focus on atomic, abstract data types

Relational databases build upon constructed data types

Schema definition implicitly determines elementary integrity constraints

data type

atomic

constructedset

tuple

surrogate

abstract

concrete basic types

Page 12: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-12Lecture "XML and Databases" - Dr. Can Türker

Simple Data Modeling Example (1)

In our mini world, called PC, we distinguish persons and cars. Cars are described by their model, fabrication year, and owner. Persons are characterized by their name and address. Every car belongs to exactly one person while a person can own several cars.

Name Address Model Year

Id

Person Car

owns

owner

Object

Page 13: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-13Lecture "XML and Databases" - Dr. Can Türker

Simple Data Modeling Example (2)

Person Id Name Address

1 Hugo Zürich

2 Franz Basel

Car Id Model Year Owner

3 VW 2013 2

NameHugo

AddressZürich

Person1

NameFranz

AddressBasel

Person2

ModelVW

Year2013

Car3

Owner

Suppose there are following instances respresenting two persons and a car

The relational representation would looks as follows:

Page 14: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-14Lecture "XML and Databases" - Dr. Can Türker

Self-Describing Data

Refers to data with integrated meta-data which annotates (describes) the data

Remember: databases strictly separate between meta-data (schema) and data (instance)

– Advantage: logical independence as basis for optimization

– Disadvantage: instance without schema complicates data exchange

Data should be self-describing! But how?

Idea: Introduce a „semi-structured“ data model

– Use a descriptor-oriented database schema

– Uniform for all applications hence often named as „meta schema“

Page 15: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-15Lecture "XML and Databases" - Dr. Can Türker

Simple Descriptor-Oriented Data Model

• Each object can be described by its characteristics

• Name and type information are part of the instance: Self-Description!

• Each object has an id and a set C of descriptors D describing the characteristics of the object

• Each descriptor D has a name N, a type T, and a value V

Fixed schema („meta schema“): • fixed names Id, C, D, N, T, V• application-independent

O

N

Id

id c

D

C

T V

Page 16: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-16Lecture "XML and Databases" - Dr. Can Türker

Simple Descriptor-Oriented Data Model: Relational Representation

Entity Id Name Type Value

1 Name String Hugo

1 Address String Zürich

2 Name String Franz

2 Address String Basel

3 Model String VW

3 Year Integer 2013

3 Owner Integer 2

Person

Person

Car

Entity Id Name Type Value

1 Name String Hugo

1 Address String Zürich

1 Class String Person

2 Name String Franz

2 Address String Basel

2 Class String Person

3 Model String VW

3 Year Integer 2013

3 Owner Integer 2

3 Class String Car

Add further descriptor foreach entity to maintainthe class information

Page 17: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-17Lecture "XML and Databases" - Dr. Can Türker

Simple Descriptor-Oriented Data Model: Advantages and Limitations

+ Each object described by variable set of characteristics (descriptors)

– No null values

– No forced introduction of sub types when more characteristics are known

– Each individual can be described individually

- Object names (classfication) disappear: must explicitly be defined as additional descriptor

– Which type of an object is the object with the id=1? Person or Car?

– No comprising context (class)

Page 18: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-18Lecture "XML and Databases" - Dr. Can Türker

XML Representation of Relational Data: Variant 1

Id

1

Name

Hugo

Address

Zürich

Row

Person Car

PC

<PC><Person>

<Row> <Id>1</Id> <Name>Hugo</Name> <Address>Zürich</Address> </Row><Row> <Id>2</Id> <Name>Franz</Name> <Address>Basel</Address> </Row>

</Person><Car>

<Row> <Id>3</Id> <Model>VW</Model> <Year>2013</Year> <Owner>2</Owner></Row></Car>

</PC>

Person Id Name Address

1 Hugo Zürich

2 Franz Basel

Car Id Model Year Owner

3 VW 2013 2

Id

1

Name

Franz

Address

Basel

Row

Id

1

Model

VW

Year

2013

Row

Owner

2

Page 19: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-19Lecture "XML and Databases" - Dr. Can Türker

XML Representation of Relational Data: Variant 2

Id

1

Name

Hugo

Address

Zürich

Person

PC

<PC><Person> <Id>1</Id> <Name>Hugo</Name> <Address>Zürich</Address> </Person><Person> <Id>2</Id> <Name>Franz</Name> <Address>Basel</Address> </Person><Car> <Id>3</Id> <Model>VW</Model> <Year>2013</Year> <Owner>2</Owner></Car>

</PC>

Id

1

Name

Hugo

Address

Zürich

Person

Id

1

Model

VW

Year

2013

Car

Owner

2

Person Id Name Address

1 Hugo Zürich

2 Franz Basel

Car Id Model Year Owner

3 VW 2013 2

Page 20: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-20Lecture "XML and Databases" - Dr. Can Türker

XML Representation of Relational Data: Variant 3

Id

1

Name

Hugo

Address

Zürich

Person

Row Row

PC

<PC><Row> <Person> <Id>1</Id> <Name>Hugo</Name> <Address>Zürich</Address> </Person></Row><Row> <Person> <Id>2</Id> <Name>Franz</Name> <Address>Basel</Address> </Person></Row><Row> <Car> <Id>3</Id> <Model>VW</Model> <Year>2013</Year> <Owner>2</Owner> </Car></Row>

</PC>

Id

1

Name

Hugo

Address

Zürich

Person

Id

1

Model

VW

Year

2013

Car

Owner

2

Row

Person Id Name Address

1 Hugo Zürich

2 Franz Basel

Car Id Model Year Owner

3 VW 2013 2

Page 21: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-21Lecture "XML and Databases" - Dr. Can Türker

Union operator u introduced as additional type constructor!

In the context of XML, often also named sequence (ordered set)

Semi-Structured Data Models

union

... evolve by extending the simpe descriptor-oriented model by a union data type:

data type

atomic

constructedset

tuple

surrogate

abstract

concrete basic types

u

Page 22: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-22Lecture "XML and Databases" - Dr. Can Türker

Extended Descriptor-Oriented (Meta) Schema

An object O is described by set C of descriptors D which consist of a name N and either

– another object O or

– a type-value pair A with the type T and the value V

Flexibility introduced by union type!

Semi-structured data model

O

N

Id

T V

Id c

D

u

OA

C

Page 23: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-23Lecture "XML and Databases" - Dr. Can Türker

Tree Representation with Edge and Node IDs

PC

PersonsCars

Person PersonCar

Hugo Zürich Franz Basel

VW 2013

1

2 5

3 4 6

Name

Address

Name

Address

Model

Year

2

Owner1Id

2Id

3Id

Person Id Name Address

1 Hugo Zürich

2 Franz Basel

Car Id Model Year Owner

3 VW 2013 2

Page 24: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-24Lecture "XML and Databases" - Dr. Can Türker

Flexibility of Semi-Structured Data Models

Data must not be compliant to a fixed schema a priori

Reasons against explicit structuring of data

– Not meaningful

– Not uniquely possible

– Not known (completely)

– Complete structuring to expensive

Data not always easily mappable to simple tables

– Optional characteristics

– Repetition of elements

– Element order potentially relevant

– Structuring however desired, e.g. for fast data access or fine-grained search

Page 25: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-25Lecture "XML and Databases" - Dr. Can Türker

Semi-Structured Data Models: Example (1)

The Autobiographyof Benjamin Franklin

bookstore

book

title

book

author price title author price

first-name

last-name

name

Benjamin Franklin

8.99

Plato

9.99The Gorgias

Authors do not have always a first and a last name

Page 26: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-26Lecture "XML and Databases" - Dr. Can Türker

Semi-Structured Data Models: Example (2)

Transaction Processing

bookstore

book

title

book

author title author

name homepagename

Jim Gray

http://www.research.microsoft.com/~gray/

Plato

The Gorgias

Avoid null values by allowing different author structures

email

[email protected]

author

name

Andreas Reuter

homepage

http://www.eml.org/english/homes/reuter.html

Page 27: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-27Lecture "XML and Databases" - Dr. Can Türker

Semi-Structured Data Models: Example (3)

The Autobiographyof Benjamin Franklin

bookstore

book

title

book

reviewtitle

Good stuff The Gorgiasreviewer

Torsten Grabs

review

Gorgeous! reviewer

Hans-JörgSchek

review

I liked it. reviewer

Can Türker

book

title

The ConfidenceMan

Books can have arbitrary reviews but must not have one

Page 28: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-28Lecture "XML and Databases" - Dr. Can Türker

Conclusions: Motivation for XML via Semantic Data Models

Semi-structured data model supports flexible structures

– XML documents of a given type can be structured differently

Self-description avoids separation of schema and instance

– XML documents can describe data as well as data structure

Page 29: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-29Lecture "XML and Databases" - Dr. Can Türker

Motivation of XML via Information Retrieval

Database systems manage well-structured data

Information-Retrieval-Systems manage less-structured documents

– Text documents: arbitrary content length, potentially structured in paragraphs

– Hyper-Text documents: additionally contain links to related text positions

– Multi-Media documents: may include images, audios, and videos

– Documents may significantly differ in structure and size

Compared to flat information retrieval models, XML supports

– Better structuring possibilities

– Integrated meta-data description

Information-Retrieval may benefit from XML

– Structure information is explicitly available in contrast to HTML

– Access granularity is more fine-granule in contrast to flat document structures

Page 30: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-30Lecture "XML and Databases" - Dr. Can Türker

Information Retrieval on Flat Documents

Searchers

D1: Thedogswalkedhome…

TEXT DOCS

D2:Homeon therange.

Dogs at home

QUERY

TRANSFORMEDQUERY

dogs home

dogs --> D1, D5, D78walked --> D1, . . .home --> D1, D2, . . .on --> D2, . . .range --> D2, . . .. . .

INVERTED INDEX

RSV(Q,D1) = .75RSV(Q,D2) = .34

SIMILARITY MEASURE

D1D2

LIST OF RANKEDDOCUMENTS

DocumentAuthor

Document index and query described by characteristics

Query hits based on characteristics match

Page 31: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-31Lecture "XML and Databases" - Dr. Can Türker

Information Retrieval on Structured Documents

Searchers

<DOC> <DOCNO> D1 </DOCNO><TEXT> The dogs walked home. … </TEXT> </DOC>

<DOC><DOCNO>D2</DOCNO><TEXT> Homeon the range.</TEXT></DOC>

TAGGED DOCS

"Dogs at home" in /DOC/TEXT

QUERY

TRANSFORMEDQUERY

dogs home

dogs --> D1, D5, D78walked --> D1, . . .home --> D1, D2, . . .on --> D2, . . .range --> D2, . . .. . .

INVERTED INDEX

RSV(Q,D1) = .75RSV(Q,D2) = .34

SIMILARITY MEASURE

D1: /DOC/TEXTD2: /DOC/TEXT

LIST OF RANKEDELEMENTS

DocumentAuthor

Can we use the same ranking function as for flat documents?

Page 32: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

1-32Lecture "XML and Databases" - Dr. Can Türker

Conclusions: Motivation of XML via IR

Information-Retrieval-Systems

– Support retrieval of relevant text fragments

– Extract words and word sequences

– Abstract from document structure

Information Retrieval with XML

– XML makes structure of certain text elements explicit

– Finer access and retrieval granularity, e.g., restriction to certain XML elements

– Increases document search hit potential by exploiting XML document structure

– Better retrieval results by improved ranking based on integrated structure information

Tradional IR ranking approaches such as the vector-space model deliver better quality when document structure is included

bookstore

medicine computer science

book

title chapterauthor

Retrieve only books from the computer

science category

book

title chapterauthor

Page 33: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

Conclusions: Motivation for XML

General advantages

Platform-independent

Producer-independent

Language-independent

Unicode-based („international“)

Extensible

Usable at all application levels

– Storage

– Application logic

– Presentation

Advantages from business perspective

Support extended search possibilites on distributed, heterogenous data

Allow new functionality on existing data

Ease data transformation

Connect applications

Reduce application development time

No (little) license costs since many XML tools are freely available

Page 34: XML and Databases - UZHc39a954d-4083-4c02... · interplay between XML and databases. The following aspects are studied in detail: semi-structured data model of XML, query languages

Motivation for XML in Databases

XML as data exchange format

– XML documents describing business relevant data

– Archiving original documents

– Transformation into other formats

– Storing XML data in pipeline steps

Applications on XML data

– XML as internal data structure, e.g., to describe configuration files, application data, or style sheets

– Requirements for adequat XML storage?

Advantages of storing XML in databases

– Powerful and efficient search functions

– Transactional, persistent storage

– Concurrency control

– Access control

1-34Lecture "XML and Databases" - Dr. Can Türker