well-designed xml data marcelo arenas and leonid libkin university of toronto
Post on 19-Dec-2015
221 views
TRANSCRIPT
![Page 1: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/1.jpg)
Well-designed XML Data
Marcelo Arenas and Leonid Libkin
University of Toronto
![Page 2: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/2.jpg)
Outline
Part 1 - Database Normalization from the 1970s and 1980s.
Part 2 - Classical theory revisited: normalizing XML documents.
Part 3 - Classical theory re-done: new justifications for normalization.
2
![Page 3: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/3.jpg)
Part 1: Classical Normalization
Design: decide how to represent the information in a particular data model.
• Even for simple application domains there is a large number of ways of representing the data of interest.
We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
3
![Page 4: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/4.jpg)
Designing a Database: An Example
Attributes: number, title, section, room.
Data dependency: every course number is associated with only one title.
Relational Schema:
R(number, title, section, room),
number title
GOOD alternative:
S(number, title), number title
T(number, section, room),
4
BAD alternative:
![Page 5: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/5.jpg)
Problems with BAD: Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer Organization I.
5
![Page 6: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/6.jpg)
Problems with BAD: Update Anomaly
number title section room
CSC258 Computer Organization 1 LP266
CSC258 Computer Organization 2 GB258
CSC258 Computer Organization 3 GB248
CSC434 Database Systems 1 GB248
Title of CSC258 is changed to Computer Organization I.
5
![Page 7: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/7.jpg)
Problems with BAD: Update Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 Database Systems 1 GB248Title of CSC258 is changed to Computer Organization I.The instance stores redundant information.
5
![Page 8: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/8.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 Database Systems 1 GB248CSC434 is not given in this term.
6
![Page 9: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/9.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 Database Systems 1 GB248CSC434 is not given in this term.
6
![Page 10: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/10.jpg)
Deletion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was deleted.
6
![Page 11: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/11.jpg)
Insertion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
A new course is created: (CSC336, Numerical Methods)
7
![Page 12: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/12.jpg)
Insertion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
A new course is created: (CSC336, Numerical Methods)
7
![Page 13: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/13.jpg)
Insertion Anomaly
number title section room
CSC258 Computer Organization I
1 LP266
CSC258 Computer Organization I
2 GB258
CSC258 Computer Organization I
3 GB248
CSC336 Numerical Methods ? ?A new course is created: (CSC336, Numerical Methods)The instance stores attributes that are not directly related.
7
![Page 14: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/14.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248Title of CSC258 is changed to Computer Organization I.
8
![Page 15: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/15.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248Title of CSC258 is changed to Computer Organization I.
8
![Page 16: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/16.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248Title of CSC258 is changed to Computer Organization I.CSC434 is not given in this term.
The instance does not store redundant information.
8
![Page 17: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/17.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434
1 GB248CSC434 is not given in this term.
8
![Page 18: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/18.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
CSC434 is not given in this term.
The title of CSC434 is not removed from the instance.
A new course is created: (CSC336, Numerical Methods)
8
![Page 19: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/19.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
A new course is created: (CSC336, Numerical Methods)
8
![Page 20: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/20.jpg)
Avoiding Update Anomalies
number
title
CSC258
Computer Organization I
CSC434
Database Systems
CSC336
Numerical Methods
number
section room
CSC258
1 LP266
CSC258
2 GB258
CSC258
3 GB248
A new course is created: (CSC336, Numerical Methods)No information about sections has to be provided.Each relation stores attributes that are directly related.
8
![Page 21: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/21.jpg)
Normalization Theory
Main idea: a normal form defines a condition that a well designed database should satisfy.
Normal form: syntactic condition on the database schema.• Defined for a class of data dependencies.
Main problems:
• How to test whether a database schema is in a particular normal form.
• How to transform a database schema into an equivalent one satisfying a particular normal form.
9
![Page 22: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/22.jpg)
Normalization Theory Today
Normalization theory for relational databases was developed in the 70s and 80s.
Why do we need normalization theory today?• New data models have emerged: XML.
• XML documents can contain redundant information.
Redundant information in XML documents:• Can be discovered if the user provides semantic
information.
• Can be eliminated.
10
![Page 23: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/23.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
11
![Page 24: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/24.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
11
![Page 25: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/25.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
11
![Page 26: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/26.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
11
![Page 27: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/27.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
11
![Page 28: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/28.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 29: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/29.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 30: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/30.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
…
</course>
<course cno=“CSC434”>
…
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 31: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/31.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 32: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/32.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 33: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/33.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 34: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/34.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 35: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/35.jpg)
Part 2: XML and Normalization
<courses>
<course cno=“CSC258”>
<taken_by>
<student sno=“st1”>
<name> Fox </name>
<grade> B+ </grade>
</student>
</taken_by>
</course>
</courses>
XML Document:
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
name #PCDATA
grade #PCDATA
DTD:
11
![Page 36: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/36.jpg)
XML Databases
D : : Two students with the same @sno value must have the same name.
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
12
XML Schema: (D, )
![Page 37: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/37.jpg)
Redundancy in XML
courses
coursecourse info
@cno @cno taken_bytaken_by
student student
@snoname gradegrade name@sno
student
name@sno
. . .
“st1” “st1” “A+”“B+”
“CSC258” “CSC434”
“Fox”“Fox”
“st1” “Fox”
13
![Page 38: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/38.jpg)
XML Database Normalization
DTD: Data dependency:
Two students with the same @sno value must have the same name.
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student name, grade
14
![Page 39: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/39.jpg)
XML Database Normalization
DTD:
, info* @sno is the identifier of info elements.
courses course*
course @cno
course taken_by
taken_by
student*
student @sno
student gradeinfo @sno
info name
Data dependency:
Two students with the same @sno value must have the same name.
14
![Page 40: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/40.jpg)
A “Non-relational” Example
DBLP
conf conf
title issueissue
article articlearticle
@yeartitle title @year
@year
“ICDT”
@year
author @yeartitleauthor“1999”
“1999”
“1999”“Dong” “2001”“Jarke”
“2001”
“. . .” “. . .” “. . .”
15
![Page 41: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/41.jpg)
XNF: XML Normal Form
It eliminates two types of anomalies.
It was defined for XML functional dependencies:
DBLP.conf.@title DBLP.confDBLP.conf.issue
DBLP.conf.issue.article.@year
16
![Page 42: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/42.jpg)
Problems to Address
Functional dependencies for XML.
Normal form for XML documents (XNF).
•Generalizes BCNF.
Algorithm for normalizing XML documents.
•Implication problem for functional dependencies.
17
![Page 43: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/43.jpg)
Framework: Paths in DTDs
Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S
We distinguish three kinds of elements: attributes (@), strings (S) and element types.
FDs are defined by means of a relational representation of XML documents.
18
![Page 44: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/44.jpg)
Framework: XML Trees
v1
v2
v3 v4
v5
v6 v7
v0
. . .
courses
coursecourse
@cno
“cs100”
@sno name grade @sno name grade
student student
“123” “456”
“Fox” “B+” “Smith” “A-”
S S S S
19
![Page 45: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/45.jpg)
Tree Tuples
v1
v2
v0
courses
course
@cno student
“cs100”
t(courses) = v0
t(courses.course) = v1
t(courses.course.@cno) = “cs100”t(courses.course.student) = v2
t(p) = , for the remaining paths
Relational representation: tree tuples - mappings
t : Paths(D) Vertices Strings {}
A tree tuple represents an XML tree:
20
![Page 46: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/46.jpg)
XML Tree: set of Tree Tuples
v1
v2
v3 v4
v5
v6 v7
v0
. . .
courses
coursecourse
@cno
“cs100”
@sno name grade @sno name grade
student student
“123” “456”
“Fox” “B+” “Smith” “A-”
S S S S
v1
v2
courses
course
@cno
“cs100”
student
v0
v3 v4
@sno name grade
“123”
“Fox” “B+”
S S
v5
v6 v7
@sno name grade
student
“456”
“Smith” “A-”
S S
. . .
course
21
![Page 47: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/47.jpg)
Functional Dependencies for XML
Expressions of the form: X Y
defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).
XML tree T can be tested for satisfaction of X Y
if:
X Y Paths(T) Paths(D)
T X Y if for every pair u, v of tree tuples in T:
u.X = v.X and u.X ≠ implies u.Y = v.Y
22
![Page 48: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/48.jpg)
FD: Examples
University DTD: courses course*course @cno, student*student @sno, name, grade
Two students with the same @sno value must have the same name:
courses.course.student.@sno courses.course.student.name.S
Every student can have at most one grade in every course:
{ courses.course, courses.course.student.@sno }
courses.course.student.grade.S
23
![Page 49: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/49.jpg)
Implication Problem for FD
Given a DTD D and a set of functional dependencies {}:
(D, ) if for any XML tree T conforming to D and satisfying , it is the case that T
(D, )+ = { | (D, ) }
Functional dependency is trivial if it is implied by the DTD alone: (D, )
24
![Page 50: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/50.jpg)
XNF: XML Normal Form
XML specification: a DTD D and a set of functional dependencies .
A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key.
(D, ) is in XNF if:
For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.
25
![Page 51: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/51.jpg)
Back to DBLP
DBLP is not in XNF:
DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+
DBLP.conf.issue DBLP.conf.issue.article
(D,)+
Proposed solution is in XNF.
26
![Page 52: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/52.jpg)
Normalization Algorithm
The algorithm applies two transformations until theschema is in XNF.
If there is an anomalous FD of the form:
DBLP.conf.issue DBLP.conf.issue.article.@year
then apply the “DBLP example rule”.
Otherwise: choose a minimal anomalous FD and apply the “University example rule”.
27
![Page 53: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/53.jpg)
Normalizing XML Documents
Theorem The decomposition algorithm terminates and outputs a specification in XNF.
Furthermore, it does not lose information:
UnnormalizedNormalizedXML document XML Document
Q1, Q2 are XQuery core queries.
Q1
Q2
28
![Page 54: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/54.jpg)
Part 3: What was Missing? Justification!
What is a good database design?
• Well-known solutions: BCNF, 4NF, …
But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless decomposition, dependency preservation.
Previous work was specific for the relational model.
• Classical problems have to be revisited in the XML context.
29
![Page 55: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/55.jpg)
Justification of Normal Forms
Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status as relational algebra.
• We do not even know if implication of XML FDs is decidable!
We need a different approach.
• It must be based on some intrinsic characteristics of the data.
• It must be applicable to new data models.
• It must be independent of query/update/constraint issues.
Our approach is based on information theory.
30
![Page 56: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/56.jpg)
Information Theory
Entropy measures the amount of information provided by a certain event.
Assume that an event can have n different outcomes with probabilities p1, …, pn.
Amount of information gained by knowing that event i occurred :Average amount of information gained (entropy) :
Entropy is maximal if each pi = 1/n :
31
ip
1log
n
i ii p
p1
1log
nlog
![Page 57: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/57.jpg)
Entropy and Redundancies
Database schema: R(A,B,C), A B
Instance I:
Pick a domain properly containing adom(I) :
• Probability distribution: P(4) = 0 and P(a) = 1/5, a ≠ 4
• Entropy: log 5 ≈ 2.322
A B C
1 2 3
1 2 4
A B C
1 2 3
1 2 4
A B C
1 2
1 2 4
A B C
1 2 3
1 2 4
A B C
1 3
1 2 4
Pick a domain properly containing adom(I) : {1, …, 6}
• Probability distribution: P(2) = 1 and P(a) = 0, a ≠ 2
• Entropy: log 1 = 0
{1, …, 6}
32
![Page 58: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/58.jpg)
Entropy and Normal Forms
Let be a set of FDs over a schema S.
Theorem (S,) is in BCNF if and only if for every instance of (S,) and for every domain properly containing adom(I), each position carries non-zero amount of information (entropy > 0).
A similar result holds for 4NF and MVDs.
This is a clean characterization of BCNF and 4NF, but the measure is not accurate enough ...
33
![Page 59: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/59.jpg)
Problems with the Measure
The measure cannot distinguish between different types of data dependencies.
It cannot distinguish between different instances of the same schema:
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
entropy = 0
R(A,B,C), A B
entropy = 0
34
![Page 60: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/60.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 4
35
![Page 61: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/61.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 2 3
1 2 4
35
![Page 62: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/62.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 2 3
1 2 4
35
![Page 63: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/63.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
Initial setting: pick a position p Pos(I) and pick k such that adom(I) {1, …, k}. For example, k = 7.
A B C
1 3
1 2 4
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
35
![Page 64: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/64.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 3
1 2 4
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
35
![Page 65: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/65.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
35
![Page 66: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/66.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
35
![Page 67: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/67.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
2 3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
35
![Page 68: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/68.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 1
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
35
![Page 69: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/69.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
4 2 3
1 2 7
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
35
![Page 70: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/70.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 2 3
1 2 3
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
35
![Page 71: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/71.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
35
![Page 72: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/72.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
a 3
1 2
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
35
![Page 73: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/73.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
2 a 3
1 2 7
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
35
![Page 74: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/74.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 a 3
1 2 6
Computation: for every X Pos(I) – {p}, compute probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) = 42/
(48 + 6 42) = 0.16
(48 + 6 42) = 0.14
Entropy ≈ 2.8057 (log 7 ≈ 2.8073)
35
![Page 75: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/75.jpg)
A General Measure
Instance I of schema R(A,B,C), A B :
A B C
1 3
1 2 4
Value : we consider the average over all sets X Pos(I) – {p}.
•Average: 2.4558 < log 7 (maximal entropy)
•It corresponds to conditional entropy.
•It depends on the value of k ...35
![Page 76: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/76.jpg)
A General Measure
Previous value:
For each k, we consider the ratio:
• How close the given position p is to having the maximum possible information content.
General measure:
)|( pInf kI
k
pInf kI
log
)|(
k
pInfpInf
kI
kI log
)|(lim)|(
36
![Page 77: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/77.jpg)
Basic Properties
The measure is well defined:
For every set of first order constraints defined over a schema S, every I inst(S,), and every p Pos(I): exists.
Bounds:
)|( pInf I
1)|(0 pInf I
37
![Page 78: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/78.jpg)
Basic Properties
The measure does not depend on a particular representation of constraints. If 1 and 2 are equivalent:
It overcomes the limitations of the simple measure: R(A,B,C), A B
)|()|( 21 pInfpInf II
A B C
1 2 3
1 2 4
1 5
A B C
1 2 3
1 4
0.875 0.781
38
![Page 79: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/79.jpg)
Well-Designed Databases
Definition A database specification (S,) is well-designed if for every I inst(S,) and every p Pos(I), = 1.
In other words, every position in every instance carries the maximum possible amount of information.
We would like to test this definition in the relational world ...
)|( pInf I
39
![Page 80: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/80.jpg)
Relational Databases
is a set of data dependencies over a schema S:
= : (S,) is well-designed.
is a set of FDs: (S,) is well-designed if and only if (S,) is in BCNF.
is a set of FDs and MVDs: (S,) is well-designed if and only if (S,) is in 4NF.
is a set of FDs and JDs:
• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed. The converse is not true.
• A syntactic characterization of being well-designed is given in [AL03].
40
![Page 81: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/81.jpg)
Relational Databases
The problem of verifying whether a relational schema is well-designed is undecidable.
If the schema contains only universal constraints (FDs, MVDs, JDs, …), then the problem becomes decidable.
Now we would like to apply our definition in the XML world ...
41
![Page 82: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/82.jpg)
XML Databases
XML schema: (D,).
• D is a DTD.
• is a set of data dependencies over D.
We would like to evaluate XML normal forms.
The notion of being well-designed extends from relations to XML.
• The measure is robust; we just need to define the set of positions in an XML tree T: Pos(T).
42
![Page 83: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/83.jpg)
Positions in an XML Tree
DBLP
conf conf
title issueissue
article articlearticle
@yeartitle title @year
“ICDT”
author @yeartitleauthor“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”
“ICDT”
“1999” “1999”“Dong” “2001”“Jarke”“. . .” “. . .” “. . .”
43
![Page 84: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/84.jpg)
Well-Designed XML Data
We consider k such that adom(T) {1, …,k}.
For each k :
We consider the ratio:
General measure:
)|( pInf kT
k
pInfpInf
kT
kT log
)|(lim)|(
kpInf kT log/)|(
44
![Page 85: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/85.jpg)
XNF: XML Normal Form
For arbitrary XML data dependencies:
Definition An XML specification (D,) is well-designed if for every T inst(D,) and every p Pos(T), = 1.
For functional dependencies:
Theorem An XML specification (D,) is in XNF if and only if (D,) is well-designed.
)|( pInfT
45
![Page 86: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/86.jpg)
Normalization Algorithms
The information-theoretic measure can also be used for reasoning about normalization algorithms.
For BCNF and XNF decomposition algorithms:
Theorem After each step of these decomposition algorithms, the amount of information in each position does not decrease.
46
![Page 87: Well-designed XML Data Marcelo Arenas and Leonid Libkin University of Toronto](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d385503460f94a11142/html5/thumbnails/87.jpg)
Future Work
We would like to consider more complex XML constraints and characterize good designs they give rise to.
We would like to characterize 3NF by using the measure developed in this paper.
• In general, we would like to characterize “non-perfect” normal forms.
We would like to develop better characterizations of normalization algorithms using our measure.
• Why is the “usual” BCNF decomposition algorithm good? Why does it always stop?
47