is432: semi-structured data dr. azeddine chikh. 1. semi structured data object exchange model

26
IS432: Semi- Structured Data Dr. Azeddine Chikh

Upload: adrian-roberts

Post on 26-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

IS432: Semi-Structured Data

Dr. Azeddine Chikh

Page 2: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

1. Semi Structured Data Object Exchange Model

Page 3: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Introduction

From a database perspective : the Web has generated an enormous demand for

recently developed database architectures for database integration such as

data warehouses and mediation systems

The Web has led to the development of semistructured data model with languages adapted to this model.

3

Page 4: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Introduction

The emergence of XML as a standard for data representation on the Web is expected greatly to facilitate the publication of electronic data by providing a simple syntax for data that is both human and machine readable

4

Page 5: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Introduction

Although the document and database viewpoints were, until quite recently, irreconcilable, there is now a convergence in technologies brought about by the development of XML for data on the Web and the closely related development of

semistructured data in the database community

5

Page 6: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Unstructured Data

6

data can be of any type not necessarily following any format or sequence does not follow any rules is not predictable

examples include text video sound images

Page 7: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Structured Data

7

data is organized in semantic chunks (entities) similar entities are grouped together (relations or

classes) entities in the same group have the same

descriptions (attributes) descriptions for all entities in a group (schema)

have the same defined format have a predefined length are all present and follow the same order

Page 8: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Semi-Structured Data

8

idea predates XML but not HTML data is available electronically in

database systems file systems, e.g., bibliographic data, Web data data exchange formats, e.g., EDI, scientific data

attempt to reconcile database and document "worlds" semi-structured data

organized in semantic entities similar entities are grouped together entities in same group may not have same attributes order of attributes not necessarily important not all attributes may be required size of same attributes in a group may differ type of same attributes in a group may differ

Page 9: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of Semi-Structured Data

9

name: Azeddine CHIKH email: [email protected], [email protected]   name:

first name: Mourad last name: Benchikh

email: [email protected]   name: Ashraf Youcef affiliation: IS Department

Page 10: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Semi-Structured Data Models

10

based on labelled graphs rather than labelled trees used for data exchange among, and integration of,

heterogeneous data sources

schema information is in the edge labels sometimes called schemaless or self-describing data stored at the leaves

Page 11: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Graph Terminology (1)

11

a (directed) graph G = (N,E) consists of a set N of nodes and a set E of edges

each edge in E is an (ordered) pair of nodes (x,y), where x is the source and y is the target

a path from x1 to xn is a sequence of edges(x1, x2), (x2, x3), ... , (xn-1, xn)

the length of a path is the number of edges in it a node r is a root for graph G if there is a path from r

to every other node in G a cycle is a path from a node to itself a graph with no cycles is called acyclic

Page 12: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Graph Terminology (2)

12

a graph is rooted if it has a single root a tree is a rooted graph G in which there is a unique

path from the root to every other node in G a node is a leaf if it is not the source of any edge graphs can have node labels and/or edge labels in an edge-labelled graph G = (N,E,FE), FE is an

edge labelling function that maps each edge to a label

in a node-labelled graph G = (N,E,FN), FN is a node labelling function that maps each node to a label

Page 13: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Object Exchange Model (OEM)

13

original OEM used only node labels we use a variant in which the edges are labelled an OEM data graph is a rooted, labelled, directed

graph its edge labels map to strings only its leaf nodes have labels which map to data

values no ordering of edges leaving a node

Page 14: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

OEM Syntax

14

example may be written as { book: { author: "Coetzee", title: "Disgrace", year: 1999} }

simple label-value pairs labels can be repeated, e.g., for multiple authors this is a serialization syntax for the graph what about graphs that are not trees? introduce object identifiers (oids) for nodes

Page 15: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of OEM Data Graph (1)

15

Page 16: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of OEM Data Graph (2)

16

Page 17: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of OEM Syntax

17

bib: &1

{ paper: &2 { ... },

book: &3 { ... },

paper: &4 { author: &10

{ firstname: &15 "Serge", lastname: &16 "Abiteboul”},

author: &11 { ... }

title: &12 { ... }

pages: &13

{ first: &17 122,

last: &18 133 },

references: &2,

references: &3

}

}

Page 18: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Characteristics of SSData

18

structure is irregular: missing or additional attributes (labels)

parts of data lack structure, e.g., images some may yield little structure, e.g., plain text a-priori schema vs a-posteriori dataguide

db: fix the schema, then populate the db web: design pages, then design schema to facilitate access

schema is large schema is often ignored, e.g., information retrieval

queries schema is rapidly evolving

Page 19: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Schema Graphs

19

given some semi-structured data, might want to extract a schema that describes it

useful for browsing the data by types optimizing queries by reducing the number of paths

searched improving storage of data

schema graph specifies what edges are permitted in a data graph every path in the data graph occurs in the schema graph

Page 20: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of a Schema Graph

20

Page 21: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Data Graph Satisfying a Schema G.

21

given data graph D and schema graph S D is an instance of S (or D satisfies S) if

there exists a simulation R from D to S such that (root(D), root(S)) is in R

a simulation is a relation R between nodes: if (u,v) is in R and (u,x) labelled l is in D

then there exists (v,y) labelled l in S such that (x,y) is in R for our example:

node &1 in D related under R to node at target of edge labelled bib in S

&2 and &4 related to node at target of edge labelled paper &3 related to node at target of edge labelled book note that above two cases need to satisfy requirements of edges

labelled references as well &10 and &11 related to node at target of edge labelled author

Page 22: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

A Less Specific Schema Graph

22

Page 23: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Data Guides

23

Data guide is a concise and accurate summary of a data graph accurate: every path in the data occurs in the data guide, and

vice versa concise: every path in the data guide occurs exactly once data guide is the most specific schema graph for a given data

graph i.e., there is a simulation from the data guide to every other schema graph the data graph satisfies

Page 24: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of a Data Guide (1)

24

Page 25: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

Example of a Data Guide (2)

25

Page 26: IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model

References

26

www.cis.upenn.edu/~db/tutorials.html Tutorial on semi-structured data by Peter Buneman from Symposium on Principles of Database Systems, 1997

www-db.stanford.edu/lore/research/data.html Abiteboul S., Buneman P., Suciu D., «Data on the

Web - From Relations to Semistructured Data and XML», Morgan Kaufmann Publishers, San Francisco, California