discovering, maintaining, and using semantics for database schemas yuan an, ph.d. ischool at drexel...

40
Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ.

Upload: rudolph-thornton

Post on 17-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Discovering, Maintaining, and Using Semantics for Database Schemas

Yuan An, Ph.D.iSchool at Drexel

February 23, 2009CS Department at Villanova Univ.

Page 2: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

2

Page 3: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Background

• Information integration is the problem of sharing and using data across disparate information sources.

• What challenges information integration is that information sources are often distributed, autonomous, and heterogeneous.

3

Page 4: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Example of Information Integration• Patient healthcare and medical data

usually resides in multiple sources such as different units of hospitals, labs, clinics, personal data management devices, and even drugstores.

• Example tasks for Information integration:– obtaining a holistic view of patient health

status– merging data for multiple healthcare providers

4

Page 5: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Information Integration

5

Page 6: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

A Central Issue

• A key component of any solutions for information integration is the definitions of mappings between different data sources/schemas.

• Despite a decade’s effort, building schema mapping remains a very difficult problem.

• The difficulty lies in the requirement of understanding the meaning of the schemas being mapped.

6

Page 7: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

An Example

ID AdmDatePatRef

Admission

DisDate

ID DocPatRef

Treatment

Date Desc

ID NameMedCr#

Patient

Diagnosis

Philadelphia General Hospital DB

ID EnterPolicy#

Coronary

Leave Patient

ID EnterPolicy#

Pulmonary

Leave Patient

ID

Admission

ID DocProgID

Treatment

Date

ID SymptomPatRef

Progress

Boston Mass General Hospital DB

Transfer patient medical informationfrom Philadelphia General Hospital to Boston MassGeneral Hospital.

7

Page 8: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Schema Semantics

ID AdmDatePatRef

Admission

DisDate

ID DocPatRef

Treatment

Date Desc

ID NameMedCr#

Patient

Diagnosis ID EnterPolicy#

Coronary

Leave Patient

ID DocProgID

Treatment

Date

ID SymptomPatRef

Progress

ID SymptomPatRefProgress

Progress

hasIDhasSymptom

Patient

hasRefNhasName

relate

* 1

ID DocProgIDTreatment Date

Progress

hasIDhasSymptom

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

prescribe

apply

* 1

1*

8

Page 9: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

• We aim at developing an automatic tool for discovering semantic mappings from database schemas to conceptual models (CM).

Discovering Semantics

DB

conceptual model

9

Page 10: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Benefits of Discovering Semantics for Schemas

ID DocPatRefTreatment Date Desc

Philadelphia General Hospital DB Schema

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

prescribe

recommend

apply

monitor relate* 1

* *1 * * 1

1*

Boston Mass Hospital DB Conceptual Model

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

10

Page 11: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

• We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution.

Maintaining Semantics

DB

conceptual model

DB’

conceptual model’

11

Page 12: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Using Semantics for Discovering Schema Mapping

DB2

conceptual model 2

DB1

conceptual model 1

12

Page 13: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Roadmap

• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema

Mapping• Conclusions

13

Page 14: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

prescribe

recommend

apply

monitor relate* 1

* *1 * * 1

1*

• Much more semantics in conceptual models, e.g., weak entities, partOf, n-ary relationships, ISA relationships…• Need to distinguish them all from schema structures.

Challenges

ID DocPatRefTreatment Date Desc

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

Discover all and only the “reasonable” trees we call semantic trees that are plausible semantics of the table.

14

Page 15: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

• Schema matching tools: associate atomic elements in different schemas using syntactic links.

• Schema mapping tools: infer query expressions for translating/exchanging data.– unable to discover expected semantics of a

schema in terms of a conceptual model.

Existing Mapping Tools

ID DocPatRef

Treatment

Date Desc ID DocProgID

Treatment

Date

15

Page 16: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Our Solution for Discovering

Semantics

ID DocPatRefTreatment Date Desc

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

prescribe

recommend

apply

monitor relate* 1

* *1 * * 1

1*

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

Simple correspondences can be specified manually or by using a schema matching tool.

The key is to discover “reasonable” links based on1.analysis key and foreign key constraints in schemas.2. a careful study of standard database design princiles.

We focus on deriving semantic trees connectingthe individual concepts using “reasonable” links.

16

Page 17: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Discovering Semantic Trees

ID DocPatRefTreatment Date Desc

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

prescribe

recommend

apply

monitor relate* 1

* *1 * * 1

1*

Treatment

hasIDhasDate

Doctor

hasPhyIDhasName

Progress

hasIDhasSymptom

Patient

hasRefNhasName

Step1: determine a skeleton tree and its anchor by key columns.

Step2: determine skeleton trees the their anchors corresponded to by f.k. columns.

Step4: link any concepts corresponding to unaccounted-for columns.

Step3: link the skeleton trees using shortest functional paths.

17

Page 18: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

“Divide and Conquer”

• A gradual manner: 1. ER0 – an initial subset with binary

relationships.2. ER1 – adding n-ary relationships 3. ER2 – adding ISA relationships.

18

Page 19: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

“Good” Properties of the Algorithm• Guarantees only for “standard”

relational schemas.1. A sense of “completeness”: the algorithm

finds all the “correct” semantics.2. A sense of “soundness”: for multiple

candidates, each one would result in an “indistinguishable” table by the standard database design methodology.

19

Page 20: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

The MAPONTO Tool

the mapping formulas

20

Page 21: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Evaluation ResultsSchemas # of

Tables

# of Columns

Ontology # of Nodes

# of Links

UTCS Department

8 32 Academic Department

62 1913

VLDB Conference

9 38 Academic Conference

27 143

DBLP Bibliography

5 27 Bibliographic Data

75 1178

OBSERVER Project

8 115 Bibliographic Data

75 1178

Country 6 18 CIA Factbook 52 125

21

Page 22: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Evaluation Results

• correct semantics for 85% of the tested tables.

• maximum number of semantics candidates is 4.

• Average execution time less than 1 second.

22

Page 23: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Roadmap

• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema

Mapping• Conclusions

23

Page 24: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

• We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution.

Maintaining Semantics

DB

conceptual model

DB’

conceptual model’

24

Page 25: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Challenges in Maintenance

• What to maintain: how to define the property for maintenance and how to detect violation on the property.

• How to capture changes to CMs and relational schemas.

• How to reconcile CMs and schemas according to the intent of users.

25

Page 26: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Our Goals of Mapping Maintenance

• To keep the mapping consistent: a consistent conceptual-relational mapping allows two-way legal instances translation.

• To reconcile the conceptual model when the associated schema evolve.

• To update the mapping when associated conceptual model evolve.

26

Page 27: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Capturing CM/Schema Changes

• A user can change CM/schema in different ways:– Modifying the original model.– Generating a new model.

• It is difficult to ask the user to provide a sequence of primitive actions.

• It would be easier to ask the user to draw correspondences.

Biosample(bsid,species,organ,…,donor_disease)

Biosample(bsid,species,organ,…) tissue(bsid,donor_disease)

27

Page 28: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Reconciling CM and Schema

• Analyzing the existing semantics in the original mappings in terms of skeleton trees and connections between anchors.

• Discovering changes through correspondences between old and new models.

• Synchronizing models and adapting the mapping accordingly.

28

Page 29: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Evaluation Methodology and Results• The same data sets for discovering conceptual-

relational mappings.• Measuring efficiency and benefits in comparison

to mapping reconstructing approach.• Comparing the number of mapping candidates

generated by maintaining and reconstructing approaches.

• The maintenance approach can save at least 80% of user effort for reaching consistent mappings. Execution time is insignificant: avg. < 1 sec.

29

Page 30: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Roadmap

• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema

Mapping• Conclusions

30

Page 31: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Using CM-Relational Mappings for Discovering Schema Mapping

DB2

conceptual model 2

DB1

conceptual model 1

31

Page 32: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Current Solutions for Schema Mapping

compose Progress(ID,PatRef,Symptom) with Treatment(ID’,ProgID,Doc,Date) where Progress.ID=Treatment.ProgID → Treatment(ID’,PatRef,Doc,Date,Symptom).

SOURCE: TARGET:

Treatment ID DocProgID Date

ID SymptomPatRefProgress ID DocPatRefTreatment Date Desc

32

Page 33: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

33/44

1. load Doctor.name and Doctor.clinic into employee as employee.name and employee.clinic in the target.

2. load Scientist.name and Scientist.lab into employee as employee.name and employee.lab in the target.

3. compose Doctor(ssn,name’,clinic) with Scientist(ssn,name,lab) where they have the same ssn → employee(z,name,clinic,lab).

Using the SemanticsEmployeessnname

Doctorssnclinic

Scientistssnlab

X

Doctor

Scientist

employeessn name clinic

ssn name lab

eid name clinic lab

Employeessnname

Doctorssnclinic

Scientistssnlab

X

Page 34: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Principles of the Semantic Approach• Discovering two conceptual subgraphs (CSG)

that are “semantically similar” (≠ “structurally match”) and then translating the CSGs into algebraic expressions

1. connections between corresponding pairs of nodes are semantically similar or compatible, e.g., ISA, partOf…

2. maintaining desirable properties in database queries.

3. the principle of parsimony: smallest trees.

34

Page 35: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Evaluation Methodology

• Comparison between the semantic approach and traditional approachs based on referential integrity constraint.

• Manually specified mapping expressions as a “gold standard”.

• Traditional “precision” and “recall” as evaluation criteria.

• Data collection from a variety of domains.

35

Page 36: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Test DataSchema #

tablesAssociated CM #

nodes in CM

#mappings tested

DBLP1DBLP2

229

BibliographicDBLP2 ER

757

6

Mondia1Mondial2

2826

FactbookMondial2 ER

5226

5

Amalgam1Amalgam2

1527

Amalgam1 ERAmalgam2 ER

826

7

3Sdb13Sdb2

99

3Sdb1 ER3Sdb2 ER

3 3

UTCSUTDB

813

KA ontologyCS dept. ontology

10562

2

HotelAHotelB

65

hotelA ontologyhotelB ontology

77

5

NetworkANetworkB

1819

networkA ontologynetworkB ontology

2827

6

36

Page 37: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Summary of the Evaluation Results

• Found all the expected mappings as found by the traditional approach.

• Improved precision (70% of the test cases) by eliminating suspicious pairings.

• Improved recall (40% of the test cases) by considering ISA as functional relationship.

• No much complicated semantics, no improvements.

37

Page 38: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Roadmap

• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema

Mapping• Conclusions

38

Page 39: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Conclusions• A novel and effective tool for discovering

semantics for schemas in terms of conceptual models.

• A round-trip engineering process for maintaining semantic mappings.

• A semantic approach for improving schema mappings using the semantics.

• A suite of tools for assisting users to discover and maintain mappings between different data representations in a variety of information integration situations.

39

Page 40: Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ

Thank You!

40