integrating data sources on the world-wide web

24
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary umlawren,[email protected] oba.ca

Upload: duena

Post on 17-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Integrating data sources on the World-Wide Web. Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary umlawren,[email protected]. Introduction. Integration of data is required when accessing multiple databases within an organization or on the WWW. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrating data sources on the World-Wide Web

Integrating data sources on the World-Wide Web

Ramon Lawrence and Ken Barker

U. of Manitoba, U. of Calgary

umlawren,[email protected]

Page 2: Integrating data sources on the World-Wide Web

Introduction

• Integration of data is required when accessing multiple databases within an organization or on the WWW.

• Our focus is automatically combining database schema using schema integration.

• Schema integration requires knowledge of data semantics and use of metadata.

Page 3: Integrating data sources on the World-Wide Web

Motivation

• Organizations have several database systems which must interoperate.

• Users often access multiple Web databases whose knowledge must be integrated and presented in a useful form.

• Data warehouses and OLAP systems require data semantics to be understood and data to be cleansed and summarized.

Page 4: Integrating data sources on the World-Wide Web

Background

• Schema integration involves combining diverse database schema into an integrated view by resolving conflicts.

• Schema conflicts include naming, structural, and semantic conflicts.

• Schema integration is required for database interoperability, but it is currently a manual process.

Page 5: Integrating data sources on the World-Wide Web

Previous Work

• Research systems:– integrating systems by logical rules (Sheth)– defining global dictionaries (Castano)– Carnot Project using the Cyc knowledge base

• Industrial systems and standards:– Metadata Interchange Specification (MDIS)– XML, BizTalk, E-commerce portals

Page 6: Integrating data sources on the World-Wide Web

Architecture Components: The Global Dictionary

• A global dictionary (GD) provides standardized terms to capture data semantics.– Hierarchy of terms related by IS-A or Has-A links– Contains base set of common database concepts, but

new concepts can be added

• A GD term is a single, unambiguous semantic definition.– Several GD entries for a single English word are

required if the word has multiple definitions.

Page 7: Integrating data sources on the World-Wide Web

Architecture Components:Using the Global Dictionary

• GD terms are used to build semantic names to describe the semantics of schema elements.

• Semantic names have the form:– semantic name = “[“CT [[;CT] | [,CT]] “]” CN– CT = context term, CN = concept name– each CT and CN is a single term from the GD

• Semantic names are included in RIM specifications describing a data source.

Page 8: Integrating data sources on the World-Wide Web

Architecture Components:The Relational Integration Model

• Database metadata and semantic names are combined into Relational Integration Model (RIM) Specifications (RIM Specs)– contains information on a relational schema– organized into database, table, and field levels– stores semantic names to describe and integrate

schema elements

Page 9: Integrating data sources on the World-Wide Web

Architecture Components:Integrating RIM Specs

• Each database to be integrated is described using a RIM specification.

• Identical concepts in different databases are identified by similar semantic names.

• Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.

Page 10: Integrating data sources on the World-Wide Web

Integration Architecture

• Our integration architecture consists of two separate phases:– capture process: RIM specs are constructed for

each data source independently– integration process: RIM specs are combined

using the integration algorithm which matches semantic names using the global dictionary

Page 11: Integrating data sources on the World-Wide Web

Integration Architecture:The Capture Process

• Capture process involves:– automatically extracting the schema information

and metadata using a specification editor– assigning semantic names to each schema

element (tables and fields) to capture their semantics

Page 12: Integrating data sources on the World-Wide Web

Integration Architecture:The Capture Process

RelationalSchema

GlobalDictionary

RIMSpec

SpecificationEditor

AutomaticExtraction

DBA Lookupof terms

Page 13: Integrating data sources on the World-Wide Web

Integration Architecture:The Integration Process

• Integration process involves:– automatically identifying identical concepts by

matching semantic names– constructing a global view of database concepts

consisting of a hierarchy of concept terms– resolving structural differences during query

generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)

Page 14: Integrating data sources on the World-Wide Web

Integration Architecture:The Integration Process

Client

RDBMS

Integration Site

Subtransactions

Client………….

RDBMS……..

RIM spec RIM spec

Page 15: Integrating data sources on the World-Wide Web

Integration Architecture Benefits

• The benefits of the two phase architecture are:– Dynamic integration: schemas integrated as needed– RIM Specs are constructed only once and

independent of each other– Automatic conflict resolution by integrating based

on semantic name rather than physical structure– Users are isolated from system names and

organization by querying through a global view using semantic names for concepts

Page 16: Integrating data sources on the World-Wide Web

Integration Example

• Two claims databases to be integrated:– ABC Company: Claims_tb(claim_id, claimant,

net_amount, paid_amount)– XYZ Company: T_claims(id, customer, claim_amt),

T_payments(cid, pid, amount)

• First step is to construct RIM specs for each database.

Page 17: Integrating data sources on the World-Wide Web

Integration Example:ABC Database RIM Spec

Type System Name Semantic Name

Table Claims_tb [Claim]

Field Claim_id [Claim] Id

Field Claimant [Claim;Claimant] NameField Net_amount [Claim] Amount

Field Paid_amount [Claim;Payment] Amount

Page 18: Integrating data sources on the World-Wide Web

Integration Example:XYZ Database RIM Spec

Type System Name Semantic Name

Table T_claims [Claim]Field id [Claim] IdField customer [Claim;Customer] Name

Field claim_amt [Claim] Amount

Table T_payments [Claim;Payment]Field cid [Claim] IdField pid [Claim;Payment] Id

Field amount [Claim;Payment] Amount

Page 19: Integrating data sources on the World-Wide Web

Integration Example:Integrated View

• Global view after integration:– [Claim]

• Id

• Net amount

• [Customer]– name

• [Payment]– id

– amount

Page 20: Integrating data sources on the World-Wide Web

Integration Example:Discussion

• Important points:– system and field names are not presented to the user

who queries based on semantic names

– database structure is not shown to the user

– different physical representations for the same concept are combined (e.g. payment (attribute) in ABC with payment table in XYZ database)

– hierarchially related concepts (customer vs. claimant) are combined based on their IS-A relationship in the global dictionary

Page 21: Integrating data sources on the World-Wide Web

Applications to the WWW

• Integrating diverse data sources is involved in constructing a data warehouse and other operational systems.

• The WWW is a diverse organizations of databases which users access.

• Automatically integrating web data sources by a browser or portal reduces query complexity and integration of results for the user.

Page 22: Integrating data sources on the World-Wide Web

Conclusions

• Automatic integration of database schema is possible by using a global dictionary of terms and constructing semantic names for schema elements.

• Integration of data sources has applications to the WWW and construction of data warehouses.

Page 23: Integrating data sources on the World-Wide Web

Important Changes

• The integration architecture is constantly being refined. Some notable differences in this presentation versus the paper:– Our integration system uses XML to represent a RIM

spec which is renamed as a X-Spec.– An integration site is used as a central portal for

integration and management.– No longer using semantic distance calculations between

terms.– Format of semantic name has been simplified.

Page 24: Integrating data sources on the World-Wide Web

Future Work

• The integration architecture is involving with standards on XML and now captures metadata information in XML documents.

• The system is being tested on sample problems, and a query mechanism is work-in-progress.

• We are refining a prototype of the system called Unity.