semantic integration in heterogeneous databases using neural networks

13
Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth

Upload: wray

Post on 06-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Semantic Integration in Heterogeneous Databases Using Neural Networks. Wen-Syan Li, Chris Clifton Presentation by Jeff Roth. Introduction. Basic schema matching problem GTE’s data integration project included 27,000 data elements - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semantic Integration in Heterogeneous Databases Using Neural Networks

Semantic Integration in Heterogeneous Databases Using

Neural Networks

Wen-Syan Li, Chris Clifton

Presentation by Jeff Roth

Page 2: Semantic Integration in Heterogeneous Databases Using Neural Networks

Introduction

Basic schema matching problemGTE’s data integration project included

27,000 data elementsThis took 4 hours per data element or 25

full time employees 2 years to completeThis method -> .1 seconds, 144000 x faster“how to match knowledge is discovered”

Page 3: Semantic Integration in Heterogeneous Databases Using Neural Networks

Method Outline

“The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”

Page 4: Semantic Integration in Heterogeneous Databases Using Neural Networks

Automated semantic integration methods

Attribute Name ComparisonThis method is not used in this paper

Attribute values and domains comparisonEqual, Contains, Overlap, Contained-in and Disjoint

Used but not with the above measures

Field SpecificationsData type, field length constraints and others.

This is also used in this method

Page 5: Semantic Integration in Heterogeneous Databases Using Neural Networks

Field Specifications

The following measures are used data types

Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0

field length

Length = 2 * (1/(1 + k-length) - 0.5) format specifications

similar to data type constraints (primary key, foreign key, disallowing nulls, access

restrictions, etc…)

similar to data type

Page 6: Semantic Integration in Heterogeneous Databases Using Neural Networks

Attribute Values and DomainsDivide measures into character fields and numeric fields

Patterns for Character fields

1. Ratio of numerical charactersAddress: 146 South 920 West would score 6/18

2. Ratio of white spaceAddress: 146 South 920 West would score 3/18

3. Length StatisticsAverage, Variance, and coefficient of the “used” length

relative to the maximum length

Page 7: Semantic Integration in Heterogeneous Databases Using Neural Networks

Attribute Values and Domains cont.

Patterns for numeric fields

1. Average (mean)

2. Variance

3. Coefficient of variationRecognizes similarity between values of different Units and Granularity

This can also help recognize which fields may need unit conversions

4. GroupingFor example: area code, zip code, first three digits of SSN

Page 8: Semantic Integration in Heterogeneous Databases Using Neural Networks

Self-Organizing Grouping algorithm

N = number of possible discriminators

M = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|”

This is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity

Page 9: Semantic Integration in Heterogeneous Databases Using Neural Networks

Training the Back-Prop Network

Inputs (N) are identical to classifier

Outputs (M) are trained using Back-Propagation and classifier’s results

Categories are labeled with the attributes they grouped together*

Page 10: Semantic Integration in Heterogeneous Databases Using Neural Networks

What is the classifier for?

Ease of training:

“ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys|

It is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys|

Networks can be trained in which there there are attributes that are identical

Page 11: Semantic Integration in Heterogeneous Databases Using Neural Networks

Integration Procedure

1. DBMS Specific Parser

2. Classify (Categorize) Training Data

3. Train Neural Network

4. DBMS Specific Parser

5. Classification by Neural Network

6. User Checks Results

21 3

4 5 6

Page 12: Semantic Integration in Heterogeneous Databases Using Neural Networks

Results

Page 13: Semantic Integration in Heterogeneous Databases Using Neural Networks

Conclusion and Future Work

Human Effort needed for semantic integration is minimized

Different Systems have different attribute properties available - automated solution

Extend to automated information integration

C source code available at eecs.nwu.edu/pub/semint