semantic integration in heterogeneous databases using neural networks

Semantic Integration in Heterogeneous Databases Using

Neural Networks

Wen-Syan Li, Chris Clifton

Presentation by Jeff Roth

Introduction

Basic schema matching problemGTE’s data integration project included

27,000 data elementsThis took 4 hours per data element or 25

full time employees 2 years to completeThis method -> .1 seconds, 144000 x faster“how to match knowledge is discovered”

Method Outline

“The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”

Automated semantic integration methods

Attribute Name ComparisonThis method is not used in this paper

Attribute values and domains comparisonEqual, Contains, Overlap, Contained-in and Disjoint

Used but not with the above measures

Field SpecificationsData type, field length constraints and others.

This is also used in this method

Field Specifications

The following measures are used data types

Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0

field length

Length = 2 * (1/(1 + k-length) - 0.5) format specifications

similar to data type constraints (primary key, foreign key, disallowing nulls, access

restrictions, etc…)

similar to data type

Attribute Values and DomainsDivide measures into character fields and numeric fields

Patterns for Character fields

1. Ratio of numerical charactersAddress: 146 South 920 West would score 6/18

2. Ratio of white spaceAddress: 146 South 920 West would score 3/18

3. Length StatisticsAverage, Variance, and coefficient of the “used” length

relative to the maximum length

Attribute Values and Domains cont.

Patterns for numeric fields

1. Average (mean)

2. Variance

3. Coefficient of variationRecognizes similarity between values of different Units and Granularity

This can also help recognize which fields may need unit conversions

4. GroupingFor example: area code, zip code, first three digits of SSN

Self-Organizing Grouping algorithm

N = number of possible discriminators

M = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|”

This is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity

Training the Back-Prop Network

Inputs (N) are identical to classifier

Outputs (M) are trained using Back-Propagation and classifier’s results

Categories are labeled with the attributes they grouped together*

Integration Procedure

1. DBMS Specific Parser

2. Classify (Categorize) Training Data

3. Train Neural Network

4. DBMS Specific Parser

5. Classification by Neural Network

6. User Checks Results

21 3

4 5 6

Results

Conclusion and Future Work

Human Effort needed for semantic integration is minimized

Different Systems have different attribute properties available - automated solution

Extend to automated information integration

C source code available at eecs.nwu.edu/pub/semint

semantic integration in heterogeneous databases using neural networks

Documents