semantic integration in heterogeneous databases using neural networks
DESCRIPTION
Semantic Integration in Heterogeneous Databases Using Neural Networks. Wen-Syan Li, Chris Clifton Presentation by Jeff Roth. Introduction. Basic schema matching problem GTE’s data integration project included 27,000 data elements - PowerPoint PPT PresentationTRANSCRIPT
Semantic Integration in Heterogeneous Databases Using
Neural Networks
Wen-Syan Li, Chris Clifton
Presentation by Jeff Roth
Introduction
Basic schema matching problemGTE’s data integration project included
27,000 data elementsThis took 4 hours per data element or 25
full time employees 2 years to completeThis method -> .1 seconds, 144000 x faster“how to match knowledge is discovered”
Method Outline
“The end user is able to distinguish between unreasonable and reasonable answers, and exact results aren’t critical. This method allows a user to obtain reasonable answers requiring database integration at a low cost”
Automated semantic integration methods
Attribute Name ComparisonThis method is not used in this paper
Attribute values and domains comparisonEqual, Contains, Overlap, Contained-in and Disjoint
Used but not with the above measures
Field SpecificationsData type, field length constraints and others.
This is also used in this method
Field Specifications
The following measures are used data types
Each possible data type has a network input, with the field data type having a value of 1 and all the other having a value of 0
field length
Length = 2 * (1/(1 + k-length) - 0.5) format specifications
similar to data type constraints (primary key, foreign key, disallowing nulls, access
restrictions, etc…)
similar to data type
Attribute Values and DomainsDivide measures into character fields and numeric fields
Patterns for Character fields
1. Ratio of numerical charactersAddress: 146 South 920 West would score 6/18
2. Ratio of white spaceAddress: 146 South 920 West would score 3/18
3. Length StatisticsAverage, Variance, and coefficient of the “used” length
relative to the maximum length
Attribute Values and Domains cont.
Patterns for numeric fields
1. Average (mean)
2. Variance
3. Coefficient of variationRecognizes similarity between values of different Units and Granularity
This can also help recognize which fields may need unit conversions
4. GroupingFor example: area code, zip code, first three digits of SSN
Self-Organizing Grouping algorithm
N = number of possible discriminators
M = number of categories, this can be adjusted by user. “ideally this is |attributes| - |foreign keys|”
This is unsupervised, i.e. you don’t have to provide a correct classification, it simply groups based on similarity
Training the Back-Prop Network
Inputs (N) are identical to classifier
Outputs (M) are trained using Back-Propagation and classifier’s results
Categories are labeled with the attributes they grouped together*
What is the classifier for?
Ease of training:
“ideally [M] is |attributes| - |foreign keys|” and it is less computationally expensive to train M classifications where M < |attributes| - |foreign keys|
It is less computationally complex to compare new elements to the M classification than to ever attribute of the training database or |attributes| - |foreign keys|
Networks can be trained in which there there are attributes that are identical
Integration Procedure
1. DBMS Specific Parser
2. Classify (Categorize) Training Data
3. Train Neural Network
4. DBMS Specific Parser
5. Classification by Neural Network
6. User Checks Results
21 3
4 5 6
Results
Conclusion and Future Work
Human Effort needed for semantic integration is minimized
Different Systems have different attribute properties available - automated solution
Extend to automated information integration
C source code available at eecs.nwu.edu/pub/semint