Transcript
Page 1: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

J. Akoka et al. (Eds.): ER'99, LNCS 1728, pp. 325-339, 1999. Springer-Verlag Berlin Heidelberg 1999

Dealing with Semantic Heterogeneity during DataIntegration

Zoubida Kedad 1 and Elisabeth Métais 1

1 Laboratoire PRiSM, Université de Versailles, 45 avenue des Etats-UnisF-78035 Versailles cedex, France

{Zoubida.Kedad, Elisabeth.Metais}@prism.uvsq.fr

Abstract. Multi-sources information systems, such as data warehouse systems,involve heterogeneous sources. In this paper, we deal with the semanticheterogeneity of the data instances. Problems may occur when confrontingsources, each time different level of denominations have been used for the samevalue, e.g. "vermilion" in one source, and "red" in an other. We propose tomanage this semantic heterogeneity by using a linguistic dictionary. "Semanticoperators" allow a linguistic flexibility in the queries, e.g. two tuples with thevalues "red" and "vermilion" could match in a semantic join on the "color"attribute. A particularity of our approach is it states the scope of the flexibilityby defining classes of equivalent values by the mean of "priority nodes". Theyare used as parameters for allowing the user to define the scope of the flexibilityin a very natural manner, without specifying any distance.

1 Introduction

A Multi-Source Information System (MSIS) is composed of a set of heterogeneousand distributed data sources, and a set of views (or queries) defining user requirementsover these sources. Examples of MSIS are web systems and data warehouse systems.The main difference between an MSIS and a classical information system resides inits definition and its feeding with data. While the database schema of a classicalinformation system is defined as an integrated data structure, the database structure ofan MSIS is defined as a set of (possibly independent) views. Moreover, while thefeeding of the first database is done by the users through their applications, the feedingof the second database is automatically done by the system from the data sources.MSIS applications aim only at the use of this data without any direct update.

One possible solution to access distributed and heterogeneous data sources is to buildon top of the local systems a global system which integrates the different data source[23],[27]. The integration, exportation and importation of data is done by the federatedsystem. Several systems represent this architecture such as the TSIMMIS system[4],[18], Information Manifold [17], and SIMS [2].

An alternative to provide access to a set of distributed and heterogeneous data sourcesis the data warehouse (DW) approach, in which the relevant data is extracted fromeach source, then translated, cleaned and integrated to be stored in a centralized

Page 2: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

326 Zoubida Kedad and Elisabeth Métais

repository. Among the systems representing this approach, we can find the WHIPSprototype [10], H2O [28] and DWQ [11].

MSIS have introduced new design activities such as the selection of relevant datasources, the query containment problem [17], the selection of views to materialize[26], the update propagation of source changes [15], the definition and generation ofmediators [27], the definition of computing expressions for each view in the system[14] and the data integration and cleaning [5],[1],[3].

One key problem in designing MSIS is the reconciliation of the data contained in thedata sources. When merging the values contained in the sources, the followingproblems may occur :- Differences in data format, for example when the instances of two equivalent

attributes have different lengths;- Differences in scale, when the instances of two equivalent attributes are

expressed using different units, such as French francs and dollars;- Differences in the encoding of attributes, for example the attributes sex in two

different data sources can be encoded using the set of instances {1, 2} and {F,M} respectively;

- Differences in the denominations of the instances, for example, the sameattribute color may have the two values "azure" and "blue".

Natural language techniques could be useful for solving some of the cases in whichthe denominations of the instances are different. These techniques have been used forschema integration. [8] proposes the use of fuzzy and incomplete terminologicalknowledge to determine the correspondence assertions between the compared objects.In [12] and [13], correspondence assertions between the schemas to integrate aredetermined using case grammar and speech act theory. [19] and [20] suggest the useof semantic dictionaries storing concept hierarchies and canonical graphs to retrievethe semantics of the schemas. [21] uses a fuzzy thesaurus built using linguistic tools tocompute a fuzzy semantic nearness coefficient between objects. In the Carnot project[24], which provides a framework for heterogeneous database integration, a commonontology is used to semantically relate different schemas with each other and therebyenable the interoperation of the underlying databases. The common ontology isexpressed using CYC, a knowledge base providing the "commonsense" knowledge ofan application domain.

In this paper we propose to deal with the semantic heterogeneity, by using meta dataincluding a linguistic dictionary. The main idea of this technique is to providelinguistic flexibility in the queries, without any distance set by the user. Section 2gives an overview of our objective. Section 3 discusses the linguistic techniquessupporting the methodology and sections 4 and 5 detail the proposed semanticoperators.

Page 3: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 327

2. The General Approach

Using traditional relational algebra operators for querying a MSIS may lead tomeaningless results; the example of figure 1 enlightens the problem we aim to solve.We can see on this example that the result of this union is hard to interpret and that itcontains inconsistencies. Flexible relations as defined in [1] allow a correct chainingwith other operators, without propagating errors. However, our purpose is different inthe sense that we want to obtain the result of Fig. 2, by the way of "semanticoperators" which are widely described in section 4.

Union

CAR n° color

12

azuravermilion

CAR n° color

123

bluebluered

CAR n° color

12

azuravermilion

CAR n° color

11223

azurabluevermilionbluered

Fig. 1. : Union without linguistic knowledge

SemUnion

CAR n° color

12

azuravermilion

CAR n° color

123

bluebluered

CAR n° color

12

azuravermilion

CAR n° color

13

azurared

+INCONSISTENCIES n° color color

2 vermilion blue

Fig. 2. : Semantic Union using Linguistic knowledge

Remark : In Fig. 2 and in the remaining of this paper, we use the term "inconsistency"for the case of two tuples "matching" on all the key attribute values but havingconflicting values for some non-key attributes, as it is used in [6] and [1].

The basic problem is that, giving the two source schemas R1(K,A1,A2,A3) andR2(K',A4,A5,A6), once the correspondences between attribute names are stated at theschema level (for example A1 is equivalent to A4), it is necessary to have theknowledge on the instance level (the a1 value for a r1 tuple of R1 is the same as thea4 value of a r2 tuple in R2, in spite of their coding heterogeneity).

Page 4: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

328 Zoubida Kedad and Elisabeth Métais

Considering the thousands of tuples from the tens sources, it is not realistic to storethese assertions, like schema assertions are. Furthermore, these assertions are partiallyuser dependant. For example, for the same query "search for cars having the samecolor than their cover-seats", some users expect blue cars and covers, whatever thekind of blue are, while other users expect only cars and cover both light-blue, or carsand covers both deep-blue.

A preliminary cleansing of the data is also inadequate because the denominations arenot exactly equivalent (on the opposite of a code "1" and a code "M" for "male").Thus, cleaning in the first relation "azure" in "blue" will lead to a lack of semantic;and cleaning in the second relation "blue" in "azure" is impossible before having theknowledge that the value in the other relation is "azure" and not "navy".

Consequently, our approach integrates the linguistic matching in the processing of thequeries. These queries may have different purposes and may come from different"users" (e.g. user, interfaces, mediators).

USER INTERFACE

Querry withsemanticoperators

Answer tothe query

set prioritylevel

meta database

LinguisticDictionary

Relations extracted from sources

Query transformation tool

linguistic matchingtool

SQL Queries/Answers without semanticoperators

Fig. 3. : Global methodology

A linguistic matching (i.e. a linguistic flexibility in comparison of values) is proposedto the user by the mean of a set of semantic operators, as the one of Fig. 2. Theseoperators are namely the SemUnion, the SemJoin, the SemRestriction, theSemDifference and the SemIntersection. It is important to notice that the linguisticflexibility must be triggered by the user since one user may search the red cars, andexpect vermilion cars and other kind of red cars, while an other user may search forred cars, expecting cars having exactly the "red" value for the attribute "color". This isa important difference with other cleaning procedures such as converting moneys,

Page 5: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 329

which have no impact on the semantic of the query, and thus may be automaticallyperformed.

A traditional challenge when introducing some flexibility is to define its scope.Numerous works [22], [25] have proposed semantic distances for integration orretrieval purposes. However, although this distance could be computed using adictionary, we didn't want the user to specify it because this is not a natural way ofthinking. In our approach, the user automatically defines classes of values, by themean of priority nodes in a hierarchical dictionary. Then, two values are consideredsemantically close if they belong to the same class, and semantically far if they do not.

The semantic operators are transformed into classical operators and then executed bythe DBMS on the relations extracted from the sources as shown in figure 3. Thistransformation need metadata which are provided by a dictionary adapted fromlinguistic dictionaries.

3 The Linguistic Knowledge Supplied by the Meta Database

Besides procedures and table conversions, the meta database has to contain alinguistic dictionary in order to reconcile denominations. In this section, we willdescribe how linguistic dictionaries may be customized in order to be more suitablefor the purpose of data reconciliation in MSIS.

3.1 General Linguistic Dictionaries

Several projects aiming at building electronic linguistic dictionaries are in progresssuch as CYC [16], EDR [7] and WordNet [9]. Beyond morphologic and syntacticinformation they provide semantic links between words such as synonymy, linksrelating words to concepts and links between concepts such as antonymy,hyponymy/hypernymy (is-a) and meronymy/ holonymy (part-of). The is-a links areuseful for conceptual schema elaboration and are organized in a "hierarchy ofconcepts". Other links are supported by "canonical graphs" which specifyrelationships expected among the concepts involved in a given action.

These general linguistic dictionaries are very helpful for retrieving the semantic of thenouns in schemas or data for their integration. They allow to supply with all thesemantic lost during the modeling process. In previous works [19] [20] we usedWordNet to integrate database schemas. For example, one contribution of suchdictionaries is the detection of generalization links between objects of different views.Another contribution is the deduction of the roles played by each entity in arelationship.

General linguistic dictionaries are very useful for schema integration. However, theyseem to have some limitations for data integration in a MSIS, mainly at three levels:the vocabulary, the links and the normalization.

Page 6: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

330 Zoubida Kedad and Elisabeth Métais

Their vocabularies are adapted to describe most common concepts of the real world,and then are very suitable for knowledge representation, including databases schemas.But they can not catch all the possible values of instances. Most of them are specific totheir owner, and only part of them may be considered as word of the commonvocabulary.

The links they provide are also suitable for the schema level (semantic roles, converseverbs,...), but other links would be closer to the data's problems. For example, thecardinality, from an object to a property should help in decision taking, when the sameproperty has different values in distinct sources. Such a link is not found in a generaldictionary, but is one of the basic links between data.

Concerning normalisation, general dictionaries do not provide a preference among aset of synonyms, or inside a branch of the concept hierarchy.

3.2 Proposal for a More Suitable Linguistic Dictionary

Due to the limitations described above we propose to enrich the linguistic dictionaryso that it is more data reconciliation oriented. We will illustrate this section with theexample of a dictionary for the color denominations, mainly devoted to the vehicle'scolors. Colors are essential in car marketing and are mined for many selling analysis.This example illustrates the case of data having data which may have a huge numberof denominations since they era usually specific to a constructor.

The Concept Hierarchy

The main structure of the linguistic dictionary is the concept hierarchy, which storesthe "is-a" relationship. As we can see Fig. 4, the leaves nodes of the concept hierarchyare often enterprise's denominations. This example enlightens a particularity oflinguistic dictionary devoted to data instances, which is to gather both commonvocabulary names and names that looks like codes, integrated in the same hierarchy.

The elaboration of such a dictionary requires the cooperation of the main concernedenterprises. Thus an automatic procedure has to be proposed in order to feed thedictionary with data. Concerning the particular example of the color denominations,this procedure does exist. A color may be encoded with three coordinates according tothe three basic colors axis. While some colors have only one point in this space(usually constructor colors such as "Glacier"), others (usually common name colorssuch as "red") cover a volume in this space. Then the hierarchy is automatically buildas follow: a color subsumes another color if the corresponding volume contains thevolume of the other color (which may be a point).

Page 7: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 331

azure

vermilion

Sevilla

Glacier

blue green red

T

navy

rubylight blue deep blue

Fig. 4. : Excerpt of the concept hierarchy

We notice that as other cleaning metadata, a part of the dictionary may be shared by alarge number of applications and users, and an other part is an enrichment which isspecific to an application. The dictionary also contains a set of parameters which maybe updated for each user. They are described in the following section.

The Priority Level

We can intuitively introduce the notion of priority nodes in the hierarchy as a way forthe user to specify classes of equivalent values. These nodes are graphicallyrepresented by a cercle as shown Fig. 4. where "blue", "green" and "red" are prioritynodes. The class induced by a priority node is constituted by the sub-tree havingthe priority node as top (i.e. the class of a priority node contains the set of valuesthat it subsumes).

The choice of the priority nodes is arbitrary and may change from one query toanother. The user may change the priority nodes as parameters. That is a very easyand natural manner for the user to precise his classes of semantically close values,without manipulating any distance.

However, it is important to notice that in practice, this choice is quite stable. Usually,the priority node of a branch is the most frequently referred node in the languageamong all the nodes of this branch. Default priority nodes could be those which are themost frequently referred in the definitions of both their subsuming and subsumednodes. As an example, in usual dictionaries, both "vermilion" and "warm color" aredefined using the "red" denomination.

The Cardinality Links

To reconciliate data according to one criteria we need to know if it is possible for anobject to have several values for this criteria. The dictionary store the "cardinality"relationship, well known in the database area. In our example, a table stores thecardinality from objects toward the color property. An excerpt of this table is (car,monovalued), (flag, multivalued).

Page 8: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

332 Zoubida Kedad and Elisabeth Métais

3.3 Assertions Provided by a Priority Level Linguistic Dictionary

Thanks to the linguistic hierarchy and its priority level described above, the dictionaryprovides assertions, in order to describe the existing relation between two values. Weintroduce different kinds of assertions, describing respectively the equivalencerelation, the inclusion relation, the co-class and the dis-class relations. Theseassertions will be used by the semantic operators described in the sections below.

Definition 1: Two values A and B are equivalent if they are either equal orsynonyms.

Definition 2 : An A value is included in a B value if and only if there exists apath from A towards B in the concept hierarchy (in other words, if B subsumesA). Respectively, we say that A includes B. As an example, in the dictionary ofFig. 4, "Glacier" is included in "light blue".

Definition 3: Two values A and B are co-class if and only if there exists a prioritynode C including both A and B. For example, in the dictionary of Fig. 4, "Azure"and "Navy" are co-class, since they are both "blue" which has been stated aspriority node.

Definition 4 : Two values A and B are dis-class if and only if there is no prioritynode including both A and B (that is, if they are not co-class).

4 The Semantic Operators

In this section, we introduce the semantic operators allowing to deal with linguisticheterogeneity. These operators are denoted SemUnion, SemJoin, SemRestriction,SemDifference and SemIntersection. A particular attention will be paid on the waythey use the metadata provided by the linguistic dictionary.

4.1 Notation of the Semantic Operators

The semantic operators have to be parameterized with the set of attributes thelinguistic matching must be applied on. For example the semantic Union of therelations CAR(..., paint,type,...) and VEHICLE (..., color,type,...) is denoted:SemUnion ((paint,color),(type,type)).

Notation facility : When the semantic flexibility concerns only one pair of attributes,and if these attributes have the same names in the two relations, we will note shortly:color-SemUnion of the relations car (...,color,...) and the relation vehicle (...,color,...). In the same way, we will note other operators: "color-SimJoin", "color-SemRestriction", "color-Intersection" and "color-SemDifference".

Page 9: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 333

4.2 The Semantic Join Operator

The SemJoin operator is used for queries such as "Do people buy cover-seats and carhaving the same color ?". The denominations of the colors for the cars and the cover-seats may be very different, since they have different manufacturers, each one havingits own denominations. For this query, the user is not interested in an exact match ofthe color names. He just expects these color to be the same with respect to the prioritylevels of the linguistic dictionary.

Use of the linguistic Assertions in the SemJoin Algorithm

Let us consider the Ai-SemJoin of R1(A1,...Ai,...An) and R2(B1,...,Bi-1,Ai,Bi+1,...,Bm) on a criteria R1.Ai = R2.Ai and Pi, where Pi is a set of other joinpredicates (e.g. R1.A3=R2.B4)

For each matching comparison between R1.Ai and R2.Ai :If R1.Ai is equivalent to R2.Ai then the matching succeeds,If R1.Ai includes R2.Ai then the matching succeeds,If R1.Ai is included in R2.Ai then the matching succeeds,If R1.Ai and R2.Ai are co-class then the matching succeeds,If R1.Ai and R2.Ai are dis-class then the matching fails.

We notice that the result of the SemJoin operator is strongly dependant on the prioritylevel (the case "co-class" leads to a success while the case "dis-class" leads to afailure). This is coherent with the idea of introducing natural language in the queries,because the priority level exactly answer to the problem of what do we mean in thecourant natural language by sentences like "do they have the same color ?".

4.3 Semantic Restriction

The SemRestriction operator aims at answering queries such as "Have red cars moreaccidents than others ?", even if the stored colors for the car are not "red" but"Vermilion" or "Sevilla".

Use of the linguistic assertions in the SemRestriction Algorithm

Let us onsider the Ai-SemRestriction on the R(A1,...Ai,...An) relation with the criteriaR1.Ai = valA,For each matching comparison between R1.Ai and valA :

If R1.Ai is equivalent to valA then the matching succeeds,If R1.Ai includes valA then the matching fails,If R1.Ai is included in valA then the matching succeeds,If R1.Ai and valA are co-class then the matching fails,If R1.Ai and valA are dis-class then the matching fails.

We notice that the SemRestriction operator uses the hierarchy of the linguisticdictionary without considering the priority level. For example when a user wants allthe vermilion cars, he obviously wants all the cars which colors are subsumed byvermilion, whatever the priority level is. Giving in the result the cars for which we

Page 10: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

334 Zoubida Kedad and Elisabeth Métais

only know that they are red should imply to dissociate in the result the cars which"are" vermilion and the car which "may be" vermilion but also may be an other kindof red. In this work, we did not address this possibility.

4.4 The Semantic Union Operator

In a data warehouse system, the role of a union operator is to state whether two tuplescorrespond to the same object and to merge them in the resulting relation. TheSemUnion operator can technically help in both. But in practice, it is particularlyadapted to improve the second point, that is a correct merge of two tuplescorresponding to the same object. For this purpose, the operator has to be applied onnon key attributes. Then, an exact matching (by equality or through conversionprocedure) is applied on the keys, assuming that the tuples correspond to the sameobject, while a linguistic matching is applied on other attributes for merging includedor co-class values.

Use of the linguistic assertions in the SemUnion Algorithm

Let us consider the Ai-SemUnion of R1(A1,...Ai,...An) and R2(A1,...,Ai,...An),for each pair of tuples from R1XR2 such that all Aj values are matching for j•i,

If R1.Ai is equivalent to R2.Ai then the two tuples are mergedIf R1.Ai includes R2.Ai then the two tuple are merged, and the chosen value

for Ai is R2.AiIf R1.Ai is included in R2.Ai then the two tuples are merged, and the chosen

value for Ai is R1.AiIf (R1.Ai and R2.Ai are co-class) or (R1.Ai and R2.Ai are dis-class) then

if cardinality (R1,Ai) and the cardinality (R2,Ai) are monovaluedthen the two tuples are not merged and are considered as

inconsistentelse the two tuples are inserted in the result.

We notice that this algorithm uses not only the concept hierarchy but also additionallinks of the linguistic dictionary, namely the cardinality links.

An interesting use of this operator is to check out for the consistency of some inter-schema assertions stored in the meta database, namely the equivalence assertionsbetween the attributes of the schemas. If two attributes are related by an assertionstating their semantic equivalence, and if none of their values are found linguisticallyequivalent, that probably means that the attributes have different meanings, and thatthe semantic equivalence assertion between them must be reconsidered.

Page 11: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 335

4.5 Semantic Intersection and Semantic Difference

The SemIntersection and SemDifference operators are close to the SemUnionoperator, and they use the linguistic knowledge in the same way. We just give in thissection an example for each operator in Fig.5 and Fig. 6.

SemIntersection

CAR n° color

12

azuravermilion

CAR n° color

123

bluebluered

CAR n° color

12

azuravermilion

CAR n° color

1 azura +INCONSISTENCIES n° color color

2 vermilion bl e

Fig. 5.: An example of SemIntersection

SemDifference

CAR n° color

12

azuravermilion

CAR n° color

123

bluebluered

CAR n° color

124

azuravermilionGlacier

CAR n° color

+

INCONSISTENCIES n° color color

2 vermilion blue4 Glacier

Fig. 6.: An example of SemDifference

5. Implementation of the Algorithms by the Way of Classical Algebra Operators

The semantic operators are computed by calling classical operators, after an automaticmodification of the relations, by the mean of an embedded SQL. The implementationis performed using PL/SQL (for the ORACLE 8 DBMS). In this prototype, thelinguistic dictionary is also stored using ORACLE 8.

Implementation of the Semantic Join

In order to compute the A-semantic join of two relations R1 ( ..., A,...) and R2(...,A,...) on the criteria R1.X = R2.Y where X is a set of attributes of R1 including A,and Y is a set of attributes of R2 including A, three steps have to be performed :

Step 1: A new attribute SemA is added to the schema of the R1 relation and anewattribute SemA is added to the schema of the R2 relation. Let us call these twonew relations R1' and R2'.

Page 12: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

336 Zoubida Kedad and Elisabeth Métais

Step 2 : The instances of R1' and R2' are computed by the following algorithm: eachtuple r1 of R1 is translated into a tuple r1' of R'1 by filling the value for SemA withthe priority node subsuming A in the linguistic dictionary. If no priority node issubsuming A, then the value is A itself. If several priority nodes are subsuming A,the r1 tuple gives several tuples in R1', each of them having a value for SemAcorresponding to each priority node. The same process is applied o fill R2'.

Step 3: The semantic join between R1 and R2 on the criteria X=Y is transformed intoa classical join between R1' and R2' with the criteria X' = Y', where X' = ((X USemA) - A) and ((Y' = Y U SemA)-A).

COVER color owner

skyvermilionlemon

100300400

CAR color owner

azurebluered

100200300

color-SemJoin

CAR. color =COVER.color

andCAR.owner =COVER.owner

Fig. 7.: An example of semantic join

CAR. SemCOLOR =COVER.SemCOLOR

andCAR.owner =COVER.owner

CAR color owner

azurebluered

100200300

Semcolor COVER color owner

skyvermilionlemon

100300400

Semcolor

blueredyellow

bluebluered

Fig. 8.: Transformation of the semantic join of figure 7 into a classical join

COVER color owner

skyvermilionlemon

100300400

CAR color owner

azurebluered

100200300

color-SemJoin

CAR. color =COVER.color

andCAR.owner =COVER.owner

Fig. 9.: the rough result of the join, before projections

An example of query involving such join may be "how many people buy cars andseat-covers having the same color?". This is a typical marketing problem and theresults may have an impact on further offers. Fig. 7 shows the semantic join, Fig. 8shows the transformation of the semantic join into a classical join according to thethree steps algorithm exposed above in this paragraph, and Fig. 9 shows the result ofthis classical join. Then a projection gives the set {100, 300} of persons having thesame color for their cover-seats and their cars.

The optimization of the semantic operators execution is not address in this paper, butmay be a weakness of the method. Obviously, if the semantic operators whereimplemented inside a DBMS, they could take profit of a panel of optimization

Page 13: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 337

techniques and they would be more efficient. However, we do think that a mainadvantage of our method is to be compatible with the use of existing DBMS.

6. Conclusion

In this paper, we have proposed a method for dealing with the linguistic heterogeneityintroducing some linguistic flexibility in the queries. We have defined the semanticoperators and their implementation which uses classical operators applied ontransformed relations. The linguistic metadata supporting theses transformations isprovided by a linguistic dictionary. We discussed the particularity of this dictionarywith respect to a general linguistic dictionary as those used for schema integration.

Values in the dictionary are organized in a subsumption hierarchy in which somenodes are considered as "priority nodes". This means that each sub-tree having apriority node is the top represents a class of "linguistically close" values. The prioritynodes are expected to be the most frequently referred denominations in the commonlanguage. They are used as parameters for allowing the user to define the scope of theflexibility in a very natural manner, without specifying any distance.

This approach introduces some linguistic flexibility in the queries, with a setting itsscope in a well suitable for decision making tools over SIMS. They allow to deal withthe high level concept values used in selling or statistic analysis such as "water","coffee", "blue" while sources have recorded intermediate or basic values such as"plane water", "Lazarro", "azure" or "Glacier". Moreover, this technique contributingto a better understanding may be used for other contexts such as extraction on theWEB.

Acknowledgments

We wish to thank Mokrane Bouzeghoub for his valuable suggestions during this work.

References

1. Agarwal, S., Keller, A.M., Wiederhold, G. Krichna, S.: Flexible relation: an approach forintegrating data from multiple, possibly inconsistent databases. Eleventh InternationalConference on Data Engineering, IEEE (1995)

2. Arens, Y., Knoblock, C.A., Shen, W.M.: "Query Reformulation for dynamic informationintegration. International Journal on Intelligent and Cooperative Information Systems (6)2/3, (1996)

3. Calvanese, D., De Giacomo, G. , Lenzerini, M., Nardi, D., Rosati, R.: Source Integration: conceptual modeling and reasoning support. Proceedings of the 6th Internat. Conf. InCooperative Information Systems, (CoopIS’98) (1998)

4. Chawathe, .S, Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., UllmanJ., Widom J.: The TSIMMIS project: Integration of Heterogeneous Information Sources",IPSJ'94 (1994)

Page 14: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

338 Zoubida Kedad and Elisabeth Métais

5. Cluet, S., Delobel, C., Simeon, J., Katarzyna, S.: Your Mediators Need Data Conversion!. ACM SIGMOD, Seattle (1998)

6. DeMichiel, L.G.: Resolving Database Incompatibility: An Approach to PerformingRelational Operations over Mismatched Domains. IEEE Transactions on Knowledge andData Engineering, vol.1,n°4 (1989)

7. EDR Electronic Dictionary Technical Guide, Japan Electronic Dictionary ResearchInstitute, Ltd. Mita-Kokusai-Bldg (1993)

8. Fankhauser, P., Kracker, M., Neuhold, E.J.: Semantic vs. Structural Resemblance ofClasses. Sigmod Record, 20 (4) October (1991)

9. Fellbaum, C. et al., "WordNet : An electronic lexical database", MIT Press, ISBN 0-262-06197-X, (1998)

10. Hammer, J., Garcia-Molina, H., Widom, J., Labio, W., Zhuge, Y.: The Stanford DataWarehousing Project. Bulletin of the Technical Committee on Data Engineering (1995)

11. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P., (edits): Fundamentals of DataWarehouses, Springer (1999)

12. Johannesson, P.: Logic-based approach to schema integration. Proc. of the 10th ERConference (1991)

13. Johannesson, P.: Using conceptual graph theory to support schema integration. Proc. ofthe 12th ER Conf. (1993)

14. Kedad, Z., Bouzeghoub, M.: Conception de systèmes d’information multi-source. XVIIeCongrès Informatique des organisations et systèmes d'information et de décision(INFORSID) (1999)

15. Labio, W., Zhuge, Y., Wiener, J., Gupta, H., Garcia-Molina, H., Widom, J.: The WHIPSprototype for Data Warehouse creation and Maintenance, SIGMOD (1997)

16. Lenat, D.B. Guha, R.V., Pittman, K., Pratt, D. and Shepherd, M.: CYC: TowardPrograms with Common Sens. Communications of the ACM, vol.33, n°8 (1990)

17. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying Heterogeneous Information SourcesUsing Source Description. VLDB (1996)

18. Li C., Yerneni, R., Vassalos, V., Garcia-Molina, H., Papakonstantinou, Y., Valiveti, M.:Capability Based Mediation in TSIMMIS. IPSJ (1994)

19. Métais, E., Meunier, J.N., Levreau, G.: Database Schema Design, Validation and ViewIntegration: a Perspective from Natural Language. Proc. of the 12th ER Conference(1993)

20. Métais, E., Kedad, Z., Comyn-Wattiau, I., Bouzeghoub, M.: Using Linguistic Knowledgein View Integration: toward a third generation of tools: in the review Data andKnowledge Engineering, vol. 23, n°1 Jun (1997)

21. Mirbel, I.: Semantic integration of conceptuel schemes" First International Workshop onApplication of Natural Language to Data Bases (1995)

22. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy,IJCAI'95 (1995)

23. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed,heterogeneous and autonomous databases. ACM Computing Surveys, Vol 22,n°3,September (1990)

Page 15: [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Dealing with Semantic Heterogeneity during Data Integration 339

24. Singh, M.P., Cannata, P.E., Huhns M.N., Jacobs, N., Ksiezyk T, Ong K, Sheth, A.P.,Tomlinson, C., Woelk, D.: The Carnot Heterogeneous Database Project: ImplementedApplications. In Distributed and Parallel Databases" Journal, vol. 5, n° 2, pages 207-225,April (1997)

25. Song, W.W., Johannesson, P., Bubenko, J.A.: Semantic similarity relations andcomputation in schema integration. in the review "Data and Knowledge Engineering",19(1996)

26. Theodoratos D., Sellis T. "Data Warehouse Schema and Instance Design", ER'98, (1998)

27. Wiederhold, G., Genesereth, M.: The basis for mediation. proc. of CoopIS'95, (1995)

28. Zhou, G. Hull, R. King, R. Franchitti, J.C.: Data Integration and Warehousing usingH2O. Bulletin of Technical Committee on Data Engineering (1995)


Top Related