[lecture notes in computer science] conceptual modeling — er ’99 volume 1728 || dealing with...

Download [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Post on 07-Mar-2017

220 views

Category:

Documents

6 download

Embed Size (px)

TRANSCRIPT

  • J. Akoka et al. (Eds.): ER'99, LNCS 1728, pp. 325-339, 1999. Springer-Verlag Berlin Heidelberg 1999

    Dealing with Semantic Heterogeneity during DataIntegration

    Zoubida Kedad 1 and Elisabeth Mtais 11 Laboratoire PRiSM, Universit de Versailles, 45 avenue des Etats-Unis

    F-78035 Versailles cedex, France{Zoubida.Kedad, Elisabeth.Metais}@prism.uvsq.fr

    Abstract. Multi-sources information systems, such as data warehouse systems,involve heterogeneous sources. In this paper, we deal with the semanticheterogeneity of the data instances. Problems may occur when confrontingsources, each time different level of denominations have been used for the samevalue, e.g. "vermilion" in one source, and "red" in an other. We propose tomanage this semantic heterogeneity by using a linguistic dictionary. "Semanticoperators" allow a linguistic flexibility in the queries, e.g. two tuples with thevalues "red" and "vermilion" could match in a semantic join on the "color"attribute. A particularity of our approach is it states the scope of the flexibilityby defining classes of equivalent values by the mean of "priority nodes". Theyare used as parameters for allowing the user to define the scope of the flexibilityin a very natural manner, without specifying any distance.

    1 Introduction

    A Multi-Source Information System (MSIS) is composed of a set of heterogeneousand distributed data sources, and a set of views (or queries) defining user requirementsover these sources. Examples of MSIS are web systems and data warehouse systems.The main difference between an MSIS and a classical information system resides inits definition and its feeding with data. While the database schema of a classicalinformation system is defined as an integrated data structure, the database structure ofan MSIS is defined as a set of (possibly independent) views. Moreover, while thefeeding of the first database is done by the users through their applications, the feedingof the second database is automatically done by the system from the data sources.MSIS applications aim only at the use of this data without any direct update.

    One possible solution to access distributed and heterogeneous data sources is to buildon top of the local systems a global system which integrates the different data source[23],[27]. The integration, exportation and importation of data is done by the federatedsystem. Several systems represent this architecture such as the TSIMMIS system[4],[18], Information Manifold [17], and SIMS [2].

    An alternative to provide access to a set of distributed and heterogeneous data sourcesis the data warehouse (DW) approach, in which the relevant data is extracted fromeach source, then translated, cleaned and integrated to be stored in a centralized

  • 326 Zoubida Kedad and Elisabeth Mtais

    repository. Among the systems representing this approach, we can find the WHIPSprototype [10], H2O [28] and DWQ [11].

    MSIS have introduced new design activities such as the selection of relevant datasources, the query containment problem [17], the selection of views to materialize[26], the update propagation of source changes [15], the definition and generation ofmediators [27], the definition of computing expressions for each view in the system[14] and the data integration and cleaning [5],[1],[3].

    One key problem in designing MSIS is the reconciliation of the data contained in thedata sources. When merging the values contained in the sources, the followingproblems may occur :- Differences in data format, for example when the instances of two equivalent

    attributes have different lengths;- Differences in scale, when the instances of two equivalent attributes are

    expressed using different units, such as French francs and dollars;- Differences in the encoding of attributes, for example the attributes sex in two

    different data sources can be encoded using the set of instances {1, 2} and {F,M} respectively;

    - Differences in the denominations of the instances, for example, the sameattribute color may have the two values "azure" and "blue".

    Natural language techniques could be useful for solving some of the cases in whichthe denominations of the instances are different. These techniques have been used forschema integration. [8] proposes the use of fuzzy and incomplete terminologicalknowledge to determine the correspondence assertions between the compared objects.In [12] and [13], correspondence assertions between the schemas to integrate aredetermined using case grammar and speech act theory. [19] and [20] suggest the useof semantic dictionaries storing concept hierarchies and canonical graphs to retrievethe semantics of the schemas. [21] uses a fuzzy thesaurus built using linguistic tools tocompute a fuzzy semantic nearness coefficient between objects. In the Carnot project[24], which provides a framework for heterogeneous database integration, a commonontology is used to semantically relate different schemas with each other and therebyenable the interoperation of the underlying databases. The common ontology isexpressed using CYC, a knowledge base providing the "commonsense" knowledge ofan application domain.

    In this paper we propose to deal with the semantic heterogeneity, by using meta dataincluding a linguistic dictionary. The main idea of this technique is to providelinguistic flexibility in the queries, without any distance set by the user. Section 2gives an overview of our objective. Section 3 discusses the linguistic techniquessupporting the methodology and sections 4 and 5 detail the proposed semanticoperators.

  • Dealing with Semantic Heterogeneity during Data Integration 327

    2. The General Approach

    Using traditional relational algebra operators for querying a MSIS may lead tomeaningless results; the example of figure 1 enlightens the problem we aim to solve.We can see on this example that the result of this union is hard to interpret and that itcontains inconsistencies. Flexible relations as defined in [1] allow a correct chainingwith other operators, without propagating errors. However, our purpose is different inthe sense that we want to obtain the result of Fig. 2, by the way of "semanticoperators" which are widely described in section 4.

    Union

    CAR n color

    12

    azuravermilion

    CAR n color

    123

    bluebluered

    CAR n color

    12

    azuravermilion

    CAR n color

    11223

    azurabluevermilionbluered

    Fig. 1. : Union without linguistic knowledge

    SemUnion

    CAR n color

    12

    azuravermilion

    CAR n color

    123

    bluebluered

    CAR n color

    12

    azuravermilion

    CAR n color

    13

    azurared

    +INCONSISTENCIES n color color

    2 vermilion blue

    Fig. 2. : Semantic Union using Linguistic knowledge

    Remark : In Fig. 2 and in the remaining of this paper, we use the term "inconsistency"for the case of two tuples "matching" on all the key attribute values but havingconflicting values for some non-key attributes, as it is used in [6] and [1].

    The basic problem is that, giving the two source schemas R1(K,A1,A2,A3) andR2(K',A4,A5,A6), once the correspondences between attribute names are stated at theschema level (for example A1 is equivalent to A4), it is necessary to have theknowledge on the instance level (the a1 value for a r1 tuple of R1 is the same as thea4 value of a r2 tuple in R2, in spite of their coding heterogeneity).

  • 328 Zoubida Kedad and Elisabeth Mtais

    Considering the thousands of tuples from the tens sources, it is not realistic to storethese assertions, like schema assertions are. Furthermore, these assertions are partiallyuser dependant. For example, for the same query "search for cars having the samecolor than their cover-seats", some users expect blue cars and covers, whatever thekind of blue are, while other users expect only cars and cover both light-blue, or carsand covers both deep-blue.

    A preliminary cleansing of the data is also inadequate because the denominations arenot exactly equivalent (on the opposite of a code "1" and a code "M" for "male").Thus, cleaning in the first relation "azure" in "blue" will lead to a lack of semantic;and cleaning in the second relation "blue" in "azure" is impossible before having theknowledge that the value in the other relation is "azure" and not "navy".

    Consequently, our approach integrates the linguistic matching in the processing of thequeries. These queries may have different purposes and may come from different"users" (e.g. user, interfaces, mediators).

    USER INTERFACE

    Querry withsemanticoperators

    Answer tothe query

    set prioritylevel

    meta database

    LinguisticDictionary

    Relations extracted from sources

    Query transformation tool

    linguistic matchingtool

    SQL Queries/Answers without semanticoperators

    Fig. 3. : Global methodology

    A linguistic matching (i.e. a linguistic flexibility in comparison of values) is proposedto the user by the mean of a set of semantic operators, as the one of Fig. 2. Theseoperators are namely the SemUnion, the SemJoin, the SemRestriction, theSemDifference and the SemIntersection. It is important to notice that the linguisticflexibility must be triggered by the user since one user may search the red cars, andexpect vermilion cars and other kind of red cars, while an other user may search forred cars, expecting cars having exactly the "red" value for the attribute "color". This isa important difference with other cleaning procedures such as converting moneys,

  • Dealing with Semantic Heterogeneity during Data Integration 329

    which have no impact on the semantic of the query, and thus may be automaticallyperformed.

    A traditional challenge when introducing some flexibility is to define its scope.Numerous w