[Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Download [Lecture Notes in Computer Science] Conceptual Modeling — ER ’99 Volume 1728 || Dealing with Semantic Heterogeneity During Data Integration

Post on 07-Mar-2017




6 download

Embed Size (px)


<ul><li><p>J. Akoka et al. (Eds.): ER'99, LNCS 1728, pp. 325-339, 1999. Springer-Verlag Berlin Heidelberg 1999</p><p>Dealing with Semantic Heterogeneity during DataIntegration</p><p>Zoubida Kedad 1 and Elisabeth Mtais 11 Laboratoire PRiSM, Universit de Versailles, 45 avenue des Etats-Unis</p><p>F-78035 Versailles cedex, France{Zoubida.Kedad, Elisabeth.Metais}@prism.uvsq.fr</p><p>Abstract. Multi-sources information systems, such as data warehouse systems,involve heterogeneous sources. In this paper, we deal with the semanticheterogeneity of the data instances. Problems may occur when confrontingsources, each time different level of denominations have been used for the samevalue, e.g. "vermilion" in one source, and "red" in an other. We propose tomanage this semantic heterogeneity by using a linguistic dictionary. "Semanticoperators" allow a linguistic flexibility in the queries, e.g. two tuples with thevalues "red" and "vermilion" could match in a semantic join on the "color"attribute. A particularity of our approach is it states the scope of the flexibilityby defining classes of equivalent values by the mean of "priority nodes". Theyare used as parameters for allowing the user to define the scope of the flexibilityin a very natural manner, without specifying any distance.</p><p>1 Introduction</p><p>A Multi-Source Information System (MSIS) is composed of a set of heterogeneousand distributed data sources, and a set of views (or queries) defining user requirementsover these sources. Examples of MSIS are web systems and data warehouse systems.The main difference between an MSIS and a classical information system resides inits definition and its feeding with data. While the database schema of a classicalinformation system is defined as an integrated data structure, the database structure ofan MSIS is defined as a set of (possibly independent) views. Moreover, while thefeeding of the first database is done by the users through their applications, the feedingof the second database is automatically done by the system from the data sources.MSIS applications aim only at the use of this data without any direct update.</p><p>One possible solution to access distributed and heterogeneous data sources is to buildon top of the local systems a global system which integrates the different data source[23],[27]. The integration, exportation and importation of data is done by the federatedsystem. Several systems represent this architecture such as the TSIMMIS system[4],[18], Information Manifold [17], and SIMS [2].</p><p>An alternative to provide access to a set of distributed and heterogeneous data sourcesis the data warehouse (DW) approach, in which the relevant data is extracted fromeach source, then translated, cleaned and integrated to be stored in a centralized</p></li><li><p>326 Zoubida Kedad and Elisabeth Mtais</p><p>repository. Among the systems representing this approach, we can find the WHIPSprototype [10], H2O [28] and DWQ [11].</p><p>MSIS have introduced new design activities such as the selection of relevant datasources, the query containment problem [17], the selection of views to materialize[26], the update propagation of source changes [15], the definition and generation ofmediators [27], the definition of computing expressions for each view in the system[14] and the data integration and cleaning [5],[1],[3].</p><p>One key problem in designing MSIS is the reconciliation of the data contained in thedata sources. When merging the values contained in the sources, the followingproblems may occur :- Differences in data format, for example when the instances of two equivalent</p><p>attributes have different lengths;- Differences in scale, when the instances of two equivalent attributes are</p><p>expressed using different units, such as French francs and dollars;- Differences in the encoding of attributes, for example the attributes sex in two</p><p>different data sources can be encoded using the set of instances {1, 2} and {F,M} respectively;</p><p>- Differences in the denominations of the instances, for example, the sameattribute color may have the two values "azure" and "blue".</p><p>Natural language techniques could be useful for solving some of the cases in whichthe denominations of the instances are different. These techniques have been used forschema integration. [8] proposes the use of fuzzy and incomplete terminologicalknowledge to determine the correspondence assertions between the compared objects.In [12] and [13], correspondence assertions between the schemas to integrate aredetermined using case grammar and speech act theory. [19] and [20] suggest the useof semantic dictionaries storing concept hierarchies and canonical graphs to retrievethe semantics of the schemas. [21] uses a fuzzy thesaurus built using linguistic tools tocompute a fuzzy semantic nearness coefficient between objects. In the Carnot project[24], which provides a framework for heterogeneous database integration, a commonontology is used to semantically relate different schemas with each other and therebyenable the interoperation of the underlying databases. The common ontology isexpressed using CYC, a knowledge base providing the "commonsense" knowledge ofan application domain.</p><p>In this paper we propose to deal with the semantic heterogeneity, by using meta dataincluding a linguistic dictionary. The main idea of this technique is to providelinguistic flexibility in the queries, without any distance set by the user. Section 2gives an overview of our objective. Section 3 discusses the linguistic techniquessupporting the methodology and sections 4 and 5 detail the proposed semanticoperators.</p></li><li><p>Dealing with Semantic Heterogeneity during Data Integration 327</p><p>2. The General Approach</p><p>Using traditional relational algebra operators for querying a MSIS may lead tomeaningless results; the example of figure 1 enlightens the problem we aim to solve.We can see on this example that the result of this union is hard to interpret and that itcontains inconsistencies. Flexible relations as defined in [1] allow a correct chainingwith other operators, without propagating errors. However, our purpose is different inthe sense that we want to obtain the result of Fig. 2, by the way of "semanticoperators" which are widely described in section 4.</p><p>Union</p><p>CAR n color</p><p>12</p><p>azuravermilion</p><p>CAR n color</p><p>123</p><p>bluebluered</p><p>CAR n color</p><p>12</p><p>azuravermilion</p><p>CAR n color</p><p>11223</p><p>azurabluevermilionbluered</p><p>Fig. 1. : Union without linguistic knowledge</p><p>SemUnion</p><p>CAR n color</p><p>12</p><p>azuravermilion</p><p>CAR n color</p><p>123</p><p>bluebluered</p><p>CAR n color</p><p>12</p><p>azuravermilion</p><p>CAR n color</p><p>13</p><p>azurared</p><p>+INCONSISTENCIES n color color</p><p>2 vermilion blue</p><p>Fig. 2. : Semantic Union using Linguistic knowledge</p><p>Remark : In Fig. 2 and in the remaining of this paper, we use the term "inconsistency"for the case of two tuples "matching" on all the key attribute values but havingconflicting values for some non-key attributes, as it is used in [6] and [1].</p><p>The basic problem is that, giving the two source schemas R1(K,A1,A2,A3) andR2(K',A4,A5,A6), once the correspondences between attribute names are stated at theschema level (for example A1 is equivalent to A4), it is necessary to have theknowledge on the instance level (the a1 value for a r1 tuple of R1 is the same as thea4 value of a r2 tuple in R2, in spite of their coding heterogeneity).</p></li><li><p>328 Zoubida Kedad and Elisabeth Mtais</p><p>Considering the thousands of tuples from the tens sources, it is not realistic to storethese assertions, like schema assertions are. Furthermore, these assertions are partiallyuser dependant. For example, for the same query "search for cars having the samecolor than their cover-seats", some users expect blue cars and covers, whatever thekind of blue are, while other users expect only cars and cover both light-blue, or carsand covers both deep-blue.</p><p>A preliminary cleansing of the data is also inadequate because the denominations arenot exactly equivalent (on the opposite of a code "1" and a code "M" for "male").Thus, cleaning in the first relation "azure" in "blue" will lead to a lack of semantic;and cleaning in the second relation "blue" in "azure" is impossible before having theknowledge that the value in the other relation is "azure" and not "navy".</p><p>Consequently, our approach integrates the linguistic matching in the processing of thequeries. These queries may have different purposes and may come from different"users" (e.g. user, interfaces, mediators).</p><p>USER INTERFACE</p><p>Querry withsemanticoperators</p><p>Answer tothe query</p><p>set prioritylevel</p><p>meta database</p><p>LinguisticDictionary</p><p>Relations extracted from sources</p><p>Query transformation tool</p><p>linguistic matchingtool</p><p>SQL Queries/Answers without semanticoperators</p><p>Fig. 3. : Global methodology</p><p>A linguistic matching (i.e. a linguistic flexibility in comparison of values) is proposedto the user by the mean of a set of semantic operators, as the one of Fig. 2. Theseoperators are namely the SemUnion, the SemJoin, the SemRestriction, theSemDifference and the SemIntersection. It is important to notice that the linguisticflexibility must be triggered by the user since one user may search the red cars, andexpect vermilion cars and other kind of red cars, while an other user may search forred cars, expecting cars having exactly the "red" value for the attribute "color". This isa important difference with other cleaning procedures such as converting moneys,</p></li><li><p>Dealing with Semantic Heterogeneity during Data Integration 329</p><p>which have no impact on the semantic of the query, and thus may be automaticallyperformed.</p><p>A traditional challenge when introducing some flexibility is to define its scope.Numerous works [22], [25] have proposed semantic distances for integration orretrieval purposes. However, although this distance could be computed using adictionary, we didn't want the user to specify it because this is not a natural way ofthinking. In our approach, the user automatically defines classes of values, by themean of priority nodes in a hierarchical dictionary. Then, two values are consideredsemantically close if they belong to the same class, and semantically far if they do not.</p><p>The semantic operators are transformed into classical operators and then executed bythe DBMS on the relations extracted from the sources as shown in figure 3. Thistransformation need metadata which are provided by a dictionary adapted fromlinguistic dictionaries.</p><p>3 The Linguistic Knowledge Supplied by the Meta Database</p><p>Besides procedures and table conversions, the meta database has to contain alinguistic dictionary in order to reconcile denominations. In this section, we willdescribe how linguistic dictionaries may be customized in order to be more suitablefor the purpose of data reconciliation in MSIS.</p><p>3.1 General Linguistic Dictionaries</p><p>Several projects aiming at building electronic linguistic dictionaries are in progresssuch as CYC [16], EDR [7] and WordNet [9]. Beyond morphologic and syntacticinformation they provide semantic links between words such as synonymy, linksrelating words to concepts and links between concepts such as antonymy,hyponymy/hypernymy (is-a) and meronymy/ holonymy (part-of). The is-a links areuseful for conceptual schema elaboration and are organized in a "hierarchy ofconcepts". Other links are supported by "canonical graphs" which specifyrelationships expected among the concepts involved in a given action.</p><p>These general linguistic dictionaries are very helpful for retrieving the semantic of thenouns in schemas or data for their integration. They allow to supply with all thesemantic lost during the modeling process. In previous works [19] [20] we usedWordNet to integrate database schemas. For example, one contribution of suchdictionaries is the detection of generalization links between objects of different views.Another contribution is the deduction of the roles played by each entity in arelationship.</p><p>General linguistic dictionaries are very useful for schema integration. However, theyseem to have some limitations for data integration in a MSIS, mainly at three levels:the vocabulary, the links and the normalization.</p></li><li><p>330 Zoubida Kedad and Elisabeth Mtais</p><p>Their vocabularies are adapted to describe most common concepts of the real world,and then are very suitable for knowledge representation, including databases schemas.But they can not catch all the possible values of instances. Most of them are specific totheir owner, and only part of them may be considered as word of the commonvocabulary.</p><p>The links they provide are also suitable for the schema level (semantic roles, converseverbs,...), but other links would be closer to the data's problems. For example, thecardinality, from an object to a property should help in decision taking, when the sameproperty has different values in distinct sources. Such a link is not found in a generaldictionary, but is one of the basic links between data.</p><p>Concerning normalisation, general dictionaries do not provide a preference among aset of synonyms, or inside a branch of the concept hierarchy.</p><p>3.2 Proposal for a More Suitable Linguistic Dictionary</p><p>Due to the limitations described above we propose to enrich the linguistic dictionaryso that it is more data reconciliation oriented. We will illustrate this section with theexample of a dictionary for the color denominations, mainly devoted to the vehicle'scolors. Colors are essential in car marketing and are mined for many selling analysis.This example illustrates the case of data having data which may have a huge numberof denominations since they era usually specific to a constructor.</p><p>The Concept Hierarchy</p><p>The main structure of the linguistic dictionary is the concept hierarchy, which storesthe "is-a" relationship. As we can see Fig. 4, the leaves nodes of the concept hierarchyare often enterprise's denominations. This example enlightens a particularity oflinguistic dictionary devoted to data instances, which is to gather both commonvocabulary names and names that looks like codes, integrated in the same hierarchy.</p><p>The elaboration of such a dictionary requires the cooperation of the main concernedenterprises. Thus an automatic procedure has to be proposed in order to feed thedictionary with data. Concerning the particular example of the color denominations,this procedure does exist. A color may be encoded with three coordinates according tothe three basic colors axis. While some colors have only one point in this space(usually constructor colors such as "Glacier"), others (usually common name colorssuch as "red") cover a volume in this space. Then the hierarchy is automatically buildas follow: a color subsumes another color if the corresponding volume contains thevolume of the other color (which may be a point).</p></li><li><p>Dealing with Semantic Heterogeneity during Data Integration 331</p><p>azure</p><p>vermilion</p><p>Sevilla</p><p>Glacier</p><p>blue green red</p><p>T</p><p>navy</p><p>rubylight blue deep blue</p><p>Fig. 4. : Excerpt of the concept hierarchy</p><p>We notice that as other cleaning metadata, a part of the dictionary may be shared by alarge number of applications and users, and an other part is an enrichment which isspecific to an application. The dicti...</p></li></ul>