usingontologies measuring semantic similarity in … › ~golmoham › sw › ontology based... ·...

8
Using Ontologies for Measuring Semantic Similarity in Data Warehouse Schema Matching Process M. Banek*, B. Vrdoljak* and A.M. Tjoa** * Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia ** Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria {marko.banek, boris.vrdoljak}gfer.hr, amingifs.tuwien.ac.at Abstract-The key step of data warehouse integration is the solve easy and moderate conflicts and offer several construction of mappings that link mutually compatible solutions to some complex problems, to which the components of data warehouse schemas: dimensions, engineers designing the integrated system can concentrate. aggregation levels, attributes and facts. In order to perform The intensive, more-than-two-decades long research in the integration process in a semi-automated manner, we the domain of schema integration, natural language must define similarity functions that compare the names processing and information retrieval has pointed out and substructures of those structure elements. During the semantic similarity functions as the crucial tool for last decade, many approaches to measuring semantic comparing objects that belong to heterogeneous schemas. similarity between lexical terms have been introduced, most Originally, semantic similarity functions give the degree of them based either on the taxonomy of WordNet, a large of mutual relatedness between two words (more specific, lexical and thesaurus database of English language, or on two certain meanings of those words). They were invented the previously measured language statistic corpus. This by psycholinguists, who tried to fnd an objective, paper presents a novel semantic similarity technique, based . ' . T on edge counting, which combines WordNet and domain quantified measurement of human perception. The degree ontologies written in OWL and is implemented as a Java relatednes o two inptword (ordisenses) iS software. Ontologies are designed by domain experts and reresenthe asam poabit fion. For tw differen thus provide a better and more trustworthy source for wordswththe samexmeaning (.e. synonyms)Tthevalue of calculating similarity, and the fact that the terms are related the function is exactly one. The value of zero means that closer than in WordNet results in a higher similarity. the two input words are not related at all. Semantic similarity functions are used to solve naming conflicts in schema matching. Complex structure conflicts I. INTRODUCTION can be recursively reduced to a set of atomic naming and Data warehouses are the principal data sources for data type conflicts. This means that the whole integration decision-making in the information systems of large process actually turns into applying a semantic similarity businesses. Integration of data warehouses is nowadays a function to several hundred or several thousand words. hot topic, as the greater amount of data and their greater Semantic similarity functions cannot be used without statistical significance offers a broader base for decision- the existence of a machine-readable dictionary, lexicon or making and knowledge discovery. Large businesses thesaurus where words are organized into a semantic integrate their separately developed regional warehouses, space. The basic dictionary source for English words is newly merged companies join their warehouses to enable WordNet [14][19], a handcrafted taxonomy of English the business to be run centrally and independent language, which contains more than 100,000 words [14]. organizations unite their warehouses when this benefits all The recent emergence of domain ontologies offers another of them. important source of relations among words or terms. The integration process starts with comparing the While WordNet can produce the semantic similarity for schema structures of the component warehouses. Possible any pair of input words, domain ontologies contain a conflicts are noted. Next, mappings are produced that state much smaller number of words, but describe their which warehouse elements represent the same information relations in a more precise way, restricting the meaning of within different heterogeneous sources and hence can be words exclusively to that specific domain, and not joined together. regarding their general or most standard meaning. For the specific, domain-oriented purposes, we thus consider diffren name an/o structures to describe thesam domain ontologies as a more reliable source for similarity alIIerent names ana/or structufunctions than WordNetm information (e.g. dimension describing a hospital may be intoshaWrde called hospital in one component data warehouse, and This paper presents an approach that incorporates clinic in another) or to different specifications of the same ontologies written in Web Ontology Language (OWL) structure [9]. into WordNet and thus forms a unified source for our Automatic integration of heterogeneous schemas is semantic similarity fiunction. We regard OWL classes as necssay t reucethedurtio ofthecomlexproessof atomic structures equal to synonym sets in WordNet. We neesaryetou rdue i t heaton duroation of thecope process ofn fiurther analyze different OWL constructions and translate wareouseinteratin. Atomtionof te prcesscan them into relations that correspond to those in WordNet. 9th International Conference on Telecommunications - ConTEL 2007 ISBN: 978-953-184-111-5, June 13-15, 2007, Zagreb, Croatia 227

Upload: others

Post on 26-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Using Ontologies for Measuring SemanticSimilarity in Data Warehouse Schema Matching

Process

M. Banek*, B. Vrdoljak* and A.M. Tjoa*** Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

** Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria{marko.banek, boris.vrdoljak}gfer.hr, amingifs.tuwien.ac.at

Abstract-The key step of data warehouse integration is the solve easy and moderate conflicts and offer severalconstruction of mappings that link mutually compatible solutions to some complex problems, to which thecomponents of data warehouse schemas: dimensions, engineers designing the integrated system can concentrate.aggregation levels, attributes and facts. In order to perform The intensive, more-than-two-decades long research inthe integration process in a semi-automated manner, we the domain of schema integration, natural languagemust define similarity functions that compare the names processing and information retrieval has pointed outand substructures of those structure elements. During the semantic similarity functions as the crucial tool forlast decade, many approaches to measuring semantic comparing objects that belong to heterogeneous schemas.similarity between lexical terms have been introduced, most Originally, semantic similarity functions give the degreeof them based either on the taxonomy of WordNet, a large of mutual relatedness between two words (more specific,lexical and thesaurus database of English language, or on two certain meanings of those words). They were inventedthe previously measured language statistic corpus. This by psycholinguists, who tried to fnd an objective,paper presents a novel semantic similarity technique, based . ' . Ton edge counting, which combines WordNet and domain quantified measurement of human perception. The degreeontologies written in OWL and is implemented as a Java relatednes o two inptword (ordisenses) iSsoftware. Ontologies are designed by domain experts and reresentheasampoabit fion. For tw differenthus provide a better and more trustworthy source for wordswththe samexmeaning (.e. synonyms)Tthevalue ofcalculating similarity, and the fact that the terms are related the function is exactly one. The value of zero means thatcloser than in WordNet results in a higher similarity. the two input words are not related at all.

Semantic similarity functions are used to solve namingconflicts in schema matching. Complex structure conflicts

I. INTRODUCTION can be recursively reduced to a set of atomic naming andData warehouses are the principal data sources for data type conflicts. This means that the whole integration

decision-making in the information systems of large process actually turns into applying a semantic similaritybusinesses. Integration of data warehouses is nowadays a function to several hundred or several thousand words.hot topic, as the greater amount of data and their greater Semantic similarity functions cannot be used withoutstatistical significance offers a broader base for decision- the existence of a machine-readable dictionary, lexicon ormaking and knowledge discovery. Large businesses thesaurus where words are organized into a semanticintegrate their separately developed regional warehouses, space. The basic dictionary source for English words isnewly merged companies join their warehouses to enable WordNet [14][19], a handcrafted taxonomy of Englishthe business to be run centrally and independent language, which contains more than 100,000 words [14].organizations unite their warehouses when this benefits all The recent emergence of domain ontologies offers anotherof them. important source of relations among words or terms.

The integration process starts with comparing the While WordNet can produce the semantic similarity forschema structures of the component warehouses. Possible any pair of input words, domain ontologies contain aconflicts are noted. Next, mappings are produced that state much smaller number of words, but describe theirwhich warehouse elements represent the same information relations in a more precise way, restricting the meaning ofwithin different heterogeneous sources and hence can be words exclusively to that specific domain, and notjoined together. regarding their general or most standard meaning. For the

specific, domain-oriented purposes, we thus considerdiffren name an/o structures to describe thesam domain ontologies as a more reliable source for similarityalIIerentnamesana/or structufunctions than WordNetminformation (e.g. dimension describing a hospital may be intoshaWrdecalled hospital in one component data warehouse, and This paper presents an approach that incorporatesclinic in another) or to different specifications of the same ontologies written in Web Ontology Language (OWL)structure [9]. into WordNet and thus forms a unified source for our

Automatic integration of heterogeneous schemas is semantic similarity fiunction. We regard OWL classes asnecssay t reucethedurtio ofthecomlexproessof atomic structures equal to synonym sets in WordNet. We

neesaryetou rduei theaton duroation of thecope process ofn fiurther analyze different OWL constructions and translatewareouseinteratin. Atomtionof te prcesscan them into relations that correspond to those in WordNet.

9th International Conference on Telecommunications - ConTEL 2007ISBN: 978-953-184-111-5, June 13-15, 2007, Zagreb, Croatia 227

Page 2: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Marko Banek, Boris Vrdoljak, A Min Tjoa

We provide a Java software implementation of our Generally, techniques that automate the process ofsimilarity technique to support the process of schema matching structured data (databases or semi-structuredmatching in data warehouse integration. data) can be divided into schema-based and instance-

The paper is structured as follows. Section II describes based (schema-based matching considers only schemaour methodology for automated integration of data information, not instance data) [15].warehouse schemas and the need to use semantic Schema-based solutions either perform a linguisticsimilarity functions. The principles of word taxonomy semantic comparison of the source schema components ororganization in WordNet are briefly outlined in Section translate them into graphs and then compare the structureIII. An overview of the related work in the area of of those graphs. Instance-based approaches apply differentsemantic similarity concerning WordNet is given in learning and mining techniques to compare instance data,Section IV. OWL constructs that correspond to taxonomy together with schema metadata (thus being able torelations between words in WordNet are depicted in outperform schema-based techniques) [5].Section V. Section VI shows how semantic similaritybetween words is computed when at least one ofthem can C. Methodologyfor Automated Integration ofDatabe found in a domain ontology. Some brief details of the Warehouse Schemassoftware implementation of our approach are described in Since we deal with legally protected healthcare data,Section VII. Conclusions are drawn in Section VIII. which cannot be reproduced for processing in order to

perform warehouse schema matching, we must restrictII. DATA WAREHOUSE INTEGRATION ourselves to warehouse schema metadata as the only

source for the matching process.The goal of the matching process is to determine which

The easiest way of data warehouse integration is the multidimensional components of a data warehouse areconstruction of a new warehouse, to which all data from mutually equivalent. The automatically proposed resultthe independent source warehouses are copied. However, should correspond to the solution manually created by ain certain cases the privacy policy of the organizations, as data warehouse designer to the highest possible degree.well as the legal protection of sensitive data prohibits the There are five types of multidimensional structures: facts,data to be copied outside the organizations where they are measures, dimensions, aggregation levels and dimensionalstored. We faced that challenge while integrating attributes, but only those of the same type can beheterogeneous data warehouses of several independent compatible for matching. Mapping cardinalities can behealth insurance organizations in order to develop a one-to-one, one-to-many or many-to-many.medical knowledge base. The solution to the problem is a The matching algorithm consists oftwo basic steps:logical integration, a data warehouse federation [3]. 1. comparison of the multidimensional

According to Sheth and Larson [18], a federated componentsdatabase is "a collection of cooperating database systems 2. creation ofmappingsthat are autonomous and possibly heterogeneous". Afederated data warehouse is a functional warehouse, a "big In the first step we determine how much twoumbrella". Instead of creating a new large data warehouse multidimensional structures are similar to each other. Wethat collects data from the smaller component warehouses, compare their names and substructures (if the latter exist).the latter are functionally integrated into a single unit from Attributes and measures are atomic elements of datathe conceptual point of view, using a unique common warehouse schemas and their similarity is calculated asconceptual model. A federation user must know nothing the semantic similarity between their names. Similarity ofabout the heterogeneous and distributed character of the complex elements (dimensions, aggregation levels, facts),federation. On the other hand, the existence of a federation sim, can be computed as a weighted sum of the namemust not have impact on the local users of the component similarity (nsim) and substructure similarity (ssim):warehouses, and each ofthem must work independently ofthe federation [18]. sim = Wname * nsim + (I -Wname) * ssim, Wname E [0, 1].

B. Existing Approaches to Warehouse Integration and Having calculated the similarity for all attribute pairs,Schema Matching we turn to aggregation levels. The substructure similarityAn approach to prove whether particular dimensions formula for aggregation levels uses the neighborhood

and facts in the schemas of heterogeneous data marts are similarity approach [ 17] to combine the similarities of thecompatible for integration is given in [4]. Special attention attributes that form the levels. Recursively, dimensionis paid to matching dimensions and preserving their similarity is computed by combining the similarityhierarchies. In the final step, the instance data are checked between dimension names and the neighborhoodand tuples shared between the sources are selected. similarity formula applied to the calculated similaritiesHowever, multidimensional structures that are candidate between aggregation levels pairs. Finally, dimension andpairs for matching must be chosen manually, using the measure similarity is joined with fact name similarity tohuman knowledge of their semantic meaning. calculate the similarity between facts.A query language aimed at integrating autonomous data In the second step of the algorithm we use heuristic

warehouses into a multi data warehouse system, which principles to determine which multidimensional structuressolves different conflicts between heterogeneous will be mapped as equivalent, among a much largermultidimensional structures is presented in [2]. Again, number of possible mappings. Some basic mappingmatch candidates are supposed to be chosen in advance, principles are discussed in [12]. The hierarchical

organization of aggregation levels in dimensions brings

228 ConTEL 2007, ISBN: 978-953-184-1 11-5

Page 3: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Using Ontologiesfor Measuring Semantic Similarity in Data Warehouse Schema Matching Process

additional mapping constraints that are inherent to themultidimensional conceptual model. As mappings are entitycreated one after another, the newest mapping must notviolate the partial order in hierarchies established byprevious mappings. If aggregation levels Ai and Bj are objectphysical objectmapped, we are not allowed to create mappings Ai-, - B1+1 attand Ai,, - Bj1l (see the left part of Fig. 1). A similar idea isexpressed in [4], where the coherence of partial orders indimension hierarchies is stated as a necessary condition surface- way decorationfor their match compatibility.

If an aggregation level has no corresponding vcounterpart in the other schema, it can be mapped to the horizontal surface/layer road/route arabesquecounterpart of the next finer level (as in the case of twotypical store dimensions describing sales in the right part ts t Vof Fig. 1: city1 is mapped to store2), since the records it paved surface thoroughfarecontains can be obtained by summing the records of the tfiner level (e.g. city, records can be obtained by summing pavementpaving - > streetstore, records, which, on the other hand, correspond to the t 4records in store2). curbside

Ai-i Bj-1store store2/\ , Figure 2. A portion of the taxonomy in WordNet

A / B city,,'/\ synsets where some of the synonyms are labeled by

county countB several words (like horizontal surface or physical object;see Fig. 2).

Figure 1. Constraints to mapping aggregation levels inherent to Semantic relations between nouns are the following.the multidimensional concentual model Antonymy is a symmetric relation determining that the

two nouns are opposite (e.g. beauty and ugliness) andThe similarity between the names of multidimensional exists for nouns derived from adjectives.

structures is, actually, the semantic similarity between the A noun synset {x, x', }... is said to be a hyponym of thewords (terms) that correspond to the names. Semantic noun synset {y, y', } if native speakers of the languagesimilarity calculation will be further discussed in Section accept that an x is a (kind of) y. For instance, everyIV, after introducing some necessary definitions in artifact is an object (Fig. 2). We also say that the synsetSection III. {y, y' } is a hypernym of the synset {x, x', }, e.g.

surface, way and decoration have a common hypernymIII. WORDNET artifactlartefact. Hyponymy and hypemymy are equal to

WordNet divides the lexicon into four basic categories: the subordination and superordination relations. Both arenouns, verbs, adjectives and adverbs, in order to exploit transitive and asymmetrical. Each noun synset inthe fundamental differences in the semantic organization WordNet has in general a single hypernym and thus theof these syntactic categories. Nouns are organized in the hierarchical semantic structure of nouns in WordNet islexical memory as topical hierarchies, verbs are organized tree-like. The noun entity, denoting something which isby a variety of entailment relations, and adjectives and perceived or known or inferred to have its own distinctadverbs are organized as N-dimensional hyperspaces [19]. existence (living or nonliving) is situated at the top of theNounrbs are thganizeeta parofdimWordonetheynuberss of

1

hierarchy, i.e. it has no hypernym, and all nouns are itsNouns are the central part of WordNet: the number of direct or indirect hyponyms (see Fig. 2; hypenymy andnouns is greater than of all other words together, all nouns. lv,,X, dform a single tree-like hierarchy (i.e. all of them are hponyyl , respectively,'arde otb solid an dasherelated) and the number of semantic relations between in les) , r,nouns is greater than between other word types. There are triangles).also links from adverbs to related adjectives and from Thepart-whole (or HAS-A)relation is knownto lexicaladjectives and verbs to related nouns. Fortunately, most semanticists as meronymy/holonymy. A conceptnames in data warehouse schemas are exactly nouns or represented by the synset {x, x', } is a meronym of acombinations ofnouns. concept represented by the synset {y, y', }... if nativeA word may have several meanings. For instance, a speakers of the language accept that a y has an x (as a

street means both a thoroughfare (usually including part) or an x is a part ofy, as pavementlpaving is a partsidewalks) that is lined with buildings, as in the sentence (i.e meronym) of street (Fig. 2). The inverse relation is"he lives on Nassau Street", or people living or working called holonymy. Meronymy is asymmetrical, and is noton the same street, as in the sentence "the whole street transitive generally, but only in some cases. For instance,protested the absence of street lights~ [14]. These afinger is a part of a hand, a hand is a part of a body, andmeanings are called word senses. A word sense may have a common sense conclusion would be that a finger is alsoone or more synonyms (e.g. sovereign and monarch). a part ofabody. Onthe other hand, abranch is apart ofaThey form a synset. The noun hierarchy is a directed tree, and a tree is a part of aforest, but it would sound oddgraph whose vertices are different synsets while their to most people to say that a branch is a part of a forestsemantic relations are edges. WordNet also includes [13].

ConTEL 2007, ISBN: 978-953-184-1 11-5 229

Page 4: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Marko Banek, Boris Vrdoljak, A Min Tjoa

IV. ExiSTING APPROACHES TO CALCULATING paths can be constructed between the target synsets, whichSEMANTIC SIMILARITY results in a higher value of the similarity function. As the

holonymy/meronymy relation is generally not transitive, aA. Overview valid path must not contain more than oneTwo basic approaches to measuring semantic similarity holonymy/meronymy edge, but may contain any number

have been developed: methods based on counting edges in ofhypernym/hyponym edges. The path between pavementa graph representing a dictionary/thesaurus [8] [10] [21] and street now consists of a single edge, while the oneand information theory-based methods, which rely on the between pavement and roadlroute contains three (a better

^ .. ' . . ~~~~result than SiX edges, before includingco-occurrence of words (word senses) in a language reut ta' i de, bfr nldn

corpus [16]. holonymy/meronymy edges as possible path members).copu [16].the informationcontentofataxonomyconceptc Comparing the weight product scores associated to allIn [16] the information content of a taxonomy concept c valid paths between synsets, we choose the highest as the

(a vertex in the taxonomy graph) is specified as the final measure oftheir semantic similarity.negative logarithmic likelihood p(C) which is the fiamesroftireancsmlrtynegative logarithmic li o p) w h is te Each type of relation is given a fixed weight and thefrequency of the concept (word) appearing in the language semantic similarity is the product of the weights of thecorpus (the frequencies have been measured on newspaper edges that yormthe path between the vertices. Given twoarticles). The method uses the synonym and thehafomtephbtwntevricsGvntohyperm. Thpom relations onl. Gvno two wrs concepts (synsets), cl and c2, Yang and Powers computehypemnym/hyponym relations only. Given two words, th seani siilriystheir semantic similarity is the information content of theirnearest common ancestor concept.

The edge counting approaches compute the similarity dist(c.,C2between two concepts by constructing all possible paths Sim(c1 c2) = at 7,B,, dist(cI, C2) <Ybetween the representing vertices. Some of them assignweights to each edge in the WordNet taxonomy graph andsum [8] or multiply [21] them across the paths, returning orthe highest score as the similarity value. Others use pathproperties, such as path length, possibly with some other Sim(cl, C2) = 0, dist(cl, c2) . Y,properties of the taxonomy itself (the depth of the targetvertices) [ 10]. where at is the link type factor, applied to a path (at 1 for

For all presented four methods the same benchmark two identical concepts, at = 0,9 for synonyms/antonymsagainst human judgments was used. They can be ordered and at = 0,85 for hypernyms, hyponyms, holonyms oraccording to their efficiency in ascending manner as meronyms) and /B is the depth factor (Jt = 0,7 forfollows: [16],[8],[10],[21]. We use the newest and the hypernyms, hyponyms, holonyms or meronyms). y is amost efficient method, presented by Yang and Powers threshold on the distance, representing human[21], as the starting point for developing a semantic incompetence to state very distant terms as related (thesimilarity calculation technique that combines multiple recommended value for y is 12).dictionary sources: WordNet and domain ontologieswritten in OWL. V. TAXONOMY RELATIONS IN OWL ONTOLOGIES

There are three basic components of OWL ontologies:instances (individuals), classes and properties [20].

Differently from other approaches ([8],[10],[16]), the Properties are binary relations that link two individuals.one presented in [21] also takes into account the antonym For example, Germany is linked to Peter by the propertyand holonym/meronym relations. Hence, a vertex in the livesIn (Peter lives in Germany). Classes grouptaxonomy graph can always have more than one parent. instances that share some common characteristics. TheSince all synsets in WordNet are descendants of the entity instance Peter may belong to a class Person, whilesynset, it is possible to construct a path between any pair Germany may belong to a class country (Fig. 3). On theof synsets by searching for their common ancestor. Given other hand, properties do not link every possibletwo target synsets, the path construction algorithm starts individual to any other individual, but we define thembidirectionally, adding to both synsets their hypernyms only for all instances of a certain class (the class is calledand proceeds concurrently and recursively until the domain) so that they can be linked to any instance ofcommon ancestor is found (i.e. until the same synset can another class (called range). The class Person is thebe found in both paths). Thus, considering only domain and the class country is the range of the propertyhypernymy/hyponymy, the path between pavement and livesIn (Fig. 4). In general, properties are not inverseroad/route is the following (Fig. 2): pavement - paved (obviously, Germany does not live in Peter).surface - horizontal surfacellayer - surface -

Cartifactlartefact - way - roadlroute. It contains six Classes can be arranged in a taxonomy. A subclasshypernym/hyponym edges. The same number of edges inherits all properties of its superclass and may define itscontains the path between palvement and arabesque, -______________________meaning that pavement is equally similar to road as to <country rdf:ID="Germany"/>arabesque, which is obviously not true. Even worse, thepath between pavement and street contains eight edges, <Person rdf: ID="Peter">makingpavement less related to street than to arabesque. <livesIn rdf resource= "#Germany" 7>Yang and Powers solve the problem by introducing the |~<Person>l

holonym/meronym relations as possible members of the _________________________path between synsets. In this way, new, possibly shorter Figure 3. Instances ofclasses Country and Person in OWL

230 ConTEL 2007, ISBN: 978-953-184-1 11-5

Page 5: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Using Ontologiesfor Measuring Semantic Similarity in Data Warehouse Schema Matching Process

own properties. All its instances are also instances of its relations, this means that the intersection class behaves assuperclass. Each class may have zero or more subclasses a subclass (hyponym) of all classes that are enumeratedand zero ore more superclasses, although the latter is very within the tag <owl: intersectionOf>. Consequently,inconvenient, i.e. there is usually a single superclass. the intersection class has more than one superclassMultiple inheritance, on the other hand, is achieved by (hypernym). The class Insurant equally treats theusing complex classes with set operators. complex class EmployedInsurant and its "classic"

subclass DependentInsurant (an insurant who is not<owl:Class rdf:ID="Person"/> employed, but uses the basic policy of an employed<owl:Class rdf:ID="Country"/> insurant): both are simply its subclasses. The tag

<owl:disjoint> indicates that no instance of<owl:ObjectProperty rdf:ID="livesIn"> EmployedInsurant can also be an instance of

<rdfs:domain rdf:resource="#Person"/> DependentInsurant and vice versa.<rdfs:range rdf:resource="#Country"/> The taxonomy is shown in Fig. 7 (the same symbols

</owl:ObjectProperty> denote superclasses and hypernyms in Fig. 2, as well asthe same symbols denote subclasses and hyponyms)

Figure 4. The domain and the range ofan OWL propertyPerson

<owl:Class rdf:ID=" Insurant">><rdfs:subClassOf Insurant Employee

rdf :resource="#Person"/></owl:Class>

Dependentinsurant Employedinsurant<owl :Class rdf :ID="Employee">

<rdfs:subClassOf Figure 7. Taxonomy ofthe intersection classrdf : resource="#Person" 7> EmployedInsurant and the related classes

</owl:Class>B. Union

<Employee rdf:ID="1Georg"1> Instead of defining EmployedInsurant and<livesIn rdf:resource="1#Germany"1/> DependentInsurant as subclasses of Insurant,</Employee> another possible definition would be to declare them as

direct subclasses of Person (in this example, let us forgetFigure 5. OWL definitions of subclasses that EmployedInsurant is also a subclass of

Fig. .. shows the defmitions of classeS Employee and Employee). Then we can define Insurant as a union of

Insurantg both of which are subclasses of Person and EmployedInsurant and Dependent Insurant (Fig. 8).inhurat bthpofewhichare sub e oThe union class contains instances of both "basic"

thusinherititspropertyives In.complexclassesusingth classes and does not inherit their properties. Actually, theOWL enables creation of complex classes usmng three union class behaves as a superclass of both basic classes,set operators: intersection, union and complement. while they can be regarded as subclasses of the union

A. Intersection class. The taxonomy graph (Fig. 9) shows thatEmployedInsurant and DependentInsurant both

Class EmployedInsurant in Fig. 6 is the intersectionof classes Employee and Insurant. An instance ofEmployedInsurant is both an instance of Employee <|rdfs:subClassOfDeand of Insurant. Therefore, it also has all properties of rdf:resource="#Person"/>both classes (for the reason of simplicity, both classes are </owl:Class>shown without properties). Considering taxonomy

<owl:Class rdf:ID="DependentInsurant"><owl:Class rdf:ID="EmployedInsurant"> <rdfs:subClassOf

<owl:intersectionOf rdf: resource="#Person"/>rdf:parseType="Collection"> <owl:disjointWith

<owl:Class rdf:about="#Insurant"/> rdf:resource="#EmployedInsurant"/><owl:Class rdf:about="#Employee"/> </owl:Class>

</owl:intersectionOf> <owl:Class rdf:ID="Insurant"></owl:Class> <owl:unionOf

rdf :parseType=" Collection"><owl:Class rdf:ID="DependentInsurant"> <owl:Class<rdfs:subClassOf ~~~~~~rdf.:about="#EmployedInsurant"/7>

rdf:resource="#Insurant"/> <owl:Class<owl :disjointWith rfaot"Dpnetnuat/

rdf:resource="#EmployedInsurant"/> rdowl:uaboutn #Dpndnfn>n"7L</owl :Class> </owl :Class>

Figure 6. Definition of classes EmployedInsurant andDependentInsurant Figure 8. An example of using the unionOf construct in OWL

ConTEL 2007, ISBN: 978-953-184-1 11-5 231

Page 6: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Marko Banek, Boris Vrdoljak, A Min Tjoa

A. Semantic Similarityfor Two Words Corresponding toOntology Classes

- Person Given two target terms, we calculate in parallel thesimilarity from the taxonomy graph of the OWL ontology

Insurant and from the WordNet taxonomy. The higher of the tworesults is returned as the final similarity.

If any ofthe target terms is an adjective, it appears as anEmployedInsurant Dependentlnsurant additional vertex in the WordNet noun taxonomy, together

with a new edge that depicts the link to the related nounFigure 9. Taxonomy of the union class Insurant and the (as stated in Section III). Adverb vertices are similarly

related classes connected with the related adjective vertices.Our semantic similarity function measures similarity

have two superclasses: Person and Insurant. between the names of data warehouse structures. ManyInsurant is not directly connected to Person (as it was attribute names actually consist of several words (e.g.in Fig. 7). horizontal surface, employed insurant, insurance policy).

WordNet mostly contains single-word terms, whereasC. Complement multiple-word terms (like horizontal surface) appear onlyThe complementOf construct selects all instances from occasionally, when it is impossible to create a hierarchy

the domain of discourse that do not belong to a certain without including them or when they explain the sense ofclass. Thus, all instances existing in the ontology either its single-word synonym that has many other sensesbelong to ConsumableThings or to (horizontal surfacellevel as different to

10, [20]). degree/level/stage/point). The reason for this is theSochonstumaethions (Fig.d primary role of WordNet to serve as a dictionary and aSuch constructions are used rarely (complementOfiSe thesaurus. On the other hand, ontologies may cover any

much more frequently used as a nested part of an intersect concept of human interest, where a name is needed as anclass definition, producing relations that are not useful for unambiguous description of a term and thus it becomesthis study). The complement relation corresponds to the natural to use complex, multiple-word names (wine,similarlyrare antonymy relation in WordNet. WhiteWine, WhiteNonFrenchWine [20]). It is a

standard in languages that use XML syntax, like OWL, to<owl:Class rdf:ID=" ConsumableThing" /> use composite, multiple-word terms, where the words are

concatenated, but the starting letter of each new word is<owl:Class rdf:ID="NonConsumableThing"> written in upper case (e.g. EmployedInsurant). In this

<owl:complementOf way, the separate words can be easily extracted by ardf:resource="#ConsumableThing"/> machine).

</owl:Class> Calculating the WordNet similarity of two targetmultiple-word terms, we decompose them to two sets of

Figure10. Two complement classes in OWL words, SI and S2 [1 ]. We compute the similarity for eachword pair from SI and S2. For each word we find its

D. Properties maximum (i.e. the highest) similarity with any word in theother set and then produce the average ofthose maximums

Ontology properties link classes that do not form a for both sets (as defined by [1]):superclass-subclass hierarchy but are related in somesense. It is the existence of properties what makesontologies semantically much richer than pure hierarchical ZLmaxsim(wi,W2)]+ E maxsim(wp,W2)2structures (like WordNet). Although the property link Sim(S S2 w2lSW2eS'2= WlS2between a domain and a range class (and vice versa) in . .2).Sl+IS2ontologies is substantially different from theholonym/meronym relation in WordNet, they are both When both target terms exist in the ontology, we firstintransitive and will be similarly interpreted when find all existing paths between them. While in WordNetconstructing edge paths across ontology and/or WordNet one or mostly two paths can generally exist, ontologiestaxonomy graphs. provide a better connection between terms, so several

paths are always to be expected. We adopt the Young-VI. SEMANTIC SIMILARITY COMPUTATION . ..........Powersformula to compute the ontology similarity.

As stated in Section V, we interpret the Again, a, = 1 for two identical concepts, a, = 0,9 forsuperclass/subclass edge in taxonomy graph of an synonyms and complement classes (the latter is equal toontology as equal to a hypernym/hyponym edge in the ............antonymy).Also, aCt 0,85 for paths that containWordNet taxonomy graph.We treat property links (i.e. the ............superclass/subclass edges or property links. The depthedges connecting the vertices belonging to the domain and ............factoris /J = 0,7 for superclasses/subclasses, as well as forthe range class of the property) as equal to ............propertylinks.holonymy/meronymy links. Again, we allow paths that ............Whenthere is no ontology class whose namecontain no more than one property link, which must be at ............correspondsto a target multiple-word term, we do notthe "beginning" or the "end" of the path. ............decompose it to particles, as we do for WordNet. We only

search for terms with other sequence of component wordsthat form the class name (if there are no classes

232 ConTEL 2007, ISBN: 978-953-184-1 11-5

Page 7: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

Using Ontologiesfor Measuring Semantic Similarity in Data Warehouse Schema Matching Process

EmployedInsurant or WhiteWine, we search for In Fig. 12 the names of ontology classes are written inclasses InsurantEmployed and WineWhite). As we bold letters, while WordNet synsets are written in italic.explained in Section I, domain ontologies describe terms Given an ontology class that corresponds to the targetin a precise way, with a very specific and restricted term, we first produce all possible paths in the superclassmeaning. Thus, if a domain ontology does not contain the direction (the direction "up"), as well as those in theterm, we believe that it was the decision of the ontology subclass direction (the direction "down"), never changingcreator, not to define it as such. In that case, we state that the direction. For all ontology classes whose namesthe ontology similarity is zero, and calculate only the correspond to WordNet synsets (e.g. Beverage tosimilarity in WordNet. beverageldrink, RedWine to red wine), we cross into

Although ontology classes are linked well, it is possible WordNet and continue the path in the same direction (tothat no valid path exists between two classes that all leaf hyponyms in the direction down and to entity incorrespond to our target terms. The only path may contain the direction up).a property link in the middle and hence becomes invalid If the other target term is a member of the path, we canfor our calculus. Such a case mostly happens for very use the Yang-Powers method to compute similarity. The"general" classes, i.e. those at the top of path between Wine and beaujolais consists of two edges,superclass/subclass hierarchies. Examples of that situation producing the similarity value of 0,417 (compared to 3are the classes Person and company, as shown in the edges or 0,292 in "pure" WordNet, i.e. without using theontology taxonomy graph in Fig. 11. Since all user- ontology).defined ontology classes are subclasses of the predefined The constructed paths only capture vertex pairs that areabstract class owl:Thing [20], we connect the top classes related as ancestors/descendants of each other. Following(i.e. those without superclasses) of the two separated the widely adopted principle of calculating the similaritysubtaxonomies to owl: Thing, producing two additional of two synsets in WordNet by finding their commonedges that correspond to the subclass/superclass relation. ancestor [8][16][21], we produce additional paths in theNow, the path between Person and Company consists of direction up, starting in the target ontology class. In eachthree subclass/superclass edges and the similarity can following vertex (which is then interpreted as the commoneasily be calculated. ancestor) we can turn in the direction down and continue

until we find the other target term or reach a leaf vertexowl:Thing without success. In this way, we can calculate the

Organization similarity between whitewine and cappuccino (fourPerson edges or 0,204, instead of five or 0,143 in "pure"

C impanyWordNet), Riesling and coffeeljava (four edges orisInsure Coy 0,204, instead of six or 0,100) or Riesling and

beaujolais (again four edges instead of six).Insurant InsuranceCompany The reduced number of edges is a result of using

-----~~ ontologies, where specific domain terms are closer than ininverse(islnsuredBy) WordNet. Again, by joining the taxonomy of an ontology

Figure 11. Connecting two separated subtaxonomies using the to the WordNet taxonomy we follow our conviction of anowl: Thingclass ontology as a better and more reliable source for similarity

functions than WordNet.Calculating semantic similarities between words is the

B. Semantic Similarity between a Word Corresponding starting step of the automated process for matchingto an Ontology Class and a WordNet Synset heterogeneous data warehouse schemas. Structure

If similarity formulas are used in the next step to completeIfthe ontology contains only one of the two target th coprio ofmliiesoa.tutrs hterms, and the other can be found in WordNet (the term X . T

itself and not only the component words of a multi-word compared structures are finally mapped according to theirterm), we connect the taxonomy of the ontology and the similarity. The entire solution is produced automatically.WordNet taxonomy into a single graph and compute thesimilarity. beverage/drink = Beverage

To the best of our knowledge, the only existingapproach that integrates WordNet and domain ontology coffee/java Winesnippets is COMA++ [1]. However, the aim ofCOMA++is the integration of ontologies, so the input data for the 4\merging process are the ontologies themselves, while capuccino red wine=RedWine WhiteWineWordNet remains the only merger ontology resource.Given ontologies O1 and 02 to be merged, the similarities v vin COMA±± are always calculated using 01-WordNet-02 beaujolais Rieslingpaths. In our case, ontologies are the part of the merger Figure 12. Joining the taxonomy of an ontology with the WordNetdictionary while the input data are warehouse schema taxonomymetadata. Ontology and WordNet together serve as themeans of producing similarity calculation in order to VII. SOFTWARE IMPLEMENTATIONmatch other data. We create 0-WordNet paths in a morerestricted fashion thanCOMA±±. The proposed approach was implemented entirely in

Java 1.5. We encoded our own implementation of theYang-Powers similarity method using the Java extension

ConTEL 2007, ISBN: 978-953-184-1 11-5 233

Page 8: UsingOntologies Measuring Semantic Similarity in … › ~golmoham › SW › ontology based... · 2008-02-14 · * FacultyofElectrical Engineering andComputing,University ofZagreb,

for WordNet, JWordNet (version 1.3) [6], corresponding [4] L. Cabibbo, and R. Torlone: "Integrating Heterogeneousto WordNet 2.0. The Jena API (version 2.4) [7] was used Multidimensional Databases," Proc. Int. Conf Scientific andto manipulate OWL ontologies. The software serves as the Statistic Database Management (SSDBM 2005), IEEE Computerbase of our data warehouse schema matching algorithmSociety Press, 2005, pp. 205-214baseo fo data wehos shem a m n o , [5] A.H. Doan, and A.Y. Halevi: "Semantic-integration research in

the database community: A Brief Survey," AI Magazine, vol. 26(1), special issue on semantic integration, March 2005, pp. 83 - 94

VIII. CONCLUSION [6] George Washington University: "JWordNet. A Java

Automating the matching process for heterogeneous Implementation of WordNet," (last visited January 15, 2007)data warehouses schemas significantly shortens the http:!!jwnsourcef6rge.net.

warehouse integration process as whole. The matching [7] Hewlett Packard Laboratories: "Jena 2. A Semantic Webwarehouse integrationprocess as who. The matchFramework," version 2.4. (last visited January 15, 2007)process determines which multidimensional components http://jena.sourceforge.neof a data warehouse: facts, measures, dimensions, [8] J. Jiang and D. Conrath: "Semantic similarity based on corpusaggregation levels and attributes, are mutually equivalent. statistics and lexical taxonomy", Proc. Int. Conf Research onThe equivalence of multidimensional components is Computational Linguistics (ROCLINGX), 1997expressed by means of a probability function called [9] Won Kim, and Jungyun Seo: "Classifying Semantic and Datasimilarity function, which combines the similarity of their Heterogeneity in Multidatabase Systems,", IEEE Computer, vol.names and their substructures. 24 (12), 1991, pp. 12-18

[10] Y. Li, Z. Bandar, and D. McLean: "An Approach for Measuringresentisg a focuses o naec similarit c emalcu tion, Semantic Similarity Between Words Using Multiple Information

presentig a combied edge-countig-based semantic Sources," IEEE Trans.Knowledge and Data Engineering, vol. 15similarity technique, which integrates domain ontologies, (4), 2003, pp. 871-882written in OWL, with the always-reliable WordNet into a [11] J. Madhavan, P.A. Bernstein, and E. Rahm: "Generic Schemasingle taxonomy, used as the base for similarity Matching with Cupid," Proc. 27th Int. Conf Very Large Datacalculations. When the similarity of terms from the Bases (VLDB 2001), Morgan Kaufmann, 2001, pp. 49-58domain described by an ontology is calculated, the [12] S. Melnik, H. Garcia-Molina, and E. Rahm: "Similarity Flooding:ontology provides a source designed by domain experts, A Versatile Graph Matching Algorithm and Its Application tobetter and more trustworthy than WordNet and with closer Schema Matching". Proc. Int. Conf Data Engineering (ICDE

related terms. hi ouropinion,thepresentedtechniq2002), IEEE Computer Society Press, 2002, pp. 117-128relad terms. In our oio, 'thpr,eseed oute iqe [13] G.A. Miller; "Nouns in WordNet: A Lexical Inheritance System,"could be usedl to automate the matching processes outsidle Int. J. Lexicography vol. 3 (4), 1990, pp. 245-264 [revised Augustthe data warehouse domain, whenever name similarity 1993]calculation is needed. [14] Princeton University Cognitive Science Laboratory: "WordNet. aWe believe that in domain ontologies unnecessary and lexical database of the English language," (as of January 2007)

formal relations are avoided, thus resulting in a smaller http:!!wordnetprin eton.edunumber of edges and a higher similarity in comparison [15] E. Rahm, and P.A. Bernstein: "A survey of approaches towith WordNe. On the her h and,ywh n h simarityn automatic schema matching," VLDB J vol. 10(4), 2001, pp. 334-With1wordN\et. On th1e oth1er hand , wh1en th1e similarity 350obtained from the ontology is smaller than the WordNet [16] P. Resnik: "Semantic Similarity in a Taxonomy: An Information-similarity, we believe that the ontology is focused on other Based Measure and its Application to Problems of Ambiguity inrelationships and take the higher value (the one obtained Natural Language". J Artificial Intelligence Research, vol. 11,from WordNet) as the final result. 1999, pp. :95-130

[17] M.A. Rodriguez, and Max J. Egenhofer: "Determining SemanticREFERENCES Similarity among Entity Classes from Different Ontologies," IEEE

Trans. Knowl. Data Engineering, vol. 15 (2),. 2003, pp. 442-456[1] D. Aumueller, H.-H. Do, S. Massmann, and E. Rahm: "Schema [1 AP. Sheth, and JA. Larson: "Federated Database Systems for

and Ontology Matching with COMA±±". Proc. SIGMOD 2005, Managing Distributed, Heterogeneous, and AutonomousACM Press, New York, NY, USA, 2005, pp. 906-908 Databases", ACM Computing Surveys, Vol. 22 (3), ACM Press,

[2] S. Berger, and M. Schrefl: "Analysing Multi-dimensional Data New York, NY, USA, 1990, pp. 183-236accross Autonomous Data Warehouses," Lecture Notes in [19] WordNet. An Electronic Lexical Database, C. Fellbaum, Ed.,Computer Science, vol. 4081, 2006, pp. 120-133, [Proc. Int. Conf [ Cambridge, MA, USA, 1998.Data Warehousing and Knowledge Discovery] [20] World Wide Web Consortium (W3C): "OWL Web Ontology

[3] M. Banek, A.M. Tjoa, and N. Stolba: "Integrating Different Grain Language Guide", W3C Recommendation (as of 10 FebruaryLevels in a Medical Data Warehouse Federation," Lecture Notes 2004), http://www.w3.org/T/2004/RC-owl-guide-20040210in Computer Science, vol. 4081, 2006, pp. 185-194, [Proc. Int.

2

Conf. Data Warehousing and Knowledge Discovery] [21] D. Yang, and D.M.W. Powers: "Measuring Semantic Similarity inthe Taxonomy of WordNet," Proc. Australasian Comp. Sc. Conf(ACSC 2005), CRPIT, vol. 38, 2005, pp. 315-322

234 ConTEL 2007, ISBN: 978-953-184-1 11-5