sdd-matcher: a semantic-driven data matching framework · 2012-10-30 · keywords: semantic...

SDD-Matcher: a Semantic-Driven Data Matching Framework Yan Tanga and Gang Zhaob

aVUB STARLab, 10G731, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Elesene, Brussels, Belgium bInterlartes, BP8 Waterloo, 1410 Belgium

Abstract. A generic semantic-driven data matching framework (SDD-Matcher) has been designed and developed for matching data objects across organizations. It contains matching algorithms at three different levels: string, lexical and graph. The level of graph is also called ontological or conceptual level. Those matching algorithms are the basic building blocks of an SDD-Matcher matching strategy, each of which contains at least a graph matching algorithm. In principle, we can freely choose the matching algorithms when composing a matching strategy. In this article, we focus on a full engineering cycle of SDD-Matcher, which includes five phases – framework design, use case design, implementation, evaluation and strategy enhance-ment. Lexon Matching Strategy (LexMA) and Controlled, Fully Automated, Ontology-based Matching Strategy (C-FOAM) are revised and refined in SDD-Matcher. We will also illustrate a new SDD-Matcher strategy – Ontology Graph Measuring Strategy (OntoGraM) – in this paper. Semantic decision tables are used as a means to easily configure the matching strategies. In the phases of use case design, evaluation and strategy enhancement, we use real case data from a large enterprise in the fields of human resource management (HRM) and eLearning/training.

Keywords: semantic decision table (SDT), domain ontology, matching, ontology-based data matching

1. Introduction

Data matching is defined as a process of bringing data from different and heterogeneous data sources and comparing them in order to find out whether they represent the same or similar real-world object [13]. It is a key problem of data management processes, such as data or schema integration, querying across domains, data cleansing, data mining and fuzzy searching. The authors in [13], [16] provided a sur-vey on data matching and mark the importance of data matching in the mentioned fields.

The types of the data sources concerning data matching can be (local or remote) databases, web data (e.g., web pages) or content data in natural lan-guage (e.g., textual documents). Although the data source types are different, the underlying principle is rather similar. That is, data matching happens at the levels of either schemas (or structures) or data in-stances (or values). The approaches discussed in [16] focus on performing data value analyses. The ones shown in [13] emphasize structure analyses. The

scope of this paper is essentially to deal with struc-ture analyses.

Our problem is similar to schema matching, which is defined as a process of taking two schemas as the input and producing a mapping between elements of the two schemas that correspond semantically to each other [27] [29]. Rahm and Bernstein provided a sur-vey of approaches to automatic schema matching in [34]. A schema matching technique, no matter whether it is automatic, semi-automatic or manual, can be adapted for ontology matching.

Ontology matching is a process of finding corre-spondences between semantically related entities of the ontologies [37]. An ontology is a specification of conceptualization [21]. A conceptualization is the intended models, within which a set of logical axi-oms are designed to account for the intended mean-ing of a vocabulary [19].

Although we can exploit schema matching ap-proaches for ontology matching, we need to keep in mind that a database schema, by default, is not an ontology model. As discussed in [28], the conceptu-alisation and the vocabulary of a database schema are

not intended a priori to be shared by other applica-tions. An ontology, by definition, contains shared knowledge within the domain and serves as a knowledge base for multiple applications. The level of shareability is the main difference.

There are quite a lot of existing work on ontology matching and integration1 [37], e.g., S-Match [20], Optima [14] and AgreementMaker [10]. As will be explained in the next section, there is very little relat-ed work on semantic-driven data matching (or ontol-ogy-based data matching). The problem of semantic-driven data matching is not exactly the problem of ontology matching. The goal of ontology matching is to solve the problem of semantic inconsistency while integrating/merging more than two ontologies. Our goal (also the scope of this paper) is to find the simi-larities between two data sets, each of which corre-sponds to one part in the ontology. There is only one ontology in the particular problem.

SDD-Matcher is designed based on an ontology-based data matching framework (ODMF), which was initially designed by VUB STARLab in 2007 during the EC Prolix project2. It has been gradually enriched and tested in the EC 3DAH project3, EC TAS3 pro-ject4 and the ITEA 2 DYSE project5.

In Prolix, we have used ODMF to calculate com-petency gaps between employee’s profiles and learn-ing modules (e.g. learning materials and courses) in order to find most suitable learning modules for the employees [43], [45]. The ODMF use case in 3DAH is focused on how to evaluate medical students and provide personalized suggestions of online learning materials to them [7]. In TAS3, we have developed a method supported by tools to provide semantic sup-port to process modelers during the design phase of a secure business process model. ODMF was used to discover the user design intent from a dedicated knowledge base [8]. In DYSE, we have embedded ODMF in a component discoverer and recommender system, which is used to assist amateurs when they want to create their own personalized smart environ-ment (it is also called Do-It-Yourself activities). ODMF has been used to find the most suitable hard-ware (e.g., sensors and actuators) and software (e.g., pieces of code and CSS feeds) components according to the needs of end users [46]. These use cases cover

1 http://www.ontologymatching.org 2 http://www.prolixproject.org 3 http://3dah.miralab.ch 4 http://www.tas3.eu 5 http://www.dyse.org (last retrieved: July 13th,

2011)

the domains of eLearning, 3D anatomy, human re-source management, security and ubiquitous compu-ting. ODMF is not restricted to a particular domain.

Although a few of the SDD-Matcher matching strategies have been discussed in [7], [8], [43], [44] [45], [46], we feel it necessary to provide an integra-tion of SDD-Matcher, which covers a full engineer-ing cycle containing five phases – framework design, use cases design, implementation, evaluation, and strategy enhancement. This is one of the main contri-butions of this article.

Other contributions are described as follows. Two main ODMF strategies, namely Lexon Matching Strategy (LexMA) and Controlled, Fully Automated, Ontology-based Matching Strategy (C-FOAM), are revised and refined in SDD-Matcher. The Ontology Graph Measuring Strategy (OntoGraM) – a new matching strategy – will be reported. We use seman-tic decision tables (SDT, [42]) as a means to config-ure those strategies.

We have developed an evaluation method for justi-fying the SDD-Matcher matching strategies. In order to get the best matching result, we have designed and implemented an algorithm called Semantic Decision Table Self-Configuring Algorithm (SDT-SCA), which has been used in the phase of strategy en-hancement.

This article is organized as follows. Sec. 2 is the related work. We will describe the engineering cycle and the frame overview of SDD-Matcher in Sec.3. A generic use case and a detailed use case from British Telecom (BT) will be illustrated in Sec.4. We present the SDD-Matcher matching strategies in Sec.5. The evaluation method is recorded in Sec. 6. We show the SDD-Matcher implementation, the evaluation results and SDT-SCA in Sec.7. In Sec. 8, we conclude with discussions and future work.

2. Related Work

One part of the related work has been already dis-cussed in the previous section. As mentioned, the problem of semantic-driven data matching is not ex-actly ontology matching. The problem of ontology matching occurs when more than one ontologies are involved and need to be integrated or merged. Hence the approaches to this problem mainly deal with checking/ensuring consistency between several on-tologies. The problem of semantic-driven data match-ing is a data matching problem. There is only one ontology. The solutions are to find the connections

and measurements between two data sets using an ontology.

Our problem is similar to schema matching [16]. In particular, we wan to compare two sub-schemas in a large schema. This type of schemas is not a simple database schema. Instead, they are ontology models, meaning that they are shared and used by multiple applications within a certain domain.

There exist few approaches to semantic-driven da-ta matching that make good uses of an ontology. TODE [39] uses ontologies (in particular, SUMO6 and WordNet [17]) to categorize web pages. It only deals with matching with very limited structure in-formation. In particular, only the relations of “is-a” (Meronym) and “part-of” (Holonym) are used in TODE. Oracle RDBMS [11] embeds ontology-based semantic matching components in their relational database management system. Oracle RDBMS pro-vides four ontology related operators – ONT_RELATED, ONT_EXPAND, ONT_PATH and ONT_DISTANCE. They are implemented based on the OWL relations, which are richer than the ones in TODE. However, the matching mechanism used by these database queries is one-to-one matching. The ontology-based resource matching approach il-lustrated in [47] shows how to share resources in a Grid environment. It uses domain ontologies to de-scribe domain terminology and request queries. The matching rules are embedded in a policy ontology. Similar to many existing ontology-based searching approaches, this work is based on the ontological queries, which supports one-to-one exact matching. Compared to their work, SDD-Matcher uses all kinds of ontological relations to process matching. The matching is not only one-to-one matching but also one-to-many matching.

We consider an ontology as a connected graph or network. Our problem then becomes how to find the connections between two sub-graphs. The ideas in [2], [18], [25], [30], [49] show the related work. Bar-rett et al. [2] illustrated an algorithm of finding the shortest path between two nodes in a labelled and weighted network. The authors in [25] tried to dis-cover data based on feature distances. The authors in [49] focused on how to find the data objects or the web pages that belong to the same/similar contexts. Although their work deals with sub-graphs in one graph, it is focused more on how to draw a boundary between searching spaces, especially as discussed in [25], [49]. Compared to their work, we care about

6 http://www.ontologyportal.org (last retrieved: Ju-

ly 18th, 2011)

neither how the boundary between two sub-graphs is drawn, nor how they are graphically overlapped. Instead, we need to know how each element (arcs and vertices) from one sub-graph is linked with the others from the other one. In addition, our graph is specific. It is an ontology; hence the arcs and vertices are meaningful and semantically rich.

Ferrara et al. [18] illustrated a survey on data link-ing for Semantic Web. The task is to determine whether two object descriptions refer to a same real-world object, or to discover a relationship between the two respected real-world objects. Another inter-esting related work that assumes there is only one ontology for two datasets to be linked is the ap-proaches of managing open, linked data7 [30]. We try to find the commonality between two data sets, which correspond to two different real-world objects. Their relationship is multi-dimensional instead of one-dimensional.

3. Engineering Cycle and Framework Overview

The SDD-Matcher engineering cycle contains five phases – framework design, use case design, imple-mentation, evaluation and strategy enhancement – as illustrated in Fig. 1.

Fig. 1 SDD-Matcher Engineering Cycle

The arrows in Fig. 1 indicate the execution flow. The initial phase is the phase of framework design. After the phase of strategy enhancement, the flow will reiterate and the framework will be updated. In the rest of this paper, we will explain the method, design and/or results for each phase.

In the phase of framework design, we have sur-veyed on the approaches to semantic-driven data

7 http://linkeddata.org/

Framework design

Use case design

Implementation

Evaluation

Strategy En-hancement

matching. As the result, a generic design of SDD-Matcher is illustrated in Fig. 2. The input is two strings. They point to two different data objects in the real world, and the output is a similarity score. The solid arrow-tipped bars indicate the execution flow of an SDD-Matcher matching strategy, the entry points of which are illustrated with double line arrows. The dotted arrow-tipped bars explain how final scores are reached.

Fig. 2 A generic design of SDD-Matcher

As shown in Fig. 2, an SDD-Matcher strategy con-tains algorithms at three levels: string (or alphabetical, morphological), lexical and graph. They are the basic components for an SDD-Matcher strategy. Each strategy contains at least one and at most three algo-rithms. When a strategy contains one algorithm, this algorithm must be at the level of graph matching. When it contains three algorithms, each algorithm must correspond to a different level. It is not recom-mended to have a combination of algorithms from the same level.

As visualized in Fig. 2, the entrance points can be a matching at the string level, lexical level or graph level. Suppose we have three algorithms푎 ,푎 and푎 where푎 is at the string level,푎 at the lexical level and푎 at the graph level. All the following combina-tions are the only possible combinations for an SDD-Matcher matching strategy. ⟨푎 ,푎 ,푎 ⟩ ⟨푎 ,푎 ⟩ ⟨푎 ,푎 ⟩ ⟨푎 ⟩

The above combinations also show that a matching process can start with the matching at the string level followed by the matching at the levels of lexical and graph. It can also start with the matching at the lexi-cal level followed by the matching at the graph level. Or, it starts directly with the graph matching.

Note that the goal of the matching at the levels of string and lexical is to form proper sub-graphs. If an SDD-Matcher matching strategy is⟨푎 ,푎 ,푎 ⟩, then its execution flow is described as follows. First,푎 is used to check possible typos or similar labels for the

two input strings. Then, it uses 푎 to find lexically connected labels. If it cannot find any, then it uses 푎 again in order to find ‘similar’ lexically connected labels. Those labels, together with the ones generated in the first step, correspond to two sets of concepts in an ontology. Two sub-graphs are then formed.

In the last step, it calls푎 to measure the connec-tivity between these two sub-graphs and yields a sim-ilarity score. Note that each step gives a score. The ones generated by 푎 and 푎 need to be reworked with a penalty. Or, they need to pass a threshold.

Fig. 2 will be further studied with three matching strategies in Sec. 5.

4. A Generic Case and a BT Use Case

In the phase of use case design (see Fig. 1), a ge-neric use case and a specific use case are created. They will be used to illustrate and test the SDD-Matcher strategies in the phase of implementation, and to evaluate these matching strategies in the phase of evaluation.

4.1. A Generic Use Case

Fig. 3 shows a generic use case of SDD-Matcher. Two data objects from two different sub-domains, enterprises or departments, have data descriptions in natural language. These descriptions can be in a well defined structure or free texts, which the annotation server takes as the input and generates two annota-tion sets as the output. These two annotation sets correspond to two sub-graphs in the ontology graph. Based on the ontology and the annotation sets, SDD-Matcher calculates a similarity score, which is a float number in the range [0, 1].

Fig. 3 An SDD-Matcher Generic Case

The ontology server supports not only querying or reasoning ontologies, but also ontology versioning.

Our ontology consists of two separate parts – a set of binary fact types (also called “lexons”) and a set of commitments. A lexon is ⟨훾, 푡 , 푟 , 푟 , 푡 ⟩ where 푡 and 푡 represent two concepts and identified by the context identifier γ; 푟 and 푟 are the two roles that these two concepts can possibly play with; 푟 is the inverse role of푟 . We often use a document label or URI as the context identifier to point to a resource where 푡 and푡 are originally defined, and where푟 and푟 become meaningful.

Fig. 4 shows a lexon example, which presents a fact that “a teacher teaches a student, and a student is taught by a teacher” in the context identified by “http://en.wikipedia.org/wiki/Teacher”. “Teacher” and “Student” are the two terms that point to two concepts; “teaches” and “is taught by” are the two roles.

Fig. 4 A lexon example

In this example, 푟 (“is taught by”) can be easily deduced from푟 (“teaches”) using a linguistic thesauri. In specific cases, it is not easy to deduce푟 from푟 . That is the reason why we sometimes need to keep푟 .

A commitment (or “ontological commitment”) is an agreement made by a community (also called “group of interests”). It often contains the constraint that can be expressed in some languages, such as Description Logic (DL, [1]). An example of a con-straint is mandatory, e.g., each teacher teaches at least one student, which can be written in DL as푇푒푎푐ℎ푒푟 ⊑≥ 1푡푒푎푐ℎ푒푠. 푆푡푢푑푒푛푡.

A commitment can also be a query, with which we select data from an ontology-based database. For instance, we can use a SPARQL8 query for an RDF9 triple store.

Note that it does not matter which format of our domain ontology will be stored and published at the end. It can be in Web Ontology Language (OWL10), RDF Schema (RDF(s)) or Semantic Web Rule Lan-guage (SWRL11). Hence, we will not focus on those

8 Simple Protocol and Resource Description

Framework (RDF) Query Language: http://www.w3.org/TR/rdf-sparql-query/

9 http://www.w3.org/TR/rdf-primer/ 10 http://www.w3.org/TR/owl-ref/ 11 http://www.w3.org/Submission/SWRL/

issues. The important consideration here is the con-tent/models in the domain ontology.

The generic use case shown in this subsection can be extended for particular domains. In the next sub-section, we will illustrate a use case from the Am-sterdam branch of British Telecom (BT) in the fields of human resource management (HRM) and eLearn-ing/training.

4.2. A BT Use Case

This use case has been worked out during the EC Prolix project.

At BT, the human resource management (HRM) department uses textual descriptions of “company values” (such as “trustworthy”, “straightforward” and “heart”) to rate its employees. The results are record-ed in an assessment form. The training department uses competency notations in terms of skills and abil-ities to categorize learning materials. A learning ma-terial has an ID, e.g., “course ITIL 1”. Each depart-ment uses its own supporting tools and terms. In or-der to have a better collaboration between them and enhance the interoperability for the applications across departments, we have developed an HRM ontology.

A use case scenario is as follows. Inge works for BT and gets ranked periodically. One day, she got her assessment form sent by the HRM department. She realized that her “heart” got a low rate and want-ed to train herself with the learning materials that can help to improve her “heart”. She fed “heart” as an input to SDD-Matcher, which generated a sorted list of learning materials. Inge took this list and down-loaded the recommended materials from the website of the training department. At the end, Inge trained herself and her “heart” got improved.

What SDD-Matcher does in this use case scenario is as follows. Before the first usage, SDD-Matcher needs to execute an annotation task. It is a pre-process task. We can use the algorithm illustrated in [41] to automate this task, or do it manually or semi-automatically. As it is out of the paper scope, we refer to [41] and its related work for further reading. Note that if a knowledge engineer cannot find a properly defined concept during the annotation task, then he needs to define this new concept by execut-ing an ontology creation/versioning method.

In real-time, SDD-Matcher compares the annota-tion set of “heart” with the annotation set of each learning material. Each comparison gives a score. After all the comparison tasks are successfully exe-

cuted, SDD-Matcher provides a sorted list of learning materials.

We will further enrich this use case with alterna-tive inputs in Sec. 5.3.

5. SDD-Matcher Matching Strategies

In the phase of implementation (see Fig. 1), we need to design the SDD-Matcher matching strategies based on the use cases that are created in the phase of use case design.

The SDD-Matcher string matching algorithms in-clude the ones from the SecondString project [9], which contains the implementation of UnsmoothedJS [23], [24], [51], JaroWinklerTFIDF [23], [24], [51] and TFIDF (term frequency–inverse document fre-quency, [38]).

The lexical matching algorithms are based on the lexical information from WordNet [17] and a domain dictionary. In WordNet, synonyms are grouped into unordered sets – Synsets, which are the smallest enti-ty in the linguistic database. Including the synonyms, mostly used relations are 1) is-a subsumption rela-tionship (also called hypernym and hyponymy in WordNet, e.g., “jet black” is a hyponym of “black” and “achromatic color” is a hypernym of “black”); 2) type-instance relationship (e.g., “Mao Zedong” is an instance of “communist”); 3) part-whole Meronym relationship (e.g., “neck” is a part of “body”); 4) troponym in verb Synsets, which expresses different manner, precision or volume of a verb, for instance, “yawl” is a troponym of “shout”.

As discussed, we use lexons to express concepts and relations. 푡 and 푡 are often two nouns; and푟 and푟 are often two verbs. It is rare to see ad-verbs in such a lexon. Therefore, the WordNet rela-tions concerning adverbs are currently not used by SDD-Matcher. Note that the relation of antonym for nouns or verbs is not used either. We consider two terms (or roles) separated when they do not connect with each other; and antonym is a specific case of this kind of separation.

Note that it is possible to use the user-specific dic-tionary that contains data dictionaries and equiva-lences. It is simple from a computational point of view, yet important from a business point of view.

Graph theory has been studied as a classic research area for decades. West [50] presented a survey on its algorithms. Current SDD-Matcher takes commonly used graph matching algorithms, such as the ones of finding bipartite matching [50] and Dijkstra’s short-

est path [12]. As will be discussed in the following sections, the matching algorithm in LexMA is a sim-plified version of bipartite matching. The GRASIM strategy, which we have studied in [43], is based on the calculation of Dijkstra’s shortest path.

In the following subsections, three matching strat-egies from SDD-Matcher will be illustrated.

5.1. Lexon Matching Strategy (LexMA)

LexMA is a matching strategy that only contains a graph matching algorithm. It is an enhanced version of LeMaSt [44].

Suppose we have two graphs퐺 , which contains 푛 lexons, and 퐺 , which contains 푛 lexons. We use∙ to indicate the source of a lexon. For example, 퐺 ∙ 푙 is a lexon from퐺 .

We identify four kinds of relations between two lexons. The first one is equivalence. We consider 푙 and 푙 equivalent and note it as “푙 = 푙 ” if it can be deduced using the following formula.

(푙 ∙ 푡 = 푙 ∙ 푡 )∩ (푙 ∙ 푟 = 푙 ∙ 푟 ) ∩(푙 ∙ 푟 = 푙 ∙ 푟 ) ∩ (푙 ∙ 푡 = 푙 ∙ 푡 )∪ (푙 ∙ 푡 = 푙 ∙푡 )∩ (푙 ∙ 푟 = 푙 ∙ 푟 ) ∩ (푙 ∙ 푟 = 푙 ∙ 푟 ) ∩ (푙 ∙ 푡 =푙 ∙ 푡 ) → 푙 = 푙 .

The second one is inequality. We consider 푙 and 푙 unequal and note it as “푙 ≠ 푙 ” if it can be de-duced using the following formula.

(푙 ∙ 푡 ≠ 푙 ∙ 푡 )∩ (푙 ∙ 푡 ≠ 푙 ∙ 푡 )∩(푙 ∙ 푡 ≠ 푙 ∙ 푡 )∩ (푙 ∙ 푡 ≠ 푙 ∙ 푡 )∩ (푙 ∙ 푟 ≠ 푙 ∙푟 )∩ (푙 ∙ 푟 ≠ 푙 ∙ 푟 )∩ (푙 ∙ 푟 ≠ 푙 ∙ 푟 ) ∩(푙 ∙ 푟 ≠ 푙 ∙ 푟 ) → 푙 ≠ 푙 .

The third one records the situation of “same vertex, different edges” (very similar). We say 푙 is very similar to 푙 and note it as푙 ≅ 푙 if it can be deduced using the following formula.

(푙 ∙ 푡 = 푙 ∙ 푡 )∩ (푙 ∙ 푡 = 푙 ∙ 푡 )∩(푙 ∙ 푟 ≠ 푙 ∙ 푟 )∩ (푙 ∙ 푟 ≠ 푙 ∙ 푟 ) ∪ (푙 ∙ 푡 = 푙 ∙푡 )∩ (푙 ∙ 푡 = 푙 ∙ 푡 )∩ (푙 ∙ 푟 ≠ 푙 ∙ 푟 ) ∩(푙 ∙ 푟 ≠ 푙 ∙ 푟 ) → 푙 ≅ 푙

The last relation is to describe the situation of “connected with one vertex” (connected), which we indicated as 푙 ∼ 푙 . The formula is illustrated as: ¬(푙 = 푙 )∩¬(푙 ≠ 푙 ) ∩ (푙 ∙ 푡 = 푙 ∙ 푡 )∪(푙 ∙ 푡 = 푙 ∙ 푡 )∪ (푙 ∙ 푡 = 푙 ∙ 푡 )∪ (푙 ∙ 푡 = 푙 ∙푡 ) → 푙 ~푙

The similarity score of comparing 퐺 and 퐺 is calculated using the following equation.

푆 = 푤 ×푚 ∙ ∙

푛 + 푛 −푚 ∙ ∙+ 푤

×푚 ∙ ≅ ∙

푛 + 푛 −푚 ∙ ≅ ∙+ 푤

×푚 ∙ ~ ∙

푛 + 푤 ×푚 ∙ ~ ∙

푛

The parameters 푤 , 푤 ,푤 and 푤 are the Real number weights ( 푤 ,푤 ,푤 ,푤 ∈ ℝ), where0 ≤푤 ,푤 ,푤 ,푤 ≤ 1 , and 푤 + 푤 +푤 +푤 = 1 . We use 푛 to indicate the total number of the lexons in퐺 and 푛 to indicate the total number of the lex-ons in퐺 . 푚 ∙ ∙ denotes the size of퐺 ∙ ∙ , which is

defined as ∀퐺 ∙ ∙ ∙ 푙 ∃퐺 ∙ 푙 |퐺 ∙ ∙ ∙ 푙 =퐺 ∙ 푙 , 1 ≤ 푗 ≤ 푛 ,퐺 ∙ ∙ ⊆ 퐺 . 푚 ∙ ≅ ∙ indi-cates the size of 퐺 ∙ ≅ ∙ , which is defined as ∀퐺 ∙ ≅ ∙ ∙ 푙 ∃퐺 ∙ 푙 |퐺 ∙ 푙 ≅ 퐺 ∙ 푙 , 1 ≤ 푗 ≤푛 ,퐺 ∙ ≅ ∙ ⊆ 퐺 . We use 푚 ∙ ~ ∙ to denote the size of 퐺 ∙ ~ ∙ , which is defined as ∀퐺 ∙ ~ ∙ ∙푙 ∃퐺 ∙ 푙 |퐺 ∙ 푙 ~퐺 ∙ 푙 , 1 ≤ 푗 ≤ 푛 ,퐺 ∙ ~ ∙ ⊆ 퐺 .

We use a semantic decision table (SDT, [45]) to configure 푤 ,푤 ,푤 and푤 . An SDT is an extension to a decision table. It is a decision tables properly annotated with concepts from the domain ontology.

An SDT contains a decision table, a set of SDT lexons and SDT commitments. A decision table con-sists of a set of conditions, actions and decision rules. A condition is a combination of a condition stub and condition entry. An action is a combination of an action stub and an action entry. A decision rule is a decision column in the table.

An example is illustrated in Table 1. “Profile” is a condition stub. “Optimistic” is a condition entry. The pair⟨푃푟표푓푖푙푒,푂푝푡푖푚푖푠푡푖푐⟩ is a condition. “푤 ” is an action stub. “1” is an action entry. The pair⟨푤 , 1⟩is an action. Columns 1~8 are the decision rules.

Table 1 An SDT of configuring the weights for LexMA

Condition 1 2 3 4 5 6 7 8 Profile Optimistic Pessimistic Balanced Customized 퐺 ≥ 퐺 ? Y N Y N Y N Y N Action

푤 1 0.1 0 0.25 0.8 0.7 푤 0 0.1 0 0.25 0.2 0.2 푤 0 0.4 0.5 0.25 0 0.1 푤 0 0.4 0.5 0.25 0 0.1

SDT Commitment 1 푊푒푖푔ℎ푡 ⊑ ∀ℎ푎푠푉푎푙푢푒.푓푙표푎푡[≥ ∧≤ ] 2 푊푒푖푔ℎ푡 ≡ {푤 ,푤 ,푤 ,푤 } 3 푤 + 푤 + 푤 + 푤 = 1

Table 1 decides the weights based on the profile types and the fact whether퐺 is larger than퐺 or not. We consider퐺 ≥ 퐺 iff푛 ≥ 푛 .

An SDT commitment can be formalized in DL, First-Order Logic, or using mathematical formulas etc., depending on the required modeling feasibility and the type of the reasoning engine that will be used at the end. There are three commitments in Table 1. We use 풮풪풬(풟) – a DL dialect – for axioms 1 and 2 in order to express the constraints of value range and set. Commitment 3 in Table 1 is a mathematical for-mula, with which we express that the total of the weights must equal to 1.

Let us use the use case in Sec. 4.2. Suppose the company value “heart” is annotated with the concepts from G in Fig. 5 and a learning material called “course ITIL 1” is annotated with the ones from퐺 .

Fig. 5 Annotation graph examples; 퐺 :“heart”, 퐺 : “course ITIL 1 ”

We get the following figures by using LexMA: 푛 = 4 푛 = 6 푚 ∙ ∙ = 1 (see ⟨푃푒푟푠표푛, 푚푎푛푎푔푒,

퐸푚표푡푖표푛⟩12) 푚 ∙ ≅ ∙ = 2 ( ⟨푃푒푟푠표푛, 푖푛푡푒푟푎푐푡푤푖푡ℎ,

푃푒푟푠표푛⟩ and ⟨푃푒푟푠표푛, 푡푟푢푠푡,푃푒푟푠표푛⟩ from퐺 share the same vertex with⟨푃푒푟푠표푛,푠푢푏표푟푑푖푛푎푡푒, 푃푒푟푠표푛⟩ from퐺 .

푚 ∙ ~ ∙ = 1 푚 ∙ ~ ∙ = 1 If we take the weights defined by a “balanced”

profile (see columns 5 and 6 in Table 1), then we get the similarity score 푆 = 0.25 ×

+ 0.25 × + 0.25 × + 0.25 ×≈ 0.2014

12 We omit the context identifiers and co-roles here.

5.2. Ontology Graph Measuring Strategy (OntoGraM)

The OntoGraM strategy is a composition of the WordNet lexical matching algorithms13 (e.g., finding synonyms) and a simple graph algorithm that uses an ontological model to perform the matching.

In OntoGraM, an annotation set need to be a tree. Suppose that we have an ontology model as illustrat-ed in Fig. 6. In the BT use case illustrated in Sec. 4.2, a company value (such as “heart”) is an object of “Qualification”, which contains a set of competencies. A competency can be a task, tool, knowledge, skill or ability. A learning material, e.g., “course ITIL 1”, also contains a set of competencies.

+hasFunction() : object-Person : string

Person

+hasCompetency() : object

-Name : string-ID : long

Learning Material

+hasCompetency() : object

-Label : string-ID : long

Qualification

+hasTask()+useTool()+hasKnowledge()+hasSkill()+hasAbility()

-ID : long-Title : string

Competency

-has1

-is of

1..*

-has

1

-is of

*

-has

1

-is of *

-has

1-is of

1..*

Fig. 6 an ontology model in UML for matching and annotation

Fig. 7 two competency trees of data objects

When we annotate “heart”, we select the compe-tency items from a competency ontology as shown in Fig. 7. “English Language” and “Customer and Per-sonal Service” are selected as the knowledge for “heart”. “Comprehension”, “Monitoring” and “Speak” are the skills. “Vision” is a required ability. We use the same method to annotate the learning material “Course ITIL 1”. Note that the annotation tree has only two levels. The root indicates the data object.

13 http://lyle.smu.edu/~tspell/jaws/index.html

The rest are the concepts from the domain ontology. The relation between a node and its parent node is “part-of”.

The annotation sets that we use here are different from the ones from Sec. 5.1 (see Fig. 5). Fig. 5 con-tains domain canonical relations (e.g., “interact with”) while Fig. 7 only contains “part-of” mereolog-ical relation. It is because LexMA deals with text-based domain canonical relations, and, OntoGraM deals with trees, which are a specific kind of graphs.

OntoGraM is described as follows. Given two

trees퐺 and퐺 , categories퐶 ,퐶 , … ,퐶 and annotat-ed concepts 퐺 ∙ 푡 ,퐺 ∙ 푡 , … ,퐺 ∙ 푡 ,퐺 ∙ 푡 ,퐺 ∙푡 , … ,퐺 ∙ 푡 , we compare the concepts (from 퐺 and퐺 ) that belong to the same category퐶 (푖 ∈ 푛).

If퐺 ∙ 푡 = 퐺 ∙ 푡 (푖 ∈ 푛 , 푗 ∈ 푛 ), then we say that 퐺 ∙ 푡 and퐺 ∙ 푡 exactly match (e.g., “Monitoring” in Fig. 7). The total number of the exactly matched concepts is indicated as푛 .

If 퐺 ∙ 푡 is a subtype of 퐺 ∙ 푡 ( 푖 ∈ 푛 , 푗 ∈ 푛 ), then we say that퐺 ∙ 푡 subsumes퐺 ∙ 푡 (e.g., “Near Vision” subsumes “Vision” in Fig. 7). The total number of the matched concepts is indicated as푛 . If 퐺 ∙ 푡 is a supertype of 퐺 ∙ 푡 , then we say that퐺 ∙ 푡 subsumes퐺 ∙ 푡 and the total number of the matched concepts is indicated as푛 .

If 퐺 ∙ 푡 is a synonym of 퐺 ∙ 푡 , then we say that퐺 ∙ 푡 is similar to퐺 ∙ 푡 (e.g., “Comprehension” and “Understanding” in Fig. 7). The total number of similar concepts is indicated as푛 . Note that in some cases, a relation like “similar” or “similar to”14 is defined in the domain ontology. In this case, we can directly use this ontological relation.

If퐺 ∙ 푡 and퐺 ∙ 푡 appear in the same lexon and none of the above situations occurs, then we say that 퐺 ∙ 푡 is connected with 퐺 ∙ 푡 . For instance, “Speaking” and “Communication” are connected in⟨퐶표푚푚푢푛푖푐푎푡푖표푛, 푢푠푒, 푆푝푒푎푘푖푛푔⟩. We indicate the number of the connected concepts as푛 .

If퐺 ∙ 푡 is connected with퐺 ∙ 푡 using the domain canonical relations that are not “similar” or “similar to” and none of the above situations occurs, then we say that퐺 ∙ 푡 and 퐺 ∙ 푡 are strongly connected. For example, we have a domain canonical relation “has member”. “MS Excel” and “Spreadsheet Software”

14 We call this kind of relations as domain canoni-

cal relations. We define a domain canonical relation as a specific relation that must be interpreted and implemented

are strongly connected if we have ⟨푀푆퐸푥푐푒푙,푖푠푚푒푚푏푒푟표푓, 푆푝푟푒푎푑푠ℎ푒푒푡푆표푓푡푤푎푟푒⟩ in our domain ontology. We indicate the number of the strongly connected concepts as푛∽.

At the end, we use the following equation to calcu-late the matching score. 푆 = 푤 × 푛 +푤 × 푛 +푤 × 푛 +푤

× 푛 +푤 × 푛 + 푤 × 푛∽÷ 푛

Where 푤 , … ,푤 are the weights (푤 +푤 + ⋯+푤 = 1,0 ≤ 푤 ,푤 , … ,푤 ≤ 1) and 푛 is the num-ber of the annotated concepts in퐺 (the number of the vertex without the root). Similar to LexMA, we can use an SDT to configure the weights.

Suppose all the weights are , we then get the

matching score for Fig. 7 as푆 = ×= 0.1111.

5.3. Controlled, Fully Automated, Ontology-based Matching Strategy (C-FOAM)

C-FOAM contains two modules: 1) an interpreter and 2) a comparator. The interpreter uses a lexical dictionary, a domain dictionary, a domain ontology and a string matching algorithm to interpret an end user’s input. Given a term that denotes either (a) a concept in the domain ontology, or (b) an instance in the ontology, the interpreter will return the correct concept(s) defined in the ontology or lexical diction-ary, and an annotation set. Fig. 8 shows the design of C-FOAM processes.

Fig. 8 Design of the C-FOAM Structure

A related work of the C-FOAM interpreter is named entity recognition (NER, [5]) in the fields of text minding and information retrieval. It locates and classifies atomic elements in a text into predefined categories or contexts.

There are two thresholds in the interpreter. The first one is the threshold for the internal output using string matching. If C-FOAM finds the concept in the

ontology, then it will go directly to the step of graph matching. Otherwise, the filtered terms will be the input for the lexical searching components. If the interpreter cannot find any concepts after executing the string matching, then it consults the lexical matching algorithm. The second threshold is to filter the output of the lexical searching components.

There is one threshold in the comparator. As illus-trated in the use case (Sec. 4.2), the final output is a list of learning materials. This threshold is to select the most relevant learning materials after executing graph matching.

Suppose we select JaroWinklerTFIDF [23], [24], [51] as the string matching algorithm, WordNet [17] synonym finder as the lexical matching algorithm and LexMA (Sec. 5.1) as the graph matching algo-rithm to run our example.

Situation 1: If we set the threshold to 0.5 and a user enters a string “hearty”, JaroWinklerTFIDF yields 0.97 as the matching score for “hearty” and “heart”, which is over the threshold (0.97>0.5). The interpreter will feed “heart” as the input to the com-parator, which will find a list of learning materials based on the annotation of “heart”.

Situation 2: If the user enters “warmness”, JaroWinklerTFIDF yields 0.0 when comparing “warmness” with all the company values. Then, the interpreter consults the WordNet Synonym finder, which finds “heart”. Afterwards, C-FOAM will exe-cute the comparator and do the rest as discussed in the previous scenario.

Situation 3: If the user provides “tender” as the input, which does not match any of the company val-ues according to JaroWinklerTFIDF or the WordNet Synonym finder, then the interpreter needs to execute the following steps: 1) load the synonyms of all the company values, e.g., the synonyms of “heart” is “bosom”, “pump”, “ticker”, “core”, “warmness” and “tenderness” etc.; 2) compare these synonyms with the user’s input using JaroWinklerTFIDF. If the score is more than the threshold (0.5), then the corre-sponding company value is found; 3) C-FOAM acti-vates the comparator. In this example, the JaroWinklerTFIDF score for “tender” and “tender-ness” is 0.92 (0.92>0.5). Then, “tenderness” will be provided as the input for the comparator. This scenar-io is also applicable when the user misspells his input.

Advanced C-FOAM: In the above scenarios, we only use WordNet Synonym finder for the interpreter. In that case, the second threshold needs to be 1 be-cause a term can be either a synonym of another or not. In the advanced C-FOAM, we use the SDT (e.g.,

Table 2) to set the scores for different kinds of WordNet relations.

Compared to Table 1, Table 2 has a different out-look. Commitments 2 and 3 in Table 2 contain impli-cations, which mean that if the relation type is anto-nym, then the score should be 0, and, if the relation type is holonym, then the score is the minimum al-lowed value. These two commitments are modeled in FOL and not in DL as they are not possible to be modeled in DL.

Table 2 an SDT that decides lexical matching scores based on WordNet relations

Condition/Relation type Action/Score Antonym 0 Synonym 0.9 Holonym 0.1 Hypernym 0.9 Hyponym 0.5 Instance 0.6 Meronym 0.2

SDT Commitments 1 푆푐표푟푒 ⊑ ∀ℎ푎푠푉푎푙푢푒.푓푙표푎푡[≥ ∧≤ ] 2 푅푒푙푎푡푖표푛푇푦푝푒(퐴푛푡표푛푦푚) ⇒ ℎ푎푠푉푎푙푢푒(푆푐표푟푒, 0) 3 푅푒푙푎푡푖표푛푇푦푝푒(퐻표푙표푛푦푚) ⇒ ℎ푎푠푉푎푙푢푒(푆푐표푟푒,푀퐼푁) According to Table 2, if we use 0.5 as the thresh-

old for the lexical matching, then synonyms, hyper-nyms, hyponyms and instances are the only types used for the lexical matching.

Note that the interpreter carries penalty values for the string matching and lexical matching. Normally, the penalty for the string matching is higher than the one for the lexical matching because we believe that lexical information is more “semantic” than textual information in many cases.

The final matching score of C-FOAM is calculated using the following formulas.

Situation 1: 푆 = 푝 × 푆 Situation 2:푆 = 푝 × 푆 Situation 3: 푆 = 푝 × 푝 × 푆 Where 푝 and 푝 are the penalty values for the

string matching and lexical matching; 푆 is the graph matching score. For example, if we want to compare “hearty” with “course ITIL 2”, the annotation sets of which are illustrated in Fig. 5, and suppose that푝 =0.5, then we get푆 = 푝 × 푆 = 0.5 × 0.2014 ≈0.1 (see Situation 1).

Once C-FOAM gets all the matching scores be-tween the user’s input and all the learning materials, SDD-Matcher will sort the scores and only show a list of learning materials with high scores.

6. An Evaluation Method

After the matching strategies are implemented in the phase of implementation (see Fig. 1), we go to the phase of evaluation.

In the previous work in the literatures, a general evaluation method for semantic-driven data matching does not exist. Evaluation methods are often trivial and application specific. Hence we need to design an evaluation method for the SDD-Matcher matching strategies.

Program evaluation is the systematic collection of information about the activities, characteristics and outcomes of programs to make judgments about the program and inform decisions about future pro-gramming [32]. It is “the systematic assessment of the operation and/or outcomes of a program or policy, compared to a set of explicit or implicit standards as a means of contributing to the improvement of pro-gram or policy” [6], [26].

Utilization-focused evaluation [40] is a compre-hensive approach to doing evaluations that are useful, practical, ethical and accurate. Examples of such methods are the evaluation methods for non-experimental data [4], which show how to use non-experimental methods to evaluate social programs.

Purpose oriented evaluation methodologies [3] contain three kinds of evaluation methodologies – formative evaluation, pretraining evaluation and summative evaluation. Formative evaluation focuses on processes. Pretraining evaluation focuses on a judgment of the value before the implementation. Summative evaluation focuses on an outcome.

Other evaluation types, which are considered not directly linked to our work, are product evaluation, personnel evaluation, self evaluation, advocacy eval-uation, policy evaluation, organizational evaluation, and cluster evaluation etc. We refer to [22], [31], [35] for the examples of product evaluation methodolo-gies, self evaluation methodologies and cluster eval-uation methodologies.

The SDD-Matcher evaluation method adapts the principles in the methodologies for program evalua-tion and purpose oriented evaluation.

The basic principle in program evaluation method-ologies is that an evaluation methodology needs to help a system to improve their services and/or func-tions, and, also help to ensure that the system is de-livering right services and/or functions. The SDD-Matcher evaluation method needs to help SDD-Matcher to improve the matching results, and, to en-

sure that SDD-Matcher is delivering a correct list of recommended learning materials.

The principles in purpose oriented evaluation methodologies are as follows: It must be able to determine what information

exists in the process of a system, which is im-portant so that the engineers can analyze the processing information;

We need to use it to test and collect continu-ous feedback in order to revise the process;

It must have a precondition analysis and a post-condition analysis;

End users must be able to judge the outcome of a system.

Accordingly, our SDD-Matcher evaluation method needs to determine the information during processes of SDD-Matcher, with which we analyze which matching strategy performs the best within certain contexts. For instance, it must show the internal out-put from each matching step to our engineers. It needs to continuously analyze the comparison be-tween user’s expectations and the outcome of an SDD-Matcher matching strategy. The method needs to provide a mechanism for end users, with which they can justify the matching results.

The above evaluation principles give an overview of evaluating an SDD-Matcher strategy at a general level. This kind of judgment is called macro judg-ment. They are the fundamental issues to judge the quality of a matching strategy.

We draw evaluation criteria for the SDD-Matcher matching strategies as shown in Table 3. They are used to evaluate the matching strategy at a detailed level. This kind of judgment is called micro judgment. The reason why we choose these criteria is explained in the column – “motivation/evaluation principle” – in Table 3.

Table 3 Evaluation criteria for the SDD-Matcher evaluation meth-od

Evaluation criteria

Explanation Motivation/Evaluation principle(s)

Difficulty of managing the required knowledge resource

To check whether it is difficult to manage the knowledge base of a strat-egy or not.

To evaluate that this specific strategy are useful and easily used.

Difficulty of using the strategy

To check whether it is difficult to adjust the parameters of a strategy or not.

In order to continuously analyze (with different pa-rameters) the comparison between users’ expected scores and the scores calcu-lated by this strategy.

Results of the matching strategy

To check whether the similarity scores match users’ expecta-tions or not.

To evaluate whether or not this strategy is delivering the right services and good functional results

What affects the matching score

To find with which factors, this strategy is delivering the right services and good func-tional results

To help a system that uses this strategy to improve their services and/or functions

Advantage and disad-vantage

To explain the situations where this strategy is applicable and inapplicable

To evaluate whether this strategy is delivering the right services and good functional results; To help a system that uses this strategy to improve their services and/or functions in the future

Performance analysis

To check whether it is expensive to run a strategy

To evaluate whether this strategy is delivering ser-vices with limited resources.

The SDD-Matcher evaluation method is illustrated

in Fig. 9. The arrow-tipped bars indicate the activity flows. When the result of an activity yields a valua-ble input for its previous activities, we should re-execute its previous activities using this input. For example, after the step of “design a test suite”, we can come back to the step of “design a use case sto-ry”, if necessary.

Fig. 9 the SDD-Matcher evaluation method

In what follows, the steps in Fig. 9 will be ex-plained.

Step 1 (collect working document): It is a prepa-ration step. We scope the problem in this step. De-sign terms (e.g., actors15 and triggers16) are gathered

15 An actor is a person (or another entity external

to SDD-Matcher) who interacts with SDD-Matcher and performs test cases or use cases to accomplish a test task.

16 A trigger is an identifier of the event that initi-ates a use case or test case.

from BT. The output of this step is a report contain-ing design issues.

Step 2 (design a use case story): We specify the problem in this step. We specify the design terms from the previous step by specifying types of infor-mation (e.g., process information) used by SDD-Matcher. We also analyze preconditions and post-conditions of the use case. The output of this step is a report containing a detailed SDD-Matcher use case story, e.g., Table 4.

Table 4 a story of a detailed SDD-Matcher use case

Title SDD-Matcher for learning material recommendation

ID OMMR_V1.0

Scope e-learning, training, material recommendation, ontol-ogy based gap analysis

Purpose This story describes a use case of using ontology based gap analysis framework for the recommendation of the learning materials for company.

Settings S1 An employee has competencies, which can be evalu-

ated by reviewers and stored in a De. S2 Learning materials contains the methods of improv-

ing skills of BT employees. The formats of these learning materials vary from documents to multimedia resources.

Characters C1 Every employee at BT has one function. His function

(level) gets raised when he gets a good evaluation. C2 The reviewer evaluates the employee. The evaluation

result, which is stored in a DPR, is the input of the recommendation process.

C3 The trainer is responsible for training BT employees with appropriate materials.

Episodes Episode I – use case that involves graph matching algorithm EI-1 An employee gets the DPR from a reviewer. The

employee’s actual performance rating of the values is recorded in the DPR.

EI-2 The employee provides input. (annotated) EI-2.1 (mandatory) The employee provides the actual compe-

tency scores from the DPR EI-2.2 (optional) The employee provides the expected com-

petency scores as input 2. EI-3 The SDD-Matcher performs the calculation. EI-3.1 If input 2 is not provided, then capacities with value of

NI are collected as the capacity set that needs to be improved.

EI-3.2 If input 2 is provided, then capacities with a value that is lower than expected are collected as the capacities that need to be improved.

EI-3.3 The framework collects all the capacities that do not need to be improved.

EI-3.4 The framework compares the gap between the capaci-ties in the set of existing capacities and the capacities in the set of capacities that need to be improved.

EI-3.4.1 The framework generates two graphs in the ontology. EI-3.4.2 The framework combines the graphs of the capacities

in the first set into graph 1. EI-3.4.3 The framework combines the graphs of the capacities

in the second set into graph 2.

EI-3.4.4 The framework compares graph 1 and graph 2. EI-3.4.5 The framework finds the difference between graph 1

and graph 2, an internal competency gap set is gener-ated. Suppose this gap set is gap 1.

EI-3.4.6 The framework finds the learning materials that are annotated with the concepts in gap 1.

EI-4 The matching framework generates the output. EI-4.1 Output 1: a set of recommended learning materials. EI-4.2 Output 2: reasons of recommendation. E.g. each learn-

ing material is illustrated with relevant concepts con-cerning competency in the ontology. And each capaci-ty is also illustrated with relevant concepts concerning competency in the ontology.

EI-4.3 Output 3: others, e.g. the steps of graph matching (log) The preconditions of the story in Table 4 are as

follows: 1) there exists a competency ontology; 2) there exist several matching strategies in SDD-Matcher; and 3) there is at least one good example from the test bed.

The post-conditions of the story (see Table 4) are as follows: 1) SDD-Matcher needs to illustrate the definition of a term and the relations of concepts in the ontology base; 2) SDD-Matcher needs to explain matching results from different matching strategies.

Step 3 (Design test and evaluation data): In this step, we design the test data that are used by SDD-Matcher (not by end users). The output of this step is a report containing a list of test and evaluation data. For example, for the BT use case in Sec. 4.2, the test data set contains 26 learning materials, which are categorized into 10 soft skills. There are in total 10 company values. The ontology contains 1365 lexons, which cover 382 different concepts and 208 different role pairs. All the selected learning materials and 10 company values are properly annotated with the on-tology.

Step 4 (Design a user test suite): In this step, a knowledge engineer will design a user test suite, the data in which will be provided by an evaluator. This evaluator knows well the domains of both HRM and training but has not contributed to modelling the do-main ontology. The reason why we do not allow an evaluator to be an ontology modeller is because the evaluation result will be more valuable if he is not influenced by the models in the ontology. The output of this step is a report containing a test suite.

An example of a user test suite is illustrated in Ta-ble 5. A level of relevance can be 1, 2, 3, 4, or 5; 1 means that 퐺 and 퐺 are completely irrelevant; 2 means “not very relevant (or I don’t know)”; 3 means “relevant”; and, 4 means “very relevant” and level 5 means “100% relevant”.

Table 5 A BT test suite

푮ퟏ 푮ퟐ Level of relevance Heart Course ITIL1 3 Helpful Course ITIL8 2 Straightforward PD0236 4 Inspiring SKpd_04_a05 5 Trustworthy BTAMG001 1 Customer connected HMM23 2 Team work FS-POSTA01 4 Coaching for performance SKCUST0154 5

… … …

We use this test suite to measure whether or not a particular SDD-Matcher matching strategy provides “good” similarity scores.

Step 5 (Analyse SDD-Matcher outputs vs. users’ expectations): The output of this step is a report of a comparison, which can be a figure or a data sheet. We will illustrate such reports in the next section.

Step 6 (Analyse and conclude): In the step, we analyse the comparison report that is produced in step 5 and draw valuable conclusions. We will illus-trate such conclusions in the following sections.

7. Evaluation Result and Strategy Enhancement

Together with the criteria illustrated in Table 3, we have implemented a tool called ODMatcher as shown in Fig. 10 to test and evaluate the strategies.

Fig. 10 The screenshot of ODMatcher

The satisfactory rate is one criterion, which we calculate based on Table 5 and will be described in the following subsection.

7.1. Satisfactory Rates

In order to calculate the satisfactory rate of a matching strategy, we need to get the scale of the similarity scores of a matching strategy. A scale is defined as the range of the scores generated by the matching strategy. For instance, the minimum and maximum scores of LexMA are 0 and 0.3225. Then the scale of LexMA similarity scores is [0, 0.3225]. Seeing that there are 5 levels in Table 5, we split this scale into 5 parts. If we split it equally, then we will get a mapping as follows. Relevance Level 5: 푆 > 0.258 Relevance level 4: 푆 > 0.1935and≤ 0.258 Relevance level 3:푆 > 0.129 and ≤ 0.1935 Relevance level 2:푆 > 0.0645 and≤ 0.129 Relevance level 1: 푆 ≤ 0.0645

A similarity score is “completely satisfied” if it falls in the range, otherwise, we need to calculate the bias. A bias is defined as a quantified difference be-tween an actual score (generated by a matching strat-egy) and its expected score. It is the minimum value of the low boundary bias and the high boundary bias, which are calculated using the following algorithm.

IF (Similarity Score <Low Boundary) THEN Low Boundary Bias = Low Boundary

– Similarity Score. ELSE Low Boundary Bias = Similarity

Score – Low Boundary

IF (Similarity Score <High Boundary THEN High Boundary Bias = High Bound-

ary-Similarity Score ELSE High Boundary Bias = Similarity

Score – High Boundary For instance, if the similarity score for relevance

level 4 is 0.2, then the low boundary bias is 0.2-0.1935 = 0.0065 and the high boundary bias is 0.258-0.2 = 0.058. The bias is then set to 0.0065.

If the bias is less than 0.0645 (one interval), then we say that this similarity score is “satisfied”. If it is more than 0.0645 and less than 0.129 (two intervals), then we say that it is “not really satisfied”. All the rest scores are “completely unsatisfied”.

The satisfactory rate is the total of “completely sat-isfied” and “satisfied”. For instance, we have 100 similarity scores. There are 51 similarity scores that perfectly match the user’s expectations. 27 similarity scores satisfy the users’ expectation. Then the satis-factory rate is 78% ( = 78%).

Fig. 11 shows the result of comparing user’s ex-pectations with the SDD-Matcher Scores generated by LexMA, OntoGraM and C-FOAM. As we can see, LexMA has the highest satisfactory rate. OntoGraM has the highest rate that is “completely satisfied”.

0% 20% 40% 60% 80% 100% 120%

LexMA

OntoGraM

C-FOAMcompletely satisfied

satisfied

not really satisfied

completely unsatisfied

Fig. 11 SDD-Matcher scores vs. user’s expectations

Note that we can increase the satisfactory rate of a matching strategy by modifying its parameters. It is a necessary step in the phase of strategy enhancement (see Fig. 1).

As has been discussed in this article, SDT is used to configure the parameters. In what follows, we will illustrate an algorithm called SDT Self-Configuring Algorithm (SDT-SCA) in order to find the best con-figuration.

7.2. Semantic Decision Table Self-Configuring Algorithm

The algorithm is illustrated in Fig. 12. In the step of preparation phase, we get the test suit as shown in Table 5 and decide which matching strategy we will use. In the steps of “run a complete test and calculate score ranges” and “calculate bias”, we get a score range and bias, the processing details of which have been discussed in the previous section. Then, we cluster the scores with satisfactory rates.

Fig. 12 SDT-SCA

In the step of “increase/decrease parameters in SDT”, we increase/decrease the parameters by as-signing any possible floats that are accurate to a cer-tain decimal places. For example, the action stubs for the changeable actions may be one value in the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} if we use one decimal place and allow them to vary from 0.1 to 0.9.

We repeat step 2 ~ step 5 until we cannot increase or decrease the parameters anymore. Suppose we have 푥 changeable actions in an SDT, and we allow the action stubs to vary from 푦 to 푦 (with the cali-bration of푐). A complete test of SDT-SCA contains a loop that executes exactly ((푦 − 푦 )/푐 + 1) times.

We collect the satisfaction rates for all the possible parameter combinations, and select the one that has the best satisfaction rate.

For instance, Table 6 shows a part of the results of running SDT-SCA for LexMA.

Table 6 Test results of running SDT-SCA for LexMA

ID 1 2 3 4 5 6 7 8 9 1 0.1 0.1 0.1 0.1 43 65 72 54 0.462 2 0.1 0.1 0.1 0.2 43 65 72 54 0.462 3 0.1 0.1 0.1 0.3 43 65 72 54 0.462 4 0.1 0.1 0.1 0.4 43 65 72 54 0.462 5 0.1 0.1 0.1 0.5 44 65 69 56 0.464 6 0.1 0.1 0.1 0.6 43 65 71 54 0.463 7 0.1 0.1 0.1 0.7 43 65 72 54 0.461 8 0.1 0.1 0.1 0.8 43 65 71 54 0.463 9 0.1 0.1 0.1 0.9 43 65 72 54 0.462 10 0.1 0.1 0.2 0.1 44 80 73 36 0.533 … … … … … … … … … 11 0.1 0.1 0.8 0.6 65 109 53 6 0.747 12 0.1 0.1 0.8 0.7 66 110 52 6 0.752 13 0.1 0.1 0.8 0.8 66 111 51 6 0.756 14 0.1 0.1 0.8 0.9 66 112 51 6 0.759 15 0.1 0.1 0.9 0.1 67 119 43 5 0.796 16 0.1 0.1 0.9 0.2 67 120 43 5 0.7959 17 0.1 0.1 0.9 0.3 67 119 43 5 0.795 18 0.1 0.1 0.9 0.4 67 119 43 5 0.795 19 0.1 0.1 0.9 0.5 66 112 51 6 0.757 20 0.1 0.1 0.9 0.6 66 111 51 6 0.756 … … … … … … … … … …

The meaning of the numbers in columns 1 ~ 9

from Table 6 is explained as follows. 1: parameter for one term match (suppose we

indicate it as푥 ) 2: parameter for two terms match (suppose we

indicate it as 푥 ) 3: parameter for one term one role match

(suppose we indicate it as 푥 ) 4: parameter for two terms one role match

(suppose we indicate it as 푥 )

5: number of the scores that are completely satisfied. We use 푛 to denote the value.

6: number of the scores that are satisfied. We use 푛 to denote the value.

7: number of the score that are not really satis-fied. We use 푛 to denote the value.

8: number of the scores that are completely unsatisfied. We use푛 to denote the value.

9: satisfaction rate. It is calculated using the formula:

( )

Accordingly, (0.1, 0.1, 0.9, 0.1) is the best parame-ter set (see the row with ID 15 in Table 6).

8. Conclusion, Discussion and Future Work

We have presented a generic semantic-driven data matching framework called SDD-Matcher. It is used to find similarities between two objects using their annotations. We have illustrated a full engineering cycle of SDD-Matcher, which contains five phases – framework design, uses case design, implementation, evaluation and strategy enhancement.

The framework is a computational framework us-ing semantic terms. It covers the computation of se-mantic tokens (e.g., strings, labels and words) and types (e.g., structural or contextual information, ge-neric semantics from the ontology). An SDD-Matcher contains matching algorithms at the levels of string, lexical and graph. A matching strategy from SDD-Matcher is a composition of those algorithms. Each strategy contains at least one graph matching algorithm. String and lexical matching algorithms are used to align different labels in a natural language into unified concepts, and form proper sub-graphs in an ontology. Each sub-graph corresponds to an anno-tation set of a data object. In this article, we have shown the LexMA, enhanced C-FOAM and Onto-GraM matching strategies. Together with BT, we have validated our use cases, evaluated the three matching strategies and enhanced the strategies. We have used semantic decision tables to configure matching strategies in SDD-Matcher. In the phase of strategy enhancement, in order to have the most satis-fied matching scores, we have designed the SDT-SCA algorithm for automatically adjust the semantic decision tables.

SDD-Matcher covers the computation of semantic

tokens and types. A future direction may be twofold: levels of data pragmatics and data syntax. With re-

gard to the level of data pragmatics, we will need to interpret and compare specific semantics in the data context, i.e., context-specific and personalized data matching. Concerning the level of data syntax, we will study data schema for structured data and natu-ral-language structure for unstructured data. The fu-ture matching strategies can be dictionary-based, rule-based or decision table-based.

SDD-Matcher uses an ontology, the format of which is not restricted. For example, we use UML diagrams and simple trees for OntoGraM, and di-rected graphs for LexMA and C-FOAM.

It is important to know that there is only one on-tology in the settings. A multiple ontology scenario is also possible, which we need a preprocessing module to establish the equivalency between those ontologies. The problem can be solved by ontology alignment techniques.

The SDT-SCA algorithm is used to enhance the matching strategies. Note that it is not the only way to enhance a strategy. We can try other means, e.g., introducing more parameters or modifying the for-mulas, which will be one of our future works.

We have designed a graph matching algorithm for LexMA. Other graph matching algorithms, which are usually used but not yet implemented in SDD-Matcher, are maximum flow algorithms [36], Ed-monds's non-bipartite matching algorithm [15], Tutte’s 1-factor Theorem [48] and the algorithm of finding minimum spanning tree [33]. How to use those algorithms in constructing SDD-Matcher matching strategies is one of our future works.

Another interesting future work is to apply the matching strategies from SDD-Matcher on linking data for the Semantic Web. Although the require-ments are not exactly the same, some principles and works illustrated in [18] can be considered as useful approaches to applying the strategies and designing other SDD-Matcher matching strategies in the future.

9. Acknowledgement

The data are taken from the EC FP6 Prolix project. It is authors’ pleasure to thank our ex-colleagues – Peter De Baer – for the implementation of OD-Matcher. With regard to the use case illustrated in this article, we are very grateful to work collabora-tively with Hans Dirkzwager, Joke de Laaf and Jo-hanneke Stam from BT. And, we shall thank Prof. Tharam Dillon from Curtin University for his valua-tion suggestions and ideas in the paper.

References

[1] Baader, F., Calvanese, D., McGuinness, D., Nardi, D., and Patel-Schneider, P., editors. The Description Logic Handbook: Theory, Implementation, and Applications, Cambridge Uni-versity Press, 2007.

[2] Barrett, C., Bisset, K., Holzer, M., Konjevod, G., Marathe, M. and Wagner, D. (2008): Engineering Label-Constrained Shortest-Path Algorithms, in proceedings of 4th International Conference, AAIM 2008, Springer Berlin / Heidelberg, LNCS 5034/2008, ISBN 978-3-540-68865-5, Shanghai, China

[3] Bhola, H. S. (1990): Evaluating “Literacy for development” projects, programs and campaigns: Evaluation planning, de-sign and implementation, and utilization of evaluation results. Hamburg, Germany: UNESCO Institute for Education;

[4] Blundell, R., and Costa Dias, M. (2000): Evaluation methods for non-experimental data, Fiscal Studies, Volume 21, Num-ber 4, December 2000, pp. 427-468

[5] Borthwick, A. (1999): A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University

[6] CDC (2009): Developing Process Evaluation Questions, At the National Center for Chronic Disease Prevention and Health Promotion, Healthy Youth, Program Evaluation Re-sources, http://www.cdc.gov/healthyyouth/evaluation/resources.htm

[7] Ciuciu, I.G. and Tang, Y. (2010): A Personalized and Collab-orative eLearning Materials Recommendation Scenario using Ontology-based Data Matching Strategies, in proc. of 5th In-ternational Workshop on Enterprise Integration, Interoperabil-ity and Networking (EI2N’2010), special track of “Semantic & Decision Support” (SeDeS’2010), OTM 2010 workshops, Springer, LNCS 6428, p. 575 ff, Hersonissou, Crete, Greece, Oct. 24~28, 2010

[8] Ciuciu, I. G., Tang, Y. and Meersman, R. (2011): Towards Retrieving and Recommending Security Annotations for Business Process Models Using an Ontology-based Data Matching Strategy, in Proc. of the first international symposi-um on data-driven process discovery and anlysis (SIMP-DA'11), IFIP working group 2.6 and 2.12, ISBN 978-88-903120-2-1, vol. 1, pp. 71-81, Campion d'italia, Italy, June 29th ~ July 1st, 2011

[9] Cohen, W. W., and Ravikumar, P. (2003): Secondstring: An open source java toolkit of approximate string-matching tech-niques. Project web page, http://secondstring.sourceforge.net

[10] Cruz. I. F., Antonelli, F. P., and Stroe, C. (2009): Agree-mentMaker: efficient matching for large real-world schemas and ontologies, Journal of the VLDB Endowment, Vol. 2 Is-sue 2, August 2009

[11] Das, S., Chong, E. I., Eadon, G., and Srinivasan, J. (2004): Supporting Ontology-based Semantic Matching in RDBMS, VLDB '04 Proceedings of the Thirtieth international confer-ence on Very large data bases - Volume 30, ISBN:0-12-088469-0

[12] Dijkstra, E.W. (1959): A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271

[13] Dorneles, C. F., Gonçalves, R., and Mello, R. d. S. (2011): Approximate data instance matching: a survey, Knowledge and Information Systems, Volume 27, Number 1, 1-21, DOI: 10.1007/s10115-010-0285-0

[14] Doshi, P., Kolli, R., and Thomas, C. (2009): Inexact Matching of Ontology Graphs Using Expectation-Maximization, Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 7, 2, April 2009, Pp. 90-106, Elsevier

[15] Edmonds, Jack (1965): Paths, trees, and flowers, Canad. J. Math. 17: 449–467. doi:10.4153/CJM-1965-045-4

[16] Elmagarmid, A. K., Ipeirotis, P. G, Verykios, V. S. (2007): Duplicate record detection: a survey. IEEE Trans Knowledge Data Engineering 19:1–16

[17] Fellbaum, C. (1998): WordNet: an electronic lexical database (Language, Speech, and Communication). ISBN-10:026206197X, ISBN-13: 978-0262061971, MIT Press, Cambridge, 199

[18] Ferrara, A., Nikolov, A., and Scharffe, F. (2011): Data Link-ing for the Semantic Web. Int. J. Semantic Web Inf. Syst. 7(3):46-76

[19] Guarino, N., Carrara, M., and Giaretta, P. (1994): Formalizing ontological commitments, in Proceedings of the twelfth na-tional conference on AI (vol. 1), pp: 560 - 567,1994,ISBN:0-262-61102-3, USA

[20] Giunchiglia, F., Shvaiko, P., and Yatskevich, M. (2004): S-Match: an algorithm and an implementation of semantic matching. In Proceedings of ESWS 2004, Heraklion (GR), pages 61–75, 2004

[21] Gruber, T. R. (1993): A translation approach to portable on-tologies. Knowledge Acquisition, 5(2):199-220

[22] Huber, F., Herrmann, A., and Gustafson, A. (2007): On the influence of the evaluation methods in conjoint design – some empirical results, in book “Conjoint Measurement”, fourth edition, Springer, Berlin, Heidelberg, ISBN 978-3-540-71403-3, pp. 93-112

[23] Jaro, M. A. (1989): Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.

[24] Jaro, M. A. (1995): Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.

[25] Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E. and Pro-topapas, Z. (1996): Fast Nearest Neighbor Search in Medical Databases. In International Conference on Very Large Data-bases (VLDB), pages 215–226, India

[26] Langmeyer, D. B., Huntington, G. S. (1992): Developing Evaluation Questions, ARCH National Resource Center for Respite and Crisis Care Services, http://www.archrespite.org/archfs13.htm

[27] Li, W. and Clifton, C. (1994): Semanticinte gration in hetero-geneous databases using neural networks. In: Proc20th Int Conf On Very Large Data Bases, pp. 1–12

[28] Meersman, R. (1999): Ontologies and databases: More than a Fleeting Ressemblance. In: Proceedings of the International Symposium on Methodologies for Intelligent Systems, vol.1609, Springer

[29] Miller, R, EIoannidis, Y., and Ramakrishnan, R(1994): Sche-ma equivalence in heterogeneous systems: bridging theory and practice. Inf Syst 19(1):3–31

[30] Noessner, J., Niepert, M., Meilicke, C., and Stuckenschmidt, H. (2010): Leveragingterminological structure for object rec-onciliation. In: Aroyo L et al. eds. Proc. of the 7th Extended Semantic Web Conference, LNCS 6088, Heidelberg: Spring-er-Verlag, 2010. 334-348.

[31] Ovando, M. N. (2001): Teachers’ perceptions of a learner-centered teacher evaluation system, journal of personnel eval-uation in education, volume 15, Number 3, September 2001, Springer, Netherlands

[32] Patton, M. Q. (2002): Qualitative Research and Evaluation Methods, 3rd edition, Sage Publications, Inc, London, UK, ISBN 0-7619-1971-6

[33] Peng C., Tan, Y., X, N., Y, L. T., and Zhu, H. (2009): Ap-proximation algorithms for inner-node weighted minimum spanning trees, IJCSSE, Tharam Dillon (ed.), Vol 24 No 3, May 2009

[34] Rahm, E., and Bernstein, P. A. (2001): A Survey of Ap-proaches to Automatic Schema Matching, the VLDB Journal 10: 334–350 (2001)

[35] Ruan, J. and Zhang, W. (2007): Identification and evaluation of functional modules in Gene Co-expression Networks, LNCS 4532/2007, in book “Systems Biology and Computa-tional Proteomics”, pp. 57-76, Springer Berlin/Heidelberg

[36] Schrijver, A. (2002): On the history of the transportation and maximum flow problems, Mathematical Programming 91 (2002) 437-445

[37] Shvaiko, P., Euzenat, J. (2012): Ontology Matching: State of the Art and Future Challenges, IEEE Transactions on Knowledge and Data Engineering, 13 December 2011

[38] Spärck Jones, K. (1972): A statistical interpretation of term specificity and its application in retrieval, Journal of Docu-mentation 28 (1): 11–21

[39] Stamou, S., Ntoulas, A., and Christodoulakis, D. (2007): TODE- an ontology based model for the dynamic population of web directories, in book “Data Mining with Ontologies: Implementations, Findings and Frameworks”, edited by Hec-tor Oscar Nigro, Sandra Elizabeth Gonzalez Cisaro, Daniel Hugo Xodo, IGI Global

[40] Stufflebeam, D. L., Madaus, G. F., and Kellaghan, T. (2006): Utilization-Focused Evaluation, in book “Evaluation in Edu-cation and Human Services”, Volume 49, Second edition, Springer, Netherlands, pp. 425~438

[41] Tang, Y. (2011): Onto-Ann: An Automatic and Semantically Rich Annotation Component for Do-It-Yourself Assemblage. OTM Workshops 2011: pp. 424-433, LNCS, Vol7046/2011, , Robert Meersman, Tharam S. Dillon and Pilar Herrero (eds.), Crete, Greece, October 17-21, 2011

[42] Tang, Y. (2010): Semantic Decision Tables - A New, Promis-ing and Practical Way of Organizing Your Business Seman-tics with Existing Decision Making Tools, ISBN 978-3- 8383-3791-3, LAP LAMBERT Academic Publishing AG & Co. Saarbrucken, Germany

[43] Tang, Y. (2010): Towards Evaluating GRASIM for Ontology-based Data Matching, in proc. of the 9th international confer-ence on ontologies, databases, and applications for semantics (ODBASE'2010), Springer Verlag, LNCS 6427, p. 1009 ff, Hersonissou, Crete, Greece, Oct 26~28, 2010

[44] Tang, Y., Meersman, R., Ciuciu, I. G., Leenarts, E., and Pud-ney, K. (2010): Towards Evaluating Ontology Based Data

Matching Strategies, in proc. of fourth IEEE Research Chal-lenges in Information Science RCIS'10, Peri Loucopoulos and Jean Louis Cavarero (eds.), pp: 137 – 146, Nice, France, May 19 - 21, 2010

[45] Tang, Y., Meersman, R. and Vanthienen, J. (2011): Adding Semantics to Decision Tables: a New Approach to Data and Knowledge Engineering? in book "Applied Semantic Tech-nologies: Using Semantics in Intelligent Information Pro-cessing",Vijayan Sugumaran and Jon Atle Gulla(eds.) to be published by Taylor and Francis

[46] Tang, Y. and Meersman, R. (2011): DIY-CDR: An Ontology-based, Do-it-Yourself Components Discoverer and Recom-mender, Theme Issue on Adaptation and Personalization for Ubiquitous Computing, journal of Personal and Ubiquitous Computing, Springer, Z. Yu, D. Cheng, I. Khalil, J. Kay, D. Heckmann (eds.), DOI 10.1007/s00779-011-0416-y, ISSN 1617-4909, June 21, 2011

[47] Tangmunarunkit, H., Decker, S., and Kesselman, C (2003): Ontology-based Resource Matching in the Grid — The Grid meets the Semantic Web, the Semantic Web - ISWC 2003, LNCS 2003, Vol. 2870/2003, pp.706-721, Sanibel-Captiva Is-lands, Oct. 20~23, 2003

[48] Tutte, W. T. (1947): The factorization of linear graphs, J. Lond. Math. Soc., 22 (1947), 459-474

[49] Venkateswaran, J., Kahveci1,T. and Camoglu, O. (2006) : Finding Data Broadness Via Generalized Nearest Neighbors, in proc. Of Advances in Database Technology - EDBT 2006, Springer Berlin / Heidelberg, ISSN 0302-9743, Volume 3896/2006

[50] West, D. B. (2000): Introduction to Graph Theory (2nd Edi-tion),Prentice Hall, ISBN-10: 0130144002,ISBN-13: 978-0130144003, September 1, 2000

[51] Winkler, W. E. (1999): The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html

[52] Zhou, X. and Geller, J. (2008): Raising, to Enhance Rule Mining in Web Marketing with the Use of an Ontology, in "Data Mining with Ontologies: Implementations, Findings, and Frameworks", ISBN13: 9781599046181, ISBN10: 1599046180, EISBN13: 9781599046204, IGI Global

sdd-matcher: a semantic-driven data matching framework · 2012-10-30 · keywords: semantic...

Documents