large-scale biomedical ontology matching with servomap

4
Disponible en ligne sur www.sciencedirect.com IRBM 34 (2013) 56–59 Digital technologies for healthcare Large-scale biomedical ontology matching with ServOMap M. Ba , G. Diallo Université Bordeaux Segalen, LESIM-ISPED, 146, rue Léo-Saignat, 33000 Bordeaux, France Received 15 November 2012; received in revised form 15 December 2012; accepted 17 December 2012 Available online 13 February 2013 Abstract The proliferation of biomedical applications, which rely on different knowledge organization systems, such as ontologies and thesauri raises the issue of the automated identification of the correspondences between these models, in particular for the data integration need. A significant effort has been conducted for tackling this issue of ontology alignment. However, few systems are able to deal with ontologies containing tens of thousands of entities, as it may be the case in the biomedical domain where resources such as SNOMED-CT, the FMA or the NCI thesaurus are commonly used. We present in this paper ServOMap, an efficient system for large-scale ontology alignment. It relies on an Ontology Server (ServO) and uses Information Retrieval techniques for computing similarity between entities. The system participated with two configurations in the 2012 Ontology Alignment Evaluation Initiative campaign. We report the very promising results obtained by the system for large biomedical ontologies alignment. ServOMap is freely available for download at http://code.google.com/p/servo/. © 2013 Elsevier Masson SAS. All rights reserved. 1. Introduction The proliferation of biomedical knowledge-based applica- tions and the emergence of Linked Open Data (LOD) lead to the development of various knowledge organization systems (KOS). These KOS are used for data annotation and therefore they facil- itate data sharing. Following a long tradition of categorization, the biomedical domain is very active in providing these KOS, which may ranges from terminologies and structured vocabu- laries to ontologies. Due to their increasing availability, their more and more large size and eventually their heterogeneity, it becomes necessary to provide efficient tools for performing automated alignment between them [1]. A significant effort has been conducted recently both for alignment at the entity level and the instance level [2]. However, matching large-scale onto- logies remains a great challenge and few systems are able to deal with it. This is the aim of the ServOMap ontology matching sys- tem. It is designed for facilitating time efficient interoperability between different applications, which are based on hetero- geneous KOS. The heterogeneity comes from their language Corresponding author. E-mail addresses: [email protected] (M. Ba), [email protected], [email protected] (G. Diallo). format, their level of formalism, etc. The system relies on Information Retrieval techniques and a dynamic description of entities of the different KOS for computing their similarity. In the same vein as systems such as LogMap [3], it is dedicated to the task of matching large-scale ontologies. The main contribution of the system is a new generic approach of computing similarity between large KOS using nei- ther background knowledge nor specific external resources (e.g. lexical resources such as WordNet or repository such as the Uni- fied Medical Language System UMLS). This approach relies on the ServO Ontology Repository (OR) system [4]. We define an OR a.k.a Ontology Server as a semantic index that could be maintained in the main memory or in the system files and which store a “representation” of several KOS which are later used for performing some meta-operations including comput- ing similarity between entities. The notion of OR we refer to differs from the notion represented by systems such as OWLIM [5] and more generally Ontology-Based Databases systems and RDF repositories such as Sesame [6]. OWLIM and Sesame are rather focused on RDF data storage and querying and are not designed for ontology matching purpose. In the rest of the paper, we will briefly describe in section 2 the ontology matching process followed by ServOMap. In section 3, we will report the evaluation of the system that has been conducted with a set of large biomedical ontologies. Then, we conclude and raise some perspectives as future work in section 4. 1959-0318/$ see front matter © 2013 Elsevier Masson SAS. All rights reserved. http://dx.doi.org/10.1016/j.irbm.2012.12.011

Upload: g

Post on 23-Dec-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Large-scale biomedical ontology matching with ServOMap

A

teoa(to©

1

tdTitwlmiabalw

tbg

G

1h

Disponible en ligne sur

www.sciencedirect.com

IRBM 34 (2013) 56–59

Digital technologies for healthcare

Large-scale biomedical ontology matching with ServOMap

M. Ba , G. Diallo ∗Université Bordeaux Segalen, LESIM-ISPED, 146, rue Léo-Saignat, 33000 Bordeaux, France

Received 15 November 2012; received in revised form 15 December 2012; accepted 17 December 2012Available online 13 February 2013

bstract

The proliferation of biomedical applications, which rely on different knowledge organization systems, such as ontologies and thesauri raiseshe issue of the automated identification of the correspondences between these models, in particular for the data integration need. A significantffort has been conducted for tackling this issue of ontology alignment. However, few systems are able to deal with ontologies containing tensf thousands of entities, as it may be the case in the biomedical domain where resources such as SNOMED-CT, the FMA or the NCI thesaurus

re commonly used. We present in this paper ServOMap, an efficient system for large-scale ontology alignment. It relies on an Ontology ServerServO) and uses Information Retrieval techniques for computing similarity between entities. The system participated with two configurations inhe 2012 Ontology Alignment Evaluation Initiative campaign. We report the very promising results obtained by the system for large biomedicalntologies alignment. ServOMap is freely available for download at http://code.google.com/p/servo/.

2013 Elsevier Masson SAS. All rights reserved.

fIett

atlfioabwuid

. Introduction

The proliferation of biomedical knowledge-based applica-ions and the emergence of Linked Open Data (LOD) lead to theevelopment of various knowledge organization systems (KOS).hese KOS are used for data annotation and therefore they facil-

tate data sharing. Following a long tradition of categorization,he biomedical domain is very active in providing these KOS,hich may ranges from terminologies and structured vocabu-

aries to ontologies. Due to their increasing availability, theirore and more large size and eventually their heterogeneity,

t becomes necessary to provide efficient tools for performingutomated alignment between them [1]. A significant effort haseen conducted recently both for alignment at the entity levelnd the instance level [2]. However, matching large-scale onto-ogies remains a great challenge and few systems are able to dealith it.This is the aim of the ServOMap ontology matching sys-

em. It is designed for facilitating time efficient interoperabilityetween different applications, which are based on hetero-eneous KOS. The heterogeneity comes from their language

∗ Corresponding author.E-mail addresses: [email protected] (M. Ba),

[email protected], [email protected] (G. Diallo).

[Rrd

o3cc

959-0318/$ – see front matter © 2013 Elsevier Masson SAS. All rights reserved.ttp://dx.doi.org/10.1016/j.irbm.2012.12.011

ormat, their level of formalism, etc. The system relies onnformation Retrieval techniques and a dynamic description ofntities of the different KOS for computing their similarity. Inhe same vein as systems such as LogMap [3], it is dedicated tohe task of matching large-scale ontologies.

The main contribution of the system is a new genericpproach of computing similarity between large KOS using nei-her background knowledge nor specific external resources (e.g.exical resources such as WordNet or repository such as the Uni-ed Medical Language System – UMLS). This approach reliesn the ServO Ontology Repository (OR) system [4]. We definen OR a.k.a Ontology Server as a semantic index that coulde maintained in the main memory or in the system files andhich store a “representation” of several KOS which are latersed for performing some meta-operations including comput-ng similarity between entities. The notion of OR we refer toiffers from the notion represented by systems such as OWLIM5] and more generally Ontology-Based Databases systems andDF repositories such as Sesame [6]. OWLIM and Sesame are

ather focused on RDF data storage and querying and are notesigned for ontology matching purpose.

In the rest of the paper, we will briefly describe in section 2 the

ntology matching process followed by ServOMap. In section, we will report the evaluation of the system that has beenonducted with a set of large biomedical ontologies. Then, weonclude and raise some perspectives as future work in section 4.
Page 2: Large-scale biomedical ontology matching with ServOMap

M. Ba, G. Diallo / IRBM 34 (2013) 56–59 57

ig. 1.

2

dhScdotdw

l

2

oaaotu

2

tAddiccIlc

2

iRbAvrboootoa

pscdPnTc

2

tcotbn

F

. The ServOMap matching approach

ServOMap is based on the ServO OR system, which is aecentralized ontology repository-building tool for managingeterogeneous knowledge resources. The design principle ofervO is based on the analogy that could be made betweenlassical information retrieval (IR) techniques over a corpus ofocuments and semantic resources retrieval available within anntology. Therefore, an ontology is seen as a corpus of documento process. Each entity, either a concept or a property (object oratatype) is a semantic document to process. For saving place,e refer the reader to [4] for more details about ServO.In the following section, we briefly outline the process fol-

owed for matching two input ontologies (Fig. 1).

.1. Computing ontology metrics

After parsing and loading the input ontologies, the first stepf the matching process is to computing a set of metrics thatre used in the next steps as tuning parameters for the systemsnd for optimization purpose. These metrics include for an inputntology the list of languages used to denote entities labels orheir annotation properties and the longest set of synonyms labelssed to describe a concepts (Fig. 1).

.2. Lexical and contextual indexing

ServOMap constructs an inverted index thanks to the use ofhe Ontology Indexing Module of ServO, which relies on thepache Lucene API1. According to the parameters computeduring the previous step, a dynamic generation of each entityescription is performed. This process is dynamic as each entitys described according to the features it holds. Thus, some con-epts may have synonyms in several languages or may haveomments while others may only have English synonyms terms.n this step, several pre-processing operations are performed:

owering the labels of entities, stop words removal, words con-atenation, etc.

1 http://lucene.apache.org/.

csmTo

.3. Computing lexical-based similarity

After the indexing phase, ServOMap proceeds to the comput-ng of lexical-based similarity. This step relies on the Ontologyetrieval Module of the ServO system and uses a vectorial-ased similarity. Each entity is represented as a vector of terms.ccording to the flag indicating the indexed ontologies, Ser-OMap calls the Ontology Processing Module of ServO foretrieving the entities to use for searching over the previouslyuilt index. Thus, if both input ontologies are indexed, the firstne, let’s say O1, is used as query provider to perform searchver the index of the second ontology I2. And, vice versa, thentology O2 is used as query provider to perform search overhe index of the first ontology I1. If the flag indicates that onlyne of the ontologies is indexed, then ServOMap performs only

one-way search.Our similarity between concepts can include the use of the

roperties (both object and datatype) attached to them. To doo, dedicated fields are used to complete the description of aoncept. They include: PDomain (which store the list of theomains for all the properties declared for the given concept);Range for the list of ranges, DDomain for the list of datatypeames and DRange for the datatype range (String, Date, etc.).his is dynamic as it depends on the features that a particularoncept holds.

.4. Computing contextual-based similarity

The context-based similarity is based on the idea that whenwo entities are similar, it is likely that the surrounding con-epts are also similar. The surrounding concepts are the setf ancestors, descendants and siblings concepts. Therefore, inhe context-based similarity, the description of a concept isased also on the description of its surrounding concepts. Pleaseote that the context-based similarity is applied only on con-

epts and not on properties. ServOMap restricts the contextualimilarity computing to the concepts that have not been yetapped to any other concepts during the lexical similarity phase.he lexical-based similarity is considered as the most precisene.
Page 3: Large-scale biomedical ontology matching with ServOMap

5 IRBM 34 (2013) 56–59

2s

arTfmeuHw

3

at(ecttwf

tmttv

ismW

tS[ss

tpd

tt(aS

Table 1Parameters used by the two versions of the ServOMap system.

ServOMap-lt ServOMAP

Terms processing The same forall languages

According tothe language ofthe labels

Entities taken into account Only concepts All entitiesOntologies indexed One BothSearching strategy One way Two waysStemming Yes NoA

wS

3

bctdct7ccimmfT

urpt

3

atof the tasks of the track-matching problem. The performanceof the two versions of the system is depicted on Table 2 andTable 3 and we refer the reader to the OAEI 2012 web site for

Table 2ServOMap-lt performance on the LargeBio dataset.

Task Precision Recall F1-measure Time (s)

8 M. Ba, G. Diallo /

.5. Refining the mappings provided by the context-basedimilarity

The mappings with the context-based similarity are lessccurate as the description of concepts is completed by its sur-ounding concepts, which leads to more false positive mappings.he idea is thus to avoid keeping a couple of concepts obtained

rom the context-based similarity if one of its entries is alreadyapped to another concept during the lexical process. This strat-

gy, which constitutes the last step, allows removing most of thenwanted mappings and increases the recall at the same time.owever, according to our experiments, the precision obtainedith lexical-based mappings uses to be reduced.

. Evaluation

We report in this section the results obtained by the systemccording to the experiments that have been conducted duringhe 2012 Ontology alignment Evaluation Initiative campaignOAEI)2. OAEI is an international campaign for the systematicvaluation of ontology matching systems — software programsapable of finding correspondences (called alignments) betweenhe vocabularies of a given set of input ontologies. The objec-ives of the OAEI campaign include assessing strengths andeaknesses of alignment/matching systems and comparing per-

ormance of techniques.For this edition, 21 systems participated in this 7th compe-

ition, which was constituted by six different tracks for entitiesatching. LogMap was among the OAEI 2012 participating sys-

ems with two different configurations: a light version, LogMaplthat relies only on lexical-based similarity, and the LogMap fullersion.

We can notice various followed approaches for the partic-pating systems. They range from lexical or structural basedimilarity to logical-based similarity computing. In addition,ost of them rely on the use of external resources such asordNet, the UMLS or Wikipedia.The ontologies or thesauri involved in the campaign con-

ain entities ranging from tens to hundreds of thousands. TheEALS platform has been used for evaluating all the systems7]. SEALS is an independent, open, scalable, extensible andustainable infrastructure that allow the remote evaluation ofemantic technologies.

ServOMap participated in the campaign with two configura-ions. They differ by the parameters used to tune the matchingrocess. The main differences between the two versions areepicted on Table 1.

As can be seen, the first version of the system that we refero as ServOMap-lt, uses the same processing technique for theerms of the entities being matched regardless their languageEnglish, French, etc.). In addition, only concepts are taken into

ccount contrary to the second version, which we refer to aservOMap. Also, only one of the input ontology is indexed

2 http://oaei.ontologymatching.org/2012/.

FFSA

Fs

rity 1:n 1:1

ith ServOMap-lt. However, it performs 1:n mappings whileervOMap takes into account only 1:1 mappings.

.1. The used dataset

The evaluation reported here has been conducted on the largeiomedical ontologies (LargeBio) dataset of the OAEI 2012ampaign. The LargeBio track is one of the most challengingasks in term of scalability and complexity. The ontologies in thisataset are semantically rich and contain tens of thousands oflasses. Indeed, the track consists of finding alignments betweenhe Foundational Model of Anatomy (FMA), which contain8,989 concepts, the SNOMED-CT, which contain 306,591 con-epts, and the National Cancer Institute Thesaurus (NCI), whichontain 66,724 concepts. The track consists of three match-ng problems: FMA-NCI matching problem, FMA-SNOMED

atching problem and SNOMED-NCI matching problem. Eachatching problem is divided in three tasks involving different

ragments of the considered ontologies, i.e. small, big and whole.his leads to 9 sub-tasks.

The mappings provided by the UMLS Methathesaurus aresed as gold standard and serves as the basis for the trackeference alignments [8]. This reference contains 3,024 map-ings, 9,008 mappings and 18,324 mappings respectively forhe FMA-NCI, FMA-SNOMED and SNOMED-NCI tasks.

.2. Results

The evaluation is run in a server with 16 CPUs andllocating 15 Gb RAM. Fifteen out of 21 participating sys-ems/configurations have been able to cope with at least one

MA-NCI 0.931 0.8 0.86 366MA-SNOMED 0.956 0.60 0.802 790NOMED-NCI 0.875 0.593 0.706 1.248VERAGE 0.890 0.699 0.780 2.405

MA: Foundational Model of Anatomy; NCI: National Cancer Institute The-aurus.

Page 4: Large-scale biomedical ontology matching with ServOMap

M. Ba, G. Diallo / IRBM

Table 3ServOMap performance on the LargeBio dataset.

Task Precision Recall F1-measure Time (s)

FMA-NCI 0.945 0.747 0.834 327FMA-SNOMED 0.953 0.656 0.777 893SNOMED-NCI 0.901 0.554 0.687 1.089AVERAGE 0.903 0.657 0.758 2.310

Fs

trTbNL

tiwtittpY

4

s2ssat

d

fiiSctim(taw[

R

[9] Ngo D, Bellahsene Z. YAM++: a multi-strategy based approach for ontol-

MA: Foundational Model of Anatomy; NCI: National Cancer Institute The-aurus.

he complete results of the evaluation. We have averaged theesults obtained on the entire sub-tasks (small, big, and whole).hus, instead of presenting separate results, we have one liney matching problem (FMA-NCI, FMA-SNOMED, SNOMED-CI). The last entry of the tables gives the average of the entireargeBio track and the total computation times.

Overall, among all the systems which completed at least oneask of the LargeBio track, ServOMap-lt provided the best resultsn terms of F-measure and precision for the FMA-SNOMED taskhile ServOMap generated the most precise mappings when all

he task are averaged with 90.3%. ServOMap-lt finished secondn term of F-measure with 78% closely behind the YAM++ sys-em (78.2%) [9]. For the computation times, ServOMap finishedhe entire nine tasks in 2.310 seconds (38.5 mn) at the secondosition behind the LogMaplt system (711 seconds) [3] whileAM++ completed them in 18 hours.

. Conclusion

We have briefly described the ServOMap approach for large-cale ontology matching. The system which participated in the012 edition of the Ontology Alignment Evaluation Initiativehowed very promising results and is ranked among the top-3ystems in the large biomedical track both in term of F-measurend computation times. However ServOMap relies heavily on

he lexical description of the entities being mapped.

As of future work, we intend to investigate the followingirections. First, we plan to improve the algorithm used for

[

34 (2013) 56–59 59

ltering out the mappings provided by the context-based match-ng in order to increase recall without reducing the precision.ervOMap does not use any external resource in the similarityomputing process. We intend to use the UMLS resource for bet-er discarding incorrect mappings. Moreover, the current versions not able to compute oriented mappings nor takes into account

apping two ontologies described in two different languagese.g. English vs. French). Thus, an improvement of the sys-em is the implementation of a cross lingual ontology matchingpproach and investigating into oriented mappings issue. Finally,e plan introducing logic assessment of computed mappings

10] and implementing a user-friendly interface.

eferences

[1] Euzenat J, Meilicke C, Stuckenschmidt H, Shvaiko P, Trojahn C. Ontologyalignment evaluation initiative: six years of experience. J Data Semantics2011.

[2] Shvaiko P, Euzenat J. Ontology matching: state of the art andfuture challenges. IEEE Trans Knowledge Data Eng 2013;25(1):158–76,http://dx.doi.org/10.1109/TKDE.2011.253.

[3] Ruiz EJ, Grau BC, Zhou Y, Horrocks I. Large-scale interactive ontologymatching: algorithms and implementation. In: Proceedings of the 20thEuropean Conference on Artificial Intelligence (ECAI). 2012. p. 444–9.

[4] Diallo G. Efficient building of local repository of distributed ontologies.In: In IEEE proceedings of international conference on signal-image tech-nology and internet based systems. 2012. p. 159–66.

[5] Kiryakov A, Damova M. Storing the semantic web: semantic repositories.In: Semantic web handbook. Heidelberg, Germany: Springer Verlag; 2011.

[6] Schenk S, Petrak J.Sesame RDF repository extensions for remote queryingZNALOSTI conference. 2008.

[7] Esteban-Gutiérrez M, Garcıa-Castro R, Gómez-Pérez A. Executing Eval-uations over Semantic Technologies using the SEALS Platform. In:Proceedings of the International Workshop on Evaluation of SemanticTechnologies (IWEST2010). 2010, volume 666, http://CEUR-WS.org

[8] Bodenreider O. The unified medical language system (UMLS): inte-grating biomedical terminology. Nucleic Acids Res 2004;32(databaseissue):D267–70.

ogy matching task. EKAW 2012:421–5.10] Meilicke C, Stuckenschmidt H, Sváb-Zamazal O. A reasoning-based sup-

port tool for ontology mapping evaluation. ESWC 2009:878–82.