a survey on large scale schema and ontology matching techniques

8/3/2019 A Survey on Large Scale Schema and Ontology Matching Techniques

1/7

A Survey on Large Scale Schema andOntology Matching Techniques

K. Saruladha, Dr. G. Aghila, and B. Sathiya

AbstractWith the fast growth and the increasing usage of the large variety of data like XML schemas, ontologies. In manydomains, the heterogeneity among the data increases enormously. Hence solving such heterogeneity needs matching

techniques. In areas like E-business, web and life sciences the size of XML schema and ontology is large. But most of the

existing matching techniques address only small match tasks. So, we present an overview of recently proposed matching

techniques like early pruning of the search space, divide and conquer strategies, parallel matching and some renowned

ontology matching tools which achieve high match efficiency or/and high quality for large-scale matching. In addition to this the

pros and cons of various matching techniques is summarized.

Index TermsSimilarity Measure, Schema Matching, Ontology Matching, Ontology Alignment.

1 INTRODUCTION

Ngeneral

relational

and

XML

database

is

called

as

Schemaandthedatabaserepresentedbysemantic lan

guage likeWebOntologyLanguage (OWL)orResource

DescriptionLanguage(RDF)iscalledasOntology.Match

ing technique finds semantic correspondences between

cartesianproductofentitypairof thegiven twoontolo

gies. These correspondences may stand for equivalent,

subsumption, or disjoint relationbetween the ontology

entities.Theresultofmatchingtechniqueisthesetofse

mantic correspondences called alignments. Fig.1 depicts

the general framework for the schema matching tech

niques which is also applicable for ontologymatching

techniques.As

shown

in

figure

the

input

of

the

matching

technique istwoschemaswhicharefirst importedintoa

format suitable forprocessing.For a faster computation

the schemamaybe preprocessed, e.g., preparing neig

bourlistforeachentitytofastenthestructuralsimilarity

calculation, removing redundant information from sche

ma,etc.Thesimilarityvalueiscalculatedforallcartesian

product entitypairsone from eachof twopreprocessed

schemas using a match workflow consisting of set of

matcher (e.g., lexical, structural and semanticmatcher).

Thesemanticvalueranges from0 to1, i.e.,value1 indi

catesequivalentandvalue0meansdisjointrelation.The

matcherworkflow

can

be

sequential,

parallel,

iterative

or

in some mixed fashion. The similarity value obtained

fromthematchworkflowcanbeaggregatedtoobtainthe

final alignment. For large scale schema matching the

workflowwill

slightly

vary

to

incorporate

techniques

like

early pruning andpartitioning of schema to reduce the

searchspace,parallelization,etc.

According toShvaikoPandEuzenatJ [11]oneof the

toughestchallengeformatchingsystemishandlinglarge

scaleschemasorontologyandonthebasisofRahmE[12]

work domain where large scale matching is necessary

includeebusiness[13],webdata[14],lifescienceontolo

gies,andmedicine [15] andweb directories [16]. Even

thoughtheadvancementsaremadeinthecurrentmatch

ing systems, they are stillunable to achievegood effec

tiveness (correctnessandcompletenessof thealignment)

andgood

efficiency

(time

and

space

efficiency).

The

effec

tiveness ismeasuredby precision and recallwhile the

efficiency ismeasured by execution time andmemory

used.

The remainder of thispaper is organized as follows:Section 2 discusses various large scale matching techniques. Section 3 gives abrief comparison of the largescalematching techniques. Finally, Section 4 concludesthesurveyofontologymatchingtechniques.

Fig. 1 General Framework for Matching Process

Mrs. K. Saruladha is with the Computer Science Department, Pondi-cherry Engineering College, Puducherry, Pin 605014, India.

Dr. G. Aghila is with the Computer Science Department, PondicherryUniversity, Puducherry, Puducherry, Pin 605014, India.

Ms. B. Sathiya is with the Computer Science Department, PondicherryEngineering College, Puducherry, Pin 605014, India.

I

2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

http://sites.google.com/site/journalofcomputing/

InputSchemas

Preprocessor

Preprocessor

Similarity

Computation

Similarity

AggregationAlignments

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 10, OCTOBER 2011, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING

WWW.JOURNALOFCOMPUTING.ORG 55


2/7

2 APPROACHESUSEDFORLARGESCALEMATCHINGTECHNIQUE

In this section, we discuss about various large scale

matching techniques.Thematching techniques are cate

gorizedintofourmajorcategoriesasshownbelow

1) Earlypruningstrategybasedmatchingtechnique

a.

Quick

Ontology

Matching

algorithm

(QOM)

b. Eric peukert et al. schema and ontology

matchingalgorithm

2) Partitionbasedmatchingtechnique

a. Coma++ schema and ontologymatching al

gorithm

b. FalconAOontologymatchingalgorithm

c. Taxomapontologymatchingalgorithm

d. Anchorfloodontologymatchingalgorithm

3) Parallelmatching technique

a. Grossetal.ontologymatchingalgorithm

4) Othermatchingtool

a. RiMOMontologymatchingtoolb. ASMOV (AutomatedSemanticMatchingof

Ontologies with Verification) ontology

matchingtool

c. Agreementmaker schema and ontology

matchingtool

2.1Early Pruning Strategy Matching TechniqueTheaimofearlypruningtechniqueistoreducethesearch

space formatching, e.g., onematcher can prune entity

pairswhose semantic correspondencevalue isvery low,

thus reducing search space for the subsequentmatcher.

Thisidea

is

suitable

for

both

sequential

and

iterative

matcherworkflow.QOM [1]was the first system to im

plement this ideausing iterativematcherworkflowand

Peukert et al. [2] System implemented both matcher

workflows.

2.1.1 QOMQOM is theontologymatchingalgorithmadapted from

theNaiveOntologyMatching (NOM). Thebasicdiffer

encebetweenNOMandQOMisQOMusesheuristicand

dynamic programming approach to reduce the search

space with marginally compromising on effectiveness.

QOM isan iterativematcher consistingof the following

six

steps

to

match

the

ontologies:

Feature

engineering,

search step selection, similarity computation, similarity

aggregation,interpretationanditeration.

FeatureEngineering.Theprocessofdeterminingwheth

ertwoentitiesaresameornotisbasedontheentityfea

tures.RDF features likeentitiesunifiedresource identifi

ers (URIs), its label, its parent entities, similar entities,

childrenentitiesanddomainspecificfeaturesareusedfor

matching.Thisstepconstructsthesearchspaceofontolo

gy entitiesby computing the cartesianproductofentity

pairsusingentityfeaturesofbothontologies.

SearchStepSelection.Theentitypairsobtainedwillbe

pruned using heuristics strategies to reduce the search

space contructed in the previous step. The selected

matchingpairswillbeprocessed forsimilaritycomputa

tionandtheirinterpretation.Onceagainthesystemhasto

determinewhichmatchingpairstoaddtotheagendafor

thenext iteration.Theheuristics strategies arebasedon

labelsof entities, entityneighbour to thealignmentsob

tainedfrom

the

previous

iteration,

hierarchy

of

entities,

randompickorcombinationoftheabovestrategies.

Similarity Computation. The similarity between the

above selected entity pairs can be measured by wide

range of similarity functions likeLevenshteins editdis

tance,Dicecoefficientandcosinesimilarityvalue .Based

on the feature of the entity any of the above similarity

functioncanbechosen.

SimilarityAggregation.QOMusesweightedaverageof

similarityvaluesforagivenentitypair.Theweightofthe

similarityvaluesiscalculatedbysigmoidfunctionwhich

emphasizeshighindividualsimilaritiesanddeemphasiz

eslowindividualsimilarities.

Interpretation.First,

matching

pairs

with

similarity

value

less than a threshold isdiscarded.Next similaritypairs

whichviolatesbijectivity(onetooneandontoconstraint)

ofthematchingarediscarded.

Iteration. Initial iteration uses lexical knowledgewhile

later iterations use ontology structure knowledge for

matching.The number of iteration required irrespective

ofontologysizebasedontheirexperimentsisten.

BasedontheevaluationQOMisnearlytwentytimesfast

erthanNOM.ThedrawbackofQOMisthat,effectiveness

ismarginally lower thanNOM sincenotall entitypairs

areevaluated.

2.1.2 Peukert et al. MethodPeukertetal.proposedaschemaandontologymatching

algorithm.Itisarulebasedoptimizationtechniquewhich

rewritesmatchworkflow to improve performance. The

filteroperatorswithinmatchworkfloweliminatedissimi

larentitypairs (whosesimilarityvalue is less thansome

threshold) from intermediatematch results and thus re

ducingsearchspaceforfurthermatchers.Thethresholdis

either statically predetermined or dynamically derived

fromthesimilaritythresholdused inthepreviousmatch

workflowtoselectalignment.

TheinputofthisprocessisXMLschemas,metamod

els

or

ontologies.

These

input

formats

are

converted

into

a

standard internal format. The user must design the

matchingprocesswhich isrepresentedasagraphwhere

theverticesrepresentmatcherfromthelibraryandedges

representexecutionorderanddataflow.Sincethegraph

isdesignedbytheuserthematchworkflowcanbeparal

lelor serialor iterative.Theuser canutilize the rewrite

recommendation system, to increase efficiency of the

matchingworkflow.These rewrite recommendation sys

tem is similar to costbased rewrites in database query

optimizatione.g,theuseofrewriterulesaresimilartothe

useofpredicatepushdownrulesfordatabasequeryop





3/7

timizationwhich reduce thenumberof input tuples for

joinsandotherexpensiveoperators.Asimplecostmodel

isusedtochoosewhichpartsofamatchingworkflowto

rewrite.Theserewritescanbeacceptedorrejectedbythe

user.Once thematching process graph is finalised the

graphcanbeexecutedtoobtainthealignment.

Thedrawbacksof thismethodareasfollows.Rewrite

rules,only

focus

on

efficiency

improvements

but

not

on

effectiveness improvement.Thematcher library consists

ofonlyfewnumbersofmatchers.

2.2 Partition Based Matching Technique

Theideaofthisapproachistopartitiontheontologyand

executeapartitionwisematching.Thepartitioningisper

formedinsuchawaythateachpartitionoffirstontology

ismatchedwithonlysmallsubsetofthepartitions(ideal

ly,onlywithonepartition)of the secondontology.This

methodreducesthesearchspaceandthusbetterefficien

cy.Space complexityof thematchingprocess is also re

duced. Four partitionbasedmethodsCOMA++ [5], Fal

conAO

[3],Taxomap

[4],

anchor

flood

[6]

will

be

dis

cussedbelow.

2.2.1 COMA++COMA++ is a generic schema and ontology matching

toolwith a library ofmatchers and a flexible option to

combinethematcherstorefinethematchingresults.This

methodcanprocessXMLschema,relationaldatabaseand

OWLontology.First the schema isprocessed to identify

the relevant components (node, path or fragment) for

matching. Then depending on the component chosen a

matcherworkflowisusedtocomputecomponentsimilar

ities.Finallysimilaritycombinationmethodsareused to

findthe

correspondences

between

components.

Fragmentcomponentbasedmatchingprocess iscapa

ble of handling large scale schema.A rooted subgraph

downtothe leaf level intheschemagraph isreferredas

fragment. Ingeneral, fragments shouldhave littleorno

overlap to avoid repeated similarity computations and

overlappingmatch results.The three typesofautomatic

fragmenttypesareschema,subschema,andshared frag

ments.Usercanalsoselectasubstructureasafragment.

Based on the fragment type, the schema is fragmented

andfragmentpairsfromeachontologyismatchedtofind

the most similar fragment pair.The search for similar

fragments

is

some

kind

of

light

weight

matching,

e.g.,

basedonthesimilarityofthefragmentroots.Nextsimilar

fragmentpair ismatchedelementwiseto find thealign

ments.

Thedrawbacksofthismethodareasfollows.COMA++

isdevelopedforXMLschemasandhenceitisnotsuitable

forcomplexontologygraph.Ituserelativelysimpleheu

risticrules topartition the inputschemasresultingoften

in too few or too many partitions and to find similar

fragmentpaironlytherootnodefeaturesareusedwhich

willleadtolessmatchingquality.

2.2.2 Falcon-AO

FalconAO is an ontologymatching toolwhich aims at

finding alignments of the given twoOWL/RDF ontolo

gies.Thematcher libraryofFalconAO consistofVDoc

[17] and ISub [19] which are lightweight linguistic

matchers,GMO [18] an iterative structuralmatcher and

PBM(PartitionBasedMatcher)whichadoptsthedivide

and

conquer

strategy

to

find

block

mappings

between

largescale ontologies. The three phase of PBM are ex

plainedbelow

Partitioning ontologies: Structural proximitiesbetween

entitiesare calculatedbasedonhow closely theyare re

latedinthehierarchies.Theontologyclustersareformed

based on structural proximities using modified ROCK

[20] clustering algorithm. RDF Blocks are constructed

fromtheclustersbyassigningeachRDFtripletoacluster

inwhichatleasttwoentitiesarecontained.

Matchingblocks:Thealignmentwithhighsimilaritiesis

referred as anchors. A lightweight string comparison

technique, ISUB, is firstly employed to exploit anchors

betweentwo

full

ontologies

and

then

the

blocks

from

the

twoontologiesarematchedbasedon thedistributionof

theanchors.

Discovering alignments:VDOC adopts a linguistic ap

proach toontologymatching.GMO isan iterativestruc

turalmatcher.Similaritycombinationisaheuristicstrate

gytotunethethresholdsoftheabovetwomatchers.

Thedrawbacksof thismethodareas follows.Theentire

ontologyneeds tobeprocessedtofindanchorsandthus

efficiencyofFalconAOisreduced.Maximumnumberof

entityinaclusterisdeterminedbytheGMOmatcherand

theclusteringalgorithm terminatesabruptly if it reaches

maximum

number

which

will

lead

to

poor

clustering.

2.2.3 TaxomapTaxomapisanontologymatchingalgorithmconsistingof

twopartitioningalgorithmnamelyPartitionAnchorParti

tion(PAP)andAnchorPartitionPartition(APP)whichhave

been designed to take the alignment objective into ac

count in the partitioning process. Themost structured

ontologyisreferredastargetontologyandthelessstruc

tured is referred as sourceontology.PAP is suitable for

structuredontologyvsunstructuredontologyandAPPis

suitable for structured ontology vs structured ontology.

Theentitypaironefromeachoftwoontologywhichhas

identicallabelsiscalledasanchorswhichwillbeusedin

bothPAPandAPP.Thealignmentisbasedonlexicalandstructure(subclass)similaritymeasure.

PartitionAnchorPartition(PAP):

1. UsePBMmatchertopartitionthetargetontologyinto

setofblocks.

2. Identify the set of anchorsbetween two ontologies.

This setwillbe the centerof the futureblockwhich

willbegeneratedfromthesourceontology.

3. Use PBM matcher to partition the source ontology

aroundtheidentifiedcenters.

4. Aligneachblockwiththecorrespondingblock.





4/7

AnchorPartitionPartition(PAP):

1. Identifythesetofanchorsacrossontologies.

2. Partitionboththetargetontologyandsourceontology

by modifying PBM matcher in order to take into

accountsharedanchors.

3. Alignblocksthatsharemaximalnumberofanchors.

The

drawbacks

of

this

method

are

as

follows.

The

ef

fectivenessof thismethoddependson theavailabilityof

identical labels across ontologies. Only labels and hie

rarchystructureisusedformatchingandhencecompara

tivelylessrecall.

2.2.4 AnchorFlood

AnchorFlood is an ontologymatching algorithmwhich

implements a dynamic partition based matching that

avoidstheprioripartitioningoftheontologies.Thesimi

laritymeasure for alignment isbased on internal struc

ture,externalstructureandJaroWingler stringdistance.

The process starts ofwith the set of anchorswhere an

anchor is defined as a exact stringmatch of concepts,

propertiesorinstancespair.The algorithm preprocesses ontologies to normalize

the textual contents of entities. It startswith an anchor

and collects neighboring entity of the chosen anchor

acrossontologies.Thesegmentpairconstructedfromthe

abovecollectedneighbors isprocessed toproducealign

mentpairs.Theprocess repeatsuntileitherall the col

lected entity are explored, or no new aligned pair is

found.Theanchorwillbediscardedasmisalignment if

theconstructedsegmentof theanchordoesnotproduce

sufficientalignments.

Thedrawbacksofthismethodareasfollows.Semantic

similaritybetween

entities

is

not

explored.

It

ignores

cer

tain distantly placed aligned pairs since segments are

constructed fromneighborofanchors.Onlyexact string

match of concepts, properties and instances are consi

deredasanchorswhichleadtoinefficientalignments.

2.3 Parallel Matching Technique

A straightforwardmethod to increase the efficiency of

largescalematching is toexecutematcher inparallelon

severalprocessors.Thetwokindsofparallelmatchingare

intermatcher and intramatcher parallelization. Inter

matcher parallelization dealswith parallel execution of

independently executablematchers while intramatcher

parallelizationdeals

with

internal

decomposition

of

indi

vidualmatchersormatcherpartsintoseveralmatchtasks

thatcanbeexecuted inparallel.Grossetal.[7]matching

system implementsbothintermatcherandintramatcher

parallelizationwhichwillbediscussedbelow

Gross et al. proposed a parallel ontology matching

system with a distributed infrastructure to incorporate

intramatcherandintermatcherparallelism.Theelement

level and structurelevelmatching are also parallelized.

For intramatcher parallelism, a sizebased partitioning

algorithm hasbeen proposedby this system leading to

better loadbalancing, limitedmemoryconsumptionand

scalability without reducing the effectiveness of match

results.

The system first generates the context attributes for

each entity. Then the ontology is partitioned into set of

partitionbased on the sizebasedpartitioning algorithm

achievingintra

matcher

parallelism.

Now

the

partition

pairsareconstructedonefromeachofthetwoontologies.

Eachpartitionpair is assigned aprocessor.Within each

processorthepartitionpairisprocessedinparallelbythe

elementlevel,structurelevel, instancebasedmatchersso

as to achieve inter matcherparallelism.Thealignments

fromallthematchercanbeaggregatedtooutputthefinal

alignment.

The drawbacks of this method are as follows.The

matcherlibraryconsistsofonlyfewnumberofmatchers.

The partition algorithm uses only simple strategies for

partitioningwhichcanbeimproved.

2.4 Other Matching ToolsRiMOM [8] is the firstontologymatchingalgorithmde

velopedwithautomaticanddynamicselectionofmatch

ers for ontology alignment tasks. The input ontology

shouldbe inOWL format. Itconsiders lexical,structural

and instancesimilarities.Basedonthe featuresofthe in

put ontology and the predefined rules, appropriate

matchers are chosen to apply for thematching task.Ri

MOMconsistofsixstepsandareexplainedbelow.

OntologyPreprocessingandFeatureFactorsEstimation.

Foreachentityofbothontologies,generatethefeaturesof

theentitylikeitsname,label,children,etc.Thenthelabel

and

structure

similarity

of

the

entity

pairs

are

calculated

whichwillbeusedinthefollowingstep.

Strategy (Matcher)Selection.Thebasic ideaof strategy

selection is that, if two ontologies have some feature in

common, thenmatcherbased on these feature informa

tionareemployedwithhighweight;andifsomefeature

factorsaretwo low,then thesematchermay notbeem

ployed.Theentitiesarefirstlinguisticallymatched;struc

turalmatchingisonlyappliediftheschemasexhibitsuf

ficientstructuralsimilarity.

Single strategy execution. The selected strategies from

theabovestepisusedtofindthealignmentindependent

ly.Eachstrategyoutputsanalignmentresult.

Alignmentcombination.

In

this

phase,

RiMOM

combines

thealignmentresultsobtainedby theselectedstrategies.

The combination is conductedby a linear interpolation

method.

Similaritypropagation (optional). If the two ontologies

havehighstructuresimilarityfactor,RiMOMemploysto

findnewalignmentaccording to the structural informa

tion.

Alignment refinement. It refines the alignment results

fromtheprevioussteps.Itdefinedseveralheuristicrules

toremovetheunreliablealignments.





5/7

TABLE 1Comparison of Matching Technique

*Notavailable**optional

The drawback of RiMOM is its inefficiency for dealing

withlarge

scale

ontologies.

Eventhough

it

shows

avery

good effectiveness for large scale ontologies it consume

longtimeand largeamountofmemorysince itdoesnot

incorporate search space reduction techniques like early

pruning,partitionofontologyorparalleltechnique.

ASMOV(Automated Semantic Matching of Ontologies

with Verification) [9] is an iterative ontologymatching

algorithm. The strength of ASMOV lies in the post

processing of the alignment to remove the alignments

whicharesemanticallyinconsistent.ItusesWordNetand

theUnifiedMedicalLanguageSystem(UMLS)toincrease

theeffectiveness.The inputontologyshouldbe inOWL

DLformat.

The

input

of

ASMOV

is

two

ontologies

and

an

optionalpredeterminedalignmentset.Thelexical,struc

tural and instance based similarity value between all

possiblepairsofentities,onefromeachofthetwoontol

ogies is calculated and is used as the optional input

alignmenttoadjustanycalculatedmeasures.Asimilarity

matrix containing the calculated similarity values for

everypairofentitiesisobtainedfromtheabovestep.For

eachentitychoose themaximumsimilarityvalueaspre

alignmentfromthesimilaritymatrix.

Thisprealignmentmustundergoaprocessofseman

ticverification,whichisanextensivepostprocessingto

eliminate potential inconsistencies among the set pre

alignment.Five

different

kinds

of

inconsistencies

are

checked.One such inconsistency rule isMultipleentity

correspondences,e.g., if twoalignments (a,b)and (b, c)

exist then theremustalsobealignment like (a, c), ifnot

the above two alignment cannotbe verified and hence

removed.Theoutputoftheabovestepisthesemantically

verified similaritymatrix,which is then testedagainsta

termination condition. If this condition is true,nomore

iteration is needed and the process stops.The resulting

alignmentisfinalalignmentset.

Thedrawbacksofthissystemareasfollows.Whenthe

given two ontologies are dissimilar the effectiveness is

decreased. Sometime semanticverification system elimi

natestoo

many

alignments

leading

to

less

recall.

For

each

iteration, theASMOVneedspolynomial time, thus lead

ingtoinefficiency.

AgreeementMaker[10]isaschemaandontologymatch

ing algorithm consisting of wide range ofmatcher for

lexicalandstructuralfeaturesoftheontology.Itprovides

bothserialandparallelmatcherworkflows.Thestrength

of the system lies inGUIwhich enable user to choose,

control and execute the iterativematchers and their re

sults.ThroughGUItheusercanchoosematcherfromthe

matcher librarybased on thematching granularity (ele

QOM Peukert

etal.

COMA++ Falcon

AO

Taxo

map

Anchor

Flood

Gross

etal.

RiMOM ASMOV Agreement

Maker

Inputas

XMLSchemas

No Yes Yes No No No No No No (Yes)

Inputas

Ontologies

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

GUI No Yes Yes (Yes)** No No No No No Yes

Linguistic

Matcher

Yes Na* Yes Yes Yes Yes Yes (Yes) Yes Yes

Structural

Matcher

Yes Na Yes Yes Yes Yes Yes (Yes) Yes Yes

Instance

Matcher

Yes Na Yes No No No Yes (Yes) Yes Yes

Earlypruning Yes Yes No No No No No No No No

Schemaparti

tion

No No Yes Yes Yes Yes Yes No No No

Parallel

match

ingNo

(Yes)

No

No

No

No

Yes

No

No

No

Dyn.matcher

selection

No No No No No No No Yes No No

Mapping

reuse

No Yes Yes No No No No No No No

OAEIpartici

pation

No No Yes Yes Yes Yes No Yes Yes Yes

Useofexternal

dictionary

Yes No Yes No No No No Yes Yes Yes





6/7


7/7

in Distributed computing, information security and ontology basedinformation retrieval. She is currently pursuing her Ph.D. in ontologybased information retreival systems.

Dr. G. Aghila working as Professor in Pondicherry University, Indiahas got a total of 21 years of teaching expereince. She has graduat-ed from Anna University Chennai, India. She has published nearly40 research papers in web crawlers, ontology based informationretrieval. She is currently guiding 8 Ph.D. scholars. She was in re-ceipt of schrneiger award. She is an expert in onology development.Her area of interest includes artificial intelligence, text mining andsemantic web technologies.

B. Sathiya is a post graduate student pursuing her M.Tech in Com-puter Science and Engineering (Distributed Computing Systemsspecialization) in Pondicherry Engineering College, Puducherry ,India.




a survey on large scale schema and ontology matching techniques

Documents