on matching large life science ontologies in parallel2 matching large life science... · • needed...
TRANSCRIPT
ON MATCHING LARGE LIFE SCIENCE
ONTOLOGIES IN PARALLEL
Anika Groß, Michael Hartung, Toralf Kirsten, Erhard Rahm
Database Group Leipzig
http://dbs.uni-leipzig.de
Interdisciplinary Centre for Bioinformatics
http://www.izbi.uni-leipzig.de
Göteborg, 25th August 2010
ONTOLOGY MAPPINGS
• Ontologies are often highly related
• Identify overlapping information
• Crucial for data integration, enhanced dataanalysis across ontologies, ontology merging, …
On Matching Large Life Science Ontologies in Parallel 2 / 20
GALEN MouseAnatomy
NCIThesaurus
SNOMEDCT
FMA
NCI Thesaurus(Anatomic Structure..)
Vertebra
Sacral_Vertebra
Sacrum
S2_Vertebra S1_Vertebra…
……
Adult Mouse Anatomical Dictionary
vertebra
sacral vertebra
pelvis bone
sacral vertebra 2sacral vertebra 1 …
… …
• Manual creation is complex or even infeasible
• (Semi-) automatic determination of similarities between ontology elements to find correspondences
• Ontology Mapping = set of semantic correspondences between concepts of different ontologies
?
?
??
Preprocessing
• Use metadata (element-, structure-level), instance data
• High quality mappings
• Combined execution of several matchers in more complex match workflows
On Matching Large Life Science Ontologies in Parallel 3 / 20
• Time consuming and memory intensive task
ONTOLOGY MATCHING
… …
O1
O2
… …
Combination SelectionMatchResult
Mapping
M2
M1
M3
Matcher execution
→ Long execution times, high memory requirements
→ Reasonable to improve efficiency
EXAMPLE – MATCH SUB-ONTOLOGIES OF GENE ONTOLOGY
On Matching Large Life Science Ontologies in Parallel 4 / 20
Molecular Function
Biological Process
2*104
104
Evaluation ofCartesian product
2·108 comparisons
Multiplied by number of different
matchers
6·108 comparisons
HOW TO DEAL WITH PERFORMANCE AND MEMORY ISSUES?
On Matching Large Life Science Ontologies in Parallel 5 / 20
→ Further methods
• Reduction of search space: Avoid evaluation of Cartesian product using e.g. early pruning, cluster- or fragment-based methods
• Reuse of match results: Save recomputation (ontology evolution)
• …
→ Combination of approaches
→ Parallelization
• Parallel execution of ontology matching on multiple compute nodes
• Use the broad availability of multi-core systems
• Reduce execution time, memory requirements
CONTRIBUTIONS
• Two strategies for parallel ontology matching• Inter-matcher parallelization: execute independent matchers in parallel
• Intra-matcher parallelization: internal parallelization of matchers based on partitioning of the ontologies
• Parallelization of different kinds of matchers• Element-level, structure-level, instance-based matchers
• Implementation and evaluation• Distributed infrastructure for parallel ontology matching
• Evaluation of the approaches for matching large life science ontologies
On Matching Large Life Science Ontologies in Parallel 6 / 20
OVERVIEW
• Motivation
• Parallelization Strategies
• Inter-matcher Parallelization
• Intra-matcher Parallelization
• Infrastructure
• Evaluation
• Conclusion & Future Work
On Matching Large Life Science Ontologies in Parallel 7 / 20
INTER-MATCHER PARALLELIZATION
• Parallel execution of independently executable matchers
• Process matchers on different cores or computing nodes
On Matching Large Life Science Ontologies in Parallel 8 / 20
+ Improve execution time up to factor n (n=|matchers|)
– Limited degree of parallelization (|independently executable matchers|)
– Slowest matcher limits the achievable speedup
– Memory requirements are not reduced since matchers evaluate complete ontologies
...
M1
O2
O1
Mn
...MatchResult
M1
O2
O1
M3
...MatchResult
M2
INTRA-MATCHER PARALLELIZATION
• Internal decomposition of individual matchers or matcher parts
• Partitioning the input data (ontologies)
• Many small match tasks can be executed in parallel
On Matching Large Life Science Ontologies in Parallel 9 / 20
...
...Ontology Partitioning
M11
M1k
O2
O1
∪
...
Ontology Partitioning
Match Task Generation M1
Match Task Generation Mn
MatchResult
...
Mn1
Mnk
∪
...
+ Match tasks of limited complexity → reduced memory and processing requirements
+ Applicable to sequential and independently executable matchers
• Each match task matches one O1 against one O2 partition
SIZE-BASED PARTITIONING
• Enable parallel matching of Cartesian product• Partition input ontologies into partitions of equal size (|concepts|)
On Matching Large Life Science Ontologies in Parallel 10 / 20
...
p10
p1
...
p1
p10
+ Scalable to large ontologies+ Good load balancing + Optimizes performance without sacrificing match quality+ Applicable to element-level, structure-level and instance-based matchers
… … … …
|O1|= 10,000concepts
Generate 100match tasks
|O2|= 10,000concepts
Acc: c1
Name: c1.name
Acc: c3
Name: c3.name
Acc: d1
Name: d1.name
Acc: d2
Name: d2.nameAcc: d3
Name: d3.name
Acc: d4
Name: d4.name
Acc: d5
Name: d5.nameAcc: c2
Name: c2.name
PARALLELIZATION OF ELEMENT-LEVEL MATCHERS
• Needed information is directly associated with concepts themselves
On Matching Large Life Science Ontologies in Parallel 11 / 20
→ Match-information is retained
• Attribute values like names, synonyms …
0.70.90.9
0.7
Acc: c1
Name: c1.name
Acc: c3
Name: c3.name
Acc: d1
Name: d1.name
Acc: d2
Name: d2.nameAcc: d3
Name: d3.name
Acc: d4
Name: d4.name
Acc: d5
Name: d5.nameAcc: c2
Name: c2.name
c∈O1, d∈O2
→ Easy partitioning of ontologies into subsets of concepts for element-level matchers
• Needed information is not provided by concepts themselves
• Use information from the neighborhood of concepts, e.g. children, siblings, namePath, …
PARALLELIZATION OF STRUCTURE-LEVEL MATCHERS
On Matching Large Life Science Ontologies in Parallel 12 / 20
Approach
• Extend the concept-level information within special multi-valued context attributes, e.g. Child, NamePath, …
• Preprocessing step of linear effort
→ Size-based partitioning as for element-level matching
• Note: parallelization of similarity propagation algorithms is more difficult (need the whole structural information)
→ Only parallelization of the initial matcher
• Average element similarity between children of concepts
→Compare all their Child attributes
Acc: c1
Name: c1.name
Acc: c3
Name: c3.name
Acc: d1
Name: d1.name
Acc: d2
Name: d2.nameAcc: d3
Name: d3.name
Acc: d4
Name: d4.name
Acc: d5
Name: d5.nameAcc: c2
Name: c2.name
EXAMPLE: CHILD MATCHER
• Use a multi-valued Child context attribute: get name values of child concepts in a preprocessing step
On Matching Large Life Science Ontologies in Parallel 13 / 20
corr(c,d) simChildren(c,d)
c1 – d1 (0.9 + 0.7 + 0.7 + 0.9) / (2·2) = 0.8
c1 – d4 (0.7 + 0 + 0 + 0.9) / (2·2) = 0.4
• Similarly applicable for other local context matchers, e.g. namePath
0.8 0.4
Acc: c1
Name: c1.nameChild: c2.nameChild: c3.name
Acc: c3
Name: c3.nameChild: …
Acc: d1
Name: d1.nameChild: d2.nameChild: d3.name
Acc: d2
Name: d2.nameChild: …
Acc: d3
Name: d3.nameChild: …
Acc: d4
Name: d4.nameChild: d3.nameChild: d5.name
Acc: d5
Name: d5.nameChild: …
Acc: c2
Name: c2.nameChild: …
0.7 0.90.90.7
INFRASTRUCTURE INTRA-MATCHER PARALLELIZATION EXAMPLE
On Matching Large Life Science Ontologies in Parallel 14 / 20
CPU1
CPU2
CPU1
CPU2
Data Service
… …… …
Workflow Service
...
c1c2
ck-1ck
...
c1c2
cl-1cl
Concepts with context-attributes
...
pn
p1
...
p1
pm
Partitions (sets of concepts)
Match Service 1
Match Service 2
j1j1
j2
j3
j4
js
j2
j3
j4
js
Job queue
......
INFRASTRUCTURE – INTRA-MATCHER PARALLELIZATION EXAMPLE
On Matching Large Life Science Ontologies in Parallel 15 / 20
CPU1
CPU2
CPU1
CPU2
Data Service
Workflow Service
CPU1...
...
...
...
...
...
...
Job executionUnify Partial Match Results
Match Service 1
Match Service 2
Further aggregation,selection, …
in the match workflow...
...
...
...
• Two match problems to evaluate efficiency (execution times)
EVALUATION
On Matching Large Life Science Ontologies in Parallel 16 / 20
Adult Mouse Anatomical Dictionary
NCI Thesaurus Anatomic Structure or Substance
Biological Processes
Molecular Functions
~2,700 · ~3,300 = 9 · 106 comparisons
~9,400 · ~17,100 = 1.6 · 108 comparisons
• Matchers (element and structure-level)
• NameSynonym (NS) Max TriGram similarity for the name and multi-valued synonym attr.
• Children (CH) Avg TriGram similarity on the context attributes Child
• NamePath (NP) Avg TriGram similarity on the context attributes NamePath
• Computing environmentFour nodes consisting of four cores (up to 16 cores)
Medium
Scale
Large
Scale
• 1 node, 4 cores, 8 threads
INTRA-MATCHER PARALLELIZATION OF INDIVIDUAL MATCHERS
On Matching Large Life Science Ontologies in Parallel 17 / 20
• ↑↑↑↑ degree of parallelization → ↓↓↓↓ execution times
• Almost linear speedup for up to 4 threads
6h15
• NP - most expensive, long concatenated names, multiple inheritance
• NS - many synonyms per concept in GO
1h15
NP NS CH speedup NP speedup NS speedup CH
Medium Scale Large Scale
PARALLELIZATION STRATEGIES
• Parallelization strategies benefit from multiple threads/cores
• Combined approach slightly better, because execution delays between matchers are avoided
On Matching Large Life Science Ontologies in Parallel 18 / 20
• 4 nodes, 16 cores; matcher combination of NP, CH, NS
11h
50min
Intra-matcher parallelization speedup Intra
Intra- & Inter-matcher parallelization speedup Intra&Inter
→ Intra and the combined approach are very effective and thus especially valuable for parallel ontology matching
Speedup12.5
CONCLUSIONS AND FUTURE WORK
• General, flexible strategies for parallel ontology matching on multiple compute nodes to improve efficiency (inter- and intra-matcher parallelization)
• Size-based partitioning enabled good load balancing, scalable, reduced memory consumptions, quality not affected
• Parallelization of element-level, structure-level, instance-based matchers using multi-valued context attributes
• Implemented a distributed infrastructure, evaluation for large life science ontology match problems
On Matching Large Life Science Ontologies in Parallel 19 / 20
• Investigate parallel ontology matching for additional matchers, evaluate effectiveness
• Combine parallelization with advanced fragmentation strategies
THANK YOU FOR YOUR ATTENTION
On Matching Large Life Science Ontologies in Parallel 20 / 20