on matching large life science ontologies in parallel2 matching large life science... · • needed...

20
O N M ATCHING L ARGE L IFE S CIENCE O NTOLOGIES IN P ARALLEL Anika Groß, Michael Hartung, Toralf Kirsten, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Interdisciplinary Centre for Bioinformatics http://www.izbi.uni-leipzig.de Göteborg, 25 th August 2010

Upload: others

Post on 01-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

ON MATCHING LARGE LIFE SCIENCE

ONTOLOGIES IN PARALLEL

Anika Groß, Michael Hartung, Toralf Kirsten, Erhard Rahm

Database Group Leipzig

http://dbs.uni-leipzig.de

Interdisciplinary Centre for Bioinformatics

http://www.izbi.uni-leipzig.de

Göteborg, 25th August 2010

Page 2: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

ONTOLOGY MAPPINGS

• Ontologies are often highly related

• Identify overlapping information

• Crucial for data integration, enhanced dataanalysis across ontologies, ontology merging, …

On Matching Large Life Science Ontologies in Parallel 2 / 20

GALEN MouseAnatomy

NCIThesaurus

SNOMEDCT

FMA

NCI Thesaurus(Anatomic Structure..)

Vertebra

Sacral_Vertebra

Sacrum

S2_Vertebra S1_Vertebra…

……

Adult Mouse Anatomical Dictionary

vertebra

sacral vertebra

pelvis bone

sacral vertebra 2sacral vertebra 1 …

… …

• Manual creation is complex or even infeasible

• (Semi-) automatic determination of similarities between ontology elements to find correspondences

• Ontology Mapping = set of semantic correspondences between concepts of different ontologies

?

?

??

Page 3: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

Preprocessing

• Use metadata (element-, structure-level), instance data

• High quality mappings

• Combined execution of several matchers in more complex match workflows

On Matching Large Life Science Ontologies in Parallel 3 / 20

• Time consuming and memory intensive task

ONTOLOGY MATCHING

… …

O1

O2

… …

Combination SelectionMatchResult

Mapping

M2

M1

M3

Matcher execution

Page 4: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

→ Long execution times, high memory requirements

→ Reasonable to improve efficiency

EXAMPLE – MATCH SUB-ONTOLOGIES OF GENE ONTOLOGY

On Matching Large Life Science Ontologies in Parallel 4 / 20

Molecular Function

Biological Process

2*104

104

Evaluation ofCartesian product

2·108 comparisons

Multiplied by number of different

matchers

6·108 comparisons

Page 5: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

HOW TO DEAL WITH PERFORMANCE AND MEMORY ISSUES?

On Matching Large Life Science Ontologies in Parallel 5 / 20

→ Further methods

• Reduction of search space: Avoid evaluation of Cartesian product using e.g. early pruning, cluster- or fragment-based methods

• Reuse of match results: Save recomputation (ontology evolution)

• …

→ Combination of approaches

→ Parallelization

• Parallel execution of ontology matching on multiple compute nodes

• Use the broad availability of multi-core systems

• Reduce execution time, memory requirements

Page 6: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

CONTRIBUTIONS

• Two strategies for parallel ontology matching• Inter-matcher parallelization: execute independent matchers in parallel

• Intra-matcher parallelization: internal parallelization of matchers based on partitioning of the ontologies

• Parallelization of different kinds of matchers• Element-level, structure-level, instance-based matchers

• Implementation and evaluation• Distributed infrastructure for parallel ontology matching

• Evaluation of the approaches for matching large life science ontologies

On Matching Large Life Science Ontologies in Parallel 6 / 20

Page 7: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

OVERVIEW

• Motivation

• Parallelization Strategies

• Inter-matcher Parallelization

• Intra-matcher Parallelization

• Infrastructure

• Evaluation

• Conclusion & Future Work

On Matching Large Life Science Ontologies in Parallel 7 / 20

Page 8: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

INTER-MATCHER PARALLELIZATION

• Parallel execution of independently executable matchers

• Process matchers on different cores or computing nodes

On Matching Large Life Science Ontologies in Parallel 8 / 20

+ Improve execution time up to factor n (n=|matchers|)

– Limited degree of parallelization (|independently executable matchers|)

– Slowest matcher limits the achievable speedup

– Memory requirements are not reduced since matchers evaluate complete ontologies

...

M1

O2

O1

Mn

...MatchResult

M1

O2

O1

M3

...MatchResult

M2

Page 9: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

INTRA-MATCHER PARALLELIZATION

• Internal decomposition of individual matchers or matcher parts

• Partitioning the input data (ontologies)

• Many small match tasks can be executed in parallel

On Matching Large Life Science Ontologies in Parallel 9 / 20

...

...Ontology Partitioning

M11

M1k

O2

O1

...

Ontology Partitioning

Match Task Generation M1

Match Task Generation Mn

MatchResult

...

Mn1

Mnk

...

+ Match tasks of limited complexity → reduced memory and processing requirements

+ Applicable to sequential and independently executable matchers

Page 10: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

• Each match task matches one O1 against one O2 partition

SIZE-BASED PARTITIONING

• Enable parallel matching of Cartesian product• Partition input ontologies into partitions of equal size (|concepts|)

On Matching Large Life Science Ontologies in Parallel 10 / 20

...

p10

p1

...

p1

p10

+ Scalable to large ontologies+ Good load balancing + Optimizes performance without sacrificing match quality+ Applicable to element-level, structure-level and instance-based matchers

… … … …

|O1|= 10,000concepts

Generate 100match tasks

|O2|= 10,000concepts

Page 11: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

PARALLELIZATION OF ELEMENT-LEVEL MATCHERS

• Needed information is directly associated with concepts themselves

On Matching Large Life Science Ontologies in Parallel 11 / 20

→ Match-information is retained

• Attribute values like names, synonyms …

0.70.90.9

0.7

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

c∈O1, d∈O2

→ Easy partitioning of ontologies into subsets of concepts for element-level matchers

Page 12: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

• Needed information is not provided by concepts themselves

• Use information from the neighborhood of concepts, e.g. children, siblings, namePath, …

PARALLELIZATION OF STRUCTURE-LEVEL MATCHERS

On Matching Large Life Science Ontologies in Parallel 12 / 20

Approach

• Extend the concept-level information within special multi-valued context attributes, e.g. Child, NamePath, …

• Preprocessing step of linear effort

→ Size-based partitioning as for element-level matching

• Note: parallelization of similarity propagation algorithms is more difficult (need the whole structural information)

→ Only parallelization of the initial matcher

Page 13: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

• Average element similarity between children of concepts

→Compare all their Child attributes

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

EXAMPLE: CHILD MATCHER

• Use a multi-valued Child context attribute: get name values of child concepts in a preprocessing step

On Matching Large Life Science Ontologies in Parallel 13 / 20

corr(c,d) simChildren(c,d)

c1 – d1 (0.9 + 0.7 + 0.7 + 0.9) / (2·2) = 0.8

c1 – d4 (0.7 + 0 + 0 + 0.9) / (2·2) = 0.4

• Similarly applicable for other local context matchers, e.g. namePath

0.8 0.4

Acc: c1

Name: c1.nameChild: c2.nameChild: c3.name

Acc: c3

Name: c3.nameChild: …

Acc: d1

Name: d1.nameChild: d2.nameChild: d3.name

Acc: d2

Name: d2.nameChild: …

Acc: d3

Name: d3.nameChild: …

Acc: d4

Name: d4.nameChild: d3.nameChild: d5.name

Acc: d5

Name: d5.nameChild: …

Acc: c2

Name: c2.nameChild: …

0.7 0.90.90.7

Page 14: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

INFRASTRUCTURE INTRA-MATCHER PARALLELIZATION EXAMPLE

On Matching Large Life Science Ontologies in Parallel 14 / 20

CPU1

CPU2

CPU1

CPU2

Data Service

… …… …

Workflow Service

...

c1c2

ck-1ck

...

c1c2

cl-1cl

Concepts with context-attributes

...

pn

p1

...

p1

pm

Partitions (sets of concepts)

Match Service 1

Match Service 2

j1j1

j2

j3

j4

js

j2

j3

j4

js

Job queue

......

Page 15: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

INFRASTRUCTURE – INTRA-MATCHER PARALLELIZATION EXAMPLE

On Matching Large Life Science Ontologies in Parallel 15 / 20

CPU1

CPU2

CPU1

CPU2

Data Service

Workflow Service

CPU1...

...

...

...

...

...

...

Job executionUnify Partial Match Results

Match Service 1

Match Service 2

Further aggregation,selection, …

in the match workflow...

...

...

...

Page 16: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

• Two match problems to evaluate efficiency (execution times)

EVALUATION

On Matching Large Life Science Ontologies in Parallel 16 / 20

Adult Mouse Anatomical Dictionary

NCI Thesaurus Anatomic Structure or Substance

Biological Processes

Molecular Functions

~2,700 · ~3,300 = 9 · 106 comparisons

~9,400 · ~17,100 = 1.6 · 108 comparisons

• Matchers (element and structure-level)

• NameSynonym (NS) Max TriGram similarity for the name and multi-valued synonym attr.

• Children (CH) Avg TriGram similarity on the context attributes Child

• NamePath (NP) Avg TriGram similarity on the context attributes NamePath

• Computing environmentFour nodes consisting of four cores (up to 16 cores)

Medium

Scale

Large

Scale

Page 17: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

• 1 node, 4 cores, 8 threads

INTRA-MATCHER PARALLELIZATION OF INDIVIDUAL MATCHERS

On Matching Large Life Science Ontologies in Parallel 17 / 20

• ↑↑↑↑ degree of parallelization → ↓↓↓↓ execution times

• Almost linear speedup for up to 4 threads

6h15

• NP - most expensive, long concatenated names, multiple inheritance

• NS - many synonyms per concept in GO

1h15

NP NS CH speedup NP speedup NS speedup CH

Page 18: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

Medium Scale Large Scale

PARALLELIZATION STRATEGIES

• Parallelization strategies benefit from multiple threads/cores

• Combined approach slightly better, because execution delays between matchers are avoided

On Matching Large Life Science Ontologies in Parallel 18 / 20

• 4 nodes, 16 cores; matcher combination of NP, CH, NS

11h

50min

Intra-matcher parallelization speedup Intra

Intra- & Inter-matcher parallelization speedup Intra&Inter

→ Intra and the combined approach are very effective and thus especially valuable for parallel ontology matching

Speedup12.5

Page 19: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

CONCLUSIONS AND FUTURE WORK

• General, flexible strategies for parallel ontology matching on multiple compute nodes to improve efficiency (inter- and intra-matcher parallelization)

• Size-based partitioning enabled good load balancing, scalable, reduced memory consumptions, quality not affected

• Parallelization of element-level, structure-level, instance-based matchers using multi-valued context attributes

• Implemented a distributed infrastructure, evaluation for large life science ontology match problems

On Matching Large Life Science Ontologies in Parallel 19 / 20

• Investigate parallel ontology matching for additional matchers, evaluate effectiveness

• Combine parallelization with advanced fragmentation strategies

Page 20: On Matching Large Life Science Ontologies in Parallel2 Matching Large Life Science... · • Needed information is not provided by concepts themselves • Use information from the

THANK YOU FOR YOUR ATTENTION

On Matching Large Life Science Ontologies in Parallel 20 / 20