matching anatomy ontologies -...

28
Matching Anatomy Ontologies - Quality and Performance 1/28 Matching Anatomy Ontologies - Quality and Performance Anika Groß Leipzig, 9 th December 2009 Interdisciplinary Centre for Bioinformatics http://www.izbi.uni-leipzig.de Database Group Leipzig http:// http:// dbs dbs .uni .uni - - leipzig leipzig .de .de

Upload: dangcong

Post on 30-Mar-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 1/28

Matching Anatomy Ontologies

- Quality and Performance

Anika Groß

Leipzig, 9th December 2009

Interdisciplinary Centre for Bioinformaticshttp://www.izbi.uni-leipzig.de

Database Group Leipzighttp://http://dbsdbs.uni.uni--leipzigleipzig.de.de

Page 2: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 2/28

Ontologies

• Formal representation to model the entities and their relationships within a domain

• Life science example domains� Biological processes� Anatomy� ...

• Often multiple ontologies for the same domain, e.g.:Open Biomedical Ontologies provide 34 anatomy (related) ontologies

GALEN

FMA

MouseAnatomy

NCIThesaurus

SNOMEDCT

??

??

??

??

??

??

??

??

�Overlapping

information?�Need for

creating mappings

Page 3: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 3/28

Ontology Mappings

• Set of semantic correspondences between concepts of different ontologies

• Automatically generated by ontology matching approaches• Crucial for

� Data integration� Enhanced data analysis across ontologies� ...

Adult Mouse Anatomical Dictionary

lumbar vertebra coat hair

lumbar vertebra 3

NCI Thesaurus (Anatomic Structure,

System, or Substance)

Hair Lumbar Vertebrae

L3 Vertebra …

Page 4: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 4/28

Overview

• Ontologies, ontology mappings

• Matching of anatomy ontologies

- Dataset

- Match workflow (stepwise)

- Evaluation w.r.t. quality (precision, recall, …)

• Distributed Matching

- Resources

- Distributed Architecture

- Evaluation w.r.t. performance

• Conclusion & future work

Page 5: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 5/28

Matching Anatomy OntologiesMatching Anatomy Ontologies

Aim Find overlapping information in anatomy ontologies

Possible application Aligning anatomies across model organisms to compare functional information about genes/proteins

Example Alignment (OAEI Anatomy track)

� Mouse anatomy: Adult Mouse Anatomical Dictionary (MA)

� Human anatomy: Anatomy subset of NCI Thesaurus (NCI)

Perfect mapping contains 1523 correspondences

Given partial reference mapping

� contains 988 correspondences (934 trivial, 54 non-trivial)

� “Real” evaluation of my match results:

““[[……] it is possible to send us an submission [] it is possible to send us an submission [……] you will be informed about ] you will be informed about precision, recall and fprecision, recall and f--measure [measure [……] ] we restrict ourselveswe restrict ourselves to doto do this for eachthis for each systemsystemonly two times only two times [[……]]””**

� Match result should be as good as possible before I try …

* http://oaei.ontologymatching.org/2007/results/anatomy/samples.html

Page 6: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 6/28

WorkflowWorkflow

Matching

Postprocessing

Combination Strategy

Preprocessing

- Basic statistics of input ontologies- Frequency of word occurrence- Origin of words (language), e.g. English, Latin, Greek- Depth of graph structure- …

Initial analysis

Page 7: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 7/28

Initial Analysis

• Basic statistics

� semi-automatic elimination of most frequent terms that carry no valuable information (e.g. „of“)?

• Frequent word analysis (Top 10)• occword -- # occurrences of word in attribute set (names, synonyms)

TODO• Application of this knowledge in preprocessing• Analyze origin of words (language), e.g. English, Latin, Greek

� used to find further synonyms• Depth of graph structure …

Page 8: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 8/28

- Basic statistics of ontologies- Frequency of word occurrence - Origin of words (language), e.g. English, Latin, Greek- Depth of graph structure

WorkflowWorkflow

Matching

Postprocessing

Combination Strategy

Initial analysis

Attribute normalization

Get additional knowledgen Google-URLs,

UMLS Metathesaurus, …

Precomputen-Level name path,

- Remove delimiters / punctuation- Upper case to lower case- Roman to arabic numbers …

Preprocessing

Page 9: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 9/28

Preprocessing

Attribute Normalization• Remove all delimiters/punctuations, replace upper case letters

Right_Lung_Respiratory_Bronchiole � right lung respiratory bronchioleTODO ���� Conversion to simplest word part (stemming)

� Alphabetic order

Precomputation of attribute values• n-Level name path: path from concept to root including n levels

� Consider all non-redundant paths (is_a, part_of) to the root

ID: MA_0001161Name: peripheral nervous system ganglionPath: /nervous system/ganglion/peripheral nervous system ganglion

ID: NCI_C12719Name: Ganglion*Path: /nervous system/peripheral nervous system/ganglion

Example Get 3-Level name path for concepts „NCI_C12719“, „MA_0001161“

* Ganglion* Ganglion German:German: EinEin vonvon einer Kapsel umschlossener Nervenzellknoteneiner Kapsel umschlossener Nervenzellknoten;;Spanish:Spanish: Formaciones nodulares queFormaciones nodulares que hay en elhay en el trayectotrayecto dede los nervioslos nervios

Page 10: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 10/28

Preprocessing

• Get additional knowledge n Google-URLs (different query strategies)• E.g. compare only wiki-URLs

TODO• Try different query strategies, e.g.

(anatomy OR anatomical structure) „globus pallidus“• Use UMLS-Metathesaurus

= comprehensive thesaurus of biomedical concepts

Precomputation of attribute values

ID: MA_0000889Name: pallidum1st-Wiki-URL: http://en.wikipedia.org/wiki/Globus_pallidus

ID: NCI_C12449Name: Globus_Pallidus *1st-Wiki-URL: http://en.wikipedia.org/wiki/Globus_pallidus

Query "globus pallidus" site:en.wikipedia.org

Query "pallidum" site:en.wikipedia.org

* Globus pallidus German: Teil des Mittelhirns; Spanish: parte del cerebro medio

Page 11: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 11/28

Attribute normalization- Punctuation marks to space- Upper case to lower case- Roman to arabic numbers

Get additional knowledgen Google-URLs

UMLS Metathesaurus

Precomputen-Level name path

Preprocessing

- Basic statistic of ontology- Frequency of word occurrence (distribution) - Origin of words (language), e.g. English, Latin, Greek- Depth of graph structure

WorkflowWorkflow

Postprocessing

Initial analysis

Combination Strategy

MatchingMatching Attr.: Name, synonymMatcher: Trigram,…

Attr.: n-Level name pathMatcher: Trigram,…

Attr.:GoogleURLsOverlap first n=1..8Metric: Dice, …

Aggregation: Max, …Direction: Small�Large, …Candidate Selection:Threshold, MaxN, MaxDelta

Page 12: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 12/28

Matching

• Look for 1,523 correspondences�Mapping size should be only slightly more/less• Search predominantly 1:1 correspondences for

anatomy concepts

• Use normalized and precomputed attributes to matchMA and NCI � name&synonym� n-Level name path� n googleURLs (only wikipedia)

• Similarity computed using� Trigram� Overlap (dice)

• Combination� Aggregation: union of single matcher results (max similarity)� Candidate Selection: Max1, MaxDelta

Page 13: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 13/28

MaxDelta-Threshold

• MaxDelta strategy*: only correspondences with maximal (and slightly differing) similarity (e.g., 0.03)

• From small to large ontology

• Example: Correspondences for one MA-concept(used matcher: 2-level name path TRIGRAM, minimal threshold 0.75)

MA_0000728 thyroid gland left lobe **

NCI_C33491 Right_Thyroid_Gland_Lobe; 0.756

NCI_C33784 Thyroid_Gland_Lobe; 0.889

NCI_C32973 Left_Thyroid_Gland_Lobe; 0.877

MA NCI

False

True

Not false, but more general accepted

notaccepted

* Do, H.H.; Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches,Proc. VLDB, 2002** thyroid gland left lobe German: linker Schilddrüsenflügel; Spanish: lóbulo tiroideo izquierdo

Page 14: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 14/28

Evaluation Setup

Combination of matchers: • TRIGRAM: Name&Synonym, n-Level name path• TRIGRAM: Name&Synonym, n-Level name path, google(only wikipedia)

Different configurations:• Minimal threshold• MaxDelta parameter• Number of levels (name path)• #wikiURLs• URL Overlap

Measures to evaluate quality of match results:• PRM = Partial Reference Mapping• MR = Match Result• SizePRM,SizeMR• PrecPRM,RecPRM ,F-MeasPRM - Precision, Recall, F-Measure w.r.t. PRM

Page 15: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 15/28

Evaluation Results

• 2- vs. 3-Level name path no big difference (use nearest semantic neighborhood )• 0.9 (MaxDelta 0.03) to restrictive � small mapping size

� Good if focus on precision• wiki1 produces to much results (~6,000)

Problem: fine and general concepts are described on the same wiki page• wiki8 seams to be promising …

Ma

pp

ing

Siz

e

0

10

20

30

40

50

60

70

80

90

100

0.8 | 3L 0.8; MaxD 0.03

0.8 | 2L 0.8; MaxD 0.03

0.75 | 2L 0.75; Max1

0.9 | 2L 0.9; MaxD 0.03

0.9 | 2L 0.9 | 8url 0.8; MaxD 0.03

0.9 | 2L 0.9 | 1url; Max1

0

1000

2000

3000

4000

5000

6000PrecPRM

RecPRM

F-MeasPRM

SizeMR

SizePRM

Page 16: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 16/28

Get additional knowledgen Google-URLs

UMLS Metathesaurus

Aggregation: Max, …Direction: Small�Large, …Candidate Selection:Threshold, MaxN, MaxDelta

Combination Strategy

Attribute normalization

Precomputen-Level name path

- Punctuation marks to space- Upper case to lower case- Roman to arabic numbersPreprocessing

MatchingMatchingName, synonym

TRIGRAMThreshold, MaxDelta, Max1

n-Level name pathTRIGRAM

Threshold, MaxDelta, Max1

Overlap first n=1..8 URLsDICE, …

- Basic statistic of ontology- Frequency of word occurrence (distribution) - Origin of words (language), e.g. English, Latin, Greek- Depth of graph structure

Initial analysis

WorkflowWorkflow

Remove structural conflicts

Postprocessing Same number Similar node depth

Page 17: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 17/28

Postprocessing

Todo:• Use different integrity constraints to filter false correspondences

• Same numbers� if concepts of one correspondence contain numbers

(roman or arabic) � allow only same number

• Remove structural conflicts� Use of semantic constraints, e.g., high-level disjointness� Opposite hierarchy of “as similar” identified concepts

• Require similar node depth� Concepts of a correspondence should have the same (a similar)

distance to the root

• …

Page 18: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 18/28

How much time took it?

Example match task:

M1 = Trigram Name,Synonyms t=0.8 & M2 Trigram 2-level name path 0.8

Nested loop: ~18,000,000 comparisons

• 1 CPU:

• 11.5 min (M1 � 5.3 min, M2 � 6.2 min )

Why only use one CPU out of 2/4/.. available CPUs?

• 6 CPUs:

• 2 min (M1 ���� 0.9 min, M2 ���� 1.1 min)

50 runs with different configurations: only ~2h instead of 9.6h

How was it realized?

Page 19: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 19/28

Distributed Matching *Distributed Matching *

Basic idea• Use the CPUs/main memory of available resources (e.g.,

in our group) to speed up ontology match task significantly• Top database group resources

2x Quad-Core

Dual-Core

9 Desktop clients18 CPUs à2.66 GHz

2 Server of db group16 CPUs à2.66 GHz

ΣΣΣΣ: 42 CPUs à 2.66 GHz

Quad-Core

2 Desktop clients8 CPUs à2.66 GHz

* Joined work with Toralf and Michael

• Utilization of resources when they are not used (lunch break, over night, weekend, holiday, …)

Page 20: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 20/28

Distributed Matching *Distributed Matching *

Basic idea• Use the CPUs/main memory of available resources (e.g.,

in our group) to speed up ontology match task significantly• Top database group resources

2x Quad-Core

Dual-Core

9 Desktop clients18 CPUs à2.66 GHz

2 Server of db group16 CPUs à2.66 GHz

ΣΣΣΣ: 42 CPUs à 2.66 GHz

Quad-Core

2 Desktop clients8 CPUs à2.66 GHz

* Joined work with Toralf and Michael

• Utilization of resources when they are not used (lunch break, over night, weekend, holiday, …)

Free Free capacitycapacity

Example Example dbserv2dbserv2

Page 21: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 21/28

Distributed ArchitectureDistributed Architecture

Application

Distributed MatchClient• Job building• Job distribution• Final result

composition

…MatchService1

MatchServicen

MatchRepository(Gomma)

DataService

uses

RMI – Remote Method Invocation

data / query constraint

Data access

Page 22: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 22/28

Ontology Partitioning for Distributed MatchingOntology Partitioning for Distributed Matching

• Partition each ontology and match parts on services • Merge of single results into overall result

Ontology size“manageable”on Client

Ontology size not“manageable”on Client

Structure-basedmatching

Non-structure-based matching

Structure-basedmatching

Non-structure-based matching

• Distribution of ontology concepts into buckets

• Preprocessing• Flatten the structure into attributes, e.g. NamePath•…

• Dynamic fetch of ontology concepts on service using constraints• Use hashing to construct constraints, e.g., based on accession numbers or other attribute values

• Find ontology parts (sub graphs) that converse the structure• Problem of overlaps between ontology parts caused by DAG structure

Result: buckets with associated concepts

Result: buckets with constraints describing the associated concepts or ontology parts

Page 23: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 23/28

Experiment

Experimental setup• NCI Thesaurus self match using TRIGRAM• 68,855*68,855 concepts = nested loop 4.8*109 comparisons• Split in 32 partitions à ~2,150 concepts on each side � 1024 jobs• Partition: Hashing of accession number, example bucket:SELECT ... WHERE accession LIKE „%00“

OR accession LIKE „%01“OR accession LIKE „%02“

Possible to match it in the lunch break? *• 19 CPUs from the db group* and IZBI server• 65 minutes• 1,223 comparisons per millisecond

How is the performance for less CPUs?

* Thanks to Stefan, Lars and Andreas for providing us their free resources ☺

Page 24: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 24/28

0

100

200

300

400

500

600

700

800

900

1000

1 2 4 8 13 19

real caseideal case (based on 1 CPU run)

Running Times for Different #CPUsRunning Times for Different #CPUs

• Significant speed up even for one machine (2/4 CPUs)� Reduced from ~16h to ~1h

# CPUs

Ela

pse

d t

ime in m

inute

s

db19db19 db19db19

db19db19dbserv2dbserv2

db19db19

db19db19dbserv2dbserv2

db7db7apriliaaprilia

db19db19dbserv2dbserv2

db7db7apriliaapriliaLausenLausen

db3db3db10db10

Distributed Matching NCIxNCI – Running times

Page 25: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 25/28

Real vs. Theoretic Speed upReal vs. Theoretic Speed upDistributed Matching NCIxNCI - Speed up

0

2

4

6

8

10

12

14

16

18

20

1 2 4 8 13 19

speed up realspeed up ideal

Speed u

p c

om

pare

d t

o 1

CPU

# CPUs

Page 26: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 26/28

Assignment of JobsAssignment of Jobs%Jobs (Overall: 1024 Jobs)

db10

(2x2.66 GHz)

11%

db7

(2x2.66 GHz)

6% db3

(2x2.66 GHz)

11%

lausen

(2x2.66 GHz)

12%

aprilia

(3x2.0GHz)

13%

dbserv2

(4x2.66Ghz)

23%db19

(4x2.66 GHz)

24%

lessless GHzGHz

DistributedDistributed

Match ClientMatch Client

Page 27: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 27/28

Conclusion and Future WorkConclusion and Future Work

(1) Match workflow• Experiences in aligning large life science ontologies with metadata-based

strategies

• Future work

� Improve the workflow (e.g., Google query strategies, …)

� Evaluation of the data (send to OAEI)

� Investigate and compare the stability/robustness of metadata-based and instance-based matching approaches

(2) Distributed matching• Significant reduction of elapsed time for “nested-loop” matching

of large ontologies/data sets

• Future work

� Combination of distributed matching with intelligent blocking strategies

� Interesting for object matching?

Page 28: Matching Anatomy Ontologies - uni-leipzig.dedbs.uni-leipzig.de/file/Gross_Oberseminar_WS09-10.pdf · Matching Anatomy Ontologies ... Matching Anatomy Ontologies - Quality and Performance

Matching Anatomy Ontologies - Quality and Performance 28/28

Thank you for your attention!Thank you for your attention!