optimizing similarity computations for ontology matching - experiences from gomma

Optimizing Similarity Computations for Ontology

Matching - Experiences from GOMMA

Michael Hartung, Lars Kolb, Anika Groß, Erhard RahmDatabase Research GroupUniversity of Leipzig

9th Intl. Conf. on Data Integrationin the Life Sciences (DILS)Montreal, July 2013

getSim(str1,str2)

2

Ontologies Multiple interrelated ontologies in a domain

Example: anatomy

Identify overlapping information between ontologies Information exchange, data integration purposes, reuse …

Need to create mappings between ontologies

MeSHGALENUMLS

SNOMEDNCI Thesaurus

Mouse Anatomy

FMA

3

Matching Example

Two ‘small’ anatomy ontologies O and O’ Concepts with attributes (name, synonym)

Possible match strategy in GOMMA* Compare name/synonym values of concepts by a string

similarity function, e.g., n-gram or edit distance Two concepts match if one value pair is highly similar

O O’Attr. Values Attr. Values

Concept

c0 • head • head c0‘

c1• trunk• torso

• torso• truncus c1‘

c2

• limb • extremity

• limbsc2‘

5x4=20 similarity computations

MO,O’ = {(c0,c0’),(c1,c1’),(c2,c2’)}

* Kirsten, Groß, Hartung, Rahm: GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution. Journal Biomedical Semantics, 2011

4

Problems Evaluation of Cartesian product OxO’

Especially for large life science ontologies Different strategies: pruning, blocking, mapping reuse, …

Excessive usage of similarity functions Applied O(|O||O’|) times during matching How efficient (runtime, space) does a similarity function work?

Experiences from GOMMA1. Optimized implementation of n-gram similarity function2. Application on massively parallel hardware

Graphical Processing Units (GPU) Multi-core CPUs

getSim(str1,str2)

5

Trigram (n=3) Similarity with Dice Trigram similarity

Input: two strings A and B to be compared Output: similarity sim(A,B) ∈ [0,1] between A and B

Approach1. Split A and B into tokens of length 32. Compute intersect (overlap) between both token sets3. Calculate dice metric based on the size of intersect and

token sets

(Optional) Assign pre-/postfixes of length 2 (e.g., ##, $$) to A and B before tokenization

6

Trigram Similarity - Example

sim(‘TRUNK’, ‘TRUNCUS’)1. Token sets

{TRU, RUN, UNK} {TRU, RUN, UNC, NCU, CUS}

2. Intersect {TRU, RUN}

3. Dice metric 22 / (3+5) = 4/8 = 0.5

7

Naïve Solution Method

Tokenization (trivial) Result: two token arrays aTokens and bTokens

Intersect computation with Nested Loop (main part) For each token in aTokens look for same token in bTokens

true: increase overlap counter (and go on with next token in aTokens)

Final similarity (trivial) 2overlap / (|aTokens|+|bTokens|)

aTokens: {TRU, RUN, UNK}

bTokens: {TRU, RUN, UNC, NCU, CUS}

overlap: 012

#string-comparisons: 8

8

“Sort-Merge”-like Solution Optimization ideas

1. Avoid string comparisons String comparisons are expensive especially for equal-length

strings (e.g., “equals” of String class in Java) Dictionary-based transformation of tokens into unique integer

values

2. Avoid nested loop complexity O(mn) comparisons to determine intersect of token sets A-priori sorting of token arrays make use of ordered tokens

during comparison (O(m+n), see Sort-Merge join ) Amortization of sorting token sets are used multiple times for

comparison

‚UNC‘ = ‚UNK‘? 3 = 8?

9

“Sort-Merge”-like Solution - Example sim(TRUNK,TRUNCUS)

Tokenization integer conversion sorting TRUNK {TRU, RUN, UNK} {1, 2, 3} TRUNCUS {TRU, RUN, UNC, NCU, CUS} {1, 2, 4, 5, 6}

Intersect with interleaved linear scans

aTokens: {1, 2, 3}

bTokens: {1, 2, 4, 5, 6}

overlap: 012

#integer-comparisons: 3

10

GPU as Execution Environment Design goals

Scalability with 100’s of cores and 1000’s of threads Focus on parallel algorithms Example: CUDA programming model

CUDA Kernels and Threads Kernel: function that runs on a device (GPU, CPU) Many CUDA threads can execute each kernel CUDA vs. CPU threads

CUDA threads extremely lightweight (little creation overhead, instant switching, 1000’s of threads used to achieve efficiency)

Multi-core CPUs can only consume a few threads Drawbacks

A-priori memory allocation, basic data structures

11

Bringing n-gram to GPU Problems to be solved

1. Which data structures are possible for input / output?2. How to cope with fixed / limited memory?3. How can n-gram be parallelized on GPU?

12

Input- /Output Data Structure

Three-layered index structure for each ontology ci: concept index representing all concepts ai: attribute index representing all attributes gi: gram (token) index containing all tokens

Two arrays to represent top-k matches per concept A-priori memory allocation possible (array size of k|O|) Short (2 bytes) instead of float (4 bytes) data type for

similarities reduce memory consumption

13

Input- /Output Data Structure - Example

O:ci

ai

gi 0 1 3 4 ci

0 2 5 10 13

1 2 6 7 8 3 4 18 19 20 9 10 21 gi

ai

O‘:

MO,O‘:0 1 2 1000 1000 800simscorrs

c0 c1 c2

c0‘ c1‘ c2‘

c0-c0‘ c1-c1‘ c2-c2‘

O O’

Attr. Values Attr. Values

Concept

c0 • head • head c0‘

c1• trunk• torso

• torso• truncus c1‘

c2

• limb • extremity

• limbsc2‘

O O’

Token sets Token sets

Concept

c0 • • c0‘

c1• •

• • c1‘

c2

• •

• c2‘

0 1 3 5

0 2 8 105 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18

[1,2]

[3,4,5][6,7,8]

[9,10][11,12,13,14,15,16,17]

[1,2]

[6,7,8][3,4,18,19,20]

[9,10,21]

Output:top-k=2 ˄ sim>0.7

Input:

14

Limited Memory / Parallelization Ontologies and mapping to large for GPU

Size-based ontology partitioning* Ideal case: one ontology fits completely in GPU memory

Each kernel computes n-gram similarities between one concept of Pi and all concepts in Qj

Re-use of already stored partition in GPU Hybrid execution on GPU/CPU possible

Q0 Q1 Q2 Q3 Q4

P0

P1

P2

P3

GPU thread

CPU thread(s)

Match task

ciaigi

P0

ciaigi

Q3

corrs

sims

MP0,Q3

Global memory

ReplaceQ3 with Q4

ReadMP0,Q4

Kernel0

Kernel|P0|-1

…

GPU

-Q4

-1-Q4

ciaigi

P0

ciaigi

Q4

GPUusage

CPUusage

* Groß, Hartung, Kirsten, Rahm: On Matching Large Life Science Ontologies in Parallel. Proc. DILS, 2010

15

Evaluation FMA-NCI match problem from OAEI 2012 Large Bio Task

Three sub tasks (small, large, whole) With blocking step to be comparable with OAEI 2012 results*

Hardware CPU: Intel i5-2500 (4x3.30GHz, 8GB)

GPU: Asus GTX660 (960 CUDA cores, 2GB)

FMA NCI Matchsubtask #concepts #attValues #concepts #attValues #comparisons

small 3,720 8,502 6,551 13,308 0.12·109

large 28,885 50,728 6,895 13,125 0.66·109

whole 79,042 125,304 11,675 21,070 2.64·109

* Groß, Hartung, Kirsten, Rahm: GOMMA Results for OAEI 2012. Proc. 7th Intl. Ontology Matching Workshop, 2012

16

Results for one CPU or GPU Different implementations for Trigram

Naïve nested loop, hash set lookup, sort-merge

Sort-merge solution performs significantly better GPU further reduces execution times (~20% of CPU time)

17

Results for hybrid CPU/GPU usage NoGPU vs. GPU

Increasing number of CPU threads

Good scalability for multiple CPU threads (speed up of 3.6) Slightly better execution time with hybrid CPU/GPU

One thread required to communicate with GPU

18

Summary and Future Work Experiences from optimizing GOMMA’s ontology

matching workflows Tuning of n-gram similarity function

Preprocessing (integer conversion, sorting) for more efficient similarity computation

Execution on GPU Overcoming GPU drawbacks (fixed memory, a-priori allocation)

Significant reduction of execution times 104min99sec for FMA-NCI (whole) match task

Future Work Execution of other similarity functions on GPU Flexible ontology matching / entity resolution workflows

Choice between CPU, GPU, cluster, cloud infrastructure, …

19

Thank You for Your Attention !!

Questions ?

optimizing similarity computations for ontology matching - experiences from gomma

Documents

similarity computationsmo

similarity sima

string similarity function

token arrays atokens

similarity function

trigram n

string class

tokens of length