enhanced ontological searching of medical scientific information

8/12/2019 Enhanced Ontological Searching of Medical Scientific Information

1/130


2/130


3/130

Contents

Abstract 7

Declaration 9

Intellectual Property Statement 11

Acknowledgements 13

List of Abbreviations 15

List of Tables 17

List of Figures 19

1 Introduction 25

1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3 Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.4 Thesis Organization. . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 Ontologies 31

2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . . . 31

2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . . . 34

2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 34

3


4/130

2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.3 ICD-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Similarity Metrics 39

3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . 39

3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Character-based Similarity Measures . . . . . . . . . . . . 41

Longest Common Substring . . . . . . . . . . . . . . . . . 41Hamming Similarity . . . . . . . . . . . . . . . . . . . . . 41

Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . 41

Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . 42

Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . . 42

N-gram Similarity. . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . 43

Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . 43

Jaccard Similarity. . . . . . . . . . . . . . . . . . . . . . . 44

Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 44

Manhattan Similarity. . . . . . . . . . . . . . . . . . . . . 44

Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 45

3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 45

3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . 45

Distance-based Metrics . . . . . . . . . . . . . . . . . . . . 45

Information-Based Metrics . . . . . . . . . . . . . . . . . . 48

Feature-Based Measures . . . . . . . . . . . . . . . . . . . 52

3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . . 52

4 Search Interfaces 55

4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . 55

4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4


5/130

4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . . . 60

4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Requirements 65

5.1 Feature Specification . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Design 69

6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 69

6.1.1 Database and Table Creation . . . . . . . . . . . . . . . . 70

6.1.2 Populating the Database Tables . . . . . . . . . . . . . . . 72

6.2 Stage II: Computation of Semantic Similarity . . . . . . . . . . . 76

6.2.1 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . 76

6.2.2 Semantic Similarity Calculation . . . . . . . . . . . . . . . 77

6.3 Stage III: Interface Design Data Presentation . . . . . . . . . . . 79

6.4 Summary of Technology Choices. . . . . . . . . . . . . . . . . . . 80

7 Implementation 83

7.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.2 Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.3 Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 88

7.3.1 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3.2 Querying the Database . . . . . . . . . . . . . . . . . . . . 88

7.3.3 Ranking and Grouping of Search Results . . . . . . . . . . 89

7.3.4 Return-key or Mouse-click Search . . . . . . . . . . . . . . 91

7.3.5 Auto-completion Search . . . . . . . . . . . . . . . . . . . 91

7.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.5 Term Information Presentation . . . . . . . . . . . . . . . . . . . 96

7.6 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Evaluation 103

8.1 Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103

8.2 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109

5


6/130

8.2.1 Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109

8.2.2 Results Ranking. . . . . . . . . . . . . . . . . . . . . . . . 111

8.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113

8.2.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.3 Comments from an AstraZeneca Search Specialist . . . . . . . . . 117

9 Conclusions and Future Work 121

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Bibliography 123

Number of Words in the Document: 25648

6


7/130

University of Manchester

School of Computer Science

Degree Programme of Advanced Computer Science

ABSTRACT OF

MASTERS THESIS

Author: Christos Karaiskos

Title: Enhanced Ontological Searching of Medical Scientific Information

Supervisors: Prof. Andrew Brass (University of Manchester)

Dr. Jennifer Bradford (AstraZeneca)

Abstract: An enormous amount of biomedical knowledge is encoded in narra-

tive textual format. In an attempt to discover new or hidden knowledge, exten-

sive research is being conducted to extract and exploit term relationships fromplain text, with the aid of technology. A common approach for the identification

of biomedical entities in plain text involves usage of ontologies, i.e., knowledge

bases which provide formal machine-understandable representations of domains

of variable specificity. In addition to term extraction, ontologies may be used

as controlled vocabularies or as a means for automatic knowledge acquisition

through their inherent inference capabilities. Visualization of the content of on-

tologies is, thus, very important for researchers in the biomedical domain. Un-

fortunately, many of these researchers find it difficult to deal with formal logic

and would prefer that ontology search interfaces completely hide any structural

or functional references to ontologies. This thesis proposes a strategy for build-

ing a web-based ontology search application that exploits ontologies behind the

scene, transparently from the end user, and presents relevant concept informa-

tion in such a way that searchers can successfully and quickly find what they

are looking for. The proposed search interface features various search tools for

enhanced ontological searching, including term auto-completion, error correction,

clever results ranking, and similar term visualizations based on semantic similar-

ity metrics. Evaluation of the developed application shows that its features can

improve enterprise-strength ontology search applications, such as BioPortal.

Keywords: search interface design, ontology hiding, biomedical ontology,

semantic similarity, usability, data integration

7


8/130

8


9/130

Declaration

No portion of the work referred to in the dissertation has been submitted in

support of an application for another degree or qualification of this or any other

university or other institute of learning.

9


10/130

10


11/130

Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules

to this dissertation) owns certain copyright or related rights in it (the Copy-

right) and he has given The University of Manchester certain rights to use

such Copyright, including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard

or electronic copy, may be made only in accordance with the Copyright,

Designs and Patents Act 1988 (as amended) and regulations issued under

it or, where appropriate, in accordance with licensing agreements which the

University has entered into. This page must form part of any such copies

made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other

intellectual property (the Intellectual Property) and any reproductions of

copyright works in the dissertation, for example graphs and tables (Repro-

ductions), which may be described in this dissertation, may not be owned by

the author and may be owned by third parties. Such Intellectual Property

and Reproductions cannot and must not be made available for use with-

out the prior written permission of the owner(s) of the relevant Intellectual

Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication

and commercialisation of this dissertation, the Copyright and any Intel-

lectual Property and/or Reproductions described in it may take place is

11


12/130

available in the University IP Policy (see http://documents.manchester.ac.

uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-

rations deposited in the University Library, The University Librarys reg-

ulations (see http://www.manchester.ac.uk/library/aboutus/regulations)

and in The Universitys Guidance for the Presentation of Dissertations.

12
http://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487http://www.manchester.ac.uk/library/aboutus/regulationshttp://www.manchester.ac.uk/library/aboutus/regulationshttp://documents.manchester.ac.uk/display.aspx?DocID=487http://documents.manchester.ac.uk/display.aspx?DocID=487


13/130

Acknowledgements

I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch-

ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and

support throughout the duration of this project. I have greatly benefited from

experiencing the different perspectives of academia and industry, which have both

contributed to shaping the final outcome of this project.

I would like to thank Sebastian Philipp Brandt (University of Manchester),

for his suggestions on making the search application even better. Also, I would

like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time

to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on

improving the performance and security of the application.

Finally, I would like to thank Matina for her patience and love, and my par-

ents, Ioannis and Stavroula, for always being there.

13


14/130

14


15/130

List of Abbreviations

AI Artificial Intelligence

AJAX Asynchronous JavaScript and XML

API Application Programming Interface

CSS Cascading Style Sheets

DAG Directed Acyclic Graph

HLGT High Level Group Term

HLT High Level Term

HTTP Hypertext Transfer Protocol

IC Information Content

ICD International Classification of Diseases

JDBC Java Database Connectivity

JSON JavaScript Object Notation

LCS Least Common Subsumer

MedDRA Medical Dictionary for Regulatory Activities

NCIT National Cancer Institute Thesaurus

NDF-RT National Drug File Reference Terminology

15


16/130

NHS UK National Health System

NLP Natural Language Processing

OBO Open Biomedical Ontologies

OWL Web Ontology Language

PHP PHP Hypertext Preprocessor

PT Preferred Term

RDF Resource Description Framework

RDF-S Resource Description Framework Schema

REST Representational State Transfer

RF2 Release Format 2

SNOMED CT Systematized Nomenclature of Medicine Clinical Terms

SNOMED RT Systematized Nomenclature of Medicine Reference

Terminology

SOC System Organ Class

UMLS Unified Medical Language System

URI Uniform Resource Identifier

URL Uniform Resource Locator

UX User Experience

VA U.S. Department of Veterans Affairs

WHO World Health Organization

XHTML Extensible HyperText Markup Language

XML Extensible Markup Language

16


17/130


18/130


19/130

List of Figures

2.1 The structure of the MedDRA terminology comprises a fixed-depth

hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 The google search engine entry form. . . . . . . . . . . . . . . . . 57

4.2 Facebook uses grayed-out descriptive text to help in the formula-

tion of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Bings search interface features a powerful dynamic search sugges-

tion, where prefixes are highlighted with grayed-out font and the

remaining text is in bold. . . . . . . . . . . . . . . . . . . . . . . 58

4.4 The Safari browsers embedded search interface explicitly states

which queries are suggestions and which belong to the users recent

search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 The Firefox browsers embedded search interface contains recent

queries on top, and separates them from suggestions using a solid

line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Googles search results page is a typical scrollable vertical list of

captions. Metadata facets, that restrain results to a particular

type of information, are also present in the interface (e.g. Images

tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7 Amazons search interface provides facets as a left panel to the

results page, helping the user dynamically refine the initial search. 62

19


20/130

4.8 Pubmeds results page includes term expansion in two ways. On

the right of the screen, there is a Related searches panel that pre-

serves the initial query and adds a new related term to it. Also,

right below the entry form there is a See also feature which sug-

gests complete or partial modifications in the initial query. . . . . 64

6.1 A part of the XML response for the get all terms query of Table

6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 The provided methods of the ontoCAT APIAdamusiak et al.(2011). 75

6.3 Populating the Ontologies database is performed with the help of

the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.1 The organization of the files that comprise the web application.

These files are responsible for the presentation, styling and inter-

active behavior of the web application. . . . . . . . . . . . . . . . 84

7.2 The main window of the search application. The search box is

placed at the top of the screen, with central horizontal alignment.

A submit button labeled Search is also provided, to assist users

that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87

7.3 Once the user clicks inside the search box, the grey help message

disappears and a blinking cursor takes its place. . . . . . . . . . . 87

7.4 Terms, that would appear on their own table row, are grouped

under a more lexically-matching term to the query, when their

semantic similarity to that term is higher than a threshold. . . . . 90

7.5 Pressing the Return key or clicking the Search button submits

the query toindex.php and a table of search results is added to the

interface.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.6 Part of the JSON response from performQuery.php, for the input

query rash. Each JSON object represents a term matching the

query, and contains information that can be used for its presentation. 93

20


21/130

7.7 Pressing any other key except Return submits the query through

AJAX toperformQuery.php and an auto-completion pop-up menu

is created from the JSON response. . . . . . . . . . . . . . . . . . 93

7.8 Error correction when input query is lyng. The closest term is

suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95

7.9 When the user places the mouse cursor on a circle, a tooltip imme-

diately appears, containing the full term name and the semantic

similarity score with the viewed term.. . . . . . . . . . . . . . . . 97

7.10 Presentation page for the NCIT term Recurrent NSCLC. On the

left side, the basic term information is shown, along with an XML

representation of highly similar terms. On the right side, a visual-

ization of highly similar terms is provided, using the D3 JavaScript

library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.11 Presentation page for the MedDRA term Rash. The term has

very close relations with terms that are not in the hierarchy. This

is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100

7.12 The XML representation of a term. It includes basic term infor-

mation and highly similar terms. . . . . . . . . . . . . . . . . . . 101

7.13 Help is provided through tooltips that activate on mouse-over. . . 101

8.1 The term DIHS is not found, but this is normal, since it is not

part of any of the supported ontologies. Instead, the term DIOS

is proposed, in case the user had mispelt the query. . . . . . . . . 106

8.2 The term NMDA Antagonist is not found, but this is normal,since it is not part of any of the supported ontologies. No soundex

match is found, so no error corrections are suggested. . . . . . . . 106

8.3 The term Hepatotoxicity is shown in the auto-completion dialogue.106

8.4 The term NSCLC is shown in the auto-completion dialogue.. . . 106

8.5 The term DRESS syndrome is shown in the auto-completion di-

alogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

21


22/130

8.6 The query LHRH produces two different 100%-matching results.

Unlike in the previous search application, the user can now see that

Gonadotropin Releasing Hormone is a preferred term for LHRH. 107

8.7 The results for the query VEGFR, illustrate a semantic grouping

of 4 similar terms, namely VEGFR, Vascular Endothelial Growth

Factor Receptor 1, Vascular Endothelial Growth Factor Receptor

2, Vascular Endothelial Growth Factor Receptor 3. The latter

three are grouped under the parent term. . . . . . . . . . . . . . . 108

8.8 The BioPortal interface is a simple text box, similar to this projects

main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.9 BioPortal also offers advanced options to improve the search results.110

8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out

of the 353 ontologies offered by BioPortal, so that comparisons to

this projects work are achievable. . . . . . . . . . . . . . . . . . . 111

8.11 Auto-completion pop-up menu of BioPortal NCIT widget when

the user has typed nsc. Only preferred terms are shown. The

user might be confused when seeing the term Becatecarin in the

results, since it does not contain nsc. . . . . . . . . . . . . . . . . 112

8.12 Auto-completion pop-up menu of this projects search application

when the user has typed nsc. . . . . . . . . . . . . . . . . . . . . 112

8.13 Searching for Denatonium Benzoate through its preferred term

name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.14 Searching for Denatonium Benzoate through its synonym THS-839. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.15 Searching for Denatonium Benzoate through its synonym WIN

16568. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.16 BioPortal search results rankings for nsclc. All terms are grouped

according to the ontology they belong to, under the preferred name

of the most lexically-relevant term to the query. . . . . . . . . . . 114

22


23/130

8.17 This projects search results rankings for nsclc. Terms in the re-

sults are rearranged into groups that show high semantic similarity.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.18 BioPortal returns no search results for the erroneously spelt term

nsclca. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.19 BioPortal returns no search results for the erroneously spelt term

caancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.20 This projects search application returns a search suggestion of

nsclc for the erroneously spelt term nsclca. . . . . . . . . . . . 116

8.21 This projects search application returns a search suggestion of

cancer for the erroneously spelt term caancer. . . . . . . . . . 116

8.22 BioPortal uses a graph to visualize hierarchical relations. Edges

are annotated with a description of the relationship between the

connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116

8.23 This projects application focuses on inexperienced users and at-

tempts to completely hide any formal-logic relationships that might

confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.24 Search results depicting causal associations between smoking and

cancer, as presented by the I2E text mining application. . . . . . 118

8.25 Search results for the term MEK inhibitor in NCIT, when the

I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119

23


24/130

24


25/130

Chapter 1

Introduction

Ontologies are knowledge bases which provide formal machine-understandable

representations of domains of variable specificity. Given a domain of discourse,

concepts that belong to the domain are well documented in formal logic, along

with their inter-relations. Ontologies, as representations, cannot perfectly capture

the part of the world that they attempt to describe Davis et al. (1993). They

are based on the open world assumption, which states that if something is not

represented in a knowledge base, it does not mean that it does not exist in the

real worldHustadt et al. (1994). As our knowledge about a domain increases,

ontologies are updated and they become more complex. This has become evident

in the biomedical domain, where ontologies have already attained a high degree of

specificity, and has led to their quick adoption for data integration and knowledge

discovery purposes.

1.1 Problem Context

Within biomedicine, ontologies can help researchers communicate, by promoting

consistent use of biomedical terms and concepts. The construction of an ontol-

ogy itself involves mediating across multiple views and requires that a number

of domain experts reach a consensus that reflects the diverse viewpoints of the

25


26/130

CHAPTER 1. INTRODUCTION

community. Ontologies are viewed as tools that provide opportunities for new

knowledge acquisition, due to the complex semantic relations that they model.

Inferences in a huge ontology may reveal connections that the human eye would

bypass. This is especially important in the pharmaceutical sector, where drug

discovery has slowed down significantly as a process and in the biological sector,

where attempts to demystify genome patterns associated with disease are still

at initial stage. Another common use for ontologies in the biomedical domain

is as controlled vocabularies that feed filtered terms into computer applications.

Finally, ontologies may be used to connect terms found in plain text to their

semantic representations. Term extraction with the help of ontologies is a hot

topic in biomedicine, due to the vast amounts of medical information stored in

plain text. Due to the importance of ontologies, it is usual for researchers in the

biomedical field to require access to their content.

1.2 Motivation

In the past, AstraZeneca employees were provided with a web-based search form

that enabled them to look for concepts in one or more biomedical ontologies and

select the most suitable from a list of search results. The chosen concepts were, in

turn, conveyed to a text mining application. Understanding the results required

the user to be familiar with the content and structure of the ontology from which

the terms were retrieved. Unfortunately, most users did not feel comfortable

with the idea of ontologies and struggled, or even refused, to use the provided

interfaces, even though no logic-based content was there to confuse them.

In many cases, though, this was not solely the fault of the users. The interface

gave the users freedom to select the ontologies to be searched for the specified

query. Inexperienced users usually did not know or care about which ontology

contains the desired query term. For example, a user wished to search for Non-

small cell lung carcinoma, by its abbreviation NSCLC. Querying NSCLC in

26


27/130

1.3. CONTRIBUTION

the MedDRA terminology1 returned no results, since the concept is not present

in the terminology. Although this behavior is correct, it seems wrong to the

inexperienced user and may lead to loss of trust to the system.

But even if the term is present in the ontology, the user should not be forced

to know its exact spelling. For example, querying for NSCLC in the NCIT

thesaurus also returned no results, despite the fact that the actual concept exists

in the ontology. The searcher needed to know that the preferred term for the

NSCLC concept is Non-small cell lung carcinoma. Abbreviations and dissimilar

synonyms are common in the biomedical field, so expecting the user to know the

preferred term for each concept is considered problematic.

In addition to the above, presentation of results was not always straightfor-

ward. Terms that demonstrate a strong semantic relation to each other were

presented as stand-alone terms in the search results, subconsciously misleading

users to deduce that the terms were independent. It was up to the user to judge

the relevance of results to the query. For example, the results for Non-small cell

lung carcinoma in NCIT included, among others, the terms Non-small cell lung

carcinoma and Stage I non-small cell lung carcinoma equally spaced, in a way

that users could not infer the connections between them. In fact, the latter term

is a specification of the former. In reality, what users did was to choose all terms,

even though they were looking for the broad term, because they became confused

and did not want to take the risk of selecting only one.

This collapse at the human-computer interface has motivated AstraZeneca to

try to build tools that take advantage of the ontology structure and, at the same

time, completely hide it from the user in order to facilitate the search procedure.

1.3 Contribution

The outcome of this thesis is the development of a user-friendly search applica-

tion that allows users to find information about concepts present in a medical

1The difference between terminology and ontology is described in Section2.2

27


28/130

CHAPTER 1. INTRODUCTION

ontology, without requiring from them to understand the underlying structure of

the ontology. Information about a concept includes its accession code within the

given ontology, the term for its preferred name, its definition and all available

synonym terms. In order to facilitate the search procedure and enhance User

Experience (UX), the search application includes features such as dynamic term

suggestion, spelling correction and similar term visualization tools.

The main challenge lies in the presentation of results; as stated in section 1.2,

users are usually not sure about which term(s) to choose, when multiple similarly-

spelt terms appear. Ranking of terms is performed with the aid of both lexical

and semantic similarity. The former screens those terms that best match the user

query and ranks them according to a string relevance metric. These results are

processed by the latter, so that terms showing a strong semantic connection are

grouped together.

Ideally, the search application should bridge across terms from multiple ontolo-

gies. Due to the diversity in the format and annotation of different ontologies, this

is not a straightforward generalization. Most importantly, within the biomedical

society, the term ontology is often used erroneously to describe plain termi-

nologies that, in fact, violate basic ontological principles.2 Therefore, ontology-

specific difficulties are expected to arise, if semantic similarity measures are to be

deployed.

In summary, the goals of this thesis are to investigate the following topics:

1. To develop user-friendly search tools that allow users to build search queries

based on the terms present in a medical ontology, without need for the usersto understand the actual structure of the ontology.

2. To exploit the semantic annotations of the underlying ontology in order to

enhance the quality and presentation of results.

3. To intermix results originating from different ontologies.

2In MedDRA, the synonym of a term may be a child node of the term itself.

28


29/130

1.4. THESIS ORGANIZATION

1.4 Thesis Organization

The thesis is organized in a total of 9 chapters. Chapter 2 includes an introductionto ontologies and a brief description of some notable biomedical ontologies. Chap-

ter 3 presents the background needed for understanding the different measures

of lexical and semantic similarity. Chapter 4 discusses interface design principles

for user-centered search applications. In chapter 5, the requirements and feature

specifications for the final search application are addressed. Chapter 6 describes

the design considerations that were taken into account for the ontological search

application, while chapter 7 presents the final implementation. Chapter 8 in-cludes the evaluation of the search application. Finally, conclusions are drawn in

chapter 9, along with possible future directions.

29


30/130

30


31/130

Chapter 2

Ontologies

The term ontology is an uncountable noun coined in the philosophical field, by

ancient Greek philosophersGuarino(1998). It involves the study of the nature

of existence, at a fairly abstract level. In the world of computer science, the word

ontology refers to the encoding of human knowledge in a format that allows

for computational use. This chapter includes an introduction to the modern

definition of ontology, along with a brief description of some of the most notablebiomedical ontologies.

2.1 Modern Ontology Definition

In Artificial Intelligence (AI), an ontology is commonly defined as a specification

of a (shared) conceptualizationGruber et al. (1995). A conceptualization refers

to an individuals knowledge about a specific domain, acquired through expe-

rience, observation or introspection Huang et al. (2010). Ontologies are shared

conceptualizations, meaning that multiple participants, usually domain experts,

contribute to their construction, maintenance and expansion. Conflicts are cer-

tain to arise among the different participants, so an important aspect of ontology

design is to bridge across multiple views of the desired domain into a single con-

crete representation. On the other hand, a specification is a transformation of

31


32/130

CHAPTER 2. ONTOLOGIES

this shared conceptualization into a formal representation language.

The outcome of a formal representation of a domain is a collection of entities,

expressions and axioms. Entities include:

concepts or classes, which are sets of individuals (e.g., Country, which

contains all countries),

individuals, which are specific instances of classes (e.g., Greece as an in-

stance of Country),

data types (e.g. string, integer),

literals, which are specific values of a given data type (e.g. 1,2,3, or string

values),

properties (e.g. hasDisease, hasAge).

Expressionsrefer to descriptions of entities in a formal representation language.

The standardized family of languages for formal ontology representation is the

Web Ontology Language (OWL), which builds on the Extensible Markup Lan-

guage (XML), Resource Description Framework (RDF) and RDF-Schema (RDF-

S) standards to provide a highly expressive means for representing knowledge

McGuinness et al. (2004). The underlying format of the resulting OWL docu-

ment can vary among several types, with the most common being RDF/XML.

Finally, axioms relate entities/expressions. This connection can be made

class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), property-

to-property (i.e. SubPropertyOf), among others. These relations can be asserted

explicitly or inferred by a reasoner. Inferences are made, based on the logic rela-

tions of concepts. As an example of a simple inference, a concepts ancestors can

be inferred automatically, once the parent concept is specified.

An ontology may be visualized as a graph, in which concepts are nodes and

relations are edges between nodes. Furthermore, if transitive hierarchical rela-

tions are isolated (e.g. subsumption, also known as is-a relation or hyponymy),

32


33/130

2.2. ONTOLOGY VS. TERMINOLOGY

the ontology can be viewed as a taxonomy. The geometrical visualization of an

ontology will be presented in more detail in chapter 3.

2.2 Ontology vs. Terminology

A terminology is a collection of term names that are associated with a given

domain. A term is a mapping of a concrete concept to natural language. This

term-to-concept mapping is usually not one-to-one, especially in the biomedical

domain where term variation and term ambiguities arise Ananiadou and Mc-

Naught(2006). Term variation is a result of the richness of natural language and

refers to the existence of multiple terms for the description of the same concept.

For example, the terms Transmembrane 4 Superfamily Member 1, TM4SF1t,

L6 Antigen all point to the same protein. Term ambiguity occurs when a term is

mapped to more than one distinct concept. This is common when new abbrevia-

tions are introducedLiu et al.(2002). As an example, some of the concepts that

the acronym CTX may map to are Cardiac Transplantation, Clinical Trial

exemption and Conotoxin. Their disambiguation is a matter of context.

A terminology is not constrained to being a simple list of terms. In fact,

most terminologies feature some kind of structure, where terms that map to the

same concept are grouped together and semantic relationships between concepts

are explicitly or implicitly stated. Semantic relationships between terms include

synonymy and antonymy, while semantic relationships between concepts include

hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000).

Synonymy exists when two terms are interchangeable, while antonymy denotes

that two terms have opposite meaning. Hyponymy introduces a parent-child, or

is-a relation between concepts. A concept is a hyponym of another concept,

if the former derives from the latter and it represents a more granular concept.

Hyponymy is transitive; if concept a is a child of concept b, and concept b is a

child of concept c, then a is also a child ofc. Hypernymy is the reverse relation

of hyponymy. Meronymy exists when a concept represents a part of another

33


34/130


concept. Holonymy is the opposite relation, where a concept has part some other

concept(s).

The difference between a terminology and an ontology is not always clear, as

terminologies continue to improve their state of organization in a way that resem-

bles ontologies. The initial scope and aim of the two, though, is clearly different;

the purpose of a terminology was initially, as the name implies, an effort to collect

all terms associated with a specified domain. On the other hand, the target of

an ontology has, from the start, been to provide a machine-readable specification

of a shared conceptualization. Despite their many common characteristics, ter-

minologies are not necessarily ontologies. If treated as ontologies, they may lead

to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught

(2006). An illustrative example is the case of MedDRA, which will be discussed

in Section2.3.4.

2.3 Notable Biomedical Ontologies and Termi-

nologies

Hundreds of biomedical ontologies and terminologies have been published on-

line. According to BioPortal1 statistics, the top five most viewed ontologies or

terminologies are SNOMED Clinical terms, National Drug File, International

Classification of Diseases, MedDRA and NCI Thesaurus. In this section, a brief

introduction to these ontologies/terminologies is performed.

2.3.1 SNOMED CT

The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a

biomedical terminology which covers most areas within medicine such as drugs,

diseases, operations, medical devices and symptoms. It may be used for the cod-

1BioPortal is a biomedical ontology/terminology repository which provides online ontology

presentation and manipulation tools(http://bioportal.bioontology.org/ ).

34
http://bioportal.bioontology.org/http://bioportal.bioontology.org/http://bioportal.bioontology.org/


35/130

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

ing, retrieval and processing of clinical data. SNOMED CT is written purely in

formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available

and organized into multiple independent hierarchies. It is the result of the merg-

ing between the UK National Health Systems (NHS) Read codes and SNOMED

Reference Terminology (SNOMED-RT), developed by the College of American

Pathologists. The basic hierarchies, or axes, are Clinical Finding and Proce-

dure. The last version contains more than 400000 concepts and over 1000000

of relationships, rendering SNOMED CT the most complete terminology in the

medical domain. Only few definitions are present in the terminology. Each con-

cept contains a unique identifier and numerous synonymous terms that account

for term variation. Also, each concept is part of at least one hierarchy and may

have multiple is-a relationships with higher level nodes. SNOMED CT is part

of the Unified Medical Language System (UMLS), a biomedical ontology and

terminology integration attempt which comprises hundreds of resources.

2.3.2 NDF-RT

The National Drug File Reference Terminology (NDF-RT) was introduced by the

U.S. Department of Veterans Affairs (VA) as a formalized representation for a

medication terminology, written in description logic syntax VHA (2012). The

terminology is organized into concept hierarchies, where each concept is a node

comprising a list of term synonyms and a unique identifier. As expected, top-level

concepts are more general than lower-level ones. The central hierarchy is named

DRUG KIND and indicates the types of medications, the preparations used in

them and clinical VA drug products. Other hierarchies include

DISEASE KIND,

INGREDIENT KIND,

MECHANISM OF ACTION KIND,

PHARMACOKINETICS KIND,

35


36/130


PHYSIOLOGIC EFFECT KIND,

THERAPEUTIC CATEGORY KIND,

DOSE FORM and

DRUG INTERACTION KIND.

Roles exist between different concepts, and are specified only with existential

restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other ter-

minologies are also available. Currently, NDF-RT more than 45000 concepts in

hierarchies of maximum depth 12.

2.3.3 ICD-10

The International Statistical Classification of Diseases and Related Health Prob-

lems (ICD) is a terminology which attempts to classify signs, symptoms and

causes of disease and morbidity WHO(1992). It appeared in the mid-19th cen-

tury and is now maintained by the World Health Organization (WHO). Currently

it is available in its 10th revision, although the 11th version is claimed to be at

the final stage before release. As a taxonomy, it has relatively small maximum

depth, equal to 6. Codes assigned to each concept tie it to a specific place in the

taxonomy, with each code having only a single parent. It is thus not a proper ap-

plication of ontological principles2, since, in reality, it is not unusual for concepts

to belong to more than one subsumers, and this is not modeled. In addition to

that, there exist categories such as Not otherwise specified or Other, which are

not needed in an ontology; the open world assumption already covers the fact

that every ontology is incomplete, so stating it explicitly is redundant and may

interfere with the evolution of the ontology, as new terms are not classified under

their closest match.

2nor was meant to be; its intent is classification

36


37/130

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.

2.3.4 MedDRA

The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology

that is concerned with biopharmaceutical regulatory processes. It contains terms

associated with all phases of the drug development cycle. MedDRA is organized

in a hierarchical structure of fixed depth, as seen in Fig. 2.1. System Organ

Classes (SOCs) represent the 26 predefined overlapping hierarchies in which terms

belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are

general term groupings, denoting disorders or complications. Preferred Terms

(PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs)

include terms of maximum specificity. LLTs may be connected with hyponymy,

meronymy or synonymy relationships to their PTs. This is the main problem in

trying to view MedDRA as an ontology. In a formal ontology, a concept cannot

be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs

share a synonymy relation.

37


38/130


2.3.5 NCI Thesaurus

The National Cancer Institute Thesaurus (NCIT) is a controlled terminologyfor cancer research. The thesaurus has been converted to formal OWL syntax

and is updated at fixed intervals. The conversion was not an easy one; many

inconsistencies and modeling dead-ends that were encountered in the conversion

procedure have been documentedCeusters et al. (2005), along with some clear

violations of ontological principlesSchulz et al.(2010). The NCIT provides almost

100000 concepts, with approximately 65% containing a definition.

38


39/130


40/130

CHAPTER 3. SIMILARITY METRICS

4. d(a, b) +d(b, c) d(a, c) (triangular inequality).

On the other hand, the requirements for a similarity metric were formally intro-

duced not long ago Chen et al. (2009). The definition states that a similarity

metric s(a, b) must satisfy the following properties:

1. s(a, a) 0,

2. s(a, b) =s(b, a),

3. s(a, a) s(a, b),

4. s(a, b) +s(b, c) s(a, c) +s(b, b),

5. s(a, a) =s(b, b) =s(a, b) if and only ifa= b.

The counter-intuitive 4th property can be proven, using set theory. More specif-

ically, if|a b| denotes the cardinality of common characteristics between a and

b, and c denotes the complement ofc, the following equality holds:

|a b|= |a b c| + |a b c|. (3.1)

Then,

|a b| + |bc|= |a bc| + |a b c| + |a b c| + |a b c| |ac| + |b|, (3.2)

since|a b c| |a c|and |a b c| + |a b c| + |a b c| |b|. Deduction of

similarity from distance is a common procedure that requires simple operations.

Similarity is, intuitively, a decreasing function of distance. Conversion between

the two can take many formsChen et al.(2009). In this thesis, all formulas will

be presented as similarity measures.

3.2 Lexical Similarity

String-based methods that calculate lexical similarity can be divided into character-

based and word-based. In this section, some of the most popular metrics are

presented. For a more complete survey of lexical similarity measures see Navarro

(2001) andGomaa and Fahmy(2013).

40


41/130

3.2. LEXICAL SIMILARITY

3.2.1 Character-based Similarity Measures

In character-based similarity, strings are viewed as character sequences and at-tempts are made to discover character relevance.

Longest Common Substring

The Longest Common Substring algorithmGusfield(1997) tries to find the max-

imum number of consecutive characters that two strings share. It may be imple-

mented using a suffix tree or dynamic programming.

Hamming Similarity

Hamming similarity is a metric that can be applied to strings of equal length. It

is a simple metric that measures the number of common characters between two

strings. Given stringsaand b, the formula for string similarity can be constructed

as follows:

simham(a, b) =

i 1(ai=bi)

|a| , (3.3)

where 1() is the indicator function and | | denotes string length, measured in

characters.

Levenshtein Similarity

Levenshtein distance counts the number of character alterations that need tobe made in order to transform one string to another Levenshtein(1966). This

number is bounded by the length of the larger string, which is commonly used as a

normalizing measure that restrains the value of distance to [0 , 1]. Mathematically,

normalized Levenshtein distance of termsaand b is computed using the following

formula:

dlev(a, b) = leva,b(|a|, |b|)

max{|a|, |b|}, (3.4)

41


42/130


where| | denotes string length in number of characters,

leva,b(i, j) =

max{i, j} , if min{i, j}= 0

min

leva,b(i 1, j) + 1

leva,b(i, j 1) + 1

leva,b(i 1, j 1) + [ai =bj]

, else(3.5)

and max{}, min{} denote the maximum and minimum functions, respectively.

Converting normalized distance to similarity can be done as follows:

simlev(a, b) = 1 dlev(a, b). (3.6)

Jaro Similarity

Jaro similarityJaro(1989,1995) takes into account both the number and sequence

of common characters present in the two strings. Let us consider strings a =

a1 . . . aK and b = b1 . . . bL. A character ai is said to be common with b if the

character exists in b within a window of

min{|a|,|b|}

2 frombi. Leta

=a

1 . . . a

K

bethose characters ina that are common withb, andb =b1 . . . b

L those characters

inbthat are common with a. A transposition fora, b is a positioni in the strings

a, b in which ai = bi. The number of transpositions fora

, b divided by two is

denoted asTa,b. Then, Jaros formula for similarity is given by:

simjaro (a, b) =1

3

|a|

|a| +

|b|

|b| +

|a| Ta,b

|a|

. (3.7)

It should be noted that Jaro similarity violates the symmetry property of Eq.3.1, therefore it is not a true similarity metric, according to that definition.

Jaro-Winkler Similarity

Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which

promotes strings with long common prefixes. The length of the longest prefix

common to both strings a and b is denoted as P. Then, if P = max(P, 4),

42


43/130

3.2. LEXICAL SIMILARITY

Jaro-Winkler similarity is given by:

simj&w(a, b) = simjaro (a, b) + P

10(1 simjaro (a, b)). (3.8)

N-gram Similarity

A string can be split into n-grams, i.e. all possible consecutive character sequences

of lengthnin the string. As an example, the word protein can be split into the 3-

grams pro, rot, ote, tei and ein. When comparing two strings, the number

of common n-grams is computed and normalized by the maximum number of

n-grams. More specifically, given strings aand b, similarity is given by:

simngram(a, b) =NcomNmax

, (3.9)

where Ncom denotes the number of common n-grams andNmax denotes the max-

imum number of n-grams in either of the two strings.

3.2.2 Word-based Similarity Measures

As the name implies, word-based measures view the string as a collection of words.

Similarity measures dictate how similar two terms are word-wise, and no weight

is given on character similarity.

Dice Similarity

Dice similarity considers input strings a and b as sets of words A and B respec-

tively, and calculates similarity as follows:

simdice(a, b) = 2|A B|

|A| + |B|, (3.10)

where | | denotes set cardinality in number of words.

43


44/130


Jaccard Similarity

Jaccard similarity counts the number of common words of the compared stringsand divides it by the number of distinct words in both strings, i.e.

simjacc(a, b) = |A B|

|A B|. (3.11)

Cosine Similarity

In order to compute cosine similarity, the compared strings should be converted to

vectors. The dimension of the resulting vectors will be equal to the total number

of distinct words present in both. Therefore, each element in the vector represents

one word. The vector values for each string are computed as follows: A vector

contains unitary values in positions that correspond to words that are contained

in the respective string. Similarly, a vector contains zero values in all positions

that correspond to words that are not present in the respective string. Given

strings a and b, the respective vectors a and b are computed. Cosine similarity

is then given by:

simcos(a, b) = a b||a|| ||b||, (3.12)

where|| || denotes the Euclidean norm function.

Manhattan Similarity

Taxicab geometry considers that distance between two points in a grid is given

by the sum of the absolute differences of their respective coordinates. The grid

resembles a uniform city road map, where diagonal movements are not permitted.

This is the reason why the distance metric in this space is often called Manhattan

distance or city block distance. Considering N-dimension string vectors a and b,

Manhattan distance can be computed as:

simmanh(a, b) = 1

Ni=1

|ai bi|

N , (3.13)

whereNis a normalizing constant that represents the dimension ofaand b.

44


45/130

3.3. ONTOLOGICAL SEMANTIC SIMILARITY

Euclidean Similarity

Euclidean similarity also considers strings as vectors, and computes similarity as:

simeucl(a, b) = 1

N

i=1

|ai bi|2

N . (3.14)

3.3 Ontological Semantic Similarity

An ontology is a collection of concepts and their inter-relationships. It may be

visualized as a graph, in which nodes represent concepts and edges represent the

relations between them. Usually, ontologies are viewed as taxonomies, where is-

a and part-of relations play the most important role. Viewing the ontology as a

taxonomy, one can apply semantic similarity metrics that exploit the hierarchical

structure. Probably the most famous object of semantic similarity tests is the

computational lexicon WordNetMiller(1995). In WordNet, closely related terms

are grouped together to form synsets. These synsets, in turn, form semantic rela-

tions with other synsets. WordNet is commonly referred to as a lexical ontology,

due to an obvious mapping of lexical hyponymy to ontological subsumption.

3.3.1 Intra-ontology Semantic Similarity

Intra-ontology semantic similarity metrics are meant to measure similarity be-

tween concepts that reside within the same ontology. These metrics can be

roughly divided into distance-based, information-based and feature-based.

Distance-based Metrics

Distance-based metrics take advantage of the ontological topology to compute

the similarity between concepts. This method requires viewing the ontology as

a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges

among them are restricted to hierarchical relationships, with the most usual type

45


46/130


being is-a relationships. At the top, there is a single concept, the root. The graph

is directed, starting from a low-level concept and directed towards its ancestors

through transitive relationships. The graph is also acyclic, since a finite path

from a source node to a destination node cannot return to the source node. In

other words, a node can never be a child of one of its children.

A simple look at an ontology from a geometric perspective may reveal im-

portant information about the similarity of concepts. As depth in the DAG

increases, concepts become increasingly specific, thus similarity is expected to

increase. Another important characteristic of the ontology DAG is that the path

between concepts is not always unique, therefore distance-based similarity will

depend on which path is chosen. Finally, the density of nodes is a good indicator

of similarity; as density increases, concepts approach each other and similarity

increases.

The accuracy of distance-based methods depends on the level of detail that

the ontology captures. A poorly structured ontology with many omissions might

yield misleading similarity results. Fortunately, a lot of effort has been made to

make biomedical ontologies as complete as possible, therefore network density in

biomedical ontologies is usually high.

The most straightforward way to measure the similarity of concept nodes is

given inRada et al. (1989). In that work by Rada et al., all edges are assigned

a unitary weight and the distance between two concepts is equal to the number

of edges that are present in their shortest path. Let us consider two distinct

concepts c1 and c2 in the hierarchy. Each pathi that connects these two concept

nodes may be represented as a set which includes all edges ek present in the path,

i.e.

pathi(c1, c2) ={e1, e2, . . . , eK}. (3.15)

with cardinality |pathi(c1, c2)|= K. The distance between concepts c1 and c2 is,

then, equal to the shortest path that connects them, i.e.,

drada(c1, c2) = mini|pathi(c1, c2)|. (3.16)

46


47/130


Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen(2006)) where

Radas measure is used with node counting, instead of edge counting. In those

cases, each path is represented as a set of the nodes that compose it, including

the end nodes. The minimum distance can be converted into a similarity metric,

as inResnik(1995):

simrada(c1, c2) = 2D d(c1, c2), (3.17)

where D is the maximum depth of the taxonomy. This method fails to capture

the intuition that concept nodes, which reside at the lower part of the hierarchy

and are separated by distanced, are more similar than higher-level nodes with the

same distance separationd. Also, its success highly depends on the uniformity of

edge distribution within the ontology. For these reasons, other approaches have

been proposed in order to achieve a more representative score of similarity.

InWu and Palmer(1994), the relative depth of the compared concepts in the

hierarchy is considered. In that work, Wu and Palmer introduce the Least Com-

mon Subsumer (LCS) of the compared concepts. The LCS is the hierarchically

deepest common ancestor of the compared concepts. Similarity for concepts c1

and c2 is then given as:

simw&p(c1, c2) = 2h

N1+N2+ 2h, (3.18)

where N1 is the number of nodes in the path between concept c1 and the LCS,

N2 is the number of nodes between concept c2 and the LCS, and h is the depth

of the LCS, measured again in number of nodes.

In Li et al. (2003), the authors followed various strategies in their attempt

to calculate similarity as a function of the shortest path between the comparedconcepts, the depth of their LCS and the local density of the ontology. They

perceived that the best performance was obtained when they used the following

non-linear function:

simli(c1, c2) =e drada(c1,c2)

eh eh

eh +eh, (3.19)

where,are non-negative parameters and h = drada(LCS(c1, c2), root) denotes

the minimum depth of the LCS. Distances are measured in number of edges.

47


48/130


Al-Mubaid and Nguyen attempt to combine path length and node depth in one

measure. InAl-Mubaid and Nguyen(2006), they view the DAG as a composition

of clusters, with each cluster having as root a child of the ontology root. The

usage of clusters aims to exploit local characteristics of different branches. Given

concepts c1 and c2, they first compute their so-called common specificity:

Cspec(c1, c2) =Dc h, (3.20)

whereDcdenotes the depth of the specific cluster and h refers to the depth of the

LCS in the ontology, with both quantities measured in number of nodes. Then

similarity is computed as:

sima&n(c1, c2) = log((Path 1) (CSpec) +k), (3.21)

where Path is a modified version of Radas distance measure which is adapted

according to the largest cluster, and , ,k are constants, whose default values

are unitary.

Information-Based Metrics

One of the first attempts to focus on nodes in the similarity formula is that

of Leacock and Chodorow Leacock and Chodorow (1998). This method uses

negative log likelihood in a way that resembles the formula of self-information

Cover and Thomas(2012), but does not really involve valid probability. Instead,

a normalized form of the path length between the concepts is used:

siml&c(c1, c2) =log(Np/2D), (3.22)

where Np is the number of nodes in the shortest path between concepts c1 and

c2. This variable also includes the end nodes.

Resnik, inResnik(1995), continues down this path by replacing the normal-

ized path length with a probability measure P() to calculate the information

content (IC) of a concept. He considers all common subsumersCSi of concepts

48


49/130


c1 and c2 and calculates similarity as:

simresn(c1, c2) = maxi [log(P(CSi))], (3.23)

or, equivalently,

simresn(c1, c2) =log(P(LCS)). (3.24)

Considering that the IC of a concept c is defined as the negative logarithm of its

probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as:

simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)

Probabilities are estimated with the help of a text corpus, i.e. a collection of

nature language excerpts, specifically chosen to provide a good representation of

actual term usage. When dealing with biomedical ontology concepts, collections

of Pubmed1 abstracts are commonly used as corpora to determine the probability

of each concept.

Given a corpus, the occurrence of a term which corresponds to concept c

essentially implies the occurrence of each and every concept that subsumes c

within the ontological structure. Conversely, the number of occurrences of a

conceptc depends not only on the number of appearances ofcitself in the corpus,

but also on every occurrence of its descendants in the hierarchy. Thus, the number

of occurrences of concept c is given by:

occ(c) =

n=subsumed(c)

count(n), (3.26)

where subsumed(c) represents c and its children concept nodes, and count()

denotes the number of occurrences of the specific concept within the given corpus.

Converting occurrences to probability can be done using:

P(c) =occ(c)

N , (3.27)

where N is the total number of occurrences of ontology terms in the corpus.

This method results to higher probabilities for concepts residing at the top part

1http://www.ncbi.nlm.nih.gov/pubmed

49
http://www.ncbi.nlm.nih.gov/pubmedhttp://www.ncbi.nlm.nih.gov/pubmed


50/130


of the hierarchy, with the root having unitary probability. Therefore, concepts

whose LCS lies lower in the hierarchy are more similar, since their LCS has low

probability (i.e., high IC).

A possible drawback of this method is that probabilities are tied to the choice

of corpus. So far, in the biomedical domain, there is no widely accepted corpus

that covers the domain needsAl-Mubaid and Nguyen(2006). This is due to the

fact that thousands of new terms and abbreviations appear in the literature every

year, thus a stable corpus might not function well. Since extensions of the corpus

would need to be considered at fixed intervals, it might not serve as a useful

benchmark.

Alternatively, computation of IC can be performed without the use of a corpus,

by solely relying on the structure of the ontology DAG. Intrinsic computation of

IC involves approximating the occurrence probability of a concept as a function

of multiple variables, such as number of descendant nodes, number of subsumers

or number of descendant nodes which are leaves in the ontology. InSeco et al.

(2004), the IC of a concept c is given by:

ICseco(c) = 1 log(descendants(c) + 1)log(allConcepts)

, (3.28)

wheredescendants(c) returns the number of nodes that concept c subsumes, and

allConcepts denotes the number of all the available concepts in the ontology.

The IC function introduced by Seco et. al has the drawback that it assigns IC

equal to one for every leaf node in the ontology, and also that concepts containing

the same number of descendant nodes are again given the same IC. An attempt to

distinguish the IC between leaf concepts was made in Zhou et al.(2008), by also

including the depth of the node in the calculation, normalized by the maximum

depth of the ontology. The proposed IC formula is given by:

ICzhou(c) =kICseco(c) + (1 k)log(depth(c) + 1)

log(maxDepth) , (3.29)

wheredepth(c) represents the depth of the concept c in the hierarchy, maxDepth

is the maximum depth of the ontology, measured in node number and k is a

weighting constant.

50


51/130


The authors inSanchez et al.(2011) further improve the modeling of the IC

function. In that work, the IC function can also distinguish concepts that contain

the same number of descendants, due to the fact that the number of subsumers

of a concept is also used. The IC is given as:

ICsan(c) =log

leaves(c)ancestors(c)

+ 1)

allLeaves

, (3.30)

where leaves(c) is the number of nodes that are descendants of c and have no

children, ancestors(c) refers to the number of concepts which subsume c and

allLeavesdenotes the total number of leaf nodes in the ontology. The IC func-

tions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to

compute the similarity between two concepts without using a corpus.

Lin et al. use IC in an alteration of the similarity metric ofWu and Palmer

(1994). More specifically,

siml&p(c1, c2) =2 simresn(c1, c2)

IC(c1) + IC(c2), (3.31)

This approach aims to include the individual characteristics of the compared

nodes that Resniks approach neglected. Indeed, in Resniks measure, any two

pairs of nodes that have the same LCS produce the same similarity.

Jiang and Conrath follow a similar approach with Wu and Palmer (1994),

but avoid the scaling of similarityJiang and Conrath(1997). Instead, they use a

distance metric as follows:

dj&c(c1, c2) = IC(c1) + IC(c2) 2 simresn(c1, c2). (3.32)

Various transformations have been applied to convert this distance to similarity.

Among these, the authors in Seco et al. (2004) consider a linear transformation

and present the following formula of similarity normalized in the interval [0,1]:

simj&c(c1, c2) = 1 dj&c(c1, c2)

2 . (3.33)

Another example can be found in Zhu et al. (2009), in which an exponential

function is used for the similarity formula, along with a constant that accounts

51


52/130


for curve steepness:

simj&c(c1, c2) =edj&c(c1,c2)

. (3.34)

Feature-Based Measures

Feature-based measures do not necessarily conform to the similarity metric rules

ofChen et al. (2009), as they allow for similarity asymmetry. In feature-based

techniques, the two compared concepts are viewed as sets of features, in contrast

to the geometric view presented in previous sections. To calculate similarity, not

only the common features of the concepts are taken into account, but also the

differences between them. That way, common features improve similarity, while

different features penalize its valueTversky et al.(1977). Given concepts c1 and

c2, let C1 and C2 denote the sets that contain their features. Then, similarity

between the two can be given as:

simtve(c1, c2) = |C1 C2|

|C1 C2| +|C1 C2| + (1 )|C2 C1|, (3.35)

whereis a weight which takes values in [0,1]. InRodrguez et al. (1999), the

parameter is computed as follows:

=

d(c1,LCS)d(c1,c2)

, d(c1,LCS) d(c2,LCS)

1 d(c1,LCS)d(c1,c2)

, else(3.36)

This asymmetric function stems from Tverskys observation that similarity might

not be symmetric. In one of Tverskys examples, North Korea was said to be more

similar to Red China than the reverse.

3.3.2 Inter-ontology Semantic Similarity

Inter-ontology semantic similarity measures try to quantify the similarity between

concepts that belong to different ontologies. Fairly little research has been doc-

umented in this area, due to the inherent difficulty of comparing heterogeneous

structures. A common approach is to combine the different ontologies into a

52


53/130


single ontology through detailed concept mappings Gangemi et al. (1998). It is

clear that this is very challenging and requires the help of a domain expert, as

well as plenty of time and effort. Furthermore, not all biomedical terminologies

are consistent and their lack of homogeneity is a major problem. Simpler ap-

proaches have been proposed in the literature. A usual first step is to merge the

different ontologies under a dummy root. This approach is found inRodrguez

and Egenhofer (2003), where the authors use a weighted version of Tverskys

similarity which also takes into account geometrical features of the ontologies.

A similar route is followed by Petrakis et al. (2006), where the authors substi-

tute Tverskys similarity with a form of Jaccard similarity. The drawback of

these cross-similarity metrics is that they do not consider term overlap in both

ontologies. Other methods rely on extensions of single ontology similarity met-

rics. Examples of such work can be found in Al-Mubaid and Nguyen(2006) and

Sanchez et al.(2012).

53


54/130


55/130

Chapter 4

Search Interfaces

Search has risen to be one of the most commonly used tools for computer users.

It can be found everywhere, from stand-alone web-based search engines to em-

bedded search forms that appear in desktop applications and websites. To a large

extent, success of the search procedure depends on the users ability to formulate

their information needs, transforming them into queries that are highly likely to

produce desired results. For this reason, a lot of effort has been spent on improv-

ing the search interfaces and providing tools that will enhance user experience.

In this chapter, the basic characteristics of successful search interface design are

presented, with main focus on web-search interfaces.

4.1 Information Seeking Models

Information seeking models attempt to recognize and describe the strategies fol-

lowed by humans from the moment they sense a search need until the moment

they acquire desired results. The search procedure may be viewed as a repetition

of actions. InSutcliffe and Ennis (1998), the authors identify the following four

actions in what is considered the standard model of information seeking:

1. Problem Identification

2. Articulation of Need

55


56/130

CHAPTER 4. SEARCH INTERFACES

3. Query Formulation

4. Evaluation of Results

The first step refers to conceptualization of the search need, while the second step

involves expressing this need in words. The third step requires the user to trans-

form the articulated need into a format that will be accepted by the underlying

search system. Finally, the fourth step refers to the procedure of judging the

results critically, exploiting any relevant domain knowledge and deciding whether

the need is satisfied. A search may be characterized as ok, failed or unsatis-

factory. An ok search ends the cycle successfully. An unsatisfactory search

may lead to reformulation of the query or re-articulation of the need, while a

completely failed search might require re-identification of the problem.

Sutcliffe and Enniss model assumes that the need does not change, unless

results are disappointing. It does not capture the fact that users learn as they

search. This dynamic aspect of information seeking was captured in an earlier

work by BatesBates(1989). In that study, the users needs are assumed to change

as the process advances. Furthermore, Bates claims that the success of the search

procedure does not only depend on the final list of results, but on the selections

made along the way. This model is referred to as the berry-picking model, to

denote that it does not result in a single set of results. A simple example of the

berry-picking model can be illustrated when a user attempts a broad query such

as String similarity algorithms and refines the query to Jaro similarity after

viewing this result in the initial result list.

4.2 Query Specification

Queries are usually specified through rectangular entry forms, as in Fig. 4.1. The

width of these forms varies in size, with studies showing that wider forms promote

formulation of longer queriesFranzen and Karlgren(2000);Belkin et al.(2003).

It has been observed that around 88% of search queries are composed of 1 to 4

56


57/130

4.2. QUERY SPECIFICATION

Figure 4.1: The google search engine entry form.

Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user

queries.

words, with mean length equal to 2.8 words per query Jansen et al.(2007). The

actual search is executed by pressing the return key or mouse-clicking a specified

button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their

background with descriptive text that provides guidance for the user. An example

is Facebooks search form, as seen in Fig. 4.2. The text disappears, once the user

clicks inside the form. This usually helps to narrow down the search domain.

After query submission, processing of the query takes place before any attempt

to retrieve results. This process may include removal of stopwords (i.e. words

with high appearance probability such as the, a), normalization of words (e.g.

plural to singular) and permutation of word order. Boolean logic may also be used

in the case of multiple words per query. Returning results that contain all query

words (i.e. Boolean AND operator) seems more intuitive, although this might

sometimes lead to overly specific queries that return no results. The actual types

of processing are often hidden from the users, in an attempt to avoid confusion

and promote transparency,Muramatsu and Pratt(2001).

Most modern search interfaces are equipped with dynamic search suggestion,

also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of

57


58/130


Figure 4.3: Bings search interface features a powerful dynamic search suggestion, where

prefixes are highlighted with grayed-out font and the remaining text is in bold.

term suggestions appears under the entry form. The suggestions contained in the

list are usually queries whose prefix matches what has been typed so far, although

there are cases where interior matches are also included. The user can then mouse-

click the most relevant query or navigate through the list, using keyboard arrows.

Studies have shown that approximately one third of all search attempts in the

Yahoo Search Assist were performed through a dynamically suggested queryAn-

ick and Kantamneni(2008). The dynamic search suggestion technique attempts

to minimize unneeded typing from the user side and can alleviate spelling errors

early. Most importantly, though, it reassures the user that results are available,

so there is no frustration from empty result pages.

An important point to consider is that searchers often return to their pre-

viously accessed information. In the empirical study undertaken by Tauscher

and GreenbergTauscher and Greenberg(1997), it was found that there is a 58%

chance that the next web page to be visited had been visited before. A more

recent studyZhang and Zhao(2011) about tabbed browsing, conducted in 2010,

also finds page revisitation to be around the same levels, at 59.3%. Various tools

58


59/130

4.2. QUERY SPECIFICATION

Figure 4.4: The Safari browsers embedded search interface explicitly states which queries are

suggestions and which belong to the users recent search history.

Figure 4.5: The Firefox browsers embedded search interface contains recent queries on top,

and separates them from suggestions using a solid line.

exist to help users find their intended pages, including Uniform Resource Locator

(URL) history, bookmarking of pages, basic navigation buttons (e.g. Back but-

ton for short term page revisit) and change of URL font color if page has already

been visited. Among other methods documented, users may save whole webpages

to their local disk or keep URLs in text documents, after enriching them with

comments Jones et al. (2002). Interestingly, a common approach to revisiting

documents is actually re-searching for them Obendorf et al. (2007). Users who

59


60/130


Figure 4.6: Googles search results page is a typical scrollable vertical list of captions. Meta-

data facets, that restrain results to a particular type of information, are also present in the

interface (e.g. Images tab).

adopt this strategy attempt to re-create the conditions of their previous search, by

trying to formulate the exact same query. Another strategy requires past searchqueries to appear as the user types, along with regular dynamic term sugges-

tion. Separation between suggested queries and previously generated ones varies

among interfaces, as can be seen in Figures 4.4and4.5.

4.3 Presentation of Search Results

Search applications usually present results as a vertical list of captions, distributed

along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a

minimum requirement, comprises a title and an excerpt of the target document

Clarke et al.(2007). Usually, the excerpt includes some or all of the query terms,

as highlighted text. In most cases, highlighting is performed using bold font or

colored term background. Many search applications tend to group similar results,

that originate from the same source, into the same caption. That way, result

60


61/130

4.3. PRESENTATION OF SEARCH RESULTS

pollution from few sources is avoided and diversity is promoted. The relevance

of search results is reflected in their order of appearance. Although relevance

scores were formerly used to grade the fit of the result to the query, they are

usually not present anymore in modern search applications. The reasons behind

their omission might be to avoid reverse-engineering of the ranking algorithms and

to reduce redundancy, since the ranking itself already reflects the importance of

resultsHearst(2009).

It has been observed that users tend to click on the uppermost captions

Joachims et al. (2005). In the same study, it was found that the first caption

received more attention than its successors, even if its relevance was actually

lower. Furthermore, the majority of users often remain on the first page of re-

sults. The authors inJansen et al. (2007) observed that only 30% continued to

look for relevant results in the second page of the results, and only 15% looked

even further. Usually, the patience of a user is a function of his/her experience

in using the system. More experienced users tend to be more patient than users

who are not accustomed to the search procedure. Inexperienced users, on the

other hand, often prefer to refine their query or simply accept that what they

search for cannot be found by the search applicationHearst(2009).

Apart from plain lists of results, further organization of captions may be per-

formed, using some form of faceted browsing. Facets attempt to refine search

results, a

enhanced ontological searching of medical scientific information

Documents