enhanced ontological searching of med- ical scienti c...

University of Manchester

School of Computer Science

Degree Programme of Advanced Computer Science

Christos Karaiskos

Enhanced Ontological Searching of Med-ical Scientific Information

Progress Report

Manchester, May 10, 2013

Supervisors: Prof. Andrew Brass

Dr. Jennifer Bradford (AstraZeneca)

University of Manchester

School of Computer Science

Degree Programme of Advanced Computer Science

ABSTRACT OF

MASTER’S THESIS

Author: Christos Karaiskos

Title: Enhanced Ontological Searching of Medical Scientific Information

Date: May 10, 2013 Pages: 7+49+8

Pathway: Data and Knowledge Management

Supervisors: Prof. Andrew Brass

Dr. Jennifer Bradford (AstraZeneca)

An enormous amount of biomedical knowledge is encoded in narrative textual

format. In an attempt to discover new or hidden knowledge, extensive research is

being conducted to extract and exploit term relationships from plain text, with

the aid of technology. A common approach for the identification of biomedical en-

tities in plain text involves usage of ontologies, i.e., knowledge bases which provide

formal machine-understandable representations of domains of variable specificity.

In addition to term extraction, ontologies may also be used as controlled vocab-

ularies or as a means for automatic knowledge acquisition through their inherent

inference capabilities. Visualization of the content of ontologies is, thus, very im-

portant for researchers in the biomedical domain. Unfortunately, many of these

researchers find it difficult to deal with formal logic and would prefer that ontol-

ogy search interfaces completely hide any structural or functional references to

ontologies, even if they are as simple as parent/child relationships. This thesis

proposes a strategy for building an ontology search application that exploits on-

tologies behind the scene, transparently from the end user, and presents relevant

concept information in such a way that searchers can successfully and quickly

find what they are looking for. The proposed search interface features various

search tools for enhanced ontological searching, such as term auto-completion,

error correction, clever results ranking and similar concept suggestions based on

semantic similarity metrics.

Keywords: search interface design, ontology hiding, biomedical ontology,

semantic similarity, usability, data integration

#Words: 12754 (Abstract + Chapters 1-6 Isolated)

ii

List of Abbreviations

AI Artificial Intelligence

API Application Programming Interface

DAG Directed Acyclic Graph

GUI Graphic User Interface

HLGT High Level Group Term

HLT High Level Term

IC Information Content

ICD International Classification of Diseases

LCS Least Common Subsumer

MedDRA Medical Dictionary for Regulatory Activities

NCIT National Cancer Institute Thesaurus

NDF-RT National Drug File Reference Terminology

NHS UK National Health System

OWL Web Ontology Language

PT Preferred Term

RDF Resource Description Framework

RDF-S Resource Description Framework Schema

iii

RF2 Release Format 2

SNOMED CT Systematized Nomenclature of Medicine Clini-

cal Terms

SNOMED RT Systematized Nomenclature of Medicine Refer-

ence Terminology

SOC System Organ Class

UMLS Unified Medical Language System

UX User Experience

VA U.S. Department of Veterans Affairs

WHO World Health Organization

XML Extensible Markup Language

iv

Contents

1 Introduction 1

1.1 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Ontologies 6

2.1 Modern Ontology Definition . . . . . . . . . . . . . . . . . . . 6

2.2 Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . 8

2.3 Notable Biomedical Ontologies and Terminologies . . . . . . . 9

2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.4 MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.5 NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . 12

3 Similarity Metrics 13

3.1 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . 13

3.2 Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Character-based Similarity Measures . . . . . . . . . . 15

Longest Common Substring . . . . . . . . . . . . . . . 15

Hamming Similarity . . . . . . . . . . . . . . . . . . . 15

Levenshtein Similarity . . . . . . . . . . . . . . . . . . 15

Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . 16

Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . 16

N-gram Similarity . . . . . . . . . . . . . . . . . . . . . 17

v

CONTENTS CONTENTS

3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . 17

Dice Similarity . . . . . . . . . . . . . . . . . . . . . . 17

Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . 17

Cosine Similarity . . . . . . . . . . . . . . . . . . . . . 18

Manhattan Similarity . . . . . . . . . . . . . . . . . . . 18

Euclidean Similarity . . . . . . . . . . . . . . . . . . . 18

3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . 19

3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . 19

Distance-based Metrics . . . . . . . . . . . . . . . . . . 19

Information-Based Metrics . . . . . . . . . . . . . . . . 22

Feature-Based Measures . . . . . . . . . . . . . . . . . 25

3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . 26

4 Search Interfaces 27

4.1 Information Seeking Models . . . . . . . . . . . . . . . . . . . 27

4.2 Query Specification . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Presentation of Search Results . . . . . . . . . . . . . . . . . . 32

4.4 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . 33

5 Design 36

5.1 AstraZeneca’s Search Application . . . . . . . . . . . . . . . . 36

5.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Ontology Access . . . . . . . . . . . . . . . . . . . . . 38

5.2.2 Ontology Manipulation . . . . . . . . . . . . . . . . . . 39

5.2.3 Search Entry Form . . . . . . . . . . . . . . . . . . . . 40

5.2.4 Result Calculation . . . . . . . . . . . . . . . . . . . . 40

5.2.5 Error Correction . . . . . . . . . . . . . . . . . . . . . 41

5.2.6 Results Presentation . . . . . . . . . . . . . . . . . . . 41

5.2.7 Concept Presentation . . . . . . . . . . . . . . . . . . . 42

5.2.8 Navigation . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.9 History . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.10 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vi

CONTENTS

5.5 Current Implementation State . . . . . . . . . . . . . . . . . . 44

6 Conclusions and Future Work 49

vii

Chapter 1

Introduction

Ontologies are knowledge bases which provide formal machine-understandable

representations of domains of variable specificity. Given a domain of dis-

course, concepts that belong to the domain are well documented in formal

logic, along with their inter-relations. Ontologies, as representations, can-

not perfectly capture the part of the world that they attempt to describe

[Davis et al., 1993]. They are based on the open world assumption, which

states that if something is not represented in a knowledge base, it does not

mean that it does not exist in the real world [Hustadt et al., 1994]. As our

knowledge about a domain increases, ontologies are updated and they become

more complex. This has become evident in the biomedical domain, where

ontologies have already attained a high degree of specificity, and has led to

their quick adoption for data integration and knowledge discovery purposes.

1.1 Problem Context

Within biomedicine, ontologies can help researchers communicate, by pro-

moting consistent use of biomedical terms and concepts. The construction

of an ontology itself involves mediating across multiple views and requires

that a number of domain experts reach a consensus that reflects the diverse

viewpoints of the community. Ontologies are viewed as tools that provide

opportunities for new knowledge acquisition, due to the complex semantic

relations that they model. Inferences in a huge ontology may reveal connec-

1

1.2. MOTIVATION

tions that the human eye would bypass. This is especially important in the

pharmaceutical sector, where drug discovery has slowed down significantly as

a process and in the biological sector, where attempts to demystify genome

patterns associated with disease are still at initial stage. Another common

use for ontologies in the biomedical domain is as controlled vocabularies that

feed filtered terms into computer applications. Finally, ontologies may be

used to connect terms found in plain text to their semantic representations.

Term extraction with the help of ontologies is a hot topic in biomedicine, due

to the vast amounts of medical information stored in plain text. Due to the

importance of ontologies, it is usual for researchers in the biomedical field to

require access to their content.

1.2 Motivation

In the past, AstraZeneca employees were provided with a web-based search

form that enabled them to look for concepts in one or more biomedical on-

tologies and select the most suitable from a list of search results. The chosen

concepts were, in turn, conveyed to a text mining application. Understanding

the results required the user to be familiar with the content and structure of

the ontology from which the terms were retrieved. Unfortunately, most users

did not feel comfortable with the idea of ontologies and struggled, or even

refused, to use the provided interfaces, even though no logic-based content

was there to confuse them.

In many cases, though, this was not solely the fault of the users. The

interface gave the users freedom to select the ontologies to be searched for

the specified query. Inexperienced users usually did not know or care about

which ontology contains the desired query term. For example, a user wished

to search for ‘Non-small cell lung carcinoma’, by its abbreviation ‘NSCLC’.

Querying ‘NSCLC’ in the MedDRA terminology1 returned no results, since

the concept is not present in the terminology. Although this behavior is

correct, it seems wrong to the inexperienced user and may lead to loss of

trust to the system.

1The difference between terminology and ontology is described in Section 2.2

2

1.3. CONTRIBUTION

But even if the term is present in the ontology, the user should not be

forced to know its exact spelling. For example, querying for ‘NSCLC’ in

the NCIT thesaurus also returned no results, despite the fact that the ac-

tual concept exists in the ontology. The searcher needed to know that the

preferred term for the ‘NSCLC’ concept is ‘Non-small cell lung carcinoma’.

Abbreviations and dissimilar synonyms are common in the biomedical field,

so expecting the user to know the preferred term for each concept is consid-

ered problematic.

In addition to the above, presentation of results was not always straight-

forward. Terms that demonstrate a strong semantic relation to each other

were presented as stand-alone terms in the search results, subconsciously

misleading users to deduce that the terms were independent. It was up to

the user to judge the relevance of results to the query. For example, the

results for ‘Non-small cell lung carcinoma’ in NCIT included, among others,

the terms ‘Non-small cell lung carcinoma’ and ‘Stage I non-small cell lung

carcinoma’ equally spaced, in a way that users could not infer the connec-

tions between them. In fact, the latter term is a specification of the former.

In reality, what users did was to choose all terms, even though they were

looking for the broad term, because they became confused and did not want

to take the risk of selecting only one.

This collapse at the human-computer interface has motivated AstraZeneca

to try to build tools that take advantage of the ontology structure and, at the

same time, completely hide it from the user in order to facilitate the search

procedure.

1.3 Contribution

The outcome of this thesis will be the development of a user-friendly search

application that will allow users to find information about concepts present

in a medical ontology, without requiring from them to understand the un-

derlying structure of the ontology. Information about a concept may include

its accession code within the given ontology, the term for its preferred name,

its definition and all available synonym terms. In order to facilitate the

search procedure and enhance User Experience (UX), the search application

3

1.4. THESIS ORGANIZATION

will include features such as dynamic term suggestion, spelling correction,

recent query history and basic navigational functionality (e.g. back, forward

buttons).

The main challenge lies in the presentation of results; as stated in sec-

tion 1.2, users are usually not sure about which term(s) to choose, when

multiple similarly-spelt terms appear. Ranking of terms will be performed

with the aid of both lexical and semantic similarity. The former will screen

those terms that best match the user query and rank them according to a

string relevance metric. These results will be processed by the latter, so that

terms showing a strong semantic connection are grouped together.

Ideally, the search application should bridge across terms from multiple

ontologies. Due to the diversity in the format and annotation of different

ontologies, this is not a straightforward generalization. Most importantly,

within the biomedical society, the term ‘ontology’ is often used erroneously

to describe plain terminologies that, in fact, violate basic ontological princi-

ples.2 Therefore, ontology-specific difficulties are expected to arise, if seman-

tic similarity measures are to be deployed.

In summary, the goals of this thesis are to investigate the following topics:

1. To develop user-friendly search tools that allow users to build search

queries based on the terms present in a medical ontology, without need

for the users to understand the actual structure of the ontology.

2. To exploit the semantic annotations of the underlying ontology in order

to enhance the quality and presentation of results.

3. To bridge across ontologies and intermix their results appropriately.

1.4 Thesis Organization

The present report is organized in a total of 6 chapters. Chapter 2 includes

an introduction to ontologies and a brief description of some notable biomed-

ical ontologies. Chapter 3 presents the background needed for understanding

the different measures of lexical and semantic similarity. Chapter 4 discusses

2In MedDRA, the synonym of a term may be a child node of the term itself.

4

1.4. THESIS ORGANIZATION

interface design principles for user-centered search applications. Chapter 5

presents the design considerations that are taken into account for the imple-

mentation of the ontological search application. Also, the current implemen-

tation state is presented. Finally, conclusions are drawn in chapter 6, along

with possible future directions.

5

Chapter 2

Ontologies

The term ‘ontology’ is an uncountable noun coined in the philosophical field,

by ancient Greek philosophers [Guarino, 1998]. It involves the study of the

nature of existence, at a fairly abstract level. In the world of computer sci-

ence, the word ‘ontology’ refers to the encoding of human knowledge in a

format that allows for computational use. This chapter includes an intro-

duction to the modern definition of ontology, along with a brief description

of some of the most notable biomedical ontologies.

2.1 Modern Ontology Definition

In Artificial Intelligence (AI), an ontology is commonly defined as a speci-

fication of a (shared) conceptualization [Gruber et al., 1995]. A conceptual-

ization refers to an individual’s knowledge about a specific domain, acquired

through “experience, observation or introspection” [Huang et al., 2010]. On-

tologies are shared conceptualizations, meaning that multiple participants,

usually domain experts, contribute to their construction, maintenance and

expansion. Conflicts are certain to arise among the different participants, so

an important aspect of ontology design is to bridge across multiple views of

the desired domain into a single concrete representation. On the other hand,

a specification is a transformation of this shared conceptualization into a

formal representation language.

The outcome of a formal representation of a domain is a collection of

6

2.1. MODERN ONTOLOGY DEFINITION

entities, expressions and axioms. Entities include:

• concepts or classes, which are sets of individuals (e.g., ‘Country’, which

contains all countries),

• individuals, which are specific instances of classes (e.g., ‘Greece’ as an

instance of ‘Country’),

• data types (e.g. string, integer),

• literals, which are specific values of a given data type (e.g. 1,2,3, or

string values),

• properties (e.g. hasDisease, hasAge).

Expressions refer to descriptions of entities in a formal representation lan-

guage. The standardized family of languages for formal ontology represen-

tation is the Web Ontology Language (OWL), which builds on the Extensi-

ble Markup Language (XML), Resource Description Framework (RDF) and

RDF-Schema (RDF-S) standards to provide a highly expressive means for

representing knowledge [McGuinness et al., 2004]. The underlying format of

the resulting OWL document can vary among several types, with the most

common being RDF/XML.

Finally, axioms relate entities/expressions. This connection can be made

class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion),

property-to-property (i.e. SubPropertyOf), among others. These relations

can be asserted explicitly or inferred by a reasoner. Inferences are made,

based on the logic relations of concepts. As an example of a simple infer-

ence, a concept’s ancestors can be inferred automatically, once the parent

concept is specified.

An ontology may be visualized as a graph, in which concepts are nodes

and relations are edges between nodes. Furthermore, if transitive hierarchical

relations are isolated (e.g. subsumption, also known as “is-a” relation or

hyponymy), the ontology can be viewed as a taxonomy. The geometrical

visualization of an ontology will be presented in more detail in chapter 3.

7

2.2. ONTOLOGY VS. TERMINOLOGY

2.2 Ontology vs. Terminology

A terminology is a collection of term names that are associated with a

given domain. A term is a mapping of a concrete concept to natural lan-

guage. This term-to-concept mapping is usually not one-to-one, especially

in the biomedical domain where term variation and term ambiguities arise

[Ananiadou and McNaught, 2006]. Term variation is a result of the richness

of natural language and refers to the existence of multiple terms for the

description of the same concept. For example, the terms ‘Transmembrane

4 Superfamily Member 1’, ‘TM4SF1t’, ‘L6 Antigen’ all point to the same

protein. Term ambiguity occurs when a term is mapped to more than one

distinct concept. This is common when new abbreviations are introduced

[Liu et al., 2002]. As an example, some of the concepts that the acronym

‘CTX’ may map to are ‘Cardiac Transplantation’, ‘Clinical Trial exemption’

and ‘Conotoxin’. Their disambiguation is a matter of context.

A terminology is not constrained to being a simple list of terms. In

fact, most terminologies feature some kind of structure, where terms that

map to the same concept are grouped together and semantic relationships

between concepts are explicitly or implicitly stated. Semantic relationships

between terms include synonymy and antonymy, while semantic relationships

between concepts include hyponymy, hypernymy, meronymy and holonymy

[Jurafsky and Martin, 2000]. Synonymy exists when two terms are inter-

changeable, while antonymy denotes that two terms have opposite meaning.

Hyponymy introduces a parent-child, or “is-a” relation between concepts. A

concept is a hyponym of another concept, if the former derives from the latter

and it represents a more granular concept. Hyponymy is transitive; if con-

cept a is a child of concept b, and concept b is a child of concept c, then a is

also a child of c. Hypernymy is the reverse relation of hyponymy. Meronymy

exists when a concept represents a part of another concept. Holonymy is the

opposite relation, where a concept has part some other concept(s).

The difference between a terminology and an ontology is not always clear,

as terminologies continue to improve their state of organization in a way that

resembles ontologies. The initial scope and aim of the two, though, is clearly

different; the purpose of a terminology was initially, as the name implies,

8

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

an effort to collect all terms associated with a specified domain. On the

other hand, the target of an ontology has, from the start, been to provide a

machine-readable specification of a shared conceptualization. Despite their

many common characteristics, terminologies are not necessarily ontologies. If

treated as ontologies, they may lead to inconsistencies or wrong inferencing

mechanisms [Ananiadou and McNaught, 2006]. An illustrative example is

the case of MedDRA, which will be discussed in Section 2.3.4.

2.3 Notable Biomedical Ontologies and Ter-

minologies

Hundreds of biomedical ontologies and terminologies have been published on-

line. According to Bioportal1 statistics, the top five most viewed ontologies

or terminologies are SNOMED Clinical terms, National Drug File, Interna-

tional Classification of Diseases (ICD), MedDRA and NCI Thesaurus. In this

section, a brief introduction to these ontologies/terminologies is performed.

2.3.1 SNOMED CT

The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT)

is a biomedical terminology which covers most areas within medicine such as

drugs, diseases, operations, medical devices and symptoms. It may be used

for the coding, retrieval and processing of clinical data. SNOMED CT is

written purely in formal logic-based syntax (i.e., the so-called Release For-

mat 2 or RF2) available and organized into multiple independent hierarchies.

It is the result of the merging between the UK National Health System’s

(NHS) Read codes and SNOMED Reference Terminology (SNOMED-RT),

developed by the College of American Pathologists. The basic hierarchies, or

axes, are ‘Clinical Finding’ and ‘Procedure’. The last version contains more

than 400000 concepts and over 1000000 of relationships, rendering SNOMED

CT the most complete terminology in the medical domain. Only few defini-

1Bioportal is a biomedical ontology/terminology repository which provides online on-

tology presentation and manipulation tools (http://bioportal.bioontology.org/).

9


tions are present in the terminology. Each concept contains a unique identi-

fier and numerous synonymous terms that account for term variation. Also,

each concept is part of at least one hierarchy and may have multiple “is-a”

relationships with higher level nodes. SNOMED CT is part of the Unified

Medical Language System (UMLS), a biomedical ontology and terminology

integration attempt which comprises hundreds of resources.

2.3.2 NDF-RT

The National Drug File Reference Terminology (NDF-RT) was introduced

by the U.S. Department of Veterans Affairs (VA) as a formalized repre-

sentation for a medication terminology, written in description logic syntax

[VHA, 2012]. The terminology is organized into concept hierarchies, where

each concept is a node comprising a list of term synonyms and a unique

identifier. As expected, top-level concepts are more general than lower-level

ones. The central hierarchy is named DRUG KIND and indicates the types

of medications, the preparations used in them and clinical VA drug products.

Other hierarchies include

• DISEASE KIND,

• INGREDIENT KIND,

• MECHANISM OF ACTION KIND,

• PHARMACOKINETICS KIND,

• PHYSIOLOGIC EFFECT KIND,

• THERAPEUTIC CATEGORY KIND,

• DOSE FORM and

• DRUG INTERACTION KIND.

Roles exist between different concepts, and are specified only with existen-

tial restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to

other terminologies are also available. Currently, NDF-RT more than 45000

concepts in hierarchies of maximum depth 12.

10


2.3.3 ICD-10

The International Statistical Classification of Diseases and Related Health

Problems (ICD) is a terminology which attempts to classify signs, symp-

toms and causes of disease and morbidity [WHO, 1992]. It appeared in the

mid-19th century and is now maintained by the World Health Organization

(WHO). Currently it is available in its 10th revision, although the 11th ver-

sion is claimed to be at the final stage before release. As a taxonomy, it has

relatively small maximum depth, equal to 6. Codes assigned to each concept

tie it to a specific place in the taxonomy, with each code having only a single

parent. It is thus not a proper application of ontological principles2, since, in

reality, it is not unusual for concepts to belong to more than one subsumers,

and this is not modeled. In addition to that, there exist categories such as

“Not otherwise specified” or “Other”, which are not needed in an ontology;

the open world assumption already covers the fact that every ontology is

incomplete, so stating it explicitly is redundant and may interfere with the

evolution of the ontology, as new terms are not classified under their closest

match.

2.3.4 MedDRA

The Medical Dictionary for Regulatory Activities (MedDRA) is a termi-

nology that is concerned with biopharmaceutical regulatory processes. It

contains terms associated with all phases of the drug development cycle.

MedDRA is organized in a hierarchical structure of fixed depth, as seen in

Fig. 2.1. System Organ Classes (SOCs) represent the 26 predefined overlap-

ping hierarchies in which terms belong to. High Level Group Terms (HLGTs)

and High Level Terms (HLTs) are general term groupings, denoting disor-

ders or complications. Preferred Terms (PTs) denote the preferred name

for a concept, while Lowest Level Terms (LLTs) include terms of maximum

specificity. LLTs may be connected with hyponymy, meronymy or synonymy

relationships to their PTs. This is the main problem in trying to view Med-

DRA as an ontology. In a formal ontology, a concept cannot be a child of

2nor was meant to be; its intent is classification

11


Figure 2.1: The structure of the MedDRA terminology comprises a fixed-depth hierarchy.

itself. In MedDRA, this clearly happens, when a PT and its LLTs share a

synonymy relation.

2.3.5 NCI Thesaurus

The National Cancer Institute Thesaurus (NCIT) is a controlled terminology

for cancer research. The thesaurus has been converted to formal OWL syntax

and is updated at fixed intervals. The conversion was not an easy one;

many inconsistencies and modeling dead-ends that were encountered in the

conversion procedure have been documented [Ceusters et al., 2005], along

with some clear violations of ontological principles [Schulz et al., 2010]. The

NCIT provides almost 100000 concepts, with approximately 65% containing

a definition.

12

Chapter 3

Similarity Metrics

Similarity metrics aim at measuring the lexical or semantic similarity between

terms. Lexical similarity focuses on terms that contain similar character

or word sequences, while semantic similarity tries to determine how close

in meaning the terms are. Lexical similarity is simpler to calculate, since

string-based algorithms only require plain text to function. On the other

hand, semantic similarity requires extra information about the terms present

in plain text. This extra information is usually acquired with the help of

a knowledge base (e.g. ontology, terminology, etc.) or through statistical

analysis of corpora, i.e., large collections of text documents that resemble

real-world usage of words.

3.1 Similarity Metric vs. Distance Metric

It is common in literature to come across the term ‘semantic distance’, instead

of ‘semantic similarity’. A distance metric d(a, b), that compares entities a

and b, must satisfy the following properties:

1. d(a, b) = 0 if and only if a = b (zero property),

2. d(a, b) = d(b, a) (symmetric property),

3. d(a, b) ≥ 0 (non-negativity property),

4. d(a, b) + d(b, c) ≤ d(a, c) (triangular inequality).

13

3.2. LEXICAL SIMILARITY

On the other hand, the requirements for a similarity metric were formally

introduced not long ago [Chen et al., 2009]. The definition states that a

similarity metric s(a, b) must satisfy the following properties:

1. s(a, a) ≥ 0,

2. s(a, b) = s(b, a),

3. s(a, a) ≥ s(a, b),

4. s(a, b) + s(b, c) ≤ s(a, c) + s(b, b),

5. s(a, a) = s(b, b) = s(a, b) if and only if a = b.

The counter-intuitive 4th property can be proven, using set theory. More

specifically, if |a ∩ b| denotes the cardinality of common characteristics be-

tween a and b, and c denotes the complement of c, the following equality

holds:

|a ∩ b| = |a ∩ b ∩ c|+ |a ∩ b ∩ c|. (3.1)

Then,

|a∩b|+|b∩c| = |a∩b∩c|+|a∩b∩ c|+|a∩b∩c|+|a∩b∩c| ≤ |a∩c|+|b|, (3.2)

since |a ∩ b ∩ c| ≤ |a ∩ c| and |a ∩ b ∩ c| + |a ∩ b ∩ c| + |a ∩ b ∩ c| ≤ |b|.Deduction of similarity from distance is a common procedure that requires

simple operations. Similarity is, intuitively, a decreasing function of distance.

Conversion between the two can take many forms [Chen et al., 2009]. In this

thesis, all formulas will be presented as similarity measures.

3.2 Lexical Similarity

String-based methods that calculate lexical similarity can be divided into

character-based and word-based. In this section, some of the most popu-

lar metrics are presented. For a more complete survey of lexical similarity

measures see [Gomaa and Fahmy, 2013] and [Navarro, 2001].

14


3.2.1 Character-based Similarity Measures

In character-based similarity, strings are viewed as character sequences and

attempts are made to discover character relevance.

Longest Common Substring

The Longest Common Substring algorithm [Gusfield, 1997] tries to find the

maximum number of consecutive characters that two strings share. It may

be implemented using a suffix tree or dynamic programming.

Hamming Similarity

Hamming similarity is a metric that can be applied to strings of equal length.

It is a simple metric that measures the number of common characters between

two strings. Given strings a and b, the formula for string similarity can be

constructed as follows:

simham(a, b) =

∑∀i

1(ai = bi)

|a|, (3.3)

where 1(·) is the indicator function and | · | denotes string length, measured

in characters.

Levenshtein Similarity

Levenshtein distance counts the number of character alterations that need

to be made in order to transform one string to another [Levenshtein, 1966].

This number is bounded by the length of the larger string, which is com-

monly used as a normalizing measure that restrains the value of distance

to [0, 1]. Mathematically, normalized Levenshtein distance of terms a and b

is computed using the following formula:

dlev(a, b) =leva,b(|a|, |b|)max{|a|, |b|}

, (3.4)

15


where

leva,b(i, j) =

max{i, j} ,min{i, j} = 0

min

leva,b(i− 1, j) + 1

leva,b(i, j − 1) + 1

leva,b(i− 1, j − 1) + [ai 6= bj]

, else

(3.5)

and max{·}, min{·} denote the maximum and minimum functions, respec-

tively. Converting normalized distance to similarity can be done as follows:

simlev(a, b) = 1− dlev(a, b). (3.6)

Jaro Similarity

Jaro similarity [Jaro, 1989, Jaro, 1995] takes into account both the number

and sequence of common characters present in the two strings. Let us con-

sider strings a = a1 . . . aK and b = b1 . . . bL. A character ai is said to be

common with b if the character exists in b within a window of min{|a|,|b|}2

from

bi. Let a′ = a′1 . . . a′K′ be those characters in a that are common with b, and

b′ = b′1 . . . b′L′ those characters in b that are common with a. A transposi-

tion for a′, b′ is a position i in the strings a′, b′ in which a′i 6= b′i. The number

of transpositions for a′, b′ divided by two is denoted as Ta′,b′ . Then, Jaro’s

formula for similarity is given by:

simjaro(a, b) =1

3

(|a′||a|

+|b′||b|

+|a′| − Ta′,b′|a′|

). (3.7)

Jaro-Winkler Similarity

Jaro-Winkler similarity [Winkler, 1999] is a variation of Jaro similarity which

promotes strings with long common prefixes. The length of the longest prefix

common to both strings a and b is denoted as P . Then, if P ′ = max(P, 4),

Jaro-Winkler similarity is given by:

simj&w(a, b) = simjaro(a, b) +P ′

10(1− simjaro(a, b)). (3.8)

16


N-gram Similarity

A string can be split into n-grams, i.e. all possible consecutive character

sequences of length n in the string. As an example, the word “protein”

can be split into the 3-grams “pro”, “rot”, “ote”,“tei” and “ein”. When

comparing two strings, the number of common n-grams is computed and

normalized by the maximum number of n-grams. More specifically, given

strings a and b, similarity is given by:

simngram(a, b) =Ncom

Nmax

, (3.9)

where Ncom denotes the number of common n-grams and Nmax denotes the

maximum number of n-grams in either of the two strings.

3.2.2 Word-based Similarity Measures

As the name implies, word-based measures view the string as a collection of

words. Similarity measures dictate how similar two terms are word-wise, and

no weight is given on character similarity.

Dice Similarity

Dice similarity considers input strings a and b as sets of words A and B

respectively, and calculates similarity as follows:

simdice(a, b) =2|A ∩B||A|+ |B|

, (3.10)

where | · | denotes set cardinality in number of words.

Jaccard Similarity

Jaccard similarity counts the number of common words of the compared

strings and divides it by the number of distinct words in both strings, i.e.

simjacc(a, b) =|A ∩B||A ∪B|

. (3.11)

17


Cosine Similarity

In order to compute cosine similarity, the compared strings should be con-

verted to vectors. The dimension of the resulting vectors will be equal to

the total number of distinct words present in both. Therefore, each element

in the vector represents one word. The vector values for each string are

computed as follows: A vector contains unitary values in positions that cor-

respond to words that are contained in the respective string. Similarly, a

vector contains zero values in all positions that correspond to words that are

not present in the respective string. Given strings a and b, the respective

vectors a and b are computed. Cosine similarity is then given by:

simcos(a, b) =a · b||a|| ||b||

, (3.12)

where || · || denotes the Euclidean norm function.

Manhattan Similarity

Taxicab geometry considers that distance between two points in a grid is

given by the sum of the absolute differences of their respective coordinates.

The grid resembles a uniform city road map, where diagonal movements are

not permitted. This is the reason why the distance metric in this space

is often called Manhattan distance or city block distance. Considering N -

dimension string vectors a and b, Manhattan distance can be computed as:

simmanh(a, b) = 1−

N∑i=1

|ai − bi|

N, (3.13)

where N is a normalizing constant that represents the dimension of a and b.

Euclidean Similarity

Euclidean similarity also considers strings as vectors, and computes similarity

as:

simeucl(a, b) = 1−

√√√√√ N∑i=1

|ai − bi|2

N. (3.14)

18

3.3. ONTOLOGICAL SEMANTIC SIMILARITY

3.3 Ontological Semantic Similarity

An ontology is a collection of concepts and their inter-relationships. It may be

visualized as a graph, in which nodes represent concepts and edges represent

the relations between them. Usually, ontologies are viewed as taxonomies,

where “is-a” and “part-of” relations play the most important role. Viewing

the ontology as a taxonomy, one can apply semantic similarity metrics that

exploit the hierarchical structure. Probably the most famous object of se-

mantic similarity tests is the computational lexicon WordNet [Miller, 1995].

In WordNet, closely related terms are grouped together to form synsets.

These synsets, in turn, form semantic relations with other synsets. WordNet

is commonly referred to as a lexical ontology, due to an obvious mapping of

lexical hyponymy to ontological subsumption.

3.3.1 Intra-ontology Semantic Similarity

Intra-ontology semantic similarity metrics are meant to measure similarity

between concepts that reside within the same ontology. These metrics can be

roughly divided into distance-based, information-based and feature-based.

Distance-based Metrics

Distance-based metrics take advantage of the ontological topology to com-

pute the similarity between concepts. This method requires viewing the

ontology as a rooted Directed Acyclic Graph (DAG), in which nodes are

concepts and edges among them are restricted to hierarchical relationships,

with the most usual type being “is-a” relationships. At the top, there is a

single concept, the root. The graph is directed, starting from a low-level con-

cept and directed towards its ancestors through transitive relationships. The

graph is also acyclic, since a finite path from a source node to a destination

node cannot return to the source node. In other words, a node can never be

a child of one of its children.

A simple look at an ontology from a geometric perspective may reveal

important information about the similarity of concepts. As depth in the DAG

increases, concepts become increasingly specific, thus similarity is expected

19


to increase. Another important characteristic of the ontology DAG is that

the path between concepts is not always unique, therefore distance-based

similarity will depend on which path is chosen. Finally, the density of nodes

is a good indicator of similarity; as density increases, concepts approach each

other and similarity increases.

The accuracy of distance-based methods depends on the level of detail

that the ontology captures. A poorly structured ontology with many omis-

sions might yield misleading similarity results. Fortunately, a lot of effort has

been made to make biomedical ontologies as complete as possible, therefore

network density in biomedical ontologies is usually high.

The most straightforward way to measure the similarity of concept nodes

is given in [Rada et al., 1989]. In that work by Rada et al., all edges are

assigned a unitary weight and the distance between two concepts is equal to

the number of edges that are present in their shortest path. Let us consider

two distinct concepts c1 and c2 in the hierarchy. Each path i that connects

these two concept nodes may be represented as a set which includes all edges

ek present in the path, i.e.

pathi(c1, c2) = {e1, e2, . . . , eK}. (3.15)

with cardinality |pathi(c1, c2)| = K. The distance between concepts c1 and

c2 is, then, equal to the shortest path that connects them, i.e.,

drada(c1, c2) = min∀i|pathi(c1, c2)|. (3.16)

Note that in literature, there are cases (e.g. [Al-Mubaid and Nguyen, 2006])

where Rada’s measure is used with node counting, instead of edge counting.

In those cases, each path is represented as a set of the nodes that compose

it, including the end nodes. The minimum distance can be converted into a

similarity metric, as in [Resnik, 1995]:

simrada(c1, c2) = 2D− d(c1, c2), (3.17)

where D is the maximum depth of the taxonomy. This method fails to

capture the intuition that concept nodes, which reside at the lower part of

the hierarchy and are separated by distance d, are more similar than higher-

level nodes with the same distance separation d. Also, its success highly

20


depends on the uniformity of edge distribution within the ontology. For

these reasons, other approaches have been proposed in order to achieve a

more representative score of similarity.

In [Wu and Palmer, 1994], the relative depth of the compared concepts

in the hierarchy is considered. In that work, Wu and Palmer introduce the

Least Common Subsumer (LCS) of the compared concepts. The LCS is the

lowest-depth ancestor node common to both concepts, when looking at the

shortest path between them. Similarity for concepts c1 and c2 is then given

as:

simw&p(c1, c2) =2h

N1 +N2 + 2h, (3.18)

where N1 is the number of nodes in the path between concept a and the

LCS1, N2 is the number of nodes between concept b and the LCS, and h is

the minimum depth of the LCS towards the root, measured again in number

of nodes.

In [Li et al., 2003], the authors followed various strategies in their at-

tempt to calculate similarity as a function of the shortest path between the

compared concepts, the depth of their LCS and the local density of the on-

tology. They perceived that the best performance was obtained when they

used the following non-linear function:

simli(c1, c2) = e−α drada(c1,c2)eβh − e−βh

eβh + e−βh, (3.19)

where α, β are non-negative parameters and h = drada(LCS(c1, c2), root)

denotes the minimum depth of the LCS. Distances are measured in number

of edges.

Al-Mubaid and Nguyen attempt to combine path length and node depth

in one measure. In [Al-Mubaid and Nguyen, 2006], they view the DAG as

a composition of clusters, with each cluster having as root a child of the

ontology root. The usage of clusters aims to exploit local characteristics

of different branches. Given concepts c1 and c2, they first compute their

so-called common specificity:

Cspec(c1, c2) = Dc − h, (3.20)

1start and end nodes of the path are also included in the calculation

21


where Dc denotes the depth of the specific cluster and h refers to the depth of

the LCS in the ontology, with both quantities measured in number of nodes.

Then similarity is computed as:

sima&n(c1, c2) = log((Path− 1)α × (CSpec)β + k), (3.21)

where Path is a modified version of Rada’s distance measure which is adapted

according to the largest cluster, and α, β, k are constants, whose default

values are unitary.

Information-Based Metrics

One of the first attempts to focus on nodes in the similarity formula is that

of Leacock and Chodorow [Leacock and Chodorow, 1998]. This method uses

negative log likelihood in a way that resembles the formula of self-information

[Cover and Thomas, 2012], but does not really involve valid probability. In-

stead, a normalized form of the path length between the concepts is used:

siml&c(c1, c2) = −log(Np/2D), (3.22)

where Np is the number of nodes in the shortest path between concepts c1

and c2. This variable also includes the end nodes.

Resnik, in [Resnik, 1995], continues down this path by replacing the nor-

malized path length with a probability measure P(·) to calculate the infor-

mation content (IC) of a concept. He considers all common subsumers CSi

of concepts c1 and c2 and calculates similarity as:

simresn(c1, c2) = max∀i

[−log(P(CSi))], (3.23)

or, equivalently,

simresn(c1, c2) = −log(P(LCS)). (3.24)

Considering that the IC of a concept c is defined as the negative logarithm

of its probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written

as:

simresn(c1, c2) = IC(LCS(c1, c2)). (3.25)

22


Probabilities are estimated with the help of a text corpus, i.e. a collection of

nature language excerpts, specifically chosen to provide a good representa-

tion of actual term usage. When dealing with biomedical ontology concepts,

collections of Pubmed2 abstracts are commonly used as corpora to determine

the probability of each concept.

Given a corpus, the occurrence of a term which corresponds to concept c

essentially implies the occurrence of each and every concept that subsumes

c within the ontological structure. Conversely, the number of occurrences

of a concept c depends not only on the number of appearances of c itself in

the corpus, but also on every occurrence of its descendants in the hierarchy.

Thus, the number of occurrences of concept c is given by:

occ(c) =∑

∀n=subsumed(c)

count(n), (3.26)

where subsumed(c) represents c and its children concept nodes, and count(·)denotes the number of occurrences of the specific concept within the given

corpus. Converting occurrences to probability can be done using:

P(c) =occ(c)

N, (3.27)

where N is the total number of occurrences of ontology terms in the corpus.

This method results to higher probabilities for concepts residing at the top

part of the hierarchy, with the root having unitary probability. Therefore,

concepts whose LCS lies lower in the hierarchy are more similar, since their

LCS has low probability (i.e., high IC).

A possible drawback of this method is that probabilities are tied to the

choice of corpus. So far, in the biomedical domain, there is no widely accepted

corpus that covers the domain needs [Al-Mubaid and Nguyen, 2006]. This

is due to the fact that thousands of new terms and abbreviations appear in

the literature every year, thus a stable corpus might not function well. Since

extensions of the corpus would need to be considered at fixed intervals, it

might not serve as a useful benchmark.

Alternatively, computation of IC can be performed without the use of

a corpus, by solely relying on the structure of the ontology DAG. Intrinsic

2http://www.ncbi.nlm.nih.gov/pubmed

23


computation of IC involves approximating the occurrence probability of a

concept as a function of multiple variables, such as number of descendant

nodes, number of subsumers or number of descendant nodes which are leaves

in the ontology. In [Seco et al., 2004], the IC of a concept c is given by:

ICseco(c) = 1− log(descendants(c) + 1)

log(allConcepts), (3.28)

where descendants(c) returns the number of nodes that concept c subsumes,

and allConcepts denotes the number of all the available concepts in the

ontology.

The IC function introduced by Seco et. al has the drawback that it assigns

IC equal to one for every leaf node in the ontology, and also that concepts

containing the same number of descendant nodes are again given the same

IC. An attempt to distinguish the IC between leaf concepts was made in

[Zhou et al., 2008], by also including the depth of the node in the calculation,

normalized by the maximum depth of the ontology. The proposed IC formula

is given by:

ICzhou(c) = kICseco(c) + (1− k)log(depth(c) + 1)

log(maxDepth), (3.29)

where depth(c) represents the depth of the concept c in the hierarchy, maxDepth

is the maximum depth of the ontology, measured in node number and k is a

weighting constant.

The authors in [Sanchez et al., 2011] further improve the modeling of the

IC function. In that work, the IC function can also distinguish concepts that

contain the same number of descendants, due to the fact that the number of

subsumers of a concept is also used. The IC is given as:

ICsan(c) = −log

( leaves(c)ancestors(c)

+ 1)

allLeaves

), (3.30)

where leaves(c) is the number of nodes that are descendants of c and have no

children, ancestors(c) refers to the number of concepts which subsume c and

allLeaves denotes the total number of leaf nodes in the ontology. The IC

functions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25)

to compute the similarity between two concepts without using a corpus.

24


Lin et al. use IC in an alteration of the similarity metric presented in

[Wu and Palmer, 1994]. More specifically,

siml&p(c1, c2) =2 simresn(c1, c2)

IC(c1) + IC(c2), (3.31)

This approach aims to include the individual characteristics of the compared

nodes that Resnik’s approach neglected. Indeed, in Resnik’s measure, any

two pairs of nodes that have the same LCS produce the same similarity.

Jiang and Conrath follow a similar approach with [Wu and Palmer, 1994],

but avoid the scaling of similarity [Jiang and Conrath, 1997]. Instead, they

use a distance metric as follows:

dj&c(c1, c2) = IC(c1) + IC(c2)− 2 simresn(c1, c2). (3.32)

Various transformations have been applied to convert this distance to simi-

larity. Among these, the authors in [Seco et al., 2004] consider a linear trans-

formation and present the following formula of similarity normalized in the

interval [0,1]:

simj&c(c1, c2) = 1− dj&c(c1, c2)

2. (3.33)

Another example can be found in [Zhu et al., 2009], in which an exponential

function is used for the similarity formula, along with a constant λ that

accounts for curve steepness:

simj&c(c1, c2) = edj&c(c1,c2)

λ . (3.34)

Feature-Based Measures

Feature-based measures do not necessarily conform to the similarity met-

ric rules of [Chen et al., 2009], as they allow for similarity asymmetry. In

feature-based techniques, the two compared concepts are viewed as sets of

features, in contrast to the geometric view presented in previous sections.

To calculate similarity, not only the common features of the concepts are

taken into account, but also the differences between them. That way, com-

mon features improve similarity, while different features penalize its value

[Tversky et al., 1977]. Given concepts c1 and c2, let C1 and C2 denote the

25


sets that contain their features. Then, similarity between the two can be

given as:

simtve(c1, c2) =|C1 ∩ C2|

|C1 ∩ C2|+ µ|C1 − C2|+ (1− µ)|C2 − C1|, (3.35)

where µ is a weight which takes values in [0,1]. In [Rodrıguez et al., 1999],

the µ parameter is computed as follows:

µ =

d(c1,LCS)d(c1,c2)

, d(c1, LCS) ≤ d(c2, LCS)

1− d(c1,LCS)d(c1,c2)

, else(3.36)

This asymmetric function stems from Tversky’s observation that similarity

might not be symmetric. In one of Tversky’s examples, North Korea was

said to be more similar to Red China than the reverse.

3.3.2 Inter-ontology Semantic Similarity

Inter-ontology semantic similarity measures try to quantify the similarity

between concepts that belong to different ontologies. Fairly little research

has been documented in this area, due to the inherent difficulty of com-

paring heterogeneous structures. A common approach is to combine the

different ontologies into a single ontology through detailed concept mappings

[Gangemi et al., 1998]. It is clear that this is very challenging and requires

the help of a domain expert, as well as plenty of time and effort. Fur-

thermore, not all biomedical terminologies are consistent and their lack of

homogeneity is a major problem. Simpler approaches have been proposed in

the literature. A usual first step is to merge the different ontologies under

a dummy root. This approach is found in [Rodriguez and Egenhofer, 2003],

where the authors use a weighted version of Tversky’s similarity which also

takes into account geometrical features of the ontologies. A similar route

is followed by [Petrakis et al., 2006], where the authors substitute Tversky’s

similarity with a form of Jaccard similarity. The drawback of these cross-

similarity metrics is that they do not consider term overlap in both ontolo-

gies. Other methods rely on extensions of single ontology similarity metrics.

Examples of such work can be found in [Al-Mubaid and Nguyen, 2006] and

[Sanchez et al., 2012].

26

Chapter 4

Search Interfaces

Search has risen to be one of the most commonly used tools for computer

users. It can be found everywhere, from stand-alone web-based search engines

to embedded search forms that appear in desktop applications and websites.

To a large extent, success of the search procedure depends on the users’

ability to formulate their information needs, transforming them into queries

that are highly likely to produce desired results. For this reason, a lot of

effort has been spent on improving the search interfaces and providing tools

that will enhance user experience. In this chapter, the basic characteristics

of successful search interface design are presented, with main focus on web-

search interfaces.

4.1 Information Seeking Models

Information seeking models attempt to recognize and describe the strategies

followed by humans from the moment they sense a search need until the

moment they acquire desired results. The search procedure may be viewed as

a repetition of actions. In [Sutcliffe and Ennis, 1998], the authors identify the

following four actions in what is considered the standard model of information

seeking:

1. Problem Identification

2. Articulation of Need

27

4.2. QUERY SPECIFICATION

3. Query Formulation

4. Evaluation of Results

The first step refers to conceptualization of the search need, while the second

step involves expressing this need in words. The third step requires the user

to transform the articulated need into a format that will be accepted by the

underlying search system. Finally, the fourth step refers to the procedure

of judging the results critically, exploiting any relevant domain knowledge

and deciding whether the need is satisfied. A search may be characterized

as ‘ok’, ‘failed’ or ‘unsatisfactory’. An ‘ok’ search ends the cycle successfully.

An ‘unsatisfactory’ search may lead to reformulation of the query or re-

articulation of the need, while a completely ‘failed’ search might require

re-identification of the problem.

Sutcliffe and Ennis’s model assumes that the need does not change, unless

results are disappointing. It does not capture the fact that users learn as they

search. This dynamic aspect of information seeking was captured in an earlier

work by Bates [Bates, 1989]. In that study, the user’s needs are assumed to

change as the process advances. Furthermore, Bates claims that the success

of the search procedure does not only depend on the final list of results, but

on the selections made along the way. This model is referred to as the berry-

picking model, to denote that it does not result in a single set of results. A

simple example of the berry-picking model can be illustrated when a user

attempts a broad query such as “String similarity algorithms” and refines

the query to “Jaro similarity” after viewing this result in the initial result

list.

4.2 Query Specification

Queries are usually specified through rectangular entry forms, as in Fig. 4.1.

The width of these forms varies in size, with studies showing that wider

forms promote formulation of longer queries [Franzen and Karlgren, 2000,

Belkin et al., 2003]. It has been observed that around 88% of search queries

are composed of 1 to 4 words, with mean length equal to 2.8 words per query

[Jansen et al., 2007]. The actual search is executed by pressing the return

28


Figure 4.1: The google search engine entry form.

Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user

queries.

key or mouse-clicking a specified button (e.g. magnifying glass in Bing). In

some cases, entry forms decorate their background with descriptive text that

provides guidance for the user. An example is Facebook’s search form, as

seen in Fig. 4.2. The text disappears, once the user clicks inside the form.

This usually helps to narrow down the search domain.

After query submission, processing of the query takes place before any

attempt to retrieve results. This process may include removal of stopwords

(i.e. words with high appearance probability such as ‘the’, ‘a’), normalization

of words (e.g. plural to singular) and permutation of word order. Boolean

logic may also be used in the case of multiple words per query. Returning

results that contain all query words (i.e. Boolean AND operator) seems more

intuitive, although this might sometimes lead to overly specific queries that

return no results. The actual types of processing are often hidden from the

users, in an attempt to avoid confusion and promote transparency, while

hiding implementation details [Muramatsu and Pratt, 2001].

Most modern search interfaces are equipped with dynamic search sug-

gestion, also known as auto-completion (See Fig. 4.3). As the user starts

typing, a list of term suggestions appears under the entry form. The sugges-

tions contained in the list are usually queries whose prefix matches what has

29


Figure 4.3: Bing’s search interface features a powerful dynamic search suggestion, where

prefixes are highlighted with grayed-out font and the remaining text is in bold.

been typed so far, although there are cases where interior matches are also

included. The user can then mouse-click the most relevant query or navigate

through the list, using keyboard arrows. Studies have shown that approxi-

mately one third of all search attempts in the Yahoo Search Assist were per-

formed through a dynamically suggested query [Anick and Kantamneni, 2008].

The dynamic search suggestion technique attempts to minimize unneeded

typing from the user side and can alleviate spelling errors early. Most im-

portantly, though, it reassures the user that results are available, so there is

no frustration from empty result pages.

An important point to consider is that searchers often return to their pre-

viously accessed information. In the empirical study undertaken by Tauscher

and Greenberg [Tauscher and Greenberg, 1997], it was found that there is a

58% chance that the next web page to be visited had been visited before.

A more recent study [Zhang and Zhao, 2011] about tabbed browsing, con-

ducted in 2010, also finds page revisitation to be around the same levels,

at 59.3%. Various tools exist to help users find their intended pages, in-

cluding URL history, bookmarking of pages, basic navigation buttons (e.g.

‘Back’ button for short term page revisit) and change of URL font color if

30


Figure 4.4: The Safari browser’s embedded search interface explicitly states which queries

are suggestions and which belong to the user’s recent search history.

Figure 4.5: The Firefox browser’s embedded search interface contains recent queries on

top, and separates them from suggestions using a solid line.

page has already been visited. Among other methods documented, users

may save whole webpages to their local disk or keep URLs in text docu-

ments, after enriching them with comments [Jones et al., 2002]. Interest-

ingly, a common approach to revisiting documents is actually re-searching

for them [Obendorf et al., 2007]. Users who adopt this strategy attempt to

re-create the conditions of their previous search, by trying to formulate the

exact same query. Another strategy requires past search queries to appear

31

4.3. PRESENTATION OF SEARCH RESULTS

Figure 4.6: Google’s search results page is a typical scrollable vertical list of captions.

Metadata facets, that restrain results to a particular type of information, are also present

in the interface (e.g. ‘Images’ tab).

as the user types, along with regular dynamic term suggestion. Separation

between suggested queries and previously generated ones varies among inter-

faces, as can be seen in Figures 4.4 and 4.5.

4.3 Presentation of Search Results

Search applications usually present results as a vertical list of captions, dis-

tributed along multiple pages (see Fig. 4.6). Each caption is a clickable

entity which, as a minimum requirement, comprises a title and an excerpt of

the target document [Clarke et al., 2007]. Usually, the excerpt includes some

or all of the query terms, as highlighted text. In most cases, highlighting is

performed using bold font or colored term background. Many search applica-

tions tend to group similar results, that originate from the same source, into

the same caption. That way, result ‘pollution’ from few sources is avoided

and diversity is promoted. The relevance of search results is reflected in

their order of appearance. Although relevance scores were formerly used to

grade the fit of the result to the query, they are usually not present anymore

32

4.4. QUERY REFORMULATION

in modern search applications. The reasons behind their omission might

be to avoid reverse-engineering of the ranking algorithms and to reduce re-

dundancy, since the ranking itself already reflects the importance of results

[Hearst, 2009].

It has been observed that users tend to click on the uppermost captions

[Joachims et al., 2005]. In the same study, it was found that the first cap-

tion received more attention than its successors, even if its relevance was

actually lower. Furthermore, the majority of users often remain on the first

page of results. The authors in [Jansen et al., 2007] observed that only 30%

continued to look for relevant results in the second page of the results, and

only 15% looked even further. Usually, the patience of a user is a function

of his/her experience in using the system. More experienced users tend to

be more patient than users who are not accustomed to the search procedure.

Inexperienced users, on the other hand, often prefer to refine their query

or simply accept that what they search for cannot be found by the search

application [Hearst, 2009].

Apart from plain lists of results, further organization of captions may be

performed, using some form of faceted browsing. Facets attempt to refine

search results, according to their characteristics. As an example, Amazon’s

search interface provides facets that correspond to the different departments

that might contain the desired item (see Fig. 4.7).

4.4 Query Reformulation

It is common that desired search results are not discovered with the first

try. Query reformulation is the procedure which attempts to transform the

original query to a format that will match the information retrieval sys-

tem’s vocabulary. Studies using query logs have shown that the number

of reformulated queries may reach up to 52% [Jansen et al., 2005] of all

queries. It has been observed that, if no help for query reformulation is

given explicitly by the search application, users tend to provide simple alter-

ations of the initial query [Hertzum and Frøkjær, 1996]. This bias towards

initial queries is referred to as anchoring, a term coined by psychologists

[Tversky and Kahneman, 1975].

33


Figure 4.7: Amazon’s search interface provides facets as a left panel to the results page,

helping the user dynamically refine the initial search.

One of the most common sources of search failure is query mistyping

[Cucerzan and Brill, 2004]. A common approach, which aims to correct ty-

pographical errors, is using a dictionary and finding the most similar term

to the erroneous query [Kukich, 1992]. Among other techniques mentioned

in that work are heuristic rule-based corrections, probabilistic approaches

that determine how often specific sequences of characters are spelt wrong,

and neural network models that train the system to automatically identify

errors. The outcome of the reformulation procedure may be shown explicitly

on the interface as a suggested query (e.g. Google’s ‘Did you mean’), or be

implicitly shown in the results. The former approach is preferred, since it

gives users freedom to decide whether their intent is actually captured in

the proposed correction. More recently, distributional approaches that take

advantage of user query logs are preferred, especially by web-based search

engines [Li et al., 2006].

Another dimension of query reformulation is term expansion. Term ex-

pansion refers to the suggestion of queries that relate to the initial one in some

way. Choice of related queries might take the form of thesaurus-based term

substitution [Dennis et al., 1998] or attempt to extend the present query,

34


Figure 4.8: Pubmed’s results page includes term expansion in two ways. On the right of

the screen, there is a ‘Related searches’ panel that preserves the initial query and adds a

new related term to it. Also, right below the entry form there is a ‘See also’ feature which

suggests complete or partial modifications in the initial query.

usually by adding single words (see Fig. 4.8). Query suggestion might also

be fetched from sessions of users who previously searched for the same infor-

mation. In has also been proposed that search applications ask the user to

provide relevance feedback [Ruthven and Lalmas, 2003]. Although theoreti-

cal studies approve of this feature, its appearance in commercial applications

is rare.

35

Chapter 5

Design

The main design requirement by AstraZeneca is a “Google-like” search ap-

plication that takes care of ontologies behind the scene and provides visual

tools that help users choose the most appropriate term(s) for them. In this

chapter, the previously used search application within AstraZeneca is briefly

described, along with examples and justifications of query failures. Further-

more, the methodology to be followed for improving the implementation is

analyzed.

5.1 AstraZeneca’s Search Application

The ontology search application used by AstraZeneca is integrated into a text

mining application. The user searches for terms which belong to ontologies

and the most relevant ones are used for searching medical documents. The

search application appears as a pop-up window once the user clicks on certain

fields of the text mining application. It includes a form, in which users

type their query, and a results page which lists result entries vertically. The

results page is organized as a two-column table. Each row entry includes

the preferred name for a concept (i.e., left column entry) and the list of

children or synonyms for the specific concept (i.e., right column entry). The

searcher can pick one or more of the table entries that correspond to ontology

concepts, and these terms are, in turn, fed back to the text mining application

for further processing.

36

5.1. ASTRAZENECA’S SEARCH APPLICATION

Table 5.1: Documented failed queries and suggested reasons for their failure.

Query Comments Suggested Reason for Failure

Hepatotoxicity Searcher did not find the term

and decided to search on-

line to find a synonym for it

and reformulate the query as

‘Liver Disease’.

Wrong ontology choice by user. The

term is clearly in MedDRA. It is also

a preferred name, so the application

would find it.

NSCLC The acronym refers to ‘Non-

Small Cell Lung Carcinoma’,

a concept which is listed in

NCIT. Search returned no re-

sults.

Although the abbreviation ‘NSCLC’

is documented in NCIT, it is not a

preferred name so it was bypassed

by the program.

DIHS Searcher expected the concept

‘Drug-induced hypersensitiv-

ity syndrome’ in MedDRA.

No results were returned.

DIHS does not appear as an abbre-

viation in MedDRA, so this behav-

ior was normal. Searcher needed

to explicitly specify the preferred

name, which is ‘Drug-induced hy-

persensitivity syndrome’.

DRESS Syn-

drome

Refers to the same concept as

DIHS. It was not found.

The term exists as an LLT in Med-

DRA. The application searched only

for PTs in that ontology. The

PT for ‘DRESS Syndrome’ is ‘Drug

rash with eosinophilia and systemic

symptoms’.

VEGFR Searcher came across multiple

returned terms and did not

know which one(s) to choose.

Therefore, all were chosen.

The application does not help the

user visualize possible relationships

among results (e.g. hyponymy). If

it did, the user would only choose

the broader term, and would not be

confused.

LHRH Most relevant result was ‘Go-

nadotropin Releasing Hor-

mone’. The searcher did not

know that term, so again used

internet search.

The preferred term for ‘LHRH’ is

‘Gonadotropin Releasing Hormone’.

The searcher did not have back-

ground knowledge to grade the rel-

evance of the two.

NMDA Antag-

onist

The searcher wanted to find a

list of the different NMDA an-

tagonists. What was returned

was the definition of ‘NMDA

Receptor Antagonist’

This is an ontology organization

characteristic. For example the

NMDA antagonist ‘Ketamine’ is

listed in NCIT as a subclass of

‘Anesthetic Substance’, while ‘Apti-

ganel’ is listed as a subclass of ‘Neu-

roprotective Agent’.

5.2. DESIGN CONSIDERATIONS

Although no log file containing extensive lists of query failures is avail-

able for AstraZeneca’s search application, examples of failed queries have

been given. The reasons behind query failure are diverse; Table 5.1 lists

some of the most characteristic failed queries, along with given or deduced

justifications for the reason of failure. It is clear that failure of some queries

was due to the content of the ontologies, therefore inevitable. Changing the

interface would improve many cases of failure, though. Other causes of fail-

ure included wrong ontology chosen by the user, incomplete term coverage

by the search application, lack of help and guidance from the system (e.g.,

relevance feedback or result visualization).

5.2 Design Considerations

In this section the modeling choices for the search engine and its interface

are presented. The actual coding is performed in the programming language

Java, and Graphic User Interface (GUI) design is done entirely in Java Swing.

The main reasons behind this choice were the cross-platform nature of Java

and the wide availability of high-performance Application Programming In-

terfaces (APIs) (e.g. OWL Java API1, Patricia Trie API2, OntoCAT API3).

Furthermore, the goal of this thesis is not an enterprise-strength application,

but a proof of concept.

5.2.1 Ontology Access

There are two alternatives for the access of biomedical ontologies. The first

is to process the file for each ontology locally, while the second is accessing

ontologies through the Bioportal4 Representational State Transfer (REST)

services online.

The current state of implementation (see Section 5.5) uses a local copy

of the NCIT, in OWL format, and the Java OWL API is chosen to extract

useful information about its contents. The problem with this approach is

1http://owlapi.sourceforge.net/2http://code.google.com/p/patricia-trie/3http://www.ontocat.org/4http://bioportal.bioontology.org/

38


that, when loading large ontologies with the Java OWL API, there is a long

time delay (i.e. approximately 30 seconds for NCIT with 2GB RAM) and

the whole ontology remains in memory for the duration of the program.

Therefore, the extensibility of this approach to many large ontologies and the

usability of the application to a novice user are questioned. Various attempts

to provide a database backend for multiple ontologies have been documented

(e.g., see [Iordanov, 2010, Henß et al., 2009]), but loading ontologies into a

local database takes time, requires a private server and if reasoning is to be

done, whole ontologies must be brought back into main memory whatsoever.

Fortunately, we do not need to worry about providing a database backend,

since this work has been done already in Bioportal. Queries about ontologies

may be performed online, using the provided REST web API. Available

REST services include getting all parents, children or properties of a term,

getting all terms of an ontology and getting all paths between a term and

the root. Results are returned in XML format, which is easily parsable. A

Java API for accessing terms from Bioportal is also available, by the name

OntoCAT. The second phase of the implementation will consider using these

Bioportal services to access ontologies.

5.2.2 Ontology Manipulation

Manipulation of ontologies will be performed in multiple stages. Firstly, all

available terms of each ontology will be retrieved and all their word-wise per-

mutations will be saved in a Patricia trie structure, also known as a radix tree.

This type of structure allows for quick on-demand retrieval performance. In

our implementation, it will be used for quick auto-completion. The Patricia

trie for a given ontology needs to be built only once and will be saved to

file for future references. The second type of ontology manipulation is the

computation of semantic similarity for each pair of concepts. This procedure

will involve exploiting the ontology as a DAG and is expected to be com-

putationally intensive. Fortunately, it also needs to be performed once per

ontology and the results can be stored in one or more text files as a trian-

gular (i.e. due to symmetry) matrix. A final form of ontology manipulation

will be to access specific concepts. Once the user chooses a term from the

39


list of results, the contents of that term must be retrieved. In the case of

using a local OWL file, information about the class with the given IRI can

be fetched using the OWL API, while in the case of using Bioportal REST

services, information about the term can be retrieved by forming an HTTP

request which includes the term’s accession code.

5.2.3 Search Entry Form

As mentioned in section 4.2, queries are usually less than or equal to 4 words.

That result reflects query specification in web-based search engines, where

users can search about any topic they wish for. In the more granular biomed-

ical domain, users usually attempt more targeted searches. Furthermore, the

application to be deployed in this thesis is aimed at term searching, instead

of document searching. Thus, users are aware that they are searching for

short-length terms instead of multi-page documents, and it is likely that

queries are even shorter than the average 2.8 words. Indeed, the example

queries given by AstraZeneca are comprised of at most two words. Also, an

auto-completion feature will be provided, so lengthy terms will not need to

be typed, but simply chosen from a dynamic list. Despite the fact that short

queries are expected, a wide entry form is chosen, to resemble “Google-like”

experience and provide better visibility for auto-completion features.

5.2.4 Result Calculation

Result calculation is performed at each key press action. If a user presses

keys at fast pace, calculation of results is performed only for the most re-

cently submitted query and processing of any previous queries is immediately

terminated. The query is viewed as a bag of words and is compared to all

ontology terms, which are also viewed as bags of words.

Given a query of n words, an ontology term appears in the results if it

shares n−1 words with the query and the remaining word of the query prefix-

matches one word from the term. For example, given a query ‘carcinoma lu’

and the term ‘lung carcinoma’, the word ‘carcinoma’ is contained in both,

while ‘lu’ prefix-matches ‘lung’; therefore, ‘lung carcinoma’ is included in the

result set. It is clear that the Boolean AND operator is considered. The above

40


procedure is equivalent to searching through all possible word permutations

of ontology terms to find the query as a prefix match. So, if we look for the

query ‘carcinoma lu’ as a prefix in ‘lung carcinoma’ and ‘carcinoma lung’, it

will indeed prefix-match the second permutation of the same term. Since all

permutations of terms are already saved in a Patricia trie (see Section 5.2.2),

determining the results is trivial and fast.

After the results list has been completed, it is cleaned from terms that

correspond to the same concept. Only one representative term is chosen for

each concept, and this is not always the preferred term, but the term best

matching the given query lexically. The results list is maintained until the

user presses the next key or interface button.

5.2.5 Error Correction

If no matches are found, a lexical similarity measure is applied and the closest

match is either directly returned or proposed as a correction and its adoption

is left to the user’s discretion. For cases where error correction proposals pro-

duce very low similarity scores (e.g. < 0.5 with maximum lexical similarity

equal to 1), the Boolean AND operator may be dropped and Boolean OR

may act as a softer replacement.

5.2.6 Results Presentation

If the user presses any key except ‘return’, the auto-completion function is

triggered as a pop-up window below the search entry form. To fill the auto-

completion window, a list of the most relevant terms is selected from the

results list and presented to the user, along with the relevance score against

the query. The list is fixed in size, holding a maximum of 10 terms. Ranking

is performed by comparing the query with each of the result terms lexically.

This comparison is the maximum of a character-based and a word-based

lexical similarity (e.g., maximum of Levenshtein and Jaccard). For lexical

similarity, the Java version of the Simmetrics5 library is used.

If the user presses ‘return’ or clicks the ‘Search’ button of the interface,

5http://sourceforge.net/projects/simmetrics/

41


results are shown in a table. The subset of terms that were screened using

lexical similarity are grouped together either by using semantic similarity

or by simply determining ancestor-descendant relationships. Let us assume

that descendants of a term are present in the results list after the initial

screening, along with the term itself. Those descendants will not hold their

own position in the result table, next to each other. Instead, they will be

listed as subsumed terms of their most distant ancestor present in the results.

This choice is expected to further separate the first result in the ranking from

the rest in terms of relevance, and to make choice easier and less ambiguous

for the user. For example, given the term ‘NSCLC’ and its child term ‘Stage 0

NSCLC’, the latter will be presented within the former, implying a hyponymy

relation.

5.2.7 Concept Presentation

Information about a concept will, at a minimum, include its unique code

accession in the given ontology, its preferred name, any available definition(s),

a list of synonyms and a group of highly similar terms. The grouping of

similar terms will be computed using semantic similarity metrics.

5.2.8 Navigation

A basic form of navigation through pages will be present through “Back” and

“Forward” buttons. These buttons will permit a specific window of moves

back or forward. Furthermore, typical keyboard shortcuts will be present

(e.g. traversing the auto-completion list with keyboard arrow keys).

5.2.9 History

In its final form, the auto-completion function will also host a history fea-

ture. Previously attempted queries will be presented right above the query

suggestions, with a line dividing the two. The history function will be an

independent Patricia trie, that will be updated with every query and saved

to file for future reference.

42

5.3. RELATED WORK

5.2.10 Feedback

In this first phase, a percentage which reflects (lexical) relevance of results to

the query is also included. Responses from AstraZeneca have been positive

for including this numerical indicator. Once the semantic groupings of terms

in the results is performed, we will assess the utility of a percentage.

5.3 Related Work

The most relevant work is Bioportal, a public web-based repository of biomed-

ical ontologies and terminologies [Noy et al., 2009]. Bioportal features a pow-

erful web search application, which scans multiple ontologies at once. The

main differences of our implementation to Bioportal’s are the following:

• Bioportal’s main search form does not supply an auto-completion func-

tion. Most users use this form. Auto-completion function is supported

as a widget for individual ontologies, but the feature is not immediately

evident to the novice user of the site.

• Bioportal’s auto-completion feature proposes only preferred names. Start-

ing to type ‘L6 Antigen’ in Bioportal’s NCIT widget will present ‘Trans-

membrane 4 Superfamily Member 1’ in the auto-completion list, result

whose interpretation depends on the searcher’s ability to judge that

these two indeed refer to the same concept. Our approach will present

‘L6 Antigen’ in the auto-completion menu.

• After performing a search, traversal of concepts in Bioportal depends

on the browser’s ‘Back’ and ‘Forward’ buttons and the user can easily

get lost. Our approach has taken into account navigational aspects of

usability.

• Bioportal does not offer a search history in its search form. Our imple-

mentation will feature past query proposal.

• Bioportal provides a visual representation of the parents and children

of a concept, if the user selects visualization. Our application’s in-

43

5.4. EVALUATION

terface will include a term suggestion visualization based not only on

parent/child relationships, but on general semantic similarity scores.

5.4 Evaluation

Evaluation of the search application will be performed by the AstraZeneca

side. Comparisons to the old system for the same queries will be assessed,

along with user satisfaction or dissatisfaction, which will be documented.

Evaluation may take the form of a simple questionnaire or be a description

of positive and negative feedback provided by the users themselves.

5.5 Current Implementation State

The current implementation uses the OWL file representation of the NCIT,

which is stored locally. Using the OWL Java API, all classes are retrieved and

their annotations exploited. From all annotations, the following are used:

• preferred name,

• synonym list (if present),

• definition (if present).

All preferred names and their synonyms are inserted into a Patricia trie,

in every possible permutation of their word sequence. The main interface is

shown in Fig. 5.1. It includes a wide search entry form, ‘Back’ and ‘Forward’

buttons for navigation, a ‘Clear’ button which clears the interface and results,

and a ‘Search’ button in case the user wants to use mouse-clicking. The auto-

completion feature is already fully functional, as can be seen in Fig. 5.2. To

compute the relevance score for each term in the results list, its Levenshtein

and Jaccard similarity to the query are first evaluated. The maximum of

these similarities is, then, chosen to be the relevance score for the term. An

initial form of error correction is already present, as shown in Fig. 5.3. It

is computed using Levenshtein similarity of the query to each ontology term

and choosing the highest score. The presentation of results is shown in Fig.

44

5.5. CURRENT IMPLEMENTATION STATE

Figure 5.1: The main window of the search application.

5.4. It is still at initial stage, and only lexical similarity metrics are used.

Result grouping according to semantic similarity is not yet implemented.

45


Figure 5.2: Auto-completion function appears immediately after the user presses a key.

The function works independent of the query word order. Top 10 most relevant results

are shown and percentages that indicate lexical similarity to the query are also present.

46


Figure 5.3: Basic error correction is shown as a proposal, in case the query produces no

results.

Figure 5.4: The results page is a table of entries. Each entry contains the matching term

name, a relevance score, the preferred name for the concept and the ontology source.

47


Figure 5.5: The term description page currently presents the preferred term name,

definitions and synonym terms. The term that was chosen by the user to reach this screen

is highlighted (i.e. ‘Liver Cancer’). It may be also noticed that the ‘Back button’ is not

disabled anymore and can be used to return to search results.

48

Chapter 6

Conclusions and Future Work

Ontologies are expected to play a major role in the discovery of new knowl-

edge within the biomedical sector. Providing user-friendly tools that help

researchers navigate efficiently through ontologies without requiring from

them to understand about ontological principles is more likely to help them

reach their final goals quickly, without confusion and frustration. In this

report, proposals were made for enhancing the user experience in ontological

search, through a simple search interface that features enhanced searching

tools such as auto-completion, semantic grouping of results, query reformula-

tion and similar concept suggestion. The current state of implementation will

be further improved to account for multiple ontology searching and seman-

tic grouping/ranking of results. Usage of local OWL files against web-based

REST services for ontology access will also be reconsidered. Further changes

will also be made on the visual aspects of the interface. Future extensions of

the final outcome may include transforming it into a web-based application

and providing tools that allow for its integration with other applications,

especially those involving text mining.

49

Bibliography

[Al-Mubaid and Nguyen, 2006] Al-Mubaid, H. and Nguyen, H. A. (2006). A

cluster-based approach for semantic similarity in the biomedical domain.

In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th

Annual International Conference of the IEEE, pages 2713–2717. IEEE.

[Ananiadou and McNaught, 2006] Ananiadou, S. and McNaught, J. (2006).

Text mining for biology and biomedicine. Artech House Boston, London.

[Anick and Kantamneni, 2008] Anick, P. and Kantamneni, R. G. (2008). A

longitudinal study of real-time search assistance adoption. In Proceedings

of the 31st annual international ACM SIGIR conference on Research and

development in information retrieval, pages 701–702. ACM.

[Bates, 1989] Bates, M. J. (1989). The design of browsing and berrypicking

techniques for the online search interface. Online Information Review,

13(5):407–424.

[Belkin et al., 2003] Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-

J., Muresan, G., Tang, M.-C., Yuan, X.-J., and Cool, C. (2003). Query

length in interactive information retrieval. In Proceedings of the 26th an-

nual international ACM SIGIR conference on Research and development

in informaion retrieval, pages 205–212. ACM.

[Ceusters et al., 2005] Ceusters, W., Smith, B., and Goldberg, L. (2005). A

terminological and ontological analysis of the nci thesaurus. Methods of

information in medicine, 44(4):498.

50

BIBLIOGRAPHY

[Chen et al., 2009] Chen, S., Ma, B., and Zhang, K. (2009). On the sim-

ilarity metric and the distance metric. Theoretical Computer Science,

410(24):2365–2376.

[Clarke et al., 2007] Clarke, C. L., Agichtein, E., Dumais, S., and White,

R. W. (2007). The influence of caption features on clickthrough patterns in

web search. In Proceedings of the 30th annual international ACM SIGIR

conference on Research and development in information retrieval, pages

135–142. ACM.

[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). Elements

of information theory. Wiley-interscience.

[Cucerzan and Brill, 2004] Cucerzan, S. and Brill, E. (2004). Spelling cor-

rection as an iterative process that exploits the collective knowledge of web

users. In Proceedings of EMNLP, volume 4, pages 293–300.

[Davis et al., 1993] Davis, R., Shrobe, H., and Szolovits, P. (1993). What is

a knowledge representation? AI magazine, 14(1):17.

[Dennis et al., 1998] Dennis, S., Robert, M., and Bmza, P. (1998). Searching

the world wide web made easy? the cognitive load imposed by query refine-

ment mechanisms. In Proceedings of ADCS 98 Third Australian Document

Computing Symposium, page 65.

[Franzen and Karlgren, 2000] Franzen, K. and Karlgren, J. (2000). Verbosity

and interface design. SICS Research Report.

[Gangemi et al., 1998] Gangemi, A., Pisanelli, D., and Steve, G. (1998). On-

tology integration: Experiences with medical terminologies. In Formal

ontology in information systems, volume 46, pages 98–94. IOS Press, Am-

sterdam, AM.

[Gomaa and Fahmy, 2013] Gomaa, W. H. and Fahmy, A. A. (2013). Article:

A survey of text similarity approaches. International Journal of Computer

Applications, 68(13):13–18. Published by Foundation of Computer Science,

New York, USA.

51

BIBLIOGRAPHY

[Gruber et al., 1995] Gruber, T. R. et al. (1995). Toward principles for the

design of ontologies used for knowledge sharing. International journal of

human computer studies, 43(5):907–928.

[Guarino, 1998] Guarino, N. (1998). Formal Ontology in Information Sys-

tems: Proceedings of the 1st International Conference June 6-8, 1998,

Trento, Italy, volume 46. Ios PressInc.

[Gusfield, 1997] Gusfield, D. (1997). Algorithms on strings, trees and se-

quences: computer science and computational biology. Cambridge Univer-

sity Press.

[Hearst, 2009] Hearst, M. (2009). Search user interfaces. Cambridge Uni-

versity Press.

[Henß et al., 2009] Henß, J., Kleb, J., Grimm, S., and Bock, J. (2009). A

database backend for owl. In OWL: Experiences and Directions (OWLED

2009), CEUR Workshop Proceedings. CEUR-WS. org.

[Hertzum and Frøkjær, 1996] Hertzum, M. and Frøkjær, E. (1996). Browsing

and querying in online documentation: a study of user interfaces and the

interaction process. ACM Transactions on Computer-Human Interaction

(TOCHI), 3(2):136–161.

[Huang et al., 2010] Huang, C.-r., Calzolari, N., Gangemi, A., Lenci, A.,

Oltramari, A., and Prevot, L. (2010). Ontology and the Lexicon: A Natural

Language Processing Perspective. Cambridge University Press Cambridge.

[Hustadt et al., 1994] Hustadt, U. et al. (1994). Do we need the closed-world

assumption in knowledge representation. Working Notes of the KI, 94:24–

26.

[Iordanov, 2010] Iordanov, B. (2010). Hypergraphdb: a generalized graph

database. In Web-Age Information Management, pages 25–36. Springer.

[Jansen et al., 2007] Jansen, B. J., Spink, A., and Koshman, S. (2007). Web

searcher interaction with the dogpile.com metasearch engine. Journal of

the American Society for Information Science and Technology, 58(5):744–

755.

52

BIBLIOGRAPHY

[Jansen et al., 2005] Jansen, B. J., Spink, A., and Pedersen, J. (2005). A

temporal comparison of altavista web searching. Journal of the American

Society for Information Science and Technology, 56(6):559–570.

[Jaro, 1989] Jaro, M. A. (1989). Advances in record-linkage methodology

as applied to matching the 1985 census of tampa, florida. Journal of the

American Statistical Association, 84(406):414–420.

[Jaro, 1995] Jaro, M. A. (1995). Probabilistic linkage of large public health

data files. Statistics in medicine, 14(5-7):491–498.

[Jiang and Conrath, 1997] Jiang, J. and Conrath, D. (1997). Semantic sim-

ilarity based on corpus statistics and lexical taxonomy. In Proc. of the

Int’l. Conf. on Research in Computational Linguistics, pages 19–33.

[Joachims et al., 2005] Joachims, T., Granka, L., Pan, B., Hembrooke, H.,

and Gay, G. (2005). Accurately interpreting clickthrough data as implicit

feedback. In Proceedings of the 28th annual international ACM SIGIR

conference on Research and development in information retrieval, pages

154–161. ACM.

[Jones et al., 2002] Jones, W., Dumais, S., and Bruce, H. (2002). Once

found, what then? a study of keeping behaviors in the personal use of

web information. Proceedings of the American Society for Information

Science and Technology, 39(1):391–402.

[Jurafsky and Martin, 2000] Jurafsky, D. and Martin, J. H. (2000). Speech

& Language Processing. Pearson Education India.

[Kukich, 1992] Kukich, K. (1992). Techniques for automatically correcting

words in text. ACM Computing Surveys (CSUR), 24(4):377–439.

[Leacock and Chodorow, 1998] Leacock, C. and Chodorow, M. (1998). Com-

bining local context and wordnet similarity for word sense identification.

WordNet: An electronic lexical database, 49(2):265–283.

[Levenshtein, 1966] Levenshtein, V. I. (1966). Binary codes capable of cor-

recting deletions, insertions, and reversals. Technical Report 8.

53

BIBLIOGRAPHY

[Li et al., 2006] Li, M., Zhang, Y., Zhu, M., and Zhou, M. (2006). Explor-

ing distributional similarity based models for query spelling correction. In

Proceedings of the 21st International Conference on Computational Lin-

guistics and the 44th annual meeting of the Association for Computational

Linguistics, pages 1025–1032. Association for Computational Linguistics.

[Li et al., 2003] Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach

for measuring semantic similarity between words using multiple informa-

tion sources. Knowledge and Data Engineering, IEEE Transactions on,

15(4):871–882.

[Liu et al., 2002] Liu, H., Johnson, S. B., and Friedman, C. (2002). Auto-

matic resolution of ambiguous terms based on machine learning and con-

ceptual relations in the umls. Journal of the American Medical Informatics

Association, 9(6):621–636.

[McGuinness et al., 2004] McGuinness, D. L., Van Harmelen, F., et al.

(2004). Owl web ontology language overview. W3C recommendation,

10(2004-03):10.

[Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english.

Communications of the ACM, 38(11):39–41.

[Muramatsu and Pratt, 2001] Muramatsu, J. and Pratt, W. (2001). Trans-

parent queries: investigation users’ mental models of search engines. In

Proceedings of the 24th annual international ACM SIGIR conference on

Research and development in information retrieval, pages 217–224. ACM.

[Navarro, 2001] Navarro, G. (2001). A guided tour to approximate string

matching. ACM computing surveys (CSUR), 33(1):31–88.

[Noy et al., 2009] Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M.,

Griffith, N., Jonquet, C., Rubin, D. L., Storey, M.-A., Chute, C. G., et al.

(2009). Bioportal: ontologies and integrated data resources at the click of

a mouse. Nucleic acids research, 37(suppl 2):W170–W173.

[Obendorf et al., 2007] Obendorf, H., Weinreich, H., Herder, E., and Mayer,

M. (2007). Web page revisitation revisited: implications of a long-term

54

BIBLIOGRAPHY

click-stream study of browser usage. In Proceedings of the SIGCHI con-

ference on Human factors in computing systems, pages 597–606. ACM.

[Petrakis et al., 2006] Petrakis, E. G., Varelas, G., Hliaoutakis, A., and

Raftopoulou, P. (2006). X-similarity: computing semantic similarity be-

tween concepts from different ontologies. Journal of Digital Information

Management, 4(4):233.

[Rada et al., 1989] Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989).

Development and application of a metric on semantic nets. Systems, Man

and Cybernetics, IEEE Transactions on, 19(1):17–30.

[Resnik, 1995] Resnik, P. (1995). Using information content to evaluate se-

mantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007.

[Rodriguez and Egenhofer, 2003] Rodriguez, M. A. and Egenhofer, M. J.

(2003). Determining semantic similarity among entity classes from dif-

ferent ontologies. Knowledge and Data Engineering, IEEE Transactions

on, 15(2):442–456.

[Rodrıguez et al., 1999] Rodrıguez, M. A., Egenhofer, M. J., and Rugg, R. D.

(1999). Assessing semantic similarities among geospatial feature class defi-

nitions. In Interoperating Geographic Information Systems, pages 189–202.

Springer.

[Ruthven and Lalmas, 2003] Ruthven, I. and Lalmas, M. (2003). A survey on

the use of relevance feedback for information access systems. The Knowl-

edge Engineering Review, 18(02):95–145.

[Sanchez et al., 2011] Sanchez, D., Batet, M., and Isern, D. (2011).

Ontology-based information content computation. Knowledge-Based Sys-

tems, 24(2):297–303.

[Sanchez et al., 2012] Sanchez, D., Sole-Ribalta, A., Batet, M., and Ser-

ratosa, F. (2012). Enabling semantic similarity estimation across multiple

ontologies: An evaluation in the biomedical domain. Journal of Biomedical

Informatics, 45(1):141–155.

55

BIBLIOGRAPHY

[Schulz et al., 2010] Schulz, S., Schober, D., Tudose, I., and Stenzhorn, H.

(2010). The pitfalls of thesaurus ontologization–the case of the nci the-

saurus. In AMIA Annual Symposium Proceedings, volume 2010, page 727.

American Medical Informatics Association.

[Seco et al., 2004] Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic

information content metric for semantic similarity in wordnet. In ECAI,

volume 16, page 1089. Citeseer.

[Sutcliffe and Ennis, 1998] Sutcliffe, A. and Ennis, M. (1998). Towards a

cognitive theory of information retrieval. Interacting with computers,

10(3):321–351.

[Tauscher and Greenberg, 1997] Tauscher, L. and Greenberg, S. (1997). How

people revisit web pages: Empirical findings and implications for the design

of history systems. International Journal of Human-Computer Studies,

47(1):97–137.

[Tversky et al., 1977] Tversky, A. et al. (1977). Features of similarity. Psy-

chological review, 84(4):327–352.

[Tversky and Kahneman, 1975] Tversky, A. and Kahneman, D. (1975).

Judgment under uncertainty: Heuristics and biases. Springer.

[VHA, 2012] VHA, V. H. A. (2012). National Drug File Reference Termi-

nology (NDF-RT) Documentation. U.S. Department of Veterans Affairs.

[WHO, 1992] WHO, W. H. O. (1992). International Statistical Classifica-

tion of Diseases and Related Health Problems, Tenth Revision: Introduc-

tion; list of three-character categories; tabular list of inclusions and four-

character subcategories; morphology of neoplams; special tabulation lists

for mortality and morbidity; definitions; regulations. World Health Orga-

nization.

[Winkler, 1999] Winkler, W. E. (1999). The state of record linkage and

current research problems. In Statistical Research Division, US Census

Bureau. Citeseer.

56

BIBLIOGRAPHY

[Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994). Verbs semantics and

lexical selection. In Proceedings of the 32nd annual meeting on Association

for Computational Linguistics, pages 133–138. Association for Computa-

tional Linguistics.

[Zhang and Zhao, 2011] Zhang, H. and Zhao, S. (2011). Measuring web page

revisitation in tabbed browsing. In Proceedings of the 2011 annual confer-

ence on Human factors in computing systems, pages 1831–1834. ACM.

[Zhou et al., 2008] Zhou, Z., Wang, Y., and Gu, J. (2008). A new model of

information content for semantic similarity in wordnet. In Future Genera-

tion Communication and Networking Symposia, 2008. FGCNS’08. Second

International Conference on, volume 3, pages 85–89. IEEE.

[Zhu et al., 2009] Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing

medline document clustering by incorporating mesh semantic similarity.

Bioinformatics, 25(15):1944–1951.

57

enhanced ontological searching of med- ical scienti c...

Documents