cluster-based query expansion using language modeling for … · 2014. 9. 5. · informational...

Cluster-based Query Expansion Using Language Modeling for Biomedical Literature Retrieval

A Thesis

Submitted to the Faculty

of

Drexel University

by

Xuheng Xu

in partial fulfillment of the

requirements for the degree

of

Doctor of Philosophy

May 2011

ii

© Copyright 2011 Xuheng Xu. All Rights Reserved

iii

To my parents, brother and sister, and my family.

iv

Acknowledgments

“Plan prayerfully, prepare purposefully, proceed positively, and pursue

persistently.” - William A. Ward

My journey pursuing PhD degree started seven years ago.

There are very few things in life that just do not change but keep increasing.

Knowledge and the constant urge to do something different is certainly one of them.

When I look back this seven year journey I am just overwhelmed by the constant

encouragement I had received. I truly am humbled to have such wonderful

experience in my life and have worked with such an outstanding group (supervised

by my advisor) of “get in and get it happening people” and “move forward attitude”.

Firstly I am very thankful to my advisor Dr. Xiaohua (Tony) Hu. He

promoted my understanding on the research issues for information studies. His

deep insights and extensively profound knowledge and experience opened up my

mind and made me immerse and continue to embody theoretical and technical

details in data mining and bioinformatics. His passion and commitment are really

incredibly characteristics of a super model I could follow. I have been privileged and

fortunate to have him as my advisor and friend. From the very basics, for instance,

he guides me to learn the ways to read the papers, the ways to explore the questions,

how to develop the efficient solutions, how to collaborate with field folks, how to

write and present the papers, and come to the ways how to do researches. Without

his enduring support, lasting encouragement, and persevering guidance in my

research, it is hard to imagine this thesis would ever become a reality.

v

I would also like to thank Drs. Yuan An, Zekarias Berhane, Jiexun Jason Li,

and II-Yeol Song for their precious advices, time and efforts serving on my thesis

committee. Especially Dr. Song is the first one to show me how to organize, write

research papers and make presentations. Dr. Li and Dr. An provided valuable

suggestions to accomplish the thesis.

I would like to extend my gratitude to the College of Information Science and

Technology for providing me the opportunity to fulfill my dream of pursuing a Ph.D.

degree here at Drexel and funding to permit me to attend several academic

conferences over the years, through which I have met many wonderful folks and

learned many valuable life lessons in addition to academic abilities and know-hows.

For that, I am truly blessed and have many people to thank.

The research related to this thesis was supported in part by NSF Career Grant

IIS 0448023, NSF CCF 0514679, Pennsylvania Department of Health Tobacco

Settlement Formula Grant (No. 240205 and No. 240196), Pennsylvania Department

of Health Grant (No. 239667), and NSF CCF 0905291 and NSFC 90920005 “Chinese

Language Semantic Knowledge Acquisition and Semantic Computational Model

Study”.

This thesis is dedicated to my family. I would not have reached today’s

height if it were not for the sacrifice, fully support, and hard work of my father and

mother. My father passed away in 2007, but I know he would be glad to see this

thesis in his next life. My deepest appreciation also goes to my brother Xuyan and

sister Xuya who inspired me to do my best and provided the unity of happiness.

vi

And lastly, Hong Xu, thanks for the trust you have conferred upon me and the

never-ending support over the years. Jason Xu, you are always the best in my eyes.

You show me that everything is possible.

Although I am not sure the destination the journey might lead me to, I know

this journey changed me forever.

vii

Table of Contents LIST OF TABLES ..................................................................................................................... x

LIST OF FIGURES ................................................................................................................. xi

ABSTRACT .......................................................................................................................... xiii

CHAPTER 1. INTRODUCTION ........................................................................................... 1

1.1 Query Expansion ................................................................................................... 3

1.2 Cluster-based Retrieval ....................................................................................... 5

1.3 Scope ..................................................................................................................... 7

1.4 Motivation .............................................................................................................. 8

CHAPTER 2. BACKGROUND AND LITERATURE REVIEW ...................................... 11

2.1 Language Modeling Approach ...................................................................... 11

2.2 Query Expansion Methods ................................................................................ 13

2.2.1 Global Analysis ............................................................................................ 17

2.2.2 Local Analysis............................................................................................... 18

2.3 Document Clustering ......................................................................................... 22

2.3.1 Feature Extraction ....................................................................................... 24

2.3.2 Feature Selection ........................................................................................ 24

2.3.3 Document Representation ....................................................................... 25

2.3.4 Clustering Methods ..................................................................................... 26

2.3.5 Cluster Quality ............................................................................................. 30

CHAPTER 3. A COMPARISON OF LOCAL ANALYSIS, GLOBAL ANALYSIS AND ONTOLOGY-BASED QUERY EXPANSION STRATEGIES FOR BIOMEDICAL LITERATURE RETRIEVAL ................................................................................................. 33

3.1 System Architecture ........................................................................................... 34

3.2 Query Strategies ................................................................................................. 36

3.2.1 Local Analysis............................................................................................... 36

3.3.2 Global Analysis ............................................................................................ 40

3.3.3 Term Re-weighting ...................................................................................... 40

3.4 Experiment Design and Implementation ....................................................... 42

3.4.1 Building the Index ....................................................................................... 43

3.4.2 Preprocessing Query to Search Engines ................................................. 43

3.4.3 Evaluation .................................................................................................... 44

viii

3.5 Experimental Results and Discussion ............................................................... 44

3.5.1 Results ............................................................................................................ 44

3.5.2 Discussion ..................................................................................................... 49

3.6 Conclusion ........................................................................................................... 52

CHAPTER 4. USING TWO-STAGE CONCEPT-BASED SINGULAR VALUE DECOMPOSITION TECHNIQUE AS A QUERY EXPANSION STRATEGY ............. 53

4.1 Introduction ......................................................................................................... 53


4.3 Query Strategies ................................................................................................. 56

4.3.1 Stage one - generate concepts and construct concept-based term-document matrices ................................................................................................... 56

4.3.2 Stage two – use SVD technique in LSI to extract the expanded query terms….. ....................................................................................................................... 61

4.4 Experimental Design and Implementation .................................................... 63

4.4.1 MeSH Terms .................................................................................................. 63

4.4.2 UMLS .............................................................................................................. 63

4.4.3 Experimental Setup ..................................................................................... 64

4.5 Preliminary Experimental Results ...................................................................... 66

4.5.1 Evaluation .................................................................................................... 66

4.5.2 Preliminary Results ....................................................................................... 67

CHAPTER 5. CLUSTER-BASED QUERY EXPANSION USING LANGUAGE MODELING IN THE BIOMEDICAL DOMAIN .............................................................. 74

5.1 Introduction ......................................................................................................... 74

5.2 Related Work ....................................................................................................... 76

5.3 Methods ............................................................................................................... 79

5.3.1 Pseudo-relevance-based query expansion (PRB-DOC-QE) ............... 79

5.3.2 Global-cluster-based query expansion (GCB-QE) ................................ 80

5.3.3 Global-cluster-based query expansion (GCB-DOC-QE) ..................... 80

5.3.4 Local-cluster-based query expansion (LCB-RR-QE) .............................. 81

5.3.5 Concept-based pseudo-relevance query expansion (CB-PR-DOC-QE)……. ....................................................................................................................... 81

5.3.6 Concept-based global-cluster query expansion (CB-GC-QE) ........... 81

ix

5.3.7 Concept-based global-cluster query expansion (CB-GC-DOC-QE) 81

5.3.8 Concept-based local-cluster query expansion (CB-LC-RR-QE) ......... 81


5.5 Evaluation ............................................................................................................ 83

5.6 Results ................................................................................................................... 90

CHAPTER 6. CONCLUSIONS AND FUTURE WORK .................................................. 91

6.1 Contributions of the Thesis ................................................................................ 91

6.2 Future Work .......................................................................................................... 93

List of Reference .................................................................................................................... 95

VITA ..................................................................................................................................... 102

x

LIST OF TABLES

Table 3 - 1: Experiment Results from LEMUR Search Tool (OKAPI BM25 MODEL) 44 Table 3 - 2: Experiment Results from LEMUR Search Tool (VECTOR SPACE MODEL) ................................................................................................................................................. 46 Table 3 - 3: Experiment Results from LUCENE Search Tool (VECTOR SPACE MODEL) ................................................................................................................................. 48 Table 4 - 1: Experimental Results (OKAPI BM25 model) ................................................ 67 Table 4 - 2: Experimental Results (OKAPI BM25 model) ................................................ 68 Table 4 - 3: Experiment Results (VECTOR SPACE MODEL) ......................................... 69 Table 4 - 4: Experimental Results (VECTOR SPACE model) ......................................... 70 Table 5 - 1: Experimental Results (Term-based) ............................................................... 85 Table 5 - 2: Experimental Results (Concept-based) .......................................................... 87

xi

LIST OF FIGURES

Figure 1 - 1: Model mismatch from language modeling perspective .............................. 2

Figure 2 - 1: Search stages .................................................................................................... 14 Figure 2 - 2: Query Expansion: Methods and Sources. The figure is copied from [Efthimiadis, 1996] ................................................................................................................ 15 Figure 2 - 3: Hierarchical agglomerative clustering ......................................................... 28

Figure 3 - 1: AQEE-1 System Architecture ........................................................................ 35 Figure 3 - 2: Local analysis approach query expansion ................................................... 37 Figure 3 - 3: Global analysis and term re-weighting query expansion ......................... 41 Figure 3 - 4: Average Precision of All Runs (OKAPI BM25 MODEL) ........................... 45 Figure 3 - 5: P@10 of All Runs (OKAPI BM25 MODEL) ................................................. 45 Figure 3 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 46 Figure 3 - 7: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 47 Figure 3 - 8: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 48 Figure 3 - 9: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 49

Figure 4 - 1: Two-stage concept-based query expansion (AQEE-2) .............................. 55 Figure 4 - 2: Stage one – generate concepts and construct concept-based term document matrices ............................................................................................................... 56 Figure 4 - 3: The algorithm for extracting one concept name and its candidate concept IDs [Zhou et al. 2006] ............................................................................................. 60 Figure 4 - 4: Mean Average Precision of All Runs (OKAPI BM25 model) ................... 68 Figure 4 - 5: P@10 of All Runs (OKAPI BM25 model) ..................................................... 69 Figure 4 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 70 Figure 4 - 7: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 71

Figure 5 - 1: Cluster-Based Query Expansion (AQEE-3) ................................................. 83 Figure 5 - 2: Mean Average Precision of All Runs ........................................................... 86 Figure 5 - 3: P@10 of All Runs ............................................................................................. 86 Figure 5 - 4: Mean Average Precision of All Runs (Concept-based) ............................. 88 Figure 5 - 5: P@10 of All Runs (Concept-based) ............................................................... 89

xiii

ABSTRACT

Cluster-based Query Expansion Using Language Modeling for Biomedical Literature Retrieval and Classification

Xuheng Xu Xiaohua Hu, Ph.D.

The tremendously huge volume of biomedical literature, scientists' specific

information needs, long terms of multiples words, and fundamental problems of

synonym and polysemy have been challenging issues facing the biomedical

information retrieval community researchers. Search engines have significantly

improved the efficiency and effectiveness of biomedical literature searching. The

search engines, however, are known to return many results that are irrelevant to the

intention of a user’s query, in other words, perform not very sound in terms of

precision and recall. To further improve precision and recall of biomedical

informational retrieval, various query expansion strategies are widely used. In this

thesis, we concentrate on empirical comparison, experiments and evaluations in

investigating query expansion methods. We also use the findings as an empirical

justification for cluster-based query expansion. We have investigated broadly many

methods of query expansion such as local analysis, global analysis, ontology-based

term reweighting across various search engines and obtained important insights.

Among the findings, two-stage concept-based latent semantic analysis strategy and

cluster-based query expansion have been presented and the Singular Value

Decomposition (SVD) technique in the Latent Semantic Indexing (LSI) is utilized in

the proposed method. In contrast to other query expansion methods, our strategy

xiv

selects those terms that are most similar to the concepts of in the query as well as the

related documents, rather than selects terms that are similar to the query terms only.

Furthermore, we propose a novel framework for cluster-based query expansion. we

have designed and implemented a novel and efficient computational approach to

cluster-based query expansion using language modeling. Through our experiments

in TREC genomic track ad-hoc retrieval task, we demonstrate that clusters which

are created based on the whole collection or the initially returned document results

of the original query can be utilized to perform query expansion and eventually

improve the overall effectiveness and performance of information retrieval system in

the biomedical literature retrieval. Lastly, we believe the principles of this strategy

may be extended and utilized in other domains.

1

CHAPTER 1. INTRODUCTION

The volume of biomedical literature produced has increased dramatically

over the past several decades. For example, PubMed database provided by the

United States National Library of Medicine contains over 17 million citations from

over 4,800 journals for biomedical articles back to 1950s and the United States

National Institutes of Health clinical trials database contains information on over

13,5000 clinical trials [PubMed]. To utilize such resources, users such as practicing

physicians and biomedical researchers have to tackle the incredibly time-consuming

tedious tasks of finding, locating or searching, browsing or reading and evaluating

relevant biomedical literature. Indeed, the explosion of data combined with

increasing publication rates greatly impairs human’s ability to locate the right

information in enormous collections of data.

In order to improve the efficiency and effectiveness of biomedical literature

retrieval, various search engines with different models have been used. However,

the unexpected results due to inherent difficulties of using search engines have

continued to occur: for instance, the search results may contain a large number of

irrelevant results (low precision); the search results may contain only some portion

of the documents requested by the users (low recall). It has not been easy for search

engines to meet the users’ expectations and information needs to search what they

want, especially when the users do not have enough domain knowledge. The

fundamental issue behind the problems is synonymy and polysemy. Per Merriam-

Webster Dictionary, synonymy is a list or collection of words with similar meanings

2

while polysemy is a word or phrase having multiple meanings, for instance, “apple”

could be apple fruit or apple computer in different context. In addition, it is noticed

and understandable that people of different knowledge domain, dissimilar skills and

diverse levels of mastering languages use different terms to describe the same

concepts. And to describe the same concepts, it is not unlikely that terms employed

in the queries by the users are different from those used by the authors of documents

in a collection.

Looking at the problem from language modeling approach, on can conclude

that there is a mismatch between the user’s query model and data corpus documents

model. Figure 1-1 illustrates the situation.

P(Model|User) P(Query|Model)

P(Model|Author) P(Doc|Model)

Figure 1 - 1: Model mismatch from language modeling perspective

Query

Model

Doc Model

Doc

Query

3

Since a typical search tends to be using one-word search term [Efthimiadis

1996], the problems mentioned above for one or few-terms search will certainly

deteriorate the retrieval performance of search engines. For example, one or fewer-

terms queries may be not adequate to express a concept accurately. As the lower

number of terms is enclosed in a query, expectedly, the chance of a term meaning

the same concept occurring in both a query and the relevant documents declines. In

other words, it is expected that the problem for fewer-terms (like one or two words)

queries will be worse than that for more-terms queries. To overcome these problems,

several promising techniques have been developed, such as query expansion and

cluster-based retrieval. The primary goal of this dissertation is to build a novel

approach of cluster-based query expansion to help improve retrieval performance in

the biomedical literature retrieval. And as we eventually find that our novel

approach of cluster-based question is really related to knowledge structure

extraction, exploration and discovery which all are essentials to information science.

1.1 Query Expansion

Query expansion is not new. Many attempts have been made to improve the

state of the art. One theory behind query expansion is: due to the synonymy and

polysemy problems, it may present better results to add more terms into initial

queries to increase the precision and recall because the expanded queries contains

more terms, and accordingly the probability of matching them with terms in the

relevant documents, to some extent, increases. Query expansion attaches new

4

additional critical terms beyond the initial query terms (seed query) provided by the

users to improve the precision and/or recall. It is obvious that generating the new

additional critical terms is a key. Researchers have focused on using query

expansion techniques, manually, interactively or automatically, to help the users

reformulate a better query and eventually help the users to achieve higher precision

and higher recall. Query expansion could be done interactively or automatically by

adding terms to the initial query or manually by dynamically providing high-level

domain information about the collections, such as domain structure, thesaurus, etc.

It also could be done by providing the higher frequencies terms about the global or

local returned datasets of collections, etc. to the users with the suggestions to refine

the initial query.

However, there is a potential problem to general query expansion techniques.

It is known that in a normal document-based (most search engines are document-

based) retrieval, typical search engines normally match the query against documents

in the collection and often return a long list of search results, ranked by their

relevancies to the given query, to the user. It is also well-known that the collection

typically contains documents with different subjects or topics. For example, PubMed

research articles may be classified into different subjects or topics like diseases. The

expanded terms from various query expansion techniques to the initial query may

have totally different meanings across distinctive subjects in the whole collection.

The results of adding the expanded terms fail to improve the performance of search

engines, and in contrast decrease the effectiveness since these expanded terms

5

introduce the irrelevant terms (noise) to the query. Clearly, in this worse case, word-

based query expansion techniques may not be sufficient to improve the effectiveness

of search engines. Therefore, when considering query expansion, we may have to

contemplate the buzzword: cluster. We may have to look at query expansion in a

broader perspective, like cluster-based document collection or clusters of initial

search result documents based on the user’s original query. In other words,

paradigm shift might be needed here when considering query expansion. For

clarification and simplicity, in this dissertation, the term query expansion indicates

non-cluster consideration.

1.2 Cluster-based Retrieval

Cluster-based retrieval is based on the hypothesis that similar documents

tend to be relevant to the same information needs [Liu and Croft 2004]. Contrary to

document-based retrieval, cluster-based retrieval, divides a collection of various

documents into different meaningful clusters so that documents in the same cluster

describe the same topic and further returns a ranked list of documents based on the

clusters that they come from to the users.

Currently there are two approaches to cluster-based retrieval. The most

common is to retrieve one or more related clusters in their entirety to the user’s

query. The difference between this method and normal document-based retrieval is

6

to rank cluster-query similarity instead of document-query similarity. The second

approach is to use clusters as a form of document smoothing.

Researches show if retrieval system is able to locate good clusters, the

performance of cluster-based retrieval is better than that of document-based

retrieval [Hearst and Pedersen 1996; Jardine and Van 1971]. It is recognized that

researchers’ investigations largely focus on various clustering algorithms, cluster

retrieval and search methods. Of those, document clustering is the most common

way that has been performed either in the entire collection of data and independent

of the user’s query, or in the set of documents that is dependent of and retrieved

result of a document-based retrieval on the user’s query.

Recent research developments in statistical language modeling for

information retrieval shift the cluster-based query expansion process. Many various

researches confirmed that the language modeling approach has a solid theoretical

foundation and it is usually effective for various applications [Ponte and Croft 1998;

Zhai and Lafferty 2001]. [Liu and Croft 2004] has demonstrated that cluster-based

retrieval can be more effective than document-based retrieval by proposing two new

models for cluster-based retrieval and evaluating the results on several TREC

collections. They also reported cluster-based retrieval can perform consistently

across collections of realistic size, and significant improvements over document-

based retrieval can be obtained in a fully automatic manner and without relevance

information provided by human. These guided us to reconsider query expansion

within this new cluster-based framework using language modeling approach. We

7

are to implement cluster-based retrieval as an approach to improve query expansion

effectiveness.

In this dissertation, methods of various query expansion are explored,

particularly, a new novel cluster-based query expansion using semantic-based

language modeling approach is proposed. Different cluster algorithms, local

analysis based, global analysis based, and ontology-based term re-weighting query

expansion strategies integrated with the UMLS (Unified Medical Language System)

are investigated to improve the effectiveness of the queries and precision. These

methods are used to the ad hoc retrieval task of the TREC 2004/2005 Genomics task

and other data collection similar as [Yoo and Hu 2006]. In the meantime, larger

efforts will be focused on combining ontology-based and semantic-based query

expansion. Expanded terms selection of query will be centered on Latent Semantic

Indexing (LSI) and Association Rule (AR), therefore, they are also described in the

dissertation. Also the number of terms expanded on the initial query that could

produce promising precision will be evaluated.

It is noted that although the research is focused on the biomedical literature,

the resulting system is expected to be useful within other domains which have

domain-specific resources available.

1.3 Scope

This dissertation focuses on methodology study: automatic query expansion

strategies. The aim is to provide the overall framework of cluster-based query

8

expansion using language modeling approach.

1.4 Motivation

In general, the dissertation will answer one core research question: can one

improve retrieval performance in biomedical literature retrieval by developing a

cluster-based query expansion using language modeling approach? This dissertation

will concentrate on methodology study, empirical comparison, experiments and

evaluations in investigating cluster-based query expansion methods on various

issues, and thus is used as empirical justification for cluster-based query expansion.

Meanwhile, it is meaningful to investigate a theoretical foundation of cluster-based

query expansion and to provide reasonable path to develop theoretical argument.

A set of key questions are also associated with the above core research

question:

1. What is the effect of clustering algorithms in cluster-based

query expansion?

2. What is difference in retrieval performance by existing query

expansion methods and cluster-based query expansion?

3. Is global-based or local-based (query-specific) clustering for

query expansion more effective?

4. How to represent documents (Vector space model, Probabilistic

model, Ontology-based VSM)?

5. Which method (Latent Semantic Indexing or Associate Rule)

9

generates high quality additional query terms?

6. How to evaluate the cluster-based Query Expansion?

As we mentioned previously, generally query expansion can improve

retrieval effectiveness for queries. However, it may become ineffective when data

collection contains multiple subjects or topics (clusters). Without document

clustering technique, query expansion will not have all the relevant documents to

construct an effective expanded terms for the clusters, which will result in

information loss or distortion in the term associations.

This dissertation intends to improve retrieval effectiveness by developing a

novel cluster-based query expansion using language modeling method. The basic

idea of this method is that the whole document collection or the initial returned

document results of the original query used for expanded query terms construction

should be grouped by a document clustering technique (various clustering

algorithms) into several clusters that approximate the subjects or topics. After a set

of document clusters are located, we can build a useful terms set for each cluster in

the following manner: one way is to classify the whole document collection into the

established clusters then use the documents of the new clusters to set up VSM; the

other way is to use the documents of the established clusters only to set up VSM. For

each user query, the respective local terms of each cluster generated by LSI or AR

association methods will be expanded the original user query. Next, the relevant

documents are retrieved from the whole data collection for each expanded query.

Then, all retrieved documents are combined (removing duplicating documents) and

10

presented to the user as an overall result for one query. Experimentally, we will

evaluate the retrieval effectiveness of this proposed cluster-based query expansion

method in comparison with that achieved by existing non-cluster-based query

expansion methods (particularly, global analysis and local analysis query expansion).

The remaining parts of the dissertation are organized as follows: in addition

to chapter 1, section 1.1 and this section 1.3 introduces and highlights the research

motivation and objective of this proposed dissertation, chapter 2 reviews literature

related to this proposed dissertation, including language modeling approach,

existing query expansion methods, generating expanded terms techniques, and

document clustering techniques. Chapter 3 is a comparison of local analysis, global

analysis and ontology-based query expansion strategies for biomedical literature

retrieval. Chapter 4 is using two-stage concept-based singular value decomposition

technique as a query expansion strategy. The proposed cluster-based query

expansion method and system (AQEE-3) design are presented in Chapter 5. Chapter

6 is conclusion and future work.

11

CHAPTER 2. BACKGROUND AND LITERATURE REVIEW

In this Chapter, literature review related to this dissertation, including

language modeling approach, is presented in Section 2.1, query expansion methods

in Section 2.2, and document clustering techniques in Section 2.3.

2.1 Language Modeling Approach

Language modeling has become a popular information retrieval model based

on its sound theoretical foundation and good empirical success. The main idea of

language modeling used in information retrieval is: each document of a collection is

viewed as a sample and a query as a generation process [Na et al., 2005]. With this

idea, the retrieved documents are ranked based on the probabilities of producing a

query from the corresponding language models of these documents. In other words,

relevance of document is calculated by query likelihood generated from document

language models. Specifically, there are two steps involving. One is to build a

language model D for each document in the collection; and the second is to rank the

documents based on how likely the query Q could have been generated from each of

these document models, i.e. P(Q|D).

The probability P(Q|D) can be calculated in a couple of ways. The most

common approach (referred to unigram language model usually) assumes that the

query could be treated as a set of independent terms, and thus query probability

P(Q|D) can be represented as a product of the individual term probabilities [Liu and

Croft, 2004].

12

| |

Where qi is the ith term in the query, and P(qi|D) is specified by the

document language model.

| | 1 |

Where PML(w|D) is the maximum likelihood estimate of word w in the

document, PML(w|Coll) is the maximum likelihood estimate of word w in the

collection, and λ is a general symbol for smoothing.

For different smoothing methods, λ takes different forms.

Considering cluster context, as previously mentioned, there are two language

modeling approaches. One of them is to rank clusters to the query. For instance, if

we use Cluster I to replace D above, P(Q|D) => P(Q|C), we could obtain ranked

clusters. But it has two limitations: one, it has to be used in entirety of document

collection; and two, the formula above consider the whole cluster as a big document

and it is sometimes impossible for the users to browse the whole documents of

relevant clusters.

If we define language modeling approach in a broader perspective, that is, it

is a probabilistic mechanism for generating required text, and language modeling

approach can be used in text clustering, for example. In this case, each document of a

13

collection is again viewed as a sample and the vocabulary of the corpus as a

generated text process. We can compare the distance between two documents.

Therefore, language modeling approach also can be considered as clustering method.

Language modeling was initially used in speech recognition. Nowadays it

has been widely applied into information retrieval and other applications owing to

its solid mathematical foundation and empirical proved effectiveness. In this

dissertation I will explore its effectiveness in text document clustering and

relationship identification. Mainly, I will utilize language modeling on the following

aspects:

1) Document clustering

2) Clustering rank

3) Concept generation

2.2 Query Expansion Methods

Generally there are five steps involving search process:

1. Resource selection

2. Seed query construction

3. Search results review

4. Query reformulation

5. Repeating the above steps

14

Due to this dissertation scope, we focus on the stages from initial seed query

construction to query reformation (Figure 2-1). And for simplicity, these stages can

be independent on the other factors like search engines.

Seed query construction Search results review Query reformulation

Figure 2 - 1: Search stages

Normally, query expansion process can be taken place in either seed query

construction or query reformulation.

As we have mentioned, query expansion is a promising way to improve

retrieval effectiveness or performance in information retrieval. The basic idea of

query expansion is to expand a user query by adding terms related to the original

query. More accurately, query expansion (or term expansion) is the process of

supplementing the original query with additional terms [Efthimiadis, 1996], and it

can be considered as a method for improving retrieval performance. The query

expansion itself is applicable to any situation irrespective of the retrieval technique(s)

or search engines used and this could be one major benefit and can be taken

advantage.

Query 1. …. 2. …. 3. ….

Refined Query

15

The initial query (as provided by the user) may be an inadequate or

incomplete representation of searching concept or the user’s information need, either

in itself or in relation to the representation of ideas or concepts in documents. Query

expansion is to help the user reformulate the query and meet the user’s information

need eventually. Query expansion, as depicted in Figure 2-2, can be conducted

manually, automatically or interactively:

Figure 2 - 2: Query Expansion: Methods and Sources. The figure is copied from [Efthimiadis, 1996]

In this dissertation, I will focus on automatic query expansion. And generally,

three key elements need to be considered when applying any form of query

expansion are: (1) the candidate terms source, that will provide the terms for the

16

query expansion, (2) the strategy to use the source and (3) the methods that will be

used to select terms to be used in the expansion (e.g., ranking algorithm to rank

candidate terms). It is to note that (2) and (3) are closely related.

Figure 2-2 also shows possible candidate terms sources. One comes from the

initial search results, and it assumes the pseudo-relevance feedback process. The

number of top documents retrieved in an earlier iteration of the search that have

been identified as relevant becomes the source for the query expansion terms. The

other is some form of a knowledge structure that is independent of the search

process. Such knowledge structure can either depend on the collection, i.e., be

corpus based, or be independent of collection. In this proposed dissertation, the term

source will be either based on initial search results or based on the whole collection

if possible.

The text retrieval community has been researching and implementing many

query expansion strategies to improve the precision and/or recall and building

various query expansions into the information retrieval systems. There are two

general strategies for query expansion to use the candidate terms source: ontology-

based and semantic-based. Expanding a query with synonyms or hyponyms is one

example of an ontology-based query expansion. Shi et al built an information

retrieval system integrating EntrezGene, HUGO, Eugenes, ARGH, GO, MeSH,

UMLSKS and WordNet into a large reference database and then using a

conventional Information Retrieval toolkit [Shi et al., 2005]. Some promising results

have been reported by using ontology-based methods. However, ontology-based

17

methods have only a limited effect on biomedical information retrieval performance

[Guo et al., 2004; Leroy and Chen, 2001]. Some researchers reported a hybrid

approach that combines different weights and ontology but few published the

details on how the weights were used in conjunction with the ontology.

On the other hand, semantic-based methods focus on the collection of

documents, and they can be further classified mainly into two sub-categories: global

analysis and local analysis [Xu and Croft, 2000].

In the following discussion, we review two query expansion methods: global

analysis and local analysis.

2.2.1 Global Analysis

Global analysis extracts co-occurrences of the related terms and results in a

similarity matrix by analyzing the whole collection of documents. Global analysis

includes term clustering, latent semantic indexing, finding phrases, and similarity

thesauri. Although it is relatively robust, global analysis has some drawbacks.

Corpus-wide statistical analysis consumes a considerable amount of computing

resources. Moreover, since it only focuses on the document side and does not take

the query side into account, global analysis cannot address the term mismatch

problem well. Lastly, global analysis requires semantic similarity and

disambiguation of terms.

Several researches reported the progress of global analysis with some

changes. To improve the retrieval effectiveness of traditional global analysis

18

methods, [Qiu and Frei, 2003] proposed and used a “concept-based query

expansion”. They advise that a term that is similar to the whole query is more

relevant than one that is only similar to a single term in the query. Their empirical

test evaluation proves that the concept-based query expansion method significantly

improves retrieval effectiveness compared with traditional global analysis methods.

[Matos et al., ] developed QuExT tool, which is a new PubMed-based document

retrieval and prioritization tool. QuExT can be used to search for the most relevant

results from the biomedical literature. And QuExT follows a concept-oriented query

expansion methodology to find documents containing concepts related to the genes

in the user input, such as protein and pathway names.

2.2.2 Local Analysis

Different from global analysis, local analysis extracts highly-related

(correlated and ranked in order of their potential contribution to the query) terms

from the top documents initially retrieved by an initial query or from data mining

results without any assistance from the user. The terms generated are appended to

the query then, sometimes re-weighted also. The idea of local analysis could be

found at least to a 1977 article by Attar and Fraenkel [Attar and Fraenkel, 1997], in

which the top-ranked documents for a query became sources of information for

building an automatic thesaurus. In addition, Croft and Harper used the top-

retrieved documents information to re-estimate the probabilities that a term will

occur in the relevant set for a query [Croft and Harper, 1979].

19

In their research, [Xu and Croft, 1996] generated the most frequent 50 terms

and 10 phrases (i.e., pairs of adjacent nonstop words) from the top-ranked

documents and added to the original user query. Then the terms in the query are re-

weighted using the Rocchio formula. They reported, in their method, the

effectiveness of local analyses is highly influenced by the proportion of relevant

documents in the top ranks. Their findings also include: local analysis performance

are likely to be poorer if initial seed queries perform poorly and retrieve few relevant

documents, since most of the words that are added to the query will be originated

from non-relevant documents.

Local analysis also takes the query itself into account. For instance, Local

Context Analysis uses the top documents returned by an initial query but selects the

terms based on co-occurrence with query terms [Xu and Croft, 2000].

The benefits of this local analysis approach include saving computing

resources and less human intervention. Since it assumes that the top documents

returned by the initial query are actually relevant (“pseudo-relevance feedback”),

expansion terms are extracted from the top-ranked documents to formulate a new

query for a second cycle retrieval. However, in contrast to global analysis, local

analysis is not that robust since it is nearly impossible for search tools or mining

methods to return documents that are only relevant to the users’ information needs.

In our research, local analysis method determines the expanded terms to the

original query based on “pseudo-relevance feedback”, which assumes that the top

documents returned by a query are all relevant to the users’ information needs. For

20

term selection, in this proposed dissertation, Latent Semantic Indexing (LSI) or

Association Rule (AR) algorithm is used to find the top co-occurrence terms of each

original term from the top N retrieved documents. Particularly, a larger effort is on

LSI since the preliminary results show positive gains for the performance of the

system.

The following is a brief introduction of Latent Semantic Indexing:

The Latent Semantic Indexing (LSI) takes advantage of the association of

terms with documents (“semantic structure”). It starts with a matrix of terms by

documents [Deerwester et al., 1990, Xu and Croft, 2000]. This matrix is then

evaluated by the singular-value decomposition (SVD) to gain the latent semantic

structure [Zhu et al., 2006]. The SVD decomposes a term-document matrix into three

separate matrices, by which documents and terms are projected into the same

dimensional space.

For example, the SVD X of t x d matrix of terms and documents is

decomposed into the product of three other matrices:

X =T0S 0D0’,

where T0 (t x m matrix) and D0’ (m x d matrix) have orthogonal, unit-length

columns (T0’T0=I, D0’D0=I), and S0 (m x m matrix) is diagonal. T0 and D0 are the

matrices of left and right singular vectors, and S0 is the diagonal matrix of singular

values.

21

In general, the SVD can be easily calculated by a reduced model, which is the

rank-k model with the best possible least-squares-fit to X [Deerwester et al., 1990]:

X ≈ Xhihat =TSD’ 　

The new matrix Xhihat is the matrix of rank k which is closest in the least

squares sense to X. D’ is orthonormal matrix [Deerwester et al., 1990; Xu and Croft,

2000]. The dimensions of matrices T, S, and D’ are t x k, k x k, and k x d, respectively.

More importantly, the SVD can be used for information retrieval purposes to

derive a set of uncorrelated indexing variables or factors; each term and document is

represented by its vector of factor values. Therefore, the dot product between two

row vectors of Xhihat reflects the extent to which two terms have a similar pattern of

occurrence across the set of documents. The matrix XhihatXhihat’ is defined:

　XhihatXhihat’ = TS2T’

Here, both S and T’ are diagonal matrices [Deerwester et al., 1990].

In this dissertation, we also use this technique to compare the similarities

between the terms in the matrix extracted from the documents.

An Association Rule (AR) finds frequent patterns, associations, correlations,

or casual structure relationships among a set of data items. A matrix of terms by-

documents, similar to the LSI-based algorithm, is produced and then analyzed. For

details of Association Rule, please refer to [Agrawal et al., 1993; Agrawal et al., 1995;

Savasere et al., 1995].

22

As previously mentioned, sources of expanded terms and selection of the

candidate terms to query expansion are the key. Much work still needs to be done

here. In this dissertation, TF-IDF based and other Vector Space information retrieval

model have been used to testify local analysis. Relevance feedback for the Vector

Space Model can be represented by the Rocchio formula [Rocchio, 1999]:

| | | |

For expanded query qi+1

qi: the initial query

Dr: set of relevant documents among retrieved documents

Dn: set of non-relevant documents among retrieved documents

, , : tuning constants

In this model information in relevant documents is treated more important

than in non-relevant documents ( << ). This study applied pseudo-relevance

feedback in that the top N retrieved documents are supposed to be all relevant,

equals 0.

2.3 Document Clustering

Due to known human limitations, it is very difficult for people (even expert)

to discover useful information by reading large quantities of unstructured text. This

difficulty has inspired the creation of a more specific technique for unsupervised

23

document organization, automatic topic extraction and fast information retrieval or

filtering to aid human beings’ information discovery.

Document clustering has been studied in the field of information retrieval for

several decades. The purpose of document clustering is to group similar documents

into clusters on the basis of their contents. The problem of document clustering can

be defined as follows. Given a set of n documents called DC, DC is clustered into a

user-defined number of k document clusters DC1, DC2,…DCk (i.e. { DC1, DC2,…..DCk}

= DC) so that the documents in a document cluster are similar to one another while

documents from different clusters are dissimilar [Yoo and Hu, 2006].

Document clustering faces a number of new challenges: increasing larger

volume, especially biomedical literature; the curse of dimensionality of term space;

sparsity of term; and complex semantics. These characteristics require document

clustering techniques to be scalable to large and high dimensional data, and able to

handle sparsity and semantics.

Document clustering was initially investigated for improving information

retrieval performance because similar documents grouped by document clustering

tend to be relevant to the same user queries. However, document clustering has not

been widely used in information retrieval systems because document clustering

algorithms was too slow or infeasible for very large documents sets in early days.

However, with the advent of new technology and more powerful computers,

document clustering has been very popular in information retrieval research.

24

Generally document clustering consists of several steps, including feature

extraction, feature selection, document representation, and clustering.

2.3.1 Feature Extraction

Feature extraction is a special form of dimensionality reduction. In document

clustering, feature extraction syntactically tags terms in each document and extracts

the desired terms (usually nouns and noun phrases) from the documents. If the

features extracted are carefully chosen it is expected that the features set will extract

the relevant information from the document corpus in order to perform the desired

task using this reduced representation instead of the full size input.

2.3.2 Feature Selection

Feature selection is the technique of selecting a subset of relevant features for

building robust learning models. Appropriate feature selection could assist people to

better understand about their data: which are the important features and how they are

related with each other. Feature selection of document clustering reduces the size of

the extracted feature set to reduce the dimensional size of term space and handle

sparsity. The most common ways for feature selection include term frequency (TF),

term frequency-inverted document frequency (TF-IDF), and the mixture. Once

features are selected, it is expected that the k features with the highest selection

metric scores are carefully chosen to represent documents.

25

2.3.3 Document Representation

In document representation, each document could be represented once

features are selected. For example, one document can be represented by a feature

vector. And feature vectors of documents can be treated as inputs for document

clustering. Several representation models have been proposed: probabilistic model,

vector space model (VSM) and ontology-based VSM. Since its popularity, ontology-

based VSM will be used in this dissertation. The advantages of VSM include:

Unsophisticated;

Easy to calculate similarity between two documents;

Direct to apply algorithms on data;

Consider the relationship between terms;

Introduce semantic concepts into data models;

Combine ontology with the traditional VSM;

Terms (dimensions) have semantic relationships, rather than

independent

A document collection is represented as a matrix below:

…⋮ ⋮ ⋮

… 1⋮ ⋮

…⋮ ⋮ ⋮

…

…⋮ ⋮…

Where xij is the frequency of term I in document j.

26

It is noted that most common three ways to represent a term value xij:

frequency representation; binary representation (xij = 1 indicates that term I occurs in

document j, otherwise xij = 0); and term frequency-inverted document frequency

(TF-IDF):

, , ∗ log| |

Where tf(dj,ti) is the frequency of term ti in document dj, |D| is the total

number of documents, and df(ti) is the number of documents in which ti occurs. The

generated feature vectors of the documents then can be treated as inputs for

clustering.

2.3.4 Clustering Methods

With the generated feature vectors of the documents, a clustering technique is

applied. The objective of clustering algorithms is to efficiently and automatically

group documents with similar content into the same cluster. The core of clustering

algorithms includes similarity measures and clustering methods.

There are many different ways to measure how similar two documents are or

how similar a document is to a query. Utilizing of similarity measures is highly

depending on the choices of terms to represent text documents. For instance,

measures could be Euclidian distance (L2 norm):

,

27

And L1 norm or cosine similarity:

, | |

Where , aredocuments.

Three common clustering approaches are hierarchical, partitioning-based,

and self-organizing map.

For hierarchical, it starts with every point in a separate cluster. This approach

successively merges the most similar pairs of data or clusters based on the pairwise

distance between pairs of data until only one big cluster left or a termination

condition holds. The hierarchical approach can be employed in either a bottom-up or

top-down manner. The hierarchical agglomerative clustering (HAC) algorithm

employs a bottom-up strategy to build a binary clustering hierarchy by treating

every document as a cluster initially and gradually merging the two most similar

clusters together on the basis of an inter-cluster similarity measure. It produces a

binary tree or dendrogram. For instance, single-link measures the similarity between

the closest pair of objects while the complete-link evaluates the similarity between

the most distant pair of objects. The HAC algorithm continues the merging process

until a desired number of clusters is obtained or the inter-cluster similarity is less

than a pre-specified threshold. Figure 2-3 illustrates HAC algorithm. In contrast, the

hierarchical divisive clustering (HDC) algorithm, which employs a top-down

strategy, starts with all documents in one cluster then divides it into the two most

28

distinct clusters. The HDC algorithm continues dividing each resultant cluster until

a termination condition is satisfied.

Figure 2 - 3: Hierarchical agglomerative clustering

Hierarchical approach also has some concerns such as if data doesn’t have a

hierarchical structure and the method usually involves specifying somewhat

arbitrary cutoff values.

For partitioning, it partitions the entire document collection into several non-

overlapping clusters. And it is the most widely-used algorithms in document

clustering [Steinbach et al., 2000]. Take K-means as an example. The steps of K-

means are:

29

Choose a number of clusters k

Initialize cluster centers c1,… ck (randomly pick k data points or

randomly assign points to clusters and take means of clusters)

For each data point, compute the cluster center it is closest to and assign

the data point to this cluster

Re-compute cluster centers (mean of data points in cluster)

Stop when there are no new re-assignments

The following is the pseudo code for K-means: K-means-cluster (in S : set of vectors : k : integer) { let C[1] ... C[k] be a random partition of S into k parts; repeat { for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty } for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } } until the change to C (or the change to X) is small enough. }

Like hierarchical approach, partioning also has some problems:

Arbitrary guess K.

30

Local minimum. Example: In diagram below, if K=2, and you start with

centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}

Hard assign data points to only one cluster

Implicit assumptions about the “shapes” of clusters

Finally, self-organizing map is an unsupervised and two-layer neural

network. It is known to be based on work of Kohonen on learning/memory in the

human brain. In addition to specify the number of clusters, it also need specify a

topology – a two-dimensional grid that gives the geometric relationships between

the clusters (i.e., which clusters should be near or distant from each other) while

preserving the similarity between data points as much as possible. The advantage of

this approach is that it learns a mapping from the high dimensional space of the data

points onto the points of the two-dimensional grid (there is one grid point for each

cluster).

To use a self-organizing map, the network topology, shape (e.g. hexagon,

rectangle), and size of output nodes must be decided first on the basis of different

distance measures. For details, please refer to [Kohonen, 1995].

2.3.5 Cluster Quality

The effectiveness of different data clustering algorithms performing on a set

of data can be evaluated in terms of three types of measures: internal criterion,

external criterion and relative criterion. Among them the frequently used are purity

31

[Zhao and Karypis, 2001], entropy [Steinbach et al., 2000] and normalized mutual

information [NMI, Banerjee and Ghosh, 2002].

Purity is a simple and transparent evaluation measure and to compute purity,

each cluster is assigned to the class which is most frequent in the cluster, and then

the accuracy of this assignment is measured by counting the number of correctly

assigned documents and dividing by the number of total documents.

Formally:

∅, ∁ 1

max | ∩ |

Where ∅ , , … , is the set of clusters and ∁ , , … , is the set

of classes. And is the set of documents and is the set of documents.

Entropy is is a measure of uncertainty and defined as follows:

Where X is the set of all possible numbers we need to be able to encode.

Entropy is 1 when uncertainty is largest and 0 when there is absolute certainty.

Normalized mutual information (NMI) is defined as the mutual information

between the cluster assignments and a pre-existing labeling of the dataset

normalized by the arithmetic mean of the maximum possible entropies of the

empirical marginals, i.e.,

32

∅, ∁ ∅, ∁

∅ ∁ /2

Where ∅ is a random variable for cluster assignments, ∁ is a random variable

for the pre-existing labels on the same data.

∅, ∁ is mutual information, and

∅, ∁ ∩∩

Where , , ∩ are the probabilities of a document being

in cluster , class ,and in the intersection of respectively.

∅ is entropy, and

∅

NMI ranges from 0 to 1. The bigger the NMI is the higher quality the

clustering is. NMI is better than other common extrinsic measures such as purity

and entropy in the sense that it does not necessarily increase when the number of

clusters increases.

33

CHAPTER 3. A COMPARISON OF LOCAL ANALYSIS, GLOBAL ANALYSIS AND ONTOLOGY-BASED QUERY EXPANSION STRATEGIES FOR

BIOMEDICAL LITERATURE RETRIEVAL

There are many research papers focusing on query expansion strategies.

Mainly, three strategies are most used: local analysis, global analysis and ontology-

based. In order to better understand effectiveness of different automatic query

expansion strategies and further improve the design of future biomedical literature

retrieval system, there is a need to conduct a comparison of these three strategies,

local analysis, global analysis and ontology-based query expansion for biomedical

literature retrieval.

It is worthwhile to note that there might be several factors impacting the final

performance in biomedical literature retrieval system: 1) The system indexing

strategy; 2) The system retrieval model; 3) The data collection; 4) The qualities of

queries submitted into the system. We will propose a new automatic query

expansion approach and evaluate it from each of these four factors respectively

except dataset (we are restricting to TREC).

In this chapter, I will introduce our automatic query expansion experiment

system (AQEE-1) architecture, discuss the three major query expansion strategies

and provide experimental design environments and implementation. For some

methods, preliminary studies are conducted and evaluation results are presented.

34

3.1 System Architecture

In order to evaluate the system 1) The system indexing strategy; 2) The

system retrieval model; 3) The qualities of queries submitted into the system, our

system consists of two major parts, as shown in Figure 3-1. In the first part (above

the dashed line in Figure 3-1), we apply global analysis, local analysis, and term re-

weighting expansions. The details of our query expansion strategies and

constructions are discussed in Section 3.2. The second part (below the dashed line) is

a Lucene or Lemur search engine to demonstrate the generality of our results. We

would like to demonstrate the quality of query expansion is independent of search

engines.

35

Figure 3 - 1: AQEE-1 System Architecture

36

3.2 Query Strategies

3.2.1 Local Analysis

This method determines the expanded terms to the original query based on

“pseudo-relevance feedback”, which assumes that the top documents returned by a

query are all relevant to the users’ information needs. In this work, Latent Semantic

Indexing (LSI) or Association Rule (AR) algorithm is used to find the top co-

occurrence terms of each original term from the top N retrieved documents.

Particularly, a larger effort is on LSI since in our experiments it shows positive gains

for the performance of the system.

The processes of local analysis are depicted in Figure 3-2

37

Figure 3 - 2: Local analysis approach query expansion

The Latent Semantic Indexing (LSI) takes advantage of the association of

terms with documents (“semantic structure”). It starts with a matrix of terms by

documents [Deerwester et al., 1990, Xu and Croft, 2000]. This matrix is then

evaluated by the singular-value decomposition (SVD) to gain the latent semantic

structure [Zhu et al., 2006]. The SVD decomposes a term-document matrix into three

separate matrices, by which documents and terms are projected into the same

dimensional space.

38

For example, the SVD X of t x d matrix of terms and documents is

decomposed into the product of three other matrices:

X =T0S 0D0’,

where T0 (t x m matrix) and D0’ (m x d matrix) have orthogonal, unit-length columns

(T0’T0=I, D0’D0=I), and S0 (m x m matrix) is diagonal. T0 and D0 are the matrices of left

and right singular vectors, and S0 is the diagonal matrix of singular values.

In general, the SVD can be easily calculated by a reduced model, which is the

rank-k model with the best possible least-squares-fit to X [Deerwester et al., 1990]:

′

The new matrix is the matrix of rank k which is closest in the least squares

sense to X. D’ is orthonormal matrix [Deerwester et al., 1990, Xu and Croft, 2000].

The dimensions of matrices T, S, and D’ are t xk, k x k, and k x d, respectively.

More importantly, the SVD can be used for information retrieval purposes to

derive a set of uncorrelated indexing variables or factors; each term and document is

represented by its vector of factor values. Therefore, the dot product between two

row vectors of reflects the extent to which two terms have a similar pattern of

occurrence across the set of documents. The matrix ′is defined:

′ = TS2T’

Here, both S and T’ are diagonal matrices [Deerwester et al., 1990].

39

In this research, we use this technique to compare the similarities between the

terms in the matrix extracted from the documents.

An Association Rule (AR) is a popular and well researched method to

discover frequent patterns, associations, correlations, or casual structure

relationships among a set of data items, especially in large database. [Wikipedia]

provides one example: {onions, potatoes} => {burger} found in the sales data of a

supermarket would indicate that if a customer buys onions and potatoes together,

he is likely to buy burger as well. In our test, “financial” comes closely with query

term “combating times alien smuggling”. A matrix of terms by-documents, similar

to the LSI-based algorithm, is produced and then analyzed. For details of

Association Rule, please refer to [Agrawal et al., 1993, Agrawal et al., 1995, Savasere

et al., 1995].

The steps of expanding the query terms based on LSI or AR algorithm is

described in the following:

Step 1: Obtain a term matrix based on the top N (in our experiment, N=300)

retrieved documents for the individual initial query returned by the search engine.

Step 2: Run LSI or AR algorithm to generate the related terms to each of the

original query terms.

Step 3: Expand the query terms by including the top-ranked M terms

acquired in Step 2. We seek the top M (M=5/4/3/2/1; the best result is achieved

when M=1) co-occurrence terms of the original terms from the top N (N=300)

40

retrieved documents. We then evaluated the impact of the number of expanded

terms on precision.

3.3.2 Global Analysis

In contrast to local analysis, in global analysis, terms to be expanded are

extracted from all the documents in the collection. UMLS provides co-concepts and

related co-occurrence frequencies for many medical terms appeared in MEDLINE

during the past 10 years. Thus, we expand the initial query with UMLS co-concepts

of the original key terms with the same semantic types. It is noted that we use

LingPipe, which is an efficient tool for biomedical name extraction, in extracting key-

terms. For instance, a high frequent co-concept “mice, knockout” in UMLS of the

term “transgenic mice” has the same semantic type and will be added to the initial

query [Zhu et al., 2006].

3.3.3 Term Re-weighting

This approach takes the weights of key original terms or co-occurrence terms into

account according to their relative importance in queries. Our earlier study [Hu and

Xu, 2005, Zhu et al., 2006] shows that expanding queries with synonyms generated

from UMLS and assigning equal weights to both the initial query terms and new

terms are not effective. Therefore, in this work, we boosted the weight of the original

query term. Specifically, if an original term has a higher term frequency in the initial

query with a specific major UMLS semantic type, the term should be given a higher

weight (e.g. 8) in the expanded query. Additionally, if a key query term or an

41

expanded term has a major UMLS semantic type, its preferred MeSH term synonym

defined in UMLS will be expanded and given a higher weight [Zhu et al., 2006].

Note the selection of key query terms is again decided by the tagging of a Name

Entity Extraction tool, LingPipe.

The processes of global analysis and term re-weighting query expansion are

displayed in Figure 3-3.

Figure 3 - 3: Global analysis and term re-weighting query expansion

42

3.4 Experiment Design and Implementation

The goal of the TREC 2004 Genomics Track is to create test collections for

evaluation of information retrieval (IR) and related tasks in the genomics domain

[Genomics]. It consists of the ad hoc retrieval task and the categorization task. In this

research, we focus on ad hoc retrieval task.

The TREC 2004 Genomics ad hoc retrieval task aims at retrieving MEDLINE

records to the official 50 topics from the whole corpus. The structure of this task was

a conventional searching task based on a 10-year MEDLINE subset (1994-2003),

about 4.5M documents and 20G bytes in size with NLM XML format. The official 50

topics derived from information needs were obtained via interviews with

biomedical researchers [Genomics].

Two search engines are used in this research, LEMUR (v.4.0) and LUCENE (v.

1.4.3). While LUCENE is a high performance, scalable Information Retrieval tool and

provides indexing and searching capabilities, LEMUR is a tool set designed to

facilitate research in language modeling and information retrieval and supports the

construction of basic text retrieval systems using language modeling methods, as

well as traditional methods such as those based on the vector space model and

statistical model (for instance OKAPI BM25) [Zhu et al., 2006].

43

3.4.1 Building the Index

LEMUR provides a couple of parsers to index TREC format documents.

InvIndex type of BuildIndex module was used to transform the XML format data on

TREC 2004 Genomics into general TREC format and then TREC format data was

inverted indexed. For LUCENE, the original XML data set was converted into

HTML format. Only the content from fields such as “TITLE”, “ABSTRACT”,

“MESH”, and “CHEMICAL SUBSTANCE” is kept in the HTML files. Transformed

HTML documents were indexed accordingly [Zhu et al., 2006].

3.4.2 Preprocessing Query to Search Engines

The TREC 2004 Genomics topics for the ad hoc retrieval task are developed

from the information needs of actual field experts. The topics consist of four parts:

ID, title, information need, and context formatted in XML. The titles are abbreviated

statements of information needs, and they are most similar to queries entered by end

users since there are only a few words. In this research, the title of each topic is

selected for baseline runs.

In addition, the two-step approach was used to preprocess queries to the

LEMUR search engine: First, a query was tokenized and filtered by stop-word lists.

Additionally, a query was stemmed before it is fed to LEMUR for retrieval [Zhu et

al., 2006]. These two operations are very useful for improving the computational

efficiency and reducing noise. We also noted that context is background information

to place information need and it describes information about the environment in

44

which queries may occur. So it may be helpful to expand the original query, i.e. the

titles, to include context. In this research, we only selected those key terms or

phrases labeled by LingPipe in context information of the topics to expand the

original queries.

3.4.3 Evaluation

The evaluation measure for the topics ad hoc retrieval task is the mean

average precision (MAP). Precision and Recall results are calculated using Buckley’s

trec_eval program. In addition, we also assess precision at 10 documents (P@10) in

this research.

3.5 Experimental Results and Discussion

3.5.1 Results

Table 3 - 1: Experiment Results from LEMUR Search Tool (OKAPI BM25 MODEL)

Run

Mean Average Precision

P@10 MAP Ratio Change

Baseline 0.3010 0.516 -- UMLS_Global 0.2562 0.464 -14.9% LSI_Local 0.3035 0.516 +0.8% AR_Local 0.2840 0.440 -5.6% Baseline+ReWeighting 0.3131 0.506 +4.0% Baseline+Context 0.3107 0.554 +3.2% Baseline+Context+Re-Weighting

0.3374 0.528 +12.1%

Baseline+Context+ LSI_Local

0.3288 0.516 +9.2%

45

Figure 3 - 4: Average Precision of All Runs (OKAPI BM25 MODEL)

Figure 3 - 5: P@10 of All Runs (OKAPI BM25 MODEL)

00.05

0.10.15

0.20.25

0.30.35

0.4



0

0.1

0.2

0.3

0.4

0.5

0.6

P@10

P@10

46

Table 3 - 2: Experiment Results from LEMUR Search Tool (VECTOR SPACE MODEL)

Run


P@10 Ratio MAP Ratio Change

Baseline 0.3024 0.504 -- UMLS_Global 0.3007 0.476 -0.6% LSI_Local 0.3039 0.504 +0.5% AR_Local 0.2842 0.462 -6.0% Baseline+ReWeighting 0.3092 0.510 +2.2% Baseline+Context 0.3231 0.504 +6.8% Baseline+Context+Re-Weighting

0.3252 0.498 +7.5%

Baseline+Context+ LSI_Local

0.3251 0.504 +7.5%

Figure 3 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL)

0.260.270.280.29

0.30.310.320.33



47

Figure 3 - 7: P@10 of All Runs (VECTOR SPACE MODEL)

0.430.440.450.460.470.480.49

0.50.510.52

P@10

P@10

48

Table 3 - 3: Experiment Results from LUCENE Search Tool (VECTOR SPACE MODEL)

Run


P@10 Ratio MAP Ratio Change

Baseline 0.2404 0.436 -- UMLS_Global 0.1858 0.334 -22.7% Baseline+ReWeighting 0.2512 0.442 +4.5% Baseline+Context 0.2488 0.482 +3.5% Baseline+Context+Re-Weighting

0.2891 0.468 +20.3%


00.05

0.10.15

0.20.25

0.30.35



49


3.5.2 Discussion

As shown in Table 3-1, there are eight types of runs designed in the

experiments, Baseline, UMLS_Global (title + co-concepts in UMLS), LSI_local (title +

local co-occurrence terms), AR_local (title + local co-occurrence terms), Baseline+Re-

weighting (title + term re-weighting approach), Baseline+context (title + key terms

or phrases in context of query), Baseline+Context+LSI_Local (title + key terms or

phrases in context of query + local co-occurrence terms) and Baseline+Context+Re-

weighting (title + key terms or phrases in context of query + term re-weighting

0

0.1

0.2

0.3

0.4

0.5

0.6

P@10

P@10

50

approach). The ratio of change is compared to Baseline. We run all eight types of

experiments for LEMUR search tool and five for LUCENE search tool.

We obtained the best results with the method of Baseline+Context+Re-

weighting, showing improved average precision by 33.74% and 28.91% over

Baseline for LEMUR (OKAPI BM25 Model) and LUCENE, respectively. Co-

occurrence terms expansion with LSI local analysis is applied to four runs on

LEMUR in those queries produced from Baseline and Baseline+Context (Table 1 and

2). These four runs are based on two different information retrieval models, the

OKAPI BM25 Model and the VECTOR SPACE Model. The results suggest that co-

occurrence terms expansion with LSI local analysis on the OKAPI BM25 Model that

increases average precision by 9.2% could work better than one on the VECTOR

SPACE Model that increases average precision by 7.5% with context taken into

account.

Apparently, the LSI-based local analysis approach improves the average

precision of the Baseline+Context than that of Baseline run. This also suggests

context information of queries is critical in enhancing the retrieval performance. If

without context, co-occurrence terms expansion with LSI local analysis approach on

the OKAPI BM25 Model only increases the average precision by 0.8%, and the

performance enhancement on the VECTOR SPACE model is about 0.5%.

If the Baseline+Context runs are treated as baselines, through LSI local

analysis, the OKAPI BM25 Model could elevate the average precision by 5.8%, and

the VECTOR SPACE Model could only enhance the average precision by 0.6%. We

51

note that the best results we achieved with the LSI local analysis are the cases we

added only one extended term to each original query term.

AR produces worst results among all the methods we experimented. AR

decreases the average precision by 5.6% in Table 1 and 6.0% in Table 2 for AR_Local.

Therefore, we did not run Baseline+Context+AR_Local. We note that co-concepts

expanded from global analysis also make the results worse. The average precision of

the UMLS+Global run on LEMUR decreases 14.9% (OKAPI BM25 Model) and by 0.6%

(VECTOR SPACE Model), compared to Baseline run. This result is consistent when

we used LUCENE. Table 3 shows that the average precision of the UMLS+Global

runon LUCENE also decreases by 22.7% (VECTOR SPACE Model).

We experimented the impact of using the term re-weighting strategy on

LEMUR over the Baseline and Baseline+Context without using the term re-

weighting strategy. The experimental results of those methods are shown in Table 3-

1 and 3-2. Table 3-1 shows the experimental results with the OKAPI BM25 Model,

while Table 3-2 with the VECTOR SPACE Model. Baseline+Context+Re-weighting

method with the OKAPI BM25 Model increases the average precision over the

Baseline by 12.1%. Table 3-2 shows that Baseline+Context+Re-weighting method

with the VECTOR SPACE Model increases the average precision over the Baseline

by 7.5%. We note that the Baseline+Re-weighting, without context, increases the

average precision only by 4.0% in the OKAPI BM25 Model, and the performance

enhancement on the VECTOR SPACE model is only 2.2%.

We therefore contend that using context is important in improving precision.

52

We also experimented the impact of using the term re-weighting strategy on

LUCENE over the Baseline and Baseline+Context. The experimental results of those

methods are shown in Table 3-3. The results show the term re-weighting strategy

increases the average precision over Baseline by 4.5% and 20.3%, respectively

3.6 Conclusion

In this study we presented the design and evaluation of query expansion

strategies by using three widely used approaches for bio-medical domain. The three

approaches are local analysis, global analysis, and term re-weighting.

The important insights we obtained through our study are summarized as

follows:

Ontology-based term re-weighting provides the best results among the three

query expansion strategies.

Expanding the initial query with more precise ontology-based term enhances

LSI based local analysis substantially.

Including context to term re-weighting and LSI further improves the

precision.

Adding only one extended term to each original query term of the LSI local

analysis increases the precision over adding more than one term.

Using OKAPI BM25 Model helps achieve better precision than using

VECTOR SPACE Model on LEMUR.

53

CHAPTER 4. USING TWO-STAGE CONCEPT-BASED SINGULAR VALUE DECOMPOSITION TECHNIQUE AS A QUERY EXPANSION STRATEGY

As indicated in chapter 3, we compared different query expansion strategies

including global analysis, local analysis, and term re-weighting. We obtained some

important insights. However, there are a couple of flaws that are worthwhile to note.

First, we noticed that there are few theories supporting the strategies. Second, the

context we employed (kind of concept) is background information to place

information need and it describes information about the environment in which

queries may occur. The result shows that it is helpful to expand the original query

and acquire better performance. Nonetheless, it is known that not all biomedical

literature retrieval systems contains such information and available to the users.

Further research is needed to explore this concept generation and utilize language

modeling approach since it has strong math theory.

4.1 Introduction

As mentioned previously, query expansion techniques are the process of

supplementing the original query with additional terms. The advantages of query

expansion are: to help the users reformulate a better query to achieve higher

precision and higher recall and eventually to improve retrieval performance; query

expansion can be applicable to any situation irrespective of the retrieval techniques

or models used [Efthimiadis, 1996]. In this work, via query expansion, we focus on

addressing the issue: the selection of additional query terms based on concepts, to

help relax synonym and polysemy problem.

54

The traditional keywords (terms) match does not work well for biomedical

information retrieval. One way to solve the problem is sense-based IR through word

sense disambiguation (WSD); however WSD is a challenging task in the field of

natural language processing (NLP). Some experiments reported promising effect of

sense-based IR models but most of experimental results failed to show any

performance improvement partially due to the low accuracy of WSD in general

domain [Sanderson, 1994] and/or very noisy word-alignment. The other way to

solve this problem is through concept match. A concept has a distinctive meaning in

biomedical literature and all synonymous terms share the same concept identities,

therefore concept will not cause any ambiguity. Besides, this approach to some

extent mimics searchers’ process of selecting query terms [LSI]. And importantly, it

is quite easy to implement concept based approach with UMLS and other tools.


The following diagram 4-1 illustrates system components and data flow. The

benefit of our approach is three-fold: saving computing resources and less human

intervention by taking advantage of the UMLS and pseudo-relevance feedback;

solving synonym and polysemy problem well and not causing any ambiguity using

concepts embedded among the terms; and being independent of informational

retrieval system. Moreover, since it assumes that the top N documents returned by

the initial query are actually relevant, expansion terms are extracted from each

cluster of the top-ranked documents to formulate a new query for a second cycle

55

retrieval. It takes advantage of search engine (using initial query) and gets more

relevant documents compared with the whole data collection.

Search Engine

1. Doc 12. Doc 23. Doc 3…………..……………

Expanded Query

Query Expansion

Initial Query

Figure 4 - 1: Two-stage concept-based query expansion (AQEE-2)

Specifically, a concept based term-document matrix is generated (Stage one)

based on those top retrieved documents in this approach. Finally, The SVD

technique in the LSI is utilized and implemented to extract and rank terms to be

added on initial query (Stage two).

56

Figure 4 - 2: Stage one – generate concepts and construct concept-based term

document matrices

4.3 Query Strategies

4.3.1 Stage one - generate concepts and construct concept-based term-document matrices

During Stage one, it is very important to cluster the retrieved documents

based on initial query, and extract the concepts based on subset of the data collection.

For details of the idea and the process, please refer to [Zhu et al. 2006].

4.3.1.1 Document Clustering

Since both single linkage and average linkage suffer severe chaining problem

on all three testing datasets when using standard vector cosine as document

similarity measure [Zhu et al., 2006]. For fair comparison, we use the complete

linkage as cluster distance measure because it does not has the chaining problem on

the testing dataset working with the baseline document similarity measure. With

57

complete linkage criterion, the distance of two clusters is defined as the maximum

distance between one document in the first cluster and the other in the second

cluster. That is,

∆ , max , , ∈ , ∈

After estimating a language model for each document in the corpus with

context-sensitive semantic smoothing, the Kullback-Leibler divergence between two

language models is used to be the distance measure of the corresponding two

documents. Given two probabilistic document models p(w|d1) and p(w|d

2), the KL-

divergence distance of p(w|d1) to p(w|d

2) is defined as:

∆ , | log||

Where V is the vocabulary of the corpus. The KL-divergence distance will be

a non-negative score. It gets the zero value if and only if two document models are

exactly same. However, KL-divergence is not a symmetric metric. Thus, we define

the distance of two documents as the minimum of two KL-divergence distances.

That is,

, min ∆ , , ∆ ,

The partitional clustering is a generalized version of the standard k-means

[Zhong and Ghosh, 2005]. It assumes that there are k parameterized models, one for

each cluster. Basically, the algorithm iterates between a model re-estimation step and

a sample re-assignment step as shown in Figure 4-1.

58

Algorithm: Model-based K-Means

Input: dataset, and the desired number of clusters k.

Output: trained cluster models Λ = {λ1,…,λn}and the document assignment

Y = {y1,…,yn}, yi ∈ 1,…,k

Steps:

1. Initialize document assignment Y.

2. Model re-estimation: λi = arg ∑ log |∈

3. Sample re-assignment: yi = arg log |

4. Stop if Y does not change, otherwise go to step 2

The model based k-means algorithm adapted from (Zhong and Ghosh, 2005)

4.3.1.2 Concept Extraction

We used approximate matching to overcome the limitation of terms exact

character matching. With it, concept (phrase) extraction and concept (phrase)

meaning disambiguation can be completed within one step. The basic idea of this

approach is to calculate the importance score of the tokens and capture the

important tokens (not all tokens) of a concept name, unlike [Liu et al., 2004] a

specific window size was defined to identify a phrase in a document. The

importance score formula for the token is below [Zhou et al., 2006]:

59

max

1

∑ 1

Where w is the token,

n is the number of concept names or variants,

Sj(w) stands for the importance of the token w to the j-th variant name,

N(w) stands for the number of concepts whose variant names contain token w

wji indicates the i-th token in the j-th variant name of the concept,

I(w) denotes importance of w to the concept.

With the importance score formula above, the detailed concept extraction

algorithm shown in Figure 1 can be employed. The candidate concept with

maximum importance score is chosen in the end unless only one candidate concept

is remained.

60

Figure 4 - 3: The algorithm for extracting one concept name and its candidate concept IDs [Zhou et al. 2006]

.

One example is presented below: The regulation (C1327622) of iron transport

(C1159668) by cytokines (C0079189) is a key mechanism in the pathogenesis

(C0699748) of anemia of chronic disease (C0002873) and a promising target for

therapeutic intervention (C0808232).

Phrase: regulation, iron transport, cytokines, pathogenesis, anemia of chronic

disease, therapeutic intervention

Concept: C1327622, C1159668, C0079189, C0699748, C0002873, C0808232

We developed the program IndexAppConfig to complete Stage one task.

Given the number of documents, IndexAppConfig will extract a set of concepts. And

based on concepts generated, IndexAppConfig starts to index the documents and

generates a couple of files including: collection.stat, docterm.index, termdoc.index,

termindex.list, docindex.list, docterm.matrix and termdoc.matrix. Purposely,

61

termdoc.index is the binary random access file for storing matrix, and

termdoc.matrix will be used during Stage two. Once these files are ready, it is easier

to construct t-d matrix and therefore it comes to Stage two using LSI SVD technique

to extract terms. It is to be noted that t-d matrix is concept-based, not just about

single word and document matrix. But for simplicity in this paper, we still call t-d

matrix.

4.3.2 Stage two – use SVD technique in LSI to extract the expanded query terms

The purpose of Latent Semantic Indexing (LSI) is to overcome the problems

of lexical matching by using statistically derived conceptual indices instead of

individual words for retrieval [Agrawal et al., 1993, Guo et al., 2004]. The idea

behind it is that in the latent semantic space, a query and a document can have high

cosine similarity even if they do not share any terms but as long as their terms are

semantically similar.

The Latent Semantic Indexing (LSI) is the application of a particular

mathematical technique, which is Singular Value Decomposition (SVD), to a word-

by-document matrix. SVD (and hence LSI) is a least-squares method and it takes

advantage of the association of terms with documents ("semantic structure").

Starting with a matrix of terms by documents, this term-document matrix is then

evaluated by SVD to gain the latent semantic structure about word usage across

62

documents and is decomposed into three separate matrices, by which documents

and query terms are projected into the same dimensional space.

Because of disadvantages of LSI such as: requiring larger storage and so on,

in this research, SVD is used on the top 200/50 or less documents retrieved for initial

query on Lemur and this technique is utilized to compare the similarities between

the terms in the matrix extracted from the documents and to generate the important

terms to be added on initial query.

We developed the program SVDExpansion to complete Stage two task. Given

the number of files generated during Stage one, SVDExpansion will automatically

extract a couple of terms per request, concept ID and ranked score (1 is the highest).

The following is one example:

Query term: ferroportin C0915115

********************************************

0 ferroportin C0915115 1.0

1 Cation Transport Proteins C0969710 0.9992486625308058

63

4.4 Experimental Design and Implementation

4.4.1 MeSH Terms

MeSH (Medical Subject Headings) is a type of controlled vocabulary. MeSH

consists of sets of descriptors in a hierarchical structure that allows searching at

various levels of specificity. The roots of the hierarchical structures in MeSH have

very broad headings such as "Anatomy" or "Mental Disorders." Lower levels of the

hierarchy include more specific terms, such as "Ankle" and "Conduct Disorder."

There are 22,997 descriptors in MeSH. There are also thousands of cross-references

that assist in finding the most appropriate MeSH Heading, for example, Vitamin C

as Ascorbic Acid.

4.4.2 UMLS

The Unified Medical Language System (UMLS) is medical knowledge sources

and associated language tools developed by the National Library of Medicine. The

system includs UMLS Metathesaurus, UMLS Semantic Network and SPECIALIST

Lexical Tools [Umls]. The UMLS Metathesaurus contains information about concepts

and biomedical terminology in several languages. It was developed automatically

from 73 families of controlled vocabularies and 117 classification systems. In 2004AA

version, it has a capacity for 1 million concepts and 4.3 million terms. The purpose of

UMLS Semantic Network is to provide a consistent semantic categorization of all

concepts represented in the UMLS Metathesaurus and to provide a set of useful

relationships between these concepts. All information about specific concepts is

64

found in the Metathesaurus; the Network provides information about the set of basic

semantic types, or categories, which may be assigned to these concepts, and it

defines the set of relationships that may hold between the semantic types. The

2004AA release of the Semantic Network contains 135 semantic types and 54

relationships. The semantic types are the nodes in the Network, and the

relationships between them are the links. UMLS totally provides symbolic

relationships in 5 million pairs of concepts and statistical relations in 6.5 million

pairs of co-concepts. The symbolic relations mainly include hierarchical and

associative relationships. The associative relationships are represented as

relationships between semantic types of concepts. Statistical relations are computed

by determining the frequency with which concepts in specific vocabularies co-occur

in records in a database. For instance, there are co-occurrence relationships for the

number of times concepts have co-occurred as key topics within the same articles, as

evidenced by the Medical Subject Headings assigned to those articles in the

MEDLINE database.

4.4.3 Experimental Setup

In order to compare the experimental test results with those of chapter 3, the

same data collection TREC 2004 or 2005 is used. Below is the brief introduction of

TREC 2004 or 2005 genomics track.

The goal of the TREC 2004 or 2005 genomics track is to create test collections

and to provide a forum for evaluation of information retrieval (IR) and related tasks

in the genomics domain [Genomics]. Both TREC 2004 and TREC 2005 genomics

65

track consist of the ad hoc retrieval task and the categorization task. Since the

document set of TREC 2005 genomics track for the ad hoc task was the same corpus

as was used in the TREC 2004 genomics ad hoc task, our major effort focused on

2004 genomics ad hoc retrieval task in this research. The corpus contains a 10-year

subset (1994 to 2003) of MEDLINE, which is the bibliographic database of

biomedical literature maintained by the US National Library of Medicine. And the

corpus holds about 4.5 million MEDLINE records (which include title and abstract

as well as other bibliographic information) and its size is about 9GB of data.

The TREC 2004 genomics ad hoc retrieval task aims at retrieving MEDLINE

records to the official 50 topics from the whole corpus. The official 50 topics derived

from information needs were obtained via interviews with real biologists [Guo et al.,

2004; Genomics]. The structure of this task was a conventional searching job based

on the whole documents corpus with NLM XML format.

Lemur search tool is employed as a backend engine in this research. Lemur is

a tool set designed to facilitate research in language modeling and information

retrieval and supports the construction of basic text retrieval systems using language

modeling methods, as well as traditional methods such as vector space model and

statistical model (for instance OKAPI BM25) [Zhu et al., 2006]. In this research, it was

used for the initial query run to achieve the top N documents as a basis to generate a

set of concepts and extract the terms to reformulate the query first. Then it was also

invoked to run for the reformulated query to compare the precision and recall with

that of baseline.

66

4.4.3.1 Indexing the Corpus

To index TREC 2004 genomics track, InvIndex type of BuildIndex module

provided by Lemur was used to transform the XML format data on TREC 2004

genomics track into general TREC format and then TREC format data was inverted

indexed. The high frequency stop words in the collection were removed and the

remaining was stemmed accordingly.

4.4.3.2 Preprocessing Query to Lemur

The TREC 2004 genomics topics for the ad hoc retrieval task are developed

from the information needs of actual field experts. The topics consist of four parts:

ID, title, information need, and context formatted in XML, of which the titles are

abbreviated statements of information needs. Since the titles are only a few words

and are most similar to queries entered by end users, in this research the total 50

titles are selected for initial queries. And these initial queries for baseline run were

preprocessed before they were fed to the Lemur: initial queries were tokenized and

filtered by stop-word lists and were stemmed.

4.5 Preliminary Experimental Results

4.5.1 Evaluation

Again, Buckley’s trec_eval program was run to capture the precision and

recall for the individual query. Since mean average precision (MAP) is a

comprehensive indicator of IR performance that captures both precision and recall,

67

we examined MAP for the topics ad hoc retrieval task in this research following a

convention of TREC. In addition, we also assessed the precision at top 10 documents

(P@10) in our evaluation.

4.5.2 Preliminary Results

Table 4 - 1: Experimental Results (OKAPI BM25 model)

Run Mean Average Precision

P@10 MAP Ratio

Change

Baseline 0.3010 0.516 --

UMLS_Global 0.2562 0.464 -14.9%

LSI_Local 0.3035 0.516 +0.8%

AR_Local 0.2840 0.440 -5.6%

Baseline+ReWeighting

0.3131 0.506 +4.0%

Baseline+Context 0.3107 0.554 +3.2%

Baseline+Context+Re-Weighting

0.3374 0.528 +12.1%

68

Table 4 - 2: Experimental Results (OKAPI BM25 model)


P@10 Mean Ratio Change

Baseline 0.3010 0.516 --

Concept-based SVD (200 documents)

0.3193 0.504 +6.0%


0.3308 0.520 +9.9%

Figure 4 - 4: Mean Average Precision of All Runs (OKAPI BM25 model)

0.2850.29

0.2950.3

0.3050.31

0.3150.32

0.3250.33

0.335

Baseline Concept-basedSVD (200

documents)

Concept-basedSVD (50

documents)



69

Figure 4 - 5: P@10 of All Runs (OKAPI BM25 model)

Table 4 - 3: Experiment Results (VECTOR SPACE MODEL)

Run Mean Average Precisio

n

P@10 MAP Ratio

Change

Baseline 0.3024 0.504 --

UMLS_Global 0.3007 0.476 -0.6%

LSI_Local 0.3039 0.504 +0.5%

AR_Local 0.2842 0.462 -6.0%

Baseline+ReWeighting 0.3092 0.510 +2.2%

Baseline+Context 0.3231 0.504 +6.8%

Baseline+Context+Re-Weighting

0.3252 0.498 +7.5%

0.495

0.5

0.505

0.51

0.515

0.52

0.525

Baseline Concept-based SVD(200 documents)

Concept-based SVD(50 documents)

P@10

P@10

70

Table 4 - 4: Experimental Results (VECTOR SPACE model)



Baseline 0.3024 0.504 --


0.3183 0.516 +5.3%


0.3281 0.518 +8.4%


0.2850.29

0.2950.3

0.3050.31

0.3150.32

0.3250.33

0.335

Baseline Concept-basedSVD (200

documents)

Concept-basedSVD (50

documents)



71


There are seven types of runs designed in the experiments, Baseline,

UMLS_Global (title + co-concepts in UMLS), LSI_local (title + local co-terms),

AR_local (title + local coterms), Baseline+Re-weighting (title + term re-weighting

approach), Baseline+context (title + key terms or phrases in context of query), and

Baseline+Context+Re-weighting (title + key terms or phrases in context of query +

term re-weighting approach) in Table 4-1 and 4-3. The ratio of change is compared to

Baseline.

The average precision of our baseline run is better than the results of the 2004

TREC Genomics Ad Hoc task where the average precision of 47 runs submitted by

32 teams was 20.74%. Our best runs are 33.74% and 28.91% for LEMUR. Co-term

expansion with local analysis increases average precision by only 0.5% (LSI). AR gets

even worse result and decreases the average precision by 6.0%. Here, the use of

0.495

0.5

0.505

0.51

0.515

0.52

Baseline Concept-based SVD(200 documents)

Concept-based SVD(50 documents)

P@10

P@10

72

document as transaction unit for association rule mining may not be efficient.

Instead, the association rules at sentence level may produce more precise term

associations.

The term reweighting strategy is also applied to four runs on LEMUR in

those queries are both formed from Baseline and Baseline+Context. But these four

runs are based on two different information retrieval models, Vector Space Model

and Statistical Model (Okapi BM25). The results suggest that Statistical Model that

increase average precision by 12.1% could more empower the term re-weighting

approach than Vector Model that increase average precision by 7.5% with context

taken into account. If without context, term-reweighing approach on Statistical

Model only increases average precision by 4.0% and the performance enhancement

on Vector model is about 2.2%. If the Baseline+Context runs are treated as baselines,

through term re-weighting, Statistical Model could elevate average precision by 8.6%

and Vector Model could only enhance average precision by 0.7%.

Co-concepts expanded from global analysis make the results worse. The

average precision of the UMLS+Global run on LEMUR decreases by 0.6% (Vector

Space Model) and 14.9% (Okapi BM25 Model). Perhaps, the top-ranking co-concepts

from UMLS co-occur frequently with original terms in certain context that is totally

different from that of TREC Genomics topics.

Both Table 4-2 and Table 4-4 shows the average precision of both our baseline

runs and concept-based SVD query expansion runs (Okapi BM 25 model or Vector

73

space model) is better than the results of the 2004 TREC Genomics Ad Hoc task

where the average precision of 47 runs submitted by 32 teams was 20.74%.

Table 4-2 demonstrates the concept-based SVD approach with Okapi BM25

model at best achieves the 33.08% MAP and 52.0% P@10 while the baseline approach

(Okapi BM25 model) achieves 30.10% MAP and 51.6% P@10. Table 4 shows the

concept-based SVD approach with Vector space model at best achieves the 32.81%

MAP and 51.8% P@10 while the baseline approach (Vector space model) achieves

30.24% MAP and 50.4% P@10 respectively. Apparently the concept-based SVD query

expansion approach improves the average precision than that of baseline runs: at

best increases 9.9% in Table 4-2 and 8.4% in Table 4-4. This suggests concept-based

SVD in LSI is quite significant to enhance the performance of information retrieval

on TREC 2004 genomics track.

We also compared the results of top 200 or 50 documents retrieved for initial

queries as a basis to generate a set of concepts and extract terms to be added to initial

queries. If the concept-based SVD with top 200 documents run is treated as baseline,

the concept-based SVD with top 50 documents run will eventually improve the

retrieval performance by 3.6% (Okapi BM25 model) and 3.1% (Vector space model)

correspondingly. A possible explanation is that top 200 documents may add more

noises to the basis to generate the concepts and perform SVD and thus extract new

query terms that do not help increase the performance that much.

74

CHAPTER 5. CLUSTER-BASED QUERY EXPANSION USING LANGUAGE MODELING IN THE BIOMEDICAL DOMAIN

The approach introduced in Chapter 4 has some benefits: saving computing

resources and less human intervention by taking advantage of the UMLS and

pseudo-relevance feedback; solving synonym and polysemy problem well and not

causing any ambiguity using concepts embedded among the terms; and being

independent of informational retrieval system. Moreover, since it assumes that the

top N documents returned by the initial query are actually relevant, expansion terms

are extracted from the top-ranked documents to formulate a new query for a second

cycle retrieval. It takes advantage of search engine (using initial query) and gets

more relevant documents compared with the whole data collection.

However, it is noted that a document collection typically contains documents

with various topics. Therefore, the associations between two terms could be

dissimilar across different topics, which make it more appropriate to conduct query

expansion within each topic.

5.1 Introduction

Query expansion is indeed not a new method to improve information

retrieval precision and recall. Researchers have focused on using query expansion

techniques, manually, interactively or automatically, to help the users reformulate a

better query and eventually facilitate the users to achieve higher precision and

higher recall [Efthimiadis, 1996]. However, previous research on query expansion

has been uncertain as to whether it does improve retrieval effectiveness as well as

75

the efficiency over document-based retrieval [Buckley 2004]. We, in this study,

looked at pseudo-relevance-based query expansion methods in greater details.

Since pseudo-relevance-based query expansion methods augment the query

with terms generated from documents in an initially retrieved list, they have both

advantages and disadvantages. One of main benefits is that they take advantage of

search engines with various models without users’ interaction. In contrast, inherent

problems of search engines also arises, for example, normally in a normal document-

based retrieval, the initially retrieved list is long, ranked by their relevancies to the

given query, to the user [Liu and Croft, 2004]. Another problem is that the document

collection typically contains documents with different subjects or topics. For instance,

PubMed research articles may be classified into different subjects or topics like

diseases. Likewise, the initially retrieval list may be a mixed results on different

subjects or topics. Accordingly, the expanded terms from pseudo-relevance-based

query expansion techniques to the initial query may have totally different meanings

across distinctive subjects in the whole document collection. The more difficult and

complex part is that not all query related aspects might be obvious in the initially

retrieved list [Buckley 2004]. Overall, the expanded query might demonstrate query

drift [Mitra et al. 1998], that is, the underlying intent shift between the original query

terms and its expanded terms. If query drift occurred worse, it is expected that state-

of-the-art pseudo-relevance-based query expansion methods could yield retrieval

performance that is substantially inferior to that of using the original query with no

expansion at all [Mitra et al. 1998, Lavrenko and Croft 2001].

76

In this study, we propose to perform pseudo-relevance-based query

expansion based on the initial retrieved list and/or clusters of similar documents

(cluster-based query expansion) that are created based on the whole document

collection or the initial retrieval list of the original query. The clusters from the

whole document collection based on their query-similarities can assumedly be

considered better candidates reflecting the document collection-context of the query

than individual documents [Genomics, Lee et al. 2008, Na et al., 2005], because the

clusters could contain relevant documents that do not exhibit high query similarity.

The clusters from the initial retrieval results do not bring new relevant documents to

consideration; however, re-ranking the initial retrieval results using clusters can

reduce non-relevancy noise.

The rest of the paper is organized as follows: Section 2 reviews related work

on Language Modeling and cluster-based retrieval. Section 3 presents our proposed

methods to conduct PRB-DOC-QE, GCB-QE, GCB-DOC-QE and LCB-RR-QE.

Section 4 provides system architecture and general process of our cluster-based

methods. In section 5, we evaluate the performance of the proposed methods. And

lastly, we conclude the paper in section 6.

5.2 Related Work

Language modeling was initially used in speech recognition. Nowadays it

has been widely applied into information retrieval and other applications owing to

its solid mathematical foundation and empirical proved effectiveness [Zhai and

Lafferty, 2001].

77

The main idea of language modeling used in information retrieval is: each

document of a collection is viewed as a sample and a query as a generation process

[Na et al., 2005]. In contrary to normal retrieval model, the retrieved documents are

ranked based on the probabilities of producing a query from the corresponding

language models of these documents. In other words, relevance of document is

calculated by query likelihood generated from document language models.

Specifically, there are two steps involving. One is to build a language model D for

each document in the collection; and the second is to rank the documents based on

how likely the query Q could have been generated from each of these document

models, i.e. P(Q|D).

The probability P(Q|D) can be calculated in a couple of ways. The most

common approach (referred to unigram language model) assumes that the query

could be treated as a set of independent terms, and thus query probability P(Q|D)

can be represented as a product of the individual term probabilities [Liu and Croft,

2004].

)|()|(

1

DqPDQPm

i

i

where qi is the ith term in the query, and P(qi|D) is specified by the document

language model

)|()1()|()|( CollwPDwPDwP MLML

78

where PML(w|D) is the maximum likelihood estimate of word w in the document,

PML(w|Coll) is the maximum likelihood estimate of word w in the collection, and λ is

a general symbol for smoothing.

For different smoothing methods, λ takes different forms [Kurland et al. 2005].

Based on the language modeling framework, a relevance model, as a matter

of fact, is a query expansion approach. Suppose the query words Q = q1;q2;… qk and

the word in relevant documents are sampled identically and independently from

a distribution, the probability of a word in the distribution M is estimated as follows:

M

k

iik MqPMPMPqqP

11 )|()|()(),(

Where M is the set of documents that are pseudo-relevant to the query Q [Lavrenko

and Croft 2001].

Derived from the above estimation, the most likely expansion term from

P( |M) is chosen for the original query.

There are two approaches to perform cluster-based retrieval. One is to

retrieve one or more clusters in the whole document collection in response to a

query [Zhai and Lafferty, 2001, Liu and Croft, 2004], and this has been the most

common for cluster-based retrieval. For this approach, in cluster term, the task is to

firstly match the query against clusters of documents using clustering algorithms

instead of individual documents, and then rank clusters based on individual

cluster’s similarity to the query. Another idea behind this approach is that unlike

79

usual cluster search methods using clusters mainly a tool to identify a subset of

documents that are likely to be relevant and return the subset of documents only at

the time of retrieval, any document forming a cluster ranked higher is considered to

be more relevant than any document from a cluster ranked lower on the retrieval

clusters list [Rosenfele 2000].

The second approach to cluster-based retrieval is to use clusters as a form of

document smoothing. The benefit of this approach is that by classifying documents

into clusters, differences between representations of individual documents are, in

effect, smoothed out [Lee et al. 2008].

In this study, we are using the first approach with a little change, that is,

building clusters based on the whole document collection and based on the initial

retrieved list only. Note in (1) if we use Cluster (C) to replace D above, P(Q|D) =>

P(Q|C), we could obtain ranked clusters.

5.3 Methods

The followings are concrete details that describe query expansion methods in

our approach:

Let q, d, D, and Dinit represent a query, a document, a document collection,

and an initially retrieved documents list respectively.

5.3.1 Pseudo-relevance-based query expansion (PRB-DOC-QE)

80

In language modeling framework, Dinit is used to construct language model

and generate expanded query terms based on query-likelihood 　∏ . We use

PFB-DOC-QE to represent such a retrieval model that utilizes only documents to

perform query expansion. In order to compare the various methods performances,

we also consider this query expansion using document only as baseline.

5.3.2 Global-cluster-based query expansion (GCB-QE)

We assume that the whole document collection is clustered into the set of

clusters C. Please note here the whole document collection clustering is query-

independent. This method is to use language models of clusters rather than

documents to construct an expanded query form.

Specifically, we use the language models of the k clusters C that yield the

highest query-likelihood ∏ ; GCB-QE denotes the resultant retrieval method.

5.3.3 Global-cluster-based query expansion (GCB-DOC-QE)

Since the C clusters created in 5.3.2 are query-independent, it is expected that

the k selected clusters might bring “noise” documents that are not query-related but

with higher inter-document similarities. To take advantage of both 5.3.1 and 5.3.2

methods, we consider the GCB-DOC-QE method that uses both documents and

clusters that are similar to the query to perform the expansion. Specifically, we only

consider those documents yielding the highest query-likelihood.

81

5.3.4 Local-cluster-based query expansion (LCB-RR-QE)

In this method, the clusters are created based on the initially retrieved list

(query-dependent). We take a similar approach for cluster-based retrieval by

building language models for clusters and then retrieve / rank clusters based on the

likelihood of generating the query; and we consider this LCB-RR-QE method that re-

ranks documents of the initial retrieval list to perform the query expansion. Please

note after re-rank, we took only a half up to 80 percent of initial retrieval list to

perform query expansion.

Please note the above are term-based approach which is using term-based

matrix as a basis to generate expanded terms to add to initial query seed. Since the

approach in Chapter 4 to generate concepts increases the performance, concept-

based instead of term-based matrix approach also was tried to conduct a comparison.

We have the followings:

5.3.5 Concept-based pseudo-relevance query expansion (CB-PR-DOC-QE)

5.3.6 Concept-based global-cluster query expansion (CB-GC-QE)

5.3.7 Concept-based global-cluster query expansion (CB-GC-DOC-QE)

5.3.8 Concept-based local-cluster query expansion (CB-LC-RR-QE)

82


Our automatic query expansion experiment system (AQEE-3) architecture is

shown in Figure 5-1.

The outline of the approach described in Figure 5-1 is as follows:

Step 1: Starting with initial query, our system retrieves a sample of

documents from the document collection based one the higher query-likelihood. No

other information is required.

Step 2: Depending on different query expansion methods, the retrieved

document set (top N documents) only, or with the clusters of the whole document

collection, or with the clusters of the initial retrieved documents list are considered

to be better candidates than the whole document collection to generate expanded

query terms. (Pre-Query Expansion)

Step 3: Applying query expansion to generate important terms to be added to

initial query.

Step 4: Constructing the expanded queries based on the results of Step 3 and

query again to retrieve the improved result sets compared with the result sets of the

initial query.

83

Search Engine

1. Doc 12. Doc 23. Doc 3…………..……………

Clusters

Expanded Query

Query Expansion

Initial Query

Figure 5 - 1: Cluster-Based Query Expansion (AQEE-3)

Please note, Query Expansion could use PRB-DOC-QE, GCB-QE, GCB-DOC-

QE, LCB-RR-QE, CB-PR-DOC-QE, CB-GC-QE, CB-GC-DOC-QE, or CB-LC-RR-QE

described in section 3 Methods.

5.5 Evaluation

In this research, experiments are conducted on the TREC 2004 genomics track.

TREC 2004 genomics topics for the ad hoc retrieval task are developed from the

84

information needs of actual field experts. The topics consist of four parts – ID, title,

information need, and context formatted in XML, of which the titles are abbreviated

statements of information needs [Genomics]. Since the titles are only a few words

and are most similar to common queries used by users, the titles are selected for

initial queries. In our experiment, Lemur toolkit was used in our experiments. TREC

XML format data are converted into TREC format data, and records and topics are

preprocessed for tokenization, Porter stemming, and high frequency stop-words

removal.

Since mean average precision (MAP) measure is a comprehensive

performance indicator that captures both precision and recall, MAP for the topics ad

hoc retrieval task is used in this research following a convention of TREC. In

addition, we also assessed the precision at top 10 documents (P@10) in our

evaluation.

There are five types of run designed in our experiments: LM (simply LM

query-likelihood model with NO QE), PRB-DOC-QE, GCB-QE, GCB-DOC-QE and

LCB-RR-QE in Table 1. And there are also four types of run designed in our

experiments: CB-PR-DOC-QE, CB-GC-QE, CB-GC-DOC-QE and CB-LC-RR-QE in

Table 2. The ratio change is to measure MAP changes compared to baseline.

In the experiments, relevance model (RM3), which is a state-of-the-art

pseudo-relevance-based query-expansion method is used. The Jelinek-Mercer

smoothing parameter of the LMs is set to 0.9. The other parameters of RM3 will be

provided if requested.

85

Also, we choose the Kullback-Leibler divergence between two language

models to be the distance measure of the corresponding two documents and model-

based K-means clustering algorithms (Chapter 4). For LCB-RR-QE/CB-LC-RR-QE to

construct RM3, we take the 1000 documents in the initially retrieved list that yield

the highest query-likelihood and cluster them into 50 query-specific clusters of 1000

documents. Then, we use the documents of the query-specific clusters that yield the

highest query-likelihood to construct RM3.

Table 5 - 1: Experimental Results (Term-based)

RUN Mean Average Precision


LM (NO QE) 0.3010 0.516 --

PRB-DOC-QE 0.3035 0.516 0.83%

GCB-QE 0.3126 0.518 3.85%

GCB-DOC-QE 0.3209 0.523 6.61%

LCB-RR-QE 0.3168 0.517 5.25%

86

Figure 5 - 2: Mean Average Precision of All Runs

Figure 5 - 3: P@10 of All Runs

0.29

0.295

0.3

0.305

0.31

0.315

0.32

0.325

Mean Average Precision (MAP)

Mean Average Precision(MAP)

0.512

0.514

0.516

0.518

0.52

0.522

0.524

P@10

P@10

87

Table 5-1 shows the MAPs of PFB-DOC-QE, GCB-QE, GCB-DOC-QE and

LCB-RR-QE runs are better than that of LM (NO QE). All proposed query

expansion-based methods outperform than the non-expansion-based LM (NO QE)

approach. It also demonstrates that GCB-DOC-QE approach accomplishes 32.09%

MAP and 52.3% P@10 while the PFB-DOC-QE achieves 30.35% MAP and 51.6%

P@10 and baseline LM (NO QE) only reaches 30.10% MAP and 51.6% P@10. In

addition, the performance of GCB-DOC-QE is better than that of GCB-QE in MAP

and P@10 comparisons.

Table 5 - 2: Experimental Results (Concept-based)

RUN Mean Average Precision


LM (NO QE) 0.3010 0.516 --

CB-PR-DOC-QE

0.3045 0.516 1.16%

CB-GC-QE 0.3196 0.518 6.18%

CB-GC-DOC-QE

0.3387 0.521 12.52%

CB-LC-RR-QE 0.3460 0.520 14.95%

88

Figure 5 - 4: Mean Average Precision of All Runs (Concept-based)

0.270.280.29

0.30.310.320.330.340.35

Mean Average Precision (MAP)

LM (NO QE)

CB-PR-DOC-QE

CB-GC-QE

CB-GC-DOC-QE

CB-LC-RR-QE

89

Figure 5 - 5: P@10 of All Runs (Concept-based)

Table 5-2 shows the MAPs of CB-PF-DOC-QE, CB-GC-QE, CB-GC-DOC-QE

and CB-LC-RR-QE runs are better than that of LM (NO QE). All proposed query

expansion-based methods also outperform than the non-expansion-based LM (NO

QE) approach. It establishes that CB-GC-DOC-QE approach accomplishes 33.87%

MAP and 52.2% P@10 while CB-PR-DOC-QE achieves 30.31% MAP and 51.6% P@10

and baseline LM (NO QE) only reaches 30.10% MAP and 51.6% P@10. Among all

methods including Chapter 3, 4 and 5, the performance of CB-LC-RR-QE is best than

that of any other method. The possible reason is that CB-LC-RR-QE effectively

reduces “noise” documents to generate expanded terms.

0.5130.5140.5150.5160.5170.5180.519

0.520.5210.522

P@10

P@10

90

5.6 Results

In this study, we present the system architecture and evaluation of cluster-

based query expansion approaches. The experimental results are promising using

Lemur search tool in TREC 2004 genomics track ad-hoc retrieval task. We conclude

that information brought from clusters created based on the whole document

collection or the initial retrieval list is quite helpful to perform query expansion,

especially when integrated with concept generation introduced in Chapter 4 and

information induced from the initially retrieved individual documents as in CB-GC-

DOC-QE and CB-LC-RR-QE.

This study suggests several interesting research avenues for our future

investigations. For future work, we will continue to test the generalization of cluster-

based query expansion in other biomedical domain or even different domains. Also

since this experiment was conducted via Lemur toolkit, we may perform the similar

experiments using different search tools, for instance, Dragon toolkit to make

comparative evaluation. Furthermore, more investigation is needed to explore the

process of cluster formation. Various techniques can be used, for instance,

personalization.

91

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

6.1 Contributions of the Thesis

In order to improve the efficiency and effectiveness of biomedical literature

retrieval, we have started this thesis with a comprehensive evaluation of the query

expansion strategies. We propose a novel framework for cluster-based query

expansion. We concentrate on empirical comparison, experiments and evaluations in

investigating query expansion methods on various issues, and use the findings as an

empirical justification for cluster-based query expansion. Meanwhile, we investigate

a theoretical foundation of cluster-based query expansion and to provide reasonable

path to develop theoretical argument.

We design and implement AQEE-3 system to conduct query expansion

evaluation. With the assistance of AQEE-3 system, we compare the performance

with various query expansion strategies and with different: 1) The system indexing

strategy; 2) The system retrieval model; 3) The qualities of queries submitted into the

system. We believe researchers can effortlessly explore query expansion strategies

using AQEE-3 system.

Without consideration of clusters, the important insights we obtained

through our study are summarized as follows:

Ontology-based term re-weighting provides the best results among the three

query expansion strategies.

92

Expanding the initial query with more precise ontology-based term enhances

LSI based local analysis substantially.

Including context to term re-weighting and LSI further improves the

precision.

Adding only one extended term to each original query term of the LSI local

analysis increases the precision over adding more than one term.

Using OKAPI BM25 Model helps achieve better precision than using

VECTOR SPACE Model on LEMUR.

In regard to the research question: whether can one improve retrieval

performance in biomedical domain by developing a cluster-based query expansion

using language modeling approach, our results confirm that (AQEE-3 system) a

cluster-based query expansion using language modeling approach significantly

improves the effectiveness of TREC 2004 Genomics Track test data.

Our results also show that combining concept-based and local-based (query-

specific) clustering for query expansion is more effective than global-based.

Evaluation results prove that if retrieval system is able to locate good clusters, the

performance of cluster-based query expansion is better than that of query expansion

with poor clusters or no cluster-building at all. And it is consistent with other

reported research findings.

Term selection is the key to query expansion. Our experimental results

demonstrate that Latent Semantic Indexing generates high quality additional query

terms. On the contrary, Association Rule plays a negative role in our

93

experimentation.

6.2 Future Work

This dissertation intends to improve retrieval effectiveness by developing a

novel cluster-based query expansion using language modeling method. We design

and implement AQEE-3 system to apply the approach and conduct query expansion

evaluation. With the experimental results we believe query expansion will work

more effectiveness under the context of cluster.

There are certain aspects about this approach needs additional work.

Firstly, reweighting may be considered to be part of cluster-based query

expansion since it causes good results in our experiment. In our approach, we try

generating concept from the clusters using our algorithm. In the future, we may

generate keyword from the clusters. As query expansion is evaluated in terms of

precision and recall, new evaluation metric may be defined to measure query

expansion under cluster context.

Also in the future we will continue test our query expansion strategies using

a more comprehensive dataset, in other biomedical domain or even different

domains. Also since this experiment was conducted via Lemur toolkit, we may

perform the similar experiments using different search tools, for instance, Dragon

toolkit to make comparative evaluation. Furthermore, more investigation is needed

to explore the process of cluster formation. Various techniques can be used, for

94

instance, personalization.

95

List of Reference [Agrawal et al., 1993] Agrawal, R., et al., Database Mining: A Performance Perspective', IEEE Transactions on Knowledge and Data Engineering, Special Issue on Learning and Discovery in Knowledge-Based Databases, December 1993.

[Agrawal et al., 1995] Agrawal, R., et al., Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, U. Fayyad. et al., Editors, 1995, AAAI/MIT Press.

[Aphinyanaphongs et al., 2001]Yindalon Aphinyanaphongs, Ioannis Tsamardinos, Alexander Statnikov, Douglas Hardin, Constantin Aliferis. Text Categorization Models for High-Quality Article Retrieval in Internal Medicine. Journal of the American Medical Informatics Association, 12(2), 207-216.

[Attar and Fraenkel, 1997] Attar, R. and Fraenkel, A.S., Local Feedback in Full-Text Retrieval Systems, Journal of the ACM, Vol. 24, No. 3, 1997, pp. 397-417.

[Banerjee and Ghosh, 2002] Banerjee, A. and Ghosh, J. Frequency sensitive competitive learning for clustering on high-dimensional hperspheres. Proc. IEEE Int. Joint Conference on Neural Networks, pp. 1590-1595.

[Buckley, 2004] Buckley, C. Why current IR engines fail. In Proceedings of SIGIR, pages 584–585, 2004. Poster.

[Cesarano et al., 2003] Cesarano, C., d’Acierno, A., and Picariello, A., An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. Proceedings of the 5th ACM international workshop on Web information and data management, New Orleans, Louisiana, USA, 2003, 111-117.

[Croft and Harper, 1979] Croft, W.B. and Harper, D.J. Using Probabilistic Models of Document Retrieval Without Relevance Information, Journal of Documentation, Wol. 35, 1979, pp. 285-295.

[Deerwester et al., 1990] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. J., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41(6) 1990, 391-407.

[Dobrynin et al., 2005] Vladimir Dobrynin, David Patterson, Mykola Galushka, Niall Rooney. SOPHIA: An Interactive Cluster-based Retrieval System for the OHSUMED Collection. IEEE Transactions on Information Technology in Biomedicine 9(2): 256-265 (2005)

96

[Efthimiadis, 1996] Efthimiadis, E.N. Query Expansion, In: Williams, Martha E., ed. Annual Review of Information Systems and Technology, v31, pp 121-187, 1996. [Book chapters]

[Emanuelsson et al., 2000] Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol., 300(4):1005-1016. 2000 [Entrez] Entrez batch retrieval service. http://www.ncbi.nih.gov/entrez/batchentrez.cgi?db=protein [Frank et al., 2001] Eibe Frank, Gordon Paynter, Ian Witten, Carl Gutwin, Craig Nevill-Manning. Domain-Specific Keyphrase Extraction. In Proceedings of the Sixteenth International Joint Conference on Aritificial Intelligence, pp. 668-673 [Gaucher et al., 2004] Gaucher SP, Taylor SW, Fahy E, Zhang B, Warnock DE, Ghosh SS, Gibson BW. Expanded coverage of the human heart mitochondrial proteome using multidimensional liquid chromatography coupled with tandem mass spectrometry. J Proteome Res., 3(3):495-505, 2004 [Genomics] http://ir.ohsu.edu/genomics/2004protocol.html

[Guda et al., 2004] Guda C, Fahy E, Subramaniam S. MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics, 20(11):1785-1794, 2004 [Guda and Subramaniam, 2005] Guda C, Subramaniam S. pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics, 21(21):3963-3969, 2005 [Guo et al., 2004] Guo, Y., Harkema, H., and Gaizauskas, R., Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms, 13th Text Retrieval Conference (TREC 2004).

[Hearst and Pedersen, 1996] Hearst, M.A., and Pedersen, J.O. (1996). Re-examining the Cluster Hypothesis: Scatter/Gather on retrieval results. In SIGIR 1996, pp. 76-84.

[Hersh and Hickam, 1995] Hersh, W., and Hickam, D., Information Retrieval in Medicine-The Sapphire Experience, Journal of the American Society for Information Science, Dec., 1995, 46(10), 743-747.

[Hersh et al., 2000] Hersh, W., Price, S., and Donohoe, L., Assessing thesaurus-based query expansion using the UMLS Metathesaurus, Proc AMIA Symposium, 2000.

97

[Holmes et al., 1994] Holmes G, Donkin A, Witten I.H. Weka: A Machine Learning Workbench. Intelligent Information Systems, Proceedings of the 1994 Second Australian and New Zealand Conference, 357-361, 1994 [Hu and Xu, 2005] Hu, X. and Xu, X., Mining Novel Connections from Online Bio-medical Text Databases Using Semantic Query Expansion and Semantic-relationship pruning, International Journal of Web and Grid Services, Volume 1, Issue 2, 2005, 222-239.

[Jardine and Van, 1971] Jardine, N. and Van Rijsbergen, C.J. (1971). The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval, 7:217-240

[Kohonen, 1995] Kohonen, T., Self-Organizing Maps. New York: Springer, 1995.

[Kules et al., 2006] Bill Kules, Jack Kustanowitz, Ben Shneiderman. Categorizing Web Search Results into Meaningful and Stable Categories Using Fast-Feature Techniques. JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 210-219, 2006.

[Kumar and Raghava, 2006] Kumar M, Verma R, Raghava GP. Prediction of mitochondrial proteins using support vector machine and hidden markov model. J Biol Chem., 281(9):5357-63, 2006 [Kurland et al., 2005] Kurland, O., Lee, L., and Domshlak, C. Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR, pages 19–26, 2005. [Lavrenko and Croft, 2001]Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of SIGIR, pages 120–127, 2001.

[Lee et al., 2008] Lee, K.-S., Croft, W. B., and Allan, J. A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of SIGIR, pages 235–242, 2008. [Lemur] www.cs.cmu.edu/~lemur/1.9/tfidf.ps

[Leroy and Chen, 2001] Leroy, G., and Chen, H., Meeting Medical Terminology Needs: The Ontology-Enhanced Medical Concept Mapper, IEEE Transactions on Information Technology in Biomedicine, 2001, 5(4), 261-270.

[Lindsay and Gordon, 1999] Lindsay, R.K., and Gordon, M.D., Literature-based discovery by lexical statistics, Journal of the American Society for Information Science, 1999, 50(7), 574-587.

98

[Liu and Croft, 2004] Liu, X., Croft, W.B.: Cluster-based Retrieval Using Language Models. In proceedings of ACM SIGIR’04 conference, (2004)276-284.

[Liu et al., 2004] Liu, S., Liu, F., Yu, C., and Meng, W. (2004) An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases, Proceedings of the 27th annual international Conference on Research and development in Information Retrieval: 266-272.

[Lucene] http://LUCENE.apache.org/java/docs/api/index.html

[Mitra et al., 1998] Mitra, M., Singhal, A., and Buckley, C. Improving automatic query expansion. In Proceedings of SIGIR, pages 206–214, 1998.

[Na et al., 2005] Na, S.H., Kang, I.S., Roh, J.E., and Lee, J.H. An Empirical Study of Query Expansion and Cluster-based Retrieval in Language Modeling Approach. In proceedings of the Second Asia Information Retrieval Symposium, AIRS 2005, held in Jeju Island, Korea, in October 2005.

[Ogilvie and Callan, 2002] Ogilvie, P. and Callan, J. (2002), Experiments Using the Lemur Toolkit. In: Proceedings of the Tenth Text Retrieval Conference, (TREC-10), 103-108.

[Ponte and Croft, 1998] Ponte, J.; Croft, W.B., A Language Modeling Approach to Information Retrieval. In proceedings of ACM SIGIR’98, (1998)275-281.

[Pratt et al., 2003] Pratt, W. and Yetisgen-Yildiz, M., LitLinker: Capturing connections across the biomedical literature, K-CAP’03, Sanibel Island, FL. Oct. 2003, 105-112.

[PubMed] Retrieved January 10, 2008 from http://www.ncbi.nlm.nih.gov/PubMed/

[Qiu and Frei, 1993] Qiu Y. and Frei H.P.. Concept Based Query Expansion. In Proc. of the 16th Int. ACM SIGIR Conf., pages 160-169, ACM Press, June 1993.

[Radev et al., 2000] Dragomir Radev, Hongyan Jing, Malgorzata Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. in Proceedings of ANLP/NAACL 2000 Workshop on Automatic Summarization, Seattle, WA, pp.21-29

[Richardson and Smeaton, 1995] Richardson R., and Smeaton A.F, Using Wordnet in a knowledge-based approach to information retrieval, In Proceedings of the BCS-IRSG Colloquium, Crewe, 1995.

99

[Robertson and Walker, 2000] S. E. Robertson and S. Walker (2000), “Okapi/Keenbow at TREC-8,” in E. Voorhees and D. K. Harman (Editors), the Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246

[Rocchio, 1999] Rocchio, J., Relevance Feedback in information retrieval, In G. Salton(Ed.), The SMART Retrieval System-experiments in Automatic Document Processing (Chap 14), Englewood Cliffs, NJ: Prentice Hall.

[Rosenfele, 2000] Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? In Proceedings of the IEEE, 88(8), 2000.

[Savasere et al., 1995] Savasere, A., Omiecinski, E., Navathe, S. An Efficient Algorithm for Mining Association Rules in Large Databases. VLDB 1995: 432-444

[Sanderson, 1994] Sanderson, M. 1994, “Word sense disambiguation and information retrieval”, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, p.142-151, July 03-06, 1994, Dublin, Ireland.

[Shi et al., 2005] http://www.cs.sfu.ca/~anoop/papers/pdf/trec2005_report.pdf

[Steinbach et al., 2000] Steinbach, M., Karypis, G., and Kumar, V. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000.

[Taylor et al., 2003] Taylor SW, Fahy E, Zhang B, Glenn GM, Warnock DE, Wiley S, Murphy AN, Gaucher SP, Capaldi RA, Gibson BW et al. Characterization of the human heart mitochondrial proteome. Nat Biotechnol., 21(3):281-286, 2003 [Umls] http://www.nlm.nih.gov/research/umls/archive/2004AA/umlsdoc_2004aa.pdf

[Vogel et al., 2005] David Vogel, Steffen Bickel, Peter Haider, etc. Classifying Search Engine Queries Using the Web as Background Knowledge. Volume 7 , Issue 2 (December 2005) table of contents Pages: 117 - 122

[Wang and Zhai, 2006] Xuanhui Wang, Chengxiang Zhai. Learn from Web Search Logs to Organize Search Results.

100

[Weeber et al., 2003] Weeber, M., Vos, R., Klein, H., de Jong-Van den Berg. L.T.W., Aronson, A. and Molema, G., Generating hypotheses by discovering implicit associations in the literature: A case report for new potential therapeutic uses for Thalidomide. Journal of the American Medical Informatics Association, 2003, 10(3):252-259.

[Wikipedia] http://en.wikipedia.org/wiki/Association_rule_learning

[Xu and Croft, 1996] Xu, J., and Croft, W., Query expansion using local and global document analysis, In Proceedings of ACM SIGIR, 1996.

[Xu and Croft, 2000] Xu, J. and Croft, W.B., Improving the Effectiveness of Information Retrieval with Local Context Analysis, ACM Transactions on Information Systems, 18(1), 79-112, 2000.

[Yoo and Hu, 2006] Illhoi Yoo, Xiaohua Hu. A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library Medline. JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 220-229, 2006.

[Zeng et al., 2004] Hua-Jun Zeng; Qi-Cai He; Zheng Chen; Wei-Ying Ma. Learning to Cluster Web Search Results. In ACM. SIGIR, pages 210–217, New York, NY, USA, 2004.

[Zhai and Lafferty, 2001] Zhai, C., Lafferty, J., A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In proceeding of SIGIR’01, (2001)334-342.

[Zhai and Lafferty, 2001] Zhai, C., Lafferty, J., Model-based Feedback in the Language Modeling Approach to Information Retrieval. In proceeding of SIGIR’01, (2001)403-410.

[Zhao and Karypis, 2001] Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2001.

[Zhou et al., 2006] Zhou X., Zhang X., Hu X., Using Concept-based Indexing to Improve Language Modeling Approach to Genomic, in the 28th European Conference on Information Retrieval (ECIR 2006), April 10-12, 2006, London, UK, Page 444-455.

[Zhu et al., 2006] Zhu, W, Xu, X, Hu, X., Song, I., and Allen,R., Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy, IEEE International Conference on Granular Computing , May10-12, 2006

101

[Zhu et al., 1999] Zhu, A., Gauch, S., Lutz, G., Kral, N., and Pretschner, A., Ontology- Based Web Site Mapping for Information Exploration, Proceedings of the Eighth International Conference on Information and Knowledge Management (CIKM '99), 188-194.

102

VITA

Xuheng Xu

EDUCATION 2011.05 Ph.D. Information Studies, Drexel University, Philadelphia, PA 2004.09 M.S. Information Systems, Drexel University, Philadelphia, PA 1998.07 M.S. Computer Science, New Jersey Institute of Technology, Newark, NJ 1986.06 B.E. Mechanical Engineering, Zhejiang University, Zhejiang, China RESEARCH INTERESTS Data mining, Semantic-based Query Optimization and Intelligent Searching, and Bioinformatics. SELECTED PUBLICATIONS

1. Xuheng Xu, Xiaohua Hu: Cluster-based Query Expansion Using Language Modeling in the Biomedical Domain. Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on Issue, 18 Dec 2010 Pages: 185-188

2. Xuheng Xu, Xiaodan Zhang, Xiaohua Hu: Using Two-Stage Concept-Based Singular Value Decomposition Technique as a Query Expansion Strategy. Advanced Information Networking and Applications Workshops, 2007, AINAW apos; 2008. 21st International Conference on Volume 1, Issue , 21-23 May 2008 Page(s):295 – 300

3. Xuheng Xu, Weizhong Zhu, Xiaohua Hu, Xiaodan Zhang: Using Ontology-based and Semantic-based Query Expansion on Biomedical Literature Search. Systems, Man and Cybernetics, 2006. SMC apos; 2006. IEEE International Conference on Volume 4, Issue , 8-11 Oct. 2006 Page(s):3441 - 3446

4. Hu X., Zhang X, Zhou X, Wu D., Xu X., A Comprehensive Comparison of 7 Methods for Mining Hidden Links from Biomedical Literature, in the International Multi-Symposium in Computer and Computational Science, June 20-24, 2006, Hanzhou, China

5. Hu X, Zhang X, Li G, Yoo Y, Zhou X, Wu D, Xu X: Mining Hidden Connections among Biomedical Concepts from Disjoint Biomedical Literature Sets through Semantic-based Association Rule, International Journal of Intelligent Systems, 2006.

6. Zhong Huang, Xuheng Xu and Xiaohua Hu. Machine Learning Approach for Human Mitochondrial Protein Prediction. Book: Computational Intelligence in Bioinformatics, IEEE CS Press/Wiley, 2008.

103

7. Zhu, W, Xu, X, Hu, X., Song, I., and Allen,R., Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy, IEEE International Conference on Granular Computing , May10-12, 2006

8. Hu X., Xu., X., Mining Novel Connections from Online Biomedical Databases Using Semantic Query Expansion and Semantic-Relationship Pruning, in International Journal of Web and Grid Service, 1(2), 2005, pp222-239

cluster-based query expansion using language modeling for … · 2014. 9. 5. · informational...

Documents