cluster-based query expansion using language modeling for … · 2014. 9. 5. · informational...
TRANSCRIPT
Cluster-based Query Expansion Using Language Modeling for Biomedical Literature Retrieval
A Thesis
Submitted to the Faculty
of
Drexel University
by
Xuheng Xu
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
May 2011
ii
© Copyright 2011 Xuheng Xu. All Rights Reserved
iii
To my parents, brother and sister, and my family.
iv
Acknowledgments
“Plan prayerfully, prepare purposefully, proceed positively, and pursue
persistently.” - William A. Ward
My journey pursuing PhD degree started seven years ago.
There are very few things in life that just do not change but keep increasing.
Knowledge and the constant urge to do something different is certainly one of them.
When I look back this seven year journey I am just overwhelmed by the constant
encouragement I had received. I truly am humbled to have such wonderful
experience in my life and have worked with such an outstanding group (supervised
by my advisor) of “get in and get it happening people” and “move forward attitude”.
Firstly I am very thankful to my advisor Dr. Xiaohua (Tony) Hu. He
promoted my understanding on the research issues for information studies. His
deep insights and extensively profound knowledge and experience opened up my
mind and made me immerse and continue to embody theoretical and technical
details in data mining and bioinformatics. His passion and commitment are really
incredibly characteristics of a super model I could follow. I have been privileged and
fortunate to have him as my advisor and friend. From the very basics, for instance,
he guides me to learn the ways to read the papers, the ways to explore the questions,
how to develop the efficient solutions, how to collaborate with field folks, how to
write and present the papers, and come to the ways how to do researches. Without
his enduring support, lasting encouragement, and persevering guidance in my
research, it is hard to imagine this thesis would ever become a reality.
v
I would also like to thank Drs. Yuan An, Zekarias Berhane, Jiexun Jason Li,
and II-Yeol Song for their precious advices, time and efforts serving on my thesis
committee. Especially Dr. Song is the first one to show me how to organize, write
research papers and make presentations. Dr. Li and Dr. An provided valuable
suggestions to accomplish the thesis.
I would like to extend my gratitude to the College of Information Science and
Technology for providing me the opportunity to fulfill my dream of pursuing a Ph.D.
degree here at Drexel and funding to permit me to attend several academic
conferences over the years, through which I have met many wonderful folks and
learned many valuable life lessons in addition to academic abilities and know-hows.
For that, I am truly blessed and have many people to thank.
The research related to this thesis was supported in part by NSF Career Grant
IIS 0448023, NSF CCF 0514679, Pennsylvania Department of Health Tobacco
Settlement Formula Grant (No. 240205 and No. 240196), Pennsylvania Department
of Health Grant (No. 239667), and NSF CCF 0905291 and NSFC 90920005 “Chinese
Language Semantic Knowledge Acquisition and Semantic Computational Model
Study”.
This thesis is dedicated to my family. I would not have reached today’s
height if it were not for the sacrifice, fully support, and hard work of my father and
mother. My father passed away in 2007, but I know he would be glad to see this
thesis in his next life. My deepest appreciation also goes to my brother Xuyan and
sister Xuya who inspired me to do my best and provided the unity of happiness.
vi
And lastly, Hong Xu, thanks for the trust you have conferred upon me and the
never-ending support over the years. Jason Xu, you are always the best in my eyes.
You show me that everything is possible.
Although I am not sure the destination the journey might lead me to, I know
this journey changed me forever.
vii
Table of Contents LIST OF TABLES ..................................................................................................................... x
LIST OF FIGURES ................................................................................................................. xi
ABSTRACT .......................................................................................................................... xiii
CHAPTER 1. INTRODUCTION ........................................................................................... 1
1.1 Query Expansion ................................................................................................... 3
1.2 Cluster-based Retrieval ....................................................................................... 5
1.3 Scope ..................................................................................................................... 7
1.4 Motivation .............................................................................................................. 8
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW ...................................... 11
2.1 Language Modeling Approach ...................................................................... 11
2.2 Query Expansion Methods ................................................................................ 13
2.2.1 Global Analysis ............................................................................................ 17
2.2.2 Local Analysis............................................................................................... 18
2.3 Document Clustering ......................................................................................... 22
2.3.1 Feature Extraction ....................................................................................... 24
2.3.2 Feature Selection ........................................................................................ 24
2.3.3 Document Representation ....................................................................... 25
2.3.4 Clustering Methods ..................................................................................... 26
2.3.5 Cluster Quality ............................................................................................. 30
CHAPTER 3. A COMPARISON OF LOCAL ANALYSIS, GLOBAL ANALYSIS AND ONTOLOGY-BASED QUERY EXPANSION STRATEGIES FOR BIOMEDICAL LITERATURE RETRIEVAL ................................................................................................. 33
3.1 System Architecture ........................................................................................... 34
3.2 Query Strategies ................................................................................................. 36
3.2.1 Local Analysis............................................................................................... 36
3.3.2 Global Analysis ............................................................................................ 40
3.3.3 Term Re-weighting ...................................................................................... 40
3.4 Experiment Design and Implementation ....................................................... 42
3.4.1 Building the Index ....................................................................................... 43
3.4.2 Preprocessing Query to Search Engines ................................................. 43
3.4.3 Evaluation .................................................................................................... 44
viii
3.5 Experimental Results and Discussion ............................................................... 44
3.5.1 Results ............................................................................................................ 44
3.5.2 Discussion ..................................................................................................... 49
3.6 Conclusion ........................................................................................................... 52
CHAPTER 4. USING TWO-STAGE CONCEPT-BASED SINGULAR VALUE DECOMPOSITION TECHNIQUE AS A QUERY EXPANSION STRATEGY ............. 53
4.1 Introduction ......................................................................................................... 53
4.2 System Architecture ........................................................................................... 54
4.3 Query Strategies ................................................................................................. 56
4.3.1 Stage one - generate concepts and construct concept-based term-document matrices ................................................................................................... 56
4.3.2 Stage two – use SVD technique in LSI to extract the expanded query terms….. ....................................................................................................................... 61
4.4 Experimental Design and Implementation .................................................... 63
4.4.1 MeSH Terms .................................................................................................. 63
4.4.2 UMLS .............................................................................................................. 63
4.4.3 Experimental Setup ..................................................................................... 64
4.5 Preliminary Experimental Results ...................................................................... 66
4.5.1 Evaluation .................................................................................................... 66
4.5.2 Preliminary Results ....................................................................................... 67
CHAPTER 5. CLUSTER-BASED QUERY EXPANSION USING LANGUAGE MODELING IN THE BIOMEDICAL DOMAIN .............................................................. 74
5.1 Introduction ......................................................................................................... 74
5.2 Related Work ....................................................................................................... 76
5.3 Methods ............................................................................................................... 79
5.3.1 Pseudo-relevance-based query expansion (PRB-DOC-QE) ............... 79
5.3.2 Global-cluster-based query expansion (GCB-QE) ................................ 80
5.3.3 Global-cluster-based query expansion (GCB-DOC-QE) ..................... 80
5.3.4 Local-cluster-based query expansion (LCB-RR-QE) .............................. 81
5.3.5 Concept-based pseudo-relevance query expansion (CB-PR-DOC-QE)……. ....................................................................................................................... 81
5.3.6 Concept-based global-cluster query expansion (CB-GC-QE) ........... 81
ix
5.3.7 Concept-based global-cluster query expansion (CB-GC-DOC-QE) 81
5.3.8 Concept-based local-cluster query expansion (CB-LC-RR-QE) ......... 81
5.4 System Architecture ........................................................................................... 82
5.5 Evaluation ............................................................................................................ 83
5.6 Results ................................................................................................................... 90
CHAPTER 6. CONCLUSIONS AND FUTURE WORK .................................................. 91
6.1 Contributions of the Thesis ................................................................................ 91
6.2 Future Work .......................................................................................................... 93
List of Reference .................................................................................................................... 95
VITA ..................................................................................................................................... 102
x
LIST OF TABLES
Table 3 - 1: Experiment Results from LEMUR Search Tool (OKAPI BM25 MODEL) 44 Table 3 - 2: Experiment Results from LEMUR Search Tool (VECTOR SPACE MODEL) ................................................................................................................................................. 46 Table 3 - 3: Experiment Results from LUCENE Search Tool (VECTOR SPACE MODEL) ................................................................................................................................. 48 Table 4 - 1: Experimental Results (OKAPI BM25 model) ................................................ 67 Table 4 - 2: Experimental Results (OKAPI BM25 model) ................................................ 68 Table 4 - 3: Experiment Results (VECTOR SPACE MODEL) ......................................... 69 Table 4 - 4: Experimental Results (VECTOR SPACE model) ......................................... 70 Table 5 - 1: Experimental Results (Term-based) ............................................................... 85 Table 5 - 2: Experimental Results (Concept-based) .......................................................... 87
xi
LIST OF FIGURES
Figure 1 - 1: Model mismatch from language modeling perspective .............................. 2
Figure 2 - 1: Search stages .................................................................................................... 14 Figure 2 - 2: Query Expansion: Methods and Sources. The figure is copied from [Efthimiadis, 1996] ................................................................................................................ 15 Figure 2 - 3: Hierarchical agglomerative clustering ......................................................... 28
Figure 3 - 1: AQEE-1 System Architecture ........................................................................ 35 Figure 3 - 2: Local analysis approach query expansion ................................................... 37 Figure 3 - 3: Global analysis and term re-weighting query expansion ......................... 41 Figure 3 - 4: Average Precision of All Runs (OKAPI BM25 MODEL) ........................... 45 Figure 3 - 5: P@10 of All Runs (OKAPI BM25 MODEL) ................................................. 45 Figure 3 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 46 Figure 3 - 7: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 47 Figure 3 - 8: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 48 Figure 3 - 9: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 49
Figure 4 - 1: Two-stage concept-based query expansion (AQEE-2) .............................. 55 Figure 4 - 2: Stage one – generate concepts and construct concept-based term document matrices ............................................................................................................... 56 Figure 4 - 3: The algorithm for extracting one concept name and its candidate concept IDs [Zhou et al. 2006] ............................................................................................. 60 Figure 4 - 4: Mean Average Precision of All Runs (OKAPI BM25 model) ................... 68 Figure 4 - 5: P@10 of All Runs (OKAPI BM25 model) ..................................................... 69 Figure 4 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL) ......... 70 Figure 4 - 7: P@10 of All Runs (VECTOR SPACE MODEL) ........................................... 71
Figure 5 - 1: Cluster-Based Query Expansion (AQEE-3) ................................................. 83 Figure 5 - 2: Mean Average Precision of All Runs ........................................................... 86 Figure 5 - 3: P@10 of All Runs ............................................................................................. 86 Figure 5 - 4: Mean Average Precision of All Runs (Concept-based) ............................. 88 Figure 5 - 5: P@10 of All Runs (Concept-based) ............................................................... 89
xii
xiii
ABSTRACT
Cluster-based Query Expansion Using Language Modeling for Biomedical Literature Retrieval and Classification
Xuheng Xu Xiaohua Hu, Ph.D.
The tremendously huge volume of biomedical literature, scientists' specific
information needs, long terms of multiples words, and fundamental problems of
synonym and polysemy have been challenging issues facing the biomedical
information retrieval community researchers. Search engines have significantly
improved the efficiency and effectiveness of biomedical literature searching. The
search engines, however, are known to return many results that are irrelevant to the
intention of a user’s query, in other words, perform not very sound in terms of
precision and recall. To further improve precision and recall of biomedical
informational retrieval, various query expansion strategies are widely used. In this
thesis, we concentrate on empirical comparison, experiments and evaluations in
investigating query expansion methods. We also use the findings as an empirical
justification for cluster-based query expansion. We have investigated broadly many
methods of query expansion such as local analysis, global analysis, ontology-based
term reweighting across various search engines and obtained important insights.
Among the findings, two-stage concept-based latent semantic analysis strategy and
cluster-based query expansion have been presented and the Singular Value
Decomposition (SVD) technique in the Latent Semantic Indexing (LSI) is utilized in
the proposed method. In contrast to other query expansion methods, our strategy
xiv
selects those terms that are most similar to the concepts of in the query as well as the
related documents, rather than selects terms that are similar to the query terms only.
Furthermore, we propose a novel framework for cluster-based query expansion. we
have designed and implemented a novel and efficient computational approach to
cluster-based query expansion using language modeling. Through our experiments
in TREC genomic track ad-hoc retrieval task, we demonstrate that clusters which
are created based on the whole collection or the initially returned document results
of the original query can be utilized to perform query expansion and eventually
improve the overall effectiveness and performance of information retrieval system in
the biomedical literature retrieval. Lastly, we believe the principles of this strategy
may be extended and utilized in other domains.
xv
1
CHAPTER 1. INTRODUCTION
The volume of biomedical literature produced has increased dramatically
over the past several decades. For example, PubMed database provided by the
United States National Library of Medicine contains over 17 million citations from
over 4,800 journals for biomedical articles back to 1950s and the United States
National Institutes of Health clinical trials database contains information on over
13,5000 clinical trials [PubMed]. To utilize such resources, users such as practicing
physicians and biomedical researchers have to tackle the incredibly time-consuming
tedious tasks of finding, locating or searching, browsing or reading and evaluating
relevant biomedical literature. Indeed, the explosion of data combined with
increasing publication rates greatly impairs human’s ability to locate the right
information in enormous collections of data.
In order to improve the efficiency and effectiveness of biomedical literature
retrieval, various search engines with different models have been used. However,
the unexpected results due to inherent difficulties of using search engines have
continued to occur: for instance, the search results may contain a large number of
irrelevant results (low precision); the search results may contain only some portion
of the documents requested by the users (low recall). It has not been easy for search
engines to meet the users’ expectations and information needs to search what they
want, especially when the users do not have enough domain knowledge. The
fundamental issue behind the problems is synonymy and polysemy. Per Merriam-
Webster Dictionary, synonymy is a list or collection of words with similar meanings
2
while polysemy is a word or phrase having multiple meanings, for instance, “apple”
could be apple fruit or apple computer in different context. In addition, it is noticed
and understandable that people of different knowledge domain, dissimilar skills and
diverse levels of mastering languages use different terms to describe the same
concepts. And to describe the same concepts, it is not unlikely that terms employed
in the queries by the users are different from those used by the authors of documents
in a collection.
Looking at the problem from language modeling approach, on can conclude
that there is a mismatch between the user’s query model and data corpus documents
model. Figure 1-1 illustrates the situation.
P(Model|User) P(Query|Model)
P(Model|Author) P(Doc|Model)
Figure 1 - 1: Model mismatch from language modeling perspective
Query
Model
Doc Model
Doc
Query
3
Since a typical search tends to be using one-word search term [Efthimiadis
1996], the problems mentioned above for one or few-terms search will certainly
deteriorate the retrieval performance of search engines. For example, one or fewer-
terms queries may be not adequate to express a concept accurately. As the lower
number of terms is enclosed in a query, expectedly, the chance of a term meaning
the same concept occurring in both a query and the relevant documents declines. In
other words, it is expected that the problem for fewer-terms (like one or two words)
queries will be worse than that for more-terms queries. To overcome these problems,
several promising techniques have been developed, such as query expansion and
cluster-based retrieval. The primary goal of this dissertation is to build a novel
approach of cluster-based query expansion to help improve retrieval performance in
the biomedical literature retrieval. And as we eventually find that our novel
approach of cluster-based question is really related to knowledge structure
extraction, exploration and discovery which all are essentials to information science.
1.1 Query Expansion
Query expansion is not new. Many attempts have been made to improve the
state of the art. One theory behind query expansion is: due to the synonymy and
polysemy problems, it may present better results to add more terms into initial
queries to increase the precision and recall because the expanded queries contains
more terms, and accordingly the probability of matching them with terms in the
relevant documents, to some extent, increases. Query expansion attaches new
4
additional critical terms beyond the initial query terms (seed query) provided by the
users to improve the precision and/or recall. It is obvious that generating the new
additional critical terms is a key. Researchers have focused on using query
expansion techniques, manually, interactively or automatically, to help the users
reformulate a better query and eventually help the users to achieve higher precision
and higher recall. Query expansion could be done interactively or automatically by
adding terms to the initial query or manually by dynamically providing high-level
domain information about the collections, such as domain structure, thesaurus, etc.
It also could be done by providing the higher frequencies terms about the global or
local returned datasets of collections, etc. to the users with the suggestions to refine
the initial query.
However, there is a potential problem to general query expansion techniques.
It is known that in a normal document-based (most search engines are document-
based) retrieval, typical search engines normally match the query against documents
in the collection and often return a long list of search results, ranked by their
relevancies to the given query, to the user. It is also well-known that the collection
typically contains documents with different subjects or topics. For example, PubMed
research articles may be classified into different subjects or topics like diseases. The
expanded terms from various query expansion techniques to the initial query may
have totally different meanings across distinctive subjects in the whole collection.
The results of adding the expanded terms fail to improve the performance of search
engines, and in contrast decrease the effectiveness since these expanded terms
5
introduce the irrelevant terms (noise) to the query. Clearly, in this worse case, word-
based query expansion techniques may not be sufficient to improve the effectiveness
of search engines. Therefore, when considering query expansion, we may have to
contemplate the buzzword: cluster. We may have to look at query expansion in a
broader perspective, like cluster-based document collection or clusters of initial
search result documents based on the user’s original query. In other words,
paradigm shift might be needed here when considering query expansion. For
clarification and simplicity, in this dissertation, the term query expansion indicates
non-cluster consideration.
1.2 Cluster-based Retrieval
Cluster-based retrieval is based on the hypothesis that similar documents
tend to be relevant to the same information needs [Liu and Croft 2004]. Contrary to
document-based retrieval, cluster-based retrieval, divides a collection of various
documents into different meaningful clusters so that documents in the same cluster
describe the same topic and further returns a ranked list of documents based on the
clusters that they come from to the users.
Currently there are two approaches to cluster-based retrieval. The most
common is to retrieve one or more related clusters in their entirety to the user’s
query. The difference between this method and normal document-based retrieval is
6
to rank cluster-query similarity instead of document-query similarity. The second
approach is to use clusters as a form of document smoothing.
Researches show if retrieval system is able to locate good clusters, the
performance of cluster-based retrieval is better than that of document-based
retrieval [Hearst and Pedersen 1996; Jardine and Van 1971]. It is recognized that
researchers’ investigations largely focus on various clustering algorithms, cluster
retrieval and search methods. Of those, document clustering is the most common
way that has been performed either in the entire collection of data and independent
of the user’s query, or in the set of documents that is dependent of and retrieved
result of a document-based retrieval on the user’s query.
Recent research developments in statistical language modeling for
information retrieval shift the cluster-based query expansion process. Many various
researches confirmed that the language modeling approach has a solid theoretical
foundation and it is usually effective for various applications [Ponte and Croft 1998;
Zhai and Lafferty 2001]. [Liu and Croft 2004] has demonstrated that cluster-based
retrieval can be more effective than document-based retrieval by proposing two new
models for cluster-based retrieval and evaluating the results on several TREC
collections. They also reported cluster-based retrieval can perform consistently
across collections of realistic size, and significant improvements over document-
based retrieval can be obtained in a fully automatic manner and without relevance
information provided by human. These guided us to reconsider query expansion
within this new cluster-based framework using language modeling approach. We
7
are to implement cluster-based retrieval as an approach to improve query expansion
effectiveness.
In this dissertation, methods of various query expansion are explored,
particularly, a new novel cluster-based query expansion using semantic-based
language modeling approach is proposed. Different cluster algorithms, local
analysis based, global analysis based, and ontology-based term re-weighting query
expansion strategies integrated with the UMLS (Unified Medical Language System)
are investigated to improve the effectiveness of the queries and precision. These
methods are used to the ad hoc retrieval task of the TREC 2004/2005 Genomics task
and other data collection similar as [Yoo and Hu 2006]. In the meantime, larger
efforts will be focused on combining ontology-based and semantic-based query
expansion. Expanded terms selection of query will be centered on Latent Semantic
Indexing (LSI) and Association Rule (AR), therefore, they are also described in the
dissertation. Also the number of terms expanded on the initial query that could
produce promising precision will be evaluated.
It is noted that although the research is focused on the biomedical literature,
the resulting system is expected to be useful within other domains which have
domain-specific resources available.
1.3 Scope
This dissertation focuses on methodology study: automatic query expansion
strategies. The aim is to provide the overall framework of cluster-based query
8
expansion using language modeling approach.
1.4 Motivation
In general, the dissertation will answer one core research question: can one
improve retrieval performance in biomedical literature retrieval by developing a
cluster-based query expansion using language modeling approach? This dissertation
will concentrate on methodology study, empirical comparison, experiments and
evaluations in investigating cluster-based query expansion methods on various
issues, and thus is used as empirical justification for cluster-based query expansion.
Meanwhile, it is meaningful to investigate a theoretical foundation of cluster-based
query expansion and to provide reasonable path to develop theoretical argument.
A set of key questions are also associated with the above core research
question:
1. What is the effect of clustering algorithms in cluster-based
query expansion?
2. What is difference in retrieval performance by existing query
expansion methods and cluster-based query expansion?
3. Is global-based or local-based (query-specific) clustering for
query expansion more effective?
4. How to represent documents (Vector space model, Probabilistic
model, Ontology-based VSM)?
5. Which method (Latent Semantic Indexing or Associate Rule)
9
generates high quality additional query terms?
6. How to evaluate the cluster-based Query Expansion?
As we mentioned previously, generally query expansion can improve
retrieval effectiveness for queries. However, it may become ineffective when data
collection contains multiple subjects or topics (clusters). Without document
clustering technique, query expansion will not have all the relevant documents to
construct an effective expanded terms for the clusters, which will result in
information loss or distortion in the term associations.
This dissertation intends to improve retrieval effectiveness by developing a
novel cluster-based query expansion using language modeling method. The basic
idea of this method is that the whole document collection or the initial returned
document results of the original query used for expanded query terms construction
should be grouped by a document clustering technique (various clustering
algorithms) into several clusters that approximate the subjects or topics. After a set
of document clusters are located, we can build a useful terms set for each cluster in
the following manner: one way is to classify the whole document collection into the
established clusters then use the documents of the new clusters to set up VSM; the
other way is to use the documents of the established clusters only to set up VSM. For
each user query, the respective local terms of each cluster generated by LSI or AR
association methods will be expanded the original user query. Next, the relevant
documents are retrieved from the whole data collection for each expanded query.
Then, all retrieved documents are combined (removing duplicating documents) and
10
presented to the user as an overall result for one query. Experimentally, we will
evaluate the retrieval effectiveness of this proposed cluster-based query expansion
method in comparison with that achieved by existing non-cluster-based query
expansion methods (particularly, global analysis and local analysis query expansion).
The remaining parts of the dissertation are organized as follows: in addition
to chapter 1, section 1.1 and this section 1.3 introduces and highlights the research
motivation and objective of this proposed dissertation, chapter 2 reviews literature
related to this proposed dissertation, including language modeling approach,
existing query expansion methods, generating expanded terms techniques, and
document clustering techniques. Chapter 3 is a comparison of local analysis, global
analysis and ontology-based query expansion strategies for biomedical literature
retrieval. Chapter 4 is using two-stage concept-based singular value decomposition
technique as a query expansion strategy. The proposed cluster-based query
expansion method and system (AQEE-3) design are presented in Chapter 5. Chapter
6 is conclusion and future work.
11
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW
In this Chapter, literature review related to this dissertation, including
language modeling approach, is presented in Section 2.1, query expansion methods
in Section 2.2, and document clustering techniques in Section 2.3.
2.1 Language Modeling Approach
Language modeling has become a popular information retrieval model based
on its sound theoretical foundation and good empirical success. The main idea of
language modeling used in information retrieval is: each document of a collection is
viewed as a sample and a query as a generation process [Na et al., 2005]. With this
idea, the retrieved documents are ranked based on the probabilities of producing a
query from the corresponding language models of these documents. In other words,
relevance of document is calculated by query likelihood generated from document
language models. Specifically, there are two steps involving. One is to build a
language model D for each document in the collection; and the second is to rank the
documents based on how likely the query Q could have been generated from each of
these document models, i.e. P(Q|D).
The probability P(Q|D) can be calculated in a couple of ways. The most
common approach (referred to unigram language model usually) assumes that the
query could be treated as a set of independent terms, and thus query probability
P(Q|D) can be represented as a product of the individual term probabilities [Liu and
Croft, 2004].
12
| |
Where qi is the ith term in the query, and P(qi|D) is specified by the
document language model.
| | 1 |
Where PML(w|D) is the maximum likelihood estimate of word w in the
document, PML(w|Coll) is the maximum likelihood estimate of word w in the
collection, and λ is a general symbol for smoothing.
For different smoothing methods, λ takes different forms.
Considering cluster context, as previously mentioned, there are two language
modeling approaches. One of them is to rank clusters to the query. For instance, if
we use Cluster I to replace D above, P(Q|D) => P(Q|C), we could obtain ranked
clusters. But it has two limitations: one, it has to be used in entirety of document
collection; and two, the formula above consider the whole cluster as a big document
and it is sometimes impossible for the users to browse the whole documents of
relevant clusters.
If we define language modeling approach in a broader perspective, that is, it
is a probabilistic mechanism for generating required text, and language modeling
approach can be used in text clustering, for example. In this case, each document of a
13
collection is again viewed as a sample and the vocabulary of the corpus as a
generated text process. We can compare the distance between two documents.
Therefore, language modeling approach also can be considered as clustering method.
Language modeling was initially used in speech recognition. Nowadays it
has been widely applied into information retrieval and other applications owing to
its solid mathematical foundation and empirical proved effectiveness. In this
dissertation I will explore its effectiveness in text document clustering and
relationship identification. Mainly, I will utilize language modeling on the following
aspects:
1) Document clustering
2) Clustering rank
3) Concept generation
2.2 Query Expansion Methods
Generally there are five steps involving search process:
1. Resource selection
2. Seed query construction
3. Search results review
4. Query reformulation
5. Repeating the above steps
14
Due to this dissertation scope, we focus on the stages from initial seed query
construction to query reformation (Figure 2-1). And for simplicity, these stages can
be independent on the other factors like search engines.
Seed query construction Search results review Query reformulation
Figure 2 - 1: Search stages
Normally, query expansion process can be taken place in either seed query
construction or query reformulation.
As we have mentioned, query expansion is a promising way to improve
retrieval effectiveness or performance in information retrieval. The basic idea of
query expansion is to expand a user query by adding terms related to the original
query. More accurately, query expansion (or term expansion) is the process of
supplementing the original query with additional terms [Efthimiadis, 1996], and it
can be considered as a method for improving retrieval performance. The query
expansion itself is applicable to any situation irrespective of the retrieval technique(s)
or search engines used and this could be one major benefit and can be taken
advantage.
Query 1. …. 2. …. 3. ….
Refined Query
15
The initial query (as provided by the user) may be an inadequate or
incomplete representation of searching concept or the user’s information need, either
in itself or in relation to the representation of ideas or concepts in documents. Query
expansion is to help the user reformulate the query and meet the user’s information
need eventually. Query expansion, as depicted in Figure 2-2, can be conducted
manually, automatically or interactively:
Figure 2 - 2: Query Expansion: Methods and Sources. The figure is copied from [Efthimiadis, 1996]
In this dissertation, I will focus on automatic query expansion. And generally,
three key elements need to be considered when applying any form of query
expansion are: (1) the candidate terms source, that will provide the terms for the
16
query expansion, (2) the strategy to use the source and (3) the methods that will be
used to select terms to be used in the expansion (e.g., ranking algorithm to rank
candidate terms). It is to note that (2) and (3) are closely related.
Figure 2-2 also shows possible candidate terms sources. One comes from the
initial search results, and it assumes the pseudo-relevance feedback process. The
number of top documents retrieved in an earlier iteration of the search that have
been identified as relevant becomes the source for the query expansion terms. The
other is some form of a knowledge structure that is independent of the search
process. Such knowledge structure can either depend on the collection, i.e., be
corpus based, or be independent of collection. In this proposed dissertation, the term
source will be either based on initial search results or based on the whole collection
if possible.
The text retrieval community has been researching and implementing many
query expansion strategies to improve the precision and/or recall and building
various query expansions into the information retrieval systems. There are two
general strategies for query expansion to use the candidate terms source: ontology-
based and semantic-based. Expanding a query with synonyms or hyponyms is one
example of an ontology-based query expansion. Shi et al built an information
retrieval system integrating EntrezGene, HUGO, Eugenes, ARGH, GO, MeSH,
UMLSKS and WordNet into a large reference database and then using a
conventional Information Retrieval toolkit [Shi et al., 2005]. Some promising results
have been reported by using ontology-based methods. However, ontology-based
17
methods have only a limited effect on biomedical information retrieval performance
[Guo et al., 2004; Leroy and Chen, 2001]. Some researchers reported a hybrid
approach that combines different weights and ontology but few published the
details on how the weights were used in conjunction with the ontology.
On the other hand, semantic-based methods focus on the collection of
documents, and they can be further classified mainly into two sub-categories: global
analysis and local analysis [Xu and Croft, 2000].
In the following discussion, we review two query expansion methods: global
analysis and local analysis.
2.2.1 Global Analysis
Global analysis extracts co-occurrences of the related terms and results in a
similarity matrix by analyzing the whole collection of documents. Global analysis
includes term clustering, latent semantic indexing, finding phrases, and similarity
thesauri. Although it is relatively robust, global analysis has some drawbacks.
Corpus-wide statistical analysis consumes a considerable amount of computing
resources. Moreover, since it only focuses on the document side and does not take
the query side into account, global analysis cannot address the term mismatch
problem well. Lastly, global analysis requires semantic similarity and
disambiguation of terms.
Several researches reported the progress of global analysis with some
changes. To improve the retrieval effectiveness of traditional global analysis
18
methods, [Qiu and Frei, 2003] proposed and used a “concept-based query
expansion”. They advise that a term that is similar to the whole query is more
relevant than one that is only similar to a single term in the query. Their empirical
test evaluation proves that the concept-based query expansion method significantly
improves retrieval effectiveness compared with traditional global analysis methods.
[Matos et al., ] developed QuExT tool, which is a new PubMed-based document
retrieval and prioritization tool. QuExT can be used to search for the most relevant
results from the biomedical literature. And QuExT follows a concept-oriented query
expansion methodology to find documents containing concepts related to the genes
in the user input, such as protein and pathway names.
2.2.2 Local Analysis
Different from global analysis, local analysis extracts highly-related
(correlated and ranked in order of their potential contribution to the query) terms
from the top documents initially retrieved by an initial query or from data mining
results without any assistance from the user. The terms generated are appended to
the query then, sometimes re-weighted also. The idea of local analysis could be
found at least to a 1977 article by Attar and Fraenkel [Attar and Fraenkel, 1997], in
which the top-ranked documents for a query became sources of information for
building an automatic thesaurus. In addition, Croft and Harper used the top-
retrieved documents information to re-estimate the probabilities that a term will
occur in the relevant set for a query [Croft and Harper, 1979].
19
In their research, [Xu and Croft, 1996] generated the most frequent 50 terms
and 10 phrases (i.e., pairs of adjacent nonstop words) from the top-ranked
documents and added to the original user query. Then the terms in the query are re-
weighted using the Rocchio formula. They reported, in their method, the
effectiveness of local analyses is highly influenced by the proportion of relevant
documents in the top ranks. Their findings also include: local analysis performance
are likely to be poorer if initial seed queries perform poorly and retrieve few relevant
documents, since most of the words that are added to the query will be originated
from non-relevant documents.
Local analysis also takes the query itself into account. For instance, Local
Context Analysis uses the top documents returned by an initial query but selects the
terms based on co-occurrence with query terms [Xu and Croft, 2000].
The benefits of this local analysis approach include saving computing
resources and less human intervention. Since it assumes that the top documents
returned by the initial query are actually relevant (“pseudo-relevance feedback”),
expansion terms are extracted from the top-ranked documents to formulate a new
query for a second cycle retrieval. However, in contrast to global analysis, local
analysis is not that robust since it is nearly impossible for search tools or mining
methods to return documents that are only relevant to the users’ information needs.
In our research, local analysis method determines the expanded terms to the
original query based on “pseudo-relevance feedback”, which assumes that the top
documents returned by a query are all relevant to the users’ information needs. For
20
term selection, in this proposed dissertation, Latent Semantic Indexing (LSI) or
Association Rule (AR) algorithm is used to find the top co-occurrence terms of each
original term from the top N retrieved documents. Particularly, a larger effort is on
LSI since the preliminary results show positive gains for the performance of the
system.
The following is a brief introduction of Latent Semantic Indexing:
The Latent Semantic Indexing (LSI) takes advantage of the association of
terms with documents (“semantic structure”). It starts with a matrix of terms by
documents [Deerwester et al., 1990, Xu and Croft, 2000]. This matrix is then
evaluated by the singular-value decomposition (SVD) to gain the latent semantic
structure [Zhu et al., 2006]. The SVD decomposes a term-document matrix into three
separate matrices, by which documents and terms are projected into the same
dimensional space.
For example, the SVD X of t x d matrix of terms and documents is
decomposed into the product of three other matrices:
X =T0S 0D0’,
where T0 (t x m matrix) and D0’ (m x d matrix) have orthogonal, unit-length
columns (T0’T0=I, D0’D0=I), and S0 (m x m matrix) is diagonal. T0 and D0 are the
matrices of left and right singular vectors, and S0 is the diagonal matrix of singular
values.
21
In general, the SVD can be easily calculated by a reduced model, which is the
rank-k model with the best possible least-squares-fit to X [Deerwester et al., 1990]:
X ≈ Xhihat =TSD’
The new matrix Xhihat is the matrix of rank k which is closest in the least
squares sense to X. D’ is orthonormal matrix [Deerwester et al., 1990; Xu and Croft,
2000]. The dimensions of matrices T, S, and D’ are t x k, k x k, and k x d, respectively.
More importantly, the SVD can be used for information retrieval purposes to
derive a set of uncorrelated indexing variables or factors; each term and document is
represented by its vector of factor values. Therefore, the dot product between two
row vectors of Xhihat reflects the extent to which two terms have a similar pattern of
occurrence across the set of documents. The matrix XhihatXhihat’ is defined:
XhihatXhihat’ = TS2T’
Here, both S and T’ are diagonal matrices [Deerwester et al., 1990].
In this dissertation, we also use this technique to compare the similarities
between the terms in the matrix extracted from the documents.
An Association Rule (AR) finds frequent patterns, associations, correlations,
or casual structure relationships among a set of data items. A matrix of terms by-
documents, similar to the LSI-based algorithm, is produced and then analyzed. For
details of Association Rule, please refer to [Agrawal et al., 1993; Agrawal et al., 1995;
Savasere et al., 1995].
22
As previously mentioned, sources of expanded terms and selection of the
candidate terms to query expansion are the key. Much work still needs to be done
here. In this dissertation, TF-IDF based and other Vector Space information retrieval
model have been used to testify local analysis. Relevance feedback for the Vector
Space Model can be represented by the Rocchio formula [Rocchio, 1999]:
| | | |
For expanded query qi+1
qi: the initial query
Dr: set of relevant documents among retrieved documents
Dn: set of non-relevant documents among retrieved documents
, , : tuning constants
In this model information in relevant documents is treated more important
than in non-relevant documents ( << ). This study applied pseudo-relevance
feedback in that the top N retrieved documents are supposed to be all relevant,
equals 0.
2.3 Document Clustering
Due to known human limitations, it is very difficult for people (even expert)
to discover useful information by reading large quantities of unstructured text. This
difficulty has inspired the creation of a more specific technique for unsupervised
23
document organization, automatic topic extraction and fast information retrieval or
filtering to aid human beings’ information discovery.
Document clustering has been studied in the field of information retrieval for
several decades. The purpose of document clustering is to group similar documents
into clusters on the basis of their contents. The problem of document clustering can
be defined as follows. Given a set of n documents called DC, DC is clustered into a
user-defined number of k document clusters DC1, DC2,…DCk (i.e. { DC1, DC2,…..DCk}
= DC) so that the documents in a document cluster are similar to one another while
documents from different clusters are dissimilar [Yoo and Hu, 2006].
Document clustering faces a number of new challenges: increasing larger
volume, especially biomedical literature; the curse of dimensionality of term space;
sparsity of term; and complex semantics. These characteristics require document
clustering techniques to be scalable to large and high dimensional data, and able to
handle sparsity and semantics.
Document clustering was initially investigated for improving information
retrieval performance because similar documents grouped by document clustering
tend to be relevant to the same user queries. However, document clustering has not
been widely used in information retrieval systems because document clustering
algorithms was too slow or infeasible for very large documents sets in early days.
However, with the advent of new technology and more powerful computers,
document clustering has been very popular in information retrieval research.
24
Generally document clustering consists of several steps, including feature
extraction, feature selection, document representation, and clustering.
2.3.1 Feature Extraction
Feature extraction is a special form of dimensionality reduction. In document
clustering, feature extraction syntactically tags terms in each document and extracts
the desired terms (usually nouns and noun phrases) from the documents. If the
features extracted are carefully chosen it is expected that the features set will extract
the relevant information from the document corpus in order to perform the desired
task using this reduced representation instead of the full size input.
2.3.2 Feature Selection
Feature selection is the technique of selecting a subset of relevant features for
building robust learning models. Appropriate feature selection could assist people to
better understand about their data: which are the important features and how they are
related with each other. Feature selection of document clustering reduces the size of
the extracted feature set to reduce the dimensional size of term space and handle
sparsity. The most common ways for feature selection include term frequency (TF),
term frequency-inverted document frequency (TF-IDF), and the mixture. Once
features are selected, it is expected that the k features with the highest selection
metric scores are carefully chosen to represent documents.
25
2.3.3 Document Representation
In document representation, each document could be represented once
features are selected. For example, one document can be represented by a feature
vector. And feature vectors of documents can be treated as inputs for document
clustering. Several representation models have been proposed: probabilistic model,
vector space model (VSM) and ontology-based VSM. Since its popularity, ontology-
based VSM will be used in this dissertation. The advantages of VSM include:
Unsophisticated;
Easy to calculate similarity between two documents;
Direct to apply algorithms on data;
Consider the relationship between terms;
Introduce semantic concepts into data models;
Combine ontology with the traditional VSM;
Terms (dimensions) have semantic relationships, rather than
independent
A document collection is represented as a matrix below:
…⋮ ⋮ ⋮
… 1⋮ ⋮
…⋮ ⋮ ⋮
…
…⋮ ⋮…
Where xij is the frequency of term I in document j.
26
It is noted that most common three ways to represent a term value xij:
frequency representation; binary representation (xij = 1 indicates that term I occurs in
document j, otherwise xij = 0); and term frequency-inverted document frequency
(TF-IDF):
, , ∗ log| |
Where tf(dj,ti) is the frequency of term ti in document dj, |D| is the total
number of documents, and df(ti) is the number of documents in which ti occurs. The
generated feature vectors of the documents then can be treated as inputs for
clustering.
2.3.4 Clustering Methods
With the generated feature vectors of the documents, a clustering technique is
applied. The objective of clustering algorithms is to efficiently and automatically
group documents with similar content into the same cluster. The core of clustering
algorithms includes similarity measures and clustering methods.
There are many different ways to measure how similar two documents are or
how similar a document is to a query. Utilizing of similarity measures is highly
depending on the choices of terms to represent text documents. For instance,
measures could be Euclidian distance (L2 norm):
,
27
And L1 norm or cosine similarity:
, | |
Where , aredocuments.
Three common clustering approaches are hierarchical, partitioning-based,
and self-organizing map.
For hierarchical, it starts with every point in a separate cluster. This approach
successively merges the most similar pairs of data or clusters based on the pairwise
distance between pairs of data until only one big cluster left or a termination
condition holds. The hierarchical approach can be employed in either a bottom-up or
top-down manner. The hierarchical agglomerative clustering (HAC) algorithm
employs a bottom-up strategy to build a binary clustering hierarchy by treating
every document as a cluster initially and gradually merging the two most similar
clusters together on the basis of an inter-cluster similarity measure. It produces a
binary tree or dendrogram. For instance, single-link measures the similarity between
the closest pair of objects while the complete-link evaluates the similarity between
the most distant pair of objects. The HAC algorithm continues the merging process
until a desired number of clusters is obtained or the inter-cluster similarity is less
than a pre-specified threshold. Figure 2-3 illustrates HAC algorithm. In contrast, the
hierarchical divisive clustering (HDC) algorithm, which employs a top-down
strategy, starts with all documents in one cluster then divides it into the two most
28
distinct clusters. The HDC algorithm continues dividing each resultant cluster until
a termination condition is satisfied.
Figure 2 - 3: Hierarchical agglomerative clustering
Hierarchical approach also has some concerns such as if data doesn’t have a
hierarchical structure and the method usually involves specifying somewhat
arbitrary cutoff values.
For partitioning, it partitions the entire document collection into several non-
overlapping clusters. And it is the most widely-used algorithms in document
clustering [Steinbach et al., 2000]. Take K-means as an example. The steps of K-
means are:
29
Choose a number of clusters k
Initialize cluster centers c1,… ck (randomly pick k data points or
randomly assign points to clusters and take means of clusters)
For each data point, compute the cluster center it is closest to and assign
the data point to this cluster
Re-compute cluster centers (mean of data points in cluster)
Stop when there are no new re-assignments
The following is the pseudo code for K-means: K-means-cluster (in S : set of vectors : k : integer) { let C[1] ... C[k] be a random partition of S into k parts; repeat { for i := 1 to k { X[i] := centroid of C[i]; C[i] := empty } for j := 1 to N { X[q] := the closest to S[j] of X[1] ... X[k] add S[j] to C[q] } } until the change to C (or the change to X) is small enough. }
Like hierarchical approach, partioning also has some problems:
Arbitrary guess K.
30
Local minimum. Example: In diagram below, if K=2, and you start with
centroids B and E, converges on the two clusters {A.B.C}, {D,E,F}
Hard assign data points to only one cluster
Implicit assumptions about the “shapes” of clusters
Finally, self-organizing map is an unsupervised and two-layer neural
network. It is known to be based on work of Kohonen on learning/memory in the
human brain. In addition to specify the number of clusters, it also need specify a
topology – a two-dimensional grid that gives the geometric relationships between
the clusters (i.e., which clusters should be near or distant from each other) while
preserving the similarity between data points as much as possible. The advantage of
this approach is that it learns a mapping from the high dimensional space of the data
points onto the points of the two-dimensional grid (there is one grid point for each
cluster).
To use a self-organizing map, the network topology, shape (e.g. hexagon,
rectangle), and size of output nodes must be decided first on the basis of different
distance measures. For details, please refer to [Kohonen, 1995].
2.3.5 Cluster Quality
The effectiveness of different data clustering algorithms performing on a set
of data can be evaluated in terms of three types of measures: internal criterion,
external criterion and relative criterion. Among them the frequently used are purity
31
[Zhao and Karypis, 2001], entropy [Steinbach et al., 2000] and normalized mutual
information [NMI, Banerjee and Ghosh, 2002].
Purity is a simple and transparent evaluation measure and to compute purity,
each cluster is assigned to the class which is most frequent in the cluster, and then
the accuracy of this assignment is measured by counting the number of correctly
assigned documents and dividing by the number of total documents.
Formally:
∅, ∁ 1
max | ∩ |
Where ∅ , , … , is the set of clusters and ∁ , , … , is the set
of classes. And is the set of documents and is the set of documents.
Entropy is is a measure of uncertainty and defined as follows:
Where X is the set of all possible numbers we need to be able to encode.
Entropy is 1 when uncertainty is largest and 0 when there is absolute certainty.
Normalized mutual information (NMI) is defined as the mutual information
between the cluster assignments and a pre-existing labeling of the dataset
normalized by the arithmetic mean of the maximum possible entropies of the
empirical marginals, i.e.,
32
∅, ∁ ∅, ∁
∅ ∁ /2
Where ∅ is a random variable for cluster assignments, ∁ is a random variable
for the pre-existing labels on the same data.
∅, ∁ is mutual information, and
∅, ∁ ∩∩
Where , , ∩ are the probabilities of a document being
in cluster , class ,and in the intersection of respectively.
∅ is entropy, and
∅
NMI ranges from 0 to 1. The bigger the NMI is the higher quality the
clustering is. NMI is better than other common extrinsic measures such as purity
and entropy in the sense that it does not necessarily increase when the number of
clusters increases.
33
CHAPTER 3. A COMPARISON OF LOCAL ANALYSIS, GLOBAL ANALYSIS AND ONTOLOGY-BASED QUERY EXPANSION STRATEGIES FOR
BIOMEDICAL LITERATURE RETRIEVAL
There are many research papers focusing on query expansion strategies.
Mainly, three strategies are most used: local analysis, global analysis and ontology-
based. In order to better understand effectiveness of different automatic query
expansion strategies and further improve the design of future biomedical literature
retrieval system, there is a need to conduct a comparison of these three strategies,
local analysis, global analysis and ontology-based query expansion for biomedical
literature retrieval.
It is worthwhile to note that there might be several factors impacting the final
performance in biomedical literature retrieval system: 1) The system indexing
strategy; 2) The system retrieval model; 3) The data collection; 4) The qualities of
queries submitted into the system. We will propose a new automatic query
expansion approach and evaluate it from each of these four factors respectively
except dataset (we are restricting to TREC).
In this chapter, I will introduce our automatic query expansion experiment
system (AQEE-1) architecture, discuss the three major query expansion strategies
and provide experimental design environments and implementation. For some
methods, preliminary studies are conducted and evaluation results are presented.
34
3.1 System Architecture
In order to evaluate the system 1) The system indexing strategy; 2) The
system retrieval model; 3) The qualities of queries submitted into the system, our
system consists of two major parts, as shown in Figure 3-1. In the first part (above
the dashed line in Figure 3-1), we apply global analysis, local analysis, and term re-
weighting expansions. The details of our query expansion strategies and
constructions are discussed in Section 3.2. The second part (below the dashed line) is
a Lucene or Lemur search engine to demonstrate the generality of our results. We
would like to demonstrate the quality of query expansion is independent of search
engines.
35
Figure 3 - 1: AQEE-1 System Architecture
36
3.2 Query Strategies
3.2.1 Local Analysis
This method determines the expanded terms to the original query based on
“pseudo-relevance feedback”, which assumes that the top documents returned by a
query are all relevant to the users’ information needs. In this work, Latent Semantic
Indexing (LSI) or Association Rule (AR) algorithm is used to find the top co-
occurrence terms of each original term from the top N retrieved documents.
Particularly, a larger effort is on LSI since in our experiments it shows positive gains
for the performance of the system.
The processes of local analysis are depicted in Figure 3-2
37
Figure 3 - 2: Local analysis approach query expansion
The Latent Semantic Indexing (LSI) takes advantage of the association of
terms with documents (“semantic structure”). It starts with a matrix of terms by
documents [Deerwester et al., 1990, Xu and Croft, 2000]. This matrix is then
evaluated by the singular-value decomposition (SVD) to gain the latent semantic
structure [Zhu et al., 2006]. The SVD decomposes a term-document matrix into three
separate matrices, by which documents and terms are projected into the same
dimensional space.
38
For example, the SVD X of t x d matrix of terms and documents is
decomposed into the product of three other matrices:
X =T0S 0D0’,
where T0 (t x m matrix) and D0’ (m x d matrix) have orthogonal, unit-length columns
(T0’T0=I, D0’D0=I), and S0 (m x m matrix) is diagonal. T0 and D0 are the matrices of left
and right singular vectors, and S0 is the diagonal matrix of singular values.
In general, the SVD can be easily calculated by a reduced model, which is the
rank-k model with the best possible least-squares-fit to X [Deerwester et al., 1990]:
′
The new matrix is the matrix of rank k which is closest in the least squares
sense to X. D’ is orthonormal matrix [Deerwester et al., 1990, Xu and Croft, 2000].
The dimensions of matrices T, S, and D’ are t xk, k x k, and k x d, respectively.
More importantly, the SVD can be used for information retrieval purposes to
derive a set of uncorrelated indexing variables or factors; each term and document is
represented by its vector of factor values. Therefore, the dot product between two
row vectors of reflects the extent to which two terms have a similar pattern of
occurrence across the set of documents. The matrix ′is defined:
′ = TS2T’
Here, both S and T’ are diagonal matrices [Deerwester et al., 1990].
39
In this research, we use this technique to compare the similarities between the
terms in the matrix extracted from the documents.
An Association Rule (AR) is a popular and well researched method to
discover frequent patterns, associations, correlations, or casual structure
relationships among a set of data items, especially in large database. [Wikipedia]
provides one example: {onions, potatoes} => {burger} found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together,
he is likely to buy burger as well. In our test, “financial” comes closely with query
term “combating times alien smuggling”. A matrix of terms by-documents, similar
to the LSI-based algorithm, is produced and then analyzed. For details of
Association Rule, please refer to [Agrawal et al., 1993, Agrawal et al., 1995, Savasere
et al., 1995].
The steps of expanding the query terms based on LSI or AR algorithm is
described in the following:
Step 1: Obtain a term matrix based on the top N (in our experiment, N=300)
retrieved documents for the individual initial query returned by the search engine.
Step 2: Run LSI or AR algorithm to generate the related terms to each of the
original query terms.
Step 3: Expand the query terms by including the top-ranked M terms
acquired in Step 2. We seek the top M (M=5/4/3/2/1; the best result is achieved
when M=1) co-occurrence terms of the original terms from the top N (N=300)
40
retrieved documents. We then evaluated the impact of the number of expanded
terms on precision.
3.3.2 Global Analysis
In contrast to local analysis, in global analysis, terms to be expanded are
extracted from all the documents in the collection. UMLS provides co-concepts and
related co-occurrence frequencies for many medical terms appeared in MEDLINE
during the past 10 years. Thus, we expand the initial query with UMLS co-concepts
of the original key terms with the same semantic types. It is noted that we use
LingPipe, which is an efficient tool for biomedical name extraction, in extracting key-
terms. For instance, a high frequent co-concept “mice, knockout” in UMLS of the
term “transgenic mice” has the same semantic type and will be added to the initial
query [Zhu et al., 2006].
3.3.3 Term Re-weighting
This approach takes the weights of key original terms or co-occurrence terms into
account according to their relative importance in queries. Our earlier study [Hu and
Xu, 2005, Zhu et al., 2006] shows that expanding queries with synonyms generated
from UMLS and assigning equal weights to both the initial query terms and new
terms are not effective. Therefore, in this work, we boosted the weight of the original
query term. Specifically, if an original term has a higher term frequency in the initial
query with a specific major UMLS semantic type, the term should be given a higher
weight (e.g. 8) in the expanded query. Additionally, if a key query term or an
41
expanded term has a major UMLS semantic type, its preferred MeSH term synonym
defined in UMLS will be expanded and given a higher weight [Zhu et al., 2006].
Note the selection of key query terms is again decided by the tagging of a Name
Entity Extraction tool, LingPipe.
The processes of global analysis and term re-weighting query expansion are
displayed in Figure 3-3.
Figure 3 - 3: Global analysis and term re-weighting query expansion
42
3.4 Experiment Design and Implementation
The goal of the TREC 2004 Genomics Track is to create test collections for
evaluation of information retrieval (IR) and related tasks in the genomics domain
[Genomics]. It consists of the ad hoc retrieval task and the categorization task. In this
research, we focus on ad hoc retrieval task.
The TREC 2004 Genomics ad hoc retrieval task aims at retrieving MEDLINE
records to the official 50 topics from the whole corpus. The structure of this task was
a conventional searching task based on a 10-year MEDLINE subset (1994-2003),
about 4.5M documents and 20G bytes in size with NLM XML format. The official 50
topics derived from information needs were obtained via interviews with
biomedical researchers [Genomics].
Two search engines are used in this research, LEMUR (v.4.0) and LUCENE (v.
1.4.3). While LUCENE is a high performance, scalable Information Retrieval tool and
provides indexing and searching capabilities, LEMUR is a tool set designed to
facilitate research in language modeling and information retrieval and supports the
construction of basic text retrieval systems using language modeling methods, as
well as traditional methods such as those based on the vector space model and
statistical model (for instance OKAPI BM25) [Zhu et al., 2006].
43
3.4.1 Building the Index
LEMUR provides a couple of parsers to index TREC format documents.
InvIndex type of BuildIndex module was used to transform the XML format data on
TREC 2004 Genomics into general TREC format and then TREC format data was
inverted indexed. For LUCENE, the original XML data set was converted into
HTML format. Only the content from fields such as “TITLE”, “ABSTRACT”,
“MESH”, and “CHEMICAL SUBSTANCE” is kept in the HTML files. Transformed
HTML documents were indexed accordingly [Zhu et al., 2006].
3.4.2 Preprocessing Query to Search Engines
The TREC 2004 Genomics topics for the ad hoc retrieval task are developed
from the information needs of actual field experts. The topics consist of four parts:
ID, title, information need, and context formatted in XML. The titles are abbreviated
statements of information needs, and they are most similar to queries entered by end
users since there are only a few words. In this research, the title of each topic is
selected for baseline runs.
In addition, the two-step approach was used to preprocess queries to the
LEMUR search engine: First, a query was tokenized and filtered by stop-word lists.
Additionally, a query was stemmed before it is fed to LEMUR for retrieval [Zhu et
al., 2006]. These two operations are very useful for improving the computational
efficiency and reducing noise. We also noted that context is background information
to place information need and it describes information about the environment in
44
which queries may occur. So it may be helpful to expand the original query, i.e. the
titles, to include context. In this research, we only selected those key terms or
phrases labeled by LingPipe in context information of the topics to expand the
original queries.
3.4.3 Evaluation
The evaluation measure for the topics ad hoc retrieval task is the mean
average precision (MAP). Precision and Recall results are calculated using Buckley’s
trec_eval program. In addition, we also assess precision at 10 documents (P@10) in
this research.
3.5 Experimental Results and Discussion
3.5.1 Results
Table 3 - 1: Experiment Results from LEMUR Search Tool (OKAPI BM25 MODEL)
Run
Mean Average Precision
P@10 MAP Ratio Change
Baseline 0.3010 0.516 -- UMLS_Global 0.2562 0.464 -14.9% LSI_Local 0.3035 0.516 +0.8% AR_Local 0.2840 0.440 -5.6% Baseline+ReWeighting 0.3131 0.506 +4.0% Baseline+Context 0.3107 0.554 +3.2% Baseline+Context+Re-Weighting
0.3374 0.528 +12.1%
Baseline+Context+ LSI_Local
0.3288 0.516 +9.2%
45
Figure 3 - 4: Average Precision of All Runs (OKAPI BM25 MODEL)
Figure 3 - 5: P@10 of All Runs (OKAPI BM25 MODEL)
00.05
0.10.15
0.20.25
0.30.35
0.4
Mean Average Precision
Mean Average Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
P@10
P@10
46
Table 3 - 2: Experiment Results from LEMUR Search Tool (VECTOR SPACE MODEL)
Run
Mean Average Precision
P@10 Ratio MAP Ratio Change
Baseline 0.3024 0.504 -- UMLS_Global 0.3007 0.476 -0.6% LSI_Local 0.3039 0.504 +0.5% AR_Local 0.2842 0.462 -6.0% Baseline+ReWeighting 0.3092 0.510 +2.2% Baseline+Context 0.3231 0.504 +6.8% Baseline+Context+Re-Weighting
0.3252 0.498 +7.5%
Baseline+Context+ LSI_Local
0.3251 0.504 +7.5%
Figure 3 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL)
0.260.270.280.29
0.30.310.320.33
Mean Average Precision
Mean Average Precision
47
Figure 3 - 7: P@10 of All Runs (VECTOR SPACE MODEL)
0.430.440.450.460.470.480.49
0.50.510.52
P@10
P@10
48
Table 3 - 3: Experiment Results from LUCENE Search Tool (VECTOR SPACE MODEL)
Run
Mean Average Precision
P@10 Ratio MAP Ratio Change
Baseline 0.2404 0.436 -- UMLS_Global 0.1858 0.334 -22.7% Baseline+ReWeighting 0.2512 0.442 +4.5% Baseline+Context 0.2488 0.482 +3.5% Baseline+Context+Re-Weighting
0.2891 0.468 +20.3%
Figure 3 - 8: Mean Average Precision of All Runs (VECTOR SPACE MODEL)
00.05
0.10.15
0.20.25
0.30.35
Mean Average Precision
Mean Average Precision
49
Figure 3 - 9: P@10 of All Runs (VECTOR SPACE MODEL)
3.5.2 Discussion
As shown in Table 3-1, there are eight types of runs designed in the
experiments, Baseline, UMLS_Global (title + co-concepts in UMLS), LSI_local (title +
local co-occurrence terms), AR_local (title + local co-occurrence terms), Baseline+Re-
weighting (title + term re-weighting approach), Baseline+context (title + key terms
or phrases in context of query), Baseline+Context+LSI_Local (title + key terms or
phrases in context of query + local co-occurrence terms) and Baseline+Context+Re-
weighting (title + key terms or phrases in context of query + term re-weighting
0
0.1
0.2
0.3
0.4
0.5
0.6
P@10
P@10
50
approach). The ratio of change is compared to Baseline. We run all eight types of
experiments for LEMUR search tool and five for LUCENE search tool.
We obtained the best results with the method of Baseline+Context+Re-
weighting, showing improved average precision by 33.74% and 28.91% over
Baseline for LEMUR (OKAPI BM25 Model) and LUCENE, respectively. Co-
occurrence terms expansion with LSI local analysis is applied to four runs on
LEMUR in those queries produced from Baseline and Baseline+Context (Table 1 and
2). These four runs are based on two different information retrieval models, the
OKAPI BM25 Model and the VECTOR SPACE Model. The results suggest that co-
occurrence terms expansion with LSI local analysis on the OKAPI BM25 Model that
increases average precision by 9.2% could work better than one on the VECTOR
SPACE Model that increases average precision by 7.5% with context taken into
account.
Apparently, the LSI-based local analysis approach improves the average
precision of the Baseline+Context than that of Baseline run. This also suggests
context information of queries is critical in enhancing the retrieval performance. If
without context, co-occurrence terms expansion with LSI local analysis approach on
the OKAPI BM25 Model only increases the average precision by 0.8%, and the
performance enhancement on the VECTOR SPACE model is about 0.5%.
If the Baseline+Context runs are treated as baselines, through LSI local
analysis, the OKAPI BM25 Model could elevate the average precision by 5.8%, and
the VECTOR SPACE Model could only enhance the average precision by 0.6%. We
51
note that the best results we achieved with the LSI local analysis are the cases we
added only one extended term to each original query term.
AR produces worst results among all the methods we experimented. AR
decreases the average precision by 5.6% in Table 1 and 6.0% in Table 2 for AR_Local.
Therefore, we did not run Baseline+Context+AR_Local. We note that co-concepts
expanded from global analysis also make the results worse. The average precision of
the UMLS+Global run on LEMUR decreases 14.9% (OKAPI BM25 Model) and by 0.6%
(VECTOR SPACE Model), compared to Baseline run. This result is consistent when
we used LUCENE. Table 3 shows that the average precision of the UMLS+Global
runon LUCENE also decreases by 22.7% (VECTOR SPACE Model).
We experimented the impact of using the term re-weighting strategy on
LEMUR over the Baseline and Baseline+Context without using the term re-
weighting strategy. The experimental results of those methods are shown in Table 3-
1 and 3-2. Table 3-1 shows the experimental results with the OKAPI BM25 Model,
while Table 3-2 with the VECTOR SPACE Model. Baseline+Context+Re-weighting
method with the OKAPI BM25 Model increases the average precision over the
Baseline by 12.1%. Table 3-2 shows that Baseline+Context+Re-weighting method
with the VECTOR SPACE Model increases the average precision over the Baseline
by 7.5%. We note that the Baseline+Re-weighting, without context, increases the
average precision only by 4.0% in the OKAPI BM25 Model, and the performance
enhancement on the VECTOR SPACE model is only 2.2%.
We therefore contend that using context is important in improving precision.
52
We also experimented the impact of using the term re-weighting strategy on
LUCENE over the Baseline and Baseline+Context. The experimental results of those
methods are shown in Table 3-3. The results show the term re-weighting strategy
increases the average precision over Baseline by 4.5% and 20.3%, respectively
3.6 Conclusion
In this study we presented the design and evaluation of query expansion
strategies by using three widely used approaches for bio-medical domain. The three
approaches are local analysis, global analysis, and term re-weighting.
The important insights we obtained through our study are summarized as
follows:
Ontology-based term re-weighting provides the best results among the three
query expansion strategies.
Expanding the initial query with more precise ontology-based term enhances
LSI based local analysis substantially.
Including context to term re-weighting and LSI further improves the
precision.
Adding only one extended term to each original query term of the LSI local
analysis increases the precision over adding more than one term.
Using OKAPI BM25 Model helps achieve better precision than using
VECTOR SPACE Model on LEMUR.
53
CHAPTER 4. USING TWO-STAGE CONCEPT-BASED SINGULAR VALUE DECOMPOSITION TECHNIQUE AS A QUERY EXPANSION STRATEGY
As indicated in chapter 3, we compared different query expansion strategies
including global analysis, local analysis, and term re-weighting. We obtained some
important insights. However, there are a couple of flaws that are worthwhile to note.
First, we noticed that there are few theories supporting the strategies. Second, the
context we employed (kind of concept) is background information to place
information need and it describes information about the environment in which
queries may occur. The result shows that it is helpful to expand the original query
and acquire better performance. Nonetheless, it is known that not all biomedical
literature retrieval systems contains such information and available to the users.
Further research is needed to explore this concept generation and utilize language
modeling approach since it has strong math theory.
4.1 Introduction
As mentioned previously, query expansion techniques are the process of
supplementing the original query with additional terms. The advantages of query
expansion are: to help the users reformulate a better query to achieve higher
precision and higher recall and eventually to improve retrieval performance; query
expansion can be applicable to any situation irrespective of the retrieval techniques
or models used [Efthimiadis, 1996]. In this work, via query expansion, we focus on
addressing the issue: the selection of additional query terms based on concepts, to
help relax synonym and polysemy problem.
54
The traditional keywords (terms) match does not work well for biomedical
information retrieval. One way to solve the problem is sense-based IR through word
sense disambiguation (WSD); however WSD is a challenging task in the field of
natural language processing (NLP). Some experiments reported promising effect of
sense-based IR models but most of experimental results failed to show any
performance improvement partially due to the low accuracy of WSD in general
domain [Sanderson, 1994] and/or very noisy word-alignment. The other way to
solve this problem is through concept match. A concept has a distinctive meaning in
biomedical literature and all synonymous terms share the same concept identities,
therefore concept will not cause any ambiguity. Besides, this approach to some
extent mimics searchers’ process of selecting query terms [LSI]. And importantly, it
is quite easy to implement concept based approach with UMLS and other tools.
4.2 System Architecture
The following diagram 4-1 illustrates system components and data flow. The
benefit of our approach is three-fold: saving computing resources and less human
intervention by taking advantage of the UMLS and pseudo-relevance feedback;
solving synonym and polysemy problem well and not causing any ambiguity using
concepts embedded among the terms; and being independent of informational
retrieval system. Moreover, since it assumes that the top N documents returned by
the initial query are actually relevant, expansion terms are extracted from each
cluster of the top-ranked documents to formulate a new query for a second cycle
55
retrieval. It takes advantage of search engine (using initial query) and gets more
relevant documents compared with the whole data collection.
Search Engine
1. Doc 12. Doc 23. Doc 3…………..……………
Expanded Query
Query Expansion
Initial Query
Figure 4 - 1: Two-stage concept-based query expansion (AQEE-2)
Specifically, a concept based term-document matrix is generated (Stage one)
based on those top retrieved documents in this approach. Finally, The SVD
technique in the LSI is utilized and implemented to extract and rank terms to be
added on initial query (Stage two).
56
Figure 4 - 2: Stage one – generate concepts and construct concept-based term
document matrices
4.3 Query Strategies
4.3.1 Stage one - generate concepts and construct concept-based term-document matrices
During Stage one, it is very important to cluster the retrieved documents
based on initial query, and extract the concepts based on subset of the data collection.
For details of the idea and the process, please refer to [Zhu et al. 2006].
4.3.1.1 Document Clustering
Since both single linkage and average linkage suffer severe chaining problem
on all three testing datasets when using standard vector cosine as document
similarity measure [Zhu et al., 2006]. For fair comparison, we use the complete
linkage as cluster distance measure because it does not has the chaining problem on
the testing dataset working with the baseline document similarity measure. With
57
complete linkage criterion, the distance of two clusters is defined as the maximum
distance between one document in the first cluster and the other in the second
cluster. That is,
∆ , max , , ∈ , ∈
After estimating a language model for each document in the corpus with
context-sensitive semantic smoothing, the Kullback-Leibler divergence between two
language models is used to be the distance measure of the corresponding two
documents. Given two probabilistic document models p(w|d1) and p(w|d
2), the KL-
divergence distance of p(w|d1) to p(w|d
2) is defined as:
∆ , | log||
Where V is the vocabulary of the corpus. The KL-divergence distance will be
a non-negative score. It gets the zero value if and only if two document models are
exactly same. However, KL-divergence is not a symmetric metric. Thus, we define
the distance of two documents as the minimum of two KL-divergence distances.
That is,
, min ∆ , , ∆ ,
The partitional clustering is a generalized version of the standard k-means
[Zhong and Ghosh, 2005]. It assumes that there are k parameterized models, one for
each cluster. Basically, the algorithm iterates between a model re-estimation step and
a sample re-assignment step as shown in Figure 4-1.
58
Algorithm: Model-based K-Means
Input: dataset, and the desired number of clusters k.
Output: trained cluster models Λ = {λ1,…,λn}and the document assignment
Y = {y1,…,yn}, yi ∈ 1,…,k
Steps:
1. Initialize document assignment Y.
2. Model re-estimation: λi = arg ∑ log |∈
3. Sample re-assignment: yi = arg log |
4. Stop if Y does not change, otherwise go to step 2
The model based k-means algorithm adapted from (Zhong and Ghosh, 2005)
4.3.1.2 Concept Extraction
We used approximate matching to overcome the limitation of terms exact
character matching. With it, concept (phrase) extraction and concept (phrase)
meaning disambiguation can be completed within one step. The basic idea of this
approach is to calculate the importance score of the tokens and capture the
important tokens (not all tokens) of a concept name, unlike [Liu et al., 2004] a
specific window size was defined to identify a phrase in a document. The
importance score formula for the token is below [Zhou et al., 2006]:
59
max
1
∑ 1
Where w is the token,
n is the number of concept names or variants,
Sj(w) stands for the importance of the token w to the j-th variant name,
N(w) stands for the number of concepts whose variant names contain token w
wji indicates the i-th token in the j-th variant name of the concept,
I(w) denotes importance of w to the concept.
With the importance score formula above, the detailed concept extraction
algorithm shown in Figure 1 can be employed. The candidate concept with
maximum importance score is chosen in the end unless only one candidate concept
is remained.
60
Figure 4 - 3: The algorithm for extracting one concept name and its candidate concept IDs [Zhou et al. 2006]
.
One example is presented below: The regulation (C1327622) of iron transport
(C1159668) by cytokines (C0079189) is a key mechanism in the pathogenesis
(C0699748) of anemia of chronic disease (C0002873) and a promising target for
therapeutic intervention (C0808232).
Phrase: regulation, iron transport, cytokines, pathogenesis, anemia of chronic
disease, therapeutic intervention
Concept: C1327622, C1159668, C0079189, C0699748, C0002873, C0808232
We developed the program IndexAppConfig to complete Stage one task.
Given the number of documents, IndexAppConfig will extract a set of concepts. And
based on concepts generated, IndexAppConfig starts to index the documents and
generates a couple of files including: collection.stat, docterm.index, termdoc.index,
termindex.list, docindex.list, docterm.matrix and termdoc.matrix. Purposely,
61
termdoc.index is the binary random access file for storing matrix, and
termdoc.matrix will be used during Stage two. Once these files are ready, it is easier
to construct t-d matrix and therefore it comes to Stage two using LSI SVD technique
to extract terms. It is to be noted that t-d matrix is concept-based, not just about
single word and document matrix. But for simplicity in this paper, we still call t-d
matrix.
4.3.2 Stage two – use SVD technique in LSI to extract the expanded query terms
The purpose of Latent Semantic Indexing (LSI) is to overcome the problems
of lexical matching by using statistically derived conceptual indices instead of
individual words for retrieval [Agrawal et al., 1993, Guo et al., 2004]. The idea
behind it is that in the latent semantic space, a query and a document can have high
cosine similarity even if they do not share any terms but as long as their terms are
semantically similar.
The Latent Semantic Indexing (LSI) is the application of a particular
mathematical technique, which is Singular Value Decomposition (SVD), to a word-
by-document matrix. SVD (and hence LSI) is a least-squares method and it takes
advantage of the association of terms with documents ("semantic structure").
Starting with a matrix of terms by documents, this term-document matrix is then
evaluated by SVD to gain the latent semantic structure about word usage across
62
documents and is decomposed into three separate matrices, by which documents
and query terms are projected into the same dimensional space.
Because of disadvantages of LSI such as: requiring larger storage and so on,
in this research, SVD is used on the top 200/50 or less documents retrieved for initial
query on Lemur and this technique is utilized to compare the similarities between
the terms in the matrix extracted from the documents and to generate the important
terms to be added on initial query.
We developed the program SVDExpansion to complete Stage two task. Given
the number of files generated during Stage one, SVDExpansion will automatically
extract a couple of terms per request, concept ID and ranked score (1 is the highest).
The following is one example:
Query term: ferroportin C0915115
********************************************
0 ferroportin C0915115 1.0
1 Cation Transport Proteins C0969710 0.9992486625308058
63
4.4 Experimental Design and Implementation
4.4.1 MeSH Terms
MeSH (Medical Subject Headings) is a type of controlled vocabulary. MeSH
consists of sets of descriptors in a hierarchical structure that allows searching at
various levels of specificity. The roots of the hierarchical structures in MeSH have
very broad headings such as "Anatomy" or "Mental Disorders." Lower levels of the
hierarchy include more specific terms, such as "Ankle" and "Conduct Disorder."
There are 22,997 descriptors in MeSH. There are also thousands of cross-references
that assist in finding the most appropriate MeSH Heading, for example, Vitamin C
as Ascorbic Acid.
4.4.2 UMLS
The Unified Medical Language System (UMLS) is medical knowledge sources
and associated language tools developed by the National Library of Medicine. The
system includs UMLS Metathesaurus, UMLS Semantic Network and SPECIALIST
Lexical Tools [Umls]. The UMLS Metathesaurus contains information about concepts
and biomedical terminology in several languages. It was developed automatically
from 73 families of controlled vocabularies and 117 classification systems. In 2004AA
version, it has a capacity for 1 million concepts and 4.3 million terms. The purpose of
UMLS Semantic Network is to provide a consistent semantic categorization of all
concepts represented in the UMLS Metathesaurus and to provide a set of useful
relationships between these concepts. All information about specific concepts is
64
found in the Metathesaurus; the Network provides information about the set of basic
semantic types, or categories, which may be assigned to these concepts, and it
defines the set of relationships that may hold between the semantic types. The
2004AA release of the Semantic Network contains 135 semantic types and 54
relationships. The semantic types are the nodes in the Network, and the
relationships between them are the links. UMLS totally provides symbolic
relationships in 5 million pairs of concepts and statistical relations in 6.5 million
pairs of co-concepts. The symbolic relations mainly include hierarchical and
associative relationships. The associative relationships are represented as
relationships between semantic types of concepts. Statistical relations are computed
by determining the frequency with which concepts in specific vocabularies co-occur
in records in a database. For instance, there are co-occurrence relationships for the
number of times concepts have co-occurred as key topics within the same articles, as
evidenced by the Medical Subject Headings assigned to those articles in the
MEDLINE database.
4.4.3 Experimental Setup
In order to compare the experimental test results with those of chapter 3, the
same data collection TREC 2004 or 2005 is used. Below is the brief introduction of
TREC 2004 or 2005 genomics track.
The goal of the TREC 2004 or 2005 genomics track is to create test collections
and to provide a forum for evaluation of information retrieval (IR) and related tasks
in the genomics domain [Genomics]. Both TREC 2004 and TREC 2005 genomics
65
track consist of the ad hoc retrieval task and the categorization task. Since the
document set of TREC 2005 genomics track for the ad hoc task was the same corpus
as was used in the TREC 2004 genomics ad hoc task, our major effort focused on
2004 genomics ad hoc retrieval task in this research. The corpus contains a 10-year
subset (1994 to 2003) of MEDLINE, which is the bibliographic database of
biomedical literature maintained by the US National Library of Medicine. And the
corpus holds about 4.5 million MEDLINE records (which include title and abstract
as well as other bibliographic information) and its size is about 9GB of data.
The TREC 2004 genomics ad hoc retrieval task aims at retrieving MEDLINE
records to the official 50 topics from the whole corpus. The official 50 topics derived
from information needs were obtained via interviews with real biologists [Guo et al.,
2004; Genomics]. The structure of this task was a conventional searching job based
on the whole documents corpus with NLM XML format.
Lemur search tool is employed as a backend engine in this research. Lemur is
a tool set designed to facilitate research in language modeling and information
retrieval and supports the construction of basic text retrieval systems using language
modeling methods, as well as traditional methods such as vector space model and
statistical model (for instance OKAPI BM25) [Zhu et al., 2006]. In this research, it was
used for the initial query run to achieve the top N documents as a basis to generate a
set of concepts and extract the terms to reformulate the query first. Then it was also
invoked to run for the reformulated query to compare the precision and recall with
that of baseline.
66
4.4.3.1 Indexing the Corpus
To index TREC 2004 genomics track, InvIndex type of BuildIndex module
provided by Lemur was used to transform the XML format data on TREC 2004
genomics track into general TREC format and then TREC format data was inverted
indexed. The high frequency stop words in the collection were removed and the
remaining was stemmed accordingly.
4.4.3.2 Preprocessing Query to Lemur
The TREC 2004 genomics topics for the ad hoc retrieval task are developed
from the information needs of actual field experts. The topics consist of four parts:
ID, title, information need, and context formatted in XML, of which the titles are
abbreviated statements of information needs. Since the titles are only a few words
and are most similar to queries entered by end users, in this research the total 50
titles are selected for initial queries. And these initial queries for baseline run were
preprocessed before they were fed to the Lemur: initial queries were tokenized and
filtered by stop-word lists and were stemmed.
4.5 Preliminary Experimental Results
4.5.1 Evaluation
Again, Buckley’s trec_eval program was run to capture the precision and
recall for the individual query. Since mean average precision (MAP) is a
comprehensive indicator of IR performance that captures both precision and recall,
67
we examined MAP for the topics ad hoc retrieval task in this research following a
convention of TREC. In addition, we also assessed the precision at top 10 documents
(P@10) in our evaluation.
4.5.2 Preliminary Results
Table 4 - 1: Experimental Results (OKAPI BM25 model)
Run Mean Average Precision
P@10 MAP Ratio
Change
Baseline 0.3010 0.516 --
UMLS_Global 0.2562 0.464 -14.9%
LSI_Local 0.3035 0.516 +0.8%
AR_Local 0.2840 0.440 -5.6%
Baseline+ReWeighting
0.3131 0.506 +4.0%
Baseline+Context 0.3107 0.554 +3.2%
Baseline+Context+Re-Weighting
0.3374 0.528 +12.1%
68
Table 4 - 2: Experimental Results (OKAPI BM25 model)
Run Mean Average Precision
P@10 Mean Ratio Change
Baseline 0.3010 0.516 --
Concept-based SVD (200 documents)
0.3193 0.504 +6.0%
Concept-based SVD (50 documents)
0.3308 0.520 +9.9%
Figure 4 - 4: Mean Average Precision of All Runs (OKAPI BM25 model)
0.2850.29
0.2950.3
0.3050.31
0.3150.32
0.3250.33
0.335
Baseline Concept-basedSVD (200
documents)
Concept-basedSVD (50
documents)
Mean Average Precision
Mean Average Precision
69
Figure 4 - 5: P@10 of All Runs (OKAPI BM25 model)
Table 4 - 3: Experiment Results (VECTOR SPACE MODEL)
Run Mean Average Precisio
n
P@10 MAP Ratio
Change
Baseline 0.3024 0.504 --
UMLS_Global 0.3007 0.476 -0.6%
LSI_Local 0.3039 0.504 +0.5%
AR_Local 0.2842 0.462 -6.0%
Baseline+ReWeighting 0.3092 0.510 +2.2%
Baseline+Context 0.3231 0.504 +6.8%
Baseline+Context+Re-Weighting
0.3252 0.498 +7.5%
0.495
0.5
0.505
0.51
0.515
0.52
0.525
Baseline Concept-based SVD(200 documents)
Concept-based SVD(50 documents)
P@10
P@10
70
Table 4 - 4: Experimental Results (VECTOR SPACE model)
Run Mean Average Precision
P@10 MAP Ratio Change
Baseline 0.3024 0.504 --
Concept-based SVD (200 documents)
0.3183 0.516 +5.3%
Concept-based SVD (50 documents)
0.3281 0.518 +8.4%
Figure 4 - 6: Mean Average Precision of All Runs (VECTOR SPACE MODEL)
0.2850.29
0.2950.3
0.3050.31
0.3150.32
0.3250.33
0.335
Baseline Concept-basedSVD (200
documents)
Concept-basedSVD (50
documents)
Mean Average Precision
Mean Average Precision
71
Figure 4 - 7: P@10 of All Runs (VECTOR SPACE MODEL)
There are seven types of runs designed in the experiments, Baseline,
UMLS_Global (title + co-concepts in UMLS), LSI_local (title + local co-terms),
AR_local (title + local coterms), Baseline+Re-weighting (title + term re-weighting
approach), Baseline+context (title + key terms or phrases in context of query), and
Baseline+Context+Re-weighting (title + key terms or phrases in context of query +
term re-weighting approach) in Table 4-1 and 4-3. The ratio of change is compared to
Baseline.
The average precision of our baseline run is better than the results of the 2004
TREC Genomics Ad Hoc task where the average precision of 47 runs submitted by
32 teams was 20.74%. Our best runs are 33.74% and 28.91% for LEMUR. Co-term
expansion with local analysis increases average precision by only 0.5% (LSI). AR gets
even worse result and decreases the average precision by 6.0%. Here, the use of
0.495
0.5
0.505
0.51
0.515
0.52
Baseline Concept-based SVD(200 documents)
Concept-based SVD(50 documents)
P@10
P@10
72
document as transaction unit for association rule mining may not be efficient.
Instead, the association rules at sentence level may produce more precise term
associations.
The term reweighting strategy is also applied to four runs on LEMUR in
those queries are both formed from Baseline and Baseline+Context. But these four
runs are based on two different information retrieval models, Vector Space Model
and Statistical Model (Okapi BM25). The results suggest that Statistical Model that
increase average precision by 12.1% could more empower the term re-weighting
approach than Vector Model that increase average precision by 7.5% with context
taken into account. If without context, term-reweighing approach on Statistical
Model only increases average precision by 4.0% and the performance enhancement
on Vector model is about 2.2%. If the Baseline+Context runs are treated as baselines,
through term re-weighting, Statistical Model could elevate average precision by 8.6%
and Vector Model could only enhance average precision by 0.7%.
Co-concepts expanded from global analysis make the results worse. The
average precision of the UMLS+Global run on LEMUR decreases by 0.6% (Vector
Space Model) and 14.9% (Okapi BM25 Model). Perhaps, the top-ranking co-concepts
from UMLS co-occur frequently with original terms in certain context that is totally
different from that of TREC Genomics topics.
Both Table 4-2 and Table 4-4 shows the average precision of both our baseline
runs and concept-based SVD query expansion runs (Okapi BM 25 model or Vector
73
space model) is better than the results of the 2004 TREC Genomics Ad Hoc task
where the average precision of 47 runs submitted by 32 teams was 20.74%.
Table 4-2 demonstrates the concept-based SVD approach with Okapi BM25
model at best achieves the 33.08% MAP and 52.0% P@10 while the baseline approach
(Okapi BM25 model) achieves 30.10% MAP and 51.6% P@10. Table 4 shows the
concept-based SVD approach with Vector space model at best achieves the 32.81%
MAP and 51.8% P@10 while the baseline approach (Vector space model) achieves
30.24% MAP and 50.4% P@10 respectively. Apparently the concept-based SVD query
expansion approach improves the average precision than that of baseline runs: at
best increases 9.9% in Table 4-2 and 8.4% in Table 4-4. This suggests concept-based
SVD in LSI is quite significant to enhance the performance of information retrieval
on TREC 2004 genomics track.
We also compared the results of top 200 or 50 documents retrieved for initial
queries as a basis to generate a set of concepts and extract terms to be added to initial
queries. If the concept-based SVD with top 200 documents run is treated as baseline,
the concept-based SVD with top 50 documents run will eventually improve the
retrieval performance by 3.6% (Okapi BM25 model) and 3.1% (Vector space model)
correspondingly. A possible explanation is that top 200 documents may add more
noises to the basis to generate the concepts and perform SVD and thus extract new
query terms that do not help increase the performance that much.
74
CHAPTER 5. CLUSTER-BASED QUERY EXPANSION USING LANGUAGE MODELING IN THE BIOMEDICAL DOMAIN
The approach introduced in Chapter 4 has some benefits: saving computing
resources and less human intervention by taking advantage of the UMLS and
pseudo-relevance feedback; solving synonym and polysemy problem well and not
causing any ambiguity using concepts embedded among the terms; and being
independent of informational retrieval system. Moreover, since it assumes that the
top N documents returned by the initial query are actually relevant, expansion terms
are extracted from the top-ranked documents to formulate a new query for a second
cycle retrieval. It takes advantage of search engine (using initial query) and gets
more relevant documents compared with the whole data collection.
However, it is noted that a document collection typically contains documents
with various topics. Therefore, the associations between two terms could be
dissimilar across different topics, which make it more appropriate to conduct query
expansion within each topic.
5.1 Introduction
Query expansion is indeed not a new method to improve information
retrieval precision and recall. Researchers have focused on using query expansion
techniques, manually, interactively or automatically, to help the users reformulate a
better query and eventually facilitate the users to achieve higher precision and
higher recall [Efthimiadis, 1996]. However, previous research on query expansion
has been uncertain as to whether it does improve retrieval effectiveness as well as
75
the efficiency over document-based retrieval [Buckley 2004]. We, in this study,
looked at pseudo-relevance-based query expansion methods in greater details.
Since pseudo-relevance-based query expansion methods augment the query
with terms generated from documents in an initially retrieved list, they have both
advantages and disadvantages. One of main benefits is that they take advantage of
search engines with various models without users’ interaction. In contrast, inherent
problems of search engines also arises, for example, normally in a normal document-
based retrieval, the initially retrieved list is long, ranked by their relevancies to the
given query, to the user [Liu and Croft, 2004]. Another problem is that the document
collection typically contains documents with different subjects or topics. For instance,
PubMed research articles may be classified into different subjects or topics like
diseases. Likewise, the initially retrieval list may be a mixed results on different
subjects or topics. Accordingly, the expanded terms from pseudo-relevance-based
query expansion techniques to the initial query may have totally different meanings
across distinctive subjects in the whole document collection. The more difficult and
complex part is that not all query related aspects might be obvious in the initially
retrieved list [Buckley 2004]. Overall, the expanded query might demonstrate query
drift [Mitra et al. 1998], that is, the underlying intent shift between the original query
terms and its expanded terms. If query drift occurred worse, it is expected that state-
of-the-art pseudo-relevance-based query expansion methods could yield retrieval
performance that is substantially inferior to that of using the original query with no
expansion at all [Mitra et al. 1998, Lavrenko and Croft 2001].
76
In this study, we propose to perform pseudo-relevance-based query
expansion based on the initial retrieved list and/or clusters of similar documents
(cluster-based query expansion) that are created based on the whole document
collection or the initial retrieval list of the original query. The clusters from the
whole document collection based on their query-similarities can assumedly be
considered better candidates reflecting the document collection-context of the query
than individual documents [Genomics, Lee et al. 2008, Na et al., 2005], because the
clusters could contain relevant documents that do not exhibit high query similarity.
The clusters from the initial retrieval results do not bring new relevant documents to
consideration; however, re-ranking the initial retrieval results using clusters can
reduce non-relevancy noise.
The rest of the paper is organized as follows: Section 2 reviews related work
on Language Modeling and cluster-based retrieval. Section 3 presents our proposed
methods to conduct PRB-DOC-QE, GCB-QE, GCB-DOC-QE and LCB-RR-QE.
Section 4 provides system architecture and general process of our cluster-based
methods. In section 5, we evaluate the performance of the proposed methods. And
lastly, we conclude the paper in section 6.
5.2 Related Work
Language modeling was initially used in speech recognition. Nowadays it
has been widely applied into information retrieval and other applications owing to
its solid mathematical foundation and empirical proved effectiveness [Zhai and
Lafferty, 2001].
77
The main idea of language modeling used in information retrieval is: each
document of a collection is viewed as a sample and a query as a generation process
[Na et al., 2005]. In contrary to normal retrieval model, the retrieved documents are
ranked based on the probabilities of producing a query from the corresponding
language models of these documents. In other words, relevance of document is
calculated by query likelihood generated from document language models.
Specifically, there are two steps involving. One is to build a language model D for
each document in the collection; and the second is to rank the documents based on
how likely the query Q could have been generated from each of these document
models, i.e. P(Q|D).
The probability P(Q|D) can be calculated in a couple of ways. The most
common approach (referred to unigram language model) assumes that the query
could be treated as a set of independent terms, and thus query probability P(Q|D)
can be represented as a product of the individual term probabilities [Liu and Croft,
2004].
)|()|(
1
DqPDQPm
i
i
where qi is the ith term in the query, and P(qi|D) is specified by the document
language model
)|()1()|()|( CollwPDwPDwP MLML
78
where PML(w|D) is the maximum likelihood estimate of word w in the document,
PML(w|Coll) is the maximum likelihood estimate of word w in the collection, and λ is
a general symbol for smoothing.
For different smoothing methods, λ takes different forms [Kurland et al. 2005].
Based on the language modeling framework, a relevance model, as a matter
of fact, is a query expansion approach. Suppose the query words Q = q1;q2;… qk and
the word in relevant documents are sampled identically and independently from
a distribution, the probability of a word in the distribution M is estimated as follows:
M
k
iik MqPMPMPqqP
11 )|()|()(),(
Where M is the set of documents that are pseudo-relevant to the query Q [Lavrenko
and Croft 2001].
Derived from the above estimation, the most likely expansion term from
P( |M) is chosen for the original query.
There are two approaches to perform cluster-based retrieval. One is to
retrieve one or more clusters in the whole document collection in response to a
query [Zhai and Lafferty, 2001, Liu and Croft, 2004], and this has been the most
common for cluster-based retrieval. For this approach, in cluster term, the task is to
firstly match the query against clusters of documents using clustering algorithms
instead of individual documents, and then rank clusters based on individual
cluster’s similarity to the query. Another idea behind this approach is that unlike
79
usual cluster search methods using clusters mainly a tool to identify a subset of
documents that are likely to be relevant and return the subset of documents only at
the time of retrieval, any document forming a cluster ranked higher is considered to
be more relevant than any document from a cluster ranked lower on the retrieval
clusters list [Rosenfele 2000].
The second approach to cluster-based retrieval is to use clusters as a form of
document smoothing. The benefit of this approach is that by classifying documents
into clusters, differences between representations of individual documents are, in
effect, smoothed out [Lee et al. 2008].
In this study, we are using the first approach with a little change, that is,
building clusters based on the whole document collection and based on the initial
retrieved list only. Note in (1) if we use Cluster (C) to replace D above, P(Q|D) =>
P(Q|C), we could obtain ranked clusters.
5.3 Methods
The followings are concrete details that describe query expansion methods in
our approach:
Let q, d, D, and Dinit represent a query, a document, a document collection,
and an initially retrieved documents list respectively.
5.3.1 Pseudo-relevance-based query expansion (PRB-DOC-QE)
80
In language modeling framework, Dinit is used to construct language model
and generate expanded query terms based on query-likelihood ∏ . We use
PFB-DOC-QE to represent such a retrieval model that utilizes only documents to
perform query expansion. In order to compare the various methods performances,
we also consider this query expansion using document only as baseline.
5.3.2 Global-cluster-based query expansion (GCB-QE)
We assume that the whole document collection is clustered into the set of
clusters C. Please note here the whole document collection clustering is query-
independent. This method is to use language models of clusters rather than
documents to construct an expanded query form.
Specifically, we use the language models of the k clusters C that yield the
highest query-likelihood ∏ ; GCB-QE denotes the resultant retrieval method.
5.3.3 Global-cluster-based query expansion (GCB-DOC-QE)
Since the C clusters created in 5.3.2 are query-independent, it is expected that
the k selected clusters might bring “noise” documents that are not query-related but
with higher inter-document similarities. To take advantage of both 5.3.1 and 5.3.2
methods, we consider the GCB-DOC-QE method that uses both documents and
clusters that are similar to the query to perform the expansion. Specifically, we only
consider those documents yielding the highest query-likelihood.
81
5.3.4 Local-cluster-based query expansion (LCB-RR-QE)
In this method, the clusters are created based on the initially retrieved list
(query-dependent). We take a similar approach for cluster-based retrieval by
building language models for clusters and then retrieve / rank clusters based on the
likelihood of generating the query; and we consider this LCB-RR-QE method that re-
ranks documents of the initial retrieval list to perform the query expansion. Please
note after re-rank, we took only a half up to 80 percent of initial retrieval list to
perform query expansion.
Please note the above are term-based approach which is using term-based
matrix as a basis to generate expanded terms to add to initial query seed. Since the
approach in Chapter 4 to generate concepts increases the performance, concept-
based instead of term-based matrix approach also was tried to conduct a comparison.
We have the followings:
5.3.5 Concept-based pseudo-relevance query expansion (CB-PR-DOC-QE)
5.3.6 Concept-based global-cluster query expansion (CB-GC-QE)
5.3.7 Concept-based global-cluster query expansion (CB-GC-DOC-QE)
5.3.8 Concept-based local-cluster query expansion (CB-LC-RR-QE)
82
5.4 System Architecture
Our automatic query expansion experiment system (AQEE-3) architecture is
shown in Figure 5-1.
The outline of the approach described in Figure 5-1 is as follows:
Step 1: Starting with initial query, our system retrieves a sample of
documents from the document collection based one the higher query-likelihood. No
other information is required.
Step 2: Depending on different query expansion methods, the retrieved
document set (top N documents) only, or with the clusters of the whole document
collection, or with the clusters of the initial retrieved documents list are considered
to be better candidates than the whole document collection to generate expanded
query terms. (Pre-Query Expansion)
Step 3: Applying query expansion to generate important terms to be added to
initial query.
Step 4: Constructing the expanded queries based on the results of Step 3 and
query again to retrieve the improved result sets compared with the result sets of the
initial query.
83
Search Engine
1. Doc 12. Doc 23. Doc 3…………..……………
Clusters
Expanded Query
Query Expansion
Initial Query
Figure 5 - 1: Cluster-Based Query Expansion (AQEE-3)
Please note, Query Expansion could use PRB-DOC-QE, GCB-QE, GCB-DOC-
QE, LCB-RR-QE, CB-PR-DOC-QE, CB-GC-QE, CB-GC-DOC-QE, or CB-LC-RR-QE
described in section 3 Methods.
5.5 Evaluation
In this research, experiments are conducted on the TREC 2004 genomics track.
TREC 2004 genomics topics for the ad hoc retrieval task are developed from the
84
information needs of actual field experts. The topics consist of four parts – ID, title,
information need, and context formatted in XML, of which the titles are abbreviated
statements of information needs [Genomics]. Since the titles are only a few words
and are most similar to common queries used by users, the titles are selected for
initial queries. In our experiment, Lemur toolkit was used in our experiments. TREC
XML format data are converted into TREC format data, and records and topics are
preprocessed for tokenization, Porter stemming, and high frequency stop-words
removal.
Since mean average precision (MAP) measure is a comprehensive
performance indicator that captures both precision and recall, MAP for the topics ad
hoc retrieval task is used in this research following a convention of TREC. In
addition, we also assessed the precision at top 10 documents (P@10) in our
evaluation.
There are five types of run designed in our experiments: LM (simply LM
query-likelihood model with NO QE), PRB-DOC-QE, GCB-QE, GCB-DOC-QE and
LCB-RR-QE in Table 1. And there are also four types of run designed in our
experiments: CB-PR-DOC-QE, CB-GC-QE, CB-GC-DOC-QE and CB-LC-RR-QE in
Table 2. The ratio change is to measure MAP changes compared to baseline.
In the experiments, relevance model (RM3), which is a state-of-the-art
pseudo-relevance-based query-expansion method is used. The Jelinek-Mercer
smoothing parameter of the LMs is set to 0.9. The other parameters of RM3 will be
provided if requested.
85
Also, we choose the Kullback-Leibler divergence between two language
models to be the distance measure of the corresponding two documents and model-
based K-means clustering algorithms (Chapter 4). For LCB-RR-QE/CB-LC-RR-QE to
construct RM3, we take the 1000 documents in the initially retrieved list that yield
the highest query-likelihood and cluster them into 50 query-specific clusters of 1000
documents. Then, we use the documents of the query-specific clusters that yield the
highest query-likelihood to construct RM3.
Table 5 - 1: Experimental Results (Term-based)
RUN Mean Average Precision
P@10 MAP Ratio Change
LM (NO QE) 0.3010 0.516 --
PRB-DOC-QE 0.3035 0.516 0.83%
GCB-QE 0.3126 0.518 3.85%
GCB-DOC-QE 0.3209 0.523 6.61%
LCB-RR-QE 0.3168 0.517 5.25%
86
Figure 5 - 2: Mean Average Precision of All Runs
Figure 5 - 3: P@10 of All Runs
0.29
0.295
0.3
0.305
0.31
0.315
0.32
0.325
Mean Average Precision (MAP)
Mean Average Precision(MAP)
0.512
0.514
0.516
0.518
0.52
0.522
0.524
P@10
P@10
87
Table 5-1 shows the MAPs of PFB-DOC-QE, GCB-QE, GCB-DOC-QE and
LCB-RR-QE runs are better than that of LM (NO QE). All proposed query
expansion-based methods outperform than the non-expansion-based LM (NO QE)
approach. It also demonstrates that GCB-DOC-QE approach accomplishes 32.09%
MAP and 52.3% P@10 while the PFB-DOC-QE achieves 30.35% MAP and 51.6%
P@10 and baseline LM (NO QE) only reaches 30.10% MAP and 51.6% P@10. In
addition, the performance of GCB-DOC-QE is better than that of GCB-QE in MAP
and P@10 comparisons.
Table 5 - 2: Experimental Results (Concept-based)
RUN Mean Average Precision
P@10 MAP Ratio Change
LM (NO QE) 0.3010 0.516 --
CB-PR-DOC-QE
0.3045 0.516 1.16%
CB-GC-QE 0.3196 0.518 6.18%
CB-GC-DOC-QE
0.3387 0.521 12.52%
CB-LC-RR-QE 0.3460 0.520 14.95%
88
Figure 5 - 4: Mean Average Precision of All Runs (Concept-based)
0.270.280.29
0.30.310.320.330.340.35
Mean Average Precision (MAP)
LM (NO QE)
CB-PR-DOC-QE
CB-GC-QE
CB-GC-DOC-QE
CB-LC-RR-QE
89
Figure 5 - 5: P@10 of All Runs (Concept-based)
Table 5-2 shows the MAPs of CB-PF-DOC-QE, CB-GC-QE, CB-GC-DOC-QE
and CB-LC-RR-QE runs are better than that of LM (NO QE). All proposed query
expansion-based methods also outperform than the non-expansion-based LM (NO
QE) approach. It establishes that CB-GC-DOC-QE approach accomplishes 33.87%
MAP and 52.2% P@10 while CB-PR-DOC-QE achieves 30.31% MAP and 51.6% P@10
and baseline LM (NO QE) only reaches 30.10% MAP and 51.6% P@10. Among all
methods including Chapter 3, 4 and 5, the performance of CB-LC-RR-QE is best than
that of any other method. The possible reason is that CB-LC-RR-QE effectively
reduces “noise” documents to generate expanded terms.
0.5130.5140.5150.5160.5170.5180.519
0.520.5210.522
P@10
P@10
90
5.6 Results
In this study, we present the system architecture and evaluation of cluster-
based query expansion approaches. The experimental results are promising using
Lemur search tool in TREC 2004 genomics track ad-hoc retrieval task. We conclude
that information brought from clusters created based on the whole document
collection or the initial retrieval list is quite helpful to perform query expansion,
especially when integrated with concept generation introduced in Chapter 4 and
information induced from the initially retrieved individual documents as in CB-GC-
DOC-QE and CB-LC-RR-QE.
This study suggests several interesting research avenues for our future
investigations. For future work, we will continue to test the generalization of cluster-
based query expansion in other biomedical domain or even different domains. Also
since this experiment was conducted via Lemur toolkit, we may perform the similar
experiments using different search tools, for instance, Dragon toolkit to make
comparative evaluation. Furthermore, more investigation is needed to explore the
process of cluster formation. Various techniques can be used, for instance,
personalization.
91
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
6.1 Contributions of the Thesis
In order to improve the efficiency and effectiveness of biomedical literature
retrieval, we have started this thesis with a comprehensive evaluation of the query
expansion strategies. We propose a novel framework for cluster-based query
expansion. We concentrate on empirical comparison, experiments and evaluations in
investigating query expansion methods on various issues, and use the findings as an
empirical justification for cluster-based query expansion. Meanwhile, we investigate
a theoretical foundation of cluster-based query expansion and to provide reasonable
path to develop theoretical argument.
We design and implement AQEE-3 system to conduct query expansion
evaluation. With the assistance of AQEE-3 system, we compare the performance
with various query expansion strategies and with different: 1) The system indexing
strategy; 2) The system retrieval model; 3) The qualities of queries submitted into the
system. We believe researchers can effortlessly explore query expansion strategies
using AQEE-3 system.
Without consideration of clusters, the important insights we obtained
through our study are summarized as follows:
Ontology-based term re-weighting provides the best results among the three
query expansion strategies.
92
Expanding the initial query with more precise ontology-based term enhances
LSI based local analysis substantially.
Including context to term re-weighting and LSI further improves the
precision.
Adding only one extended term to each original query term of the LSI local
analysis increases the precision over adding more than one term.
Using OKAPI BM25 Model helps achieve better precision than using
VECTOR SPACE Model on LEMUR.
In regard to the research question: whether can one improve retrieval
performance in biomedical domain by developing a cluster-based query expansion
using language modeling approach, our results confirm that (AQEE-3 system) a
cluster-based query expansion using language modeling approach significantly
improves the effectiveness of TREC 2004 Genomics Track test data.
Our results also show that combining concept-based and local-based (query-
specific) clustering for query expansion is more effective than global-based.
Evaluation results prove that if retrieval system is able to locate good clusters, the
performance of cluster-based query expansion is better than that of query expansion
with poor clusters or no cluster-building at all. And it is consistent with other
reported research findings.
Term selection is the key to query expansion. Our experimental results
demonstrate that Latent Semantic Indexing generates high quality additional query
terms. On the contrary, Association Rule plays a negative role in our
93
experimentation.
6.2 Future Work
This dissertation intends to improve retrieval effectiveness by developing a
novel cluster-based query expansion using language modeling method. We design
and implement AQEE-3 system to apply the approach and conduct query expansion
evaluation. With the experimental results we believe query expansion will work
more effectiveness under the context of cluster.
There are certain aspects about this approach needs additional work.
Firstly, reweighting may be considered to be part of cluster-based query
expansion since it causes good results in our experiment. In our approach, we try
generating concept from the clusters using our algorithm. In the future, we may
generate keyword from the clusters. As query expansion is evaluated in terms of
precision and recall, new evaluation metric may be defined to measure query
expansion under cluster context.
Also in the future we will continue test our query expansion strategies using
a more comprehensive dataset, in other biomedical domain or even different
domains. Also since this experiment was conducted via Lemur toolkit, we may
perform the similar experiments using different search tools, for instance, Dragon
toolkit to make comparative evaluation. Furthermore, more investigation is needed
to explore the process of cluster formation. Various techniques can be used, for
94
instance, personalization.
95
List of Reference [Agrawal et al., 1993] Agrawal, R., et al., Database Mining: A Performance Perspective', IEEE Transactions on Knowledge and Data Engineering, Special Issue on Learning and Discovery in Knowledge-Based Databases, December 1993.
[Agrawal et al., 1995] Agrawal, R., et al., Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, U. Fayyad. et al., Editors, 1995, AAAI/MIT Press.
[Aphinyanaphongs et al., 2001]Yindalon Aphinyanaphongs, Ioannis Tsamardinos, Alexander Statnikov, Douglas Hardin, Constantin Aliferis. Text Categorization Models for High-Quality Article Retrieval in Internal Medicine. Journal of the American Medical Informatics Association, 12(2), 207-216.
[Attar and Fraenkel, 1997] Attar, R. and Fraenkel, A.S., Local Feedback in Full-Text Retrieval Systems, Journal of the ACM, Vol. 24, No. 3, 1997, pp. 397-417.
[Banerjee and Ghosh, 2002] Banerjee, A. and Ghosh, J. Frequency sensitive competitive learning for clustering on high-dimensional hperspheres. Proc. IEEE Int. Joint Conference on Neural Networks, pp. 1590-1595.
[Buckley, 2004] Buckley, C. Why current IR engines fail. In Proceedings of SIGIR, pages 584–585, 2004. Poster.
[Cesarano et al., 2003] Cesarano, C., d’Acierno, A., and Picariello, A., An Intelligent Search Agent System for Semantic Information Retrieval on the Internet. Proceedings of the 5th ACM international workshop on Web information and data management, New Orleans, Louisiana, USA, 2003, 111-117.
[Croft and Harper, 1979] Croft, W.B. and Harper, D.J. Using Probabilistic Models of Document Retrieval Without Relevance Information, Journal of Documentation, Wol. 35, 1979, pp. 285-295.
[Deerwester et al., 1990] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. J., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41(6) 1990, 391-407.
[Dobrynin et al., 2005] Vladimir Dobrynin, David Patterson, Mykola Galushka, Niall Rooney. SOPHIA: An Interactive Cluster-based Retrieval System for the OHSUMED Collection. IEEE Transactions on Information Technology in Biomedicine 9(2): 256-265 (2005)
96
[Efthimiadis, 1996] Efthimiadis, E.N. Query Expansion, In: Williams, Martha E., ed. Annual Review of Information Systems and Technology, v31, pp 121-187, 1996. [Book chapters]
[Emanuelsson et al., 2000] Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol., 300(4):1005-1016. 2000 [Entrez] Entrez batch retrieval service. http://www.ncbi.nih.gov/entrez/batchentrez.cgi?db=protein [Frank et al., 2001] Eibe Frank, Gordon Paynter, Ian Witten, Carl Gutwin, Craig Nevill-Manning. Domain-Specific Keyphrase Extraction. In Proceedings of the Sixteenth International Joint Conference on Aritificial Intelligence, pp. 668-673 [Gaucher et al., 2004] Gaucher SP, Taylor SW, Fahy E, Zhang B, Warnock DE, Ghosh SS, Gibson BW. Expanded coverage of the human heart mitochondrial proteome using multidimensional liquid chromatography coupled with tandem mass spectrometry. J Proteome Res., 3(3):495-505, 2004 [Genomics] http://ir.ohsu.edu/genomics/2004protocol.html
[Guda et al., 2004] Guda C, Fahy E, Subramaniam S. MITOPRED: a genome-scale method for prediction of nucleus-encoded mitochondrial proteins. Bioinformatics, 20(11):1785-1794, 2004 [Guda and Subramaniam, 2005] Guda C, Subramaniam S. pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics, 21(21):3963-3969, 2005 [Guo et al., 2004] Guo, Y., Harkema, H., and Gaizauskas, R., Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms, 13th Text Retrieval Conference (TREC 2004).
[Hearst and Pedersen, 1996] Hearst, M.A., and Pedersen, J.O. (1996). Re-examining the Cluster Hypothesis: Scatter/Gather on retrieval results. In SIGIR 1996, pp. 76-84.
[Hersh and Hickam, 1995] Hersh, W., and Hickam, D., Information Retrieval in Medicine-The Sapphire Experience, Journal of the American Society for Information Science, Dec., 1995, 46(10), 743-747.
[Hersh et al., 2000] Hersh, W., Price, S., and Donohoe, L., Assessing thesaurus-based query expansion using the UMLS Metathesaurus, Proc AMIA Symposium, 2000.
97
[Holmes et al., 1994] Holmes G, Donkin A, Witten I.H. Weka: A Machine Learning Workbench. Intelligent Information Systems, Proceedings of the 1994 Second Australian and New Zealand Conference, 357-361, 1994 [Hu and Xu, 2005] Hu, X. and Xu, X., Mining Novel Connections from Online Bio-medical Text Databases Using Semantic Query Expansion and Semantic-relationship pruning, International Journal of Web and Grid Services, Volume 1, Issue 2, 2005, 222-239.
[Jardine and Van, 1971] Jardine, N. and Van Rijsbergen, C.J. (1971). The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval, 7:217-240
[Kohonen, 1995] Kohonen, T., Self-Organizing Maps. New York: Springer, 1995.
[Kules et al., 2006] Bill Kules, Jack Kustanowitz, Ben Shneiderman. Categorizing Web Search Results into Meaningful and Stable Categories Using Fast-Feature Techniques. JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 210-219, 2006.
[Kumar and Raghava, 2006] Kumar M, Verma R, Raghava GP. Prediction of mitochondrial proteins using support vector machine and hidden markov model. J Biol Chem., 281(9):5357-63, 2006 [Kurland et al., 2005] Kurland, O., Lee, L., and Domshlak, C. Better than the real thing? Iterative pseudo-query processing using cluster-based language models. In Proceedings of SIGIR, pages 19–26, 2005. [Lavrenko and Croft, 2001]Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of SIGIR, pages 120–127, 2001.
[Lee et al., 2008] Lee, K.-S., Croft, W. B., and Allan, J. A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of SIGIR, pages 235–242, 2008. [Lemur] www.cs.cmu.edu/~lemur/1.9/tfidf.ps
[Leroy and Chen, 2001] Leroy, G., and Chen, H., Meeting Medical Terminology Needs: The Ontology-Enhanced Medical Concept Mapper, IEEE Transactions on Information Technology in Biomedicine, 2001, 5(4), 261-270.
[Lindsay and Gordon, 1999] Lindsay, R.K., and Gordon, M.D., Literature-based discovery by lexical statistics, Journal of the American Society for Information Science, 1999, 50(7), 574-587.
98
[Liu and Croft, 2004] Liu, X., Croft, W.B.: Cluster-based Retrieval Using Language Models. In proceedings of ACM SIGIR’04 conference, (2004)276-284.
[Liu et al., 2004] Liu, S., Liu, F., Yu, C., and Meng, W. (2004) An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases, Proceedings of the 27th annual international Conference on Research and development in Information Retrieval: 266-272.
[Lucene] http://LUCENE.apache.org/java/docs/api/index.html
[Mitra et al., 1998] Mitra, M., Singhal, A., and Buckley, C. Improving automatic query expansion. In Proceedings of SIGIR, pages 206–214, 1998.
[Na et al., 2005] Na, S.H., Kang, I.S., Roh, J.E., and Lee, J.H. An Empirical Study of Query Expansion and Cluster-based Retrieval in Language Modeling Approach. In proceedings of the Second Asia Information Retrieval Symposium, AIRS 2005, held in Jeju Island, Korea, in October 2005.
[Ogilvie and Callan, 2002] Ogilvie, P. and Callan, J. (2002), Experiments Using the Lemur Toolkit. In: Proceedings of the Tenth Text Retrieval Conference, (TREC-10), 103-108.
[Ponte and Croft, 1998] Ponte, J.; Croft, W.B., A Language Modeling Approach to Information Retrieval. In proceedings of ACM SIGIR’98, (1998)275-281.
[Pratt et al., 2003] Pratt, W. and Yetisgen-Yildiz, M., LitLinker: Capturing connections across the biomedical literature, K-CAP’03, Sanibel Island, FL. Oct. 2003, 105-112.
[PubMed] Retrieved January 10, 2008 from http://www.ncbi.nlm.nih.gov/PubMed/
[Qiu and Frei, 1993] Qiu Y. and Frei H.P.. Concept Based Query Expansion. In Proc. of the 16th Int. ACM SIGIR Conf., pages 160-169, ACM Press, June 1993.
[Radev et al., 2000] Dragomir Radev, Hongyan Jing, Malgorzata Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. in Proceedings of ANLP/NAACL 2000 Workshop on Automatic Summarization, Seattle, WA, pp.21-29
[Richardson and Smeaton, 1995] Richardson R., and Smeaton A.F, Using Wordnet in a knowledge-based approach to information retrieval, In Proceedings of the BCS-IRSG Colloquium, Crewe, 1995.
99
[Robertson and Walker, 2000] S. E. Robertson and S. Walker (2000), “Okapi/Keenbow at TREC-8,” in E. Voorhees and D. K. Harman (Editors), the Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246
[Rocchio, 1999] Rocchio, J., Relevance Feedback in information retrieval, In G. Salton(Ed.), The SMART Retrieval System-experiments in Automatic Document Processing (Chap 14), Englewood Cliffs, NJ: Prentice Hall.
[Rosenfele, 2000] Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go from here? In Proceedings of the IEEE, 88(8), 2000.
[Savasere et al., 1995] Savasere, A., Omiecinski, E., Navathe, S. An Efficient Algorithm for Mining Association Rules in Large Databases. VLDB 1995: 432-444
[Sanderson, 1994] Sanderson, M. 1994, “Word sense disambiguation and information retrieval”, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, p.142-151, July 03-06, 1994, Dublin, Ireland.
[Shi et al., 2005] http://www.cs.sfu.ca/~anoop/papers/pdf/trec2005_report.pdf
[Steinbach et al., 2000] Steinbach, M., Karypis, G., and Kumar, V. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000.
[Taylor et al., 2003] Taylor SW, Fahy E, Zhang B, Glenn GM, Warnock DE, Wiley S, Murphy AN, Gaucher SP, Capaldi RA, Gibson BW et al. Characterization of the human heart mitochondrial proteome. Nat Biotechnol., 21(3):281-286, 2003 [Umls] http://www.nlm.nih.gov/research/umls/archive/2004AA/umlsdoc_2004aa.pdf
[Vogel et al., 2005] David Vogel, Steffen Bickel, Peter Haider, etc. Classifying Search Engine Queries Using the Web as Background Knowledge. Volume 7 , Issue 2 (December 2005) table of contents Pages: 117 - 122
[Wang and Zhai, 2006] Xuanhui Wang, Chengxiang Zhai. Learn from Web Search Logs to Organize Search Results.
100
[Weeber et al., 2003] Weeber, M., Vos, R., Klein, H., de Jong-Van den Berg. L.T.W., Aronson, A. and Molema, G., Generating hypotheses by discovering implicit associations in the literature: A case report for new potential therapeutic uses for Thalidomide. Journal of the American Medical Informatics Association, 2003, 10(3):252-259.
[Wikipedia] http://en.wikipedia.org/wiki/Association_rule_learning
[Xu and Croft, 1996] Xu, J., and Croft, W., Query expansion using local and global document analysis, In Proceedings of ACM SIGIR, 1996.
[Xu and Croft, 2000] Xu, J. and Croft, W.B., Improving the Effectiveness of Information Retrieval with Local Context Analysis, ACM Transactions on Information Systems, 18(1), 79-112, 2000.
[Yoo and Hu, 2006] Illhoi Yoo, Xiaohua Hu. A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library Medline. JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, 220-229, 2006.
[Zeng et al., 2004] Hua-Jun Zeng; Qi-Cai He; Zheng Chen; Wei-Ying Ma. Learning to Cluster Web Search Results. In ACM. SIGIR, pages 210–217, New York, NY, USA, 2004.
[Zhai and Lafferty, 2001] Zhai, C., Lafferty, J., A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In proceeding of SIGIR’01, (2001)334-342.
[Zhai and Lafferty, 2001] Zhai, C., Lafferty, J., Model-based Feedback in the Language Modeling Approach to Information Retrieval. In proceeding of SIGIR’01, (2001)403-410.
[Zhao and Karypis, 2001] Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2001.
[Zhou et al., 2006] Zhou X., Zhang X., Hu X., Using Concept-based Indexing to Improve Language Modeling Approach to Genomic, in the 28th European Conference on Information Retrieval (ECIR 2006), April 10-12, 2006, London, UK, Page 444-455.
[Zhu et al., 2006] Zhu, W, Xu, X, Hu, X., Song, I., and Allen,R., Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy, IEEE International Conference on Granular Computing , May10-12, 2006
101
[Zhu et al., 1999] Zhu, A., Gauch, S., Lutz, G., Kral, N., and Pretschner, A., Ontology- Based Web Site Mapping for Information Exploration, Proceedings of the Eighth International Conference on Information and Knowledge Management (CIKM '99), 188-194.
102
VITA
Xuheng Xu
EDUCATION 2011.05 Ph.D. Information Studies, Drexel University, Philadelphia, PA 2004.09 M.S. Information Systems, Drexel University, Philadelphia, PA 1998.07 M.S. Computer Science, New Jersey Institute of Technology, Newark, NJ 1986.06 B.E. Mechanical Engineering, Zhejiang University, Zhejiang, China RESEARCH INTERESTS Data mining, Semantic-based Query Optimization and Intelligent Searching, and Bioinformatics. SELECTED PUBLICATIONS
1. Xuheng Xu, Xiaohua Hu: Cluster-based Query Expansion Using Language Modeling in the Biomedical Domain. Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference on Issue, 18 Dec 2010 Pages: 185-188
2. Xuheng Xu, Xiaodan Zhang, Xiaohua Hu: Using Two-Stage Concept-Based Singular Value Decomposition Technique as a Query Expansion Strategy. Advanced Information Networking and Applications Workshops, 2007, AINAW apos; 2008. 21st International Conference on Volume 1, Issue , 21-23 May 2008 Page(s):295 – 300
3. Xuheng Xu, Weizhong Zhu, Xiaohua Hu, Xiaodan Zhang: Using Ontology-based and Semantic-based Query Expansion on Biomedical Literature Search. Systems, Man and Cybernetics, 2006. SMC apos; 2006. IEEE International Conference on Volume 4, Issue , 8-11 Oct. 2006 Page(s):3441 - 3446
4. Hu X., Zhang X, Zhou X, Wu D., Xu X., A Comprehensive Comparison of 7 Methods for Mining Hidden Links from Biomedical Literature, in the International Multi-Symposium in Computer and Computational Science, June 20-24, 2006, Hanzhou, China
5. Hu X, Zhang X, Li G, Yoo Y, Zhou X, Wu D, Xu X: Mining Hidden Connections among Biomedical Concepts from Disjoint Biomedical Literature Sets through Semantic-based Association Rule, International Journal of Intelligent Systems, 2006.
6. Zhong Huang, Xuheng Xu and Xiaohua Hu. Machine Learning Approach for Human Mitochondrial Protein Prediction. Book: Computational Intelligence in Bioinformatics, IEEE CS Press/Wiley, 2008.
103
7. Zhu, W, Xu, X, Hu, X., Song, I., and Allen,R., Using UMLS-based Re-Weighting Terms as a Query Expansion Strategy, IEEE International Conference on Granular Computing , May10-12, 2006
8. Hu X., Xu., X., Mining Novel Connections from Online Biomedical Databases Using Semantic Query Expansion and Semantic-Relationship Pruning, in International Journal of Web and Grid Service, 1(2), 2005, pp222-239