AN INTELLIGENT METASEARCH ENGINE FOR THE WORLD WIDE WEB
Andrew Agno
A thesis submitted in conformity with the requirements for the degree of Master of Science
Graduate Department of Cornputer Science University of Toronto
Copyright @ 2000 by Andrew Agno
National Library ($1 of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibfiographic Services services bibliographiques
395 Wellington Street 395. rue Wellington OttawaON K1AON4 Ottawa ON K1 A O N 4 Canada Canada
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Lïbrary of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in rnicrofonn, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/film, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts ffom it Ni la thèse ni des extraits substantiels may be p ~ t e d or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
tract
An Intelligent Met asearch Engine for the n'orld n ï d e Ué b
.Anchen- Agno
Mas ter of Science
Graduate Department of Cornputer Science
Cniversi ty of Toronto
2000
Uachine learning and informat ion r e t r i e d techniques are appliecl t o met asearch on the
Vorld n'ide ?\éb as a means of providing user specific relennr documents in respome
to user queries. .A rnerasearch agent works in conjunction n-ith a user to provide daiiy
sers of relel-ant documents. Csers provide relennce feedback which is incorporarcd into
future resdts b. a choice of machine I~arning algorithms.
Csing a fisecl ranking niethoci. the algorithms incorporating relelance feetlback per-
forni rriuch bet ter than t hose t hat do not. Furthemore. using heterogeneoits information
sources on the Lorld Wide \\éb is shown ro be effective in short and long term usage.
Acknowledgement s
1 n-ould he much less proticl of m\- work if it tvere not for the help of a nurnber of people.
1 woiild like to firsr thank Grigoris Iiarakoulas and John 11~-lopotilos. my super~isors.
for their guidance and support of mj- work. nïthoiit theni. 1 woiild still be fishing for
a perfect ropic. Thank o u also for making me look ar quesrions insteacl of answrs. I
,.. ,,uuU - .. I I n L X e tu iZnriP iio DG ; L L L L C ~ gruup for r k i r quesrions ro my presenration oi
this thesis. which helped me focus on \\-kat questions other people woiild be inreresred
in. upon seeing my work.
Some of the n-ork in the implementation of my project used other peoples software. In
particular. 1 n-ould like to thank S teyen Brandt for the package com.stevesoft.pat. Doug
Lea. for util.concurrenr . and Brian Chambers for his pret-ioiis work in word sterrirning and
document vectorization.
Last. but certainly not least. 1 woiiltl like ro thank mu d e Jobie. for coming to
Toronto ancl staying wirh me thrse lasr two yars .
3 Architecture 24
3.1 O\-erall -1rchitect ure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Global Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2s
3 . 1 Zipf's Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 . 2 Stopword List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Stemniing 30
3.3 Scalahility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 L e m Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Topics 33
4 Experimental Results and Evaluation 35
4.1 Description of Data Gathering Procedure . . . . . . . . . . . . . . . . . . 35
4 . E d u a t i o n Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 TREC ~lcasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 . 3 . TheF3lIeasure . . . . . . . . . . . . . . . . . . . . . . . . 41
4 . 3 2 The T9L- Neasure . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.3 The T9P Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Precision of Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 5.L
4.4.1 Continuous Learning vs Train/Test . . . . . . . . . . . . . . . . . 60
4.5 Daily Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Spikes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.1 Data Gathering Gaps . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6.2 Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Individual Search Engine Recall . . . . . . . . . . . . . . . . . . . . . . . 69
5 Conclusions and Future Directions 77
.. 5 . 1 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . r ,
5 . Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2-1 Implicit Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 . 2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SO
5.2.3 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S 1
3 . 4 Alternate Document or Featiire Space . . . . . . . . . . . . . . . . S 1
5 . 2 . Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S?
5.2.6 Alternative Methods of Learning . . . . . . . . . . . . . . . . . . S.' C - a.?. 1 !discellaneou Iniprovements and Direct ions . . . . . . . . . . . . S9
Bibliograp hy
List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Architecture Diagam 26
. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Daily Document Counts 36
. . . . . . . . . . . . . . . . . . . . . . . . 4 Precision for Plain Algorithm 39
. . . . . . . . . . . . . . . . . . . . . . . 4.3 Precision for Random aigorithm 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 4 F3 Measure 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 5 F3 Sleasiire . Top 5 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 F3 blessure. Top 10 44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7 T9C Ueasure 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.S T9C Sleasure . Top 5 4s
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 T9C Ueasure . Top 10 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 T9P 'lleasure 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 T9P Skasure 52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -4.12 T9P 'Ikasrire 53
- - . . . . . . . . . . . . . . . . . . . 4.13 Running Average of Precision AU Topics XI
. . . . . 4.14 Precision Running Average . Various Topics . Rocchio tk Grigoris 56
. . . . . . . . . . . . . . . 4.15 Running .A \.erage Precision . Student . Grigoris 5s
4.16 Running Average Precision . .AU Topics . Continuous Training v s Train / Test 5s
-4.17 Running avg precision . nrious topics . continuous training vs Train/Test 59
. . . . . . . . . . . . . . . . . . . . . . 4-18 Dail- recall for Rocchio algorithm 61
. . . . . . . . . . . . . . . . . . . . . . 4.19 Daily r e c d for Grigoris algonthm 62
vii
4.20 Daily precision . -411 ropics . Top 10 . . . . . . . . . . . . . . . . . . . . . . 66
4.21 Dail>- precision . I'arious topics . Top 10 . . . . . . . . . . . . . . . . . . . 67
4.22 Daily precision on srudenrs topic . Top 30 . . . . . . . . . . . . . . . . . . (jS
4.23 Search Engine R e d . -411 Topics . Riuining Average . . . . . . . . . . . . 70
4.24 Search Engine R e c d . Al1 Topics . Running -Iverage . . . . . . . . . . . . 71
4.23 Runriing Average of Recall of làhoo on Palni Pilot . . . . . . . . . . . . 72
. . . . . . . . . . . . . . 4.26 Riinning .l verage of Recall of Lycos on Student 73
. . . . . . . . . . . . . . . . . . . . . . . 4.27 Running average recall lIS/DO*I 74
Chapter 1
Introduction
Problem and Motivation
Findilig informat ion on the World Wide Web (\\'WU') c a n be difficult withottt sonir form
of assistance. As estimated by Lawrence and Giles [LG99] in 1999. there rvere SOO million
pages. an increaoe of 250% from their previous estimate in their 1998 study [LC;9Sc].
C-illance [ M M O O ] claims there are 2.1 billion unique and ptihlicly amilable pages on
the Internet. Given the size and the gowth of the W W Y . one can see that we neeci tools
to help us find information. One would typicdy turn to a search engine. like Yahoo![Ya.h]
or Coogle [Goo] . Cnfortunately. the mos t frequently used search engines [StaOO. SulOOa]
do not always do an adequate job. due to their Iack of coverage [LG99. LGSSc] and their
lack of ability t-O find the relevant documents in those that are covered.
One potential ierned- is to enable searches with more -intelligence". Given a search -
engine. ir mai- be imbued with -intelligencen in at least tn-O ways: through the use of
specialized. larger. or sirnply different informat ion sources: or t hrough the implement at ion
of various machine learning or information filtering algorithms. The purpose of this
R-ork is to create an intelligent search engine by combining both of these approaches
into a single search engine. d ra~ing from machine learning and information retriemi
techniques. The remainder of this chapter i d 1 deal with an 01-er~ien- of a particiilar
technique for searching t hrough her erogeneoiis infornia t ion sources. called met asearch.
The chapter also inclutles an esplanation of various information retrieval and machine
learning techniques rhar have been used in other work. the contributions of this work as
iveU as a layout of the remainder of this thesis.
1.2 Metasearch on the World Wide Web
One commonly tised method for adding intelligence. even m o n g search engines not
norrnally thought of as eiichjSulOOc]. is to use metasearch. For the purposes of this
research. metasearch will refer to nietasearch on the 1V1171'.
1.2.1 A simple anatomy of a search engine
In the following disciission. it helps to have nn idea of hou- search engines typically ivork.
The portion of a search engine that the user sees is only one aspect of the entire system
that makes up an engine. .U the results given by a search engine corne from some
forrn of a database underlying the engine. which may list documents and information
about those documents. This database is populated by certain software agents that
visit WWV' servers and index the documents that those semers contain. In the Hantest
system[BDW95]. these agents are c d e d Gatherers. whereas in Google[BPSSb]. they are
called c~awlers. The purpose of these crawlers is to d o w the collection of information
about documents. also known as the indexiag of documents. to proceed independently
of an? search that may be using the information. These cran-lers must revisit documents
periodically. to reindes them. This is because documents on the UTVW have a tendency
ro change or even disappear over time. Reindesing must happen at the same time that
new documents are being indesed. When a user gives a que- to a search engine. the
engine uses its database of rankings and documents to generate a nen- document that
consists of a list of pointers. knorvn as Pniversal Resource Locators ( CRLs). ro dociiments
that haïe been r a d e d as most likely ro fuMl the user's information need. The esact
r ad ing depends on the searck engine. and the ranking çchenie is typically proprietary.
Horvever. search engines developed in acadernia do publish their ranking methods. For
instance. Google uses something called a PageRank. which is defined by Brin and Page
[BPSSb]. This description of a search engine ma? also be applied to metasearch engines.
wi t h çorne changes.
1.2.2 Metasearch and how it helps
The idea behind rnetasearch is to use multiple -helper- search engines ro do the search.
then to combine the results from these engines. Engines rhat use metasearch inclucle
!detacrawler. SavvySearch. 11 SS Search and .Ut alvista. among ot hers [SulOOc] . These
helper engines form the merasearch engine's database. This approach differs from the
individual search engines in that a metaseach engine does not need to crawl the U7117V.
although it may do so. A metasearch engine for the W V 1 Y may just verify that the doc-
uments returned by the search engines stiil elùst. The problem of combining the results
can be solved sirnply or by using a little cleverness. The sirnpli&ic solution has been
used in search engines from !detacrawler to Dogpile. Sletacrawler has changed in recent
years and no longer appears to employ this simple rnethod. but Dogpile continues to use
it. The solution is to put results from separate engines under separate headings. This
solution c a n still be seen in some engines t hat empioy both a rnanually-maint ained direc-
tory (in the style of Yahoo!) and a more traditional cran-ler-generated index of lFT\W
documents (as in Inktomi). It is also seen in some of the less well h o n n metasearch
engines. such as SherlockHound [She] and . U o n e [.Ill]. This approach has a number of
problems:
0 The individual search engines are treated equdy. despite the fact that their results
might not be e q u d y relevant to the search at hand. This implies two things.
- The only indication of relative importance of the resiilts is within eacli search
engine. not across search engines. This means rliat the user cannot jiidge
which documents will be rnost relel-ant of al1 the results returned.
- The number of results is typically restricteci. wirh each jearch engine given a
cpoca of documents that ir can fill. This quota is the sanie for each search
documents that are more relevant than the results from searcli engine B. then
the rn documents from search engine A that are better t han the ones froni B
will nor be shorvn to the user. Insteacl. al1 the results froni search engine B
If a user searches ouce. then attemprs the same search again. the results do not
change. even rhough the user ma? have giveri some implicit information aboiir
documents that might be relevant. The user ma- have selected some documents
to be viewed. and by doing so. ma1 have ;ken some information about potenrial
relevance.
The other. more sophisticated. solution is to rerank the documents that the helper
search engines returned. either bp downloading the documents and analyzing them. or
using some existing preprocessed version of the document. such as the surnmary that
occasiondy accompanies the results from search engines. This approach is used in
San-ySearch [HDgÏ. DH96. Sav]. in search engines from YEC research institute [LGgSa.
LG9Sb. GLG+99]: and presumably in the new version of Metacrawler. This deviates - the problem of treating the search engines equally. but the other problem remains.
-4s can be seen from the listing of alliances at Search Engine IVatch [SulOOc]. meta-
search. even in its simple fonn. is a popular method of searching. The reasons for this
include the user's desire to t w multiple search engines. the user's la& of laon-ledge about
esisting search engines [GLGf 991. and the user's unniKngness to visit multiple search
engines. One orher reason to use metasearch is that while most 11111V users i d 1 tr>-
another search engine when rhey cannot find n-hat the>- are looking for. 20% of theni give
up [Sulood. SiilOOb]. Csing metasearch allow the user to view resiilrs from man!- search
engines simultaneousl~ which ma? allow at least one search engine to give back rele\-ant
results before the user gives up.
Fortunatel' for tliose engines thnt do employ some form of metasearch. there is eV-
itlence for its efficacy The reason that metasearch can be effective is rhat there are
problerns wi t h plain search:
1. Indi\-idual search engines have poor indesing anci reindesing t imes [LGW]. Rcin-
desing tirne refers to the time betiveen successive visits to a documerit h>- a search
engine's crawler. This is a problern. because if a search cngine's crawler does not
reindes a frequently changing document. it may nor be able to order CRLs ac-
cording to their actual data. Poor reindesing rimes also lead to the p redence of
dead links [LGW] in the results giwn by a search engine. These are pointers CO
documents that no longer esisr. are no longer published. or are no longer accessible.
2. Ranking schemes ma! be poor or inconsistent (Iiee92. LGSSa]. In a talk by Chris-
tian Collberg [ColOO]. given at the Cniversity of Toronto. he claimed that one of the
methods thar .Altavista used to r a d documents was by age of the document. This
xas meant to reduce the propensity of artificially relevant documents (also known
as -spamn). This leads to old documents being listed first. which is not usually
the best heuristic for ranking relemnt documents. The general problem of poor
ranking schemes is esacerbated by users' tendencies to use short. one or tn-O word
queries [SulOOb. LGSSa]. .A Lape query combined ni th a poor ranking scherne by
a search engine can lead to fmstration. In fact. the ranking scheme need not be
poor overd . just poor on the topic that the user has in mind.
3. Indi\-idital search engines ma! have poor coverage due to specialization [DHS6.
HD97. BBCSS. LebS;] or may jusr not have the resources to have wide CO\-erage
of the entire iYJVFV. This means that no one search ensine hw a full indes of
the WWK. Also. since the Kn'W is growing esponentially [LGSS]. so do the
cornput ing and storage needs of search engines thar maintain t heir own index.
1. Rc!n:cc! :o t h pie;-i~iis pviuî. ïhc düiuiiienî craii.irr. rLar are 1ist.d for inkronii-
style search engines ma? not he able to search the enrire \YU-n' because of the
connectedness properries-as many as half the documents are nor reachable from
the -tore" web [But 001. which is presiimably where mosr crarders spencl t heir t ime.
Sletasearch can alleviate some of these problems:
1. Csing multiple search engines should give ooerlaps in terms of the index times.
translating into a shorter mean time between reindesing [LGSS].
2. .\ metasearch engine can use its own ranking scheme. independent of the rankings
of the individual search engines [DHSG. HD97. LGSSa]. It can also use ranking
schemes that analyze the individual pages. just as individual search engine crawlers
do. The met asearch engine c m do this faster. since it has a smaller set of documents
to analyze. Combined with document caching [BCF+9S. ZFJSi']. this can result in a
fast metasearch using a custom. document based ranliing that analyzes documents
to a geater degree than the individual search engines used. Xote that a problern
may still esist since a metasearch engine must do the anaipis in real time. instead
of ofIline. as the crawlers do. Lawrence and Giles [LGgSa] do show that it is possible
to do this analysis in real time. without creating long tvaiting times for the user.
3. Coverage of a metasearch engine is greater because it uses multiple sources of in-
formation [LGSSc. LG991 and a metasearch engine can use speciaity searcli engines
without having them adversel- affect the results due to fen- results. or due to a
slow network connectiou to the specialty engine. If the nitniber of documents on
the IiTk7V continues to grow esponentially. then col-erage niay eventually be a
problem even when using ntimerous search engines.
4. The last item cannot be fised hy using metasearch engines. Instead. individual
search engines need to do random searching among IP addresses and ports. IYliile
the metasearch engine coitld do this itself. ir removes one of its advanrages from
the perspective of resource usage.
Recent studics hy Lawrence and Giles [LGSS. LGSSc] g i ~ e s more creclence to the p e
tential of metasearch. The papers show that individual search engine corerage is poor.
covering no more t han 16% of the n'Il?\' and as lit t le as 2.2%. However. r he o ~ ~ r l a p
between search engines is also low. with a range of [-l.-Ll%. X.j%] in 199s. to a range of
[2.2%.3S.3%1 in 1999. This bodes well for metasearch. despitr the facr that the combined
coverage appears to be decreasing over time. An estimate of the combinecl coterage in
1999. iising 11 search engines. was only approsimately 40% of the estimated size of the
iWW being covered. In 199s. with only 6 search engines. the estimated coverage \vas
59%, . Giwn tkis. met asearch ought to be effective. as long as the individual engines have
access to rele~ant documents and can return them to the metasearch engine.
lletasearch is not mithout problerns. however. For instance. metasearch can create
increased network traffic. both on the global Interner and on the local netrvork. especidy
for engines that perform their onn ranking by dotvnloading the actual documents t hat
are found ria the individual search engines (DH96. HDS;]. The structure of the W?W
[But001 is also a problem. because not all documents be indexed by the engines
used by a metasearch engine. The h s t problem can be alleviated by using a network
bandwidth sensitive ranking scherne [DH96. KD97..\ILY+99j. The second problern needs
to be solved by having the WTVW crawlers probe random IP addresses at port SO (HTTP)
and possibly other common or uncommon ports.
1.3 Intelligent Agents
Besides metasearch. a search engine can also turn to other algorithms for inspiration.
The other source of intelligence cari be obtained by looking at intelligent agents. These
are software programs that uncierrake tasks on behalf of a user or iisers. ;\gents ha\-e
been used to aid in sorting email [IIla99. BooSS]. and for searching and -sitriing* the
KG-U- ilIia99. CS%. PB9;. P89'J. J a n K BLGos]. The latter. the Uéb asenrs. can bc
categorizeri into ones rhat perfonn automaticall~ nithout user assistance. and those tbat
arc designed to work in conjunction rvirh the user in order to enkmce the iiser. Those that
work wirho~it feedhack from the iiser. such as CiteSeer [BLGSS] and WbSIate [CSSS] .
among others [Jang;!. can be usecl as a search agent. where the program ma- learn user
habits and preferences rnerely through obsernrion. and rhus alter search queries. or offer
dociiments. .Uternatively. as in CiteSeer. the progréun ma! st il1 rely on user geiierated
queries. but the user is assumed to want a specific type of information nnd the agent lises
a specially huilt clarabase ro ansrver queries.
The other agents. those rhat rvork to enhance a user. or otherwise reqiiire user inter-
vention or feedback. are of particular interest to this thesis. There have been a nurnber
of these systems made for viewing mebsites [PB99. PB9;. SH99. Pol99. BSY95. BS951.
The- typically require that the user give some feedback about the iVi\X' documents
that have been recommended. The agents adjust their future responses to take this new
information into account. This data is stored in a profde of the user. This profile typically
consists of a set of features that have some weight associated with them. Features include
some form of a List of keyn*ords[llla99]. although other possibilities esist. For esample.
agents mzq- form part of a collaborat ive system. ahere a profile may simply consist of a
List of documents that the user has ranked and those ranliings. As Glover et al !GLG'99]
suggest. other features of a document map also be used to construct a profile. This could
indude preferences for length of documents. reading level. languages. age. images and
URL:text ratio. The esact features used in the profile can be determined esperimen-
tally. JI-hatever the exact nature of the profle. it is used to give iniproved resitlts to the
user. Hon-e~er. none of the aforementioned agents has applied this learning from user
feedhack to the problem of nietasearch. The individual systems rnentioned above l-ar?- iu
t heir learning algori t hms. term weightings. feat ure space and int encted use. For instance.
Pazzani-s Syskill and ilébert [PB971 uses a Layesian classifier ro predict the probabiliry
of a user liking a document. whereas Balabanovic [BSYSJ. BS9j.I and Somlo [SHSQ] ilse
similariry nieasures wirh respect ro a profile. One interesting \ariarion on searching the
Kn-n- is Pazzani's web sire agent [PB99]. There. the agent learns only about people
visiting a site. as well as the patterns of site navigation of al1 users ancl the patterns
of the hypertest linkage structure to sorne degree. The linkage structure is merely the
way in tvhich documents refer to each other through hyperlinks. The narrow scope of
that agent's responsibilities is nor the purpose of this work. However. it cloes provicle
an interesting contrast in terms of usage. Pol&xa (Po1991 denionstrates a collaborative
recommender system in which user profiles are sirnply the relevane' rankings of various
documents. These profiles are compared between users. and for any given user. the sys-
tem recommends clocuments that a sirnilar user found relevant. Sone of the agents give
the user access to a large number of information sources from which to gather relevant
documents. using only a single search engine. or a single W V W site in each case.
1.3.1 Relevance Feedback
Relevame feedback is a term used in information retrietal to describe a special type of
feedback loop. This feedback hop requires that the user of a system make judgements
about the rele~ance of documents retumed by the system. The +stem. in this case. is
one that is designed to return documents to a user based on queries that the user gives
to the system. The feedback obtained in this manner is used to determine the profile
used in the nest iteration of document retrievai. The exact nature of this feedback varies
with different systems [RocTl. BBC9S. CS9S. BSY95. BS95. DH96. SB9O. BSA9.I. Joa97I
and may even be implicit [BBCSS. I<J1JIC97. DH9CI. Esplicit feedback ma' be a simple
boolean. representing relelant and irreleiant documents. or ma' be a range of ialiies
hqond the boolean O and 1. Ranges of feeclback ialiies mal- present the user with a
psychological or perhaps a user inrerface difficult>-. Specifically. if sonie document. D.
appears multiple rimes. it may be given rankings that actually depend on the relelance
lalue of the dociunenrs thar iverc listed alongside D at the time of ranking [BSY95].
If this is the case. the range of lalues ma' not give an>- more information t h r i the
boolean ranking system. and may in fact haniper learning due ro the inconsistency of
user rankings.
V ï r h implicit feedback. feedback is assigned by the sysrern basecl on whether a docu-
ment was viewed b>- the user and possibly baseci on how long a user viewed a tfocunient.
The ad\xntage of this implicit feedback is that. from the point of view of the user. the
interface is transparent. The system c m leam a user's preference wichout the user ha\-ing
to intervene. The problem with using impiicit feedback is that the feedback is somewhat
more difficult to interpret. as a visited document is not necessarily a relenrtr one. This
causes potential inconsistencies in the rankings. which is n-hy ir is not irnplemented in
the system described here.
One added complesity that may be introduced with relelance feedback is incremental
feedback. In most of the previous work. feedback processing mas done in a batch fashion.
That is. the entire corpus. and their relekance ranlüngs. were known before testing began.
and the entire corpus (or some training subset) could be given to the feedback algorithm-
the algorithm could have complete knoivledge and could process the feedback in one pass.
This is the context in nhich the original Rocchio algorithm for relelance feedback was
created [SalÏl] . Incremental feedback. on the other hand. requires only that a portion
of the corpus be judged before applping changes to the profile. Additional portions ma?
be judged and the profile changed as time progresses. with Little or no knowledge of
previous r d n g s . Fortunatel'. Rocchio and Ide [Ideil] style algorithms for releiance
feedback ( herein termed -traditional relevance feedback algorit hms- ) ma! be applied
in a incremental fasliion i.11196. Cal%. IJA92]. In fact. Allan found that -keeping a
s m d niunber of terms can acttially iniprow performance over full feedback ... alniosr any
nurnber of terms works 11-ell." Full feedback refers to feedback in which. at an? tinie t .
rhe system is @en d l ranked documents seen until time t . Harrnan [HarS?] suggests the
use of 20 terms in a full feedback em-iroument. Xian shows that traditional relel-ançe
feedback style algorithnis work n-el1 as long as some contest. possibly a d~.namically
changing contest. is maint ained froni previous iterations. This provides a niethod to
pcrforni online learning. insteacl of batch leaniing. This result is uecessary in orcler ro
did date the use of relevame feedback as appliecl to metasearch.
Query Expansion
Typically. reletnnce feedback results in an altered query. This qiiery is identical to the
profile mentioned above (earlier in section 1.3). The purpose of the espancled qiiery is to
give a more precise quer!: based on previous user rankings. for the nest ireration. This
qiiery nould also be used to order the pages in terms of relevance. for the user to view.
This approach t~orks well in certain experirnents. However. in the contest of metasearch.
t his approach does not work at d. For instance. some search engines Limit queries to only
10 terms. and some seem to ignore long queries. Those that do accept long queries do not
return man? results. Espanded queries typically used stemmed versions of words. These
are words that have had their s&~es stripped from them. Any stemrning in the query
assumes that the search engines mil1 understand the stemmed term or tenns. mhich is not
necessarily crue. In fact. search engines like Google do not do stemming. and others d o w
stemming on'\. as an ad~anced option: even with stemming as an option. however. search
engines cannot be relied upon to understand stemmed terms given as part of the q u e .
Emn if the stemmed terms were e-rpanded into the words that were originallj- seen to
create the stemmed terms. the queries would be increased in length. concributhg to the
pre~~ious problem. This is different from mosr of the literature on searching a corpus of
documents. but its omission here is support ecl by r he profile generating inreiligent agents
(earl?. in section 1.3).
1.3.2 Browsing and Searching
.-\ user surfs the IYJVJY. There are generally two different rnethocls of ~ising a search
engine to do this sufing: bron-sing and searching. The difference berwen the t~vo is
identified by the user's interest. -4 user who wants detailed information on a specific
topic is searcliing. whereas a user who wants an over~ien- of a topic. or even multiple
topics. is hrowsing. One may also distinguish the two by the user's conceptual mode1 of
a topic. If the user is looking for information on. for esample. Palm Pilots' . rhen this
is probahly browsing. .A user interested in a specific subtopic. such as free p r o d u c t i ~ i t ~
software for the paim pilot. is probably doing a search.
.iccording to Search Engine l la tch [Su100d]. 70% of people know specifically what
the' are looking for when the! use a search engine. Hoivever. this does not mean that
they can articulate this knowledge in the form of an appropriate query. Csers typically
have a specific meaning in mind nhen using a term in a que- This meaning is not
necessarily the same one that search engines assign the word. For instance. the term
-pairnw might be found in a document on handheld computer organizers as well as a
document on vacations. Furthermore. Butler [But001 says that most users only have a
general queq- in mind. These two viervs of a user that k n o w specificdy what they are
looking for and one nho does not are not easily reconcilable. The evidence suggests.
however. that most users only enter a general que- This is supported by the fact that
30% of searches are done using o d y a single word in the query [SulOOb]. according to
Search Engine Match. In another study done by Lan-rence and Giles [LGSSa]. the- found
'=\II product narnes and cornpan' narnes mentioned in this document are the trademarks of their respective holders
thar almosr half of user cperies containecl one term. and almost 80% of queries w r e one
or two terms. WhiIe the query entered into search engines mal he general. this author is
inclined to believe that rhe iiser has a fairlj- clear idea of what the!- are looking for. The'
ma>- not always he able to express rhis in the form of a qiiery. but the? can certainly
identify documents that are and are not relevant. .An!- other view of a user means that
the!- are entering randorn queries to the search engine.
The work in this thesis assumes that the mer does have a fairly good idea what she
is iooking for. ancl are able to identify these documents. This work dso assumes that the
iiser ri.iIl only enter a generai query. From there. the system will leu11 a more specific
query that corresponds ro the user's internalized and unasked qiier?: This corresponds to
the specific search aspect of surfing the \VlVR*. It should be possible to use the sysrem
describecl here to perform the browsing aspect of suhng. but this rws not one of the
items investigated.
Our Approach
The systrm presented in this thesis is meant to address some of the shortcomings of other
WRlY search engines. and in particular. other metasearch engines for the W k V W .
0 We will use the more sophisticated of the metasearch techniques by using a unified
ranking scheme across al1 search engines.
0 tVe nill allow the search engine to adapt to a user's preferences over time.
0 The feedback from users will be elduated on a daily bais. in an incremental
fashion. Thus. there tvill be no strict tuning period to generate a correct profile. as
in Somlo's engine [SH99].
0 The system ma? be used in either a server based mode. or ma!- be used as a single
client on the user-s machine.
a The search engine will generate a specific profile from a user formulated general
query.
0 rser profiles will be generated which represent a prototypical docunlenr. The user
profile will he used to determine similarity with documents that are retrim-ed h-
the search engine.
Usage Scenarios
The scenario in which the search engine described in this thesis was designed to be nstd
is the following. A user has a query that they wish to make over a period of days or even
weeks. The user may be looking for information on a topic she knows little about. or she
may be looking for a certain type of infornlation. but does not know hotv to specify the
information as a search query. As an esample of the former. suppose the user is looking
for student Life at the Cniversity of Toronto. She is not sure what this might entail. but
is certain of what i t does not. and hopes that the search engine will help to determine
the scope of the topic. The other type of search is where a user knows what to look for.
but enters only a \xgue query. For example. a user might be interested in looking for free
productivity software for the Patm PilotT". and might do so for three to five days: t h s
user might onlj- enter -palm pilotm as a query. As another example. a user might want
to retrieve information about the most recent Microsoft antitrust trial in the USA. from
its besinning to the current events. In this case. the user might enter a query such as
-microsoft doj.- In both these queries. the topic is general enough to include documents
pertaining to other. nondesirable events or topics. The metasearch engine should be able
to detect user preferences and d~namically alter the manner in which it orders pages for
user viewing.
1.3.4 Questions
There are several questions t hat resring of r he engine will answer:
a Can learning a user profile increase the relelance of tlie results as returned by tlie
met asearch engine?
a Are certain search engines hrtter to use than others on certain topics'.'
O 1s metasearch more effective than ordinary search. in the contesr of relevancc feetl-
hack'.'
1.3.5 Contribution
The work presented here dl provide evidence for the effiçacy of an adaptiw learuer for
the problem of metaseuch on the \1717\'. It ni11 provide evidence that search engines can
hr usecl for more than merel- one rime searches. but can also be used for the piirposes
of tracking a topic over tirne. né will confirm that metasearch is actually effective.
and show that even cornbining search engines that use similar techniques is worthwhile.
Furthemore. some reasons that metasearch is effective in the temporal contest of this
work d l be shown.
1.4 The Rest of the Story
The rernainder of this thesis will discuss the user's interaction with the user. delving into
representations of the user's ideas and methods by nhich one can alter these representa- - tions. Chapter 3 will esamine the architecture and implementation details of the engine.
including remarks about the data structures used and the scalability of the system as
a shole. Certain shortcomings of the system d l also be shown here. as well as some
possible ways to overcome them. The results of a tarie-- of esperiments dl be shown in
4. in an attempt to ans=er the questions posed in section 1.3.4. Chapter 4 also presents
some analysis of the data. Chapter 5 sumarizes the findiugs. answering the quesrions
$\-en ahove. anstvering possible esceptions ro the work clone. and cornnielits on esrenciin;
this work.
Chapter 2
User Interaction and Query
Processing
This chapter describes the user's interaction with the systeni. and the results of those
interactions. It also describes representations of the user that the system maintaim. and
the met hods t hroiigh which t hese represent at ions are maint ained.
2.1 Profile and Document Representation
-4s in the other systerns rnentioned earlier in section 1.3. the system uses a profile to
keep track of user preferences. In this thesis. a profile can be viewed as a prototype
document. or perhaps a union of prototype documents. A document is represented by
a set of weighted words and phrases (collectively c d e d terms) coming from a document
space [SMS3]. This document space is a muitidimensiond space. having one avis per
n-ord or phrase chat is accepted by the system. This space mal- either be knonn be-
forehand. or may gron- over rime. as new t e m s are encountered. Terms that are in the
document space at time t are called the dictionary at time t-this diction- is static if
the document space is already knoftm. Given a document space. D. consisting of terms.
D = ( t l . t2. . . . . tlnl). a document is a vector. I'. in this space n-ith non-negatke neights
II- = ( t r i . ~ 2 . . . . . "'pi): wt > O: 1- = 11'-'P. It is iiseful to have the sparse representation
of the vector. Thar is. take Il-' = ( t r ; . w ; . . . . . cc: 1. u., > O n-here 11 is the numher of
positive weights. D' = ( t ; . t:. . . . . t ;) . the terms n-itk posiri\-e weighrs and 1 -' = II'' - P'.
For a document. d. 1 ;i indicates the presence or absence of terms in the docunient. and
m a - indicate the significance of those ternis in the document. The marner in nhick
weights are assignecl cau \ary between implenientarions of an' sysrem employin:, such
a representation [\Ila99. HC931. and may range from a simple boolean only. to a real
number reprcsenting the weight of the term. The esact method iised in this thesis is
described in C hapt er 3.
2.2 The User Query
The query sent to the individual search engines is al\\-ays the original que- that the
user yives. This is partly due to the fact that queru espansion does not work in the
metasearch contest. as eqdained in section 1.3.1. hlso. given the assurnption of a generd
query. this alIo~vs the individual search engines to retrieve man- documents. Having many
documents for the search engines to retrieve is important because of the repetitive nature
of the query-the user wiU input the same query over a number of days. but only wants
to see nem or changed docriments. h large pool of documents ensures that there will be
many nen documents. even if only a fen of them change over time. Xso. the purpose of
this metasearch engine is not to learn the optimal ranking for a finite set of documents
for a specific user. Rather. i t should continua& adjust to the user's preferences. and
learn the best ranking for future documents that have not been seen. This can o d y be
tested if the rnetasearch engine can examine o d y a portion of a large pool of documents
at one t ime.
2.2.1 User Query and Ontologies
The user's query is espected to fit into some onrology. This i?; siniilar ro Glol-er et
al": -information needs- [GLGi 991. and DiialX-KI'S -caregories" [XIHT99j. The user
rnanitally selects an approprinre topic in u-hich to place the qtiery Tliese topics conir
from some esis t ing ontology Current ly. t his onto1o;y esists only for r hose qiieries rhat
bas^ the categorization of qtieries. The use of this outology also allon-s rhe sysreni to
jeneralize or specialize to a new profile. giwn the old onrs. Sonie adclicional tliotiglir
would be required to determine esactly how r his ~ o t i l d hr done. as fiirttier detailecl in
Ckaprer 5. whiïli tlescrihes possible future work.
2.3 Ranking
Two clifferent entities rank clocuments: the user. ancl the nietasearch engine. The meta-
search engine ranks documents it receives from the indiridual search engines. It does
this so that i r ma? present the user with a list of rnnked document CRLs. The ranking
is done by making cornparisons of documents to the current profile. Since the profile is
merely a vector in the document space. then the profile. P. and the document. D. may
be compared by rneasuring the cosine of the angle between the trvo rectors:
If the document vectors are normalized. relative document rankings for a specific profile
are preserved rvi th:
Other rransformations may be made prior to the nomalization. For instance. in the
engine described here. aIl vector n-eights are first made to sum to one by dividing each
vector by the sum of the weights. Thus. the actual similarity nleasure used is:
where P: and Dr are the weights for the term Pt and D, respecti~ely.
The highes t ranked documents are then shown to t he user as a lis t of L-RLs-the esac t
numher returned depends on the algorirhrn usecl (see section 2.4). Oncr this is clone. the
user ma' give rnnkings. The user looks at the documents given by the metaseasch engine
and ranks them as either relevant or nouelevant. The user's ranking is based on a set of
criteria which depends on the topic. One important global rule rvas that documents that
had heen seen before must have changed in a relelïmt manner in order for the document
to be marked relevant again. This criteria is important. because otherwise. a fised set
of documents would always be returned to the user. These documents would be those
that were accessed from a database. or othemise dynamically generated. documents
that had their date changed on a daily basis but no other changes. and documents on
servers that gaye incorrect dates for the document's datest amp. This is because the page
fetchers. as described in Chapter 3. fetch those documents whose datestamp or checksum
has changed. if the document has been seen before. The exact number of documents
ranked \aries according to the learner that is in use and the threshold the leamer uses
for determinhg tt-hen to stop giving documents to the user. This number is a maximum
of 30 documents.
2.4 Profile Adjustment
2 A. 1 Relevance Ranking
The ira>- in ~vhich profiles are aitered depends on the systeni in use. biir are generally
lariarions on Rocchio's algorithm [RocTl. SB90j: the profile at timr f + 1. cnn he
ohtained rhrough a function. f : Qtc i = f ( Q t . Rt .St) . il-here R, and St are the sers of
reletanr and nolirele\ant documents. respectivcl!-. The ac tiial algori thm usecl iu Rocchio's
original forniulat ion [Roc711 rvas:
Generally. variarions of Rocchio's algorithm are va.riatious on this formula. ttsing ditferent
weights for Q,. anci the rrro sums. For instance. .-\llan [;\1196] uses 2 ancl instead of the
inverse cardinalities of the sets as the rveigkts for the two sums:
halbersberg [IJ.l92] cites Salton and Buckley [SB901 and Ide [Ide711 for the general forms:
where o. 3.7 are variables.
For each separate que- that the user enters. a profile is created. Initia&-. the t e m s
in the profie are just the terms in the user que?. The weights in the terms of the profile
alnays sum to one. The profile is limited to a maximum of 20 single word terms and 4
two word t e m s (ie: phrases). Documents are represented in a similar fashion. but are
limited to a masimum of JO single mord terms and 10 phrases. Depending on the leamer.
the profle gets updated in different -S. TNO different lemers were used:
Rocchio variant Here. the profile at time t i l. Pt+ ,. ma>- be obtainecl from Pt and the
sets of relex-ant and norirele\-ant documents as follows:
Pt+i = Pt + I S ' I C R t - IR ' ICS ' R : sr
~ ~ h e r e R'. St are. respectively. the set of relelant and nonrelelxnt dociinients as
judged b>- the user ar cime t . R: is the \*SM for relelant document nuniber i.
Similarly. S: is the \'S1\I for nonreievam cloctunent niimber i. This is siniilar to
Rocchio's original variant [Roc;l]. modifiecl in the weiglit assiguecl to the previoiis
query and in the facr thar the algorir kni is appliecl in an incremental fashion. \\-ben
iising tLis algorithni. tip to 30 clocumenrs are returnecl to the user for relelanc-y
rankirig. Docunienrs u-hose systeni rariking is less than or equal to O are nor shown
to the user.
The algorithm is further moclified as suggesrecl in Rocchio's original formulation
[Roc;l]. by accepting a term for inclusion in Pt,, if and only if its weight is grearer
than 0. and the term \vas ei t her in Pt. or waç in more relennt vectors t han non-
relevant vcctors. Both these measures have the same effect : the- elirninate terms
thar have little discriminatory pomer. The former technique also eliminates those
terms that are only able to identify irrele~ant documents.
Ide variant This merhod is a nriation diie to Iiarakoulas [I<F9S] and is a generalization
of other variations [.-\U96. BS.494. IdeTl]. Here. the new profile Pt+i is obtained as
folio\\-s :
pt+, = Û P , + J C R ~ - ? C S ~ R : s';
where Rt . Sr are as before and a. 3. -, . d are predetermined constants. and n-here
.? = t3 for dl t e m s u; E Rt such that u, is not in the guerl- at time t. The
constants are determined through esperimentation. In another variation. a. 3 . 7 . d
may actually be variables that are determined as time progresses. Howet-er. for
the purposes of these esperimenrs. the! are constant. -AL: before. when usilig this
algorithm. up to 30 documents niiist be rankecl b?- the user. using rlie same criteria
as in the Rocchio style algorithni. Only positive. non-zero wijhrecl ternis are
accepted in Pt,I. This nierhod will be referred to b?- the name %rigoris".
Chapter 3
Architecture
This chaprer esplains the architecriire of the systeni iniplementrcl. coinnient o; ou r htl
scalability of the 5'-stem. esplains some algorith~ris tisrd. and lisrs the topics iised for the
nest chaprer.
3.1 Overall Architecture
Adaptive information filtering and metasearch are both asynchronous tasks. The archi-
tecture refiects this. Figure 3.1 shotvs the architecture of the metasearch engine. .An
esplanation of the symbols used folloivs.
Circle An object in the system. operating in the same machine as al1 other circles: over-
lapping circles represent multiple objects working concurrently in separate t hreads
of execut ion.
Square A n object in the system. operating on a specific machine. Overlapping squares
mean that sewrd object instances are working concurrently. possibly on different
machines. If on the same machine. instances work in different threads in the same
process.
Rectangle Rectangles t hat are uot also squares represenr queues or user interfaces.
Rectangles wirh a horizontal orient arion are qiieties. i~hi le a vertical orieutarioti
iutlicater the iiser interface.
Oval The 01x1 representr rhe niain client-rhis is the clictir rhat esisti on the sanie nia-
chine as the user interface.
.As can he seen froni the tliagrani. one client cornrnunicareï with niiiltiple processes ou
niultipl~ machilies in ortler to collect clocunienta for iiser rankiiig. Each one of thore
processes creates a series of queiies a d qiteiie managers to hantlle iariotis data trnns-
formations. The qtiew managers operate iri parallel wi th eacli ut hm. procevsing tlat a
as it hecoriitls availahltx in the queues. Each queue manager acttially passes the dnra
in the querie to a rniilritlireatled pool of workers. One worker iu the pool will perforni
transiorniarions ou r he data before putting i t into the nest queue. iVhat follow is au
esplanacion of figure 3.1.
1. The user intrrface iu responsible for accepting the initial query of the iiser.
2 . The query is passed to the client program.
3. The client program passes the que- ro the search engine selector. which selects a
set of search engines to use. This allows for dynamic selection of search engines.
possih1~- in conjiinction wit h a learning algorithm.
4. The set of search engines are passed to multiple page retrievers. possibly on different
machines. The machines are selected in a random order. and given a random
number of search engines to query. The masimum number of search engines on a
single machine is an adjustable parameter. Each page retriever handes o d y one
engine. Each of the follon-ing steps unril step 10 occurs in eack page retriever. and
each page retriever has its on2 copy of the ~ ~ i o u s objects and queues. escept for
global data structures. rrhich are outlined in section 3.2.
Figure 3.1: Overail architecture. esplanatioii iu section 3.1
.3. The search engine estractor works in tandem n-itli the page fetchers. It reqtiest.:
CRLs corresponditig ro the iiser's query from a search erigine. ancl continues to do
so until it fin& $0 new or changed documents. or no more dociinleurs are a\aiiai>le
from the inclividiial search engines. Changeci dociinients are rtiose thar have heen
fetclied before. but have changecl sincc last fetckect. This is deterniined using a
combina tion of a clatestanip and a checksiim for the clocunient . The page fetchers
tell the aearch eiigiue the nitmher of new or changed doc~inients that have breu
founcl. The search engine estractor puti CRL': into rhe CRL queue.
6. The CRL Manager removes CRLs froni the CRL Qtieiie and passes them ro a ivorker
in the page fetchcr pool. .Ja\a's threading niotlrl requires thar an asyuchro~ious
page fetcher use a helper ohject. 50 that page cio~vnloaclin~ rnay I>e halretl. The
page fetcher asks a helper ro do~vnloacl the document $yen the CRL. and passes
informa riori ahutit nrw or changed dociinienrs to the searck engine csrracror. The
docilnient information i?i then entered into the page information queue.
- . The page information manager removes clocuments from the page information qtieiie
ancl passes them to one of the workers in the page analyzer pool. The analyzers
estract and rearrang some of the data from each document. transforming the
document into an SGML document that the backend instances can use. The newly
formatted document is entered into the backend queue. At this stage. documents
n-ith non-English characters are stripped to include only English characters. if a q -
esist . Furthemore. the document is split into a series of rRLs. the L'RL test. the
title. and the rest of the document.
S. The backend manager estracts the document data fiom the backend queue and
passes it to one of the n-orkers in the backend instances pool. Here. a backend
instance transfoms the document into a vector representation of the document.
dong n-ith various rags assigned to the document by the page anal)-zers. The
vecrors are then entered inro rlie analyzecl \*SM qtieiie. The backeiid was origirial1)-
crea t ecl hj- Brian CLarnhers[CliaQS]. and rnoclified ru allon- for nitiit i thenclt.d itsr
au1 lYl\*U* ctoctinieuts i ie: rlor.unients in HTSZL 1.
10. The page retriever coilects the data in the analyzed \'SM qiietie aiid seuds tkta tlara
ro clirrit. basecl on a client reqtiesr for the clata.
11. The clizrir then forniars the clata and orclers i t according to tlic Iearning algorithni
in ~1st.. and presents it to the user for rankiug.
Al1 coiiiniu~iicarion betwen machines is clone via a message passirig systerii built on
top of .Java's Reniote Ifet hotl Invocatiou i R l I I ) . This niessage passirie systcrn allows
synchronotis and as>-nckronous communication with mechanisnu in place to allow agent
coniniiiaiçat ions langtiagcs such as IiQlIL [LFI;. FLM9;. FFlIE]. Howe~cr. stich lan-
guages are oot reqiiirecl for use. aucl a much sinipler protocol was usecl here. The global
data stntcrtires are shared throtigh Java's Rl I I siihsystem.
Global Data Structures
There are two global data structures that are used for all queries and across all machines.
The first of these is the document frequency table and the second is the stopword list.
The former is a list of ail the tcrms that have been seen until the current time. t,. It also
has a mapping of rems to the frequency of the term in the documents seen until time t , .
This stnicture forms the dictionary that is used by the systern. and is initially empty. If
chis dictionan- represented the entire document space. and thus. al l words and two word
phrases in the English language. the dimensionality of the document space wodd inhibit
useful leaming in addition to causing scalabilit y problems. This problem is handled fmt
by the clynamic generation of the document frequency table. rrhich ensures that o d y
those rvorcls <Lat have beexi seeu hefore are incliitlrtl iu the clicriouary. Therr are three
or her mer hods t har are u s d to bandle cliniensiuua1ir~-.
3.2.1 Zipf's Law
The firsr of t t i e ~ e i i an application of Zipf's law. This lm*. as c i t td 11)- SaLia1iii!Sah9Sj.
srates t hat worcls chat occur infrecliiently in a corpus of tlociimerits have lit rle cliscririiinar-
ing porwr hetn-ern dociimenr~. Siich words rnc- help ro ideiitify incli\-itliial docurneuts.
h~ t t do little else. Since the pitrpose of the clocumenr \'S'iIs is ro orcier docitrnents r&ti\-e
to cach other. these words ma? he cliscarclecl. as the! pro\-ide no tisefiil information in
this contest. .As an esample of this. Saliami estimates that wordr occiirring only oncp in
a corpus n-il1 proride half of the unique terms in the corpiis. This is only an approsinia-
tion. but aids in keeping the dicrionary smaU. This application of Zipf's law may be done
ar internls. when a large number of documents. such as 10000. ha\*e been seen. This
ensures that such terms really do fall inro the category of unique but nondiscriminatory.
Terms that are in this category are placecl into a dynarnically generated stopword list.
3.2.2 Stopword List
The stopn-ord list is used when creating \51Is. and is used by the system to remove words
from a document's i-ector representation that appear in the stopword Est. Complementing - the dpamically generated portion of the stopword list. described in section 3.2.1. is a
static stopword list. This list is created before the d i c t i o n q is built. and consists of a list
of words to esclude from the diction- and thus the document space. This Est is meant
ro consist of commonly used words that are knonn to provide Little or no discriminatory
power. It includes articles iike -the.- - a and common conjuncts like -anda and -or.-
3.2.3 Stemming
The lasr nirclauiim to prevenr esces cliniensionaliry i': sreniniing. Here. words trith
sirnilar rom' are rornlkrtl iii rhe clof-iinienr space. For instance. the n-or& -accessorc-
-aîceasories.~* alid "accesorized" are identifiecl to the terni "accessori- during the tram-
forniatioii of a tlocument ro a \-ecror in r h r rlocti~rient space. Tliiis. the document space
?.:A3 Lï PL -- q .*---A - - -- -1 - -. - - - -1- - C 7 kt:; GE!:; r u u ~ c I b u r c r z c~~~~ UIUC~. i.&l.iiïiiu~13 VL ificlu. LCL.ILLS thal al.^
ttvo ~ o r t l ~ ~ t i r a s r s have sternniing applietl ro each indit-icliial word iu the phrase. This is
açconiplishd through a d i s stripping routine[PorSO]. whete words are systematically
reduced to a yoot" word. whicli riiay or r n q riot correspond to an actual wortl in the
English langitaqr. The façr thar terni:: in the profile consist of stenimecl teniis ad& ro
t tic inahilirl- ro ilse ari espniiclecl qiter>- in successive searches tri t h search engines ( sep
secriou 1.3.1 1 .
3.3 Scalability
The arctii tect ure is SC alable. allowing multiple machines to cooperate in analyzing and
tlownloading documents. In k t . working on multiple machines mas necessary. as the
initial sysreni reported mernory errors wich just one machine il-ith 12SMB of RA11 in
use. The use of .Ja\?i0s Rh1 1 subsystem allows multiple machines to coordinate. and work
in an asynchronous manner. Cnfortunately. this is not as cornpletely scalable as it could
be or appears to be. Since the engine \vas rvritten in Java. it is subject to some of Java3
faults. In this case. threads in Java cannot be interrupted immediately. This means t hat
even though ail the transformation performing objects (section 3.1) are given a limited
amount of rime in n-hich to complete a data transformation. those that do not finish and
are told to intempt themselves will not stop consuming resources until the task has been
completed. or the object checks for interruption of the thread in h hi ch it is operating.
If the object happens to be in the rnidst of a blochïng c d . such as when donmloading a
Gerard Salton and C hris Buckley. Iniprovin J rer rieval perforniance by rrl-
elance feedback. Jounzal of the Americati Society for Infornrntiorl Science.
-4lr 4 ):?SS-W. 1990.
Gabriel L. S o d o and -idele E. Han-e. Agent-assistecl inreruet browsing. In
Proceedings of the Workshop on Ititekgetit iraforn~ation Systems at the 16th
iVntiotial Corifererice on Artificial Ititelligr7icc ( A A A I '39). 1999.
Sherlockhound. ht tp: //~v~~*iv.sherlockhottnct .corn.
G. Salton ancl 1I.J. McGill. Introduction to Modrnr Information Retrieval.
l\IcGraw-Hill. Sew York. Sew York. 19S3.
StatUarkct. S t at Narket search engine rat ings. .lune 2000.
ht tp://~vtv~v.searchengine~~*atch.com/reports/starmarket.html.
Louise T. Su. The relevance of recall and precision in user evaluation. Jol~rnal
of the Arnerican Society /or Infonnation Science. 45(3):207-217. 1994.
Daruiy Sulliran. Xedia Metris search engine ratings. Slarch 2000.
http://a~~~~.searchengine~vatch.com/reports/medimetrix.htd.
Danny Sullimn. SPD search Sr navigation study. June 2000.
http://searchenginen*atch.com/reports/npd.htd.
Danny Sullivan. Search engine alliances chart. June 2000.
ht tp: / /searchengine~t-atch.com/reports/alliances.htd.
Dnnny Sullikm. Survey reveals search habits. June 2000.
ht tp: / /searchenginematch.com/sereport /00/06-realnames. h t d .
mn-that is. for s e p s 1 to 11 in figure 3.1-a page retriever will nor have a complete
document frequency table. Only t hose changes that are made locally are alailable. .At
the end of the run. changes ro the local table are rransferrecl to the master table. dong
ivith changes from all orher page retrievers. This redtices mtich of the nerwork rraffic.
escep t for the initial trausfer of the table and stopword list . As a consequence of caching.
the document frecluency r able is an est imare of the current kno~declge about the frequenc?.
of terms in the documents seen. Thus. it i d 1 inirially be inaccurare. ancl the esrimate
will get ber ter as rime goes on. until the caching effect becornes irrelevant . This is nor a
prohlem. as the table is inaccurate initially. whether or not caching is tised. Furthermore.
if no caching ivere used. and the initial tablc were chauging rapidly. the kliowledge would
only alioiv bet ter vector modelling of clocuments t hat were viewecl later in the initial rtins.
Thus. documents tvould not be treated eclually in the rankings because they tvoiilcl he
ranked basect on
The problem of
frequency tablc.
S tich knowledge
different arnounts of knotdedge about the terrns in the document space.
a poor initial table ma' be reduced by using a bootstrap document
if there is knowledge about the dictionary that will Iikely be created.
ma? corne from esisting analyses of the English language or from other
information r e t r i e d studies. for example. For this study. bootstrap- t ables were used from
Brian Chamber's work [Cha99]. The document corpus and topics used in Chamber's
work were different from the ones used here. However. this means that shased words
would most likely be fairly common words in rhe English language that still had some
discrimina tory power.
In spite of the caching done. tme scalability c m only be achieved by using some
distributed database. or other distributed data structure. n-hich inciudes the use of a
more intelligent caching scherne. This database or caching scheme n-ould apply to both
the document fiequency table and the stopword list.
3.4 Term Weighting
In the creation of I'Slls in step S in fi~ure 3.1. a number of different approaclies ma?
be used for the creation of the \'Slls. In this engine. a terni t in document (1 is gi\-en a
weight IL* as follo~vs.
# unique words in d IL =
aïg # unique words per clocunienr
ivhere f, is the term freqitency. log is the logarithrn t o the base 2. .\- is the total number
of documents secn at the tinie that d is analyzed. fd(t) is the nuniber of documents iu
~vhich term t occurs at least once. Other term weighting systems may also be used. as in
(SMS3. Sal7l. LG3Sa. 'rIla991. These are not the weights that are used during learning.
however. n'ben actua!ly used. vectors are transformed so chat the sum of their weights
is 1. This includes the wctors that represent the profiles.
3.5 Topics
Four topics were chosen for queries:
Palm Pilot The palm pilot topic Kas meant to obtain documents pertaining to acces-
sories and free productivity software for the Palm handheld computer line by Palm.
h c . A U documents had to be about either accessories or free productivity software
to be deemed relevant-no demos or sharen-are- and no hst of W s to other sites.
among other criteria.
Robots This topic concernecl research into autonomous robots. parricularly n-ith respect
to courses. but an- research wodd do. Documents about robot cornpetitions ( ~ u l e s s
these were also courses j. remore controlled robots. and toys were escliided from rhis
topic.
Microsoft DOJ \!*hile Xcrosoft Lias had a niunber of cases with the Department of
Tl.,+:, 1. 3 1 ., ,J,.,C DO.] ). doci,mcu:s K C ~ C ~ C ! C ; Z Y ~ iû tkih î ü ~ l i ü d ~ - if i k ~ K C L ~ ctppL~cduir
to the case which took place in the F e u s 199s-2000. and restdtecl in the judge
determining that llicrosoft shoiild be split into two companies. This escludes the
case regarding the consent decree. circa 1997. and al1 ot her antitrust cases. stich as
the one regarding the purchase of Inriiit.
St udents The actual query used for this topic was -stuclents universitu toronro." It wns
nieant ro obtain dociunents relating to student life nt the Cniversity of Toronto:
chings such as clubs. organizations. student activities and student guides. It was
meant to mimic a query by a potential undergraduate student to the Cniversity.
who was interestcd in seeing what students did at the Cniwrsity. socially.
Chapter 4
Experimental Results and
Evaluat ion
This chapter presents results of esperiments run to answer the questions posed in the
Introdiictiou. The results are aiso evaluared for statistical significance. and e\aliiated
ivith respect to the questions posed.
4.1 Description of Data Gathering Procedure
Data were gathered on as close to a dail' basis as possible. Some days. data could not
be obtained due to the fact that few documents nrere retunied. This lack of data on a
given day was due to changing conditions on the local and global Intemet. and because
of this. the data gathering procedures ivere either redone for those days or not done at
au. As wiIl be shotm. a large gap between data gathering days did not appear to have
an effeci on the results. On those days when data n-ere gathered. the queries were given
in the order in nhich thel- were presented in section 3.5. For each topic. at least 50
documents were obrained on each day. with a mean of 1SO documents gathered per topic.
per da>-. The exact number of documents found each da' may be seen in figure 4.1. An
esrplanation of some of the larger spikes may be found in section 4.6.
(a) Palni Pilot ( b ) Rabots
(c ) lIicrosoft/ DOJ (d) Students
Figure 4.1: Daily Document Counts
CHAPTER 4. ESPERISIESTAL RESL'LTS XSD EVALU.-\TIOS
4.2 Evaluation Framework
The e\aluation of information retrielal systems rypically involves some nieaslire of pre-
cision ancl recall. In this domain. the former is a measure of the number of documents
renie\-ecl that are Iahelled as relel-ant . and the latter is the percenrage of relelant clocu-
menrs that were retrieved. out of the d o l e population of relelant clocumenrs a\ailable.
-1 l u e rotai niimoer of reielanr documenrs ior an- giren topic or cper?- is imknorvn. Fur-
thermore. -\. measure of recall should also take into account properties of the Interner.
At an- given moment. large portions of the Inrenier ma- be inaccessible or ciifficult to
access due ro fnilures of individual machines in the Internet. or iinresolved congestion
at some point in the Internet. .Uso. since users tend not to navigate beyonci the first
set of doc~iments that a search engine displays !SieOO. Sie99b. Xie%. SchOO. TogSS] and
since some search engines can give liiinclreds of thousancls of documents. it is especidly
important to have more relex-ant documents in the top few CRLs listed. Finally. as the
\\'t\'lI* grows. recall becornes far Iess important t han precision [Sie99aj. Man?- clocu-
ments will be reletant. but the most relelant should be placed at the top of search lists.
Some might argue rhar the need for better precision over recall already esists. However.
another study [Su94]. indicates that users may be more interested in absolute recall than
precision-it is unclear whether this wodd still be true ahen using today's search engines
on the mstness of the U'WV. as Yielsen argues. Whatever the case may be. measures of
recall or precision that go beyond the first 10 to 20 documents that the metasearch engine
displays is relatively useless because the user will only rarely see documents listed be-
yond those first 16 or - 20. Hoivever. a rneasure of relative recall is useful when comparing
individual search engines that make up part of the metasearch engine. Here. a measure
of the number of relemnt documents an individual search engine obtained relative to the
number of relevant documents the metasearch engine obtained can be used to compare
individual search engine performance over cime. Statistics of this nature carr be found in
section 4.7. It is possible to estirnate the number of relevant documents found each day.
These results are presented in section 4.5. For most of the other discussion. a nieasure of
precision is used. This measure is giwu with respect to a certain number of docii~iients:
for esample. the precision in the top 10 documents returnecl hy the metasearcli engine. or
the precision in the top 1 document. Tkese measures have been used before. in arialyzing
search engines [GLGf 99. CSSS]. The TREC-'7 filtering taak also siiggests a measiire to
use [Hu199]. the F3 nieasiire:
where Rs. is the number of rele\-ant ciocuments retrieved ancl ,Y+ is the number of nonrel-
evant documents retrieved. Results from this measure are presented in secrion 4.3.1. The
TREC-9 filtering task [RH] siiggests the use of the T9P mcl T K measures. presentecl
in sections 4.3.2 and 4.3.3. respectively:
JI i n l - othenvise
T9P = R+
m u ( Min D. (R+ +S+ ) )
J l i n l - = -400 for 4 y e a s or pro-rata
Min D = 50 for 4 vars or pro-rata
Establishing a performance baseline would also be useful. One baseline that could be
used is haring an unchanging profile order the documents. In other words. the profile and
the query s t a y the same through time. In the following discussion. this profile. queq- and
associated learning algorithm will be referred to as the plain profile. query. or learning
algorithm. This baseline tums out to have properties veq- similar to choosing a random
set of documents and presenting them to the user. One R-ould expect that choosing a
Figure 4.7: Running average of precision across al1 topics for Plaiii algorithm
% Plain. tw t -O- Plain: iao 3 '
a 8 Plan: top 5 l , c Pian. tao IO , -- Pfw: mp Ja ,
random set of documents would result in the running average of the precision in t he
top .Y being approsimately the sarne throughout time. for an!- S. The running average
precision in the top S is simply the precision in the top S over some number of days.
This can be seen in figure 4.2. The differences in precision end up being no more than
4%. hlso. the order in which the precisions appear on the p p h - s e e m s random. with
the precision in the top 30 being the best. mhile the precision in the top 3 is the worst.
The difference between the precision in the top 30 and the next best precision suggests
that the ordering may be poorer than random. and figure 4.3 confirms this. X R'ilcoxon
signed rank test [Gusg;]. performed due to the non-normality of the data. reveds the
difference is significant with p < 0.001. This still presents a good baseline. however.
based on the merits of the profile. .AU the other learning algorithrns use this plain profile
to s t a r with. Any improvement over the plain algorithm is thus a resdt of Ieaming.
IYi t h t his baseline measurement . 30 documents were ranked each da>-.
h o t h e r baseline %-as also used. This wiil be referred to by the random name. The
random algorithm chose 30 random documents each da>-. from the documents retrieved on
Figure 4.3: Ruming average of precision across al1 topics for Randorn algorithni
I 4 Ranaom: top 1 £- Ranaorn: toc 3 e Randam: top 5
each day. Here. al1 the running averages converge to approsimately 10%. which is slightly
higher than the precision in the top 30 for the plain profile. This convergence is espectcci
froni a ranking of a radon1 set of documents. The data frorn the random algorit hm will
be used to estimatc the proportion of relekant documents per da> and thus. the ntimber
of relevant documents per day. -4s with the plain algorithm. 30 documents were ranked
on each &y.
In the follotving discussion. it is often instructive to examine only a portion of the
results returned bu the dgori thms used. For esample. ni t hout an- ot her restrictions.
every algorithm coiild return up to 30 documents. Thus. the measures git-en. without
other restrictions. could - be called the measuzes in the top (up to) 30 documents. LVe
cari restrict the nurnber of documents we aUow in the measure. and call it the measure
in the top n documents. These measures would include ody the top n system ranked
documents. or fewer than n if fen-er documents were ranked by rhe system. For ense of
notation. these measures will be refened to as the measure in the top n. Iieep in mind
chat fewer than n documents ma? be included in the measure.
4.3 TREC Measures
The F3 Measure
The runnin; average of the F3 nieamre is sho~v~ i in figure 4.4. with \?trioils topicr.
Figure LI( h ) shows an esarnple of the tlaily F3 statistics. The line labellecl as -Test" in
rhi5 i i c n ~ ~ ~ . $1 ^?ber fi grirFF- hl -2E-p pr9f le pi rhp !ire -crinGv;c.* *-n+:l "O '" L 3 "-
da. 62. After day 62. the -Test" profile is frozeri. while the -Grigorisn profile is allowecl
CO change through continued learning. This was done to examine an>- clifferences that
niight arise. and results of this are presenred in section 4.4.1.
The clifference hrtweeu the Grigoris algorithm and the Rocchio algorithni is s t at is t i-
cally significant wheu taking iuto account al1 topics. with p < 0.0001. using the \\'ilcoson
signed rank test with a 0.5 continuity correction. The difference here is also clcarl>. sig-
nificant in terms of real tvorld performance.
The graphs seem to indicate thar. n-ith the esception of the plain algorithm. al1 algo-
rithms on al1 topics perform n-el1 in the firsr ten to twenty d q s . after which performance
seems to plateau or decline. The exception to this is the performance on the Palm Pilot
topic. where the Rocchio and Grigoris algorithms increase their performance continually.
Almost without esception. this measure indicates an increase in performance at or
around day 70. This will be esplained in section 1.6. The almost monot onically decreas-
ing line of the plain profile indicates chat the number of relevant documents is almost
always decreasing. This is to be espected. as one would thinlr that the individual search
engines are fairly good at r&ng documents. Thus. as time goes on. the individual
search engines give back worse and worse documents. n-hich are less and less relennt
to the general quer- g i~en . much less the implicit topic that the user has in mind. In
light of ths . any line segment in the cumulative plots mhich has greater siope than the
corresponding line segment on the plain profile line is indicative of better than plain
performance. Liken-ise n-ith respect to the fine for the random algorithm.
(a) Palni Pilot (b) Palm Pilot. Daily
( c ) R O ~ O I S (d) JIS/DOJ
( e ) Students
-'7 I+- ' . Ire- ! * - -? m rn m m ,QI
- n m m 'n>
( f ) Al1 Topics
Figure -1.4: F3 Sleasure on Va.rious Topics. Running Average
( a ) Palm Pilot (b) Robots
( c ) hIicrosoft/ DOJ (d) Students
(e) AI1 topics
Figure 4.5: F3 Measure on Ikrious Topics. Running Average. based on top 5 documents
returned. only
(d) Students
(e) -411 Topics
Figure 4.6: F3 JIewure on Various Topics. Running Average. based on top 10 document
returned. only
The two tradi t ional relevance feeclback algori thms perform iwil on al1 but the sr udem
ancl robots topics. On the latter topic. the Rocchio algorit hm generates approsinia tel?
four rimes as many nonrelevant documents as relelant ones. whereas the Grigoris algo-
rit hm manages to set 50 more rele~anr documents t han nonrelevant ones. Seit lier of the
algorirhms clo well in the sr d e n t topic. possibly due to rke fact that few dociirnenrs were
relelant at al1 in that topic. as inclicated by the line for the randonl meastue-note rhat
the minimum \ d u e of tliis measure is - 1300 and the random algorithin receivetl - 1-47.
whicli translates to approsimately 5.5% of the retrie~ed documents being relennt. This
may indicate that the data or the generated profiles were noisy. In facr. the Rorchio
algorithni had a recall of 24% (see section 4.5 below. on how recall is estirnatecl in this
work) but performed more poorly than the Grigoris algorithm because of the l x , ne nuni-
ber of irrelevant documents obrainrd. In comparison. the Grigoris algorithm had a 23%
recall. It is also possible thar the implicit topic behind the stuclent query lecl to noisy
results by virtue of the topic itseif. For instance. suppose rhat the t ~ r m -Associationu
were a good indicator of relelance. but only in conjunction with the term -Torontou
and -S t udent .- Furt hermore. suppose t hat t hose words w r e poor inckators of relelance
when fountl alone. or not near to the other words. The generated profiles do not take
rhis into account. and thus cannot mode1 these relationships accuratel-
The F3 measure based only on the top 10 and top 5 documents ranlred gives an
interesring picture of the systems performance. These may be seen in figures 4.6 and 4.5.
Judging fiom the shapes of the graphs'. and in particulax the shape of the graph showing
the F3 measure across all topics. it nould appear that the performance improïes bu just
taking into account only those documents. This lends credence to the idea that the
learning algorithms are working. because this type of improvement sugges ts that more
relevant documents are concentrated in the top ranked documents. More discussion on
- - - - -- - - --
'Lnless we normalize the F3 measure. we cannot compare dopes or absolute nurnbers-this can be done. but this type of comparison is more clearly seen in section 4.4
CHAPTER 4. ESPERIXIESTAL RESL-LTS XSD EVALC'XTIOS
this is presented in section 4.4.
4.3.2 The T9U Measure
The T91' measure is nieant to reduce the effect of selectine; too many documents. partic-
iilarly when those documents are irrelelanc. It does this by introducing a lower hoiind ou
-1 - z ~ t c a s ~ r e s,i~iiliir t~ îLr F2 ilirn>ULr. lu~ver LUULLJ is caiiei J i i n i - . liesuirs siiown
with this meastire. in figure 4.7. shoiild be a 'snioother' version of rhe F3 measure. This
is borne o~ i t by esamining the graphs. In facr. escept for the palrn pilot topic. the - craphs
for the l emer s eshibit a high degee of parallelism with the graphs for plain and ran-
dom algorithms. This means that the learning algorithms aide in the process of findiug
relevant documents only at specific points. This musi he tnie. given the parailelisni and
the fact that the lines representing the learning dgorithms are higher than rhose for the
plain and random algorithrns.
This behaviour is made more obvious by rsamining the graphs in figures 4. ;(e) . 4.S( e)
and 4.9(e) I t is easy to see that the 'bumpiness' of the graphs increases when taking into
account fewer documents. The 'bumpiness' occurs where the learning aigorithms actually
perform betrer than the random and plain algorithms for a particular da'. This hehaviour
cannot entirely be due to the performance on the palm pilot topic. as the graphs of the
palm pilot are fairly consistent in terms of when performance is better than the plain
and random algorithms. The consistency of the measure on the palm pilot topic and
the fact that the measure was a ln qs rising indicates that. in fact. the learners on this
topic did not select man' more than 5 documents each da- While the other topics are
also somen-har consistent. this same interpretation may not be giwn. The consistency
there is a result of the performance being similar to the baseline measures for most of
the nui. Khile the performance on the other topics appears to be poor. the overall
resulta still indicate that the l emer s n-ere correctly ra&ng relerant documents higher
than nonrelerant ones. If this were not the case. the graphs for the top 5 and 10 (figures
ia ) Palm Pilot (b) Robots
(d ) Students
( e ) A11 Topics
Fiove 4.7: T9L Measure on Various Topics. Running Averages
[a) Palm Pilot (b j Robots
+- i d -
; * T U * - ~ r r
1 + -
(d) Student
(e) Al1 Topicj
Figure -1.8: T9C Measure. Top 5 only
( a ) Palm Pilot
fc) >Iicrosoft/DOJ
( b ) Robots
Pm- +-
(d) Student
(e) -411 Topics
Figure -1.9: T9V Measure. Top 10 only
4.S ancl -4.9) n-oiild look more like the graph in nhich no restrictions w r e made (figure
4.7). Esamininj the O\-erall results on this measure one final time. it indicates rhar eitlier
there were nor too man'- rele~ant documents or the learning algorithms work besr only
for producing relevant documents in the top 5 or fewer cloctmients returned. If this w-ere
not the case. one would espect to see more the performance to be siniilar on al1 graphs
for a particular topic.
The dara seeni to indicate rhat the Grigoris algorithm is better than the Rocchio one
using this measiire. However. a n'ilcoson signed r a d test on the daily differences reveals
the difference to be insignificant. with 0.1977 < p < 0.2005. Esamining figure 4.;(r).
the difference seems airnost entirely due to the difference occiirring at days 2 . 3 wcl 4.
11-hile this difference. taken on a dail? basis. might not be statistical1~- significant. the
end results are certainly significant in the red worid. This difference ends up causing a
difference of 4.19% in rems of the precision in the top 30. Perhaps more importantly.
however. rhe clifference is almost entirely due to a sis document difference in the first
four days. with a five document difference in the second day This is important because
while the usage scenario of this system occtus over some time. it coiild be the case that
a time periocl of only a few days mas important. Thus. any irnprovement in rhose few
days would be critical. Also. since the behaviour of the individual search engines being
put to use is such that relemnt links are highly l ih ly to be present in the first few pages
of hits that the search engines return. and that these first few pages are retrieved in the
first few days. the importance of achieving a good profile in a short time increases.
4.3.3 The T9P Measure
The T9P measure is meant to *stress precision* according to the Filtering Track Guide-
lines [RH]. It uses a lon-er bound on the minimum number of documents that must
be selected. It thus penalizes systerns ahich retrieve fewer than this minimum n m b e r
of documents. These results are presented in figures 4.10. 4.11 and 4.12. The results
a.,. t- rW
(a) Palm Pilot
( c ) .\licrosoft/ DOJ
i b ) Robots
(d) Students
( e ) Ail Topics
Fi,gure 1.10: T9P Measure on nrious topics
(a) Palni Pilot ( b ) Robots
(d) Students
*- E
(e) ,411 Topics
Figure 4.11: T9P Measme on larious topics including a rnavimum of 5 documents.
(a) Palm Pilot jb) Robots
(c) lIicrosoft/ DOJ (d) Students
(e) -411 Topics
Figure 4.12: T9P Measure on various topics. including up to a maximum of 10 documents.
are clearly in favour of the Grigoris algorithm. In fact. the differences bern-een it and
the Rocchio algorithm are statistically significaot on the palnl pilot and robots topic. as
well as the overall results. Again. the U'ilcoson signed rank resr with the 0.5 continuity
correction !vas used to compare daily versions of the measure. The p values Iyere p <
0.0044. p < 0.0'207.0.1151 < p < 0.1170.0.1003 < p < 0.1020. p < 0.0013 for the graphs
of figure 4.10. in order from left to right. top to bottom. In other words. the three graphs
in which it look like there may have been a statistically significant difTerence do in facc
have one. This clifference decreases as we include ouly those documents in the top j or
10. although the overall results maintain their staristical significance. This means that
the algorithms tend to give the user the sarne number of relevant documents in the top
few document S.
4.4 Precision of Learning Algorit hms
The T9P measure in section 4.3.3 gives the precision in the top 30 for the plain and
random algorithms. since those algorithms are always forced to return 30 documents.
That measure does not completely represent the data. or accurately represent those
individual algorithms. Figure 1.13 shows the ruZlILing average of the precision across al1
topics. with precisions in the top 1. 5. 10 and 30 documents returned by the system. The
figure shows that the Grigoris algorithm is better at discriminating between relevant and
nomelemnt doiuments. since the line representing the results from the top n alaays ends
up higher with thé Grigoris algorithm than the Rocchio one.
The staggered positions of the lines corresponding to the different numbers of doc-
uments indicates that bot h tradit ional style aigori thms are pushing rele~ant documents
higher in the list of documents presented to the user. It may also imply that there are not
enough releianc documents to have a high precision in the top 30 or the top 10. Given
that the algorithms are n-orking. which they appear to be. then if there mere many rele-
* Racchio; top IO Racchio: top 30
1 L 1 I 1 1
O 20 40 60 80 1 0 0 120 days !rom start
Figure 4.13: Running average precision across dl topics
(a) Palrti Pilot
(c) Microsoft/ DOJ
(b) Robots
(d) Students
Figure 4.14: Running average precision on ~ar ious topics. shotving precisions wi th various
numbers of documents included. Rocchio and Grigons algorithms.
rant documents. one wodd espect thac the lines for the different number of clociimcnts
n-ould be closer to each other. instead of having an eigkt percent difference betrveen the
top 1 and rop 30 results. It is likely t har the staggering is also a result of the algorit h m '
inability to both have relevant documents ranked highly. and to obtain al1 the rele\arit
rankings from the set of documents anilable. Hoivever. it is difficult to estimate the
recall of the algorithms. since no data esists about how mm>- relevant articles esist in
the entire KWI*. Data about recall on a daily basis. ivith respect to the documents
rerurned by the i n d i d u a l search engines. can be estiniated by looking at the data from
the random algorithrn-t bis is presented in section 4.5.
Figure 4.14 shows the precision numbers for the individual topics. -411 the graphs
eshibit stratification. That is. given a fked nurnber. m. of documents incltidecl in the
measure. then including n documents. n < m. results in incrcased performance. The
stratification is much less apparent on the robots and lIS/DO.J topic than on the palm
pilot topic. This is probably attributable to the profiles. and rheir ability to distingtiish
between documents. For instance. with the student topic. the term 'sttident life' is
important. but only in conjunction with the term 'toronto'. However. the term 'toronto'
by itself is actually a ver? poor indication of releçance. Similarly. the term 'student life'
without the term 'toronto' is a poor indication of relemnce. Thus. at certain times in
the eduations. the term 'student life' may be important in the profile. but i t lacks the
tenn 'toronto.' or vice versa. This may be due to a fault of the feature space. which only
consists of the terms in the VShIs and does not add the correlations between terms.
The large gap in the Grigoris algorithm on the students topic appears unusual. The
stratification seems to be extreme when loobng from the top 3 to the top I data. This
is probably due to the high precision at the beginning of the run.
Figure 4.1 3: Riinning average precision of Grigoris algorit hm on the stticlent topic
- - - - - - - - - . am-
Figure 4.16: Running average of the precision for continuous training versus test portion
of train/ test. across all topics. Grigoris algorithm used. S h o m starting
from test cycle at day 70.
(a) Palm Pilot (b) Robots
id) Students
Figure 4.1;: Running average precision on larious topics using the Grigoris algorit hm.
comparing continuous training with the test portion of a trainltest cycle.
Testing cycle begins at da>- 70.
CHXPTER 4. ESPERI~IESTAL RESL-LTS X S D E\-ALUXTIOS
4.4.1 Continuous Learning vs Train/Test
One other test that was performecl was to esamine the effects of continuous training
versus hal-ing a training and testing period for the profile. This was clone with the
Grigoris algorirhm. and the results ma>- be seen in figures 4.16 and 4.17. K t h the
exception of the resulrs in the top 3. the differences in the daily data are statisticaily
significanr with p E [O.OIO.L. 0.04lSI. The differences in the top 3 result in p o 0.0643.
These resuits are only with 14 days worth of data. Furthemiore. these results nia! be
due to the facr that the metasearch engine frequently returned pages that kacl been seen
hefore. and tvhich had only changed slightly since the last time it t ~ a s seen-in the date.
for esample. Servers ma! also have returned incorrect dates for the date check. or pages
may have been dynamically generated. -411 these factors result in a basically unchanged
page being given to the user. Pages that had not changed in a relevant manner ivere
marked irrelevant (see section 2.3). This ptits any profile that has been frozen at some
particular point in time at a disadvaatage. because it cannot adjust to this method of
ranking. At the sarne time. this method is necessary in order for the system to give
new pages to the user. and for the system r d n g s not to be dominated bu d ~ a m i c d y
generated pages.
One other possible explanation of these results is that overtraining might have oc-
curred. If this were the case. the test profile would only work weU with the documents
the system had already seen. This is not the case. here. The spilie in the performance
at day 70 shows this. As explained in section 4.6. this s p i h is due to the system seeing
relevant documents that had been seen before. but when the profile nas still relatively
poor.
(a) Palni Pilot
1 - O ,- bac
b t b
(b) Robots
(d) Students
(e) .Ill topics
Figure 4-18: Dail' recall for the Rocchio algorithm
(b ) Robots
rniiaw
(d) Students
( e ) ,411 topics
Figure 4.19: Dail- r e c d for the Grigoris algorithm
CHXPTER 4. EXPERIJIESTAL RESL-LTS XSD EUL~ATIOS
4.5 Daily Recall
The random algori t h ~ n allon-s estimation of the proportion of reletant clocumenrs a d -
able on each day. Xote that this does not allon- estimation of the proportion of relevant
J~cii~iwiirs iu riir popuiarion of ciociiments aiaiable on the \\W\\*. bitt only the pop-
ulation consisting o i those dociiments retrieved on each da?. This still provitles useful
data. For instance. the 93% confidence interval of the number of relel-ant documents
each day can he estimared. and frorn there. a range of recall values for each algorithm
may he obtained. This confidence inter\d is obtained iising a 0.5 continuity correction
to approsimate a binomial distribution wit h a normal one(Gus9;j. Csing these in te rds .
it is possible to compute a range for the number of relevant clocuments that are expected
to he present on each da'. Given this. it is easy to cornpure the estirnated recall per day.
These data are presented in figure 4.18. for the Rocchio algorithni. and figure 4.19 for the
Grigoris algorithm. Sote tkat the estirnated proportions were altered to be nonnegative.
that the recall estimates shown were dtered to be no more than 100%. and that a recall
of 010 iras given 100% recall. The 'mean recall' refers to the recall obtained when using
the middle of the confidence interval as the estimate of the number of documents found.
The dail- data can be used to estirnate the o v e d recall. The ranges are given in ta-
ble 4.1. These ranges were obt ained by summing the found n u b e r of releiant documents
each da? and dividing by the surn of the estimated number of relevant documents found
each da): This sum ~ar ied . depending on esactly nhich nurnber in the 95% confidence
inten-al 1-as used in the summation-either the maximum. the minimum or the rnean of
the number of estimated documents in the confidence intend. The expected number of
relennt documents is the sum of the means of the number of relet-ant documents. and
gives an indication of the weight each topic is given in the 'A11 topics' topic.
i topic Rocchio
1 AI1 topics (0.19S-L 0.2645 0.3S5l / 0.1935 O.25SO 0.3757 1
1 min mean mas 1 min mean nias / documents
Table 4.1: Overall estimated recall.
Grigoris
/ Palm Pilot / 0.2672 0.3496 0.3039
( number of retrieved documents expected 'Z relelant t i
Lspec t ed num-
ber of relevant
0.233s 0.3059 0.4409
l Studenrs i 5365 1 5.93 I
S9'1
Robots
Table 4.2: Total number of documents retriewd for each topic. This is not the num-
ber retrieved by the learning algorithrns to be presented to the user. but the
documents thar the metasearch obtains from the individual search engines.
6453
hll topics
32.9
31664 10.2
CHXPTER 4. ESPERIXIEST.AL RESC'LTS A N D E V X L ~ X T I O S
4.6 Spikes
There are two spikes. appearing at the beginning of the graphs. and on or around cl-
70 in the graphs. that are particularly striking. as the' appear in al1 graphs. froni the
d ù c ü i ~ i n ~ iütirî grirpL üf Sgurc 4.1; t u tlic TREC i i i r ca i i r rb iu figiirc~ 4.4. 4.7. -i.iû: to rne
precision measurements in figure 4.13. The initial spike ma>- be esplainecl hl- the fact thar
the indi\*idiial search engines that were being used had their best resiilts at the begiming
of their document lists. which were viewed by the metasearch engine first. In the case
of the document counts. there were more documents becaitsr the system dicl not haïe
to go far in the iists of documents returned by the individual search engines. Thiis. the
documents tended to be fairly popular. which tends to imply good network connections.
It could also have resulted in more popular pages. particularly with DirectHit. Hothot
and derivatives. One i~ould espect more popular documents to be reachablt across the
Internet and conversel-. that less popular documents wouid be less reachable. Since
unreachable documents cause resource consunlption. the system might not be able to
collect as man- documents in the later days than in the early days.
The other spike occurred after a gap in data gathering of eight days. but also after the
table which keeps track of the seen documents had been fliished. This table kept track of
ivhich documents the metasearch engine had seen aod ranked. not which documents the
user had seen and ranked. Flushing the first 10 days of data from it resulted in
many completely new documents being presented to the user. and some previously seen
ones. Ignoring the previously seen ones unless they had changd. the met asearch engine
still gave a large number of relevant documents. especidy to show a difference even after
;O days of data gathering. This could either have been due to the gap in data gathering.
or the flushing.
Figure 4.10: Dail- precision across al1 topics. Top 10
4.6.1 Data Gathering Gaps
Gaps occur in other places. siich as r he sewu day gap starting at da- 17. the 6 da' gap
starting nt da? 30. and the 15 da\- gap starting at da! 73. Esamining figure 4.1. it is
not dif£icult to see where rhese gaps are. The l ~ g e s t gap-the 15 d q one. results in a
general decline across topics. while the other two gaps show a mis of shallow declines
and ascensions. as can be seen on the various performance gaphs. such as the F3. T9L'
and T9P rneasures. The document counts also do not appear to increase or decrease
significantly. meaning that the number of documents that were found to be changed or
new did not increase or decrease. The estimated number of relevant documents also
does not appear to have any correlation with the data gathering gaps-the results of
figure 4.lS(e) show a mis of ascensions and declines associated with the gaps.
4.6.2 Flushing
Thus. the spikes were probably due to flushing. However. one n-ould not expect that this
flushing would work more than once because the first flushing would d o w the user to
r a d most of the pages that had been missed due to a poor profile. earlier. If it did. one
tvould expect that the performance n-ouid not increase even to the extent that they did on
1 -
as-
aa - , ;
4 a,.* "
(a) Palm Pilot (b) Robots
( c ) Mcrosoft/DOJ (d) Students
Figure 4.21: Daily precision on various topics. Top 10
Figure 4.22: Dail- precision in top 30 on stiidents topic for rhe Rocchio algorithm
da! 70. .in increase at least as large woiild mean that the individital search engines were
g i~ ing man- good results over time. but the profile was not adapting quickly enoiigh.
Flushing iras done at days 70. 94. and 104-that is. 3 d-s. 7 days anc l 13 days before the
last data point. Figures 4-20. 4.21. and 4.1s confirm the hypothesis that 0ushing would
not work more rhan once-in one instance after the da! 70 flushing. flirshing results in
higher precision and in the other ir results in lotver precision. Also. the lower precision
due to flushing occurs earlier in tirne than the higher precision due to flushing (if flushing
nas the cause of this at all. which ir probably was not ). Similar patterns are found in
the daily recall statistics.
This data also has some bearing on the issue of continuous learning. and therefore.
continuous adjustment of the profile. It shows chat despite the necessity for continuous
leaming as shom-in section 4.4.1. the leamed profile still performs tvell n-hen presented -
nith a large n d e e r of relevaat documents. even 60 days after the previous peali per-
formance (and thus. peak learning with the traditonal style algorithms). The mai&-
negative reinforcement that was received afier the h s t spike in performance does not
seem to have a deleterious effect later on.
One peculiarity r ~ i t h the daily precision numbers is that the biggest s p i h in the
students topic occurs at da!- 50 with the Rocchio algorithm. This doesn't correspond to
an' particularl!- special day. Only 200 docunienrs were retrieved on that day. !&en the
average for the srudents ropic !vas 190. TO put rhat in perspective. -1'27 documenrs were
retrieved on da? 70. The high performance seems ro be due to the Snap and MSS' search
engines. each obtaining four relelant results in the rop 10.
4.7 Individual Search Engine Recall
The individual search engines do perforni differently ou certain topics than on otkers.
Figure 4.23 shows graphs of the running average of the search engine recall. for d l
topics with rhe Rocchio algorithm. Figure 4.24 shows a sirnilar graph for the Grigoris
algorithm. The recall nurnbers are given with respect to the relevant documents found
hy the metasearch engine lie: al1 search engines). Comput ing a line of regression for the
abovc graphs. and using a percentage for the recall. produces the slopes given in table 4.3.
This shows that the search engine recall is about the same over time across al1 topics.
The graphs show tliat no indivitlual search engine is able to obtain more than 25% of the
relevant documents. over time. Esamining results on the separate topics reveals that no
individual search engine obtains more rhan 40% of the relevant documents found by the
metasearch engine on an? individual topic. over tirne.
Search engine recall does change on a per topic basis. Eramining figure 4.25. there is
a clear upward trend. In fact. the slope of the line of regression is O.l545%/day. Similar
cases may also be found in the other topics. as ivith Infoseek on the robots topic. with a
dope of the regression line of O.IOSl%/day (see figure 1.26 and figure 4-27). The different
search engines perform differently on the various topics. Table 4.4 shows the engine with
the highest slope on the line of regression on the karious topics for the two traditional
style algorithms. and table 4.6 shows the lowest slopes. In each topic. a single search
engine had the best or norst slope. independently of the leaming algorithm used. The
9 3 -
lm-
Fiope 4.23: Running average of individual search engine recall. across ail topics for Roc-
chio algorithm. R e c d is measured relative to total number of relelant doc-
ument s found by the metasearch engine using the Rocchio algorit hm for
Figure 4.24: Running average of individual search engine recall. across ail topics for Grig-
ois algorithm. R e c d is measured relative to total number of reletant doc-
uments found by the metasearch engine using the Grigoris algorithm for
leaming .
Search Engine
.Ut a i ï s t a
Direct Hi t
Hotbot
Infoseek
L y cos
1ISS
Sat ionalDirectory
Snap
Thunclers tone
Yahoo
Table 4.3: Slopes of regession lines where the y asis is given as a percentage recdl: t hus
the unita are %recalI/day from start of run. Results are across al1 topics.
an-
Figure 4.23: Running average of r e c d for the Yahoo search engine on the p h pilot
topic using the Rocchio Iearning algorithm
Figure 4.26: Running average of recdl for the Lycos search engine on the student topic
using the Rocchio learning algorithni
most and least improving engines do not necessarily match the best and worst performing
search engines at the end of the tests. These are given in tables 4.3 and 4.7. The best
engines for this job do not reflecr the coverage each indit-idual search engine has according
to Lawrence and Giles [LGSS]. It is also interesting to note chat those search engines
thar share cornmon structures. such as the use of DirectHit's search engine in Hotbot
and h1SS search engines have varying results. possibly due to different versions of the
search engine. or different versions of databases. or different supplementary searches.
Figure 4.27: Running average of relative r ecd . Llicrosoft /DO J topic. Rocchio algorithm.
These graphs show a wuiety of patterns in the recd . for mxious search
engines.
ropic Rocchio Grigoris
S lope Engine Engine -
Palm Pilot
Robots
!lIicrosoft/DOJ
5 t ticients
-
!lm-
Infoseek
Alt aVist a
N S S
11SX
Infoseek
DirectHit
MSS
Table 4.4: Best improving individual engine recall per ropic. for the two traditional style
algorit hms. Slopes are given as <7; recalllday
Grigoris Rocchio I topic 1
Engine 1 Recall Engine Recall 1 03jj3i 0.2604 /
11SS
Snap
AltaVista
Lycos
Snap
1 Stiidents
Table 4.5: Besr i n d i d u a l engine recail per topic. for the two traditional style algorithms.
Rocchio Grigoris t opic
Slope Engine Engine
1 Robots DirectHit
Snap
[ Students DirectHit Direct Hit
Table 4.6: Wxst improvement in individual engine r e c d per topic. for the two traditonal
style algorithnis. Slopes are g i~en as %recd/day
topic
1 I p z T 1 Robots
Slicrosoft /DOJ
1 Students
Rocchio
Engine
SationalDirecton
Thunderstone
l'ah00
Table 4.7: Llbrsr indiridual engine recall per topic.
rit hms .
Grigoris
Engine
Thunderstone
Sat ionalDirectoy
for the two traditionai style algo-
Chapter 5
Conclusions and Future Directions
5.1 Conclusions and Discussion
To ansiver the questions posed in the Introdttct ion. the al results of the previous
chapter indicate that metasearch does seem to be effective. and learning a user profile
also appears to increase the relevance of retumed documents. The difference in precision
between the baseline algorithms and the algorithms that use a changing user profile
ranges from 15% to 30%. The T9P measure resdted in differences of 9 to 10 points. the
T9C measure resdted in a difference of between 90 to 115. and the F3 measure resulted
in differences between 3000 and 4000 points. Yetasearch in a relevance feedback context.
and with the implemented method for rankng. is much better than merely using an'
individual search engine because certain search engines perform better than others on
certain topics. and because of the increased coverage one obtains with metasearch. L7sing
a user profile appears to help rele:ance. according to the larious TREC measures and
cornparisons n-ith the plain algori thm. n-ith some caveats presented belon-. The Grigoris
algorithm appears to perforrn at least as n-eLl as the Rocchio algonthm on the individual
ropics and ~er forms significantly better than the Rocchio algorithm on the T9C. T9P.
and F3 measures on d the topics taken together.
This studj- also obtained answers to questions that were iinposed. For instance. the
importance of achiel-ing a fairly gootl profile in the first rwo to fivt. runs was shown in
the parameter tuning stage. There. a ser of paranierers that led a learner ro ohtain few
resiilts in the first feu- days led to poor learned profiles. This caused the learuer to never
see relevant resrilts at dl. becatise no reletant results were returned to the user. Thiis. the
user would be forced to indicate that al1 the clocuments w r e conrelevant . Saturally. this
led ro the learner having no way to predict whicli future clociiments woiild be rele\ant.
except throtigh random selecrion of documents.
There are some objections thar niay he raised about the accuracy of the conclusions
presented above. In particular. the tise of the plain algorithni and plain profile as a
baseline may he q~iestionable. Similarly with the use of the randorn algorit hm. Seither
algorit hm gives particularly good rankings. F u t hermore. some esist ing search engines
have t k i r own met hod of 'one-s tep' relevame feedback. For instance. Google allo~vs
a searcher to find 5irnilar pagesq to an' that they find releiant . Other search engines
have the pot ential to use implicit feedback. although this would probably occiir in a bat ch
fashion. unlike the work presented here. Search engines that fail into this category are
rhose that present a list of links to other pages. but in which those links are actually links
to a server operated by the same entity as the search engine itself. This other server can
coilect 'visited' statistics. and then redirect the user to the document they nish to vien-.
Ir is difficult to make any cornparisons with individual search engines simply because
they are not designed to be used in the same scenario presented here.
As to the baseiine measure. it would be difficult to corne up with another benchmark.
Other rnetasearch engines cannot be used for various reasons.
r They may not return enough results to make a repetitive query feasible. even with
a general q u e - such as -palm pilot.-
They may not combine the results of al1 the search engines. instead interleaving
results in some manner.
0 The? ma' not allow the user ro specify esactly n-hich search engines to use. or if
the? do. the- may not allow certain search engines thar were used in t his studj-.
The rnetasearch engine closesr to being feasible for benchrnarking purposes would be
San-ySearch. which still fails for the first two reasons given. The main reason. how-
ever. is the firsr-no metasearch engine returns enough results to make a repetirive query
worthn-hile. There would be no new documents to view. and few. if an'. changer1 nn-.
Thus. the? would tend to rerum feu- or no documents for user ranking. Csing the ranking
scheme that the indiridual search engines use wouid form a good baseline. but these are
not availabk for the obvious reason that the! are proprietary. and are the basis on which
people use a search engine. It woulcl be possible to use SavvySeafch (or sonie other
metasearch engine) as the only search engine used as a 'helper' search eligirie for the
rnetasearch engine described here. but that would not be fair to SavySearck's ranking
algori t hm.
Barring furcher objections. the results shotv t hat met asearch works tvell for a repeti rive
query. iising forced reie\ance feedback to adjitst the systern rankings. and that rnetasearch
n-orks bet ter than an- indiridual search engine. in this contest. -4 number of t hings may
be done in the future to improve the relevancy of the results presented here and to esplore
other aspects of searching.
5.2 Future Directions
5.2.1 Implieit Ranking -
One of the less savon aspects of using this systern from the user's point of view is that
the user must provide relelance feedback. and in the case of the tn-O traditional style
dgorithms. had to do so for 30 documents. This does not fit well with the scanning
method that people use to vien- K K W documents [Nie97]. A better method would be
implicit ranking-ra&g that is done nithout the user having to press a button. For
esample. one could use the tirne between visits to the page of ranked documents-thar is.
the time between when a user follon-s a link to a ranked docitnient to the tirne that a user
uest i o l l o~~s a link to anothrtr ranked document. Obviously. some niasiniuni tinie ri-otild
have to be instituted. The assumption here is that users visit relennr pages for longer
periods of t ime than nonrelelant ones. Iionsr an et al. [IiM WS;] show. \rit h uewsgroup
articles. that there is a high correlation between the rime spent reading and the rsplicit
rating given to an article. This could even be used in addition to esplicit measurcs of
releiance. San-ySearch [DH96] used a visited/not visitecl measure of perforniance. This
coiiltl easily be made into a boolean rele~ance ranking. and indicstes the relc\ance of
worcls presented in the test of the page thar displays the ranked doctmients. This latter
information could also be used as iniplicit reletance feedback.
5.2.2 Ontologies
As mentioned in Chap ter '7. the user's query is expected to fit into some ontology. It might
prove interesting to obtain some ontology. such as that from the L i b r - of Congress or
h m an existing director' based search engine such as Open D z T ~ c ~ o ? ~ or Yahoo!. One
could also use a narrow-er field such as Computer Science. using some esisting ontology
(such as one from ResearchIndex-formerly CiteSeer [LGBSS] ) . Csing this ontologv. one
could create more a general profile. P,. for a topic g by cornbining esisting profiles which
corresponded to topics belon- g in the ontology. Similady. one could create more specific
profiles by using esisting general profiles. This wodd have to be done at the user's
discretion. since specific profiles would not necessarily generalize. nor vice versa. The
user could even spec- the esact combination of profdes to use. The use of an ontology
would be even more powerfd through collaboration.
CHAPTER 5 . COSCLUSIOSS XSD FL-TI'RE DIRECTIOSS
5.2.3 Collaboration
Collaborative learning and recommendation has been used in a ntunber of differerit sys-
rems [CGW99. BP9Sa. KSS97. BS9L Ii1111+97]. and has shon-n good results. i l ï r h
ontologies. collaboration with respect to profiles might produce good resulrs as well.
The collaborarion wodd involre liaving a conunon ontology iised by al1 people using the
uict,açarcL çugiiiç. Diffrrrlic profiies !roui dinSrenr peopie createci untier a ropic in the
ontolog- could be combined to produce a prototype profile. which could be of general use.
particularly for new users. or for those users who do not wish ro have their own. separate
profile. People witk similx profiles could also recommend other profiles to each other.
Clustering techniques coulcL be used on the profiles to generate a d y a m i c onrolog- to
use. or merely be used to creace a d~namic ser of bootstrap profiler; for neu- users.
5.2.4 Alternate Document or Feature Space
The current rnetasearch engine uses only the document space as the set of features CO use
when comparing documents to profiles. Other features could be used in the profiles and
document represent at ions. such as the linkage stmc ture of retrieved documents. This is
used in search engines such as Google [BPSSb]. Clever [I<RRT99. IXRC99. CDKi99].
and Direct Hit [Dir]. and is used to identify interest ing documents. called authoritative
and hub sites. based on how maqv documents link to them via CRL references. and
how man' authoritative and hub sites the document itself links to. This creates a graph
structure. d i c h is anal-zed to produce metrics that can be used as features.
One could also use features such as the grade level of a document. the number of
words in the document. the number of links and images. the recency of the document.
and indications of whether the paper is a research paper. among others [GLG+99,511a99].
These mould provide additional information and codd provide additional insight into a
user's criteria for relelance n-hich likely include things other than the t e s of a document
5.2.5 Thresholds
The sysrem clescribed here uses a static tlireshold to determine when ro stop giving
documents to the user for ranking. Documents with a system ranking belon- rhis thresliold
are x \ - e r -hot::-, zo the user. c-.-cc if fc*::cr ;ha 30 Uûc iu i i c i i t~ L d Leeu cuiiecteci ro be
shown. This threshold could be dpamically generated. This cotild be done by monitoring
current performance. such as one of the F3. T9P or T9L- measiires. and alrering the
threshold based on the d u e of or the changes in those measures.
5.2.6 Alternative Met hods of Learning
Other learners might prove to be more effective on this task. For insrance. the palm pilot
topic triight be more easily learned by a system that used several learning agents. each
of which would learn a specialized profile. One could be good at retrie~ing results on
hardware accessories. while one could be good at retrieling resul t s on free produc tivi ty
software. Each agent would leam a local version of the more general profile. This
coiild lead to a better ability to discriminate between rele~ant and irrelevant documents.
hecause each agent wodd have a full sized profde representing a local version of the
general one. This nould also provide a kind of symmetry-t he metasearch uses met asearch
as the learning component. One candidate for this type of leaming is SIGMA [I<F96].
5.2.7 Miscellaneous Improvements and Directions
In the course of using the metasearch engine. and in analyzing the results. several points
of potential improwment have corne to light.
1. To increase the precision at the beginning. it might be usehl to implement a system
in ahich. having reached a peak (as detected bu the subsequent decline in precision).
the system woiild rerank those documents that hacl been seen before. but were
unranked bj- the user. This accounts for the second spike as net ailed in section 4.6.
2. .An alternative to the above would be to have a system that alii-ays rerankecl those
clociiments rhat had been seen before. btit were tinranked by rhe user.
3. There nreds to be a better mechanisni to detect changecl documents such that a
docurnenr woiild be rejarcled as tinchangecl if it were changed in a tririal manner.
siich as a date change. or a single number or rvord change. This ~ou lc l prevent some
ronrele\ant documents from affecting the precision measures. One siich mechanisrn
niight be to only use a sample of the data in the chrcksiim. such as the 30 bytes
of data surrounding the niost common terms of the clociiment. Altematively. the
similarity measure hetween the VSSI versions of a potent ially changed document
could also he used.
4. .Aalbersberg [IJA92] obtained favorable results wi t h incremental relelance feedback
where the user only gave a relevance ranking for a single doctirnent at a tirne. This
might dev ia te sorne of the strain mentioned in section 5.2.1. The contest of the
problem presented in that paper was slightly different. however. so might not readily
apply to the situation outlined here.
5 . It is possible that using the random algorithm at the beginning of a run would
always produce better results than using the plain algorithm. Certainly. the graphs
in figures 4.3 and 4.2 suggest that using the random algorithm for at l e s t the first
day tvould be better than using the initial que- as the profile (ie: using the plain
profile).
6. It should be possible for leamers to escape from any local minima they encounter
&en using a poor profile. This means that eve- learner needs to have the ability
to revert to old profiles or use old profiles in some way in order to e-xplore the space
CHXPTER 3. COSCLLSIOSS XSD FL'TC'RE DIRECTIOSS
of possible profiles as a means of escaping the local minima.
- 1 . Esaniination of the feature space usecl might prove fruitful. For instance. insteacl
of merel!. using two word phrases. one could deternine the average distance. in
the document. betn-een words chat are in the profile. .A low al-erase distance coulcl
indicat e increased relel-ance. Of course. as mentioned in the introduction. o t her
;>-j:em.i Laïc ii;d ûtker f ~ ~ t ü i e s . as ;ïcE.
S. To bet ter use the resources anilable. the interruption niechanisni mentionecl in
section 3.3 coitld be implemented for interrupting calls thar blockecl on socket com-
m~inicat ions ( ie: rietwork cornmunicat ions ) .
Bibliography
[.41196] J. Allan. Incremenral relennce feedback for inforniatiou filrering. In .4Ck.I
SIGIR Con f.* August 1996. Zurich. Switzerland.
[Bar941 Carol L. Barry. Ilser-defined rele~ance criteria: . in exploratory stucly. Joar-
na1 of the American Society for In f o n a t i o ~ i Scierm. 45( 3 ):l-lg-l39. 1994.
[BBCgS] Ana B. Benitez. Mandis Beigi. and Shih-Fu Chang. h i n g relevame feedback
in content-based image metasearch. IEEE Internet Computing. '7(4):59-69.
.July/ August 199s.
[BCF+9S] Lee Breslau. Pei Cao. Li Fan. Graham Phillips. and Scott Shenker. Web
caching and zipf-like dis tribut ions: Evidence and implications. Technical
report. University of IVisconsin-1,Iadison. April199S. Technical Report 1371.
Computer Sciences Dept .
[ B D W 951 Jlic Botvman. Peter B. Danzip. Gdi !danber. 4Iichael F. Schwartz. Darren R.
Hardy and Duane P. Wessels. Harvest: .A scalable, customizable discover-
and access system. Technical report. University of Colorado-Boulder. 1995.
[BLGSS] Kurt D. Bollacker. Steve Lan~ence. and C. Lee Giles. CiteSeer: .in au-
tonornous web agent for automatic retrieval and identification of interes ting
publications. In Autonomow Agenb 98. h C X 199s.
[BooSS] Gary Boone. Concept features in ReAgenr. an inrelligenr email agent. In
Proceeditigs of the Second International conference on -4 utonon~o ILS Agents.
pages 141-143. 199s.
[BPSSa] D. Billsus and SI. Pazzani. Learning collaborari\-e information filrers. In
Proceedinp of the Fifieen th Int entatiotial C o n ference on Machine Learniiig.
pages 46-34. llorgan iiauiman. 199S.
[BPSShj Serge' Brin ancl Lawrence Page. The anatomy of a largescale hypertestiial
IIèb searhc engine. In Seventh International World Wide Web Cotr ference.
Brisbane. Australia. 199s.
11. B alabanovic and Y. S hoham. Learning informat ion ret r i c d agents: Es-
perimeuts wit h automated web browsing. In A.4.41 SpBng Symposirrm o n 171-
formation Gathenng /rom Heterogeneous. Distrib ut ed Erivirorimenta. llarch
1993.
M. Balabanovié and Y. Shoham. Fab: Content-based. collaborative recom-
mendation. C o m m ~ ~ n i c a t i o ~ of the ACM. 40(3):66-;O. 'rfarcli 1997.
[BSAS-L] C. Buckley. G. Salton. and J . Ailan. The effect of adding reletance informa-
tion in a relevame feedback environment. In Proceedzngs of the seventeenth
annual international ACM-SIGIR conference o n research and development
in information retn'evd Springer-krlag. 1994.
[BSY95] 11. BalabanoviC. 1'. Shoham. and Y. Y u . An adaptive agent for auto-
mated web browsing. Journal of Visual Communication and Image Represen-
atation. 6(4). 1995. http://n?vn.diglib. stanford.edu/cgi-bin/I\T/get/SIDL-
11-P-19950023.
[But 001 Declan Butler. Souped-up search engines. Nature. - I O X 12-1 15. Sf a'- 2000.
[CalSS] J . Callan. Learning u-hile filtering documents. In Proceedings of the K M
SIGIR Conference. 199s.
[CDIi+99] Soumen Chalilabarti. Byron E. Dom. S. Ravi Iiitmar. Prabhakar Raghamn.
Sridhar Rajagopalan. Andre~v Tornkiiins. David Gibson. mc1 .Jon Iileinberg.
Minirig the Keb's link structure. IEEE Cornpater. 32 ( 8 L60-67. Aligiist 1999.
iCC;3I+99! Mark Claypool. Anuja Gokhale. Tim 'iliranda. P a \ d Jlurnikol-. Dvicry
Set es. and l l a t t ben- Sartin. Combining content-based and collaborat ive
filters in an online newspaper. ACM SIGIR WorXlrhop or2 Recomrnetider
Systems. August 1999. Berkeley. CA.
iCha991 Brian D. Chambers. .-\daprive bayesian information filrering. h s r e r ' s thesis.
Cniversity of Toronto. 1999.
[Co1001 Christian Collberg. 2000. Colloquiurn at Cniversity of Toromo.
[CS981 Liren Chen and Katia Sycara. WeblIate: -4 personai agent for browsing ancl
searching. In A utonornous Agents '98. pages 132-139. ACM. 1998.
[DH96] Daniel Dreilinger and Adele E. Howe. An information garhering agent for
querying tveb search engines. Technical Report Techincal Report CS-9G- 11 1.
Cornputer Science Department. Colorado S tate Cniversity. 1996.
[D ir] Directhit. http://ivww.directhit.corn.
[FFM92] Tim Finin. Rich Fritzson. and Don McKay. A lanwage and protocol to
support intelligent agent interoperabilit- In Proceedinqs of the CE B CALS
Washington -92 Conference. June 1992.
[FLUS;] Tim Finin. I'annis Labrou. and James Uayiield. KQML as an agent com-
munication language. In Softwan Agents. MIT Press. Cambridge. 1997.
[GLG+99] Eric .J. Glowr. Steve Lawrence. llichael D. Gordon. Killian P. Birmingham.
and C. Lee Giles. Recommending \Y& documenrs based on user preferences.
In Proceedings of the CM S I G I R '99 ÇVorkhop on Recommender Sgstenw:
..llgorithms and Eval~uation. 1999.
[Go01 Google. ht t p: / / it-ww .goog1e. corn.
[Gus971 Paul Gustafsen. Sovember 1997. Lecture notes from Statistics 303. CBC.
[Har92] Donna Harman. Relelance feedback revisitecl. In Proceedings of the Fifteenth
Annual Jnten~at ional ACM S I G I R conference on Research ond deoeloprnent
in information retneval. June 1992.
[HC93] D. Haines and K. B. Croft. Relevance feedback and inference networks. In
Proceedings of the Sizteenth Annual International ACM S I G I R Conference
on Research and Development in Infonnution Retrieval. pages '2- I l . EKU.
[HD97] -4. Howe and D. Dreilinger. Sav\-ysearch: A metasearch engine that learns
which search engines to query. -41 Magazine. lS(3). 1997.
[Hu1991 David A. Hull. The TREC-7 filtering track: Description and analysis. In
E. SI. Yoorhees and D. Harman. editors. The Seuenth Text REtn'evaZ Confer-
ence (TREC-7). pages 33-56. Department of Commerce. Xational Institute
of Standards and Technolog-. 1999.
[IdeTl] E. Ide. 'r'en* e-xperiments in reltance feedback. In Salton [SalX]. pages 337-
354.
[IJASS] IJsbrand Jan Aalbersberg. Lucremencal r e l e ~ n c e feedback. In Proceedings
of the Fifteenth Annual International ACM SIGIR conference o n Research
ond development in information ntrieval. June 1992.
.J m e s J ansen. Csing an intelligent agent to enhance search
engine performance. Fkst Monday. ( 3 . Jlwch 1997.
ht t p: / / n ~ \ ~ ~ . f i r s t monda>-.dk/issues/issue2-3/j ansen/ indes.htni1.
T. Joachims. -4 probabilistic analysis of the rocchio algorithm tvith tfidf
for rest caregorizarion. In Proc. of the 14th Ititerriatiorial Corifererrce on
Mm-h.in~ L~nrning lCMC 97. ? a p 143-1 51, ! go .
E. Iieen. Term position ranhng: Some new test results. In Proceedlngs
of the Fifteenth Annual International ACM SIGIR conference on Research
and development in i n formation retrieval. pages 66-T6. l%Q. amilable at
ht tp://~~~v~~.acm.org/pubs/contents/proceedings/ir/ L3316O/.
Grigoris J. Karakoulas and Innes A. Ferguson. A cornpiitational market
mode1 for multi-agent learning. In AAAI 96 Fall Symposium orr Leamirig
Cornplex Behaviors in Adaptiue Intelligent S y s t e m . .-\--\.4I Press. 1996.
Grigoris J . Iiarakoulas and Innes A. Ferguson. Applying SIGJI-4 to the
TREC-7 filtering track. L-npublished paper obtained from Grigoris Karak-
oulas. 199s.
.I. Iileinberg. S. Kumar. P. Rapham. S. Rajagopalan. and A. Tomkins. The
web as a graph: Measurements. models and methods. In Proceedings of the
International Conference on Combinatorics and Computing. 1999.
J. Konstan. B. Xiller. D. Maltz. J. Herlocker. L. Gordon. and .'J. Riedl.
Grouplens: Applying collaborative filtering to usenet nems. Communications
of the ACM. 40(3):77-Sr. 31arch 1997.
S. R. I<umar. P. Raghatan. S. Rajagopalan. and A. To&ns. Extracting
largescale howledge bases fiom the web. In Proceedtngs of the International
Conference o n Veru Laroe Databases. Edinburgh. Scotland. 1999.
H. Kautz. B. Selman. and II. Shah. Referral rveb: Combining social networks
and collaborative filtering. Comm.unicatiotis of the A CM. AO(3). blarch 1997.
Alesander Lebedel-. Best search engines for finding scientific in-
formation in the web. néb aiithored. May 1997. \\éb address
http://n~r~v.chem.rnsu.su/eng/comparison.htnil.
't'annis Labrou and Tirn Finin. -1 proposal for a new kqml specification.
Technical report. Computer Science and Electrical Engineering Department.
Cniwrsity of Maryland. Baltimore County. Bdtimore. SID '21250. F e b r u q
1997. TR CS-97-03.
Sceve Lawrence and C. Lee Giles. Contest and page analysis for improveti
Uéb search. IEEE Inteniet Computing. 2(4). 199s.
Steve Lawrence and C. Lee Giles. Inquirus. the SECI rneta search engine. In
Seventh International World Wide We b Con ference. pages 95-105. Brisbane.
-lus tralia, 1998. Elsevier Science.
Steve Lawrence and C. Lee Giles. Searching the Korld n'ide Web. Science.
%O(536O) :9S. 199s.
Steve Lawrence and C. Lee Giles. Accessibility of information on the web.
Nature. 400(6740):107-109. 1999.
Steve Lamence. C. Lee Giles. and Kurt Bollacker. Digital libraries and
auronomous citation indexing. IEEE Compter. 32(6):67-Z. 1999. Worbring
systern aiailable at ht tp: //citeseer.nj .nec.com/cs.
Dunj a Mladenic. Te--learning and related intelligent agents: h survey.
IEEE Intelligent Systems. pages 44-54. July 1999.
W. Meng. Ii. Liu. C. Yu. W. Ku. and S. Rishe. Estimating the usefulness
of search engines. In 15th Intenaatiotial Conference o n Data Engineering
(ICDE '99). Sydney. -4ustralia. Ilarch 1999.
.Uvin Moore and Brian H. .\lima\-. Sizing the inremet. Jidy 2000.
ht tp://~~~~~~-.cy~eillance.com/ne~~sroom/pressr/OOO~~O.asp.
.Ja.kob Sielsen. How uusers read on the w b . Octoher 1991.
http://w~v~r..useit.corn/alert box/971Oa.htnil.
Jakob 'iielsen. Why yahoo is good (but niay get worse). Xovember 199s.
http://~r?v~~.useit.com/dertbos/9S110l.htrnl.
.Jakob Sielsen. July 1999. http://~~*~vm.useit.com/hotlist/spot-
light 1999q234.htrnl.
Jakob Sielsen. 'top ten mistalies' revisited three years later. l lay 1999.
http://i\-tt.~~.~iseit.com/alertbos/990502.html.
.Jakob Sielsen. 1s navigation useful? . January 2000.
ht tp://w~nv.useit .corn/alert box/20000109.html.
Yoshiki Niwa. 1Iakoto Irvayama. Toru Hisami tsu. S hingo Sishiola. Akhiko
Takano. Hirofumi Sakurai. and Osami Imaichi. Interactive document search
with DualIWVI. In Proceedings of the First NTCIR Worhhop on Research
in Japanese Text Retrieval and Term Recognition. pages 123-130. August
1999. Tokyo. Japan.
Open directon http://\t7tv-.dmoz.org.
Taemin Kim Park. Toward a theory of user-based rele\ance: -4 c d for a
nen paradigm of inquîry. Journal of the Arnerican Society for Information
Science. 45(3):135-141. 1994.
M. Pazzani and D. Billsus. Leaming and revising user profiles: The identifi-
car ion of interesting web sites. k?achine Learning. 27:3 13-331. 1997.
'ilichael J. Pazzani and Daniel Billsus. Edua t i ng adaptiw web sire agents.
b*orkshop on Recommender Systems Algorithms and Enluation. '2nd Inter-
national Conference on Research and Development in Information Retrielal.
L!Xl!L
Gabriela Polticova. Recommending htd-documents iising feature guided au-
toniated collaborarive filtering. In Johann Eder. h a n Rozman. and TaGana
nélzer. edirors. ADBIS Short Papers. pages S I -87. Instit ute of Informat-
ics. Faculty of Elecrrical Engineering and Cornputer Science. Smetanova 1;.
IS-2000 Illaribor. Slovenia. 1999.
SI. Porter. An algori t hm for s d s stripping. program. Automated Librarg
and Information Systems. l4(3):130- 137. 1980.
Stephen Robertson and David A. Hull. Guidelines for the TREC-9 filtering
track. http://wtvtv.soi.cityac.uk/ ser/filterguide.htm.
Joseph J. Rocchio. Rele$ance feedback in information retrieial. In Gerard
Salton. editor. The SMART retrieval system: experiments in automatic doc-
ument processing. pages 313-323. Prentice-Hall. Englewood Cliffs. US. 1971.
'rlehran Sabarni. Using Machine Learning to Improve Infornation Access.
PhD thesis. Stanford Uni-jersity. December 1998.
Gerard Salton. editor. The SMART retrieval 3ystem: ezperiments in auto-
matic document processing. Prentice-Hd. Englewood Cliffs. US. 1971.
Gerard Salt on and C hrir Buckley. Improving ret r i c d perforniance by rel-
elance feedback. Journal of the Amencan Society for Information Science.
-Il(-I):2SS-297. 1990.
llathew Schwartz. Shwper staples. June 2000. hr tp://wv~\-.cornputer-
worlc~.com/c~1-i/story/O.1I99.S~~~~~~STO457S~.OO.html.
Gabriel L. Somlo and Adele E. Howe. Agent-assisted internet browsing. In
Proceedings o f the Worhhop on Intelligent Information Systems- nt the 16th
National Conference on Artificial Intelligence (AAAI '99). 1999.
G. Salton and 11.J. 1lcGill. Introd*uction to Modern Information Retrieval.
McGratv-Hill. Xew York. Sew York. 19S3.
Stat Market. Stat Uarket search engine rat in gs. June 3000.
ht tp://~~*lnv.searchenginetvatch.com/reports/statmarht .html.
Louise T. Su. The reletance of recall and precision in user etaluation. Journal
of the Arnerican Society for Information Science. 13(3):207-217. 1994.
Danny Sullivan. Media Metris search engine ratings. Mar& 2000.
Dnnny Sullivan. ';PD search k navigation study. June 2000.
Damy Sulli~an. Search engine alliances chart. June 2000.
ht tp:/ / s e a r c h e n g i n e w a t c h . c o m / r e p o r t s / ~ ~
Dnnny S u l l i ~ n . Survey reveals search habits. June 2000.
http://searctenginewatch.com/sereport/00/06-r .html.
[TogSS] Bruce Tognazzani. Scding information access. August 199s.
http://~~~~-~~-.asktog.com/columns/00Sscale&nfo.html.
[ZFJ97! L. Zhang. S. Floyd. and 1.. Jacobson. Adaptive web caching. In NLAiVR
W e b Cache Worbhop. June 1997. http://tv\~x\--nrg.ee.lbl.gov/floyd.