system performance and natural language expression of information needs

Information Retrieval, 8, 101–128, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

System Performance and Natural LanguageExpression of Information Needs

WALTER LIGGETT [email protected] Institute of Standards and Technology, Gaithersburg, MD 20899, USA

CHRIS BUCKLEYSabIR Research, Inc., Gaithersburg, MD 20878, USA

Received March 14, 2003; Revised September 25, 2003; Accepted October 6, 2003

Abstract. Consider information retrieval systems that respond to a query (a natural language statement of atopic, an information need) with an ordered list of 1000 documents from the document collection. From theresponses to queries that all express the same topic, one can discern how the words associated with a topic resultin particular system behavior. From what is discerned from different topics, one can hypothesize abstract topicfactors that influence system performance. An example of such a factor is the specificity of the topic’s primarykey word. This paper shows that statements about the effect of abstract topic factors on system performance canbe supported empirically. A combination of statistical methods is applied to system responses from NIST’s TextREtrieval Conference. We analyze each topic using a measure of irrelevant-document exclusion computed foreach response and a measure of dissimilarity between relevant-document return orders computed for each pair ofresponses. We formulate topic factors through graphical comparison of measurements for different topics. Finally,we propose for each topic a four-dimensional summarization that we use to select topic comparisons likely todepict topic factors clearly.

Keywords: multidimensional scaling, performance indicators, query expansion, research method, termweighting

1. Introduction

A prominent information retrieval (IR) task is search of a known, stable collection ofdocuments for those relevant to a novel, unanticipated topic. Comparative evaluation ofIR systems in their performance of this task is one of the purposes of the Text RetrievalConference (TREC) (Voorhees and Harman 2000). The basis of this evaluation is a documentcollection, a group of topics (information needs), and associated expert judgements of theretrieved documents as relevant or not relevant. In the TREC Ad Hoc task, each systemprovides a response for each topic in the group. A response is a search result, i.e., a list of1000 documents that the system orders by pertinence to the input request, the query. Thequery may be the topic itself or a restatement. Assessment of the document lists in termsof the relevance judgements provides a gauge for system evaluation.

Consider different queries that are all intended to be natural-language expressions ofthe same topic. As the definition of a topic, one might point to the statement of the topicthat the TREC assessor prepared originally or to the documents in the collection that the

102 LIGGETT AND BUCKLEY

assessor judged relevant. Of course, different queries are not all equally faithful to the topic.Nevertheless, a set of different queries for each topic can be used to obtain a set of documentlists from a system and thus a more extensive output with which to characterize systembehavior. The TREC evaluation in 2000 contains a task called the Query Track (Voorheesand Harman 2001) in which for each topic, a set of queries is input to each system. Multiplequeries help capture the variety of different ways users with the same information need mayaddress a system. In this paper, we show how to analyze the document lists produced bymultiple queries and illustrate our methods with results from the Query Track.

The simplest way to compare two document lists for the same topic is to compute foreach list a single-list performance measure such as average precision and take the differ-ence. Clearly, however, the differences between lists are much more involved. Not only arepositions in each list occupied by documents marked as relevant or not, but also some doc-uments are common to both lists, and these have nearly the same or quite different ranks ineach list. In the case of multiple queries for a single topic, more involved ways of analyzinglist-to-list differences reveal more about system differences. For example, from the QueryTrack results, one can see the relation between the words available to express a topic andthe relative behavior of different systems. In this paper, we show how further insight canbe obtained by viewing the system responses not only in terms of a single-list performancemeasure but also in terms of a list-to-list dissimilarity measure (Banks et al. 1999).

A more complete understanding of the relation between topic expression and systembehavior may be interesting but does not completely satisfy the goal of system evaluation.The goal is insight into the relative system performance for topics that the systems mighthave to process in the future. For example, Liggett and Buckley (2001) portray systembehavior for four topics. For one topic, they show that the systems fail to one degree oranother to equate “weather-related deaths” with “storm-related deaths” as is appropriatein the context of the topic. The question is, “How can one generalize the system behaviorthat this example suggests?” In Section 2, we compare three topics and show that suchgeneralization is possible. We hypothesize an abstract topic factor that is present at variouslevels in all topics and that governs, in part, relative system behavior. Such an abstract topicfactor is a latent variable in statistical terminology (Bartholomew and Knott 1999).

In addition to the topic factor discussed in Section 2, other topic factors influence systembehavior. In Section 3, we describe a method for finding additional factors. Such a methodseems necessary in light of the effort that would be needed to compare all topics in themanner of Section 2. The method is based on the idea that topic factors can be most easilyseen by contrasting topics that are very different. To gauge topic difference, we introduceand compute for each topic four indicators of the system responses to the query set for thetopic. One of these is the average of a single-list performance measure. The others are basedon a list-to-list dissimilarity measure.

Some of the topics indicated in Section 3 we analyze in Section 4. We identify additionaltopic factors although we are not able to characterize them as well as in Section 2. We alsofind a few topics with corresponding system behavior that we cannot generalize.

In Section 5, we draw some conclusions about what progress we have made in under-standing topic factors and how this understanding might be applied. We also comment onwhat might be gained from data sets collected in the future.

SYSTEM PERFORMANCE 103

2. Topic factors

The purpose of this paper is to show that topic factors are useful in evaluation of IRsystems. The term “topic factor” we borrow from engineering experiments, which areusually conceptualized in terms of an observed quantity that is a function of independentvariables called factors. The observed quantity consists of indicators of system behaviorsuch as those discussed in the next section. A topic factor is something measurable abouteach topic of interest. At this stage in their development, topic factors are not as concretelydefined as factors in physical sciences experiments. Further work on topic-factor definitionsis necessary as part of the use of topic factors in conceptualizing IR systems.

In this paper, we focus on topic factors useful in the comparison of two systems and carryalong another four systems to provide context. Please see the disclaimer at the end of thepaper. We use the Query Track designations for the systems (Buckley 2001). The systems wecompare are “Sabe” and “ok9u.” The other four systems that appear in our figures are “Saba,”“humA”,“IN7a,” and “IN7e.” These systems are described in the TREC-9 proceedings(Allan et al. 2001, Buckley and Walz 2001, Robertson and Walker 2001, Tomlinson andBlackwell 2001). Note that because these systems were configured specially for the QueryTrack, one cannot infer from their performance the performance of a commercially availablesystem.

We focus on “Sabe” and “ok9u” because their difference provides an interesting illus-tration of the methods discussed in this paper. One system “Sabe” has the feature queryexpansion whereas the other “ok9u” does not. As implemented in a well-regarded classof information retrieval systems, the document list is computed by first deriving a set ofterms (key words and phrases) with associated weights from the input query and then usingthis set with its weights to select and rank documents from the collection. The first stepmay involve a procedure based on examining documents judged particularly relevant inan initial search of the document collection. User selection of these documents is calledrelevance feedback (Berry and Browne 1999), and automatic selection of these documentsas in the case of “Sabe” is called blind feedback. This procedure is intended to uncoveradditional terms pertinent to the need for information and to revise the weighting for the setof terms finally used. Selecting additional terms may be done by the user or as in the caseof “Sabe” automatically by the system in which case the procedure is called (automatic)query expansion.

2.1. Topic 57

We begin our discussion of topic 57 with the topic statement, which is shown in Table 1.The topic statement, which an assessor creates as the first step in the Ad Hoc task, hasthree roles. First, as part of TREC-1 where the topics considered here were first introduced,the topic statement was distributed to the participants who returned document lists fromwhich were extracted the documents assessed for relevance. Second, the topic statementis the TREC-1 assessor’s understanding of what constitutes a relevant document. Third, aspart of the Query Track in TREC-9, the topic statement was used to create a set of queries,alternative natural language expressions of the topic.


Table 1. Statement of topic 57.

Domain: U.S. Economics

Topic: MCI

Description:

Document will discuss how MCI has been doing since the Bell System breakup.

Summary:

Document will discuss how MCI (Multiport Communications Interface) has been doing since the Bell Systembreakup.

Narrative:

A relevant document will discuss the financial health of MCI Communications Corp. since the breakup of the BellSystem (AT&T and the seven regional Baby Bells) in January 1984. The status indicated may not necessarilybe a direct or indirect result of the breakup of the system and ensuing regulation and deregulation of Ma Bellor of the restrictions placed upon the seven Bells; it may result from any number of factors, such as advances intelecommunications technology, MCI initiative, etc. MCI’s financial health may be reported directly: a broadstatement about its earnings or cash flow, or a report containing financial data such as a quarterly report; orit may be reflected by one or more of the following: credit ratings, share of customers, volume growth, cutsin capital spending, $$ figure net loss, pre-tax charge, analysts’ or MCI’s own forecast about how well theywill be doing, or MCI’s response to price cuts that AT&T makes at its own initiative or under orders from theFederal Communications Commission (FCC), such as price reductions, layoffs of employees out of a perceivedneed to cut costs, etc. Daily OTC trading stock market and monthly short interest reports are NOT relevant; theinventory must be longer term, at least quarterly.

Concept(s):

1. MCI Communications Corp.2. Bell System breakup3. Federal Communications Commission, FCC4. regulation, deregulation5. profits, revenue, net income, net loss, write-downs6. NOT daily OTC trading, NOT monthly short interest

Factor(s):

Time: after January 1984

Thought of as the truth against which systems are evaluated, the topic can be characterizedby the topic statement or by the documents in the collection designated as relevant. In thecollection used for the Query Track, there are 280 relevant documents for topic 57. As acaveat to results reported below, we note that there may be documents in the collectionthat should be designated as relevant but were missed by the systems used in TREC-1and therefore not assessed for relevance. In this regard, note the absence in the descriptionand summary for topic 57 (Table 1) of the word “financial,” which we mention below inour discussion of topic 57. Because some TREC-1 systems did not use the entire topicstatement as input, this absence might have resulted in documents that should have beenjudged relevant but that were never presented to the assessors.

We begin our analysis of topic 57 with a single-list measure of how well irrelevantdocuments are excluded from the list. We compare this measure to average precision in theAppendix. Our measure is essentially the number of irrelevant documents in the list that are


returned before 25 percent of the relevant documents are returned. We denote this numberby N25. Our measure is a transformation of N25, namely −log10(N25 + 1). Note that ourmeasure gives a higher value with better performance. It gives the value 0 if there are noirrelevant documents in the list before 25 percent of the relevant documents are returned,the value −1 if there are 9, the value −2 if there are 99, and the value −3 if there are 999.One might expect that the variation of N25 in absolute terms is less meaningful when N25

is larger. The logarithm compensates for this.To define our measure of exclusion of irrelevant documents as well as indicators discussed

in Section 3, we introduce some mathematical notation. For each topic, our measure iscomputed from the 1000-document lists and from the designations of documents as relevantor not relevant to the topic. Let the relevant documents for a topic be indexed by i , wherei = 1, . . . , nR . Thus, the values of i correspond one-to-one with the identifiers of the relevantdocuments. For a particular list, either relevant document i is returned and its position inthe list is given by ri or it is not returned and we can give its position as ri = 1001. Letthe number of relevant documents returned be n(n ≤ nR). As the first step, we reduce each1000-document list to a sequence {r1, . . . , rnR }. For i ≤ n, let r(i) be the i th position in the1000-document list occupied by a relevant document. Sorting {r1, . . . , rnR } in ascendingorder gives {r(1), . . . , r(nR )}, where r(1) < · · · < r(n) < r(n+1) = · · · = r(nR ). This sorting isthe basis for our irrelevant exclusion measure.

There are some details that are important only in programming our irrelevant exclusionmeasure. The number of irrelevant documents that occur before one quarter of relevantdocuments is roughly r(q) − .25nR , where q is the integer part of .25nR , that is, the largestinteger such that q ≤ .25nR . If � = .25nR − q is greater than 0 and if q < n, then insteadof using r(q) we interpolate using (1 − �)r(q) + �r(q+1). If q > n, then we compute theextrapolation .25nRr(n)/n, and if its value is greater than 1000, we use it. What we do inother cases is shown below. On the basis of these considerations, our count of irrelevantdocuments is

N25 =

(1 − �)r(q) + �r(q+1) − .25nR if 0 < q < n

max[.25nRr(n)/n, (1 − �)r(n) + �1001] − .25nR if 0 < q = n

max[.25nRr(n)/n, 1001] − .25nR if 0 < n < q

∞ otherwise

In the data considered here, the fourth case, the case n = 0, does not actually occur.Performance in terms of our irrelevant-exclusion measure is shown in figure 1 for twenty

queries. We consider 20 queries of the 43 queries available for each topic, eliminatingones that are near duplicates of others and selecting, in order, the ones that give the bestperformance. To gauge query performance for this purpose, we compute for each query andsystem the fraction of relevant documents returned (n/nR), usually referred to as recall at1000, and then average over the six systems. We use six systems so that a single system hasless effect on whether or not a query is included.

Reading the queries in figure 1 shows that “MCI” is the primary key word for this topic.Comparing systems, we see that the performance difference between “Sabe” and “ok9u”is positive for all queries shown. Looking at the relation between performance difference


Figure 1. Irrelevant exclusion measure for topic 57, 20 queries and 6 systems. “Sabe”—diamond, “ok9u”—square, others—circles.

and query wording suggests that the words “Bell” and “financial” influence performance.The word “Bell” occurs four times, the word “financial” occurs ten times, and they occurtogether once. Generally speaking, “Bell” decreases the performance difference even when“financial” is present, and “financial” increases the difference largely by improving theperformance of “Sabe.” That only a few words have predictable effects on the “Sabe” -“ok9u” performance difference seems to be a predominant characteristic of this topic andquery set.

We now consider list-to-list differences from the perspective of the order in which therelevant documents are returned. Recall that each value of i corresponds to a particularrelevant document and that its position in a list is given by ri , where ri = 1001 if documenti is not returned. Let Ri be the position of relevant document i in the sub-list consisting ofonly the relevant documents. If relevant document i is not returned, let Ri = (nR +n+1)/2.Thus, Ri is the rank of ri in the set of numbers r1, . . . , rnR . Note that replacing {r1, . . . , rnR }with ranks {R1, . . . , RnR } is quite different from sorting {r1, . . . , rnR } as is done in computinga single-list performance measure.

The return order of relevant documents {R1, . . . , RnR } can be used to compare systemresponses for the same topic through computation of dissimilarities. Our choice of dissim-ilarity is obtained from Spearman’s coefficient of rank correlation (Gibbons 1985). This


choice is effective for the Query Track data. Banks et al. (1999) mention other possiblechoices. Consider two 1000-document lists with n(p) relevant documents returned in thefirst and n(q) in the second and with relevant document return order {R(p)

1 , . . . , R(p)nR } for the

first and with {R(q)1 , . . . , RnR

(q)} for the second. Spearman’s coefficient of rank correlationadjusted for the relevant documents not returned is given by

spq = n3R − nR − 6

∑nRi=1

(R(p)

i − R(q)i

)2 − (U (p) + U (q)

)/2√[

n3R − nR − U (p)

][n3

R − nR − U (q)] ,

where

U (p) = (nR − n(p)

)3 − (nR − n(p)

)U (q) = (

nR − n(q))3 − (

nR − n(q)).

Converting spq , which is a similarity measure, to our dissimilarity measure, we obtain

δpq = √1 − spq .

Note that in our choice of 20 queries from the 43, our purpose in eliminating duplicatequeries is to avoid different queries with δpq = 0 for some system.

When all the relevant documents are returned (n = nR), Spearman’s coefficient of rankcorrelation is just the familiar correlation coefficient computed with ranks {R(p)

1 , . . . , R(p)nR }

and {R(q)1 , . . . , R(q)

nR }

spq =∑nR

i=1

(R(p)

i − R̄(p))(

R(q)i − R̄(q)

)[∑nR

i=1

(R(p)

i − R̄(p))2 ∑nR

i=1

(R(q)

i − R̄(q))2]1/2 ,

where R̄(p) and R̄(q) are means of the ranks. We take the ranks of the relevant documentsnot returned as all the same, that is, as tied. We include the necessary adjustment for tiedranks (Gibbons 1985).

Figure 2 displays the dissimilarities for topic 57 by use of multidimensional scaling (Coxand Cox 2001, Kruskal and Wish 1978, Rorvig 1999). An introduction to multidimensionalscaling for the Query Track data is given by Liggett and Buckley (2001). The algorithm weuse is Kruskal’s isotonic multidimensional scaling, which is named “isoMDS” by Venablesand Ripley (1999). In our application, multidimensional scaling gives a plot with a pointfor each query-system combination. The Euclidean distances between points on the plotportray as faithfully as possible the corresponding dissimilarities.

Figure 2 shows that the variation in the return order of relevant documents is much smallerfor “Sabe” than “ok9u.” We see that “Bell,” one of the words that stands out in figure 1,


Figure 2. Multidimensional scaling plot for topic 57, 20 queries and 6 systems. “Sabe”—arial font, “ok9u”—times font, others—small.

influences the return order of both systems but more for “ok9u” than “Sabe.” For bothsystems, we see that “financial” has little influence on the return order. Apparently, otherwords influence the return order for “ok9u.”

Summarizing topic 57, we note that compared to “ok9u,” “Sabe” excludes irrelevantdocuments better and is less return-order sensitive to “Bell” and other query words (except“financial”). Apparently, the system “Sabe” achieves this by weighting “MCI” more heavilyin formation of document lists. This observation is, of course, an oversimplified descriptionof the way “Sabe” functions. Nevertheless, it seems that for topics like 57, “Sabe” has agreater ability to grasp the topic despite variations in query wording.

Topic 76

As the next step in discovery of a topic factor, we consider topic 76. The topic statement isshown in Table 2. For this topic, the number of relevant documents in the collection is 168.

Figure 3, which is comparable to figure 1, shows the 20 best queries and our irrelevant-exclusion measure for each system and query. Reading the queries in figure 3 suggeststhat “Constitution” is the primary key word for this topic. Comparing systems, we seethat the performance difference between “Sabe” and “ok9u” is positive for all but sixof the queries shown. Looking at the relation between performance difference and query



Domain: Law and Government

Topic: U.S. Constitution—Original Intent

Description:

Document will include a discussion of, or debate over, the issue of the original intent of the U.S. Constitution, orthe meaning of a particular amendment to the Constitution, or will present an individual’s interpretation of anamendment.

Summary:

Document will include a discussion of, or debate over, the issue of the original intent of the U.S. Constitution, orthe meaning of a particular amendment to the Constitution, or will present an individual’s interpretation of anamendment.

Narrative:

A relevant document will include a discussion of, or debate over, the issue of the original intent of the U.S.Constitution, or the meaning of a particular amendment to the Constitution, or will present an individual’sinterpretation of an amendment. (A mere reference to the problem is not relevant.)

Concept(s):

1. U.S. Constitution, separation of powers, body of the Constitution, Bill of Rights, First Amendment, FifthAmendment

2. original intent

Factor(s):

Nationality: U.S.

wording suggests that the word “court,” which refers to the Supreme Court, and the phrase“Constitutional amendment” degrade “Sabe” performance. The word “court” occurs twotimes and the phrase “Constitutional amendment” occurs three times. Apparently, “Sabe”treats “Constitutional amendment” as a phrase and not as two separate words. Perhapsbetter performance would be obtained if this phrase were treated as two separate words as isapparently the case for query E. Figure 3 also suggests that the word “intent,” which occursnine times, improves the performance of “Sabe” when “Constitutional amendment” is notpresent. The performance of “Sabe” for Query H is not as good as one might expect. That“court” and “Constitutional amendment” stand out from the other query words implies thatTopic 76 is like Topic 57 in that only a few words govern the “Sabe” - “ok9u” performancedifference.

Figure 4, which is comparable to figure 2, is the multidimensional scaling plot for Topic76. Figure 4 shows that the variation in the return order of relevant documents is muchsmaller for “Sabe” than “ok9u.” We see that “court,” one of the words that stands out infigure 3, influences the return order for “Sabe.” This word also influences the return orderfor “ok9u” but so do many other query words. Looking at “Sabe” points B, C, and G, we seethat “Constitutional amendment” also influences the return order for “Sabe.” That query His close to these three points suggests that “Sabe” treats this query as though it containedthe phrase “Constitutional amendment.”

Summarizing topic 76, we note that compared to “ok9u,” “Sabe” excludes irrelevantdocuments better and is less return-order sensitive unless the query contains “court” or


Figure 3. Irrelevant exclusion measure for topic 76, 20 queries and 6 systems. “Sabe”—diamond, “ok9u”—square, others—circles. Truncated part of query C: Constitution amendments?

“Constitutional amendment.” Apparently, the system “Sabe” achieves this by weighting“Constitution” more heavily in formation of document lists. This observation suggests thatas in the case of topic 57, “Sabe” has a greater ability to grasp the topic despite variationsin query wording.

Topic 85

As the third step in discovery of a topic factor, we consider topic 85. The topic statement isshown in Table 3. For this topic, the number of relevant documents in the collection is 581.

Figure 5, which is comparable to figures 1 and 3, shows the 20 best queries and ourirrelevant-exclusion measure for each system and query. Reading the queries in figure 5suggests that “corruption” (and its variant “corrupt”) is the primary key word for this topic.Comparing systems, we see that the performance difference between “Sabe” and “ok9u”is positive for all but five of the queries shown. However, on average, the performancedifference is smaller for topic 85 than for topics 57 and 76. Looking at the relation betweenperformance difference and query wording suggests that the word “bribery,” which occursthree times if one includes the variant “bribe,” improves the performance of both systems.The phrases “public official” and “official corruption,” the separate words in these phrases,



Domain: Law and Government

Topic: Official Corruption

Description:

Document will discuss allegations, or measures being taken against, corrupt public officials of any governmentaljurisdiction worldwide.

Summary:

Document will discuss allegations, or measures being taken against, corrupt public officials of any governmentaljurisdiction worldwide.

Narrative:

A relevant document will discuss charges or actions being taken against corrupt public officials (be they elected,appointed, or career civil servant) anywhere in the world. The allegations or charges must be specific, e.g.bribes taken >from a named group or individual with a given objective, rather than generalized allegations ofendemic political corruption, or moves against corporate or private malfeasance (unless linked to an officialcorruption case).

Concept(s):

1. Official corruption, public corruption2. Bribery, prostitution, cover-up, undercover investigation, influence-peddling, illegal payment, collusion



Figure 5. Irrelevant exclusion measure for topic 85, 20 queries and 6 systems. “Sabe”—diamond, “ok9u”—square, others—circles. Truncated part of query M: governmental leaders or bankers or police officials in theworld. Truncated part of query T: punishment have they received? Truncated part of query U: consequences forthose involved.

and the words “government” and “political” might influence performance, but the effectsare not clear. That the effect of query words on the performance difference is hard to discernimplies that topic 85 is unlike topics 57 and 76.

Figure 6, which is comparable to figures 2 and 4, is the multidimensional scaling plotfor Topic 85. Figure 6 shows that the variation in the return order of relevant documentsis about the same for “Sabe” as it is for “ok9u.” We see that “bribery,” one of the wordsthat stands out in figure 5, influences the return order for both “Sabe” and “ok9u.” Lookingat “Sabe” points C, R, S and T, we see that these queries have in common the phrase“public official.” This suggests the influence of this phrase, but this influence is not clear infigure 5.

Summarizing topic 85, we note that the secondary words affect the behavior of both“Sabe” and “ok9u.” Apparently, in comparison with “ok9u,” the influence of these secondarywords is not reduced very much by the “Sabe” query expansion. In other words, the way“Sabe” treats the key word “corruption” has little effect. In this way, topic 85 differs fromtopics 57 and 72.

One might hypothesize a topic factor that governs the differences among topics 57,76, and 85. When there is a highly specific primary key word, one that effectively, if not



perfectly, separates relevant and irrelevant documents, “Sabe” takes advantage of this wordto outperform “ok9u.” The primary key words are “MCI,” “Constitution,” and “corruption.”The topic factor that governs differences among these three topics might be called the key-word specificity. These three topics have different levels of this topic factor, highest for topic57 and lowest for topic 85. The difference in performance between “Sabe” and “ok9u” isa function of the level of this topic factor. Although we do not provide the details, we notethat topic 62 with primary key word “coup” fits into this sequence with level of key wordspecificity between that of topic and 85 and that of topic 76. Apparently, for “Sabe” relativeto “ok9u,” an ideal topic is one with a single key word that distinguishes relevant from notrelevant, and a more difficult topic is one that must be delimited by secondary words. Thereare, of course, other topic factors that affect the performance difference.

3. Indicators for topic comparison

Because system behavior varies widely from topic to topic, one cannot be satisfied withresults from three topics when there are fifty available. But extension of the analysis inSection 2 to fifty topics cannot entail a person viewing and comprehending one hundredfigures. This difficulty leads us to propose an approach based on a low-dimensional summarycomputed for each of the fifty topics.


Table 4. Description of topic indicators.

Name Indicator large if relevant documents

Exclusion of irrelevant documents Take precedence over irrelevant documents

Uniformity of return order Are similarly ordered query to query

Centering of best query Are ordered like query with best irrelevant rejection

Sphericity of return order Have orderings with no overall dominant component

Our overview is based on four indicators of system behavior that can be computed forevery topic. Plots of system-to-system differences in the values of these indicators providethe overview. One indicator is the average over the twenty best queries of our measureof irrelevant-document exclusion, which is defined in Section 2. The best twenty queriesare chosen as in Section 2. The use of average precision in a similar way would suggestitself to those familiar with the information retrieval literature (see the Appendix). We addto this indicator three others that summarize what is observed in figures like figures 2, 4,and 6. These three figures can be thought of as depicting a point cloud for each system,where the point cloud is derived from a dissimilarity matrix. We can describe the threeindicators in terms of these point clouds. The uniformity of return order corresponds to howtightly the points are clustered. The centering of best query corresponds to the differencebetween middle of the point cloud and the point for the query with best irrelevant rejection.The sphericity of return order corresponds to how circular the point cloud is. As discussednext, we actually define our indicators not in terms of the two-dimensional view given byfigures 2, 4, and 6 but in terms of the dissimilarity matrix. Our indicators are summarizedin Table 4.

The system-to-system differences for these indicators provide a gauge of topic-to-topicdifferences. These topic-to-topic differences can be used for finding small groups of topicsfor which the variation in system behavior is interesting. Indicators that fulfill this purposemust be properly chosen. In particular, they should not reflect properties of the documentcollection but only properties of the topics, the information needs. An indicator that dependsstrongly on nR , the number of relevant documents, would not be satisfactory because nR

is a property of the document collection, not of the topic. If one were to choose topics forcomparison on the basis of an indicator that depended strongly on nR , one might miss topicsthat could lead to important insights.

3.1. Exclusion of irrelevant documents

Figures 1, 3, and 5 display, for each query and each system, the value of −log10(N25 + 1),where N25 is the number of irrelevant documents returned before 25 percent of the relevantdocuments. These figures show, for nQ = 20, the nQ best queries with duplicate queriesexcluded. Let j index the nQ best queries (1 ≤ j ≤ nQ); and let m index the systems(1 ≤ m ≤ 6). Let N (m)

25, j denote the value of N25 for query j and system m. Our exclusion


indicator for system m is given by

Em = 1

nQ

nQ∑j=1

−log10

(N (m)

25, j + 1)

For topics 57, 76, or 85, this indicator is just the average over the queries of what is shownfor a system in figures 1, 3, or 5, respectively.

As discussed in the Appendix, we choose this indicator instead of one based on averageprecision to reduce topic-to-topic dependence on nR . Another dependence that we wouldlike to avoid arises from the topic-to-topic variation in the query set available for a topic.Because the queries are natural language expressions, we can try to reduce this dependencebut cannot hope to eliminate it. We choose nQ = 20 and choose n/nR averaged overthe systems as a gauge for selecting the best queries. Other choices might reasonably beconsidered.

3.2. Uniformity of return order

We now define three indicators related to the patterns observed in multidimensional scalingplots such as figures 2, 4, and 6. We define these indicators in terms the dissimilarities δpq

given in Section 2.1. We motivate their choice by considering the case in which the systemreturns every relevant document in response to every query. In other words, we consider thecase in which the number of relevant documents returned n j in response to query j equalsnR for all queries. As in Section 2.1, let R ji denote the rank of relevant document i amongthose returned in response to query j . Let

x ji = (R ji − R̄i )√n(n2 − 1)/12

,

where

R̄i = 1

nQ

nQ∑j=1

R ji .

For the case n j = nk = nR , our dissimilarity measure for system m and queries j and k isgiven by

δ2m, jk = (x j − xk)T (x j − xk),

where

x j = (x j1, . . . , x jnR

)T.


To define our indicators, we only need query-to-query dissimilarities for the same system.For the case n j = nR , there is a one-to-one correspondence between the points given bythe vectors x j and the dissimilarities (Cox and Cox 2001). This correspondence is the basisfor the type of multidimensional scaling called classical scaling. We consider the vectorsx j in justifying our choices of indicators.

Our indicator of uniformity is given by

Um = − log

(1

2n2Q

nQ∑j=1

nQ∑k=1

δ2m, jk

)

Essentially, this indicator is the average of the dissimilarities for the system inverted so thatthe indicator is larger when the uniformity is greater. In terms of the vectors x j for the casen j = nR , we have

1

2n2Q

nQ∑j=1

nQ∑k=1

δ2m, jk = 1

nQ

nQ∑j=1

xTj x j .

This equation follows from

δ2m, jk = xT

j x j + xTk xk − 2xT

j xk

and

nQ∑j=1

x j = 0.

The quantity xTj x j can be thought of as the squared distance from query j to the average

of the queries. Thus, our uniformity indicator is the negative logarithm of the average ofthe squared distances of the queries from their center. Defining this indicator without thelogarithm would not have a dramatic effect on our results. Using the expression giving theindicator in terms of the dissimilarities, we can compute values for the indicator even whennot all the relevant documents are returned.

3.3. Centering of the best query

Another aspect of the patterns seen in figures 2, 4, and 6 is the location of the best query,the one that heads the list in our choice of the best nQ queries. We compare the square ofthe distance between this query and the center with the average for all the queries. Let theindex for the best query be j = 1. In terms of the dissimilarities, our indicator of best querycentering is

Cm = − log

(1

nQ

nQ∑j=1

δ2m, j1 − 1

2n2Q

nQ∑j=1

nQ∑k=1

δ2m, jk

)− Um


In terms of the vectors x j for the case n j = nR , we have

Cm = − log

(xT

1 x11

nQ

∑nQ

j=1 xTj x j

).

3.4. Sphericity of return order

It is sometimes observed in multidimensional scaling results such as those shown in figures 2,4, and 6 that the queries for a system are spread out much more in one direction than theother. Our fourth indicator is intended to reflect the opposite of this, that there is no dominantdirection to the spread of the queries. Consider

nQ∑j=1

α j x j

as the response from a composite query defined by the coefficients α j . The distance fromthis composite query to the center of the queries is given by

nQ∑j=1

nQ∑k=1

α j xTj xkαk .

The maximum of

∑nQ

j=1

∑nQ

k=1 α j xTj xkαk∑nQ

j=1 α2j

over α j , j = 1, . . . , nQ provides the basis for our sphericity indicator. We denote thismaximum by λ1. We have

xTj xk = −1

2

(δ2

m, jk − 1

nQ

nQ∑u=1

δ2m,uk − 1

nQ

nQ∑v=1

δ2m, jv + 1

n2Q

nQ∑u=1

nQ∑u=1

δ2m,uv

)

The quantity λ1 is the largest eigenvalue of the matrix B = (xTj xk). Our indicator is given

by

Sm = − log(λ1) − Um .

Note that subtracting Um amounts to dividing by the average squared distance before takingthe logarithm. Thus, this indicator is normalized by overall query-to-query variation.


Figure 7. For 43 topics, the indicator of irrelevant document exclusion versus the indicator of relevant-documentreturn order uniformity.

3.5. Indicator overview

Figure 7 shows, for the exclusion indicator Em and the uniformity indicator Um , the differ-ence between “Sabe” and “ok9u.” Only 43 of the 50 topics are shown because we judgedthe other 7 topics to be less informative. Topics 51, 70, and 78 are so easy in terms of theexclusion indicator that the “Sabe” - “ok9u” difference shows little of interest. Topics 74,75, 84, and 96 are so difficult in terms of the exclusion indicator that they also show littleof interest.

Figure 7 shows that in the progression from topic 57 to topic 76 to topic 85, the differencein exclusion of irrelevant documents becomes smaller and the difference in uniformity ofrelevant-document return order becomes smaller. This is what we observed from figures1–6 in Section 2. Note that topic 62 fits between topics 76 and 85 in this progression. Figure7 also shows that these three topics provide only a partial picture of the variation in systemdifference with topic.

Figure 8 shows for the centering indicator Cm and the sphericity indicator Sm , the dif-ference between “Sabe” and “ok9u.” This figure shows that for topics 57, 76, and 85, thesystem difference varies little in best-query centering and varies appreciably in sphericity.


Figure 8. For 43 topics, the indicator of best-query centering versus the indicator of relevant-document returnorder sphericity.

We note that topics 59, 86, and 100 are discussed by Liggett and Buckley (2001). Forthese three topics, the system difference in best-query centering varies appreciably. As withfigure 7, figure 8 shows that the three topics discussed in Section 2 do not provide a completepicture.

4. Finding topic factors

The indicators introduced in Section 3 help us expand on the perceptions formed in Section2. Through the analysis of topics 57, 76, and 85 in Section 2, we formulated the idea of atopic factor that reflects the specificity of a topic’s primary key word. More topic factors thanthis are needed in forming a useful model of the way relative system behavior depends ontopic properties. Finding additional topic factors is the purpose of the indicators formulatedin Section 3. These indicators provide a guide to further topics the analysis of which islikely to lead to additional topic factors.

The topics that seem likely to provide insight are those on the periphery of the four-dimensional scatter of indicator values shown by the two-dimensional projections in figures 7



Domain: Environment

Topic: Poaching

Description:

Document will report a poaching method used against a certain type of wildlife.

Summary:

Document will report specific poaching activities including identification of the type of wildlife being poached,the poaching technique or method of killing the wildlife which is used by the poacher, and the reason for thepoaching.

Narrative:

A relevant document will identify the type of wildlife being poached, the poaching technique or method of killingthe wildlife which is used by the poacher, and the reason for the poaching (e.g. for a trophy, meat, or money).A report of poaching or someone caught for poaching without mention of the technique used and the reason isnot relevant.

Concept(s):

1. poaching, illegal hunting, fishing, trapping, equipment2. territorial waters, economic zone, game preserve, refuge, park3. arrest, impound, fine4. issue, agreement, license, treaty, legal limit

and 8. Although more topics could be considered, we confine our attention to topics 63, 66,77, 81, and 98. One reason for this choice is coverage of some parts of the space of indicatorvalues by topics discussed in Section 2 or in Liggett and Buckley (2001). We further narrowour attention to topics 77 and 98. Topics 63, 66, and 81 have characteristics that makegeneralization difficult. For four queries of topic 63, the system “ok9u” has inexplicablypoor performance. For topic 66, the assessor made a mistake in assigning relevance judge-ments to documents. For topic 81, system behavior is governed by the names of televisionevangelists. Study of system behavior when some queries involve the names of individualswould be interesting, but we are not going to pursue such a study in this paper.

The topic statement for topic 77 is shown in Table 5. We note that the actions of thepoacher rather than the impact of poaching on the larger community are of interest. Thenumber of relevant documents in the collection is 75.

Query by query exclusion of irrelevant documents for topic 77 is shown in figure 9. Theprimary key word for this topic is “poaching,” which has variants “poach” and “poacher.”We see that although the difference between “Sabe” and “ok9u” is positive for 18 of the 20queries, the difference is generally small. Moreover, there are no secondary key words witheffect that is obvious in this figure.

The multidimensional scaling plot for topic 77 is shown in figure 10. The return order ofrelevant documents is more uniform for “Sabe” than “ok9u.” The effect of the secondarykey word “wildlife” is evident in this figure, especially in the points for “ok9u.” Topic 77stands out in figure 8 because of the response of “ok9u” to queries R and T. These twoqueries contain the word “poacher” rather than “poaching.” If one were to say that thebehavior of “ok9u” should not change much when “poacher” is substituted for “poaching,”


Figure 9. Irrelevant exclusion measure for topic 77, 20 queries and 6 systems. “Sabe”—diamond, “ok9u”—square, others—circles. Truncated part of query D: beasts.

then one might say “ok9u” has a stemming problem that “Sabe” does not have. Stemmingis removing affixes to reduce a word to its root form (Berry and Browne 1999). Overall, wesee that the response of “Sabe” is less sensitive to the secondary words in the query but thatthis results in only a small improvement in system performance as gauged by exclusion ofrelevant documents.

Topic 77 suggests that failures in stemming or similar failures in equating equivalentwords constitute a topic factor not observed in topics 57, 76, or 85. Interestingly, systembehavior for topic 86 discussed by Liggett and Buckley (2001) shows a similar failure. Inthis case, the failure is in equating FDIC with Federal Deposit Insurance Corporation. Topic86 is close to topic 77 in figure 8.

The topic statement for topic 98 is shown in Table 6. We note that this topic involvesthe production aspects of fiber optics, not the application aspects. The number of relevantdocuments is 363.

For topic 98, figure 11 shows the queries and irrelevant-document performance. Theprimary key word is “fiber optic.” We see that the difference in performance between“Sabe” and “ok9u” is positive for only 9 of the 20 queries and that the average of thisdifference is small. Secondary query words “cables” and “equipment” have some effect.



Domain: Science and Technology

Topic: Fiber Optics Equipment Manufacturers

Description:

Document must identify individuals or organizations which produce fiber optics equipment.

Summary:

Document must identify individuals or organizations which produce fiber optics equipment or systems.

Narrative:

To be relevant, a document must identify by name either individuals or companies which manufacture equipmentor materials used in fiber optics technology.

Concept(s):

1. Fiber optics, lasers2. cables, connectors, fibers

Definition(s):

1. Fiber optics refers to the technology by which information is passed via laser light transmitted through glass orplastic fibers.

Figure 10. Multidimensional scaling plot for topic 77, 20 queries and 6 systems. “Sabe”–arial font, “ok9u”—times font, others—small.


Figure 11. Irrelevant exclusion measure for topic 98, 20 queries and 6 systems. “Sabe”—diamond, “ok9u”—square, others—circles. Truncated part of query I: producing?

For the three queries with “cables,” “ok9u” performs better, and for the six queries with“equipment,” “Sabe” performs better.

The multidimensional scaling plot for topic 98 is shown in figure 12. We see that withthe exception of queries A, C, and E, the return orders for “Sabe” are very tightly clustered.Queries A, C, and E are themselves tightly clustered. In comparison, the points for “ok9u”are spread out over the plot with queries A, C, and E separated from the others as with“Sabe.” Queries A, C, and E contain the word “cables.” For “ok9u,” one can also see thatqueries B, D, F, G, J, and K, which contain the word “equipment,” are separated from theothers. Apparently, “Sabe” achieves considerable return order uniformity but with no effecton its exclusion of irrelevant documents relative to “ok9u.” Seemingly, “Sabe” weights“fiber optic” heavily, but this is ineffective because the primary key word does not faithfullyseparate relevant documents from irrelevant ones.

An additional topic factor is suggested by the comparison of topics 85 and 98. Fortopic 85, “Sabe” does not heavily weight the primary key word “corruption,” and thereforeits irrelevant document performance is not much better than that of “ok9u.” For topic 98,“Sabe” heavily weights the primary key word “fiber optic,” but this word does not distinguishrelevant and irrelevant documents well enough for this weighting to be effective in irrelevantdocument performance relative to “ok9u.” Thus, topics 85 and 98 illustrate different waysthat other topics can contrast with topic 57.



5. Conclusions

This paper shows that distinguishing topics in terms of topic factor levels has an empiricalbasis. Thus, this paper supports using topic factors in thinking about the effect of topic-to-topic variation on relative system behavior. However, the empirical basis provided islimited. More data and more analysis might be required before topic factors can be appliedin a particular situation.

Our primary example of a topic factor is key word specificity. As discussed, the keywords “corruption,” “coup,” “Constitution,” and “MCI” are ordered by increasing specificity.Another of our examples is the topic factor stemming failure. Our analysis suggests furtherexamples. It seems clear that there are other topic factors. These additional factors mightappear if a similar comparison of “Sabe” and “ok9u” were performed on another set oftopics. Even more likely, additional factors might appear if other pairs of systems werecompared.

Consider IR system evaluation, that is, decisions about what system will serve the cus-tomer best. In IR system evaluation, the major statistical challenge is inference from asample of topics. There are three reasons for this. The first reason is that the effect of topicis larger overall than the effects of other factors in the evaluation. The second reason is thatcost limits the number of topics. Evaluation requires not only a clear description of eachtopic but also assessment of each document in the collection as relevant or irrelevant to each


topic. Professionals experienced in organizing information develop the topic descriptionsand assess the relevance of documents. Assessment is performed without looking at everydocument in the collection. Nevertheless, substantial increase from the 50 topics for whichassessments have been done each year in TREC would be prohibitively expensive. Thus,expanding the size of the topic sample is not an option. The third reason is that probabilitysampling in selection of topics seems impractical because the selection of topic descriptionsis subject to constraints. The descriptions must be unambiguous regarding the relevanceor irrelevance of documents. Documents appropriate to the topics must be present in thecollection. Moreover, topics must be chosen in a way that leads to efficient use of the infor-mation professionals’ time. Because of these constraints, a practical approach to defining atopic frame and performing probability sampling from this frame seems difficult.

Typically, IR system evaluation is based on the system performance averaged over theavailable topics. Information on topic factors provides an alternative. Say that the topicpopulation of interest to the user is not the same as a topic population for which theavailable topics can be considered as representative. In this case, adjustment can be basedon topic factors if topic factors descriptive of the difference between the two populations areavailable. Adjustment would consist of placing increased weight of topics of more interestto the user.

What more could we learn if we had more topics run with multiple queries for eachtopic? Obtaining multi-query responses for more topics is feasible because there are moresets of 50 topics in the TREC collection. For these topics, all that is necessary is generatingalternative queries and running the systems. It seems that the analysis in this paper is limitedby the number of topics available and that more topics would lead to more clarity on topicfactors.

Appendix: Average precision

In the notation of Section 2, the value of the average precision for a 1000-document list isgiven by

1

nR

n∑i=1

i

r(i).

When one uses this measure to compare topics, one should be concerned with the depen-dence on nR , which is a property of the document collection rather than of the content ofthe topic. The dependence on nR can be shown by asking what would happen if half therelevant documents were selected at random and removed from the document collection.The precision, which is the ratio of the number of relevant documents retrieved to the totalnumber of documents retrieved, would be approximately cut in half. This applies, then, toaverage precision as well. This dependence on nR is important in comparing system out-puts for different topics but the same document collection. This dependence may not be ofoverriding importance in other uses of average precision. A paper by Liggett (1999) can beusefully read as a comparison of average precision and the measure of irrelevant rejectiondefined in Section 2.


We have followed the usual statistical practice in choosing a measure for which depen-dence on nR can be largely ignored (Bartholomew 1996). In standard statistical practice,the choice would be the number of irrelevant documents returned before 50 percent of therelevant documents. We choose 25 percent instead because some document lists do not con-tain 50 percent of the relevant documents. Consider the dependence on nR for our measureof exclusion of irrelevant documents. The number of irrelevant documents returned before25 percent of the relevant documents is roughly r(H ), where H ≈ .25nR , if we ignore therelevant documents included in r(H ). Say that half the relevant documents are removed atrandom. The value of H becomes .25nR/2. The ranks of the remaining relevant documentsare nearly the same but their numbering corresponds roughly to i/2. Thus, the rank of therelevant document at which 25 percent of the relevant documents are returned does notchange very much. In other words, the number of irrelevant documents returned before 25percent of the relevant documents is largely independent of nR .

In the comparison of “Sabe” and “ok9u” given in this paper, our substitution for averageprecision does not seem particularly consequential in light of the topic-to topic variabil-ity. In a comparison of more closely matched systems, this substitution might be moreimportant. We computed query means of the average precision for the same 20 querieswe use to compute our indicator of irrelevant document exclusion. Figure 13 shows the

Figure 13. For 43 topics, the indicator of irrelevant document exclusion versus the query mean of averageprecision.


average-precision means plotted versus our exclusion indicator. We see that average pre-cision does not invalidate our choice of topics for detailed study. In the four dimensionalspace of indicators shown in figures 7 and 8, topics 57, 76, 85, 77, and 98 would remain inthe same neighborhoods were average precision used instead.

Disclaimer

Certain commercial entities, equipment, or materials may be identified in this paper inorder to describe an experimental procedure or concept adequately. Such identification isnot intended to imply recommendation or endorsement by the National Institute of Standardsand Technology, nor is it intended to imply that the entities, materials, or equipment arenecessarily the best available for the purpose.

References

Allan J, Connell WB, Croft WB, Feng F-F, Fisher D and Li X (2001) INQUERY and TREC-9. In: Voorhees EMand Harman DK, Eds., The Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249.U.S. Government Printing Office, Washington, DC, pp. 551–562. (Available at http://trec.nist.gov)

Banks D, Over P and Zhang N (1999) Blind men and elephants: Six approaches to TREC data. InformationRetrieval, 1:7–34.

Bartholomew DJ (1996) The Statistical Approach to Social Measurement. Academic Press, San Diego.Bartholomew DJ and Knott M (1999) Latent Variable Models and Factor Analysis. Oxford University Press, New

York.Berry MW and Browne M (1999) Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Society for Industrial and Applied Mathematics, Philadelphia.Buckley C (2001) The TREC-9 query track. In: Voorhees EM and Harman DK, Eds., The Ninth Text REtrieval

Conference (TREC-9), NIST Special Publication 500-249. U.S. Government Printing Office, Washington, DC,pp. 81–86. (Available at http://trec.nist.gov)

Buckley C and Walz J (2001) Sabir research at TREC-9. In: Voorhees EM and Harman DK, Eds., The NinthText REtrieval Conference (TREC-9), NIST Special Publication 500-249. U.S. Government Printing Office,Washington, DC, pp. 475–478. (Available at http://trec.nist.gov)

Cox TF and Cox MAA (2001) Multidimensional Scaling, 2 edition. Chapman & Hall, London.Gibbons JD (1985) Nonparametric Statistical Inference, Marcel Dekker, New York, pp. 226–235.Kruskal JB and Wish M (1978) Multidimensional Scaling. SAGE Publications, Newbury Park, CA.Liggett W (1999) Topic by topic performance of information retrieval systems. In: Voorhees EM and Harman DK,

Eds., The Seventh Text REtrieval Conference (TREC-7), NIST Special Publication 500-242. U.S. GovernmentPrinting Office, Washington, DC, pp. 105–114. (Available at http://trec.nist.gov)

Liggett W and Buckley C (2001) Query expansion seen through return order of relevant documents. In: VoorheesEM and Harman DK, Eds., The Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249.U.S. Government Printing Office, Washington, DC, pp. 51–70. (Available at http://trec.nist.gov)

Robertson SE and Walker S (2001) Microsoft Cambridge at TREC-9: Filtering track. In: Voorhees EM andHarman DK, Eds., The Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249. U.S.Government Printing Office, Washington, DC, pp. 361–368. (Available at http://trec.nist.gov)

Rorvig M (1999) Images of similarity: A visual exploration of optimal similarity metrics and scaling propertiesof TREC topic-document sets. Journal of the American Society for Information Science, 50:639–651.

Tomlinson S and Blackwell T (2001) Hummingbird’s Fulcrum SearchServer at TREC-9. In: Voorhees EM andHarman DK, Eds., The Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249. U.S.Government Printing Office, Washington, DC, pp. 209–222. (Available at http://trec.nist.gov)


Venables WN and Ripley BD (1999) Modern Applied Statistics with S-PLUS, 3 edition. Springer-Verlag, NewYork.

Voorhees EM and Harman D (2000) Overview of the Eighth Text REtrieval Conference (TREC-8). In: VoorheesEM and Harman DK, Eds., The Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246.U.S. Government Printing Office, Washington, DC, pp. 1–24. (Available at http://trec.nist.gov).

Voorhees EM and Harman D (2001) Overview of the Ninth Text Retrieval Conference (TREC-9). In: VoorheesEM and Harman DK, Eds., The Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249.U.S. Government Printing Office, Washington, DC, pp. 1–14. (Available at http://trec.nist.gov)