is amount of effort a better predictor of search success than use of specific search tactics?

10
1 Is amount of effort a better predictor of search success than use of specific search tactics? Earl Bailey University of North Carolina Chapel Hill, NC, USA [email protected] Diane Kelly University of North Carolina Chapel Hill, NC [email protected] ABSTRACT While it is true that many factors can influence information seeking success, the amount of effort exerted by the user in completing a task can be a key factor. This study examined the log files of high performing and low performing users of an experimental system to find differences in their behaviors performing retrieval tasks. The study examined aspects of elaboration, redundancy, depth, and effort to determine which might be used to predict user success in a retrieval task. Results did not show significant differences between the groups with regard to elaboration, redundancy, and depth, but did reveal significant differences when comparing the following effort measures: number of documents opened, number of piles used, and number of documents placed into piles. Results indicate that the amount of effort is a good indicator of success in information seeking tasks and that high performers are more likely to use tools to help them organize and monitor their search results. Keywords Searching, Information Storage and Retrieval Systems, Evaluation/Information Storage and Retrieval Systems. 1. INTRODUCTION Research in retrieval seeks to measure, describe and quantify success in information seeking, examining how that user encounters and creates meaning from the process (Case, 2002). Numerous models have been developed to represent this process, including anomalous states of knowledge (Belkin, 1980), psychological relevance (Sperber & Wilson, 1986), information search process (Kuhlthau, 1993), Leckie’s general model (Leckie, 1996), sense making (Dervin, 1999), and berry picking (Bates, 1989). Individual tactics have been examined in models as well, including successive fractions (Meadow & Cochrane, 1981), pearl growing (Markey & Atherton, 1978), interactive scanning (Hawkins & Wager, 1982), and building blocks (Harter, 1986). Few search behaviors are found in isolation, but rather they are combined in actual practice (Marchionini, 1995). This normative combining of behaviors is useful when examining and quantifying particular users, but is less useful when attempting to predict how a user might perform a given task to help with designing a retrieval system. In fact, many of the findings of research in the field of information seeking have been largely disconnected from information retrieval system design. This reflects how difficult information seeking and retrieval are to actually study rather than a deficit of researchers’ intents or efforts. Determining the competence of a user by their behavior is a task that has been studied as a part of research into interactive retrieval systems and artificial intelligence systems, with the goal of using these determinations to tailor system output more towards the actual user. Studies comparing expert behaviors and novice behaviors also touch upon the differences in users interacting with a particular system. Unfortunately, few studies have focused on differences in how these two groups perform searches and whether these differences can be used to predict performance on retrieval-based tasks. This study examines and compares two groups of subjects that used the same experimental information retrieval system, a high performing group and a low performing group. Performance was measured based upon the subjects’ calculated recall, or the ratio of the number of documents they selected as relevant to the task to the total number relevant in the collection. The study specifically reports on the retrieval methods of these groups as potential indicators of group membership. Our original research question was: “Is amount of effort a better predictor of search success than use of specific tactics?” which we split into: RQ1: Are there significant differences between the groups in the amount of effort? RQ2: Are there significant differences between the groups in depth of results viewed? RQ3: Are there significant differences between the groups in their use of search tactics? This is the space reserved for copyright notices. ASIST 2011, October 9–13, 2011, New Orleans, LA, USA. Copyright notice continues right here.

Upload: earl-bailey

Post on 15-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

1

Is amount of effort a better predictor of search success than use of specific search tactics?

Earl Bailey University of North Carolina

Chapel Hill, NC, USA [email protected]

Diane Kelly University of North Carolina

Chapel Hill, NC [email protected]

ABSTRACT While it is true that many factors can influence information seeking success, the amount of effort exerted by the user in completing a task can be a key factor. This study examined the log files of high performing and low performing users of an experimental system to find differences in their behaviors performing retrieval tasks. The study examined aspects of elaboration, redundancy, depth, and effort to determine which might be used to predict user success in a retrieval task. Results did not show significant differences between the groups with regard to elaboration, redundancy, and depth, but did reveal significant differences when comparing the following effort measures: number of documents opened, number of piles used, and number of documents placed into piles. Results indicate that the amount of effort is a good indicator of success in information seeking tasks and that high performers are more likely to use tools to help them organize and monitor their search results.

Keywords Searching, Information Storage and Retrieval Systems, Evaluation/Information Storage and Retrieval Systems.

1. INTRODUCTION Research in retrieval seeks to measure, describe and quantify success in information seeking, examining how that user encounters and creates meaning from the process (Case, 2002). Numerous models have been developed to represent this process, including anomalous states of knowledge (Belkin, 1980), psychological relevance (Sperber & Wilson, 1986), information search process (Kuhlthau, 1993), Leckie’s general model (Leckie, 1996), sense making (Dervin, 1999), and berry picking (Bates, 1989). Individual tactics have been examined in models as well, including successive fractions (Meadow & Cochrane,

1981), pearl growing (Markey & Atherton, 1978), interactive scanning (Hawkins & Wager, 1982), and building blocks (Harter, 1986).

Few search behaviors are found in isolation, but rather they are combined in actual practice (Marchionini, 1995). This normative combining of behaviors is useful when examining and quantifying particular users, but is less useful when attempting to predict how a user might perform a given task to help with designing a retrieval system. In fact, many of the findings of research in the field of information seeking have been largely disconnected from information retrieval system design. This reflects how difficult information seeking and retrieval are to actually study rather than a deficit of researchers’ intents or efforts.

Determining the competence of a user by their behavior is a task that has been studied as a part of research into interactive retrieval systems and artificial intelligence systems, with the goal of using these determinations to tailor system output more towards the actual user. Studies comparing expert behaviors and novice behaviors also touch upon the differences in users interacting with a particular system. Unfortunately, few studies have focused on differences in how these two groups perform searches and whether these differences can be used to predict performance on retrieval-based tasks.

This study examines and compares two groups of subjects that used the same experimental information retrieval system, a high performing group and a low performing group. Performance was measured based upon the subjects’ calculated recall, or the ratio of the number of documents they selected as relevant to the task to the total number relevant in the collection. The study specifically reports on the retrieval methods of these groups as potential indicators of group membership. Our original research question was: “Is amount of effort a better predictor of search success than use of specific tactics?” which we split into:

• RQ1: Are there significant differences between the groups in the amount of effort?

• RQ2: Are there significant differences between the groups in depth of results viewed?

• RQ3: Are there significant differences between the groups in their use of search tactics?

This is the space reserved for copyright notices. ASIST 2011, October 9–13, 2011, New Orleans, LA, USA. Copyright notice continues right here.

2

We hypothesized that the differences between the high performing and low performing groups would be seen in effort (RQ1) and depth (RQ2) rather than in use of search tactics (RQ3). Since this study makes use of multiple comparisons, we set our significance threshold to 0.01 to reduce the risk of false positives. Formally, these hypotheses are shown below for RQ1 and RQ2. Since we expect the answer to RQ3 to be negative, no hypothesis is given.

• H1: There are significant differences between the groups in measures of effort.

• H2: There are significant differences between the groups in measures of depth.

Effort is defined by Merriam-Webster as the total work done to achieve a particular goal (http://www.merriam-webster.com). To measure retrieval effort, we chose the variables: number of documents opened, number of documents placed into piles, number of piles used, number of search iterations, number of search terms, number of unique search terms, and search time. For H2, depth is measured by number of search results displayed by the user during the course of the search, where a search result is defined as the list of documents retrieved, not the actual full text of the document.

To measure use of search tactics for Q3, we chose the model used by Hembrooke, Granke, Gay, and Liddy (2005) to compare the search behaviors of domain experts and novices. This model organized search tactics into two broad groups, Elaboration and Redundancy. Elaboration is represented by the specific tactics of broadening and refining. Redundancy is represented by the specific tactics of backtracking, plural making and taking, topic term usage, poke-n-hope, and kitchen sink. These specific tactics are discussed in more detail in the literature review. Tactics involving Boolean structures were specifically omitted because the subjects in this experiment used an experimental retrieval system that did not support Boolean queries. We expected that we would find no significant differences between groups in the use of any of these tactics.

In short, we predicted that effort would be the key factor in determining successful search interactions rather than use of specific search tactics. Naturally, it is not surprising that effort plays a role in retrieval success since it clearly does so in so many other activities. But measurable differences in variables representing effort could serve to help quantify this facet of user behavior. If observed, these results could have significant impact upon the future design of search interfaces as well as upon the design of future retrieval research.

2. LITERATURE REVIEW Much of the research analyzing how users might construct queries or navigate through retrieval results has been in the area of Interactive Information Retrieval (IIR) and

information-seeking behavior. IIR systems study user search behaviors, both implicit and explicit, to create relevance feedback (Ruthven, Lalmas & Rijsbergen, 2003). This feedback is often used by the system under study to model the user and to use this model to personalize the interface or results. The focus of this research is on the creation of interactive retrieval systems, but the findings are also useful for the examination of user behaviors. Here, however, generalities are difficult to come by. In fact, even such quantifiable measures as display time are shown to have substantial different meanings across users (Kelly, 2004). Additionally many researchers agree information seeking behaviors can be altered by such variables as task or time (Vakkari, 2000), explaining differences in user behaviors as examples of their influence. While helpful in research pertaining to system design, these explanations do not suffice for predictive user modeling.

2.1 Domain and Search Expertise A great deal of research has been conducted on the acquisition and effect of expertise, both within a particular discipline, or domain, and related to a particular task or system. Expertise has been called “the ability, acquired by practice, to perform qualitatively well in a particular task domain” (Frensch and Sternberg, 1989). Experts differ from novices both qualitatively and quantitatively, not only having stored more information but also having “organized that information into more structurally and hierarchically meaningful patterns” (LaFrance, 1989). Researchers have examined this difference as it relates to information retrieval, studying its effects upon search term selection (Hembrooke, Granka, Gay, & Liddy, 2005), search tactics (Hsieh-Yee, 1993), search tactic formulation (Wildemuth, 2004), and knowledge acquisition (LaFrance, 1989). Hsieh-Yee (1993) theorized that expert searchers know how to cope with their reduced knowledge while novice searchers are poor enough at searching that even domain knowledge does not assist them in formulating tactics. Baloglu (2003) studied both high and low achievers in Education, concluding that high achievers “evaluate the information that they study more critically, organize it conceptually, and make comparisons and contrasts” more effectively than do the low achievers. Given that previous research suggests that information organization is a key strategy used by experts, in this study, we chose to analyze search logs from an experimental information retrieval system that allowed users to organize their search results into piles to see if high performers made more use of this feature.

Domain expertise as it relates to retrieval can be defined as “knowledge of a subject area (i.e., domain) that is the focus or topic of the search,” and is “conceptually distinct from knowledge of searching techniques” (Wildemuth, 2004). Additionally, domain experts use well-rehearsed strategies to enhance recall and process knowledge, giving them an “enhanced ability to recall the appropriate scripts or schema of their field of expertise” (Solomon, 1992). Vakkari (2002) notes that experts in a subject tend to generate better results

3

than those with less experience, but that novices tend to show more gains when assisted by query expansion or term suggestion. Wildemuth (2004) examined the tactics used by searchers in an effort to understand how they formulate and reformulate search strategies. She also examined differences in those tactics based on domain knowledge. The results were structured into models designed to illustrate the cognitive processes represented by the specific search strategies. She found that her participants moved through the models at greater frequency when their domain knowledge was low, a result she attributed to the number of changes needed by students to retrieve appropriate records. She also noted that the sequence of moves changed as domain knowledge changed, although increased familiarity with the interface might have impacted those results.

Most of this research in expertise and retrieval has had a limited focus, primarily examining novice/expert differences with an eye towards assisting novices to perform more like experts or the creation of expert systems to accomplish this goal. Examination of how search behaviors evolve as expertise increases can shed light upon the search behaviors of all users such that their behavior might be used to predict their success or even to afford assistance to the user, but much of the research is inconclusive or contradictory when generalizing the findings beyond the scope of the experiments.

2.2 Search Strategies and Behaviors Examination of search strategies or behaviors has produced limited findings, but the examination of search tactics may prove more promising. While search strategies represent the general roadmap or plan for information seeking, search tactics describe the individual actions taken or moves made when conducting a search (Bates, 1979). Tactical actions used in conducting searches have been studied and categorized in many different ways. Bates (1979) created a taxonomy of search tactics with four broad categories with 17 idea tactics. She studied the literature on information searching and reference processes looking for tactics that would improve the effectiveness of a search. Tactics were grouped into the general categories of monitoring, file structure, search formulation, and term. Bates also classified models into 4 groups, those idealizing search, those representing searching, those teaching how to search, and those facilitating searching. She placed this work on tactics in the facilitating group, but it might well be seen as belonging to the idealizing group since it provided no real method by which to incorporate particular tactics.

Much of the early research on search tactics is based upon the use of bibliographic databases. The current use of full-text databases renders this research less useful (Marchionini, 1995). For example, Harter and Rogers-Peters (1985) studied the literature and gathered together the heuristics mentioned in current research. They then classified 101 individual tactics into six broad classes of philosophical attitudes and overall approach; language of problem description; record and file structure; concept

formulation and reformulation; recall and precision; and cost efficiency. Many of the heuristics and tactics were firmly based upon the database systems then in use, and so are of less use when studying full-text searching.

Hembrooke, Granka, Gay, and Liddy (2005) explored differences between experts and novices in generating search terms both with and without feedback. The study measured the presence or absence of search tactics, indicators of complexity, and time per trial. The results indicated that novices engaged in fewer effective strategic search behaviors and that experts used elaboration more frequently and employed more complex and unique terms. The study also developed a typology of search term query strategies and tactics, shown in Table 1.

Elaboration Tactics

Broadening Beginning with a specific query and expanding the scope.

Refining Beginning broadly and then narrowing the search.

Redundancy Tactics

Backtracking Reusing prior search terms over successive trials.

Plural Making & Taking

Repeatedly incorporating similar nouns in a search attempt, with slight modifications like pluralizing.

Topic Terms Incorporating the terms from the description of the topic as the search terms.

Poke-n-Hope Changing only a single word while retaining the rest of the query.

Kitchen Sink Using search terms related to the subject, but not specific to the task.

Table 1: Search Strategies (Hembrooke, Granka, Gay, & Liddy, 2005)

The concept of elaboration represents the user’s capacity to construct a query that reflects understanding of the search topic. This concept comes in part from Nordlie’s (1999) study of user interactions with OPACs and reference intermediaries. Nordlie noted that the beginning queries were often not sufficient to complete a specified task. These queries might be insufficient for a number of reasons ranging from cognitive to linguistic, and may be opaque to the observer. What could be observed, however, was the steps taken after the initial query to elaborate upon it (Nordlie, 1999). In Hembrooke’s typology, elaboration is represented by the tactics of broadening and narrowing. The concept of redundancy, represented by backtracking, plural making and taking, topic terms, poke-n-hope, and kitchen sink, reflects limited understanding of the topic. Subjects employing these tactics might simply reuse terms in multiple queries in an effort to discover new results or might use terms that could be relevant but have no real connection to the task before them. What links these tactics together is the seemingly random nature of the terms selected from query to query. Elaboration tactics, on the other hand, follow observable broadening and refining from one query to the next.

4

Comparing the search tactics of different groups of users is not a new concept, nor is it one in which the results are likely to be conclusive. In their examinations of the search behaviors of medical students, Wildemuth and Moore (1995) concluded that no generalizations can be made about the “relationship between search performance and the number of statements executed, terms used, citations retrieved, or types of moves used.” Further, they found the strong possibility that the searches might contain syntactical or typographical errors and fail to use any terms from a controlled vocabulary. Recent examination of transitions between search tactics using social network analysis (Xie & Joo, 2010) indicated that user effort was a factor in frequency of tactics selected and that getting to a use point in search frequently involved considerable effort on the user’s part. Aula and Nordhausen (2006) found that the speed of query formulation and evaluation of results was also a factor in determining search success.

Fidel (1984) noted two kinds of searchers, operationalists who use optimal strategies to enhance precision and conceptualists who use facets to enhance recall. Conceptual moves were defined by Fidel as moves that change the actual meaning of the query, such as broadening, narrowing, and related descriptors. Operational moves were defined as moves that use system features to modify a retrieval set, such as synonyms, variant spellings, controlled vocabulary, and date. Fenichel (1980), in contrast, surveyed retrieval literature and concluded that there existed considerable variations in search styles and approaches to searching. Further, searchers also tend to use intuition when applying any rules or structures (Fidel, 1991). In fact, a searcher’s background, experience, and current mental state all have a bearing on their information seeking behaviors (Harter, 1992). Other researchers, e.g. Vakkari (2000) and Kuhlthau (1993), have shown that users are also greatly influenced by where they are in the search process.

2.3 Search Effort Zipf (1989) proposed that individual actions are influenced by a “Principle of Least Effort,” taking actions designed to incur the least effort on their part. Wilson (2006) asserts that “information providers must assume that information users will adopt very simple search strategies in seeking information.” Studies examining end user searches seem to support this principle, showing that end users search differently than expert searchers, less systematically with shorter queries (Markey, 2007). If users desire the least effort, have mixed results from search tactics, and see no effect of knowledge upon their searches, it is not at all surprising that differences between high achievers and low achievers might be better explained by amount of effort rather than any of those prior concepts.

Satisficing behaviors based upon this “Principle of Least Effort” can be seen in recent research in retrieval systems that found that users can compensate for poorly performing systems through internal adaptation (Smith & Kantor, 2008), perhaps related to use of fewer satisficing behaviors

since users were forced to interact with the system more to obtain similar levels of results. Smith (2008) also found that searchers changed their search behaviors between difficult and easy topics, another indication that satisficing may play a role in effort expended by the user.

Aula, et al (2010) also examined difficult topics, finding that even skillful searchers sometimes struggle with difficult tasks, spending more time on examination of results lists for those tasks. Since the level of satisficing depends upon the user’s perception of being satisfied enough with the results to decide not to continue, users expending more effort might have higher thresholds before they are satisficed. Interestingly, some research (White and Morris, 2007) has shown that advanced searchers click further down on the search results list than other searchers, perhaps again due to that higher threshold.

Other research into difficult search tasks has shown that difficulty of a particular task may be predicted based upon searcher behaviors, provided task type is taken into account (Liu, et al, 2010). Gwizdka (2009) also showed that the interaction between the searcher’s cognitive abilities and the difficulty of the task influenced effort, measured in part by number of queries made. This difference was noted in the participants with higher cognitive abilities, but not with the lower group, perhaps indicating that effort and satisficing might be different based upon these factors.

It is clear that incorporating all the factors that influence search success or failure is a monumental task perhaps even an unrealistic task. For this reason, many researchers have narrowed their focus to certain aspects, certain tactics, or certain environments. Effort has been quantified and measured in this way, but has not been studied as something fundamentally different from other tactics, nor has it been compared to use of those other tactics to determine its predictive usefulness in retrieval success.

3. METHOD This study is a secondary analysis of data collected in another study designed to test how retrieval system ratings might be influenced by subjects’ perceptions of their own performances within that system (Kelly, et al, 2008). The original study included 60 subjects and gathered data in log files that were not affected by the manipulations in that study. The log files and actual performance information from that study formed the basis of the data for this study. To identify high and low performers for the current study, subjects from the previous study were ranked by their average recall and placed into three groups:

• Group 1: High Performers (9 subjects, recall from .31 to .52)

• Group 2: Average Performers (41 subjects, recall from .17 to .29)

• Group 3: Low Performers (10 subjects, recall from .13 to .16)

5

The demarcations between the groups were selected based upon clustering of recall results. For the High Performers, there was a noticeable gap between recall averages of .29 and .31, with a cluster of three users to each side of the gap. Low Performers were also selected based upon clustering, with the added constraint of adhering to a similar fraction of the total population. The demarcation between .16 and .17 was therefore selected to ensure similar sample sizes in the high and low groups. Average performers were not included in this study since this study is primarily focused on high and low performers. Understanding the differences between these groups is especially critical for developing search assistance for low performers.

3.1 Subjects The subjects consist of nineteen undergraduates (high performers=9; low performers=10) from UNC ranging in age from 18 to 21, with average age of 19.9 for low performers and 20.4 for high performers. Subjects reported their majors as Biology (4), Information Science (4), Business (2), Education (2), Journalism (1), Communications (1), History (1), Psychology/English (1), Linguistics (1), Economics (1) and Biomedical Engineering (1). The subjects were split by gender, 12 male and 7 female, with 8 males and 2 females in the low performing group and 4 males and 5 females in the high performing group. Fourteen subjects reported that they were Fairly Experienced with searching, with three subjects reporting Very Experienced. Average experience for low performers was 3 while for high performers average experience was 3.22. One subject in the low performing group reported a search familiarity as Fairly Inexperienced. Eighteen subjects reported that they searched the Web at least daily, with one subject reporting usage at least weekly.

Subjects were recruited using a solicitation email sent to all university undergraduates and used an online scheduling form to specify their availability. Subjects received $10 compensation for participating in the study.

3.2 Experimental Retrieval System The study used the XRF interface developed by Harper and Kelly (2006), a specialized search tool designed to allow users to create customizable piles to organize retrieved documents and request contextual relevance feedback (Figure 1 shows a partial screen shot). The piles were color coded to assist identification, and documents could belong to multiple piles, added or removed as the user desired. The user could see at any time how many documents were in a pile and the name of the pile. Clicking on a pile retrieved the documents placed within the pile. Users could also click on a Similar button to retrieve documents similar to those already in a pile. Subjects were not required to create multiple piles; they were instructed that they could use as many or as few piles as they wished, but that they would have to maintain at least one pile to store saved documents. The bottom half of the interface displayed the search results list and the full text of the selected document.

Figure 1. XRF Experimental System

The system used a retrieval engine developed with the Lemur toolkit (http://www.lemurproject.org/) with ad-hoc capabilities provided using the Okapi BM25 model and default settings from Lemur. The relevance feedback (RF) searching used 25 feedback terms and the Lemur KL-divergence language modeling approach. Response times for the system varied from 2.9 seconds for ad-hoc searches to 6.2 seconds for RF searches. All user interactions with the system were logged and archived.

3.3 Collection The TREC-8 Interactive Track collection (Hersh & Over, 1999) was used in the study. This collection consisted of a corpus of 210,158 articles from the Financial Times of London 1991-1994, a set of aspectual search topics, and a set of relevance judgments. Topics were selected from those used by Harper and Kelly (2006), who modified topics within the TREC-8 system in order to allow use of the relevance judgments already in the corpus (Figure 2).

Tropical Storms

Imagine that you are enrolled in an environmental science course and you are interested in learning more about tropical storms (hurricanes and typhoons). It seems that tropical storms are becoming more destructive and you decide to investigate past storms. Your instructor asks you to prepare a short, 5-page paper investigating past tropical storms. Your instructor asks you to use historical newspaper articles as your sources of information and to collect comprehensive information on different tropical storms and impacts of each storm. Specifically, your goal is to identify different tropical storms that have caused property damage and/or loss of life, and to find as much information as possible about each storm.

You have up to 15 minutes to find and save documents related to this topic.

Figure 2. Sample Search Topic.

Topics were altered to create situations where users would find exhaustive and comprehensive information, but retained the original topicality to ensure that the original relevance judgments would remain valid. Text was also added to the topics to posit a particular information-seeking work task which made the age of the documents

6

appropriate. Subjects were not given the original TREC instructions for the aspectual recall task (see Hersh & Over, 1999 for more information).

The relevance assessments come directly from the TREC-8 Interactive Track. The recall measure used is the traditional definition as the proportion of the relevant documents retrieved by users. In this case, placing a document into a pile was used to denote saving of that document. The topics used in this study were tropical storms, declining birth rates, and robotic technology.

3.4 Protocol The original study used a between-subjects experimental design (Kelly, et al., 2008). The experimental manipulation was related to feedback subjects received about their performance before they completed the last evaluation questionnaire. The experimental system was the same across all four feedback conditions. Subjects were randomly assigned to feedback condition. Topic order was counterbalanced according to a Latin square design. Each session lasted approximately one hour.

Upon arrival, subjects were greeted by the researcher and completed a consent form. Subjects were asked to complete a demographic questionnaire that gathered background information about them such as age, major, and search experience. Subjects were asked to watch a short system tutorial on the computer using a sample topic. Subjects were then given the first topic and given control of the system. Subjects were instructed to search on the topic until they were satisfied or until time ran out. Subjects were notified that they had approximately 15 minutes to complete the task. Upon completion of the topic or the end of time allocated, the subject was asked to complete a post-task questionnaire evaluating the system. This was repeated for two more tasks. After completing all three searches, subjects were given information about their performance (the experimental manipulation) and completed an exit questionnaire evaluating the system. The post-task questionnaire was not analyzed for this study because the focus was on log data. The exit questionnaire was also not used in this study because it was implemented after the experimental manipulation. Besides, measures from these questionnaires are not the focus of this secondary analysis of data.

3.5 Instruments and Measures

3.5.1 Demographic Questionnaire The Demographic Questionnaire consisted of six questions. Questions 1-4 were used to describe the study sample. Questions 5 and 6 were designed to differentiate inexperienced searchers from experienced searchers. Since the overwhelming majority of subjects reported values of Fairly Experienced or better, these questions were not useful as grouping variables for this study. Subjects responses to the other Demographic questions were reported earlier in Section 3.1.

3.5.2 XRF System Log Files The XRF System log files recorded the actions taken by the subject during each task. The actions logged included the start time and end time of each search, time stamps for all other activities, an ending time stamp, query strings, use of review pile function, use of similar to pile function, use of more results function, and use of add to pile function. The log files were analyzed manually to count occurrences of queries, results viewed (search results), documents viewed (full text), number of piles created, and number of documents placed into piles (saved). The searches were then examined for examples of Elaboration or Redundancy tactics. For RQ1, we noted number of documents opened, number of piles created, number of documents placed in piles, number of search iterations (queries issued), number of search terms, and number of unique search terms. For unique search terms, we used stemming, since the XRF system did stem words. For RQ2, we noted values for results viewed. Results viewed differs from documents opened in that it reflects the number of search results (snippets) displayed, not the number of documents opened for full text viewing. Finally, for RQ3, we noted the use of broadening and refining Elaboration tactics as well as Redundancy tactics of backtracking, plural making and taking, topic term usage, poke-n-hope, and kitchen sink (see Table 1 for more information).

4. RESULTS Data was analyzed using SPSS v16.0 for Windows. Each subject in the study performed three searches, and the measures from those searches were averaged to obtain values for comparison across subjects. Similar results were obtained by comparing the individual searches.

4.1 H1: There are significant differences between the groups in measures of effort. To test H1, we used the grouping variable Group, representing whether the user was a member of the high recall group or the low recall group as an independent variable and the dependent measures Docs Opened, Docs in Piles, Piles Used, Search Iterations, Search Terms, Unique Search Terms, and Search Time. We first examined the means and standard deviations of the variables, seen in Table 2. There are clear differences in the means for Docs Opened, Docs in Piles, Piles Used, and Search Terms, but these differences are not consistent. High performers opened more documents, created more piles, and placed more documents in piles than low performers. However, low performers used more search terms even though the two groups issued roughly the same numbers of queries, 8.73 and 8.33 for the low and high groups, respectively. Thus, it seems that high performers create better queries and spend more time examining documents retrieved for each query.

7

Variable Group N Mean Std. Dev.

Low 10 32.000 10.521 Docs Opened

High 9 59.444 10.581

Low 10 11.867 1.709 Docs in Piles

High 9 33.296 11.133

Low 10 6.799 1.604 Piles Used

High 9 14.371 3.813

Low 10 8.734 4.447 Search Iterations

High 9 8.333 3.235

Low 10 19.734 13.641 Search Terms

High 9 10.852 5.090

Low 10 7.333 3.793 Unique Terms

High 9 5.259 1.420

Low 10 11.539 1.747 Search Time

High 9 13.257 0.965

Table 2. H1 Variable Means.

The results also show that high performers save more documents than low performers (33.29 compared to 11.86). A closer examination shows a large discrepancy in standard deviation scores for this measure (Docs in Piles); in particular, there was greater variation in high performers than low performers, who saved a fairly similar number of documents. Low performers, however, had a much greater variation in number of search terms, indicating perhaps a wide variety in search tactics. Levene’s Tests showed several cases where the equal variance cannot be assumed (essentially showing the differences in standard deviations are significant), including Docs in Piles (F=10.042, p=0.006), Piles Used (F=6.424, p=0.021), Search Terms (F=5.144, p=0.037), and Unique Terms (F=13.158, p=0.002). Thus, for these analyses we use t-test results that do not assume equal variance.

We examined the significance of the comparisons of the means using a standard t-test in SPSS, using t-tests that do not assume equal variance (Table 3). As noted above, we set our p value to .01 to guard against false positives since we were conducting multiple comparisons.

Variable t df Sig. (2 tailed)

Docs Opened -5.660 16.769 0.000* Docs in Piles -5.715 8.340 0.000* Piles Used -5.534 10.515 0.000* Search Iterations 0.226 16.340 0.824 Search Terms 1.916 11.686 0.080 Unique Terms 1.608 11.703 0.134 Search Time -2.687 14.284 0.017 Table 3. H1 Variable t-test, equal variance not assumed.

When comparing the variables representing effort, the variables Docs Opened, Docs in Piles, and Piles Used show significance at .01 between members of the Low scoring and High scoring groups, partially supporting H1. The variable Search Time also indicates potential for future research, although it was not significant in this data.

4.2 H2: There are significant differences between the groups in measures of depth. In order to test H2 we used the independent variable Group with the dependent variable Search Results Viewed. We first examined the means and standard deviations of the variables, seen in Table 4. The High performing group shows higher values in both mean and standard deviation. Combined with the previous finding that the high and low performers issued similar numbers of queries, these results suggests that high performers evaluated search results at lower depths, essentially looking further down the search results list for relevant documents.

Variable Group N Mean Std. Dev. Low 10 196.000 43.794 Search

Results Viewed High 9 254.667 90.372

Table 4. H2 Variable Means.

Levene’s Test indicated that equal variance could be assumed for this variable (F=1.490, p=0.239), so we used a standard t-test in SPSS using a p value of .01 to examine the differences between the groups. The variable Search Results Viewed, here representing Depth, did not show significance at .01 or better when comparing members of the Low scoring and High scoring groups, refuting our hypothesis H2 (t=-1.832, df=17, p=0.085).

4.3 H3: There are no significant differences between the groups in measures of search tactics. In order to test H3 we used the independent variable Group with the dependent variables Refine and Broaden representing Elaboration and the dependent variables Backtrack, Plural M/T, TopicTerms, Poke-n-Hope, and Ksink representing Redundancy. We first examined the means and standard deviations of the variables (Table 5).

Variable Group N Mean Std. Dev. Low 10 2.565 2.132 Refine High 9 1.111 1.055 Low 10 0.934 0.541 Broaden High 9 0.667 0.763 Low 10 1.200 0.849 Backtrack High 9 0.630 0.773 Low 10 0.901 1.268 Plural M/T High 9 0.482 0.413 Low 10 14.200 7.969 TopicTerms High 9 8.149 2.963 Low 10 3.299 2.755 Poke-n-Hope High 9 1.592 1.322 Low 10 1.799 2.732 Ksink High 9 0.779 0.800 Table 5. H3 Variable Means.

8

There are clear differences in the means for TopicTerms, and smaller differences in several of the other mean comparisons. In fact, low performers executed more search tactics than high performers. Closer examination also shows a difference in standard deviation scores for TopicTerms; the high performers were much more similar to one another than the low preformers. Levene’s tests show that equal variances cannot be assumed for many of the variables in this group, including Refine (F=8.989, p=0.008), Plural M/T (F=6.796, p=0.018), TopicTerms (F=4.814, p=0.042), and Poke-n-Hope (F=12.339, p=0.003). In all cases, the variances for the low performers were greater than the high performers, indicating much more variability in the search behaviors of these searchers. Thus, for these analyses we use t-test results that do not assume equal variance.

We examined the significance of the differences using a standard t-test in SPSS without the assumption of equal variance as noted above. As noted previously, we set our p value at .01. Results are shown in Table 6. None of the variables representing Elaboration or Redundancy show significance at .01 or better when comparing members of the low scoring and high scoring groups, confirming our hypothesis H3. The significance for TopicTerms is low enough, however, to indicate a potential for further study.

Variable t df Sig. (2 tailed)

Refine 1.912 13.445 0.077 Broaden 0.873 14.275 0.397 Backtrack 1.532 16.995 0.144 Plural M/T 0.988 11.072 0.344 TopicTerms 2.236 11.668 0.046 Poke-n-Hope 1.748 13.222 0.104 Ksink 1.128 10.689 0.284 Table 6: H3 Variable t-test, equal variance not assumed.

5. DISCUSSION Examination of the findings for the Elaboration and Redundancy tactics shows no clear relationship between use of these tactics and membership in either the High or the Low results group, although less stringent criteria indicates that number of TopicTerms might be worth further study. It is clear that none of these variables can be used predictably to determine user performance, answering RQ3 negatively as expected. The results from Hembrooke, Granka, Gay, and Liddy (2005) are not in contradiction to these results as they compared domain experts to novices rather than groups with high and low performance in recall.

The findings for the remaining questions are more mixed. The variable Search Results Viewed, representing depth, showed no significant difference between groups, despite a fairly substantial difference in means, with high performers looking more exhaustively at the search results lists. This result is in contrast to our prediction for H2. For this study, the hypothesis H2 that there would be differences in the groups in quantity of results viewed must be rejected.

Examination of the remaining variables shows support for H1. Three of the variables, Docs Opened, Docs in Piles, and Piles Used show significant differences between the High and Low groups. This supports our hypothesis H1 that there would be differences in effort between the groups, although not all measures show the same significance. Even with this limitation, it is important to note that the most significant differences noted between the high and low performing groups is how many documents they opened, how many documents they put into piles and how many piles they used. Since subjects were only required to create one pile to store saved documents, this result could mean that the extra effort of using piles and placing documents into them somehow structures the process more successfully for the searcher. This finding is aligned with previous research on expertise, which emphasizes the importance of information organization (LaFance, 1989; Baloglu, 2003).

Another interesting note is that subjects in the low performing group actually used more search terms than those in the high performing group. This could indicate more persistence by the high performers, less patience by the low performers, or more willingness by the high performers to continue examining results. Although not significant, it does support the hypothesis that higher performers expend more effort than lower performers and indicates the need for further study.

It must be noted that the experimental interface used for this study was not designed to support all user behaviors in information seeking. The most notable tactic unavailable to users is the use of Boolean operators in their searches. However, this tactic is not used often by the typical web searcher using commercial search engines, so this may not be a significant limitation. While important to include for completeness, other studies have shown that users do not use these tactics even when they are available (Borgman, 1996), therefore their lack does not negate study results. The original study also limited searchers in the amount of time to search on each topic, which could have impacted some searchers more than others. We also note that we set our significance threshold to 0.01 to reduce the risk of false positives since this study made use of multiple comparisons.

6. CONCLUSION The study of information seeking covers many topics, including systems benchmarking, interaction modeling, domain expertise, search strategies, and search tactics, but little work has been done on the effects of effort upon search success. While it is true that system and interface design play important roles in success, the amount of effort exerted by the user in completing a task can be a key factor. This study examined the log files of high performing and low performing users to see if the high performers expended more effort during retrieval tasks. We also examined tactical user behaviors representing Elaboration and Redundancy to see their effect upon search success.

9

Discovering the effects of effort on information seeking success can help to redirect research into areas of more direct actual benefit to the users of these systems.

Future research into the effects of effort will broaden the applicability by using a different search interface to allow for the capture of more types of user behavior. For instance, subjects can be split into groups and given different instruction sets to determine if greater effort, and therefore greater success, can be influenced through instruction. Subjects can also be selected from a more diverse population to test the hypotheses outside of the original sample of university students. It is expected that broadening these variables may cause the results to be less definitive, but using these variables predictably requires a broad data set. The results also indicate strong significance in the use of piles when comparing the two groups. Future research could explore whether it is the use of piles that helps with search success or whether it is a by-product of some other factor in the user.

The principle of parsimony asks scientists to consider the least complex explanation for an observation. Perhaps this is also applicable in measuring overall retrieval success – that success primarily springs from the effort of the searcher rather than the specific actions taken by the searcher.

REFERENCES Aula, A., Nordhausen, K. (2006). Modeling successful

performance in Web searching. JASIST 57(12): 1678-1693.

Aula, A., Khan, R. & Guan, Z. (2010). How does search behavior change as search becomes more difficult? Proceedings of ACM CHI '10, Atlanta, GA, 35-44.

Baloglu, M. (2003). High and low achieving education students on processing, retaining, and retrieval of information. Retrieved March 22, 2008 from The Free Library. (http://www.thefreelibrary.com/).

Bates, M. (1979). Information Search Tactics. Journal of the American Society for Information Science, 30: 205-214.

Bates, M. (1989). The design of browsing and berrypicking techniques for the online search interface. Online Review, 13(5): 407-424.

Belkin, N. (1980). Anomalous states of knowledge as a basis for information retrieval. Canadian Journal of Information Science, 5:133-143.

Borgman, C. (1996). Why are online catalogs still hard to use? JASIS, 47(7): 493–503.

Case, D. (2002). Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior, Academic Press.

Dervin, B. (1999). On studying information seeking methodologically: the implications of connecting

metatheory to method. Information Processing & Management, 35: 727-750.

Fenichel, C. (1980). The process of searching online bibliographic databases: A review of research. Library Research. 2(2): 107-127.

Fidel, R. (1984). Online Searching Styles: A Case-Study Based Model of Searching Behavior. Journal of the American Society for Information Science. 35(4): 211-221.

Fidel, R, (1991). Searchers’ Selection of Search Keys: I. The Selection Routine. Journal of the American Society for Information Science. 42(7): 490-500.

Frensch, P., & Sternberg, R. (1989). Expertise and Intelligent Thinking: When Is It Worse to Know Better? Advances in the Psychology of Human Intelligence, vol.5, edited by R. Sternberg, 157-88. Hillsdale, N.J.: Erlbaum.

Gwizdka, J. (2009). What a difference a tag cloud makes: Effects of tasks and cognitive abilities on search results interface use. Information Research, 14(4).

Harper, D. & Kelly, D. (2006). Contextual relevance feedback. Proceedings of the 1st Symposium on Information Interaction in Context (IIiX), Copenhagen, Denmark.

Harter, S., & Rogers-Peters, A. (1985). Heuristics for online information retrieval: A typology and preliminary listing. Online Review, 9(5): 407-424.

Harter, S. (1986). Online Information Retrieval: Concepts, Principles, and Techniques. San Diego, CA: Academic Press, Inc.

Harter, S. (1992). Psychological relevance and information science. Journal of the American Society for Information Science, 43(9):602-615.

Hawkins, D. & Wagers, R. (1982). Online bibliographic search strategy development. Online, 6(3): 12-19.

Hembrooke, H., Granka, L., Gay, G., & Liddy, E. (2005). The Effects of Expertise and Feedback on Search Term Selection and Subsequent Learning. Journal of the American Society for Information Science and Technology, 56(8). 861-871.

Hersh, W., & Over, P. (1999). TREC-8 interactive track report. In D. Harman and E. M. Voorhees (Eds.), The Eighth Text Retrieval Conference (TREC-8), 57-64.

Hsieh-Yee, I. (1993). Effects of Search Experience and Subject Knowledge on the Search Tactics of Novice and Experienced Searchers. Journal of the American Society for Information Science, 44(3). 161-174.

Kelly, D. and Belkin, N. (2004). Display time as implicit feedback: Understanding task effects. In Proceedings of the 27th Annual ACM International Conference on Research and Development in Information Retrieval (pp. 377–384). Sheffield.

10

Kelly, D., Shah, C., Sugimoto, C., Bailey, E., Clemens, R., Irvine, A., Johnson, N., Ke, W., Oh, S., Poljakova, A., Rodriguez, M., van Noord, M., & Zhang, Y. (2008). Method bias? The effects of performance feedback on users' evaluations of an interactive IR system. UNC SILS Technical Report #TR-2008-01. Available online at: http://sils.unc.edu/research/techreports.html

Kuhlthau, C. (1993). Principle for uncertainty for information seeking. Journal of Documentation. 49(4): 339-355.

LaFrance, M. (1989). The Quality of Expertise: Implications of Expert-Novice Differences for Knowledge Acquisition. SIGART Newsletter, 108. 6-14.

Leckie, G., Pettigrew, K. & Sylvain, C. (1996). Modeling the information seeking of professionals: A general model derived from research on engineers, health care professionals, and lawyers. Library Quarterly, 66 (2): 161-193.

Liu, J., et al. (2010). Search behaviors in different task types. Proceedings of the 10th annual joint conference on Digital libraries. Gold Coast, Queensland, Australia. 35-44.

Marchionini, G. (1995). Analytical Search strategies. Information Seeking in Electronic Environments. NY: Cambridge University Press.

Markey, K. (2007). Twenty-five years of end-user searching, Part 1: Research findings. JASIST, 58(8), 1071-1081.

Markey, K. & Atherton, P. (1978) Online Training and Practice Manual for ERIC Data Base Searchers, Syracuse, NY: ERIC Clearinghouse on Information Resources.

Meadow, C. & Cochrane, P. (1961). Basics of On-Line Searching. Wiley, New York, 1961.

Nordlie, R. (1999). "User revealment"-a comparison of initial queries and ensuing question development in online searching and in human reference interactions. Proceedings of 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 11-18.

Ruthven, I., Lalmas, M. & van Rijsbergen, C. (2003). Incorporating user search behavior into relevance feedback. Journal of the American Society for Information Science and Technology. 54(6): 528-548.

Smith, C. (2008). Searcher adaptation: A response to topic difficulty. American Society for Information Science and Technology, 45(1).

Smith, C., & Kantor, P. (2008). User adaptation: Good results from poor systems. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore. 147-154.

Solomon, P. (1992). On the Dynamics of Information System Use: From Novice to ?. Proceedings of the ASIS Annual Meeting, 29. 162-70

Sperber, D., & Wilson, D. (1986). Relevance: Communication and Cognition. Cambridge, MA: Harvard University Press.

Vakkari, P. (2000). Cognition and Changes of Search Terms and Tactics During Task Performance: a Longitudinal Study. In Proceedings of RIAO 2004 (Recherche d’Information Assist´ee par Ordinateur), 894–907. Paris.

Vakkari, P. (2002). Subject Knowledge, Source of Terms, and Term Selection in Query Expansion: An Analytical Study. Lecture Notes In Computer Science; Vol. 2291. Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information. 110 – 123.

Wildemuth, B. & Moore, M. (1995). End-user search behaviors and their relationship to search effectiveness. Bulletin of the MLA, 83:294-304.

Wildemuth, B. (2004). The Effects of Domain Knowledge on Search Tactic Formulation. Journal of the American Society for Information Science and Technology, 55(3). 246-258.

White, R.W. and Morris, D. (2007) Investigating the querying and browsing behavior of advanced search engine users. Proc. SIGIR 2007, 255-262.

Wilson, T.D. (2006). Information-seeking behaviour and the digital information world. The Indexer, 25 (1): 28-31.

Xie, I. & Joo, S. (2010). Transitions in search tactics during the web-based search process. Journal of the American Society for Information Science and Technology. 61(11): 2188-2205.

Zipf, G. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley.