metrics for evaluating human information interaction systems

Metrics for evaluating human information

interaction systems

Jean Scholtz *

National Institute of Standards and Technology, Information Technology Laboratory, MS 8940,

100 Bureau Drive, Gaithersburg 20899, MD, USA

Received 29 December 2003; received in revised form 17 August 2005; accepted 12 October 2005

Available online 27 January 2006

Abstract

Society today has a wealth of information available due to information technology. The challenge

facing researchers working in information access is how to help users easily locate the information

needed. Evaluation methodologies and metrics are important tools to assess progress in human

information interaction (HII). To properly evaluate these systems, evaluations need to consider the

performance of the various components, the usability of the system, and the impact of the system on

the end user. Current usability metrics are adequate for evaluating the efficiency, effectiveness, and

user satisfaction of such systems. Performance measures for new intelligent technologies will have to

be developed. Regardless of how well the systems are and how usable the systems are, it is critical

that impact measures are developed. For HII systems to be useful, we need to assess how well

information analysts work with the systems. This evaluation needs to go beyond technical

performance metrics and usability metrics. What are the metrics for evaluating utility? This paper

describes research efforts focused on developing metrics for the intelligence community that

measure the impact of new software to facilitate information interaction.

Published by Elsevier B.V.

Keywords: Human information interaction; Information retrieval; Evaluation; User-centered; Intelligence

analysis; Metrics for utility

Interacting with Computers 18 (2006) 507–527

www.elsevier.com/locate/intcom

0953-5438/$ - see front matter Published by Elsevier B.V.

doi:10.1016/j.intcom.2005.10.004

* Tel.: C1 3019752520; fax: C1 3019755287.

E-mail address: [email protected]

http://www.elsevier.com/locate/intcom

J. Scholtz / Interacting with Computers 18 (2006) 507–527508

1. Introduction

Lucas (2000) suggests that it is necessary for us to move from a ‘computer-centric’

world to an ‘information-centric’ world. He suggests that we should interact with

information, not computers. Hence, the challenge today is to design information objects

and our interactions with these objects. The term ‘human information interaction’

(Gershon, 1995) denotes: ‘how human beings interact with, relate to, and process

information regardless of..the medium connecting the two.’ The problem was expressed

in the National Research Council report (1997):

“Today’s technology, built to meet obsolete constraints of the 1960s and 1970s,

focus users’ attention and work patterns on the tool instead of the information.”

We are aware that the term ‘information’ is overused and that there are often questions

about distinctions between ‘data’ and ‘information.’ In this paper, we use the term

‘information’ to denote that we are working with data that has been processed from a raw

form. That is, we are not concerned about analyzing streams of signals. We are concerned

with being able to look at maps, text reports, photographs, newspapers, books, web pages,

and so on. In our definition, the World Wide Web is a source of information.

In this paper we focus on one particular group of information analysts, professional

intelligence analysts. While there are many types of intelligence analysts, differing on the

sources of information they analyze, the consumer for whom they are producing

the intelligence, and the domain they are analyzing, they all face a world in which the

information they must deal with is growing at an ever increasing rate.

With the development of the World Wide Web in 1989 more and more information has

been put on-line. Lesk (1997) calculates that in 1997 there were approximately 3 petabytes

(3000 terabytes) of information available at the United States Library of Congress, with

20 terabytes of this in text. He suggests that the amount of text on the world-wide web

might increase to 800 terabytes. While the Library of Congress only houses published text,

the web is available for all of us to post information, thus accounting for the large

difference in amount of material. Craine (2000) notes that ‘The world is experiencing

exponential growth of digital information. More information has been produced in the last

thirty years than in the previous five thousand—the entire history of civilization.

Some experts estimate that analytic tools today must handle 70 Terabytes of data daily

and that is growing at the rate of 60% per year. Moreover, the world is becoming much

smaller in the sense that crises in one country can have great impact on very distant

countries. Consider the impacts of diseases such as Asian bird flu or ‘mad cow’ disease.

Terrorist groups are able to carryout their plans in countries far from where they reside.

Transactions, travel, and communications have greatly increased and therefore, have

increased the records of such activities.

The challenge then is to provide information analysts, both intelligence analysts and

other information analysts, with tools to support interaction and exploration of information

in this enormously complex information space. These analysts need the functionality to go

beyond just finding information to finding surprises—information they did not know they

were looking for. We contend that current performance and usability metrics are

inadequate to assess all aspects of human information interaction. Performance measures

J. Scholtz / Interacting with Computers 18 (2006) 507–527 509

focus on measuring the accuracy and speed of software. For example, a software tool that

filters e-mail messages based on a user profile certainly needs to capture the correct

messages in a timely fashion. The process of setting up the profile, editing the profile, and

checking on which messages are captured needs to be usable. However, meeting

performance and usability criteria will not guarantee that information analysts will use the

tool. The tool also needs to positively impact the analyst’s process, product, or both.

To assess new research systems such as those sponsored by the Advanced Research and

Development Agency (ARDA, 2005), we have been conducting a series of evaluations with

researchers in these programs. The goal is to develop user-centered measures of impact for

human information interaction systems.The intelligence community represents ‘power users’

of information interaction systems. Their work is often time critical; it is extremely important

that they locate and digest as much relevant information as possible to understand the overall

situation.Additionally, theyneed the ability to explore the data to devise alternative scenarios.

2. Background

The evaluation of digital libraries is a reasonable place to look for evaluation work that

could be used as a starting point. Covey (2002) conducted a survey of 24 out of 26

members of the Digital Library Federation asking what types of assessments they had done

for digital libraries and whether they felt there was a good cost/benefit ration. She

concluded that while many studies had been conducted, the community is still addressing

questions such as what the right measures and composite measures are to capture and

assess digital library use. The dynamic nature of the capabilities and processes of digital

libraries add to the complexity of this challenge.

Measures for physical libraries include: circulation, collection size, growth rate of the

collection, number of visits, user satisfaction, and user visits (Marchionini, 2000). These

measures are not sufficient for evaluating digital libraries. Marchinoni stressed the need for

looking at the community impact of digital libraries, including such things as recruitment

of personnel with new skills. The study of bibliometrics (White and McCain, 1989) looks

at the impact that individuals or communities have on research by looking at the citations

of research papers.

A workshop on usability of digital libraries (Blanford and Buchanan, 2002) noted that

usability studies should include the larger context in which the search is taking place. This

report also noted the issue with defining who the users of digital library systems are. An

evaluation study of the Alexandria Digital Library (Hill et al., 1997) focused on three types

of users: earth scientists, information specialists, and educators. In this evaluation of the

geolibrary, the earth scientists and educators were involved in both searching for

information and in analyzing the information. The information specialists were experts

used to help locate information but were not involved with the analysis of the information.

The evaluators found that the earth scientists wanted the library capabilities more fully

integrated into the environment used for analysis; the educators wanted the capability to

work in groups in the classroom.

There are both similarities and differences between intelligence analysts and users of

digital libraries. First of all, digital libraries need to worry about the collection of


information provided, not just the system access to it. In the intelligence world, there is

certainly the need to worry about the collection, organization, and management of the data

but evaluations of tools to provide access to this information are less concerned with the

management and acquisition processes. Intelligence analysts, although varied in focus and

skill levels, should be categorized as skilled both in the technical system and in the domain

(that is, more like the earth scientists in the Alexandria Digital Library assessment). While

analysts are not forced to use any tools provided to them, they must access online

information. Therefore, they are not discretionary users in the same sense as many digital

library users.

Like researchers in the digital library world, researchers in the intelligence analysis

world realize that evaluations must move beyond usability and performance. Both

communities agree that research is needed to develop the appropriate metrics. In the

intelligence community the job is somewhat easier as we have only one user group,

although somewhat diverse, to consider in the design and evaluations of our systems.

In the following sections, we present a discussion of intelligence analysis for those

unfamiliar with this profession, a discussion of the different types of evaluations we have

done, and the metrics that have resulted from these evaluations.

3. Intelligence analysis

To help readers who are not familiar with intelligence analysis, we present an

introductory section. Webster’s definition of intelligence reads:

Main entry: in$tel$li$gence

1 a (1): the ability to learn or understand or to deal with new or trying situations:

REASON; also: the skilled use of reason (2): the ability to apply knowledge to

manipulate one’s environment or to think abstractly as measured by objective

criteria (as tests) b Christian Science: the basic eternal quality of divine Mind c:

mental acuteness: SHREWDNESS

2 a: an intelligent entity; especially: ANGEL b: intelligent minds or mind !cosmic

intelligenceO3: the act of understanding: COMPREHENSION

4a: INFORMATION, NEWS b: information concerning an enemy or possible enemy

or an area; also: an agency engaged in obtaining such information

5: the ability to perform computer functions

A number of definitions of Intelligence exist within this community. The definition that

Warner (accessed May, 2004) arrives at is ‘intelligence is secret, state activity to

understand or influence foreign entities. However, in arguing to this definition Warner

notes that intelligence is both a process and a product. He also makes the point that

information does not equal intelligence; intelligence analysts provide added value to

information during the intelligence process.

The United States Central Intelligence Agency provides a diagram of the traditional

intelligence analysis cycle. This is shown in Fig. 1. While the phases in this process are

Fig. 1. The intelligence cycle and its use of information (CIA home page, accessed May, 2004).


valid, the actual process is an iterative one which the figure does not adequately depict.

An analyst working in a particular domain is continually monitoring information and

assimilating this into her knowledge of the domain. At any point in time, the analyst is

able to respond to a short term request for information within this domain (Medina,

2002).

Currently the intelligence community is looking at new models for intelligence

analysis. However, there is no new agreed upon model of a process. Medina (2002) notes

that the traditional model is focused on understanding an event or developments. Today,

analysts need to understand the needs of their customers and add value in an environment

where many have access to information. Medina suggests that analysts need to focus on

ideas, to take risks and to consider ‘adventurous analytic ideas.’ Research programs for

intelligence analysis are developing software along these lines. To determine whether this

software will help the analysts requires the development of new metrics to assess changes

in the analytic process and products.

4. Evaluating the utility of human information interaction

The evaluation of interactive information systems encompasses three types of

evaluation:


† Performance or algorithm evaluations

† Usability or interaction evaluation

† Overall impact or process evaluation

Performance metrics focus on processing time and accuracy of the algorithm. For

example, information retrieval evaluations commonly use precision and recall measures to

determine how well the algorithms are working (Trec conference, accessed Feb, 2005,

Voorhees and Buckland, 2003).

The human–computer interaction (HCI) community conducts usability evaluations of

information systems. The metrics commonly used are effectiveness, efficiency, and user

satisfaction evaluated in the context of use (ISO 9241). That is, users use tools not just for

the sake of using tools but within a larger scenario. Measures to support usability metrics

should be collected within this larger scenario. Effectiveness measures the percentage of

tasks a user is able to complete, both with and without external help. Efficiency is the

measure of how much time users need to complete tasks in the scenario. User satisfaction

measures the user’s perception of the interaction process during the scenario.

To complete the evaluation of HII systems, we must move beyond performance and

usability and consider utility or impact measures. That is, how do human information

interactive systems change the work that users are doing? What affect do new information

interaction systems have on work products? Obviously these three types of evaluations are

closely tied. Systems need to have attained specified levels of accuracy and usability

before we can assess utility. However, good accuracy and usability do not guarantee good

measures of utility. These evaluations are also iterative. Assessments of utility may

produce new criteria for accuracy and usability.

The following sections present some methodologies, results, and lessons learned from

evaluations we conducted of interactive information systems in the past year. The systems

that we evaluated are all being developed for use in intelligence analysis. The success and

continuation of research programs for this community depends in part on the development

of metrics to assess the impact of information interaction systems.

We have conducted evaluations in three setting: laboratory studies; simulated

environments; and operational environments. We report on lessons learned from each

of these types of evaluations and how these are shaping our development of metrics for

information interaction systems.

4.1. Laboratory studies: question-answering evaluations

TREC (Trec conference, accessed Feb., 2004) sponsors a track on question-answering

that focuses on retrieving information not documents. Systems are to answer a question by

finding information in one or more documents and then presenting the portion of the

document that answers the question. A subset of these question answering research

systems was interested in conducting dialogues with the users to determine the users’

information needs. These systems need the capability to answer questions more complex

than simple factoid questions. The typical ‘batch processing’ mode of evaluation was not

appropriate. Even though users were given the same scenario, they asked different

questions and had different styles of dialogue with the systems.


We conducted two studies with research systems to understand variations in dialogue

between users and the systems. Our objectives were to provide feedback to the researchers

developing the systems; to develop appropriate metrics for future evaluations; and to

provide measures of progress to the sponsors of the research. In the rest of this section we

discuss the two pilot studies we conducted.

4.1.1. Dialogue evaluation pilot #1

The research systems were in various stages of development but all were early, were

not necessarily robust, and lacked usable user interfaces. We adopted a modified wizard of

OZ technique (Dahlback et al., 1993) in which a text chat served as the user interface for

all systems. The text chat was developed to allow researchers to use any one of three

modes for conversing with the user:

– Wizard: a researcher pretending to be the system responds to the user.

– System: the research system’s response to the user was passed along untouched.

– Modified: the researcher was allowed to modify some part of the system generated

response to make it more understandable to the user.

Testing took place over the internet using the text chat interface. Subjects entered their

queries in the large text area and submitted them. The systems received the query and

proceeded to process the request for information. When the system generated its response,

a log entry was made to note whether the response was generated by the system itself, by a

wizard or by a wizard modifying something that the system generated.

Subjects were allowed 15 min to work on each of the 10 problems. They were told to

continue until one of three conditions occurred:

1. they found what they needed/success

2. they felt there was no chance of success/failure

3. Time expired

If time ran out, subjects were asked to decide whether they believed that they were

likely to be successful or not. Buttons in the interface were used to guide this

interaction.

A log file format was developed to track each exchange. The header of the log

contained information about the identity of the system and the user. Each activity record

was time-stamped and categorized as to its origin (analyst, system, wizard, or modified)

and the content of the message.

The participants in the first pilot study were assessors for the Text REtrieval

Conference (TREC) (Voorhees and Buckland, 2003). Many of these subjects are retired

intelligence analysts and all have expertise in information retrieval. The tasks in this study

were general rather than domain-specific so the match of subjects’ skills and task

requirements was satisfactory. Ten subjects were recruited for this study and each of them

worked with a single system to perform the 10 tasks. Each system was used by 2 subjects

and the tasks were performed in random order.


4.1.1.1. Data collection. In addition to the logs that were collected, an observer

noted issues that participants were having. These notes were also used to ask more

thorough questions once the participants had finished the experiment. Subjects were

asked to rate the systems on various aspects of the dialogue. They were asked to use

7-point scales (1, extremely dissatisfied; 7, completely satisfied) to address the following

areas:

† the final answer you obtained from the system

† the time that it took to carryout this task

† the dialogue that you carried on with the system

† the clarifications that the system requested

† the number of misunderstandings between you and the system

† the ease of understanding of the system messages

4.1.1.2. Results.We learned a great deal about facets of dialogue interactions that systems

would need to accommodate, including:

† Spelling errors

† Incomplete sentences

† Ability to remember context

† Qualified affirmative and negative responses

We also learned about the process as the participants went about finding answers to the

tasks we had assigned them:

† Analysts ask questions as they occur to them—not necessarily in a strict turn-taking

style.

† Analysts expect to be able to ask clarification questions about the content the system

delivers.

† Analysts have expectations about context. Some analysts initially set context for the

system. Dialogue sequences illustrated the expectation for the system to remember this

context as well as context of previous questions and responses.

† Analysts expressed concern about time:

† Analysts specified date ranges in a number of ways and expected to be able to use time

descriptions such as ‘recent’

† Analysts wanted to see dates on information returned and some wanted

information returned organized by time. On the other hand, offering analysts

information by selecting the year was not always viewed as favorable. This was

especially true in situations when they were unfamiliar with the topic they were

researching.

† Clarifications to system questions are not always straightforward. Analysts gave

qualifications to yes/no questions (e.g. ‘Do you want to see more information?’,

Yes, but only if..’) and even ignored the questions.


We also assessed the number of system turns that were marked as system, modified

system, and wizard. In summary, we used the following metrics in the first pilot:

User satisfaction (questionnaire ratings)

System handling capability (log of how ‘system’ response was generated)

Efficiency (number of turns, %clarifications)

Effectiveness (whether answer was achieved)

4.1.2. Dialogue pilot #2

In a follow-up study, we modified the experiment from the first study slightly. First of

all, two systems used their own interface; an additional two systems used the NIST Text

Chat UI. We used two long scenarios, each lasting 1 h, instead of 10 shorter scenarios. Our

participants were Naval reservists who do analysis work as part of their reserve duty. In

addition to the data collected in the first study, we recorded the computer screen of each

participant during the evaluation.

We computed the same metrics as in the first pilot study: user satisfaction, system

handling capability, efficiency, and effectiveness.

High level issues that we leaned in this evaluation included:

† Duplicating documents or summaries returned is a problem.

† Attributions for material are essential. Date information is also critical.

† Analysts test the system. They ask questions they already know to determine if the

system returns the correct answer.

† Document summaries should accurately reflect the documents they stand for.

† Variant spellings and unwarranted equivalences (USSR is not the same as Russia) are

problems.

† If the system rephrases the analyst’s question, it should allow them to disagree.

† Systems should state how much information is available.

† Analysts need to know when the information they are looking for does not exist.

† Analysts need to be able to save information returned by the system.

† Analysts need to know what they have done during a session and would like to be able

to save these traces.

4.1.3. Summary of lessons learned

From these two pilot studies, we determined metrics that should be considered in

addition to typical performance and usability metrics:

Performance evaluation:

%duplicate information returned

Accuracy of summary of document

%of information returned without attributionUsability evaluation:

percentage actions user could not control

System evaluation:

Trust—# times user tests system and it passes


Users who worked with the user interfaces rather than the text chat interfaces spent

considerable time interacting with the user interface. We need to investigate how to

measure the impact of these interactions. The systems we were working with differed

considerably in the additional functionality they supplied. One system concentrated on

helping the user formulate new questions to ask while the other system had interaction

visualizations of the information. Potential metrics to consider would be percentage

questions asked due to suggestions/interactions with the system; %of these that yield

information for either further exploration or that become part of the final work product.

Fig. 2 shows a concept map of the overall metrics that might be used to evaluate an

interactive dialogue information system. We have used the term ‘thread’ in the concept

map. A ‘thread’ is a particular line of questioning that might represent a specific

hypothesis the analyst is investigating.

In the concept map, we have used time as a measure of performance. However, we need

to qualify what it means to save an intelligence analyst time. In our discussion with

analysts we found that time to produce a final product is dictated by the assignment. The

intelligence analyst simply must have the product completed within a specified time

frame. However, the analyst will use all of the time allotted. Because of the volume of

information to look through, analysts prioritize what they read. Given more time, they are

Fig. 2. Concept map showing relationships between usability, performance, and utility metrics for dialogue

systems.


simply able to read and review more documents and hence, increase the coverage of the

topic and their confidence in their analysis. If searching can be done faster and

summarization techniques allow them to review documents faster, then analysts will be

able to do more in the same amount of time.

The concept map in this case has been specialized for dialogue systems. The

information/effort metric is composed of two pieces relating to information and effort. In

this case, we have implemented ‘effort’ as # turns/answer and # clarification turns/thread.

For systems that are not dialogue systems, this concept map can be modified to reflect

efficiency measures appropriate for that particular system.

Because these were laboratory studies, we have no insights into process. We also did

not design this study to give any information about product. Therefore theses studies

contributed to our understanding of metrics for performance and usability of information

interaction.

4.2. Measuring human information interaction in a simulated environment

We wanted to go beyond laboratory studies and evaluate software tools in an

environment resembling an operational setting. Moreover, we wanted to be able to

compare the performance of information seekers with and without a given set of tools to

determine user impact. This is not as straightforward as it seems. Intelligence analysts

have both short term and long term tasks. An analyst may continue to monitor the same

issue for months or even years. All tasks are not created equal—some tasks may have a

plethora of information; others may have little. Analysts may work on multiple tasks

simultaneously. We need to determine how these variables affect users and develop one or

more realistic baselines.

One project that we are currently involved in has developed an environment to capture

the work that analysts do, including the task the analyst is researching, queries posed to

search engines, documents viewed, documents saved, reports created, and all keyboard

interactions. A way is provided for analysts to make comments about the tasks they are

currently doing that can be used to understand the lower level data. Several analysts

currently work in this environment and generate data that provides a view of the current

process. However, this environment is somewhat removed from an actual analysis

environment. The tasks that are given to the analysts are selected by the project managers

and are chosen in such a way as to illustrate real situations that analysts must cope with,

such as, short reaction time tasks, multiple tasks, and long term tasks. The customers for

the analytic products are the researchers, not intelligence officials. Nonetheless, this

situation does simulate an actual analysis situation.

We are continually analyzing the data to determine how analysts work given their

current set of tools. In this case, they use search engines, word processing applications, and

when necessary, presentation tools. Eventually, we will move the research tools into the

simulated environment to determine their impact on the process and products of the

analysts.

Our evaluation efforts are currently focused on developing measures from this data that

can be used to assess impact of new information interaction tools. We have been working

with analysts, researchers, and program managers to arrive at metrics that will show


impact for analysts, will provide diagnostics for the researchers, and will show progress

for the program managers. The current set of metrics was developed in an iterative fashion.

We first used information from publications about the process of intelligence analysis to

bootstrap our thinking (Grabo, 2003; Heuer, 1999; Krizan, 1999) We had a number of

opportunities to speak with analysts and to observe them at work. We used these sessions

to start brainstorming with the researchers and analysts to produce a list of potential

metrics. We then condensed this set to eliminate duplicates and we grouped them into

categories. We conducted observations of two analysts (Scholtz et al., 2005) which helped

us see a complete process over a two day period. In addition, we have been analyzing data

captured in the simulated environment this past year to determine what measures we can

collect and which of these will support proposed metrics.

The metrics under consideration are shown in Fig. 3. These metrics are in addition to

the traditional usability metrics of efficiency, effectiveness, and user satisfaction that will

be used to evaluate the user interface. Although we said earlier that time is not a good

overall measure, we are interested in saving time spent on repetitive tasks and on

interaction overhead to give the information workers more time to think and understand

the information. The metric category labeled ‘analyst time’ looks at these aspects. This

said, there is a danger in over automating. Understanding of data is produced by

manipulating it. There is a fine balance between producing enough automation to save the

analyst time and too much automation which produces results but without producing

understanding on the analyst’s part. We are currently conducting pilot evaluations on

various research components to determine the feasibility and utility of the various metrics.

While some of the measures can be collected in a straightforward fashion, others will

involve more indirect measures. In particular, the Data Coverage metric presents definite

challenges in data collection. Other measures such as perceived effort and insights gained

will be obtained from interviews and rating questionnaires.

We have begun analyzing data from our pilot studies and from data in the simulated

environment to determine what measures can be computed. We have been able to compute

the following measures:

† Time spent in the various software applications

† The queries posed and the number of documents read for each query

† Time to determine relevance of documents

† The growth of the analyst’s information collection document

† Gaps in activity indicating off-line activities

† Time spent in report generation and data collection activities

† The quality of the work product

In a recent study we were able to present system generated hypotheses to the user for

relevance rankings. Although this is an early evaluation, this constitutes part of our plan to

measure the system recommendations produced by the analyst. We also measured number

of relevant documents returned in a sequence of queries to determine if user modeling was

successfully augmenting user queries.

We have included the metric of product quality. In a recent evaluation we attempted to

measure product quality using a method for rating coverage as well as having all the

Fig. 3. Metrics for assessing impact of tools on human information interaction.



subjects rank each other’s reports. We had an expert analyst generate a report on the

subject for us. In addition the expert analyst ranked the participants’ reports. Participants

in the evaluation were also given anonymous copies of each others’ reports and asked to

rank them according to quality. We found that the expert ranking was very similar to the

ranking we obtained when averaging the participants’ rankings. The coverage score was

somewhat less in agreement but the coverage scoring was done by one experimenter. We

think this is a promising measure, especially in cases where we have a report from an

experienced analyst to use as a basis of comparison. We intend to use this method again,

but we will also have the coverage scored by the participants in addition to having them

rank the reports.

From the time spent in various software applications, we can clearly see the step of

analysis the analyst is doing—locating information or report generation. The content of

the queries can be used to infer the different lines of reasoning that the analyst is

pursuing. Currently we are only able to capture off-line activities if the analyst chooses

to write a note about them. But we can see gaps and we are able to make some

inferences, such as if a print command was issued just prior to a gap in on-line

activity we can make a reasonable assumption that the analyst is reading the hard copy

of a document.

As analysts work, they often assemble a document containing relevant information they

are finding. They use this document when producing their final report. We have computed

the growth rate of this document over time based on keystroke data from standard desktop

applications (Cowley et al., 2005). We can use this along with measures of the number of

searches done and the number of documents retrieved and read from these searches to

compute information /effort, a measure we introduced in Section 4.1. The graph in Fig. 4

shows a growth rate chart and details whether the growth came from keystrokes by the

analyst or from copy/paste operations.

Fig. 4. Growth rate of analyt’s document over a one day time period.


Analysis of this data has resulted in a number of observations about analysts and the

analytic process:

† Analysts are quite skilled at forming good searches. They view at least one document in

the majority of searches. In addition, they look beyond the first ten results in many

cases.

† When analysts are doing quick reaction tasks, they rely heavily on highlighted search

terms in the documents to guide them to the appropriate sections to read. Long

documents with no highlighted terms are not read in such situations even though they

may be highly relevant.

† When documents are of interest to analysts, they note the sections of interest. This may

occur by printing out the document and highlighting sections or making notes on the

paper, or the analyst may copy and paste text into a repository document.

† Provenance is extremely important to analysts. They always include the source of their

information. They also decide the credibility of information based on their knowledge

of the source. If an analyst is working in a new domain, she must look for ways to

assess the credibility of a source.

† Temporal information is extremely important as well. Analysts look for dates in on-line

web sites so they understand the time frame in which the document was written.

These observations are helpful in developing insights into metrics that should be

considered. For example, we could consider the number as well as the size of documents

that the analyst reviews. We are able to track the actual documents that are used for

copy/paste operations. We could also examine these according to the dates and provenance

and then determine if new systems help the analyst look further back and produce more

information with citations.

There are still limitations to doing evaluations in ‘simulated’ environments. While

there is a process followed by the analysts, we lack an organizational process. And we lack

the actual pressures that occur in the operational environment. In Section 4.3 we discuss

results from a number of experiments in operational settings.

4.3. Evaluating human information systems in an operational environment

We have also looked at metrics for evaluating information systems deployed in

operational settings. We had different environments available for experimentation and

evaluation: a data protected platform for conducting controlled experiments and an

operational platform where systems could be tested using real data and in actual processes.

Technical performance was evaluated in the data protected platform using synthetic data.

When software had been successfully evaluated in this platform, it was moved into the

operational environment. Software performance was evaluated on operational data and

then evaluated within the intelligence process itself. Once in the operational world, the

evaluation switched from measures of software and individual user interaction to that of

process and organizational impact. Technical measures were also collected within the

operational environment as we need to evaluate how the interactions scale when real-

world data, processes, and products are factored in. In addition, cognitive metrics were


collected only in the operational environment as they only make sense when analysts are

engaged in real tasks using the software.

By looking at the various metrics in the different environments we were able to roll up

the metrics to capture the entire picture: the performance of the technology component, the

scalability of the components to actual data; and of the impact of these components on the

analytic process.

We developed a metrics model (Mack et al., 2004) that defined the metrics and the

conceptual measures at a functionality level and at a scenario level. As new

components were integrated into the environment, we implemented these measures by

determining what data from the components could be used to obtain a given metric.

This allowed us to customize the measures for different tools but to have the ability to

compare metrics across experiments. Fig. 5 shows the portion of the metrics model that

focuses on utility.

We use the following definitions in our metrics model.

A metric is used for distinguishing progress or differences between implementations of

a Capability.

A conceptual measure is an attribute or property of a Metric.

An implementation measure realizes part or all of a Conceptual Measure. The set of

implementation measures produces the measurements that determine the value of the

Metric.

Metrics and conceptual measures are independent of the specific software being

evaluated and allow us to make comparisons across software components and across

experiments. Implementation measures are specific to the software being evaluated.

We have used the categories of effectiveness, efficiency, and user satisfaction to

classify our conceptual measures, although we have expanded the standard usability

definitions of these terms (ISO 9241). The measures in Fig. 5 are, in many

instances, repeated in the different platforms. However, the implementation of these

measures may be quite different. Measure that we might be able to collect through

logging software in a more experimental situation may be replaced with interviews

and questionnaires in an operational environment. We have also separated out

metrics measuring performance with real-world data from metrics associated with

processes involving real-world data. It is important to look at aspects of

performance on actual data before trying to assess impact within an actual process.

This allows us to compare the evaluations done using synthetic data to measures

collected using operational data.

We do not mean to suggest that there was a clean sequence of evaluations progressing

from the controlled experimental world to the operational environment. Often we found

anomalies in the operational environment that required further investigations in a

controlled environment. Other measures such as changes to process can only be assessed

within a process in an operational environment.

In Section 4.4, we discuss agreed upon metrics that will be used in an upcoming

evaluation in an operational environment.

Fig. 5. The portion of the metric model focusing on ‘utility’.


4.4. Testing the metrics

Over the next six months we will be inserting research software developed for the

Novel Intelligence from Massive Data program (NIMD home page, accessed June, 2005)

into an operational environment. We have formalized the metrics that will be used based

on metrics developed mainly in our simulated environment studies. We are first collecting

a baseline data from analysts. The overall metrics that will be used are:


† Increased analytic product

† Increased analysis confidence in product

† Increased signal to noise ratio

Senior analysts will rate the quality of the products; the analysts themselves will be

asked to give their confidence level. The signal to noise ratio will be measured by looking

at the number of documents that analysts view and the percentage of those that are deemed

relevant (printed, read online, saved, copied into report, cited in report). While the quality

and confidence are qualitative measures, there are quantitative measures at the

implementation level that will feed into these metrics as well. Lower level metrics

(based on the research areas of the project) include:

† Reduced workload for the analyst

† Increase in the percentage of relevant documents returned to the analyst

† Increased throughput for the man and machine working together

† Increase the number of paths of reasoning that the analyst is able to explore

In particular, one research area is human information interaction. The metrics we

propose for this area, which will feed into the overall impact metrics, are:

† Increase the number of productive searches

† Increase the number of relevant documents retrieved and reviewed

† Reduce the analyst’s overall workload

We are excited about the opportunity to determine if these metrics can be used to

distinguish which software tools are useful to the analyst.

5. Discussion

We have explored metrics for assessing human information interaction in laboratory

studies, in simulated environments, and in operational settings. We have used the

traditional usability metrics of effectiveness, efficiency, and user satisfaction but we have

augmented these with measures such as the number of turns and clarifications needed. We

found that a potential utility metric is the amount of effort an analyst expends to obtain

information. In our simulated environments, we are able to measure more about the

process that analysts use in their analytic tasks. In the operational worlds, we are able to

determine how well new software tools work with the existing infrastructure and how well

they scale to real-world information streams. Table 1 summarizes our metrics work in

these three areas.

The ideal situation would be to capture a baseline first in the operational world and then

base research on those needs and characteristics. However, the operational world is

dynamic and we must continually assess what is occurring there. Analysts, their needs,

their missions, and their infrastructure continually change. As we move from the

laboratory to the simulated environment to the operational world, we gain a broader

Table 1

Summary of metrics information obtained in three areas of study

Laboratory studies Simulated environment Operational environment

Expanded definitions of effec-

tiveness and efficiency

Developed metrics for process

and product

Produced baseline metrics

Gained insights into criteria for

good answers

Looked at detailed use of software

over a period of time

Assessed how tool interacts with

other tools in the environment and

with operational data

Obtained detailed data about user

interactions

Obtained detailed data about

analytic process and construction

of product

Obtained statistical data and

interview data


perspective but we lose control over experiments. It is difficult to obtain fine grain data but

we can more easily get data about real-world use. The laboratory is the place to understand

interactions in detail, within a given scenario. To look at how a particular software system

impacts an individual user, a simulated environment should be used. To determine if that

software system is effective in the real world, we need to study it in an operational

environment. Interviews and statistics on usage can be collected in the operational

environment and can be used to devised more laboratory studies and simulated

environment investigations.

6. Conclusions

Evaluating information interaction is a complex process that involves metrics and

measures at several levels. Technology developers have typically used performance

measures for their systems. The HCI community has perfected metrics and methodologies

for assessing the usability of systems. We suggest it is time to move from measures of

usability and performance alone and to include measures of utility and impact. This will

require new evaluation methodologies and new metrics. A number of metrics have been

discussed here, including additional performance measures and expanded usability

definitions. While all of the metrics presented here will not turn out to be useful, we are

confident that a subset will be sufficient to adequately measure utility of human

information interaction systems. These metrics have been developed through readings and

discussions with members of the intelligence community, with researchers developing

software to assist intelligence analysts, and by conducting studies and experiments in the

laboratory, in simulated environments, and in operational settings. The metrics proposed

are a good starting point for evaluating the impact of human information interaction in the

large. However, these metrics will definitely evolve as more research focuses on the

evaluation of information interaction.

In addition, there are a number of confounds that need to be considered: the expertise of

the analyst, the domain knowledge of the analyst, the time that can be allocated to the task

the analyst is given, the complexity of the task, and the amount, quality, and dynamic

nature of the available information. We have not addressed these in this discussion.


However, we are currently conducting some studies that will address these factors and

should help us to interpret our metrics within these constraints.

We have several tasks ahead of us. One is to determine which metrics distinguish

systems that will be used by analysts as opposed to those that will not. We have a study

underway that will help to validate metrics we have defined for one particular research

program in the intelligence community. We will use qualitative data collected from the

analysts in an operational environment to determine whether our impact metrics can

distinguish between tools that have utility and those that do not. The second task will be to

determine if these metrics extend to the evaluation of tools in different areas of human

information interaction, such as digital libraries.

Acknowledgements

The author would like to thank her team at the National Institute of Standards and

Technology for their work in conducting and analyzing numerous studies that provided

metrics used in this paper. The anonymous reviewers have made extremely useful

suggestions for improvements. The author would also like to thank Dianne Murray and

Gitte Lindgaard for their helpful suggestions and comments. This work was funded in part

by a number of programs at ARDA.

References

Advanced Research and Development Agency (ARDA) home page, http://www.ic-arda.org/, accessed May,

2004.

Blanford, A., Buchanan, G., 2002. Usability of digital libraries: a workshop at JCDL 2002—JCDL’02 workshop

on usability of digital libraries. Joint Conference on Digital Libraries, Portland Oregon.

Covey, D., 2002. Usage and Usability Assessment: Library Practices and Concerns. Digital Library Federation,

Council on Library and Information Resources, Washington—website http://www.clir.org/pubs/abstract/

pub105abst.html.

Cowley, P., Nowell, L., Scholtz, J., 2005. Glassbox: An Instrumented Infrastructure for Supporting Human

Interaction with Information Hawaii International Conference on System Science, Jan. 2005.

Craine, K., 2000. Designing a Document Strategy. MC2 Publishing.

Dahlback, N., Jonsson, A., Ahrenberg, L., 1993. Wizard of Oz studies—why and how. Proceedings from the 1993

International Workshop on Intelligent User Interfaces, Orlando, FL, pp. 193–200.

Gershon, N. 1995. Human Information Interaction, WWW4 Conference, December.

Grabo, C.M., 2005. Anticipating Surprise: Analysis for Strategic Warning. Center for Strategic Intelligence

Research. Joint Military Intelligence College. 2003. University Press of America, Washington, DC.

Heuer, R., 1999. Psychology of intelligence analysis. Center for the Study of Intelligence. CIA.

Hill, L., Dolin, R., Frew, J., Kemp, R.B., Larsgaard, M., Montello, D.R., Rae, M., Simpson, J. 1997. User

evaluation: summary of the methodologies and results for the Alexandria Digital Library, University of

California at Santa Barbara. In: ASIS97 Digital Collections: Implications for Users, Funders, Developers and

Maintainers—Proceedings of the 60th ASIS Annual Meeting, 34 pp. 225–243.

International Organization for Standards, ISO 9241. Ergonomics Requirements for Office Work with Visual

Display Terminals (VDTs), Part 11. Guidance on Usability Specification and Measures.

Krizan, L., 1999. Intelligence Essentials for Everyone. Joint Military Intelligence College, Occasional Paper

Number Six, Washington, DC.

Lesk, M., 1997. Unpublished Paper, accessed Nov. 16, 2005. www.lesk.com/mlesk/diglib.html.

http://www.ic-arda.org

http://www.clir.org/pubs/abstract/pub105abst.html

http://www.clir.org/pubs/abstract/pub105abst.html

http://www.lesk.com/mlesk/diglib.html


Lucas, P. 2000. Passive information access and the rise of human-information interaction, CHI, 2000, Invited

Talk. CHI 2000. Extended Abstracts, pp. 202.

Mack, G., Lonergan, K., Hale, C., Scholtz, J., Steves, M., 2004. A framework for metrics in large, complex

systems. Aerospace.

Marchionini G. 2000. Evaluating digital libraries: a longitudinal and multifaceted view. Library Trends. 49(2),

304–333. Online copy at http://www.ils.unc.edu/~march/perseus/lib-trends-final.pdf.

Medina, C. 2002. what to do when traditional models fail. Studies in Intelligence. 46(3), http://www.cia.gov/csi/

studies/vol46no3/article03.html, accessed May, 2004.

National Research Council, More than Screen Deep, 1997. National Academy Press, p. 32.

NIMD home page, http://www.ic-arda.org/Novel_Intelligence/, accessed June 12, 2005.

Scholtz, J., Morse, E., Hewett, T., 2005. In depth observational studies of professional intelligence analysts.

International Conference on Intelligence Analysis, May, MacLean, VA. Proceedings available at https://

analysis.mitre.org/ [accessed June 12, 2005].

TREC Conference [http://trec.nist.gov/data/qa.html, accessed Feb. 16, 2004].

Voorhees, E., Buckland, L. (Eds.). 2003. Proceedings of the Eleventh Text REtrieval Conference (TREC 2002).

Warner, M., 2004. Wanted: a definition of intelligence. Studies in Intelligence, 46(3), http://www.cia.gov/csi/

studies/vol46no3/article02.html, accessed May 2004.

White, H., McCain, K., 1989. Bibliometrics. In: Williams, M. (Ed.), Annual Review of Information Science and

Technology, vol. 207. Information Today, Inc., Medford, NY, pp. 161–207.

http://www.ils.unc.edu/~march/perseus/lib-trends-final.pdf

http://www.cia.gov/csi/studies/vol46no3/article03.html


http://www.ic-arda.org/Novel_Intelligence

http://https://analysis.mitre.org

http://https://analysis.mitre.org

http://trec.nist.gov/data/qa.html



metrics for evaluating human information interaction systems

Documents