metrics for evaluating human information interaction systems
TRANSCRIPT
Metrics for evaluating human information
interaction systems
Jean Scholtz *
National Institute of Standards and Technology, Information Technology Laboratory, MS 8940,
100 Bureau Drive, Gaithersburg 20899, MD, USA
Received 29 December 2003; received in revised form 17 August 2005; accepted 12 October 2005
Available online 27 January 2006
Abstract
Society today has a wealth of information available due to information technology. The challenge
facing researchers working in information access is how to help users easily locate the information
needed. Evaluation methodologies and metrics are important tools to assess progress in human
information interaction (HII). To properly evaluate these systems, evaluations need to consider the
performance of the various components, the usability of the system, and the impact of the system on
the end user. Current usability metrics are adequate for evaluating the efficiency, effectiveness, and
user satisfaction of such systems. Performance measures for new intelligent technologies will have to
be developed. Regardless of how well the systems are and how usable the systems are, it is critical
that impact measures are developed. For HII systems to be useful, we need to assess how well
information analysts work with the systems. This evaluation needs to go beyond technical
performance metrics and usability metrics. What are the metrics for evaluating utility? This paper
describes research efforts focused on developing metrics for the intelligence community that
measure the impact of new software to facilitate information interaction.
Published by Elsevier B.V.
Keywords: Human information interaction; Information retrieval; Evaluation; User-centered; Intelligence
analysis; Metrics for utility
Interacting with Computers 18 (2006) 507–527
www.elsevier.com/locate/intcom
0953-5438/$ - see front matter Published by Elsevier B.V.
doi:10.1016/j.intcom.2005.10.004
* Tel.: C1 3019752520; fax: C1 3019755287.
E-mail address: [email protected]
J. Scholtz / Interacting with Computers 18 (2006) 507–527508
1. Introduction
Lucas (2000) suggests that it is necessary for us to move from a ‘computer-centric’
world to an ‘information-centric’ world. He suggests that we should interact with
information, not computers. Hence, the challenge today is to design information objects
and our interactions with these objects. The term ‘human information interaction’
(Gershon, 1995) denotes: ‘how human beings interact with, relate to, and process
information regardless of..the medium connecting the two.’ The problem was expressed
in the National Research Council report (1997):
“Today’s technology, built to meet obsolete constraints of the 1960s and 1970s,
focus users’ attention and work patterns on the tool instead of the information.”
We are aware that the term ‘information’ is overused and that there are often questions
about distinctions between ‘data’ and ‘information.’ In this paper, we use the term
‘information’ to denote that we are working with data that has been processed from a raw
form. That is, we are not concerned about analyzing streams of signals. We are concerned
with being able to look at maps, text reports, photographs, newspapers, books, web pages,
and so on. In our definition, the World Wide Web is a source of information.
In this paper we focus on one particular group of information analysts, professional
intelligence analysts. While there are many types of intelligence analysts, differing on the
sources of information they analyze, the consumer for whom they are producing
the intelligence, and the domain they are analyzing, they all face a world in which the
information they must deal with is growing at an ever increasing rate.
With the development of the World Wide Web in 1989 more and more information has
been put on-line. Lesk (1997) calculates that in 1997 there were approximately 3 petabytes
(3000 terabytes) of information available at the United States Library of Congress, with
20 terabytes of this in text. He suggests that the amount of text on the world-wide web
might increase to 800 terabytes. While the Library of Congress only houses published text,
the web is available for all of us to post information, thus accounting for the large
difference in amount of material. Craine (2000) notes that ‘The world is experiencing
exponential growth of digital information. More information has been produced in the last
thirty years than in the previous five thousand—the entire history of civilization.
Some experts estimate that analytic tools today must handle 70 Terabytes of data daily
and that is growing at the rate of 60% per year. Moreover, the world is becoming much
smaller in the sense that crises in one country can have great impact on very distant
countries. Consider the impacts of diseases such as Asian bird flu or ‘mad cow’ disease.
Terrorist groups are able to carryout their plans in countries far from where they reside.
Transactions, travel, and communications have greatly increased and therefore, have
increased the records of such activities.
The challenge then is to provide information analysts, both intelligence analysts and
other information analysts, with tools to support interaction and exploration of information
in this enormously complex information space. These analysts need the functionality to go
beyond just finding information to finding surprises—information they did not know they
were looking for. We contend that current performance and usability metrics are
inadequate to assess all aspects of human information interaction. Performance measures
J. Scholtz / Interacting with Computers 18 (2006) 507–527 509
focus on measuring the accuracy and speed of software. For example, a software tool that
filters e-mail messages based on a user profile certainly needs to capture the correct
messages in a timely fashion. The process of setting up the profile, editing the profile, and
checking on which messages are captured needs to be usable. However, meeting
performance and usability criteria will not guarantee that information analysts will use the
tool. The tool also needs to positively impact the analyst’s process, product, or both.
To assess new research systems such as those sponsored by the Advanced Research and
Development Agency (ARDA, 2005), we have been conducting a series of evaluations with
researchers in these programs. The goal is to develop user-centered measures of impact for
human information interaction systems.The intelligence community represents ‘power users’
of information interaction systems. Their work is often time critical; it is extremely important
that they locate and digest as much relevant information as possible to understand the overall
situation.Additionally, theyneed the ability to explore the data to devise alternative scenarios.
2. Background
The evaluation of digital libraries is a reasonable place to look for evaluation work that
could be used as a starting point. Covey (2002) conducted a survey of 24 out of 26
members of the Digital Library Federation asking what types of assessments they had done
for digital libraries and whether they felt there was a good cost/benefit ration. She
concluded that while many studies had been conducted, the community is still addressing
questions such as what the right measures and composite measures are to capture and
assess digital library use. The dynamic nature of the capabilities and processes of digital
libraries add to the complexity of this challenge.
Measures for physical libraries include: circulation, collection size, growth rate of the
collection, number of visits, user satisfaction, and user visits (Marchionini, 2000). These
measures are not sufficient for evaluating digital libraries. Marchinoni stressed the need for
looking at the community impact of digital libraries, including such things as recruitment
of personnel with new skills. The study of bibliometrics (White and McCain, 1989) looks
at the impact that individuals or communities have on research by looking at the citations
of research papers.
A workshop on usability of digital libraries (Blanford and Buchanan, 2002) noted that
usability studies should include the larger context in which the search is taking place. This
report also noted the issue with defining who the users of digital library systems are. An
evaluation study of the Alexandria Digital Library (Hill et al., 1997) focused on three types
of users: earth scientists, information specialists, and educators. In this evaluation of the
geolibrary, the earth scientists and educators were involved in both searching for
information and in analyzing the information. The information specialists were experts
used to help locate information but were not involved with the analysis of the information.
The evaluators found that the earth scientists wanted the library capabilities more fully
integrated into the environment used for analysis; the educators wanted the capability to
work in groups in the classroom.
There are both similarities and differences between intelligence analysts and users of
digital libraries. First of all, digital libraries need to worry about the collection of
J. Scholtz / Interacting with Computers 18 (2006) 507–527510
information provided, not just the system access to it. In the intelligence world, there is
certainly the need to worry about the collection, organization, and management of the data
but evaluations of tools to provide access to this information are less concerned with the
management and acquisition processes. Intelligence analysts, although varied in focus and
skill levels, should be categorized as skilled both in the technical system and in the domain
(that is, more like the earth scientists in the Alexandria Digital Library assessment). While
analysts are not forced to use any tools provided to them, they must access online
information. Therefore, they are not discretionary users in the same sense as many digital
library users.
Like researchers in the digital library world, researchers in the intelligence analysis
world realize that evaluations must move beyond usability and performance. Both
communities agree that research is needed to develop the appropriate metrics. In the
intelligence community the job is somewhat easier as we have only one user group,
although somewhat diverse, to consider in the design and evaluations of our systems.
In the following sections, we present a discussion of intelligence analysis for those
unfamiliar with this profession, a discussion of the different types of evaluations we have
done, and the metrics that have resulted from these evaluations.
3. Intelligence analysis
To help readers who are not familiar with intelligence analysis, we present an
introductory section. Webster’s definition of intelligence reads:
Main entry: in$tel$li$gence
1 a (1): the ability to learn or understand or to deal with new or trying situations:
REASON; also: the skilled use of reason (2): the ability to apply knowledge to
manipulate one’s environment or to think abstractly as measured by objective
criteria (as tests) b Christian Science: the basic eternal quality of divine Mind c:
mental acuteness: SHREWDNESS
2 a: an intelligent entity; especially: ANGEL b: intelligent minds or mind !cosmic
intelligenceO3: the act of understanding: COMPREHENSION
4a: INFORMATION, NEWS b: information concerning an enemy or possible enemy
or an area; also: an agency engaged in obtaining such information
5: the ability to perform computer functions
A number of definitions of Intelligence exist within this community. The definition that
Warner (accessed May, 2004) arrives at is ‘intelligence is secret, state activity to
understand or influence foreign entities. However, in arguing to this definition Warner
notes that intelligence is both a process and a product. He also makes the point that
information does not equal intelligence; intelligence analysts provide added value to
information during the intelligence process.
The United States Central Intelligence Agency provides a diagram of the traditional
intelligence analysis cycle. This is shown in Fig. 1. While the phases in this process are
Fig. 1. The intelligence cycle and its use of information (CIA home page, accessed May, 2004).
J. Scholtz / Interacting with Computers 18 (2006) 507–527 511
valid, the actual process is an iterative one which the figure does not adequately depict.
An analyst working in a particular domain is continually monitoring information and
assimilating this into her knowledge of the domain. At any point in time, the analyst is
able to respond to a short term request for information within this domain (Medina,
2002).
Currently the intelligence community is looking at new models for intelligence
analysis. However, there is no new agreed upon model of a process. Medina (2002) notes
that the traditional model is focused on understanding an event or developments. Today,
analysts need to understand the needs of their customers and add value in an environment
where many have access to information. Medina suggests that analysts need to focus on
ideas, to take risks and to consider ‘adventurous analytic ideas.’ Research programs for
intelligence analysis are developing software along these lines. To determine whether this
software will help the analysts requires the development of new metrics to assess changes
in the analytic process and products.
4. Evaluating the utility of human information interaction
The evaluation of interactive information systems encompasses three types of
evaluation:
J. Scholtz / Interacting with Computers 18 (2006) 507–527512
† Performance or algorithm evaluations
† Usability or interaction evaluation
† Overall impact or process evaluation
Performance metrics focus on processing time and accuracy of the algorithm. For
example, information retrieval evaluations commonly use precision and recall measures to
determine how well the algorithms are working (Trec conference, accessed Feb, 2005,
Voorhees and Buckland, 2003).
The human–computer interaction (HCI) community conducts usability evaluations of
information systems. The metrics commonly used are effectiveness, efficiency, and user
satisfaction evaluated in the context of use (ISO 9241). That is, users use tools not just for
the sake of using tools but within a larger scenario. Measures to support usability metrics
should be collected within this larger scenario. Effectiveness measures the percentage of
tasks a user is able to complete, both with and without external help. Efficiency is the
measure of how much time users need to complete tasks in the scenario. User satisfaction
measures the user’s perception of the interaction process during the scenario.
To complete the evaluation of HII systems, we must move beyond performance and
usability and consider utility or impact measures. That is, how do human information
interactive systems change the work that users are doing? What affect do new information
interaction systems have on work products? Obviously these three types of evaluations are
closely tied. Systems need to have attained specified levels of accuracy and usability
before we can assess utility. However, good accuracy and usability do not guarantee good
measures of utility. These evaluations are also iterative. Assessments of utility may
produce new criteria for accuracy and usability.
The following sections present some methodologies, results, and lessons learned from
evaluations we conducted of interactive information systems in the past year. The systems
that we evaluated are all being developed for use in intelligence analysis. The success and
continuation of research programs for this community depends in part on the development
of metrics to assess the impact of information interaction systems.
We have conducted evaluations in three setting: laboratory studies; simulated
environments; and operational environments. We report on lessons learned from each
of these types of evaluations and how these are shaping our development of metrics for
information interaction systems.
4.1. Laboratory studies: question-answering evaluations
TREC (Trec conference, accessed Feb., 2004) sponsors a track on question-answering
that focuses on retrieving information not documents. Systems are to answer a question by
finding information in one or more documents and then presenting the portion of the
document that answers the question. A subset of these question answering research
systems was interested in conducting dialogues with the users to determine the users’
information needs. These systems need the capability to answer questions more complex
than simple factoid questions. The typical ‘batch processing’ mode of evaluation was not
appropriate. Even though users were given the same scenario, they asked different
questions and had different styles of dialogue with the systems.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 513
We conducted two studies with research systems to understand variations in dialogue
between users and the systems. Our objectives were to provide feedback to the researchers
developing the systems; to develop appropriate metrics for future evaluations; and to
provide measures of progress to the sponsors of the research. In the rest of this section we
discuss the two pilot studies we conducted.
4.1.1. Dialogue evaluation pilot #1
The research systems were in various stages of development but all were early, were
not necessarily robust, and lacked usable user interfaces. We adopted a modified wizard of
OZ technique (Dahlback et al., 1993) in which a text chat served as the user interface for
all systems. The text chat was developed to allow researchers to use any one of three
modes for conversing with the user:
– Wizard: a researcher pretending to be the system responds to the user.
– System: the research system’s response to the user was passed along untouched.
– Modified: the researcher was allowed to modify some part of the system generated
response to make it more understandable to the user.
Testing took place over the internet using the text chat interface. Subjects entered their
queries in the large text area and submitted them. The systems received the query and
proceeded to process the request for information. When the system generated its response,
a log entry was made to note whether the response was generated by the system itself, by a
wizard or by a wizard modifying something that the system generated.
Subjects were allowed 15 min to work on each of the 10 problems. They were told to
continue until one of three conditions occurred:
1. they found what they needed/success
2. they felt there was no chance of success/failure
3. Time expired
If time ran out, subjects were asked to decide whether they believed that they were
likely to be successful or not. Buttons in the interface were used to guide this
interaction.
A log file format was developed to track each exchange. The header of the log
contained information about the identity of the system and the user. Each activity record
was time-stamped and categorized as to its origin (analyst, system, wizard, or modified)
and the content of the message.
The participants in the first pilot study were assessors for the Text REtrieval
Conference (TREC) (Voorhees and Buckland, 2003). Many of these subjects are retired
intelligence analysts and all have expertise in information retrieval. The tasks in this study
were general rather than domain-specific so the match of subjects’ skills and task
requirements was satisfactory. Ten subjects were recruited for this study and each of them
worked with a single system to perform the 10 tasks. Each system was used by 2 subjects
and the tasks were performed in random order.
J. Scholtz / Interacting with Computers 18 (2006) 507–527514
4.1.1.1. Data collection. In addition to the logs that were collected, an observer
noted issues that participants were having. These notes were also used to ask more
thorough questions once the participants had finished the experiment. Subjects were
asked to rate the systems on various aspects of the dialogue. They were asked to use
7-point scales (1, extremely dissatisfied; 7, completely satisfied) to address the following
areas:
† the final answer you obtained from the system
† the time that it took to carryout this task
† the dialogue that you carried on with the system
† the clarifications that the system requested
† the number of misunderstandings between you and the system
† the ease of understanding of the system messages
4.1.1.2. Results.We learned a great deal about facets of dialogue interactions that systems
would need to accommodate, including:
† Spelling errors
† Incomplete sentences
† Ability to remember context
† Qualified affirmative and negative responses
We also learned about the process as the participants went about finding answers to the
tasks we had assigned them:
† Analysts ask questions as they occur to them—not necessarily in a strict turn-taking
style.
† Analysts expect to be able to ask clarification questions about the content the system
delivers.
† Analysts have expectations about context. Some analysts initially set context for the
system. Dialogue sequences illustrated the expectation for the system to remember this
context as well as context of previous questions and responses.
† Analysts expressed concern about time:
† Analysts specified date ranges in a number of ways and expected to be able to use time
descriptions such as ‘recent’
† Analysts wanted to see dates on information returned and some wanted
information returned organized by time. On the other hand, offering analysts
information by selecting the year was not always viewed as favorable. This was
especially true in situations when they were unfamiliar with the topic they were
researching.
† Clarifications to system questions are not always straightforward. Analysts gave
qualifications to yes/no questions (e.g. ‘Do you want to see more information?’,
Yes, but only if..’) and even ignored the questions.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 515
We also assessed the number of system turns that were marked as system, modified
system, and wizard. In summary, we used the following metrics in the first pilot:
User satisfaction (questionnaire ratings)
System handling capability (log of how ‘system’ response was generated)
Efficiency (number of turns, %clarifications)
Effectiveness (whether answer was achieved)
4.1.2. Dialogue pilot #2
In a follow-up study, we modified the experiment from the first study slightly. First of
all, two systems used their own interface; an additional two systems used the NIST Text
Chat UI. We used two long scenarios, each lasting 1 h, instead of 10 shorter scenarios. Our
participants were Naval reservists who do analysis work as part of their reserve duty. In
addition to the data collected in the first study, we recorded the computer screen of each
participant during the evaluation.
We computed the same metrics as in the first pilot study: user satisfaction, system
handling capability, efficiency, and effectiveness.
High level issues that we leaned in this evaluation included:
† Duplicating documents or summaries returned is a problem.
† Attributions for material are essential. Date information is also critical.
† Analysts test the system. They ask questions they already know to determine if the
system returns the correct answer.
† Document summaries should accurately reflect the documents they stand for.
† Variant spellings and unwarranted equivalences (USSR is not the same as Russia) are
problems.
† If the system rephrases the analyst’s question, it should allow them to disagree.
† Systems should state how much information is available.
† Analysts need to know when the information they are looking for does not exist.
† Analysts need to be able to save information returned by the system.
† Analysts need to know what they have done during a session and would like to be able
to save these traces.
4.1.3. Summary of lessons learned
From these two pilot studies, we determined metrics that should be considered in
addition to typical performance and usability metrics:
Performance evaluation:
%duplicate information returned
Accuracy of summary of document
%of information returned without attributionUsability evaluation:
percentage actions user could not control
System evaluation:
Trust—# times user tests system and it passes
J. Scholtz / Interacting with Computers 18 (2006) 507–527516
Users who worked with the user interfaces rather than the text chat interfaces spent
considerable time interacting with the user interface. We need to investigate how to
measure the impact of these interactions. The systems we were working with differed
considerably in the additional functionality they supplied. One system concentrated on
helping the user formulate new questions to ask while the other system had interaction
visualizations of the information. Potential metrics to consider would be percentage
questions asked due to suggestions/interactions with the system; %of these that yield
information for either further exploration or that become part of the final work product.
Fig. 2 shows a concept map of the overall metrics that might be used to evaluate an
interactive dialogue information system. We have used the term ‘thread’ in the concept
map. A ‘thread’ is a particular line of questioning that might represent a specific
hypothesis the analyst is investigating.
In the concept map, we have used time as a measure of performance. However, we need
to qualify what it means to save an intelligence analyst time. In our discussion with
analysts we found that time to produce a final product is dictated by the assignment. The
intelligence analyst simply must have the product completed within a specified time
frame. However, the analyst will use all of the time allotted. Because of the volume of
information to look through, analysts prioritize what they read. Given more time, they are
Fig. 2. Concept map showing relationships between usability, performance, and utility metrics for dialogue
systems.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 517
simply able to read and review more documents and hence, increase the coverage of the
topic and their confidence in their analysis. If searching can be done faster and
summarization techniques allow them to review documents faster, then analysts will be
able to do more in the same amount of time.
The concept map in this case has been specialized for dialogue systems. The
information/effort metric is composed of two pieces relating to information and effort. In
this case, we have implemented ‘effort’ as # turns/answer and # clarification turns/thread.
For systems that are not dialogue systems, this concept map can be modified to reflect
efficiency measures appropriate for that particular system.
Because these were laboratory studies, we have no insights into process. We also did
not design this study to give any information about product. Therefore theses studies
contributed to our understanding of metrics for performance and usability of information
interaction.
4.2. Measuring human information interaction in a simulated environment
We wanted to go beyond laboratory studies and evaluate software tools in an
environment resembling an operational setting. Moreover, we wanted to be able to
compare the performance of information seekers with and without a given set of tools to
determine user impact. This is not as straightforward as it seems. Intelligence analysts
have both short term and long term tasks. An analyst may continue to monitor the same
issue for months or even years. All tasks are not created equal—some tasks may have a
plethora of information; others may have little. Analysts may work on multiple tasks
simultaneously. We need to determine how these variables affect users and develop one or
more realistic baselines.
One project that we are currently involved in has developed an environment to capture
the work that analysts do, including the task the analyst is researching, queries posed to
search engines, documents viewed, documents saved, reports created, and all keyboard
interactions. A way is provided for analysts to make comments about the tasks they are
currently doing that can be used to understand the lower level data. Several analysts
currently work in this environment and generate data that provides a view of the current
process. However, this environment is somewhat removed from an actual analysis
environment. The tasks that are given to the analysts are selected by the project managers
and are chosen in such a way as to illustrate real situations that analysts must cope with,
such as, short reaction time tasks, multiple tasks, and long term tasks. The customers for
the analytic products are the researchers, not intelligence officials. Nonetheless, this
situation does simulate an actual analysis situation.
We are continually analyzing the data to determine how analysts work given their
current set of tools. In this case, they use search engines, word processing applications, and
when necessary, presentation tools. Eventually, we will move the research tools into the
simulated environment to determine their impact on the process and products of the
analysts.
Our evaluation efforts are currently focused on developing measures from this data that
can be used to assess impact of new information interaction tools. We have been working
with analysts, researchers, and program managers to arrive at metrics that will show
J. Scholtz / Interacting with Computers 18 (2006) 507–527518
impact for analysts, will provide diagnostics for the researchers, and will show progress
for the program managers. The current set of metrics was developed in an iterative fashion.
We first used information from publications about the process of intelligence analysis to
bootstrap our thinking (Grabo, 2003; Heuer, 1999; Krizan, 1999) We had a number of
opportunities to speak with analysts and to observe them at work. We used these sessions
to start brainstorming with the researchers and analysts to produce a list of potential
metrics. We then condensed this set to eliminate duplicates and we grouped them into
categories. We conducted observations of two analysts (Scholtz et al., 2005) which helped
us see a complete process over a two day period. In addition, we have been analyzing data
captured in the simulated environment this past year to determine what measures we can
collect and which of these will support proposed metrics.
The metrics under consideration are shown in Fig. 3. These metrics are in addition to
the traditional usability metrics of efficiency, effectiveness, and user satisfaction that will
be used to evaluate the user interface. Although we said earlier that time is not a good
overall measure, we are interested in saving time spent on repetitive tasks and on
interaction overhead to give the information workers more time to think and understand
the information. The metric category labeled ‘analyst time’ looks at these aspects. This
said, there is a danger in over automating. Understanding of data is produced by
manipulating it. There is a fine balance between producing enough automation to save the
analyst time and too much automation which produces results but without producing
understanding on the analyst’s part. We are currently conducting pilot evaluations on
various research components to determine the feasibility and utility of the various metrics.
While some of the measures can be collected in a straightforward fashion, others will
involve more indirect measures. In particular, the Data Coverage metric presents definite
challenges in data collection. Other measures such as perceived effort and insights gained
will be obtained from interviews and rating questionnaires.
We have begun analyzing data from our pilot studies and from data in the simulated
environment to determine what measures can be computed. We have been able to compute
the following measures:
† Time spent in the various software applications
† The queries posed and the number of documents read for each query
† Time to determine relevance of documents
† The growth of the analyst’s information collection document
† Gaps in activity indicating off-line activities
† Time spent in report generation and data collection activities
† The quality of the work product
In a recent study we were able to present system generated hypotheses to the user for
relevance rankings. Although this is an early evaluation, this constitutes part of our plan to
measure the system recommendations produced by the analyst. We also measured number
of relevant documents returned in a sequence of queries to determine if user modeling was
successfully augmenting user queries.
We have included the metric of product quality. In a recent evaluation we attempted to
measure product quality using a method for rating coverage as well as having all the
Fig. 3. Metrics for assessing impact of tools on human information interaction.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 519
J. Scholtz / Interacting with Computers 18 (2006) 507–527520
subjects rank each other’s reports. We had an expert analyst generate a report on the
subject for us. In addition the expert analyst ranked the participants’ reports. Participants
in the evaluation were also given anonymous copies of each others’ reports and asked to
rank them according to quality. We found that the expert ranking was very similar to the
ranking we obtained when averaging the participants’ rankings. The coverage score was
somewhat less in agreement but the coverage scoring was done by one experimenter. We
think this is a promising measure, especially in cases where we have a report from an
experienced analyst to use as a basis of comparison. We intend to use this method again,
but we will also have the coverage scored by the participants in addition to having them
rank the reports.
From the time spent in various software applications, we can clearly see the step of
analysis the analyst is doing—locating information or report generation. The content of
the queries can be used to infer the different lines of reasoning that the analyst is
pursuing. Currently we are only able to capture off-line activities if the analyst chooses
to write a note about them. But we can see gaps and we are able to make some
inferences, such as if a print command was issued just prior to a gap in on-line
activity we can make a reasonable assumption that the analyst is reading the hard copy
of a document.
As analysts work, they often assemble a document containing relevant information they
are finding. They use this document when producing their final report. We have computed
the growth rate of this document over time based on keystroke data from standard desktop
applications (Cowley et al., 2005). We can use this along with measures of the number of
searches done and the number of documents retrieved and read from these searches to
compute information /effort, a measure we introduced in Section 4.1. The graph in Fig. 4
shows a growth rate chart and details whether the growth came from keystrokes by the
analyst or from copy/paste operations.
Fig. 4. Growth rate of analyt’s document over a one day time period.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 521
Analysis of this data has resulted in a number of observations about analysts and the
analytic process:
† Analysts are quite skilled at forming good searches. They view at least one document in
the majority of searches. In addition, they look beyond the first ten results in many
cases.
† When analysts are doing quick reaction tasks, they rely heavily on highlighted search
terms in the documents to guide them to the appropriate sections to read. Long
documents with no highlighted terms are not read in such situations even though they
may be highly relevant.
† When documents are of interest to analysts, they note the sections of interest. This may
occur by printing out the document and highlighting sections or making notes on the
paper, or the analyst may copy and paste text into a repository document.
† Provenance is extremely important to analysts. They always include the source of their
information. They also decide the credibility of information based on their knowledge
of the source. If an analyst is working in a new domain, she must look for ways to
assess the credibility of a source.
† Temporal information is extremely important as well. Analysts look for dates in on-line
web sites so they understand the time frame in which the document was written.
These observations are helpful in developing insights into metrics that should be
considered. For example, we could consider the number as well as the size of documents
that the analyst reviews. We are able to track the actual documents that are used for
copy/paste operations. We could also examine these according to the dates and provenance
and then determine if new systems help the analyst look further back and produce more
information with citations.
There are still limitations to doing evaluations in ‘simulated’ environments. While
there is a process followed by the analysts, we lack an organizational process. And we lack
the actual pressures that occur in the operational environment. In Section 4.3 we discuss
results from a number of experiments in operational settings.
4.3. Evaluating human information systems in an operational environment
We have also looked at metrics for evaluating information systems deployed in
operational settings. We had different environments available for experimentation and
evaluation: a data protected platform for conducting controlled experiments and an
operational platform where systems could be tested using real data and in actual processes.
Technical performance was evaluated in the data protected platform using synthetic data.
When software had been successfully evaluated in this platform, it was moved into the
operational environment. Software performance was evaluated on operational data and
then evaluated within the intelligence process itself. Once in the operational world, the
evaluation switched from measures of software and individual user interaction to that of
process and organizational impact. Technical measures were also collected within the
operational environment as we need to evaluate how the interactions scale when real-
world data, processes, and products are factored in. In addition, cognitive metrics were
J. Scholtz / Interacting with Computers 18 (2006) 507–527522
collected only in the operational environment as they only make sense when analysts are
engaged in real tasks using the software.
By looking at the various metrics in the different environments we were able to roll up
the metrics to capture the entire picture: the performance of the technology component, the
scalability of the components to actual data; and of the impact of these components on the
analytic process.
We developed a metrics model (Mack et al., 2004) that defined the metrics and the
conceptual measures at a functionality level and at a scenario level. As new
components were integrated into the environment, we implemented these measures by
determining what data from the components could be used to obtain a given metric.
This allowed us to customize the measures for different tools but to have the ability to
compare metrics across experiments. Fig. 5 shows the portion of the metrics model that
focuses on utility.
We use the following definitions in our metrics model.
A metric is used for distinguishing progress or differences between implementations of
a Capability.
A conceptual measure is an attribute or property of a Metric.
An implementation measure realizes part or all of a Conceptual Measure. The set of
implementation measures produces the measurements that determine the value of the
Metric.
Metrics and conceptual measures are independent of the specific software being
evaluated and allow us to make comparisons across software components and across
experiments. Implementation measures are specific to the software being evaluated.
We have used the categories of effectiveness, efficiency, and user satisfaction to
classify our conceptual measures, although we have expanded the standard usability
definitions of these terms (ISO 9241). The measures in Fig. 5 are, in many
instances, repeated in the different platforms. However, the implementation of these
measures may be quite different. Measure that we might be able to collect through
logging software in a more experimental situation may be replaced with interviews
and questionnaires in an operational environment. We have also separated out
metrics measuring performance with real-world data from metrics associated with
processes involving real-world data. It is important to look at aspects of
performance on actual data before trying to assess impact within an actual process.
This allows us to compare the evaluations done using synthetic data to measures
collected using operational data.
We do not mean to suggest that there was a clean sequence of evaluations progressing
from the controlled experimental world to the operational environment. Often we found
anomalies in the operational environment that required further investigations in a
controlled environment. Other measures such as changes to process can only be assessed
within a process in an operational environment.
In Section 4.4, we discuss agreed upon metrics that will be used in an upcoming
evaluation in an operational environment.
Fig. 5. The portion of the metric model focusing on ‘utility’.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 523
4.4. Testing the metrics
Over the next six months we will be inserting research software developed for the
Novel Intelligence from Massive Data program (NIMD home page, accessed June, 2005)
into an operational environment. We have formalized the metrics that will be used based
on metrics developed mainly in our simulated environment studies. We are first collecting
a baseline data from analysts. The overall metrics that will be used are:
J. Scholtz / Interacting with Computers 18 (2006) 507–527524
† Increased analytic product
† Increased analysis confidence in product
† Increased signal to noise ratio
Senior analysts will rate the quality of the products; the analysts themselves will be
asked to give their confidence level. The signal to noise ratio will be measured by looking
at the number of documents that analysts view and the percentage of those that are deemed
relevant (printed, read online, saved, copied into report, cited in report). While the quality
and confidence are qualitative measures, there are quantitative measures at the
implementation level that will feed into these metrics as well. Lower level metrics
(based on the research areas of the project) include:
† Reduced workload for the analyst
† Increase in the percentage of relevant documents returned to the analyst
† Increased throughput for the man and machine working together
† Increase the number of paths of reasoning that the analyst is able to explore
In particular, one research area is human information interaction. The metrics we
propose for this area, which will feed into the overall impact metrics, are:
† Increase the number of productive searches
† Increase the number of relevant documents retrieved and reviewed
† Reduce the analyst’s overall workload
We are excited about the opportunity to determine if these metrics can be used to
distinguish which software tools are useful to the analyst.
5. Discussion
We have explored metrics for assessing human information interaction in laboratory
studies, in simulated environments, and in operational settings. We have used the
traditional usability metrics of effectiveness, efficiency, and user satisfaction but we have
augmented these with measures such as the number of turns and clarifications needed. We
found that a potential utility metric is the amount of effort an analyst expends to obtain
information. In our simulated environments, we are able to measure more about the
process that analysts use in their analytic tasks. In the operational worlds, we are able to
determine how well new software tools work with the existing infrastructure and how well
they scale to real-world information streams. Table 1 summarizes our metrics work in
these three areas.
The ideal situation would be to capture a baseline first in the operational world and then
base research on those needs and characteristics. However, the operational world is
dynamic and we must continually assess what is occurring there. Analysts, their needs,
their missions, and their infrastructure continually change. As we move from the
laboratory to the simulated environment to the operational world, we gain a broader
Table 1
Summary of metrics information obtained in three areas of study
Laboratory studies Simulated environment Operational environment
Expanded definitions of effec-
tiveness and efficiency
Developed metrics for process
and product
Produced baseline metrics
Gained insights into criteria for
good answers
Looked at detailed use of software
over a period of time
Assessed how tool interacts with
other tools in the environment and
with operational data
Obtained detailed data about user
interactions
Obtained detailed data about
analytic process and construction
of product
Obtained statistical data and
interview data
J. Scholtz / Interacting with Computers 18 (2006) 507–527 525
perspective but we lose control over experiments. It is difficult to obtain fine grain data but
we can more easily get data about real-world use. The laboratory is the place to understand
interactions in detail, within a given scenario. To look at how a particular software system
impacts an individual user, a simulated environment should be used. To determine if that
software system is effective in the real world, we need to study it in an operational
environment. Interviews and statistics on usage can be collected in the operational
environment and can be used to devised more laboratory studies and simulated
environment investigations.
6. Conclusions
Evaluating information interaction is a complex process that involves metrics and
measures at several levels. Technology developers have typically used performance
measures for their systems. The HCI community has perfected metrics and methodologies
for assessing the usability of systems. We suggest it is time to move from measures of
usability and performance alone and to include measures of utility and impact. This will
require new evaluation methodologies and new metrics. A number of metrics have been
discussed here, including additional performance measures and expanded usability
definitions. While all of the metrics presented here will not turn out to be useful, we are
confident that a subset will be sufficient to adequately measure utility of human
information interaction systems. These metrics have been developed through readings and
discussions with members of the intelligence community, with researchers developing
software to assist intelligence analysts, and by conducting studies and experiments in the
laboratory, in simulated environments, and in operational settings. The metrics proposed
are a good starting point for evaluating the impact of human information interaction in the
large. However, these metrics will definitely evolve as more research focuses on the
evaluation of information interaction.
In addition, there are a number of confounds that need to be considered: the expertise of
the analyst, the domain knowledge of the analyst, the time that can be allocated to the task
the analyst is given, the complexity of the task, and the amount, quality, and dynamic
nature of the available information. We have not addressed these in this discussion.
J. Scholtz / Interacting with Computers 18 (2006) 507–527526
However, we are currently conducting some studies that will address these factors and
should help us to interpret our metrics within these constraints.
We have several tasks ahead of us. One is to determine which metrics distinguish
systems that will be used by analysts as opposed to those that will not. We have a study
underway that will help to validate metrics we have defined for one particular research
program in the intelligence community. We will use qualitative data collected from the
analysts in an operational environment to determine whether our impact metrics can
distinguish between tools that have utility and those that do not. The second task will be to
determine if these metrics extend to the evaluation of tools in different areas of human
information interaction, such as digital libraries.
Acknowledgements
The author would like to thank her team at the National Institute of Standards and
Technology for their work in conducting and analyzing numerous studies that provided
metrics used in this paper. The anonymous reviewers have made extremely useful
suggestions for improvements. The author would also like to thank Dianne Murray and
Gitte Lindgaard for their helpful suggestions and comments. This work was funded in part
by a number of programs at ARDA.
References
Advanced Research and Development Agency (ARDA) home page, http://www.ic-arda.org/, accessed May,
2004.
Blanford, A., Buchanan, G., 2002. Usability of digital libraries: a workshop at JCDL 2002—JCDL’02 workshop
on usability of digital libraries. Joint Conference on Digital Libraries, Portland Oregon.
Covey, D., 2002. Usage and Usability Assessment: Library Practices and Concerns. Digital Library Federation,
Council on Library and Information Resources, Washington—website http://www.clir.org/pubs/abstract/
pub105abst.html.
Cowley, P., Nowell, L., Scholtz, J., 2005. Glassbox: An Instrumented Infrastructure for Supporting Human
Interaction with Information Hawaii International Conference on System Science, Jan. 2005.
Craine, K., 2000. Designing a Document Strategy. MC2 Publishing.
Dahlback, N., Jonsson, A., Ahrenberg, L., 1993. Wizard of Oz studies—why and how. Proceedings from the 1993
International Workshop on Intelligent User Interfaces, Orlando, FL, pp. 193–200.
Gershon, N. 1995. Human Information Interaction, WWW4 Conference, December.
Grabo, C.M., 2005. Anticipating Surprise: Analysis for Strategic Warning. Center for Strategic Intelligence
Research. Joint Military Intelligence College. 2003. University Press of America, Washington, DC.
Heuer, R., 1999. Psychology of intelligence analysis. Center for the Study of Intelligence. CIA.
Hill, L., Dolin, R., Frew, J., Kemp, R.B., Larsgaard, M., Montello, D.R., Rae, M., Simpson, J. 1997. User
evaluation: summary of the methodologies and results for the Alexandria Digital Library, University of
California at Santa Barbara. In: ASIS97 Digital Collections: Implications for Users, Funders, Developers and
Maintainers—Proceedings of the 60th ASIS Annual Meeting, 34 pp. 225–243.
International Organization for Standards, ISO 9241. Ergonomics Requirements for Office Work with Visual
Display Terminals (VDTs), Part 11. Guidance on Usability Specification and Measures.
Krizan, L., 1999. Intelligence Essentials for Everyone. Joint Military Intelligence College, Occasional Paper
Number Six, Washington, DC.
Lesk, M., 1997. Unpublished Paper, accessed Nov. 16, 2005. www.lesk.com/mlesk/diglib.html.
J. Scholtz / Interacting with Computers 18 (2006) 507–527 527
Lucas, P. 2000. Passive information access and the rise of human-information interaction, CHI, 2000, Invited
Talk. CHI 2000. Extended Abstracts, pp. 202.
Mack, G., Lonergan, K., Hale, C., Scholtz, J., Steves, M., 2004. A framework for metrics in large, complex
systems. Aerospace.
Marchionini G. 2000. Evaluating digital libraries: a longitudinal and multifaceted view. Library Trends. 49(2),
304–333. Online copy at http://www.ils.unc.edu/~march/perseus/lib-trends-final.pdf.
Medina, C. 2002. what to do when traditional models fail. Studies in Intelligence. 46(3), http://www.cia.gov/csi/
studies/vol46no3/article03.html, accessed May, 2004.
National Research Council, More than Screen Deep, 1997. National Academy Press, p. 32.
NIMD home page, http://www.ic-arda.org/Novel_Intelligence/, accessed June 12, 2005.
Scholtz, J., Morse, E., Hewett, T., 2005. In depth observational studies of professional intelligence analysts.
International Conference on Intelligence Analysis, May, MacLean, VA. Proceedings available at https://
analysis.mitre.org/ [accessed June 12, 2005].
TREC Conference [http://trec.nist.gov/data/qa.html, accessed Feb. 16, 2004].
Voorhees, E., Buckland, L. (Eds.). 2003. Proceedings of the Eleventh Text REtrieval Conference (TREC 2002).
Warner, M., 2004. Wanted: a definition of intelligence. Studies in Intelligence, 46(3), http://www.cia.gov/csi/
studies/vol46no3/article02.html, accessed May 2004.
White, H., McCain, K., 1989. Bibliometrics. In: Williams, M. (Ed.), Annual Review of Information Science and
Technology, vol. 207. Information Today, Inc., Medford, NY, pp. 161–207.