deborah hix and robert s. scbulman · this article is similar to the roberts and moran approach,...

/

BE

D e b o r a h H i x a n d R o b e r t S. S c b u l m a n

R

l l l l a a l l a l l l a l l l l l l l l i i l l a i l l l l a l l i l l l l l l i l a l l l l l l l i l l l l l l l l l l l l l l l l l l l l l l l l COMPUTING PRACTICES

Motivation for a Tool Evaluation Methodology Tools for human-computer interface development--of ten called user interface management systems or UIMS or user interface development environments--conjure up visions of icons and images, windows and words, mice and other media that comprise the human- computer interface of an interactive system. Human-computer interface development tools are themselves interactive systems that support production and execution of the human-computer interface.

environment in which it is used; and

• extensibility, to remain sensitive to advances in interface technology.

This evaluation methodology focuses on technical issues. Al- though economic issues cannot be ignored in a business environment, this methodology does not address such issues, nor does it make rec- ommendations. Rather, it presents data that management can use in making decisions. Data produced by this methodology might be used, for example, to help determine the

choice of a tool for a particular interface development environment. It is also important to remember that this methodology is used to evaluate human-computer interface development tools, and not their end-user interface products. The latter is the target application interfaces that these tools are used to produce.

The research presented in this article is groundbreaking, repre- senting one of the first attempts to produce a structured, quantitative approach to the evaluation of interface development tools. Our meth-

A M E T H O D O L O G Y FOR THEIR

EVALUATION Despite their recent proliferation, there are no procedures for systematically evaluating or comparing such tools. State-of-the-art in their evaluation is based on subjective opinion--"warm fuzzy" feelings.

Goals Of a TOOl Evaluation Methodology We have developed and empirically validated a methodology for evaluating and comparing human- computer interface development tools. Goals of such a methodology include the following:

• use of a standardized approach to produce reliable quantifiable results;

• thoroughness, but not necessarily exhaustiveness, of a tool evaluation;

• objectivity, rather than judgement, on the part of the tool evaluator;

• ease-of-use, but not necessarily quick-to-use, for the tool evaluator;

• adaptability to the development

A comprehensive checklist-based methodology produces quantifiable criteria for evaluating and comparing human- computer interface development tools along two dimensions: functionality and usability. An empirical evaluation shows that the methodology which is in use in several corporate interface development environments, produces reliable (con-

sistent) results.

odology is in use at a number of corporate sites (see the section on "Real World Use of the Methodol- ogy").

Related Work While very little structured evaluation of human-computer interface development tools has been done, precedence has been established for using the type of approach on which our tool evaluation methodology is based. Roberts and Moran [6], for example, produced a methodology for evaluation of text editors. The basis of their approach was classification of potential edit- ing tasks (to provide a common basis for comparison), and evaluation (both quantitatively and qualitatively) along several dimensions, including time to perform tasks, error costs, learning time, and functionality. A replication study of their work [2] produced recom- mendations for modifications. The evaluation procedure presented in

COMMUNICATIONS OF THE ACM/March 1991/%Iol.34, No.3 7 S

this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results.

Cohill, Gilfoil, and Pilitsis [3] developed a methodology that was used within AT&T for evaluating software packages, particularly commercial systems. This methodology was based on criteria such as performance, documentat ion, and support . The i r approach, which has been used to compare at least five Unix-based word-processing packages, resulted in the selection of a system for an engineer ing group. Such approaches provided ideas that we used in developing our tool-evaluation methodology.

Overview' of ThiS Article This article will discuss some of the difficulties with developing an evaluation methodology for interface development tools. It will describe in detail the tool evaluation methodology we have produced and used. Results of an empirical validation study will show that the methodology produces reliable data. Use of the methodology in several real-world human- computer interface development environments will also be presented.

Difficulties in Developing a Tool Evaluation Methodology A surpris ing number of problems arise when one at tempts to develop a formal, s t ructured evaluation methodology for interface development tools [5]. At the core of the difficulty of developing such a methodology is the fundamenta l problem of def ining the human- compute r interface itself. Behav- ioral scientists often view the interface as having cognitive and/or ges- tural aspects, while computer scientists typically think of it as jus t software and hardware. Closely connected to the interface definition is the problem of def ining an interface development tool. For example, terms include user interface development tools or toolkits or environments, and probably the most prevalent user interface man-

agement systems or UIMS [4]. An interface development tool can be anything from a single interface rout ine to a complete interface development environment . A toolkit is a l ibrary of routines that can be called to implement low-level interface features, particularly to incorporate various interaction tech- n i q u e s - f o r example, menus, scroll bars, buttons, and their associated physical devices, such as ,nice, joy- sticks, t rackbal l s - - in to an interface. A toolkit is typically used by a pro- g rammer who writes source code to call the desired routines. Toolkits provide little or no suppor t for a nonprogramming interface de- signer. A UIMS is an integrated interactive environment , often based on a broad set of tools, for designing, prototyping, executing, evaluating, modifying, and maintaining interfaces. In particular, a UIMS supports the development , through all phases of the life cycle, of a user interface by an interface developer who may not be a p rogrammer . This is commonly done through the use of graphical representat ion techniques for designing and prototyping an interface, as well as run-t ime execution suppor t so that an interface can be evaluated and iteratively refined. By relieving an interface developer of much of the tedium (i.e., coding) of producing an interface, a developer can con- centrate on the design of the interface itself, ra ther than on its implementation. Many interactive software systems call themselves UIMSs, when they are really no more than toolkits.

Other problems associated with producing a methodology for evaluation of interface development tools include the following:

• Determinat ion and measurement of salient dimensions for evaluation of interface development tools: Possibilities for dimensions include tool functionality, usability, learnability, user (of the tool) performance, tool performance, and documentat ion. Measure- ment of selected dimensions is

difficult, generally because an attempt is being made to quantify what is largely qualitative information.

• Variety of specification and im- plementat ion techniques used in tools: These techniques can greatly color a tool user's impres- sion of the tool, including its usability and learnability.

• Availability of tool software and hardware for hands-on use by an evaluator: Data collection from sources such as market ing glossies, videotapes, documentation, and even live demonstra- tions are unsatisfactory and un- verifiable. In particular, one system we examined looked good from glossies and a videotape, but tu rned out to have limited functionality in operat ion. Of course a "catch-22" can arise: a tool cannot be adequately evaluated without acquiring it, but one may not initially know which tool to acquire. Costs for tools vary widely and can be expens ive - - in the thousands of dollars.

• Selection of an appropr ia te tool evaluator: This person should be someone who is familiar with several such tools. An evaluator must exhibi t - -poss ibly through per formance of baseline tasks using the tool (s ) - -a minimal measurable level o f expert ise with each tool.

• Cost of the evaluation process, involving an appropr ia te ly expe- r ienced person for a substantial time.

• Rapid technological advances in the human-compute r interface area: This can r ende r a tool evaluation methodology worth- less even before it has been shown to be useful.

• Difficulty of evaluating an evaluation methodology itself: Virtu- ally no research exists in formal evaluation of interface development tools. As methodologies are produced, they must be shown to produce reliable (consistent) and valid (the methodology must be measur ing what we intend it to) results. Statistically de te rmining

76 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

= * . a a i = * ~ a . . i . l m . . . l a m a . i m m a m . i l . a a a a m a a l = . i a . * l l . . . a . . I n i i m m m m n l a a m m m l m m m m a

reliability and validity can be a costly, complicated process involving longitudinal use of the methodology with a variety of tools and evaluators.

Description of the Tool Evaluation Methodology

Overview of the Evaluation Process Figure 1 shows a high-level block diagram of steps used in the tool- evaluation methodology, which revolves around hands-on use of the tool being evaluated. After learning how the tool works, an evaluator completes a detailed 28 page "checklist" form that is a matrix organized around two dimensions:

1. Functionality of the tool, and 2. Usability o f the tool.

Functionality indicates what the tool can do; that is, what interface styles, techniques, and features can be produced by the tool for a target application interface. Usability indicates how well the tool performs its possible functions, in terms of ease- of-use (a subjective, but quantitative, rating of how easy or difficult the tool is to use) and human performance (an objective rating of how efficiently the tool can be used to perform tasks).

After completing the detailed portion of the form, the evaluator calculates overall functionality and usability ratings (using a Microsoft Excel spreadsheet that replicates layout of each page of the form) to produce numeric results for an executive summary. If desired, the evaluator may perform benchmark interface development tasks, completing the evaluation.

Functionality Dimension The functionality dimension of a tool's evaluation is organized into three sections. Major categories and some subcategories from each section are shown in Figure 2. Each of these is further divided into appropriate items (e.g., 14 types of menus, 27 features of interfaces, 17

hardware input devices, 9 rapid prototyping subcategories, 7 evaluation subcategories, and so on).

Usability Dimension The usability dimension of a tool's evaluation is measured by two methods:

• Subjective evaluation measures ease-of-use for each of the three sections of the functionality dimension. Only for each function the tool can produce, does the evaluator indicate on the form whether it was difficult (indicated by a frowning face), adequate (bland face), or easy (smiling face) to use to produce that function. Individual evaluator interpretation was reduced by defining each face usability icon, as

= I G U R E 1

Steps in using the tool evaluation methodology

= I G U R E 2

Major categories and some subcategories of the three sections of the functionality dimension

C O M P U T I N G P R A C T I C E S

shown in Figure 3. These definitions came from extensive interviews with five expert tool users.

• Objective evaluation measures human performance using the tool to complete suggested benchmark interface tasks that can be customized to a specific working environment. (This is traditional benchmark testing, using representative tasks, and will not be addressed here.)

Specification/Implementation Techniques Only for those functions of interfaces that the tool can produce, the evaluator also indicates the primary type of specification/implementation technique the tool uses to produce individual functions. Tools

Section A. Types of Interfaces Produced with this Tool

Interaction styles Menus Forms Typed string inputs Windows

Feature of interfaces Animation Adaptive dialogue Types of navigation Default input values Graphics

Hardware devices Input devices Output devices

Section B. Types of Support

Rapid prototyping Development methodology Constructional model of human

computer interaction Evaluation of target interface Database management system Interface libraries Help for using tool Documentation of tool itself Automated project management

Section C. General Characteristics

Integration across tool Reusability of tool outputs Extensibility of tool Modifiability of tool

COMMUNICATIONS OF THE ACM/March 1991/Vol.34, No.3 77

incorporate very different techniques for specifying and imple- menting a user interface, ranging from programming paradigms to graphical metaphors. The type of specification/implementation technique can strongly influence an interface developer's use of a tool, and may be a key feature in selecting a tool for an interface development team, based on the skills of those on the team. A tool in which 80% of an interface is implemented by textual programming would, for example, differ markedly from a tool in which 80% of an interface is implemented using graphical, direct manipulation techniques. Techniques on the evaluation form are chosen from those shown in Figure 4.

All terms in Figures 2, 3, and 4, as well as all other terms in the evaluation form, are defined in a Glos- sary that accompanies the form.

Examples of Use of the Form To give an idea of the appearance, use, and results of the form, some pages will be shown as examples, including a detailed page from the form (Figure 5), and a page from the executive summary (Figure 6). These are actual results from an empirical study of evaluating a tool called Bricklin's Demo TM, described later in the section "Empirical Vali- dation of the Evaluation Methodol- ogy." Figure 5 shows the completed page from the "types of interfaces" section, the page for evaluating forms. (No claims are made about whether the checks shown in this figure are right or wrong; they sim- ply show exact marks made by one evaluator in our study.) Note the matrix positioning of usability ver- sus functionality.

To use the form, an evaluator first looks down the list of functionality items in the left column, and puts a check in the corresponding "Not pos~qble with this tool" column for any item not handled by this tool. Next, for any item without a check under "Not possible with this tool," the evaluator checks the appropriate box (one of the three

faces) to indicate the tool's ease-of- use for that item, as defined in Fig- ure 3. Then, for all items for which a usability rating was just checked, the evaluator checks the appropriate box to indicate the specification/ implementation technique(s) used by the tool. Note spaces on the evaluation matrix labeled "Other, Spec- ify:" (e.g., bottom of the "forms" page, and under specification/ implementation techniques across the top of the page). Here the evaluator can add items that are appropriate to a tool but are not explicitly included on the form.

Calculating ReSUltS from the Evaluation MethodolOgy Numeric results computed for each tool evaluated by using this methodology include the following:

• Functionality rating--indicator of the number of interface functions the tool supports; calculated as the percentage of the total number of functions on the evaluation form that are possible with this tool;

• Usability rating--indicator of the ease with which functions can be produced with the tool; calculated by considering only those functions possible with this tool, this rating is the total rating earned expressed as a percentage of the highest possible usability rating;

• Specification~implementation technique rating--indicator of the degree to which a technique is used in the tool for producing possible functions; calculated as the percentage of use of a specific technique relative to all possible techniques.

The functionality rating is expressed as a percentage of the number of functions possible with this tool with respect to the total number of functions on the evaluation form. For example, a functionality rating of 80% for a category, such as "menus," indicates that a tool supports 80% of the total functions listed under "menus" on the form. The functionality rating for

any category is equal to the number of functions supported by the tool (i.e., those items not checked as "Not possible with this tool") divided by the total number of functions in that category, expressed as a percentage (i.e., multiplied by 100). For the page in Figure 5, there are 12 (non-"other") items listed under "forms"; this evaluator checked 2 of them as "Not possible with this tool." Thus 10 of the 12 functions are supported by this tool, so the functionality rating for "forms" is 10 divided by 12, or 83%, shown in Figure 6.

The usability rating quantifies the evaluator's qualitative judgment of the ease of using the tool to accom- plish particular tasks, such as producing a menu. The usability rating is expressed as the percentage of the ease-of-use rating of all functions possible with this tool with respect to the maximum ease-of-use rating for those same possible functions. For an individual function, a check under the frowning face is assigned an ease-of-use rating of 1, while checks under the bland and smiling faces are assigned ratings of 2 and 3, respectively. The denomi- nator for the usability rating is the highest possible ease-of-use rating, determined by multiplying the number o f rows for which any face icon is checked (i.e., all rows that do not have a check under "Not possible with this tool") by 3, the highest possible ease-of-use rating. In Fig- ure 5, this rating is 30 (i.e., 10 rows times 3). The numerator is determined by multiplying the total number of checks in each face icon column by its corresponding ease- of-use rating. From the example, there are 1, 3, and 6 checks--for the frowning, bland, and smiling faces, respectively, so the numerator is 1 times 1, plus 3 times 2, plus 6 times 3, or 25. Thus the usability rating for the "forms" category is 25 divided by 30, or 83%, as shown in Figure 6. (This rating is coinci- dental with the "forms" functionality rating, also 83%.) The specification~implementation technique rating may be useful in indi-

7 8 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

COMPUTING PRACTICES

kd.

: I G U R E 3 : I G U R E 4

Definitions for ease-of,use ratings of the usability dimension

Types Of Specification/Implemen- tation techniques

: I G U R E S

A page from the evaluation form, as completed by an evaluator par- tlclpant for Demo

USABILITY

Ease of Using Tool to Produce

This Function

SpecBBication/Implementatlon Technique U ~ d by Tool to Produce This Function

Coded In textull f Direct i Miscellaneous: language: manipulation p ' • / /

FUNCTIONALITY

FORMS

Typed (non-enumerated) input Formatted fields Toggled (enumerated) input Default field values

Navigation within form Arrow keys Picking with mouse Tab key Space bar Bidirectional Wrap-around Other (Specify:

Single-page forms Multi-page forms Other (Specify: )

TOTALS

X

X )

X X ~

X X

X

X

X

X X X

X X

X

X =X X

X X

xl X

COMMUNICATIONS OF THE ACM/March 1991/Vo1.34, No.3 79

i!i~ i~i!~ i iii!iii ! !i!! ~ !if! ~ if! iii iiiii ~ ~ii!i:~i! ii!!!ii!i!!iiiii~ iliii~ ili !iiiiii iii!!!! i ~ i a ~ i i 0 f l ~ ! 0 n i i iii~iiii ¸ iiii!i ~ iil !!i i i ii! iil i!! !iiili iiii~iiiiii ¸ iiii!iiii!i i~i iii~ ¸ iii~ iiilJ ii~ i! ' ill ii! !~ ii if! i !!i i!i~i ii~i !̧!ili!! ' if! ~!1 i ii! iiii ii!: i!ii i iii iii ii :!! ~ ~i ~ i i !i ii ~: i ~ / ~ ~ ~i i ~ i!~ ~i iii:!ili~!: :i ̧ ~i ̧ iii:i:~i!i!i~!i ¸ ~ i~ iiiii!~ i~ i ~ iiiii!iiiiiiii!i ~!i ~ ~ ~C~I:I~I ̧ ~!~ ~ iii!i!iiiiii:i!il iii ii! iiii!iiii!!iiill ii ̧ i ; ! !

i i , i~ ~!i i i~ ~ i i~ ~ il i iiiil ~!ii~ !iiiiiiiil ii i!il ii~iii!il i i!i !ii iiii~i!iiii~i!iii!~iiiii!ii!i i!! i i!ii!iii i iii i~ ii ~ iii~ !ii!ii!! i i i! !!i i!iiii~! iii i~! !il i ii !i i!!!iiii!i!ili ii!i~!i !iiiii~iii!!iiii!!!!!ii!i~ !!r~! !i~ i~i~ ~ .~ l~ l i i~ i ! ~!i! i i! ' ! ~ !~!!!!ii ii !!! i~,~bli!~! ~!:~i! ~ i~- i i~!!iiiiii~!i!~i~!!iiii~!!iiii!i !iiiii!!!ii~i !!!!i~ iili ~ii~ j i i!i! ~!ii!! i~il !ii !!iii!i !!fill !!iiii!i!i!! !iii!! iiii!i!iii!i~ii i ii i i~i i~i! ~' ~ ~ ~ ~ ~ ~ ~i~ ~ i ~ !! ~ ~,~ ~ !!iii~iii!~i!i! R~.g~i ~ i~!iii!~i!iiii ii~i!~ ~ ~i!~!!~i~ iln~.g!!i!~ii~i!iii~!ii i~

Actual results from executive summary for Demo: Functionality and usaDillty ratings (%)

cating the specification/implementation paradigm used by a tool, and may actually influence its usability rating. This rat ing is calculated by summing the checks for each column in the specification/implementation columns, and recording these numbers in the corresponding "TOTALS" row for each page. Next, all these totals are summed across pages, producing a row of gi~and totals. The percentage use of a technique (an individual column) is the grand total of that column divided by the sum of all grand totals. Thus from our example (most of the data are not given here), the sum of all grand totals is 46 (for object manipulat ion) plus 28 (for rule-based transitions), or 74. So the percentage use of object manipulation is 46 divided by 74, or 62%, and the percentage use of rule- based transitions is 28 divided by 74, or 38%.

A complete evaluation repor t contains several parts:

• General description of the tool being ,evaluated;

• Informat ion about sources used in prepax'ing the evaluation;

• Executive summary of overall func- tionaliry and usability ratings for all three sections;

• Detailed evaluation of functionality and usability dimensions for all three sections; and

• Glossary defining every item in the form.

Interpretation of Results Now that we know how to compute these numbers, let us look at their interpretat ion. For example, there are 14 kinds of menus on the evaluation form. According to this evaluator (Figure 6), Demo can produce 100% of them, and can produce 83% of the 12 types of forms. Thus Demo supports a large variety of types of menus and forms. As dis- cussed previously, usability is evaluated only on functions that are possible (e.g., the 83% of types of forms). Usability ratings of 91% and 83% mean Demo is easy to use to produce menus and forms, respectively. A good variety of the features of interfaces (81%) are suppor ted by Demo, and are fairly easy to produce (usability of 72%). Demo can produce interfaces with only a few input devices (functionality of 18%), while output devices are much better suppor ted (71%). Both input and output devices are easy to incorporate into interfaces produced using Demo (usability of 100% and 80%, respectively). (It is impor tant to r emember that all usability ratings refer to ease of using the tool to produce an application interface, and are not in any way re-

lated to ease-of-use of that interface.)

Empirical validation of the Evaluation Methodology

Participants, Materials, and Procedures These evaluation ratings remain mere numbers unless we can show they are reliable. To de termine reliability we must ask whether different evatuators will p roduce similar, consistent results for the same tool. To test the reliability (consistency) of this evaluation methodology across di f ferent evaluators, we conducted an exper iment using six graduate students from Virginia Tech as research participants. These students were carefully chosen as representat ive subjects to be tool evaluators; all had a min imum of two years of system and interface development experience, ei ther in industry or an academic setting; and all were very familiar with interface development tools, having used one or more such tools ei ther to develop a real interface or jus t to learn more about them.

Three tools represent ing different application types were evaluated:

• Bricklin's Demo TM Vl , • Macintosh's HyperCard TM V1.2.1,

and • SmethersBarnes ' Prototyper TM V1.

Demo runs on IBM hardware, while the other two run on the Mac- intosh. Demo produces sequential, nondirect manipulat ion interfaces such as those typically found in IBM-based systems. Prototyper is used for developing Macintosh- style interfaces, and has a Mac-style interface. While Hype rCard is not strictly an interface development tool, we included it in our evaluation because of its widespread use and because of its features that support interface development . Fur- ther, these three tools are widely available and known in the interface development world, so that intuitive assessment of evaluation results would be more feasible.

0 M a r c h 1 9 9 1 / V o 1 . 3 4 , No.3/GOMMUNICATION$ OF THE AGM

That is, we could determine whether the results look right, based on a priori knowledge of these tools.

These tools were specifically chosen because they are perceptibly different, and therefore would be expected to have different functionality and usability ratings. The goal of this experiment was to determine if the evaluation procedure produces different ratings for obvi- ously different tools. This is critical for an initial study; if validating the procedure for the first time by evaluating very similar tools produced no significant differences in ratings, then we would not know whether the apparent lack of differences was due to the evaluation procedure or whether there really were no differences among the evaluated tools.

Each participant evaluated two different tools (neither of which they had ever used before), with appropriate counter-balancing, so that each tool was evaluated by four different participants. While the results of evaluating the tools are inherently interesting, we were more concerned with evaluating the evaluation form and methodology-- their usability, complete- ness, and understandability, for example.

The reliability experiment was conducted in three stages. During Stage 1, participants received software and manuals for the two tools they were asked to evaluate, and learned how to use each tool to a reasonable level of expertise, at their own pace, during a two week period. During Stage 2, participants performed baseline tasks for the experimenter, to ensure that each participant attained at least a common baseline level of expertise with each tool, to reduce variance among participants in their knowledge of each tool. The tasks took about fifteen minutes to complete, and all participants satisfactorily performed the baseline tasks on their first attempt. During Stage 3, participants were given training on the use of the form and glossary by

the experimenter. Participants then completed a form for each tool, at their own pace, using any documentation and the tool itself. Par- ticipants returned completed forms to the experimenter who computed functionality ratings, usability ratings, and specification/implementation technique ratings, using the electronic spreadsheet. Participants were also asked some questions about how they used the evaluation methodology.

Results from Using the Methodology to Evaluate the Tools Two kinds of results are presented here: numbers from the use of the evaluation methodology for each tool (in this section), and the results of reliability testing of the methodology based on these numbers (in the following section).

Participants reported spending six to ten hours learning a tool, and an additional one to two hours completing each evaluation form. Tables 1, 2, 3, and 4 show mean

summary percentages for each tool, averaged from ratings of all four participants who evaluated each tool. (Note that the summary percentages shown in Figure 6 for evaluation results of Demo were from one individual participant, and are therefore different from these mean percentages.)

Results of the evaluation of each of the three tools can be interpreted as was done previously for Demo. Table 1, showing mean summary functionality and usability percentages for each tool, indicates that,

COMPUTING PRACTICES

overall, HyperCard has the greatest functionality across all three sections of the form, followed by Demo and then Prototyper. Hyper- Card and Prototyper have very similar ratings of usability, and Demo has the lowest usability rating.

Table 2 gives a more detailed breakdown of line 1 of Table 1. Considering all columns in Table 2, some spot comparisons show that HyperCard (functionality of 90%) produces more types of forms than Demo (73%) or Prototyper (54%). Typed input strings are not well supported by Demo or Prototyper (both functionalities of 19%). Hy- perCard supports more features of interfaces (84%) than does Demo (46%) or Prototyper (37%). All three tools provide little support for input devices, while Demo supports more output devices (71%) than does HyperCard (64%) or Prototyper (53%). Usability ratings are generally high for all three tools. Results in Table 3, showing the types of support the tools provide, can be interpreted similarly.

Specification/implementation techniques used by each tool are shown in Table 4. According to these results, HyperCard primarily uses object manipulation (80%) with some tabular manipulation (11%). Prototyper also primarily uses object manipulation (78%) and some form-filling (14%). Demo uses a mix of object and tabular manipulation, rule-based transitions, and form-filling. None of these tools use textual languages to any extent.

Mean summary results (%) of functionality and usability ratings for all three sections of the form

Demo Hypercard Prototyper

Funct. Usab. Funct. Usab. Funct. Usab.

Types Of Interfaces 52 82 70 87 46 93

Types of Support 56 65 76 82 38 74

General Characteristics 28 61 50 83 31 75

C O M M U N I C A T I O N S OF T H E ACM/March 1991/Vol,34, No.3 81

A series of analyses of variance (ANOVA) was run on all results to de termine which are significantly different. Major comparisons were computed across the three section's of the form.

Results; from Testing the Methodology for Reliability Results were analyzed to de termine whether numbers p roduced by the

methodology are reliable (consistent) across different evaluators. To produce statistical tests of reliabil: ity, the probabili ty that was computed was that responses from the four evaluators for each tool would match by chance. The observed propor t ion of matches was then recorded for each category of items (menus, forms, and so on) and across the entire form, and then

compared with the chance probability using a binomial test. Be- cause functionality and usability were measured on dif ferent scales, however, the chance probabili ty of matching was not the same in the two cases.

For functionality, successful matching for an item was def ined either as all four of its evaluators agreeing that the item was possible

Types Of Interfaces the tools can produce: Mean summary results (%) of functionality and usability ratings


No. of items Funct. Usab. Funct. Usab. Funct. Usab.

Interaction Styles: Menus 14 82 78 84 82 66 93

Forms 12 73 76 90 97 54 93

Typed input strings 4 19 100 69 68 19 100

Wiindows 3 59 92 75 81 83 91

Features of Interface: 27 46 71 84 92 37 84

Hardware Devices: Input devices 17 18 89 20 99 t2 100

Output devices 7 71 78 64 85 53 92

Types of support the tools provide: Mean summary results (%) of functionality and usability ratings


No. of items Funct. U s a b . Funct. U s a b . Funct. Usab.

Rapid prototyping 9 75 72 86 83 58 89

Development methodology 6 58 76 84 87 50 75

Evaluation of target interface 7 39 74 61 60 0 n/a

Database management system 2 0 n/a 75 55 0 n/a

Interl:ace libraries 1 50 50 100 84 50 50

Help for using tool 1 100 67 100 100 75 67

Documentation of tool itself 1 100 42 100 75 100 75

Context of definition 3 58 75 84 95 25 84

Automated project management 5 0 n/a 10 100 0 n/a

8 2 March 1991/Vol.34, No.3/COMMUNICATIONS OF T H E ACM

~ 4 k ~

. . a . a . a . a . * a m * a a . . a i m a a a l a a l l a a a l m a a a l l a A a a l a a m a a l l l a a i m m l i l l m i l i l l m l l l l l l l a

with a given tool, or all four agreeing that the item was not possible with that tool. The probability of successful matching by chance is 1/8 (=. ] 25). Even if there is no reliability, that is, if each evaluator decided on functionality by a coin flip, we would still expect about one out of every eight items on the form to produce a successful match. When the observed propor t ion of successful matches in a category substantially exceeds this rate, we conclude that evaluations are reliable. The level of confidence we have in this conclusion is indexed by the p-value of our result. The p-value is the probability of obtaining the observed number of successful matches by chance alone. The smaller the p-value, the more cer- tain we are that evaluators agreed in their judgments of functionality. Tha t is, we are more confident that functionality ratings are reliable (consistent). The p-values for functionality are shown in Table 5.

For usability, successful matching for an item was def ined either as all four of its evaluators agreeing on the face selected, or three of the four evaluators agreeing, with the fourth evaluator disagreeing by only one face (e.g., frown, frown, frown, bland is a successful match, while frown, frown, frown, smile is not a successful match). This definition was chosen - - r a the r than requir ing all four to agree on the selected face--because it is inherently more difficult to get agreement among three choices (the three faces) than among the two choices (possible or not possible) available for functionality. The probability of successful matching by chance is 19/81 (=.2347). As with functionality, the observed propor t ion of successful matches was compared with this figure, and the probability computed of obtaining this many successful matches by chance alone. Table 6 shows these p-values for each category of items and across the entire form.

Usability was examined only for items that all four evaluators agreed were possible (i.e., perfect

agreement- -successfu l ma t c h ing - - on functionality). As a result, in most categories the number of

C O M P U T I N G P R A C T I C E S

items available for reliability testing of usability was much smaller than for functionality. Some categories

Specification/implementation techniques used by the tools: Mean summary results (%)


Coded in Textual Language: P r o g r a m m i n g language - - 8 1

specialized dialogue l a n g u a g e 5 1 - -

BNF - - - - - -

R e g u l a r e x p r e s s i o n s - - - - - -

Direct Manipulation: O b j e c t m a n i p u l a t i o n 17 80 78

Graph ica l p r o g r a m m i n g l a n g u a g e - - - - 3

Miscellaneous: Tabu la r m a n i p u l a t i o n 32 11 2

Ru le -based t r a n s i t i o n s 28 - - - -

F o r m - f i l l i n g 18 - - 14

Other (Specify: p lug - i n ) - - - - 2

p-values for functionality for each category of the form and across the entire form

No. of items Demo Hypercard Prototyper

interaction Styles: Menus 14 .0007 .0007 .0230

Forms 12 .0113 .0000 .0528

Typed input strings 4 .0789 1.0000 .0071

Windows 3 .0430 1.0000 1.0000

Features of Interface 27 .0042 .0000 .0042

Hardware Devices: Input devices 17 .0000 .0000 .0000

Output devices 7 .0000 .0463 .0062

Types of Support 38 .0000 .0000 .0000

General Characteristics 5 .4871 .1207 .0161

Across Entire Form 127 .0000 .0000 .0000


dropped out entirely (indicated by a dash in Table 6), since they con- tained :no items that all four evaluators agreed were possible. The figures in parentheses in Table 6 are the numbers of items on which each p-value is based.

Discus.~;ion of Results

Evaluating t h e Tools At the highest level, the ANOVA showed that HyperCard has a significantly higher functionality for types of interfaces a tool can produce, and Demo has about the same functionality as Prototyper (line 1 of Table: 1, and Table 2). Given our a priori knowledge and expecta- tions, these results are not surpris- ing, since HyperCard is a more general-purpose tool. The ANOVA showed that the only significant difference for usability ratings for the types of interfaces a tool can produce is between Prototyper and Demo (line 1 of Table 1, and Table 2). Prototyper is indeed more usable than Demo but the difference in usability between Hyper-

Card and either of the other two tools is not significant. This is again reasonable: Prototyper runs on a Macintosh and might therefore be expected to be easier to use. Al- though HyperCard also runs on a Macintosh, its greater functionality may actually interfere with its usability.

The ANOVA showed all three tools are indeed significantly different in their functionality for types of support. HyperCard, as we expected because of its generality, provided the most types of support (line 2 of Table 1, and Table 3), followed by Demo and then Prototyper. The ANOVA indicated that usability for these types of support is significantly different only between the highest usability (Hy- perCard) and the lowest (Demo). Prototyper, at the in-between level, is not significantly different in usability for types of support from either of the other two.

While results (Table 1, line 3) indicate that HyperCard has more general tool characteristics than Prototyper, which has more than

p-values and numbers of items on which they are based for usability for each category of the form

and across the entire form


Interaction Styles: Menus .7989 (6) .0097 (7) .0129 (3)

Forms .2360 (4) .0000 (9) .0129 (3)

Typed input s t r ings m (0) m (0) - - (0)

W i n d o w s .2346 (1) ~ (0) - - (0)

Features of Interface .5515 (3) .0000 (14) .2346 (1)

Hardware Devices: I n p u t dev ices .1393 (3) .0550 (2) .2346 (1)

O u t p u t dev ices .0123 (5) .0550 (2) .0550 (2)

Types Of Support .5158 (7) .0002 (12) .0030 (4)

General Characteristics - - (0) .4141 (2) .2346 (1)

Across Entire Form .0030 (29) .0000 (48) .0000 (15)

Demo, the ANOVA showed that the three tools in fact do not differ significantly on either functionality or usability of general characteristics. Direct manipulation was the primary specification/implementation technique (shown in Table 4) for HyperCard and Prototyper, and was undoubtedly reflected in their overall high usability ratings. More complete comparison of these tools, for example, to choose one for a development environment (see the following section "Exam- ple: Use of Results"), would neces- sitate a look into details of individual items on the evaluation form, from which these overall results were calculated.

Test ing t h e Methodology for Reliability Using the standard criterion of p < .05, most of the p-values for functionality shown in Table 5 are quite satisfactory, indicating that the form is generally reliable for determining functionality of a tool. Poor results (e.g., typed input strings, windows, and general characteristics) were frequently found in those categories that have only a few items (4, 3, and 5 items, respectively), suggesting a need to reassess the measurement of fimctionality and the grouping of items in those categories. Across the entire form, reliability is highly significant for functionality for all three tools.

For usability, the p-values shown in Table 6 are adequate in some categories, but unsatisfactory in oth- ers. These results suggest that usability ratings may not be as reliable as functionality ratings. This was expected for at least two reasons: First, usability is inherently more subjective than functionality, and therefore more likely to be incon- sistent across evaluators. Second, because usability was examined only for those items that all evaluators agreed were possible, the number of items in most categories was smaller than was the number of items examined for functionality. The smaller number of items in many categories adversely affects

84 March 1991/Vo1.34, No.3/COMMUNICATIONS OF THE ACM

the power of statistical tests, and substantially accounts for the larger p-values. These results suggest a need to reexamine groupings of items into categories, avoiding categories with very few items. They may also reflect overly subjective definitions for the three usability faces, and the desirability of making these definitions less open to individual interpretation by evaluators. Nevertheless, by combining all items across the entire form, and thereby employing a large sample size in our test, usability reliability is highly significant for all three tools.

Qualitatively, participants said, in post-evaluation interviews, that they felt results fairly represented tool capabilities. As a result, rather than producing ad hoc evaluations, they felt the form provides a structured, consistent instrument for evaluating and comparing tools, as well as presenting results of those evaluations. These comments indicated participants' positive feelings toward the methodology, despite the fairly lengthy t ime--a t least 20 hours- -each of them had volun- teered.

Example: Use of Results To show how results of this methodology might be used to aid in selection of an interface development to01 for a particular environment, we will present a purely hypothetical, simple example that compares the results of the three tool evaluations obtained from the empirical study. Requirements for selecting a tool for our hypothetical environment are arbitrarily set as follows:

• Tool must be capable of producing a large variety of menus and forms;

• Tool must be easy to use for the types of interfaces it can produce; and

• Tool must mainly use object manipulation for producing interfaces.

First, these requirements must be quantified; in this example we will arbitrarily choose the following ratings to satisfy, respectively, each of

the three requirements:

• Minimum functionality rating of 70% for both menus and forms;

• Minimum overall usability rating of 60%; and

• At least two-thirds of the tool's functions must be produced using object manipulation.

Comparing functionality ratings. From Table 2, Prototyper falls below the 70% functionality required for both menus and forms; HyperCard and Demo satisfy it. So Prototyper is eliminated. Comparing usability ratings. From Table 2, HyperCard has higher usability ratings in most categories, but both HyperCard and Demo meet the 60% minimum usability requirement for all seven categories. Interestingly, note Prototyper has higher usability ratings than either HyperCard or Demo for several categories. Thus if the tool selection requirements had been different (e.g., favoring usability over functionality), Prototyper might not have been eliminated so quickly. Comparing specification~implementation techniques. Table 4 shows that Demo uses object manipulation only 17% of the time, while Hyper- Card uses it 80% of the time. Our selection requirement states that an object manipulation rating of at least 66% is the minimum acceptable, so Demo is eliminated. (Again, Prototyper would meet this requirement had we not already eliminated it.) Final decision. HyperCard has higher functionality and usability ratings than Demo, and meets the acceptable limit o.f required object manipulation. Thus we would rec- ommend HyperCard as the appropriate tool, given our hypothetical selection requirements.

Conclusions and Future Research Results across different participants (evaluators) in our experiment were substantially more similar than we expected. Since this was the first time the evaluation methodol-

COMPUTING PRACTICES

ogy had been formally used, we anticipated the possibility of widely varying results. The empirical testing indicates, however, that our methodology provides reliable (consistent) results across different evaluators; thus different evaluators should reach the same conclusions about tools evaluated and compared with this methodology.

Examination of details across different participants revealed several instances in which one participant marked a function (e.g., typed input strings) as "not possible" while another marked it otherwise. Investigation showed this was due, as expected, to two common reasons:

1. Differences in interpretation of glossary definitions, and

2. Differences in expertise level of different evaiuators using the same tool.

Interestingly, most discrepancies were found either in rather ob- scure, poorly understood items or in items not explicitly supported by a tool; for example, producing "typed input strings" using Prototyper and Demo. When they were not explicitly possible, some participants used intuition and cre- ativity to produce typed input strings with a tool. These two differences suggest that detail in the glossary could be clarified, and for- real evaluator training should be explored.

In general, we believe that this methodology should be used only by an evaluator who has thoroughly learned the tool(s) to be evaluated and is familiar with the methodology and form. This is to be expected: given the complexity of the kinds of tools we are attempting to evaluate, a methodology for evaluation is itself lengthy and complex.

The methodology's focus on quantitative results can be miscon- strued. Overall (summary) ratings give only macro results; an evaluator must go to the detailed sections of the f o r m to determine specific information about a tool. The numeric ratings provide guidelines

C O M M U N I C A T I O N S O F T H E &¢:M/March 1991/Vol.34, No.3 8 S

for evaluating and compar ing tools and aid in decision making; they do not mandate a part icular decision.

Finally, there are great advan- tages to this kind of checklist evaluation methodology. It encourages its users (the tool evaluators) to broaden their thinking about tool evaluation, by present ing them with a structured, wide variety of possible choices, many which they might not think o f otherwise. This very structu:re can be limiting, however, by actually narrowing the set of questions asked about a part icular tool. As an initial response to this, we have made the form extensible by allowing evaluators to add items as appropr ia te (in Figure 2, note spaces for "Other"). We are also investigating adding an initial phase in which evaluators must determine, in some structured fash- ion, a desired, context-sensitive set of functions for a tool within their part icular environment , before pro- ceeding with evaluation of tools using our methodology.

Several o ther open issues are also being pursued. For categories with low reliability, we are investigating how to improve consistency. Deter- mining validity (i.e., whether the methodology measures what we intend it to measure) of the methodology is also being investigated; this study addressed only reliability. Validity is even more difficult because of a lack of exper t comparisons for cross-validation purposes. Without a comparable technique for evaluating human-compute r interface development tools, which current ly does not exist, validity cannot be effectively assessed.

Pragmatic Use of the Tool Evaluation Methodology

Importance of the Methodology Roberts and Moran [6] make an excellent distinction between importance and reliability within the context of quantitative scores such as those p roduced by this tool evaluation methodology. They state that any difference in size across scores can be made reliable by con-

ducting enough extensive, expen- sive studies. The basic issue, however, is not so much reliability as importance. Impor tance represents a substantive, ra ther than a statistical, difference among ratings. Thus, in practical use, small differences that may be unreliable are not critical for producing useful ratings. The usefulness of a relatively inexpen- sive methodology, such as the one described in this article, is that it reveals potentially large, and therefore important , differences among tools. When this methodology iden- tifies a potentially impor tant difference across tool categories, for example, then evaluators can de termine jus t how reliable that difference needs to be. We found that, as a "rule of thumb," a difference of about 10 percentage points in ratings across tools was useful for dist inguishing a ra ther different "look and feel" for both functionality and usability of each tool.

Real World Use of the Methodology The exper imental study described above gave us statistical evidence that this tool evaluation methodology is useful and reliable. Perhaps even more impor tant than our lab- oratory results are the number o f development groups within major organizations that have expressed interest in our methodology and who, in fact, have used or are considering using it. More than 25 organizations have requested the evaluation form, including GTE Data Systems [1], McDonnell Doug- las [7], Battelle Pacific Northwest Labs, Bell Nor thern Research, Data General , Digital Equipment Corpo- ration, Hewlett-Packard, IBM, Jet Propulsion Lab, NASA, and Soft- ware Engineer ing Institute. The methodology is being used in various ways, and groups are f inding it easy to adapt to their part icular needs. For example, one group used it as a high-level checklist to compare more than a dozen interface development tools. The matrix provided a useful taxonomy for making initial qualitative compari-

sons across several tools. They performed the calculations when there was a desire to have a more detailed, quantitative comparison across several tools as indicated by the qualitative, intuitive results. One group adapted a subset of the form to a specific domain of tools that they were evaluating (e.g., forms management systems). The response of these groups, along with requests we get (on average, about once a week) for information on the methodology, encourage us by letting us know that there is a great need for such a methodology, and that our approach, while still young, shows real promise. 1

And Did We Meet Our Goals? Six goals for a tool evaluation methodology were set forth at the begin- ning o f this article:

1. use of a standardized approach to produce reliable quantifiable results;

2. thoroughness, but not necessarily exhaustiveness, of a tool evaluation;

3. objectivity, ra ther than judgement, on the part of the tool evaluator;

4. ease-of-use, but not necessarily quick-to-use, for the tool evaluator;

5. adaptability to the development environment in which it is used; and

6. extensibility, to remain sensitive to advances in interface technology.

The first goal was met, as results of the empirical study indicate. Other goals were assessed through questioning participants in that study, as well as members of the organizations ment ioned in the previous section. All agreed that this evaluation methodology is thorough, providing a comprehensive, s t ructured f ramework within

q n f o r m a t i o n about obta ining the evaluation form and instructions for its use are available contact ing Hix. The Excel spread sheet can also be obtained at no cost by sending a 3-½ inch floppy disk to her. We welcome inquiries f rom anyone interested in using and critiqu- ing this evaluation methodology.

86 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

which to evaluate and compare interface development tools. All con- curred that the evaluation, because of the highly structured matrix and the carefully defined usability icons, removes much of the evaluator subjectivity. Users of this methodology stated that they generally find it easy, and even fun, to use. Most users stated that the form itself does not take long to complete (one to two hours per tool); however, the methodology is not quick because the evaluator must learn each interface development tool being evaluated.

Adaptability to specific environments can be accomplished by subsetting the form. By removing uninterest ing or inappropriate items across all evaluations, consistent ratings will be retained across all tools being evaluated. The form is easily extensible through use of the "Other" category provided throughout. Because each section of the form establishes a taxonomy, new items can be easily added within that framework, keeping the methodology sensitive to rapidly occurring advances in interface technology. As new items are added to the form, tools that have already been evaluated can be reevaluated with the new items. Thus a tool's ratings could increase or decrease over time, while always remaining relative to the most current taxonomy against which it has been evaluated.

Summary This evaluation methodology is, we believe, the first attempt at developing and empirically validating a standardized technique for evaluating and comparing human- computer interface development tools. Such an evaluation methodology for interface development tools provides both theoretical and practical contributions to human- computer interaction research, including the following:

• Development of a method for systematically and consistently evaluating all aspects of an inter-

face development tool; • Instantiation of the concept of

quantitative functionality and usability ratings for a tool;

• Development of a taxonomy of types of interfaces that can be produced with a tool; and

• Identification of specification/ implementation techniques used by a tool.

This research provides a communication mechanism for tool re- searchers, tool practitioners, and tool users for making coherent cri- tiques of their own and other tools. Our goal is that this work will result in a rigorous, trusted methodology for evaluating human-computer interface development tools.

Acknowledgments The authors would especially like to thank graduate student Kay Tan, who conducted the reliability study, and the participants who volun- teered their time for this study. Appreciation also goes to H. Rex Hartson, Antonio C. Siochi, Matt Humphrey, and Eric Smith, who helped in the early stages of developing the evaluation form, and to Bruce Koons, who gave valuable expert advice on the experimental study. This research was funded by the Software Productivity Consor- tium and the Virginia Center for Innovative Technology. r4

References 1. Arble, F. Private communication,

1988. 2. Borenstein, N. S. The evaluation of

text editors: A critical review of the Roberts and Moran methodology based on new experiments. In Pro- ceedings of CHI'85 Human Factors in Computing Systems (Apr., Boston). ACM, N.Y., 1981, pp. 99-105.

3. Cohill, A. M., Gilfoil, D. M., and Pilit- sis, J. v. Measuring the utility of application software. In H. R. Hartson & D. Hix, Eds. Advances in Human- Computer Interaction, Vol. 2. Ablex Publishing Corp., Norwood, N.J., 1988.

4. Hartson, H. R., and Hix, D. Human- computer interface development: Concepts and systems for its management. ACM Comput. Surv. 21, 1 (Mar. 1989), 5-93.

COMPUTING PRACTICES

5. Hix, D. Evaluation of human- computer interface development tools: Problems and promises. Pre- sented at EFISS (Atlanta Ga, Oct. 1988).

6. Roberts, T. L., and Moran, T. P. The evaluation of text editors: Methodol- ogy and empirical results. Commun. ACM 26, 4. 265-283.

7. Totten, S. Private communication, 1989.

CR Categories and Subject Descrip- tors: D.2.2 [Software Engineering]: Tools and Techniques; H.1.2 [Models and Principles]: User/Machine Systems

General Terms: Experimentation, Management, Measurement

Additional Key Words and Phrases: Measurement techniques; methodology, user interfaces

Deborah Hix is a research computer sci- entist on the faculty at Virginia Poly- technic Institute and State University in Blacksburg, Virginia. She is a principal investigator on the Dialogue Manage- ment Project. This project is concerned with achieving quality human-computer interfaces, through development of concepts for human-computer interface management and through development of specialized methodologies, techniques, and tools for producing the human-computer interface.

Author's present address: Department of Computer Science Virginia Polytech- nic Institute and State University, Blacksburg, VA, 24061. email for Debo- rah Hix: [email protected]

Robert S. Schulman is an associate pro- fessor of statistics at Virginia Polytech- nic Institute and State University in Blacksburg, Virginia. He performs statistical consulting in a wide variety of fields. His specialty areas are psychometrics, test theory, and applica- tions of statistics to the social sciences. Author's present address: Department of Statistics, Virginia Polytechnic Insti- tute and State University, Blacksburg, VA, 24061.

Unix is a t r ade m a r k o f A T & T Bell Laborato- ries.

~" D e m o is a p roduc t o f Sof tware Ga rden , Inc.

~" H y p e r C a r d is a p roduc t o f App le C o m p u t e r Inc.

TM Pro to typer is a p roduc t o f Smethe r s Barnes

© ACM 0002-0782/91/0300-074 $1.50


deborah hix and robert s. scbulman · this article is similar to the roberts and moran approach,...

Documents