deborah hix and robert s. scbulman · this article is similar to the roberts and moran approach,...

14
/ BE Deborah Hix and Robert S. Scbulman R

Upload: others

Post on 07-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

/

BE

D e b o r a h H i x a n d R o b e r t S. S c b u l m a n

R

Page 2: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

l l l l a a l l a l l l a l l l l l l l l i i l l a i l l l l a l l i l l l l l l i l a l l l l l l l i l l l l l l l l l l l l l l l l l l l l l l l l COMPUTING PRACTICES

Motivation for a Tool Evaluation Methodology Tools for human-computer inter- face development--of ten called user interface management systems or UIMS or user interface develop- ment environments--conjure up visions of icons and images, win- dows and words, mice and other media that comprise the human- computer interface of an interac- tive system. Human-computer in- terface development tools are themselves interactive systems that support production and execution of the human-computer interface.

environment in which it is used; and

• extensibility, to remain sensitive to advances in interface technology.

This evaluation methodology focuses on technical issues. Al- though economic issues cannot be ignored in a business environment, this methodology does not address such issues, nor does it make rec- ommendations. Rather, it presents data that management can use in making decisions. Data produced by this methodology might be used, for example, to help determine the

choice of a tool for a particular in- terface development environment. It is also important to remember that this methodology is used to evaluate human-computer inter- face development tools, and not their end-user interface products. The latter is the target application interfaces that these tools are used to produce.

The research presented in this article is groundbreaking, repre- senting one of the first attempts to produce a structured, quantitative approach to the evaluation of inter- face development tools. Our meth-

A M E T H O D O L O G Y FOR THEIR

EVALUATION Despite their recent proliferation, there are no procedures for system- atically evaluating or comparing such tools. State-of-the-art in their evaluation is based on subjective opinion--"warm fuzzy" feelings.

Goals Of a TOOl Evaluation Methodology We have developed and empirically validated a methodology for evalu- ating and comparing human- computer interface development tools. Goals of such a methodology include the following:

• use of a standardized approach to produce reliable quantifiable re- sults;

• thoroughness, but not necessarily exhaustiveness, of a tool evalua- tion;

• objectivity, rather than judgement, on the part of the tool evaluator;

• ease-of-use, but not necessarily quick-to-use, for the tool evalua- tor;

• adaptability to the development

A comprehensive check- list-based methodology produces quantifiable criteria for evaluating and comparing human- computer interface development tools along two dimensions: functionality and us- ability. An empirical evaluation shows that the methodology which is in use in several cor- porate interface devel- opment environments, produces reliable (con-

sistent) results.

odology is in use at a number of corporate sites (see the section on "Real World Use of the Methodol- ogy").

Related Work While very little structured evalua- tion of human-computer interface development tools has been done, precedence has been established for using the type of approach on which our tool evaluation method- ology is based. Roberts and Moran [6], for example, produced a meth- odology for evaluation of text edi- tors. The basis of their approach was classification of potential edit- ing tasks (to provide a common basis for comparison), and evalua- tion (both quantitatively and quali- tatively) along several dimensions, including time to perform tasks, error costs, learning time, and functionality. A replication study of their work [2] produced recom- mendations for modifications. The evaluation procedure presented in

COMMUNICATIONS OF THE ACM/March 1991/%Iol.34, No.3 7 S

Page 3: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results.

Cohill, Gilfoil, and Pilitsis [3] developed a methodology that was used within AT&T for evaluating software packages, particularly commercial systems. This method- ology was based on criteria such as performance, documentat ion, and support . The i r approach, which has been used to compare at least five Unix-based word-processing packages, resulted in the selection of a system for an engineer ing group. Such approaches provided ideas that we used in developing our tool-evaluation methodology.

Overview' of ThiS Article This article will discuss some of the difficulties with developing an eval- uation methodology for interface development tools. It will describe in detail the tool evaluation meth- odology we have produced and used. Results of an empirical vali- dation study will show that the methodology produces reliable data. Use of the methodology in several real-world human- computer interface development environments will also be pre- sented.

Difficulties in Developing a Tool Evaluation Methodology A surpris ing number of problems arise when one at tempts to develop a formal, s t ructured evaluation methodology for interface develop- ment tools [5]. At the core of the difficulty of developing such a methodology is the fundamenta l problem of def ining the human- compute r interface itself. Behav- ioral scientists often view the inter- face as having cognitive and/or ges- tural aspects, while computer scientists typically think of it as jus t software and hardware. Closely connected to the interface defini- tion is the problem of def ining an interface development tool. For example, terms include user inter- face development tools or toolkits or environments, and probably the most prevalent user interface man-

agement systems or UIMS [4]. An interface development tool can be anything from a single interface rout ine to a complete interface de- velopment environment . A toolkit is a l ibrary of routines that can be called to implement low-level inter- face features, particularly to incor- porate various interaction tech- n i q u e s - f o r example, menus, scroll bars, buttons, and their associated physical devices, such as ,nice, joy- sticks, t rackbal l s - - in to an interface. A toolkit is typically used by a pro- g rammer who writes source code to call the desired routines. Toolkits provide little or no suppor t for a nonprogramming interface de- signer. A UIMS is an integrated in- teractive environment , often based on a broad set of tools, for design- ing, prototyping, executing, evalu- ating, modifying, and maintaining interfaces. In particular, a UIMS supports the development , through all phases of the life cycle, of a user interface by an interface developer who may not be a p rogrammer . This is commonly done through the use of graphical representat ion techniques for designing and pro- totyping an interface, as well as run-t ime execution suppor t so that an interface can be evaluated and iteratively refined. By relieving an interface developer of much of the tedium (i.e., coding) of producing an interface, a developer can con- centrate on the design of the inter- face itself, ra ther than on its imple- mentation. Many interactive software systems call themselves UIMSs, when they are really no more than toolkits.

Other problems associated with producing a methodology for eval- uation of interface development tools include the following:

• Determinat ion and measurement of salient dimensions for evalua- tion of interface development tools: Possibilities for dimensions include tool functionality, usabil- ity, learnability, user (of the tool) performance, tool performance, and documentat ion. Measure- ment of selected dimensions is

difficult, generally because an at- tempt is being made to quantify what is largely qualitative infor- mation.

• Variety of specification and im- plementat ion techniques used in tools: These techniques can greatly color a tool user's impres- sion of the tool, including its us- ability and learnability.

• Availability of tool software and hardware for hands-on use by an evaluator: Data collection from sources such as market ing glossies, videotapes, documenta- tion, and even live demonstra- tions are unsatisfactory and un- verifiable. In particular, one system we examined looked good from glossies and a videotape, but tu rned out to have limited functionality in operat ion. Of course a "catch-22" can arise: a tool cannot be adequately evalu- ated without acquiring it, but one may not initially know which tool to acquire. Costs for tools vary widely and can be expens ive - - in the thousands of dollars.

• Selection of an appropr ia te tool evaluator: This person should be someone who is familiar with sev- eral such tools. An evaluator must exhibi t - -poss ibly through per formance of baseline tasks using the tool (s ) - -a minimal measurable level o f expert ise with each tool.

• Cost of the evaluation process, involving an appropr ia te ly expe- r ienced person for a substantial time.

• Rapid technological advances in the human-compute r interface area: This can r ende r a tool evaluation methodology worth- less even before it has been shown to be useful.

• Difficulty of evaluating an evalu- ation methodology itself: Virtu- ally no research exists in formal evaluation of interface develop- ment tools. As methodologies are produced, they must be shown to produce reliable (consistent) and valid (the methodology must be measur ing what we intend it to) results. Statistically de te rmining

76 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

Page 4: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

= * . a a i = * ~ a . . i . l m . . . l a m a . i m m a m . i l . a a a a m a a l = . i a . * l l . . . a . . I n i i m m m m n l a a m m m l m m m m a

reliability and validity can be a costly, complicated process in- volving longitudinal use of the methodology with a variety of tools and evaluators.

Description of the Tool Evaluation Methodology

Overview of the Evaluation Process Figure 1 shows a high-level block diagram of steps used in the tool- evaluation methodology, which revolves around hands-on use of the tool being evaluated. After learning how the tool works, an evaluator completes a detailed 28 page "checklist" form that is a ma- trix organized around two dimen- sions:

1. Functionality of the tool, and 2. Usability o f the tool.

Functionality indicates what the tool can do; that is, what interface styles, techniques, and features can be produced by the tool for a target application interface. Usability indi- cates how well the tool performs its possible functions, in terms of ease- of-use (a subjective, but quantitative, rating of how easy or difficult the tool is to use) and human perfor- mance (an objective rating of how efficiently the tool can be used to perform tasks).

After completing the detailed portion of the form, the evaluator calculates overall functionality and usability ratings (using a Microsoft Excel spreadsheet that replicates layout of each page of the form) to produce numeric results for an executive summary. If desired, the evaluator may perform benchmark interface development tasks, com- pleting the evaluation.

Functionality Dimension The functionality dimension of a tool's evaluation is organized into three sections. Major categories and some subcategories from each sec- tion are shown in Figure 2. Each of these is further divided into appro- priate items (e.g., 14 types of menus, 27 features of interfaces, 17

hardware input devices, 9 rapid prototyping subcategories, 7 evalu- ation subcategories, and so on).

Usability Dimension The usability dimension of a tool's evaluation is measured by two methods:

• Subjective evaluation measures ease-of-use for each of the three sections of the functionality di- mension. Only for each function the tool can produce, does the evaluator indicate on the form whether it was difficult (indicated by a frowning face), adequate (bland face), or easy (smiling face) to use to produce that func- tion. Individual evaluator inter- pretation was reduced by defin- ing each face usability icon, as

= I G U R E 1

Steps in using the tool evaluation methodology

= I G U R E 2

Major categories and some sub- categories of the three sections of the functionality dimension

C O M P U T I N G P R A C T I C E S

shown in Figure 3. These defini- tions came from extensive inter- views with five expert tool users.

• Objective evaluation measures human performance using the tool to complete suggested benchmark interface tasks that can be customized to a specific working environment. (This is traditional benchmark testing, using representative tasks, and will not be addressed here.)

Specification/Implementation Techniques Only for those functions of inter- faces that the tool can produce, the evaluator also indicates the primary type of specification/implementa- tion technique the tool uses to pro- duce individual functions. Tools

Section A. Types of Interfaces Produced with this Tool

Interaction styles Menus Forms Typed string inputs Windows

Feature of interfaces Animation Adaptive dialogue Types of navigation Default input values Graphics

Hardware devices Input devices Output devices

Section B. Types of Support

Rapid prototyping Development methodology Constructional model of human

computer interaction Evaluation of target interface Database management system Interface libraries Help for using tool Documentation of tool itself Automated project management

Section C. General Characteristics

Integration across tool Reusability of tool outputs Extensibility of tool Modifiability of tool

COMMUNICATIONS OF THE ACM/March 1991/Vol.34, No.3 77

Page 5: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

incorporate very different tech- niques for specifying and imple- menting a user interface, ranging from programming paradigms to graphical metaphors. The type of specification/implementation tech- nique can strongly influence an in- terface developer's use of a tool, and may be a key feature in select- ing a tool for an interface develop- ment team, based on the skills of those on the team. A tool in which 80% of an interface is implemented by textual programming would, for example, differ markedly from a tool in which 80% of an interface is implemented using graphical, di- rect manipulation techniques. Techniques on the evaluation form are chosen from those shown in Figure 4.

All terms in Figures 2, 3, and 4, as well as all other terms in the eval- uation form, are defined in a Glos- sary that accompanies the form.

Examples of Use of the Form To give an idea of the appearance, use, and results of the form, some pages will be shown as examples, including a detailed page from the form (Figure 5), and a page from the executive summary (Figure 6). These are actual results from an empirical study of evaluating a tool called Bricklin's Demo TM, described later in the section "Empirical Vali- dation of the Evaluation Methodol- ogy." Figure 5 shows the completed page from the "types of interfaces" section, the page for evaluating forms. (No claims are made about whether the checks shown in this figure are right or wrong; they sim- ply show exact marks made by one evaluator in our study.) Note the matrix positioning of usability ver- sus functionality.

To use the form, an evaluator first looks down the list of function- ality items in the left column, and puts a check in the corresponding "Not pos~qble with this tool" column for any item not handled by this tool. Next, for any item without a check under "Not possible with this tool," the evaluator checks the ap- propriate box (one of the three

faces) to indicate the tool's ease-of- use for that item, as defined in Fig- ure 3. Then, for all items for which a usability rating was just checked, the evaluator checks the appropri- ate box to indicate the specification/ implementation technique(s) used by the tool. Note spaces on the eval- uation matrix labeled "Other, Spec- ify:" (e.g., bottom of the "forms" page, and under specification/ implementation techniques across the top of the page). Here the eval- uator can add items that are appro- priate to a tool but are not explicitly included on the form.

Calculating ReSUltS from the Evaluation MethodolOgy Numeric results computed for each tool evaluated by using this meth- odology include the following:

• Functionality rating--indicator of the number of interface func- tions the tool supports; calculated as the percentage of the total number of functions on the eval- uation form that are possible with this tool;

• Usability rating--indicator of the ease with which functions can be produced with the tool; calcu- lated by considering only those functions possible with this tool, this rating is the total rating earned expressed as a percentage of the highest possible usability rating;

• Specification~implementation tech- nique rating--indicator of the degree to which a technique is used in the tool for producing possible functions; calculated as the percentage of use of a specific technique relative to all possible techniques.

The functionality rating is ex- pressed as a percentage of the number of functions possible with this tool with respect to the total number of functions on the evalua- tion form. For example, a function- ality rating of 80% for a category, such as "menus," indicates that a tool supports 80% of the total func- tions listed under "menus" on the form. The functionality rating for

any category is equal to the number of functions supported by the tool (i.e., those items not checked as "Not possible with this tool") di- vided by the total number of func- tions in that category, expressed as a percentage (i.e., multiplied by 100). For the page in Figure 5, there are 12 (non-"other") items listed under "forms"; this evaluator checked 2 of them as "Not possible with this tool." Thus 10 of the 12 functions are supported by this tool, so the functionality rating for "forms" is 10 divided by 12, or 83%, shown in Figure 6.

The usability rating quantifies the evaluator's qualitative judgment of the ease of using the tool to accom- plish particular tasks, such as pro- ducing a menu. The usability rating is expressed as the percentage of the ease-of-use rating of all func- tions possible with this tool with re- spect to the maximum ease-of-use rating for those same possible func- tions. For an individual function, a check under the frowning face is assigned an ease-of-use rating of 1, while checks under the bland and smiling faces are assigned ratings of 2 and 3, respectively. The denomi- nator for the usability rating is the highest possible ease-of-use rating, determined by multiplying the number o f rows for which any face icon is checked (i.e., all rows that do not have a check under "Not possi- ble with this tool") by 3, the highest possible ease-of-use rating. In Fig- ure 5, this rating is 30 (i.e., 10 rows times 3). The numerator is deter- mined by multiplying the total number of checks in each face icon column by its corresponding ease- of-use rating. From the example, there are 1, 3, and 6 checks--for the frowning, bland, and smiling faces, respectively, so the numera- tor is 1 times 1, plus 3 times 2, plus 6 times 3, or 25. Thus the usability rating for the "forms" category is 25 divided by 30, or 83%, as shown in Figure 6. (This rating is coinci- dental with the "forms" functional- ity rating, also 83%.) The specification~implementation tech- nique rating may be useful in indi-

7 8 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

Page 6: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

COMPUTING PRACTICES

kd.

: I G U R E 3 : I G U R E 4

Definitions for ease-of,use ratings of the usability dimension

Types Of Specification/Implemen- tation techniques

: I G U R E S

A page from the evaluation form, as completed by an evaluator par- tlclpant for Demo

USABILITY

Ease of Using Tool to Produce

This Function

SpecBBication/Implementatlon Technique U ~ d by Tool to Produce This Function

Coded In textull f Direct i Miscellaneous: language: manipulation p ' • / /

FUNCTIONALITY

FORMS

Typed (non-enumerated) input Formatted fields Toggled (enumerated) input Default field values

Navigation within form Arrow keys Picking with mouse Tab key Space bar Bidirectional Wrap-around Other (Specify:

Single-page forms Multi-page forms Other (Specify: )

TOTALS

X

X )

X X ~

X X

X

X

X

X X X

X X

X

X =X X

X X

xl X

COMMUNICATIONS OF THE ACM/March 1991/Vo1.34, No.3 79

Page 7: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

i!i~ i~i!~ i iii!iii ! !i!! ~ !if! ~ if! iii iiiii ~ ~ii!i:~i! ii!!!ii!i!!iiiii~ iliii~ ili !iiiiii iii!!!! i ~ i a ~ i i 0 f l ~ ! 0 n i i iii~iiii ¸ iiii!i ~ iil !!i i i ii! iil i!! !iiili iiii~iiiiii ¸ iiii!iiii!i i~i iii~ ¸ iii~ iiilJ ii~ i! ' ill ii! !~ ii if! i !!i i!i~i ii~i !̧!ili!! ' if! ~!1 i ii! iiii ii!: i!ii i iii iii ii :!! ~ ~i ~ i i !i ii ~: i ~ / ~ ~ ~i i ~ i!~ ~i iii:!ili~!: :i ̧ ~i ̧ iii:i:~i!i!i~!i ¸ ~ i~ iiiii!~ i~ i ~ iiiii!iiiiiiii!i ~!i ~ ~ ~C~I:I~I ̧ ~!~ ~ iii!i!iiiiii:i!il iii ii! iiii!iiii!!iiill ii ̧ i ; ! !

i i , i~ ~!i i i~ ~ i i~ ~ il i iiiil ~!ii~ !iiiiiiiil ii i!il ii~iii!il i i!i !ii iiii~i!iiii~i!iii!~iiiii!ii!i i!! i i!ii!iii i iii i~ ii ~ iii~ !ii!ii!! i i i! !!i i!iiii~! iii i~! !il i ii !i i!!!iiii!i!ili ii!i~!i !iiiii~iii!!iiii!!!!!ii!i~ !!r~! !i~ i~i~ ~ .~ l~ l i i~ i ! ~!i! i i! ' ! ~ !~!!!!ii ii !!! i~,~bli!~! ~!:~i! ~ i~- i i~!!iiiiii~!i!~i~!!iiii~!!iiii!i !iiiii!!!ii~i !!!!i~ iili ~ii~ j i i!i! ~!ii!! i~il !ii !!iii!i !!fill !!iiii!i!i!! !iii!! iiii!i!iii!i~ii i ii i i~i i~i! ~' ~ ~ ~ ~ ~ ~ ~i~ ~ i ~ !! ~ ~,~ ~ !!iii~iii!~i!i! R~.g~i ~ i~!iii!~i!iiii ii~i!~ ~ ~i!~!!~i~ iln~.g!!i!~ii~i!iii~!ii i~

Actual results from executive summary for Demo: Functionality and usaDillty ratings (%)

cating the specification/implemen- tation paradigm used by a tool, and may actually influence its usability rating. This rat ing is calculated by summing the checks for each col- umn in the specification/implemen- tation columns, and recording these numbers in the correspond- ing "TOTALS" row for each page. Next, all these totals are summed across pages, producing a row of gi~and totals. The percentage use of a technique (an individual column) is the grand total of that column divided by the sum of all grand to- tals. Thus from our example (most of the data are not given here), the sum of all grand totals is 46 (for object manipulat ion) plus 28 (for rule-based transitions), or 74. So the percentage use of object manip- ulation is 46 divided by 74, or 62%, and the percentage use of rule- based transitions is 28 divided by 74, or 38%.

A complete evaluation repor t contains several parts:

• General description of the tool being ,evaluated;

• Informat ion about sources used in prepax'ing the evaluation;

• Executive summary of overall func- tionaliry and usability ratings for all three sections;

• Detailed evaluation of functionality and usability dimensions for all three sections; and

• Glossary defining every item in the form.

Interpretation of Results Now that we know how to compute these numbers, let us look at their interpretat ion. For example, there are 14 kinds of menus on the evalu- ation form. According to this evalu- ator (Figure 6), Demo can produce 100% of them, and can produce 83% of the 12 types of forms. Thus Demo supports a large variety of types of menus and forms. As dis- cussed previously, usability is evalu- ated only on functions that are pos- sible (e.g., the 83% of types of forms). Usability ratings of 91% and 83% mean Demo is easy to use to produce menus and forms, re- spectively. A good variety of the features of interfaces (81%) are suppor ted by Demo, and are fairly easy to produce (usability of 72%). Demo can produce interfaces with only a few input devices (function- ality of 18%), while output devices are much better suppor ted (71%). Both input and output devices are easy to incorporate into interfaces produced using Demo (usability of 100% and 80%, respectively). (It is impor tant to r emember that all us- ability ratings refer to ease of using the tool to produce an application interface, and are not in any way re-

lated to ease-of-use of that inter- face.)

Empirical validation of the Evaluation Methodology

Participants, Materials, and Procedures These evaluation ratings remain mere numbers unless we can show they are reliable. To de termine reli- ability we must ask whether differ- ent evatuators will p roduce similar, consistent results for the same tool. To test the reliability (consistency) of this evaluation methodology across di f ferent evaluators, we con- ducted an exper iment using six graduate students from Virginia Tech as research participants. These students were carefully cho- sen as representat ive subjects to be tool evaluators; all had a min imum of two years of system and interface development experience, ei ther in industry or an academic setting; and all were very familiar with in- terface development tools, having used one or more such tools ei ther to develop a real interface or jus t to learn more about them.

Three tools represent ing differ- ent application types were evalu- ated:

• Bricklin's Demo TM Vl , • Macintosh's HyperCard TM V1.2.1,

and • SmethersBarnes ' Prototyper TM V1.

Demo runs on IBM hardware, while the other two run on the Mac- intosh. Demo produces sequential, nondirect manipulat ion interfaces such as those typically found in IBM-based systems. Prototyper is used for developing Macintosh- style interfaces, and has a Mac-style interface. While Hype rCard is not strictly an interface development tool, we included it in our evalua- tion because of its widespread use and because of its features that sup- port interface development . Fur- ther, these three tools are widely available and known in the inter- face development world, so that intuitive assessment of evaluation results would be more feasible.

0 M a r c h 1 9 9 1 / V o 1 . 3 4 , No.3/GOMMUNICATION$ OF THE AGM

Page 8: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

That is, we could determine whether the results look right, based on a priori knowledge of these tools.

These tools were specifically cho- sen because they are perceptibly different, and therefore would be expected to have different func- tionality and usability ratings. The goal of this experiment was to de- termine if the evaluation procedure produces different ratings for obvi- ously different tools. This is critical for an initial study; if validating the procedure for the first time by eval- uating very similar tools produced no significant differences in rat- ings, then we would not know whether the apparent lack of dif- ferences was due to the evaluation procedure or whether there really were no differences among the evaluated tools.

Each participant evaluated two different tools (neither of which they had ever used before), with appropriate counter-balancing, so that each tool was evaluated by four different participants. While the results of evaluating the tools are inherently interesting, we were more concerned with evaluating the evaluation form and method- ology-- their usability, complete- ness, and understandability, for example.

The reliability experiment was conducted in three stages. During Stage 1, participants received soft- ware and manuals for the two tools they were asked to evaluate, and learned how to use each tool to a reasonable level of expertise, at their own pace, during a two week period. During Stage 2, partici- pants performed baseline tasks for the experimenter, to ensure that each participant attained at least a common baseline level of expertise with each tool, to reduce variance among participants in their knowl- edge of each tool. The tasks took about fifteen minutes to complete, and all participants satisfactorily performed the baseline tasks on their first attempt. During Stage 3, participants were given training on the use of the form and glossary by

the experimenter. Participants then completed a form for each tool, at their own pace, using any docu- mentation and the tool itself. Par- ticipants returned completed forms to the experimenter who computed functionality ratings, usability rat- ings, and specification/implementa- tion technique ratings, using the electronic spreadsheet. Participants were also asked some questions about how they used the evaluation methodology.

Results from Using the Methodology to Evaluate the Tools Two kinds of results are presented here: numbers from the use of the evaluation methodology for each tool (in this section), and the results of reliability testing of the method- ology based on these numbers (in the following section).

Participants reported spending six to ten hours learning a tool, and an additional one to two hours completing each evaluation form. Tables 1, 2, 3, and 4 show mean

summary percentages for each tool, averaged from ratings of all four participants who evaluated each tool. (Note that the summary per- centages shown in Figure 6 for evaluation results of Demo were from one individual participant, and are therefore different from these mean percentages.)

Results of the evaluation of each of the three tools can be interpreted as was done previously for Demo. Table 1, showing mean summary functionality and usability percent- ages for each tool, indicates that,

COMPUTING PRACTICES

overall, HyperCard has the greatest functionality across all three sec- tions of the form, followed by Demo and then Prototyper. Hyper- Card and Prototyper have very sim- ilar ratings of usability, and Demo has the lowest usability rating.

Table 2 gives a more detailed breakdown of line 1 of Table 1. Considering all columns in Table 2, some spot comparisons show that HyperCard (functionality of 90%) produces more types of forms than Demo (73%) or Prototyper (54%). Typed input strings are not well supported by Demo or Prototyper (both functionalities of 19%). Hy- perCard supports more features of interfaces (84%) than does Demo (46%) or Prototyper (37%). All three tools provide little support for input devices, while Demo sup- ports more output devices (71%) than does HyperCard (64%) or Prototyper (53%). Usability ratings are generally high for all three tools. Results in Table 3, showing the types of support the tools pro- vide, can be interpreted similarly.

Specification/implementation techniques used by each tool are shown in Table 4. According to these results, HyperCard primarily uses object manipulation (80%) with some tabular manipulation (11%). Prototyper also primarily uses object manipulation (78%) and some form-filling (14%). Demo uses a mix of object and tabular manipulation, rule-based transi- tions, and form-filling. None of these tools use textual languages to any extent.

Mean summary results (%) of functionality and usability ratings for all three sections of the form

Demo Hypercard Prototyper

Funct. Usab. Funct. Usab. Funct. Usab.

Types Of Interfaces 52 82 70 87 46 93

Types of Support 56 65 76 82 38 74

General Characteristics 28 61 50 83 31 75

C O M M U N I C A T I O N S OF T H E ACM/March 1991/Vol,34, No.3 81

Page 9: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

A series of analyses of variance (ANOVA) was run on all results to de termine which are significantly different. Major comparisons were computed across the three section's of the form.

Results; from Testing the Methodology for Reliability Results were analyzed to de termine whether numbers p roduced by the

methodology are reliable (consis- tent) across different evaluators. To produce statistical tests of reliabil: ity, the probabili ty that was com- puted was that responses from the four evaluators for each tool would match by chance. The observed propor t ion of matches was then recorded for each category of items (menus, forms, and so on) and across the entire form, and then

compared with the chance prob- ability using a binomial test. Be- cause functionality and usability were measured on dif ferent scales, however, the chance probabili ty of matching was not the same in the two cases.

For functionality, successful matching for an item was def ined either as all four of its evaluators agreeing that the item was possible

Types Of Interfaces the tools can produce: Mean summary results (%) of functionality and usability ratings

Demo Hypercard Prototyper

No. of items Funct. Usab. Funct. Usab. Funct. Usab.

Interaction Styles: Menus 14 82 78 84 82 66 93

Forms 12 73 76 90 97 54 93

Typed input strings 4 19 100 69 68 19 100

Wiindows 3 59 92 75 81 83 91

Features of Interface: 27 46 71 84 92 37 84

Hardware Devices: Input devices 17 18 89 20 99 t2 100

Output devices 7 71 78 64 85 53 92

Types of support the tools provide: Mean summary results (%) of functionality and usability ratings

Demo Hypercard Prototyper

No. of items Funct. U s a b . Funct. U s a b . Funct. Usab.

Rapid prototyping 9 75 72 86 83 58 89

Development methodology 6 58 76 84 87 50 75

Evaluation of target interface 7 39 74 61 60 0 n/a

Database management system 2 0 n/a 75 55 0 n/a

Interl:ace libraries 1 50 50 100 84 50 50

Help for using tool 1 100 67 100 100 75 67

Documentation of tool itself 1 100 42 100 75 100 75

Context of definition 3 58 75 84 95 25 84

Automated project management 5 0 n/a 10 100 0 n/a

8 2 March 1991/Vol.34, No.3/COMMUNICATIONS OF T H E ACM

Page 10: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

~ 4 k ~

. . a . a . a . a . * a m * a a . . a i m a a a l a a l l a a a l m a a a l l a A a a l a a m a a l l l a a i m m l i l l m i l i l l m l l l l l l l a

with a given tool, or all four agree- ing that the item was not possible with that tool. The probability of successful matching by chance is 1/8 (=. ] 25). Even if there is no reliabil- ity, that is, if each evaluator decided on functionality by a coin flip, we would still expect about one out of every eight items on the form to produce a successful match. When the observed propor t ion of success- ful matches in a category substan- tially exceeds this rate, we conclude that evaluations are reliable. The level of confidence we have in this conclusion is indexed by the p-value of our result. The p-value is the probability of obtaining the ob- served number of successful matches by chance alone. The smaller the p-value, the more cer- tain we are that evaluators agreed in their judgments of functionality. Tha t is, we are more confident that functionality ratings are reliable (consistent). The p-values for func- tionality are shown in Table 5.

For usability, successful match- ing for an item was def ined either as all four of its evaluators agreeing on the face selected, or three of the four evaluators agreeing, with the fourth evaluator disagreeing by only one face (e.g., frown, frown, frown, bland is a successful match, while frown, frown, frown, smile is not a successful match). This defi- nition was chosen - - r a the r than requir ing all four to agree on the selected face--because it is inher- ently more difficult to get agree- ment among three choices (the three faces) than among the two choices (possible or not possible) available for functionality. The probability of successful matching by chance is 19/81 (=.2347). As with functionality, the observed propor t ion of successful matches was compared with this figure, and the probability computed of obtain- ing this many successful matches by chance alone. Table 6 shows these p-values for each category of items and across the entire form.

Usability was examined only for items that all four evaluators agreed were possible (i.e., perfect

agreement- -successfu l ma t c h ing - - on functionality). As a result, in most categories the number of

C O M P U T I N G P R A C T I C E S

items available for reliability testing of usability was much smaller than for functionality. Some categories

Specification/implementation techniques used by the tools: Mean summary results (%)

Demo Hypercard Prototyper

Coded in Textual Language: P r o g r a m m i n g language - - 8 1

specialized dialogue l a n g u a g e 5 1 - -

BNF - - - - - -

R e g u l a r e x p r e s s i o n s - - - - - -

Direct Manipulation: O b j e c t m a n i p u l a t i o n 17 80 78

Graph ica l p r o g r a m m i n g l a n g u a g e - - - - 3

Miscellaneous: Tabu la r m a n i p u l a t i o n 32 11 2

Ru le -based t r a n s i t i o n s 28 - - - -

F o r m - f i l l i n g 18 - - 14

Other (Specify: p lug - i n ) - - - - 2

p-values for functionality for each category of the form and across the entire form

No. of items Demo Hypercard Prototyper

interaction Styles: Menus 14 .0007 .0007 .0230

Forms 12 .0113 .0000 .0528

Typed input strings 4 .0789 1.0000 .0071

Windows 3 .0430 1.0000 1.0000

Features of Interface 27 .0042 .0000 .0042

Hardware Devices: Input devices 17 .0000 .0000 .0000

Output devices 7 .0000 .0463 .0062

Types of Support 38 .0000 .0000 .0000

General Characteristics 5 .4871 .1207 .0161

Across Entire Form 127 .0000 .0000 .0000

COMMUNICATIONS OF THE ACM/March 1991/Vol.34, No.3 83

Page 11: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

dropped out entirely (indicated by a dash in Table 6), since they con- tained :no items that all four evalu- ators agreed were possible. The fig- ures in parentheses in Table 6 are the numbers of items on which each p-value is based.

Discus.~;ion of Results

Evaluating t h e Tools At the highest level, the ANOVA showed that HyperCard has a sig- nificantly higher functionality for types of interfaces a tool can pro- duce, and Demo has about the same functionality as Prototyper (line 1 of Table: 1, and Table 2). Given our a priori knowledge and expecta- tions, these results are not surpris- ing, since HyperCard is a more general-purpose tool. The ANOVA showed that the only significant dif- ference for usability ratings for the types of interfaces a tool can pro- duce is between Prototyper and Demo (line 1 of Table 1, and Table 2). Prototyper is indeed more usable than Demo but the differ- ence in usability between Hyper-

Card and either of the other two tools is not significant. This is again reasonable: Prototyper runs on a Macintosh and might therefore be expected to be easier to use. Al- though HyperCard also runs on a Macintosh, its greater functionality may actually interfere with its us- ability.

The ANOVA showed all three tools are indeed significantly differ- ent in their functionality for types of support. HyperCard, as we ex- pected because of its generality, provided the most types of support (line 2 of Table 1, and Table 3), fol- lowed by Demo and then Prototyper. The ANOVA indicated that usability for these types of sup- port is significantly different only between the highest usability (Hy- perCard) and the lowest (Demo). Prototyper, at the in-between level, is not significantly different in us- ability for types of support from either of the other two.

While results (Table 1, line 3) indicate that HyperCard has more general tool characteristics than Prototyper, which has more than

p-values and numbers of items on which they are based for usability for each category of the form

and across the entire form

Demo Hypercard Prototyper

Interaction Styles: Menus .7989 (6) .0097 (7) .0129 (3)

Forms .2360 (4) .0000 (9) .0129 (3)

Typed input s t r ings m (0) m (0) - - (0)

W i n d o w s .2346 (1) ~ (0) - - (0)

Features of Interface .5515 (3) .0000 (14) .2346 (1)

Hardware Devices: I n p u t dev ices .1393 (3) .0550 (2) .2346 (1)

O u t p u t dev ices .0123 (5) .0550 (2) .0550 (2)

Types Of Support .5158 (7) .0002 (12) .0030 (4)

General Characteristics - - (0) .4141 (2) .2346 (1)

Across Entire Form .0030 (29) .0000 (48) .0000 (15)

Demo, the ANOVA showed that the three tools in fact do not differ significantly on either functionality or usability of general characteris- tics. Direct manipulation was the primary specification/implementa- tion technique (shown in Table 4) for HyperCard and Prototyper, and was undoubtedly reflected in their overall high usability ratings. More complete comparison of these tools, for example, to choose one for a development environment (see the following section "Exam- ple: Use of Results"), would neces- sitate a look into details of individ- ual items on the evaluation form, from which these overall results were calculated.

Test ing t h e Methodology for Reliability Using the standard criterion of p < .05, most of the p-values for functionality shown in Table 5 are quite satisfactory, indicating that the form is generally reliable for determining functionality of a tool. Poor results (e.g., typed input strings, windows, and general char- acteristics) were frequently found in those categories that have only a few items (4, 3, and 5 items, respec- tively), suggesting a need to reassess the measurement of fimctionality and the grouping of items in those categories. Across the entire form, reliability is highly significant for functionality for all three tools.

For usability, the p-values shown in Table 6 are adequate in some cat- egories, but unsatisfactory in oth- ers. These results suggest that us- ability ratings may not be as reliable as functionality ratings. This was expected for at least two reasons: First, usability is inherently more subjective than functionality, and therefore more likely to be incon- sistent across evaluators. Second, because usability was examined only for those items that all evaluat- ors agreed were possible, the num- ber of items in most categories was smaller than was the number of items examined for functionality. The smaller number of items in many categories adversely affects

84 March 1991/Vo1.34, No.3/COMMUNICATIONS OF THE ACM

Page 12: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

the power of statistical tests, and substantially accounts for the larger p-values. These results suggest a need to reexamine groupings of items into categories, avoiding cate- gories with very few items. They may also reflect overly subjective definitions for the three usability faces, and the desirability of mak- ing these definitions less open to individual interpretation by evalu- ators. Nevertheless, by combining all items across the entire form, and thereby employing a large sample size in our test, usability reliability is highly significant for all three tools.

Qualitatively, participants said, in post-evaluation interviews, that they felt results fairly represented tool capabilities. As a result, rather than producing ad hoc evaluations, they felt the form provides a struc- tured, consistent instrument for evaluating and comparing tools, as well as presenting results of those evaluations. These comments indi- cated participants' positive feelings toward the methodology, despite the fairly lengthy t ime--a t least 20 hours- -each of them had volun- teered.

Example: Use of Results To show how results of this meth- odology might be used to aid in se- lection of an interface development to01 for a particular environment, we will present a purely hypotheti- cal, simple example that compares the results of the three tool evalua- tions obtained from the empirical study. Requirements for selecting a tool for our hypothetical environ- ment are arbitrarily set as follows:

• Tool must be capable of produc- ing a large variety of menus and forms;

• Tool must be easy to use for the types of interfaces it can produce; and

• Tool must mainly use object ma- nipulation for producing inter- faces.

First, these requirements must be quantified; in this example we will arbitrarily choose the following rat- ings to satisfy, respectively, each of

the three requirements:

• Minimum functionality rating of 70% for both menus and forms;

• Minimum overall usability rating of 60%; and

• At least two-thirds of the tool's functions must be produced using object manipulation.

Comparing functionality ratings. From Table 2, Prototyper falls below the 70% functionality required for both menus and forms; HyperCard and Demo satisfy it. So Prototyper is eliminated. Comparing usability ratings. From Table 2, HyperCard has higher usability ratings in most categories, but both HyperCard and Demo meet the 60% minimum usability requirement for all seven catego- ries. Interestingly, note Prototyper has higher usability ratings than ei- ther HyperCard or Demo for sev- eral categories. Thus if the tool se- lection requirements had been different (e.g., favoring usability over functionality), Prototyper might not have been eliminated so quickly. Comparing specification~implementa- tion techniques. Table 4 shows that Demo uses object manipulation only 17% of the time, while Hyper- Card uses it 80% of the time. Our selection requirement states that an object manipulation rating of at least 66% is the minimum accept- able, so Demo is eliminated. (Again, Prototyper would meet this re- quirement had we not already elim- inated it.) Final decision. HyperCard has higher functionality and usability ratings than Demo, and meets the acceptable limit o.f required object manipulation. Thus we would rec- ommend HyperCard as the appro- priate tool, given our hypothetical selection requirements.

Conclusions and Future Research Results across different participants (evaluators) in our experiment were substantially more similar than we expected. Since this was the first time the evaluation methodol-

COMPUTING PRACTICES

ogy had been formally used, we anticipated the possibility of widely varying results. The empirical test- ing indicates, however, that our methodology provides reliable (consistent) results across different evaluators; thus different evaluat- ors should reach the same conclu- sions about tools evaluated and compared with this methodology.

Examination of details across dif- ferent participants revealed several instances in which one participant marked a function (e.g., typed input strings) as "not possible" while another marked it otherwise. Investigation showed this was due, as expected, to two common rea- sons:

1. Differences in interpretation of glossary definitions, and

2. Differences in expertise level of different evaiuators using the same tool.

Interestingly, most discrepancies were found either in rather ob- scure, poorly understood items or in items not explicitly supported by a tool; for example, producing "typed input strings" using Prototyper and Demo. When they were not explicitly possible, some participants used intuition and cre- ativity to produce typed input strings with a tool. These two dif- ferences suggest that detail in the glossary could be clarified, and for- real evaluator training should be explored.

In general, we believe that this methodology should be used only by an evaluator who has thoroughly learned the tool(s) to be evaluated and is familiar with the methodol- ogy and form. This is to be ex- pected: given the complexity of the kinds of tools we are attempting to evaluate, a methodology for evalua- tion is itself lengthy and complex.

The methodology's focus on quantitative results can be miscon- strued. Overall (summary) ratings give only macro results; an evalua- tor must go to the detailed sections of the f o r m to determine specific information about a tool. The nu- meric ratings provide guidelines

C O M M U N I C A T I O N S O F T H E &¢:M/March 1991/Vol.34, No.3 8 S

Page 13: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

for evaluating and compar ing tools and aid in decision making; they do not mandate a part icular decision.

Finally, there are great advan- tages to this kind of checklist evalu- ation methodology. It encourages its users (the tool evaluators) to broaden their thinking about tool evaluation, by present ing them with a structured, wide variety of possi- ble choices, many which they might not think o f otherwise. This very structu:re can be limiting, however, by actually narrowing the set of questions asked about a part icular tool. As an initial response to this, we have made the form extensible by allowing evaluators to add items as appropr ia te (in Figure 2, note spaces for "Other"). We are also investigating adding an initial phase in which evaluators must de- termine, in some structured fash- ion, a desired, context-sensitive set of functions for a tool within their part icular environment , before pro- ceeding with evaluation of tools using our methodology.

Several o ther open issues are also being pursued. For categories with low reliability, we are investigating how to improve consistency. Deter- mining validity (i.e., whether the methodology measures what we intend it to measure) of the meth- odology is also being investigated; this study addressed only reliability. Validity is even more difficult be- cause of a lack of exper t compari- sons for cross-validation purposes. Without a comparable technique for evaluating human-compute r interface development tools, which current ly does not exist, validity cannot be effectively assessed.

Pragmatic Use of the Tool Evaluation Methodology

Importance of the Methodology Roberts and Moran [6] make an excellent distinction between im- portance and reliability within the context of quantitative scores such as those p roduced by this tool eval- uation methodology. They state that any difference in size across scores can be made reliable by con-

ducting enough extensive, expen- sive studies. The basic issue, how- ever, is not so much reliability as importance. Impor tance represents a substantive, ra ther than a statistical, difference among ratings. Thus, in practical use, small differences that may be unreliable are not critical for producing useful ratings. The usefulness of a relatively inexpen- sive methodology, such as the one described in this article, is that it reveals potentially large, and there- fore important , differences among tools. When this methodology iden- tifies a potentially impor tant differ- ence across tool categories, for ex- ample, then evaluators can de termine jus t how reliable that difference needs to be. We found that, as a "rule of thumb," a differ- ence of about 10 percentage points in ratings across tools was useful for dist inguishing a ra ther different "look and feel" for both functional- ity and usability of each tool.

Real World Use of the Methodology The exper imental study described above gave us statistical evidence that this tool evaluation methodol- ogy is useful and reliable. Perhaps even more impor tant than our lab- oratory results are the number o f development groups within major organizations that have expressed interest in our methodology and who, in fact, have used or are con- sidering using it. More than 25 or- ganizations have requested the evaluation form, including GTE Data Systems [1], McDonnell Doug- las [7], Battelle Pacific Northwest Labs, Bell Nor thern Research, Data General , Digital Equipment Corpo- ration, Hewlett-Packard, IBM, Jet Propulsion Lab, NASA, and Soft- ware Engineer ing Institute. The methodology is being used in vari- ous ways, and groups are f inding it easy to adapt to their part icular needs. For example, one group used it as a high-level checklist to compare more than a dozen inter- face development tools. The matrix provided a useful taxonomy for making initial qualitative compari-

sons across several tools. They per- formed the calculations when there was a desire to have a more de- tailed, quantitative comparison across several tools as indicated by the qualitative, intuitive results. One group adapted a subset of the form to a specific domain of tools that they were evaluating (e.g., forms management systems). The response of these groups, along with requests we get (on average, about once a week) for information on the methodology, encourage us by letting us know that there is a great need for such a methodology, and that our approach, while still young, shows real promise. 1

And Did We Meet Our Goals? Six goals for a tool evaluation meth- odology were set forth at the begin- ning o f this article:

1. use of a standardized approach to produce reliable quantifiable re- sults;

2. thoroughness, but not necessarily exhaustiveness, of a tool evalua- tion;

3. objectivity, ra ther than judge- ment, on the part of the tool evaluator;

4. ease-of-use, but not necessarily quick-to-use, for the tool evalua- tor;

5. adaptability to the development environment in which it is used; and

6. extensibility, to remain sensitive to advances in interface technol- ogy.

The first goal was met, as results of the empirical study indicate. Other goals were assessed through questioning participants in that study, as well as members of the organizations ment ioned in the previous section. All agreed that this evaluation methodology is thorough, providing a comprehen- sive, s t ructured f ramework within

q n f o r m a t i o n about obta ining the evaluation form and instructions for its use are available contact ing Hix. The Excel spread sheet can also be obtained at no cost by sending a 3-½ inch floppy disk to her. We welcome inquiries f rom anyone interested in using and critiqu- ing this evaluation methodology.

86 March 1991/Vol.34, No.3/COMMUNICATIONS OF THE ACM

Page 14: Deborah Hix and Robert S. Scbulman · this article is similar to the Roberts and Moran approach, but differs in context, content, style, and results. Cohill, Gilfoil, and Pilitsis

which to evaluate and compare in- terface development tools. All con- curred that the evaluation, because of the highly structured matrix and the carefully defined usability icons, removes much of the evalua- tor subjectivity. Users of this meth- odology stated that they generally find it easy, and even fun, to use. Most users stated that the form it- self does not take long to complete (one to two hours per tool); how- ever, the methodology is not quick because the evaluator must learn each interface development tool being evaluated.

Adaptability to specific environ- ments can be accomplished by subsetting the form. By removing uninterest ing or inappropriate items across all evaluations, consis- tent ratings will be retained across all tools being evaluated. The form is easily extensible through use of the "Other" category provided throughout. Because each section of the form establishes a taxonomy, new items can be easily added within that framework, keeping the methodology sensitive to rapidly occurring advances in interface technology. As new items are added to the form, tools that have already been evaluated can be reevaluated with the new items. Thus a tool's ratings could increase or decrease over time, while always remaining relative to the most current taxon- omy against which it has been eval- uated.

Summary This evaluation methodology is, we believe, the first attempt at devel- oping and empirically validating a standardized technique for evaluat- ing and comparing human- computer interface development tools. Such an evaluation methodol- ogy for interface development tools provides both theoretical and prac- tical contributions to human- computer interaction research, in- cluding the following:

• Development of a method for systematically and consistently evaluating all aspects of an inter-

face development tool; • Instantiation of the concept of

quantitative functionality and usability ratings for a tool;

• Development of a taxonomy of types of interfaces that can be produced with a tool; and

• Identification of specification/ implementation techniques used by a tool.

This research provides a communi- cation mechanism for tool re- searchers, tool practitioners, and tool users for making coherent cri- tiques of their own and other tools. Our goal is that this work will result in a rigorous, trusted methodology for evaluating human-computer interface development tools.

Acknowledgments The authors would especially like to thank graduate student Kay Tan, who conducted the reliability study, and the participants who volun- teered their time for this study. Appreciation also goes to H. Rex Hartson, Antonio C. Siochi, Matt Humphrey, and Eric Smith, who helped in the early stages of devel- oping the evaluation form, and to Bruce Koons, who gave valuable expert advice on the experimental study. This research was funded by the Software Productivity Consor- tium and the Virginia Center for Innovative Technology. r4

References 1. Arble, F. Private communication,

1988. 2. Borenstein, N. S. The evaluation of

text editors: A critical review of the Roberts and Moran methodology based on new experiments. In Pro- ceedings of CHI'85 Human Factors in Computing Systems (Apr., Boston). ACM, N.Y., 1981, pp. 99-105.

3. Cohill, A. M., Gilfoil, D. M., and Pilit- sis, J. v. Measuring the utility of ap- plication software. In H. R. Hartson & D. Hix, Eds. Advances in Human- Computer Interaction, Vol. 2. Ablex Publishing Corp., Norwood, N.J., 1988.

4. Hartson, H. R., and Hix, D. Human- computer interface development: Concepts and systems for its man- agement. ACM Comput. Surv. 21, 1 (Mar. 1989), 5-93.

COMPUTING PRACTICES

5. Hix, D. Evaluation of human- computer interface development tools: Problems and promises. Pre- sented at EFISS (Atlanta Ga, Oct. 1988).

6. Roberts, T. L., and Moran, T. P. The evaluation of text editors: Methodol- ogy and empirical results. Commun. ACM 26, 4. 265-283.

7. Totten, S. Private communication, 1989.

CR Categories and Subject Descrip- tors: D.2.2 [Software Engineering]: Tools and Techniques; H.1.2 [Models and Principles]: User/Machine Systems

General Terms: Experimentation, Management, Measurement

Additional Key Words and Phrases: Measurement techniques; methodol- ogy, user interfaces

Deborah Hix is a research computer sci- entist on the faculty at Virginia Poly- technic Institute and State University in Blacksburg, Virginia. She is a principal investigator on the Dialogue Manage- ment Project. This project is concerned with achieving quality human-computer interfaces, through development of concepts for human-computer interface management and through development of specialized methodologies, tech- niques, and tools for producing the human-computer interface.

Author's present address: Department of Computer Science Virginia Polytech- nic Institute and State University, Blacksburg, VA, 24061. email for Debo- rah Hix: [email protected]

Robert S. Schulman is an associate pro- fessor of statistics at Virginia Polytech- nic Institute and State University in Blacksburg, Virginia. He performs sta- tistical consulting in a wide variety of fields. His specialty areas are psychometrics, test theory, and applica- tions of statistics to the social sciences. Author's present address: Department of Statistics, Virginia Polytechnic Insti- tute and State University, Blacksburg, VA, 24061.

Unix is a t r ade m a r k o f A T & T Bell Laborato- ries.

~" D e m o is a p roduc t o f Sof tware Ga rden , Inc.

~" H y p e r C a r d is a p roduc t o f App le C o m p u t e r Inc.

TM Pro to typer is a p roduc t o f Smethe r s Barnes

© ACM 0002-0782/91/0300-074 $1.50

COMMUNICATIONS OF THE ACM/March 1991/Vol.34, No.3 87