the effect of dimension content on observation and ratings of job performance

20
ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES 48, 2.52-271(191) The Effect of Dimension Content on Observation and Ratings of Job Performance TRACY MCDONALD Department of Management, California State Universiry, Chico Recent theory and research in the performance appraisal area (e.g., Denisi, Cafferty, & Meglino, 1984; Feldman, 1986; Ilgen & Feldman, 1983; Williams, Denisi, & Blencoe, 1985)have suggested that providing information regarding the performance dimension to be rated will cause raters to select appropriate observational schemata and, as a result, produce higher quality ratings. A study was conducted to determine if giving raters dimension-relevant infor- mation prior to performance observation would affect their attention processes and rating quality. Prior to watching a videotape of an instructor giving a lecture, 156 subjects were given either: 1) correct information, 2) incorrect information, or 3) no information regarding dimensions of performance they would subsequently be asked to rate. The results indicated that giving prior information regarding dimension content affected subjects’ attention pro- cesses. Further, raters receiving no information and those receiving misinfor- mation prior to performance observation produced less accurate ratings com- pared to expert raters. Ratings produced by subjects receiving correct infor- mation did not differ significantly from experts’ ratings. These results are discussed in terms of both their practical and theoretical implications. o IYY~ Academic Press, Inc. INTRODUCTION Recent theory and research in the performance appraisal area (e.g., Denisi, Cafferty, & Meglino, 1984; Feldman, 1986; Ilgen & Feldman, 1983; Williams, Denisi, & Blencoe 1985; Cohen & Ebbesen, 1979) have suggested that providing information regarding the performance dimen- sion to be rated will cause raters to select appropriate observational sche- mata and, as a result, produce more valid ratings. The purpose of this study is to determine if giving such information does indeed affect atten- tion processes and consequently, validity of ratings. Examined are the effects of three rating conditions on validity of ratings: 1) providing raters with correct information regarding content of dimension to be rated prior to making ratings, 2) providing raters with incorrect information regarding I gratefully acknowledge the comments of two anonymous reviewers on an earlier draft of this article. Requests for reprints should be sent to Tracy McDonald, Department of Management, California State University, Chico, Chico, CA 95929-0031. 252 0749-5978191 $3.00 Copyright 0 1991 by Academic Press. Inc. AU rights of reproduction in any form reserved.

Upload: tracy-mcdonald

Post on 02-Sep-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES 48, 2.52-271 (191)

The Effect of Dimension Content on Observation and Ratings of Job Performance

TRACY MCDONALD

Department of Management, California State Universiry, Chico

Recent theory and research in the performance appraisal area (e.g., Denisi, Cafferty, & Meglino, 1984; Feldman, 1986; Ilgen & Feldman, 1983; Williams, Denisi, & Blencoe, 1985) have suggested that providing information regarding the performance dimension to be rated will cause raters to select appropriate observational schemata and, as a result, produce higher quality ratings. A study was conducted to determine if giving raters dimension-relevant infor- mation prior to performance observation would affect their attention processes and rating quality. Prior to watching a videotape of an instructor giving a lecture, 156 subjects were given either: 1) correct information, 2) incorrect information, or 3) no information regarding dimensions of performance they would subsequently be asked to rate. The results indicated that giving prior information regarding dimension content affected subjects’ attention pro- cesses. Further, raters receiving no information and those receiving misinfor- mation prior to performance observation produced less accurate ratings com- pared to expert raters. Ratings produced by subjects receiving correct infor- mation did not differ significantly from experts’ ratings. These results are discussed in terms of both their practical and theoretical implications. o IYY~ Academic Press, Inc.

INTRODUCTION

Recent theory and research in the performance appraisal area (e.g., Denisi, Cafferty, & Meglino, 1984; Feldman, 1986; Ilgen & Feldman, 1983; Williams, Denisi, & Blencoe 1985; Cohen & Ebbesen, 1979) have suggested that providing information regarding the performance dimen- sion to be rated will cause raters to select appropriate observational sche- mata and, as a result, produce more valid ratings. The purpose of this study is to determine if giving such information does indeed affect atten- tion processes and consequently, validity of ratings. Examined are the effects of three rating conditions on validity of ratings: 1) providing raters with correct information regarding content of dimension to be rated prior to making ratings, 2) providing raters with incorrect information regarding

I gratefully acknowledge the comments of two anonymous reviewers on an earlier draft of this article.

Requests for reprints should be sent to Tracy McDonald, Department of Management, California State University, Chico, Chico, CA 95929-0031.

252 0749-5978191 $3.00 Copyright 0 1991 by Academic Press. Inc. AU rights of reproduction in any form reserved.

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 253

content of dimension to be rated prior to making ratings, and 3) providing raters with no information prior to making ratings.

These three conditions have obvious counterparts in performance ap- praisal contexts. In some organizations, performance appraisal adminis- tration is very formalized. The rater may receive explicit training regard- ing the use of the appraisal instrument and the dimensions of performance to be rated. In some cases, the rater may have even been involved in developing the instrument. In these situations, one would expect the rater to have correct information concerning dimensions of performance to be rated and thus be more likely to attend to aspects of performance relevant to the dimensions and produce more valid ratings.

Further down the spectrum are organizations which make no formal- ized effort to provide training or information to raters regarding the nature of ratings to be made. The first time the rater becomes knowledgeable of the dimensions to be rated may be when he/she is given the appraisal instrument. In such cases, the rater may have few, if any, expectations regarding dimensions to be rated. In a more extreme situation, the rater’s expectations may be based on incorrect assumptions regarding which aspects of performance are to be rated. These assumptions, for example, could be based on experience appraising performance in other contexts or in other organizations, on incorrect expectations regarding appraisal pur- pose, or simply on what the rater personally perceives to be the most important components of performance on the particular job. In these situations, raters would be less likely to attend to aspects of performance relevant to the dimensions to be rated and thus produce less valid ratings compared to raters receiving correct information regarding dimension content.

Several researchers have suggested schema theory (Bartlett, 1932; Nor- man, 1976; Neisser, 1976) as a useful heuristic for conceptualizing the performance appraisal process (e.g., Denisi et al., 1984; Feldman, 1981; Ilgen & Feldman, 1983). Recent research (Cohen & Ebbesen, 1979) sug- gests that when raters observe performance to make ratings on various performance dimensions, they activate appropriate schemata. A schema is a hypothetical cognitive structure which represents an individual’s knowledge about the world and of the relationship among elements in the world (Cohen, 1981). When an individual encounters a stimulus configu- ration, a schema is matched against the configuration and organizes its elements. Individuals attend only to that information which is relevant to the currently activated schema (Neisser, 1976; Newtson, Enquist, & Bois, 1977). Further, that which is not attended to will not be stored and consequently will not be remembered (Neisser, 1976). Individuals select schemata for activation based upon: 1) purpose for engaging in a partic- ular activity (e.g., observing horses as a judge in a horse show), 2) mo-

254 TRACY MCDONALD

tivation, and 3) expectations regarding a particular situation (Bobrow & Norman, 1975).

Several theoretical and empirical papers have recently alluded to the importance of giving raters information regarding content of dimensions to be rated prior to performance observation. Denisi et al. (1984) contend that the performance appraisal instrument, if its content is known in ad- vance, governs the selection of dimension-relevant schemata and that rater training must acknowledge that raters use schemata. Related to this, Ilgen and Feldman (1983) propose that rater training needs to move from its concern with the retrieval of performance information from memory to include the encoding and storage of such information. Finally, Feldman (1986) holds that in order to produce high-quality ratings, supervisors must possess valid, job-relevant schemata to direct their attention to rel- evant subordinate behaviors. Again, correct prior information regarding the instrument should affect the nature of the schemata they possess.

An important study which has implications for the current research was conducted by Cohen and Ebbesen (1979). Subjects were instructed to either form an impression of or learn the task performed by an actress prior to observing a videotape of the actress engaging in both task and expressive behavior. The results indicated that subjects in the two in- structional groups attended to distinctly different aspects of the actress’ behavior. Moreover, impression-forming subjects were more influenced by their implicit personality theories than were task-learning subjects. Finally, subjects who observed the videotape under task-learning instruc- tions scored significantly higher on a memory test than subjects in the impression-formation group. Cohen and Ebbesen interpret their results as supporting the idea that the two observational goals activated sets of schemata containing different features.

In order to measure subjects’ cognitive processing of behavior, Cohen and Ebbesen (1979) used a “unitization measure” developed by Newtson (1973; 1977). This dependent measure deserves brief discussion since it will also be used in the current study. To use the measure, subjects are provided with a button and are instructed to press it when they perceive one behavior unit to have ended and another to have begun.

Newtson contends that a perceived action is defined by a change in a stimulus array. Behavior perception is a feature monitoring process whereby the perceiver monitors a particular set of features, segmenting the behavior into parts when one or more of the monitored features changes its state. Button presses are the points of feature change. Ac- cording to this formulation, then, it is possible for a button press to represent more than one feature change. Newtson further contends that expectancies or sets can affect behavior perception by altering the set of features monitored. Included here would be expectancies regarding job

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 255

dimension on which performance is to be assessed. Finally, Newtson (1976) has shown that the unitization measure exhibits both high interrater and test-retest reliability.

The results of Cohen and Ebbesen’s (1979) study can be interpreted using Newtson’s framework. The observational goals given to subjects activated different schemata for which certain behavior features were relevant. As a result, subjects in the two conditions were monitoring different sets of features. If button presses represent feature changes and if different features change at different rates, then differences in unitiza- tion due to observational goal would be the expected result. This is pre- cisely the result Cohen and Ebbesen obtained. If these results generalize to the performance appraisal context, then raters observing job perfor- mance to rate different dimensions should attend to different aspects of performance and thus produce different unitization patterns.

Related to this, it would also be expected that the highest quality ratings would be produced when raters receive correct information, prior to per- formance observation, concerning the dimension they will be asked to rate. This is because providing raters with such information would cause them to activate dimension-relevant schemata and thus attend to those aspects of performance most relevant to the dimension in question. As a consequence, they would be better prepared to make quality ratings com- pared to raters who do not receive preobservational information or to raters who receive incorrect information.

Implications of Schema Theory for Performance Appraisal Contexts

It is proposed here that the dimension to be assessed determines the selection of dimension-relevant schemata which, in turn, influences the rater’s cognitive processing of performance. There are two hypotheses:

Hi: Raters assessing different dimensions will attend to different fea- tures of performance.

H,: Ratings made by raters given correct prior information regarding dimensions contained on the rating instrument will be more valid compared to those of raters given no prior information, and those of raters given incorrect prior information.

METHOD

Overview

There were several phases involved in the experimental proceedings. At the onset, subjects were told that they would be viewing a videotape of a psychology instructor giving a lecture. Using a counterbalanced, repeated-measures design, subjects in the experimental group (“Correct Information Group”) first viewed the videotape in order to rate either: 1)

256 TRACY MCDONALD

the instructor’s delivery skills (“Delivery”) or 2) the degree to which the instructor relates the lecture material to the students’ own experiences (“Relevance”). When the tape was completed, subjects rated the instruc- tor on the dimension for which they had viewed the tape. Ratings were made on a graphic scale ranging from one to seven, with seven being most positive. Attached to each scale was the dimension label and definition. After the first viewing of the tape, subjects viewed the tape a second time in order to rate the instructor on the dimension they had not previously assessed. In a comparison group (“No Prior Information Group”), sub- jects received no prior information concerning the dimension they would be asked to rate, viewed the videotape, and rated the instructor on the two dimensions. While viewing the videotape, subjects in both groups unitized the instructor’s performance. In the Correct Information Group, subjects unitized performance twice, once for each dimension. Subjects in the No Prior Information group viewed and unitized performance only once. A third group of subjects (“Incorrect Information Group”) were given preobservational instructions to either: 1) rate the instructor’s de- livery but after observation were asked to rate relevance or 2) rate rele- vance but after observation were asked to rate delivery. This group of subjects did not unitize performance. After completing their final ratings, subjects in all three groups were debriefed and thanked for their partici- pation.

Subjects

Subjects were 116 male and 131 female students enrolled in an intro- ductory psychology course at a large Midwestern university. It is reason- able to use undergraduates as subjects in the current study because the task which they performed is a task with which they should all be familiar since students regularly observe and rate instructor performance. Of the subjects, 91 were involved in pilot testing and 156 participated in the experiment.

Creating the Videotape Several steps were involved in creating the videotape to maximize the

probability that subjects would perceive most instructor behaviors as belonging to either one or the other of the two dimensions being assessed. As a first step, two dimensions were selected from the behavioral expec- tation scales developed by Zedeck, Jacobs, & Kafry (1976). The dimen- sions, delivery and relevance were selected because: 1) they seemed con- ceptually independent of each other, 2) neither dimension requires inter- action with students, an instructor behavior which could potentially confound the unitization measure, and 3) the dimensions are amenable to being displayed in a relatively short videotape.

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 257

After selecting the dimensions, three judges (two industriahorgani- zational psychology graduate students and one senior) generated as many behavioral examples as possible for the two dimensions. Zedeck’s own examples were not used because it was desired to create a larger set of behaviors then contained in Zedeck’s behavioral expectation scales. As a second step, 27 undergraduate subjects were given definitions of the two dimensions and were instructed to categorize each of the 44 behavioral examples (arranged in random order) as reflecting one of the dimensions. Subjects were told that they did not have to classify any behaviors which they felt were unclassifiable. Originally, it was decided that 80% or greater agreement in categorization of behaviors represented an accept- able level of interrater reliability. However, for several behaviors, there was between 78 and 80% agreement in categorization. Since this percent- age was so close to the original SO%, behaviors on the two dimensions for which there was 78% or greater agreement in categorization were con- sidered for possible inclusion in the videotape.

The following are definitions of the two dimensions selected. Some of the examples contained in the definitions actually occurred in the video- tape. The same definitions were used in all phases of the study:

“DELIVERY refers to the degree to which the instructor is a skillful speaker in front of the class. This category of teaching behavior is what most people think of as public speaking skills. Examples of some behaviors relevant to delivery are loudness of voice when lecturing, eye contact, and use of lecture notes. Delivery refers only to the manner in which the lecture is presented, not to the content and organization of the material.” “RELEVANCE refers to the degree to which the instructor relates the subject matter of the lecture to the students’ own experiences and knowledge to make the subject matter more important and meaningful to the students. This might be ac- complished, for example, by using everyday experiences to illustrate theories or concepts or by applying psychological principles to explain real-world situations.”

A 6’27” videotape was made of a psychology instructor giving a lecture. The instructor demonstrated performance relevant to both the delivery and relevance dimensions by exhibiting behaviors which had been cate- gorized by subjects as representing the respective dimensions. Since Newtson’s (1976) work has demonstrated that button presses occur when subjects perceive behavior changes, the videotape was constructed so that behavior on the two dimensions occurred at different rates. That is, changes in delivery style did not occur simultaneously with a shift in the lecture from theory to relating the theory to the students’ experience. It was intended that button presses on the two dimensions would not occur simultaneously. This was done because even if subjects in the different conditions were attending to different features of behavior, the unitization

258 TRACY MCDONALD

measure would not reflect this if the features changed concurrently. Ad- ditionally, behavior on the delivery dimension changed at a faster rate (i.e., more behavior changes occurred) than did behavior on the relevance dimension.

In constructing the videotape, efforts were made so that the instructor would be perceived as being “above average” in relevance and “below average” in delivery (e.g., the instructor related many aspects of the lecture to the students’ experiences, but displayed little eye contact, spoke in a monotone voice, and clasped and unclasped her hands). This was done to enhance the interpretation of the results concerning whether subjects could accurately perceive the two separate dimensions of per- formance. For example, if the instructor performed “above average” on both dimensions and subjects gave above average ratings on the two dimensions, it would be unclear whether the subjects were inaccurately perceiving one global (above average) dimension of performance or ac- curately perceiving the two dimensions in question.

Finally, a second performance sequence was videotaped in which ef- forts were made so that the instructor would be perceived as “below average” in relevance and “above average” in delivery, in contrast to the previous tape. This was done to provide additional data concerning whether the subjects in the population could accurately perceive the two dimensions (i.e., that subjects can perceive both “above” and “below” average performance on both dimensions is stronger evidence than evi- dence based on the first tape only). The second videotape was used for manipulation check purposes only.

After the two videotapes were made, data were gathered from two groups of subjects to determine the degree to which subjects could accu- rately perceive the dimensions in question. Using a counter-balanced, repeated-measures design, one group of 32 subjects received dimension definitions, viewed the first tape and then rated either delivery or rele- vance. Subjects then received definitions of the dimensions they had not previously assessed, viewed the tape a second time and rated the instruc- tor’s performance on the second dimension. A second group of 32 sub- jects proceeded in the same manner except they rated the teacher’s per- formance on the two dimensions in the second videotape.

The two-between, one-within analysis of variance resulted in a Tape x Dimension Rated interaction, F( l&O) = 112.19, p < .OOOl , indicating that subjects rated the delivery and relevance dimensions significantly differ- ently depending on which tape they observed (Experimental tape: Deliv- ery Mean = 2.31, Relevance Mean = 4.59; “Manipulation check” tape: Delivery Mean = 5.0, Relevance Mean = 2.72). A post hoc test using Dunn’s procedure indicated that the two delivery means were signifi-

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 259

cantly different from each other as were the two relevance means, p < .Ol for both comparisons. It was concluded that subjects could accurately perceive the two dimensions exhibited in the tapes.

Obtaining True Scores on Performance Dimensions

Following a modification of Borman’s (1977; 1979) methodology, 16 expert raters were asked to rate the instructor’s performance on the two dimensions. Ten of the raters were faculty members of a management department and six of the raters were graduate students in business. All of the raters had extensive experience rating teaching performance, most had experience teaching, and all were knowledgeable of the performance appraisal area.

It was intended that experts be very well prepared to make their per- formance ratings. Before making ratings, each rater was given the dimen- sion labels, their definitions, and examples of behavior relevant to each definition. After becoming familiar with the definitions, they were given a verbal, verbatum transcript of the performance segment and a transcript of the nonverbal behaviors that occurred in the videotape. While reading through these materials, experts were encouraged to underline and take notes based on the information.

After the preparation session was over, raters observed the videotape and rated either delivery or relating. Subsequently, they watched the videotape a second time and rated the dimension they had not previously rated, with order of dimension to be rated counterbalanced. Experts were encouraged to take notes while observing the performance sequence.

Experts’ ratings were analyzed to determine whether there was a sig- nificant difference between ratings on the two dimensions. This type of analysis is appropriate since there were no intended true scores; the tape was constructed so that, in a general sense, performance on the relating dimension would be perceived as more effective than performance on the delivery dimension.’ A t-test indicated that subjects rated delivery signif- icantly differently from relevance (t = 8.72, p < .OOOl; Delivery Mean = 2.3, Relevance Mean = 4.9). Other estimates of rating validity, such as intraclass correlation could not be computed because experts made only one rating on each dimension. Thus, ratings on a given dimension by a given rater had no variance, a condition necessary for computing intra- class correlations. Because of the significant differences between ratings on the two dimensions, because of the special preparation for rating re- ceived by experts, and because of their experience with performance

* Thanks to Walter Borman for suggesting this analysis.

260 TRACY MCDONALD

appraisal and teaching, mean expert ratings were adopted as true scores to which subsequent analyses can be compared.

Correct Information Group

Upon arrival at the laboratory, subjects in the Correct Information Group were seated and were told that they would be watching a videotape of a psychology instructor giving a lecture. Using a counter-balanced, repeated-measures design, the 50 subjects were told that they were to view the videotape in order to rate either: 1) the instructor’s delivery, or 2) the instructor’s relevance. Prior to viewing the videotape, subjects read definitions of the dimension they would be rating.

Subjects then viewed a short practice tape of another instructor giving a lecture after being told to press down the button in front of them when they perceived one behavior to have ended and another to have begun. Thus, subjects were to press the button when there was a change from one behavior to another. These instructions were used as a replication of the instructions used by Cohen and Ebbesen (1979) and Newtson (1973; 1976; 1977). The buttons which subjects pressed were connected to a computer that records time elapsed between button presses in l-s units. Subjects were instructed to place the apparatus containing the button under the table so as not to influence each other in their button pressing.

After the practice tape was over, subjects viewed the experimental tape and unitized. When the tape was completed, subjects were given rating sheets on which to make their ratings. Subjects were then informed that they would also be assessing the dimension they had not previously as- sessed and proceeded in the same manner (i.e., read dimension definition, watched and unitized tape, made rating) as before. Order of dimension assessed was randomly determined at the beginning of each session. Sub- jects were run in groups of two to eight.

As a manipulation check, a one-between, one-within analysis of vari- ance was computed using the subjects’ ratings on the two dimensions, F(1,48) = 150.33, p < .00001. Delivery received a mean rating of 1.84 while relevance received a mean rating of 4.44. No significant effects emerged due to the order in which subjects received dimension definitions and assessed performance. It was concluded that subjects were accu- rately perceiving the two dimensions exhibited in the tape.

No Prior Information Group

Subjects in the No Prior Information Group proceeded in a similar manner except they were given no prior instructions regarding the per- formance dimensions they would be rating. The 54 subjects in this group were told that they would be watching a tape of a psychology instructor giving a lecture. Subjects first viewed and unitized the practice tape after

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 261

which they viewed and unitized the experimental tape. When the exper- imental tape was completed, subjects read a dimension definition of either delivery or relevance, and rated that dimension. Subsequently, they read a definition of the dimension they had not previously rated and made a second rating. The order in which subjects made the ratings was coun- terbalanced and rating order was randomly determined at the beginning of each session. Subjects were run in groups of two to eight.

Incorrect Information Group

Using a counter-balanced design, the 52 subjects in this group were told that they were to view a videotape of an instructor giving a lecture in order to rate either: 1) the instructor’s delivery or 2) the instructor’s relevance. Prior to observation, subjects read definitions of the dimen- sions they were asked to rate. Subjects in this group had no reason to suspect they were being given incorrect information. They then viewed the videotape. When the tape was completed, delivery instructed subjects were told that they would instead be rating relevance and relevance in- structed subjects were told that they would instead be rating delivery. Subjects then read definitions of the dimensions they would actually be rating and made their ratings. Subjects were run in groups of 12 to 15. Groups are larger in this condition, compared to the first two conditions, due to variation in the number of undergraduate psychology students who choose to participate in experiments in a given semester. Since all exper- imental sessions were conducted in the same room under as identical conditions as possible, and since subjects were not allowed to interact with each other, it is doubtful that the number of subjects per session introduced any bias.

Summary

Subjects in the Correct Information Group viewed the videotape twice and unitized performance both times. Subjects in the No Prior Informa- tion Group unitized performance while subjects in the Incorrect Informa- tion Group did not. The latter two groups of subjects viewed the video- tape only once.

RESULTS

Units of Perception

In analyzing the unitization data, one option would be to compare subjects unitization patterns with the points of intended behavior change in the videotape since such great care was taken in its construction. However, the tape was not used as a criterion against which to compare units defined by subjects because it was constructed based upon only one

262 TRACY MCDONALD

individual’s (the author’s) subjective perceptions as to where button presses would most likely occur. Thus, it was decided to instead use aggregate data and compare unitization patterns across conditions.

Unitization patterns of delivery and relevance instructed subjects. Sev- eral analyses were performed to determine whether subjects instructed to rate delivery were attending to different aspects of the instructor’s per- formance compared to subjects instructed to rate relevance. If the two groups of subjects were attending to different features, subjects in the delivery condition should exhibit a greater number of units in segmenting performance since the videotape was constructed so that behavior on the delivery dimension changed at a faster rate than behavior on the rele- vance dimension. To test this hypothesis, a one-between, one-within analysis of variance was performed on the number of button presses subjects generated while viewing the 6’27” videotape.

Table 1 contains the results of this analysis. A significant main effect due to dimension observed was obtained. Subjects used a greater number of units when observing performance to rate delivery, M = 17.72 com- pared to observing performance to rate relevance, M = 8.16. No signif- icant effects due to instruction order emerged.

The analysis of the number of button presses subjects generated is not sufficient evidence that subjects in the two conditions were attending to different performance features. It is possible that subjects in the delivery condition were dividing, into smaller subdivisions, the same units they used in the relevance condition (i.e., units used in the delivery condition are subsets of units used in the relevance condition). If subjects were attending to the same features of performance, the units produced in the relating condition should coincide with those produced in the delivery condition.

Since subjects viewed the videotape under both sets of rating instruc- tions, it was possible to determine the proportion of unit boundaries which coincided for each subject. Using Cohen and Ebbesen’s (1979)

TABLE 1 ANALYSIS OF VARIANCE: NUMBER OF UNITS’ USED BY SUBJECTS RECEIVING

Commr PRIOR INF~R~~ATIoN

Source df MS F P

Order 1 153.76 1.36 .25 Error 48 112.89 - - Dimension rated 1 2284.84 45.35 .OOOOl Dimension rated x order 1 4.0 .08 .78 Error 48 50.37 - -

n N = 50.

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 263

method, a “hit rate” was computed for each subject. The following for- mula was used: number of unit boundaries (i.e., button presses) that coincided across the two conditions divided by the total number of bound- aries used in the relevance condition.

Since most subjects produced a smaller number of unit boundaries in the relevance condition (except for four subjects), a hit rate of 0.0 would indicate that no unit boundaries lined up and 1.0 would indicate that all relevance unit boundaries coincided with unit boundaries in the delivery condition.

Five estimates of hit rate were calculated, each based on intervals of a different size. For each estimate, the 387-s videotape was divided into continuous intervals of either 1 s, 2 s, 3 s, 4 s, or 5 s duration. A hit occurred whenever a subject produced a button press in the relating con- dition that occurred in the same (1,2,3,4, or 5 s) interval as a button press produced in the delivery condition.

Table 2 contains the mean hit rates for each of the five interval sizes. Hit rate varied from .05, for l-s intervals, to .23, for 5-s intervals. Table 2 also contains the 95% confidence intervals around each of the hit rates and the number of subjects exhibiting hit rates of 0.0 for the five interval sizes. The confidence intervals show that observed hit rates are signifi- cantly different (p < .05) from both 0.0 and 1.0.

An important issue in interpreting the hit rate data concerns which interval size is appropriate to estimate “true” hit rate. The l-s interval may underestimate true hit rate due to subjects’ reaction times and to random error. On the other hand, the 5-s interval is probably too lenient since it is highly likely that behavior changes relevant to both of the dimensions occur within 5 s of each other. Thus, a hit would be counted even if subjects were attending to different performance features.

Because of the above issues, either the 2-, 3-, or 4-s interval size is probably most appropriate for estimating hit rate with the sizes varying in terms of conservativeness in estimation. Even using the confidence in-

TABLE 2 OBSERVED HIT RATES, CONFIDENCE INTERVALS FOR OBSERVED HIT RATES, AND

NUMBER OF SUBJECTS EXHIBITING HIT RATES OF 0.0”

Interval Observed size hit rate

Confidence intervals for observed

hit rate

Number of subjects exhibiting

rates of 0.0

l-s .OS .014-.132 33 (66%) 2-s .ll .05-.21 24 (48%) 3-s .13 .08-.24 17 (34%) 4-S .17 .O!I-.28 14 (28%) 5-S .23 .14-.34 14 (28%)

264 TRACY MCDONALD

terval upper bound of the more lenient 4-s interval, almost three-fourths of subjects’ unit boundaries do not coincide across the two conditions. Since such low levels of agreement on unit boundaries were observed, it is concluded that subjects exhibited qualitatively different unitization pat- terns. The hypothesis that subjects assessing different dimensions attend to different aspects of performance is supported.

No prior information group. Using Newtson’s (1973) procedure, “breakpoints” were determined separately for the two groups given cor- rect prior information (delivery-instructed and relevance-instructed) and for the group receiving no prior information. Unitization data for the delivery and relevance groups are not independent since subjects given correct information unitized for both dimensions. Only data generated in the subjects’ first round of unitization were used. If subjects in each group were attending to different aspects of performance, then a different set of breakpoints should be identified for the three groups.

The following strategy (Newtson, 1973) was used for each of the three sets of data. The videotape was divided into 129 continuous 3-s intervals. The number of subjects producing a button press within each interval was determined and a mean and standard deviation were computed across all 129 intervals. A breakpoint was defined as any interval containing a num- ber of button presses greater than or equal to the mean plus one standard deviation for button presses for the particular group. This criterion was used because the joint probability of two frequencies one standard devi- ation above the mean of a normal distribution is .028.

Using this procedure, any 3-s interval containing twelve or more button presses in the delivery condition (M + 1 SD = 12.14), six or more button presses in the relevance condition (M + 1 SD = 5.92), and nine or more button presses in the No Prior Information Group (M + SD = 8.63) was considered a breakpoint. There were 27, 25, and 23 breakpoints in the three conditions, respectively.

Of the 48 breakpoints defined in the relevance and No Prior Informa- tion Groups combined (25 in the relevance group + 23 in the No Prior Information Group), only five were common. By chance alone, 4.46 com- mon breakpoints would be expected. The chi-square statistic (X2 = .065) was not significant. Eleven breakpoints were common among the 50 breakpoints defined in the delivery and No Prior Information groups com- bined. By chance, 4.8 common breakpoints would be expected and the resulting chi-square statistic (X2 = 8.31) was significant at the .Ol level. Thus, the hypothesis that subjects in the No Prior Information Group will exhibit unitization patterns different from subjects in the two correct information groups receives mixed support.

A final analysis examined breakpoints common to the delivery and relevance conditions. Using the same procedures outlined before, 21

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 265

breakpoints were defined in the delivery condition and 25 breakpoints were defined in the relevance condition. Only two common breakpoints existed and the chi-square test (X2 = 1.08) was nonsignificant. This result supports the previous conclusion that subjects in the delivery and rele- vance conditions attended to different aspects of performance.

Behavior content associated with unitization patterns. To further inter- pret the unitization results, the verbal and behavioral content of break- points and nonbreakpoints (defined as any 3-s interval containing a num- ber of button presses less than or equal to the mean minus one standard deviation for button presses produced during 3-s intervals for the partic- ular group) was examined for the two correct information groups and the No Prior Information Group. For the relevance condition, breakpoints occurred when the instructor made a transition from discussing theory to a student-oriented example demonstrating the theory, made the opposite transition, or used specific examples to explain concepts of the theory. Breakpoints for the delivery condition occurred when the instructor, for example, exhibited changes in eye contact, looked at her watch, shuffled her notes, leaned on the podium, lost her place, or appeared to lose her train of thought. For the No Prior Information Group, breakpoints oc- curred when the instructor introduced the theory, made a transition, or summarized and when behavior changes (of a similar nature as those produced by delivery-instructed subjects) occurred.

It is concluded that subjects were attending to aspects of the instruc- tor’s performance relevant to the dimension they were rating. These anal- yses further suggest that No Prior Information Group subjects were at- tending to the structure or organization of the lecture and in addition, like subjects in the delivery condition, to behavioral changes. The latter ob- servation is consistent with the previous conclusion that delivery and comparison group subjects were attending to some of the same aspects of performance. See McDonald (1983) for a complete discussion of the con- tent analysis of breakpoints and nonbreakpoints.

Summary. The results of the analyses of units of perception support three major conclusions. First, subjects instructed to assess delivery at- tended to different aspects of performance compared to subjects in- structed to assess relevance. Second, relevance instructed subjects at- tended to a different set of performance features compared to subjects receiving no prior information. Finally, delivery and No Prior Information Group subjects attended to some common performance features. The content analyses of breakpoints provide additional support for these con- elusions .

Dimension Ratings

To determine whether the different instructional conditions affected the

266 TRACY MCDONALD

validity of ratings, subjects’ ratings in the Correct Information Group, the No Prior Information Group, and the Incorrect Information Group were compared to the experts’ ratings. T-tests were computed and .Ol was selected as the level of significance appropriate for rejecting the null hypothesis since several comparisons were to be made and a more lenient significance level could result in capitalizing on chance.

No significant differences emerged between experts’ ratings and ratings made by subjects in the Correct Information Group. In the No Prior Information Group, subjects produced ratings significantly less positive on both delivery (t = 3.82, p < .OOl) and relevance (t = 4.2, p < .OOl) when compared to the ratings produced by experts.

In the Incorrect Information Group, subjects’ ratings were also signif- icantly less positive than experts’ ratings on both delivery (t = 4.1, p s .OOl) and relevance (t = 5.75, p < JOI). More specifically, subjects who were led to believe that they would be rating one dimension, but unex- pectedly were asked to rate the other dimension produced significantly lower ratings than the experts who had received correct information re- garding ratings to be made. Finally, two other analyses were conducted to gain further insight into the unitization results. First, no significant dif- ferences were detected between ratings made by subjects in the No Prior Information Group and those in the Incorrect Information Group on both the delivery and relevance dimensions. Additionally, subjects’ ratings of delivery in the Correct Information Group did not differ significantly from those produced by subjects in the No Prior Information Group. Table 3 contains the means and variances for the four groups.

DISCUSSION

Unitization Results

The current study supports the contention that giving raters preobser- vational information affects raters’ attention processes. Schema theory provides a useful heuristic for conceptualizing this process. Using this conceptualization, the results of the study are consistent with the idea

TABLE 3 DIMENSION RATING MEANS AND VARIANCES FOR EXPERTS’ RATINGS, CORRECT

INFORMATION GROUP, No PRIOR INFORMATION GROUP, AND INCORRECT INFORMATION GROUP

Experts Correct Information No F’rior Information Incorrect Information (n = 16) (n = 50) (?I = 54) (n = 52)

Mean Variance Mean Variance Mean Variance Mean Variance

Delivery 2.3 .36 1.84 .88 1.65* .53 1.48’ .&I Relating 4.94 .86 4.44 1.66 3.76* 1.61 3.1* 2.6

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 267

that subjects in the various observational conditions activated different schemata which caused them to attend to different features of perfor- mance. Subjects who viewed the videotape in order to rate the instruc- tor’s delivery exhibited qualitatively different unitization patterns com- pared to subjects who viewed the videotape in order to rate relevance. The hit rate measure indicated that subjects were attending to different features of performance rather than segmenting the same performance into smaller units. Finally, the breakpoint data provide further evidence for attentional differences.

In contrast, subjects instructed to assess delivery exhibited unitization patterns somewhat similar to subjects in the No Prior Information Group. First, the number of breakpoints common to the two groups exceeded chance. Second, there were some similarities in breakpoint content for the two groups. When Newtson’s (1976) original research is considered, these results can be tentatively explained. Like subjects in the No Prior Information condition, Newtson’s subjects were instructed simply to press the button when they perceived one meaningful behavior to have ended and another to have begun. Upon examination of breakpoints, Newtson found that subjects were monitoring overt features of perfor- mance (i.e., breakpoints occurred when the actor exhibited physical changes).

Subjects instructed to assess delivery were also attending, in addition to other aspects of performance, to physical changes in the instructor’s lecture (e.g., eye contact, use of notes, posture). If subjects receiving no prior information also attend to physical changes in the instructor’s per- formance, some similarities in unitization patterns across the two groups should occur. Perhaps when raters are not knowledgeable of the dimen- sions on which they will rate performance, they attend to only obvious, overt, behavioral aspects of performance and do not attend to the more subtle behaviors. Future research is necessary to determine the nature of schemata raters typically evoke in situations when they are observing performance without a particular dimension in mind. Other areas which deserve attention include the effect of appraisal purpose on schematic processing and improving the interpretation of unitization data.

Dimension Ratings The results of this study support the notion that when raters do not

evoke appropriate observational schemata, they produce less valid rat- ings. Subjects receiving incorrect or no information produced signifi- cantly lower ratings compared to experts’ ratings and to ratings made by subjects receiving correct information. Like Cohen and Ebbesen’s study (1979), attentional differences can be tied to other outcomes, in this case differences in rating quality.

268 TRACY MCDONALD

One result that is difficult to explain concerns the ratings and unitiza- tion patterns produced by subjects rating delivery in the Correct Infor- mation and in the No Prior Information Groups. The two groups’ unit- ization patterns had significant similarities and ratings of delivery did not differ significantly between the two groups. However, the No Prior In- formation Group’s ratings of delivery were significantly lower than ex- perts’ ratings while Correct Information Group subjects’ ratings were not. Perhaps the No Prior Information Group’s unitization patterns are similar enough to the Correct Information Group’s unitization patterns (11 com- mon breakpoints out of the 50 defined by the two groups combined) so that subjects in the two groups were attending to enough common per- formance features to produce similar ratings. However, since subjects in the two groups were also attending to different performance features (39 of the 50 breakpoints defined by the two groups were not common), perhaps subjects in the Correct Information Group were attending to some features of performance similar to those experts attended to since their ratings were not significantly different from experts’ ratings. Con- versely, subjects receiving no prior information may have been attending to performance features different from those attended to by the experts and thus produced ratings significantly different from experts’ ratings. Future research must be conducted to determine why raters who do not receive prior information and who receive incorrect information regarding dimensions to be rated produce less positive rating compared to correctly informed raters. Perhaps in the absence of correct information, negative aspects of performance become more salient than positive, and as a con- sequence, ratings are affected.

Implications

The results of the study have important implications for performance appraisal practices. The findings suggest that in order to maximize the probability of receiving valid ratings from raters, organizations should make sure that raters have correct information regarding the dimensions of performance they will be asked to rate. This suggestion is not new; since the emergence of Wherry’s theory of rating (1952), researchers (e.g., Bemardin and Beatty, 1984; Borman 1979; Landy and Farr, 1983; Latham, Wexley, and Pursell, 1975) have advocated forewarning raters concerning the nature of performance dimensions they will be asked to rate. Research examining the rating process provides further support that informed raters produce higher quality ratings. For example, Gordon (1970) found that raters who were more experienced with a rating instru- ment produced more accurate ratings. Friedman and Cornelius (1976) found that participation in rating scale development improved rating ac- curacy. The current study is consistent with previous research underscor-

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 269

ing the importance of forewarning raters concerning the ratings they will be asked to make.

The results of the study also provide stronger evidence for the validity of the unitization measure. First, in previous research using the measure (Newtson, 1973; Cohen and Ebbesen, 1979) no predictions were made regarding the features of behavior subjects would attend to or the meaning of different unitization patterns. In the current study, predictions were made and were supported by the results. Second, Newtson (1973) pro- posed that sets or expectancies can atfect which features of performance are monitored. This proposition was supported here since dimension to be assessed affected what subjects attended to. Related to this, since the unitization measure was shown to behave in a systematic manner, new possibilities emerge for using it as a research tool. For example, it may be used in a diagnostic sense to determine whether raters are attending to relevant performance features. A second possible use of the measure is in assessing the success of rater training programs. Finally, the measure may be used to determine if raters are using automatic or controlled processing of behavior, as described by Feldman (1981).

Limitations

Factors regarding the design of the study serve to limit its internal validity. First, the assumption is made that subjects in the Correct and Incorrect Prior Information Groups produced similar unitization patterns. Although there is no reason to conclude that subjects in the two groups produced different unitization patterns since they received identical ob- servational instructions, the assumption cannot be fully supported since subjects in the latter group did not unitize. This assumption is important because the study attempts to link rating quality with unitization patterns. Another issue relates to the fact that subjects in the Correct Information Group viewed the videotape twice while the other two groups viewed it only once. Although the study was designed in this manner so that cor- rectly informed subjects’ unitization patterns could be compared across the two instructional conditions, subjects’ ratings and unitization patterns on the second round could have been affected.

Third, it should be noted that since subjects in the Incorrect Informa- tion Group did not unitize performance, while those in the No Prior Information Group did, ratings could have been affected due to this de- sign difference. However, ratings on both the delivery and relevance dimensions produced by these two groups were not significantly different. Thus, it appears that the unitization measure is not reactive.

Other limitations concern the issue of external validity and thus gener- alizability of results. The task performed by subjects in the study is dif- ferent, in many respects, from the task performed by raters in organiza-

270 TRACY MCDONALD

tions. In organizations, raters have many opportunities to observe an employee’s performance relevant to a large number of dimensions. Fur- ther, in organizations, there is typically a longer interval of time during which performance relevant information can decay. Finally, the unitiza- tion measure introduces a degree of artificiality into the rating situation that is not there when raters in organizations observe behavior. All of these differences could serve to limit the generahzability of results. How- ever, it is felt that at this stage of investigation, the control offered by the laboratory outweighs the shortcomings in terms of external validity.

REFERENCES

Bartlett, F. C., (1932). Remembering. Cambridge: Cambridge University Press. Bemardin, H. J., and Beatty, R. W. (1984). Performance uppruisal: Assessing human be-

havior at work. Boston: Kent Publishing Company. Bobrow, P. G., and Norman, D. A. (1975). Some principles of memory schemata. In D. G.

Bobrow and A. Collins (Eds.), Representation and understanding: Studies in cognitive sciences. New York: Academic Press.

Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior and Human Performance, 20,238-252.

Borman, W. C. (1979). Format and training effects on rating accuracy and rating errors. Journal of Applied Psychology, 64, 41&421(a).

Cohen, C. E. (1981). Person perception: Making sense from the stream of behavior. In N. Cantor and J. F. Kihlstrom (Eds.), Personality, cognition, and social interaction. Hillsdale: Lawrence Erlbaum Associates.

Cohen, C. E., & Ebbesen, E. B. (1979). Observational goals and schema activation: A theoretical framework for behavior perception. Journal of Experimental Sociul Psy- chology, 15, 305-329.

Denisi, A. S., Cafferty, T. P., and Meglino, B. M. A cognitive view of the performance appraisal process: A model and research propositions. Organizational Behavior and Human Performance, 33, 36O-3%.

Feldman, J. M. (1981). Beyond attribution theory: Cognitive processes in performance ap- praisal. Journal of Applied Psychology, 66, 302307.

Feldman, J. M. (1986). A note on the statistical correction of halo error. Journal of Applied Psychology, 71, 173-176.

Friedman, B. A., and Cornelius E. T., III. (1976). Effect of rater participation in scale construction on the psychometric characteristics of two rating scale formats. Journal of Applied Psychology, 61, 21&216.

Gordon, M. E. (1970). The effect of the correctness of the behavior observed on the accu- racy of ratings. Organizational Behavior and Human Performance, 5, 366-377.

Ilgen, D. R., and Feldman, J. M. (1983). Performance appraisal: A process focus. In B. M. Staw and L. Cummings (Eds.), Research in organizational behavior (Vol. 5). Green- wich, CT: JAI Press.

Landy, F. J., and Farr, J. L. (1983). The measurement of work perhormunce. New York: Academic Press.

Latham, G. P., Wexley, K. N., and Pursell, E. D. (1975). Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 60, 55C~ 555.

McDonald, T. (1983). The effect of dimension content on observation of job performance and ratings. Unpublished doctoral dissertation, Ohio State University, Columbus.

DIMENSION CONTENT AND PERFORMANCE OBSERVATION 271

Neisser, U. (1976). Cognition and reality: Principles and implications of cognitive psychol- ogy. San Francisco: Freeman.

Newtson, D. A. (1973). Attribution and the unit of perception of ongoing behavior. Journal of Personality and Social Psychology, 28, 28-38.

Newtson, D. A. (1976). Foundation of attribution: The perception of ongoing behavior. In J. H. Harvey, W. J. Ickes, and R. F. Kid (Eds.), New directions in attribution research (Vol. 1). Hillsdale, NJ: Lawrence Erlbaum Associates.

Newtson D., Enquist, G., and Bois, J. (1977). The objective basis of behavior units, Journal of Personal@ and Social Psychology, 35. 847-862.

Nom~an, D. (1976). Memory and attention. New York: Wiley. Williams, K. J., Denisi, A. S., and Blencoe, A. G. (1985). The role of appraisal purpose:

Effects of purpose on information acquisition and utilization. Organizational Behavior and Human Decision Processes, 35, 314-339.

Wherry, R. J. (February, 1952). The control ofbias in rating: A theory ofrating. Washing- ton, D.C.: Department of Army, Personnel Research Section.

Zedeck, S., Jacobs, R., and Kafry, D. (1976). Behavior expectations: Development of parallel forms and analysis of scale assumptions. Journal of Applied Psychology, 61, 112-l 16.

RECEIVED: July 10, 1987