a case for professional judgment when combining marks

This article was downloaded by: [TIB & Universitaetsbibliothek]On: 28 October 2014, At: 05:39Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Assessment & Evaluation in Higher EducationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/caeh20

A CASE FOR PROFESSIONAL JUDGMENT WHENCOMBINING MARKSGeoffrey Isaacs a & Bradford W. Imrie ba University of Queenslandb Victoria University of WellingtonPublished online: 28 Jul 2006.

To cite this article: Geoffrey Isaacs & Bradford W. Imrie (1981) A CASE FOR PROFESSIONAL JUDGMENT WHEN COMBININGMARKS, Assessment & Evaluation in Higher Education, 6:1, 3-25, DOI: 10.1080/0260293810060102

To link to this article: http://dx.doi.org/10.1080/0260293810060102

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in thepublications on our platform. However, Taylor & Francis, our agents, and our licensors make no representationsor warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Anyopinions and views expressed in this publication are the opinions and views of the authors, and are not theviews of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoevercaused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/caeh20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/0260293810060102

http://dx.doi.org/10.1080/0260293810060102

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

- 3 -

Assessment and Evaluation in Higher Education Vol.6 No.l March 1981

pp.3-25

A CASE FOR PROFESSIONAL JUDGMENT WHEN COMBINING MARKS

Geoffrey I s a a c s , Un ive r s i ty of Queensland

Bradford W. Imr ie , V i c t o r i a Un ive r s i ty of Well ington

ABSTRACT

Issues raised in a paper by Moss (1977) in this Journal, are discussed

in this paper which describes a strategy for comparing and combining marks

on the basis of rational analysis and professional judgment. Consideration

is given to the interpretation and implications of mean values, standard

deviations, correlations and shapes of mark distributions.

The data presented by Moss (1977) are analysed and procedures are dis-

cussed which discriminate between nominal and effective weightings (Willmott

and Hall, 1975) with reference to the effects of standard deviation and

correlation. This approach may also be used for combining sets of marks

which are not equally weighted and is potentially more useful to the teacher

than standardising procedures. In some circumstances combination may be

judged inappropriate and association of marks (in a profile) might be the

preferred procedure. All such procedures demand the professional judgment

and expertise of teachers, to utilise fully the information available in the

raw marks of student performance.

INTRODUCTION

One of the most important professional responsibilties of academic staff

is that of allocating and reporting marks (or grades) as measures of student

performance. It is often the case that, after allocation of marks for the

different parts of a student's performance, those marks are combined to pro-

duce the final mark (or grade) which is published. There is usually a pre-

determined procedure foR combining such marks using nominal weighting factors;

these may not be known to the students. (For example see the brief report by

Burnett and Cavaye (1) in the previous issue.)

Marks may be combined for different assessment purposes:

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 4 -

(a) Within an individual test or examination: for example, in amultiple choice test of 40 items, the test score would be thenumber of items correct each with a weight of unity regardlessof differences in difficulty (or facility); or, for a 3-hourpaper requiring six questions to be attempted out of 10, students'marks would be reported as the total of scores on each question,with each question weighted equally regardless of differences indifficulty and selection.

(b) For a course, with student performance assessed at differentpoints during the course and at the end of the course, thedifferent component marks will be weighted and combined to givean overall mark. The mark may be used to decide entry to anothercourse and, sometimes, eligibility for scholarships or bursaries.

(c) On completion of a degree programme, the marks from differentcourses may be combined (with due weighting) for decisions aboutdegree classification, scholarships, prizes and, of course,employment opportunity.

For all such purposes combining marks requires the careful exercise of

professional judgment based on information which is readily obtainable to

assist in the interpretation of marks. For sets of marks which are to be

combined, professional judgment requires consideration of:

(a) descriptions of the sets of marks in terms of mean, standarddeviation, correlation and shape of distribution;

(b) whether the nominal (or intended) weighting is consistent withthe effective (or actual) weighting as determined by correlationand standard deviation (6);

(c) procedures for comparing and combining sets of marks.

To illustrate the significance of these considerations, this paper

develops a case study based on the data presented in a paper by Moss (2)

which may be referred to for comparison and contrast.

ASSESSMENT ISSUES

In his paper (2), published in this Journal, Moss describes the doubts

that staff (of a Zoology Department) had in accepting that students seemed

to be achieving better results in a redesigned course. The assessment sys-

tem had been changed from one "based mainly on terminal examinations" to one

in which,

regular tests were introduced on a weekly basis, most ofthese being multiple choice tests (M) but including fewertutor marked tests (T). In addition, the students wereexpected to complete a three hour written exam(W) and a

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 5 -

practical exam (P). Each of the four components wouldcount equally in the final assessment. (p.164)

ContinuousAssessment(C)

TerminalAssessment(E)

50

50

per

per

cent

cent

Multiple Choice Tests (M)

Tutor Marked Assignments (T)

Written Examination (W)

Practical Examination (P)

25

25

25

25

per

per

per

per

cent

cent

cent

cent

Figure 1: Assessment structure

Details of the assessment are summarised in Figure 1. With the wisdom

of hindsight we will consider some issues underlying not only Moss's paper

but also the inevitable confrontations which teachers have, as examiners,

with marks dubious in their derivations and fallacious in their final com-

binations. Firstly the issue which seems to have been posed by Zoology

staff to Moss (2):

Issue 1'We think that the new assessment system makes it easierfor students to pass - the 'easiness' is due to the con-tinuous assessment component of the final mark.'

Marks for each of the four assessment components are shown in Table 1.

Each mark is a percentage and the overall score, 0, is the average of the

four components rounded to the nearest whole number).

As Moss (2) notes, "the traditional pass/fail mark is 40% ... on all

components". The numbers of students failing on this 'standard' (and also

50 per cent) for each component and for the course overall, are shown in

Table 2.

These results do seem to support the critics' claim that continuous

assessment did make it easier for students to pass. A summary of descrip-

tive statistics (Table 3) provides more information about the sets of marks

under consideration.

The mean marks for the two continuous assessment components (M and T)

are higher than the mean marks for the two terminal assessments (W and P).

This too seems to support the claim that the continuous assessment components

are easier than the terminal assessment components. Indeed, a student who

came last in all four assessments (none did so) would have scored 43 per cent,

comprising 27.5 from continuous assessment and 15.5 per cent from terminal

assessment.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 6 -

Student No.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

M

75

69

76

73

54

78

70

71

59

58

64

57

67

80

64

71

77

63

75

69

72

56

64

68

71

68

T

74

71

66

64

57

66

67

62

63

63

63

56

62

70

64

60

71

60

69

69

58

68

69

59

60

60

W

47

50

43

41

51

48

35

48

39

40

50

32

42

57

57

55

52

39

52

48

30

48

47

34

50

62

P

44

70

52

60

47

58

58

78

55

38

60

40

50

71

65

53

73

56

74

64

55

53

51

58

37

61

0

60

65

59

60

52

63

58

65

54

50

59

46

55

70

63

60

68

55

68

63

54

56

58

55

•55

63

Student No.

27

28

29

30

64

73

80

60

72

63

71

56

34

59

43

43

32

55

69

58

51

63

66

54

M = Multiple Choice

T = Tutor marked

W = Written exam

P = Practical exam

0 = Overall score

M + T + W + P rounded to the4 nearest whole number

Table 1; Raw score data (percentages)

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 7 -

Assessment

Number ofstudents '

Percentage(N=30)

Component

<40failing' <50

of group <40<50

perper

perper

centcent

centcent

M

00

00

T

00

00

W

719

2363

P

36

1020

0

01

03

Table 2: Students scoring less than 40 per cent and lessthan 50 per cent for assessment components andoverall.

Assessment Component

Mean

MinimumMaximumRange

Standard Deviation (SD)

M

68.20

548027

7.25

T

64.43

567419

5.13

W

45.87

306233

8.25

P

56.50

327847

11.35

(Mean)

(58.75)*

(31.5)

(8.0)

0

58.93*

467025

5.87

*Note: the mean of the mean component marks is not precisely equal to themean overall mark because overall marks were rounded to whole numbersbefore the mean was calculated.

Table 3: Summary of descriptive statistics for raw marks.

A measure of spread is provided approximately by the range and more

accurately by the standard deviation as a measure of the variation of each

set of marks about its mean. The variations in spread are evident by either

measure. Further, the mean standard deviation (~8.0) is greater than the

standard deviation overall of the combined marks (~5.9) thus indicating that

the component marks are not perfectly correlated (r < 1) - see Table 4.

Both spread and correlation need to be considered when comparing and combin-

ing sets of marks.

What does "easier for students to pass" mean? Let us accept for the

moment that 40 per cent (or, indeed, 50 per cent) is some kind of naturally

ordained discontinuity which is used as a pass level. Is it true that the

continuous assessment components enabled previously ordained failures to

pass? In this regard we find it puzzling that the only change discussed in

the paper (2) is a change in assessment method. No other curricular con-

siderations seem to have been considered with reference to teaching and

learning. It seems evident that weekly testing will have influenced student

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 8 -

learning patterns with (presumably) regular feedback from the test experience.

Why is it that when there is failure we blame the students, but when they

pass we suspect the assessment methods? In this case we have new assessment

procedures which, because they are spread over the whole course, test what

the students know, using techniques which become part of the routine of a

student's learning experience. In contrast, we have the traditional three

hour, end-of-course examination (without feedback) which can sample only a

small part of the course content and objectives. As such it is more likely

to measure what students don't know. Should we be surprised if the probabi-

lity of higher scores increases with the introduction of continuous assess-

ment? Since the distribution of learning changed because of changes in the

assessment procedure, so too did the distribution of student performance.

In addition since continuous assessment involved more measurements of student

ability, the marks were likely to be more reliable than a single terminal

written or practical examination.

Elton and Laurillard (3) agree with Himmelweit (4) that examinations

act as a signal to the student, or exert a trigger effect, and suggest prag-

matically that "the quickest way to change student learning is to change the

assessment system". Probably this is what has happened with the Zoology

course. The issue should be 'Does continuous assessment make it easier for

students to learn?'

Issue 2When various assessment components are intended to "'count equally', what are the implications?

Students might believe that this means that each of the four assessment

components is equally important in determining the overall mark (or grade)

and that each component requires a similar amount of effort to gain a similar

mark. Students do begin their courses with different amounts of skill,

ability and knowledge and it is not suggested that equal efforts will be

rewarded equally for all students. What is_ suggested is that if one student

puts equal effort into all components the marks obtained on each shall be

about the same, other things being equal, for each component, i.e.:

the quality of the examining (level of difficulty,reliability of marking, validity of tasks, emphasisof importance for weighting)

the quality of the teaching (in lectures and in the laboratory)

the course design such that similar skills and abilities are

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 9 -

included in each learning/assessment component (otherwisethe student's entry attributes of skill, ability and know-ledge will not be equal for each component).

It is a matter of professional judgment to assess these considerations

e.g. is it likely that students will put more effort (once) into preparing

for the final examination than (regularly) for the continuous assessment

tests? It cannot be assumed, of course, that each student did put equal

effort into each component. It is not unreasonable, however, to expect that

other things (once again) being equal, the various inequalities of effort

will cancel out over the class as a whole. That is, the distributions of

students over the possible marks should be similar. In particular, their

means and variances should be equal.

It is evident from Table 3 that this is far from true either for means

or variances (as measured by their square roots, the standard deviations).

Equating the means of each component to the overall mean would change each

student's overall mark by (-9.27 -5.5 +13.06 +2.43)/4 = 0.18. (The change

would be zero if the overall marks were exact averages of the four compon-

ents; in fact the overall marks are rounded to the nearest whole number

and 0.18 is the average round-off error.) Changing the mean of any or all

of the component sets of marks by addition or subtraction, has no effect on

the rank position of each student's overall mark.

It is also evident from Table 3, with reference to Table 1, that the

spread of marks on each component has an effect on their influence or weight

in the overall mark. For example, a student who scored 15 marks above the

lowest mark on each component would rank:

equal 16th on the multiple choice test component (M)equal 4th on tutor-marked assignments (T)about 18th on the written examination (W)and 25th on the practical examination (P).

Provided 'other things' are equal it would seem that we should equate

the standard deviations of the components in order to equate the weight

each has in the overall mark (0S). This is what Moss (2) attempts to do by

what he calls 'normalising' the scores. The procedure used is, in fact,

that of converting each component mark to a normalised standard T-score (or

Z-score) (5) so that each scaled set of marks has a mean of 50 and a stand-

ard deviation of 10. It should be noted that:

If suffix 's' is used to distinguish the standardised marksfrom the raw marks then, for each student (cf. Table 3),

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 10 -

4 x 0s = Mg + T_ + W_ + P_

or 4 x 0 - F M - 6 8 . 2 T - 64.43 W - 45.87 P -56^50s | 7.25 5.13 8.25 11.35 J

The component means (50) and standard deviations (10) may notcorrespond to the considered judgment of staff concerned aboutcomparing and combining these component sets of marks. Pass/faildecisions are obviously affected by the mean selected for standar-disation. In fact the standardised overall set of marks has amean, 0 = 50.02 and a standard deviation SD = 7.09 not 10 asmight hive been expected or intended.

Again the point is made that since the standard deviation of theoverall (combined) set of marks is not equal to the mean of thestandard deviations of the component sets, the correlations amongthe components and between each component and the overall set,will be less than unity, i.e. they are not perfectly correlated.Correlation coefficients are given in Table 4 and in Figure 3.Figure 2 summarises the nomenclature used for the various correla-tion coefficients.

As a consequence of these variations in correlation, although theT-score standardisation assigned equal standard deviations toeach component set of scores, the effective weightings (6) of thestandardised components compared with the initial raw sets ofmarks (calculated as illustrated in Table 5) are:

per cent per cent per cent per cent

Standardised (M) 27.4 (T) 23.6 (W) 22.7 (P) 26.3Raw (M) 23.2 (T) 12.2 (W) 23.8 (P) 40.8

The outcome is certainly closer to the intended equal weightingof the components in the overall set of marks. The remainingvariation in the weighting of the standardised scores is due tothe effect of correlation.

Issue 3What do the results of the new assessment systemtell us about student achievement and itsassessment in this course?

A correlation coefficient (such as Pearson's r) provides an index of

the relation of one set of scores or marks to another. In this case if

'similar' abilities (skills and knowledge) are being tested in each compon-

ent then correlation needs to be considered when comparing and combining

marks. There are three possibilities:

r=l; perfect correlation among the sets of component marks and(therefore) between the components and the overall marks;for such a condition a student's rank in class would be deter-mined by any one of the component sets and the others would beredundant (but perhaps educationally desirable, cf. Burnett

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 11 -

and Cavaye (1)) since each assessment component would measureexactly the same performance standard; weighting is relateddirectly to standard deviation ratios.

0<r<l; positive correlation is the usual relationship and applies inthis case; the size of r indicates how much one pair of compo-nents might have in common and can therefore assist in the inter-pretation of one set of marks compared with another; the effec-tive weighting is now dependent on the product of the standarddeviation of each component and the correlation coefficient ofthat component with the set of overall marks (6)

r<0; zero and negative or inverse correlation between two sets ofmarks obtained by the same group of students indicates that verydifferent characteristics or attributes have been measured withthe same students scoring highly in one and low in the other,and vice versa; while it is not logical to combine two suchsets of marks it is reasonable to associate the marks by report-ing them together (e.g. in the form of a profile (7)).

etc.

rM0> rT0

rM0-' rT0-

'MOSJ

lCE

r T Q S etc.

Symbol used for the Pearson product-moment coefficientof correlation as a measure of the relation betweenpairs of measures for individuals in a group. Suffixesindicate particular measures.

Correlations between component sets of raw marks.

Correlations between a component and the set of rawoverall marks.

Correlations between a component and the set of rawoverall marks with that component's contributionremoved. Thus r. is the correlation of M with 0 - —LM0-(because 0 = M + T•+ W + P).

Correlations between a component and the set of overallmarks derived from standardised components.

Correlation between continuous assessment component(M+T) and end-of-course assessment component (W+P).

Figure 2: Correlation nomenclature.

In themselves, correlation coefficients provide only an indication of

relationship the reality and meaning of which must be assessed using pro-

fessional judgment. Correlation coefficient nomenclature for the example

under consideration is given in Figure 2. Correlation coefficients are

summarised in Table 4 and presented as a correlation network in Figure 3.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 12 -

Among ComponentComponents

RAW r,-,MARKS m

rMTrWPrMWrPT

1W

rCE(C = M+T; E =

= 0

= 0

= 0

= 0

= 0

= 0.

= 0.

P+W)

48

47

40

25

24

18

43*

rMOrT0rworpo

rT0-rwn-rPO-

rMflc?

rTOSrwnsrPOS

Between Componentand Overall Scores*

= 0.74

= 0.55

= 0.67

= 0.83

= 0.54

o if*.

= 0.39

= 0.53

= 0.78

— n A7

= 0.64

= 0.74

RAWMARKS

RAW COMPONENT MARKSAND RAW OVERALLSCORES LESS EACHCOMPONENT

(Those for stan-dardised marks arealmost identical tothese)

STANDARDISEDCOMPONENT MARKS ANDOVERALL SCORESDERIVED FROM THEM

Unrounded scores were used here (see note to Table 3).

Table 4: Correlation coefficients.

The following points have significance for interpreting the measures

of achievement represented by the four component sets of marks and the

overall marks.

(a) None of the components has very much in common with any of theothers (all correlations between components are small).

(b) The 'usefulness' of high correlations depends on whether theassessment objectives of the components are, and are intended tobe, homogeneous or heterogeneous. Briefly, high correlations aredesirable if the objectives are intended to be homogeneous, andlow correlations are desirable if the objectives are intended tobe heterogeneous. The other two possibilities indicate problems.Cronbach (8) provides an interesting discussion of this matterwhich needs to be considered when deciding whether componentsshould be combined to give one representative mark, or associatedperhaps in the form of a profile with minimum levels of competencyidentified.

(c) Do the new continuous components of assessment (M and T) measurethe same kinds of things as the traditional components (W and P)or their sum? This is a question of validity. Are the new

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 13 -

0.74 0.78

M1.

0.47

\ (M+T)

0.43 \

0.83 0.74

\E \(W+P)

0.48

0.24 *.

0.25

OS

-.0.18 *.

,' 0.40

Between components and overall marks

Figure 3: Correlation networks.

Among Components

assessment components valid measures of learning outcomes intendedby the staff? Or (a different question) is the 'new' continuousassessment a valid measure of what the 'old' assessment seemed tomeasure? Ebel (5) identifies ten types of validity grouped intotwo categories of primary or direct validity, and secondary orderived validity. He offers a definition of direct validitywhich depends primarily on rational analysis and professionaljudgment, i.e.

"A test has direct primary validity to the extent that thetasks included in it represent faithfully and in due propor-tion the kinds of tasks that provide operational definitionsof achievement." (p.381)

For example, it would seem that W and P are not really directlyvalid measures each, of the skills (written or practical) testedby each other. The correlation coefficient may indicate that thewritten examination required knowledge derived from the laboratoryexperience or vice versa - a matter of professional judgment aboutderived validity.Likewise the continuous assessment components (C) do not seem tobe a valid measure of performances tested by terminal assessment(E) and vice versa since rCE = 0.4.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 14 -

Some explanations of the low correlations and the associated problems might

include:

any or all of the assessment components were unreliable in theirdesign (despite the title of the paper (2) no information isincluded about reliability);

the skills, abilities and knowledge measured by each test mightnot have much in common (including the skills of demonstratingachievement under the conditions of each test, obviously differentbetween T and W);

variations due to the test conditions including time constraints,coverage of syllabus sampled, marking (e.g. differences betweenthe objective marking of multiple choice tests and of assignmentsmarked by different tutors).

These considerations relate basically to the quality of the 'signal'

which depends on the validity of assessment, and the quantity of 'noise'

which obscures meaning and makes interpretation difficult. Scaling, stan-

dardising or combining cannot remove 'noise' due to deficiencies in assess-

ment design, moderation and marking.

THE SHAPE OF A SET OF MARKS

Equating the means and standard deviations of the distribution of marks

for each component will not necessarily make these distributions identical.

As is shown below, such differences can be of practical importance. Thus,

if the shapes differ significantly cause should be sought. Figure 4(a)

gives two contrasting examples with equal standard deviations:

Test A has a long 'tail' of marks below the mean;

Test B has a long 'tail' of marks above the mean.

In Figure 4(b) the means have also been equated. Note that the con-

trasting shapes have not been changed. Neither distribution is very good

for discriminating between students. In Test A, for example, a student

whose mark changed from 54 to 60 (only one half of a standard deviation)

would improve by eight places in a class of 30.

Moreover, careful analysis and judgment is required before decisions

are made about the combination of such sets of marks. Test A spreads low

scorers and Test B spreads high scorers. Regardless of the correlation

between the tests, information will probably be lost about differences

between students when the marks are combined. Returning to our original

example, from Table 1 we note that 22 out of 30 students have tied overall

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 15 -

(a) Two sets of marks with differing meansbut identical standard deviations (about 12 marks)

10

w4J

a>• d3+JCO

M-lO

UID

Test A

0 0 0

00 00 00 00 00 0

0 0 0 0 0 00 0 0 0 0 0

0 00 0

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % MARKSX X X X X X X X X X X X

10

X X X X X X X X XX XX XX XX XX XX

Test B

10

5

0

5

10

(b) The same setsand with the

Test

00 0 0 0

scaledsame

A

00

00

5 10 15 20 25 30 35 40 45X XX X

XX

XXXXXXXX

to equal means (about 52.5 marks)standard deviations

0000000

50XXXXXXX

: 0000000 0 0 00 0 0 0

55 60 65 70 75 80 85 90 95 100 % MARKSX X X X X XX X X

Test B

Figure 4: Mark distributions.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 16 -

marks. Many of these ties will stem from the combination of components which

are largely unrelated (and thus may work to counteract each other). Some may

stem from variations in distribution shape. Unfortunately, no amount of

scaling of any kind can supply information which is not contained in the raw

mark set. In particular, with T (tutor marked component), reasons for the

small spread should be sought before scaling.

Professional judgment should be used to decide whether the 'compact'

grouping represents a real lack of information about differences among

students (e.g. because T is based on a few items) or whether the information

is there but compressed due to markers not using the whole range of possible

marks. In the former case rescaling is out of the question and re-examining

students or some other breach of the grading contract may be required.

Issue 4How should the final set of marks be derivedfrom sets of component marks?

As indicated in this paper the implications of shape of distribution,

mean, standard deviation and correlation, should be considered. The "all

things being equal" argument suggests that these parameters should be simi-

lar for the different sets of component marks.

Adjustments to the mean value of one or all of the components have no

effect on the ranking of the students' overall marks or for other norm-

referenced purposes. The actual mark level will be important if comparisons

are to be made with student performances in other courses or subjects and,

of course, for pass-fail decisions related to constants such as 40 per cent.

With a large class the overall mean (~59) of the sum of the four raw

component sets is a more stable measure of class standard than with a small

class or sample (such as reported here). In a small class moderate changes

in the marks of only a few students can make a noticeable change in the

mean; in a large class this is not so. To standardise the mean (to 50,

say) is to accept that its raw value may owe a fair part of itself to the

chance composition of the class ("Fred sat the exam, but Joe was sick",

for example).

Terwilliger (8) suggests, a more inclusive reference group could be

used to determine characteristic means (and standard deviations) against

which the standard of the present class might be assessed. Such a reference

group consists of all students for whom there has been comparable instruc-

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 17 -

tion and assessment, thereby reducing the emphasis on the possibly atypical

performance of a single class, or a sample of a class. "Comparability"

must be assessed, of course, and here once again professional expertise and

judgment are indispensable.

Terwilliger (9) discusses the formation of composite scores and iden-

tifies standard deviations and the relationship between scores as the impor-

tant factors to consider in establishing relative weights. Willmott and

Hall (6) distinguish between the intended or nominal weight (in this case

25 per cent for each component) and the effective weight:

Discrepancies between the nominal and effective weights of

an examination can be attributed to two factors:

(a) the standard deviation of the component

(b) the degree of association (correlation)

between the component and other components

in the examination. , (p.33)

If the components are perfectly correlated the ratio of standard deviations

determines completely the effective weights of the sets of component marks.

• Usually correlation is not perfect (0<r<l) and the effective weighting is

calculated using the product of the standard deviation of each set of marks

and its correlation with the overall mark. Willmott and Hall (6) refer to

the work of Fowles (10) for detailed procedures. Table 5 provides a summary

of nominal and effective weightings for the marks of this case study.

Assessment component: X =

Nominal Weighting (%)

RAW MARKS

Standard deviation (SD)

Weighting if it is assumed rXO = 1

Correlation with overall marks (rX O)

Effective weighting (α rXO x SD)*

STANDARDISED MARKS

Standard deviation (SD)

Correlation with standardised

overall marks (r

XOS)

Effective weighting (α rXOS

X ;S D )

Effective weighting for component X

M

25

7.

22.

0.

23.

10

0.

27.

rm *•

25

7

740

2

776

4

SD •*

T

25

5.

16.

0.

12.

10

.0.

23.

" rTO

13

0

549

2

667

6

rxor

x

x SD +

W

25

8.

25.

0.

23.

io

,0..

22V

SD

rwn

25

8

668

8

644

7

X:.SD

P

25

11.

35.

0.

40.

10

0.

26v

+ rp

35

5

833

8

744

3

Q x SD

Table 5: Statistics used to calculate effective weighting.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 18 -

Although the standardising procedure has assigned equal standard devia-

tions to the component sets of marks, the effective weightings are not equal

due to the variations in the correlation coefficients. These discrepancies

are not very large and Terwilliger (9) asserts that "there is no adjustment

which can be made to overcome the correlation between the measures combined

in a composite score" (p.165). This is misleading, since it is desirable

that correlations be as close as possible to one. However, the procedure

outlined by Willmott and Hall (6) allows for the effect of correlation.

Table 5 shows the use of the procedure, in two stages, to calculate the

effective weighting of the raw marks due to the combined effects of standard

deviation and of correlation between the component and composite or overall

scores:

(a) Assuming (incorrectly) r=l (perfect correlation) the effectiveweighting depends only on the variations in standard deviation.

(b) The modification introduced by the product (rxSD) gives thecombined effective weighting.

For ranking purposes only, each component set of marks is multiplied by the

ratio of nominal to effective weighting producing a weighted set of overall

marks. This is the first step of an iteration procedure (10). Two itera-

tions have been carried out to illustrate the procedure. The effective

weights and relevant correlations are shown in Table 6.

In Table 7 we show the raw marks (0), the 'standardised' marks (0 )

and the correctly weighted marks (0W") for each student, together with each

student's rank in the class on each type of total mark. Note that all

marks have been rounded to the nearest whole number, since this procedure

was used by Moss to obtain the raw marks (0). Rounding Moss's 'standardi-

sed' marks (0M) reintroduces ties in students' class ranks (the lack of

ties on the 0.. used by Moss was an artefact introduced by carrying one more

significant figure in OM than was used in 0).

Note that there are 25 variations in position between this new set of

weighted marks and the raw marks. The average difference (including zero

differences) is 0.87 places. There are also 25 differences between ranks

on Mass's 'normalised' marks and ranks on theraw marks, but the average

difference is 1.2 places. Moss's 'normalised' marks and the more correctly

derived weighted marks yield different ranks for only 12 students, the

average difference (as always, including zeros) being 0.50 places. None of

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 19 -

Assessment component: X = M T W P

Nominal weighting (%) 25 25 25 25

RAW MARKS (Overall mark = 0)

Standard Deviation (SD) 7.25 5.13 8.25 11.35

Correlation with overall marks (rXO) 0.740 0.549 0.668 0.833

Effective weighting (α= rx 0 x SD) 23.2 12.2 23.8 40.8

WEIGHTED MARKS

Iteration 1 (Overall mark = 0 ):

Nominal weight/effective weight

t jt \ 1•08 2.05 1.05 0.613

(dx.)

New standard deviation ^ .1 Q > 5 1 g > 6 ?

(bDj = SD x dx,J

Correlation with weighted overall _

marks fr "1 0.748 0.732 0.646 0.684

Effective weighting (α rXOW,

x SD1) 24.5 32.2 23.4 19.9

Iteration 2 (Overall mark = 0W ") :

Nominal weight/effective weight 1 . 0 2 Q 7 7 f i 1

New standard deviation _ n_ ,,

n

(SD2 = SDj x d

x,,J

Correlation with weighted overall

marks (ry

0 )

Effective weighting (« r x SD ) 25.1 22.0 25.9 27.1

Table 6: The approach to correct effective weighting by iteration.

these average differences is large, nor are many of the individual variations

which give rise to the averages.

The following points are worth noting:

The Willmott and Hall (6) procedure is not, of course, restrictedto equal weightings - any distribution of weightings among anynumber of components can be adjusted.

When the differences between the nominal weightings and theeffective weightings are in range 0-10 per cent, there are markeddifferences in ranking only in a few cases.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 20 -

Student No. Raw Marks Position Standardised Position Weighted PositionMarks Marks

CO) (oM) (o^,)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

MEAN

(J = Joint

Table 7:

60

65

59

60

52

63

58

65

54

50

59

46

55

70

63

60

68

55

68

63

54

56

58

55

55

63

51

63

66

54

58.93

13 (J)

5.5(J)

15.5 (J)

13 (J)

27

9 (J)

17.5(J)

5.5(J)

25 (J)

29

15.5(J)

30

21.5(J)

1

9 (J)

13 (J)

2.5(J)

21.5(J)

2.5CJ)

9 (J)

25 (J)

19

17.5(J)

21.5(J)

21.5(J)

9 (J)

28

9 (J)

4

25 (J) .

55

58

52

51

41

55

49

55

44

40

50

34

46

63

54

51

62

44

60

55

43

47

50

44

46

54

43

55

59

43

50.10

8 (J)

5

13

14.5(J)

28

8 (J)

18

8 (J)

23 (J)

29

16.5(J)

30

20.5 (J)

1

ll.S(J)

14.5(J)

2

23 (J)

3

8 (J)

26 (J)

19

16.5(J)

23 (J)

20.5(J)

11.5(J)

26 (J)

8 (J)

4

26 (J)

72

75

69

69

61

72

67

73

63

60

68

55

64

79

72

69

78

63

77

72

62

66

68

63

65

72

62

72

76

60

68.20

Placing) (Standardised Marks are called Normal Scores by

Raw, standardised and weighted marks and rankings.

9

5

14

14

28

9,

18

6

23

29

16.

30

21

1

9.

14

2

23

3

9.

26

19

16.

23

20

9.

26

9.

4

26

Moss

•5(J)

(J)

(J)

•5(J)

(J)

5(J)

5(J)

(J)

(J)

5(J)

(J)

5(J)

(J)

5(J)

(J)5(J)

(J)

(2))

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 21 -

Terwilliger (8,9) suggests that if the total of the positivedifferences between the effective weights and the nominatedweights is less than 20 per cent, adjustment is not necessary.For the raw marks, the differences (positive only) add up to15.8 per cent and the effect, as noted above, is small (seeTable 5 also).

The mean value of the weighted set of overall marks (after twoiterations) is 68.2 (SD = 5.97) which is higher than the meanof the raw marks, 58.9. In this case adjustment of the levelof the overall marks is readily made by subtracting 68.2 -58.9 =9.3 from each of the weighted overall marks (or 68.2 -50 = 18.2 if a mean of 50 is judged more appropriate).

Issue 5How can the quality of assessment (information)be improved?

The validity (5) of the assessment tasks and conditions should be

established, and appreciated by the students. The assessment tasks together

with model solutions and guidelines for marking should be moderated by

colleagues (assessment sub-committees). This process should also take into

consideration the implications of a prescribed pass mark and/or a prescribed

level of attainment as prerequisite for study of a subject at a higher

level. Clift and Imrie (11) discuss such matters with reference to a pro-

posed two-tier structure of assessment with tier-I at the minimum essen-

tials (recall, application) level and tier-II at the problem solving level.

Engel et at (12) describe what promises to be a most significant inno-

vation in assessment which not only allows for validation and moderation

but also involves students in evaluation of the assessment tasks which are

aimed at the problem solving level. The procedure contributes to the

moderation of model answers and of 'minimum levels of competency'. It

takes place the day after the students sit the test and thus represents an

important (feedback) learning experience for the students at a time when

their interest is at a peak. And that should be a principal aim of assess-

ment: that it be part of a teaching strategy to help with the improvement

of student learning (quantity, quality and retention).

As with many other aspects of teaching, the professional judgment and

commonsense of the teacher must be depended upon to modify general recommen-

dations (9) as required by subject, student and institutional considerations.

If, for example, there is poor correlation (r<0.4) between assessment compo-

nents, combination (as discussed in this paper) may be less important, less

valid and less useful than association. For illustration consider the

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 22 -

abilities required to be a 'good' driver. A 90 per cent minimum level of

competency may be required for satisfactory performance on a test of the

Highway Code. Then the would-be driver is required to demonstrate basic

mastery of the vehicle in 'normal' traffic conditions. There is likely to

be low correlation between the 'scores' of such tests; there would be little

point in combining them and little logic in using a combined score to predict

the 'good' driver. But the association of the two components separately

assessed is evidently commonsense.

The newly licensed driver will probably, it is predicted, be able to

solve most of the problems presented by changing traffic and road conditions.

With practice and continuing development of skills, attitudes and knowledge,

the driver will become 'good' at problem solving on the road.

With a course such as Zoology which involves the development of a wide

range of skills and knowledge, information is available in the marks of

assessment for rational analysis and professional judgment (5) to be used to

identify minimum levels of competency and the appropriate use of combination

and association. As with all educational endeavours there is an associated

professional responsibility to evaluate our performances as teachers and

assessors particularly when much effort has been expended on change and

innovation.

CONCLUDING COMMENTS

In this paper it has been our intention to discuss considerations of

assessment which seem likely to be overlooked when sets of marks are com-

bined. To utilize the potential of this Journal as a forum, we thought it

appropriate to consider and develop some of the issues raised in the paper

by Dennis Moss so that somewhat different points of view could be conven-

iently considered. As indicated in this paper, in this Journal (Vol.5(3)),

are documented innovations which seek to improve the experiences of assess-

ment which so greatly influence the attitudes of our students.

While we have used numbers and calculations to illustrate some of the

considerations of this paper, we have intended to emphasise that there is

no mechanical procedure for converting marks into decisions about an indi-

vidual's ability and potential. We wish to share our convictions that 'good'

examining requires high levels of competence. The efforts which staff and

students make in producing marks should invite reciprocal diligence in

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 23 -

analysis and in decision-making procedures which utilise not only relevant

statistical methods but, particularly, the informed judgment of teachers.

Evaluation of assessment in higher education has received iittle atten-

tion in the literature and we acknowledge the interest of Zoology staff as

reported by Moss (2). We wished to contribute to such a debate by discuss-

ing the combination of marks not as an exercise in numbers but as a matter

of relationships between marks and the learning experiences of students.

We have presented a case for professional judgment and expertise when

teachers consider combining marks.

REFERENCES

(1) BURNETT, W. and CAVAYE, G. (1980), "Peer Assessment by Fifth Year

Students of Surgery", Assessment in Higher Education, 5, (3).

(2) MOSS, D. (1977), "How Reliable are Continuous Assessment Methods in

Measuring Student Performance?", Assessment in Higher Education, 2,

(3), 164-172.

(3) ELTON, L.R.B. and LAURILLARD, D.M. (1979), "Trends in Research on

Student Learning", Studies in Higher Education, 4, (1), 87-102.

(4) HIMMELWEIT, H.T. (1967), "Towards a Rationalization of Examination

Procedures", Universities Quarterly, 21, (3), 359-372.

(5) EBEL, R.L. (1965), Measuring Educational Achievement, Prentice-Hall,

New Jersey.

(6) WILLMOTT, A.S. and HALL, C.H.W. (1975), O-Level Examined: The Effect

of Question Choice, Schools Council Research Studies, Macmillan

Education, London.

(7) CRONBACH, L.J. (1971), THORNDIKE, R.L. (Ed.), "Test Validation" in

Educational Measurement (Second Edition), American Council on Educa-

tion, Washington.

(8) TERWILLIGER, J.S. (1977), "Assigning Grades - Philosophical Issues and

Practical Recommendations", Journal of Research and Development in

Education, 10, (3), 21-39.

(9) TERWILLIGER, J.S. (1971), Assigning Grades to Students, Scott, Foresman,

Illinois.

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 24 -

(10) FOWLES, D.E. (1974), CSE: Two Research Studies, (School Council Exami-

nations Bulletin 28), Evans/Methuen Educational.

(11) CLIFT, J.C. and IMRIE, B.W. (1980), Assessing Students, Appraising

Teaching, Croom Helm, London.

(12) ENGEL, C.E., FELETTI, G.I. and LEEDER, S.R. (1980), "Assessment of

Medical Students in a New Curriculum", Assessment in Higher Education,

5, (3).

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

- 25 -

First version of the paper received : March 1980

Final version of the paper received : September 1980

Copies of the paper and further details from:

Mr. Geoffrey Isaacs,

Tertiary Education Institute,

University of Queensland,

St. Lucia, Q. 4067,

Australia.

Mr. Bradford Imrie,

University Teaching and Research Centre,

Victoria University of Wellington,

Wellington,

New Zealand.

Geoff Isaacs is a lecturer in the Tertiary Education Institute (since

1974), and formerly taught mathematics at the University of New South

Wales.

Bradford W. Imrie is senior lecturer in the University Teaching and

Research Centre, and was formerly a lecturer in mechanical engineering at

the University of Leeds. This paper was written while he was on study

leave at the Tertiary Education Institute (September 1979 - June 1980).

Dow

nloa

ded

by [

TIB

& U

nive

rsita

etsb

iblio

thek

] at

05:

39 2

8 O

ctob

er 2

014

a case for professional judgment when combining marks

Documents