a case for professional judgment when combining marks
TRANSCRIPT
This article was downloaded by: [TIB & Universitaetsbibliothek]On: 28 October 2014, At: 05:39Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK
Assessment & Evaluation in Higher EducationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/caeh20
A CASE FOR PROFESSIONAL JUDGMENT WHENCOMBINING MARKSGeoffrey Isaacs a & Bradford W. Imrie ba University of Queenslandb Victoria University of WellingtonPublished online: 28 Jul 2006.
To cite this article: Geoffrey Isaacs & Bradford W. Imrie (1981) A CASE FOR PROFESSIONAL JUDGMENT WHEN COMBININGMARKS, Assessment & Evaluation in Higher Education, 6:1, 3-25, DOI: 10.1080/0260293810060102
To link to this article: http://dx.doi.org/10.1080/0260293810060102
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in thepublications on our platform. However, Taylor & Francis, our agents, and our licensors make no representationsor warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Anyopinions and views expressed in this publication are the opinions and views of the authors, and are not theviews of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should beindependently verified with primary sources of information. Taylor and Francis shall not be liable for any losses,actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoevercaused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
- 3 -
Assessment and Evaluation in Higher Education Vol.6 No.l March 1981
pp.3-25
A CASE FOR PROFESSIONAL JUDGMENT WHEN COMBINING MARKS
Geoffrey I s a a c s , Un ive r s i ty of Queensland
Bradford W. Imr ie , V i c t o r i a Un ive r s i ty of Well ington
ABSTRACT
Issues raised in a paper by Moss (1977) in this Journal, are discussed
in this paper which describes a strategy for comparing and combining marks
on the basis of rational analysis and professional judgment. Consideration
is given to the interpretation and implications of mean values, standard
deviations, correlations and shapes of mark distributions.
The data presented by Moss (1977) are analysed and procedures are dis-
cussed which discriminate between nominal and effective weightings (Willmott
and Hall, 1975) with reference to the effects of standard deviation and
correlation. This approach may also be used for combining sets of marks
which are not equally weighted and is potentially more useful to the teacher
than standardising procedures. In some circumstances combination may be
judged inappropriate and association of marks (in a profile) might be the
preferred procedure. All such procedures demand the professional judgment
and expertise of teachers, to utilise fully the information available in the
raw marks of student performance.
INTRODUCTION
One of the most important professional responsibilties of academic staff
is that of allocating and reporting marks (or grades) as measures of student
performance. It is often the case that, after allocation of marks for the
different parts of a student's performance, those marks are combined to pro-
duce the final mark (or grade) which is published. There is usually a pre-
determined procedure foR combining such marks using nominal weighting factors;
these may not be known to the students. (For example see the brief report by
Burnett and Cavaye (1) in the previous issue.)
Marks may be combined for different assessment purposes:
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 4 -
(a) Within an individual test or examination: for example, in amultiple choice test of 40 items, the test score would be thenumber of items correct each with a weight of unity regardlessof differences in difficulty (or facility); or, for a 3-hourpaper requiring six questions to be attempted out of 10, students'marks would be reported as the total of scores on each question,with each question weighted equally regardless of differences indifficulty and selection.
(b) For a course, with student performance assessed at differentpoints during the course and at the end of the course, thedifferent component marks will be weighted and combined to givean overall mark. The mark may be used to decide entry to anothercourse and, sometimes, eligibility for scholarships or bursaries.
(c) On completion of a degree programme, the marks from differentcourses may be combined (with due weighting) for decisions aboutdegree classification, scholarships, prizes and, of course,employment opportunity.
For all such purposes combining marks requires the careful exercise of
professional judgment based on information which is readily obtainable to
assist in the interpretation of marks. For sets of marks which are to be
combined, professional judgment requires consideration of:
(a) descriptions of the sets of marks in terms of mean, standarddeviation, correlation and shape of distribution;
(b) whether the nominal (or intended) weighting is consistent withthe effective (or actual) weighting as determined by correlationand standard deviation (6);
(c) procedures for comparing and combining sets of marks.
To illustrate the significance of these considerations, this paper
develops a case study based on the data presented in a paper by Moss (2)
which may be referred to for comparison and contrast.
ASSESSMENT ISSUES
In his paper (2), published in this Journal, Moss describes the doubts
that staff (of a Zoology Department) had in accepting that students seemed
to be achieving better results in a redesigned course. The assessment sys-
tem had been changed from one "based mainly on terminal examinations" to one
in which,
regular tests were introduced on a weekly basis, most ofthese being multiple choice tests (M) but including fewertutor marked tests (T). In addition, the students wereexpected to complete a three hour written exam(W) and a
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 5 -
practical exam (P). Each of the four components wouldcount equally in the final assessment. (p.164)
ContinuousAssessment(C)
TerminalAssessment(E)
50
50
per
per
cent
cent
Multiple Choice Tests (M)
Tutor Marked Assignments (T)
Written Examination (W)
Practical Examination (P)
25
25
25
25
per
per
per
per
cent
cent
cent
cent
Figure 1: Assessment structure
Details of the assessment are summarised in Figure 1. With the wisdom
of hindsight we will consider some issues underlying not only Moss's paper
but also the inevitable confrontations which teachers have, as examiners,
with marks dubious in their derivations and fallacious in their final com-
binations. Firstly the issue which seems to have been posed by Zoology
staff to Moss (2):
Issue 1'We think that the new assessment system makes it easierfor students to pass - the 'easiness' is due to the con-tinuous assessment component of the final mark.'
Marks for each of the four assessment components are shown in Table 1.
Each mark is a percentage and the overall score, 0, is the average of the
four components rounded to the nearest whole number).
As Moss (2) notes, "the traditional pass/fail mark is 40% ... on all
components". The numbers of students failing on this 'standard' (and also
50 per cent) for each component and for the course overall, are shown in
Table 2.
These results do seem to support the critics' claim that continuous
assessment did make it easier for students to pass. A summary of descrip-
tive statistics (Table 3) provides more information about the sets of marks
under consideration.
The mean marks for the two continuous assessment components (M and T)
are higher than the mean marks for the two terminal assessments (W and P).
This too seems to support the claim that the continuous assessment components
are easier than the terminal assessment components. Indeed, a student who
came last in all four assessments (none did so) would have scored 43 per cent,
comprising 27.5 from continuous assessment and 15.5 per cent from terminal
assessment.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 6 -
Student No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
M
75
69
76
73
54
78
70
71
59
58
64
57
67
80
64
71
77
63
75
69
72
56
64
68
71
68
T
74
71
66
64
57
66
67
62
63
63
63
56
62
70
64
60
71
60
69
69
58
68
69
59
60
60
W
47
50
43
41
51
48
35
48
39
40
50
32
42
57
57
55
52
39
52
48
30
48
47
34
50
62
P
44
70
52
60
47
58
58
78
55
38
60
40
50
71
65
53
73
56
74
64
55
53
51
58
37
61
0
60
65
59
60
52
63
58
65
54
50
59
46
55
70
63
60
68
55
68
63
54
56
58
55
•55
63
Student No.
27
28
29
30
64
73
80
60
72
63
71
56
34
59
43
43
32
55
69
58
51
63
66
54
M = Multiple Choice
T = Tutor marked
W = Written exam
P = Practical exam
0 = Overall score
M + T + W + P rounded to the4 nearest whole number
Table 1; Raw score data (percentages)
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 7 -
Assessment
Number ofstudents '
Percentage(N=30)
Component
<40failing' <50
of group <40<50
perper
perper
centcent
centcent
M
00
00
T
00
00
W
719
2363
P
36
1020
0
01
03
Table 2: Students scoring less than 40 per cent and lessthan 50 per cent for assessment components andoverall.
Assessment Component
Mean
MinimumMaximumRange
Standard Deviation (SD)
M
68.20
548027
7.25
T
64.43
567419
5.13
W
45.87
306233
8.25
P
56.50
327847
11.35
(Mean)
(58.75)*
(31.5)
(8.0)
0
58.93*
467025
5.87
*Note: the mean of the mean component marks is not precisely equal to themean overall mark because overall marks were rounded to whole numbersbefore the mean was calculated.
Table 3: Summary of descriptive statistics for raw marks.
A measure of spread is provided approximately by the range and more
accurately by the standard deviation as a measure of the variation of each
set of marks about its mean. The variations in spread are evident by either
measure. Further, the mean standard deviation (~8.0) is greater than the
standard deviation overall of the combined marks (~5.9) thus indicating that
the component marks are not perfectly correlated (r < 1) - see Table 4.
Both spread and correlation need to be considered when comparing and combin-
ing sets of marks.
What does "easier for students to pass" mean? Let us accept for the
moment that 40 per cent (or, indeed, 50 per cent) is some kind of naturally
ordained discontinuity which is used as a pass level. Is it true that the
continuous assessment components enabled previously ordained failures to
pass? In this regard we find it puzzling that the only change discussed in
the paper (2) is a change in assessment method. No other curricular con-
siderations seem to have been considered with reference to teaching and
learning. It seems evident that weekly testing will have influenced student
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 8 -
learning patterns with (presumably) regular feedback from the test experience.
Why is it that when there is failure we blame the students, but when they
pass we suspect the assessment methods? In this case we have new assessment
procedures which, because they are spread over the whole course, test what
the students know, using techniques which become part of the routine of a
student's learning experience. In contrast, we have the traditional three
hour, end-of-course examination (without feedback) which can sample only a
small part of the course content and objectives. As such it is more likely
to measure what students don't know. Should we be surprised if the probabi-
lity of higher scores increases with the introduction of continuous assess-
ment? Since the distribution of learning changed because of changes in the
assessment procedure, so too did the distribution of student performance.
In addition since continuous assessment involved more measurements of student
ability, the marks were likely to be more reliable than a single terminal
written or practical examination.
Elton and Laurillard (3) agree with Himmelweit (4) that examinations
act as a signal to the student, or exert a trigger effect, and suggest prag-
matically that "the quickest way to change student learning is to change the
assessment system". Probably this is what has happened with the Zoology
course. The issue should be 'Does continuous assessment make it easier for
students to learn?'
Issue 2When various assessment components are intended to "'count equally', what are the implications?
Students might believe that this means that each of the four assessment
components is equally important in determining the overall mark (or grade)
and that each component requires a similar amount of effort to gain a similar
mark. Students do begin their courses with different amounts of skill,
ability and knowledge and it is not suggested that equal efforts will be
rewarded equally for all students. What is_ suggested is that if one student
puts equal effort into all components the marks obtained on each shall be
about the same, other things being equal, for each component, i.e.:
the quality of the examining (level of difficulty,reliability of marking, validity of tasks, emphasisof importance for weighting)
the quality of the teaching (in lectures and in the laboratory)
the course design such that similar skills and abilities are
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 9 -
included in each learning/assessment component (otherwisethe student's entry attributes of skill, ability and know-ledge will not be equal for each component).
It is a matter of professional judgment to assess these considerations
e.g. is it likely that students will put more effort (once) into preparing
for the final examination than (regularly) for the continuous assessment
tests? It cannot be assumed, of course, that each student did put equal
effort into each component. It is not unreasonable, however, to expect that
other things (once again) being equal, the various inequalities of effort
will cancel out over the class as a whole. That is, the distributions of
students over the possible marks should be similar. In particular, their
means and variances should be equal.
It is evident from Table 3 that this is far from true either for means
or variances (as measured by their square roots, the standard deviations).
Equating the means of each component to the overall mean would change each
student's overall mark by (-9.27 -5.5 +13.06 +2.43)/4 = 0.18. (The change
would be zero if the overall marks were exact averages of the four compon-
ents; in fact the overall marks are rounded to the nearest whole number
and 0.18 is the average round-off error.) Changing the mean of any or all
of the component sets of marks by addition or subtraction, has no effect on
the rank position of each student's overall mark.
It is also evident from Table 3, with reference to Table 1, that the
spread of marks on each component has an effect on their influence or weight
in the overall mark. For example, a student who scored 15 marks above the
lowest mark on each component would rank:
equal 16th on the multiple choice test component (M)equal 4th on tutor-marked assignments (T)about 18th on the written examination (W)and 25th on the practical examination (P).
Provided 'other things' are equal it would seem that we should equate
the standard deviations of the components in order to equate the weight
each has in the overall mark (0S). This is what Moss (2) attempts to do by
what he calls 'normalising' the scores. The procedure used is, in fact,
that of converting each component mark to a normalised standard T-score (or
Z-score) (5) so that each scaled set of marks has a mean of 50 and a stand-
ard deviation of 10. It should be noted that:
If suffix 's' is used to distinguish the standardised marksfrom the raw marks then, for each student (cf. Table 3),
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 10 -
4 x 0s = Mg + T_ + W_ + P_
or 4 x 0 - F M - 6 8 . 2 T - 64.43 W - 45.87 P -56^50s | 7.25 5.13 8.25 11.35 J
The component means (50) and standard deviations (10) may notcorrespond to the considered judgment of staff concerned aboutcomparing and combining these component sets of marks. Pass/faildecisions are obviously affected by the mean selected for standar-disation. In fact the standardised overall set of marks has amean, 0 = 50.02 and a standard deviation SD = 7.09 not 10 asmight hive been expected or intended.
Again the point is made that since the standard deviation of theoverall (combined) set of marks is not equal to the mean of thestandard deviations of the component sets, the correlations amongthe components and between each component and the overall set,will be less than unity, i.e. they are not perfectly correlated.Correlation coefficients are given in Table 4 and in Figure 3.Figure 2 summarises the nomenclature used for the various correla-tion coefficients.
As a consequence of these variations in correlation, although theT-score standardisation assigned equal standard deviations toeach component set of scores, the effective weightings (6) of thestandardised components compared with the initial raw sets ofmarks (calculated as illustrated in Table 5) are:
per cent per cent per cent per cent
Standardised (M) 27.4 (T) 23.6 (W) 22.7 (P) 26.3Raw (M) 23.2 (T) 12.2 (W) 23.8 (P) 40.8
The outcome is certainly closer to the intended equal weightingof the components in the overall set of marks. The remainingvariation in the weighting of the standardised scores is due tothe effect of correlation.
Issue 3What do the results of the new assessment systemtell us about student achievement and itsassessment in this course?
A correlation coefficient (such as Pearson's r) provides an index of
the relation of one set of scores or marks to another. In this case if
'similar' abilities (skills and knowledge) are being tested in each compon-
ent then correlation needs to be considered when comparing and combining
marks. There are three possibilities:
r=l; perfect correlation among the sets of component marks and(therefore) between the components and the overall marks;for such a condition a student's rank in class would be deter-mined by any one of the component sets and the others would beredundant (but perhaps educationally desirable, cf. Burnett
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 11 -
and Cavaye (1)) since each assessment component would measureexactly the same performance standard; weighting is relateddirectly to standard deviation ratios.
0<r<l; positive correlation is the usual relationship and applies inthis case; the size of r indicates how much one pair of compo-nents might have in common and can therefore assist in the inter-pretation of one set of marks compared with another; the effec-tive weighting is now dependent on the product of the standarddeviation of each component and the correlation coefficient ofthat component with the set of overall marks (6)
r<0; zero and negative or inverse correlation between two sets ofmarks obtained by the same group of students indicates that verydifferent characteristics or attributes have been measured withthe same students scoring highly in one and low in the other,and vice versa; while it is not logical to combine two suchsets of marks it is reasonable to associate the marks by report-ing them together (e.g. in the form of a profile (7)).
etc.
rM0> rT0
rM0-' rT0-
'MOSJ
lCE
r T Q S etc.
Symbol used for the Pearson product-moment coefficientof correlation as a measure of the relation betweenpairs of measures for individuals in a group. Suffixesindicate particular measures.
Correlations between component sets of raw marks.
Correlations between a component and the set of rawoverall marks.
Correlations between a component and the set of rawoverall marks with that component's contributionremoved. Thus r. is the correlation of M with 0 - —LM0-(because 0 = M + T•+ W + P).
Correlations between a component and the set of overallmarks derived from standardised components.
Correlation between continuous assessment component(M+T) and end-of-course assessment component (W+P).
Figure 2: Correlation nomenclature.
In themselves, correlation coefficients provide only an indication of
relationship the reality and meaning of which must be assessed using pro-
fessional judgment. Correlation coefficient nomenclature for the example
under consideration is given in Figure 2. Correlation coefficients are
summarised in Table 4 and presented as a correlation network in Figure 3.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 12 -
Among ComponentComponents
RAW r,-,MARKS m
rMTrWPrMWrPT
1W
rCE(C = M+T; E =
= 0
= 0
= 0
= 0
= 0
= 0.
= 0.
P+W)
48
47
40
25
24
18
43*
rMOrT0rworpo
rT0-rwn-rPO-
rMflc?
rTOSrwnsrPOS
Between Componentand Overall Scores*
= 0.74
= 0.55
= 0.67
= 0.83
= 0.54
o if*.
= 0.39
= 0.53
= 0.78
— n A7
= 0.64
= 0.74
RAWMARKS
RAW COMPONENT MARKSAND RAW OVERALLSCORES LESS EACHCOMPONENT
(Those for stan-dardised marks arealmost identical tothese)
STANDARDISEDCOMPONENT MARKS ANDOVERALL SCORESDERIVED FROM THEM
Unrounded scores were used here (see note to Table 3).
Table 4: Correlation coefficients.
The following points have significance for interpreting the measures
of achievement represented by the four component sets of marks and the
overall marks.
(a) None of the components has very much in common with any of theothers (all correlations between components are small).
(b) The 'usefulness' of high correlations depends on whether theassessment objectives of the components are, and are intended tobe, homogeneous or heterogeneous. Briefly, high correlations aredesirable if the objectives are intended to be homogeneous, andlow correlations are desirable if the objectives are intended tobe heterogeneous. The other two possibilities indicate problems.Cronbach (8) provides an interesting discussion of this matterwhich needs to be considered when deciding whether componentsshould be combined to give one representative mark, or associatedperhaps in the form of a profile with minimum levels of competencyidentified.
(c) Do the new continuous components of assessment (M and T) measurethe same kinds of things as the traditional components (W and P)or their sum? This is a question of validity. Are the new
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 13 -
0.74 0.78
M1.
0.47
\ (M+T)
0.43 \
0.83 0.74
\E \(W+P)
0.48
0.24 *.
0.25
OS
-.0.18 *.
,' 0.40
Between components and overall marks
Figure 3: Correlation networks.
Among Components
assessment components valid measures of learning outcomes intendedby the staff? Or (a different question) is the 'new' continuousassessment a valid measure of what the 'old' assessment seemed tomeasure? Ebel (5) identifies ten types of validity grouped intotwo categories of primary or direct validity, and secondary orderived validity. He offers a definition of direct validitywhich depends primarily on rational analysis and professionaljudgment, i.e.
"A test has direct primary validity to the extent that thetasks included in it represent faithfully and in due propor-tion the kinds of tasks that provide operational definitionsof achievement." (p.381)
For example, it would seem that W and P are not really directlyvalid measures each, of the skills (written or practical) testedby each other. The correlation coefficient may indicate that thewritten examination required knowledge derived from the laboratoryexperience or vice versa - a matter of professional judgment aboutderived validity.Likewise the continuous assessment components (C) do not seem tobe a valid measure of performances tested by terminal assessment(E) and vice versa since rCE = 0.4.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 14 -
Some explanations of the low correlations and the associated problems might
include:
any or all of the assessment components were unreliable in theirdesign (despite the title of the paper (2) no information isincluded about reliability);
the skills, abilities and knowledge measured by each test mightnot have much in common (including the skills of demonstratingachievement under the conditions of each test, obviously differentbetween T and W);
variations due to the test conditions including time constraints,coverage of syllabus sampled, marking (e.g. differences betweenthe objective marking of multiple choice tests and of assignmentsmarked by different tutors).
These considerations relate basically to the quality of the 'signal'
which depends on the validity of assessment, and the quantity of 'noise'
which obscures meaning and makes interpretation difficult. Scaling, stan-
dardising or combining cannot remove 'noise' due to deficiencies in assess-
ment design, moderation and marking.
THE SHAPE OF A SET OF MARKS
Equating the means and standard deviations of the distribution of marks
for each component will not necessarily make these distributions identical.
As is shown below, such differences can be of practical importance. Thus,
if the shapes differ significantly cause should be sought. Figure 4(a)
gives two contrasting examples with equal standard deviations:
Test A has a long 'tail' of marks below the mean;
Test B has a long 'tail' of marks above the mean.
In Figure 4(b) the means have also been equated. Note that the con-
trasting shapes have not been changed. Neither distribution is very good
for discriminating between students. In Test A, for example, a student
whose mark changed from 54 to 60 (only one half of a standard deviation)
would improve by eight places in a class of 30.
Moreover, careful analysis and judgment is required before decisions
are made about the combination of such sets of marks. Test A spreads low
scorers and Test B spreads high scorers. Regardless of the correlation
between the tests, information will probably be lost about differences
between students when the marks are combined. Returning to our original
example, from Table 1 we note that 22 out of 30 students have tied overall
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 15 -
(a) Two sets of marks with differing meansbut identical standard deviations (about 12 marks)
10
w4J
a>• d3+JCO
M-lO
UID
Test A
0 0 0
00 00 00 00 00 0
0 0 0 0 0 00 0 0 0 0 0
0 00 0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 % MARKSX X X X X X X X X X X X
10
X X X X X X X X XX XX XX XX XX XX
Test B
10
5
0
5
10
(b) The same setsand with the
Test
00 0 0 0
scaledsame
A
00
00
5 10 15 20 25 30 35 40 45X XX X
XX
XXXXXXXX
to equal means (about 52.5 marks)standard deviations
0000000
50XXXXXXX
: 0000000 0 0 00 0 0 0
55 60 65 70 75 80 85 90 95 100 % MARKSX X X X X XX X X
Test B
Figure 4: Mark distributions.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 16 -
marks. Many of these ties will stem from the combination of components which
are largely unrelated (and thus may work to counteract each other). Some may
stem from variations in distribution shape. Unfortunately, no amount of
scaling of any kind can supply information which is not contained in the raw
mark set. In particular, with T (tutor marked component), reasons for the
small spread should be sought before scaling.
Professional judgment should be used to decide whether the 'compact'
grouping represents a real lack of information about differences among
students (e.g. because T is based on a few items) or whether the information
is there but compressed due to markers not using the whole range of possible
marks. In the former case rescaling is out of the question and re-examining
students or some other breach of the grading contract may be required.
Issue 4How should the final set of marks be derivedfrom sets of component marks?
As indicated in this paper the implications of shape of distribution,
mean, standard deviation and correlation, should be considered. The "all
things being equal" argument suggests that these parameters should be simi-
lar for the different sets of component marks.
Adjustments to the mean value of one or all of the components have no
effect on the ranking of the students' overall marks or for other norm-
referenced purposes. The actual mark level will be important if comparisons
are to be made with student performances in other courses or subjects and,
of course, for pass-fail decisions related to constants such as 40 per cent.
With a large class the overall mean (~59) of the sum of the four raw
component sets is a more stable measure of class standard than with a small
class or sample (such as reported here). In a small class moderate changes
in the marks of only a few students can make a noticeable change in the
mean; in a large class this is not so. To standardise the mean (to 50,
say) is to accept that its raw value may owe a fair part of itself to the
chance composition of the class ("Fred sat the exam, but Joe was sick",
for example).
Terwilliger (8) suggests, a more inclusive reference group could be
used to determine characteristic means (and standard deviations) against
which the standard of the present class might be assessed. Such a reference
group consists of all students for whom there has been comparable instruc-
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 17 -
tion and assessment, thereby reducing the emphasis on the possibly atypical
performance of a single class, or a sample of a class. "Comparability"
must be assessed, of course, and here once again professional expertise and
judgment are indispensable.
Terwilliger (9) discusses the formation of composite scores and iden-
tifies standard deviations and the relationship between scores as the impor-
tant factors to consider in establishing relative weights. Willmott and
Hall (6) distinguish between the intended or nominal weight (in this case
25 per cent for each component) and the effective weight:
Discrepancies between the nominal and effective weights of
an examination can be attributed to two factors:
(a) the standard deviation of the component
(b) the degree of association (correlation)
between the component and other components
in the examination. , (p.33)
If the components are perfectly correlated the ratio of standard deviations
determines completely the effective weights of the sets of component marks.
• Usually correlation is not perfect (0<r<l) and the effective weighting is
calculated using the product of the standard deviation of each set of marks
and its correlation with the overall mark. Willmott and Hall (6) refer to
the work of Fowles (10) for detailed procedures. Table 5 provides a summary
of nominal and effective weightings for the marks of this case study.
Assessment component: X =
Nominal Weighting (%)
RAW MARKS
Standard deviation (SD)
Weighting if it is assumed rXO = 1
Correlation with overall marks (rX O)
Effective weighting (α rXO x SD)*
STANDARDISED MARKS
Standard deviation (SD)
Correlation with standardised
overall marks (r
XOS)
Effective weighting (α rXOS
X ;S D )
Effective weighting for component X
M
25
7.
22.
0.
23.
10
0.
27.
rm *•
25
7
740
2
776
4
SD •*
T
25
5.
16.
0.
12.
10
.0.
23.
" rTO
13
0
549
2
667
6
rxor
x
x SD +
W
25
8.
25.
0.
23.
io
,0..
22V
SD
rwn
25
8
668
8
644
7
X:.SD
P
25
11.
35.
0.
40.
10
0.
26v
+ rp
35
5
833
8
744
3
Q x SD
Table 5: Statistics used to calculate effective weighting.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 18 -
Although the standardising procedure has assigned equal standard devia-
tions to the component sets of marks, the effective weightings are not equal
due to the variations in the correlation coefficients. These discrepancies
are not very large and Terwilliger (9) asserts that "there is no adjustment
which can be made to overcome the correlation between the measures combined
in a composite score" (p.165). This is misleading, since it is desirable
that correlations be as close as possible to one. However, the procedure
outlined by Willmott and Hall (6) allows for the effect of correlation.
Table 5 shows the use of the procedure, in two stages, to calculate the
effective weighting of the raw marks due to the combined effects of standard
deviation and of correlation between the component and composite or overall
scores:
(a) Assuming (incorrectly) r=l (perfect correlation) the effectiveweighting depends only on the variations in standard deviation.
(b) The modification introduced by the product (rxSD) gives thecombined effective weighting.
For ranking purposes only, each component set of marks is multiplied by the
ratio of nominal to effective weighting producing a weighted set of overall
marks. This is the first step of an iteration procedure (10). Two itera-
tions have been carried out to illustrate the procedure. The effective
weights and relevant correlations are shown in Table 6.
In Table 7 we show the raw marks (0), the 'standardised' marks (0 )
and the correctly weighted marks (0W") for each student, together with each
student's rank in the class on each type of total mark. Note that all
marks have been rounded to the nearest whole number, since this procedure
was used by Moss to obtain the raw marks (0). Rounding Moss's 'standardi-
sed' marks (0M) reintroduces ties in students' class ranks (the lack of
ties on the 0.. used by Moss was an artefact introduced by carrying one more
significant figure in OM than was used in 0).
Note that there are 25 variations in position between this new set of
weighted marks and the raw marks. The average difference (including zero
differences) is 0.87 places. There are also 25 differences between ranks
on Mass's 'normalised' marks and ranks on theraw marks, but the average
difference is 1.2 places. Moss's 'normalised' marks and the more correctly
derived weighted marks yield different ranks for only 12 students, the
average difference (as always, including zeros) being 0.50 places. None of
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 19 -
Assessment component: X = M T W P
Nominal weighting (%) 25 25 25 25
RAW MARKS (Overall mark = 0)
Standard Deviation (SD) 7.25 5.13 8.25 11.35
Correlation with overall marks (rXO) 0.740 0.549 0.668 0.833
Effective weighting (α= rx 0 x SD) 23.2 12.2 23.8 40.8
WEIGHTED MARKS
Iteration 1 (Overall mark = 0 ):
Nominal weight/effective weight
t jt \ 1•08 2.05 1.05 0.613
(dx.)
New standard deviation ^ .1 Q > 5 1 g > 6 ?
(bDj = SD x dx,J
Correlation with weighted overall _
marks fr "1 0.748 0.732 0.646 0.684
Effective weighting (α rXOW,
x SD1) 24.5 32.2 23.4 19.9
Iteration 2 (Overall mark = 0W ") :
Nominal weight/effective weight 1 . 0 2 Q 7 7 f i 1
New standard deviation _ n_ ,,
n
(SD2 = SDj x d
x,,J
Correlation with weighted overall
marks (ry
0 )
Effective weighting (« r x SD ) 25.1 22.0 25.9 27.1
Table 6: The approach to correct effective weighting by iteration.
these average differences is large, nor are many of the individual variations
which give rise to the averages.
The following points are worth noting:
The Willmott and Hall (6) procedure is not, of course, restrictedto equal weightings - any distribution of weightings among anynumber of components can be adjusted.
When the differences between the nominal weightings and theeffective weightings are in range 0-10 per cent, there are markeddifferences in ranking only in a few cases.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 20 -
Student No. Raw Marks Position Standardised Position Weighted PositionMarks Marks
CO) (oM) (o^,)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
MEAN
(J = Joint
Table 7:
60
65
59
60
52
63
58
65
54
50
59
46
55
70
63
60
68
55
68
63
54
56
58
55
55
63
51
63
66
54
58.93
13 (J)
5.5(J)
15.5 (J)
13 (J)
27
9 (J)
17.5(J)
5.5(J)
25 (J)
29
15.5(J)
30
21.5(J)
1
9 (J)
13 (J)
2.5(J)
21.5(J)
2.5CJ)
9 (J)
25 (J)
19
17.5(J)
21.5(J)
21.5(J)
9 (J)
28
9 (J)
4
25 (J) .
55
58
52
51
41
55
49
55
44
40
50
34
46
63
54
51
62
44
60
55
43
47
50
44
46
54
43
55
59
43
50.10
8 (J)
5
13
14.5(J)
28
8 (J)
18
8 (J)
23 (J)
29
16.5(J)
30
20.5 (J)
1
ll.S(J)
14.5(J)
2
23 (J)
3
8 (J)
26 (J)
19
16.5(J)
23 (J)
20.5(J)
11.5(J)
26 (J)
8 (J)
4
26 (J)
72
75
69
69
61
72
67
73
63
60
68
55
64
79
72
69
78
63
77
72
62
66
68
63
65
72
62
72
76
60
68.20
Placing) (Standardised Marks are called Normal Scores by
Raw, standardised and weighted marks and rankings.
9
5
14
14
28
9,
18
6
23
29
16.
30
21
1
9.
14
2
23
3
9.
26
19
16.
23
20
9.
26
9.
4
26
Moss
•5(J)
(J)
(J)
•5(J)
(J)
5(J)
5(J)
(J)
(J)
5(J)
(J)
5(J)
(J)
5(J)
(J)5(J)
(J)
(2))
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 21 -
Terwilliger (8,9) suggests that if the total of the positivedifferences between the effective weights and the nominatedweights is less than 20 per cent, adjustment is not necessary.For the raw marks, the differences (positive only) add up to15.8 per cent and the effect, as noted above, is small (seeTable 5 also).
The mean value of the weighted set of overall marks (after twoiterations) is 68.2 (SD = 5.97) which is higher than the meanof the raw marks, 58.9. In this case adjustment of the levelof the overall marks is readily made by subtracting 68.2 -58.9 =9.3 from each of the weighted overall marks (or 68.2 -50 = 18.2 if a mean of 50 is judged more appropriate).
Issue 5How can the quality of assessment (information)be improved?
The validity (5) of the assessment tasks and conditions should be
established, and appreciated by the students. The assessment tasks together
with model solutions and guidelines for marking should be moderated by
colleagues (assessment sub-committees). This process should also take into
consideration the implications of a prescribed pass mark and/or a prescribed
level of attainment as prerequisite for study of a subject at a higher
level. Clift and Imrie (11) discuss such matters with reference to a pro-
posed two-tier structure of assessment with tier-I at the minimum essen-
tials (recall, application) level and tier-II at the problem solving level.
Engel et at (12) describe what promises to be a most significant inno-
vation in assessment which not only allows for validation and moderation
but also involves students in evaluation of the assessment tasks which are
aimed at the problem solving level. The procedure contributes to the
moderation of model answers and of 'minimum levels of competency'. It
takes place the day after the students sit the test and thus represents an
important (feedback) learning experience for the students at a time when
their interest is at a peak. And that should be a principal aim of assess-
ment: that it be part of a teaching strategy to help with the improvement
of student learning (quantity, quality and retention).
As with many other aspects of teaching, the professional judgment and
commonsense of the teacher must be depended upon to modify general recommen-
dations (9) as required by subject, student and institutional considerations.
If, for example, there is poor correlation (r<0.4) between assessment compo-
nents, combination (as discussed in this paper) may be less important, less
valid and less useful than association. For illustration consider the
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 22 -
abilities required to be a 'good' driver. A 90 per cent minimum level of
competency may be required for satisfactory performance on a test of the
Highway Code. Then the would-be driver is required to demonstrate basic
mastery of the vehicle in 'normal' traffic conditions. There is likely to
be low correlation between the 'scores' of such tests; there would be little
point in combining them and little logic in using a combined score to predict
the 'good' driver. But the association of the two components separately
assessed is evidently commonsense.
The newly licensed driver will probably, it is predicted, be able to
solve most of the problems presented by changing traffic and road conditions.
With practice and continuing development of skills, attitudes and knowledge,
the driver will become 'good' at problem solving on the road.
With a course such as Zoology which involves the development of a wide
range of skills and knowledge, information is available in the marks of
assessment for rational analysis and professional judgment (5) to be used to
identify minimum levels of competency and the appropriate use of combination
and association. As with all educational endeavours there is an associated
professional responsibility to evaluate our performances as teachers and
assessors particularly when much effort has been expended on change and
innovation.
CONCLUDING COMMENTS
In this paper it has been our intention to discuss considerations of
assessment which seem likely to be overlooked when sets of marks are com-
bined. To utilize the potential of this Journal as a forum, we thought it
appropriate to consider and develop some of the issues raised in the paper
by Dennis Moss so that somewhat different points of view could be conven-
iently considered. As indicated in this paper, in this Journal (Vol.5(3)),
are documented innovations which seek to improve the experiences of assess-
ment which so greatly influence the attitudes of our students.
While we have used numbers and calculations to illustrate some of the
considerations of this paper, we have intended to emphasise that there is
no mechanical procedure for converting marks into decisions about an indi-
vidual's ability and potential. We wish to share our convictions that 'good'
examining requires high levels of competence. The efforts which staff and
students make in producing marks should invite reciprocal diligence in
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 23 -
analysis and in decision-making procedures which utilise not only relevant
statistical methods but, particularly, the informed judgment of teachers.
Evaluation of assessment in higher education has received iittle atten-
tion in the literature and we acknowledge the interest of Zoology staff as
reported by Moss (2). We wished to contribute to such a debate by discuss-
ing the combination of marks not as an exercise in numbers but as a matter
of relationships between marks and the learning experiences of students.
We have presented a case for professional judgment and expertise when
teachers consider combining marks.
REFERENCES
(1) BURNETT, W. and CAVAYE, G. (1980), "Peer Assessment by Fifth Year
Students of Surgery", Assessment in Higher Education, 5, (3).
(2) MOSS, D. (1977), "How Reliable are Continuous Assessment Methods in
Measuring Student Performance?", Assessment in Higher Education, 2,
(3), 164-172.
(3) ELTON, L.R.B. and LAURILLARD, D.M. (1979), "Trends in Research on
Student Learning", Studies in Higher Education, 4, (1), 87-102.
(4) HIMMELWEIT, H.T. (1967), "Towards a Rationalization of Examination
Procedures", Universities Quarterly, 21, (3), 359-372.
(5) EBEL, R.L. (1965), Measuring Educational Achievement, Prentice-Hall,
New Jersey.
(6) WILLMOTT, A.S. and HALL, C.H.W. (1975), O-Level Examined: The Effect
of Question Choice, Schools Council Research Studies, Macmillan
Education, London.
(7) CRONBACH, L.J. (1971), THORNDIKE, R.L. (Ed.), "Test Validation" in
Educational Measurement (Second Edition), American Council on Educa-
tion, Washington.
(8) TERWILLIGER, J.S. (1977), "Assigning Grades - Philosophical Issues and
Practical Recommendations", Journal of Research and Development in
Education, 10, (3), 21-39.
(9) TERWILLIGER, J.S. (1971), Assigning Grades to Students, Scott, Foresman,
Illinois.
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 24 -
(10) FOWLES, D.E. (1974), CSE: Two Research Studies, (School Council Exami-
nations Bulletin 28), Evans/Methuen Educational.
(11) CLIFT, J.C. and IMRIE, B.W. (1980), Assessing Students, Appraising
Teaching, Croom Helm, London.
(12) ENGEL, C.E., FELETTI, G.I. and LEEDER, S.R. (1980), "Assessment of
Medical Students in a New Curriculum", Assessment in Higher Education,
5, (3).
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014
- 25 -
First version of the paper received : March 1980
Final version of the paper received : September 1980
Copies of the paper and further details from:
Mr. Geoffrey Isaacs,
Tertiary Education Institute,
University of Queensland,
St. Lucia, Q. 4067,
Australia.
Mr. Bradford Imrie,
University Teaching and Research Centre,
Victoria University of Wellington,
Wellington,
New Zealand.
Geoff Isaacs is a lecturer in the Tertiary Education Institute (since
1974), and formerly taught mathematics at the University of New South
Wales.
Bradford W. Imrie is senior lecturer in the University Teaching and
Research Centre, and was formerly a lecturer in mechanical engineering at
the University of Leeds. This paper was written while he was on study
leave at the Tertiary Education Institute (September 1979 - June 1980).
Dow
nloa
ded
by [
TIB
& U
nive
rsita
etsb
iblio
thek
] at
05:
39 2
8 O
ctob
er 2
014