scholfield 2011 simple statistical approaches to reliability and item analysis

35
1 Simple statistical approaches to reliability and item analysis Data and non-SPSS programs mentioned are on my website Note that this account is not fully revised for 2012 1. Stats used to check on one’s instruments, not to analyse substantive results of a study Reliability measurement (Rel) and item analysis (IA) are things you do 'on the side' in an investigation. They are not usually the main focus of the research (unless it is research on testing itself), but rather ways of checking on the quality of whatever means you use to measure the relevant variables for the research - whether it be reading proficiency, informants' social class, learners' mastery of the count/mass distinction or whatever. Along with Rel and IA also belongs what I call 'case analysis' though this is not a widespread term for it (It will be explained later). These are not the only things to check on. The other big one is validity (see separate handout): just to confuse things, though, some people nowadays think of reliability as one form of validity. The things just mentioned are often checked as part of a pilot study, where means of measurement or 'instruments' are tried out small scale before being used for real in 'the main study'. They then form the basis for revising the test, scoring system, instructions to raters, or whatever ready for the main study. Analysis of Rel and IA may also be done in a main study, and reported in a separate section from the substantive 'results'. However, if the reliability of the instruments is not good, at that stage there is little one can do about it, and the real results one is interested in will be less credible. If you are aiming to show that motivation affects people's success in language learning, say, you will not convince anyone if the rating scales, tests or whatever you use to measure 'motivation' and 'language learning' are shown to not give consistent results in the first place. The basic statistics for examining reliability and doing item analysis are mostly just special uses of general stats used for other things as well. We have seen many of them already. However, here the focus is on comparing means of measurement with themselves, rather than comparing different groups of people, different conditions etc. (More complex reliability stats are described in Rietveld and van Hout, and Dunn). 2. Introduction to reliability. Some means of measuring or quantifying language is said to be, in the technical sense, 'reliable', if it gives virtually the same scores, rank positions or categorisations to the same cases when applied repeatedly in the same way in the same conditions by the same or a comparable measurer. Of course that says nothing about whether the test, questionnaire item, observation schedule or whatever quantifies what you want it to quantify. That is a matter of 'validity'. But if a measuring technique is reliable, you can at least be confident that when used on any occasion it is recording more or less the "true" score/categorisation of each individual measured for whatever variable it measures . So-called 'random errors' or 'measurement errors' are minimal - there is relatively little misclassification, or awarding slightly higher or lower marks than should be on any occasion. It must be noted that what constitutes "the same score" etc. in the above characterisation differs depending on whether quantification is in the absolute or relative sense - see below. Indeed some prefer to use the term 'dependability' rather than 'reliability' when considering the present matters for data scored in some absolute way. This account attempts to cover both, but will stick to the

Upload: curtis-gautschi

Post on 20-Jul-2016

222 views

Category:

Documents


0 download

DESCRIPTION

Simple statistical approaches to reliability and item analysis for foreign language testing. Excellent overview and samples

TRANSCRIPT

1

Simple statistical approaches to reliability and item analysisData and non-SPSS programs mentioned are on my website

Note that this account is not fully revised for 2012

1. Stats used to check on one’s instruments, not to analyse substantive results of a study

Reliability measurement (Rel) and item analysis (IA) are things you do 'on the side' in aninvestigation. They are not usually the main focus of the research (unless it is research on testingitself), but rather ways of checking on the quality of whatever means you use to measure therelevant variables for the research - whether it be reading proficiency, informants' social class,learners' mastery of the count/mass distinction or whatever. Along with Rel and IA also belongswhat I call 'case analysis' though this is not a widespread term for it (It will be explained later).These are not the only things to check on. The other big one is validity (see separate handout):just to confuse things, though, some people nowadays think of reliability as one form of validity.

The things just mentioned are often checked as part of a pilot study, where means ofmeasurement or 'instruments' are tried out small scale before being used for real in 'the mainstudy'. They then form the basis for revising the test, scoring system, instructions to raters, orwhatever ready for the main study.

Analysis of Rel and IA may also be done in a main study, and reported in a separate section fromthe substantive 'results'. However, if the reliability of the instruments is not good, at that stagethere is little one can do about it, and the real results one is interested in will be less credible. Ifyou are aiming to show that motivation affects people's success in language learning, say, youwill not convince anyone if the rating scales, tests or whatever you use to measure 'motivation'and 'language learning' are shown to not give consistent results in the first place.

The basic statistics for examining reliability and doing item analysis are mostly just special usesof general stats used for other things as well. We have seen many of them already. However,here the focus is on comparing means of measurement with themselves, rather than comparingdifferent groups of people, different conditions etc. (More complex reliability stats are describedin Rietveld and van Hout, and Dunn).

2. Introduction to reliability.

Some means of measuring or quantifying language is said to be, in the technical sense, 'reliable',if it gives virtually the same scores, rank positions or categorisations to the same cases whenapplied repeatedly in the same way in the same conditions by the same or a comparable measurer.Of course that says nothing about whether the test, questionnaire item, observation schedule orwhatever quantifies what you want it to quantify. That is a matter of 'validity'. But if a measuringtechnique is reliable, you can at least be confident that when used on any occasion it is recordingmore or less the "true" score/categorisation of each individual measured for whatever variable itmeasures. So-called 'random errors' or 'measurement errors' are minimal - there is relatively littlemisclassification, or awarding slightly higher or lower marks than should be on any occasion.

It must be noted that what constitutes "the same score" etc. in the above characterisation differsdepending on whether quantification is in the absolute or relative sense - see below. Indeed someprefer to use the term 'dependability' rather than 'reliability' when considering the present mattersfor data scored in some absolute way. This account attempts to cover both, but will stick to the

2

term 'reliability' for all.

Details of possible causes of unreliability and how to eliminate them are not discussed here (seefor example Scholfield, 1995: chs 19, 20). Rather we focus on the statistical measurement of thereliability of any measuring instrument or technique. This would of course be used as a guide torevise some aspect of the measurement method, test, classification procedure etc. to reduceunreliability in future, or at least taken into account when interpreting any scores obtained: thesematters are not pursued here either.

The simple approaches to quantifying reliability reviewed here all require two or more parallelsets of scores/categorisations of the same cases, obtained typically in one of the following ways.I.e. these are the fundamental DESIGNS of reliability study. (Note: IA applies only in situationD, 'Case analysis' can be done in all of A-D).

(A) Two or more different scorers, researchers, raters etc. ('judges'), of the same general type,score/rank/categorise the same cases (people, sentences etc.). For instance the same set of English compositionsobtained from a group of learners are double marked (blindly) by competent teachers, or ones trained by theresearcher on the same scoring system. The coefficient resulting from analysis of the two or more sets of scoresetc. is regarded as measuring 'interjudge /inter-rater reliability'.

(B) The same judge rescores/recategorises the same cases on two or more occasions, using the same primary data.For example he/she re-marks the same set of English compositions, after a suitable time gap so that he/she hasforgotten specific marks given. The coefficient resulting from analysis of the two or more sets of scores etc. isregarded as measuring 'intrajudge reliability'.

(C) The same cases are measured/categorised on two or more successive occasions by the same measurer, usingeither the same or an equivalent test or other instrument repeatedly. For instance, the MLU of some children ismeasured from speech samples elicited in the same way on two successive days. Or learners take three equivalent40 item vocab size tests one after the other instead of one long one. The coefficient resulting from analysis of thetwo or more sets of scores etc. is regarded as measuring 'test-retest reliability' or 'stability'.

(D) Scores are obtained on just one occasion from one group of cases by one measurer, using some instrumentwhere an overall score is calculated for each case from a lot of items scored separately, like a multi-item test or anattitude inventory with a gamut of agree/disagree items. There is then internal analysis of the scores: typically twoscores are obtained for each case by dividing the set of items into two and calculating a score on each half. (Notethat is not dividing the group of cases into half and comparing scores between halves in that sense!). Thecoefficient resulting from analysis of the two sets of scores is regarded as measuring 'internal consistency'reliability.

Each of these in effect focusses on a different aspect/cause of unreliability - it is up to theresearcher to decide which aspect(s) are the ones most in need of checking in a particular study.For instance if the measurement is by a test with straightforward correct/wrong items, it may beassumed that reliability as per A and B may not be a problem, but D may be worth investigating.If the measurement involves coding of strategies in transcripts of think aloud protocols, then Aand B may be more at issue, as even experts are very likely to disagree on what is an instance ofthis or that strategy.

In all these approaches the cases should be of the same sort, a homogeneous group, and probablyat the very least 20 in number to give a safe estimate of reliability. They should be the same sortof people as the instrument is to be used on 'for real' later. All 'facets of observation' are inprinciple kept constant or are random alternatives of the same type. By that is meant that

3

rescoring, retesting etc. is done in the same or equivalent conditions (room, time of day etc.) withthe same instructions etc. Either the same test or whatever is used repeatedly, or 'equivalent'parallel forms. The aim is not to compare even slightly different means of quantification, andcertainly not primarily to compare different types of people or different conditions: focus is onanalysing the instrument itself. There are numerous problems with keeping things genuinely'identical' when measuring the same people twice and so forth that cannot be pursued here (seefor example Scholfield, 1995 ch 19, 20; Bachman, 1990, ch 6).

More complicated reliability studies are often combinations of the above. For example, the samecases are measured on several occasions and with several measurers. Further elaborations caninvolve varying substantive factors usually held constant. For instance one can remeasure morethan one group of cases, to see if reliability is different when different types of case are involved.Similarly one can vary the type of measurer (e.g. teacher versus researcher), the preciseconditions of testing etc. to see their effects on reliability. We stick with simple approaches here.

The simple approaches to quantifying reliability all yield two or more parallel sets of figures forthe same cases. I.e. the data looks like that used in repeated measures and correlation studies wehave met before. Sometimes, especially with interjudge reliability, there are three or more sets ofscores/categorisations to examine. To obtain a coefficient of reliability, the sets have to beanalysed statistically for closeness or agreement in some way. The reliability measures are theninterpreted in part depending on how the repeated scores were obtained - from repeated judges oroccasions or what - and in part on what statistical measure of agreement or the like was used(sketched here).

Some of the reliability measures used are ones we have met before (like the Pearson r correlationcoefficient), others not. But either way, you need to think a bit differently in reliability studiesfrom the way you do in analysis of general research results. In particular, significance tests and pvalues are not of very great interest in reliability work and are not usually quoted. This is becausewhen you are comparing a measurement 'with itself' in the above sorts of ways (A-D) it would beremarkable to fail to get a significant amount of agreement. What is crucial is not just significantamounts of agreement or relationship or whatever, since 'significant' means no more than'definitely better than zero'. To show a measurement is reliable you want high actual levels ofagreement or relationship etc. - e.g. Pearson correlations close to +1. In short, in reliability workyou mainly use just graphs and relevant descriptive statistics to show what is going on.The choice of appropriate statistical measure of reliability depends on a number of things,especially:

(1) whether two or more than two sets of repeated scores or categorisations have to beconsidered (arising in designs A-C above), or an internal analysis of one set (from designD above)(2) the scale type of the variable (interval, rank order, ordered categories, nominalcategories).(3) whether the scores/categorisations are considered as having relative or absolute value.

These three considerations are used to organise the overview of reliability coefficients below.

Scale type is essentially the matter of whether the figures obtained have been derived in such away that they have to be regarded as 'interval' or 'ordinal' or labelling a 'nominal' classification.Typically scores from a test or MLUs or error counts and often rating figures will be regarded asinterval, i.e. having the full properties we expect of numbers. If cases are given numbersindicating just who is better than who, but not by how much (rank position), then that is ordinal

4

data. If one assigns cases to ordered categories, such as social classes A, B1 etc., that too isordinal. If numbers just label categories with no order involved, as when one assigns errors tofive different types (grammatical, lexical, spelling etc.), or classifies people as having either cleftpalate or normal palate, then that is a nominal categorisation. (For a fuller account see Scholfield,1995: chs 11-18).

The relative/absolute distinction is particularly important since in everyday parlance when wetalk about the "closeness" or "amount of agreement" of two sets of scores or categorisations weprobably think first of some absolute correspondence. However, the most popular reliabilitycoefficients cited in studies are based on treating scores as relative: essentially they are those wemet in the two variable correlational design - measures of symmetric relationship of the highesttype the scale allows - preferably linear.

In a nutshell, absolute measurement is done with an eye to some predecided standards oflanguage ability or performance which decide what score a case gets, or what category they areplaced in. Relative quantification on the other hand is done more with an eye to the ability orperformance of other cases measured, relative to which an individual gets a score or is placed ina category. Thus an absolute, 'criterion-referenced', test of receptive knowledge of Englishphrasal verbs would aim to quantify how many phrasal verbs each case actually understands, as aproportion of all such verbs. A relative, 'norm-referenced', test of the same thing would beconcerned rather with discriminating which cases know more phrasal verbs than which othercases (and not particularly quantify how many anyone actually knows).

In general, interval scored data may be either relative or absolute, as may data in orderedcategories. One must think hard to see which such data is best thought of as trying to be, giventhe type of instrument used. On the other hand rank ordering can only really be relative andnominal categorisation only absolute. If unfamiliar, the difference will become clearer as itsconsequences are seen below. (For further elucidation see also Scholfield, 1995: ch 10).

If we think just of interval scores, we can see the essential consequences for measuring reliabilitythis way. With the absolute or criterion-referenced view of the scores, we would saymeasurement is reliable if on the two occasions or with two scorers or whatever each case getsalmost exactly the same score. For instance, someone who scores 75% with one scorer, scores76% with the other. Statistically this is a matter of 'similarity' or 'difference' or 'proximity'.

With the relative/norm-referenced/psychometric view of scores we would say measurement isreliable if on the two occasions, or with two scorers or whatever, each case is placed in more orless the same position relative to the scores of the group. For instance, someone who is justabove average, and in 14th place from the top in a group of 35 with one scorer comes out almostthe same amount above another scorer’s average and in 15th place with them. Statistically this is(in simple approaches) a matter of 'correlation'.

The two views are not identical, because in the relative version the actual scores awarded to thesame case by two scorers or whatever could be more different. Reliability might still be regardeda high provided the scores of the group had all shifted in parallel up or down between the twooccasions or scorers.

We now attempt an overview of simple statistical means of quantifying reliability.

5

RELIABILITY IN SITUATIONS A-C

3a. Two occasions or judges: data viewed as interval with relative value

Main stats: scatterplot, Pearson r.

The following simple sets of data will enable us to see what the most commonly cited measure ofreliability, the 'Pearson r correlation coefficient' actually tells us.

Each of the following sets of figures is the imagined results for four subjects measured or markedtwice on the same measure (reliability design C), or by two markers (A). The sets aredeliberately ridiculously small both so as to be quick to enter in SPSS and so as to be especiallyeasy to see by eye how similar the two sets of results are.

For each set

(a) enter the figures in the Data grid

(b) get the Pearson r correlation coefficient calculated

(c) get a scatterplot made, with suitable labels for the axes. In order to be able to seewhich case is which on the scatterplot, it is also useful to edit it to label the squares withcase numbers: Edit the graph: Chart.... options... Case labels... set them to On.

(d) answer the questions below

If not sure how to do a-c, look back esp. at LG475 SPSS tasks for step by step instructions.Set 1

First time Second timeSubject 1 13 13Subject 2 18 18Subject 3 16 16Subject 4 21 21

Does this look like a reliable measure? Are the columns of results 'agreeing' closely?

Does the value of Pearson r reflect your intuition?

If not, why not?

Does the scatterplot reflect your intuition too?

How does the scatterplot relate to the value of r?

Set 2First time Second time

Subject 1 13 20Subject 2 16 22Subject 3 18 23Subject 4 21 25

Does this look like a reliable measure? Are the columns of results 'agreeing' closely?

6

Does the value of Pearson r reflect your intuition?

If not, why not?

Does the scatterplot reflect your intuition either?

How does the scatterplot relate to the value of r?

Set 3First time Second time

Subject 1 13 13Subject 2 16 16Subject 3 18 21Subject 4 21 18

How does this set differ from set1 and 2?

Intuitively is agreement better or worse than set 2 do you think?

What is r for this, and what does it reflect? Does it show this set has higher or lowerreliability than set2?

Examine and comment on the scatterplot.

Set 4First time Second time

Subject 1 13 13Subject 2 16 14Subject 3 18 20Subject 4 21 21

How does this set differ from the ones above?

Which set above is it most like?

What r do you expect to get?

Do you get it?

Look at the scatterplot and see why. r is obviously sensitive to .... what aspect of thescores?

Set 5First time Second time

Subject 1 12 12Subject 2 16 16Subject 3 19 22Subject 4 22 19

How does this set differ from the ones above (esp. set 3)?

Which of them is it most like?

What r do you expect to get?

7

Do you get it?

What might be causing the result?

From this you should get a feel for what sort of 'reliability' the Pearson r measures:

Perfect reliability: what does r = ?

Good reliability: r should be greater than what, do you think?

Is it true that r is greater if the corresponding scores in the two sets are close to each otherin value?

Is it true that r is higher if scores are spread farther apart within the sets? (i.e. greater SDand variance)

Is it true that r is greater if the two sets of scores are in the same order?

Do you think reliability measured this way is satisfactory for most research? Why?

Is r as in 'Pearson r' short for 'reliability'?

3b. Two occasions or judges: data viewed as interval with absolute value

Main stats: histogram of absolute differences for cases, mean absolute difference.

You may feel that for some purposes the sort of similarity of relative order and distance that rreflects is not the sort of 'sameness' you want to measure as an indication of reliability. If peopleare being graded on a number scale from 'native speaker proficiency' = 20 to 'complete beginner'= 0, then you would want people to get the same actual number grade, whoever measured them,if the measure is reliable (not just be placed in the same order). Then you need a differentmeasure of agreement.

How could you measure how close to being identical in some absolute sense two columnsof scores are?

Could you just subtract each of one column of scores from the corresponding one in theother column and add the differences?

Try doing this by hand for the above sets of data and see if it works.

In fact to get a useful absolute agreement measure you have to add a proviso to how youcalculate and add the differences. You need the 'mean absolute difference' measure of reliability.See if you can get SPSS to calculate it for you for one of the above sets of data. Set 4 is a goodone to try.

To go about it you click the Transform menu and choose Compute. You are offered a screenwhere you can generate a new column of figures (Target Variable) by formula from ones youhave already. In this instance let's call the Target Variable diff. Click your existing columns intothe Numerical expression box top right and connect them using ABS from by Functions list.Use a formula such as ABS(col1 – col2)

Having got your new column, look at it in the data editor window

8

How have the differences been calculated? Check a few to see they are right.

How does the 'absolute difference' between scores differ from just the 'difference'?

Get the histogram and descriptive stats for that new difference column.

A bigger mean absolute difference means ... greater or less reliability?

Does this mean, as a reliability measure have a logical maximum and minimum value?

Can you see how the histogram would enable you to spot the cases that were mostdisagreed about?

What shape would you expect the histogram to have: not the 'Normal' distributionshape.... Why?

You may have had enough of this toy data by now. So look at a set of real data. I have given thedata not as an SPSS data file, ending in .sav, but a plain text file agjudg2.dat where the data isjust wordprocessed in columns with spaces between.

Sidepoint. Opening a file composed on wordprocessor or the like, and saved as plain text/DOStext, as a file with the ending .txt or .dat.

Simplest way is to use the menu option File… Read text data. Click on the arrow beside theFiles of type space and browse for files ending in .txt or .dat.

When you choose the file, the Text import wizard should be activated, and you can go throughit probably without altering any of its choices to get the data loaded satisfactorily.

This illustrates how you can get data into SPSS without retyping it even if you have it onlyoriginally in wordprocessed form, or scanned in from a book, or as output from some othersoftware that saves data as plain text files…

Two markers, P and V, marked a set of 45 undergrad applied linguistics examscripts without seeing each others' marks. What is the reliability of their marking?

Which reliability design is involved (A-D)?

Get the data from my disk - file agjudg2.dat and calculate both the relative andabsolute types of reliability coefficient as described above. Get the scatterplot, withcases labelled (when editing the graph, use Chart…options…case labels…On),and the histogram of absolute differences.

Interpret them

Is the reliability good on either the relative or absolute measure?

Remember the max and min values possible for Pearson r are ....what?

The max and min possible for the mean absolute difference depend on the length of

9

the scale available.

If DLL in practice gives marks only effectively between 35 and 80, what arethe max and min possible mean absolute difference to compare the obtainedvalue with? Which end is 'good' – i.e. to show high absolute reliability do wewant a mean abs difference that is small or large?

See how the scatterplot shows which students the markers disagreed on most.

This is what I call 'case analysis': looking at the individual people or whatever whoseem to be off-line (outliers) causing disagreement and trying to explain why (andin research perhaps eliminating them from the sample if there is good reason).

Do the same cases come up as 'odd' on the histogram as on the scatterplot? Ifnot why not?

Would you say university exam marking is supposed to be on a relative orabsolute scale?

So which reliability coefficient is more worth looking at in this instance?

Does it inspire confidence in the double marking system? Think carefully.

Can you get SPSS to tell you which marker was more 'generous'?

Absolute reliability is rarely reported by researchers: but I think that is because all the stats booksare fixated on relative measures like Pearson r. The absolute mean difference does not have amaximum of +1 meaning 'perfect reliability', of course, but can be turned into such a measure byputting it in the formula:

Absolute measure on scale 0 to1 = 1 - (Mean abs. diff. / Length of interval scale)

Check that makes sense for the ones we have used above.

3c. Two occasions or judges: data rank ordered with relative value

Much as for 3a, using Spearman ρ (rho) coefficient instead of Pearson r.

4a. Three or more occasions or judges: interval data with relative value.

Main stats: scatterplots, Cronbach's alpha.

It is not common except in professional testers' reliability studies to obtain more than two sets ofscores from supposedly identical quantification occasions to compare. However, quite commonlyit is both easy and useful to use more than two judges of the same sort to score or rate the samecases. You can of course use the relevant previously described approaches, both visual andnumerical, to look at the correlation between each pair of occasions or judges separately, butideally you would like an overall measure of the amount of agreement as an indication ofcollective reliability. Hatch and Lazaraton p533 give a method that relies on averaging thecorrelations between pairs of judges.

A standard solution is Cronbach's alpha (Greek letter α), which can be thought of as a bit like aPearson correlation coefficient across more than two occasions or judges of the same things. Its

10

maximum is +1, but it can achieve peculiar negative values (theoretically of any size) in someodd situations. It has similar properties to Pearson r in that it is not so much affected by thecorrespondence of absolute sizes of the scores, more by the agreement in order and spread ofthem between particular columns (=occasions or judges). SPSS also produces useful statisticsconfusingly called 'alpha if item deleted' (For 'item' read 'occasion or judge' in the presentdiscussion). This tells you what the alpha would be if a particular judge was left out and alphajust calculated from the others. Obviously if the alpha goes up drastically when a judge is left out,that suggests that that judge was not rating very much in the same way as the others, but issomewhat of a maverick. One might not trust that judge in future (unless of course that oneturned out to be you, the researcher!).

To try out an analysis with alpha, get hold of my data agjudg3.dat (again this is notan SPSS file). This consists of ratings by three judges of 14 learner compositions(design A again).

The three markers clearly did not use the same scale to mark on, so whether or noteach was marking on an absolute scale, we can only treat their marks as relativescores for further analysis. The reliability question that alpha settles is, 'To whatextent are the three markers' scores mutually correlating positively?'

First get the visual picture by making scatterplots for each pair of markers asbefore.

Three markers, so how many scatterplots needed to see all possible pairs?

By inspection of the scatterplots, which pair of markers agrees better? Whichmarker is the odd one out?

As usual look for cases that are off line - compositions most disagreed about.

Get the correlation calculated between each pair of judges. When you ask for thisin SPSS just put all three variables in at once and practise spotting the informativebits of the 'correlation matrix' which appears in the output (remember SPSS givesyou everything twice in this, plus other rubbish!).

Now go to Analyze...Scale... Reliability and under Statistics get Scale if itemdeleted. In the output remember that 'items' means 'judges' in this instance.

How high is alpha? Is it convincingly showing high agreement of judges andso interjudge reliability? Remember that reliability coefficients of thecorrelation type need to be high in absolute terms (over .8) to be impressive.

Which judge comes out as the rogue one? The same as we saw from thecorrelations?

4b. Three or more occasions or judges: rank order data with relative value.

Main stats: scatterplots, Kendall's W coefficient of concordance.

A very useful simple measure is 'Kendall's W', also called 'Kendall's coefficient of Concordance'.This is actually a technique for quantifying the amount of agreement in several rank orderings ofthe same cases, though it can also be used for interval data like that in 4a above (with loss of

11

some information). Often it is used where one doubts if ‘interval’ data really is equal interval.You could use this as a cautious choice where cases have been given several scores that might beregarded as interval, say by several teachers rating the same essays in a non-absolute way, butwhere you feel doubt as to whether they really observed equal difference between rating levels.In their marking, is an essay rated as worth 11 different from one rated 10 by the same amount ofquality as one rated 4 is different from one rated 3, for example? Often one cannot be sure ininstances when the figures come from people rather than from instruments like computersmeasuring response times. One opts to treat it as an interval scale or not on one’s own judgment.With the W measure, the data is treated as having been rank ordered only by each judge, andsome information is lost.

Kendall's W is like a Spearman rho correlation for three or more rank orderings, and comes outbetween 0 and 1, with 1 indicating perfect 'agreement' (see e.g. Cohen and Holliday, 1982 forcalculation). Parallel to what has been said already, if three scorers differ in some way in termsof actual level of score, e.g. A always gives higher marks than B and B higher than C, this willnot be reflected in such an interjudge reliability coefficient. That is quite in order for a norm-referenced technique but of course is not really satisfactory for a criterion-referenced one. Forinstance Fischer (1984) got ten untrained raters to use his communicative rating measure on 18learner texts and reports a W of .725. This indicates a fair level of agreement in the order ofmerit the raters assigned to the texts, but says nothing about whether they agreed much in theactual level of communicativeness they attributed to each.

To try out an analysis with W, use my data agjudg3.dat again.

The reliability question that Kendall's W answers is 'To what extent are the threemarkers putting the compositions in the same order of merit?'

We have already analysed the scatterplots, so go direct to get Kendall's Wcalculated with SPSS to get the overall reliability of the three markers together.There is a slight problem here. One would normally have the data entered as in myfile, with the 14 cases measured (the compositions) down the side forming therows, and the three judges across the top, one per column, but this is not the waySPSS requires it. SPSS will calculate a result from this, but it will be the wrongone. To get it done correctly, you must first make the rows the columns and viceversa.

To do this is easy in SPSS. Click the Data menu and then choose Transpose.Complete the dialog box by getting the three variables into the right hand space andyour data will get rearranged to be three rows and 14 columns, plus an extracolumn put in automatically labelling the three judges. The columns will belabelled with labels like var0001 in the usual default manner.

From now on SPSS will refer to the three judges as the cases and the 14compositions as variables, which is confusing, but there you are....

Now go to the Analyze menu ....Nonparametric ....K related samples ....and clickthe box to activate Kendall's W. Click to deactivate Friedman's test. Make sure all14 'variables' are included.

From the output you will get a value of W

12

How high is it on a scale 0 to 1? Similar or not to alpha?

Is it convincingly showing high agreement of judges and so interjudgereliability?

Remember that reliability coefficients on a scale 0-1 need to be quite high inabsolute terms (over .8) to be really impressive.

You will also get a level of significance,

What is it?

Does it show the value of W is significant?

Why do you think that in reliability study people don't pay much attention tosignificance levels (p values)?

Look at the Mean Rank part of the display. Remember the statisticalprocedure has simply turned each judge's marks into a rank ordering of the14 students' compositions. SPSS is displaying here the average ranking ofeach composition over the three judges. You can do some useful 'caseanalysis' (currently being called variables by SPSS, remember):

Which composition is being judged collectively best, which worst? (DoesSPSS rank from 1=best to 14 = worst or the reverse?)

If you have them, look at the actual compositions and see if you are surprised.

Where are the judges having trouble agreeing on a clear difference ofquality? Look for pairs of compositions where the rankings are tied.

4c. Three or more occasions or judges: data viewed as interval with absolute value

Main stats: Histogram of individual SDs of cases, Mean SD

For interval measurement with some absolute value, such as criterion-referenced tests and manycounts and ratings, alpha or Kendall's W is not so suitable. Since the actual scores "mean"something, to be reliable, you'd want any such measurement to yield the same actual marks notjust marks in the same order, when different testers/raters etc. are involved. You can of courseuse the method of 3b repeatedly. Or a simple approach to reliability here is to calculate astandard measure of spread, such as the 'standard deviation' (SD), to quantify the closeness of thethree or more scores of each case. SD is described in any elementary statistics text. The averageSD over the group of cases measured then serves as an absolute 'difference' measure of reliability.You can do this following the procedure for making a column of absolute differences: exceptyou ask for the SD of the set of columns instead of ABS in the formula.

As for the abs. differences in 3b, we can construct a frequency histogram and see what is thecommonest SD, and the largest SD - i.e. the most unreliable case - and so on. Mean SD comesout as zero if the three or more scores of each case are the same, albeit the scores are different fordifferent cases. A higher SD figure reflects lower agreement between the repeated scores. The

13

logical maximum average SD, signalling maximum difference, is half the length of the scorescale (except for some rare situations where it can be a little greater). So it is 50 for scores inpercent, 3 for a 1 to 7 rating scale (because 7 minus one divided by 2 = 3). The possiblemaximum must always be borne in mind, since an average SD of, say, 2.5 will signal quite adifferent degree of unreliability, in this absolute sense, depending on it. Like at the end of 3b wecould re-express a mean SD as a coefficient with maximum value 1.

You can do an absolute reliability analysis like the above quickly using my ABSREL2 program.You save your data as Fixed ASCII from SPSS, in a file with suffix .dat. Get my program fromme or the website. It is a clunky old executable file that runs in DOS. Run it to get the fullinstructions; follow them and look at its output file. It does a few things that it takes time to getSPSS to do. It identifies the rogue judges and cases automatically.

Note, sometimes the means of the three or more repeated tests or measurements are presented asevidence of this sort of reliability. However, as the examples below shows, a test on threeoccasions can have the same mean score but still not be reliable in this absolute sense. Imaginethree subjects have their vocab size measured by equivalent tests on three occasions. The sizesthe test gives for their vocabularies are as follows.

Example 1

Occasion 1 Occasion 2 Occasion 3 SD

S1 200 200 200 0

S2 400 400 400 0

S3 600 600 600 0

Mean 400 400 400 0

Here all subjects score the same on each occasion/test, so absolute reliability is perfect. This isreflected in the fact that the SDs are all zero so mean SD is zero. The mean of each test is alsoidentical.

Example 2

Occasion 1 Occasion 2 Occasion 3 SD

S1 200 200 200 0

S2 500 600 700 100

S3 500 400 300 100

Mean 400 400 400 66.7

Here all subjects do not score the same on each test, so absolute reliability is not perfect. This isreflected in the fact that the SDs are not all zero so mean SD is not zero. However, the means ofthe equivalent tests are still identical, so this is not a good guide to criterion referenced reliabilityhere (the correlations between occasions are also high, showing good norm referencedreliability).

14

5. Two or more occasions or judges: data in ordered categories.

Skipped here

6. Two or more occasions or judges: ordered/unordered categories with some absolutevalue.

Main stats: proportion or percentage agreement, Cohen’s kappa, Jaccard coefficient

6b Two judges (or occasions): category data with absolute value.

When cases are put into categories by two judges, percentage agreement is the simple measure ofagreement widely used. That is, out of all the cases/people that the judges categorised, whatpercent did they both put in the same category? However, the argument is often produced that ifthe number of categories is small, two judges would produce some apparent agreement even ifthey were randomly putting cases in the two categories, and one needs to adjust for that.Example: two examiners look at written essays by 20 people and make a judgment on eachperson either that the person shows adequate mastery of academic English for university study ornot. If they each in fact made no rational judgment but just randomly placed each student in themastery or non-mastery category, how much agreement would we get? Answer, there are 4possible outcomes for each person judged: both judges say yes, both say no, the first says yes thesecond no, the first says no and the second yes. Each of those is equally probable, so in principlethere would be agreement on 10 of the students, i.e. half, on that basis..

Some experts therefore like Cohen’s kappa which claims to adjust for random agreement.

Personally I still like the % agreement, for three reasons.

a. When talking about reliability, we are concerned with judges (examiners, researchers,raters etc.) who are NOT likely to be actually placing people randomly into categories,but trying their best to do the job rationally. Hence do we need an adjustment assumingthat they are doing this to some extent? If it was students, children or other people withno expertise in the categorising being done who were doing it, then we might take adifferent view. I.e. such people might be blindly guessing.

b. The more categories are involved the smaller the chance agreement rate becomes(assuming equal distribution of judgments over all the possible combinations).

15

Number of categories Chance agreement as % of cases judged

2 50

3 33.3

4 25

5 20

6 16.7

7 14.3

8 12.5

9 11.1

10 10

In language research often the categorisation is, say, of chunks of interview transcript(the cases) into a qualitative coding system the researcher has developed, which maycontain 30 or more categories. The amount of chance agreement here is negligible (3.3%).

c. There have been many criticisms made about Cohen’s kappa from a technical point ofview. For instance the amount of agreement it shows varies not just depending on howmany agreements there are, but on the precise proportions of different types ofagreements and disagreements. Do we really want that? Thus if 2 judges put 20 peopleinto two categories A and B as follows, kappa comes out as 0 (as we might expectbecause the agreements are not better than the chance rate)

First judge

A B

Second judge A 5 5

B 5 5

But if the disagreements are spread as we see here, kappa = .2, even though the numberof agreements on the diagonal is the same.

16

First judge

A B

Second judge A 5 0

B 10 5

And if the judgments were distributed like this, kappa changes again to .083.

First judge

A B

Second judge A 9 0

B 10 1

Also, the basic kappa usually calculated (as above) does not have a possible maximumvalue of 1 for most data, but less than 1, so it is hard to evaluate how great the agreementreally is that is being reported: if kappa comes out at .36, but we don’t know what themaximum is that it could be for our data, what do we make of it? (see further below onhow to remedy that).

However, since it is popular, here is a bit on the kappa coefficient (Cohen’s kappa).

Take this example. Two markers rated the same 19 students for speaking ability on the CommonEuropean Framework of Reference scale which runs from A1 the lowest to C2 the highest abilitygrade.

A Basic SpeakerA1 Breakthrough or beginnerA2 Waystage or elementary

B Independent SpeakerB1 Threshold or intermediateB2 Vantage or upper intermediate

C Proficient SpeakerC1 Effective Operational Proficiency or advancedC2 Mastery or proficiency

When listed as categories the data looks like this:

B2 B2

A2 A2

B2 B2

C2 C2

17

B1 B1

B2 B2

B1 B1

B1 B1

B1 B2

B2 B2

B2 C1

B2 B2

C1 C1

C2 C2

A2 B1

B1 B1

C1 C1

B2 B2

A2 A2

The percentage agreement we can calculate from that as (16/19) x 100 = 84.2.

Advocates of Cohen’s kappa however argue that even if the judges/raters/scorers were randomlyassigning grades to students there would still be some agreement, and that needs to be discountedfrom the agreement measure. That is what kappa claims to do.

When displayed as a contingency table (= SPSS Crosstabs), the data looks like this:

judge2 * judge1 Crosstabulation

Count

judge1

TotalA2 B1 B2 C1 C2

judge2 A2 2 0 0 0 0 2

B1 1 4 0 0 0 5

B2 0 1 6 0 0 7

18

C1 0 0 1 2 0 3

C2 0 0 0 0 2 2

Total 3 5 7 2 2 19

Kappa is done by SPSS under Analyze.... Descriptive statistics... Crosstabs... Statistics. SPSSgives the value of unweighted kappa and the significance value but not certain other kappaswhich are more useful. For those use the website http://faculty.vassar.edu/lowry/kappa.htmlInstructions...

1. Generate the Crosstabs table in SPSS (and get kappa and its sig if you like)2. Go to the website above and Select number of categories and choose 5 (since this data

involves placing cases into 5 categories A2 B1 B2 C1 C2)3. In the yellow spaces in the grid just enter the raw numbers exactly as in the SPSS

crosstabs table. Think of category 1 as A2, category 2 as B1 etc.4. Click Calculate and look at the following output:

Unweighted kappa observed (that should be the same as what SPSS produced and in publishedwork is what is usually quoted as kappa)Unweighted kappa as proportion of maximum possible (that has the same p value as thepreceding)Linear Weighted kappa observedLinear Weighted kappa as proportion of maximum possible

Unfortunately there are various pros and cons one could debate as to which of these 4 is the bestfigure to use as a measure of interjudge agreement. And none of them get rid of some of theobjections mentioned above. Whether you use kappa or not rather depends if you might have anexaminer who would demand it or not! Here is my judgment:

The two ‘observed kappas’ do not usually have 1 as their potential max value, so they are ratherdifficult to interpret and compare. Hence I would not use them. In reliability work one wants toknow how big the agreement is, and wants it to be high. One is less concerned with significance,since it would be remarkable if two markers did not agree enough to be significant, even if theiragreement was far from perfect. So we look more at the descriptive measure of agreement (suchas we did with Pearson r etc. earlier). Hence a descriptive statistic that does not have a knownmaximum possible value is not much use as we cannot tell how high it really is. The ‘kappa outof maximum possible’ figures make more sense as they do have a max of 1 more like correlationcoefficients. So I would use one of those, but make it clear in the writeup that that is what isbeing cited.

If your data involves cases placed into just two categories (e.g. students judged by more than oneexaminer as pass or fail for something), or more than two but unordered nominal ones (e.g.words judged by more than one expert to be general English versus academic English versustechnical terms), then only look at the unweighted figures.

If the categories are logically ordered, as the CEFR grades are, then the weighted kappa isprobably preferable. It differs from unweighted in that it gives some credit to scorers where theydiffer in only one scale category rather than two. I.e. if both judges say a case is B2, then perfect

19

agreement is recorded, if one judge says a case is B1 and the other says B2 they get recorded aspartially agreeing while if one says B1 and the other says C1 they get a lower credit foragreement, and so on.. Unweighted kappas only give credit to perfect agreement.

6d. Two judges (or occasions): category data with absolute value where the categoriesthemselves are not ‘given’ in advance.

When researchers categorise data from open response sources like think aloud protocols,interviews or observation, they first have to adopt someone’s classification scheme or developwhat they think is a workable and valid system of classification of their own. This revising anddeveloping of the set of categories to use requires work with other experts (e.g. the PhD studentgets the supervisor to do some independently and they discuss): this is more a matter of ‘contentvalidation’ than reliability. Thus reading strategy researchers need to work out a suitable set ofdifferent strategies, with a definition of each, which they look for evidence of their subjects using:this could come from the lists of other researchers, or in part from their own data. The motivationresearcher, having interviewed learners and asked their reasons for learning a FL, has to establisha set of distinct types of reason that they mention, again using the lists of other researchers (e.g.Gardner) and/or relying on analysis of what their own subjects actually suggested (maybe usingqualitative analysis software to help). The observer of classes, looking for how teachersintroduce new vocab, may have a checklist of types of ways that can be done prepared inadvance to tick off, but may modify this if the teacher uses some technique not on the list.

Now once the data has been gathered and gone over in order to finalise the classification systemitself, then the researcher applies the final version of the classification system to all the data.Then of course he/she can count up how many times each subject used this or that strategy, howmany learners mentioned ‘for my job’ as a reason for language learning, how many teachers usedtranslation to present the meaning of new vocab, etc. etc. At this point it is common goodpractice to get another person or people, trained in the classification system, to go over a sampleof the same data and see if they classify it the same way as the researcher. This is the reliabilitychecking part of the data analysis. It allows you to show that different analysts can use the sameset of categories consistently, and mostly agree on which bits of data fall in which category.Hopefully there will not be too many instances where one analyst thinks what a reader said isevidence of them using the paraphrase strategy while another thinks it is an instance of thesummarising strategy… Obviously you need a measure of agreement to report: some form of‘proportion or percentage of instances agreed on by both judges’. In principle this is, as aproportion,Number of cases placed in the same category by both judges

Total number of cases(where cases are often not people but bits of data from ‘within’ people, e.g. instances of strategyuse). For percentage agreement multiply by 100.A problem that arises for measuring agreement here is that the set of things to be categorised isnot determined in advance, but somewhat fluid and decided by each judge for him/herself. Thismay arise for instance in categorising strategies from 'oral protocols'. Suppose the same tape ofsomeone doing a think aloud task while reading is given to two judges to identify and categoriseall the reading strategies identifiable from it. Not only may they disagree on how to categoriseparticular chunks of what is said, but also on whether a particular bit of talk gives evidence of astrategy being used at all. So it is not so straightforwardly 'one set of cases classified twice' as insimple reliability situations.

20

Imaginary example. Six people/strategies or whatever classified as either A or B on twooccasions/by two judges, with some missed by one or other judge.

occ./judge 1 occ./judge 2case 1 A --case 2 A Acase 3 A Bcase 4 -- Acase 5 B Bcase 6 A --

(This could be the same cases categorised on two occasions or the same ones categorised by twojudges). There are several ways out:

a. You take one occasion/judge as the gold-standard and disregard any cases that werecategorised by the other one only. E.g. if you take occasion/judge 1 above as the standard, youomit case 4 when calculating percentage agreement etc., and just calculate it out of 5 cases. Thishowever gives precedence to one occasion/judge, when neither is normally to be taken as prior tothe other for reliability purposes. If you do it like this, usually the researcher is the judge taken ashaving priority. This is the solution I have seen most often used.

b. You only consider cases in common to both. OK that is even, but may leave a good deal out(cases 1, 4 and 6 above). Still this might seem best if the cases have left themselves out, as itwere - e.g. people who did not turn up both times for a test-retest. SPSS would calculateagreement this way, leaving out anyone with missing data in any column. But SPSS is not verygood at this sort of thing and best to just calculate these agreements by hand.

c. You include all, and have an extra category of 'missing' or 'not recognised' for the problemcases. That might seem fairest for the strategy classification example, as there is then somerecord of how far the judges failed even to agree on where there was a strategy at all, let alonehow to classify it. This would yield a contingency table for the above like this:

Judge 1

A B --

Judge 2

A 1 0 1

B 1 1 0

-- 2 0 ??!

However, a problem does arise for the cell marked with a ? above. In principle this cannot becompleted as there is really no way of counting the instances where both judges agreed on notrecognising the existence of a strategy at all in a segment of transcript. This would therefore be agood candidate for the use of the Jaccard coefficient which is calculated like the proportion orpercentage agreement, but using eight of the cells in the table, not all nine: it omits considerationof the bottom right cell, the number of cases agreed on as not being whatever it is.

21

In this toy data, what is the % agreement for method a and b, and the Jaccard coefficientfor c?

Which method makes the reliability look best? Which worst?

Yet another solution for c here is to calculate two reliability coefficients (using % agreement orkappa), not try and treat the problem as a single one:

One would be for the agreement in recognising occurrence/not of a strategy at all (regardless ofthe classification of what kind), where n is all the units the protocols are divided up into(utterances, sentences or whatever). We see the extent to which the judges agree that eachsentence does/doesn't show evidence of a strategy of some sort being used (2 categories).

The other would be calculated just within the set of utterances where both judges agreed therewas a strategy of some sort (option b above). It would focus on how far they agreed in thespecific classification of each strategy as 'prediction', 'planning', etc. (commonly suchclassifications contain over 20 strategies).

6e. Three or more occasions or judges: category data with absolute value.

For any sort of absolute categorisations of the same cases done separately by more than twojudges, of course you can simply do the procedures already described for each pair of judgesseparately (or see Dunn, 1989: 7.4 or Rietveld and van Hout p221ff for fuller procedure). Theaverage of the agreements between all pairs of judges is a simple overall measure of reliability.

You can also calculate the overall proportion agreement among 3+ judges as:Number of cases placed in the same category by all judges

Total number of casesHere, as usual, +1 would indicate total agreement/perfect reliability. However, this is somewhatcrude: if we do this for four judges putting some learners in pass/fail categories for adequacy inapologising appropriately in English, it does not take into account, where all four judges do notagree, how often the split was three-one or two-two. Yet clearly we would feel this addssomething to our impression of reliability. In addition to a high proportion of unanimousjudgements, we would be more convinced of the inter-judge reliability if most of the others werethree-one rather than two-two. More valuable therefore is to calculate and report a series ofproportions (or percentages): the proportion of cases agreed on by all judges, by all but one, byall but two etc., as appropriate for the number of judges and categories involved.

Equally, for case analysis with four judges, the people/things that were disagreed about 2-2 areobviously more 'odd' than those that were disagreed about 3-1 or agreed about (4-0).

A table that can be useful in the multijudge situation is one with scale categories across the topinstead of occasions/judges. By this I mean as follows.

Normally the data is in the form (from which SPSS can get contingency tables for each pair ofjudges):

Judges → 1, 2 3 etc.Cases Columns contain numbers representing themeasured category each judge placed each case in.

22

But to see the agreement of many judges this can be easier to interpret by eye, especially wherethe number of judges is greater than the number of categories:

Categories → 1, 2 3 etc. of the nominal scaleCases Columns contain figures representing themeasured number of judges who put each case in each category.

From whichever you use, you need to summarise for each case, in a new column on the right,what the agreement is for each case (e.g. if there are three judges and three categories, then it canlogically be either 3-0 or 2-1 or 1-1-1). Then you count the numbers of each type of agreementand express it as % of all the cases. This is probably simplest done by hand from the columns ofraw figures. I have not found a simple way for SPSS to do it.

6f. The problems described in 6d can also arise again, where one gets three or more people(including oneself, perhaps) to categorise strategies from transcriptions of taped material.

Real example for consideration (design A). A student got a number of Arabicspeaking learners of English to write compositions both in English and Arabic, andreport in 'think aloud' fashion on what they were doing as they wrote. The tapedthink aloud protocols were then each gone through by the researcher who identifieddistinct bits of writing behaviour/strategies and categorised them into the major andminor categories of Perl's system and counted up frequencies of use of the variousbehaviours. Perl's system has 23 'major/main categories', with labels such as'general planning', 'rehearsing', 'scanning back over text so far', 'revising'. There is arather larger number of more detailed categories called 'minor/subscript categories'.

To check on reliability, a sample of the tape transcripts (protocols) was given totwo other judges briefed by the researcher to categorise in the same system. Thereliability is reported as follows for one of these protocols (i.e. the think aloudmaterial associated with one piece of writing by one subject). Look at theinformation and answer these questions (Can't use SPSS here):

Is the researcher looking at the reliability overall or in pairs for the threejudges? Are all pairs considered?

Which measure of the ones discussed in sec. 6 is being used as a reliabilitycoefficient?

Is n the same for all judges? I.e. did they all identify the same number ofstrategies, and the issue is simply how far they categorised them the sameway? Or not?

Which of the methods a-c in sec. 6d is being followed here? Do you think itis a reasonable way of proceeding?

If we try to construct a contingency table to show the details of agreement in

23

categorisation between, say, the researcher and coder A, what problems dowe find? What information is missing in this account?

Do you think the missing information could tell us anything useful?

Extract from L’s thesis draft:

RELIABILITY AND ITEM ANALYSIS IN SITUATION D

So far we have looked at simple ways of statistically assessing reliability in situations A-C

24

outlined at the start - where the same cases are remeasured in the same way on differentoccasions or by different judges or by the same judge repeatedly. That enables one to see if themarkers/scorers are being consistent or if what seem like the same innocuous circumstances inwhich data is gathered actually have varying effect on scores etc. (See my book...).

However, much reliability work focusses on the internal reliability of measuring instruments.

Wherever a multi-item test, attitude inventory etc. has been used to measure cases, we may wellwant to examine this internal reliability. Indeed we can only look at internal reliability where themeasuring technique consists of a series of mini-measures added up to produce an overall scorefor something. A reliable test etc. of this sort is then one where all the individual items aresupposed to be consistent and measure the same thing. But internal reliability is only relevant toa set of items all scored on the same scale are supposed to be measuring 'the same thing' in thesame conditions, not just any old set of items.

If you do a survey with a questionnaire asking questions about people's age, gender, levelof language ability, preference for using a bilingual or monolingual dictionary, etc. is thisa suitable instrument for internal reliability checking?

You do a psycholinguistic experiment with three sets of stimuli representing threedifferent conditions. E.g. you present native speakers of French with verbs from threeconjugations in French, mixed in with made-up words as distracters, and ask to as fast aspossible press a key if the word they see is a real word of French. There are ten wordsfrom each conjugation: can you usefully analyse the internal reliability of the responsetimes in each set?

In your study in Pakistan you measure integrative motivation with a set of agree/disagreeitems which Gardner uses in his famous studies in Canada etc. Since this is a 'standard'set of items is there any point in you assessing the internal reliability of that set in yourstudy?

Note the instrument only has to be used once on a group of suitable cases to allow internalreliability to be checked. This alone makes checking this sort of reliability more popular amongresearchers! The repetition of measurement that was a feature of designs A-C is still there though.Each item in the test or inventory is conceived as being a remeasure of whatever the test as awhole is testing in the same people in the same conditions.

Though most associated with pedagogical testing, internal reliability and item analysis applies tomulti-item instruments in many areas of language research, esp. psycholinguistics and appliedlinguistics where tests are used. A restriction, however, is that the same items have to have beenused with one set of subjects. In some repeated measures experimental designs inpsycholinguistics, for example, there will have been a randomisation of items over people orconditions, or use of a Latin square to assign items to conditions and people. The consequencewill be that while all items appear an equal number of times for all subjects and in all conditions,within one condition (which is the domain within which one would assess reliability) there maybe no set of items for all of which one has scores from the same set of people. For example, somesubjects will have experienced item 1, 3 and 5 in that condition (and 2, 4, 6 in another condition),others item 2, 4, 6, others item 1, 2, 3 and so on… In such a situation one can examine facility(below) and distraction (if relevant), but not obtain classic alpha or Rasch reliability measures.

This sort of internal reliability analysis naturally combines with 'item analysis', which is a cover

25

term for the analysis of the results for individual items in a test etc., usually as a step towardsimproving them for the next time the test etc. is used. However, item analysis can also be basedon other considerations than reliability: see Validity.

These activities are often associated with professionals developing language tests, but are oftenalso sorely needed to improve those little tests, inventories etc. PhD students use to get data fortheir research projects... If you want to elicit evidence of subjects' ability to interpret pronounreference in a particular kind of relative clause question in English, say, a common approach is tomake up a set of items, test the subjects, calculate total scores for each subject, then get on to theinteresting bit such as differences between groups or levels in ability to interpret correctly,compare with other relative clause types, etc. The more cautious reliability conscious researcherwould additionally check if the individual items are in fact 'pulling together' (internal reliability).After all, you wouldn't be so convinced by the test total score as a measure of someone's ‘relativeclause interpretation competence' if people seem to be scoring quite differently on some itemsthan on others... That might suggest there are some items in there measuring something else,probably irrelevant to what you want.

The choice of statistics involved in all this depends somewhat on three aspects:

Is the test etc. relative/norm-referenced or absolute/criterion referenced?

Are the individual items scored dichotomously (=binary right/wrong, yes/no etc.) or on amore elaborate interval scale (e.g. 0 for wrong, 1 for partly right, 2 for correct)?

Is the design of the test concerned with the grading of items (e.g. from easy to hard) or isthis not relevant/difficulty does not vary much?

Not all the combinations of these can be looked at here.

Incidentally, re. the last of those considerations: There is some excellent information on threeways of constructing attitude and suchlike inventories/scales of items at the websitehttp://trochim.human.cornell.edu/kb/scaling.htm. Two (Thurstone and Likert) do not involvegraded items, the third (Guttman) does. However the distinction also applies to sets of languagetest items.

7-9. One occasion analysed internally: data taken as interval or binary, with relative/NRvalue.

7. Classic norm-referenced internal reliability

Main stats: Cronbach's alpha (alias Kuder-Richardson)

The simplest way of checking if a set of items is reliable is to split them in half, calculate twoscores for each person, one from each half, then use the Pearson r correlation coefficient as seenabove to quantify overall how well the total score of each case for one subset of items correlatedwith his/her score for the other subset.

What would be a sensible way to split a set of items in half for this? The first ten itemsthen the second ten in a 20 item test?

Since the correlation comes from two halves of the one set of observations, it initially constitutesa reliability coefficient for a test or whatever only half the length of the one you started with.

26

This has to be scaled up, using the Spearman-Brown formula, to give the coefficient for the fullset of observations, twice as long:

Reliability of full set of items = 2 x split halves rel1 + split halves rel

Further, if the average score and spread (variance) of scores is not the same in the two halves,there may be an underestimate of reliability. SPSS will take care of most of this for you.

An alternative approach is to use the 'Guttman split-half estimate' (see Bachman, 1990: p175ff).That does not need adjustment for length and is not affected by non-equivalence of halves.

However, since we have SPSS available, let's use the Rolls-Royce measure - Cronbach's alpha.This is standardly cited these days in internal reliability study of norm-referenced tests and thelike with uniform sets of items. We already used it above in 4a for multiple judges/raters. Thisalpha reliability coefficient can be thought of as the average of the correlation coefficients youwould get if you were to use the split-half method on all the possible divisions of the test itemsinto two sets, not just the one based on taking alternate items. It works for items each eitherscored dichotomously (right/wrong, yes/no etc.) or rated (e.g. out of 4) or scored (e.g. responsetimes). On dichotomously scored sets of items it is equivalent to the 'Kuder-Richardson 20'formula (KR20). For calculation of the latter see Allen and Davies (1977: p194ff), also Hatchand Farhady (1982: p247ff). If items are of equal difficulty, a slightly simpler form of the latterformula - 'Kuder-Richardson 21' can be used, but alpha really subsumes all these variants.

It needs as input all cases' scores on all individual items. I.e. you need a data grid entered with arow for each case measured and a column for each item in a test or whatever (assumed to bemeasuring a unitary construct, i.e. ‘one thing’). If the items are dichotomous, then the grid willcontain 1s and 0s, for yes/correct/agree etc. versus no/wrong/disagree for each person on eachitem. Otherwise the grid may contain other numbers.

As examples I have three (ludicrously) small sets of data for you to analyse (in textfiles, not .sav ones). They are from three little listening tests made up by a teacherand used with just five learners on an advanced English course in the Dept (seesheets attached at end). The files should be available on the C drive of thecomputer.

- In my file listtf are the results for five true/false items testing comprehension afterhearing a passage on tape, each scored correct/not. The items are attached.

- In file listgap are the results for correct filling of ten gaps in a cloze version of thesame passage previously heard (attached), each scored correct/not.

- In file listdict are the results for a free dictation. Cases had to write down as gooda version as possible of what they heard (a different passage from the above), notnecessarily word for word. They were scored out of 3 for grammatical etc.correctness of each of the five sentences of the passage.

In which of these tests are the items 'dichotomous' then?

Load the first of these (remember File…Read text data) look at the Data sheet to

27

see what the data looks like.

Does anything strike you by eye?

Maybe it is things we will see later from the statistics. In a small data set eyeballingis often just as good as stats, but on the larger scale it isn't possible.

To calculate alpha you then choose Analyze...Scale...Reliability analysis.... Thedefault is alpha though you can choose split half etc. if you want. Sticking withalpha, highlight all your items and transfer them in one go into the right hand boxfor treatment.

Click the Statistics button to get some things we will need below for ItemAnalysis. What you will need is obtained by choosing Descriptives for Item, andScale if Item Deleted. In SPSS it doesn't pay to ask for everything. You tend toend up unable to see the wood for the trees. Proceed.

You may get a message 'Warning zero variance items'.

What does that mean? What is zero variance?

What value of alpha do you get?

Alpha, like a correlation coefficient, has a maximum of 1 if reliability is perfect.Anything below .8 would generally be considered rather poor in reliability studies.

If your alpha is less than .8, why do you think that is?

Load the other files and do the same. But I suggest you save each one as a SPSSfile (with suffix .sav) as you go along so you can easily get it back to look at itagain. (Remember: File...Save Data...).

Which of the three listening tests is the most reliable?

Why do you think that is? Hint: remember alpha is a measure related to the Pearsonr correlation coefficient - what things did we find earlier helped increase Pearsonrs?

28

A problem which all the above approaches suffer from is that, to give an unbiassed coefficient ofreliability, it has to be assumed that any person's response to any given item in a multi-item testor inventory is uninfluenced by their response to any other - i.e. that there is 'independence' ofevery item from every other one. This may not always be so, and is hard to check anyway. Theconsequence of violation of independence tends to be an overestimate of reliability. However,there is often little one can do about this so it is usually glossed over.

Which of the above tests do you think is more likely to lack this sort ofindependence of response to different items?

How could you change the test to counter this?

In a psycholinguistic experiment where each condition is represented by a test-like set of items,this independence is more certain if the sets of items are not given separately, but all mixed upwith each other in one list administered together.

8. Classic norm-referenced item analysis, no grading of items at issue

29

Main stats: mean score per item, item - total correlation, alpha if item deleted, chi squared fit test.

Traditional NR item analysis looks mainly at two aspects of items - their difficulty/facility andtheir discrimination/discriminability.

8a. Facility. A norm-referenced measure is designed to show clearly which cases score higherthan which others, not the absolute level of knowledge of anything that each case has. So whichof the following items are 'good' in such a test, which 'bad'?

Items which most/all people get right?

Items which most/all people get wrong?

Items which about half the cases get right, half wrong?

If you are using a multi-item test or inventory as a research tool, e.g. to gather data onmemory for content of passages on different topics, with a hypothesis that where thereader knows more about the topic in advance they remember more, would you want arelative or absolute measure of how much of the content is retained?

You are wanting to compare native speaker and non-native speaker teachers of Englishfor their focus on correctness. You give a few of them numerous sentences with possibleerrors in to mark in your pilot of your elicitation instrument. Some items are markedwrong by all of them. Do you leave them out in the revised instrument?

How do the concepts of 'ceiling effect' and 'floor effect' relate to this matter?

Can you see how facility relates to NR reliability? An item which everyone gets wronghas a standard deviation of...what? If all items in a test were got mostly right or mostlywrong, the variation in total test scores would be...high or low? And do correlationcoefficients of the Pearson r type come out higher or lower if there are higher degrees ofvariation among the scores being correlated?

8b. Discrimination. With the idea of internal reliability as described, a good item is obviouslyone that pulls together with the others and so 'fits' the overall pattern of responses. In short, agood item is one that the people who did best on the whole test got right and that the worst onesgot wrong - it discriminates cases in the same way as the test as a whole. An item got right by alot of people who were the worst overall is not discriminating helpfully. This can be measured bythe correlation coefficient of each case's score for the individual item with their score for thewhole test. (There are older by-hand methods for this involving comparing items' popularity withthe best third and the bottom third of testees/cases, not gone over here).

The output of the SPSS procedure you have already generated gives you all thefacility and discrimination information you need.

For facility look at the item means - i.e. the average score calculated for the itemrather than a person taking the test.

If the items are scored 0 or 1, then 'good' means in reliability terms are near...what value?

30

What are undesirable means - indicating items that might usefully be leftout?

If an item scored 0/1 has a mean of .7, what % of the testees got it right?

In the test with items scored out of three, what is the ideal mean for an item?

For discrimination, look at the item-total correlations. These are based on eachperson's score for the item and their score for the whole test. If the item isdiscriminating well, then its Pearson r correlation with the total test will be high.

Why do some of them come out as 0?

A third useful bit of information for each item is the alpha if item deleted column.This reflects both facility and discrimination and tells you simply what thereliability of the whole set of items would be if that item were left out.

To improve a set of items obviously you remove or replace items thatwould ?increase or decrease? alpha if left out - which?

Generally items with means near the extremes and with poor item-total correlationswill increase alpha by being left out.

Can you see which items would best be replaced? Why?

Do all this for all three listening tests and pick some items that are unsatisfactory ineach.

Can you see from the actual items why they are unsuitable?

8c. Distraction. Finally on item analysis, it is worth mentioning something further you can doexclusively where the items are multiple choice. A basic rule of multiple choice tests (but farfrom the only one) is that the 'distractors' should genuinely 'distract'. If there are four alternativesbut two are so ludicrously wrong that nobody would choose them, a four choice item in effectbecomes a two choice one - much easier to guess the answer from, with consequent unreliability.

To check on this, you simply have to have available information not just on who got each itemright or wrong, but also, if they got it wrong, on which distractor they chose.

Below are some hand counted figures for a few items related to prepositions in alarge scale piece of test piloting by Mukattash in Jordan some years ago.

Look at the descriptive statistics for the distractors (the frequencies) and answerthese questions for each item:

Which is the least popular distractor?

Does it differ in different types of school (i.e. for different populations oftestees)?

31

Can you tell why it is the least popular?

Of course distractors are never going to be chosen exactly equally, so you mightlike a way of deciding where they are being chosen so differently that it is worthrevising them.

On your judgment, in which of these examples is the least popular distractorso unpopular that you would change it in a revised test?

A statistical means for deciding the issue would involve the chi-squared goodness of fit test(which is basically the one variable version of chi squared, used where frequencies in three ormore unordered categories need to be checked against 'expected' frequencies). This version ofchi squared would test the hypothesis that there is no difference between the observedproportions of responses falling in the three distractor categories, and the proportions you wouldexpect if equal numbers fell in each (the 'expected' frequencies), beyond what you would expectfrom the vagaries of sampling.

If that hypothesis is rejected, then ... do you change the distractors or not?

In item 9, in the vocational school, how many responses would you expecteach of the three distractors to attract if they were equally attractive?

To do the statistical test, enter the figures by the short method (cf LG475 tasks). You need acolumn to label the three categories (distractors) and another with the three actual frequencies.Then get SPSS to weight the data by the column with the frequencies. Then chooseAnalyze...Non-parametric tests...Chi Square. Choose the category column as the test variable.Keep everything else as given.

32

In the result, the Expected column tells you the frequencies under the null hypothesis: if alldistractors were equally distracting. The Residual is the Observed frequency minus the Expectedone.

The least successful distractor will have a high ? negative or positive? residual - which?

If chi squared is significant, that is a reflection of high positive and negative residuals, i.e.responses to distractors so uneven as to merit serious consideration of changing at least one of

33

them. I.e. it is unlikely the observed proportions could have arisen by chance from a populationwhere the division of choice was actually equal.

Is chi squared significant for distractors of Item 9, vocational school data?

Try this for some other items/schools.

For different schools would you come to different decisions about suitabilityof the same item?

9. Item analysis where grading of items is involved: binary/dichotomous data

Main stats: Guttman's scalability coefficient, Rasch t.

Classic norm-referenced reliability and IA techniques identify and help one develop multi-iteminstruments that distinguish between cases very consistently and finely where the cases are of asimilar ability. They don't tell you how much of anything anyone knows in any absolute sense -you need a criterion-referenced measure for that. And they don't work well on a wide band ofability - you need instruments developed to have the properties that Rasch analysis focuses onfor that. You can of course make a whole series of norm-referenced tests for different levels.However, there is no easy way to relate the score of a person on one such test to their score onanother, because the tests would have to be compiled and refined separately for each population.

In testing circles there is a lot of interest nowadays in tests which contain items in a whole rangeof difficulties (though not necessarily presented to test-takers in order of increasing difficulty), sothat, for example, one test will do for beginner and intermediate learners or for children of a widespan of ages or for dialect speakers ranging from very broad to practically standard. A relatedidea is that you can establish an 'item bank' from which you draw items to make up tests suited toparticular levels of case. But these items are all related to each other in such a way that you canrelate someone's score on one such test to someone else's score on another.

Compared with the classic approach above, items are chosen and checked not necessarily to beof middling difficulty for everyone - there is a good range of difficulty - but discriminability of asort is retained. This is done by requiring items to form an 'implicational scale'. That means that aset of test items of this sort will range from hard to easy, but in such a way that people who getharder ones right also get right those items that are generally easier (i.e. which more people getright). That is a third reliability-type concept, alongside the classical NR ‘high correlation’ oneand the absolute CR ‘high absolute agreement’one.

Full story skipped on this occasion. See me if you need it!

10. One occasion analysed internally: data taken as interval or binary, and absolute/CR.

Classical reliability and IA is not criterion-referenced and the Rasch approach, though it claimsto be, is not straightforwardly so. There are various sophisticated ideas about how to improveinternal reliability of CR tests not covered here (e.g. see Subkoviak and Baker (1977) andBachman (1990: 212ff)). But one could use the approach of 4c above as a simple approach.Calculate an SD for each person across their responses, and take the mean of those SDs as an

34

indication of reliability. Low SD means more agreement. Identify bad items on the basis ofseeing which items, if omitted, lead to the best reduction in mean SD over all subjects (analogousto ‘alpha if item deleted’). And identify subjects that are especially giving rise to disagreementtoo.

My ABSREL2 programme does that for you.

CONCLUSION ON RELIABILITY AND RELATED IA

11. A further concern may be as follows. Ideally one does all the reliability checking in a pilotstudy, and uses the information to revise everything for the main study. But suppose one hasdone the relevant things from the above in one's main study. There is no possibility of revisingthe instrument or procedure and going back and doing it again, so one has to just exploit thefigures one has got to the best possible extent in the analysis of the real RESULTS of one'senquiry, with due acknowledgement of its possible failings. But what figures does one use?

Typical procedures used to 'massage' data:

In general.

Leave out cases that seem odd/don't fit. But make a good argument for them not belonging toyour targeted population on grounds other than just that their responses weren't right! E.g. youcould show that, prompted by their odd responses, you looked back at your notes from the datagathering and can see that that person was rather unwilling to participate, or was a learner whohad learnt English under different conditions from the other subjects, etc..... Of course this optionis not available in pedagogical uses of tests where one is interested in the scores of all particularstudents.

In design (D).

Leave out whole items that don't fit the pattern of the set they belong to. That is lesscontroversial. You may also be able to identify particular responses of particular cases that areodd and omit just them: e.g. in response time tests it is common to leave out responses that areoverlong or overshort

What argument could you use to support that?

How could you define 'too long'?

For onward analysis of results you then have to calculate a score for each case/person that is theiraverage over the scores that you have decided to retain, which may be a different number fordifferent cases/people. This also arises in general if there are missing values/scores for otherreasons. If everything is in one row per case format in SPSS you can do this by usingTransform... Compute... and selecting MEAN(numexpr,numexpr,...) from the list offunctions. This calculates an average over the non-missing values in a set of columns. So if yourdata was three cases on three items, with some missing responses thus:

Item1 Item2 Item3

Case1 1 0 0Case2 1 . 0Case3 . . 1

35

Fill in the formula in SPSS as MEAN(Item1,Item2,Item3) and the new column will come outwith averages using only the non-missing columns. Here that is .33, .5 and 1.

In design (A) and (B) and (C).

The usual wisdom is that the mean of several separate measurements is always more reliablethan one measurement of anything. But if two out of three judgements/occasions agree well witheach other, and differ from the third, maybe leave out the scores from the 'odd' judgement thatdidn't agree well, and use the average of the other two. If different judges/scorers are involved,then you can leave out an unreliable scorer and use the scores averaged over the others. Or trainup that scorer a bit more and get him/her to score the protocols again!

When combining figures from several judges/occasions etc. there is a special problem if the datais not interval scores or rank orders (which can be averaged with the mean). With categorisationsyou would typically use the category that the majority of judgements placed a case in (i.e. the'modal' category): so it is useful to use uneven numbers of judges to make this easy! E.g. if twoout of three times when you recategorised your think aloud data you thought a certain strategywas 'Prediction', that is the one you regard it as being for your actual analysis of results.

There is quite a bit about the potential causes of unreliability in my book. Here we have onlylooked at how to check on it statistically.

PJS revised a little Jan 2012