item analysis: classical and beyond

Item Analysis: Classical and Beyond

SCROLLA SymposiumMeasurement Theory and Item Analysis

Heriot Watt University12th February 2003.

Why is item analysis relevant?

Item analysis provides a way of measuring the quality of questions - seeing how appropriate they were for the candidates and how well they measured their ability.

It also provides a way of re-using items over and over again in different tests with prior knowledge of how they are going to perform.

What kinds of item analysis are there?

Item Analysis

Classical Latent Trait Models

RaschItem Response theory

IRT1 IRT2 IRT3 IRT4

Classical Analysis

Classical analysis is the easiest and most widely used form of analysis. The statistics can be computed by generic statistical packages (or at a push by hand) and need no specialist software.

Analysis is performed on the test as a whole rather than on the item and although item statistics can be generated, they apply only to that group of students on that collection of items

Classical AnalysisAssumptions

Classical test analysis assumes that any test score is comprised of a “true” value, plus randomised error. Crucially it assumes that this error is normally distributed; uncorrelated with true score and the mean of the error is zero.

xobs = xtrue + G(0, err)

Classical Analysis Statistics

• Difficulty (item level statistic)

• Discrimination (item level statistic)

• Reliability (test level statistic)

Classical AnalysisDifficulty

The difficulty of a (1 mark) question in classical analysis is simply the proportion of people who answered the question incorrectly. For multiple mark questions, it is the average mark expressed as a proportion.

Given on a scale of 0-1, the higher the proportion the greater the difficulty

Classical Analysis Discrimination

The discrimination of an item is the (Pearson) correlation between the average item mark and the average total test mark.

Being a correlation it can vary from –1 to +1 with higher values indicating (desirable) high discrimination.

Classical Analysis Reliability

Reliability is a measure of how well the test “holds together”. For practical reasons, internal consistency estimates are the easiest to obtain which indicate the extent to which each item correlates with every other item.

This is measured on a scale of 0-1. The greater the number the higher the reliability.

Classical Analysis vs

Latent Trait Models• Classical analysis has the test (not the item as its

basis. Although the statistics generated are often generalised to similar students taking a similar test; they only really apply to those students taking that test

• Latent trait models aim to look beyond that at the underlying traits which are producing the test performance. They are measured at item level and provide sample-free measurement

Latent Trait Models

• Latent trait models have been around since the 1940s, but were not widely used until the 1960s. Although theoretically possible, it is practically unfeasible to use these without specialist software.

• They aim to measure the underlying ability (or trait) which is producing the test performance rather than measuring performance per se.

• This leads to them being sample-free. As the statistics are not dependant on the test situation which generated them, they can be used more flexibly

Rasch vs

Item Response TheoryMathematically, Rasch is identical to the most basic IRT model

(IRT1), however there are some important differences which makes it a more viable proposition for practical testing

• In Rasch the model is superior. Data which does not fit the model is discarded.

• Rasch does not permit abilities to be estimated for extreme items and persons.

• Rasch eschews the use of bayesian priors to assist parameter setting.

IRT - the generalised model

Where ag = gradient of the ICC at the point

(item discrimination)

bg = the ability level at which ag is maximised (item difficulty)

cg = probability of low candidates correctly answering question g

IRT - Item Characteristic Curves

•An ICC is a plot of the candidates ability over the probability of

them correctly answering the

question. The higher the ability the higher the chance that they

will respond correctly.

c - intercept

a - gradient

b - ability at max (a)

IRT - About the ParametersDifficulty

• Although there is no “correct” difficulty for any one item, it is clearly desirable that the difficulty of the test is centred around the average ability of the candidates.

• The higher the “b” parameter the more difficult the question - note that this is inversely proportionate to the probability of the question being answered correctly.

IRT - About the ParametersDiscrimination

• In IRT (unlike Rasch) maximal discrimination is sought. Thus the higher the “a” parameter the more desirable the question.

• Note however that differences in the discrimination of questions can lead to differences in the difficulties of questions across the ability range.

IRT - About the ParametersGuessing

• A high “c” parameter suggests that candidates with very little ability may choose the correct answer.

• This is rarely a valid parameter outwith multiple choice testing…and the value should not vary excessively from the reciprocal of the number of choices.

IRT - Parameter Estimation

• Before being used (in an item bank or for measurement) items must first be calibrated. That is their parameters must be estimated.

• There are two main procedures - Joint Maximal Likelihood and Marginal Maximal Likelihood. JML is most common for IRT1 and 2, while MML is used more frequently for IRT3.

• Bayesian estimation and estimated bounds may be imposed on the data to avoid one parameter degrading, or high discrimation items being over valued.

Resources - Classical Analysis

Software

• Standard statistical packages (Excel; SPSS; SAS)

• ITEMAN (available from www.assess.com)

Reading

Matlock-Hetzel (1997) Basic Concepts in Item and Test Analysis available at www.ericae.net/ft/tamu/Espy.htm

Resources - IRT

Software

• BILOG (available at www.assess.com)

• Xcalibre available at www.assess.com

Reading• Lord (1980) Applications of Item Response Theory to

Practical Testing Problems

• Baker, Frank (2001). The Basics of Item Response Theory - available at http://ericae.net/irt/baker/

item analysis: classical and beyond

Education