itc conference itc conference, winchester, 2002 computer-based testing usability of psychometric...

Post on 28-Mar-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ITC ConferenceITC Conference, Winchester, 2002

Computer-based Testing

Usability of Psychometric Admeasurements Usability of Psychometric Admeasurements

Dr. J. M. Müller

University of Tübingen, GermanyUniversity of Tübingen, Germany

http://www.joergmmueller.de/default.htm

Overview

1. Introduction: Formal test descriptions in practice

2. Definition of usability in the context of test description

3. Illustrating problems: Reliability

4. Criteria of usability: foundation, scaling, general attributes

5. Two examples of enhanced usability: NDR and PDR

6. Summary

Introduction: Psychometric admeasurements in practice today and tomorrow

1. Test users often use poor quality tests (e.g. Piotrowski et al.; Wade & Baker, 1977) Psychometric knowledge (Moreland et al. 1995)/Competence approach (Bartram, 1995, 1996)

2. What should be described? CBT: Criteria for software usability (ISO 9241/10, 1991; Willumeit, Gediga & Hamborg, 1995) and further criteria: platform-independence, possibility of making own norm banking, protection)

3. How should it be described?

4. “Good practice” guidelines and standards are based on quality criteria (e.g. Standards for educational and psychological Testing, PA, 1999; International Guidelines for testing, ITC, 2000)

Quality Supply Quality Demand

Definition of Usability

Scope of usability: Usability in the context of psychological testing concerns all important kinds of information for test users to describe a test for various purposes and the ways to communicate them. This includes test manuals as well as a formal test descriptions with the help of psychometric admeasurements.

Aim of usability: The product or effect of good usability is that any test user finds all necessary information quickly and in a proper standardized form, ready to use for answering the questions of the test users to enable them to decide whether a test is an appropriate help for the diagnostic question.

Frame of usability: Quality assurance in the context of psychological testing refers to test construction, test translation, test description and the use of tests in practice.Methods to enhance quality control can contain guidelines for test use, standards for test description, etc. Usability is a strategy to enhance quality on the level of formal description.

Consequences of usability concern the reengineering of formal test description,

Indices of measurement of error

Spe a

rman

Co r

rela

t i on

% o

r SM

C

Phi

-Coe

f f ici

ent

Ret

est P

ears

o n c

orre

l ati o

n

Yul

e‘s

Y

Cro

nba c

h‘s

Alp

ha

Kud

e r- R

i ch a

r ds o

n‘s

For

mul

a 20

Spe a

rman

-Bro

wn

prop

hecy

f orm

u la

intr

acla

ss-c

orre

latio

n

S en s

i tiv i

t y T

P/(

TP

+F

N)

S pe c

i fic y

TN

/(T

N+

FP

)

S ta n

d ar d

er r

o r o

f a s

c ore

Kap

pa R

ecla

ssifi

catio

n

Mod

el-F

it L

ikel

ihoo

ds

Info

rma t

i on -

f un c

t ion

Kap

pa I

nter

rate

r

Stan

dard

err

or s

core

Measurement of error

Dimensional construct Categorical construct

CTT IRTGeneralizability Theorynonspecific

misclassificationspecific

misclassification

Standard error score

Reliability

Relationships between indices of error of measurementSp

e arm

an C

o rre

lat i o

n

% o

r SM

C

Phi

-Coe

f f ici

ent

Ret

est P

ears

o n c

orre

l ati o

n

Yul

e‘s

Y

Cro

nba c

h‘s

Alp

ha

Kud

e r- R

i ch a

r ds o

n‘s

For

mul

a 20

Spe a

rman

-Bro

wn

prop

hecy

f orm

u la

intr

acla

ss-c

orre

latio

n

S en s

i tiv i

t y T

P/(

TP

+F

N)

S pe c

i fic y

TN

/(T

N+

FP

)

S ta n

d ar d

er r

o r o

f a s

c ore

Kap

pa R

ecla

ssifi

catio

n

Mod

el-F

it L

ikel

ihoo

ds

Info

rma t

i on -

f un c

t ion

Kap

pa I

nter

rate

r

Info

rmat

ion-

crite

ria

Y/ Kappa/ Phi

Korrelation

Phi

Kappa

test theory/statistic

Index: Generic formula

Algorithm

scale (correction)

Interpretation of the score (operational meaning)

Top-down vs. bottom-up strategy to develop a coefficient

Practitioner‘s point of view

Scientist‘s point of view

Defining the operational meaning

Scale definition

Specification of within a test theory

Index: Defining the influencing factors

Index: Generic formula

Rescaling reliability: Number of distinctive results (NDR)

(Wright & Master, 1982; Lehrl & Kinzel, 1973; Müller, 2001)

Rang R

Test score distribution

x1 x2

criticaldifference

criticaldifference

criticaldifference

criticaldifference

criticaldifference

ttx rsxxk 1296,105.012Formula

R = test score range

k = critical difference

21

1*2

2

ttrk

RD

Foundation1. Unambiguous

operational meaning

2. Unambiguous formal definition

3. Broad application area

4. Relevant dependencies

5. Independent of irrelevant factors

Scale Definition1. Meaningful scale unit,

that implies:• Interval scale• Positive values• Defined range of

values 2. Comparable to the

reference scale 3. Significant scale unit that

implies a minimum of observations (Nmin)

Global attributes in using

1. Relevance2. Informative (not

redundant)3. Predictable for the test

user (nominal/actual value comparison)

4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria

of estimating

Criteria of usabilityfor formal quality criteria

(modified from Müller, 2001, 2002a,b; Goodmann & Kruskal, 1954)

Foundation1. Unambiguous

operational meaning

2. Unambiguous formal definition

3. Broad application area

4. Relevant dependencies

5. Independent of irrelevant factors

Scale Definition1. Meaningful scale unit,

that implies:• Interval scale• Positive values• Defined range of

values 2. Comparable to the

reference scale 3. Significant scale unit that

implies a minimum of observations (Nmin)

Global attributes in using

1. Relevance2. Informative (not

redundant)3. Predictable for the test

user (nominal/actual value comparison)

4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria

of estimating

NDR at work...

NDR = 2NDR = 5NDR = 10

r = .50r = .92r = .98

Distribution of reliability coefficient Distribution of NDR coefficient

Conclusion: many precise tests Conclusion: some precise tests

Probability of distinctive results (PDR)

2

)1(*

nntD

Formula

tD

sDPDR

n

ji jiji

jijiji kxxifs

kxxifsssD

, ,

,, ,0

,1

Complete score comparison of pairs

Rectangular distribution shows an 80 %

probability to distinguish two test scores

Gaussian distribution shows a 60 %

probability to distinguish two test scores

Reliability

PDR

PDR: Simulation studyPerformance to separate test scores with respect

to reliability and score distribution

PDR: Example

Subscale ‚Resignation‘; Stress-Coping-Questionnaire

SVF-KJ; Hampel, Petermann & Dickow, 1999; N=1123

Subscale ‚Unsicherheit‘ Symptom Check List

(Derogatis, 1977; German Version Franke, 1995; N=875

r = 0.81

PDR = 41.6 % PDR = 30.6 %

r = 0.81

Reviewing NDR and PDR

1. NDR and PDR can be derived in any test theoretical model – there is progress in the application area.

2. NDR and PDR have an easy to understand operational meaning

3. NDR and PDR are predictable for the test user for the nominal/actual value comparison

NDR and PDR serve as examples of how to develop more usable NDR and PDR serve as examples of how to develop more usable formal test descriptionsformal test descriptions

Summary

1. Usability is a possible strategy with explicit and observable criteria, for improving formal test descriptions – and strengthening indirectly the role of guidelines and standards.

2. With NDR and PDR two easy to understood coefficients have been proposed, the application of which in is progress in several test theoretical models.

Thank you for your attention!

Medicine: Effect-size measures

Practitioners coefficient

m

i i

ii

P

PPw

1 0

201

Scientific coefficient (Cohen, 1988)

CER*

1

RRRNNT

NNTs [Number-Needed-to-Treat] the number of patients who need to be treated to prevent 1 adverse outcome. Taken from EBM Glossary - Evidence Based Medicine Volume 125 Number 1

Measuring in technical fields: Solutions from engineering

The is a German Norm DIN 2257 on how to measure the physical length of an object and how to report the result. The norm allows as output only values with statistical evidence.

Criteria of usabilityfor formal quality criteria for NNT

Foundation1. Unambiguous

operational meaning

2. Unambiguous formal definition

3. Broad application area

4. Relevant dependencies

5. Independent of irrelevant factors

Scale Definition1. Meaningful scale unit,

that implies:• Interval scale• Positive values• Defined range of

values 2. Comparable to the

reference scale 3. Significant scale unit, that

implies a minimum of observations (Nmin)

Global attributes in using

1. Relevance2. Informative (not

redundant)3. Predictable for the test

user (nominal/actual value comparison)

4. Easy to learn 5. Easy to utilise6. Fisher‘s (1925)

criteria of estimating

Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)

Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:

1. Suitability for the task

2. Self-descriptiveness

3. Controllability

4. Conformity with user expectations

5. Error tolerance

6. Suitability for individualization

7. Suitability for learning

KR20 and Cronbach

2

111 t

n

uii

tt

qp

n

nr

Kuder-Richardson-Formel KR20i itempi relative Anzahl von 1qi relative Anzahl von 0

(aus Cronbach, 1951)

Cronbachs Alphac Anzahl der Variablen

si2 Varianz der Variablen i

stot2 Varianz der Summe

21

2

11 x

J

ii

s

s

c

c

Formula to the error of measurement in categorial constructs

Cohen‘s KappaWeiter 16 Maße zur Konkordanz zweier

Messungen für binaäre Daten verglichen Conger & Ward (1984)

Yule Vierfelderinterdependenzmaß

Q-Koeffizient

Phi-Koeffizient Abhängigkeit von Randsummenverteilung Abhängigkeit des Signifkanztests von N

(Yates-Kontinuitätskorrektur, 1934)

adbc

adbdY

1

1

2

1

2

1

2

2

i i ij

ijij

e

ef

B1 B2

A1 a bA2 c d

bcad

bcadQ

e

e

p

pp

1

02

N

dap

0

2

)()()()(

N

dbdcbacape

N2

Formula to the error of measurement in categorial constructs

Frickes Übereinstimmungs-koeffizient SS: Quadratsumme innerhalb einer

Person; max SS: maximal mögliche Quadratsumme innerhalb der Personen

Punkt-biseriale KorrelationX=arithmetisches Mittel aller Testrohwerte

XR=arithmetishes Mittel der Pbn mit richtigen Antworten

sx=Standardabweichung der Testrohwerte aller Pbn

N = Anzahl aller Pbn

NR=Anzahl der Pbn, mit richtigen Antworten

Tetrachrorische Korrelation

max

1SS

SSÜ

n

daÜ

A B C

I 1 4 3

II 0 4 2

III 0 5 2

bcadrtet

1

180cos

0

q

p

s

XXr

x

Rjtbisp

_

Formula to the error of measurement in CTT, IRT + prophecy-formula

Spearman-Brown-Formel

k= Faktor der Testverlängerung

Rasch model

CTT

tt

tttt rk

rkr

11

)1(

1)(

1vi

k

ivi pp

EVar

ttxe rss 1

Some Formula for the error of measurement in metric constructs

reliability (Kelley, 1921)

Pearson(1907) -Correlation Bravais (1846)

Spearman‘s rho (1904)

Kendalls Tau , 1942(S=difference of pro- und inversionsnumber)

22

2

ew

wtt ss

sr

N

i yx

ii

ssN

yyxxr

122

1

61

2

2

NN

dRho i

i

2/)1(

NN

S

r

rZ

1

1ln

2

1

1 2 3 4 5

3 2 3 5 4

Non-linear Relationsship between reliability, NDR and the standard error score

1reliability

NDR Standard error score

NDR

Standard error score

Item-Response-Theory(Fischer & Molenaar, 1994)

1. Dichotomous raschmodel

2. Linear logistic test model

3. Linear logistic model for change

4. Dynamic generalization of the raschmodel

5. One parametric logistic model

6. Linear logistic latent class analysis

7. Mixture distribution rasch models

8. Polytomous rasch Models

9. Extended rating scale and partial credit models

10. Polytomous mixed rasch models

11. ...

...more IRT (van der Linden & Hambleton, 1997)

1. Nominal categories model

2. Response model for multiple choice

3. Graded response model

4. Partial credit model

5. Generalized partial credit model

6. Logistic model for time-limit tests

7. Hyperbolic cosine IRT model for unfolding direct responses

8. Single-item response model

9. Response model with manifest predictors

10. A linear multidimensional model

11. ...

Formula of some IRT

rasch model

binomial model

Unfolding-model

iA

iAAiAi

xxp

exp1

exp

A

AAxp

exp1

exp

))(exp(1

))(exp()(

ivi

ivivivi

xxp

Birnbaum model

))(exp(1

))(exp()(

2

2

iv

ivvivi

xxp

ii xig

G

g

k

i

xiggxp

1

1 1

)1()( Latent-Class-model

Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)

Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:

1. Suitable for the task

2. Self-descriptiveness

3. Controllability

4. Conformity with user expectations

5. Error Tolerance

6. Suitable for individualization

7. Suitability for learning

Norm scales

SCL-90-R test score distribution

Simulation study about the relationsship between measures of association

Y/ Kappa/ Phi

correlation Y/ Kappa/ Phi Q

Y/ Kappa/ Phi

correlation

SMCY/Kappa/Phi

Q

correlation

Phi

SMC

Phi

Kappa

SMC

KappaMeasure of associationMea

sure

of

asso

ciat

ion

Linear relationship?

dcA2

baA1

B2B1

dichotome Normal distribution- equal marginals

Skewed distribution - unequal marginals

Efficiency in measuring

Content: efficiencyConcept: The less effort you need for the same

amount of information, the more efficiency the test isefficiency = f(Information;effort)

Indice: E = Amount of Information/TimeEstimates: Information Theory

(Shannon & Weaver, 1949)

Amount of Information of a signal: Chess example

In the chess example you need at least (binary, 50-50 chance) 6 question‘s that are 6 bit.

1.Frage: links-rechts?

2.Frage: oben-unten?

3.Frage

4. 5.

6.

The scale unit ‚bit‘ can be understand as the minimal or optimal number of question‘s, to identify a signal out of quantity of alternatives.

AB

Schachspieler

C1:21:2

Rasch variances are a measure of the variability of person‘s within a dimension

Als Maßeinheit der Unterschiedlichkeit dient die

Differenz der Gewinnwahrscheinlichkeiten.

1: 21: 2

1: 21: 2

1: 21: 2

1: 21:2

1. Gewinnwahrscheinlichkeiten -> Lösungswahrscheinlichkeiten

2. Gegner -> Testaufgabe (Itemparameter)

3. Spielstärke -> Personenparameter

4. Differenz der Gewinnwahrscheinlichkeit definiert über den Logit des Raschmodells

Interpretable Rasch Variances

personen parameter

Probability to solve an item

Item i with = 0

B A

item m with = 1

C

iA

iAAiAi

xxp

exp1

exp

minmax A

Difference to solve a question or task

Empirical Evidence of the range of person parameters in rasch units

AID Kubingen & Wurst Standardform Parallelform

Alltagswissen 21,1 21,3

Realitätssicherheit 13,3 13,1

Angewandtes Rechnen 21,7 20,5

Eigenschaft Autor AusdehnungVerbaler Intelligenztest Metzler & Schmidt 11,4

Averbale Intelligenz Forman & Pieswanger 8,2

Einstellung zur Sexualmoral Wakenhut 8,1

Einstellung zur Strafrechtsreform Wakenhut 7,2

Beschwerdeliste Fahrenberg 6,4

Räumliches Vorstellungsvermögen Gittler 5,9

Umgang mit Zahlen bei Kindern Rasch 3,5

Usability criteria explanations

• Relevant dependencies: Example: Reliability and test length, stability, ...

• Irrelevant dependencies: Example: Reliability and test score distribution

• Displaying numbers: Integer, positive, predictable range

• Meaningful scale unit

• Familiarness: each new coefficient should distinctly more usable than the traditional

7. Linearität zur Unit-in-Change

Erläuterung: ‚Linearität zur Unit-in-Change‘

- Im Falle der Messgenauigkeit betrifft dies die Beziehung der Reliabilität zum Messfehler.

- Im Falle der Übereinstimmung betrifft dies die Beziehung von Yules Y zur Veränderung der Zellhäufigkeit a bzw. d.

Korrelation/Reliabilität

Standardmessfehler

Yules Y

Freq (Zelle a)

Evaluation the progress trough enhancing usability

1. Formal test criteria are used more frequently for test selection

2. Tests in practice are of higher quality

Ergonomics in psychological test selection

Ergonomics

Psychological diagnostic

Configuration of Environment

Software conception

Designing a tool to fit in hand.

Developing a program to be used

intuitively

Restrict a test description, that

relevant information are ready to use

Integrating ergonomics in the formal test description

Human interface techniques

test user Psychometric admeasurements

test

Analysis

of usage evaluationUsability criteria

1. Formal test criteria are used more frequently for test selection2. Tests in practice are of higher quality

Ergonomics and the development of criteria of usability

Requirement Analysis (Mayhew,1999)

User-ProfileTask-

Analysis

Platform Capabilities/ Constrains

Testuser Test selection Test theory

Top-down vs. bottom-up strategy to develop a coefficient

test theory/statistic

Index: Generic formula

Algorithm

scale (correction)

Interpretation of the score (operational meaning)

Practitioner‘s point of view

Scientist‘s point of view

Defining the operational meaning

Scale definition

Specification of within a test theory

Index: Defining the influencing factors

Index: Generic formula

CTT

22

2

ew

w

ss

sr

N

i yx

ii

ssN

yyxxr

122

none

associationP-R-E

SEDTTS

NDR

k

RD

f(me, score range, probability)

ttx rsD

1296,1

s*6 x

top related