itc conference itc conference, winchester, 2002 computer-based testing usability of psychometric...

ITC ConferenceITC Conference, Winchester, 2002

Computer-based Testing

Usability of Psychometric Admeasurements Usability of Psychometric Admeasurements

Dr. J. M. Müller

University of Tübingen, GermanyUniversity of Tübingen, Germany

http://www.joergmmueller.de/default.htm

Overview

1. Introduction: Formal test descriptions in practice

2. Definition of usability in the context of test description

3. Illustrating problems: Reliability

4. Criteria of usability: foundation, scaling, general attributes

5. Two examples of enhanced usability: NDR and PDR

6. Summary

Introduction: Psychometric admeasurements in practice today and tomorrow

1. Test users often use poor quality tests (e.g. Piotrowski et al.; Wade & Baker, 1977) Psychometric knowledge (Moreland et al. 1995)/Competence approach (Bartram, 1995, 1996)

2. What should be described? CBT: Criteria for software usability (ISO 9241/10, 1991; Willumeit, Gediga & Hamborg, 1995) and further criteria: platform-independence, possibility of making own norm banking, protection)

3. How should it be described?

4. “Good practice” guidelines and standards are based on quality criteria (e.g. Standards for educational and psychological Testing, PA, 1999; International Guidelines for testing, ITC, 2000)

Quality Supply Quality Demand

Definition of Usability

Scope of usability: Usability in the context of psychological testing concerns all important kinds of information for test users to describe a test for various purposes and the ways to communicate them. This includes test manuals as well as a formal test descriptions with the help of psychometric admeasurements.

Aim of usability: The product or effect of good usability is that any test user finds all necessary information quickly and in a proper standardized form, ready to use for answering the questions of the test users to enable them to decide whether a test is an appropriate help for the diagnostic question.

Frame of usability: Quality assurance in the context of psychological testing refers to test construction, test translation, test description and the use of tests in practice.Methods to enhance quality control can contain guidelines for test use, standards for test description, etc. Usability is a strategy to enhance quality on the level of formal description.

Consequences of usability concern the reengineering of formal test description,

Indices of measurement of error

t i on

f f ici

l ati o

e r- R

i ch a

r ds o

S en s

i tiv i

S pe c

i fic y

S ta n

d ar d

i on -

f un c

Measurement of error

Dimensional construct Categorical construct

CTT IRTGeneralizability Theorynonspecific

misclassificationspecific

misclassification

Standard error score

Reliability

Relationships between indices of error of measurementSp

lat i o

f f ici

l ati o

e r- R

i ch a

r ds o

S en s

i tiv i

S pe c

i fic y

S ta n

d ar d

i on -

f un c

Y/ Kappa/ Phi

Korrelation

test theory/statistic

Index: Generic formula

Algorithm

scale (correction)

Interpretation of the score (operational meaning)

Top-down vs. bottom-up strategy to develop a coefficient

Practitioner‘s point of view

Scientist‘s point of view

Defining the operational meaning

Scale definition

Specification of within a test theory

Index: Defining the influencing factors

Rescaling reliability: Number of distinctive results (NDR)

(Wright & Master, 1982; Lehrl & Kinzel, 1973; Müller, 2001)

Rang R

Test score distribution

criticaldifference

ttx rsxxk 1296,105.012Formula

R = test score range

k = critical difference

Foundation1. Unambiguous

operational meaning

2. Unambiguous formal definition

3. Broad application area

4. Relevant dependencies

5. Independent of irrelevant factors

Scale Definition1. Meaningful scale unit,

that implies:• Interval scale• Positive values• Defined range of

values 2. Comparable to the

reference scale 3. Significant scale unit that

implies a minimum of observations (Nmin)

Global attributes in using

1. Relevance2. Informative (not

redundant)3. Predictable for the test

user (nominal/actual value comparison)

4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria

of estimating

Criteria of usabilityfor formal quality criteria

(modified from Müller, 2001, 2002a,b; Goodmann & Kruskal, 1954)

operational meaning

reference scale 3. Significant scale unit that

4. Easy to learn 5. Easy to utilise6. Fisher(1925) criteria

of estimating

NDR at work...

NDR = 2NDR = 5NDR = 10

r = .50r = .92r = .98

Distribution of reliability coefficient Distribution of NDR coefficient

Conclusion: many precise tests Conclusion: some precise tests

Probability of distinctive results (PDR)

Formula

ji jiji

jijiji kxxifs

kxxifsssD

Complete score comparison of pairs

Rectangular distribution shows an 80 %

probability to distinguish two test scores

Gaussian distribution shows a 60 %

probability to distinguish two test scores

Reliability

PDR: Simulation studyPerformance to separate test scores with respect

to reliability and score distribution

PDR: Example

Subscale ‚Resignation‘; Stress-Coping-Questionnaire

SVF-KJ; Hampel, Petermann & Dickow, 1999; N=1123

Subscale ‚Unsicherheit‘ Symptom Check List

(Derogatis, 1977; German Version Franke, 1995; N=875

r = 0.81

PDR = 41.6 % PDR = 30.6 %

r = 0.81

Reviewing NDR and PDR

1. NDR and PDR can be derived in any test theoretical model – there is progress in the application area.

2. NDR and PDR have an easy to understand operational meaning

3. NDR and PDR are predictable for the test user for the nominal/actual value comparison

NDR and PDR serve as examples of how to develop more usable NDR and PDR serve as examples of how to develop more usable formal test descriptionsformal test descriptions

Summary

1. Usability is a possible strategy with explicit and observable criteria, for improving formal test descriptions – and strengthening indirectly the role of guidelines and standards.

2. With NDR and PDR two easy to understood coefficients have been proposed, the application of which in is progress in several test theoretical models.

Thank you for your attention!

Medicine: Effect-size measures

Practitioners coefficient

Scientific coefficient (Cohen, 1988)

RRRNNT

NNTs [Number-Needed-to-Treat] the number of patients who need to be treated to prevent 1 adverse outcome. Taken from EBM Glossary - Evidence Based Medicine Volume 125 Number 1

Measuring in technical fields: Solutions from engineering

The is a German Norm DIN 2257 on how to measure the physical length of an object and how to report the result. The norm allows as output only values with statistical evidence.

Criteria of usabilityfor formal quality criteria for NNT

operational meaning

reference scale 3. Significant scale unit, that

4. Easy to learn 5. Easy to utilise6. Fisher‘s (1925)

criteria of estimating

Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)

Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:

1. Suitability for the task

2. Self-descriptiveness

3. Controllability

4. Conformity with user expectations

5. Error tolerance

6. Suitability for individualization

7. Suitability for learning

KR20 and Cronbach

Kuder-Richardson-Formel KR20i itempi relative Anzahl von 1qi relative Anzahl von 0

(aus Cronbach, 1951)

Cronbachs Alphac Anzahl der Variablen

si2 Varianz der Variablen i

stot2 Varianz der Summe

Formula to the error of measurement in categorial constructs

Cohen‘s KappaWeiter 16 Maße zur Konkordanz zweier

Messungen für binaäre Daten verglichen Conger & Ward (1984)

Yule Vierfelderinterdependenzmaß

Q-Koeffizient

Phi-Koeffizient Abhängigkeit von Randsummenverteilung Abhängigkeit des Signifkanztests von N

(Yates-Kontinuitätskorrektur, 1934)

i i ij

A1 a bA2 c d

)()()()(

dbdcbacape

Formula to the error of measurement in categorial constructs

Frickes Übereinstimmungs-koeffizient SS: Quadratsumme innerhalb einer

Person; max SS: maximal mögliche Quadratsumme innerhalb der Personen

Punkt-biseriale KorrelationX=arithmetisches Mittel aller Testrohwerte

XR=arithmetishes Mittel der Pbn mit richtigen Antworten

sx=Standardabweichung der Testrohwerte aller Pbn

N = Anzahl aller Pbn

NR=Anzahl der Pbn, mit richtigen Antworten

Tetrachrorische Korrelation

I 1 4 3

II 0 4 2

III 0 5 2

bcadrtet

180cos

Rjtbisp

Formula to the error of measurement in CTT, IRT + prophecy-formula

Spearman-Brown-Formel

k= Faktor der Testverlängerung

Rasch model

tttt rk

ivi pp

ttxe rss 1

Some Formula for the error of measurement in metric constructs

reliability (Kelley, 1921)

Pearson(1907) -Correlation Bravais (1846)

Spearman‘s rho (1904)

Kendalls Tau , 1942(S=difference of pro- und inversionsnumber)

wtt ss

dRho i

1 2 3 4 5

3 2 3 5 4

Non-linear Relationsship between reliability, NDR and the standard error score

1reliability

NDR Standard error score

Standard error score

Item-Response-Theory(Fischer & Molenaar, 1994)

1. Dichotomous raschmodel

2. Linear logistic test model

3. Linear logistic model for change

4. Dynamic generalization of the raschmodel

5. One parametric logistic model

6. Linear logistic latent class analysis

7. Mixture distribution rasch models

8. Polytomous rasch Models

9. Extended rating scale and partial credit models

10. Polytomous mixed rasch models

11. ...

...more IRT (van der Linden & Hambleton, 1997)

1. Nominal categories model

2. Response model for multiple choice

3. Graded response model

4. Partial credit model

5. Generalized partial credit model

6. Logistic model for time-limit tests

7. Hyperbolic cosine IRT model for unfolding direct responses

8. Single-item response model

9. Response model with manifest predictors

10. A linear multidimensional model

11. ...

Formula of some IRT

rasch model

binomial model

Unfolding-model

iAAiAi

))(exp(1

))(exp()(

ivivivi

Birnbaum model

))(exp(1

))(exp()(

ivvivi

ii xig

xiggxp

)1()( Latent-Class-model

Criteria of software usability(from Willumeit, Gediga & Hamborg, 1995)

Questionnaire on the basis of ISO9241/10 (IsoMetrics) to evaluate the following dimensions:

1. Suitable for the task

2. Self-descriptiveness

3. Controllability

4. Conformity with user expectations

5. Error Tolerance

6. Suitable for individualization

7. Suitability for learning

Norm scales

SCL-90-R test score distribution

Simulation study about the relationsship between measures of association

Y/ Kappa/ Phi

correlation Y/ Kappa/ Phi Q

Y/ Kappa/ Phi

correlation

SMCY/Kappa/Phi

correlation

KappaMeasure of associationMea

Linear relationship?

dichotome Normal distribution- equal marginals

Skewed distribution - unequal marginals

Efficiency in measuring

Content: efficiencyConcept: The less effort you need for the same

amount of information, the more efficiency the test isefficiency = f(Information;effort)

Indice: E = Amount of Information/TimeEstimates: Information Theory

(Shannon & Weaver, 1949)

Amount of Information of a signal: Chess example

In the chess example you need at least (binary, 50-50 chance) 6 question‘s that are 6 bit.

1.Frage: links-rechts?

2.Frage: oben-unten?

3.Frage

The scale unit ‚bit‘ can be understand as the minimal or optimal number of question‘s, to identify a signal out of quantity of alternatives.

Schachspieler

C1:21:2

Rasch variances are a measure of the variability of person‘s within a dimension

Als Maßeinheit der Unterschiedlichkeit dient die

Differenz der Gewinnwahrscheinlichkeiten.

1: 21: 2

1: 21:2

1. Gewinnwahrscheinlichkeiten -> Lösungswahrscheinlichkeiten

2. Gegner -> Testaufgabe (Itemparameter)

3. Spielstärke -> Personenparameter

4. Differenz der Gewinnwahrscheinlichkeit definiert über den Logit des Raschmodells

Interpretable Rasch Variances

personen parameter

Probability to solve an item

Item i with = 0

item m with = 1

iAAiAi

minmax A

Difference to solve a question or task

Empirical Evidence of the range of person parameters in rasch units

AID Kubingen & Wurst Standardform Parallelform

Alltagswissen 21,1 21,3

Realitätssicherheit 13,3 13,1

Angewandtes Rechnen 21,7 20,5

Eigenschaft Autor AusdehnungVerbaler Intelligenztest Metzler & Schmidt 11,4

Averbale Intelligenz Forman & Pieswanger 8,2

Einstellung zur Sexualmoral Wakenhut 8,1

Einstellung zur Strafrechtsreform Wakenhut 7,2

Beschwerdeliste Fahrenberg 6,4

Räumliches Vorstellungsvermögen Gittler 5,9

Umgang mit Zahlen bei Kindern Rasch 3,5

Usability criteria explanations

• Relevant dependencies: Example: Reliability and test length, stability, ...

• Irrelevant dependencies: Example: Reliability and test score distribution

• Displaying numbers: Integer, positive, predictable range

• Meaningful scale unit

• Familiarness: each new coefficient should distinctly more usable than the traditional

7. Linearität zur Unit-in-Change

Erläuterung: ‚Linearität zur Unit-in-Change‘

- Im Falle der Messgenauigkeit betrifft dies die Beziehung der Reliabilität zum Messfehler.

- Im Falle der Übereinstimmung betrifft dies die Beziehung von Yules Y zur Veränderung der Zellhäufigkeit a bzw. d.

Korrelation/Reliabilität

Standardmessfehler

Yules Y

Freq (Zelle a)

Evaluation the progress trough enhancing usability

1. Formal test criteria are used more frequently for test selection

2. Tests in practice are of higher quality

Ergonomics in psychological test selection

Ergonomics

Psychological diagnostic

Configuration of Environment

Software conception

Designing a tool to fit in hand.

Developing a program to be used

intuitively

Restrict a test description, that

relevant information are ready to use

Integrating ergonomics in the formal test description

Human interface techniques

test user Psychometric admeasurements

Analysis

of usage evaluationUsability criteria

1. Formal test criteria are used more frequently for test selection2. Tests in practice are of higher quality

Ergonomics and the development of criteria of usability

Requirement Analysis (Mayhew,1999)

User-ProfileTask-

Analysis

Platform Capabilities/ Constrains

Testuser Test selection Test theory

Top-down vs. bottom-up strategy to develop a coefficient

test theory/statistic

Algorithm

scale (correction)

Interpretation of the score (operational meaning)

Practitioner‘s point of view

Scientist‘s point of view

Defining the operational meaning

Scale definition

Specification of within a test theory

Index: Defining the influencing factors

associationP-R-E

SEDTTS

f(me, score range, probability)

ttx rsD

1296,1

itc conference itc conference, winchester, 2002 computer-based testing usability of psychometric...

test users

test construction

test translation

test manuals

test score range

formal test descriptions

context of test description

test theory index

Documents

inclusiveness target countries (itc) conference grants ·...

itc conference, london, 12. may 2011 dr. peter schiefer ceo

intellectual property owner’s association itc · pdf...

www.itcsoftware.com itc software itc automation testing...

itc conference programme

intellectual property owner’s association itc conference

www.itcsoftware.com itc software itc software testing...

itc induction itc melting and heating

ititコーディネータ（itc）コーディネータ（itc）...

itc conference grant scientific report

dba itc ltd itc

inclusiveness target countries (itc) conference grants ·...

the 2021 international telecommunications conference (itc

itc gardenia, - conference venue gardenia... · itc...

itc staff conference presentation · itc staff conference...

international trade centre (itc) united nations conference

tui itc presentación itc 2011

itc enhancing rural marketing an itc perspective

3 rd annual coal market in india – 2013 conference hotel...

itc interceppo itc/e - itc/f - tomasella macchine agricole