psycometrics in neuropsychological assesment

30
Psychometrics in Neuropsychological Assessment with Daniel ). Slick OVERVIEW lhe pracos of ncuropsychologicJI asscssmcnt dcpcnds lo a brge exlcnt OH lhe reliability and valiJity of llcuropsycholog- ieal lesls. UnfortullJtely, no! ali neuropsychological tests are crcated equal, and, like any olher product, published tests ViU}' in terms of lheir "quali'y," as defined in psychometric tcrms such as reliability, rncasurement crror, temporal slabil- ity, sCllsitivity, spccificity, prcdictive v,llidity, and with respect to lhe care with which t('st itcms are derivcJ anJ norm,llivc data are obtaincJ. In d,lditioll tu commcf(:ial mcasurC5, nu- meram tcsts dcvclopcd primarilr for rcscarch purposcs have founJ their war into wide clinicai usagc; Ihese vary wnsidcr- ably with rcgard to psychomctric propertics. With few cxcep- tions, whcn tests originate from clinicaI research conlcxts, thnc is ohcn validity data but littlc c!se, which makcs esti- lllating mcasurelllcnt precision and stability of test scores a challenge. Rcgardless of lhe origins of neuropsyclJOlogical tesls, lheir competcnt use in clinicai practice demanJs a good working knowledge of test standards and of lhe specific psychometric charaeteristics of each lest useJ. This includes familiarity with the StanJards for Educational anJ Psychological Testing (American Educational Research Associalion [AERA] et aI., 1999) and a working knowledge ofbasic psychometrics. 'iCxts sllch as those by Nunllally and Bernstein (19')4) and AnaSlasi <IndUrbina (1997) outline some of the fundamental psycho- metric prerequisites for competent sdectioll of tests and in- terpretation of oblained scores. Other, neuropsychologieally focuseJ teXls such as Mitrushina et ai. (2005), Lezak et aI. (2004), Baron (2004), Franklill (2003a), and Franzcn (2000) also proviJe guidance. The following is inlended lOprovide a broad overview of important psyehometric eoncepls in neu- rupsychological assessment and coverage of important issues to consider when crilicalty evaluating leSISfor clinicai usage. Much of the information provided also serves as a conceptual framework for the test reviews in this volume. 3 THE NORN\Al CURVE Thc frequency Jistributions of many physical, biological, and psychological attributes, <lSlhey occur ilCroSSindividuais in nature, tend to conform, to a greater or lcsser degree, to a bell- shaped curve (see Figure I-I). This normal wrl'c or normal distributíoll, so namcd by Karl I'earson, is also known as the Gaussian or Laplace-Gauss distribution, aftcr the 18lh-century mathematicians who first defined it. The normal curve is lhe hasis of many commonly used stalislÍeal and psychometric moJels (e.g., classical test theory) atld is lhe assumed dislri- hulion for many psyehological variables.' Definilion ond Charocleristics The normal curve has a number of spccific propcrties. It is unimodal, perfectly symmetrical and asymptolie at the t<lils. With respcct to scores frum measurcs Ihat are normally dis- tributed, the ordinate, or hcight of lhe curve at any point along the x (tesl s(Ore) axis, is the proportion af persons wilhin the sample who ohlained a givcn score. The ordinates for a range of scores O.e., between two points on the x axis) ma}' alsa bc summed lo give the proportion of persons Lhat obtaineJ a score within the speófied range. If a spccified nor- mal curve accuratdy rdleets a population distribution, then ordinatc valucs are also cquivalcnl to lhe probahility of oh- serving a given seore or range of scores when randomly sam- pling fram the pop\llation. Thus, the normal curve ma}' also bc refcrred lo as a probilbilily distribution. Figure1-1 Tnc llllfrnal UlrV(\ x

Upload: helo-neves-da-rocha

Post on 26-Jun-2015

187 views

Category:

Health & Medicine


5 download

DESCRIPTION

Psicometria

TRANSCRIPT

Page 1: Psycometrics in neuropsychological assesment

Psychometrics in Neuropsychological Assessmentwith Daniel ). Slick

OVERVIEW

lhe pracos of ncuropsychologicJI asscssmcnt dcpcnds lo abrge exlcnt OH lhe reliability and valiJity of llcuropsycholog-ieal lesls. UnfortullJtely, no! ali neuropsychological tests arecrcated equal, and, like any olher product, published testsViU}' in terms of lheir "quali'y," as defined in psychometrictcrms such as reliability, rncasurement crror, temporal slabil-ity, sCllsitivity, spccificity, prcdictive v,llidity, and with respectto lhe care with which t('st itcms are derivcJ anJ norm,llivcdata are obtaincJ. In d,lditioll tu commcf(:ial mcasurC5, nu-meram tcsts dcvclopcd primarilr for rcscarch purposcs havefounJ their war into wide clinicai usagc; Ihese vary wnsidcr-ably with rcgard to psychomctric propertics. With few cxcep-tions, whcn tests originate from clinicaI research conlcxts,thnc is ohcn validity data but littlc c!se, which makcs esti-lllating mcasurelllcnt precision and stability of test scores achallenge.

Rcgardless of lhe origins of neuropsyclJOlogical tesls, lheircompetcnt use in clinicai practice demanJs a good workingknowledge of test standards and of lhe specific psychometriccharaeteristics of each lest useJ. This includes familiaritywith the StanJards for Educational anJ Psychological Testing(American Educational Research Associalion [AERA] et aI.,1999) and a working knowledge ofbasic psychometrics. 'iCxtssllch as those by Nunllally and Bernstein (19')4) and AnaSlasi<IndUrbina (1997) outline some of the fundamental psycho-metric prerequisites for competent sdectioll of tests and in-terpretation of oblained scores. Other, neuropsychologieallyfocuseJ teXls such as Mitrushina et ai. (2005), Lezak et aI.(2004), Baron (2004), Franklill (2003a), and Franzcn (2000)also proviJe guidance. The following is inlended lOprovide abroad overview of important psyehometric eoncepls in neu-rupsychological assessment and coverage of important issuesto consider when crilicalty evaluating leSISfor clinicai usage.Much of the information provided also serves as a conceptualframework for the test reviews in this volume.

3

THE NORN\Al CURVE

Thc frequency Jistributions of many physical, biological, andpsychological attributes, <lSlhey occur ilCroSSindividuais innature, tend to conform, to a greater or lcsser degree, to a bell-shaped curve (see Figure I-I). This normal wrl'c or normaldistributíoll, so namcd by Karl I'earson, is also known as theGaussian or Laplace-Gauss distribution, aftcr the 18lh-centurymathematicians who first defined it. The normal curve is lhehasis of many commonly used stalislÍeal and psychometricmoJels (e.g., classical test theory) atld is lhe assumed dislri-hulion for many psyehological variables.'

Definilion ond Charocleristics

The normal curve has a number of spccific propcrties. It isunimodal, perfectly symmetrical and asymptolie at the t<lils.With respcct to scores frum measurcs Ihat are normally dis-tributed, the ordinate, or hcight of lhe curve at any pointalong the x (tesl s(Ore) axis, is the proportion af personswilhin the sample who ohlained a givcn score. The ordinatesfor a range of scores O.e., between two points on the x axis)ma}' alsa bc summed lo give the proportion of persons LhatobtaineJ a score within the speófied range. If a spccified nor-mal curve accuratdy rdleets a population distribution, thenordinatc valucs are also cquivalcnl to lhe probahility of oh-serving a given seore or range of scores when randomly sam-pling fram the pop\llation. Thus, the normal curve ma}' alsobc refcrred lo as a probilbilily distribution.

Figure 1-1 Tnc llllfrnal UlrV(\

x

Page 2: Psycometrics in neuropsychological assesment

4 A Compentliurn lIfNeuwpsychologi«11 Tests

The normal cun'(' is mathematically defincd as fol!ows:

. I .j(x)=--e-(x-11)- 111

~2ITa'

corrcsponcling 10 any resulting z score can Ihen be easilylooked up in lablcs avail<lblein mosl statistical texts. Z scoreconversiolls to percentilcs are ,liso shown in Table I-I.

\\11ere:

Relevancefor Assessment

Z Scores ond Percenliles

x= meaSllrement value (test score)X= lhe mean of lhe test score dislribulionSO = lhe slandard devialion of the lest score dislribution

LinearTransformatiancf Z Scores: TScoresand OIher Standard Scores

In ,Iddition to the z score, lineM transformalion can be usedto produce other slandardized scores Ihat have lhe same prop-erties with regard lo easy conversion via tablc look-up (sceTable I-I). The most common of Ihese are T scores (M == 50,SD = 10), scalcd scores, and slanclard scores such as Ihose usedin mosl IQ tesls (M = 10, SD= 3, ,md M = 100, SD= 15). limusl be rcmembered that z scorcs, T scores, slandard scores,and perccntile equivalenls are dcrived from sl/mples; ahhoughthese are of1en treated as population values, any limitations ofgeneralizability due to rcference samplc composition or test-ing circumstances muSl be taken into consideralion whenslandardized scores nre inlerprclcd.

Interprelalionof Percentile~

An imporlant properly of the normal curve is that the rela-lionship belweell raw or z scores (which for purposes of thiscliscussion are e{]uívalent, since Ihey are linear trnnsforma-lions of each other) and percenliles is nol linear. lhat is, aconstant differencc bctween rOlwor z scores will he assocLJ.ledwith a variablc difference in percentile scores, as a funClioll oflhe dislallce ofthe Iwo scores from lhe mean. This isdue to thefact Ihal there are proportionally more obsen'aliollS (scores)near the mean Ihan Ihere are farther from the mean; olherwisc,the distribulion would be reclangular, or non-normal. Thiscom readily he seen in Figure 1-2, which shows the normaldistribution with demarcation of z scores and correspondingpcrcclltilc ranges.

The nonlinear relation between z scores alld percentileshas important inlerprclivc implicatinns. For example, a one-point diffcrence betwel.:n two z scores may be interpreleddifferently, dcpending on where the two scores fali on the Ilor-lllal curve. Ascombc seen, lhe difference hetween a z score ofo,md a z score of + I is 34 percenti!e points, because 34% ofscores fali uctween these two z scores (i.e., the scores beingcompared are at lhe 50lh and 84th percentiles). iIowever, thediffcrence belween a z score of +2 nnd a z score of +3 is lcssthan 3 percentile points, because only 2.5% of lhe distribu-tion falls belween Ihese Iwo poinls (i.e., lhe scores being com-pared are nl the 981h and 99.91h percentilcs). Ou lhe otherhnnd, interpretalion of percenlile-score differences ISalso nolslraightforward, in Ihal an equivalcnl "difference" betwcenlwo percenlile rankings mal' entai! differenl clinicaI implica-lions if lhe scores occm at the tail end ofthe curve than ifthcyoccur near the míddle of the distribution. For ex,lmple, a 30-poinl difference belween scores at lhe 1st percentilc versus the3IsI percenlíle lllay be more C!inical1ymcaningful than thesame difference between scores at the 351h percentile versuslhe 651hpercenlilc.

[21z=(x-X)/SD

\Vhere:

x = measurement values (test scores)p = lhe mean of lhe test score dístríbution0'= lhe starHlanl deviat ion of the tesl score dislribut ion]'f"'" lhe conslanl pi (3.14 ... )e = the base of naturallogarithms (2.71 ... )f(x) = lhe heighl (ordinate) of lhe ClUvefor ,IllYgiven tesl

score

A percenlile indicates the percent,lge of scores Ihal fali ai orbelow a given lesl score. As an examplc, we will assume lhaIa given lesl score is plolted on a normal curve. \Vhen ali oflhe ordinate values aI and bclow Ihis tesl score are summed,lhe resulting value is lhe percenlilc associaled wilh thal leslscore (e.g., a score in the 75th percentilc indicales Iha175% oflhe reference samplc oblainecl equal or lower scores).

To converl scores lo percl.:nliics,r,IWscores may be linearl)'Iransformed or "stanclardizl.:d"in several ways. The simplestand perll<lpsmost commonly calculated standard score is thez swre, which is obtained by subtrncting lhe sample meanscore from an obtnined score allJ dividing lhe resull by lhesample 50, as show below:

The resulting distrihution of z scores ha.~a mean of O and anSD of 1,regardlcss of the melric of raw scores from which the)'werc Jcrived. For example, given a mean llf 25 and an SDof 5,<lraw scoreof20 translales inlo n zscorc of -1. The percentilc

As noted previously, because il is a frequellcy dislribulioll,lhe area under any given segmenl of the normal curve indi-cates lhe freqllency of observalions or cases wilhin Ihal inler-vaI. From a praclical slandpoint, Ihis provides psychologislswilh an estimale of the "normalit(' or "abnormalilY" of anygiven tesl score or range of scores (i.e., whelher il falls in lhecenter of lhe bell shape, where the majority of seores lie, orinslead, ai eilher of the tail ends, whcre few scores can befounJ). The way in which the degree of "norm,llity" or "ab-normality" of tesl scores is quantified varies, but perhapslhe most useful and inherently underslandablc metric is lhepacentí/e.

Page 3: Psycometrics in neuropsychological assesment

Toble 1-1 Sum."Convnsíon Tahk

IQ' T SSh Percenlí1e -zl+z Percentilc SSh T IQ'

S55 S211 <I SO.I S3.(JO~ ~9').9 ~19 ~l'IO ~14556-6fl 21-23 2 <I 2.67-2.99 >99 18 77-99 140....14461-67 24-27 3 I 2.20-2.66 99 17 73~76 133--13968-70 21:\-30 'I 2 1.96-2,19 OH 16 70-72 130-13271-72 31 ) 1.82-1,95 97 " 128-12973-71 32-.>3 'I 1.7()-1.1:\1 96 67~68 126-12775-76 34 5 5 1.60.....[,69 95 15 " 124-12577 6 1.52.....1.59 94 12378 35 7 1.44-1.5[ 93 65 12279 36 , U8 ....1.,1} 92 64 12180 6 9 1.32-1.37 " 14 12081 37 10 1.26-UI 90 63 119

11 1.21-1.25 "'S2 " 12 1.16-1.20 " 62 11883 13 1.11-1.15 " 117

" 39 11 1.06-1.10 R6 61 11615 1.02-1.05 85

85 40 7 16 .98-1.01 '" U 60 11517 .94-.97 "86 41 18 .90-.93 S2 59 111

" 19 .86-.89 81 11320 .83-.85 80

" 42 21 .79-.82 79 58 11222 .76-.78 78

"' 2J .73-.75 77 11143 24 .70-.72 76 5790 8 25 .66-.69 75 12 110

26 .63-.65 74

" 44 27 .60-.62 73 56 10928 57-59 7229 51 .....56 71

92 30 .52-.53 70 10815 31 .4<J-.51 69 5593 32 .46-.48 6R 107

3J .43-.45 679,1 46 34 .4\)-.42 66 54 \06

35 .38-.39 6536 .35-.37 64

95 9 37 .32......34 63 11 105

" " .3(}-.31 62 53% 39 .27......29 61 104

-lO .25-.26 6041 .22-.24 59

97 48 12 .[9-,2[ 58 52 10343 .17-.18 57H .14......16 56

98 45 .12-.13 55 10249 46 .09......11 54 5199 47 .O7-.011 53 101

48 .04-.06 5219 .02....J)J 51

100 50 10 50 .00-.01 50 10 50 100

'AI = 100. SD= 15: "M = lO. SD= 3.•Vo": SS = Sc.d",J

Page 4: Psycometrics in neuropsychological assesment

6 A Compendíllm of Neuropsychologícal Tcsts

FigtJre1-2 The normal curvedemarcaled hy z ~cores.

As wcll as facílilalíng lrallslalion of raw scores to eslímaledpopulation ranks, standardization of tesl scores, br vírtue ofconver~ion to a common llletric, facililates comparison ofscores across measures. Ilowever, this is only ,ldvisable wnenthe raw score distribulÍons for tests Ihat are being comparedare appcoximatcly normal in the population. In addílion, ifstanJardized sunes are to be compared, ther should be derivedfcom similar S<llllpleS,or more ideally, from the same s<llllple.Ascore aI lhe 50th percentilc on a test normed on a populationof uníversily students does not nave lhe same meaning as an"equivalent" score on a tesl nonned on a populatíon of dderJyindividuais. \Vhen comparing test scores, one mUSI<lisolakeinto consideration both lhe rclíability of the two measures andtheir intercorrelatíon before dctermining if a significall1 differ-ence exisls (see Crawford & Garthwaite, 2002). In some cases,rclalivcly large disparities between slandJfd scores may nOI ac-lU<lllyreflect rcliablc dífferences, and Iherefore may not bedinically me,mingful. FurtherlIlore, statislicallr significant orrcliable difTerences bctween test scores may be COllllllon in areference sample; therdore, the baserate of differences m\l~talso be considered, JepenJing on lhe levei ofthe ~cores (<InIQof 90 versus 110 as compared lo 110 versus 130). Une ~houldalS(1keep in mind that when lesl scores are not normally dis-tribuled, standardized score.~may not accllrate!y rc/leet acttl<llpopul,ltion rank. In these círcumstances, differences betweenslandard scores may be misleaJing.

Note also lhat comparability <lcmss tesls does not implyeqll<llity in meaning and relative imporlance of scores. For ex-<lmple, one may compare stand<lrd scores on rneasures ofpitch discriminalion and intelligence, but it will rarely be lhecase that these scores are of equal clinicai or practical meaniognr significance.

lhe Meaning of Stondordizcd TestScores:Score Interpretolion

+2

2.35%

0.15%

+3

In clinicai practice, one lllar encounter standard scores that areeither extremely low or extremely high. The meaníng <lndcom-p,uability of such scores will depend critie<lllyon the charac-teristics of lhe normative s<lrnplefrom which lhe)"derivl;:.

For exarnplc, cnn~ider a hypothetical case io whicn ,lIl ex-<lrnineeohtains a rilw score llwl is hclow lhe range of scnresfound io a norll1,ll s,lrnple. Suppose funher th<ll the SLJ in lhenorm,d salllpk i~verr small ilnd thus the examinee's r<lWscorelranslates to a z score of -5, indicalíng that lhe prob<lbilily ofencountering lhis score in the normal POPUl<llionwould he 3in 10 míllion (i.e., a percentile ranking of .00(03). Thi, repre-senIs J cOllsíder<lbleextrapol<!tion from the ,H:lual normativedata, as (I) lhe normalive ~ampll;:did nol include 10 míllionindividllills (2) not a singlc individual in the normalÍve S<llll-pie obtained <lscore anywhere close to the examinee's score.The percentile value i~Iherefore an eXlrapolalioll and confersa false sense of precisioo. \\11ilc one may be confident lhatit indicales impairment, lhere may be no basis to assume thalit represenls a meaningfully "worse" performance tlun a zscore of - 3, or of -4.

The t'slÍmlltcd prcvalclKe valuc of Jn obtained z score (nrT seore, elc.) C<lnbe calcuLlted to {lctermine whether inlerpre-lation of extreme scores may be appropriale. Thís is simply ac-complished by inverting the perccntile score corresponding tolhe z seore (i.e., dividing I by the percentile score). For eX<lm-pie, <lz $Coreof -4 is associattxl with an cstimated frequency ofoccurrence or prevalcnce of appcoximately 0.00003. Dividing 1by Ihis value gives a rounded result oI' 31,560. Thus, the e~li-mated prevalence value 01' lhis score in the population is 1 io31,560. Ifthe norrnative S<lIllPJcfcom which J z score is Jerivedis consider<lbly smaller lhan lhe denominator of lhe estimalcdpreva!cnce value (i.e., 31,560 in the example), then some cau-tion may be wJrr<lll1edin interprcling the pereenlíle. In <lddi-tion, whenever such exlrernl;: scores are being ínlerpreted,eX<llllinersshould also verify th<llthe examinee's raw score fallswilhin the r<lngeof raw scores in the normative sample. If thenorn1<ltive samplc size is sllbstanliallr slll,lller Ihan lhe esli-mated prev,llcnce s<lmple Si7£ /lI1t1 the examinee's score fallsolltside lhe s<lmplc range, then cOllsiJerablc caulion may beindic<ltcJ in interpretíng the percentile assn(Íaled with thestandardized seore. Regardlcss of the z seore v<llue,it must <lIsobe kept in mind thal inlerpretation of lhe <lssoci<ltedpcrcentilevalue may not be juslifiable if lhe normative sample !las a sig-nifiC<llltlynOll-llOrm<l1distrihution (see laler for funhl;:r dis-cussion of nOIH10rlJl<llily).lo sum, the dinie<ll interprel<llionof exlreme scores depends to a longeextenl on the properties ofthe normal salllples involveJ; one can have more confidenceth<llthe percentile is reasonably <lccurate if the normalive sam-pie is large and well collstructed and lhe sh<lpeof the norm<l-tive sampte distribution is ilpproximately normal, particularlyin tail regiolls where extreme $Coresilre found.

lolerprctiog Extreme Scores

A fin<llcritiC<11issue wilh respect lo lhe me,lning oI' standard-ú,ed seores (e.g., z scores) has to do with extreme observations.

lhe Normol Curve ond TeslConstruetion

Allhough the norm<ll curVI;:is from many standpoints <lnidealor even expecll;:ddistribulioll for psycholllgical dati!, tcst score

Page 5: Psycometrics in neuropsychological assesment

l'sychomelrics in Neuropsychological Assessmenl 7

Figure1-3 Skeweddislribulions.

9384Pereentiles

68

Raw ScoreMean = 50, 50 = 10

08

20

(e.g., a creativily test for gifted students). In lhis case, lhecharacterislks oI' onll' one side oI' lhe silmp1cscore dislribu-tioll (i.e., the uppt:r end) are critical, whilc lhe charactcristics011 the olher side of lhe dislrihulion are (lI'no particular con-cern. The 1l1eaSUremar even be dc1iberatdl' designed to havet100ror ceiling dTecls. ror example, if onc is not inlerested inone lail (or even olle-half) {lf lhe dislributioll, items lhatwould provide discrimination in that region may be omittedlo save adminislration time. In lhis case, a lesl with a highfloor or low cciling in lhe general population (and with posi-live or negalive skew) may be more desirablc thall a test with anormal dislribution. ln most applicalíons, however, a morellormal-Iooking curve within the targeted subpopulation isusually desirable.

Figure1-4 Anon.normallest scoredistrihution.

Non.Normality

Al1hough lhe normal curve is an cxcdlcnl modcl for psl'cho-logical ddla and manl' sample dislribulions of natural pro-cesses are approximately normal, il is not unllsllal for teslscore distributions lo be markedll' nOIl-normal, eWIl whensamples are large (Miccerti, 19R9).zFor example, neuropsy-ehological te..•ls sueh as the Boston Naming Tesl (BNT) andWiseonsill Card Sorting Test (WCST) do nol havc normal dis-tributions when r,lWscores are el;amined, and, even when de-mographie correction melhods are ilpplietl,some lests continueto show a non-norm,ll, muhimodal dislriblllion in some pop-ulations (Faslenau, 1998). (An examplc oI' a non-normal dis-tribulion is shown in Figure 1-4.)

The degree to which <lgiVClldislribution approximates theunderll'ing populalion distribulion increases as lhe nlllnberoI' observations (1\,rj increases and becomes kss accurate as Ndecreases. This has imporl<llll implications for norms com-prised of small samplcs. Thus, a larger sampk will produce ,Imore normal dislribulion, bul onll' if lhe underll'ing popu-lation distribution from which lhe samplc is oblained isnormal. In olhcr words, a largeN does nol "eorrect~ for non-normality oI''In under1l'ing popuLlIion dist ribution. Howt:ver,

Negalive SkewPositive Skew

samples do nol always conform 10 a normal dislribution.\Vhen anel'.' tesl is conslrucled, non-normality can be "cor-recled" br eXilmining lhe dislribulion of swres on lhe proto-trpe lesl, adjusling test proper1ies, and resampling until anormal dislribution is n:achC(1.For cX<lmple,whcn a test isfirsl administered during a lrl'-oul phase and a positivell'skeweddistribut ion is obtained (i.e., with mosl swres c1uster-ing ,lt lhe lail end oI' lhe dislribulion), lhe tesl likely has!oohigh a f1oor, callsing mosl examinees lo oblain low scores.Easl' ilems can then be added so lhat the majoritl' of scoresfali in the middlc of the distribulion rather lhan at the lowercnd (Anastasi & Urbina, 1997). ""11en this is successful, thegrealesl numbers of individuaIs obtain aboul 50°/" of itemscorrec!. This leveiof difficulty usualll' provides the besl differ-entiation between individuais aI ali abilil)' leveis (,\nastasi &Urbina, 1997).

11must be noled lhal a test with a normal dislribulion inlhe general population mal' show extreme skew or olher di-vngence from normaJill' when administcred to a populatiollthat differs considerabll' fcom lhe average individual. for ex-ample, a vocabulary test thal protluces norma]]l' distributedscores in a general samp1c oI' individuais mal' display a neg-ativell' skewed distribution dlle to a low cci1ingwhen admin-istered to docloral sludcnts in literature, and a positivc1l'skewed distribution dlle to a high l100rwhen adminislered topreschoo1crs Irom n:cenlll' immigrated, Spanish-speakingfamilies (see figure 1-3 for examplcs oI' positive and negaliveskew). In this Case,lhe test would be incapablc oI' dfectivc1ydiscriminating between individuais within eilher group be-caust: of ct:iling effecls and !loor efl"t-cts,rt:speclivt:!y,eventhough it is of considerablc utilill' in lhe gencral populalion.Thus, a lest'~dislribulioll, including 1100rsand ceilings, mustalwal's be eonsidercd when asscssing individuaIs who differfrom lhe normative samplc in terms of ch<uacteristicsthat af-feel test scores (ç.g., in this example, degree of exposurc to En-glish words). In additioll, whether a tesl prodmes a normaldislribution (i.e., wilhoul posilive or negalive skew) is also ,tnimporlant aspecl of evaluating tests for bias across differenlpopulatiollS (see Chapter 2 for more discussion oI'bias).

Depending on Ih.' characlerislics (lI' lhe conslruct beingmeasured and the purpose for which a lesl is bcing designed, anormal distribution oI' scores may not he obtainable or cvendesirable. For example, lhe population dislriblltioll of the con-slmcl bcing llleasured may nol be normally dislribulcd. Aht:r-nalively, one mal' want onl)' to identifl' and/or discriminatebdween persons at onll' one end of a continllum of abililies

Page 6: Psycometrics in neuropsychological assesment

8 A CompenJium ofNeumpsychological Tesls

small samplcs may yiclJ non-normal distributíon dlle toranJom samplíng cffects, even though lhe population fmmwhich lhe sanlple is Jrawn has a normal Jistriblllion. Thalis, one may nol automatically assume, given a non~nonl1alJistribulion in a small sample, that lhe populalion Jislribll~lion is in facl non~nortJlal (note Ihal the Wllverse may ,lisobe true).

Several factors may lead to non-normallesl S(;oreJislribu-tions: (a) lhe existence of diserete subpopulatiolls within lhegeneral population wilh differing abilities, (b) eeiling or l100reffeels, anJ (c) trealment effeets Ihal ehange lhe localion ofmeans, meJi<los, and moJes and affeel variability and distri~bulioo shape (Miccerli, 1YX9).

Skew

As with the normal curve, some varietics of non-nnrmalit)lmay be eharaelerized malhematically. Skew is a formal mea-sure or asymmelry in a frequeney Jistribulion Ihat can be cal-eui<lled using a specific formula (see Nunnally & Bernslcin,1994). lt is also known as the third momem of 11 distriburiol/(lhe mean and varianee are lhe first <loJ seconJ moments, re-spectivcly). A Irue normal Jistribution is perfeclly symmetri-cal aboullhe mean anJ has a skew of zero. A non-lIormal bulsymmetrie dislribution will have a skew valuc lhal is nearzero. Negative skew values indicale Ihal lhe left tail of the dis-tribulion i.sheavier (and often more elongated) Ihan the righltail, which may be lruncaled, while posilive skew vallles indi~cate lhat lhe Opposile paHem is presenl (see Figure 1-3).\Vhen distribulions are skewed, the mean and median are notidentical beeause the mean will not be at lhe midpoint in rankand z seores will not aeeuralely translate into sample per~eentile rank values. lhe error in mapping of z scores lo sam-pie pereentile ranks increases as skew inereases.

Truncaled Dislribulions

Signifieant skew often indicales the presence of a truncalcddistribulion. This may oceur when the range of scores is re-slricled on one side but not lhe olher, as is lhe case, for exam-pie, with reactioll lime measures, whieh eanllot be lower lhanseveral hundred milliseconds, bllt ean reaeh very high positivevalues in some individuais. In faet, dislribulions of scoresfrom reaetion lime measures, whether aggregated aeross Irialson an individuallevcl or aeross inJiviJuals, are oflell ehar<le-terized by positive skew anJ positive outliers. l\kan valuesmay therefore be positivdy biased wilh respect to lhe "centr,11tendcney" nf lhe dislribulion as defined by olher indices, suchas lhe mediano Truncated dislribulions are also collllllonly seenon error seores. A good example of this is Failure lo MaintainSct (FMS) scures on the WCST (see review in this volume).In the normativc sample of 30- lo 39-year-old persons, ob-served raw scores range frum Oto 21, but lhe majority of per-sons (84%) obtain seores ofO or I, and less Ihan 1% obtain$Coresgrealer lha o 3.

Floor/Ceílíng Elfeds

Hoor and eeiling effecls mar he defined as the presenee oftrunealed lails in lhe context of 1imitations in range of ilemdifficulty. For example, a lesl may be said \o have a l1igll}Ioorwhen a large pruportíon of lhe examinees obtain ravo:scores ator near lhe lowest possible score. This may indicate thal lhetest lacks a sllffieienl number and range 01'easier items. Con-verscl)', a tesl may he said to have a low ccílillgwhen lhe 01'1'0-sitc pattern is presenl (i.e., when a high number of examineesoblain rilWscores aI or near the highesl possiblc seorc). FlooranJ eeiling effeels may significantly limil lhe uscfu[ness of ameasure. For example, a measure wilh iIhigh floor mar not besuitable for use wilh low funclioning examinces, particularlyif one wíshes to delineate levei 01'impairment.

Multimodality and Other Typesaf Non-Normality

!l.lultimodality is lhe presenee of more tha/l one "peak" in afrequeTlcyJistribution (see histogram in Figure 1~1 for <lnex-amplel. Another form of signifieant non-normality is the uni-form or near-uniform distributíon (a dislributio/l wilh no orminimal peak and relatívely equal frequelley <lCrossseo[('s).\Vhen such dislributions are present, linearly transformed$Cores(z scores, T seores, and other deviatio/l seores) may betOlally inaceurale with respeel to aelual samplelpopulalionpereentile rank and should not be interpreted in Ihat frame-work. [n Ihese cases, sample-derived rank pereentilc seoresmay be more clínieally uscful.

Non-Normality ond Perceolile Derivalioos

Non-normality is /lot trivial; it has major implieations forderivalion and interpretation of standard seores and eompar-ison of sueh scores aeross lests: standardized seores Jerived bylinear transformalion (e.g., z scores) will nol corresponJ \osamplc percenlilcs, and lhe degree of divergence may be quilelonge.

ConsiJer lhe histogram in Figure 1-4, which shows lhedislrihulion of scurcs obtaineJ for iI hypolhelieal test. Thislest, with a samp!e size of 1000, h<lsa mean ril\\' score of 50anJ a standarJ devialion of 10; lherefore (and very conve-nient!y), no linear transformation is required to oblain Tseores. An cxpeeted normal dislrihution based OI} lhe oh-served mean and standard devialion has been overlaid on theobserved histogram for purposes of comparison.

The histogram in Figure 1~1 shows Ihat lhe díslribution ofscures for the hypotheticallest is grossly non-Ilormal, wilh aIruncaled lower l<lilillld significanl positive skew, indicillingfloor effects and the existenee of tW()distinct subpopulations.If lhe dislributioll were normal (i.e., if we follow the normalcurve, sllperimposed on lhe hislogram in Figure 1-4, instead(lf the histogram ilsclf), a raw score of 40 would eorrespondto a T score of 40, a S(;ore lhat is 1 SD or 10 puints fmm the

Page 7: Psycometrics in neuropsychological assesment

mean, <lnd translate lO lhe 16th pen.:enlilc (pcrcenlilc notshown in lhe graph). Howcvcr, whcn we calclllate a pcrcell\ilefor the actual scorc (listribution (i.e., lhe hislogram), a smreof 40 is actually below lhe Isl percClllile with respcct tolhe observed sampk dislributioll (pcrcelltile=O.R). C1earl)',the difterem.:e in percenlilcs in Ihis example is no! trivial antihas significanl implicatiolls for score interpretalion.

Normalizing Te~tScarc~

\Vhen confronted "vilh problematic score distributions, mall}"lest dcve10pers emplo}" "normalizing" Ir,lllsformalions in analtempl to correct depiHtures from normalit}" (cxamplcs ofthis can be fouod thwugholll this volume, in lhe NormruíwJJalll sCClíoo for tests reviewed). Allhough hc1pful, these pro-cedurcs are b}"no means a panace<l, as lhe}" often inlroduceprobkms of Iheir own with respecl lo inlcrpre\<llion. i\ddi-lionalll', tTlanl' lesl manuais contain only a cursor}" discussionof nnrmalizalion (jf lesl scorcs. i\naslasi and Urbin,l (1997)statc that scores should onl)' bc normalized if: (I) Ihel' comefrom a largc and represcnlalive samplc, or (2) any devialionfrom normalitl' arises from ddecls in lhe lesl rather thancharactcrislies of lhe sample. Fllrthermore, as we have nOledabove, it is prderable lo adjusI score distributions prior 10normalizalion by ll10difying tesl conlent (e.g., by ad(ling orll1odifl'ing ilems) ralher than slalislical1y transforming non-normal scores inlo a normal dislribution. i\lthough a detai1cddiscllssion of normali/.ation procedures is beyond lhe scopt.'of this chapler (interested readcrs arc refcrred lo Anaslasi &Urbina, 1997), ideall}', test makers should dcscribc in delailthe nalure of any significant samplc Ilon-norm<llity ,md lheprocedures useJ lo correcl it for derivalion of standardizedscores. The reasons for correction should ,liso be justified, anddirecl percentile conversions uased on thc uncorrecte(l samplcdislribution should be provided as im 0plion for users, Dc-spile the limitalions inherenl in correcting for non-normalily,Anaslasi and Urbina (1997) note th,l[ most tesl developcrswill probably continue lO do so beca use of lhe necd to usc Icslscorcs in statistical analyses Ihal <lssume normality (lf dislri-butions. From a prattlcal poinl of view, test users should bcaware of lhe Illathclllalical compulalions <lnd Iransforma-lions involved in deriving scorcs for Iheir inslruments. \Vhcnali othcr things are cqual, lest uscrs should dwose lests Ihalprovide informalion on snlfC dislribulions ,llld any proce-dures Ihal were ulldertaken to correcl non-normalit}', overthosc Ihat providc partial or no illformalÍon.

Exlrapolalion/lnlerpolotion

Despile ali lhe besl elTorts, Ihcre are times whcn norms falishorl in lerms of range or cdl size. This indudes missing dalain somc cdls, inconsistenl age eoverage, or inadequate demo-gr,lphic composilíon of some cells compared to lhe popula-tion. In Ihcse cases, data are oflen eXlrapolalcd or intcrpolaledusing Ihc exisling score dislribulioll and lechniques such as

Ps}'chornctrics in ~curOl's)"dlOrogical Assessment 9

llIultiple regressioTl. For cxamplc, llcalon ,Illd cot!eagues havepuhlished seis of norms Ihal IISt..multip1c regressiol\ lo cor-rett for demogrilphic characlcrislics ,uHl compellsate for fewsubjects in some cells (I 1caton et aI., 2(03). Although multipleregressioll is robust to slighl vio1atiolls of assumptinns, eSli-mation nrors mar occur whcn using llormative dala Ihat vio-lalcs thc assumplions ()f homoscedaslicil)" (uniform varianceacross lhe range of scores) and normal distrihution of scoresnecessary for multiple regressioll (Faslenau & AJams, 1996;f Icalon el aI., 1996).Age extrapo!alions bel'ond the hounds of the actual ages of

lhe individuais in the samples are also somelimes sccn in nor-mativc dala seIS, hased on projected devclopmcntal curves.Thcse llorms should be used with caulion due lo lhe lack ofaC\LIaldata points in these age ranges. EXlrapolalÍon melhods,such as Ihose that emplol' regression lechniqucs, dcpend onlhe shapc of lhe dislribution of scores. Indudillg only a subsetof lhe dislribulion of age scores in the regression (e.g., b}'omitling verl' young or ver)" nld individuills) may change lheprojected developnlental .sllll'C nf cert"in Icsts dralllalicalll'.Tests Ihat appedf to have !incilr relalionships, whcn consid-ered olll}' in adulthllod, ma}" ,H.:lually have highll' positivdyskewcd binomial functioJlS whcn the cnlire age range is con-sidered. OnC eX<lmple is vocablllary, which lends lo increasec)(l'0nenlially during lhe preschool l'ears, shows a slowerratc of progrcss during earll' adulthood, remains re1ative1l'stablc with conlinued gr,ldual inerease, and Ihcn shows a mi-nor decrease wilh advancing age. If only a subsel of the agerange (c.g., adulls) is used to cslimale performance aI lhe lailends of the dislribulÍon (e.g., prcschoo1crs and elderly), theeslimalion wiU not fit the shape of lhe aelual distribulion.Thus, normalizalion mar introduce error when lhe re1a-

lionship between a test ,lJld a demographic variable is I1on-linear. In Ihis case, linear correetion llsing mulliple regressjolldistorls thc truc rclationship betwccn variab1cs (Fasleneau,1998).

MEASUREMENT PREClSION: RELlABllI1YAND STANDARD ERROR

l.ike ali (orms of Illeasuremenl, ps)"chological tesls arc nolperfectl}' precise; ralher, test scores musl be seen as estimaresof abililÍes or funclions, each associated wilh some degree ofmcasurement error.-' Each lesl differs in thc precision of lhescores that it produces. Df crilical importance is lhe factthal no tcst has (lnl}' one specific Ievc1 of precision. Ralher,precision alwa}'s varies to some degree, and potentially suh-slanlialll', across {liffcrent populaliollS and tesl-use senings.Thcreforc, eslimates of measurelllenl error rc1evanl lo specifictesting circumstances are il prerequisitc for correCI inlcrprela-lion. For example, even lhe mosl precise lesl mal' producehighly imprecise results if administered in a nonslandardfashion, in <Inonoplilllal cnvironmcnl, or lo <In uncoopera-live examinee. Aside from these obvious cavealS, a few basic

Page 8: Psycometrics in neuropsychological assesment

10 A CompfJl(liurn of NcuropsydlOlogieal Tesls

lhe Speciol Cose of Spced lests

Toble1-2 $O\lrç,:sof Errur V;lriallceIn 1(e1atlolllo Relia!:>ililyCocfficients

Tesls involving speed, where lhe score exclusivdy depenJs onlhe numbcr of items completed wilhin a lime limil ratherthan lhe numbef correct, will cause spuriously high inlernalrdiabililY estimates if internai re1iability indices such as split-half reliability are useJ. For examplc, dividing lhe items inloIwo halves lo Gl!Culatc ,1 split-half rcli.lbility cocfficicnl willyie1d IWOhalf-Iesls with 100% item complction ratcs, whetherthe indiviJual oblained a score of 4 (i.e., yielding Iwo half-tests totaling 2 poínls eaeh, or perfcet agreement) or 44 (i.e.,yiclding two half-tests both lotaling 22 poinls, .llso yiclJingperfeet agreement). Thc result in both cases is a split-half reli-abilily of 1.00 (Anaslasi & Urbína, 1997). Some alternalivesare to use test-retest reliability or alternalc forrn rc1iabílily,ideally wilh lhe a1tefJl<lleforms adminislercd in immediatesuceession to avoid lime sampling error. Rc1iabilities (;Ill also

Error Varlance

Contmt samplingConlmt samplingConlent samplingTime s<lmplingCnntcnt samptingConlent saml'lingand timesamplingInlefSmrer diftúcllccsInterraler

Typcof Rcliabilill'Coefficielll

Split-halfKuder.l(ichard.sollCodficirnt all'haTest-fetestAlternale.fofm (immcdialc)Alternalc-form (delayed)

01" lhe corre!ation bctween tesl scores and true scores. This iswhy il is used for estimaling true seores and associated stan-(!dai errors (NunnaUy & 13ernslein, 1994). Ali things beingequal, longa lesls will general1y yield higher reliability esli-mates (Satl!er, 2001). InternaI reliability is llsual1y assessedwith some measure of lhe average correlatinn among ilemswithin a tesl (Nunnally & 13ernslein, 1994). These inc!uJe lhesplit-half or Spcarman-13rown reliability coefficient (obtainedby (orrdating two halves of items fram the same test) and co~dficienl alph.l, which provides <lgeneral estimate of reliabilitybascd on ali the possible ways of splitting lesl items. Alpha isesscntially based on the average inlercorrelation between Icstilems anJ any otha sct of ilems, and is used for tests withitems lhat yidd more than two response lypes (i.e., possib!esrores ofO, I, or 2). For additiollaluseful references coneern-ing alpha, sce Chronb<Kk (2004) and Streiner (2003a, 2003b).The Kuder-Richardson rdiabililY coefficient is used for itemswith yes/no answers Of helerogencous tests where splít-halfmelllllds nlusl be used (i.e., lhe mean of ali thedifferent split-half coefficienls if the lesl were split inlo ali possib1c ways).General!y, Kudcr-Rieh,lrJson cocfficienls will be lower Ihansplit -half coeffidents whcn ICstsare hcterogeneous in terms ofcontent (Anaslasi & Urbina, 1997).

Inlernal reliabililY retleds lhe cxlcnt to v,,,hichilerns within alesl measure the same eognitive domain or COllstruet. It is acore index in c1assicallesl theory. A measure of lhe intercorre-lation of items, inlernal rcliabilitl' iS;lll estimate of the corre-lalion between randomly paralleltest forms, anJ by extension,

Rc1iability refenlo lhe consislency of measuremenl of a givenlesl anJ can be defined in several ways, including eonsistencywilhin ilsc1f (internai consisteney rei iability J, comislency overlime (Iest-retest rc!i.lbilily), consistem;y ,lCrossallernale forms(alternale form rcJiability), and consislency across ralers (in-lerrattf rdiabiJily). lndices (lf rdiabililY indicate lhe degree towhich a tesl is free from measurcment tfror (or the propor-IÍon of variance in observed scores atlributablc to vMiance inIrue scores). The inlerprelalion of such indices is oflen not soslraightforw,lrd.

It is importanl to note Ihal the lerm "error" in this conlexldoes not iKlualll' refer to "incorrecl" or "wrong" informalion.Rilther, "error" consists of the lllultiple sources of variabililyIhal affeel test scores. \Vllilt mal' be lcrmed error variance inane appliealion mal' be consiJereJ par1 of lhe true score inanolher, depending on the comt ruet being measureJ (state ortrai!), lhe nalure af lhe les\ employed, anJ whelher il isdeemed relevant or irrelevanl lo the purpose of lhe lesling(Anastasi & Urbina, 1997). An exampk rdevanl to neuropsy-chology is Ihal internai reliability coeffleienlS temi to besmal1er ai citha end of lhe age continuum. This finJing hasbeen allribuled to bolh limitatiolls of lesls (e.g., measurementerror) and incf/:ased inlrinsic performance variability amongvery young and very 01(1examinecs.

Definitionof Reliability

principies help in deleflnining whelhcr a test generaUy pro-'lides precise measuremenls in mosl silll.ltiolls where il wiUbeuseJ. \Ve begin wllh an overvlcw of lhe rc1ated concepls of re-liabilit}', trw: s{(nes, ol!lail1ed scores, lhe various eslimales ofmeasurement error, <lnJ lhe nolion of ClIl1fidcl1cc in/crI'als.These are revieweJ bclO\\'.

Internai Reliability

Faclors Alfecting Reliability

Reliability coefficients are infiuenecJ by (a) tesl eharacteristics(c.g., Icngth, item type, item homngeneity, and intlucncc ofguessing) and (b) sample characteristics (e.g., sample si"c,range, and v<Hiability).The cxtenl of a test's "darily" is inli-malely related lo ils rdiability: reliable measurc, Iypieallyh,lve (a) clearly written items, (b) casily ullderstooJ test in-slruClions, (c) stanJardized administration conditions, (d)explieit scoring ru1cs Ihat minimize subjectivity, and (e) aproeess for training ralers to a performance crilerion (Nun.naUy& 13crmlein, 1994). For a lisl of commonly llsed rdiabil-ity coefticienls and lheir assoeialeJ sourees of error variance,sec 1:1blc 1-2.

Page 9: Psycometrics in neuropsychological assesment

Psychometrics in NeumpsychoJogical Assessment 11

_\<",n-e:hom I."'fweaver & t.:fld""f, 2lKH. 1'. JQ~.Rel',;nleJ w;lh pell"i";,,,, frofllEIs",;er.

T061e1-3 Coml1lnnSourçcsof Bia.\and Error inTest-lklest Situatiom

may or may nol be considered sourccs of measuremenl error.Apar! fmm these variab[es, une musl cunsider, and possiblyp;lrse out, effecIs of prior exposure, which are often conceplu-a[ized as invo[ving implicit or explicit Icarning. llence theterrn pmctifC effi'as is often llsed. Howevcr, prior exposure loa tesl does nol neccssarily kad to increased performance atretes!. Note 'l[so lhat lhe a<.:tlla[nature of lhe lesl may sorne-limes change with cxposurc. for instance, lests lhal rely on a~novelty effect~ anJ/or re(]uire (kduction oI' a stralegy orproblem snlving (e.g., \VCST, Tower 01' London) may not beconducled in the samc W,IYonce the examínee has prior fa-miliarity with lhe tcsling p,Jr<I(ligm.

Like some measures of problcm-solving abilities, measuresoI' lcarning and memory are a!s{}highly susleptible lo prilcticeeffccts, though Ihese are kss likdy lo rct!ect a fundamentalchange in how examinees approach lasks. In either case, prac-lÍce cffccts may lead to [ow test-retesl lorrclations by effec-tivdy [owering lhe ceiling at relesl, resulting in a restriction ofrange (i.e., many examinecs ohtain scores at near the IIl<Ixi-mum possible aI retest). Neverthcless, restriction oI' rangeshould not bt' assumed when test-retest corrdalÍons are lowunlil this has bem verified br illSpt'ction oI' Jat,l.

The relationship between prior exposure and tesl stabilitycoefficients is complex, anJ although test-retesl cocfficienlsmay be affected hy praclice nr prior expo.sure, lhe cot'fficienl<1oesnot indica te the magnitude oI' sllch effeets. That is, retestcorre1ations will be very high when individual retesl $Coresalichange by a similar amount, whether lhe praclice effed is nil orvery large. When stability coefficients are low, then lhere mayhe (I) no syslelll<lliceffecls of prior exposure, (2) the reialion

he cakulated for any test Ihat can be dividccl into specific timeinlervals; scores per inlerval can lhen bc compared in a pmce-dure akin to the sp[it-half method, as long as items are oI' rela-tivcly equivalent difficulty (Anaslasi & Urbina, 1997). Formost oI' the specd lests rcviewed in this volume, rcliaoilíty isestimaled by using lhe test-retest rdiabi[ity coefficicnt, or dsebr a generalizability cocfficiellt (see be!ow).

Te~t.Re!e~tReliobility

Tcst-retest rdiability, a[so known as temporal stabilíty, pro-vides an estimate oI' the corrclalion belweell Iwo lest seoresfrom the same lesl adminislered aI two different ponls in time.A tesl with gnod lemporal stabilily should show [in[e changeover time, providing Ihal the trait being lJIeasured is stablc ,mdl!lere are no differentia[ cffecls of prior exposure. lt is impor-tant to note that tests measuring dynamic (i.e., change,lb[e)abilities will by defmilion producc lower tesl-relest rcliabilitiesthan tests measuring dom<lins Ihal are more trait-like and sta-b[e (Nunnally & Ikrnslein, 19(4). See Table 1-3 for commOTlsources of bÍ<ISand error in test-retesl silualions.

A lest has an infinile number oI' possible test-retesl reliahi[-ilies, dcpending on the lcngth of the lime inlerva[ belween1esling. In some cases, rdiability eslimates are inversely relatcdto thc time inlerva[ bctween baseline and relest (Anaslasi &Urbina, 1(97). In olher wntds, the shorter lhe time interva[belween test and retest, lhe higher lhe rcliabi[ity wefficientwill be. liowever, the extent 10which lhe time inlerva! affectslhe test-relesl coefficienl will dcpend on the Iype of abilityevaluated (i.e., stable versus more v,lfiable). Rcliabilily maya[so depend on the type oI' individual being assessed, as somegroups are intrinsically more variablc over time lhan olhers.For examp[e, the exlenl to which scores !luctuate over limemay depend on subject characterislics, induding age (e.g.,normal preschoolers will show more variabilily than adults)and neurological stalus (e.g., TBI examinees' scores may varymore in lhe acute stale lhan in the posl-acule statc). Ideally,rdiabilíty estimales should be provided for bulh normal indi-viduais and the clinicai populalions in which lhe tesl is in-lended to be llsed, and the speçitic dcmographic characteristicsof the samplcs should be fuHy specified. Test slability coeffi-cients presenled in published les! manuais are usllally derivedfrum rclalÍvdy small normal samples le,ted ovcr muchshorter interva[s than are typical for retesting in clinicai prac-tice and should therefore be çonsidered with due cautionwhen drawing inferences regarding clinicai cases. Howcver,Ihere is some evidence Ihat duration of inlerval has less oI'an impact on test-retest scores lhan subje<.:tcharacteristics(Dikmen et a!., 1(99).

Prior Exposure ond Proctice Effects

Variability in scores on the same test over lime may be relatedto silualional variables suçh as examinee state, examiner state,examiner identity (same versus different examincr aI retest),or envirollmenlal condilions that are oflen unsystcmatic and

Rias

Error

Inlerveninf(variablcs

Practicceffcch

Dt.'rnographiccomidcrations

SI'ltislÍç'l]crrors

RanJom orunwntrollcJ C\'Cllts

Eventsofinterest (e.g., slIrgcry.lllcdk;ll inlt'rvmlion.rehahililalion)ExtraneollSeventsMcmorr for contcntl'rocedllf<lllearningOlher factors{a}Familiarilywilh lesling

contexl and exarniner(h) I'crforl1l;lnceanxit'lyAge(rnaturalional efft.'ctsandaging)EduC<llionGenderElhnkil)'Hasdint..ability

IvleaslIremenlerror (SE,'vI)Hcgressiollto lhe mean (SEe)

Page 10: Psycometrics in neuropsychological assesment

12 ACompendium of Nellropsychological Tesls

of prior exposure may be nonlinear, or (3) eeiling effeels!reslrietion of range related to prior exposure may be ,ltlenual-ing lhe eoefficient. For exampk, certa in SUbgrollPSIllaybendi!more from prior exposure lo tesl maleriallhan olhers (e.g.,high-1Q individuaIs; Rapporl el aI., 1998), or some SUbgrollPSmay demollslrale more stablc scores or consislenl praelice cf-feelS than do othas. This causes lhe score distribulion toehange ai retest (effectivdy "shuff]ing" lhe individuais' rank-ings in lhe dislribulioll), which will attenuate the correlalion.In Ihese cases, the tesl-relesl corre1alion may vary significantlyaeross SUbgrollPSand the correlatioll for lhe enlire samplewill nol be lhe besl eslimale of reliabilit)' for an)' of the sub-grollPS,overeslimating rdiabj]ity for some and underestimat-ing reliabilit)' for olhers. In some cases, practice cffecls, aslong as lhe)' are rdativdy s)'slematic and accuratc!y assessed,will not render a lesl unusablc from a reliabililYperspective,Ihough they shollld always be lakell inlo account when reteslscores are interpreted. In addilion, individual factors mustalways be consiuered. For example, while improved perfor-mance may usually be expecled wilh a particular measure, anindiviuual examinee may approach lesls Ihal he or she haddifficullYwith previously with heighteneu anxielY that leadsto decreased performance. Laslly,it lTlUSIbe kepl ill minu Ihalfaclors other than prior exposure (e.g., changes in enviroJl-menl or examinee state) may affecl tesl- retest reliabilily.

Ahernate FormsReliability

Some invesligators advoC<l\ethe use of alternate forms loeliminale the confounding effeels of praclice v"hena test mustbe adminislered more Ihan once (r.g., Anaslasi & Urbina,1997). Ilowever, Ihis praclice inlrodllces a second form of er-ror variance into lhe mix (i.e., conlent sarnpling error), in ad-uition to lhe time sampling error inherent in leSI-releSIparauigms (see Table 1-3; see also Lineweaver & Chelune,2003). Thus, leslS wilh ahernate forms musl have eXlremelyhigh correlalions between forms in additioll to high lesl-releslreliability lo confer any auvanlage over using lhe same formadministered t\vice. i\Ioreover, Ihey mUSldemonstrale equiva-Ience in terms of mean scores from lesl lo relest, as well ascollsistency in score e1assificationwilhin indiviuuals from lestlo retest. Furlhermore, alterna te forms do nol necessarilycliminate effecls of prior exposure, as exposure lOslimul i anJprocedures can confer some positive carry-over eITecl(e.g.,procedurallcarning) despite lhe use of a differenl sei of ilems.These dTects may be minimal across some Iypes of well-cOllS1rucledparallel forms, such as Ihose assessing acquiredknowledge. For measures such as the \VCST,where specificlcarning and problem solving are involveu, it may be difticultor impossible to produce an equiva[ent allernate form thatwill be free of cffects of prior exposure 10 the original formo\\Ihile it is possiblc to attain Ihis degree of ps}"chomelricso-phistication thruugh careful item analysis, reIiahilily sludies,and administration to a represenlative nonnative group, it israre for ,11ternateforms to be conslrucled with lhe same psy-chometric rigor as were lhe original forms frum which they

were derived. Evenwell-(onstructed alternale forms oflen lackcrucl<llv,lliu,llion evidence such as similar corrc!ations lo cri-terion measure$ as lhe original lesl formo This is especiallylrue for older neuropsychological lest.s, particularly thosewilh original forms Ihal were nevn subjecled lO any itemanalysis or rcliability sludies whatsoever (e.g., BVRT). Inade-qu,lte lcst construnion and ps)'chometric properties are alsofound for alternale forms in more general published lests incommotl usage (e.g., \VH.AT-3). l:kcause so few alternateforms are availablc and few of those th,ll are meel Ihese psy-chomelric slandards, our tendency is to use rdiable changeinuices or slandardized regression-bascd scores for estimatingchange from test lo retes\.

lnterratcr Rcliability

Mosl lesl manuaIs provide speciflc and delailcd inslru(tionson how 10 adminiSlcr anu score le,l, 'lccording lo slandardprocedures lo minimi/,e error variance duc lo uiffaenl exam-iners and scorers. However,some dcgree of examiner vari,lncerem,lins in inuiviuually ,ldminislered lests, parlicularly whenscores involve a degree of judgment (e.g., muhiplc-responscverballesls such as lhe \VechslerVOCilhular}"Scalcs,which re-quire lhe rater to adminisler a score from O lo 2).ln lhis case,an estim,lIe of lhe rcliability of ,H!minislralion aml scoringacross examiners is neeued.Inlerrater reliabililY can be evalUaled using percentage

agreemenl, kappa, producl-momenl corre!alion, and inlra-e1asscorreIalion coefficient (Sauler, 2001). for ,lny given tesl,l'earson correlalions will provide an llpper limit for lhe intra-e1asscorrel<ilions,bllt intradass correlalioTlsare preferred be-cause, unlike the l'earson's r, Ihey take inlo accounl pairedassessments made by the same sei of examiners from lhosemaue by dilTerent ex,lminers. lhus, lhe intradass correlationdislinguishes Ihose seIs oI"scores ranked in lhe same orderfrom Ihose lhal ,Ire r,lnked in lhe sallle order but havc [ow,llloderale, or complete agreemenl with each olher, and cor-rects for interexaminer or leSI-relesl ,lgreemcnt expected bychance alone (Cicchetti & Sparrow, 1981). However, adv<ln-tages of the I'earson correlatioll ,Ire lhat il is familiar, is readilyinlerpretable, and can be eas!l}"compared using sland,lrd sta-tislical techniques; il is besl for evaluating cOllsistency inranking rather than agreement per se (Faslenau el a!., 1')96).

Generolizability CoefReients

One reIiability coefficient type not covercd in this list is thegeneralil.abilily cocfficienl, which is starting lo appear morefrequentIy in lest manuais, particularly in the larger test bal-leries (e.g., Wechsler scales anu NEPSY). In generalizabilil}"theory, or G rlieory, reliabilily is ev"lualeu by decomposingtest score variance using lhe general linear model (e.g., vari-ance compollents analysis). This is a varianl of the mathe-matical methods meu lO,lpl'ortion variance in general linearmodel allill)'scs such as ANOVA.In lhe case of G lheory, lhebelween-groups variance is considered an estimate of a true

Page 11: Psycometrics in neuropsychological assesment

score \'ariance and wilhin-groups variance is considered anestimale of rrror variance. lhe generalizability coefficient isthe ratio of estimated lrue variance to lhe sum of the esti-mated true variJncc and estimated error variance. A discus-sion of this nexib1c ;Ind powerful model is beyond the scopeof t!lis chapkr, but dctailcd discllSsions can bc found inNunnally and Bernslein {I(94) and Shavelson el aI. (1989).Nunn;llIy and Bemslein (1994) also discuss rclaled issllespertinrnl lo eSlim<lling reliabílities of variables ref1eclingsums such as composite scores, and the fact that reliabililiesof diffcrrllce scores based Oll correJated measures C<1l1be verrlow.

Evaluoling a Test's Reliability

A lest cannot be Silid lo have a single or owralllrvcl of relia-bility. ]{alher, tesls can be said lo exhibil diffcrenl kinds of re-liabilill', the rdalÍvc importance of which ""iH vary dependingon how lhe tesl is to be used. Moreover, each kind of reliabil-ity mal' varl' across differenl populalions. For inslance, a testmay be highll' reliable in norm,llly funclioning adulls, bul behighly unreliablc in young children or in individuais wilhnnuological illness. It is importanllo nole that whilc high re-liability is a prerequisile for high validill', the latter does nolfol!ow automalÍcalll' from lhe former. For exampk, heighlcan be measmed wilh great reliabilitl', hut it is nol a valid in-dex of intelligence. lt is usuaHy preferable lo choose a lesl ofslighlll' lesser reliabilitl' if it can be de1110TlSlraled tha! the testis associaled witll ,I meaningfulll' higher levei of validity(Nunnalll' & Ikrnstein, 1994).

Some halle argued thal internai reli,lbilitl' is more impor-tant than olher forms of reliability; Ihus, if a!pha is low buttesl-relest re!iahility is high, a tesl should not be consideredreliable (Nunnal!l', 1978, as cited bl' Cicchetti, 1989). Notethal il is possihle to have lnw alpha values and high lest-relestreliabilitl' (if a measure is made Lip of heterogencous itemshut yie1ds the same responses at retesl), or low alpha valuesbul high interrater re1iabilitr (if the test is heterngeneous inilem contenl hut ridds highll' consislent scores acmssIraincd cxperts; an examp1c would be a mental slatus exami-nation). Internai consislencl' is therefore not necessarill' lheprimar)' index of re1iabilill', but should be evaluated withinthe broader contexl of test-retes! and inlerrater rdiability(Cicchetli, 1989).

Some argue Ihat test -retest reliabi1iIY is nO! as important asother forms of rcli<lhilily if the test will only be used once <lndis nOllikell' to be administered again in future. However, de-pending on the naturc of Ihc tcst and rrlcst sampling proce-dures (as JiSCllssed previous!y), slabilily coefficients m<ll'provide valuable insight into the replicability of lest results,particular!l' as Ihese coefficients are a gauge of "real-world"rdiabilill' ralher Ihan ilccuracy of mCilsurement of true scoresor hypothetical rdiabilill' acmss infinite randomly parallelforms (as is internaI re1iahilitl').ln addition, as was slated pre-viously, clinicaI decision making will <llmost alwal's be basedon lhe obt,lined score. Therefore, il is critiCillly importanl \O

Psychometrics in Neuropsychological Assessment 13

know the degree to whÍl.:h scores are replieablc ai relesting,whether or not lhe tcst may be used again in futme.

It is our belirf Ihal test users should take an informed<lnd pragmatie, ralher Ihan dogmalic, approach lo evaluatingrelíability of tests uscd to inform diagnosis or other clinicaIdecisions. If a lest has been designed lo measure a single, one-dimensional construcl, Ihen high internai consislency rcli<lbil-ily should be considered an essenli<ll propertl'. High tesl-reles!reliability should also be collsidereJ an essential property un-less lhe tesl is designed tn measure stale v;niablcs that are ex-pecled lo fluctllale, or if syslemalic f,lelors sueh as praeticeeffeCls attenuate slability cocfficienls.

What h an Adequale Reliability Coefficient?

Thr reliabilitl' coeffieient ean be inlerpreted direetly in lerEmof the pereentage of seore vari<lnee atlributed to differenlsourees (i.e., unlike the corre1ation coefficient, which must besquared). Thus, with a reliahilitl' of .85, 85% of lhe variancecan be attribuled lO lhe trai I being measured, and 15% can bealtributed to error variance (Anaslasi & Urhina, 1997). Whenali sources of variance are known for the same group (i.e.,when one knows lhe rdiabilill' ((lefficienls for internai, lest-retest, alternate form, and interraler rdiabililY on lhe Silmesampk), it is possible to calculitte the true score variance (foran example, see Anastasi & Urbina, 1997, pp. 101-102). Asnoted above, allhough a delailed discussion of this topie is be-l'ond lhe scope of this volume, lhe portioning of lotai seorevariante into components is lhe crux of generalizabilitl' lhe-orl' of re1iability, which forms the basis for re1iability eslÍ-males for manl' well-knowlI speed lests (e.g., \Vechsler scalesublests such as Digit Symhol).

Salller (2tXll) notes lhat re1iahilities of .80 or higher areneeded for tests used in individllal assessment. Tests used fordedsion making should have reliabililÍes of .90 or above. Nun-nalll' and 13ernstein (1994) note Ihal a reliabilitl' of .90 is a"bare minimum" for tesls used to make important decisionsabout individuaIs (e.g., lQ lests), and .95 should be the optimalslandard. When imponanl decisions wiU be basrJ on lestscorcs (e.g., placernelll into special education), small score Jif-ferences on make a greal difference to oulcome, and precisionis paratJlount. Thel' nole that cvrn with a rdiability of .9ü, lheSH"l is almusl one-lhirJ as large as lhe overall SDoflest scores.

Given Ihese !ssues, what is a c1inicallr acceptable levei ofreliabilill'~ According to Sall1rr (2001), tests wilh reliabilitiesbelow .(,0 are unrcliable; Ihose above .60 are marginalll' re!i-able, and those above .70 are rdative!l' re!iable. Of note, tcslswilh rdiabilities of .70 may be sufficient in the earll' stages ofvaliJalion research to determine whether the test correlateswilh other validation evidence; if so, additional effort call bcexprnded to incrcase rdiabilities lo more acceplable leveis(e.g., .80) by reJucing me,lsurement error (Nunnalll' & Bern-stein, 1994). In outcome slUdies using psl'chological tesls, in-ternaI collsislencies of .80 lo .90 and test-relest rc1iabilities of.70 are considcred a minimum acceptable slandard (Andrewset 011., 1994; Burlingame et aI., 1995).

Page 12: Psycometrics in neuropsychological assesment

14 A Compendium of Neuropsychological Tesls

To61e1-4 Magnitude ar ReliahililyCndficientsi\.lagniludeof CoeffJdcnl

Very high (.90+)High (.!lO-.89)Adc(juatc (.70-.79)j\l;lrgitlill(.60-.69)Lo\v (<.59)

In Icrms of inlernal rcliability of neuropsychologieal tests,Cieehetti el aI. (]990) hayc proposed that internaI consistencyestimates of lcss than .70 are unacu'ptablc, rdiabilities be-t\vecn .70 and .79 are fair, rdiabilities betwecn .80 and .89 aregood, and rdiabilities ilbove .90 are excellcnt.

For interrater reliilbilities, Cicchetti and Sparrow (I981)report that clinicaI significance is poor for reliability coeffi-eients below .40, fair between .40 and .59, good belween .tiOimd .74, and excellent between .75 and 1.00. Faslenau et aI.(1996), in summarizing guidelines on the interpretation of in~traclass corrdations and kappa cocfficients for interraler reli-ability, consider coefficients larger than .60 as sllbstantial andof .75 or .80 as almost perfecl.

The,c are the general guiddínes that we hayc medIhroughoul the lexl to c\'aluate thc rdiability of neuropsycho-logical tests (see Table 1-4) so that lhe text ean be med as areference when seleeting tests with the highest rdiability.Users should note thallhere is a great deal of variability withregard to the acceptability of reliability coeffieients for neu-ropsychological lesls, as perusal of this volume will indieate.In general, for tesls involving multi pIe subtesls and multiplcscores (e.g., Wechslcr scales, NEPSY, IJ-KEFS), incluclinglhose dcrived from qualitative observations of performance(e.g., error an,llyses), the farther away a score gels from lhecomposite score itself and the more difficlllt the seore is loquantify, the lower lhe rcliability. A quick review of lhe relia-bility data presellled in Ihis volume 'lIso indicates Ihal verbaltests, wilh few exceptions, lend to have consistently higher re-liabílity than lesls measuring other cognitivc domains.

Lastly, as previously discussed, rcli,lbility coefficienls donOI provide comp[ele informalioll on the reproducibilil}' ofindividual test senres. Thos, wilh regard to test-retest rdiabil-Itr, it is possible for a tesl to have high reliability (r= .80) buthave retesl means that are 10 POilltS higher Ihall baseline,cores. Reliabilíty coefflcients do not provide information onwhethcr individuais retain lheir relalive place in lhe distribu-- tion from baselínc to retest. Proceclures such as lhe 13Iand~Altman mcthod (A!tm,m & Bland, 1983; B1and & Altman,1(86) are one way to determine the limils of agreement be-Iween two assessments for individuais in a group.

MEASUREMENT ERROR

A good wnrking underslanding of coneeptual issues and meth-ods of guantifying measuremenl error is essential for compe-lent clinicai pracliee. We starl our discussion of lhis lopic withconcepls arising fmm dassicallest Iheory.

True Scores

A central ekmenl of classieal test theory is lhe concept of a/ruc score, or lhe score an examinee wnuld obtain on a mea-sure in lhe absence of any measuremenl error (Lord & Novick,1968). True scores can never be known. Instead, they are esti-matcd, and are coneeplually defined as lhe mean score an ex-aminee would obtain acmss an infinite number of randomlyparallel forms of ates!, assuming lhat lhe examinee's scoreswere 1101systematically affeeled by tesl exposurclpractice orolher time-related factnrs such as maluralion (Lord & Novick,1(68). In contrasl to Irue scorcs, oblaíllcd scores are lhe aClualscures yidded by tests. Obtilinnl scores indude any measure.ment error associated with a given tesl.' That is, Ihey are thesum nf lrue seores and l.~rror.

In the dassic<ll modcl, the relation betwcen nblained andtrue seores is e)(prcssed in the following formula, where error(e) is random ,lIld ,111v<lriablcs are assullled to be normal indistribution:

\Vhen lest reli,lbility is less than perfeet, as is always the case,lhe net effeel of me,IS\lrement error iICroSSexaminees is tobias obtained scores oulward from lhe popul<ltion mean. Thatis, scnres above lhe mean are most likcly lo be higher thantrue scores, while Ihose below lhe mean are most likdy lo belowcr Ihan Irue scores (Lord & Noviek, 19(8). Estimated truescores correct this bias hy regressing obtained seores towardthe normalive mean, with the amounl of regression depend-ing OH test reliability and devialion of the obtained sune fromthe mean. The formula for estimated true scnres (t') is:

limits af Reliability

Although it is possiblc to have a reliable test thal is not valid forsome purpo,cs, lhe converse is nol the case (see [ater). Further,it is also conceiv,lblc that Ihere are some neuropsychologicaldomains that simply cannol be measured reliably. Thus, evenIhough there is the assumption Ihal questionable rdiability isalways a function of the lest, reliability may depend on the na-lUre of the ps}'chological process measured or on lhe nature ofthe popul,lIion evaluated. For example, many of lhe exceulivefllnclioning tesls revicwed in this volume have relalivcly mod-est rcli,lbilities, suggesling Ihal Ihis ahilily is difficult lo assessreliably. Additionall}', tests used in poplllalions with high re-sponse variabilily, such as presehoolers, clderly individuaIs, orindividuais wilh brain disorders, may invariably yield low reli-,lbility cocfficients despile lhe best dTorls of test devclopers.

\Vhere:

X= oblained ;;coret = lrue scoree=error

X=f+e {3]

Page 13: Psycometrics in neuropsychological assesment

PsychoJnetrics in Neuropsychnlogiol i\ssessment 15

\\11ere:

lhe U~eof lrue Score~ in Clinicai Pradice

If working with z seores, lhe formula is ~implcr:

ancy betweell true and obtaineJ scores. ror a highly rdiablemcasure such as Tesl 1 (r= .95), true score regressioll is mini-mal, even when an oblained scorc lies a considerablc distancefrom the sample mean; in lhis cxamplc, a SliUHl<\fdscore of130, or two Sl.>s abovc the 1l1e,1ll,is associated with an esti-mated lrue score of 129. In contrast, lur a lesl with low rc!ia-bililY such as Tesl 3 (r=.65), true score regression is quitesubslant ia!. For this test, an obtailled score of 130 is associatedwilh ,In estimaled true score oC 120; in this case, fully one-third of lhe observed deviatioll is "losl" lo regression when theest imaled Irue scnre is calculated.

Such infornl<llion Illay have importam implicatiorls wilhrespect to inlerprelation of lest resu!ts. For example, as shownin .1~lblc1-5, as a result of differences in rdiability, obtainedscores of 120 Oll Tes! 1 and 130 on Tesl J are associated withCssclllial1yequivalcnl estimated true scores (i.e., 119 and 120,respeelivel}'). If only obtained scores are considercd, onemight inlerprcl scores from Test I anJ Test 3 as signiticantlydiffercnt, even though these "difierences" actually disappearwhen measurell1ent precision is laken inlo Jccounl. lt shouldalso be noled thal such differenees ma}' nOIhe limiled lo com-parisons of scores across differenl tesls within lhe sarne indi-viduai, but may also apply lo cOlllparisons belween scoresfrom the same test across differenl individuaIs whcn lhe indi-viduais come from differenl groups anJ lhe tcsl in questionhas variable reliabililY acmss Ihose groups.

Regression to the rnean may also m;lnifest as prunouncedasymmetry of confldellee interv<lls celltered on Irue scores,relalive to oblained scores, as discus~ed in more detail later,Although calculalion of (rue scores is encouraged as a meansof g<luginglhe limitations of reli<lbilily,il is important lo WIl-sidu Ihat an)' signiticant difference belween characteristics ofan examincc and lhe samplc from which a lllean samplc scoreand rdiabililY estimate \Vere derived may invalidatc the pru-cess. For example, in some cases il makes litlk sense lo esti-mate true scores for severdy brain-inillrcd individuais onlesls of cognition using leSI p,lfameters from healthy norma-tive samples, as mean scores wilhin the brilin-injured popul<l-tion are likely lo be suhslilntiall}' different Ccom Ihosc seen inhea1thy normative samples; reliabililies may Jiffer subsliln-ti<ll1yas well. Illsteild, olle mal' be justilied in deriving esli-maled lrue scores lIsing data frorn a cornparable clinicai sarnpleif Ihis is avai\ablc. Overall, these issues underline lhe complex-ities inherent in comparing scores from different tests in dif-ferenl populalions.

[41

[51

formula 4 shows lhal ;m cxamin('(~'sestimated true score isthe sum nf Ihc 111C,1I1score of the group to which he or she bc-longs (i.c., lhe normative samp1e) and lhe devialion of his orher obtaineJ score from the normalive mean weighted br leslrcliabililY (as derived from lhe same normativc sample). Fur-Iher, as tesl reliabililY appro<lehes unil}' (i.e., r= LO), esti-mated lrue scores approaeh oblained seures (i.e., there is littlemeasurement error, so eSlim,l\ed lrue scorc~ and oblainnlscores are nearly equiv<llcnt), Conversely, as test reliabililY ap-pro<lehes zero (i.e., whcn a tcst is eXlremely unreliablc andsllbjeCllo excessive lllea~urement error), e~limated lrue scoresapproach lhe mcan test score. Thar is, whell ti lest is hígh/y re!i-uh/r, grratrr weight is givell to obtailler1 scores tlUlIl to the nor-miltive meml score, but whell 11 Int is very IIllre!illble, grelHo-

weiglrt ís givell to the norma tive metlll score tllllll W obtallJedscorcs. l'ractically speaking, eSlimaled Irue scores will <llwaysbe closer to lhe mean than nblJÍned scores are (cxccpt, ofcourse, where the nblained score is ;lllhe mean).

x = mean test seorerxx = tesl reliabilit y (internai consisleney rc1iability in

dassieallesl theory)x= oill<lineJ seorc

Although lhe Irue score modcl is abstract, it has practical ulil-ily and important implications for tcsl scorc interpretation.For example, whal may not be immeJiatd}' obvious from for-mulas 4 and 5 is readil}' apparent in Table 1-5: estimat(~d truescores Iranslale tesl rdi,lbilil}' (or lack thereof) into the samemetric as aclUal test scores.

As can be seen in T;lble 1-5, the degree of regression to thernean of true scores is inversd}' reLlled to test reliability anddireclly rdated to degree of dcvialion from the referencemean. This rneans th<ltthe more rdiablc a test is, the doser areobtained scores 10 lrue scores and that lhe further away lheob-tained scorc is frum the samplc mean, the grealer lhe discrep-

loble 1-5 Estimalt'tlTrucS(()rcVahwsfor TnrceObscrvcdS(()rcs011 Thrce Leveisof Reliahility lhe Stondord Error of Moo~urement

ObservetlSçores(.'.1= IOO,5D", 15)

Reiiability 110 120 DO.Iest I .95 IlO li' 12.'1Test2 .80 108 116 121Te'H3 .65 107 113 120

F.xaminers may wish lo qUill1lilYthe margin of error i1SS0cl-aled wilh using oblained scores as cslimatcs of lrue seures.When lhe sJtIlple SLJ <lnd lhe reliability of oblained scnres areknown, an estimale of the SLJ of obtaincd scores about truescores may be cakubted. This value is known as the stillularderror oI meUSlIrelllem,or SEM (Lord & Novick, t 968). !vIoresimply, the SEM provides an estimate of the amount of errorin <Iperson's observeJ scorc. lt is a functlon of the re1iabilil}'

Page 14: Psycometrics in neuropsychological assesment

[61

16 A Compendium of Nellrops}'chological Tesls

of the test, ,mJ of the variabilily of scores wilhin the sOlmple.The SFM is inversdy rdaled to lhe rcliabililY of the lesl. Thus,lhe greater the rdiability of lhe lesl is, lhe smaller lhe SIA! is,and lhe more confidence the examiner can have in lhe preci-sion 01' lhe score.

The SEM is delined by the following formula:

SEM '" SD~1 - rxx

Where:

SlJ= the slandard deviation of lhe lesl, as derived from anappropriale normalive s<lmplc

rxx= the reliabililY wcffici<'nl of lhe lest (usually internairdiabililY)

Confidence Intervols

Whi1c lhe SEM can be considered on ils own as an index oflesl precision, il is nol necessarily inluitively interpretable,'and Ihere is oflen a tendenc}' to focus excessively 011 test scoresas point eslimates at the expense oI' consideration of associ-ated eslimation error ranges. Smh a lendency lo disregardimpreçision is p<uticularly inappropriate when interpretingsenres from t('sls of lower rdiability. Clinically, it may there-fore be very importanl lo reporl, in a concrele and easily un-derslanJable manner, lhe degree oI' precision associaled wilhspecific tesl senres. One melhod of doing this is to use confi-delh:e Hltervals.

The SE!Y! is used to rorm J confi(lence inlerval (or rangeoI' scores), around estimaled true scores, wilhin which oblainedscores are mosl likcly lo falI.The dislriblltion of obtained scoresaboul lhe lrue score (lhe error dislrihulion) is assumed lo benormal, with a mean of zero and an SD equal to the SEM;therefore, the bounds of çonfi(!cnce intervals can be set lO in-dude any Jcsired range of probabilities by mulliplying by theappropriate 2 valuc. Thus, if an inJividual were lo take a brgenumber oI' ranJomly parallel versiollS of a tesl, lhe resultingobtained scores would fali wilhin an inten'al of:tl SEM of lheeslimated lrue score óll% of lhe time, ,!nJ wilhin 1.96 SEM95'Yoof lhe lime (see Table 1-1).

Obviously, wllfidence inlervals for unrcliablc lests (i.e.,wilh a large SEAl) will be larger than those for highly rdiablcleslS. For example, we ma}' again use data from Table l-S. fora highly rcliablc les! such as Tesl 1, a 95% wnfidence intervalfor an obtained score of 110 ranges from 103 lo 116. In con-Irasl, lhe confidence interv,ll for Tesl 3, a lcss rcliable test, islarger, ranging from 89 to 124.

lt is importanl to bear in mind Ihal çonfidence inlervalsfor ohtained swres Ihal are based on lhe SFAl are çentered ont'stimlltcd truc swrcs." Such confidence intervals wil1 be sym-metric around obta ined scores only when oblaineJ scores areai the test mean or when rcliahility is perfeçl. Confidence in-tervals will be ,lsymmelriç aboul oblained scores to lhe S,ln1edegree Ihal lrue scnres diverge frum obl,lined scores. There~fore, when a lest is highly rcliable, the degree of asymmelrywill nflell be trivial, parliclllar!y for oblained scores within

one SI) of lhe mean. For tests of lesser relLlbilill', the asymme~Iry may be lTlarked. For examplc, in l:lblc 1-5, wnsiJer lheoblailled sçore of 130 on Tesl 2. The estimaled true sçore inIhis case is 124 (see eqllalions 4 and 5). Usingequalion 5 anda z-mulliplier of 1.96, we find thal a 95°11,confidençe intervalfor the ob!aincd scores spans :t13 poinls, or from 111 lo 137.This confidence interva! is subs!antially asymmetric aboul lheoblailled score.

It is also importanl to note thal SEM-based çonfidençe in-\ervals should not be llsed for eSlirnating the likelihood oI' ob-taining a given score at retesting wilh lhe same rneasure, ascffects oI' prior exposure are nOI accounleJ for. In addilion,Nllnally and Bernstein (1994) point out thal use of SEM-based confidence intervals assumes Ihat error Jistrihulionsare normal!y dislribuled and lwmoscedaslic (i.e., equal inspread) a(rnss lhe range of scores oblainablc for a given lesl.Howevu, this assumption ma)' oflen be violaled. A number ofalternale error mudeis Jo nol require these assumptions andmar Ihus be more appropriale in some circumslances (seeNunally and Bernslein, 1994, for a detai!Cd discussion).1

Lastly,,!Swilh the derivation 01' estimaled lrue scores, whenan examinee is known lo bclong lo a group Ihat markedly dif-fers from the norm,llive samplc, il may nol be appropriale loderive SF,Hs Olndass(lcialed confidence intervais using nor-mative samplc parameters (i.e., 51) and ru)' as Ihese wouldlikely differ significanlly from parameters derived from an ap-plicable clinicai sample.

lhe Stondord Error of Estimation

In additioll to estimating confidence inlervals for oblainedscores, Olle lllay also be inleresled in estimaling confidence in-tervills for estimated true scores (i.e., lhe likely range of lruescores aboul the eslimaled Irue score). For Ihis purpoSt'",onemal' conSlruCl confiJence intervais using lhe sflllldard error ofestimatíoll (SE,,; Lord & Novick, 1968). The formula for Ihis is:

[71\\11ere:

SD= lhe slandard deviation of the variable beingeslimated

r.u= lhe test rdiabili!y coefficient

The SEE' like lhe SEM, is an indie<llion of lesl precision. Aswilh lhe SEM, confidence intervals are formeJ around esli-mateJ Irue scores by multiplying the SEEby a desired zvalue.Thal iS,one wüuld expect that over a large nllmber oI' randomlyparallel versions of a lesl, an individuars tme score woulJ fal!within an illlerval of:tl SEI' of the eslimated Irue score 68%of lhe time, and fali within 1.96 SEIO95% oI' lhe time. As wilhconfidence inlervals bas~d on lhe SEA1, Ihose based on theSEI'will usually nol be symmetric arounJ ohtained scores.;\1IoI' lhe olher caveals detaileJ previously regarding SEM-basedconfidence interv<lisalso apply.

lhe dlOice oI' construeting confidençe inlervals based onlhe SEM versus the SEI' wil! depend on whether one is more

Page 15: Psycometrics in neuropsychological assesment

[SI

interesled in true scores or obtained s(Ores. That is, while theSEM is ,I giluge of test accuracy in that it is used to determinelhe expeçted range of obtllillcd scores abolll true scores overparallel assessments (the range of error in 111C115r1rCmCI1/ of lhetrile score), the SEE is a gauge of estimation accuracy in that itis used to determine lhe likely range wilhin which trlle $CoresfJII (the range of error of estimati"n of the true $Core). Re-gardless, both SEM-based and SEE-based confidence intervalsare symmetric wilh respecl \O estimated true scores ratherthan lhe obtained scores, and lhe boundaries of both will besimilar for any giwn levei of (Onfidence interval when a test ishighly reli,lble.

The Standard Error of Predietion

When the standard devialion of obtained scores for an alier-nate form is known, one may cakulale lhe likcly range of ub-tained scores expected on retesting with an alternate formoFor Ihis purpose, the stmulrml errar of prcdictioll (SEr; Lord &Novick, 1961'l) may be used to comlruct confidence intervals.The formula for this is:

SE!, "'SVy~l-r~

Where:

SDy = the stdndJfd devi,llÍon of lhe parallel formadministered at retest

rxx = the reliability of the form used at initialtesting

In this case, confidence inlervals are formed around cstimdledIrue scores (derivcd from initial abtained sClnes) by multiply-ing the SEr by a desired zvalue. That is, one would expect thatwhen retested OVCf a large number of randomly pJrallcl ver-sions of a lest, an individual's obl<lined SClne would fali within<In inlerval af:tl SEI' of the estimated true score 68% oI' thetime, and fali within 1.96 SEE 95% of the time. As wilh confi-dence intervals based on lhe SEM, those b,lsed un the SEI' willgenerally not be symmetric ,Iround obtained SClnes. 1\11of theother caveats detailed previously regarding the SEM-I}<Lsedconfidence intervals also apply. In addilion, while it mdY betemplÍng lo use SEf'-based confidence inlervals for eva1tI,Hingsignific<lnce of ch,mge at retesting with lhe same JlleilSUre, Ihispractice violates the assumplions Ihat a parallel form is usedaI retest and, particular1y, that no prior exposure effects apply.

SEMs and True $cores: Proclicollssues

Nunnally and Bernstein (1994) note Ihat mosl test manu<llsdo '';m exceptionally poor job of reporting estimateJ truescores ,Ind conlldcnce interva1s for expectC(I obt,tÍned scoresOtl alternative forms. for ex,lnlple, intervals are often erro-neonsly centered abolll obtained seores rather than estimatedtrue scores. Often the topic is not even discusscd" (p. 260).Sattler (2001) also notes that test manuills often base confi-dence intervals on the overall SE,"1 for the entire standardi/d-tion sample, rather than on SE"'!s for each age bando Using theaverage SEA1 across age is not always appropriate, givcn Ihat

PsydlO111ctries in Ncuropsyehological t\ssessmenl 17

some age groups are inherently more variable than othcrs(e.g., preschoo1crs versus adu1ts). In generdl, eonfidencc inter-vais based on age-specitic SE"'!s are preferable lo Ihose basedon the overall SEAI (particularly at the extremes of the agedistribution, where there is the most variability) and C<1noftenbe constructcd using age-based SEMs found in mosl manuaIs.

It is important to ackllow1cdge Ihat whilc estimated truescores and associated confidence intervals have mcrit, thereare practical reasolls to foeus on ohtained scores inslead. Forexample, essentially ali validily studies ,md ,Ktu,nidl predic-lion mcthods for mosl lesls are based on obtained scores.Therefore, obtained scores must usually be employcd for di-agnoslie and olher purposcs to maintain consistency to priorresearch and test usage. for more discussion regarding lheca!Culdtion and uses of the SE,H, SEE' SEr' and a1ternalÍve er-ror models, see Dudek (I979), Lord and Novick (l96l'l), andNunnally and Bernslein (1994).

VAUDITY

~lode1s of vdlidity ,Ire not ,Ibstract conceptual framl'worksIhat ,ne only minimally rclaled to neuropsychological prac-tice. Thl.~Standanls for Educational dnd Psychological TeslÍng(l\ERi\ et ai., 1(99) state that validati(ln is the joint rcsponsi-bility oI' the tesl developer and the tcst uscr (1999). Thus, aworking kllowlcdge of validily models and the validity char-,Ktcristics of specific tests is a central requirement lor respon-sible and competent test USl.~.From a practical perspective,a working knowkdge 01' va1idity allows users to determinewhich lests are appropriate for use and which fali below stan-dards for clinicai practice or rescarch utility. Thus, neuropsy-chologists who use tests to (lctl.~ct and diagnose neurocognitivedifficulties should be thoroughly familiar with commonlyused validity mudeis and how these can be usd to evaluatcneuropsychologicallools. Assuming that a test is valid becauseit was pu[(;hased from a reputabk test publisher, appe<lrs tohave il large normative s,nnp1c, or Came wilh a l<lfge user'stnanu,11 C<lllbe a sniolls error, as some well-known and com-monly uscd neuropsycho!ogieal tests are bcking with rcgardto crucial aspccts 01' validity.

Definilion of Validity

Cronbaeh and Meehl (I ')55) were some of the first Iheorists todiscuss the cOllcept of eonstruct VJlidily. Since then, the hasiedefinition of validity evolved as testing necds changed ovcrthe years. Allhough eonslruct validily was first inlroduced as ascparate Iypc of validity (e.g., Allastasi & Urbina, 1(97), it hasmoved, in some models, to encompass ali types of validity(e.g., Messick, 19')3). In other models, the term "constructvalidity" has been deemed redundant and has simply bcen re-placed by "validity," since ali types of validity ultimatcly in-form as lo the construet llleasured by lhe lesl. t\ccordingly, theterm "construet validity" ha.s nol been u.sed in the Standardsfor Educational and l'sycho!ogical"lcsting since 1974 (AERA

Page 16: Psycometrics in neuropsychological assesment

18 A CompellJium of Neuropsychological Tesls

"5<. l«,- J I.\<y ( 19'J6) fo, Iim,!au"Tl< "f ,hi, com!",,,<,,'

Table 1-6 /l,lesskk ..••l\ludel uf Comtruct ValiJity

interpretation" (/I,!cssick, 1995, p. 743). Likewise, the Slan-dards for Educational and l'sycholol;icallesting (AERA et <lI.,19(9) follows a modcl very llluch like ~kssick's, whcre differ-ent kinds of evidence are llsed to bolster test validity bascd oneach of the fol1owing sources: (I) evielence baseei on test COll-tent, (2) response processes, (3) internaI structure, (4) rda-lions lo olhe r variables, anel (5) consequences oftesting. Themost conlroversial aspect of these mode1s is lhe requirementfor consequential evidence to support validity. Some arguethat judging validity ,lCcording to whcthcr use of a test resultsin positive or negative social consequences is too far-rc,lChinl;ilml may 1cad to abuses of scicntific inquiry, <lSwhcn a h.'st re-sult does not agrce with lhe overriding social climate of thetime (Lecs-J-lil1cy, 1996). Sociill anel ethical conscquenccs, al-thoul;h cruci,tl, milY therefore need lo be treMcd separatclyfrom validity (Anastasi & Urbina, 19(7).

Validity Models

Since Cronbach and Mechl, various modcls of validity havebcen proposed. lhe most frequently encountered is the tripar-tite modcl whcrcby valídity ís divieleel inlo threc eompotlenls:content villitlity, criterioll-rc1ated validity, and construct valid-ity (see Anilstilsi & Urbina, 1997; l\titrushina ct aI., 2005; Nun-nally & Bernstein, 1994; Salt1cr, 2(01). Other validity subtypes,including convergent, divcrgent, prcdictivc, trcatment, clinicai,and face validity, are subsullled within thcse three domaills.For example, nmverl;enl ,1Ild divergcnt villidity are most oftentrealed as subsels of cnnstruct validily (Sattler, 2(01) ,tlld con-current and predicl!ve validity as subsels of critcrioll V<llídity(e.g., Milrushina et aI., 20(5). Concurrent and predictivc valid-ily only differ in terms of a temporill gradicnt; concurrcnt va-lidity is relevant for lests used to identify existing diagnoses orconditions, whereas predictive validity applies when dctermin-ing whether a test predicIs fulure outcnmes (Anastasi & Ur-bana, 1997). Allhough face validily appears to have fallen outoflilVor as a typc of validity, the extent to which examinees be-lieve a te~t me<1sures whilt it appears to ll1e~sure can affect mo.tivation, self-disclo~\lrc, <lnd effort. COllSequent1y, face validityGlll be seen as a moder,l\or variab1c affecting COllcurrent andpredietive validity lhal can be operalionillized <1nd measured(Bornstein, 1996; I'\evo, 1985), Again, ali these labcls for dis-tinct c<ltegories of validity are ways of providing different typesof evidmce for validity and are not, in and of themsclves, differ-ent types of villidity, as older sources mil;ltt claim (AERA et aI.,1999; YUtl & Ulrich, 20(2). Lastly, validity is a matler of degreeralher th<lll an all-or-none propcrty; validity is Iherefore neveraClually"finalil.ed,~ since tcsts must be cOlltinually reevalualedas populations and testing contexts changc over time (Nun-llally & Bernslein, 1994).

How lo EvoluoJe the Validity of a Test

I'ragmalically speaking, ali the thcorctic<ll models in lhe worldwill be of no utilíty to the practicing clinician unlcss theyean be translated into specific, step-by-stcp proeedures for

Dcfinition

Relevance, represcnlati\'{'lH.'SS,anti technicalqualily of test cOn!elltThCtlfetical rallona!cs for the test anti IcstresponsesFidelity af scoring slruelme to the structure(lf lhe constrllet mcasuf(,J by lbe teslSeores and interl'retatiulls generalize auossgroups, scttings, anu tasksCunvcrgcnt anJ Jin'rgenl villidity, eriterionrelcvanee, anJ appli<,J utililyActual and potelltial cunsequcnccs of test use,relating to suurces af invaliJity rclatcd tobias, fairness, ilnd disuiblllive justice"

Extern;t1

Genefillizilbility

Typc af Evitlcncc

Structurill

ConSl.'quentiill

SuhstanlÍn'

el a!., 1999). However, whelher il is deellleJ "conslrucl valiJ-ily" or simply "validil~-:' lhe coneepl is eentr~1 lo evalu~lingthe ulility of a lest in the clinicaI or researeh arena.

Test valiJity may bc Jefined at the mosl basie levei as lhedegree /O whícJr a leSI (/(/l/(ll/y IIlCllSlIres wllrlt ir is íntended /O

meaS/lre, or in the words uf NUllllally ~nd llernstein (1994),"how wetl itllleasures what it purports to Illeasure in the eon-text in which it is to be applied" (p. 112). As with reliability, animportant point 10 be madc here is Ihat a tesl eanflol be saidto have une single levei (lf validity. Rather, it ean be said to ex-hibil various lypes and leveis of validilY across a speclrum ofusal;e antI popul,llions. That is, \'lIliJity IS nm ti propcrty of 1/t('st, bul rather, \'ulidily js li prop('rty of the mcrmilJg attached to(/ t(,SI Sf()re; villidily can only arise and be dellned in the spe-cific conlext of tesl usal;e. Therefore, whilc it Éscertainly nec-essary to undersland the valiJity of tests in particular contexts,ultimate decisions regarding lhe validilY of test scme interpre-tation must take inlo account any unique factors pertaining tovalidity aI the levei of individual assessment, such as devia-tions fcom slandard adminislration, unusual testing enviroll-Illents, exalTlinee cooperation, and the like.

In the past, assesslllenl of validity was generally tesl-centrie. lhat is, test validity was largely indexed by compari-son with olha tests, especially "standards" in lhe field. SinceCronbach (1971), therc has becn a move aw~y from test-baseJor "measure-centered validity" (Zimi1es, 1996) toward the in-terprelatiall alld externaI utility of tests. Mcssick (1989, 1993)expanded the dcfinition af validity lo cncompass an overalljudgmenl of lhe extent to which empirical evidcncc and theo-retical rationales support lhe <ldequacy ilnd cffeclÍveness ofinlerpretations and ,tCtions resultinl; from test scores. Subse-qllenlly, !vlessick (1995) proposed <lcomprehensivc model ofconstrucl validity wherein six different, distinplishablc typesof evidence contribute to construct validity, These are (1)content rdaled, (2) substantive, (3) slructural, (4) generaliz-ability, (5) externaI, and (6) collsequcntial evidence snurces(see Table 1-6), ,llld they form thc "evidential basis for score

Page 17: Psycometrics in neuropsychological assesment

eva luating a test's valiJily .. I:lble 1-7 presenls a eomprehcnsive(bUl not exhallstivc) list of specilic fealures lIsers c<ln look forwhen cvalllatíng a tesl anJ reviewing lcst manuaIs. E<lch is or-ganizcd according lo the type of validity evidcnce provided.for exampie, COllstrllct validity ean be ,Issessed via eorrc!a-tions with other tests, faetor analysis, internai cOlIsistency(e.g., suhlesl intercorrdations), eonvergellt and Jiscriminantvalidation (c.g., multitrait-mllltímethod malrix), experimen-tai interventions (c.g., scnsitivity lo treatment), slructlH,11equalion Illodding, and response processes (e.g., lilsk dCCOlll-posilion, protocol analysis; Anaslasi & Urbina, 1997). l\fostimportantly, lIsers shollld also rernembn lhal even if an othcrcondilions are me!, a test cannol be eonsidered valid if it isnot rcliable (see previoll.\ Jiscussion).

It is importanl to nOle lhal not ali tests will have sufficielllevidence lo salisfy ali aspects of validity, bllt test uscrs shollldhilve a suffieicntly broad knowledge of nellropsychologicallools to be ab!c to select one test over anolhn, based on lhequality of the validation eviJence availablc. In essence, we

PsydHlnwlries in Nellf(lpsycho!ogical Assessmcnt 19

havc lIscd this modcl lo critically evaluate ali the tests rc-viewed in this volume.

Note that there is ,I certa in degree of overlap between cat-egorics in Table 1-7. for example, corrdatiollS between aspecific test Jnd another test me,lsuring IQ Cilll simll!tane-ously provide criterioll-rcialcJ eviJcnce <lnd construcl-relaledevidencc of validity. l{egardlcss of lhe termino]ogy, it is im-portant to understand llOW spccific techniques such as fae-tor analysis serve to inform lhc validity 01"test interpretationacross the range of sellings in whieh nellropsycho!ogists\Vork.

What Is an Adequate Validíty Coefficient?

Some invcsligalors have proposcd erileria for evaluating cvi-dencc rcJated to criterion valídity in outeollle assessmcnts. Forinstance, Andrcws ct aI. (1994) and 1311rlingamc ct aI. (1995)recornmcnd tha! a minimlltn levei of ,lCccplabilil}' for corrc!a-tions involving criterion v'lliJit}' is .50. Howcver, Nunnally

Table 1-7 Somecs of Evidence and Techni'1l1cs for Crilically EvalU<itingthe Validily of NellfOl'>yehological T(.'sts

T}'pe of Evidence

ConteTlt-rc!aled

Conslrlld-rdaled

Criterion-r(.'!aled

Resl'on>e proces.•es

Re\IUirCllEvidcnce

Rcfers lo Ihemes, wording, format, lasks, or qnc>liolls on a te,I, and <ldmini,tralion and scnringVescril'liou 01"lheorelical mudei (In which lest is bascdReview of Iilcralure with sUl'porling evidenceDefinilion (lf dOlllain of intcrest (e.g., litera!Ure review, lheoretical reasoning)Opcralionalizalion 01"def1nilion lhrough thorough and syslemalic review of tcst domain frum which ilem> areto b(..samplcd, wilh Iisling nf slmrces (c.g.. word frequenc)" sOllTcesfor vocabulary tesls}Collection of samplc of ilems brge enough to be represenUlive of dunuill and with slIfticiclll rang(.' of dífflcultyfor largel poplIlationSdcelÍon of panel of jlldges for expert review, hased on specific selectinn crileria (e.g., acadelllic and praclicalbaekgroullds or cxpcrlise within specific subdolllains)Evall1alion of item., hy experl pane! based on specific uitcria concerning accuracy and relevmlCeResolulion of judgmcnl conllids wilhin pane! for ilems lacking uoss-panc! agreelllcnt (e.g., empirical Illeans suchas lndex of llé'fl1Congruem:c; Hamhlelon. 1980)

Formal ddinilioll of comlructFormulation of hypothcsc> lo lIIeasure collstructGalhering empirical evidence of conSlruct validalionEvaluating psychofllclric propnlies of imlrunlenl (i.e., reHahilily)D(.'mon,lration of le.•1s('"milivily lo deve!0l'menul changes, correialioll with olher le~;[S,gWllll differences swdies,l"aClnranalysis, intertwl wmistcllcy (e.g., wrrdations belweell s\lolesls, or lo composiles wilhin Ih('"sallle test),convcr~ell\ and divergem valitiatioll (e.g., muitilrail-llIu1timclhod l1Iatrix), ,cnsilivity to cxpnilllenlalmanipulalioll (e.g., la'almellt sen,itivity), slruclural equalion modding, and analysis of l'rocess variahleslIndl'l'l)"ing test performallce.

Idmtification of al'propriate crileriollltientification uf relcv,11I1sample grollp rdk<:ling lhe emire pOl'lItalion of imeresl; if only a SllOgrollP is examined,Ihen gcneralization mllst remain wilhin subgroup definition (e.g., kccping in mind polenlial SOllrcesof error sllch,1.1reslriclion {lf range)Analysis of test-crilerioll relalionships Ihmugh empiricalmcam sucll as COlllrasting pouP', corrdatiollS wilhpr('viously availaolc tesls, dassil!calion of accllracy slalistks (e.g., posilive prediclive power), oulcome ,Iudi(.'",md llIela-analysi>

Velermining whether perforn""lCe on thc tcsl aCluaJl)"rei,ltes lo lhe domain being lIIeasuredAnalysis of individual responses to dderrnine lhe processes underlying performance (c.g., quc,lioning les! lahesabout slralegy, analy,is of lest performance with regard lo othcr variahles. determining whether lhe leSlllleaSllresthe same conSITUClin differeul pOI'UlalioJls, slI<:ha>age)

'i",m'c: Ad"l'tt"d fmm A",,,,,,,i & lIrbi"." 1997; Amer;(." Edll(<ltio'",' Re;eat(h A'so<:i"liun oI Jl .. 19'1');M<»i,k, 1995; .nd Yllll ""d Ulr,,-h. 2002.

Page 18: Psycometrics in neuropsychological assesment

20 A Compcndium of Neuropsychological Tests

<lndBem~tein (1994) note th,ll validity coefficient, farei)' ex-cee,! .30Of .40 in mo,t circum,tances involving Jl~}'eho!ogicaltests, given the complexities involved in mea~ufing and pre-dicting human beh,\\'ior. Thefe afe no hard and fast fUlc~when evaluating evi(knce supporlive of va!iditl" and intcr~lfe-tation should consider how the te~t results will be used. Thus,tests with evcn quite modest predictive validities (r = .50) ma}'be of considerablc utilitl', depmding on the Cifculll~tancesinwhich the}'will be used (Anasla~i& Urbina, 1997;Nunn<llll'&Bem~teill, 19(4), particularll' if Ihel' serve lo significant1l' in-(fease lhe tesl's "hil fale" over chance. 11is also important lonote Ihal in some circulIlslances, crilcrioll validitl' ma}' bemeasured in a cakgorical ralher Ihan continuous fashion,~uch as when lesl scores are used lo inform binarl' diagnoses(e.g., demented versu~ nol delllenled). ln Ihese cases, onewould Iikell' be more ínlereslcd in indices such as prediclivepower than olher me<l~uresof crilerion validill' (see below fora discus~ion of c1<lssilicalion"ccuracl' slalislics).

USE OF TESTS IN THE CONTEXT OFSCREENING AND DIAGNOSIS:CLASSIFlCATlON ACCURACY STATlSTICS

In some cases, c1inicians use lests lo meaSUfeholl' IIlllfilof;ltlattribule (e.g., intelligence) an examinee ha~, while in othercases, tesls are used to help determine whelher or nol an exam-inee has a specific atlribute, condilion, or illness that mal' beeithcr prescnt or abscnt (e.g., Alzheimer's disease). In lhe lallerClse, a sJlecialdi~linction in lesl use mal' be made. SCfcnlillStests are those which are broadll' or routinelr used to delecl aspecific altribule, oflell rdcrred lo as a collllítioll of inferest, orCOI, among persons who are not "sl'mplomatic" but who mal'Ilonctheless have the COI~ (Slreinef, 2003e). Ui'lgnosfíc tests,Ireu~ed lo assisl in ruling in ()fout a speeifie condilion in per-~onswho present wilh "sl'mploms" Ihat sugge~1lhe diagnosisin questionoAnolher related use of lesls is for purpose~ of pre-diclion of outcome. A~wilh screening and diagnostic tests, lheoulcome nf intereslll1al' bc defined in binarl' terms---it wiUei-ther occur or not occur (e.g., relum lo the same Il'pe anJ levei(lf emp!ol'menl). Thus, in ali three ca~es,dinicians wil! he in~terested in the relalion of lhe mca~\Ire'sdislribulion of scoresto iln attribule or oulcome Ihat is defincJ in binarl' lerms.

Typiealll" data conceming screening or diagnoslic accu-racl' are obtained bl' administcring a lestlo a samplc of per-

~ons who are also dassifieJ, wilh rcspect to the COI, b}'a so-called gotd ~tand<lfJ.Those who have the condition accordingto the gold stand<lfd,Ire [;lbcleJ COI+-, while Ihose who do nOIhave lhe condition ,ue hlbcled COl-. In medicine, the goldstamLud is oflcn a high!y aceurale diagnoslic lest that is moreexpcnsive and/or ha~ a higher levei of as~ocialed risk oflIlorbidity Ihan some new diagnoslic lllelhod thal is beingevaluated for use as a screening measure or as a possible re-placement for the exisling gold slandarJ. In neuropsychology,the situalion is oflen more complex, as the cal mar he a ps}'~chnlogical conslrucl (e.g., malingering) for which consensuswilh respecl to fundamenlal definilions is lacking or diagnos-tic gold standarJ.s mar not exi~1.The~c iS~llesmay he lessproblemalicwhenleslSareusedtol.redictouleollle(e.g .• re-tum to work), Ihough nlher problell1s thal mal' amiet olll-come daIa such as inlervcning variables anJ samplc altritionma}'complicale interpretation of predictive aecuraçy.

The simplest wal' to relate tesl rc~ultsto binarl' diagnose~oroUlcomes is to utiliJe a cutoff score.This is a ~ínglcpoinl a!ongthe conlinuull1 of possiblc score~ for a given lesl. Scores at orabove lhe cutoff classifr eXilmince, as belonging lo Olleof Iwogroups; scores below lhe culoff c1assifl'eXilmineesas bclongingto the other grnup. Those who have the cal acconling lo lhetesl are laheled as Test Positin- (Tesl'), whilc Iho~ewho do no!have the CO! are labeled Tcst Negatiw (Tesl-).

Table l-R shows lhe relation belween examinee classifica-tions based on tesl resulls versus da~sificalions b<lsedon agold slalHhtrdmeasure.13yconvenlion, lesl da~sificalion is de-noled bl' row membership and gold sland<lfd classification isdenoled bl' columll membership. Ccll values represenl the 10-lal number of persons from lhe silmple falling into each offom possiblc outcomes with respcct to ilgreemenl belween ale~1and respective gold slandard. Bl' convention, agreemenlsbetween gold slandard and test c!a.ssiflcalion.sare referred loas Trile Positive and TflIe Nrgative cases, whi[e disagreemenlsare referreJ to ,ISFals!' Posítíw alld FI/Isc Ncglltü'e cases, withposilívc and negmive refcrring to lhe presellce or absellce of aCOI as per elassificalion bl' the gold slandard. When cOllsid-ering outcome dala, observed oulcomc is substiluted for thegold slandard. 1t is imporlant lO kcep in mim! whilc readingthe fol!owing seclion that while golJ standanl measures areoflen implieitll' Irealed as 100%accurate, thi~mal' nol a!wal'sbe the case. Any limitalions in accuracy or applicabilitl' of agold stanJard or oulcome lIleasme need to be accounled forwhen interprcting classification accuracy slalistics.

Toble 1-8 Classificalion/Prediction ACÇ[lracy of a Test in Rdation \{)a "Cold $Iandard" ur t\ctua[Olllc<.Hne

Gold Standard

TeSI Reslllt

Test+Tesl-

Collltlm 101111

COJ'

A (Tnrc I'usitivcjC (Fal.se Neg;ltive)A+C

COJ-

ti (FalscI'osiliv(')D (Trllr Negative)II+D

Row Total

A+1lC+DN""A+Il+C+D

Page 19: Psycometrics in neuropsychological assesment

Psychometrics in !':curopsychological Assessment 21

Predidive Power

As opposed to heing concerned will1 lest accuracl' at lhe grollp

leveI, cEnicians are ll'pic<lJlymore concerned wilh tesl accu-racy in the conlexl of di,lgnosis and olher <-Jccisionmakíng atlhe levei uf individual examinees. Thal is, dillicians wish todetermine whelher or not an individual examinee does or doesnot have a givm Cal. In this seenario, clinicians lllusl collSideríndices derived from lhe dala in lhe rows of a classification ac-curacy lable (Slreiner, 2003e). These row-based indices arel'ositive Predictive Power (PPP) and Negative Prediclive Power(NPI').9 The formulas for calculation of Ihese from dala inTable 1-8 are given below:

is small. lhe formulae for calcu!aling standard errors for Sen-silivily, Specificit}', and lhe I.R are wmpkx and will nOI bepresenled here (see !l.1cKenziecl aI., 1(97). fortunately, Ihesevalues mal' also be easily calculated using <InUlllber of readilyaV<lilablecompu ler programs. Using one of these (b}' Mackin-110n,2000) wi(h data from Tablc 1-9, lhe 95°;()confidence in-terval for Sensilivily was found to be .59 to .R7, while lhal forSpecificily was .8310 .-99.LR + was 3.R 105R.6, and I.R- was 2.2lo 6.5. Clear1y, lhe range of measurement erro r is nol trivialfor lhis hypothelical study. In addition lo apprecialing issuesrelaling lo estimation error, il is <lIsoimporlanllO llndenlandlha I while co!umn-based indices provide useful informationabout tesl validíty and Ulility, a tesl m,lY n<~verthelcsshavehígh sensitivily and specíficity but slill be uf limiled clinicaIvalue in some situalions, as wíll be delailcd below.

Positive Prcdictive POlVcris delined as lhe prob,lbilily lh,ll <Inindividual wilh a posilive lest result has lhe COl. Conversei)',J\lcglltive Pmlictil'c Pmver is delined as lhe probabililY lhal anindividual wilh a negalive tesl result does )I(!f h,lve lhe COI.For example, predidive power eSlimales dcrived from lhe daIapresellled in Table 1-9 indicate lhal PPP = .94 and NPP= .79;thus, in lhe hypothelical data sei, 94% of persons \Vhooblain,1 positive test resull <letualll'have the COI, whíle 79% of peo-pie who obtain a negalive test result do nol in fa<.:t!lave lheCal. \\-11enpredictive power is dose to .50, examinees are ap-proximalcll' equalll' likdl' lo be cal. as COI-, regardlcss ofwhether Ihey are Test+ or Test-. \Vhen prediclive power is lessIhan .50, test-based dassificalions or diagnoses will be incor-reCImore oflen Ihan nol.lO

As wilh Sensitivity and Specilicitl', PPP and NPP are pa-rameter estimales Ihat should alwal's be considered in lheçontext (lf estimalion crror, Unfortullatcly, standard errors orconfidence intervals for estimates of predi<.:tive powcr arerarcll' listed when Ihese values are reported; elinicians are IhusIcft to Iheir own devices lo calculate them. Fortunatcly, thesevalues may be easily cilkulaled using a n\lmber of compulerprograms. Using OIH'of these (by Mackinnon, 2000) Wilh dalafrom Tablc 1-9, the 95% confidence intervals for 1'1'1' andNPP were found lo be .Y4 lo .YYand ,65 lo .90, respectivell'.

[131

[" [

PPP = A/(A + B)

NPP = D/(C + O)

TestResllh COI' COl- ]{ow1'otal

T",st. 30 2 32TCS1- 10 38 48ColUnlnlOtal 40 ,10 N=110

Sensilivily=A1(A+C) 19J

Spe<.:ili<.:ily= D/(D + li) [10J

LW = Sensilivity/( 1 - Speçificity) 111)

l.R- = Spe<.:ificity/{1 - Sensitivity) 112J

Table1-9 C[assil1cation/!'rcdictiollACCUf<lCYof a Testin Re1atiunlu a "Gold Standard" or A,tua[ OUlwmc-HYI'0lhc\ical [)ala

Te~tAccuracy and EHicíency

The general accuracy of ,1test Wilh reSJled to a specific COI isrdkcted by d,lla in lhe m/lImllS of a c!assifiration <lCcuracylabk (Slreiner, 2003c). The column-based indiees includeSCIIsifil'Íly, Speci{tcity, and lhe Posifive and Negative Li/.:elillOllJUatios (L1~+and L1~-). The formulas for calclllalion of thecolumn-based dOlssifi<.:alionacnuacy statistks from dala inTable I-Y ,Ire givcn below:

Go[dSlandard

Sensitivity is defined as the prnpurliun uf COI+ examineeswho are correetly classified as such by ales!. Specíficity is de-fined as the proporlion of COI- examinees who are correct1yclassified as su<.:hby a lesl. The Positive Likelihood !tatio(I.W) combines sensilivily and specificill' into a single indexoI' uveralllesl accuracy indicating lhe odds (likelihoud) thal aposilive lest resull has come from a COI+ ex,lminee. For ex-amplc, a likelihood ralio of 3.0 may be inlerprded as indical-ing that a posilive test result is lhree times as likely lo havecome ftom a CO[+ examinee as a COI- one. The LR- is inler-preled conversei)' lo lhe I.R., As lhe LR approaches 1, test elas-sificalion approximates random assignmenl of examinees.Thal is, a person who is Tesl' is equalll' likely lo be COI+ orCOI--, Given lI1al Ihel' are derivcd from an adequate norma-tiVl.~samplc, Sensilivily, Specificity and Ll~+I-are assllllled lord1c<.:1lixed (i.e" conslant) properties of a test Ihat are appli-C<lblcwl1enever a test is used wilhin the nonnative popllla-(ion. for purposes of working examples, Table 1-9 presentshypulhcti<.:allesl and gold standard data.

Using equalions 9 lo 12 above, the hypothetieal testdemol15lrales moderate Sensitivitl' (.75) and high Specificíty(.95), wilh a LR' of 15 and an iR- of 3.8. Thus, for the hypo-lhelical measure, a posilive resull is 15 times more likdl' to beoblained by an examinee who has lhe cal than one who doesnol, while a neg,ltive result is 3.8 times more likely to be ob-lained by an examinee who does nOI have lhe COI than onewho does.

Nole Ihat SensitivilY, Specificil)', .lI1d LR.I- are p<lramelerestimates Ihat have associalcd errors of estimation Ihat can bequanlilied. The magnilude of estilIlJtion error is invcrsely re-lated lo s<lmple size, and can be quile l<lrgewhen S<\mplcssize

Page 20: Psycometrics in neuropsychological assesment

22 i\ Compelldillm of Neuropsychological Tests

Sensilivity = .75 Specificity = .95

The prevalence of a COI is defined wilh respcct lo Tabic 1-8 as:

Figure 1-5 Rcialion of pr('diclivc po\\'('r to prcvalcllce-hYPOlhctical data.

endpoinls at O and I. For any given lesl culoff score, PPI' wiHa[ways increase with baserale, while NPP wiH silllultaneouslydecrease. For the hypothctical leSI heing considered, one C<1llsee lha I both 1'1'1' and NPP are moderatelr high (at or above.80) when the COI b,lserate ranges fmm 20% 10 50%. TheIradeoffbetween PPP anu NPP aI high and tow bascrate leveisis a[so readily apparenl; as the bascr<llc increases abon~ 50%,Pi'P exceeds .95, whik NPP declines, falling bclow .50 as lhebaseralc exceeds 80%. Conversdy, as the baseralc falls bclow30%, NPP exeeeds .95 while Pl'P rapidly drops off, falling be-Jow 5()% as lhe baser<lte falls bclow 7%.

From lhe forgoing, il is apparent Ihat lhe prediclive powervalues derived from dala presellted in Table 1-9 would nol beapplicable in settings where baserales vary fmm lhe 50°;',valuc in lhe hypolhelieal dala seI. This is imporlanl becausein praclice, dinicians may often be presenled wilh PPP valllesbased on data where "prevalencen values are Ilear 50%. This isdue lo the fact Ihat, regardless nf the prevalencc of a COI inlhe populalion, diagnoslic validitr stllJies Iypically cmployequalsized samples of COI+ and COI- individuais to faciJilalestatistical analyses. In contrast, lhe aCIU,I[ prevalcnce of COIsin lhe pnpulalion is rarely 50%. The ,lClual prevakncc of aCOI and the I'pp in some clinicai sctlings may be subslan-lially lower than Ihal reporled in validily sludies, particularlywhcn a tesl is used for screening purp(hes.

For example, suppose that the data frum Table 1-9 werefmm a validily Irial of a neuropsychological measure designedfor adminislralion lo young ehildrcn for purposes of predicl-ing Jaler developmenl of schizophrenia. The queslion thenariscs: should lhe lIleasure be llsed for bro'ld screening given illifclime schizophrenia prevalenec of .OOR? Using Formula 16,onc [<m delermine thal for this purpose lhe measures PPP isonly .11 and thus lhe ~posilive" lesl results would he incorrect89% of lhe lime.11 Converscl)', lhe prevalence of a COI mar insome settings be substantialJy higher than 50%. As an exam-pie of lhe olher extreme, lhe baserale of head injuries amongpersons referred to a hcad-injury rehabililalion serviee basedon documented evidellce of <lblow to lhe head lcading lo lossof conseiousness is essellliaHy 100%, in which case lhe use ofneuropsyehologieal leSIS lo determine whelhn or nol exami-nees had suslained a "head injuryn would nol only be redun-danl, bul very likcly lead lo false negalivc errors (such leslscould of course be lcgilimately used for olher purposes, suehas grilding injury severily). Clearly, dinicians need to Cilrefullyconsider pllblished dala concerning sensitivily. speeificily, andprediclive powcr in [ighl of intended lest use and, if neeessary,calculate PPP and NPP vil[ues and COI baserale estimalesapplicable lo specific grnups of examine(~s seen in Ihcir ownpraclices.

Dif~cullies wilh Eslimoling ond Applying 8oseroles

Prevalence eslimates for some COls ma)' be based on large-SCilleepidemiological studies lhal provide very accurate preva-lenee eSlimates for the general population or wilhin specificsubpopulalions (e.g., lhe prevalence rales of various psychialric

10

[ 171

[161

{15]

98.7

(A I- C)/N

.4 .5 fiBase Rale

I--ppp ~NPpl

.3.2.1

..-/' ~

/ "í '"'\.'\

...,

1\\

1'1'1'=

3

.2

.1

.0.0

As should be feadily apparenl fmm inspeclion of Tablc 1-9,the prevalence oflhe COl in lhe sample is 50 percent. Formu-las for deriving Predictive 1'ower for any levei of sensitivityand specificity and a specified prevaicnce are given below:

I'revaicnce xSensilivity

(Prevaicnce x Sensilivity) +[( I - Prevalence )x(J - Specificity) J

.9

N1'P = (I - Prevalence) x Specificity[(I - 1'revaicnce) x SpecificilY J +I1'revalence x (J - SensilivilY) I

From inspeclion of t!lese formulas, il should be apparent tl1<ltregardicss of sensilivity and specificity, predictive power willvary belween O and I as a function af prevaicnce. Applicalionof formulas 16 and 17 to the data presenteJ in Tab1c 1-9acmss lhe range of possible baserates provides the range ofpossiblc 1'1'1' and NI'P values Jepicted in Figure l-S.

As can be seen in Figure 1-5, lhe rcialioll hetween predic-tive power and prevalence is curvilinear and asymplotic with

10

Somple vs. Aclual Baserates andRetolian lo Predidive Power

C1early, lhe CI range is nol lrivial for this small dala seI. OfcriticaI importance lo clinicai inlerpretation of lesl scores,1'1'1' and NI'I' are IlOf fixed properties of a tesl like roW-b,ISCdindices, bul vary wilh lhe baserate or prevalence of a CaL

Page 21: Psycometrics in neuropsychological assesment

I'sychomctrics in Neuropsychological :\ssessment 23

tabul<Hor graphical form, is simply inspecled and a ,'>Coreischosen based on a researcher or dinician's comforl with aparticular error rate. For example, in malingering research,culoffs Ihal minimize false-posilive errors or hold them belowa low Ihreshold are oflen implicitly or explicit1y chosen, evenwhen such culoffs are associated with relalivc1y brge false-nega tive erro r rates.

:\ more formal, rigorolls and often very llseful set of toolsfor choosing culoff ~)()intsand for evalualing and comparingtest utility for diagnosis and decision making and for deter-milling oplimlllll culoff scores falls under lhe rubric of [{e-ceiver Operaling Characlcrislics (RUe) analyses. Clinkianswho use lesls for diagnostic or olher decision-making pur-poscs ShOllldbe familiar with [{oe procedures. The statislkalprocedures 1I1ilizedin RUe analyscs are dosely relaled to andsubslanli,llly ovcr1ap Ihose of I3ayesian analyses. The cenlralgraphic dement of I{Oe ,1Il,llysesis the I{OCgraph, which is aplol of lhe Irue positive proportion (Y-axis) against lhe falseposilive proportion (X-axis) associaled wilh eacll specificscore in a range of lesl scores. l'igure 1-6 shows an examp1cROC graph. The are<l under lhe curve is equivillenl lo theoverilll arcuracy of lhe test (proportion of lhe enlire samplecorreclly dassifted), whilc lhe slope of lhe curve aI any poinlis cqllivalent to lhe LR+associated wilh J specific test score.

A Ilumber of I{OC rnethods have been devcloped for de-termining cutoff poinh that consider nol only accuraçy, bulalso allow for facloring in quanlifiable or ljllasi-quantifiablecosls and benefils, and lhe rclalivc imporlance of specificcosts and benefils <lSSociiltedwilh any given cutoff score. ROClllelhods may also be llsed to compare lhe diagnostic utility ofIwo or more lIleaSllres,whieh may be very lIseful for purposesof lesl sekclion. Although ROC melhods can be very usefulc1inically, Iher have not yel made greal ínroads in clinkal

disorders in inpatient psychialric setlings). Howcver, in somecases, no prevalente data may be aV;lilablcor reported prev<l-lence data may not be applicable to speciflc settings or sub-populalions. In Ihese cases, c1inÍl.:ianswho wish lo determinepredictive pOIVermust develop Ihcir own baserale estimales.[dea11y,these can be derived from dala collecled within lhesame sClling in which lhe lest will be employed, IhOllgh this istypically time consuming and mJny mclhodologica! chal-Icnges may be faced, illcluding limilalions associated withsmall samplc sizes. Melhods for estimating bJserates in suchconlext are beyond lhe scope of Ihis c1tapter; interested rcad-ers are direcled lo lvlossmallll (2003), Pcpe (2003), and Rorerand Dawes (19K2).

Why Are Classificalion Acluary SlatislicsNol Ubiquilous in Neuropsychological Researchand Clinicai Praclice?

Of nole, the mathemalical rc1alions between sensitivily, speci-ficily, prevalence, and prediclive power were first e1ucidatedby Thomas Bayes and pllblished in 1763; methods for deriv-ing prediclive power and other re1aled índices of confidencein decision making are thus onen rcferred lo as BII}'CSj(1II sta-lislics.'~ Needless lo sal', Bayes's work predaled lhe firsl diag-noslic applicalions of psychologiC<11tesls as we kno\'I,' Ihemtoday. However, J1though neuwpsychological lests are rou-lindy used for diagnost ic decision making, informalion on theprediclive power of most lests ís oflen absenl from both leslmanuaIs and applicablc research literalllre. This is so despilelhe facl lha I the importance and relevance of BayesiJn ap-proaches to the practice of clinicai psychology was wdl de-scribed 50 years ago by M<~ehland Rosen (1955), and has beenperiodically addressed since Ihen (Willis, 1984; Elwood, 1993;Ivnik Cl ;11.,200l). Hayesian stalislics are fina]]y making majorinroads in to lhe mainslream of neuropsychology, particular1yin lhe reseilfch litcralure concerning symplom validity mea-sures, in whicn eSlimales of prediclive power have becomede rigueur, althougn Ihese are stil! lypical1y presenled wilh-OUI associaled slandard e[fors, Inm greally reducing ulilily nflhe data.

Determining lhe Oplimum Culoff Score-ROCAnolyses ond Other Melhods

The forgoing discussion h"s fócused on the diagnostic accu-raey of lesls using speciflCclltoff poinls, presumably ones thatare oplimum cutoffs for given lasks smh as diagnosing de-mentia. A numher of melhods for detennining an oplímumclltoff point are available and a1though ther may leJd to simi-lar resu1ts, Ine differences between them are not Irivial. l\.bnyof these metllods Me malhemalicillly comp1cx and/or compu-lationally demanding, IhllS requiring computer applicalions.

The delermill<ltion of an oplimum culoff senre for detec-líon (Ir di<lgnosis of a COI is often based 011 simll1taneousevaluation (lf scnsílivity and spccificity or predietive poweracross a range (lf scores. In some cases, Ihis informalioll, in

1.0

.9

.8

~ .7n"D .6eQ• .5:~"o .4Q•~ 3

.2

.1

.0O .2

Figure 1-6 AI]ROC ~raph.

.3 .4 .5 .6 .7False-Positive Probability

.8 ,9 1.0

Page 22: Psycometrics in neuropsychological assesment

24 A Compenuiulll of NeuropsydlO!ogical T(\,ts

neuropsychological lileralure. A dctailcd discussion of ROCmelhods is beyond lhe seope of Ihis chapter; inleresteu n;au-ers are rcferred lo Mosstllann and SomO/.a (1992), Pepe(2003), Somoza and ,\lossmann (1992), and Swcts, l)awes anujI..!on,lhan (2000).

Evoluolion of Prediclive Power Acros5 o Rangeof CuloH Scorcs and Ba5croles

As noted abnve, il is imporlilnt to recognize Ihal posilive andnegative pn:uiclive power are llOl Jlroperti<~s of Icsls, bulralher are properties of specific test Sl.:ores in spel.:ifil.:conlexls.The forgoing seclions uescribing lhe cakulalion and inlerpre-ta!ion of prcuictive power h,lve fOl.:used on melhods for evalu-ating lhe \lillue (lf a singlc culoff poinl for a given lesl forpurposes of c1assifying examinees as COI+ or COI-. However,by focusing exclusively on single cutoff poinls, c1inil.:ians areessentially trilllSforming cllnlinuous lesl scores into binaryscores, thus uiscarding much potenlially usdu! inforlIlation,parlicularly when scores are consiuer,lbly above or below acutoff. Lindeboom (1989) propllsed an illternalive approachin which prediclive power across a range of tesl swres andbaserates can he displayed in a single Ihyesian probabililYtable. In Ihis approach, tesl scores uefine lhe rows and baser-ales define the colulIllIS of a lable; inuividual tilblc cells wn-tain lhe associaled Pl'P anu NI'P for a specific score illlUspecific baserale. Such tables have rarely been constructeu forslandardized measures, bul examplcs can be found in sometest manu,lls (e.g., lhe Victoria Symplom Validily Test; Slickel ai" 1997). The advanlage of Ihis approach is Ihal il allowscJinicians lo consider lhe diagnoslic confidence associaledwith an examinee's specific scnre, Icading to more accuraleassessments. A limiting factor for use of Bayesian probabililytables is thal Ihey can only he cnnSlrucleJ when sensilivilyand specificily values for an entire range of :;cores are avail-ablc, which is rarely the case for mosl tests. In aJJition, pre-diclive power values in such lables are subjecl to any valiJitylimiwtions of underlying dala, and should include associatedstanJdru errors or confidence inlervais.

Evaluating Prcdiclive Power in the Contextof Mulliple Tcsts

Oflen more Ihan one lest Ihal provides data re1cvanl to a spe-cific diagnosis is adminislered. In these cases, dinicians milYwish to inlegrale predictive power eslimates across measures.There may be a lemptation lo use lhe I'PI' associateJ with ascore on one measure as lhe "baserale" when the PPP for ascore from a second measure is calnllated. For exalIlple, sup~pose Ihal the baserale of a COI is 15%. When ales! designedto detect the COl is adminislereJ, an ex,lminee's Sl.:ore Irans-lates lo a PPP of ó5%. The examiner then administers a sec-ond tesl designed to detect lhe COI, but when PPI' for theexaminee's score on the sel.:tllld lest is calculated, ,I "baserate~of 65% is used rather than 15%, as the former is now the

assumed prior probabililY lhal the examinee has lhe COI,giveH lheir swre OH lhe firsl lesl administereJ. The resultingrrl' for the examinee's score on lhe se(;(JIld measure is nowYY% anJ lhe examine r concludes lhal the examinee has theCOI. While Ihis procedure may seem Jogil.:al, il wil1 producean intlalcd 1'1'1' eslimale for lhe second lesl score wheneverlhe Iwo lllt'ilSUreS are corrdateJ, which will almost always bcIh(' case whell bolh rneasures are tiesigned lo screen for or di-agnose lhe same COI. AI presenl, there is no simple malhe-malical model Ihal can be uscd lo correcl for lhe degree oI'corrc!alioll hclween lllcasures so thal Ihey can bc used in suchan itcrative lllanncr; Ihereforc this practice should be avoided.

A preferred psychotnetric method for inlegrating scoresfrom multiple meJsures, which can only be used when nor-malive data are availablc, is lo cOllslrucl oplimum groupmcmbership (i.e., COI~ vs. COI-) prediclion cqualiolls orc1assification rules using logislic regression or multiway fre-(Iuellcy analyscs, which can then be cross-validated, andidcally distribuled in an easy-to-use formal such as software.More details on melhods for comhining information acrossmeasures lIlay be found in franklin (2003b) and Pepe (2003).

ASSESSING CHANGE OVER TIME

Neuropsychologisls are ofi:en inlerestcd in alld/or confronledwilh issues of change in function ovcr time. In these conlexlsthree inlerrclaled (IUestions <uise:

To whal degree do changes in examinee lesl scoresrdlccl "real" changes in function as opposed to mea-suremenl error?To whal degree do real changes in ex,lminee lest scoresrellecl c1inically significant changes in funclion asopposcJ to c1inically Irivial changes?To what Jegree uo changes in examinee tesl scoresconform lo expeclations, givcn lhe application oftrealmenls or the occurrence of olha evenlS or pro-cesses occurrillg between lesl and retest, such as heaJinjury, dementi.l or brain surgery?

A number of slatislical/psychometric melhoJs nave beell Je-veloped for Jssessing changes observeu over repeated aJmin-istraliolls oI' neuropsychological leSls; these differ considerablywith respect to malhemalical mouels ilnJ assumplions re-garding lhe nature of lesl data. As with moS! areas of Jlsycho-melrics, the probkms and processes involveJ in decomposingobserved scores (i.e., change scores) into measurement erro ranJ "true" scores <ue often complex. Clinicians are certainlynot aideJ by the há of agreemenl aboul which mclhods touse for analysing test-releSI dala, limilcd relest dala for manylests and limited coverage and uireclion concerning releslprocedures in Inost tcst manuais. Only a rdativdy brief dis-cussion oI' Ihis important an~a of pSYl.:homelrics is presentedhere. Intereslcd reatlcrs are referred to olher sources (e.g.,Chclune, 2003) for a more in-Jepth review.

Page 23: Psycometrics in neuropsychological assesment

Psychornclrics in Neuropsyehological Assessmenl 25

\Vhere;

change bl' the SEI!, transforming observed eh,mge scores intoSElJ unilS. The formula is given below;

[201

[ \91

RCI -CI =S).:t (z' SEn)

SI = Initial lesl swrez= z score aSSllCiatedwith a given confidence range(e.g., 1.64 for a 90% C.L)

Where;

Relesl scores falling outside the desired wnfiJence inlervalabout inilial scores can b~ considered evidence of a significantchange. Nole Ihal while a ''significant'' RCI value mal' be wn-sidered as ,I prereguisilC, il is 11M by itsclf sufficient evidencethat dinically significant change has nccurred. Consider RCIsin lhe conlext of highly reliabJe tests; relativeJy smal! seorechanges ai retest can produce significanl RCls, bul bulh theinitiallesl score anJ retest score may rema in within lhe sameclassification range (e.g., normal) so that lhe clinicai implica-lions of observed change may be mini mal. In ilddilion, use of

SI = an examinee's initial tesl score52= an examinee's score at reles! on lhe S,ltllemeilSure

The resulting RCl scores ean be eilher negalive or positive andcan be tl10ughl of as a type of z score tbal can be interpretedwith reference to upper or lower lails of a normal prob,lbilit}'distribulion. Therefore, ReI scores falling oulside a range of-1.9610 1.96would be expecled 10 occm less Ihan 5%,of lhelime as a result of measurement error alone, assuming Ihat anexaminee's true retest score Iwd not changed since the firsttesl. The assumption that an examinee's lrue score has nOlchangeJ can Iherefore be rejccteJ iHp < .05 (two-taiIed) whenhis or her RCI swre is above 1,96 or below -1.96.The RCI is direcIly derived from classical lest Iheorr. Thus

inlernal-consislency reJi;lbilily (Cronbach's a) is llsed lo esli-male measurement error rather lhan lesl-retest rdiabilily, asthe latler rdlects not just lest-intrinsic measurement error,blll also an}' additional varialion over time arising from realchanges in funclion and lhe effecl of olher intervening vari-,Ibles. Thus use of tesl-relest reliability inlroduces additionalcomplexity into lhe meaning of the RCI.The RCI is otien ca1culated llsing SI) (lO ca1culale 5£.••.1)

and reliabililY estimales oblained from lesl normative sam-pies. However, as Ihese values mal' not be applicable to lhediniea! group to which ,ltl examinee belongs, care must belaken in inlerpretation of lhe RCI in such (Írcumstances. ltmay be prcferablc lo use 51) and reliahililY eslimates fromsamplcs similar to 'In examine~, if these are available. Becallsetbe SE[) value is wnslant for any given combinalion of tesland referelll.:e samplc, it c<mbe used to construct Rei confi-denee inlervals applicable to any initial test score oblainedfcom a persoTl similar lO lhe reference sample, using lhe for-mula below;

[\81

ReferenceGroup Change Score Dislribulions

Jawbson and Truax (1991; seI' also Jawbson et aI., 1999) pro-I'osed a psl'chometric melhod for delermining if changes intest scores over lime are reliable (i.e., not an arlefact of imper-feel test reliability). This method invo!ves caJculalion of a 1<.e~liabIe Chllllge Index (RCI). The RCI is an indicalor of theprobability that an observed difference belween two scoresfrom the same examinee on lhe same tesl ean be atlributed lomeasuremenl error (i.e., lo irnperfecl reJiabilily). \Vhen Ihereis <llow probabililY IhM lhe observed ehange is due to rnea-surem~nl error, one may infer thal il reflecls olher faclors,such as progression of illness, Irealmenl effecls, <lnd/or priorexposure to lhe tes!.The RCI is eakulaled using the Stlllu/ard Error of fhe Vif-

ferellce (SE[)), <Inindex of measuremenl error derived frornc1assical test Iheory. 11is lhe stiltld,lrd deviation of expectedtest-releSI difference scores about a mean of O given an as-sumplion that no acl\lal change has (lccurredY The formulafor lhe SEI) is;

lhe Reliable Change lndex (ROI

where SEAl is the StanJarJ Error of l\kasur~m~nt, as I'revi-ous!y defined in Formula 6. Inspeclion of Formula 18 revealsIhal tests with alarge SEMwill have a large SE[)."fhe Rei for aspecific score is ca1culale(l by dividing the observed ,Imount of

If a refcrence or normalive sampJc is administernl a testtwice, the distribution of observed change scores (Uchangescore" = retest score minlls baselíne score) can be qllalltified.When such information is avaiL1ble, individual examineechange scores can be Iransformed into .ltíllldllrdized changesenres (e.g., percentiles)} thus providing information on the(Iegree of unusualness uf any observed l.:hange in scure. Un-fortunately, it is rardy possible to use this method uf evaluat-ing ehange due to m,ljor limitations in most (laIa available intest manuais. Retest sampJcs temi to be relatively slIlall formany tests, thus limiling generalizability. lhis is partieularJyimportant when change scores vary with demographie vari-ables (e.g., age and levei of edueation) and/or initiallesl scorelevei (e.g., normal vs. abnormal), became retesl samples Iypi-cally are rest ricted with respecl to bo(h. Seennd, relesl samplesare oflen obtaineJ wilhin a short period of time after initiallesting, typieally less than Iwo monlhs, whereas in clinicaipraetice Iypieal lest-relest intervals are oflen much longer.ThllS any effeels of exlended lest-relest interva!s on l.:hangescore distriblltions are not reflecled in mosl ehange-score dalapresented in lest manuaIs. Laslly, change score informalion isIl'pically presenleJ in the form of sumlTlarr statistics (e.g.,mean and SD) tha! have limiteJ utilill' if change scores are nolnormallr Jistribuled (in which case percenlile tahks would bemUl.:hprcferable). As a result uf Ihese limilalions, cliniciansoflen musllurn lo olher melhoJs for analyzing change.

Page 24: Psycometrics in neuropsychological assesment

26 A Compcndium or Neuropsyehologieal Tests

lhe RCI implícitly assumes Ihal no praclice cfrecls pcrlain.When praetice effeets are presenl, signifieant RCI values maypartia]]y l)f \IIholly refleel cffeels of prior lesl exposure ralherIhan a ehange in underIying funclÍonallcvel.

1•.) allo\ll Reis to be used wilh lesls Ihal have praclice cf-feels, Chelune el aI. (1993) suggesl a modifiealion lo ealcuIa-lion of the RCI in whieh lhe mean ehange score for areference group is sublracted from lhe obscrvcd change seoreof an inliividual examinee and lhe result is used as an Ad-jusled Change Seore for purposes of ealculalÍng an AdjustedI~Cl.Alternativcly, an RCI ennfidence inlerval ealculaled usingFormula 21 could have ils endpoinls adjusled by additlon oflhe lllean change score.

Adj. RCI - C] '" (SI + T\ld:t (z. SEll) {2l)

\Vhere;

SI = Initiallesl scoreMe =Mean change score (Retest - Test)z= z score assoôaled with ,I given confidem:e range

(e.g., 1.64 for a 90'}'<Jc.1.)

This approach appears to offer some advanlages over the tra-ditional RCI, partielllarly for lesls where large praetice effeCIsare expeeled. However, adjusling for praclice in Ihis way isproblemalie in a number of ways, firsl and foremosl of whiehis the use of a conslanl term for lhe praetiee effeel, whieh willnOI rdlecl any systemalic variability in praelÍee dTeels acrossindividuais. Secondly, n('Ílher slandard nor adjusted RCIs 1IC-

count for regression toward the Illeall heeause lhe assoeiatcdeslÍmated llleasuremcnl error is not adjusted proportionallyfor the extremily of observed change.

Standardized Regre~~ion-Ba~d Change Scares

Thc Rei may provide useful informalion regarding the likeli-hood of a mcaningflll change in the funetion being measuredby a tesl, blll as noled above, il may have limited validily insome eirellmstanees. Many quantifiable faetors not aeeollntedfor by RCI may influenee or predicl retest scores, includingtest-relest inlCfval, bascline abilily levei (Time 1 scon:), scoresfrom other tests, anu examinee eharacterislics such as gender,educalion, age, ileculturation, and neurological or medicaiconditions. In ilddilion, whilc RCI senres factor in test reliabil-ily, error is operatinnalizcd as a conslanl Ihal does nol accountfor regression lO lhe mean (i.e., lhe inerease in measuremenlerror associaled with more extreme seores). Olle melhod forevaluating ehange that does allow c1inicians lo aeeounl for ad-dilional predielors and also controls for regression to lhemean is the usc of linear regression models (Crawford &I-Iowell, 1998; Hermann el aI., 1(91).14

\Vilh linear regression models, predicted relesl scores areuerived and l!leu eompare{i with observed retesl seores forpurposes of determining ir Jevialions are "significant.» In thepreferreJ method, this is accomplished by dividing lhe {Iiffer-enee between obtained relest seores and regression-predietedreleSI scores by lhe Stmu/rmi Errar for IlldividuoI Prediaeri

Srore5 (SE,,). Beeause score diffcrenees <Iredivided by a slan-dard error, lhe resulling value is eOllsidered lo bc standard-ized. The resulting sl,mdardizeJ score is in fael <l t statistic Ihatcan be translated into a probability value using 'In appropriateprogram or tablc. Small probabilíty values indieale that theobscrved retesl score differs signifieanlly from lhe predictedvalue. The SE" is used becausc, unlike lhe Slandard Error oflhe negression, il is nol constanl aefOSScases, hUI increases asindividual values of indepcndent variables de\liate frnm lhemean, thus accounting for regression to the mean Oll a case-bj'-case basis (Crawford & Howcll, 1998). Thus, persons whoare outlicrs wilh respeet to their scores on predirtor variablcswill have IJrger milrgins of e[for assoeiated with Iheir pre-dieled seores and thus larger ehanges in raw scores will be re-quired lo reaeh signifieanec for Ihese individuais.

As with olher standarJized seores (e.g., z seores), stan-dardized regression-b,IS{:d ehange scores (SRB scores) fromdifferenl measures ean he direetly compared, regardlcss of lheoriginal lesl score mctric. Ilowcver, a number of inferenliallimilalions of sueh comparisollS, described in the seclion onstilndardizcd scores earlier in Ihis ehapter, slill apply. Regres-sion models ean 'lIso bc used when one wishes to considerchange seores from multiplc tesls simultaneollsly; Ihese aremore complcx anJ will nol be covered here (see McCleary,et aI., 19%).

As an example of lhe applieation of SRB scores, comidadata on IQ in ehildren with epilcpsy reported by Shermanet aI. (2003). They found th,lI in samplcs of ehildren wilh in-traclable epilepsy who \Verenot Irealed surgieally, FSIQ senresaI retesl enuld be predicted by bascline FSIQ and number ofanti-epilcplic lIledieations (AEDs) Ihal lhe ehildren were tak-ing aI basdine. The resulting Multiple IF value was large(.')2), indicating Ihal lhe equation had aeeeptablc predietivevalue. The resulting regression equation is given below;

FSIQ,e""1= (0.965 X FSIQba,d;ne)+ (-4.5]9 x AEDsb"",line)+ 7.358

It ean be seen from inspeetion of this equation that predictedretest FSIQ values were positivdy rclated to basdine IQ andinvcrsely rdated lo number of AEDs being laken aI baseline.lherefore, for ehildren who were nol laking any AEDs aIbasdine, a Illodesl increase in FSIQ at relesl was expecled,whilc for Ihose taking one or more AEDs (<I lIl<ukerof epilepsysevcril}'), IQs tended to decline over lime. GivCll a basclineFSIQ of 100, lhe predicted FSIQs at relest for ehil{lren laking 0,l, 2, and 3 AEDs at baseline were 104,99,95, ,Ifld 90, respee-tively. Using a program developed by Crawford & Howcll(1991\), Sherman et 011.(2003) were able lo determine whichehildren in lhe surgery samplc demonslraled unusual change,rdalivc lo expeetations for ehildren who did not receive sur-gery. For cxamplc, a ehild in the sample was taking 2 AEDs andhad a FSIQ of 53 at baseline. The predieteJ releSI IQ was thus49 but the aelual retest IQ fol!owing right anterior temporallobeetomy was 63. The observed change was 14 points higherIhan lhe predicled change; lhe assoeialed fi value was .039 andtnus the ehilJ was c1assifiedas oblaining a significanlly higher

Page 25: Psycometrics in neuropsychological assesment

Ihan prnlich:J relest score. lhe illference in this ca.sewas Ih,ltbdter than expected fS[Q outcome wa.s a positive c/Tect ofcpikpsy surgery. Other examples of regre.ssion e([uations de-vc!oped for specific neuropsychological lests are presentedthrollghout Ihis volunle.

limitotions of Regression.Bosed Chonge Scores

[I is imporlant to understand the limitations of regressionmelhod.s. Regression equations based on smaller sample sizeswill Iead to large trror terms so that lTleaningful predicted-obtained differellces may be missed. Equation.s from large-scale st\ldies or from eross-validatioll efforts are thereforeprefcrred. In order to maximize utility, sample charaeteristiesshou!d match populations seen clinieally and predictor vari-ables should hc earefully chosen to match data that willlikdyhc av,lllahle to clinicians. Test users should generally avoidintcrpolation-that is, they should avoid applying a regres-sion equalion to an examinee's data (prediclor variable.s andlesl-retest seores) when the data values fali outside the rangesfor enrre.sponding variables cnmprising the regression e([ua-linn. For example, if a regression eljllation is devc!oped forpredieting IQ ,li retest from a samplc with initíal IQ scoresranging from 85 to 125, it should not be applied lo an exami.nee whosc initíal IQ is 65. Finally, sIm scores ShOllld onl}' bedtrived and used when necessary assumptions concerningresiduaIs are met (see l'edhazur, 1997, pp. 33-34).

It is criticaI to undersland Ihal SRB scores do nolnecessar-ily indicate whether a dinically significanl change from base-line levei has occurrcd-for which use of RCls may be moreappropriate. Instead, SRB scores are an index of lhe degree lowhich observed change conforms to eslablished trends in arcfcrcnce population. These trends may consist of increases ordecreases in performance ovcr time in association with com-binalÍons of influential predictor variables, such as type andseverity of illness, trealment type, baseline cognilive levei,gender, age, and lesl-relesl interval. Expeeted trends may in-volve increased scores at retest for healthy individuais, but de-creased senres for individuaIs with progressive neurologicaldise,lse. The following twa examples will illustratc this painL

In the firs! examplc, ellnsider OIhypothcticOIIscenario of atreatmenl for depression Ihat is associated with improvedpost-trcatment scores on a depression inventory, such that ina clinicai referem:e samplc, the test-retest corrclation is highand the average improvemenl in scores at retesl exceeds lhe!hreshold for clinicai significance as established by RCI. In lhesimplest case (i.t.".,using only scores frum Time I), regression-predicted rctest scores would be equivalen! to the mean scorechange ohserved in the clinicai reference samplc. In this case,an examinee who at retesl ohtained OIdcpression S(llre aI ornear the post-trealment mean would obtain a Iwn-signijicalltSRB score but a signijietwt ReI score, indicalÍng that IheyJemonstrateJ the lypiC<llIyseen clinieally signifieant improve-ment in response to treatmen!. COllversely, an examinee whoobtained an unchanged depression score fol1owing Irealmentwlluld llht,lÍn a signijica/It SRB score but a llOlI-sijilJijicl/IltRCI

Psychomctrics in N('Ufopsychological Assessment 27

score, inJicating that they Jid not show lhe typically seen sig-nificant improvement in response to IrealmenL

In lhe second example, consider a hypolhetiC<11seenario (lfa memory lcst that has signific.lllt prinr-exposure (i.e., leam-ing) effects sllch that in lhe norma tive sample the lest.rctestcorrc!ation is high and the avcrage improvement in scores ,Hrelest exceeds lhe Ihreshold for clinicaI significance as estah-lished by RCI. As wilh the depression senre example, in lhesimplesl case (i.e., using ollly seores fTOmTime I), regres.sion-predicted retest scores would he equivalent to the me<lllscoreehange observed in the rcference samplc. In this case, an ex-aminee who ,Il retest oblained a memory score at or near therelest mcan would ohlain a 11011-signijietllltSRBC .'icorebul asigtlijiúlllt RCI score, indicating Ihat they demonstrated lhetypically seen prior exposurcllearning crfeet (note the differ-encc in interprctation from the previous example-the irn-provement in score is assumed to reflect treatment effeets inthe first case and to be artifa(lual in the seeonJ case). Con-versely, an examinee ""ho ohtaineJ an unchanged memoryscore following treatment would obtain a sigllijiwllt SRB scorebut a 11011-sigllijiClllltRei score, inJicating that they did notshow the typically seen prinr exposure/learning effecl. Con-ceivably, in thecontext of a clinicaI referral, the latter findingmighl be interprcted as rdlective of lllemory problems (.'ieeSawrie, eI aI., 1996, and "lcmkin el aI., 1999, for excdlent ex-amples of stlldies eomparing ulility of RCls and SRB scores inclinicaI samples).

Clinicolly Signincont Chonge

Once a clinician has determined that an observed tesl scorechange is re1iable, he or she will uSllally need to determinewhelher the amount of challge is dinicatly meallingful. jacob-son and Truax (1991) proposed that clinically significanlchange occurs, in the context of Ireatment, when an exami-nee's seore (e.g., on the Beck Depression Inventory) movesfTOmwithin the clinica! "depressed" range into lhe normalpopulation range. However, this definition of clinically signif-icant change is not always rclevant tu ncuropsrehologieal as-sessmcnt. There are at present no widel}' accepted cri teria forc1inieally significant change within the context of neuropsy-chological assessment. Ralher, the determination of clinicaIsignificance of any observed change that ís rdiablc will de-penJ greatly on the specilic Cllntext of the assessment.

NOR!'ML VARIATION

Ingraham and Aiken (1996) have noted th.lt when c1iniciansattempt to interpret examinee prufiles of scores acms;; multi-pIe tesls lhe}' "confront the problem of determining howlIlany deviant scores are necessary to (liagnose a patient as ab-normill or whether the conliguratilln llf scores is signifieanllydifferent frum an expected pattern" (I'. 120). They fllrthernote that thc likelihood thal a profi1c of tests scores will ex-ceeJ criteria for abnormalily increases as; (I) the numher ()f

Page 26: Psycometrics in neuropsychological assesment

28 A Compcndium ofNeurop~ycho]ogic<l1 Te~t~

te~t~in a battery incrcascs; (2) lhe z score culoff u~ed 10 clas-~i(y <lte~t ~core as abnormal dccrcascs; and (3) the number of"bnormal test S(ores required to reaeh cri teria dccrcasc5. In-graham and Aiken (1996) developed a mathematical modelthat may be u~ed for determining the likelihood of obtainingan abllormill teM re~ult fram a given number of tests. lmplicitin thi~ model is .lll as~umption thal ~ome "abnormal" le~l~core~are ~p\lriou~. A~ lngraham and Aiken.(I996) note, theproblcm of determining whether a prufi1c of te~t score~ meet~cri leria for abnormality i~ considerab]y complicated by thefact that most neurop~ychologieal measures are inlercorre-lated and thercfore lhe pf(lbabililic~ of obtaining abnormalrc~ull~ from each lesl are nol independenl. lIowevcr, Iheyprovidc ~omc ~uggeslcd guidelines for adapting their modelor ming olher methods lo provide meful approximations.

In a relalcd vein of re~eareh, Schrct!en et aI. (2003), notingthat littlc is known aboul what constitules the normal rangeof intraindividual variation acros~ cognilive domain~ and, byextension, ils~()ciilled te~t sçore~, evalualed normal varialiollin a sample of 197 healthy adults who were parlicipants in astudy on normal aging. For each individual, lhe MaximumDiserepaney, or r-.1D(the ab~olute difference bctween stan-dard .'\Coresfrom two measures expre~~ed in units of standanideviation) across scores (rom 15 l.:ommonly med neuropsy-cho!ogical measures was ealculated. The ~malle~t MP valueobserved was 1.6 SIJ, while lhe largest was 6.1 5D. Two thirdsoflhe sample obtained r-.lDscores in excess of 3 5D, and whcnthese were recakulated wilh highest and lowe~t ~cores omit-Icd, 27% of lhe samp!e stm obtained /1.11)~core~ exceeding3 5D. Schrctlcn et aI. (2003) concluded frum lhi~ data Ihal"markcd inlraindividual variability is ver)' common in nor-mal adult~, and underscores lhe need to base diagnostic infer-ence~ on dinical1y recognizablc palterns rather thanpsychometric variability alone" (p. 864). While the number o("impaired" SCllre~obtJined by eaeh heallhy participant wasnol reported, 44% of the ~amplc were found to have ai leaslone lesl seore more than 2 SD bclow their e~limated IQ seore.Similarly Palmer el aI. (1998) Jnd Taylor and Healon (2001)have rcported Ihal it is not uneOlnmon for healthy peop1e lo~how isolaled weakness in one test or area. The~e dalil are cer-tainly provocative and strollgly suggest that additionallarge-scale sludies of normal variabilily and prevalence ofimpaired-range score~ among healthy persons are dear1y Wilr-ranted. Clinicians should alway~ consider available data onnormal variability (e.g., lndex SÇ()fediscrepaney baserales forWcehsler Scales) when interpreting test seores. \-Vl1enthesedala are nol available, mathematical modcls and research data~ugg(.~tIhat a conservative apprnach to interpretatiol1 is war-ranted when cotlsidering a small number of ~core discrepan-l.:ie~or abnormal seores from a large te~t batlery.

A FinalWord on lhe Imprcósion ofPsychologicol Tests

Though progress has been made, much work remains to beclone in dcvcloping more psychomctrically sound and dinicallyeffieient and useful mea~ures. AI times, lhe leehnical limita-

lions of many neuropsychologieal tests currently availablewith regard to measurement error, re1iabilily, validity, diag-nostil' al'cural'Y,and other importanl psychomctrie char,Kler-istics may lead to questions regarding Iheir worth in dinicalpractice.lndeed, informed consideration may, quite appropri-ately, lcad neuropsychologi.sts to limil or completely curtailtheir use of some measures. The extreme argument would beto wmplctcly excludc any te~ts that cntai] meilsurement error,cffectivcly climinating alllOrm~ of objcctive mea~urcment ofhuman characteri~lics. Ilowever, it is imporlant lo keep inmind lhe limiled and unreliab1e nalure of human judgmenl,even expert judgmenl, when left to its own devices.

lJahlstom (1993) provides the following historical example.While in lhe midst of their groundbreilking work on humanintelligence and prior lo lhe use of ~l<lndardizcd tests to diag-nose conditions affeeting cognition, Binel and Simon (1907)carried oul a study on the reliability of diagnose~ assigned lochildren wilh menlal retardation by ~ta(f Jl~Yl.:hiatrist~in threeParis hospitais. The specifie categorie~ induded "I'idiotie,»"l'imbécilité," and "Ia debilité mentalc» (corre~Jlonding to theullfortunate diagno.stic eategorie~ of idiot, imbecilc, and mo-ron, respective1y). Binet and Simon reportcd the following:

We have made a melhodical comparisnn between theadmi~sion certificales fillcd oul for the sallle l'hildrenwithin only a fewday~' inlerval by lhe doetors ofSainte-Anne, Bicêtre, lhe Salpétriere, and Vaucluse. We havecomparcd scvnal hundreds of thcse certificale~, and wethink we may saywithoul exaggeralion Ihat lhey lonkedilSif they had becn drawn by chance oul of a sack. (p. 76)

DahlstotJl (1993) goes on to state thilt "thi~ fallibility in thejudgmellts made by humans about fcllo\\"human~ i~one oftheprimary reasons that psyehological tests have been devdopedand applied in ever.increasing nurnbers over the pa~t l.:entury"(p. 393). In Ihis conlexl, neuropsyehological te~ts nced not beperfect, or even p~ych()metrieally exeeptional; the}' need onlymeaningfully improve clinicai decision making and signifi-cantly reduce error~ of judgment~those errors stemming fromprejudice, per~onill biil~,halo cffccts, ignorance, and slereolyp-ing-made by peoplc when judging olher pcop!e (lJahlstom,1993; see also Meehl, 1973). The judiciou~ ~c1cclion,appropri-ale admillistratioll, and well-informed interpret.ltion of ~Ian-dardized tesls will usua]]y achieve this re~ult.

GRAPHICAl REPRESENTATIONS OF TEST DATA

11is often u~e(ul to have a visual represelltatioll of lest perfor-mance in order to facilit,lle illtcrpretation and cros.s-tesl com-parison. For thi~ purpose, we indude example profile forms(Figures 1-7 and 1-8 on pp. 32-43) which we use lo graphi-cally or Ilulllcrieally represent neuropsychological perfoTm,mccin the individual patient. \Ve suggest that thcse forms be medto draw in nllltldence inlervals for each test rather th<lnpointestimatcs. \Ve also inc1ude a samplc form that we use for eval-ualÍon of dlildrell involvillg repeat assessment. sueh as epilepsysurgical candidales.

Page 27: Psycometrics in neuropsychological assesment

NOTES

1. Jt should be noted lhal Pearsun !aler ~lakd Ihal he regrettedhis choice of "normal" as a descripl(H for lhe nurnwl <.:urw; Ihis"Ihad] lhe disadvantage of leading pcuple 10 bdi,'\(' thal ali utherdi.>trilJlltions of frló"i"]llencyare in ol1e seme or anolhcr 'abnonnal'.Thal hdief is, of murse, nol justifiablló""(I'earson, [920, p. 25).

2. Micceli analYlCd 400 dalasef" indllding 30 from nalionallesl.>and l31 from regiona-l tests, 01189 dif{erent popu1alions ad-minislered variou.> psychological aml edllcatiOl) tesfS and fOllndthat extremes of asymmetry and "Iumpiness" (i.e., apl'eararKe ofdistinct SUbP0l'llblions in the disuiblllioll) \V,'re tlU' 1100mratherlhan the exception. General ability rneaSllres lended to fare belll'rthan other IYlles of teslS sud} ,ISachien'rTIenl ll'sb, hUf lhe resultssuggested tha! lhe vast majorily of gruups lesled in lhe real worlJmnsist of snbgroups lhall'rodllce llUll-nonnal díslríbulioll>, lraJ-ing 1I.Iieceti 10 slate lhaf despile "widespreaJ hdil'f I... ) in lhe1liIIVeaSSUllll'tion of normalily," Ihne is a "startling" [ack of cvi-dence to this effect for achievcmenl lCSfSand psydlOmelric mea-sures (1" 156).

3. lronieally, meaSlIremelll crrur (annol bc kno\Vn prtx:isely andmltSl also he eslimaled.

4. Note lh<lf this model fows<:s un kSf dlaracll'rislics anJ doesnol explkilly ,Iddress nwasuremenl error arising fwm parlicularcharaClerislies of individual examine<:s or lesling cirwlllslarKes.

5. In 1110S1cases, even Ihough il is lIsually pwviJed in lhe samt'metrk as leSf scores (i.e., slam1ard score units), uscrs sholl!J nolethat some tesl publishers report SEMs in r;\l~ score unils, whidl fur-Ihló"rimpedes interprelJlion.

6. When illlerpreting coJlJidt'nce inkrva1s based on lhe SEM, il isimportant lo bear in minJ lhat while fhese provide llseful inform;l-tion about lhe expecled range of scores, SUdl conJidelKe inlervals arebased on a model thaf assumes expedeJ performance across a largenumbcr of random!y para!ld forms. [deaU}',test llSers would Ihere.fore have an undcrstanding of lhe nature and Iimitillions of c1,lssica!tesl moJds and their aprlicabilily to spednc lests in order lo use es-limales su.:h as lhe SEM approprialdy.

7. Thert' are quile a numher uf alternate mcthods for eslimatingerror inlcrvals and adjusling obtained smres for regression tolhe rnean and olher sourCt's of measuremenl error (Glulting,McDnmoll & Sfan1ey, 1987) and Ihere is no universally agreed-uponmelhod. Indet'd, lhe mosl appropriafe rnelhods may vary anossdifferenl types of feslS and inlerpretive uses, though lhe In;ljorily ofmdhoJs wiUproduce fOughly similar resuhs in many cases. A reviewof alternale methods for eSlilll<llingand wrrecfing for measllr<:lTl<:nlnror is beyond lhe swpe of Ihis huok; lhe melhod.> presenled weredlOsen because lhey conlinu<: lo I"" widdy \1st'{laml ,I<:cepled andthey are rdativel)' easy 10grasp concep!\1ally and mathematically. He.gardless, in mosl cases, fhe choi(e of whirh sl'e.:ific melhoJ is medfor estimating and correcting for mea\llrelllenl error i, f,lr less im.purlanl than lhe issue of whelher ,my such estimates and correctionsare caklllated and incorporared inlo leSf s<:oreinterprefalion. Thill is,lesl scores should not be interpreled in lhe absence of consideratinuof measurernenl errof.

8. COls and oulCOlJles of inleresl ulay "Iso b" ddin<:d along aconlinuum from binary (present-absellf) 10 multip[e disnele cate-gories (miM, mlXierate, scvere) lo fully continuous (pnccnl impair-ment). This chal'ter will only consider the bin,lry case.

9. In medicallilrralure, these ma)' be rcferred 10 as lhe l'mliaiveHl1u,' of a Pos;tive TI"t (py+) or Po,irive Preiliaive Villile(Pl'Y) anJthló"l'rediclive Villue of <INegar;ve Te,t íl'Y-) or Nrgativc I'mlicl;wVillue(Nl'V).

l'sychoJllclrics in Neuropsycllological Asscssmcnt 29

10. I'redielive power villlles ,1lor hdow .50 should '101he auto.malically intrrl'retcd as indicating Ih"l" CO[ is nOf presenl or lha! alesl has nu ltlility. For example, if fll<' p0l'ulatioIl prevaience of aCOI is .05 Jnd lhe I'I'P h"sed OHl~sl ft'SUItSis .45, a dinician canrighlly condude Ihat an examint'e is much more likdy to nave lheCOI than members of thló"gló"ner,11POJ'UI,llioll, ",hich 11t"}'he dini-cally rdevanl.

11. Rtx:aku1ating lhe PPI' for Ihis scenario lIsing low ,Illd highvalue" of Sensilivity and Specificity as defined by 95% confidencelimils Jerived earlier frolll the data in Table 1-9 gives ,\ worSI-c;\se 10besl-c"se PPP range of .03 10.41.

12. lu Baycsian lrrlllino!ogy, I'revalence of a COI is kno\\'n as fheprior probability, whilló"/'FI' and NI'I' are kr)()WIIas pO'<ll'riorprob,lbili-tit's. Couccplual!y, lhe diffrrcnce helween lhe prior alui posterior proh-ahilities associaled wilh informaJion adrbi by a lest score is an indexof lhe diagnostic utilily of a ICSI.There is an enlire li1er;ll\lre cot1cern-ing Bayesian melhods for slalistieal analysis of test ulilitr. The,e willnol he mvered hae and inlló"rló"Sledreaders are referred 10 Pepe (2003).

13. Compare Ihis al'l'roach wilh use anJ Iimilalions of lhe SEp,as Jescrihed car1icr in Ihis chaptef.

14. The hasks of linear regression will not be covcred hert': seePcdhauzcr (1997).

REFERENCES

Altman, D. C., & BI<lnd,J. .\1. (19R3). i\.1casuremcnt in medicine: Theanalpi.> of mcthod comparison. Slminiâall, 32, 307-317.

Amerkan Educalional Research Associalion, Amerkan i'sychologiealAssocialion, & National Council on Measurelllenl in Education.(1999). St<lllii<lnl,for edl/úuiO/IIlI mui p,ydlO/axiwlle,tillg. \Vash.inglon, DC: Amnkan Psychological Associalion.

Anastasi, A., & Urhina, S. (1997). P,ydwlogiCtlI rnfíng (71h t'J.). Up-p<:r5add!e Riv<:r,Nl: Prenlicc Hall.

Andr<:\\"s,C., Pelns, L., & Teessou, 1'>1.(l9'H). Tllc meU51lrelllwt afCOIlSl/ma oll/come, ÚI tIle'I!'1111<'<lltl1.Canberra, Aus\ralia: Aus.tralian Govcrnmenl Publishing Services.

Axelrod, 11. N., & Goldlllan, lto S. (1996). Use of delllographiccorrections in nló"urop.sychological inurprclalion: Ilow Man-dard are stanJard ,cores? The Clilliw/ ,'\'ellrop,yciwl"gi,t, HJ(2),159-162.

Baron, I. S. (2004). l"lnlrop,ycilOlogiw/ emlrwlioll "filIe e!liM. NewYork: OxforJ Univer.>ilyPress,

nineI, A., & $imon, T. (L907). [.1:, .,,,jim!s m",mllll'X. P;lri.s:ArmolldColin.

Bland, I. M., & Altman, D. G. (19R6). $talislical lllethods for assess-ing agreemenl between Iwo method.s of dinical measurement.Ltmçet, i, 307-310.

Bornstein, R. F.(1996). Face •..alidity in psychological assessmenl: Im-plkations for a unified model of validif)'- Amaimll Psychologisl,51(9),9113-9114.

Burlingame, G. '\-1.,Lambcrl, M. I., Reisinger, C. \V., Neff, \V. M., &!\Iosier, J. (1995). Pragmatics of tracking mental health oulmmló"sin a rnanaged care selling. JOllrtl1l1ofMental HClllth Admilú>fmtian,22,226-136.

Canadian I'sychnlogical Assnciafion. (19117). Grútlelille, for "dum-tiorral mui p,ydlOlogiw/ te5fÍlIg.Ollawa, Canada: Canadian Psy-chological Associalion.

Chdune, G. J. (2003). Assessing reliable neur0l'.>ychological change.In R. D. franklin (Ed.), Pmiietioll i'l Forfll'iÍe <lmiNeuw1'5yr!lOlogy:Sound Stllri.l!imi Prl/etius (1'1'. 65-1111),)o,1,lhw,lh,NJ: l.awrenceErlbaum A5sociales.

Page 28: Psycometrics in neuropsychological assesment

30 ACOlllpcndiulll of!':cuwpsycnulogical Tcsls

Chronbach, L (1971). Test validalion. In I{.Thorndike (Ed.), F:duC<l-tíonal ml'l/sureml'nl (2nd ed., pp. 443-51l7). Washington, DC:American COllncil on Educ;llion.

Chronbach, L., & Meehl, P, E. (1955), ConSlrUCl validilY in psycho-Jogical tests.I',ydlOlogiral Hllilctill, 52, Ifl7-lfH;.

Cicchetti, I). V. (1994). Guidclines, uileria, and rules of Inumh forevaluating normoo and stall<lJrdized as,essmenl instrumenls inpsycnology. P;ycho/ogiwl /1ssf5sment, 6(4I, 284-290.

Ckcnetti, D.v., & 5parrow, 5. S. (1981 l. Devclopíng crileria for ('stab-lisning interrater reliability of specilic it('ms: Appli(atioll> tu as-sessrnent of adaptive beha •.íor. Amerioln fourIIl/l of AkntalVeficiem:y, 86,127-137.

Cíc(heUí, D. v., Volkmar, F., Sparrow, S. 5., Cohen, D., FermMtíMI, I.,& l{ourkc, B. P.(1992), Assessing reliability of dinical scales whcndat,l hilve bOlh nominal and ordín;\1 fe;ll11res;I'roposed guiddinesfor neurop,ycholngkal assessments. for;rlml ofClinical <Ind Exp"'-imellllll Ncurop,yâlOlogy, 14(5), 673-686.

Crawford, J. R, & Garthwaite, P.H. (2002). Invesligation of the single(ase in neuropsydlOlogy: ConfiJence limíts on lhe abnorrnalítyof test scores anJ t('SI score diffcren(es. N"lIropsyâwlogill, 40,1196-1208.

Crawford, I, l{.•& Ilowdl, D. C. (1998), Regression equatiolls in clin-icai nellrol'sychology: An cvaluatioll of slatislical methods forcomparing I'redicted and obtaincd scores, fmmwl oI C1inimlol1llExperimental Nmropsydwlogy, 20(5), 755-762.

Dahlstom, \V. G. (I993). Srnall samp!e." Jarge conseqllcnccs, Amai-«ln PsyrilOlogist, 411(4), 393-399.

Dikrn('n S. 5., Healon R. K., Grant 1., & Ternkin N. R. (1999). Test-retest reliahililY ,lIld pra(tice dfects of expanrled Halstead-ReitanNeuropsychological Test Hatkry. fOllrm,{ of Ilu' IlItcmllfiom!ll'.'eu-ropsych%gi«ll Society, 5(4):346-5f1.

Dudek, F.J. (I979). The continuing mísinterpretalion of the standarderror of rneasurement. P,ydlOlogi«l/ Bljllelin, 116(2),335-337,

flwood, It w. (1993). ClinicaI discrirninatíollS and neuropsydlOlogi-cal test,: An appeal 10 Bares' theorern. Tlw Cliniml Neuropsycholo-gist, 7, 224-233.

Fast('nau, P. S. (1998). Validity of regression-based norms: An empír-icaltest of the Comprehensive Norms with older adull.,. 'oumlll01 Cliniwl 11'1'1Experimental NeuropsydlO/ogy, 20(6), 906-91 fi.

Fastenall, 1'. 5., & AJams, K, M, (1996). He;Hon, Grant, andIvlJlthews' Comprehensive Nnrms: An over/ealous auempl. four-mil 01 Cliniwllllul ExperimCllfll1,•..•'ellrop"ycholog); 111(3),444-448.

Fastenau, P. 5., BenneU, J. M., & Denbllrg, N.l. (1'/%). Al'l'licationof psychornetric stanrlards to scoring system evaluation: Is "new"lIecess"rily uimproved"? foumlll 01 Clinjwl IIl11i Exprrimt"llllllNellropsychology, JII(3), 462-472,

Ferguson, G. A. (I 981). St<ltisticalllllllly"is in p"ychology Il/lli I'Illlw-tion (5th eJ.). NewYork: McGraw-Hill.

Fr;lIlklin, It D. (Ed.). (2003a). Predidion in Forwlic ,111<1Neurop'y-dlOlogy: S01ll1l1 5t<l/i"ti(1I1 Pf(letiu". Mahwah, NJ: Lawrence Er!-baum A,sociates.

Franklin, R. D., & Krucger, I. (2003h). Bayesian InferelKe and Belieinetworks. In R. D. Franklin (Id.), l'redietion in Forensic <ltll!Neu-ropsychology: 501/11<1Sf'ltistiw/ Pf(letiw; (pp. 65-811). !\lahwah, NJ;Lawrence Erlhaulll Associates.

Franzen,I\1. D. (2(X)(I),Relio/li/ity Ima mlir/ily irl lIt'Uropsyâwlogiwl <15-sessmellf (2nd ('d.). New York: Kluwn Academic{l'lcnum I'ublishcrs.

G'ls(juoine, P. G. (1999). Variablcs moderating cultural anu ethnicdifferences in neul'Opsychological assessment: The case of His-panic Americans. Tire CIi,úw/ NellropsycilOlogist, 13(3),376-3113.

Gllluing, J. J., McDerrnot1, P. A., & Stanley, J. C. (19117). Resolvinguifferences nmung methods of eslablishing confiuence Jimits for

test scores. Edllwfiomli (11111PsydwlU!iiw/ J\,1e<l.'f/remellt, 47(3),607-614.

Goufredson, L. 5. (1994). The scíence anu pulilícs of race-nOl'ming.AllwriWIl l'syâlOlogist, 49( 11),955-963.

Greenslaw, P. 5., & )cnscn, S. S. (J996). H.ace-norming and the CivilRights Act {lf 1991. Pllblic PcrsO'lIIe! MIUliJg('rJlClII,25( I), 13-24.

J-Jambelt{ln, R. K. (l980). Test score valitlíty and st;lIld;lnl-seningrncthods. In R. A. Ikrk (EtI.), CrjteriorJ-refercm:ed IIIC11surcmcm:TIl" slille of fhe art (pp. 80-123). Baltimore, ,'vIl): Johns IlopkimUniwrsity Pr(',s.

Hnrris, J.G., & Tulsky, l), S. (2U03). Assessmenl of lhc non-na!ive En-glish speaker: Assimilating history and l'cse<lrchflndings 10 gnideclinicai praclice. In D. S. Tulsky, l). 11. Saklofske, G. I, Chelune,It K, Ileaton, R. Ivnik, R. Bornstdn, A. I'rilitel'3, & M, F,Ledbclter(Etls.), Cli'úwl inlerprer,uioll 0/ the \VAIS.III IIlItI W,1,,15.111(I'P. 343-390). New York: Academic I'rcss.

lleaton, R. K., Chclune, G. J., Tnlley, J. L., Kay, G. G., & Curtiss, G.(1993). \visçOlJ5in Card Sortitlg Te" Mlllllwl. Odessa, FL: PAR.

Heaton, l{. K., Taylor, •......1. J., & ;\Ianly, J. (20tH). Dernogral'hic cffectsand use {lf demographically mrrectc-J norrns with lhe \VAI5-[[]and WM5-111. In D. S. Tnlsky, D. H. 5aklofske, G. I. Chclune,H..K. Healon, R.lvnik, R. Borns1ein,A. Prifilna, & Iv!.F.Ledbdter(Eds.), Clilliml intcrpreflHiolJ of the \VAIS-/Il IIIl<i WMS-//f(pp. 181-210). New York:AG,dernic Pn:ss.

Ilern];lnn, 1\,P., \\'yler, A. It, VanderZwagg, 1\., Lellailly, It K., \Vhit.man, S., Somes, G.• & \\'ard, I. (1991). I-'rcdí<,;lorsof neuropsy-chologk;ll change following anterior temporal loheetomy: Roleof regression loward the mean. ]0111'11<110/ Jo'pilepsy, 4, 139-143.

Ingraharn, L. J., IXAiken, C. B. (1996). An cmpirical approach 10 de-terrnining criteria for abnormality in te,t hatteries with multiplerneasun's. I\'cllropsycllOlogy, 10(1), 120-124.

Ivník, R.I., Smith, G. E., & Ccrhan, I. H. (2001). Undcrstanding thedi;lgnOSlic capabilities of cognitiV(' teslS. Cljniml M:llropsycholo-gisf. 15(1),114-124.

lacobson, N. 5., Roberts, L. I., Rerns, 5.1\., & "-lcGlinchey, J. B. (1999)./l.-Iethod, for definíng anti rletermining the clinicai significance oflre;ltrnenl effecls dcscrip1ion, applícation, anrl al1ernatives. fOl/r-11<11oj"Collmlting ol/d Clilliw/ P.<ychofogy, 67(3), 300-307.

lacohson, N. S. & Truax, P. (1991), Clinica! significam:e: A s1atisti(alapprodch to defining Illeaningful change in I'sychothcr;lpy re.s('arch. fOI/rim/oI COll5l/lrilW Il/ui C/illiwl P"yâwlogy, 59, 12-19.

KaIechstcin, A D., van Gorp, \V. G., & R"I'I'0rt, L. I. (19911).Variahíl-itr in c1illical c1assificalÍon of raw test ,cores anos> normalivedala seIS,TIl('C1i1liwl Neuropsydwlogist, 12(3),339-347.

Lees-I laley, P, n. (199fi), Alice in validilyland, or the dangcrolls con-scqucnces of con.sequential validity. Ameriwll I'sychologist, 51(9),981-98J.

Lezak, 1'>1.D., Howicson, D. 3., & Loring, D. \V. (21l04).I\reuropsycho-10gic'lllls>es.<metJf (4th ed.). New York: Oxford Uníversíty Press.

Lintieboolll, J. (1989), Who needs Cluting poinls? fOllnlll1 of Ciilliw/l'"yâlO/Og}; 45(4),679-683.

Lineweavcr, T. T., & Chclune, G. J. (2003), Use of the WAIS.lII andWI\lS-1Il in the context of serial asseS>lllent,: Inlerpreting reli.,lble ;tnd meaningful change. In D. S. TllLsky,D. H. Saklofske,G,'. Chelune, I\. K, Heaton, I\. lvnik, R BOfllslt'in, A. Prífitcra, &M. F. Letlbener (Eds.), Clirúwl interl'rl'fIJtioll IIf t!ll' WAIS-111ll1l1iWMS-ll1 (pp. 303-337). New Ymk: A(atlernic Press.

Lord, F.M., & Novick, M. R. (I9f18). St<ltistiml tll,'orie" ofmellllll re"t,core". Reatling, MA: Audisoll"Wesley.

M,lkinnon, A (2000). A spn'atlshed for the calclllation of com-I'rehensive statistics for thc asscssment of uiagnoslic teS!Sand illler-rater agreClllcnl, Qlmlmter.< illlli%gy Ilnr/All,dicine, 30, 127-134.

Page 29: Psycometrics in neuropsychological assesment

I\lcr:leary, R., Dick, 1\1.li., IInçkwaITer,G., Ilenderson, V.,& Shankle,W. R. (19')6). Fnll-informalinn models for multiple psydwmetrietesh: Annualized rates of change in n(lTlllal aging anti dementia.A/zllClma Vi,en,e <lnd A"ori<lted Pi.<order.<,10(1),216-223.

/l-kFadden, T. U. (I 'J%). Crealing language impairments in typicallyalhieving lhildren: The pilfa!ls of "normal" rwrmative s;lI11Jlling.Lallgwlge, Spcedl, mui J/mring SUl'icl" in Sc!II'oi" 27, )-9.

MlKnuil\ D., Vida, S., \1ackinnon, A. J., Onghena, P. and Clarke, D.(1997) Accurate conlidence intcrvals for measmes of test perfor-mance. 1',yc!rilHry Researdl, 69, 207-20'J,

Meehl, P.E. (1973). \\'hy I do nnl atlend ca.<Cconferences. In P,E, I\-1eehl(Ed.), PSyâlOdillg'IOSi;: Se/caed papa" (1'1"225-302), ,\-linneal'o.lis: Univcrsity of i'.-linnesota.

Mechl, P. E., & \{men, A. (1'J55). Anteccdenl I'robability ,llld lhecfficicncy of psychollletric signs, patterns, or cutting scores.l',ydlOlogiwl/;lllklin. 52, 194-216,

i\kssilk, S. (l9IW). Validity In R, L Linn (Ed,), f"ÃiUCIlliOlUI/mell.<me-mc,u (3rd ed" p, 13-1(3), New YOl'k:.\-laemillan.

i\kssick, S. (1')')3). Validity. In R. L l.in (Ed,), f"ÃiuCIlliollill mellSllre-me,u (3rd ed., Pl', 13-1 (3), I'hoenix, AZ,: The Orrx I'ress.

Messick, S. (1995), Validity nf IlSydlOlogic,11asse%melll: V,llidation ofinfcrcnces frum pcrsons' responses and I'erforrnanee as scientiliein'luiry imo scoring meaning. AllleriCll/lP,yclrologi.<I, 9, 711-7'19.

~lcssick, S. (19'J6). V,lliditr of psydlOlogilal .1SSt'ssrnent:Va!idationof inferences from I'ersons' resl'0nses and performanet's ,1Sslientiflcin'lniry into $Corenteaning. Ameriam P>ydwlogi"f, 50(9), 711-719.

Mieceri, T. (1989). 1'l1l' rrnicorn, llw normal curve, and olher im-probable crealrrre,. P.<ydwlogiw! /lullelÍlr, I05( I), 156-166.

Islillushin,l, M. N., !loOllt" K. B., & l)'E1ia, L. F. (1999). HillldlJoo/.; ofnOnlWfiw dMO for lIn/rop,yc/rO/ogiCIII iwt'>smml. New York: Ox-!(ml Universitr I'ress.

Milrushina, ,\1. N., Boone, K, 11" I{azani, I" & D'Eli", L. F. (2005)./lalldboo/.; of Iwrtlllltive dafll fár neurop,ychologiwl asses.<menl(2nd ed.). NewYmk: Oxford University Press,

Mnssman, D. (lU03). Dauoert, cognitivc malingering, and test acCll-racy. Lmv mrd HWIIlHl /;e/hn'ior, 27{J), 229-249,

Mossman D., & Somou E. (1992). Balancing risks and oenefits: an-othcr approach to optimizing diagnostic lests. foumal of ,'.'rrj-ropsydúlltry mui C/illÍall Neurosâellcc" 4(3), 331-335.

Nevo, 11.(191\5j. Face validity revisiled. 10rmlOI of EdllcilIional Mca-~urmlellf, 22, 2R7~293.

Nunnally, I. c., & Bernslein, I. H. (1994). Psyrlrometric Ilreory (3rderl,j, New York: MeGraw-Hill, Inc.

I>;tlmer,B. \V.,Bnone, K.B., Lesser,I. R.,& \Vohl, M. A. (1991\).Base ralesof"irnp,lired' ncuropsydlologicallest performance among hl~dthyolder adult>oArrhiw" ofCliniml I\'europsyc/rology, 13,5(H-51\.

I';lvel, D., Sanlhez, T., & J\-Iacharner, A. (l994). Elhnic fraud, nalivcpeoples and higher education, Thmlglrr iltld Adion, 10,91-100,

Pearson, K. (1920). Notes on the history of corrclation. Riolllelri!:a,13,25-45,

l'edhazur, E. (l'J97). Mrrlriple Regression in Hclwvioml Rcscarch. NewYork: l-iarcourt 1Ira(c.

['epe, 1\-'1,S, (200)). Tl!e Sl(lli'<liw/El'illlllUion oIMediwl Te.</>for C/IIS-sijiwtion mh! !'mliflÍon. New York:Oxford.

PlIenl~, 1\. E., Mora, M. S., & MlInoz-Cespedes, I. M. (1997). Nl'll-ropsydwlogieal ,lssessrnent of Sl'dnish.Sl'e;lking ehildren andyouth. In C. R. Reyrwlds & E. Helcher-Iannll (Eds.), l/mui/mo/.; ofc1i"iw! drild ""lm'p'ydlOlogy (2nd ed., pp. 571-533). New York:Plellum Press.

Rapport, L. I., Hrines, D. B., & r\..><clrod,B. N. (1997). Full Scalc IQ asmerlialor of pr;Ktice effeets: The rieh get ridlcr. Clilliml ,'\'eu.ropsyclw/ogisl, 1/( 4), 375-31\ll.

I'sychometrics in l\'europsychol(lgiCill 1\sscssiTI<,nt 31

jk)', G. J., Feldman, E., & Rivas-Vazqul'Z. (1999). Neuropsychol(lgiealtest developrncnt alld nurmative data on Hispanics. t\rr/rivC5 ofCliniw/ NeuropsydlO/ogy, ],1(7),593-601.

Ruref, L. G., & Dal,'Cs, R. M. (1982). A base-rale ooo1Strap.lour/lll/ oICorrm/lillg Il'ui Cliniwl Psydwlogy, 50(3), ,119-425.

S;Kket, P. It, & \\'ilk, S. L. (1994). Wíthin-gmup norming and otherforms of $COH'adjustrncnt in preemployment testing. AmericarlI'sychologisl, 4'X11),929-951.

Sattler, I, J\-t. (200J). A.<,e»trwrrl of fhifâren: Cogrrili>'e app/iflllioll5(41h ed.). San Diego: ]erome M. Sattlcr Publishcr, Inc.

Sawrie, S, M., Chehrne, G, J" Nallgle, R. l.. & Luders, H. O. (1996).EIl1l'irical methods for ,lssessillg nwaningf\11 nClIropsyehulugicaldlMlge following epíll'psr strrgcry. IOlmril/ of 11r/, inlerllutiOllll1Neurop.<ydwlogiwf Sodety, 2, 556-561.

Sehretlen D. I., MlInro C. A., Anthony I. C., & Pl'ar!soll G. D. (2003).Examining lhe rangl' of normal intraindividual variability inneuropsydllllogieal test pnformarKe. lourrrul of tire bltcr/llltiollu/Neuwp"yc/r%giflll Society, 9(6): 861-70.

Shavdson, R. I., Webo, N. M., & Rowley, G. (191\9). Generalizaoililytheory. Ameriwn P.<ydw/ogi"l, 41,922-932.

Sherrnan, E. M. S" Sliek, D. I., Connol!y, M. H., Slcinbok, P., Marlin,R" Strauss, E., Chelune, G. I., & Farrell, K. (2003). Re-cxarniningthe effeels of epilcpsy slIrgl'ry un IQ in (hildrl'n: An empiricallydcrivcd method for measuring change. IOl/rlml of t/le inlerllil-liollll! NeuropsydwlogiwISofÍl'/y, 9, 379-836.

Slick, D.I., tiu!,!" G., S1rauss, E., & 1'hompsnn, G. (I997). Tilc Viclo-riu Syml'lom Validity Tcst. Odessa, Fl: PSYlhological AssessmcntResourccs.

Subi, R. R., & Rohlf, I. F. (l995). /liome/ry. San Fransisco, CA:\\'.H. Fn'eman.

Somoza E.,& Mossmall D. (1992). Cumparing diagnostic teslS Ilsinginfornwtion thwry: til<' INFO-ROC tcchniqu~. lourlral of NClI-rop,ydlÍmry mui C!iniwl Neurosrien(C';, 1(2),21 '1-219.

Streiner, V. L (2nO)a), Being inconsistent about consistency: Whencoefl1cienr alpha does ~nd doesll't matter, JOUTtlil1oI }'er,orwlilyAs,mmcnt, 80(3), 217-222.

Slreiner, D. L. (2003h). Starting ai the oeginning: An imroduction loC/Iefficienl all'ha and internaI consistency. lmlrlm! oll'ersolwlityAssessmcnl, HO( I), 9'J-I(}3.

Slreincr, D. L (2003c). Diagnnsing tcsts: Using and rnisu.sing diag-noslic and screening lests. loumal of H;"'Olllllily t1ssessmcrll,8/(3),209-219.

Swets, j. A., Dawes, R. M., & .\-Ionahan, I. (2000). Psyehologieal s(i-ence can improve diagnoslic decisions.l'sychologiwl Sciell<~ein I/lI'Pub/ic inleml, 1(1 I, l~l6.

Taylor, M. ]., & Ileaton, R. K. (2UOIl. Sensilivily anu speciflcity of\VAIS-Ill{WMS-lIl demngraphically cnrrl'dl,d factor scores inlleurOjlsrehologieal assessmenr, fOl/nUI! oI lhe Imcnwtiorw/ Ni'Jj-rop"ycllO/ogical Society, 7, 1\67-1\74.

Temkin, N. R., Heaton, R. K., GrdT1I,J.. & Dikmcn, S. S. (1999). De-tecling signifieant change in neuropsychological test performance:A cornparison of four mudeis. 100lnwl oIIII/' inlemrltiOtwl ,\,'l'lI-rop,ydw/ogiwl SocÍl'fY, 5, 357-369,

Tulsky, D. S., Saklofske, V.II., & Zhu,). (2003). Revising a standard:An cv,duatiun of lhe origin ,md devclopment of lhe \VAIS-llI. InD, S, TuLsky,D, 11.Saklofske, G. I, Chelune, H. K. 11caton, H.I\1lik,R. Bornstein, A. Pril1tera, & M. F. Ledhetter (Eds.), Clinica/ inlcr-prelmion of lhe \VAIS-lU Imd \VMS-Ui (pp. 4'>-92), New York:Academic Press.

Willis, W. G. (19M). Reanalysis of an actuarial approach In nl'U-ropsrdlOlogical diagnosis in collsider,llion of base rales. JOUrtlll/of C01l5lllrirlgmui ClirlÍw/I',ych%gy, 52(4), 567-569.

Page 30: Psycometrics in neuropsychological assesment

32 A Compcndium ofNcurops}'chological Tcsts

\Voods, S. P., W("ínborn, lvI.,& l.ovejoy, D. W. (2003). Are da>sífica-tion ;lccllr,I(Y statistks underused in nl:uropsychulugical re-search? jounllll oi Clinicai ilnd Experimfllrll/ Nfuwl'syâwlogy,25(3),431-439.

YlItl, J .. IXU!rkh, D. A. (2002). Estirnatin!> meaSllrem("nt validity: Atllloria!. Adapred PiJy,ical Arril'ity Quarta/y, 19, 32-47.

Zitnil("s, 11. (1996). H.C"thinkin~the valiJity uf psychologkal asses,-rnent. Americ<IIl l'SycllO/Ogist, 51(9), 980-981.

Figure 1~7 ProliJc form-Adll!t

Nome D.O.B. Ag' S" ---Edllcatian Handedness Examiner

Test Dales Previous Tesling

COGNITIVE

Score Age 55 %ile 10 2030 40 50 60 708090

WAIS.1Ilvocabulary5imiloriliesArilhmeticDigil5pon F BInformationComprehensionletter-Number 5equencing

Píclure ComplelionDigil-5ymbol-CodingBlock DesignMotrix ReasoningPicture Arrangement5ymbof SearchObject Assembly

VIOPIOF5lQ

VCIPOIWMIPSI

NAARTVIOPIOF5lQ

Raven's Matrices ( )

Other

Other

üther(con,inued]