the effects of test length and sample size on item ...lord’s (1968) study is the firstone of its...

Received: April 14, 2016Revision received: October 2, 2016Accepted: December 1, 2016OnlineFirst: December 25, 2016

Copyright © 2017 EDAMwww.estp.com.tr

DOI 10.12738/estp.2017.1.0270 February 2017 17(1) 321–335

Research Article

KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE

* Thispaperisapartofthefirstauthor’sPhDdissertationatthedepartmentofEducationalMeasurementandEvaluation,InstituteofSocialSciences,HacettepeUniversity,Ankara,Turkey. Anearlierversionofthispaperwaspresentedatthe77thAnnualmeetingofNationalCouncilonMeasurementinEducation,Chicago,April15-19,2015.

1Correspondence to:AlperŞahin(PhD),ModernLanguagesProgram,SchoolofForeignLanguages,MiddleEastTechnicalUniversityNorthernCyprusCampus,Kalkanlı,Güzelyurt,Mersin99738Turkey.Email:[email protected]

2Department of Educational Measurement and Evaluation, Faculty of Education, Hacettepe University, Ankara Turkey. Email:[email protected]

Citation:Şahin,A.,&Anıl,D.(2017).Theeffectsoftestlengthandsamplesizeonitemparametersinitemresponsetheory.Educational Sciences: Theory & Practice,17,321–335.http://dx.doi.org/10.12738/estp.2017.1.0270

Abstract

This study investigates the effects of sample size and test length on item-parameter estimation in test

development utilizing three unidimensional dichotomous models of item response theory (IRT). For this

purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test

was used to obtain data sets of three test lengths (10, 20, and 30 items) and nine different sample sizes (150,

250, 350, 500, 750, 1,000, 2,000, 3,000 and 5,000 examinees). These data sets were then used to create

various research conditions in which test length, sample size, and IRT model variables were manipulated to

investigate item parameter estimation accuracy under different conditions. The results suggest that rather

than sample size or test length, the combination of these two variables is important and samples of 150, 250,

350, 500, and 750 examinees can be used to estimate item parameters accurately in three unidimensional

dichotomous IRT models, depending on test length and model employed.

Keywords

Small samples in IRT • Item parameter estimation • Item response theory •

Language testing • Short tests

Alper Şahin1

Middle East Technical University Northern Cyprus Campus

Duygu Anıl2

Hacettepe University

The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory*

322

EDUCATIONAL SCIENCES: THEORY & PRACTICE

Developedfollowingcontroversiesovertheintelligencetestingmovement(Baker,1992),classicaltesttheoryhassuccessfullyserveditspractitionersintermsoftestdevelopmentandinterpretationoftestresultsfordecades(Embretson&Reise,2000;Hambleton, Swaminathan,&Rogers, 1991).However, the need for a theory thatcanpredictexaminees’responsestoanyitem,evenifitemshaveneverbeenseenbytheexamineesbefore(Lord,1980),resultedinthedevelopmentofanewtesttheorywhosebasic conceptsweredevelopedover seven and ahalf years (Baker, 1992).Afterspendingalongtimesearchingforanagreed-uponname,itemresponsetheory(IRT)becamethecontemporarytheoreticalfoundationformeasurement(Embretson&Reise, 2000). It is nowwidely used by test publishers, aswell as educational,military,andindustrialinstitutions,intheirresearchontestdevelopment,testequating,and detecting differentially functioning items (Hambleton et al., 1991).However,despite theadvantages IRToffers, amajorobstacle to itsuse in testdevelopmentis itsdemanding requirement for calibration sample size.More specifically,whiledevelopingatest,“thedifficultyliesintheneedtoassessthepropertiesofitemsbytryingthemoutonasampleofsubjects”(Woods&Baker,1985,p.117).IRTrequiresthissamplesizetobelarge(around1,000)inordertoobtainaccurateitem-parameterestimates(Hambleton,1989)thatresultsinaccurateestimatesofability,uponwhichsomehigh-stakesdecisionsaremade.

Thereisalargevolumeofpublishedstudiesintheliteratureontheeffectsofsamplesize and test length on item-parameter estimation in IRT-based test development.Lord’s(1968)study isthefirstoneofitskindinwhichhe investigatedthesamplesizesrequiredforestimatingitemparameters(a-itemdiscrimination,b-itemdifficulty,andc-pseudochance)accuratelyinthethree-parameterlogisticmodel(3PLM)usingdatafromtheScholasticAptitudeTest.Asaresult,Lordconcludedthataminimumof50itemsand1,000examineeswererequiredtoestimateaparameterswithhighaccuracy.Withthesupportofsubsequentstudies(Patsula&Gessaroli,1995;Tang,Way, & Carey, 1993;Yen, 1987;Yoes, 1995), 1,000 was taken as theminimumsamplesizerequiredforaccurateitem-parameterestimationinIRT.

Studies have also been conducted that investigated the effects of using samplesizes of less than a 1,000 on item parameter estimates. However, their findingshave tremendousdiscrepancies.To illustrate, a limitednumberof studiesonone-parameterlogisticmodel(1PLM)havesuggestedthat250(Goldman&Raju,1986),300(Guyer&Thompson,2011),or500(Thissen&Wainer,1982)aretheminimumfeasiblesamplesizesforthisdichotomousunidimensionalmodel.Asthecomplexityof themodel increases, thediscrepancy infindingsalso increases.Forexample,asampleof250with15 items(Harwell&Janosky,1991),of500(Stone,1992)or750(Lim&Drasgow,1990)with20items,of200(Weiss&Minden,2012)or250(Harwell&Janosky,1991)with25items,of500with30items(Hulin,Lissak,&Drasgow,1982),orof300with75items(Yoes,1995)havebeensuggestedfortwo-

323

Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory

parameterlogisticmodel(2PLM).Thesituationgetsmoreconfusingfor3PLM.Asampleof1,000with20items(Patsula&Gessaroli,1995;Swaminathan&Gifford,1979;Yen,1987);of200(Weiss&Minden,2012)with25items;of500(Akour&Al-Omari,2013)with30items;of300(Chuah,Drasgow,&Luecht,2006)with50items;of1,000with40(Patsula&Gessaroli,1995;Tangetal.,1993;Yen,1987),50(Lord,1968),60(Hulinetal.,1982)and75(Yoes,1995)items;andof500(Ree&Jensen,1980)with80itemshavebeensuggestedin3PLM.

Whilenumerousstudieshavebeenconductedontheeffectsofsamplesizeandtest lengthonitem-parameteraccuracyinIRT-basedtestdevelopment, thepresentstudydiffersfromthemintwoimportantaspects.Firstly,thedatausedinpreviousresearchhasbeenmostlysimulateddata,anditisnotknownwhethersimulateddatareflects thequalitiesofa real test (Sireci,1991). In thismanner, theresultsof thepresentstudyareexpectedtobeclosertorealtestdevelopmentthanmostpreviousresearch.Moreover,limitednumbersofsamplesizes,mostly4or5,weretestedinthepreviousresearch.However,alargenumber(9)ofsmallandlargesamplesizesaretestedatthesametimeinthepresentstudy,thusmakingitpossibletocompareandcontrastresultsfrombothsides.Thatmakesthisstudyuniqueintheliterature.Withthehelpofthesetwoaspects,thisstudysetsouttoprovideallIRTpractitionerswithablueprintofhowlargeorsmallasamplesizecanbetoestimateitemparametersaccuratelyinIRT-basedtestdevelopmentbyansweringthequestion“Howdosamplesizeandtestlengthaffectitem-parameterestimationinIRT-basedtestdevelopment?”

Method

Participants and Data Collection InstrumentTheparticipantsinthisfoundationalstudyare6,288freshmanstudentsenrolledin

anEnglishlanguagecourseatalargeuniversity.Thedatacollectioninstrumentwasa50-itemEnglishlanguagetest.Inordertofulfillthecontentvalidityconcerns,itemsinthetestwerewritteninparallelwiththecourseobjectivesbytheEnglishlanguageinstructorswhoteachthecourseandaredeemedassubjectmatterexperts.Moreover,beforeadministeringthetest,expertopinionswerereceivedfromadoctoralcandidateandlanguagetestingspecialistsworkingforthetestingofficeattheuniversitywherethe data was collected. In this way, the itemswere revisedmultiple times baseduponexpertopinionsuntilreachingtheirfinalform.Afterwards,theinstrumentwasadministeredsimultaneouslyto6,288testtakersinonesession.

Item Selection for Shorter Tests Itemsfromthefulldatasetwereselectedtoformsub-testsofdifferentlengths.

Whiledecidingthenumberofitemsforshortertests, previousstudies(Baker,1998;

324


Gao&Chen, 2005; Patsula&Gessaroli, 1995;Yen, 1987) in the literaturewereconsidered. So, from the 50-item full data set (6,288 x 50), sub-tests of n = 30, n=20,andn=10itemswereselected.Whileformingsub-tests,unidimensionalityassumptionofIRTwassatisfied.Unidimensionalityreferstohavingasinglecommonfactorunderlyingexaminees’correctresponsestoitems.Inordertoselecttheitemthatcollectivelysatisfyunidimensionality,exploratoryfactoranalysisonthetetrachoricinter-itemcorrelationmatrixofthefulldatawasused(Edelen&Reeve,2007).Itemsthat had factor loadings less than0.30 for thefirst factorwere removed from thedataas they failed to loadon thefirst factoradequately (Fidelman,2012).Outofthe50items,30itemsthathadthehighestfactor loadingsonthefirstfactorwerechosenastheitemsforthe30-itemtest.Thetetrachoriccorrelationmatrixofthese30itemswascomputedandanotherexploratoryfactoranalysiswasdoneonthismatrix.Then,20itemswithhighestfactorloadingsonthefirstfactorwereselectedforthe20-itemtest.Thesameprocedurewasfollowedtoselecttheitemsfor10-itemtest. AsummaryoftheresultsrelatedtotheunidimensionalityassumptionandthefactorloadingscanbefoundinTable1.

Table1Summary of Exploratory Factor Analysis in Item SelectionTestLength Variance λ1/λ2 RangeofFactorLoadings30-ItemTest 37.50% 8.50 0.45to0.7520-ItemTest 43.78% 8.15 0.59to0.7510-ItemTest 51.29% 6.10 0.68to0.77

Note. Variance=Varianceaccountedforbythefirstfactor,λ1/λ2=ratioofthefirsttothesecondeigenvalue.

Lord(1980)suggeststhatifthefirsteigenvalueislargecomparedtothesecondone,andifthesecondoneisnotmuchlargerthananyoftheothereigenvalues,thiscanbetakenasaproofofunidimensionality.Thesewereobservedinthedatasetsofthecurrentstudy.Apartfromthese,moderatetohighfactorloadingsforthefirstfactorandmoderatetohighvarianceaccountedforbythefirstfactorwerealsotakenasproofofunidimensionality.Uponcompletingtheitemselectionforthesub-tests,theKR-20internalconsistencycoefficientwascalculatedforeachtest,whichwerefoundtobe0.76(10-itemtest),0.85(20-itemtest),and0.88(30-itemtest).

Sampling of ExamineesAfterselectingtheitemsthatwouldbeusedforeachsub-test(10-item,20-item,

and30-item),ninedifferentsamplesizes(N=150,250,350,500,750,1,000,2,000,3,000,and5,000)weredrawnfromthefulldatasetforeachtestbystratifiedrandomsampling using the complex samplesmodule of SPSS 20 (InternationalBusinessMachinesCorporation,2011).Thiswasdoneinordertosimulatedifferentresearchconditions(e.g.,250examinees’responsesto20items).Whiledecidingsamplesizes,thesamplesizesmostfrequentlyusedinpreviousIRTresearchonsamplesizewere

325


reviewed.Asaresult,samplesizesof150,250,500,1,000,2,000,3,000,and5,000werefoundtobeusedthemostinsimilarstudies(Akour&AlOmari,2013;Baker,1998;Goldman&Raju, 1986;Hulin et al., 1982;Lord, 1968;Tang et al., 1993; Thissen&Wainer,1982;Yen,1987). Samplesof350,whichhadneverbeentested,andof750,whichhadonlybeenpreviouslytestedonce(Lim&Drasgow,1990),werealsoaddedtothestudyasalternativesto500and1,000.Moreover,whilesamplingtheexaminees,theircollegeswereusedasstratatoreflectexamineediversityinthefulldataacrosssamples.Thisprocessresultedin30(SeeTable2)differentdatasets,includingthefulldataset(markedwithan*).

Table2Data Sets of the Study

SampleSizeTestLength 150 250 350 500 750 1,000 2,000 3,000 5,000 6,288* Total

10 1 1 1 1 1 1 1 1 1 1 1020 1 1 1 1 1 1 1 1 1 1 1030 1 1 1 1 1 1 1 1 1 1 10Total 3 3 3 3 3 3 3 3 3 3 30

Item Parameter EstimationAfter sampling,itemparameters(itemdifficulty(b)in1PLM;banddiscrimination(a)

inthe2PLM;anda,b,andpseudo-chance(c)parametersinthe3PLM)wereestimatedunder90differentresearchconditions(3IRTmodelsx30datasets)usingXcalibre4.1(Guyer&Thompson,2011)withmarginalmaximumlikelihoodestimation(MMLE)andthedefaultoptionsreadilyavailableintheprogram.Theitemparametersestimatedfromthefulldataset(N =6,288,n=50)weretakenasthe“true”(baseline)parametersagainstwhichtheaccuracyoftheitemparametersestimatedfromtheothersamplesinthestudycouldbecomparedandcontrasted.Asthedatahadaconsiderablylargenumberofexaminees,itwaspossibletoestimateitemparameterswithareasonableamount of accuracy and thus could be considered as baseline parameter estimates(Swaminathan,Hambleton,Sireci,Xing,&Rizavi,2003).

Estimation Evaluation Criteria Accuracyoftheparameterestimatesunderdifferentresearchconditionswasevaluated

throughproduct-momentcorrelations(Gao&Chen,2005;Harwell&Janosky,1991;Hulin et al., 1982; Swaminathan et al., 2003;Tang et al., 1993;Yen, 1987) and theroot-mean-squareddifference(RMSD;Gao&Chen,2005;Harwell&Janosky,1991;Swaminathanetal.,2003;Yen,1987)betweenthebaselineandestimatedparameters,ashasbeendoneinsimilarstudies.RMSDisdefinedasinEquation1below.

RMSD= (1)

326


p�i representsoneof theestimated itemparameters(a,b,orc) for item i,andpi representsthecorrespondingbaselineparameter,takenasthetrueitemparameterforitemi.Underallresearchconditions,correlationsofr≥0.70,whichisconsideredacceptable (Yoes,1995)andsometimesashigh (Field,2013),andRMSD≤0.33,which corresponds to the classical reliability value of 0.90 (Rudner, 1998), weretakenasthecriteriaforminimumfeasiblesamplesizeforthatparticulartestlengthandIRTmodel.

Model Data FitAnitem-by-itemfitanalysiswasnotconducted;rather,anoverallmodel-datafit

analysiswasusedonalldata setsbecause itmightbepossible tohave items thatfitoneitemresponsemodelinadatasetwhilefailingtofitthemodelinotherdatasets.Thiswouldmakeitdifficulttocomparethesameitems’performancesinthreeIRTmodelswithninesamplesizesandthreetestlengthsusingrealtestdatafromalimitednumberofitems.Astheindicatorofoverallmodel-datafit,themodelchi-squarereportedbyXcalibre4.1(Guyer&Thompson,2011)wasused.Aschi-squareis sensitive to large sample sizes, chi-square divided by the degrees of freedom (x2/df)wasusedasitisnotsensitivetolargesamplesizes(Kline,2005).Asummaryofthemodel-datafitanalysisbasedonx2/dfisshowninTable3.

Table3Summary of the Model Data Fit Analysis Based on x2/df

N

n=10 n=20 n=301PLM 2PLM 3PLM 1PLM 2PLM 3PLM 1PLM 2PLM 3PLMx2/df x2/df x2/df x2/df x2/df x2/df x2/df x2/df x2/df

150 0.4 0.4 0.6 0.5 0.5 0.5 0.5 0.4 0.4250 0.5 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.4350 0.6 0.8 0.9 0.5 0.4 0.5 0.6 0.4 0.4500 0.6 1.0 1.4 0.5 0.4 0.4 0.5 0.4 0.4750 1.2 0.9 1.5 0.7 0.5 0.5 0.8 0.4 0.41,000 0.9 0.9 2.1 0.6 0.5 0.6 0.9 0.5 0.52,000 2.1 1.7 2.7 0.9 0.7 0.8 1.3 0.5 0.63,000 2.7 2.5 4.7 1.0 0.7 0.9 1.9 0.6 0.75,000 4.0 3.3 8.0 1.4 1.4 1.3 2.9 0.8 0.86,288 5.0 3.9 9.4 1.7 1.6 1.5 3.4 0.9 0.9

Valuesof3orlessaresuggestedasanindicatorofmodelfit(Chernyshenko,Stark,Chan,Drasgow,&Williams,2001).Table3indicatesmisfitoffewlargedatatothemodel.Thesewereignored,astheywerethoughttoresultfromtheverylargesizeofthesedatasets.

327


FindingsFigures 1, 2, and 3 show correlations and RMSD values obtained from the

parameterestimates.AsFigure1shows,correlationspertainingtothebparametersestimatedin1PLMrangedbetween0.939(N=150,n=10)and1.00(N=5,000; n =10,20,and30).Inaddition,RMSDvaluespertainingtothebparameterestimatesrangedbetween0.33(N=150, n=10)and0.01(N=5,000;n=10).Ascanbeseenfromthefigures,asampleof150metbothcriteriaforbeingdeemedacceptablein1PLMforalltestlengths.

Figure 1.Correlations(left)andRMSD(right)valuesobtainedforthebparameterin1PLM.

2a. Correlationsfortheaparameterin2PLM 2b. RMSDfortheaparameterin2PLM

2c. Correlationsforthebparameterin2PLM 2d. RMSDforthebparameterin2PLMFigure 2.CorrelationsandRMSDvaluesobtainedfortheaandbparametersin2PLM.

328


3a.Correlationsfortheaparameterin3PLM 3b.RMSDfortheaparameterin3PLM

3c.Correlationsforthebparameterin3PLM 3d.RMSDforthebparameterin3PLM

3e.Correlationsforthecparameterin3PLM 3f.RMSDforthecparameterin3PLMFigure 3.CorrelationsandRMSDvaluesobtainedforthea, b,andcparametersin3PLM.

Thecorrelationspertainingtotheaparameter(Figure2a)in2PLMrangedbetween-0.311(N=250,n=10)and1.00(N=5,000;n=20,30).TheRMSDvalues(Figure2b)obtainedfromtheaparametersin2PLMvariedfrom0.16(N=150,n=20)to0.00(N=5,000;n=10,20).Asforthebparameters,thecorrelations(Figure2c)rangedbetween0.940(N=150,n=10)and1.00(N=2,000,3,000,5,000;n=10,20,30)whiletheRMSDvaluesforthebparameter(Figure2d)rangedbetween0.34

329


(N=150,n=10)and0.02(N=5,000;n=10,20).Inlinewiththese,theminimumsamplesizethatmetbothevaluationcriteriaforbothparameters(aandb)estimatedin2PLMwas750(raâ =0.706,RMSD=0.09;rbb=0.999,RMSD=0.05)for the 10-item test, 500 (raâ =0.795,RMSD=0.08; rbb =0.985,RMSD=0.13) for the20-itemtest,and250(raâ =0.830,RMSD=0.10;rbb=0.978,RMSD=0.17)forthe30-itemtestin2PLM.

The correlations pertaining to the a parameter estimates in 3PLM (Figure 3a)rangedbetween0.207(N=250,n=10)and0.995(N=5,000;n=20,30),whereastheRMSDvalues obtained from thea parameter estimates in 3PLM (Figure 3b)ranged between 0.23 (N = 250, n = 10) and 0.02 (N = 5,000; n = 10, 20). Thecorrelationspertainingtothebparameterestimates(Figure3c)rangedbetween0.944(N=150,n=10)and1.00(N=2,000,3,000,5,000;n=10,20,30)whiletheRMSDvaluesobtainedfromthebparameterestimates(Figure3d)rangedbetween0.37(N =150,n=10)and0.02(N=5,000;n=10).Asforthecparameter,thecorrelations (Figure3e)rangedbetween0.415(N=350,n=10)and0.996(N=5,000,n=20)whereastheRMSDvaluespertainingtothecparameterestimates(Figure3f)rangedbetween0.05(N=150,n=30;N=250,n=10,30)and0.01(N =1,000,2,000,3,000,5,000,n =10;N =5,000,n =20;N =3,000,5,000,n =30).Asaresult,thesmallestsamplesizesthatmetthecriteriaforallparameters(a,b,andc)estimatedin3PLMwere750(raâ =0.901,RMSD=0.10;rbb =0.998,RMSD=0.10;rcc=0.992,RMSD=0.03)whenn=10;750(raâ =0.840,RMSD=0.11;rbb =0.995,RMSD=0.11;rcc=0.829,RMSD=0.03)whenn=20;and350(raâ =0.749,RMSD=0.16;rbb

=0.987,RMSD=0.13;rcc=0.734,RMSD=0.03)whenn=30.

DiscussionThisstudyaimedtoinvestigatetheeffectsofsamplesizeandtestlengthonitem-

parameterestimationaccuracyinIRT-basedtestdevelopment.Theresultsindicateanassociationbetweentestlengthandsamplesizethatamplifiesasthenumberofitemsinthetestandthenumberofparametersinthemodelincrease.Asmentionedearlier,therearetwocriteriasetforestimationevaluation.Oneisthatr≥0.70;theotheristhat RMSD≤0.33betweenthebaselineandestimateditemparameters.AsummaryoftheminimumsamplesizesthatmetbothofthesecriteriainallIRTmodelscanbefoundinTable4.

Table4Summary of the Findings

TestLength 1PLM 2PLM 3PLM10 150 750 75020 150 500 75030 150 250 350

330


AsTable4illustrates,thefindingsofthepresentstudysuggestthatasampleofaslowas150examineescouldbeusedin1PLMwithtestsof10,20,or30itemstoaccuratelyestimatethebparameter.Sucharesultwaspartiallyexpectedastherehadalreadybeensomesignsregardingtheviabilityofusing150examineesfor1PLMin thecurrent literature.Morespecifically, someauthorshavealreadysuggestedasamplesizeofaround200asappropriatefor1PLM(Demars,2010;Wright&Stone,1979)withnospecificreferencetotestlength.Thepresentfindingisimportantasitconfirmstheseauthors’findingswithaspecificreferencetotestlength.Moreover,somepreviousresearchhadsimilarfindings,likeN=250(Goldman&Raju,1986)or N =300(Guyer&Thompson,2011)with78and50items,respectively.Thesesamplesizesandtestlengthswerethesmallestsamplesizesandshortesttestlengthstestedinthesestudies.Thus,thepresentfinding,whichsuggestsasamplesizeof150intestsof10,20,or30items,mayconstituteaviableupdatetothecurrentfindingsin the literature.However, although the correlations andRMSDvalues similar tosmallersamplesizesconfirmedthatthefulldata,aswellasN =5,000for10-,20-,and30-itemtestsandN = 3,000in30-itemtests,fitthe1PLMasparameterinvariancewasreachedamongthesamples,thefindingspertainingto1PLMmaystillbebiased(asx2/dfvaluesindicatedmisfit)andneedtobeinterpretedaccordingly.

AsseeninTable4,theminimumsamplesizesuggestedfora10-itemtestin2PLMwas750.Thiswasaninterestingfindingfortworeasons.First,atestof10itemsin2PLMwasnotpopularinpreviousstudies.Onlythreestudiestested10-itemtestsin2PLM (Baker, 1998;Stone, 1992;Yen, 1987).However, theywereunable toyieldacceptableaccuracywithit.Thesamplesizestestedinthesestudieswere1,000(Yen,1987);30,50,60,120,and500(Baker,1998);and250,500,and1,000(Stone,1992).ThepresentfindingdifferedfromYen’s(1987)andStone’s(1992)studiesastheydidnotreport1,000asaviablesamplesizefor10items.Onthecontrary,inthepresentstudy,theitemparametersof10itemswith1,000examineeresponseswereestimatedwithhighaccuracy(raâ =0.864,RMSD=0.03;rbb =0.964,RMSD=0.14).Second,as statedearlier, a samplesizeof750wasnotapopularalternative to1,000 in thepreviousresearch.Onlyonestudy(Lim&Drasgow,1990)testedtheperformanceof750with20itemsandreportedthattheaandbparameterestimateswereclosetotheirtruevalues.Thisisclearlyinparallelwiththepresentfindingwithashortertestlength.

The present findings indicate that accurate estimates of item parameters can beobtainedwhenN=500andn=20in2PLM.Therehavebeenstudiesthatsuggesteda sample of 500 in 2PLMwith 30- (Hulin et al., 1982) and 78-item (Goldman&Raju,1986)tests.However,itshouldbenotedthatHulinetal.(1982)onlytested15-,30-,and60-itemtestsandusedjointmaximumlikelihoodestimation(JMLE)astheparameterestimationmethod.Thiscouldaccordinglycause lessaccurateestimationofitemparametersinsmallersamplesizesandshortertestlengthsasJMLEismore

331


efficientwithsamplesizesofover1,000andtestswithmorethan60items(Baker&Kim,2004).Moreover,itshouldalsobenotedthat78wastheonlytestlengthtestedbyGoldmanandRaju(1986).Forthesereasons,thecomparabilityofthesetwostudieswiththepresentstudyispoor.However,itisencouragingtocomparethepresentfindingwiththatofStone’s(1992),whoreportedanRMSDvalueofapproximately0.30whenN=500andn=20,indicatingaparallelfindingwiththepresentstudy.

Theresultsofthepresentstudyalsosuggestthatusingasampleof250with30itemsin2PLMisviable.ThisisconsistentwiththefindingsofHarwellandJanosky(1991),whoobtainedaccurateaandbparameterestimateswitha25-itemtestwhenN=250.Apartfromthis,Hulinetal.(1982)suggestedasamplesizeof500with30itemsin2PLM.Asmentionedearlier,duetotheestimationmethod(JMLE)theyhademployedtoestimate itemparameters, itwouldnotbewrongtoassumethat theycouldhaveobtainedasmallersamplesizewith30itemsiftheyhadusedtheMMLEestimationmethodusedinthepresentstudy.Moreover,GuyerandThompson(2011) obtainedaccurateaandbparametersin2PLMwhenN=300andn=50.Oneshouldnotethatn=50wastheshortesttestlengthandN=300wasthesmallestsamplesizetheytested,whichmeansthepresentstudyobtainedsimilarresultstestingasmallersamplesizeandshortertestlength.

AsshowninTable4,acceptableaccuracyisreachedwhenN=750andn=10in3PLM.Althoughtestswith10itemshadbeenpreviouslytestedintwostudies(Gao&Chen,2005;Yen,1987)in3PLM,therewasnosuggestionforthistestlengthinthecurrentliteraturefor3PLMaspreviousresearchdidnotyieldaccurateparameterestimates.Tobeginwith,whenGaoandChen’s(2005)studywasanalyzedintermsoftestlengthandsamplesize(N=100,500,2,000;n=10,30,60),onecanseethattheclosestsamplesizeto750was500.Theycouldnotyieldaccurateestimatesoftheaparameterwithasampleof500.Thepresentstudyconfirmedthat500examineeswith10itemsin3PLMyieldsverypoorestimatesfortheaparameter(raâ=0.345,RMSD=0.19).GaoandChen(2005)didnottest750oreven1,000intheirstudyasanalternativeto2,000.However,750wastestedinthepresentstudy,andaccurateestimatesoftheaparameterhavebeenobtained.Secondly,Yen(1987)obtainedpooraccuracywithasampleof1,000and10itemsin3PLM.Onthecontrary,thepresentfindingsindicatedthatwhenN=750andn=10(raâ =0.901,RMSD=0.10;rbb = 1.00,RMSD=0.10;rcc =0.922,RMSD=0.03),theitemparameterswereestimatedasaccuratelyastheywerewhenN=1,000(raâ =0.929,RMSD=0.06;rbb =0.999,RMSD=0.14;rcc =0.938,RMSD=0.01)withthesametestlength.

Whenthetestlengthincreasedfrom10to20itemsin3PLM,thesamplesizewithwhichacceptableaccuracywasreachedwasalso750.PatsulaandGessaroli(1995) andYen(1987)hadpreviouslystudiedthe20-itemtestconditionandbothsuggestedsamplesof1,000with20items.Yen(1987)studiedonlythesampleof1,000and

332


PatsulaandGessaroli(1995)didnothaveN =750intheirresearchdesign.Theyonlyhadsamplesof1,000and500astheclosestalternativesto750.Theyfoundthat500didnotyieldaccurateenoughestimates,whichwasalsothecaseinthepresentstudyunderthesameresearchcondition(raâ =0.614,RMSD=0.15;rbb =0.972,RMSD=0.19;rcc =0.820,RMSD=0.03).Assuch, itshouldnotbesurprisingthat theysuggested1,000astheminimumsamplesizefortheydidnothavetheopportunitytotest750asanalternative.Thepresentfindingsalsoindicatethat750isahighlyviablealternativeto1,000asitem-parameterestimatesinN=750(raâ =0.840,RMSD=0.11;rbb =0.995,RMSD=0.11;rcc =0.829,RMSD=0.03)areasaccurateaswhenN=1,000(raâ =0.874,RMSD=0.09;rbb =0.989,RMSD=0.12;rcc =0.915,RMSD=0.03)andn=20in3PLM.

As seen in Table 4, the sample size suggested for estimating item parametersaccurately in 3PLMwith 30 itemswas 350. Previously, two studies had tested asampleof500in3PLMwith30items(Akour&Al-Omari,2013;Gao&Chen,2005).Whenthesestudieswerefurtheranalyzed,itcanbeseenthatthesamplesizesof100and200werealsotestedinthesestudiesrespectively.Thus,N=350naturallyisn’tfoundinthecurrentliteratureasitwasnevertestedagainst500.Manyresearchers(Goldman&Raju,1986;Harwell&Janosky,1991;Patsula&Gessaroli,1995;Ree&Jensen,1980;Yoes,1995)insteadpreferredthesamplesizeof250asasmalleralternativeto500.Thatthepresentfindingsuggeststheuseofasamplesizeof350asviablewith30-itemtestsin3PLMthusshouldnotbeintriguing.

Asonecaninferfromthediscussioninthepreviousparagraphs,thefindingsofthepresentstudynotonlysupportthefindingsofpreviousresearchinthefield,butalsofurtheritsfindingsbyprovidingresearcherswithmoredataintermsofviablesamplesizesforIRT-basedtestdevelopment.Moreimportantly,Table4providesIRTpractitionerswithadetailedblueprintonwhatsamplesizetousewithwhichmodelandtestlength.Inaddition,thefindingspresentedinTable4suggestatremendousdecreaseintheminimumsamplesizerequiredforIRT-basedtestdevelopment.Thismeans that IRT practitioners will need to reach fewer examinees to pretest theiritems,andthiswillmakethedatacollectionprocesseasierandmorebudget-friendly.Fromthispointofview,thefindingsofthisstudywillnotonlybebeneficialforIRTpractitionerswhocanonlyreachalimitednumberofexaminees,butitwillalsohelptodecreasethedatacollection,itemanalysis,testproduction,andpersonnelcostsforcompaniesandinstitutionsthatneedtorecruitmorestafftomeettherequirementsofIRT-basedtestdevelopment.

In order to develop similar tests with similar qualities, the key reliability andvalidityaspectstakenundercontrolinthisstudymayneedtobefollowed.Firstofall, theKR-20,which isavariantof thealpha internalconsistencycoefficient forbinaryitems,washighfortheteststhatweredrawnfromthefulltestinthepresent

333


study.Moreover, thesub-testsformedinthestudywerehighlyunidimensional.Insubsequentstudies,testswithhighinternalconsistencyandhighunidimensionalitymaybeneededinordertoobtainsimilarresults.Aspartofcontentvalidityconcerns,theitemsinthedevelopedtestwerewrittenbysubjectmatterexpertsinaccordancewith course curriculum, andmultiple expert opinionswere taken.Afterwards, thetestwasadministeredasarealtesttorealexaminees.Asimilarapproachmayalsobenecessaryinordertoobtainsimilarresults.

Present findings should not be deemed as definitive.However, they should bedeemedasacallforfurtherresearchonminimumsamplesizenecessarytoestimateitemparametersaccurately.Oneshouldnotethatthepresentfindingsmayonlybelimited to language-test development and short test lengths up to 30 items withqualitiessimilartothoseusedinthepresentstudy.Moreover,theparameterestimateswereobtainedusingMMLE,sothepresentfindingsmaybelimitedtoitemparameterestimationwithMMLE.

Thisstudyhasthrownupsomequestionsinneedoffurtherinvestigation.Anaturalprogressionofthisworkwouldbetoconductstudiesonthefollowingtopicsbyusingrealandsimulatedtestdata.Alanguagetestdatawasusedinthisstudy.Replicatingthis studywithdata from testsother than language testswouldbeofvalue to thepractitioners in the field. It would also be a valuable contribution to the field toinvestigatefeasibilityofusingsmallsamplesizesinitem-parameterestimationfortestswithmorethan30items.

ReferencesAkour,M.,&AlOmari,H.(2013).EmpiricalinvestigationofthestabilityofIRTitem-parameters

estimation. International Online Journal of Educational Sciences, 5(2), 291–301. Retrievedfromhttp://www.iojes.net//userfiles/Article/IOJES_980.pdf

Baker, F. B. (1992). Item response theory: Parameter estimation techniques. NewYork, NY:MarcelDekker.

Baker, F. B. (1998). An investigation of the item parameter recovery of a Gibbs samplingprocedure. Applied Psychological Measurement, 22(2), 153–169. http://dx.doi.org/10.1177/01466216980222005

Baker,F.B.,&Kim,S.H.(2004).Item response theory: Parameter estimation techniques (2nded.).NewYork,NY:MarcelDekker.

Chernyshenko,O.S.,Stark,S.,Chan,K.,Drasgow,F.,&Williams,B.(2001).Fittingitemresponsetheory models to two personality inventories: Issues and insights.Multivariate Behavioral Research,36(4),523–562.http://dx.doi.org/10.1207/S15327906MBR3604_03

Chuah,S.C.,DrasgowF.,&Luecht,R.(2006). Howbigisbigenough?Samplesizerequirementsforcastitemparameterestimation. Applied Measurement in Education, 19(3), 241–255.http://dx.doi.org/10.1207/s15324818ame1903_5

334


DeMars,C.(2010).Item response theory: Understanding statistics measurement.NewYork,NY:OxfordUniversityPress.

Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling toquestionnairedevelopment,evaluation,andrefinement.Quality of Life Research,16(1),5–18.http://dx.doi.org/10.1007/s11136-007-9198-0

Embretson, S. E.,&Reise, S. P. (2000). Item response theory for psychologists.Mahwah,NJ:LawrenceErlbaumAssociates.

Fidelman,C.(2012,January).A retrospective linking of ECLS-K and ECLS-B reading scores.PaperpresentedatFederalCommitteeonStatisticalMethodologyPolicySeminar,Washington,DC.

Field,A.P.(2013).Discovering statistics using IBM SPSS Statistics: And sex and drugs and rock ‘n’ roll (4thed.).London,UK:Sage.

Gao,F.,&Chen,L. (2005).Bayesianor non-Bayesian:A comparison studyof itemparameterestimation in the three-parameter logisticmodel.Applied Measurement in Education,18(4),351–380.http://dx.doi.org/10.1207/s15324818ame1804_2

Goldman,S.H.,&Raju,N.S.(1986).Recoveryofone-andtwo-parameterlogisticitemparameters:Anempiricalstudy.Educational and Psychological Measurement,46(1),11–21.http://dx.doi.org/10.1177/0013164486461002

Guyer,R.,&Thompson,N.A.(2011).User’s manual for Xcalibre 4.1.St.Paul,MN:AssessmentSystemsCorporation.

Hambleton,R.K.,SwaminathanH.,&Rogers,H.J.(1991).Fundamentals of item response theory.NewburyPark,CA:Sage.

Hambleton,R.K.(1989).Principlesandselectedapplicationsofitemresponsetheory.InR.L.Linn(Ed.),Educational measurement (3rd ed.,pp.147–200).NewYork,NY:Macmillan.

Harwell, M. R., & Janosky, J. E. (1991).An empirical study of the effects of small datasetsandvaryingprior varianceson itemparameter estimation inBILOG.Applied Psychological Measurement, 15(3),279–291.http://dx.doi.org/10.1177/014662169101500308

Hulin,C.L.,Lissak,R.I.,&Drasgow,F.(1982).Recoveryof twoandthree-parameter logisticitem characteristic curves:AMonteCarlo study.Applied Psychological Measurement,6(3),249–260.http://dx.doi.org/10.1177/014662168200600301

InternationalBusinessMachinesCorporation.(2011).IBM SPSS Statistics for Windows, Version 20.0.Armonk,NY:Author.

Kline,R.B.(2005).Principles and practice of structural equation modeling(2nded.).NewYork,NY:GuilfordPress.

Lim,R.G.,&Drasgow,F.(1990).Evaluationoftwomethodsforestimatingitemresponsetheoryparameterswhenassessingdifferentialitemfunctioning.Journal of Applied Psychology,75(2),164–174.http://dx.doi.org/10.1037/0021-9010.75.2.164

Lord, F.M. (1968).An analysis of the verbal scholastic aptitude test using Birnbaum’s three-parameter logistic model. Educational and Psychological Measurement, 28(4), 989–1020.http://dx.doi.org/10.1177/001316446802800401

Lord,F.M.(1980).Applications of item response theory to practical testing problems.Hillsdale,NJ:LawrenceErlbaum.

335


Patsula,L.N.,&GessaroliM.E. (1995,April).A comparison of item parameter estimates and ICCs produced with TESTGRAF and BILOG under different test lengths and sample sizes.Paperpresentedat theannualmeetingof theNationalCouncilonMeasurement inEducation,SanFrancisco,CA.

Ree,M.J.,&Jensen,H.E.(1980).Effectsofsamplesizeonlinearequatingofitemcharacteristiccurveparameters.InD.J.Weiss(Ed.), Proceedings of the 1979 Computerized Adaptive Testing Conference.Minneapolis,MN:UniversityofMinnesota.

Rudner,L.M.(1998).An on-line, interactive, computer adaptive testing tutorial.Retrievedfromhttp://echo.edres.org:8080/scripts/cat/catdemo.htm

Sireci,S.G.(1991,June).Sample independent item parameters? An investigation of the stability of IRT item parameters estimated from small data sets.PaperpresentedattheannualConferenceofNortheasternEducationalResearchAssociation,NewYork,NY.

Stone,C.A. (1992).Recoveryofmarginalmaximumlikelihoodestimates in the two-parameterlogisticresponsemodel:AnevaluationofMultilog.Applied Psychological Measurement, 16(1),1–16.http://dx.doi.org/10.1177/014662169201600101

Swaminathan,H.,&Gifford,J.A.(1979).Estimation of parameters in the three-parameter latent trait model(ReportNo.90).Amherst,MA:UniversityofMassachusetts,SchoolofEducation,LaboratoryofPsychometricandEvaluationResearch.

Swaminathan, H., Hambleton, R. K., Sireci, S. G., Xing, D., & Rizavi, S. M. (2003). Smallsampleestimationindichotomousitemresponsemodels:Effectofpriorsbasedonjudgmentalinformationontheaccuracyofitemparameterestimates.Applied Psychological Measurement,27(1),27–51.http://dx.doi.org/10.1177/0146621602239475

Tang,K.L.,Way,W.D.,&Carey,P.A.(1993).The effect of small calibration sample sizes on TEOFL IRT-based equating (TOEFL technical report TR-7).Princeton,NJ:EducationalTestingService.

Thissen,D.,&Wainer,H.(1982).Somestandarderrorsinitemresponsetheory.Psychometrika,47(4),397–412.http://dx.doi.org/10.1007/BF02293705

Weiss,D.J.,&Minden,S.V.(2012).A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG. St.Paul,MN:AssessmentSystemsCorporation.

Woods,A.,&Baker,R. (1985). Item response theory.Language Testing,2(2), 117–140.http://dx.doi.org/10.1177/026553228500200202

Wright,B.D.,&Stone,M.H.(1979).Best test design.Chicago,IL:MesaPress.

Yen,W.M.(1987).AcomparisonoftheefficiencyandaccuracyofBilogandLogist.Psychometrika,52(2),275–291.http://dx.doi.org/10.1007/BF02294241

Yoes,M. (1995).An updated comparison of micro-computer based item parameter estimation procedures used with the 3-parameter IRT model. Saint Paul, MN: Assessment SystemsCorporation.

the effects of test length and sample size on item ...lord’s (1968) study is the firstone of its...

Documents