the effects of test length and sample size on item ...lord’s (1968) study is the firstone of its...
TRANSCRIPT
Received: April 14, 2016Revision received: October 2, 2016Accepted: December 1, 2016OnlineFirst: December 25, 2016
Copyright © 2017 EDAMwww.estp.com.tr
DOI 10.12738/estp.2017.1.0270 February 2017 17(1) 321–335
Research Article
KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE
* Thispaperisapartofthefirstauthor’sPhDdissertationatthedepartmentofEducationalMeasurementandEvaluation,InstituteofSocialSciences,HacettepeUniversity,Ankara,Turkey. Anearlierversionofthispaperwaspresentedatthe77thAnnualmeetingofNationalCouncilonMeasurementinEducation,Chicago,April15-19,2015.
1Correspondence to:AlperŞahin(PhD),ModernLanguagesProgram,SchoolofForeignLanguages,MiddleEastTechnicalUniversityNorthernCyprusCampus,Kalkanlı,Güzelyurt,Mersin99738Turkey.Email:[email protected]
2Department of Educational Measurement and Evaluation, Faculty of Education, Hacettepe University, Ankara Turkey. Email:[email protected]
Citation:Şahin,A.,&Anıl,D.(2017).Theeffectsoftestlengthandsamplesizeonitemparametersinitemresponsetheory.Educational Sciences: Theory & Practice,17,321–335.http://dx.doi.org/10.12738/estp.2017.1.0270
Abstract
This study investigates the effects of sample size and test length on item-parameter estimation in test
development utilizing three unidimensional dichotomous models of item response theory (IRT). For this
purpose, a real language test comprised of 50 items was administered to 6,288 students. Data from this test
was used to obtain data sets of three test lengths (10, 20, and 30 items) and nine different sample sizes (150,
250, 350, 500, 750, 1,000, 2,000, 3,000 and 5,000 examinees). These data sets were then used to create
various research conditions in which test length, sample size, and IRT model variables were manipulated to
investigate item parameter estimation accuracy under different conditions. The results suggest that rather
than sample size or test length, the combination of these two variables is important and samples of 150, 250,
350, 500, and 750 examinees can be used to estimate item parameters accurately in three unidimensional
dichotomous IRT models, depending on test length and model employed.
Keywords
Small samples in IRT • Item parameter estimation • Item response theory •
Language testing • Short tests
Alper Şahin1
Middle East Technical University Northern Cyprus Campus
Duygu Anıl2
Hacettepe University
The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory*
322
EDUCATIONAL SCIENCES: THEORY & PRACTICE
Developedfollowingcontroversiesovertheintelligencetestingmovement(Baker,1992),classicaltesttheoryhassuccessfullyserveditspractitionersintermsoftestdevelopmentandinterpretationoftestresultsfordecades(Embretson&Reise,2000;Hambleton, Swaminathan,&Rogers, 1991).However, the need for a theory thatcanpredictexaminees’responsestoanyitem,evenifitemshaveneverbeenseenbytheexamineesbefore(Lord,1980),resultedinthedevelopmentofanewtesttheorywhosebasic conceptsweredevelopedover seven and ahalf years (Baker, 1992).Afterspendingalongtimesearchingforanagreed-uponname,itemresponsetheory(IRT)becamethecontemporarytheoreticalfoundationformeasurement(Embretson&Reise, 2000). It is nowwidely used by test publishers, aswell as educational,military,andindustrialinstitutions,intheirresearchontestdevelopment,testequating,and detecting differentially functioning items (Hambleton et al., 1991).However,despite theadvantages IRToffers, amajorobstacle to itsuse in testdevelopmentis itsdemanding requirement for calibration sample size.More specifically,whiledevelopingatest,“thedifficultyliesintheneedtoassessthepropertiesofitemsbytryingthemoutonasampleofsubjects”(Woods&Baker,1985,p.117).IRTrequiresthissamplesizetobelarge(around1,000)inordertoobtainaccurateitem-parameterestimates(Hambleton,1989)thatresultsinaccurateestimatesofability,uponwhichsomehigh-stakesdecisionsaremade.
Thereisalargevolumeofpublishedstudiesintheliteratureontheeffectsofsamplesize and test length on item-parameter estimation in IRT-based test development.Lord’s(1968)study isthefirstoneofitskindinwhichhe investigatedthesamplesizesrequiredforestimatingitemparameters(a-itemdiscrimination,b-itemdifficulty,andc-pseudochance)accuratelyinthethree-parameterlogisticmodel(3PLM)usingdatafromtheScholasticAptitudeTest.Asaresult,Lordconcludedthataminimumof50itemsand1,000examineeswererequiredtoestimateaparameterswithhighaccuracy.Withthesupportofsubsequentstudies(Patsula&Gessaroli,1995;Tang,Way, & Carey, 1993;Yen, 1987;Yoes, 1995), 1,000 was taken as theminimumsamplesizerequiredforaccurateitem-parameterestimationinIRT.
Studies have also been conducted that investigated the effects of using samplesizes of less than a 1,000 on item parameter estimates. However, their findingshave tremendousdiscrepancies.To illustrate, a limitednumberof studiesonone-parameterlogisticmodel(1PLM)havesuggestedthat250(Goldman&Raju,1986),300(Guyer&Thompson,2011),or500(Thissen&Wainer,1982)aretheminimumfeasiblesamplesizesforthisdichotomousunidimensionalmodel.Asthecomplexityof themodel increases, thediscrepancy infindingsalso increases.Forexample,asampleof250with15 items(Harwell&Janosky,1991),of500(Stone,1992)or750(Lim&Drasgow,1990)with20items,of200(Weiss&Minden,2012)or250(Harwell&Janosky,1991)with25items,of500with30items(Hulin,Lissak,&Drasgow,1982),orof300with75items(Yoes,1995)havebeensuggestedfortwo-
323
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
parameterlogisticmodel(2PLM).Thesituationgetsmoreconfusingfor3PLM.Asampleof1,000with20items(Patsula&Gessaroli,1995;Swaminathan&Gifford,1979;Yen,1987);of200(Weiss&Minden,2012)with25items;of500(Akour&Al-Omari,2013)with30items;of300(Chuah,Drasgow,&Luecht,2006)with50items;of1,000with40(Patsula&Gessaroli,1995;Tangetal.,1993;Yen,1987),50(Lord,1968),60(Hulinetal.,1982)and75(Yoes,1995)items;andof500(Ree&Jensen,1980)with80itemshavebeensuggestedin3PLM.
Whilenumerousstudieshavebeenconductedontheeffectsofsamplesizeandtest lengthonitem-parameteraccuracyinIRT-basedtestdevelopment, thepresentstudydiffersfromthemintwoimportantaspects.Firstly,thedatausedinpreviousresearchhasbeenmostlysimulateddata,anditisnotknownwhethersimulateddatareflects thequalitiesofa real test (Sireci,1991). In thismanner, theresultsof thepresentstudyareexpectedtobeclosertorealtestdevelopmentthanmostpreviousresearch.Moreover,limitednumbersofsamplesizes,mostly4or5,weretestedinthepreviousresearch.However,alargenumber(9)ofsmallandlargesamplesizesaretestedatthesametimeinthepresentstudy,thusmakingitpossibletocompareandcontrastresultsfrombothsides.Thatmakesthisstudyuniqueintheliterature.Withthehelpofthesetwoaspects,thisstudysetsouttoprovideallIRTpractitionerswithablueprintofhowlargeorsmallasamplesizecanbetoestimateitemparametersaccuratelyinIRT-basedtestdevelopmentbyansweringthequestion“Howdosamplesizeandtestlengthaffectitem-parameterestimationinIRT-basedtestdevelopment?”
Method
Participants and Data Collection InstrumentTheparticipantsinthisfoundationalstudyare6,288freshmanstudentsenrolledin
anEnglishlanguagecourseatalargeuniversity.Thedatacollectioninstrumentwasa50-itemEnglishlanguagetest.Inordertofulfillthecontentvalidityconcerns,itemsinthetestwerewritteninparallelwiththecourseobjectivesbytheEnglishlanguageinstructorswhoteachthecourseandaredeemedassubjectmatterexperts.Moreover,beforeadministeringthetest,expertopinionswerereceivedfromadoctoralcandidateandlanguagetestingspecialistsworkingforthetestingofficeattheuniversitywherethe data was collected. In this way, the itemswere revisedmultiple times baseduponexpertopinionsuntilreachingtheirfinalform.Afterwards,theinstrumentwasadministeredsimultaneouslyto6,288testtakersinonesession.
Item Selection for Shorter Tests Itemsfromthefulldatasetwereselectedtoformsub-testsofdifferentlengths.
Whiledecidingthenumberofitemsforshortertests, previousstudies(Baker,1998;
324
EDUCATIONAL SCIENCES: THEORY & PRACTICE
Gao&Chen, 2005; Patsula&Gessaroli, 1995;Yen, 1987) in the literaturewereconsidered. So, from the 50-item full data set (6,288 x 50), sub-tests of n = 30, n=20,andn=10itemswereselected.Whileformingsub-tests,unidimensionalityassumptionofIRTwassatisfied.Unidimensionalityreferstohavingasinglecommonfactorunderlyingexaminees’correctresponsestoitems.Inordertoselecttheitemthatcollectivelysatisfyunidimensionality,exploratoryfactoranalysisonthetetrachoricinter-itemcorrelationmatrixofthefulldatawasused(Edelen&Reeve,2007).Itemsthat had factor loadings less than0.30 for thefirst factorwere removed from thedataas they failed to loadon thefirst factoradequately (Fidelman,2012).Outofthe50items,30itemsthathadthehighestfactor loadingsonthefirstfactorwerechosenastheitemsforthe30-itemtest.Thetetrachoriccorrelationmatrixofthese30itemswascomputedandanotherexploratoryfactoranalysiswasdoneonthismatrix.Then,20itemswithhighestfactorloadingsonthefirstfactorwereselectedforthe20-itemtest.Thesameprocedurewasfollowedtoselecttheitemsfor10-itemtest. AsummaryoftheresultsrelatedtotheunidimensionalityassumptionandthefactorloadingscanbefoundinTable1.
Table1Summary of Exploratory Factor Analysis in Item SelectionTestLength Variance λ1/λ2 RangeofFactorLoadings30-ItemTest 37.50% 8.50 0.45to0.7520-ItemTest 43.78% 8.15 0.59to0.7510-ItemTest 51.29% 6.10 0.68to0.77
Note. Variance=Varianceaccountedforbythefirstfactor,λ1/λ2=ratioofthefirsttothesecondeigenvalue.
Lord(1980)suggeststhatifthefirsteigenvalueislargecomparedtothesecondone,andifthesecondoneisnotmuchlargerthananyoftheothereigenvalues,thiscanbetakenasaproofofunidimensionality.Thesewereobservedinthedatasetsofthecurrentstudy.Apartfromthese,moderatetohighfactorloadingsforthefirstfactorandmoderatetohighvarianceaccountedforbythefirstfactorwerealsotakenasproofofunidimensionality.Uponcompletingtheitemselectionforthesub-tests,theKR-20internalconsistencycoefficientwascalculatedforeachtest,whichwerefoundtobe0.76(10-itemtest),0.85(20-itemtest),and0.88(30-itemtest).
Sampling of ExamineesAfterselectingtheitemsthatwouldbeusedforeachsub-test(10-item,20-item,
and30-item),ninedifferentsamplesizes(N=150,250,350,500,750,1,000,2,000,3,000,and5,000)weredrawnfromthefulldatasetforeachtestbystratifiedrandomsampling using the complex samplesmodule of SPSS 20 (InternationalBusinessMachinesCorporation,2011).Thiswasdoneinordertosimulatedifferentresearchconditions(e.g.,250examinees’responsesto20items).Whiledecidingsamplesizes,thesamplesizesmostfrequentlyusedinpreviousIRTresearchonsamplesizewere
325
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
reviewed.Asaresult,samplesizesof150,250,500,1,000,2,000,3,000,and5,000werefoundtobeusedthemostinsimilarstudies(Akour&AlOmari,2013;Baker,1998;Goldman&Raju, 1986;Hulin et al., 1982;Lord, 1968;Tang et al., 1993; Thissen&Wainer,1982;Yen,1987). Samplesof350,whichhadneverbeentested,andof750,whichhadonlybeenpreviouslytestedonce(Lim&Drasgow,1990),werealsoaddedtothestudyasalternativesto500and1,000.Moreover,whilesamplingtheexaminees,theircollegeswereusedasstratatoreflectexamineediversityinthefulldataacrosssamples.Thisprocessresultedin30(SeeTable2)differentdatasets,includingthefulldataset(markedwithan*).
Table2Data Sets of the Study
SampleSizeTestLength 150 250 350 500 750 1,000 2,000 3,000 5,000 6,288* Total
10 1 1 1 1 1 1 1 1 1 1 1020 1 1 1 1 1 1 1 1 1 1 1030 1 1 1 1 1 1 1 1 1 1 10Total 3 3 3 3 3 3 3 3 3 3 30
Item Parameter EstimationAfter sampling,itemparameters(itemdifficulty(b)in1PLM;banddiscrimination(a)
inthe2PLM;anda,b,andpseudo-chance(c)parametersinthe3PLM)wereestimatedunder90differentresearchconditions(3IRTmodelsx30datasets)usingXcalibre4.1(Guyer&Thompson,2011)withmarginalmaximumlikelihoodestimation(MMLE)andthedefaultoptionsreadilyavailableintheprogram.Theitemparametersestimatedfromthefulldataset(N =6,288,n=50)weretakenasthe“true”(baseline)parametersagainstwhichtheaccuracyoftheitemparametersestimatedfromtheothersamplesinthestudycouldbecomparedandcontrasted.Asthedatahadaconsiderablylargenumberofexaminees,itwaspossibletoestimateitemparameterswithareasonableamount of accuracy and thus could be considered as baseline parameter estimates(Swaminathan,Hambleton,Sireci,Xing,&Rizavi,2003).
Estimation Evaluation Criteria Accuracyoftheparameterestimatesunderdifferentresearchconditionswasevaluated
throughproduct-momentcorrelations(Gao&Chen,2005;Harwell&Janosky,1991;Hulin et al., 1982; Swaminathan et al., 2003;Tang et al., 1993;Yen, 1987) and theroot-mean-squareddifference(RMSD;Gao&Chen,2005;Harwell&Janosky,1991;Swaminathanetal.,2003;Yen,1987)betweenthebaselineandestimatedparameters,ashasbeendoneinsimilarstudies.RMSDisdefinedasinEquation1below.
RMSD= (1)
326
EDUCATIONAL SCIENCES: THEORY & PRACTICE
p�i representsoneof theestimated itemparameters(a,b,orc) for item i,andpi representsthecorrespondingbaselineparameter,takenasthetrueitemparameterforitemi.Underallresearchconditions,correlationsofr≥0.70,whichisconsideredacceptable (Yoes,1995)andsometimesashigh (Field,2013),andRMSD≤0.33,which corresponds to the classical reliability value of 0.90 (Rudner, 1998), weretakenasthecriteriaforminimumfeasiblesamplesizeforthatparticulartestlengthandIRTmodel.
Model Data FitAnitem-by-itemfitanalysiswasnotconducted;rather,anoverallmodel-datafit
analysiswasusedonalldata setsbecause itmightbepossible tohave items thatfitoneitemresponsemodelinadatasetwhilefailingtofitthemodelinotherdatasets.Thiswouldmakeitdifficulttocomparethesameitems’performancesinthreeIRTmodelswithninesamplesizesandthreetestlengthsusingrealtestdatafromalimitednumberofitems.Astheindicatorofoverallmodel-datafit,themodelchi-squarereportedbyXcalibre4.1(Guyer&Thompson,2011)wasused.Aschi-squareis sensitive to large sample sizes, chi-square divided by the degrees of freedom (x2/df)wasusedasitisnotsensitivetolargesamplesizes(Kline,2005).Asummaryofthemodel-datafitanalysisbasedonx2/dfisshowninTable3.
Table3Summary of the Model Data Fit Analysis Based on x2/df
N
n=10 n=20 n=301PLM 2PLM 3PLM 1PLM 2PLM 3PLM 1PLM 2PLM 3PLMx2/df x2/df x2/df x2/df x2/df x2/df x2/df x2/df x2/df
150 0.4 0.4 0.6 0.5 0.5 0.5 0.5 0.4 0.4250 0.5 0.4 0.5 0.5 0.5 0.5 0.5 0.4 0.4350 0.6 0.8 0.9 0.5 0.4 0.5 0.6 0.4 0.4500 0.6 1.0 1.4 0.5 0.4 0.4 0.5 0.4 0.4750 1.2 0.9 1.5 0.7 0.5 0.5 0.8 0.4 0.41,000 0.9 0.9 2.1 0.6 0.5 0.6 0.9 0.5 0.52,000 2.1 1.7 2.7 0.9 0.7 0.8 1.3 0.5 0.63,000 2.7 2.5 4.7 1.0 0.7 0.9 1.9 0.6 0.75,000 4.0 3.3 8.0 1.4 1.4 1.3 2.9 0.8 0.86,288 5.0 3.9 9.4 1.7 1.6 1.5 3.4 0.9 0.9
Valuesof3orlessaresuggestedasanindicatorofmodelfit(Chernyshenko,Stark,Chan,Drasgow,&Williams,2001).Table3indicatesmisfitoffewlargedatatothemodel.Thesewereignored,astheywerethoughttoresultfromtheverylargesizeofthesedatasets.
327
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
FindingsFigures 1, 2, and 3 show correlations and RMSD values obtained from the
parameterestimates.AsFigure1shows,correlationspertainingtothebparametersestimatedin1PLMrangedbetween0.939(N=150,n=10)and1.00(N=5,000; n =10,20,and30).Inaddition,RMSDvaluespertainingtothebparameterestimatesrangedbetween0.33(N=150, n=10)and0.01(N=5,000;n=10).Ascanbeseenfromthefigures,asampleof150metbothcriteriaforbeingdeemedacceptablein1PLMforalltestlengths.
Figure 1.Correlations(left)andRMSD(right)valuesobtainedforthebparameterin1PLM.
2a. Correlationsfortheaparameterin2PLM 2b. RMSDfortheaparameterin2PLM
2c. Correlationsforthebparameterin2PLM 2d. RMSDforthebparameterin2PLMFigure 2.CorrelationsandRMSDvaluesobtainedfortheaandbparametersin2PLM.
328
EDUCATIONAL SCIENCES: THEORY & PRACTICE
3a.Correlationsfortheaparameterin3PLM 3b.RMSDfortheaparameterin3PLM
3c.Correlationsforthebparameterin3PLM 3d.RMSDforthebparameterin3PLM
3e.Correlationsforthecparameterin3PLM 3f.RMSDforthecparameterin3PLMFigure 3.CorrelationsandRMSDvaluesobtainedforthea, b,andcparametersin3PLM.
Thecorrelationspertainingtotheaparameter(Figure2a)in2PLMrangedbetween-0.311(N=250,n=10)and1.00(N=5,000;n=20,30).TheRMSDvalues(Figure2b)obtainedfromtheaparametersin2PLMvariedfrom0.16(N=150,n=20)to0.00(N=5,000;n=10,20).Asforthebparameters,thecorrelations(Figure2c)rangedbetween0.940(N=150,n=10)and1.00(N=2,000,3,000,5,000;n=10,20,30)whiletheRMSDvaluesforthebparameter(Figure2d)rangedbetween0.34
329
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
(N=150,n=10)and0.02(N=5,000;n=10,20).Inlinewiththese,theminimumsamplesizethatmetbothevaluationcriteriaforbothparameters(aandb)estimatedin2PLMwas750(raâ =0.706,RMSD=0.09;rbb=0.999,RMSD=0.05)for the 10-item test, 500 (raâ =0.795,RMSD=0.08; rbb =0.985,RMSD=0.13) for the20-itemtest,and250(raâ =0.830,RMSD=0.10;rbb=0.978,RMSD=0.17)forthe30-itemtestin2PLM.
The correlations pertaining to the a parameter estimates in 3PLM (Figure 3a)rangedbetween0.207(N=250,n=10)and0.995(N=5,000;n=20,30),whereastheRMSDvalues obtained from thea parameter estimates in 3PLM (Figure 3b)ranged between 0.23 (N = 250, n = 10) and 0.02 (N = 5,000; n = 10, 20). Thecorrelationspertainingtothebparameterestimates(Figure3c)rangedbetween0.944(N=150,n=10)and1.00(N=2,000,3,000,5,000;n=10,20,30)whiletheRMSDvaluesobtainedfromthebparameterestimates(Figure3d)rangedbetween0.37(N =150,n=10)and0.02(N=5,000;n=10).Asforthecparameter,thecorrelations (Figure3e)rangedbetween0.415(N=350,n=10)and0.996(N=5,000,n=20)whereastheRMSDvaluespertainingtothecparameterestimates(Figure3f)rangedbetween0.05(N=150,n=30;N=250,n=10,30)and0.01(N =1,000,2,000,3,000,5,000,n =10;N =5,000,n =20;N =3,000,5,000,n =30).Asaresult,thesmallestsamplesizesthatmetthecriteriaforallparameters(a,b,andc)estimatedin3PLMwere750(raâ =0.901,RMSD=0.10;rbb =0.998,RMSD=0.10;rcc=0.992,RMSD=0.03)whenn=10;750(raâ =0.840,RMSD=0.11;rbb =0.995,RMSD=0.11;rcc=0.829,RMSD=0.03)whenn=20;and350(raâ =0.749,RMSD=0.16;rbb
=0.987,RMSD=0.13;rcc=0.734,RMSD=0.03)whenn=30.
DiscussionThisstudyaimedtoinvestigatetheeffectsofsamplesizeandtestlengthonitem-
parameterestimationaccuracyinIRT-basedtestdevelopment.Theresultsindicateanassociationbetweentestlengthandsamplesizethatamplifiesasthenumberofitemsinthetestandthenumberofparametersinthemodelincrease.Asmentionedearlier,therearetwocriteriasetforestimationevaluation.Oneisthatr≥0.70;theotheristhat RMSD≤0.33betweenthebaselineandestimateditemparameters.AsummaryoftheminimumsamplesizesthatmetbothofthesecriteriainallIRTmodelscanbefoundinTable4.
Table4Summary of the Findings
TestLength 1PLM 2PLM 3PLM10 150 750 75020 150 500 75030 150 250 350
330
EDUCATIONAL SCIENCES: THEORY & PRACTICE
AsTable4illustrates,thefindingsofthepresentstudysuggestthatasampleofaslowas150examineescouldbeusedin1PLMwithtestsof10,20,or30itemstoaccuratelyestimatethebparameter.Sucharesultwaspartiallyexpectedastherehadalreadybeensomesignsregardingtheviabilityofusing150examineesfor1PLMin thecurrent literature.Morespecifically, someauthorshavealreadysuggestedasamplesizeofaround200asappropriatefor1PLM(Demars,2010;Wright&Stone,1979)withnospecificreferencetotestlength.Thepresentfindingisimportantasitconfirmstheseauthors’findingswithaspecificreferencetotestlength.Moreover,somepreviousresearchhadsimilarfindings,likeN=250(Goldman&Raju,1986)or N =300(Guyer&Thompson,2011)with78and50items,respectively.Thesesamplesizesandtestlengthswerethesmallestsamplesizesandshortesttestlengthstestedinthesestudies.Thus,thepresentfinding,whichsuggestsasamplesizeof150intestsof10,20,or30items,mayconstituteaviableupdatetothecurrentfindingsin the literature.However, although the correlations andRMSDvalues similar tosmallersamplesizesconfirmedthatthefulldata,aswellasN =5,000for10-,20-,and30-itemtestsandN = 3,000in30-itemtests,fitthe1PLMasparameterinvariancewasreachedamongthesamples,thefindingspertainingto1PLMmaystillbebiased(asx2/dfvaluesindicatedmisfit)andneedtobeinterpretedaccordingly.
AsseeninTable4,theminimumsamplesizesuggestedfora10-itemtestin2PLMwas750.Thiswasaninterestingfindingfortworeasons.First,atestof10itemsin2PLMwasnotpopularinpreviousstudies.Onlythreestudiestested10-itemtestsin2PLM (Baker, 1998;Stone, 1992;Yen, 1987).However, theywereunable toyieldacceptableaccuracywithit.Thesamplesizestestedinthesestudieswere1,000(Yen,1987);30,50,60,120,and500(Baker,1998);and250,500,and1,000(Stone,1992).ThepresentfindingdifferedfromYen’s(1987)andStone’s(1992)studiesastheydidnotreport1,000asaviablesamplesizefor10items.Onthecontrary,inthepresentstudy,theitemparametersof10itemswith1,000examineeresponseswereestimatedwithhighaccuracy(raâ =0.864,RMSD=0.03;rbb =0.964,RMSD=0.14).Second,as statedearlier, a samplesizeof750wasnotapopularalternative to1,000 in thepreviousresearch.Onlyonestudy(Lim&Drasgow,1990)testedtheperformanceof750with20itemsandreportedthattheaandbparameterestimateswereclosetotheirtruevalues.Thisisclearlyinparallelwiththepresentfindingwithashortertestlength.
The present findings indicate that accurate estimates of item parameters can beobtainedwhenN=500andn=20in2PLM.Therehavebeenstudiesthatsuggesteda sample of 500 in 2PLMwith 30- (Hulin et al., 1982) and 78-item (Goldman&Raju,1986)tests.However,itshouldbenotedthatHulinetal.(1982)onlytested15-,30-,and60-itemtestsandusedjointmaximumlikelihoodestimation(JMLE)astheparameterestimationmethod.Thiscouldaccordinglycause lessaccurateestimationofitemparametersinsmallersamplesizesandshortertestlengthsasJMLEismore
331
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
efficientwithsamplesizesofover1,000andtestswithmorethan60items(Baker&Kim,2004).Moreover,itshouldalsobenotedthat78wastheonlytestlengthtestedbyGoldmanandRaju(1986).Forthesereasons,thecomparabilityofthesetwostudieswiththepresentstudyispoor.However,itisencouragingtocomparethepresentfindingwiththatofStone’s(1992),whoreportedanRMSDvalueofapproximately0.30whenN=500andn=20,indicatingaparallelfindingwiththepresentstudy.
Theresultsofthepresentstudyalsosuggestthatusingasampleof250with30itemsin2PLMisviable.ThisisconsistentwiththefindingsofHarwellandJanosky(1991),whoobtainedaccurateaandbparameterestimateswitha25-itemtestwhenN=250.Apartfromthis,Hulinetal.(1982)suggestedasamplesizeof500with30itemsin2PLM.Asmentionedearlier,duetotheestimationmethod(JMLE)theyhademployedtoestimate itemparameters, itwouldnotbewrongtoassumethat theycouldhaveobtainedasmallersamplesizewith30itemsiftheyhadusedtheMMLEestimationmethodusedinthepresentstudy.Moreover,GuyerandThompson(2011) obtainedaccurateaandbparametersin2PLMwhenN=300andn=50.Oneshouldnotethatn=50wastheshortesttestlengthandN=300wasthesmallestsamplesizetheytested,whichmeansthepresentstudyobtainedsimilarresultstestingasmallersamplesizeandshortertestlength.
AsshowninTable4,acceptableaccuracyisreachedwhenN=750andn=10in3PLM.Althoughtestswith10itemshadbeenpreviouslytestedintwostudies(Gao&Chen,2005;Yen,1987)in3PLM,therewasnosuggestionforthistestlengthinthecurrentliteraturefor3PLMaspreviousresearchdidnotyieldaccurateparameterestimates.Tobeginwith,whenGaoandChen’s(2005)studywasanalyzedintermsoftestlengthandsamplesize(N=100,500,2,000;n=10,30,60),onecanseethattheclosestsamplesizeto750was500.Theycouldnotyieldaccurateestimatesoftheaparameterwithasampleof500.Thepresentstudyconfirmedthat500examineeswith10itemsin3PLMyieldsverypoorestimatesfortheaparameter(raâ=0.345,RMSD=0.19).GaoandChen(2005)didnottest750oreven1,000intheirstudyasanalternativeto2,000.However,750wastestedinthepresentstudy,andaccurateestimatesoftheaparameterhavebeenobtained.Secondly,Yen(1987)obtainedpooraccuracywithasampleof1,000and10itemsin3PLM.Onthecontrary,thepresentfindingsindicatedthatwhenN=750andn=10(raâ =0.901,RMSD=0.10;rbb = 1.00,RMSD=0.10;rcc =0.922,RMSD=0.03),theitemparameterswereestimatedasaccuratelyastheywerewhenN=1,000(raâ =0.929,RMSD=0.06;rbb =0.999,RMSD=0.14;rcc =0.938,RMSD=0.01)withthesametestlength.
Whenthetestlengthincreasedfrom10to20itemsin3PLM,thesamplesizewithwhichacceptableaccuracywasreachedwasalso750.PatsulaandGessaroli(1995) andYen(1987)hadpreviouslystudiedthe20-itemtestconditionandbothsuggestedsamplesof1,000with20items.Yen(1987)studiedonlythesampleof1,000and
332
EDUCATIONAL SCIENCES: THEORY & PRACTICE
PatsulaandGessaroli(1995)didnothaveN =750intheirresearchdesign.Theyonlyhadsamplesof1,000and500astheclosestalternativesto750.Theyfoundthat500didnotyieldaccurateenoughestimates,whichwasalsothecaseinthepresentstudyunderthesameresearchcondition(raâ =0.614,RMSD=0.15;rbb =0.972,RMSD=0.19;rcc =0.820,RMSD=0.03).Assuch, itshouldnotbesurprisingthat theysuggested1,000astheminimumsamplesizefortheydidnothavetheopportunitytotest750asanalternative.Thepresentfindingsalsoindicatethat750isahighlyviablealternativeto1,000asitem-parameterestimatesinN=750(raâ =0.840,RMSD=0.11;rbb =0.995,RMSD=0.11;rcc =0.829,RMSD=0.03)areasaccurateaswhenN=1,000(raâ =0.874,RMSD=0.09;rbb =0.989,RMSD=0.12;rcc =0.915,RMSD=0.03)andn=20in3PLM.
As seen in Table 4, the sample size suggested for estimating item parametersaccurately in 3PLMwith 30 itemswas 350. Previously, two studies had tested asampleof500in3PLMwith30items(Akour&Al-Omari,2013;Gao&Chen,2005).Whenthesestudieswerefurtheranalyzed,itcanbeseenthatthesamplesizesof100and200werealsotestedinthesestudiesrespectively.Thus,N=350naturallyisn’tfoundinthecurrentliteratureasitwasnevertestedagainst500.Manyresearchers(Goldman&Raju,1986;Harwell&Janosky,1991;Patsula&Gessaroli,1995;Ree&Jensen,1980;Yoes,1995)insteadpreferredthesamplesizeof250asasmalleralternativeto500.Thatthepresentfindingsuggeststheuseofasamplesizeof350asviablewith30-itemtestsin3PLMthusshouldnotbeintriguing.
Asonecaninferfromthediscussioninthepreviousparagraphs,thefindingsofthepresentstudynotonlysupportthefindingsofpreviousresearchinthefield,butalsofurtheritsfindingsbyprovidingresearcherswithmoredataintermsofviablesamplesizesforIRT-basedtestdevelopment.Moreimportantly,Table4providesIRTpractitionerswithadetailedblueprintonwhatsamplesizetousewithwhichmodelandtestlength.Inaddition,thefindingspresentedinTable4suggestatremendousdecreaseintheminimumsamplesizerequiredforIRT-basedtestdevelopment.Thismeans that IRT practitioners will need to reach fewer examinees to pretest theiritems,andthiswillmakethedatacollectionprocesseasierandmorebudget-friendly.Fromthispointofview,thefindingsofthisstudywillnotonlybebeneficialforIRTpractitionerswhocanonlyreachalimitednumberofexaminees,butitwillalsohelptodecreasethedatacollection,itemanalysis,testproduction,andpersonnelcostsforcompaniesandinstitutionsthatneedtorecruitmorestafftomeettherequirementsofIRT-basedtestdevelopment.
In order to develop similar tests with similar qualities, the key reliability andvalidityaspectstakenundercontrolinthisstudymayneedtobefollowed.Firstofall, theKR-20,which isavariantof thealpha internalconsistencycoefficient forbinaryitems,washighfortheteststhatweredrawnfromthefulltestinthepresent
333
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
study.Moreover, thesub-testsformedinthestudywerehighlyunidimensional.Insubsequentstudies,testswithhighinternalconsistencyandhighunidimensionalitymaybeneededinordertoobtainsimilarresults.Aspartofcontentvalidityconcerns,theitemsinthedevelopedtestwerewrittenbysubjectmatterexpertsinaccordancewith course curriculum, andmultiple expert opinionswere taken.Afterwards, thetestwasadministeredasarealtesttorealexaminees.Asimilarapproachmayalsobenecessaryinordertoobtainsimilarresults.
Present findings should not be deemed as definitive.However, they should bedeemedasacallforfurtherresearchonminimumsamplesizenecessarytoestimateitemparametersaccurately.Oneshouldnotethatthepresentfindingsmayonlybelimited to language-test development and short test lengths up to 30 items withqualitiessimilartothoseusedinthepresentstudy.Moreover,theparameterestimateswereobtainedusingMMLE,sothepresentfindingsmaybelimitedtoitemparameterestimationwithMMLE.
Thisstudyhasthrownupsomequestionsinneedoffurtherinvestigation.Anaturalprogressionofthisworkwouldbetoconductstudiesonthefollowingtopicsbyusingrealandsimulatedtestdata.Alanguagetestdatawasusedinthisstudy.Replicatingthis studywithdata from testsother than language testswouldbeofvalue to thepractitioners in the field. It would also be a valuable contribution to the field toinvestigatefeasibilityofusingsmallsamplesizesinitem-parameterestimationfortestswithmorethan30items.
ReferencesAkour,M.,&AlOmari,H.(2013).EmpiricalinvestigationofthestabilityofIRTitem-parameters
estimation. International Online Journal of Educational Sciences, 5(2), 291–301. Retrievedfromhttp://www.iojes.net//userfiles/Article/IOJES_980.pdf
Baker, F. B. (1992). Item response theory: Parameter estimation techniques. NewYork, NY:MarcelDekker.
Baker, F. B. (1998). An investigation of the item parameter recovery of a Gibbs samplingprocedure. Applied Psychological Measurement, 22(2), 153–169. http://dx.doi.org/10.1177/01466216980222005
Baker,F.B.,&Kim,S.H.(2004).Item response theory: Parameter estimation techniques (2nded.).NewYork,NY:MarcelDekker.
Chernyshenko,O.S.,Stark,S.,Chan,K.,Drasgow,F.,&Williams,B.(2001).Fittingitemresponsetheory models to two personality inventories: Issues and insights.Multivariate Behavioral Research,36(4),523–562.http://dx.doi.org/10.1207/S15327906MBR3604_03
Chuah,S.C.,DrasgowF.,&Luecht,R.(2006). Howbigisbigenough?Samplesizerequirementsforcastitemparameterestimation. Applied Measurement in Education, 19(3), 241–255.http://dx.doi.org/10.1207/s15324818ame1903_5
334
EDUCATIONAL SCIENCES: THEORY & PRACTICE
DeMars,C.(2010).Item response theory: Understanding statistics measurement.NewYork,NY:OxfordUniversityPress.
Edelen, M. O., & Reeve, B. B. (2007). Applying item response theory (IRT) modeling toquestionnairedevelopment,evaluation,andrefinement.Quality of Life Research,16(1),5–18.http://dx.doi.org/10.1007/s11136-007-9198-0
Embretson, S. E.,&Reise, S. P. (2000). Item response theory for psychologists.Mahwah,NJ:LawrenceErlbaumAssociates.
Fidelman,C.(2012,January).A retrospective linking of ECLS-K and ECLS-B reading scores.PaperpresentedatFederalCommitteeonStatisticalMethodologyPolicySeminar,Washington,DC.
Field,A.P.(2013).Discovering statistics using IBM SPSS Statistics: And sex and drugs and rock ‘n’ roll (4thed.).London,UK:Sage.
Gao,F.,&Chen,L. (2005).Bayesianor non-Bayesian:A comparison studyof itemparameterestimation in the three-parameter logisticmodel.Applied Measurement in Education,18(4),351–380.http://dx.doi.org/10.1207/s15324818ame1804_2
Goldman,S.H.,&Raju,N.S.(1986).Recoveryofone-andtwo-parameterlogisticitemparameters:Anempiricalstudy.Educational and Psychological Measurement,46(1),11–21.http://dx.doi.org/10.1177/0013164486461002
Guyer,R.,&Thompson,N.A.(2011).User’s manual for Xcalibre 4.1.St.Paul,MN:AssessmentSystemsCorporation.
Hambleton,R.K.,SwaminathanH.,&Rogers,H.J.(1991).Fundamentals of item response theory.NewburyPark,CA:Sage.
Hambleton,R.K.(1989).Principlesandselectedapplicationsofitemresponsetheory.InR.L.Linn(Ed.),Educational measurement (3rd ed.,pp.147–200).NewYork,NY:Macmillan.
Harwell, M. R., & Janosky, J. E. (1991).An empirical study of the effects of small datasetsandvaryingprior varianceson itemparameter estimation inBILOG.Applied Psychological Measurement, 15(3),279–291.http://dx.doi.org/10.1177/014662169101500308
Hulin,C.L.,Lissak,R.I.,&Drasgow,F.(1982).Recoveryof twoandthree-parameter logisticitem characteristic curves:AMonteCarlo study.Applied Psychological Measurement,6(3),249–260.http://dx.doi.org/10.1177/014662168200600301
InternationalBusinessMachinesCorporation.(2011).IBM SPSS Statistics for Windows, Version 20.0.Armonk,NY:Author.
Kline,R.B.(2005).Principles and practice of structural equation modeling(2nded.).NewYork,NY:GuilfordPress.
Lim,R.G.,&Drasgow,F.(1990).Evaluationoftwomethodsforestimatingitemresponsetheoryparameterswhenassessingdifferentialitemfunctioning.Journal of Applied Psychology,75(2),164–174.http://dx.doi.org/10.1037/0021-9010.75.2.164
Lord, F.M. (1968).An analysis of the verbal scholastic aptitude test using Birnbaum’s three-parameter logistic model. Educational and Psychological Measurement, 28(4), 989–1020.http://dx.doi.org/10.1177/001316446802800401
Lord,F.M.(1980).Applications of item response theory to practical testing problems.Hillsdale,NJ:LawrenceErlbaum.
335
Şahin, Anıl / The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
Patsula,L.N.,&GessaroliM.E. (1995,April).A comparison of item parameter estimates and ICCs produced with TESTGRAF and BILOG under different test lengths and sample sizes.Paperpresentedat theannualmeetingof theNationalCouncilonMeasurement inEducation,SanFrancisco,CA.
Ree,M.J.,&Jensen,H.E.(1980).Effectsofsamplesizeonlinearequatingofitemcharacteristiccurveparameters.InD.J.Weiss(Ed.), Proceedings of the 1979 Computerized Adaptive Testing Conference.Minneapolis,MN:UniversityofMinnesota.
Rudner,L.M.(1998).An on-line, interactive, computer adaptive testing tutorial.Retrievedfromhttp://echo.edres.org:8080/scripts/cat/catdemo.htm
Sireci,S.G.(1991,June).Sample independent item parameters? An investigation of the stability of IRT item parameters estimated from small data sets.PaperpresentedattheannualConferenceofNortheasternEducationalResearchAssociation,NewYork,NY.
Stone,C.A. (1992).Recoveryofmarginalmaximumlikelihoodestimates in the two-parameterlogisticresponsemodel:AnevaluationofMultilog.Applied Psychological Measurement, 16(1),1–16.http://dx.doi.org/10.1177/014662169201600101
Swaminathan,H.,&Gifford,J.A.(1979).Estimation of parameters in the three-parameter latent trait model(ReportNo.90).Amherst,MA:UniversityofMassachusetts,SchoolofEducation,LaboratoryofPsychometricandEvaluationResearch.
Swaminathan, H., Hambleton, R. K., Sireci, S. G., Xing, D., & Rizavi, S. M. (2003). Smallsampleestimationindichotomousitemresponsemodels:Effectofpriorsbasedonjudgmentalinformationontheaccuracyofitemparameterestimates.Applied Psychological Measurement,27(1),27–51.http://dx.doi.org/10.1177/0146621602239475
Tang,K.L.,Way,W.D.,&Carey,P.A.(1993).The effect of small calibration sample sizes on TEOFL IRT-based equating (TOEFL technical report TR-7).Princeton,NJ:EducationalTestingService.
Thissen,D.,&Wainer,H.(1982).Somestandarderrorsinitemresponsetheory.Psychometrika,47(4),397–412.http://dx.doi.org/10.1007/BF02293705
Weiss,D.J.,&Minden,S.V.(2012).A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG. St.Paul,MN:AssessmentSystemsCorporation.
Woods,A.,&Baker,R. (1985). Item response theory.Language Testing,2(2), 117–140.http://dx.doi.org/10.1177/026553228500200202
Wright,B.D.,&Stone,M.H.(1979).Best test design.Chicago,IL:MesaPress.
Yen,W.M.(1987).AcomparisonoftheefficiencyandaccuracyofBilogandLogist.Psychometrika,52(2),275–291.http://dx.doi.org/10.1007/BF02294241
Yoes,M. (1995).An updated comparison of micro-computer based item parameter estimation procedures used with the 3-parameter IRT model. Saint Paul, MN: Assessment SystemsCorporation.