validity evaluation of the lawson classroom test of

19
Validity evaluation of the Lawson classroom test of scientific reasoning Lei Bao, 1,* Yang Xiao, 1,2 Kathleen Koenig, 3 and Jing Han 1 1 The Ohio State University, Columbus, Ohio 43210, USA 2 South China Normal University, Guangzhou, Guangdong 510631, China 3 University of Cincinnati, Cincinnati, Ohio 45221, USA (Received 15 June 2018; published 17 August 2018) In science, technology, engineering, and mathematics education there has been increased emphasis on teaching goals that include not only the learning of content knowledge but also the development of scientific reasoning skills. The Lawson classroom test of scientific reasoning (LCTSR) is a popular assessment instrument for scientific reasoning. Through large scale applications, however, several issues have been observed regarding the validity of the LCTSR. This paper will review the literature on the assessment of scientific reasoning and provide a detailed analysis of the current version of LCTSR in regards to its validity and measurement features. The results suggest that the LCTSR is a practical tool for assessing a unidimensional scale of scientific reasoning. The instrument has a good overall reliability for the whole test. However, inspections of individual question pairs reveal a variety of validity concerns for five of the total twelve question pairs. These five question pairs have a considerable amount of inconsistent response patterns. Qualitative analysis also indicates wide ranging question design issues. Therefore, assessment of subskills involved with the five question pairs may have less power due to their questionable validity. DOI: 10.1103/PhysRevPhysEducRes.14.020106 I. INTRODUCTION In science, technology, engineering, and mathematics (STEM) education, widely accepted teaching goals include not only the learning of broad content knowledge, but also the development of general scientific reasoning abilities. These abilities are considered necessary for contributing to the workforce and global economy of the 21st century [1]. Scientific reasoning, which is closely related to formal operational reasoning[2] and critical thinking[3], represents the thinking and reasoning skills involved during inquiry, experimentation, evaluation of evidence, inference, and argument that support the for- mation and modification of concepts and theories about the natural and social world [46]. Early research suggests that scientific reasoning abilities can be developed through training and can be transferred to support learning in other areas [7,8]. Training in scientific reasoning abilities can also have a long-term impact on studentsacademic achievement, making their value broadly appealing rather than subject limited [7]. More recently, there has been increased interest in researching student abilities in scientific reasoning. For example, Bao et al. [4] reported that the traditional style of STEM education has little impact on the development of studentsscientific reasoning abilities. Meanwhile, correla- tional studies have found that scientific reasoning abilities positively correlate with course achievement [9,10], per- formances on concept tests [11,12], and success on transfer reasoning questions [13,14]. Such results prompt the need for more widespread and proactive approaches in improv- ing educational environments to more actively target scientific reasoning. To move forward, educators and researchers need good assessment tools that can be easily applied in large scale settings and produce valid results for evaluating scientific reasoning abilities. This is important for gauging the abilities of students such that appropriate skill level teaching methods can be implemented, as well as for evaluating studentslearning gains in scientific reason- ing under different educational settings. Historically, the Piagetian clinical interview was one of the early methods used to assess studentsformal reasoning abilities. Such a method is cost and time intensive, however, making it difficult for classroom practices [15,16]. Guided by Piagetian tasks, a number of researchers developed their own instruments in assessing student scientific reasoning abilities, such as the group assessment of logical thinking test (GALT) [17], the test of logical thinking (TOLT) [18], and Lawsons classroom test of * Corresponding author. [email protected] Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published articles title, journal citation, and DOI. PHYSICAL REVIEW PHYSICS EDUCATION RESEARCH 14, 020106 (2018) 2469-9896=18=14(2)=020106(19) 020106-1 Published by the American Physical Society

Upload: others

Post on 17-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Validity evaluation of the Lawson classroom test of

Validity evaluation of the Lawson classroom test of scientific reasoning

Lei Bao1 Yang Xiao12 Kathleen Koenig3 and Jing Han11The Ohio State University Columbus Ohio 43210 USA

2South China Normal University Guangzhou Guangdong 510631 China3University of Cincinnati Cincinnati Ohio 45221 USA

(Received 15 June 2018 published 17 August 2018)

In science technology engineering and mathematics education there has been increased emphasis onteaching goals that include not only the learning of content knowledge but also the development ofscientific reasoning skills The Lawson classroom test of scientific reasoning (LCTSR) is a popularassessment instrument for scientific reasoning Through large scale applications however several issueshave been observed regarding the validity of the LCTSR This paper will review the literature on theassessment of scientific reasoning and provide a detailed analysis of the current version of LCTSR inregards to its validity and measurement features The results suggest that the LCTSR is a practical toolfor assessing a unidimensional scale of scientific reasoning The instrument has a good overall reliabilityfor the whole test However inspections of individual question pairs reveal a variety of validity concerns forfive of the total twelve question pairs These five question pairs have a considerable amount of inconsistentresponse patterns Qualitative analysis also indicates wide ranging question design issues Thereforeassessment of subskills involved with the five question pairs may have less power due to their questionablevalidity

DOI 101103PhysRevPhysEducRes14020106

I INTRODUCTION

In science technology engineering and mathematics(STEM) education widely accepted teaching goalsinclude not only the learning of broad content knowledgebut also the development of general scientific reasoningabilities These abilities are considered necessary forcontributing to the workforce and global economy ofthe 21st century [1] Scientific reasoning which is closelyrelated to ldquoformal operational reasoningrdquo [2] and ldquocriticalthinkingrdquo [3] represents the thinking and reasoning skillsinvolved during inquiry experimentation evaluation ofevidence inference and argument that support the for-mation and modification of concepts and theories aboutthe natural and social world [4ndash6] Early research suggeststhat scientific reasoning abilities can be developedthrough training and can be transferred to support learningin other areas [78] Training in scientific reasoningabilities can also have a long-term impact on studentsrsquoacademic achievement making their value broadlyappealing rather than subject limited [7]

More recently there has been increased interest inresearching student abilities in scientific reasoning Forexample Bao et al [4] reported that the traditional style ofSTEM education has little impact on the development ofstudentsrsquo scientific reasoning abilities Meanwhile correla-tional studies have found that scientific reasoning abilitiespositively correlate with course achievement [910] per-formances on concept tests [1112] and success on transferreasoning questions [1314] Such results prompt the needfor more widespread and proactive approaches in improv-ing educational environments to more actively targetscientific reasoning To move forward educators andresearchers need good assessment tools that can be easilyapplied in large scale settings and produce valid results forevaluating scientific reasoning abilities This is importantfor gauging the abilities of students such that appropriateskill level teaching methods can be implemented as well asfor evaluating studentsrsquo learning gains in scientific reason-ing under different educational settingsHistorically the Piagetian clinical interview was one of

the early methods used to assess studentsrsquo formal reasoningabilities Such a method is cost and time intensivehowever making it difficult for classroom practices[1516] Guided by Piagetian tasks a number of researchersdeveloped their own instruments in assessing studentscientific reasoning abilities such as the group assessmentof logical thinking test (GALT) [17] the test of logicalthinking (TOLT) [18] and Lawsonrsquos classroom test of

Corresponding authorbao15osuedu

Published by the American Physical Society under the terms ofthe Creative Commons Attribution 40 International licenseFurther distribution of this work must maintain attribution tothe author(s) and the published articlersquos title journal citationand DOI

PHYSICAL REVIEW PHYSICS EDUCATION RESEARCH 14 020106 (2018)

2469-9896=18=14(2)=020106(19) 020106-1 Published by the American Physical Society

formal reasoning (CTFR-78) [16] The initial open-response version of the CTFR-78 was revised in the year2000 to become a 24-item multiple choice test [19] Thismost recent version is referred to as the Lawson classroomtest of scientific reasoning (LCTSR) and it has gained widepopularity in the STEM education community For exam-ple physics education researchers used the LCTSR to studythe relation between student scientific reasoning abilitiesand physics content learning In one study Coletta andPhillips [11] reported correlations (r asymp 05) between pre-post normalized gains on the Force Concept Inventory [20]and student reasoning abilitiesAlthough several studies [2122] investigated the validity

of the CTFR-78 research on the validity of the LCTSR islimited Nevertheless the instrument has become a standardassessment tool in education research Through our own useof the LCTSR however several issues have been observedconcerning question design and data interpretation Thisstudy evaluates the validity of the LCTSR based on large-scale assessment data and follow-up interviews The find-ings will further establish the validity of the LCTSR andreveal measurement features of the current design

II LITERATURE REVIEW ON ASSESSMENTOF SCIENTIFIC REASONING

A The development and validation of the CTFR-78

During the 1960s and 1970s research on the assessmentof formal reasoning led to the development of severalinstruments which employed paper-and-pencil methodsthat could be more conveniently administered and scoredthan clinical interviews [23ndash26] A few of the instrumentsrequired students to interact with equipment during theassessment but responses to the questions were in the formof written response [27] However the use of this type ofinstrument was limited due to the equipment needs andstudies that depended on this instrument tended to havesmaller sample sizes Building on this idea other instru-ments were developed that simplified the process and aninstructor was used to conduct a demonstration for the entireclass to minimize time and equipment requirements [28]This method sought balance between the power of inter-views and a format that could be more readily implementedOne of the more popular assessments during this period

was the 1978 version of Lawsonrsquos classroom test of formalreasoning (CTFR-78) [16] The CTFR-78 was developedbased on Piagetrsquos developmental theory [2] and utilizedquestions that others had created in different contexts to testformal reasoning for a variety of operations [171826] TheCTFR-78 involves fifteen items that each consist of ademonstration conducted by an instructor followed by twoquestions that students answer in individual test bookletsThe first question is multiple choice and asks for outcomesrelated to the demonstration The second question is awritten free-response question in which students explain

the reasoning for their answer to the first question An itemis scored as correct only if answers to both questions aredeemed satisfactory The test covers five skill dimensionsincluding conservation proportion control of variablescombination and probability (see Table I)In a paper that introduced the CTFR-78 Lawson [16]

conducted a series of studies to establish the initial validityof the CTFR-78 in measuring concrete and formal reason-ing In the study six experts in Piagetian research assessedthe quality of the questions all six agreed that the questionsassessed concrete and formal reasoning Subsequently theinstrument was administered to 513 students from eighththrough tenth grade and 72 of these students wererandomly selected to participate in a series of clinicalinterviews with Piagetian tasks Two statistical methodsparametric statistics and principal components analysiswere employed for comparing the interview and test dataThe former method found an overall correlation of 076(p lt 0001) The principal components analysis identifiedthree major principal factors among the five skill dimen-sions that accounted for 66 of the variance in scoresBecause of the small number of students this result wasconsidered tentative but satisfactory Overall the analysisindicated that the CTFR-78 was able to measure formalreasoning and that it correlated reasonably well withclinical methods [16]Lawson also created a scoring system to help instructors

interpret test results regarding studentsrsquo levels on a Piagetianreasoning scale Based on the test scores of 513 studentsLawson defined three levels of reasoning concrete (scoreof 0ndash5) transitional (score of 6ndash11) and formal (scoreof 12ndash15) [16] Roughly one-third (353) of studentsexamined were classified at the concrete level about half(495) transitional and the remaining (152) at theformal level When compared to interview assessmentsmore than half of the 72 interviewees fit into the same threereasoning levels determined by the CTFR-78 Of those thatdid not fit the data indicated that the CTFR-78 may haveunderestimated the reasoning abilities of the studentsOther researchers also analyzed the CTFR-78 to deter-

mine its ability to measure student reasoning Stefanichet al [22] measured the correlation between outcomesof clinical interviews with the CTFR-78 and observed aweaker correlation (r frac14 050) than that reported byLawson Interestingly when compared to the clinicalinterviews the researchers found that the test overestimatedreasoning abilities rather than underestimated but due to asmall sample size (N frac14 27) with no reported estimate ofstatistical significance the study did little to challenge theearlier findings of LawsonAnother study by Pratt and Hacker [21] involved

administering the test to 150 students to uncover whetherthe CTFR-78 measured one or several factors Theresearchers found that the Lawson test measurement wasmultifactorial rather than singular which they took to be a

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-2

weakness of the test This examination was later repeatedby Hacker [29] who again found the test to be multi-factorial Other researchers [30] did not find a multifactorialexamination to be problematic especially given that formalreasoning is multifacetedAnd last another early study investigated how strongly

the CTFR-78 test scores correlated to student success inscience and mathematics [31] It was found that students atthe ldquoformal operational reasoningrdquo level outperformedstudents at the ldquotransitionalrdquo and ldquoconcrete operationalreasoningrdquo levels in sciences and mathematics though thelatter two levels of ability were indistinguishableConcerning general performance in STEM only one ofthe items (probabilistic reasoning) was found to be pre-dictive but overall the test was a good indicator of successin learning biology but is less effective in predictingsuccess in other STEM areas

B The development and validation of the LCTSR

Two decades later the LCTSR was published which is acompletelymultiple-choice assessment and does not involvedemonstrations This newer version uses a two-tier design

that has a total of 24 questions in 12 pairs and contains atotal of six skill dimensions Items on combinationalreasoning in the CTFR-78 were not included in the newLCTSR while new items were added for correlationalreasoning and hypothetical-deductive reasoning A com-parison of CTFR-78 and LCTSR is provided in Table I Inthe discussion that follows the terms ldquoitemrdquo and ldquoquestionrdquowill be used interchangeably based on the styles of therelevant literatureThe first 10 question pairs of the LCTSR maintained the

traditional two-tier format in which the first question in apair asks for the results of a given task and the secondquestion asks for the explanation and reasoning behind theanswer to the first question The last two pairs (21-22 and23-24) on hypothetical-deductive reasoning have a slightlychanged flavor of paired questions Question 21 asksstudents to determine the most appropriate experimentaldesign that can be used to test a given hypothesis for aprovided phenomenon Question 22 asks students to pickthe experimental outcome that would disprove the hypoth-esis proposed in question 21 Questions 23-24 are similarlystructured and ask students to select the experimentaloutcomes that will disprove two given hypotheses

TABLE I The comparison of CTFR-78 and LCTSR

Scheme testedItems

(CTFR-78)Question pair(LCTSR) Task details

Conservation of weight 1 1 2 Varying the shapes of two identical balls of clay placedon opposite ends of a balance

Conservation of volume 2 3 4 Examining the displacement volumes of two cylindersof different densities

Proportional reasoning 3 4 5 6 7 8 Pouring water between wide and narrow cylinders andpredicting levels

Proportional reasoning 5 6 Moving weights on a beam balance and predictingequilibrium positions

Control of variables 7 9 10 Designing experiments to test the influence of length ofstring on the period of a pendulum

Control of variables 8 Designing experiments to test the influence of weight ofbob on the period of a pendulum

Control of variables 9 10 Using a ramp and three metal spheres to examine theinfluence of weight and release position on collisions

Control of variables 11 12 13 14 Using fruit flies in tubes to examine the influence of redblue light and gravity on fliesrsquo responses

Combinational reasoning 11 Computing combinations of four switches that will turnon a light

Combinational reasoning 12 Listing all possible linear arrangements of four objectsrepresenting stores in a shopping center

Probability 13 14 15 15 16 17 18 Predicting chances for withdrawing certain coloredwooden blocks from a sack

Correlation reasoning 19 20 Predicting whether correlation exits between the size ofthe mice and the color of their tails through presenteddata

Hypothetical-deductive reasoning 21 22 Designing experiments to determine why the waterrushed up into the glass after the lit candle went out

Hypothetical-deductive reasoning 23 24 Designing experiments to determine why red blood cellsbecome smaller after adding a few drops of salt water

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-3

The scoring of the LCTSR has evolved into two formsone is the pair scoring method which assigns one pointwhen both questions in a pair are answered correctly andthe other is the individual scoring method which simplygrades each question independently Both methods havebeen frequently used by researchers [4111632ndash34]The utility of the LCTSR is that it can be quickly and

objectively scored and thus it has become widely usedHowever the validity of this new version has not beenthoroughly evaluated but rather rests on its predecessorRegardless many researchers have used the LCTSR andcompared the results with other assessments For examplesmall but statistically significant correlations have beendemonstrated between LCTSR test scores and changes ofpre-post scores on both the Force and Concept Inventory(FCI) (r frac14 036 p lt 0001) and Conceptual Survey ofElectricity and Magnetism (CSEM) (r frac14 037 p lt 0001)among community college students including those whohave and have not taken a calculus course [35] Similarfindings have been reported by others [113236] indicatingthat the LCTSR is a useful measure of formal reasoningAlthough widely used a complete and formal validation

including construct content and criterion validity of theLCTSR has not yet been conducted [37] Only recentlyhave a few studies started to examine its construct validity[38ndash41] For example using Rasch analysis and principlecomponent analysis the assumption of a unidimensionalconstruct for a general scientific reasoning ability hasbeen shown to be acceptable [3839] In addition themultidimensionality of LCTSR has been partially estab-lished using confirmatory factor analysis [40] As a two-tierinstrument the LCTSR is assumed to measure studentsrsquoknowing in the first tier and their reasoning processes inthe second tier [42] and this has been confirmed in a recentempirical study [41]

C The two-tier multiple-choice design

A two-tier multiple-choice (TTMC) design is constructedto measure studentsrsquo content knowledge in tier 1 andreasoning in tier 2 [4243] Since its formal introductionthree decades ago [42] TTMC designs have been widelyresearched and applied in education assessment [44ndash52]The LCTSR is a well-known example of a two-tier testdesigned to assess scientific reasoning [1116414244]Two assessment goals are often assumed when using

TTMC tests The first is to measure studentsrsquo knowing andreasoning at the same time within a single question settingUnderstanding the relation between knowing and reasoningcan provide important cognitive and education insights intohow teaching and learning can be improved Recent studieshave started to show a possible progression from knowingto reasoning which suggests that reasoning is harder thanknowing [344153ndash56] However the actual difficultyreflected within a given test can also be affected by anumber of factors including test design content and

population [3441] and therefore needs to be evaluatedwith all the possible factors considered and or controlledThe second and often more primary goal of using the

TTMC design is to suppress ldquofalse positivesrdquowhen the pairscoring method is used It is possible for students to answera question correctly without the correct understandingthrough guessing or other means From a signal processingperspective this type of ldquopositiverdquo signal is generated withunfavorable processes andwas defined as a false positive byHestenes and Halloun [57] For example in the case of theForce Concept Inventory (FCI) [20] which has five answerchoices for each question the chance of a false positive dueto guessing can be estimated as 20 [58] Using TTMC orother sequenced question design a false positive can besuppressed in pair or sequence scoring [51] For examplethe false positive due to guessing drops to 4 in pair scoringfor the five-choice question pair (1=5 times 1=5 frac14 1=25)In contrast it is also possible for a student to answer a

question incorrectly but with correct understanding orreasoning which was defined as a ldquofalse negativerdquo byHestenes and Halloun [57] It has been suggested that afalse negative is less common than a false positive in awell-designed instrument such as the FCI [20] Howeverwhen the test designs are subject to inspection both falsepositives and false negatves should be carefully consideredand examined For a well-designed TTMC instrumentgood consistency between answer and explanation isgenerally expected If a high level of inconsistency isobserved such as a correct answer for a wrong reason orvice versa it could be an indication of content validityissues in question design

D Research question

The construct validity of the LCTSR has been partiallyestablished through quantitative analysis but there hasbeen little research on fully establishing its content validityBased on large scale implementations of the LCTSR anumber of concerns regarding question designs have beenraised prompting the need for a thorough inspection of theassessment features of the LCTSR and to fully evaluate itsvalidity This will allow researchers and educators to moreaccurately interpret the results of this assessment instru-ment Therefore this paper aims to address a gap in theliterature through a detailed study of the possible contentvalidity issues of LCTSR such that the results will helpformally establish the validity of this popular instrumentThis study focuses on one core research question to whatextent is LCTSR valid and reliable to measure scientificreasoning

III METHODOLOGY

A Data collection

This research uses mixed methods [59] to integratedifferent sources of data to identify and study features of

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-4

test design and validity of the LCTSR Three formsof quantitative and qualitative data were collectedfrom freshman in three midwestern public universitiesincluding the following (i) large-scale quantitative data(N frac14 1576) using the LCTSR as written (ii) studentresponses (N frac14 181) to LCTSR questions along with shortanswer free-response explanations for each question and(iii) think-aloud interviews (N frac14 66) with a subgrouprandomly selected from the same students tested in part(ii) During the interviews students were asked to go overthe test again and verbally explain their reasoning for howthey answered the questionsAmong the students in the large-scale quantitative data

988 were enrolled in calculus-based introductory physicscourses with the rest enrolled in algebra-based introduc-tory physics courses The universities selected were thosewith medium ranking in order to obtain a representativepool of the population For the college students the LCTSRwas administered in the beginning of the fall semester priorto any college level instructionIn addition to the student data an expert review panel

consisting of 7 science faculty (3 in physics 2 in biology 1in chemistry and 1 in mathematics) was formed to provideexpert evaluation of the questions for content validityThese faculty are researchers in science education withestablished experience in assessment design and validation

B Analysis method

1 The dependence within and between question pairs

In general test design features can result in dependenceswithin and between question pairs For example in large-scale testing programs such as TOEFL PISA or NAEPthere are frequently observed correlations among questionssharing common context elements such as graphs or textsThese related question sets are often called testlets [60]or question bundles [61] In this way a typical two-tierquestion pair can be considered as a specific type of testletor question bundle [62] In addition the LCTSR wasdesigned with six subskill dimensions that each can containmultiple question pairs built around similar contextualscenarios As a result dependences between question pairswithin a single skill dimension are not surprising Theanalysis of correlations among related questions canprovide supporting evidence for identifying and validatingfactors in question design that might influence studentsrsquoperformances one way or the otherIn classical score-based analysis dependences among

questions are analyzed using correlations among questionscores This type of correlation contains contributions fromall factors including studentsrsquo abilities skill dimensionsquestion difficulties question design features etc If thedata are analyzed with Rasch or other item response theory(IRT) models one can remove the factors due to studentsrsquoabilities and item difficulties and focus on question designfeatures and other factors Therefore the fit measure

Q3 [63] which gives the correlation among questionresiduals [Eqs (1) and (2)] is used to analyze the depend-ences among questions in order to examine the influenceson studentsrsquo performance variances from question designfeatures

dik frac14 Xik minus PiethθkTHORN eth1THORN

Here dik is the difference (residual) between Xik theobserved score of the kth student on the ith question andthe model predicted score PiethθkTHORN The Pearson correlationof these question residuals is computed as

Q3ij frac14 rdidj eth2THORN

where di and dj are question residuals for questions i and jrespectively In using Q3 to screen questions for localdependence Chen and Thissen [64] suggested an absolutethreshold value of 02 above which a moderate to strongdependence can be derived For a test consisting of nquestions the mean value of all Q3 correlations amongdifferent questions is expected to be minus1=ethn minus 1THORN [65] TheLCTSR contains 24 questions therefore the expectedmean Q3 is approximately minus0043The LCTSR has a TTMC design with six subskills By

design there are at least two categories of dimensionsincluding the two-tier structure and content subskillsAdditional observed dimensions beyond the two categoriescan be considered as a result of contextual factors inquestion design In data analysis a multidimensional Raschmodel is used to account for the dependences due to thesubskill dimensions Then the variances of the residuals[Eq (1)] would primarily be the results of TTMC structureand other question design features For an ideal TTMC testit is often expected that dependences only exist withinbut not between question pairs This is equivalent to thecondition of local independence among question pairsWith the subskill dimensions removed by the multidimen-sional Rasch modeling significant residual dependencesamong different question pairs can be indications ofpossible construct and content validity issues in the ques-tion design

2 Pattern analysis of the LCTSR

As discussed earlier the first 10 LCTSR question pairs(questions 1 through 20) used the traditional two-tierdesign while the last two pairs (questions 21-22 and23-24) have a different flavor of paired questionsIdeally traditional two-tier questions would expect goodconsistency between tier-1 and tier-2 questions (iebetween answer and explanation) If a high level ofinconsistency is observed such as a correct answer for awrong reason or vice versa it may suggest issues inquestion design that can lead to false positive or negativeassessment outcomes Therefore analysis of question-pair

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-5

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 2: Validity evaluation of the Lawson classroom test of

formal reasoning (CTFR-78) [16] The initial open-response version of the CTFR-78 was revised in the year2000 to become a 24-item multiple choice test [19] Thismost recent version is referred to as the Lawson classroomtest of scientific reasoning (LCTSR) and it has gained widepopularity in the STEM education community For exam-ple physics education researchers used the LCTSR to studythe relation between student scientific reasoning abilitiesand physics content learning In one study Coletta andPhillips [11] reported correlations (r asymp 05) between pre-post normalized gains on the Force Concept Inventory [20]and student reasoning abilitiesAlthough several studies [2122] investigated the validity

of the CTFR-78 research on the validity of the LCTSR islimited Nevertheless the instrument has become a standardassessment tool in education research Through our own useof the LCTSR however several issues have been observedconcerning question design and data interpretation Thisstudy evaluates the validity of the LCTSR based on large-scale assessment data and follow-up interviews The find-ings will further establish the validity of the LCTSR andreveal measurement features of the current design

II LITERATURE REVIEW ON ASSESSMENTOF SCIENTIFIC REASONING

A The development and validation of the CTFR-78

During the 1960s and 1970s research on the assessmentof formal reasoning led to the development of severalinstruments which employed paper-and-pencil methodsthat could be more conveniently administered and scoredthan clinical interviews [23ndash26] A few of the instrumentsrequired students to interact with equipment during theassessment but responses to the questions were in the formof written response [27] However the use of this type ofinstrument was limited due to the equipment needs andstudies that depended on this instrument tended to havesmaller sample sizes Building on this idea other instru-ments were developed that simplified the process and aninstructor was used to conduct a demonstration for the entireclass to minimize time and equipment requirements [28]This method sought balance between the power of inter-views and a format that could be more readily implementedOne of the more popular assessments during this period

was the 1978 version of Lawsonrsquos classroom test of formalreasoning (CTFR-78) [16] The CTFR-78 was developedbased on Piagetrsquos developmental theory [2] and utilizedquestions that others had created in different contexts to testformal reasoning for a variety of operations [171826] TheCTFR-78 involves fifteen items that each consist of ademonstration conducted by an instructor followed by twoquestions that students answer in individual test bookletsThe first question is multiple choice and asks for outcomesrelated to the demonstration The second question is awritten free-response question in which students explain

the reasoning for their answer to the first question An itemis scored as correct only if answers to both questions aredeemed satisfactory The test covers five skill dimensionsincluding conservation proportion control of variablescombination and probability (see Table I)In a paper that introduced the CTFR-78 Lawson [16]

conducted a series of studies to establish the initial validityof the CTFR-78 in measuring concrete and formal reason-ing In the study six experts in Piagetian research assessedthe quality of the questions all six agreed that the questionsassessed concrete and formal reasoning Subsequently theinstrument was administered to 513 students from eighththrough tenth grade and 72 of these students wererandomly selected to participate in a series of clinicalinterviews with Piagetian tasks Two statistical methodsparametric statistics and principal components analysiswere employed for comparing the interview and test dataThe former method found an overall correlation of 076(p lt 0001) The principal components analysis identifiedthree major principal factors among the five skill dimen-sions that accounted for 66 of the variance in scoresBecause of the small number of students this result wasconsidered tentative but satisfactory Overall the analysisindicated that the CTFR-78 was able to measure formalreasoning and that it correlated reasonably well withclinical methods [16]Lawson also created a scoring system to help instructors

interpret test results regarding studentsrsquo levels on a Piagetianreasoning scale Based on the test scores of 513 studentsLawson defined three levels of reasoning concrete (scoreof 0ndash5) transitional (score of 6ndash11) and formal (scoreof 12ndash15) [16] Roughly one-third (353) of studentsexamined were classified at the concrete level about half(495) transitional and the remaining (152) at theformal level When compared to interview assessmentsmore than half of the 72 interviewees fit into the same threereasoning levels determined by the CTFR-78 Of those thatdid not fit the data indicated that the CTFR-78 may haveunderestimated the reasoning abilities of the studentsOther researchers also analyzed the CTFR-78 to deter-

mine its ability to measure student reasoning Stefanichet al [22] measured the correlation between outcomesof clinical interviews with the CTFR-78 and observed aweaker correlation (r frac14 050) than that reported byLawson Interestingly when compared to the clinicalinterviews the researchers found that the test overestimatedreasoning abilities rather than underestimated but due to asmall sample size (N frac14 27) with no reported estimate ofstatistical significance the study did little to challenge theearlier findings of LawsonAnother study by Pratt and Hacker [21] involved

administering the test to 150 students to uncover whetherthe CTFR-78 measured one or several factors Theresearchers found that the Lawson test measurement wasmultifactorial rather than singular which they took to be a

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-2

weakness of the test This examination was later repeatedby Hacker [29] who again found the test to be multi-factorial Other researchers [30] did not find a multifactorialexamination to be problematic especially given that formalreasoning is multifacetedAnd last another early study investigated how strongly

the CTFR-78 test scores correlated to student success inscience and mathematics [31] It was found that students atthe ldquoformal operational reasoningrdquo level outperformedstudents at the ldquotransitionalrdquo and ldquoconcrete operationalreasoningrdquo levels in sciences and mathematics though thelatter two levels of ability were indistinguishableConcerning general performance in STEM only one ofthe items (probabilistic reasoning) was found to be pre-dictive but overall the test was a good indicator of successin learning biology but is less effective in predictingsuccess in other STEM areas

B The development and validation of the LCTSR

Two decades later the LCTSR was published which is acompletelymultiple-choice assessment and does not involvedemonstrations This newer version uses a two-tier design

that has a total of 24 questions in 12 pairs and contains atotal of six skill dimensions Items on combinationalreasoning in the CTFR-78 were not included in the newLCTSR while new items were added for correlationalreasoning and hypothetical-deductive reasoning A com-parison of CTFR-78 and LCTSR is provided in Table I Inthe discussion that follows the terms ldquoitemrdquo and ldquoquestionrdquowill be used interchangeably based on the styles of therelevant literatureThe first 10 question pairs of the LCTSR maintained the

traditional two-tier format in which the first question in apair asks for the results of a given task and the secondquestion asks for the explanation and reasoning behind theanswer to the first question The last two pairs (21-22 and23-24) on hypothetical-deductive reasoning have a slightlychanged flavor of paired questions Question 21 asksstudents to determine the most appropriate experimentaldesign that can be used to test a given hypothesis for aprovided phenomenon Question 22 asks students to pickthe experimental outcome that would disprove the hypoth-esis proposed in question 21 Questions 23-24 are similarlystructured and ask students to select the experimentaloutcomes that will disprove two given hypotheses

TABLE I The comparison of CTFR-78 and LCTSR

Scheme testedItems

(CTFR-78)Question pair(LCTSR) Task details

Conservation of weight 1 1 2 Varying the shapes of two identical balls of clay placedon opposite ends of a balance

Conservation of volume 2 3 4 Examining the displacement volumes of two cylindersof different densities

Proportional reasoning 3 4 5 6 7 8 Pouring water between wide and narrow cylinders andpredicting levels

Proportional reasoning 5 6 Moving weights on a beam balance and predictingequilibrium positions

Control of variables 7 9 10 Designing experiments to test the influence of length ofstring on the period of a pendulum

Control of variables 8 Designing experiments to test the influence of weight ofbob on the period of a pendulum

Control of variables 9 10 Using a ramp and three metal spheres to examine theinfluence of weight and release position on collisions

Control of variables 11 12 13 14 Using fruit flies in tubes to examine the influence of redblue light and gravity on fliesrsquo responses

Combinational reasoning 11 Computing combinations of four switches that will turnon a light

Combinational reasoning 12 Listing all possible linear arrangements of four objectsrepresenting stores in a shopping center

Probability 13 14 15 15 16 17 18 Predicting chances for withdrawing certain coloredwooden blocks from a sack

Correlation reasoning 19 20 Predicting whether correlation exits between the size ofthe mice and the color of their tails through presenteddata

Hypothetical-deductive reasoning 21 22 Designing experiments to determine why the waterrushed up into the glass after the lit candle went out

Hypothetical-deductive reasoning 23 24 Designing experiments to determine why red blood cellsbecome smaller after adding a few drops of salt water

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-3

The scoring of the LCTSR has evolved into two formsone is the pair scoring method which assigns one pointwhen both questions in a pair are answered correctly andthe other is the individual scoring method which simplygrades each question independently Both methods havebeen frequently used by researchers [4111632ndash34]The utility of the LCTSR is that it can be quickly and

objectively scored and thus it has become widely usedHowever the validity of this new version has not beenthoroughly evaluated but rather rests on its predecessorRegardless many researchers have used the LCTSR andcompared the results with other assessments For examplesmall but statistically significant correlations have beendemonstrated between LCTSR test scores and changes ofpre-post scores on both the Force and Concept Inventory(FCI) (r frac14 036 p lt 0001) and Conceptual Survey ofElectricity and Magnetism (CSEM) (r frac14 037 p lt 0001)among community college students including those whohave and have not taken a calculus course [35] Similarfindings have been reported by others [113236] indicatingthat the LCTSR is a useful measure of formal reasoningAlthough widely used a complete and formal validation

including construct content and criterion validity of theLCTSR has not yet been conducted [37] Only recentlyhave a few studies started to examine its construct validity[38ndash41] For example using Rasch analysis and principlecomponent analysis the assumption of a unidimensionalconstruct for a general scientific reasoning ability hasbeen shown to be acceptable [3839] In addition themultidimensionality of LCTSR has been partially estab-lished using confirmatory factor analysis [40] As a two-tierinstrument the LCTSR is assumed to measure studentsrsquoknowing in the first tier and their reasoning processes inthe second tier [42] and this has been confirmed in a recentempirical study [41]

C The two-tier multiple-choice design

A two-tier multiple-choice (TTMC) design is constructedto measure studentsrsquo content knowledge in tier 1 andreasoning in tier 2 [4243] Since its formal introductionthree decades ago [42] TTMC designs have been widelyresearched and applied in education assessment [44ndash52]The LCTSR is a well-known example of a two-tier testdesigned to assess scientific reasoning [1116414244]Two assessment goals are often assumed when using

TTMC tests The first is to measure studentsrsquo knowing andreasoning at the same time within a single question settingUnderstanding the relation between knowing and reasoningcan provide important cognitive and education insights intohow teaching and learning can be improved Recent studieshave started to show a possible progression from knowingto reasoning which suggests that reasoning is harder thanknowing [344153ndash56] However the actual difficultyreflected within a given test can also be affected by anumber of factors including test design content and

population [3441] and therefore needs to be evaluatedwith all the possible factors considered and or controlledThe second and often more primary goal of using the

TTMC design is to suppress ldquofalse positivesrdquowhen the pairscoring method is used It is possible for students to answera question correctly without the correct understandingthrough guessing or other means From a signal processingperspective this type of ldquopositiverdquo signal is generated withunfavorable processes andwas defined as a false positive byHestenes and Halloun [57] For example in the case of theForce Concept Inventory (FCI) [20] which has five answerchoices for each question the chance of a false positive dueto guessing can be estimated as 20 [58] Using TTMC orother sequenced question design a false positive can besuppressed in pair or sequence scoring [51] For examplethe false positive due to guessing drops to 4 in pair scoringfor the five-choice question pair (1=5 times 1=5 frac14 1=25)In contrast it is also possible for a student to answer a

question incorrectly but with correct understanding orreasoning which was defined as a ldquofalse negativerdquo byHestenes and Halloun [57] It has been suggested that afalse negative is less common than a false positive in awell-designed instrument such as the FCI [20] Howeverwhen the test designs are subject to inspection both falsepositives and false negatves should be carefully consideredand examined For a well-designed TTMC instrumentgood consistency between answer and explanation isgenerally expected If a high level of inconsistency isobserved such as a correct answer for a wrong reason orvice versa it could be an indication of content validityissues in question design

D Research question

The construct validity of the LCTSR has been partiallyestablished through quantitative analysis but there hasbeen little research on fully establishing its content validityBased on large scale implementations of the LCTSR anumber of concerns regarding question designs have beenraised prompting the need for a thorough inspection of theassessment features of the LCTSR and to fully evaluate itsvalidity This will allow researchers and educators to moreaccurately interpret the results of this assessment instru-ment Therefore this paper aims to address a gap in theliterature through a detailed study of the possible contentvalidity issues of LCTSR such that the results will helpformally establish the validity of this popular instrumentThis study focuses on one core research question to whatextent is LCTSR valid and reliable to measure scientificreasoning

III METHODOLOGY

A Data collection

This research uses mixed methods [59] to integratedifferent sources of data to identify and study features of

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-4

test design and validity of the LCTSR Three formsof quantitative and qualitative data were collectedfrom freshman in three midwestern public universitiesincluding the following (i) large-scale quantitative data(N frac14 1576) using the LCTSR as written (ii) studentresponses (N frac14 181) to LCTSR questions along with shortanswer free-response explanations for each question and(iii) think-aloud interviews (N frac14 66) with a subgrouprandomly selected from the same students tested in part(ii) During the interviews students were asked to go overthe test again and verbally explain their reasoning for howthey answered the questionsAmong the students in the large-scale quantitative data

988 were enrolled in calculus-based introductory physicscourses with the rest enrolled in algebra-based introduc-tory physics courses The universities selected were thosewith medium ranking in order to obtain a representativepool of the population For the college students the LCTSRwas administered in the beginning of the fall semester priorto any college level instructionIn addition to the student data an expert review panel

consisting of 7 science faculty (3 in physics 2 in biology 1in chemistry and 1 in mathematics) was formed to provideexpert evaluation of the questions for content validityThese faculty are researchers in science education withestablished experience in assessment design and validation

B Analysis method

1 The dependence within and between question pairs

In general test design features can result in dependenceswithin and between question pairs For example in large-scale testing programs such as TOEFL PISA or NAEPthere are frequently observed correlations among questionssharing common context elements such as graphs or textsThese related question sets are often called testlets [60]or question bundles [61] In this way a typical two-tierquestion pair can be considered as a specific type of testletor question bundle [62] In addition the LCTSR wasdesigned with six subskill dimensions that each can containmultiple question pairs built around similar contextualscenarios As a result dependences between question pairswithin a single skill dimension are not surprising Theanalysis of correlations among related questions canprovide supporting evidence for identifying and validatingfactors in question design that might influence studentsrsquoperformances one way or the otherIn classical score-based analysis dependences among

questions are analyzed using correlations among questionscores This type of correlation contains contributions fromall factors including studentsrsquo abilities skill dimensionsquestion difficulties question design features etc If thedata are analyzed with Rasch or other item response theory(IRT) models one can remove the factors due to studentsrsquoabilities and item difficulties and focus on question designfeatures and other factors Therefore the fit measure

Q3 [63] which gives the correlation among questionresiduals [Eqs (1) and (2)] is used to analyze the depend-ences among questions in order to examine the influenceson studentsrsquo performance variances from question designfeatures

dik frac14 Xik minus PiethθkTHORN eth1THORN

Here dik is the difference (residual) between Xik theobserved score of the kth student on the ith question andthe model predicted score PiethθkTHORN The Pearson correlationof these question residuals is computed as

Q3ij frac14 rdidj eth2THORN

where di and dj are question residuals for questions i and jrespectively In using Q3 to screen questions for localdependence Chen and Thissen [64] suggested an absolutethreshold value of 02 above which a moderate to strongdependence can be derived For a test consisting of nquestions the mean value of all Q3 correlations amongdifferent questions is expected to be minus1=ethn minus 1THORN [65] TheLCTSR contains 24 questions therefore the expectedmean Q3 is approximately minus0043The LCTSR has a TTMC design with six subskills By

design there are at least two categories of dimensionsincluding the two-tier structure and content subskillsAdditional observed dimensions beyond the two categoriescan be considered as a result of contextual factors inquestion design In data analysis a multidimensional Raschmodel is used to account for the dependences due to thesubskill dimensions Then the variances of the residuals[Eq (1)] would primarily be the results of TTMC structureand other question design features For an ideal TTMC testit is often expected that dependences only exist withinbut not between question pairs This is equivalent to thecondition of local independence among question pairsWith the subskill dimensions removed by the multidimen-sional Rasch modeling significant residual dependencesamong different question pairs can be indications ofpossible construct and content validity issues in the ques-tion design

2 Pattern analysis of the LCTSR

As discussed earlier the first 10 LCTSR question pairs(questions 1 through 20) used the traditional two-tierdesign while the last two pairs (questions 21-22 and23-24) have a different flavor of paired questionsIdeally traditional two-tier questions would expect goodconsistency between tier-1 and tier-2 questions (iebetween answer and explanation) If a high level ofinconsistency is observed such as a correct answer for awrong reason or vice versa it may suggest issues inquestion design that can lead to false positive or negativeassessment outcomes Therefore analysis of question-pair

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-5

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 3: Validity evaluation of the Lawson classroom test of

weakness of the test This examination was later repeatedby Hacker [29] who again found the test to be multi-factorial Other researchers [30] did not find a multifactorialexamination to be problematic especially given that formalreasoning is multifacetedAnd last another early study investigated how strongly

the CTFR-78 test scores correlated to student success inscience and mathematics [31] It was found that students atthe ldquoformal operational reasoningrdquo level outperformedstudents at the ldquotransitionalrdquo and ldquoconcrete operationalreasoningrdquo levels in sciences and mathematics though thelatter two levels of ability were indistinguishableConcerning general performance in STEM only one ofthe items (probabilistic reasoning) was found to be pre-dictive but overall the test was a good indicator of successin learning biology but is less effective in predictingsuccess in other STEM areas

B The development and validation of the LCTSR

Two decades later the LCTSR was published which is acompletelymultiple-choice assessment and does not involvedemonstrations This newer version uses a two-tier design

that has a total of 24 questions in 12 pairs and contains atotal of six skill dimensions Items on combinationalreasoning in the CTFR-78 were not included in the newLCTSR while new items were added for correlationalreasoning and hypothetical-deductive reasoning A com-parison of CTFR-78 and LCTSR is provided in Table I Inthe discussion that follows the terms ldquoitemrdquo and ldquoquestionrdquowill be used interchangeably based on the styles of therelevant literatureThe first 10 question pairs of the LCTSR maintained the

traditional two-tier format in which the first question in apair asks for the results of a given task and the secondquestion asks for the explanation and reasoning behind theanswer to the first question The last two pairs (21-22 and23-24) on hypothetical-deductive reasoning have a slightlychanged flavor of paired questions Question 21 asksstudents to determine the most appropriate experimentaldesign that can be used to test a given hypothesis for aprovided phenomenon Question 22 asks students to pickthe experimental outcome that would disprove the hypoth-esis proposed in question 21 Questions 23-24 are similarlystructured and ask students to select the experimentaloutcomes that will disprove two given hypotheses

TABLE I The comparison of CTFR-78 and LCTSR

Scheme testedItems

(CTFR-78)Question pair(LCTSR) Task details

Conservation of weight 1 1 2 Varying the shapes of two identical balls of clay placedon opposite ends of a balance

Conservation of volume 2 3 4 Examining the displacement volumes of two cylindersof different densities

Proportional reasoning 3 4 5 6 7 8 Pouring water between wide and narrow cylinders andpredicting levels

Proportional reasoning 5 6 Moving weights on a beam balance and predictingequilibrium positions

Control of variables 7 9 10 Designing experiments to test the influence of length ofstring on the period of a pendulum

Control of variables 8 Designing experiments to test the influence of weight ofbob on the period of a pendulum

Control of variables 9 10 Using a ramp and three metal spheres to examine theinfluence of weight and release position on collisions

Control of variables 11 12 13 14 Using fruit flies in tubes to examine the influence of redblue light and gravity on fliesrsquo responses

Combinational reasoning 11 Computing combinations of four switches that will turnon a light

Combinational reasoning 12 Listing all possible linear arrangements of four objectsrepresenting stores in a shopping center

Probability 13 14 15 15 16 17 18 Predicting chances for withdrawing certain coloredwooden blocks from a sack

Correlation reasoning 19 20 Predicting whether correlation exits between the size ofthe mice and the color of their tails through presenteddata

Hypothetical-deductive reasoning 21 22 Designing experiments to determine why the waterrushed up into the glass after the lit candle went out

Hypothetical-deductive reasoning 23 24 Designing experiments to determine why red blood cellsbecome smaller after adding a few drops of salt water

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-3

The scoring of the LCTSR has evolved into two formsone is the pair scoring method which assigns one pointwhen both questions in a pair are answered correctly andthe other is the individual scoring method which simplygrades each question independently Both methods havebeen frequently used by researchers [4111632ndash34]The utility of the LCTSR is that it can be quickly and

objectively scored and thus it has become widely usedHowever the validity of this new version has not beenthoroughly evaluated but rather rests on its predecessorRegardless many researchers have used the LCTSR andcompared the results with other assessments For examplesmall but statistically significant correlations have beendemonstrated between LCTSR test scores and changes ofpre-post scores on both the Force and Concept Inventory(FCI) (r frac14 036 p lt 0001) and Conceptual Survey ofElectricity and Magnetism (CSEM) (r frac14 037 p lt 0001)among community college students including those whohave and have not taken a calculus course [35] Similarfindings have been reported by others [113236] indicatingthat the LCTSR is a useful measure of formal reasoningAlthough widely used a complete and formal validation

including construct content and criterion validity of theLCTSR has not yet been conducted [37] Only recentlyhave a few studies started to examine its construct validity[38ndash41] For example using Rasch analysis and principlecomponent analysis the assumption of a unidimensionalconstruct for a general scientific reasoning ability hasbeen shown to be acceptable [3839] In addition themultidimensionality of LCTSR has been partially estab-lished using confirmatory factor analysis [40] As a two-tierinstrument the LCTSR is assumed to measure studentsrsquoknowing in the first tier and their reasoning processes inthe second tier [42] and this has been confirmed in a recentempirical study [41]

C The two-tier multiple-choice design

A two-tier multiple-choice (TTMC) design is constructedto measure studentsrsquo content knowledge in tier 1 andreasoning in tier 2 [4243] Since its formal introductionthree decades ago [42] TTMC designs have been widelyresearched and applied in education assessment [44ndash52]The LCTSR is a well-known example of a two-tier testdesigned to assess scientific reasoning [1116414244]Two assessment goals are often assumed when using

TTMC tests The first is to measure studentsrsquo knowing andreasoning at the same time within a single question settingUnderstanding the relation between knowing and reasoningcan provide important cognitive and education insights intohow teaching and learning can be improved Recent studieshave started to show a possible progression from knowingto reasoning which suggests that reasoning is harder thanknowing [344153ndash56] However the actual difficultyreflected within a given test can also be affected by anumber of factors including test design content and

population [3441] and therefore needs to be evaluatedwith all the possible factors considered and or controlledThe second and often more primary goal of using the

TTMC design is to suppress ldquofalse positivesrdquowhen the pairscoring method is used It is possible for students to answera question correctly without the correct understandingthrough guessing or other means From a signal processingperspective this type of ldquopositiverdquo signal is generated withunfavorable processes andwas defined as a false positive byHestenes and Halloun [57] For example in the case of theForce Concept Inventory (FCI) [20] which has five answerchoices for each question the chance of a false positive dueto guessing can be estimated as 20 [58] Using TTMC orother sequenced question design a false positive can besuppressed in pair or sequence scoring [51] For examplethe false positive due to guessing drops to 4 in pair scoringfor the five-choice question pair (1=5 times 1=5 frac14 1=25)In contrast it is also possible for a student to answer a

question incorrectly but with correct understanding orreasoning which was defined as a ldquofalse negativerdquo byHestenes and Halloun [57] It has been suggested that afalse negative is less common than a false positive in awell-designed instrument such as the FCI [20] Howeverwhen the test designs are subject to inspection both falsepositives and false negatves should be carefully consideredand examined For a well-designed TTMC instrumentgood consistency between answer and explanation isgenerally expected If a high level of inconsistency isobserved such as a correct answer for a wrong reason orvice versa it could be an indication of content validityissues in question design

D Research question

The construct validity of the LCTSR has been partiallyestablished through quantitative analysis but there hasbeen little research on fully establishing its content validityBased on large scale implementations of the LCTSR anumber of concerns regarding question designs have beenraised prompting the need for a thorough inspection of theassessment features of the LCTSR and to fully evaluate itsvalidity This will allow researchers and educators to moreaccurately interpret the results of this assessment instru-ment Therefore this paper aims to address a gap in theliterature through a detailed study of the possible contentvalidity issues of LCTSR such that the results will helpformally establish the validity of this popular instrumentThis study focuses on one core research question to whatextent is LCTSR valid and reliable to measure scientificreasoning

III METHODOLOGY

A Data collection

This research uses mixed methods [59] to integratedifferent sources of data to identify and study features of

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-4

test design and validity of the LCTSR Three formsof quantitative and qualitative data were collectedfrom freshman in three midwestern public universitiesincluding the following (i) large-scale quantitative data(N frac14 1576) using the LCTSR as written (ii) studentresponses (N frac14 181) to LCTSR questions along with shortanswer free-response explanations for each question and(iii) think-aloud interviews (N frac14 66) with a subgrouprandomly selected from the same students tested in part(ii) During the interviews students were asked to go overthe test again and verbally explain their reasoning for howthey answered the questionsAmong the students in the large-scale quantitative data

988 were enrolled in calculus-based introductory physicscourses with the rest enrolled in algebra-based introduc-tory physics courses The universities selected were thosewith medium ranking in order to obtain a representativepool of the population For the college students the LCTSRwas administered in the beginning of the fall semester priorto any college level instructionIn addition to the student data an expert review panel

consisting of 7 science faculty (3 in physics 2 in biology 1in chemistry and 1 in mathematics) was formed to provideexpert evaluation of the questions for content validityThese faculty are researchers in science education withestablished experience in assessment design and validation

B Analysis method

1 The dependence within and between question pairs

In general test design features can result in dependenceswithin and between question pairs For example in large-scale testing programs such as TOEFL PISA or NAEPthere are frequently observed correlations among questionssharing common context elements such as graphs or textsThese related question sets are often called testlets [60]or question bundles [61] In this way a typical two-tierquestion pair can be considered as a specific type of testletor question bundle [62] In addition the LCTSR wasdesigned with six subskill dimensions that each can containmultiple question pairs built around similar contextualscenarios As a result dependences between question pairswithin a single skill dimension are not surprising Theanalysis of correlations among related questions canprovide supporting evidence for identifying and validatingfactors in question design that might influence studentsrsquoperformances one way or the otherIn classical score-based analysis dependences among

questions are analyzed using correlations among questionscores This type of correlation contains contributions fromall factors including studentsrsquo abilities skill dimensionsquestion difficulties question design features etc If thedata are analyzed with Rasch or other item response theory(IRT) models one can remove the factors due to studentsrsquoabilities and item difficulties and focus on question designfeatures and other factors Therefore the fit measure

Q3 [63] which gives the correlation among questionresiduals [Eqs (1) and (2)] is used to analyze the depend-ences among questions in order to examine the influenceson studentsrsquo performance variances from question designfeatures

dik frac14 Xik minus PiethθkTHORN eth1THORN

Here dik is the difference (residual) between Xik theobserved score of the kth student on the ith question andthe model predicted score PiethθkTHORN The Pearson correlationof these question residuals is computed as

Q3ij frac14 rdidj eth2THORN

where di and dj are question residuals for questions i and jrespectively In using Q3 to screen questions for localdependence Chen and Thissen [64] suggested an absolutethreshold value of 02 above which a moderate to strongdependence can be derived For a test consisting of nquestions the mean value of all Q3 correlations amongdifferent questions is expected to be minus1=ethn minus 1THORN [65] TheLCTSR contains 24 questions therefore the expectedmean Q3 is approximately minus0043The LCTSR has a TTMC design with six subskills By

design there are at least two categories of dimensionsincluding the two-tier structure and content subskillsAdditional observed dimensions beyond the two categoriescan be considered as a result of contextual factors inquestion design In data analysis a multidimensional Raschmodel is used to account for the dependences due to thesubskill dimensions Then the variances of the residuals[Eq (1)] would primarily be the results of TTMC structureand other question design features For an ideal TTMC testit is often expected that dependences only exist withinbut not between question pairs This is equivalent to thecondition of local independence among question pairsWith the subskill dimensions removed by the multidimen-sional Rasch modeling significant residual dependencesamong different question pairs can be indications ofpossible construct and content validity issues in the ques-tion design

2 Pattern analysis of the LCTSR

As discussed earlier the first 10 LCTSR question pairs(questions 1 through 20) used the traditional two-tierdesign while the last two pairs (questions 21-22 and23-24) have a different flavor of paired questionsIdeally traditional two-tier questions would expect goodconsistency between tier-1 and tier-2 questions (iebetween answer and explanation) If a high level ofinconsistency is observed such as a correct answer for awrong reason or vice versa it may suggest issues inquestion design that can lead to false positive or negativeassessment outcomes Therefore analysis of question-pair

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-5

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 4: Validity evaluation of the Lawson classroom test of

The scoring of the LCTSR has evolved into two formsone is the pair scoring method which assigns one pointwhen both questions in a pair are answered correctly andthe other is the individual scoring method which simplygrades each question independently Both methods havebeen frequently used by researchers [4111632ndash34]The utility of the LCTSR is that it can be quickly and

objectively scored and thus it has become widely usedHowever the validity of this new version has not beenthoroughly evaluated but rather rests on its predecessorRegardless many researchers have used the LCTSR andcompared the results with other assessments For examplesmall but statistically significant correlations have beendemonstrated between LCTSR test scores and changes ofpre-post scores on both the Force and Concept Inventory(FCI) (r frac14 036 p lt 0001) and Conceptual Survey ofElectricity and Magnetism (CSEM) (r frac14 037 p lt 0001)among community college students including those whohave and have not taken a calculus course [35] Similarfindings have been reported by others [113236] indicatingthat the LCTSR is a useful measure of formal reasoningAlthough widely used a complete and formal validation

including construct content and criterion validity of theLCTSR has not yet been conducted [37] Only recentlyhave a few studies started to examine its construct validity[38ndash41] For example using Rasch analysis and principlecomponent analysis the assumption of a unidimensionalconstruct for a general scientific reasoning ability hasbeen shown to be acceptable [3839] In addition themultidimensionality of LCTSR has been partially estab-lished using confirmatory factor analysis [40] As a two-tierinstrument the LCTSR is assumed to measure studentsrsquoknowing in the first tier and their reasoning processes inthe second tier [42] and this has been confirmed in a recentempirical study [41]

C The two-tier multiple-choice design

A two-tier multiple-choice (TTMC) design is constructedto measure studentsrsquo content knowledge in tier 1 andreasoning in tier 2 [4243] Since its formal introductionthree decades ago [42] TTMC designs have been widelyresearched and applied in education assessment [44ndash52]The LCTSR is a well-known example of a two-tier testdesigned to assess scientific reasoning [1116414244]Two assessment goals are often assumed when using

TTMC tests The first is to measure studentsrsquo knowing andreasoning at the same time within a single question settingUnderstanding the relation between knowing and reasoningcan provide important cognitive and education insights intohow teaching and learning can be improved Recent studieshave started to show a possible progression from knowingto reasoning which suggests that reasoning is harder thanknowing [344153ndash56] However the actual difficultyreflected within a given test can also be affected by anumber of factors including test design content and

population [3441] and therefore needs to be evaluatedwith all the possible factors considered and or controlledThe second and often more primary goal of using the

TTMC design is to suppress ldquofalse positivesrdquowhen the pairscoring method is used It is possible for students to answera question correctly without the correct understandingthrough guessing or other means From a signal processingperspective this type of ldquopositiverdquo signal is generated withunfavorable processes andwas defined as a false positive byHestenes and Halloun [57] For example in the case of theForce Concept Inventory (FCI) [20] which has five answerchoices for each question the chance of a false positive dueto guessing can be estimated as 20 [58] Using TTMC orother sequenced question design a false positive can besuppressed in pair or sequence scoring [51] For examplethe false positive due to guessing drops to 4 in pair scoringfor the five-choice question pair (1=5 times 1=5 frac14 1=25)In contrast it is also possible for a student to answer a

question incorrectly but with correct understanding orreasoning which was defined as a ldquofalse negativerdquo byHestenes and Halloun [57] It has been suggested that afalse negative is less common than a false positive in awell-designed instrument such as the FCI [20] Howeverwhen the test designs are subject to inspection both falsepositives and false negatves should be carefully consideredand examined For a well-designed TTMC instrumentgood consistency between answer and explanation isgenerally expected If a high level of inconsistency isobserved such as a correct answer for a wrong reason orvice versa it could be an indication of content validityissues in question design

D Research question

The construct validity of the LCTSR has been partiallyestablished through quantitative analysis but there hasbeen little research on fully establishing its content validityBased on large scale implementations of the LCTSR anumber of concerns regarding question designs have beenraised prompting the need for a thorough inspection of theassessment features of the LCTSR and to fully evaluate itsvalidity This will allow researchers and educators to moreaccurately interpret the results of this assessment instru-ment Therefore this paper aims to address a gap in theliterature through a detailed study of the possible contentvalidity issues of LCTSR such that the results will helpformally establish the validity of this popular instrumentThis study focuses on one core research question to whatextent is LCTSR valid and reliable to measure scientificreasoning

III METHODOLOGY

A Data collection

This research uses mixed methods [59] to integratedifferent sources of data to identify and study features of

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-4

test design and validity of the LCTSR Three formsof quantitative and qualitative data were collectedfrom freshman in three midwestern public universitiesincluding the following (i) large-scale quantitative data(N frac14 1576) using the LCTSR as written (ii) studentresponses (N frac14 181) to LCTSR questions along with shortanswer free-response explanations for each question and(iii) think-aloud interviews (N frac14 66) with a subgrouprandomly selected from the same students tested in part(ii) During the interviews students were asked to go overthe test again and verbally explain their reasoning for howthey answered the questionsAmong the students in the large-scale quantitative data

988 were enrolled in calculus-based introductory physicscourses with the rest enrolled in algebra-based introduc-tory physics courses The universities selected were thosewith medium ranking in order to obtain a representativepool of the population For the college students the LCTSRwas administered in the beginning of the fall semester priorto any college level instructionIn addition to the student data an expert review panel

consisting of 7 science faculty (3 in physics 2 in biology 1in chemistry and 1 in mathematics) was formed to provideexpert evaluation of the questions for content validityThese faculty are researchers in science education withestablished experience in assessment design and validation

B Analysis method

1 The dependence within and between question pairs

In general test design features can result in dependenceswithin and between question pairs For example in large-scale testing programs such as TOEFL PISA or NAEPthere are frequently observed correlations among questionssharing common context elements such as graphs or textsThese related question sets are often called testlets [60]or question bundles [61] In this way a typical two-tierquestion pair can be considered as a specific type of testletor question bundle [62] In addition the LCTSR wasdesigned with six subskill dimensions that each can containmultiple question pairs built around similar contextualscenarios As a result dependences between question pairswithin a single skill dimension are not surprising Theanalysis of correlations among related questions canprovide supporting evidence for identifying and validatingfactors in question design that might influence studentsrsquoperformances one way or the otherIn classical score-based analysis dependences among

questions are analyzed using correlations among questionscores This type of correlation contains contributions fromall factors including studentsrsquo abilities skill dimensionsquestion difficulties question design features etc If thedata are analyzed with Rasch or other item response theory(IRT) models one can remove the factors due to studentsrsquoabilities and item difficulties and focus on question designfeatures and other factors Therefore the fit measure

Q3 [63] which gives the correlation among questionresiduals [Eqs (1) and (2)] is used to analyze the depend-ences among questions in order to examine the influenceson studentsrsquo performance variances from question designfeatures

dik frac14 Xik minus PiethθkTHORN eth1THORN

Here dik is the difference (residual) between Xik theobserved score of the kth student on the ith question andthe model predicted score PiethθkTHORN The Pearson correlationof these question residuals is computed as

Q3ij frac14 rdidj eth2THORN

where di and dj are question residuals for questions i and jrespectively In using Q3 to screen questions for localdependence Chen and Thissen [64] suggested an absolutethreshold value of 02 above which a moderate to strongdependence can be derived For a test consisting of nquestions the mean value of all Q3 correlations amongdifferent questions is expected to be minus1=ethn minus 1THORN [65] TheLCTSR contains 24 questions therefore the expectedmean Q3 is approximately minus0043The LCTSR has a TTMC design with six subskills By

design there are at least two categories of dimensionsincluding the two-tier structure and content subskillsAdditional observed dimensions beyond the two categoriescan be considered as a result of contextual factors inquestion design In data analysis a multidimensional Raschmodel is used to account for the dependences due to thesubskill dimensions Then the variances of the residuals[Eq (1)] would primarily be the results of TTMC structureand other question design features For an ideal TTMC testit is often expected that dependences only exist withinbut not between question pairs This is equivalent to thecondition of local independence among question pairsWith the subskill dimensions removed by the multidimen-sional Rasch modeling significant residual dependencesamong different question pairs can be indications ofpossible construct and content validity issues in the ques-tion design

2 Pattern analysis of the LCTSR

As discussed earlier the first 10 LCTSR question pairs(questions 1 through 20) used the traditional two-tierdesign while the last two pairs (questions 21-22 and23-24) have a different flavor of paired questionsIdeally traditional two-tier questions would expect goodconsistency between tier-1 and tier-2 questions (iebetween answer and explanation) If a high level ofinconsistency is observed such as a correct answer for awrong reason or vice versa it may suggest issues inquestion design that can lead to false positive or negativeassessment outcomes Therefore analysis of question-pair

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-5

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 5: Validity evaluation of the Lawson classroom test of

test design and validity of the LCTSR Three formsof quantitative and qualitative data were collectedfrom freshman in three midwestern public universitiesincluding the following (i) large-scale quantitative data(N frac14 1576) using the LCTSR as written (ii) studentresponses (N frac14 181) to LCTSR questions along with shortanswer free-response explanations for each question and(iii) think-aloud interviews (N frac14 66) with a subgrouprandomly selected from the same students tested in part(ii) During the interviews students were asked to go overthe test again and verbally explain their reasoning for howthey answered the questionsAmong the students in the large-scale quantitative data

988 were enrolled in calculus-based introductory physicscourses with the rest enrolled in algebra-based introduc-tory physics courses The universities selected were thosewith medium ranking in order to obtain a representativepool of the population For the college students the LCTSRwas administered in the beginning of the fall semester priorto any college level instructionIn addition to the student data an expert review panel

consisting of 7 science faculty (3 in physics 2 in biology 1in chemistry and 1 in mathematics) was formed to provideexpert evaluation of the questions for content validityThese faculty are researchers in science education withestablished experience in assessment design and validation

B Analysis method

1 The dependence within and between question pairs

In general test design features can result in dependenceswithin and between question pairs For example in large-scale testing programs such as TOEFL PISA or NAEPthere are frequently observed correlations among questionssharing common context elements such as graphs or textsThese related question sets are often called testlets [60]or question bundles [61] In this way a typical two-tierquestion pair can be considered as a specific type of testletor question bundle [62] In addition the LCTSR wasdesigned with six subskill dimensions that each can containmultiple question pairs built around similar contextualscenarios As a result dependences between question pairswithin a single skill dimension are not surprising Theanalysis of correlations among related questions canprovide supporting evidence for identifying and validatingfactors in question design that might influence studentsrsquoperformances one way or the otherIn classical score-based analysis dependences among

questions are analyzed using correlations among questionscores This type of correlation contains contributions fromall factors including studentsrsquo abilities skill dimensionsquestion difficulties question design features etc If thedata are analyzed with Rasch or other item response theory(IRT) models one can remove the factors due to studentsrsquoabilities and item difficulties and focus on question designfeatures and other factors Therefore the fit measure

Q3 [63] which gives the correlation among questionresiduals [Eqs (1) and (2)] is used to analyze the depend-ences among questions in order to examine the influenceson studentsrsquo performance variances from question designfeatures

dik frac14 Xik minus PiethθkTHORN eth1THORN

Here dik is the difference (residual) between Xik theobserved score of the kth student on the ith question andthe model predicted score PiethθkTHORN The Pearson correlationof these question residuals is computed as

Q3ij frac14 rdidj eth2THORN

where di and dj are question residuals for questions i and jrespectively In using Q3 to screen questions for localdependence Chen and Thissen [64] suggested an absolutethreshold value of 02 above which a moderate to strongdependence can be derived For a test consisting of nquestions the mean value of all Q3 correlations amongdifferent questions is expected to be minus1=ethn minus 1THORN [65] TheLCTSR contains 24 questions therefore the expectedmean Q3 is approximately minus0043The LCTSR has a TTMC design with six subskills By

design there are at least two categories of dimensionsincluding the two-tier structure and content subskillsAdditional observed dimensions beyond the two categoriescan be considered as a result of contextual factors inquestion design In data analysis a multidimensional Raschmodel is used to account for the dependences due to thesubskill dimensions Then the variances of the residuals[Eq (1)] would primarily be the results of TTMC structureand other question design features For an ideal TTMC testit is often expected that dependences only exist withinbut not between question pairs This is equivalent to thecondition of local independence among question pairsWith the subskill dimensions removed by the multidimen-sional Rasch modeling significant residual dependencesamong different question pairs can be indications ofpossible construct and content validity issues in the ques-tion design

2 Pattern analysis of the LCTSR

As discussed earlier the first 10 LCTSR question pairs(questions 1 through 20) used the traditional two-tierdesign while the last two pairs (questions 21-22 and23-24) have a different flavor of paired questionsIdeally traditional two-tier questions would expect goodconsistency between tier-1 and tier-2 questions (iebetween answer and explanation) If a high level ofinconsistency is observed such as a correct answer for awrong reason or vice versa it may suggest issues inquestion design that can lead to false positive or negativeassessment outcomes Therefore analysis of question-pair

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-5

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 6: Validity evaluation of the Lawson classroom test of

response patterns can provide useful information to identifypotential content validity issues For a TTMC question pairthere are four score-based question pair response patterns(11 00 10 01) which are listed in Table II Patterns 00 and11 represent consistent answers while patterns 10 and 01are inconsistent answersIn this study the inconsistent patterns were analyzed in

detail to help identify possible content validity issues ofcertain questions For each question pair the frequencies ofthe different response patterns from the entire populationwere calculated and compared with other question pairsregarding the popularity of inconsistent patterns Questionpairs with a significantly higher level of inconsistentpatterns were identified for further inspection on theirquestion designs for validity evaluationIn addition for a specific question pair studentsrsquo

responses should vary with their performance levelsStudents with high test scores are often expected to beless likely to respond with inconsistent patterns [6667]When the measurement results on certain question pairsdepart from such expectations indications of contentvalidity issues can be expected To study this type oftrending relations distributions of inconsistent patternsfrom low to high score levels were calculated and com-pared For the ith question pair probability PiethsTHORN wasdefined to represent the normalized frequency of theinconsistent patterns for the subgroup of students withscore s Using individual scoring studentsrsquo test scores onthe LCTSR were binned into 25 levels of measured scores(s frac14 0 1 2hellip 24) For each binned score level one cancalculate PiethsTHORN with

PiethsTHORN frac14NiethsTHORNNethsTHORN eth3THORN

Here NethsTHORN is the total number of students with score swhileNiethsTHORN is the number of students out of NethsTHORNwho haveinconsistent responses on the ith question pair The overallPi and its distribution of PiethsTHORN over s were analyzed foreach question pair Question pairs with large Pi andorabnormal distributions (eg large Pi at high s) wereidentified for further inspection on possible validity issuesFurthermore using the individual scoring method the

scores gained from inconsistent responses were comparedto the total scores to evaluate the impact of the inconsistent

responses on score-based assessment Two measures wereused one is denoted as R12ethsTHORN to represent the fraction ofscores from inconsistent responses out of all 12 questionpairs for students with score s and the other is R5ethsTHORN whichgives the fraction of scores from inconsistent responses outof 5 question pairs identified as having significant validityissues (to be discussed in later sections) The two fractionswere calculated using Eqs (4) and (5)

R12ethsTHORN frac14P

12ifrac141 PiethsTHORNs

eth4THORN

R5ethsTHORN frac14P

5 PiethsTHORNs

eth5THORN

3 Qualitative analysis

The analysis was conducted on three types of qualitativedata including expert evaluations student surveys withopen-ended explanations and student interviews Theresults were synthesized with the quantitative analysis toidentify concrete evidence for question design issues ofLCTSR and their effects on assessment results

IV RESULTS OF QUANTITATIVE ANALYSIS TOIDENTIFY POSSIBLE VALIDITY ISSUES

A Basic statistics

The distributions of studentsrsquo scores using both pair andindividual scoring methods were calculated and plotted inFig 1 For this college population the distributions appearto be quite similar and both are skewed to the higher end ofthe scale On average the mean test score from the pairscoring method (mean frac14 5847 SD frac14 2303) is signifi-cantly lower than the score from the individual scoringmethod (meanfrac146803 SDfrac142033) (plt0001 Cohenrsquosd frac14 1682) The difference indicates that there exists asignificant number of inconsistent responses (see Table II)which will be examined in the next sectionThe reliability of LCTSR was also computed using

scores from both scoring methods The results show thatthe reliability of the pair scoring method (α frac14 076n of question paris frac14 12) is slightly lower than that of theindividual scoring method (α frac14 085 n of questions frac14 24)due to the smaller number of scoring units with pair scoringTo counter for the difference in test length reliability of the

TABLE II Question-pair score response patterns A correct answer is assigned with 1 and a wrong answer is assigned with 0

Scoring

Response patterns Description Pair Individual

11 Consistent correct correct answer with correct reasoning 1 200 Consistent wrong wrong answer with wrong reasoning 0 010 Inconsistent correct answer with wrong reasoning 0 101 Inconsistent wrong answer with correct reasoning 0 1

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-6

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 7: Validity evaluation of the Lawson classroom test of

pair scoring method was adjusted by the Spearman-Brownprophecy formula which gives a 24-item equivalent α at086 almost identical to that of the individual scoringmethod The results suggest that the LCTSR producesreliable assessment scores with both scoring methods[68] The reliabilities of the 6 subskills were also computedusing the individual scoring method The results show thatthe reliabilities of the first 5 subskills are acceptable(α frac14 071 081 065 082 and 083 respectively) [68]For the last subskill (HD) the reliability is lower than theacceptable level (α frac14 052 lt 065)Using the individual scoring method the item mean

discrimination and point biserial coefficient for eachquestion were computed and listed in Table III The resultsshow that most item means except for question 1 2 and16 fall into the suggested range of difficulty [03 09] [69]These three questions appear to be too easy for the collegestudents tested which also result in poor discrimination(lt03) [69] For all questions the point biserial correlationcoefficients are larger than the criterion value of 02 whichsuggests that all questions hold consistent measures withthe overall LCTSR test [69]The mean score for each question was also plotted in

Fig 2 The error bars represent the standard error of the

mean which are very small due to the large sample sizeWhen the responses to a question pair contain a significantnumber of inconsistent patterns the scores of the twoindividual questions are less correlated The mean scores ofthe two questions in the pair may also be substantiallydifferent but are dependent on the relative numbers of the10 and 01 patterns To compare among the question pairsthe correlation and effect size (Cohenrsquos d) between questionscores within each question pair were calculated andplotted in Fig 2 The results show that among the 12question pairs seven (1-2 3-4 5-6 9-10 15-16 17-18 and19-20) have high correlations (r gt 05) and relatively smalldifferences in score The remaining five pairs (7-8 11-1213-14 21-22 and 23-24) have considerably smaller corre-lations (r lt 04) and larger differences in score indicatinga more prominent presence of inconsistent response pat-terns As supporting evidence the Cronbach α wascalculated for each question pair and listed in Table IIIFor the five pairs having large number of inconsistentresponses (7-8 11-12 13-14 21-22 and 23-24) thereliability coefficients are all smaller than the ldquoacceptablerdquolevel (065) which again suggest a high level of incon-sistency between the measures of the two questions withineach pair

FIG 1 Score distribution (a) pair scoring method and (b) individual scoring method

TABLE III Basic assessment parameters of LCTSR questions (pairs) with college students (N frac14 1576) regarding item mean (or itemdifficulty) discrimination and point biserial coefficient (rpb) calculated based on the classical test theory The Cronbach α of eachquestion pair was also calculated where values below the acceptable level of 065 are marked with [68]

Question 1 2 3 4 5 6 7 8 9 10 11 12

Mean 096 095 080 079 064 064 050 070 079 078 051 030Discrimination 009 011 045 047 072 070 068 054 045 046 062 036rpb 022 029 046 049 059 056 048 046 047 047 044 024Cronbach α 082 097 089 054 092 046

Question 13 14 15 16 17 18 19 20 21 22 23 24Mean 060 051 082 091 083 082 074 068 042 051 044 070Discrimination 053 049 043 024 037 042 038 050 055 041 042 040rpb 037 032 048 042 045 051 031 040 038 026 026 032Cronbach α 038 069 086 083 056 033

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-7

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 8: Validity evaluation of the Lawson classroom test of

B Paired score pattern analysis

As discussed earlier results of inconsistent responses onquestions in two-tier pairs can be used to evaluate thevalidity of question designs The percentages of the fourresponses for each question pair are summarized inTable IV along with the correlations between the questionscores of each pair The results show that the averagepercentage of the inconsistent patterns for LCTSR isapproximately 20 (see the row of ldquoSum of 01 and 10rdquoin Table IV) Based on the level of inconsistent patternstwo groups of question pairs can be identified (p lt 0001Cohenrsquos d frac14 64) Seven pairs (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) appear to have good consistency(r gt 05 ravg frac14 076) as their inconsistent patterns rangefrom 17 to 137with an average of 74 (SD frac14 46)The remaining five pairs (7-8 11-12 13-14 21-22 23-24)have a much higher level of inconsistent responses with anaverage of 364 (SD frac14 44) The correlation for these isalso much smaller (r lt 04 ravg frac14 031)

Interestingly four of the five pairs with high incon-sistency (11-12 13-14 21-22 23-24) are from the pool ofnew questions added to the LCTSR Because they were notpart of the CTFR-78 these four pairs did not go through theoriginal validity analysis of the 1978 test version Questionpair 7-8 was in the 1978 version of the test but wasmodified for the multiple-choice format of the LCTSRThis might have introduced design issues specific to itsmultiple-choice formatIn addition to question design issues the inconsistent

responses can also be the result of guessing and other test-taking strategies which are more common among lowperforming students Therefore analyzing inconsistentresponses against performance levels can help establishthe origin of the inconsistent responses The probabilitydistribution PiethsTHORN of inconsistent patterns across differentscore levels s for all 12 question pairs is plotted in Fig 4The error bars are Clopper-Pearson 95 confidence inter-vals [7172] which are inversely proportional to the square

FIG 2 The average scores of individual questions and the correlation r and Cohenrsquos d between the two individual question scores ineach question pair The values of Cohenrsquos d larger than 02 and the values of r less than 04 are marked with an [70]

TABLE IV LCTSR paired score patterns (in percentage) of US college freshmen students (N frac14 1576) For the mean values listed inthe table the maximum standard error is 112 Therefore a difference larger than 3 can be considered statistically significant(p lt 005) The distribution of responses patterns for each question pair is significantly different from random guessing (p lt 0001)For example the results of chi square tests for question pairs 13-14 and 21-22 are χeth3THORN frac14 240097 p lt 0001 and χeth3THORN frac14 55838p lt 0001 respectively

Score patterns Average 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 19-20 21-22 23-24

00 227 33 213 319 239 192 429 253 79 137 226 386 21511 578 938 769 588 417 758 232 357 797 793 637 295 35801 110 12 04 49 277 20 71 158 110 29 34 206 34510 85 17 13 44 67 29 269 233 14 40 103 113 82

Sum of 01 and 10 195 29 17 93 344 50 339 391 124 69 137 319 427

Within pair correlation 057 069 094 080 037 085 034 023 054 076 071 039 020

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-8

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 9: Validity evaluation of the Lawson classroom test of

FIG 3 The distribution of inconsistent patterns as a function of score

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-9

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 10: Validity evaluation of the Lawson classroom test of

root of the sample size At low score levels (s lt 30) thesample sizes are less than 30 [see Fig 1(b)] which result inlarge error barsThe results in Fig 3 show clear distinctions between the

two groups of question pairs with high and low consistencyFor the consistent pair group (1-2 3-4 5-6 9-10 15-1617-18 and 19-20) the inconsistent responses come mostlyfrom low performing students that is PiethsTHORN quicklyapproaches zero when scores increase to the upper 30This result suggests that the inconsistent responses for thesequestion pairs are largely due to students guessing whichdecreases among high performing students In contrastfor the inconsistent pair group (7-8 11-12 13-14 21-2223-24) the inconsistent responses maintain at high levelsthroughout most of the score scale and even for studentswho scored above 90 Since students with higher scoresare expected to answer more consistently than lowerperforming students [66] the sustained inconsistency atthe high score levels is more likely the result of possiblequestion design issues of these five question pairs ratherthan due to students guessingIn order to more clearly compare the effects on scores

from inconsistent responses at different performance levelsthe fractions of score contributions from all 12 questionspairs (R12) and from the 5 inconsistent pairs (R5) are plottedas functions of score in Fig 4 The results show that atlower score levels inconsistent responses are common forall questions (R12 asymp 2R5) and are likely the result ofguessing At higher score levels inconsistent responsescome mostly from the 5 inconsistent question pairs(R12 asymp R5) and therefore are more likely caused by issuesin question design than by students guessing

C The dependence within question pairand between question pairs

As discussed in the methodology section dependenceswithin and between question pairs were analyzed to study

the effects of question designs The dependences aremodeledwith theQ3 fit statisticswhich give the correlationsof the residualsrsquo variances [Eqs (1) and (2)] In this study amultidimensional Rasch model was used to model thevariances from the six subskill dimensions so that theresidualsrsquo variances would primarily contain the contribu-tions from the two-tier structure and other question designfeaturesFor the 24 questions of the LCTSR there are a total of

276 Q3 fit statistics which have a mean value of minus0034(SD frac14 0130) for this data set This mean is within theexpected value for a 24-question test [minus1=eth24 minus 1THORN frac14minus0043] and has sufficient local independence at thewhole test level to carry out a Rasch analysis In this studythe main goal of using the Q3 was to analyze the residualcorrelations among questions so that the dependences due toquestion designs could be examined To do this the absolutevalues of Q3 calculated between every two questions of thetest were plotted with a heat map in Fig 5The heat map shows six distinctive groups of questions

with moderate to high correlations within each group(gt02) while the correlations between questions fromdifferent subskills are much smaller (lt02) This resultis the outcome of the unique design structure of theLCTSR which groups questions in order based on thesix subskills (see Table I) and the questions within eachsubskill group also share similar context and designfeatures For an ideal TTMC test one would expect tohave strong correlations between questions within individ-ual two-tier pairs and much smaller correlations betweenquestions from different two-tier pairs As shown in Fig 3many question pairs show strong correlations betweenquestions within the pairs however there are also questionpairs such as Q7-8 in which a question (Q8) is morestrongly correlated with questions in other pairs (Q5 andQ6) than with the question in its own two-tier pair (Q7)In addition there are six question pairs (11-12 13-14

15-16 21-22 23-24) for which the within-pair residualcorrelations are small (lt02) This suggests that theexpected effects of the two-tier structure have been sig-nificantly interfered possibly by other unknown questiondesign features Therefore these question pairs should befurther examined on their design validities Among the sixquestion pairs five are identified as having high incon-sistent responses (Table IV) The remaining pair (Q15-16)has much less inconsistency which also diminished rapidlyat high scores (see Fig 3) and therefore will be excusedfrom further inspectionSynthesizing the quantitative analysis there appears to

be strong indication for design issues for the five questionpairs with high inconsistency Further analysis of thesequestions using qualitative data will be discussed in thefollowing section to inspect their content validitiesTo facilitate the qualitative analysis the actual answer

patterns of the 5 inconsistent question pairs are provided inTable V Since the total number of possible combinations

FIG 4 The fractions of score contributions from all 12question pairs (R12) and from the 5 inconsistent pairs (R5)plotted as functions of score

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-10

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 11: Validity evaluation of the Lawson classroom test of

TABLE V The most common answer patterns for questions with high inconsistency The population is US college freshmen students(N frac14 1576) The numbers reflect the percentage of the corresponding patterns The patterns are ordered based on their popularity Thecorrect answer patterns are italicized which happen to be the most popular answers for all question pairs

7-8 11-12 13-14 21-22 23-24

Pattern Pattern Pattern Pattern Pattern

da 417 ba 232 cd 357 aa 295 ab 358bd 128 bb 205 ad 125 ab 102 bb 271aa 107 cd 109 ce 123 eb 75 ba 61ba 100 ad 100 ae 67 ea 74 cb 61de 54 ab 44 cc 46 bb 69 ac 56ea 36 de 36 bc 41 ba 67 bc 55ca 31 bd 35 ac 38 ca 53 ca 36Others 128 240 203 268 103

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

0 02 1

FIG 5 A heat map of absolute values ofQ3 which gives the residual correlation The outlined blocks are the six subskills of scientificreasoning (see Table I) The heat maprsquos color transition is nonuniform with the median set at 02 which is the recommendedQ3 value forshowing a correlation as suggested by Chen and Thissen [64]

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-11

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 12: Validity evaluation of the Lawson classroom test of

can reach 25 only the most common choices are shown inTable V For this data set the most often selected choices(first row of data) happen to be the correct answers Theremaining patterns show a wide variety of combinations ofcorrect and incorrect choices some of which are used assupporting evidence for student reasoning implied from thequalitative analysis in Sec IV

V RESULTS OF QUALITATIVEVALIDITY EVALUATION

The quantitative analysis suggests that further inspec-tion is needed regarding the question design validity ofthe five question pairs in LCTSR To find more concreteevidence regarding the potential question design issuesqualitative studies were conducted which includes expertevaluations student written explanations to LCTSRquestions and student interviews In this section thequalitative data on the five question pairs will bediscussed in detail to shed light on the possible questiondesign issuesAs mentioned previously a panel of 7 faculty with

science and science education backgrounds in physicschemistry biology and mathematics provided expert eval-uations of the questionsrsquo construct validities The facultymembers were all familiar with the LCTSR test and hadused the test in their own teaching and researchThe qualitative data on studentsrsquo reasoning contains two

parts Part 1 includes open-ended written explanationscollected from 181 college freshmen science and engineer-ing majors in a US university These students were fromthe same college population whose data are shown inTable II For this part of the study students were asked towrite down explanations of their reasoning to each of thequestions while doing the test Part 2 of the data includesinterviews with a subgroup of the 181 students (N frac14 66)These students were asked to go over the test again andexplain their reasoning on how they answered each of thequestions

A Qualitative analysis of Q7-8

Q7-8 is the second pair of a four-question sequence thatmeasures studentsrsquo skills on proportional reasoningwhich can be accessed from the publicly availableLCTSR [19] The Q7-8 question pair is the second pairin the cluster The correct answer to Q7 is ldquodrdquo and thecorrect reasoning in Q8 is ldquoardquo From the results of studentscores and response patterns on pair 7-8 as shown inTables IVand V almost one-third of students (277) whogave an incorrect answer to Q7 selected the correct reasonin Q8 Conversely 67 answered Q7 correctly butprovided an incorrect reason in Q8 These results suggestthat most students who know the correct outcome alsohave the correct reasoning but a substantial number ofstudents who select an incorrect outcome also select the

correct reasoning choice Therefore the design of thechoices might have features contributing to this effectThe expertsrsquo evaluation of LCTSR reported two con-

cerns which might explain these patterns First for Q7choice ldquodrdquo is the correct answer However choice ldquoardquo has avalue very close to choice ldquodrdquo As reported by teachersstudents are accustomed to rounding numerical answers inpre-college and college assessments Some students maychoose ldquoardquo without reading the other choices thinking thatit is close enough for a correct answer The results inTable V show that 107 of the students picked the answerpattern ldquoaardquo which gives the correct reasoning in Q8 with awrong but nearly correct answer of ldquo7frac12rdquo in Q7The second concern came from teachers who encoun-

tered studentsrsquo questions during the test One commonquestion from students was for clarification about thedifference between choices ldquoardquo and ldquoerdquo in Q8 Studentsoften find these to be fundamentally equivalent Choice ldquoardquois a general statement that the ratio is the same whilechoice ldquoerdquo maintains an identical ratio specifically definedwith the actual value In addition the wording used inchoice ldquoerdquo of Q8 is equivalent to choice ldquocrdquo of Q6 which isthe correct answer Since these questions are in a linked setstudents may look for consistency among the answerchoices which may lead some students to pick ldquoerdquo inQ8 instead of ldquoardquo From Table V 54 of the students didgive the correct answer in Q7 (choice ldquodrdquo) but picked ldquoerdquo inQ8 which is the most popular ldquoincorrectrdquo explanation forstudents who answered Q7 correctlyIn order to further explore why students might have

selected a correct answer for a wrong reason and viceversa studentsrsquo written explanations and interviewresponses were analyzed Figure 6 shows the calculationsand written explanations from two students Student 1answered Q7 correctly but selected the wrong reasoning

FIG 6 Examples of two studentsrsquo written explanations toQ7-8 Student 1 had the correct ratio of 2∶3 from the previousquestion pair Q5-6 and applied the proportional reasoningcorrectly in Q7 However the student selected ldquoerdquo in Q8 asthe reasoning which is incorrect Student 2 knew to use the ratiobut failed to apply it to obtain the correct result

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-12

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 13: Validity evaluation of the Lawson classroom test of

in Q8 The student appears to have a good understanding ofhow to use the mathematics involved in proportions andratios but picked ldquoerdquo as the answer for the reasoning partAmong the 66 interviewed 5 students had answers andexplanations similar to that of student 1 shown in Fig 6These students could correctly solve the problem usingknowledge of ratios and mathematics for Q7 but pickedldquoerdquo for their reasoning Further analysis of studentsrsquoexplanations in the interviews revealed that they all thoughtldquoerdquo was a more detailed description of the ratio which isspecific to the question whereas ldquoardquo was just a generallytrue statement Therefore these students chose the moredescriptive answer out of the two choices that bothappeared to them to be correctIn the example of student 2 shown in Fig 6 the student

failed to set up and solve the proportional relations in thequestion but picked the correct reason The explanationwas rather simplistic and simply indicated that the questioninvolved a ratio Apparently the student was not able tocorrectly apply the ratio to solve this problem but knew toldquouse ratiosrdquo which led to correctly selecting ldquoardquo in Q8These results indicate that the design of choices ldquoardquo and

ldquodrdquo in Q8 is less than optimal Students may treat the twochoices as equivalent with choice ldquoardquo being a generally truestatement about ratio and choice ldquoerdquo being a more sub-stantive form of the same ratio This design can causeconfusion to those who are competent in handling pro-portions but is more forgiving to those who are lesscompetent leading to both false positives and false neg-atives [57] In addition although it was not observed in ourinterviews the numerical value of the incorrect choice ldquoardquo(7 1=2) was very close to the correct answer (choice ldquodrdquo7 1=3) which might cause false negatives due to roundingas suggested by our experts Therefore it would bebeneficial to modify the numerical value of choice ldquoardquosuch that it would not be in the range of typical rounding ofthe correct answer

B Qualitative analysis of Q11-12 and Q13-14

These two question pairs were designed to measurestudent reasoning on control of variables (COV) using acontext that involves experiments with fruit flies Thecorrect answers to the four individual questions are ldquobrdquoldquoardquo ldquocrdquo and ldquodrdquo From Table II the assessment results ofQ11-12 and Q13-14 show a high level of inconsistentanswer patterns (01 and 10)For these two pairs of questions the expert review

reported four areas of concerns First the graphicalrepresentation of black paper wrapped on the test tube ismisleading Several faculty pointed out that the black dotson the paper were initially perceived as flies rather thanblack paper and they had to go back and forth to readthe text and figure in order to make sense of the scenarioThis may not be a critical validity issue but does impactinformation processing regarding time and accuracy and

may lead to confusion and frustration among studentsAnother related issue is the presentation of the numbers offlies in the different areas of the tube For tubes without theblack paper the numbers are straightforward For tubeswith black paper the numbers of flies in the covered partof the tube are not shown This may not be an issue foradvanced students but can pose some processing overheadfor some students particularly those in lower grades Forconsistency of representation it is suggested that thenumber of flies in the covered part of the tube also beshown on the figureSecond the orientation of the tubes relative to the

provided red and blue light may also cause confusionThere is no indication of a table or other means ofsupporting the tubes and the light is shown as comingfrom both top-down and bottom-up directions on the pageIn a typical lab setting light coming from bottom-up will beblocked by a table top (usually made of non-transparentmaterials) The current configuration is not common ineveryday settings and may mislead students to consider thatall tubes are lying flat (horizontally) on a table surface(overhead view) with the light beams projected horizontallyfrom opposite directions In this interpretation it will bedifficult to understand how gravity plays a role in theexperimentThe third concern is the design of choices for the

reasoning questions (Q12 and Q14) The typical scientificstrategies for analyzing data from a controlled experimentis to make explicit comparisons between the results ofdifferent conditions to emphasize the covariation relationsestablished with controlled and varied variables In this setof questions the correct comparisons should include tubesIII and IV to determine the effect of gravity and tubes II andIV to determine the effect of light Tube IV is the base forcomparison with which both the gravity and light are set ina noneffect state However none of the choices in Q12 andQ14 gives a straightforward statement of such comparisonsand tube IV is not included in the correct answers (choiceldquoardquo for Q12 and choice ldquodrdquo for Q14) The language used inthe choices simply describes certain factual features of thesetting but doesnrsquot provide clear comparisonsIn addition tube I is a confounded (uncontrolled)

situation in which both the gravity and light are variedwith respect to tube IV Correct COV reasoning shouldexclude such confounded cases but tube I is included in thecorrect answer of Q14 (choice ldquodrdquo) which raises concernsabout the validity of this choice being correctFinally it is worth noting that the conditions in the

questions do not represent a complete set For example theconfigurations for the absence of light or gravity are notincluded in the settings Therefore the outcomes should notbe generalized into those situationsResults from studentsrsquo written explanations and inter-

views support the expertsrsquo concerns Many students com-plained about the images in Q11-Q14 Some students were

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-13

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 14: Validity evaluation of the Lawson classroom test of

confused because ldquothe black paper looks lumpy and lookslike a mass of fliesrdquo Others asked if the black paperblocked out the light entirely as the light may come in fromthe ends as illustrated by the rays in the figureStudents were also confused on how the tubes were

arranged Some thought that all the tubes were lying on atable top and the figures were a birdrsquos eye view of the setupIt was only after the interviewer sketched a table perspec-tive overlaid with the tubes that many realized how thetubes were meant to be positioned In addition the arrowsof the light seemed to cause misinterpretation of themeaning of ldquorespond to lightrdquo Some students thought thatthe light beams were only going in the two directionsillustrated by the arrows ldquothe flies fly towards the light(against the light beam) in tube I and IIIrdquo This can causecomplications in understanding the effect of gravity and theoutcomes of the horizontal tubesThere are also students who seemed to have a different

interpretation of what ldquorespond to gravityrdquo means Forexample for students who selected ldquoardquo (red light notgravity) in Q11 their typical reasons included ldquoif the fliesresponded to gravity they should go to the bottom of tube Iand IIIrdquo or if ldquothe flies ignore gravity they fly high intubesrdquo This result suggests a possible complication inreasoning with gravity Since gravity in real life makesthings fall downward being able to defy gravity (flying up)seemed to have been made equivalent to the noneffect ofgravity by these studentsIn general students did not like the choices for the

reasoning provided in Q12 and Q14 It appeared thatstudents were looking for choices that made direct com-parisons of multiple tubes which is the correct strategy inanalyzing data of COV experiments Unsatisfied with thechoices some students also suggested their own answersfor example ldquoThe files are even in IV but are concentratedin the upper end of III and nearly even in IIrdquo andldquoComparison of III and IV shows response to gravitywhile comparison of II and IV shows response to lightrdquoA few students also expressed concern about the inter-

action between light and gravity ldquoThe question cannot tellhow the flies react to light and gravity separately and onemay need BOTH light and gravity to get a reaction becausethe light is never turned offrdquoThe response patterns in Table III show that for Q11-12

the major inconsistent pattern is ldquobbrdquo (205) which iscorrect for Q11 but incorrect for Q12 A typical studentanswer (student 3) is shown in Fig 7 The studentrsquosreasoning was mostly based on the fliesrsquo behaviors inresponding to gravity and was implicit regarding theeffect of red light Students might implicitly know thatthe red light was not a factor and then primarily focusedon the influential factor (the gravity) in their explanationsThis also agrees with many two-tier assessment studieswhich have demonstrated that knowing is easier thanreasoning [3441]

For Q13-14 many students seemed to struggle betweenchoices ldquodrdquo and ldquoerdquo A significant number of students whoanswered correctly for Q13 picked the incorrect choice ldquoerdquofor Q14 A representative student response (student 5) inFig 7 shows that the student used correct reasoning incomparing the tubes but was not satisfied with the answersin Q14 and picked the wrong choice This indicates thatthe design of the choices of Q14 can benefit from furtherrefinement In particular the correct answer (ldquodrdquo) shouldnot include the confounded condition (tube I)From student interviews and response patterns there is

also evidence that suggests that studentsrsquo understandingof statistical significance of a measured outcome is animportant factor in their decision making regarding thecovariation effect of an experiment For example approx-imately 15 of the students considered red light as aninfluential factor in Q11-12 Their reasoning indicates anissue in understanding statistical significance ldquoit is hard todecide because in tube II 11 flies are in the lighted partand 9 in the dark part a 11∶9 ration which is not a verybig difference to make judgementrdquo It appears that thesestudents were not sure if 11 means more flies or just arandom uncertainty of an equal distribution To reduce suchconfusion it might be helpful to increase the total numberof flies so that students are better able to make a judgementbetween random and meaningful outcomesThe results suggest that the pictorial presentations of

the questions did have some negative impact on studentrsquosunderstanding of the actual conditions of the experimentsIn addition the design of the choices in the reasoningquestions are less than optimal In COV conditionsstudents have the natural tendency to compare tubesrather than look at a single tube to test a hypothesisTherefore the choices should reflect such comparisonsand the confounded cases should not be included in thecorrect answer

FIG 7 Examples of studentsrsquo written explanations to Q11-14

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-14

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 15: Validity evaluation of the Lawson classroom test of

C Qualitative analysis of Q21-22 and Q23-24

The last two pairs of the LCTSR were designed tomeasure student ability in hypothetical deductive reasoning[7374] As discussed earlier these two pairs of questionshave a different style than the traditional two-tier designQ21-22 asks for the experimental design to test a givenhypothesis in tier 1 and the experimental outcome that willdisprove the hypothesis in tier 2 Meanwhile in Q23-24 aphenomenon is introduced along with two possible hypoth-eses and an experimental design that can test the hypothesesStudents are asked to select the experimental outcomes thatwill disprove the two hypotheses in tier 1 and tier 2respectivelyThe expert reviews reported a number of concerns with

these two question pairs Both involve complex experi-mental settings with a large number of detailed variablesand possible extensions Subsequently they require a highlevel of reading and processing that can be overly demand-ing for some students In terms of the content andreasoning the experts cited a number of plausibility andunclear variable issues For Q21-22 the scenario involvesadditional variables and relations including dry ice aballoon and suction which are not clearly defined ordiscussed for their roles and uses in the experimentStudents may interpret some of these relations in a waytangential to what the questions were designed for whichwould render their reasoning irrelevantMore specifically the concept of suction and the use

of a balloon may cause additional confusion Suction is aphysical scenario in which the air enclosed in a sealedvolume is mechanically removed to create a lower pressureregion relative to the exterior environment causing liquidto be pushed into the lower pressure region by the higher airpressure in the environment For Q21-22 although thedissolution of carbon dioxide will remove a certain amountof gas in the upside-down glass and decrease the pressurewithin the process is mechanically different from a real-world suction event in which the gas is removed throughmechanical means Such differences may confuse studentsand cause unintended interpretations Even when themeasurement of pressure is granted in this experimentfrom the practical standpoint using a balloon should beavoided Based on life experience a typical balloon is madeof rubber material and will melt or burn if placed on top of acandle flame This setting is not plausible to function asproposed by the questionsFor Q23-24 the experts again identified plausibility

issues Most plastic bags are air and water tight and donot work like osmosis membranes Furthermore liquid isnearly incompressible and solutions of positive or negativeions will not produce nearly enough pressure neededto compress liquid (if at all) Therefore the proposedscenario of compressing liquid-filled plastic bags is physi-cally implausible and the proposed experiment may leavestudents with a sense of unreality In this case the

experimental setting mixes real and imaginary parts ofcommon sense knowledge and reasoning In addition itasks students to remove certain known features of realobjects and retain other features to form a causal relationThis is an awkward design which requires one to partiallyshut down common sense logic while pretending to use thesame set of common sense logic to operate normally andignore certain common-sense outcomes This is highlyarguable to be an appropriate context for students to performhypothetical reasoning which is already complex itselfIn addition the two pairs of questions ask for proving a

hypothesis wrong Typically students have more experi-ence in supporting claims as correct rather than as wrongTherefore students may answer incorrectly due to simpleoversight and or misreading the question Bolding orunderlining key words may alleviate these issuesStudentsrsquo responses to these questions resonate with the

expertsrsquo concerns From interviews and written explana-tions many students complained that these questionswere difficult to understand and needed to be read multipletimes Many continued to misinterpret the questions whichimpacted their responses For example for the task whichasks students to prove the hypothesis wrong some studentsthought it meant ldquowhat would happen if the experiment wentwrongrdquoOthers simply went on to prove the hypothesis wasright This is evident from the large number of ldquoabrdquo patterns(102) on Q21-22 and ldquobbrdquo patterns (271) on Q23-24In addition since questions Q23-24 ask for proving twohypotheses this cued some students to compare between thetwo and left them struggling with the relations ldquoIf explan-ation I is wrong does it mean that explanation II has to berightrdquo and ldquoI misread the question the 1st time I didnrsquot seethat they were 2 separate explanations thought one wouldbe right and the other would be wrongrdquo This misunder-standing of the question caused some students to prove onehypothesis wrong and one correctIn addition for Q21-22 students overwhelmingly dis-

liked the inclusion of the concept of suction and a balloonin the scenario ldquosuction seems unrelated to the possibleexplanationrdquo ldquosuction has nothing to do with what ishappeningrdquo ldquochoice d was unreasonablerdquo ldquothe questionmade no mention of a balloon being available for usemdashoption d was wrong The starting part of option D waswrong because the use of the word lsquosuctionrsquo was wrongbecause pressure was what caused the riserdquoThere were also students not comfortable with the use of

dry ice ldquodid not understand the process described in thequestionmdashthought dry ice was to frozen the waterrdquo ldquoforthis question it seems like you need to know chemistryrdquoldquoconcerned about answer A because is there a guaranteethat the carbon dioxide isnrsquot going to escape the waterHow long will it reside in the waterrdquoBased on studentsrsquo responses and expert reviews the

design of these two question pairs appears to be less thanoptimal From the expert point of view there are significant

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-15

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 16: Validity evaluation of the Lawson classroom test of

plausibility concerns to some of the involved conditionsAlthough such concerns were not observed from studentsrsquoresponses these should be addressed in order to make thescience accurate From the studentsrsquo perspective thesequestions require substantial content knowledge in chem-istry and physics in order to fully understand the intentionsof the experimental designs and processes A lack ofcontent understanding such as that involved with dryice solubility of carbon dioxide osmosis processes etccould interfere with studentsrsquo ability to reason out solutionsto the questions On the other hand students fluent in suchcontent knowledge may simply skip the intended reasoningprocess and use memorized scientific facts to answer thequestions Therefore it would be helpful to address theidentified content knowledge issues so that studentsrsquohypothetical deductive reasoning abilities can be moreaccurately assessed And last the questions asked studentsto prove a provided hypothesis wrong which led tosignificant misreading among students This type of designcan also benefit from further revisions

D Ceiling effect

In general the Lawson test is known to be a simple testthat measures learnersrsquo basic reasoning skills and theceiling effect is evident when it is used with students atthe college level In order to understand the usability of theLawsonrsquos test for different age groups Bao et al [4]reported on the general developmental trend of Lawson testscores for Chinese and US students They showed that theLCTSR scores approach ceiling for age groups around thesenior high school to early college yearsOne concern of the demonstrated ceiling effect is that it

is significantly below the 100 mark When includingsenior college students the maximum average is approx-imately 80 One would expect the ceiling to be close to100 for advanced students Based on the analysis ofstudentsrsquo qualitative responses this low ceiling level islikely due to the question design issues discussed in thispaper That is it was evident that some of the senior collegestudents were able to apply appropriate reasoning to obtaincorrect answers in tier 1 but were in disagreement with thechoices of reasoning in tier-2 such as those discussedfor Q8 Q12 and Q14 The ceiling effect also limits thediscrimination of LCTSR among students beyond highschool Therefore in order to assess the reasoning abilitiesof college students additional questions that involve moreadvanced reasoning skills are needed

VI CONCLUSION AND DISCUSSION

In general the LCTSR is a practical tool for assessinglarge numbers of students The instrument has good overallreliabilities with both the individual and pair scoringmethods (Cronbach α gt 08) when the test length iscontrolled The test is on the easy side for college students

with typical mean scores at the 60 level for pair scoringand 70 level for individual scoring The residual corre-lation analysis using Q3 statistics shows acceptable localindependence for the overall testWhen individual question pairs are examined how-

ever seven out of the twelve pairs perform as expectedfor the TTMC design while the remaining five pairssuffer from a variety of design issues These five questionpairs have a considerable amount of inconsistent responsepatterns Qualitative analysis also indicates wide rangingquestion design issues among the content and contextscenarios pictorial and narrative presentations andanswer choices All of these potentially contribute tothe high level of inconsistent responses and the less thanoptimal ceilingThe identified question design issues are likely an

underlying cause for the substantial uncertainties in stu-dentsrsquo responses which need to be carefully consideredwhen interpreting the assessment results While the assess-ment of the overall scientific reasoning ability of largeclasses is reliable as the uncertainties from the questiondesign issues can be ldquoaveraged outrdquo interpretations of theaffected subskills including proportional reasoning con-trol of variables and hypothetic-deductive reasoning mayhave less power due to the significant amount of uncer-tainty involved Direct comparisons among the affectedsubskills without addressing the question design issues isquestionable For example a recent study has shown thatthe progression trends of COV and HD measured withsubskill scores are different from that of the other subskills[38] Their results were used to conclude a distinctive stageof scientific reasoning which should be further inspectedagainst the question design issues uncovered in this studyTherefore based on the results of this study it is suggestedthat the LCTSR is generally valid for assessing a unidimen-sional construct of scientific reasoning however its val-idity in assessing subskills is limitedIn response to the limitations of the LCTSR observed in

practice several studies have attempted to design newquestions for assessing scientific reasoning [7576] Forexample Zhou et al [76] developed questions specific forCOV These development efforts can be further informedby the results from this study which provides a compre-hensive and concrete evaluation of the possible questiondesign issues of the LCTSRThe method used in this study which emphasizes the

inconsistent responses of question pairs may also contrib-ute to the methodology in developing and validating two-tier tests While the two-tier test was originally developedas an effective alternative of the traditional multiple-choicetest to suppress the possible false positive [344257] it hasbeen unclear as to how to quantify the likelihood of falsepositives in TTMC tests The method of analyzing incon-sistent response patterns described in this paper provides apractical means to quantitatively evaluate the probabilities

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-16

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 17: Validity evaluation of the Lawson classroom test of

of possible false positives and negatives which can be usedas evidence for establishing the validity of two-tier testsFinally this study conducted validity research on the

LCTSR using only the data from US college freshmanstudents The validity evaluation is population dependentand additional validity issues may exist among otherpopulations Nevertheless the claim regarding poor contentvalidity for the five question pairs in LCTSR is sufficientlywarranted with the evidence identified

ACKNOWLEDGMENTS

The research is supported in part by the NationalScience Foundation Grants No DUE-1044724 DUE-1431908 DRL-1417983 and DUE-1712238 Any opin-ions findings and conclusions or recommendationsexpressed in this paper are those of the author(s) anddo not necessarily reflect the views of the NationalScience Foundation

[1] National Research Council Education for Life and WorkDeveloping Transferable Knowledge and Skills in the 21stCentury (National Academies Press Washington DC2012)

[2] J Piaget The stages of the intellectual development of thechild in Readings in Child Development and Personalityedited by P H Mussen J J Conger and J Kagan (Harperand Row New York 1970) pp 98ndash106

[3] J Hawkins and R D Pea Tools for bridging the cultures ofeveryday and scientific thinking J Res Sci Teach 24 291(1987)

[4] L Bao T Cai K Koenig K Fang J Han J Wang QLiu L Ding L Cui Y Luo et al Learning and scientificreasoning Science 323 586 (2009)

[5] C Zimmerman The development of scientific reasoningWhat psychologists contribute to an understanding ofelementary science learning Paper commissioned by theNational Academies of Science (National Research Coun-cilrsquos Board of Science Education Consensus Study onLearning Science Kindergarten through Eighth Grade)(2005)

[6] C Zimmerman The development of scientific thinking skillsin elementary and middle school Dev Rev 27 172 (2007)

[7] P Adey and M Shayer Really Raising Standards(Routledge London 1994)

[8] Z Chen and D Klahr All other things being equalAcquisition and transfer of the control of variables strategyChild Development 70 1098 (1999)

[9] A M Cavallo M Rozman J Blickenstaff and N WalkerLearning reasoning motivation and epistemological be-liefs J Coll Sci Teach 33 18 (2003)

[10] M A Johnson and A E Lawson What are the relativeeffects of reasoning ability and prior knowledge on biologyachievement in expository and inquiry classes J Res SciTeach 35 89 (1998)

[11] V P Coletta and J A Phillips Interpreting FCI scoresNormalized gain preinstruction scores and scientificreasoning ability Am J Phys 73 1172 (2005)

[12] H-C She and Y-W Liao Bridging scientific reasoningand conceptual change through adaptive web-based learn-ing J Res Sci Teach 47 91 (2010)

[13] S Ates and E Cataloglu The effects of studentsrsquoreasoning abilities on conceptual understandings and

problem-solving skills in introductory mechanics Eur JPhys 28 1161 (2007)

[14] J L Jensen and A Lawson Effects of collaborative groupcomposition and inquiry instruction on reasoning gainsand achievement in undergraduate biology CBE-Life SciEduc 10 64 (2011)

[15] B Inhelder and J Piaget The Growth of Logical Thinkingfrom Childhood to Adolescence An Essay on the Con-struction of Formal Operational Structures (Basic BooksNew York 1958)

[16] A E Lawson The development and validation of aclassroom test of formal reasoning J Res Sci Teach15 11 (1978)

[17] V Roadrangka R H Yeany and M J Padilla GALTGroup Test of Logical Thinking (University of GeorgiaAthens GA 1982)

[18] K Tobin and W Capie The development and validation ofa group test of logical thinking Educ Psychol Meas 41413 (1981)

[19] A E Lawson Lawson classroom test of scientificreasoning Retrieved from httpwwwpublicasuedu~anton1AssessArticlesAssessmentsMathematics20AssessmentsScientific20Reasoning20Testpdf(2000)

[20] D Hestenes M Wells and G Swackhamer Force conceptinventory Phys Teach 30 141 (1992)

[21] C Pratt and R G Hacker Is Lawsonrsquos classroom test offormal reasoning valid Educ Psychol Meas 44 441(1984)

[22] G P Stefanich R D Unruh B Perry and G PhillipsConvergent validity of group tests of cognitive develop-ment J Res Sci Teach 20 557 (1983)

[23] G M Burney The construction and validation ofan objective formal reasoning instrument PhD Disserta-tion University of Northern Colorado 1974

[24] F Longeot Analyse statistique de trois tests genetiquescollectifs Bull Inst Natl Etude 20 219 (1965)

[25] R P Tisher and L G Dale Understanding in Science Test(Australian Council for Educational Research Victoria1975)

[26] K Tobin and W Capie The test of logical thinkingDevelopment and applications Proceedings of the AnnualMeeting of the National Association for Research inScience Teaching Boston MA (1980)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-17

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 18: Validity evaluation of the Lawson classroom test of

[27] J A Rowell and P J Hoffmann Group tests for distin-guishing formal from concrete thinkers J Res Sci Teach12 157 (1975)

[28] M Shayer and D Wharry Piaget in the Classroom Part 1Testing a Whole Class at the Same Time (Chelsa CollegeUniversity of London 1975)

[29] R G Hacker The construct validities of some group testsof intellectual development Educ Psychol Meas 49 269(1989)

[30] M D Reckase T A Ackerman and J E Carlson Build-ing a unidimensional test using multidimensional itemsJ Educ Measure 25 193 (1988)

[31] A Hofstein and V N Lunetta The laboratory in scienceeducation Foundations for the twenty-first century SciEduc 88 28 (2004)

[32] V P Coletta J A Phillips and J J Steinert Interpretingforce concept inventory scores Normalized gain and SATscores Phys Rev ST Phys Educ Res 3 010106 (2007)

[33] C-Q Lee and H-C She Facilitating studentsrsquo conceptualchange and scientific reasoning involving the unit ofcombustion Res Sci Educ 40 479 (2010)

[34] Y Xiao J Han K Koenig J Xiong and L BaoMultilevel Rasch modeling of two-tier multiple choicetest A case study using Lawsonrsquos classroom test ofscientific reasoning Phys Rev Phys Educ Res 14020104 (2018)

[35] K Diff and N Tache From FCI to CSEM to Lawson testA report on data collected at a community college AIPConf Proc 951 85 (2007)

[36] V P Coletta Reaching more students through thinking inphysics Phys Teach 55 100 (2017)

[37] S Isaac and W B Michael Handbook in Research andEvaluation A Collection of Principles Methods andStrategies Useful in the Planning Design and Evaluationof Studies in Education and the Behavioral Sciences (Editspublishers San Diego 1995)

[38] L Ding Progression trend of scientific reasoning fromelementary school to university a large-scale cross-gradesurvey among chinese students Int J Sci Math Educ 1(2017)

[39] L Ding X Wei and X Liu Variations in universitystudentsrsquo scientific reasoning skills across majors yearsand types of institutions Res Sci Educ 46 613 (2016)

[40] L Ding X Wei and K Mollohan Does higher educationimprove student scientific reasoning skills Int J SciMath Educ 14 619 (2016)

[41] G W Fulmer H-E Chu D F Treagust and K NeumannIs it harder to know or to reason Analyzing two-tierscience assessment items using the Rasch measurementmodel Asia-Pac Sci Educ 1 1 (2015)

[42] D F Treagust Development and use of diagnostic tests toevaluate studentsrsquo misconceptions in science Int J SciEduc 10 159 (1988)

[43] C-C Tsai and C Chou Diagnosing studentsrsquo alternativeconceptions in science J Comput Assist Learn 18 157(2002)

[44] A L Chandrasegaran D F Treagust and M MocerinoThe development of a two-tier multiple-choice diagnosticinstrument for evaluating secondary school studentsrsquo abil-ity to describe and explain chemical reactions using

multiple levels of representation Chem Educ Res Pract8 293 (2007)

[45] H-E Chu D F Treagust and A L Chandrasegaran Astratified study of studentsrsquo understanding of basic opticsconcepts in different contexts using two-tier multiple-choice items Res Sci Technol Educ 27 253 (2009)

[46] A Hilton G Hilton S Dole and M Goos Developmentand application of a two-tier diagnostic instrument toassess middle-years studentsrsquo proportional reasoningMath Educ Res J 25 523 (2013)

[47] H-S Lee O L Liu and M C Linn Validating meas-urement of knowledge integration in science using multi-ple-choice and explanation items Appl Meas Educ 24115 (2011)

[48] K S Taber and K C D Tan The insidious nature of lsquohard-corersquoalternative conceptions Implications for the construc-tivist research programme of patterns in high schoolstudentsrsquo and pre-service teachersrsquo thinking about ionisa-tion energy Int J Sci Educ 33 259 (2011)

[49] D K-C Tan and D F Treagust Evaluating studentsrsquounderstanding of chemical bonding Sch Sci Rev 81 75(1999) httpsrepositorynieedusghandle1049714150

[50] D F Treagust and A L Chandrasegaran The Taiwannational science concept learning study in an internationalperspective Int J Sci Educ 29 391 (2007)

[51] C-Y Tsui and D Treagust Evaluating secondary studentsrsquoscientific reasoning in genetics using a two-tier diagnosticinstrument Int J Sci Educ 32 1073 (2010)

[52] J-R Wang Evaluating secondary studentsrsquo scientificreasoning in genetics using a two-tier diagnostic instru-ment Int J Sci Math Educ 2 131 (2004)

[53] O L Liu H-S Lee and M C Linn An investigation ofexplanation multiple-choice items in science assessmentEduc Assess 16 164 (2011)

[54] N B Songer B Kelcey and AW Gotwals How andwhen does complex reasoning occur Empirically drivendevelopment of a learning progression focused on complexreasoning about biodiversity J Res Sci Teach 46 610(2009)

[55] J Yao and Y Guo Validity evidence for a learningprogression of scientific explanation J Res Sci Teach55 299 (2017)

[56] J Yao Y Guo and K Neumann Towards a hypotheticallearning progression of scientific explanation Asia-PacSci Educ 2 4 (2016)

[57] D Hestenes and I Halloun Interpreting the force conceptinventory A response to Huffman and Heller Phys Teach33 502 (1995)

[58] J Wang and L Bao Analyzing force concept inventorywith item response theory Am J Phys 78 1064 (2010)

[59] J W Creswell and J D Creswell Research DesignQualitative Quantitative and Mixed Methods Approaches(Sage Publications Thousand Oaks CA 2011)

[60] H Wainer and G L Kiely Item clusters and computerizedadaptive testing A case for testlets J Educ Measure 24185 (1987)

[61] P R Rosenbaum Items bundles Psychometrika 53 349(1988)

[62] H P Tam M Wu D C H Lau and MM C Mok Usinguser-defined fit statistic to analyze two-tier items in

BAO XIAO KOENIG and HAN PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-18

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19

Page 19: Validity evaluation of the Lawson classroom test of

mathematics in Self-directed Learning Oriented Assess-ments in the Asia-Pacific (Springer New York 2012)pp 223ndash233

[63] W M Yen Effects of local item dependence on the fit andequating performance of the three-parameter logisticmodel Appl Psychol Meas 8 125 (1984)

[64] W-H Chen and D Thissen Local dependence indexes foritem pairs using item response theory J Educ Behav Stat22 265 (1997)

[65] W M Yen Scaling performance assessments Strategiesfor managing local item dependence Educ Meas 30 187(1993)

[66] A W Gotwals and N B Songer Reasoning up and down afood chain Using an assessment framework to investigatestudentsrsquo middle knowledge Sci Educ 94 259 (2010)

[67] J Yasuda N Mae M M Hull and M TaniguchiAnalyzing false positives of four questions in the ForceConcept Inventory Phys Rev Phys Educ Res 14010112 (2018)

[68] R F DeVellis Scale Development Theory and Applica-tions (Sage Publications Thousand Oaks CA 2012)

[69] L Ding R Chabay B Sherwood and R BeichnerEvaluating an electricity and magnetism assessment toolBrief electricity and magnetism assessment Phys Rev STPhys Educ Res 2 010105 (2006)

[70] J Cohen Statistical Power Analysis for the BehavioralSciences (Lawrence Earlbaum Hillsdale NJ 1988)p 20

[71] C J Clopper and E S Pearson The use of confidence orfiducial limits illustrated in the case of the binomialBiometrika 26 404 (1934)

[72] J T Willse Classical test theory functions (R packageversion 232) Retrieved from httpsCRANR-projectorgpackage=CTT (2018)

[73] A E Lawson The generality of hypothetic-deductivereasoning Making scientific thinking explicit Am BiolTeach 62 482 (2000)

[74] A E Lawson B Clark E Cramer-Meldrum K AFalconer J M Sequist and Y-J Kwon Developmentof scientific reasoning in college biology Do two levels ofgeneral hypothesis-testing skills exist J Res Sci Teach37 81 (2000)

[75] J Han Scientific reasoning Research developmentand assessment PhD thesis The Ohio State University2013

[76] S Zhou J Han K Koenig A Raplinger Y Pi D LiH Xiao Z Fu and L Bao Assessment of scientificreasoning The effects of task context data and design onstudent reasoning in control of variables Think Ski Creat19 175 (2016)

VALIDITY EVALUATION OF THE LAWSON hellip PHYS REV PHYS EDUC RES 14 020106 (2018)

020106-19