technical issues

Educational Research

Technical IssuesTwo concernsValidityReliability

1Lets turn our attention to the technical issues related to measurement. Two very, very important concerns are the validity and reliability of the instruments being used. Data Collection Quiz 1Answer the five questions on Quiz 1.

2Before we begin our discussion Id like to take a few moments to work on an exercise. Use the link on this slide to access Quiz 1. Take this quiz assuming Im going to use your score as a grade in the class. You can pause the slide show if needed.PAUSEData Collection Quiz 1 AnswersScore you paper using the following keyABABB

3The answers to Questions 1-5 are A, B, A, B, and B. Score your paper and remember the number of items answered correctly.Data Collection Quiz 1How well did you do?Should I use this score as a part of your grade?Does this score indicate your level as a graduate student? What we have here is a serious lack of communication!Most students object strongly to using this score as a part of their grade because it isnt fair.Most students object strongly to being labeled bright or challenged on the basis of their grade.Their reasoning is the test isnt fair it doesnt cover material relevant to this course.Welcome to the technical world of instrumentation.

4How well did you do?If you did well, would you mind if I used your score as a part of your grade for EDF 800?If you didnt do well, and most students do not, would you mind if I used your score as a part of your grade for EDF 800?

If you did well, can I conclude you are exceptionally bright?If you didnt do well, can I conclude you are quite challenged academically?

Most everyone objects to using their score from this quiz good or bad for any purpose because it isnt fair. The test simply doesnt cover material appropriate to an introductory educational research course. Weve studied absolutely none of the content on this quiz; expecting you to know it just isnt right.

Welcome to the technical world of instrumentation.

Technical IssuesValidity extent to which interpretations made from a test score are appropriateCharacteristicsThe most important technical characteristicSituation specificDoes not refer to the instrument but to the interpretations of scores on the instrumentBest thought of in terms of degree

5The formal definition of validity is written on this slide. If you think for a moment, the definition makes a lot of sense. When you give a test to the students in your class, you use the scores to make some decisions about each student. If one student had a very high score, you usually infer this is a good student. If another student had a very low score you could infer this student was having serious difficulties mastering this material. The question ultimately comes down to whether or not such inferences or decisions you make are appropriate, meaningful, or useful. The answer depends on two characteristics of the test.

Technical IssuesValidity (continued)Four typesContent to what extent does the test measure what it is supposed to measureItem validitySampling validityDetermined by expert judgment

6If your test covered appropriate content for the instruction provided to students, then the extent to which your inferences are appropriate, meaningful, or useful is high. If, like the quiz I gave you, the content is not relevant to the instruction, your inferences are not appropriate, meaningful, or useful to anyone. This is known as content validity and is a fundamental characteristic of any test. Please note that whether a test has evidence of content validity or not, nothing stops someone from using the scores to make decisions. Anyone ever taken an exam where the professor wrote items that had nothing to do with what was being taught? Did he or she still use your scores in your grades? Was that fair? Appropriate? Meaningful? Useful?I need to caution you about the situation specific nature of validity evidence. The quiz you took earlier was not content valid for this course, but it was taken off of a History of Education exam where every question was appropriate to the instruction. In our case the test was not content valid; in the case of another course it is 100% content valid.

Technical IssuesValidity (continued)Construct the extent to which a test measures the construct it representsUnderlying difficulty defining constructsEstimated in many waysCriterion-relatedPredictive to what extent does the test predict a future performanceConcurrent - to what extent does the test predict a performance measured at the same timeEstimated by correlations between two tests

7Sometimes the purpose of a test is not to measure specific concrete content like that we are studying. Often what is being measured is very nebulous or abstract in nature. How would you measure my intelligence? Probably with an intelligence test, but would the test be developed around Binets conception of intelligence as verbal and mathematical reasoning or Gardners 8 or 9 I forget the number - multiple intelligences? Obviously the tests would look very, very different based on the manner by which the researcher interprets the construct of intelligence. While closely related to content validity in that we worry about whether the test measures what it is supposed to measure construct validity is difficult to estimate. If a test has sufficient evidence to suggest it measures intelligence, my score on that test and your use of it is reasonable. If not, any decision you make on the basis of that score is not appropriate, meaningful, or useful.Many times we find ourselves using test scores to predict a students performance on some later task. The ACT, SAT, GRE, or MCAT are good examples of such tests. Scores on the ACT or SAT are supposed to predict a students performance in their freshman year in college. Do they do so well? If so, we can make some decisions about whether or not to admit a student to a university; if not such a decision is not appropriate, meaningful, or useful. Technical IssuesValidity (continued)Consequential to what extent are the consequences that occur from the test harmfulEstimated by empirical and expert judgmentFactors affecting validityUnclear test directionsConfusing and ambiguous test itemsVocabulary that is too difficult for test takers

8Consequential validity is a relatively new way to think about validity evidence. As the definition implies, we are interested in the consequences of testing that might prove to be particularly disconcerting for some students. For example, the Louisiana Department of Education mandates that all special needs students take the LEAP test that corresponds to the grade in which they are enrolled. Often this means a student is taking an exam that is well beyond their ability to read the test much less understand the content. Is this fair to that student? What about non-English speaking students? Is it fair to give them grade level tests that are completely dependent on the ability to read the English language? Welcome to the concerns related to consequential validity.There are many factors that can affect the validity of a test. Can you see how each of the three factors on this slide will have a negative effect on validity?Technical IssuesFactors affecting validity (continued)Overly difficult and complex sentence structureInconsistent and subjective scoringUntaught itemsFailure to follow standardized administration proceduresCheating by the participants or someone teaching to the test items

9How about these factors? Technical IssuesReliability the degree to which a test consistently measures whatever it is measuringCharacteristicsExpressed as a coefficient ranging from 0 to 1A necessary but not sufficient characteristic of a test

10Reliability is the second technical characteristic important to measurement. Reliability is basically the consistency with which we measure. If you took Exam 1 a first time and made a 40, a second time and made a 45, and a third time and made a 43, what score should I use to provide a reliable estimate of your knowledge of the material? There are three perspectives from which reliability is viewed: test reliability, score reliability, and agreement.

Technical IssuesTest reliabilityStability consistency over time with the same instrumentTest retestEstimated by a correlation between the two administrations of the same testEquivalence consistency with two parallel tests administered at the same timeParallel formsEstimated by a correlation between the parallel tests

11When speaking of test reliability, we estimate the extent to which the results of a test are likely to be the same. An estimate could be calculated using two administrations of the same test. This is know as stability or test-retest reliability. Coefficients close to 1 suggest a test that produces very consistent scores; those close to 0 suggest a lack of consistency for the test. Sometimes we dont want to give one test twice what a pain for the students! Besides, there is often a high chance that youll correct something from the first to the second administration of the test.When we develop two tests that examine the same material with different items, we are creating an opportunity to estimate reliability through equivalence or parallel forms. Comparing the scores from Form 1 of a test to those of Form 2 results in a coefficient that ranges from 0 to 1. Again the closer to 1 the more consistent the test.

Technical Issues

Test reliability (continued)Internal consistency artificially splitting the test into halvesSeveral coefficients split halves, KR 20, KR 21, Cronbach alphaAll coefficients provide estimates ranging from 0 to 1

12If one test is hard to develop, think about two! Think also about giving a second form of the test to your students! Im sure theyd be delighted to help you out! Because of this limitation, researchers have developed an estimate of test reliability called internal consistency. In essence, we think of one test of say 100 items as two tests of 50 items each. We split the test into halves. The two most common estimates of internal consistency are the KR 20 and Cronbach alpha. The former is used when the items for a test are scored as right or wrong; the latter when the answers can fall on a continuous scale. An example of this is a Likert scale where a student responds to a five point scale ranging from strongly disagreeing to strongly agreeing. Regardless of which estimate is used, the coefficients always range from 0 to 1 with 1 representing greater reliability.

technical issues

Documents

test score

low score

high score

quiz good

test isnt fair

test measure

pausedata collection

good student