making sense of test reliability and validity

8/2/2019 Making Sense of Test Reliability and Validity

1/13


2/13

The overall problem with psychological tests

concerns their ability to measure what they are

supposed to measure.

The accuracy, or usefulness, of a test is known as its

validity. For example, suppose you wanted to developa test to determine which of several job applicants

would work well in a bank. Would an arithmetic test be

a valid test of job success? Well, not if the job required

other skills, such as manual dexterity or social skills.


3/13

Face validity simply means the validity of the test at face value(how it appears without looking at its internal structure).

To check for face validity, test/survey items are sent toprofessionals to obtain suggestions for modification. No analysesare conducted, rather the professionals are just providing

opinions or assumptions regarding the look and perceivedcontent of the items. Because of the vagueness and subjectivity offace value, face validity is not considered a valid assessment of atests overall validity. This is because the appearance of an itemor subtest may not concur with the internal structure of the test(thus, the data analyses and research may prove somethingdifferent).

The criteria of validity in research should go beyond "face,""appearance," and "common sense."


4/13

Content Validityrefers to the ability of a test to sample adequately thebroad range of elements that compose a particular construct.

Content validity is concerned with sample-populationrepresentativeness. i.e. the knowledge and skills covered by the testitems should be representative to the larger domain of knowledge andskills.

Content validity is usually established by content experts. Takecomputer literacy as an example. A test of computer literacy should bewritten or reviewed by computer science professors because it isassumed that computer scientists should know what are important intheir discipline. At first glance, this approach looks similar to the

validation process of face validity, but yet there is a difference. Incontent validity, evidence is obtained by looking for agreement injudgments by judges. In short, face validity can be established by oneperson but content validity should be checked by a panel.


5/13

Criterion Validityrefers to the ability of a test to predictsomeones performance on some construct. Forexample, before actually using a test to predictwhether someone will be successful at a particular job,you would first want to determine whether the personis already doing well at that job (the criterion measure)and whether they also tend to score high on yourproposed test. If so, then you know that the testscores are related to the criterion.

Criterion validity draws inferences from test scores toperformance.


6/13

Construct Validityrefers to the ability of a test to measure thepsychological construct, such as depression, that it wasdesigned to measure. One way this can be assessed is throughthe tests convergent or divergent validity, which refers towhether a test can give results similar to other tests of the sameconstruct and different from tests of different constructs. Thus,how does my depression test fair in comparison to lets say theBeck Depression Inventory? Does my depression test take intoaccount various criteria outlined in the DSM-IV? Does mydepression correlate well with other depression inventories ordoes it seem to measure anxiety or hostility instead?

Construct validity draws inferences from test scores to apsychological construct


7/13

The ability of a test to give consistent results isknown as its reliability. For example, amathematics test that asks you to solve problemsof progressive difficulty might be very reliable

because if you couldnt do calculus yesterday youprobably wont be able to do it tomorrow or thenext day. But a personality test that asksambiguous questions which you answer justaccording to how you feel in the moment may say

one thing about you today and another thingabout you next month.


8/13

Refers to how well all the test items relate to one another.

Internal consistency is measured by the coefficient of test scoresobtained from a single test or survey . For instance, let's sayrespondents are asked to rate statements in an attitude survey

about computer anxiety. One statement is "I feel very negativeabout computers in general." Another statement is "I enjoy usingcomputers." People who strongly agree with the first statementshould also strongly disagree with the second statement, andvice versa. If the rating of both statements is high or low amongseveral respondents, the responses are said to be inconsistent.The same principle can be applied to a test. When no pattern is

found in the students' responses, probably the test is too difficultand students just guessed at the answers randomly.


9/13

Refers to how well results from one administration of the testrelate to results from another administration of the same test at alater time.

Test retest reliability measures temporal stability: the same form

of a test on two or more separate occasions to the same groupof examinees (Test-retest).

On many occasions this approach may not be practical becauserepeated measurements are likely to change the examineesresponses. For example, the examinees will adapt to the test

format and thus tend to score higher in later tests. Hence, carefulimplementation of the test-retest approach is stronglyrecommended.

*******For IQ and achievement tests, 1 year is the recommendedtime frame between administrations.


10/13

Alternate/parallel form reliability measures formequivalence: two different forms of a test, based onthe same content, are administered on one occasionto the same examinees (Alternate form).

Alternate forms are very common in high-stakeexaminations such as with licensing exams, the SAT,the GRE, etc. Thus, an examinee who took Form Aearlier could not share the test items with another

student who might take Form B later, as the two formshave different items that have been shown to beequivalent.


11/13

When constructing/developing an IQ test, such as the Wechsler scales, it isimportant to conduct extensive literature reviews and marketing research, aswell to consult with experts in the field to review the proposed structure of thenew or revised test and the specific subtests and items (Content validity). Focusgroups and surveys of experts and examiners are also conducted throughoutthe development.

After the items and subtests are developed and revised, you would want todetermine which subtests feed into the various factor constructs or scales thatyou are proposing to measure (e.g. whether the subtests measure, for example,verbal comprehension, working memory, and/or spatial orientation) andwhether the specific items you have developed for a subtest accuratelymeasure what the subtest is trying to measure. (Construct Validity). Note that IQtests generally measure several constructs such as Verbal Comprehension,Perceptual Reasoning, Working Memory, Processing Speed, etc.; Thus when

developing a test it is important to determine whether the subtest and its itemsfeed into a particular construct. If they do not, then one cannot say that thesubtest is a good and valid measure of lets say, the working memory construct.


12/13

Once the test is in development, it will go through variousstages whereby the items within each subtest will likelychange based on feedback from experts, data collectedthrough administrations, analyses conducted on the data,etc.

To test for test-retest reliability, a sample of lets say 300people from the normative group that comprise variousages, demographic groups, etc. are selected to take thetest at various time intervals. For example, an individualmay take the test the first time today and then would be re-

administered the same test in 3 months or later. Howclosely the individual scored on his or her administrationswould measure test-retest reliability.


13/13

To look at Internal Consistency Reliability we would want toexamine how well the items within a subtest relate to oneanother. Thus, for example, if the test is constructed well, allitems on the Vocabulary subtest shouldmeasure VerbalComprehension. In reality, answers to different questions willlikely vary for each particular respondent, although the items areintended to measure the same aspect or quantity. The smallerthis variability (or stronger the correlation), the greater theinternal consistency reliability of this subtest and test instrument.

The split-half method is often used to assess Internal Consistency. The ideahere is to split the items (questions) within the subtest (which are related to thesame construct to be measured, such as verbal comprehension) and tocompare the results obtained from the two resulting subsets of items. Thecloser the results - i.e. the scores of the construct being measured (e.g. verbalcomprehension), the greater the internal consistency reliability of this testinstrument.

making sense of test reliability and validity

Documents