determining the validity and reliability of key assessments

Determining the Validity and Reliability of Key

Assessments Julia M. Lee – Presenting the work of the

faculty of the Dewar College of Education at Valdosta State University

GaPSC Assessment WorkshopMay 14, 2012

Tips for Developing Key Assessments

“Begin with the end in mind”Have as many Education faculty members

as possible involved in the developmentMake sure you have P-12 and A&S partners

involvedLook at “big picture” – ultimate outcome(s)

rather than isolated knowledge and skillsExplicit, explicit, explicit

Tips for Meeting Standard Two Involve as many faculty as possible in the assessment

system Development Implementation Evaluation Analysis Revision Implementation . . . . .

Have faculty complete a “self-study” or “self-evaluation” of each program and its assessment components

Provide training to faculty, candidates, and other “users” on the instruments developed

Develop users’ guides, data calendars, data collection documents (information to be collected, timeline, source, responsibility, etc.)

Valid

ity

ObjectivityPurposeStability

ReliabilityFairness

with

Aligned ImpartialityConsistencyLegitimacy

Comprehensiveness

Assessment System and Unit Evaluation (2a)

“4. The professional education unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy, and consistency of its assessment procedures and professional education unit operations.”

Impact of 2a4 “2b1. The professional education unit maintains an

assessment system that provides regular and comprehensive information on applicant qualifications, candidate proficiencies, competence of graduates, professional education unit operations, and preparation program quality.”

“2b3. Candidate assessment data are regularly and systematically collected, compiled, aggregated, summarized, and analyzed to improve candidate performance, preparation program quality, and professional education unit operations.”

Impact of 2a4 “2c1. The professional education unit regularly and

systematically uses data, including candidate and graduate performance information, to evaluate the efficacy of its courses, preparation programs, and clinical experiences.”

“2c2. The professional education unit analyzes preparation programs’ evaluation and performance assessment data to initiate changes in preparation programs and professional education unit operations.”

Establishing fairness, accuracy, and consistency of assessment procedures and instruments: Steps taken by the

COEUse of multiple assessments (multiple sources)Primary use of analytic rather than holistic rubricsUse of multiple ratersProvision of training on assessment instrumentsCompletion of inter-rater reliability studies and / or

consensus agreement

Processes Used to Determine Reliability and Validity of Two Key

Assessments

College of Education Observation Instrument

College of Education Disposition Survey

College of Education Observation Instrument

Part of “determining” reliability and validity involves building instruments and supporting documents in such a way that these issues are considered from the very beginningHow the COE OI was developedDevelopment and implementation of an

instructional manual and training sessionsCompletion of inter-rater reliability studies

Development of the COE Observation Instrument

Aligned to professional education standards (Danielson, INTASC, Georgia Framework)

Georgia Framework indicators that were observable formed the foundation of the instrument

P-12 Teachers, P-12 Administrators, and University faculty participated in the development

Development and Implementation of an Instruction Guide and Training Sessions

for the COE OI

Training manual and training developed by a group of P-12 educators and University Faculty Members

Training manual provides explicit guidance for decision-making regarding the rubric

Training sessions are provided for first-time users (2 hour session) as well as for on-going users (1 hour “refresher” training)

Completion of Inter-rater Reliability Studies

Provided training to 17 triads (student teachers, their P-12 mentors, and their university supervisors) on the Instrument

These triads independently rated teaching one teaching episode for each candidate

Computed inter-rater agreement between P-12 mentors and university supervisors (Agreements/Agreements + Disagreements) * 100Adjacent values Standard Met / Not Met

ResultsAdjacent values Standard Met / Not Met

Inter-rater Agreement Results(% of agreement)

What did these data tell us?A. All items on this instrument were reliable

and valid.B. There was a high level of agreement on all

items on this instrument.C. The independent raters did not agree with

each other regarding whether or not candidates met the standard for most items.

D. In general, with the exception of one item, the independent raters’ had similar ratings for both types of reliability evaluated.

Inter-rater Agreement Results (% of agreement)

Decisions MadeRequire all faculty who supervise to complete

the training sessionProvided training to mentors who frequently

supervise student teachersProvided training to several cohorts of Ed.S.

students, many of whom served as public school mentors

Asked COE Assessment Committee to review data and make recommendations for changes based on reliability data

Modified this item on the instrument

Modification of the ItemLearning EnvironmentsOriginal Item III-G.: Communication

Rating of 1-2: Errors in spoken/written language; ineffective nonverbal communication; unclear directions; does not use effective questioning skills

Rating of 3-4: Error-free spoken/written language; effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies

Modification of the Item, continued

Learning EnvironmentsNew Items III-Ga. Communication

Rating of 1-2: Errors in spoken/written languageRating of 3-4: Error-free spoken/written language

New Item III-Gb. CommunicationRating of 1-2: Ineffective nonverbal

communication; unclear directions; does not use effective questioning skills

Rating of 3-4: Effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies

College of Education Disposition Survey

Again, at the initial development stage, there was a focus on reliability and validity issues

How the Unit-adopted dispositions were chosen

How the COE Disposition Survey was developed

Adoption of DispositionsLooked at all the disposition statements in the

INTASC standardsData collected from P-12 educators and

candidates regarding importance of specific dispositions

Surveyed unit faculty regarding the relative importance of each disposition statement

Conceptual framework committee reviewed results and provided input into selection

Three primary dispositions emerged from this process

Development of the COE Disposition Survey and Advanced Disposition Survey

Initially designed and field tested in the summer of 2005

Original survey consisted of 12 itemsOf those 12 items, four were targeted to

specifically address two of the unit-adopted dispositions (fairness and the belief that all students can learn)Alternate forms of survey questions were

written to address reliabilityCandidates were asked to identify, using a Likert

scale, if they “strongly agree,” “agree,” “n/a or neutral,” “disagree,” or “strongly disagree” with statements addressing these dispositions

Original Statements on SurveysStatement 2: I believe that schools today need to

get back to basics--teachers should present lessons for everyone in the same structured way for students to learn the content.

Statement 3: I believe that it is important to adapt instruction to students' different learning styles, and help students achieve in ways they find easy to learn.

Statement 11: The impact of my performance as a teacher is primarily dependent upon the students' family backgrounds and the students' personal motivation.

Statement 12: I believe all students can learn.

Early Data Gathered

Semester Transition Point

Statement 2(“Fairness”)

Statement 3(“Fairness”)

Statement 11(“Belief all Ss can learn”)

Statement 12(“Belief all Ss can learn”)

Fall Admission to Program

73.68 97.37 44.74 94.74

Spring Exit from Program

100 100 55.6 100

Summer Admission to Program

85.71 100 21.43 100

Summer Exit from Program

94.12 100 35.29 94.12

What did these data tell us?A. The four items on this instrument appeared

to be reliable and valid.B. Candidates’ responses appeared to be

pretty consistent in terms of these items.C. There appeared to be little if any

consistency in candidates’ responses on these items.

D. In general, candidates’ responses related to the two items addressing “fairness” appear to be consistent, this does not appear to be the case with the two items addressing “the belief that all students can learn.”

Decisions MadeLooked more in-depth at these items (at the

individual candidate level) to determine agreement for the two items addressing the belief that all students can learn

Asked COE Assessment Committee to review data and make recommendations for changes based on these (and other) data

The Assessment Committee recommended re-wording of the item and including separate statements rather than combined statement

Faculty across the unit had multiple conversations about the role of the teacher in influencing student achievement as well as motivation.

Some Common Errors Found In Key Assessments

Items included that are not appropriately aligned to the standard(s) OR what is supposed to be measured

Not adequately measuring the standard (only certain aspects) Not setting clear performance expectations – e.g., what is “passing” or

“acceptable? OR, setting inappropriate performance expectations Not matching the type of rubric to the assessment need (e.g., use of

holistic vs. analytic rubrics). Performance descriptors on rubrics that are not sufficiently differentiated

across levels. Use of non-specific terms in performance descriptors (“some,”

“effectively,” “adequately”) without explicit guidance for how those terms are to be defined

Use of broad terms – outcomes not well defined Lack of appropriate balance of “brevity and detail” – either not efficient

or not effective Lack of well-defined criteria to guide ratings – may lead to biased ratings

(e.g., leniency bias) Not using multiple measures to assess outcomes

Candidate will integrate research findings in his/her practice: Research proposal

Components Target Acceptable Unacceptable

AbstractLiterature ReviewResearch DesignMethodologyConclusionUse of APA style guideEffective communication

References and Resources Carey, J.). (2011). Outcomes assessment: Linking learning, assessment, and

program improvement. PowerPoint presentation from ALA Annual Meeting, June 27, 2011.

Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program outcomes. Journal of Teacher Education, 57 (2), 120-138.

Darling-Hammond, L., Amrein-Beardsley, A., Harertel, E., Rothstein, J. (2012). Evaluating teacher evaluation. Phi Delta Kappan, (Mar 2012 Supplement), 5-6.

Gonsalvez, C.J. & Freestone, J. (2007). Field supervisors’ assessments of trainee performance: Are they reliable and valid? Australian Psychologist, 42(1), 23-32.

Johnson, L.E. (2008). Teacher candidate disposition: moral judgment or regurgitation? Journal of Moral Education, 37, 429-444.

References and Resources, continued

Magin, D. & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How reliable are they? Studies in Higher Education, 26, 287-298.

McAllister, S., Lincoln. M., Ferguson, A., & McAllister, L. (2010). Issues in developing valid assessments of speech pathology students’ performance in the workplace. International Journal of Language and Communication Disorders, 45(1), 1-14.

Oláh, L.N., Lawrence, N.R., & Riggan, M. (2010). Learning to Learn From Benchmark Assessment Data: How Teachers Analyze Results. Peabody Journal of Education, 85, 226-245.

Sandholtz, J.H. & Shea, L.M. (2012). Predicting performance: A comparison of university supervisors’ predictions and teacher candidates’ scores on a teaching performance assessment. Journal of Teacher Education, 63(1), 39-50.

VSU Dewar College of Education Institutional Report (2006).

determining the validity and reliability of key assessments

Documents

education faculty members

candidate assessment

performance assessment

dewar college of education

unit evaluation 2a4

candidate performance

preparation program

data calendars