determining the validity and reliability of key assessments
DESCRIPTION
Determining the Validity and Reliability of Key Assessments . Julia M. Lee – Presenting the work of the faculty of the Dewar College of Education at Valdosta State University GaPSC Assessment Workshop May 14, 2012. Tips for Developing Key Assessments. “Begin with the end in mind” - PowerPoint PPT PresentationTRANSCRIPT
Determining the Validity and Reliability of Key
Assessments Julia M. Lee – Presenting the work of the
faculty of the Dewar College of Education at Valdosta State University
GaPSC Assessment WorkshopMay 14, 2012
Tips for Developing Key Assessments
“Begin with the end in mind”Have as many Education faculty members
as possible involved in the developmentMake sure you have P-12 and A&S partners
involvedLook at “big picture” – ultimate outcome(s)
rather than isolated knowledge and skillsExplicit, explicit, explicit
Tips for Meeting Standard Two Involve as many faculty as possible in the assessment
system Development Implementation Evaluation Analysis Revision Implementation . . . . .
Have faculty complete a “self-study” or “self-evaluation” of each program and its assessment components
Provide training to faculty, candidates, and other “users” on the instruments developed
Develop users’ guides, data calendars, data collection documents (information to be collected, timeline, source, responsibility, etc.)
Valid
ity
ObjectivityPurposeStability
ReliabilityFairness
with
Aligned ImpartialityConsistencyLegitimacy
Comprehensiveness
Assessment System and Unit Evaluation (2a)
“4. The professional education unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy, and consistency of its assessment procedures and professional education unit operations.”
Impact of 2a4 “2b1. The professional education unit maintains an
assessment system that provides regular and comprehensive information on applicant qualifications, candidate proficiencies, competence of graduates, professional education unit operations, and preparation program quality.”
“2b3. Candidate assessment data are regularly and systematically collected, compiled, aggregated, summarized, and analyzed to improve candidate performance, preparation program quality, and professional education unit operations.”
Impact of 2a4 “2c1. The professional education unit regularly and
systematically uses data, including candidate and graduate performance information, to evaluate the efficacy of its courses, preparation programs, and clinical experiences.”
“2c2. The professional education unit analyzes preparation programs’ evaluation and performance assessment data to initiate changes in preparation programs and professional education unit operations.”
Establishing fairness, accuracy, and consistency of assessment procedures and instruments: Steps taken by the
COEUse of multiple assessments (multiple sources)Primary use of analytic rather than holistic rubricsUse of multiple ratersProvision of training on assessment instrumentsCompletion of inter-rater reliability studies and / or
consensus agreement
Processes Used to Determine Reliability and Validity of Two Key
Assessments
College of Education Observation Instrument
College of Education Disposition Survey
College of Education Observation Instrument
Part of “determining” reliability and validity involves building instruments and supporting documents in such a way that these issues are considered from the very beginningHow the COE OI was developedDevelopment and implementation of an
instructional manual and training sessionsCompletion of inter-rater reliability studies
Development of the COE Observation Instrument
Aligned to professional education standards (Danielson, INTASC, Georgia Framework)
Georgia Framework indicators that were observable formed the foundation of the instrument
P-12 Teachers, P-12 Administrators, and University faculty participated in the development
Development and Implementation of an Instruction Guide and Training Sessions
for the COE OI
Training manual and training developed by a group of P-12 educators and University Faculty Members
Training manual provides explicit guidance for decision-making regarding the rubric
Training sessions are provided for first-time users (2 hour session) as well as for on-going users (1 hour “refresher” training)
Completion of Inter-rater Reliability Studies
Provided training to 17 triads (student teachers, their P-12 mentors, and their university supervisors) on the Instrument
These triads independently rated teaching one teaching episode for each candidate
Computed inter-rater agreement between P-12 mentors and university supervisors (Agreements/Agreements + Disagreements) * 100Adjacent values Standard Met / Not Met
ResultsAdjacent values Standard Met / Not Met
Inter-rater Agreement Results(% of agreement)
What did these data tell us?A. All items on this instrument were reliable
and valid.B. There was a high level of agreement on all
items on this instrument.C. The independent raters did not agree with
each other regarding whether or not candidates met the standard for most items.
D. In general, with the exception of one item, the independent raters’ had similar ratings for both types of reliability evaluated.
Inter-rater Agreement Results (% of agreement)
Decisions MadeRequire all faculty who supervise to complete
the training sessionProvided training to mentors who frequently
supervise student teachersProvided training to several cohorts of Ed.S.
students, many of whom served as public school mentors
Asked COE Assessment Committee to review data and make recommendations for changes based on reliability data
Modified this item on the instrument
Modification of the ItemLearning EnvironmentsOriginal Item III-G.: Communication
Rating of 1-2: Errors in spoken/written language; ineffective nonverbal communication; unclear directions; does not use effective questioning skills
Rating of 3-4: Error-free spoken/written language; effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies
Modification of the Item, continued
Learning EnvironmentsNew Items III-Ga. Communication
Rating of 1-2: Errors in spoken/written languageRating of 3-4: Error-free spoken/written language
New Item III-Gb. CommunicationRating of 1-2: Ineffective nonverbal
communication; unclear directions; does not use effective questioning skills
Rating of 3-4: Effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies
College of Education Disposition Survey
Again, at the initial development stage, there was a focus on reliability and validity issues
How the Unit-adopted dispositions were chosen
How the COE Disposition Survey was developed
Adoption of DispositionsLooked at all the disposition statements in the
INTASC standardsData collected from P-12 educators and
candidates regarding importance of specific dispositions
Surveyed unit faculty regarding the relative importance of each disposition statement
Conceptual framework committee reviewed results and provided input into selection
Three primary dispositions emerged from this process
Development of the COE Disposition Survey and Advanced Disposition Survey
Initially designed and field tested in the summer of 2005
Original survey consisted of 12 itemsOf those 12 items, four were targeted to
specifically address two of the unit-adopted dispositions (fairness and the belief that all students can learn)Alternate forms of survey questions were
written to address reliabilityCandidates were asked to identify, using a Likert
scale, if they “strongly agree,” “agree,” “n/a or neutral,” “disagree,” or “strongly disagree” with statements addressing these dispositions
Original Statements on SurveysStatement 2: I believe that schools today need to
get back to basics--teachers should present lessons for everyone in the same structured way for students to learn the content.
Statement 3: I believe that it is important to adapt instruction to students' different learning styles, and help students achieve in ways they find easy to learn.
Statement 11: The impact of my performance as a teacher is primarily dependent upon the students' family backgrounds and the students' personal motivation.
Statement 12: I believe all students can learn.
Early Data Gathered
Semester Transition Point
Statement 2(“Fairness”)
Statement 3(“Fairness”)
Statement 11(“Belief all Ss can learn”)
Statement 12(“Belief all Ss can learn”)
Fall Admission to Program
73.68 97.37 44.74 94.74
Spring Exit from Program
100 100 55.6 100
Summer Admission to Program
85.71 100 21.43 100
Summer Exit from Program
94.12 100 35.29 94.12
What did these data tell us?A. The four items on this instrument appeared
to be reliable and valid.B. Candidates’ responses appeared to be
pretty consistent in terms of these items.C. There appeared to be little if any
consistency in candidates’ responses on these items.
D. In general, candidates’ responses related to the two items addressing “fairness” appear to be consistent, this does not appear to be the case with the two items addressing “the belief that all students can learn.”
Decisions MadeLooked more in-depth at these items (at the
individual candidate level) to determine agreement for the two items addressing the belief that all students can learn
Asked COE Assessment Committee to review data and make recommendations for changes based on these (and other) data
The Assessment Committee recommended re-wording of the item and including separate statements rather than combined statement
Faculty across the unit had multiple conversations about the role of the teacher in influencing student achievement as well as motivation.
Some Common Errors Found In Key Assessments
Items included that are not appropriately aligned to the standard(s) OR what is supposed to be measured
Not adequately measuring the standard (only certain aspects) Not setting clear performance expectations – e.g., what is “passing” or
“acceptable? OR, setting inappropriate performance expectations Not matching the type of rubric to the assessment need (e.g., use of
holistic vs. analytic rubrics). Performance descriptors on rubrics that are not sufficiently differentiated
across levels. Use of non-specific terms in performance descriptors (“some,”
“effectively,” “adequately”) without explicit guidance for how those terms are to be defined
Use of broad terms – outcomes not well defined Lack of appropriate balance of “brevity and detail” – either not efficient
or not effective Lack of well-defined criteria to guide ratings – may lead to biased ratings
(e.g., leniency bias) Not using multiple measures to assess outcomes
Candidate will integrate research findings in his/her practice: Research proposal
Components Target Acceptable Unacceptable
AbstractLiterature ReviewResearch DesignMethodologyConclusionUse of APA style guideEffective communication
References and Resources Carey, J.). (2011). Outcomes assessment: Linking learning, assessment, and
program improvement. PowerPoint presentation from ALA Annual Meeting, June 27, 2011.
Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program outcomes. Journal of Teacher Education, 57 (2), 120-138.
Darling-Hammond, L., Amrein-Beardsley, A., Harertel, E., Rothstein, J. (2012). Evaluating teacher evaluation. Phi Delta Kappan, (Mar 2012 Supplement), 5-6.
Gonsalvez, C.J. & Freestone, J. (2007). Field supervisors’ assessments of trainee performance: Are they reliable and valid? Australian Psychologist, 42(1), 23-32.
Johnson, L.E. (2008). Teacher candidate disposition: moral judgment or regurgitation? Journal of Moral Education, 37, 429-444.
References and Resources, continued
Magin, D. & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How reliable are they? Studies in Higher Education, 26, 287-298.
McAllister, S., Lincoln. M., Ferguson, A., & McAllister, L. (2010). Issues in developing valid assessments of speech pathology students’ performance in the workplace. International Journal of Language and Communication Disorders, 45(1), 1-14.
Oláh, L.N., Lawrence, N.R., & Riggan, M. (2010). Learning to Learn From Benchmark Assessment Data: How Teachers Analyze Results. Peabody Journal of Education, 85, 226-245.
Sandholtz, J.H. & Shea, L.M. (2012). Predicting performance: A comparison of university supervisors’ predictions and teacher candidates’ scores on a teaching performance assessment. Journal of Teacher Education, 63(1), 39-50.
VSU Dewar College of Education Institutional Report (2006).