university of washington inter-rater reliability and concurrent

52
Inter-rater Reliability and Concurrent Validity Study of the Washington Kindergarten Inventory of Developing Skills (WaKIDS) Principal Investigator: Gail E. Joseph, Ph.D Co-Principal Investigator: Deborah McCutchen, Ph.D Report prepared by: Janet S. Soderberg, Sara Stull, Kevin Cummings, Elspeth Nolen, Deborah McCutchen and Gail Joseph

Upload: others

Post on 09-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Washington Inter-rater Reliability and Concurrent

Inter-rater Reliability and Concurrent Validity Study of the Washington Kindergarten Inventory of Developing Skills (WaKIDS)

Principal Investigator: Gail E. Joseph, Ph.D Co-Principal Investigator: Deborah McCutchen, Ph.D

Report prepared by:

Janet S. Soderberg, Sara Stull, Kevin Cummings, Elspeth Nolen, Deborah McCutchen and Gail Joseph

Page 2: University of Washington Inter-rater Reliability and Concurrent

  2  

Table of Contents

EXECUTIVE  SUMMARY  .........................................................................................................................  3  

INTRODUCTION  ....................................................................................................................................  4  ABOUT  TEACHING  STRATEGIES  GOLD  ..............................................................................................................  4  THE  PRESENT  STUDY  .....................................................................................................................................  6  

RELIABILITY  STUDY  ...............................................................................................................................  6  METHODS  ...................................................................................................................................................  6  RESULTS  ...................................................................................................................................................  12  SUMMARY  AND  LIMITATIONS  .......................................................................................................................  27  

VALIDITY  STUDY  .................................................................................................................................  30  METHODS  .................................................................................................................................................  30  RESULTS  ...................................................................................................................................................  37  SUMMARY  AND  LIMITATIONS  .......................................................................................................................  44  

CONCLUSIONS  ....................................................................................................................................  45  

REFERENCES  .......................................................................................................................................  47  

APPENDICES  .......................................................................................................................................  49  APPENDIX  A  –  GOLD  OBJECTIVES  AND  DIMENSIONS  (WAKIDS)  ........................................................................  49  APPENDIX  B  –  SCORING  GUIDE  .....................................................................................................................  51  

 

Page 3: University of Washington Inter-rater Reliability and Concurrent

  3  

Executive Summary

The Washington Kindergarten Inventory of Developing Skills (WaKIDS) is an initiative currently being implemented in all of Washington’s state-funded full-day kindergarten classrooms that seeks to ensure every child is adequately prepared for kindergarten. A key component of this initiative is the statewide implementation of a customized version of Teaching Strategies GOLD assessment (here forward referred to as WaKIDS assessment), which provides information about a child’s readiness to be successful in kindergarten. This report presents data concerning the inter-rater reliability and concurrent validity of the WaKIDS assessment, specifically with regard to the diverse population of students attending kindergarten across the State of Washington. In order to determine the concurrent validity of the WaKIDS assessment results with established standardized assessments, a psychometric design was implemented in which children’s scores (n= 333) from the WaKIDS assessment (administered by their teachers in the first month of the 2012-2013 school year) were compared to scores from individually-administered, norm-referenced assessments, which were administered by researchers in the fall of the same year. Results suggest that the WaKIDS assessment accurately predicts performance on the norm-referenced achievement battery on most (but not all) of the six learning domains in question. The inter-rater reliability study recruited 54 teachers to view video portfolios of four different children working and interacting in various kindergarten classrooms. Teachers were asked to complete the 19 objectives from the WaKIDS assessment for each of the four children in the portfolio, and teachers’ ratings of the children’s skill levels were compared with raters trained to expertise in scoring on the GOLD. Results suggest that the WaKIDS assessment has moderate inter-rater reliability among the sample of teachers recruited for this study. Furthermore, data are provided on the concurrent validity of the WaKIDS assessment as it pertains to the range of abilities and linguistic/ethnic/cultural diversity represented by kindergarteners in the State of Washington, with the literacy domain showing the most robust concurrent validity. The discussion addresses issues raised by the study’s design and by the implementation of the GOLD assessment in general. Finally, we provide recommendations related to the continued implementation of WaKIDS: (1) Require participating kindergarten teachers to complete both training and Teaching Strategies Interrater Reliability certification before using the assessment; (2) Provide follow up (refresher) training and establish a system and schedule of periodic reliability checks in order to maintain accuracy of results over time; (3) Provide additional training in using the WaKIDS assessment with children with special needs and children who are English Language Learners; and (4) Qualify interpretation of WaKIDS results with the caveat that some variance in data can be attributed to teacher and classroom contexts.

Page 4: University of Washington Inter-rater Reliability and Concurrent

  4  

Introduction

The Washington Kindergarten Inventory of Developing Skills (WaKIDS) is a kindergarten assessment process directed by the Office of Superintendent of Public Instruction (OSPI). WaKIDS provides information about children’s skills across six domains as they enter kindergarten. It strives to inform the K-12 system, early care and education providers, and families about children’s learning strengths and needs. An essential goal is to identify a common method to compare children across the state. There are three main components of WaKIDS: 1) family connection, in which teachers connect with new students and their families at the beginning of the school year; 2) early learning collaboration, in which practices are encouraged to bridge between the child’s early learning program (e.g., child care ECEAP, Head Start, etc.) and kindergarten; and 3) the whole child assessment. The whole child assessment measures 15 performance goals across four domains of the Washington State Early Learning and Development Benchmarks: Physical Well-Being, Health and Motor Development; Social and Emotional Development; Cognition and General Knowledge; and Language, Communication and Literacy. It should be noted that the Benchmarks have been replaced by the Washington State Early Learning and Development Guidelines, which have since been re-aligned with the WaKIDS assessment. This report presents two studies conducted to determine the inter-rater reliability and the concurrent validity of the WaKIDS whole child assessment. The WaKIDS assessment is a tailored version of the Teaching Strategies GOLD assessment. About Teaching Strategies GOLD

Teaching Strategies GOLD is an authentic observation-based assessment system, which aims to capture a complete picture of a child’s learning and development. The GOLD assessment, as developed by Teaching Strategies, covers a total of ten domains of development and learning, which are further composed of 38 objectives. The WAKIDS assessment is a tailored version of GOLD only requiring teachers to focus on the following six assessment domains:

1. Social-emotional 2. Physical 3. Language 4. Literacy 5. Cognitive 6. Mathematics

Page 5: University of Washington Inter-rater Reliability and Concurrent

  5  

These WaKIDS assessment domains are comprised of 19 specific objectives. The 19 specific objectives are a subset of the total number of objectives from the full Teaching Strategies GOLD assessment, (see Appendix A). Training: All teachers participating in WaKIDS were required to attend an OSPI-sponsored training prior to administering the assessment. Such training is a vital step towards an accurate scoring process given the inherent subjectivity of observational measures like the GOLD assessment (Meisels, Dorfman & Steele, 1995). During these two-day training sessions, teachers were given an overview of the instrument purpose, background, and research basis; overview of assessment kit contents and materials; video administration of the assessment; practice sessions; and time for questions. Training in 2012 was typically held during the months of July and August with a few make-up sessions in September for teachers who were unable to attend. Trainings were held regionally and conducted by 30 Washington trainers from Educational Service Districts (ESD) and larger districts from across the state. Outside of the summer training sessions, ESD regional WaKIDS coordinators and trainers, as well as large district trainers, OSPI WaKIDS staff, and Teaching Strategies staff provided assistance to teachers. Further, teachers had the option of pursuing Teaching Strategies Interrater Reliability Certification. Observation and data entry: The assessment is designed so that a teacher can observe a child’s skill level in each of the domains during normal classroom routines and activities. However, some items may need to be assessed using specialized activities. According to the technical summary, selected items in the literacy and mathematics domains are suited for evaluation during performance-based tasks. For example, a teacher may choose to evaluate GOLD items 16a and 16b (alphabet knowledge) by using an alphabet recognition game with an individual or with small groups of students. In 2012,WaKIDS required that data entry must be completed by the end of the seventh complete week of the school year. After data are entered and finalized, online reports become available to the teacher, principal, and other school district personnel, including a class “snapshot” of the developmental levels of all the children in the class and individualized child reports for parents. Once-a-year administration: WaKIDS requires only one assessment administration at the beginning of the school year. Teachers may choose to administer the assessment up to two additional times throughout the school year to document student progress. The WaKIDS assessment relies on teacher observation and report of students’ skills. Because teacher report can introduce bias and drift, compromising the reliability and validity of the assessment results, a two-prong study was developed in order to determine if the teacher report information was reliable (that is, yields consistent scores with a standard) and valid (that is, accurately identifies student skill level when compared to results of more standardized and objective measures).

Page 6: University of Washington Inter-rater Reliability and Concurrent

  6  

The Present Study

In collaboration with the State of Washington’s Office of Superintendent of Public Instruction (OSPI), researchers at the University of Washington were contracted to conduct a two-prong study to establish the inter-rater reliability and concurrent validity of the WaKIDS assessment. First, we conducted a reliability study to examine whether comparable information could be obtained from the tool across different raters and situations. Secondly, we sought to reveal whether or not the WaKIDS assessment accurately captures what it is intended to capture in terms of the six developmental domains specified by this assessment. To this end, we conducted a validity study to compare scores from the GOLD assessment with scores of well-established measures administered by trained assessors. The methods employed and subsequent results of these two studies are discussed separately in the sections to follow.

Reliability Study

Methods

Participants A total of 54 teachers from 42 schools across 26 school districts participating in WaKIDS volunteered to participate and were included this study. All of the teachers administered the WaKIDS assessment during the fall of 2012. We consulted with OSPI on attempting to secure a sample of schools that was representative of the state in terms of size and student diversity.

The goal of the sampling procedure was to obtain a sample of schools that was evenly distributed across five regions of Washington State. These five regions, previously established in the WaKIDS pilot study (2011), were defined by the state’s distribution of SES level (using percent of students qualifying for free or reduced-priced lunch as a proxy) and representation of ethnic minority groups: Black/African American, Native American, Asian American, and Latino/Hispanic. In the present study, teachers were recruited within each of the five identified regions of the state.

Obtaining a representative sample of all five regions was hindered by a low response rate to recruitment efforts. Out of approximately 1000 teachers who were contacted, 54 volunteered to participate in the study. Although response rate was low, the sample comprised participants from each of the five regions. The King and Pierce region encompassed 55.3% of the sample; the Northwest region contained 16.1%; the Northeast region contained 10.7%; the Southwest region contained 7.1% and the Southeast region contained 10.7%. Full-day kindergarten teachers comprised 79.9% and 20.1% taught part day. All but one of the 54 participating teachers reported attending regional WaKIDS GOLD summer trainings in 2012. About half of the participating teachers, 28, indicated that they completed the Teaching Strategies Interrater Reliability Certification. Teachers also reported having an average of one year of experience

Page 7: University of Washington Inter-rater Reliability and Concurrent

  7  

with the WaKIDS program. Teaching experience varied, with an average of seven years in the kindergarten classroom and 12 years overall. (Table 1)

Table 1. Teacher Descriptives

Characteristic

n=54 % Mean S.D. Range

Training Summer 2012 WaKIDS

98.1% TS Interrater Reliability

Certification

51.9% Experience with TS-GOLD

5.6%

K transition experience

11.1% Certifications

K-6 certification

88.9% K-12 certification

14.8%

Early childhood certification

24.1% Special education

certification

13% Secondary certification

6%

Masters degree

57% Female

98%

Years in WaKIDS

1.07

0.67 Teaching experience

Overall

12.07

9.06

0 to 35 Kindergarten

7.08

6.08

0 to 26

Pre-school

32%

Participating teachers were given $300, and entered into a drawing for an iPad as compensation for their time. Teachers varied with respect to years teaching, years of experience with WaKIDS, and professional credentials. Fifty-three of the 54 teachers were female. The identities of participating schools and teachers were kept confidential and will not be shared beyond the research team.

Teachers were recruited to view and score four different student portfolios, (two males and two females) working and interacting in a kindergarten classroom. Each portfolio consisted primarily of videos, but also included written vignettes and photos of work samples. We added short descriptions to the videos in an effort to elaborate on, or clarify any ambiguities of, a given video segment (a strategy also employed by Teaching Strategies for their Interrater Reliability Certification for the GOLD assessment). The four students were purposely selected based on gender, ethnicity, skill/developmental level, and language proficiency in order to ensure that our

Page 8: University of Washington Inter-rater Reliability and Concurrent

  8  

results reflect teachers’ reliability with using GOLD across multiple types of learners. Development of the four portfolios was guided by input from each child’s classroom teacher. Table 2 provides descriptive information on each portfolio.

Table 2. Student Portfolio Profiles

Student Gender Ethnicity Developmental level Language

A Male African/ Typically functioning English-speaking American

B Female Caucasian Typically functioning English-speaking

C Female Latino Typically functioning Dual language learner

D Male Caucasian Low functioning English-speaking

Measures GOLD. Teaching Strategies GOLD is an authentic observation-based assessment system for children from birth through kindergarten. It may be used with any developmentally appropriate curriculum. The primary purpose of Teaching Strategies GOLD is to document children's development and learning over time, to inform instruction, and to facilitate communication with families and other stakeholders. Teaching Strategies GOLD can be used to assess all children, including English-language learners, children with disabilities, and children who demonstrate competencies beyond typical developmental expectations. (Kim, Lambert & Burts, 2013; Teaching Strategies, 2010a). During training, each teacher was briefed on a manual entitled Objectives for Development and Learning. The manual contains an overview of each area of development and learning and explains the research about why each area is important. In order to use the assessment, teachers were provided with a customized booklet of the progressions included in WaKIDS. They were encouraged to use this resource while completing the online survey. The objectives included for each area are listed in a shaded box. The manual outlines the progressions of development and learning and included indicators and examples tied to chronological ages. The progressions are based on standard developmental and learning expectations and the rating scale is used to assign a value to the child’s level on a particular progression. The “in-between” boxes allow for more steps in the progression, so teachers can indicate that a child’s skills are emerging in this area but

Page 9: University of Washington Inter-rater Reliability and Concurrent

  9  

not yet solid. These in-between ratings also enable the teacher to indicate that a child needs adult support (verbal, physical, or visual) to accomplish the indicator (Teaching Strategies, 2010b). Colors for each year of life and kindergarten are used to show the age ranges for these expectations.

-­‐ Red = Birth to 1 year -­‐ Orange = 1 to 2 years -­‐ Yellow = 2 to 3 years -­‐ Green = 3 to 4 years -­‐ Blue = 4 to 5 years -­‐ Purple = kindergarten

Some colored bands of a progression are longer or shorter than others and while there is a typical progression for each objective, it is not rigid. That is, development and learning are uneven, overlapping, and interrelated. Finally, these color bands are underscored by a nine-point rating scale, which indicates a child’s score for a given item (Joseph, Cevasco, Lee, & Stull, 2011). Teaching Strategies GOLD recommends a readiness model from which suggestions are made to define the skills and behaviors expected from typically developing children They define a child’s readiness as “consistently demonstrating skills within the blue band of widely held expectations for each progression…the child’s skills are at the level just before the purple band begins” (Teaching Strategies, 2013). According to developers, studies have been conducted to determine reliability and validity for the complete version of this measure. An inter-rater reliability study examined the correlations between the ratings of a Teaching Strategies GOLD master teacher/trainer and the ratings of teachers new to the system. Resulting correlations were strong, with all but one being above .90 and the lowest correlation at .80. Additionally, Teaching Strategies GOLD reports strong internal consistency reliability estimates with a mean of .97. (Teaching Strategies, 2011). However, these findings regard reliability and validity of the measure with all 38 items included and cannot necessarily be generalized to the abbreviated WaKIDS measure.

Also, according to the developers, researchers examined a six-factor model that corresponded to the design of the instrument. These six factors (social-emotional, physical, language, cognitive, literacy, and mathematics) were used as an evaluative basis for each test item. The resulting analysis yielded strong evidence for the six-factor design with a Comparative Fit Index (CFI) = .931, a Root Mean Square Error of Approximation (RMSEA) = .066, and a Standardized Root Mean Square Residual (SRMR) = .033. Overall, the model demonstrated statistically significant results at p < .001. This supports the reliability of Teaching Strategies GOLD in measuring those six factors of child development.

Page 10: University of Washington Inter-rater Reliability and Concurrent

  10  

Portfolio development In order to measure inter-rater reliability for the WaKIDS assessment within the Washington State sample, we filmed each of the four students as they displayed abilities that aligned with the 19 GOLD objectives selected for WaKIDS. All of this footage was organized item-by-item into short video clips (approximately one to three minutes) and formatted into an online survey that teachers were asked to complete. The survey was designed to mimic the Teaching Strategies GOLD online data entry and score submission procedures. It should be noted that our decision to use vignettes and video profiles to evaluate rater agreement was based on the approach of the Interrater Reliability Certification created by Teaching Strategies (Teaching Strategies, 2010). See Figures 1 and 2 for examples from the survey.

Figure 1.

Page 11: University of Washington Inter-rater Reliability and Concurrent

Figure 2.

Procedures The 54 teachers recruited for the study were asked to complete the online survey on their own time. The survey was designed so that teachers could stop and start at their leisure, as well as navigate back and forth between portfolios. The first section consisted of demographic questions and specific directions for teachers to use while rating each portfolio. Teachers were instructed to utilize a paper version of the scoring guide as they observed vignettes, work samples, and videos to assist them in rating the skills of students (Appendix B). It was estimated that the survey would take anywhere from 3 to 5 hours to complete. Teachers were given approximately three months to complete the survey at their own convenience. A “master rater” who was trained to

Page 12: University of Washington Inter-rater Reliability and Concurrent

  12  

expertise in scoring the GOLD assessment scored each portfolio. Teachers’ ratings of the student’s skill levels were compared to the scores provided by the master rater.

Data Analysis To examine inter-rater reliability, the researchers explored the degree to which teachers were in exact agreement, adjacent agreement, or sufficiently discrepant from the master code to result in a different readiness rating in terms of the cut point score. Exact agreement reflects exact matches between the master code and a teacher’s rating. Adjacent agreement identifies instances in which the teacher’s ratings were no more than one point away from the master code (e.g., the teacher rated the item either a 5 or 7 and the master code was a 6). Agreement levels that were used in most of the following analyses were exact plus adjacent agreement, calculated by summing items with exact plus adjacent agreement, and dividing this total by the number of cases receiving ratings. Inter-rater agreement percentages for each portfolio varied by teacher across items and domains. We also examined various trends around the inter-rater agreements across domains and portfolios. Our intent was to explore trends or patterns in relation to each of the six developmental domains, the diverse skill sets of children, and tendencies of teachers to rate higher or lower than the master code. This type of evaluation is necessary when teacher judgment is involved, to get a sense of whether the student would receive the same score from another independent rater. It is possible to have high agreement level, while also having important differences in rater leniency (Miller, Linn & Groulund, 2009). Therefore, teacher ratings were also examined by domain to determine level of agreement with the master raters identification of meeting the readiness cut point, as defined by Teaching Strategies GOLD, which we refer to below as the “cut point score analysis.” The readiness cut points by domain can be found in Table 3.

Results

Research Question 1: What is the degree to which teachers are in agreement with the master code? The results point to various trends in regard to agreement of ratings across developmental domains and across the diverse skill sets observed in the four portfolios. Each of the four portfolios had 36 items rated by each of the 54 teachers and the master rater, a fully crossed design. Tables 4.1-4.4 illustrate the mean, standard deviation, and range of the teacher ratings by item and domain. The percentage of exact plus adjacent agreement by domain ranged from a low of 68.4% to a high of 84.4%. Overall, teacher ratings were more frequently in agreement with the master code for the social-emotional, physical, and language domains. Percent exact plus adjacent agreement ranged from a low of 67% to a high of 87% by portfolio. Teacher ratings were more frequently in agreement with the master code for the typically developing, native-English speaking male and female students

Page 13: University of Washington Inter-rater Reliability and Concurrent

  13  

Across all portfolios, there were a few items in particular that showed low levels of agreement between teacher ratings and the master code. These included: 11c (cognitive: solves problems), 13 (cognitive: uses classification skills), 18a (literacy: Interacts during read-alouds and book conversation), 18c (literacy: retells stories), and 20b (math: quantifies). Inter-rater agreement (exact, exact plus adjacent, and discrepant) by item, across domains and portfolios can be found in Tables 5.1-5.5.

Table 3. Cut Points

Raw Score

Scale Score

Area Actual - meeting Converted - meeting Meeting Social-Emotional (4) 21 47 595 Physical (5) 30 30 592 Language (6) 33 44 588 Cognitive (5) 24 48 603 Literacy (11) 36 39 591 Mathematics (5) 25 35 641

Table 4.1

Teacher ratings by item, Portfolio A

Mean SD Range Master

code

Social

1b 6.92 0.70

6 to 8

7

1c 7.24 0.99

4 to 9

7

2c 6.04 1.10

4 to 8

4

2d 7.44 1.08

3 to 8

8

Physical 4 6.28

1.00

4 to 8

7

5 6.44 1.42

0 to 8

8

6 7.48 0.82

4 to 9

8

7a 7.00 0.85

6 to 9

7

7b 7.89 0.37

7 to 9

8

Language 9a 7.42

0.75

6 to 8

8

Page 14: University of Washington Inter-rater Reliability and Concurrent

  14  

9b 7.48 0.61

6 to 8

8

9c 7.06 0.79

5 to 8

8

9d 6.52 1.18

4 to 8

6

10a 7.07 0.80

5 to 8

7

10b 6.30 0.60

5 to 8

7

Cognitive 11c 6.41

0.98

4 to 8

7

11d 6.48 0.80

5 to 8

7

11e 6.11 1.11

3 to 8

7

12a 6.24 0.58

5 to 8

7

13 6.07 1.36

4 to 8

5

Literacy 15a 6.30

0.77

4 to 8

6

15b 7.96 0.47

5 to 9

8

15c 7.52 0.86

4 to 9

8

16a 7.96 0.43

7 to 9

8

16b 7.65 0.83 5 to 9 8 17b 7.56

1.19

3 to 9

8

18a 7.02 0.92

5 to 8

8

18b 7.67 0.55

6 to 8

8

18c 6.65 1.25

2 to 8

7

19a 6.07 0.33

5 to 7

6

19b 6.02 0.60

4 to 7

6

Math 20a 7.63

0.88

3 to 9

8

20b 7.48 0.91

6 to 9

7

20c 8.07 0.93

7 to 9

8

22 6.56 0.86

5 to 9

7

23 6.28 0.94

5 to 8

6

Table 4.2

Teacher ratings by item, Portfolio B

Mean SD Range Master code

Social

1b 6.87 0.95

5 to 9

7

1c 7.51 0.70

6 to 8

8

2c 6.19 0.99

4 to 8

7

Page 15: University of Washington Inter-rater Reliability and Concurrent

  15  

2d 7.11 0.97

4 to 8

8

Physical 4 7.89

0.37

6 to 8

8

5 7.96 0.33

7 to 9

8

6 7.19 0.93

5 to 9

7

7a 7.63 0.62

6 to 8

8

7b 6.54 1.26

4 to 8

6

Language 9a 6.56

0.88

5 to 9

7

9b 6.19 0.70

5 to 8

5

9c 6.81 1.05

4 to 8

8

9d 6.54 1.18

4 to 8

7

10a 6.57 0.90

5 to 8

7

10b 6.85 1.11

4 to 8

7

Cognitive 11c 6.39

0.90

4 to 8

7

11d 6.15 0.96

4 to 9

7

11e 5.63 0.94

3 to 8

6

12a 5.65

0.97 3 to 8

5

13 6.02 1.02

3 to 8

6

Literacy 15a 6.07

1.20

3 to 8

7

15b 7.09 1.09

4 to 8

8

15c 6.76 1.18

2 to 8

7

16a 7.93 0.47

6 to 9

8

16b 7.74 0.71 4 to 9 8 17b 7.39

0.90

4 to 9

8

18a 5.41 1.33

3 to 8

6

18b 7.69 0.58

6 to 8

7

18c 4.69 1.78

2 to 8

2

19a 6.09 0.40

5 to 7

6

19b 6.04 0.47

5 to 7

6

Math 20a 7.67

0.64

5 to 9

7

20b 6.50 1.67

0 to 9

7

20c 7.78 0.50

6 to 9

8

22 7.31 0.84

6 to 9

8

23 6.04 1.45

3 to 9

6

Page 16: University of Washington Inter-rater Reliability and Concurrent

Table 4.3

Teacher ratings by item, Portfolio C

Mean SD Range Master code

Social

1b 7.48

0.93

4 to 9

8 1c 7.33

0.87

6 to 9

8

2c 6.39

0.94

4 to 8

6 2d 7.52

0.77

5 to 8

8

Physical 4 7.31

0.80

5 to 8

8

5 7.19

0.78

5 to 8

7 6 5.87

1.21

3 to 8

7

7a 7.80

0.56

6 to 9

8 7b 6.20

1.20

4 to 8

5

Language 9a 5.76

0.93

4 to 8

5

9b 6.37

0.73

5 to 8

7 9c 5.54

1.02

4 to 8

5

9d 5.23

1.15

3 to 8

5 10a 6.41

0.96

4 to 8

7

10b 7.13

0.93

5 to 8

8 Cognitive

11c 6.48

0.93

4 to 8

8 11d 6.39

0.71

5 to 8

7

11e 6.28

0.76

5 to 8

8 12a 6.57

0.77

5 to 8

5

13 6.13

1.12

4 to 8

6 Literacy

15a 6.50

1.02

4 to 8

7 15b 4.76

1.68

0 to 7

1

15c 5.33

1.55

0 to 8

3 16a 6.65

0.56

5 to 8

6

16b 5.43

1.73

2 to 8

2 17b 5.20

1.46

2 to 8

5

18a 3.61

2.01

0 to 8

2 18b 5.33

1.47

2 to 8

5

18c 3.59

1.71

2 to 8

2 19a 6.06

0.53

5 to 7

6

19b 5.15

0.45

4 to 6

5 Math

20a 5.98

0.77

5 to 8

5 20b 5.39

1.00

3 to 8

3

20c 6.22

0.57

5 to 8

6 22 6.52

0.99

4 to 8

7

Page 17: University of Washington Inter-rater Reliability and Concurrent

  17  

23 6.96

0.82

6 to 9

7

Table 4.4

Teacher ratings by item, Portfolio D

Mean SD Range Master code

Social

1b 5.13 0.83

4 to 7

6

1c 6.56 0.86

5 to 8

7

2c 5.26 1.62

2 to 8

6

2d 5.72 1.31

4 to 8

5

Physical 4 6.04

0.96

5 to 8

5

5 7.35 0.73

5 to 8

7

6 6.37 1.10

4 to 8

5

7a 5.44 0.90

3 to 7

5

7b 4.41 0.76

3 to 7

4

Language 9a 5.98

0.64

5 to 7

7

9b 5.75 0.73

4 to 8

6

9c 6.31 0.84

4 to 8

7

9d 6.00 1.17

4 to 8

7

10a 6.81 0.93

5 to 8

8

10b 6.59 1.02

3 to 8

8

Cognitive 11c 5.07

1.18

3 to 8

3

11d 5.83 1.34

4 to 8

4

11e 5.80 0.90

4 to 8

5

12a 4.74 0.96

4 to 8

4

13 4.31 0.99

3 to 7

1

Literacy 15a 3.48

1.99

0 to 8

5

15b 4.30 1.53

0 to 8

4

15c 3.67 2.08

0 to 8

5

16a 6.35

0.59 5 to 7

6

16b 4.57 0.92 4 to 7 4 17b 3.30

1.46

0 to 7

3

18a 4.53 1.17

1 to 7

3

18b 3.61 1.85

0 to 7

2

18c 3.83 1.75

2 to 8

2

19a 4.98 0.46

3 to 6

5

19b 3.15 0.49

2 to 5

3

Math

Page 18: University of Washington Inter-rater Reliability and Concurrent

  18  

20a 4.81 0.96

2 to 7

4

20b 4.89 1.06

3 to 8

3

20c 7.52 0.75

4 to 8

8

22 4.06 1.93

0 to 7

2

23 3.94 1.94

0 to 7

2

Table 5.1

Interrater agreement (exact plus adjacent, exact, and discrepant) on WaKIDS assessment items, Portfolio A

Domain/Item % +/- 1

agreement Exact %

agreement % Discrepant

Social 1b 100 51 0 1c 95 24 5 2c 35 6 65 2d 89 68 11

Physical 4 86 41 14 5 53 24 47 6 91 57 9 7a 98 35 2 7b 100 85 0

Language 9a 85 57 15 9b 95 54 5 9c 76 32 24 9d 67 37 33 10a 98 43 2 10b 96 26 4

Cognitive 11c 84 29 16 11d 95 28 5 11e 76 22 24 12a 95 26 5 13 64 17 36

Literacy 15a 86 78 14 15b 98 83 2 15c 87 65 13 16a 100 82 0 16b 87 69 13

Page 19: University of Washington Inter-rater Reliability and Concurrent

  19  

17b 86 55 14 18a 64 41 36 18b 96 70 4 18c 87 32 13 19a 100 89 0 19b 96 76 4

Math 20a 96 61 4 20b 93 19 7 20c 100 89 0 22 93 30 7 23 87 44 13

87.1%

48.5%

12.9%

% +/- 1 agree = Percent exact plus adjacent to master code

Table 5.2

Interrater agreement (exact plus adjacent, exact, and discrepant) on WaKIDS assessment items, Portfolio B

Domain/Item % +/- 1 agreement Exact % agreement % Discrepant

Social 1b 94 26 6 1c 89 26 11 2c 83 15 17 2d 83 39 17

Physical

4 98 91 2 5 100 89 0 6 94 28 6 7a 93 70 7 7b 67 17 33

Language

9a 94 20 6 9b 76 11 24 9c 72 26 28 9d 85 24 15 10a 87 41 13 10b 85 32 15

Cognitive

11c 87 32 13

Page 20: University of Washington Inter-rater Reliability and Concurrent

  20  

11d 82 11 18 11e 87 48 13 12a 85 46 15 13 82 63 18

Literacy

15a 67 24 33 15b 70 50 30 15c 89 30 11 16a 96 89 4 16b 96 78 4 17b 85 59 15 18a 59 33 41 18b 100 20 0 18c 24 17 76 19a 100 83 0 19b 100 78 0

Math

20a 94 31 6 20b 83 28 17 20c 98 76 2 22 78 50 22 23 63 22 37

84.0%

42.3%

16.0%

% +/- 1 agree = Percent exact plus adjacent to master rater

Table 5.3

Interrater agreement (exact plus adjacent, exact, and discrepant) on WaKIDS assessment items, Portfolio C

Domain/Item % +/-1 agreement Exact % agreement % Discrepant

Social 1b 82 67 18 1c 76 54 24 2c 85 43 15 2d 87 67 13

Physical 4 87 48 13 5 98 43 2 6 64 24 36 7a 94 81 6 7b 61 32 39

Page 21: University of Washington Inter-rater Reliability and Concurrent

  21  

Language 9a 85 39 15 9b 96 18 4 9c 83 28 17 9d 85 28 15 10a 85 28 15 10b 78 43 22

Cognitive 11c 44 15 56 11d 98 19 2 11e 28 9 72 12a 56 2 44 13 76 48 24

Literacy 15a 85 54 15 15b 6 0 94 15c 20 2 80 16a 98 33 2 16b 17 13 83 17b 70 32 30 18a 46 35 54 18b 67 11 33 18c 52 44 48 19a 100 72 0 19b 100 78 0

Math 20a 76 28 24 20b 11 6 89 20c 94 80 6 22 87 39 13 23 98 39 2

71.5%

36.2%

28.5%

% +/- 1 agree = Percent exact plus adjacent to master rater

Table 5.4

Interrater agreement (exact plus adjacent, exact, and discrepant) on WaKIDS assessment items, Portfolio D

Domain/Item % +/- 1 agreement Exact % agreement % Discrepant

Social 1b 76 30

24 1c 89 41

11

2c 70 32

30 2d 69 21

31

Page 22: University of Washington Inter-rater Reliability and Concurrent

  22  

Physical 4 70 34

30 5 98 41

2

6 54 6

46 7a 91 28

9

7b 91 69

9

Language 9a 78 19

22 9b 96 53

4

9c 89 26

11 9d 65 19

35

10a 61 28

39 10b 54 18

46

Cognitive 11c 28 7 72 11d 34 25

66

11e 87 24

13 12a 82 52

18

13 0 0

100

Literacy 15a 54 19

46 15b 78 46

22

15c 56 30

44 16a 100 54

0

16b 85 65

15 17b 72 55

28

18a 54 7

46 18b 19 11

81

18c 50 33

50 19a 98 85

2

19b 98 80

2

Math 20a 83 32

17 20b 43 4

57

20c 94 61

6 22 24 15

76

23 15 0

85

66.8%

32.5%

33.2%

% +/- 1 agree = Percent exact plus adjacent to master rater

Page 23: University of Washington Inter-rater Reliability and Concurrent

Table 5. 5

Overall interrater agreement on WaKIDS assessment items by portfolio

Domain/ Item

Portfolio A % +/- 1 agree

Portfolio B

% +/- 1 agree

Portfolio C

% +/-1 agree

Portfolio D

% +/- 1 agree

Item total

% +/-1 agree

Domain total %

+/- 1 agree

Social 1b 100 94.4 81.5 75.9 88.0

1c 94.5 88.7 75.9 88.9 87.0 2c 34.5 83.3 85.2 70.4 68.4 2d 89.1 83.3 87 68.5 82.0

81.3 Physical

4 85.5 98.1 87 70.4 85.3 5 52.7 100 98.1 98.1 87.2 6 90.9 94.4 64.2 53.7 75.8 7a 98.2 92.6 94.4 90.7 94.0 7b 100 66.7 61.1 90.7 79.6

84.4 Language

9a 85.2 94.4 85.2 77.8 85.7 9b 94.5 75.9 96.3 96.2 90.7 9c 76.4 72.2 83.3 88.9 80.2 9d 67.3 85.2 84.9 64.8 75.6 10a 98.2 87 85.2 61.1 82.9 10b 96.4 85.2 77.8 53.7 78.3

82.2 Cognitive 11c 83.6 87 44.4 27.8 60.7 11d 94.5 81.5 98.1 34 77.0

11e 76.4 87 27.8 87 69.6 12a 94.5 85.2 55.6 81.5 79.2 13 63.6 81.5 75.9 0 55.3

68.4 Literacy 15a 85.5 66.7 85.2 53.7 72.8

15b 98.2 70.4 5.6 77.8 63.0 15c 87.3 88.9 20.4 55.6 63.1 16a 100 96.3 98.1 100 98.6 16b 87.3 96.3 16.7 85.2 71.4 17b 85.5 85.2 70.4 72.2 78.3 18a 63.6 59.3 46.3 53.7 55.7 18b 96.4 100 66.7 18.5 70.4 18c 87.3 24.1 51.9 50 53.3 19a 100 100 100 98.1 99.5

Page 24: University of Washington Inter-rater Reliability and Concurrent

  24  

19b 96.4 100 100 98.1 98.6

75 Math 20a 96.4 94.4 75.9 83 87.4

20b 92.7 83.3 11.1 42.6 57.4 20c 100 98.1 94.4 94.4 96.7 22 92.7 77.8 87 24.1 70.4 23 87.3 63 98.1 14.8 65.8

75.6

87.0%

84.1%

71.6%

66.7%

% +/- 1 agree = Percent exact plus adjacent to master rater

Research Question 2: To what degree are ratings discrepant enough to fall outside of the cut point, as defined by the master code? We conducted an additional “cut point score” analysis to explore the degree to which ratings were discrepant enough from the master code to fall on the other side of the developmental cut point. That is, given each child’s readiness rating according to the master code (i.e. “ready” or “not ready”), how often were teacher ratings discrepant enough from the master code to receive an entirely different readiness rating?

Teacher discrepancy from the master code followed certain trends with regard to the skills being assessed. Overall, teacher ratings within the physical, cognitive, and math domains (across all portfolios), tended to be more discrepant from the master code than the social emotional, language, and literacy domains, and generally more discrepant for Portfolio D, the lower-performing student. The cognitive domain had the highest rate of discrepancy. As Table 6 indicates, although teachers agreed with the master code regarding the child’s readiness rating perfectly for two students, almost perfectly for a third student, for the fourth student (Portfolio D, the lower-performing student), 71.2% of the teachers gave a different readiness rating,

For the student in Portfolio D, discrepant teacher ratings tended to be inflated in the physical, cognitive, and math domains, resulting in different readiness ratings 56.6%, 71.2% and 56.6% of the time, respectively. Inter-rater discrepancies across domains and portfolios can be found in Table 6. Interestingly, the only instance in which another student was misclassified more frequently than Portfolio D was the dual language speaker in the language domain, where she was classified as not meeting the readiness cut-off by 13% of the teachers.

Page 25: University of Washington Inter-rater Reliability and Concurrent

Table 6.

Percent discrepant enough from master code to fall outside the WaKIDS cut point

Domain Portfolio A Portfolio B Portfolio C Portfolio D

Social 0% 0% 0% 29.6%

Physical 3.7% 0% 5.7% 56.6%*

Language 0% 5.6% 13.2% 5.8%

Cognitive 0% 1.9% 0% 71.2%*

Literacy 0% 0% 0% 15.1%

Math 0% 0% 0% 56.6%* * Ratings were above the cut point

Generally, simple correlations (Pearson’s r) between the teachers and master code ranged from .24 to .65. Correlations in the low range, below .30, were found for 1.9% of the teachers. Correlations for the remaining teachers fell in the moderate range; 14.8% of teachers fell in the .30-.39 range, 42.6% fell within the .40-.49 range, 31.5% in the .50-.59 range, and 9.3% in the .60-.69 range.

Research Question 3: What teacher factors predict agreement of ratings across domains and the diverse skill sets of children? The first section of the teacher survey asked demographic questions, three of which became predictor variables in a regression analysis examining factors that predict teacher agreement with the master code: 1) Teaching Strategies Interrater Reliability Certification, 2) overall teaching experience, and 3) kindergarten teaching experience. WaKIDS summer training was not included as a predictor due to near 100% teacher participation in this activity (See Table 1). The criterion, agreement level, was summed across teacher to calculate an overall score of exact and adjacent agreement with the master code. Agreement levels ranged from 85-125 points out of a possible 144 points. A score of 85 reflects agreement with the master code on 59.0% of the items, and 125 reflects agreement on 86.8%. Table 7 contains the distribution of agreement level percentages. The first predictor, Teaching Strategies Interrater Reliability Certification was dummy coded (0= did not participate and 1= participated), with 51.9% of the teachers participating. The second predictor, years teaching kindergarten, had a range of 0-26 years, and the final predictor, years teaching overall, had a range of 0-35 years (Table 1).

Page 26: University of Washington Inter-rater Reliability and Concurrent

  26  

The predictors (Teachings Strategies Interrater Reliability Certification, kindergarten teaching experience, and overall teaching experience) accounted for a significant amount of variance (23%) in agreement level with the master code, R2 =.23, F (3,49) = 4.89, p < .01, R2

adjusted = .18.

The model estimate of the intercept showed that teachers with no previous teaching experience and who did not participate in the Teaching Strategies Interrater Reliability Certification were predicted to average 106.72 points (SE= 2.10) on agreement level, and that this value is significantly different from zero, t(49) = 50.84, p < .001 (Table 8). That is, they agreed with the master code 74.1% of the time.

Table 7.

Teacher agreement levels

Number (out of 144) and Percent agreement

Number of teachers

85 (59.0%)

1

95 (66.0%)

2

98 (68.1%)

2

99 (68.8%)

1

100 (69.4%)

1

101 (70.1%)

1

102 (70.8%)

3

103 (71.5%)

1

105 (72.9%)

1

106 (73.6%)

1

107 (74.3%)

3

108 (75.0%)

1

110 (76.4%)

5

111 (77.1%)

2

112 (77.8%)

2

113 (78.5%)

3

114 (79.2%)

4

115 (79.9%)

2

116 (80.6%)

1

117 (81.3%)

2

118 (81.9%)

5

120 (83.3%)

2

122 (84.7%)

4

123 (85.4%)

2

124 (86.1%)

1

125 (86.8%)

1

Page 27: University of Washington Inter-rater Reliability and Concurrent

Table 8.

Agreement level

Standard Regression R2

total R2Adj Ftotal b (SE) t β

OverallAgree .23 .18 4.89(3,49)** Intercept

106.72

(2.10)

50.84 ***

TrainTSGOLD

5.38

(2.31)

2.33 *

0.31 KinderTeachExp

0.92

(0.27)

3.44 **

0.63

OverallTeachExp

-0.39

(0.18)

-2.18 *

-0.41

N= 54. OverallAgree = Agreement level; TrainTSGOLD = TS Interrater Reliability Certification; KinderTeachExp = Years teaching Kindergarten; OverallTeachExp = Total years teaching.

* p < .05, ** p < .01, *** p < .001.

Teaching Strategies Interrater Reliability Certification had a unique positive effect on level of agreement (b= 5.38, SE= 2.31), t(49) = 2.33, p < .05, Sr2 = .09. Specifically, there is an estimated mean increase of 5.38 points on agreement level for teachers who completed Teaching Strategies Interrater Reliability Certification, holding all other values constant. That represents an increase from 74.1% agreement to 77.8% agreement with the master code.

Total number of years teaching kindergarten had a statistically significant but small unique positive effect on level of agreement (b= .92, SE= .27), t(49) = 3.44, p <. 01, Sr2 = .19. Specifically, there is an estimated mean increase of .92 points on agreement level for each year increase in kindergarten teaching experience, holding all other values constant, increasing agreement form 74.1% to 74.8%.

Years teaching overall had a unique negative effect on level of agreement, but again rather small (b= -.39, SE= .18), t(49) = -2.18, p <.05, Sr2 = .07. Specifically, there is an estimated mean decrease of .39 points on agreement level for each year increase in overall teaching experience, holding all other values constant, decreasing agreement from 74.1% to 73.9%.

These findings indicate that teacher ratings tended to agree with the master code in the social-emotional, physical and language domains, and for the typically developing, native English speaking portfolios. Additionally, these ratings were at times discrepant enough from the master code to result in the children’s readiness level being misidentified, and more so for the lower-performing student. Finally, certification and teaching experience, in both kindergarten and overall, contributed to levels of agreement with the master code.

Summary and Limitations

The present study represents an extension of previous efforts to evaluate inter-rater reliability of the WaKIDS assessment (Teaching Strategies, 2011) by (1) drawing on a sample that is

Page 28: University of Washington Inter-rater Reliability and Concurrent

  28  

exclusively representative of Washington State kindergarten teachers, (2) evaluating teacher ratings of a diverse selection of student skill sets, and 3) focusing on a subset of items from the Teaching Strategies GOLD assessment. To this end we assessed various trends around the inter-rater agreements within the four student portfolios (two males and two females- varying in skill level, ethnicity, and language proficiency) developed for this study.

These findings provide information about the specific developmental domains and individual student skill sets in which teachers agreed with or were discrepant from the master code, and the implications these discrepancies might have for children. Previous research around inter-rater reliability with observational measures has focused on similar associations, but this study provides information about the WaKIDS assessment specifically, a unique adaptation of Teaching Strategies GOLD. We found that the cognitive domain had overall lower levels of agreement with the master code. In addition, discrepancies between teachers and the master code were extreme enough that the lower-performing student in our video portfolio was misidentified as to his readiness rating in every domain, dramatically so in the social, physical, cognitive, and math domains. Two domains (math and cognition) have a lower number of items (5 and 4 respectively) in comparison to other domains, such as literacy (which has 11 items). Interestingly, while the physical domain had a high level of agreement, the discrepancies led to three of the four students being misidentified as to their readiness rating at nontrivial rates. The literacy domain had low agreement levels with the master code overall, though fortunately the discrepancies from the master code in this case resulted in only one student being misidentified, although 15% of the time.

There were specific items that had noticeably low levels of agreement, all of which fell within the low agreement domains. These included: 11c (cognitive: solves problems), 13 (cognitive: uses classification skills), 18a (literacy: Interacts during read-aloud and book conversation), 18c (literacy: retells stories), and 20b (math: quantifies). Further, item 13 uses classification skills, is also identified in the Teaching Strategies GOLD technical report as having a high degree of difficulty. Additional trainings may be warranted to ensure teachers’ understanding and use of the rubrics in rating students in these areas.

Finally, participation in Teaching Strategies Interrater Reliability Certification was positively related to teachers’ overall agreement levels with the master code. Additionally, the number of years teaching kindergarten had a small but positive effect on teacher level of agreement with the master code, and overall years of teaching experience had a small but negative effect on agreement levels. Of these three factors, practical significance should probably be attributed only to the effect of the Teaching Strategies Interrater Reliability Certification.

These results have provided us with information about the accuracy of teacher judgments on the WaKIDS assessment and are relatively consistent with the level of agreement to be expected from most observational or authentic assessment tools (see Waterman et al, 2011). That being

Page 29: University of Washington Inter-rater Reliability and Concurrent

  29  

said, these findings should be interpreted with respect to the various limitations that may exist in this study.

First, our analysis draws its conclusions from a convenience sample of just 54 teachers. Thus, one of the primary limitations to consider is the threat to external validity that may result from such a biased sample of teachers. For this reason, these results cannot necessarily be generalized to kindergarten teachers beyond those who were recruited for this study. That is, the level of agreement across raters for any particular item should only be seen as a possible indicator of a more prevailing trend among kindergarten teachers in the State of Washington.

Second, we must recognize the possible threats to internal validity that may have arisen from variations in the degree of a third variable (i.e. quality of student portfolios) which is related to the manipulated variable (student performance on a particular skill). In other words, we acknowledge that teachers’ scores for a given survey item may reflect, not only the child’s performance on that particular item, but also the videographers ability to accurately capture the skill in short video segments. This may also contribute to the observed differences in reliability of ratings for the two typically developing students, compared to the lower functioning student and the student with multiple home languages. Further, the utilization of a single master rater prevented the reconciliation of multiple master codes.  

We might also think of this limitation as a threat to content validity, as teachers were being asked to rate computer-based profiles of abilities and behaviors that were intended to be evaluated in a classroom context, over a period of several weeks.

Finally, the timing by which teachers were able to complete the survey should be taken into account. Because teachers had approximately three months to complete the survey, and were told to do so at their own convenience, we could not control for the possible effects of staggered start and finish times, or overall time of completion. For example, some teachers spread the task out over several months while others finished it in just a few days. Moreover, some teachers finished the survey in at little as three hours, while others took upwards of 10 hours to complete the task. We did not attempt to factor these differences into our analysis partly due to the fact that we had no accurate way to track them.

Although it would be desirable to address these limitations in future projects, we believe that the present study provides valuable information about the degree of inter-rater reliability among the sample of teachers recruited for this study. Based on these findings, and because most current screening and readiness instruments are known to have limitations (Wenner, 1995), the WaKIDS assessment seems to be a sufficient tool for evaluating the school readiness of Washington State kindergarteners.

Page 30: University of Washington Inter-rater Reliability and Concurrent

  30  

Validity Study

Methods

Participants

A total of 333 children from 42 schools across 26 school districts participating in WaKIDS were included in this portion of the study. These children were randomly selected from the classrooms of all 54 teachers from the reliability study, as well two additional teachers who opted not to participate in the reliability study (M = 5.66 students per classroom). Thus, the sample was representative of the population sizes of each region: 53.2% of children in the King and Pierce Region, 17.1% in the Northwest Region, 8.7% in the Northeast Region, 7.5% in Southwest Region, and 13.5% in the Southeast Region. The sample also represented a diversity of racial backgrounds (39% White, 31.5% Hispanic, 7.5% Asian, 7.2% two or more races, 4.5% Black or African American, 1.8% Native Hawaiian, 0.6% Native American, and 7.8% unidentified). The majority of the children in the sample spoke English as their primary language (64.9%), although 18% of the children’s primary language was Spanish, 2.1% Vietnamese, 1.5% Chinese, 0.6% Korean, 0.6% Somali, 0.3% Tagalog, 3.3% an “other” language, and 8.7% unknown. The children ranged in age from 5.0 to 6.83 years at the time of their participation (M = 5.66 years). Just over half of the students were enrolled in the free or reduced lunch program (53.8%). Gender was distributed evenly (males = 51.1%) and 6.9% of students qualified for special education services. Figures 3-7 present this information.

Figure 3.

53%  

17%  

9%  

7%  

14%  

ParBcipaBng  Students  per  Region,  n=333  

King  &  Pierce  

Northwest  

Northeast  

Southwest  

Southeast  

Page 31: University of Washington Inter-rater Reliability and Concurrent

Figure 4.

Figure 5.

39%  

32%  

7%  

7%  

4%  

2%   1%  

8%  

Racial  Backgrounds  of  ParBcipaBng  Students,  n=333  

White  Hispanic  Asian  Two  or  more  races  Black  or  African  American  NaBve  Hawaiian  NaBve  American  Missing  

216  

60  

25  11   7   5   2   2   1   4  

0  20  40  60  80  100  120  140  160  180  200  220  

Student  Primary  Language,  n=333  

Page 32: University of Washington Inter-rater Reliability and Concurrent

Figure 6.

Figure 7.

Participant assessors

Student assessment data was obtained from two types of assessors; teacher assessors, who collected WaKIDS data on their students during the first two months of the school year, and researcher assessors, who collected individually administered assessment data on the smaller sample of students recruited for the study. Researcher assessors (three graduate students and 13 undergraduate students) were trained on at least one of two assessment batteries. Teacher assessors had all been trained on the WaKIDS assessment prior to the start of the school year

0  

5  

10  

15  

20  

25  

30  

35  

40  

5:0   5:1   5:2   5:3   5:4   5:5   5:6   5:7   5:8   5:9  5:10  5:11  6:0   6:1   6:2   6:3   6:5   6:6   6:7   6:8  6:10  

Student  Age  (years:months)  

170   163   165   168   179  

128  

23  

284  

0  

50  

100  

150  

200  

250  

300  

Male   Female   A   B   Yes   No   Yes   No  

Gender,  n=333   Ba\ery,  n=333   Free  or  Reduced  Lunch    n=307  

Special  Educa_on  Services,  n=307  

Student  Demographic  InformaBon  

Page 33: University of Washington Inter-rater Reliability and Concurrent

  33  

(although the reliability results summarized previously should be considered in the interpretation of validity analyses).

Measures

In addition to the WaKIDS assessment, various assessments of the skills measured by WaKIDS were selected to determine the validity of WaKIDS scores in each domain. In other words, one or two frequently-used, direct assessments were identified as reliable and valid measures of each skill represented in each WaKIDS domain and were administered to the student participants to compare to their teacher-rated WaKIDS scores.

Language

Two assessments were selected to align with the WaKIDS Language Domain, one for each of the two batteries administered: the PPVT-4 (Battery A) and the OWLS-II Oral Expression Scale (Battery B).

The Peabody Picture Vocabulary Test-Fourth Edition (PPVT-4). The PPVT–4 scale is a norm-referenced instrument for measuring the receptive (hearing) vocabulary of children and adults, in standard American English. The test content covers a broad range of receptive vocabulary levels, from preschool through adult. The items broadly sample words that represent 20 content areas (e.g., actions, vegetables, tools) and parts of speech (nouns, verbs, or attributes) across all levels of difficulty (Dunn & Dunn, 2007). The test took approximately 10-15 minutes to complete with each student (those receiving Battery A). Psychometric examinations of the PPVT-4 have yielded strong results for several types of reliability. Split-half reliability (a form of internal consistency reliability) has shown to be very high, ranging from .94 to .95 on each item. Test-retest reliability is also strong with an average correlation of .93 (ranging from .92 to .96). Four of the correlational studies compare PPVT-4 scores with scores obtained on instruments measuring expressive vocab, language ability, and reading achievement. These studies provide strong convergent validity results (Dunn & Dunn, 2007).

The Oral and Written Language Scales-Second Edition (OWLS-II) Oral Expression Scale. The OWLS-II Oral Expression scale measures the expressive language of children. This instrument took approximately 10-20 minutes to complete with each student (Battery B) during which the student responded verbally to questions about pictures presented to them According to the OWLS technical manual, test-retest reliability for this instrument is satisfactory for tests of developing abilities. Composite coefficients for the OWLS are moderate to high, ranging from .85 to .95 with a median of .92. Additionally, high level agreements between raters were found for the Oral Expression subtest of OWLS. Interclass correlations were .96 for Form A and .93 for Form B. Finally, there is sufficient evidence to support the construct validity of OWLS (Carrow-Woolfolk, 2011).

Page 34: University of Washington Inter-rater Reliability and Concurrent

  34  

Mathematics

Woodcock-Johnson III Tests of Achievement (WJ III) Applied Problems. The Applied Problems subtest of the WJ III Tests of Achievement measures the ability to analyze and solve math problems and was selected to align with the WaKIDS Math domain. To solve the problems, the child is required to listen to the problem, recognize the procedure to be followed, and then perform relatively simple calculations (McGrew, Schrank & Woodcock, 2007). The test took approximately 5-10 minutes to complete with each student. For the Applied Problems subtest, test-retest correlations for the 2-7 age range are strong, with .90 after <1 year follow-ups and .85 after a 2-years follow-up. The validity of WJ III is sufficiently supported with evidence from at least three separate measures for each of the five achievement ability subtests (McGrew, Schrank & Woodcock, 2007).

Literacy

Two assessments were selected to align with the WaKIDS Literacy Domain, one for each of the two batteries administered: the TOPA-2+ (Battery A) and the TERA-3 (Battery B).

Test of Phonological Awareness PLUS (TOPA-2+). The TOPA-2+ is a measure of phonological awareness in children that measures young children's ability to (a) isolate individual phonemes in spoken words and (b) understand the relationships between letters and phonemes in English. The Kindergarten version of this test uses two different subtests: the Phonological Awareness subtest (20 items) and the Letter Sounds subtest (15 items) (Torgesen & Bryant, 2004). Both subtests were administered to students and the two resulting subtest scores were summed to give each student one composite score. While this test is formatted for group administration, it was adapted for this study to be administered individually (specifically by having students point to their answer choice rather than marking it in the booklet) and took approximately 10 minutes to complete with each student. TOPA-2+ provides evidence of internal consistency reliability, test-retest reliability, and interscorer reliability; all of which meet or exceed .80 across all ages. Evidence is also provided for content-descriptive validity, criterion-prediction validity, and construct-identification validity (Torgesen & Bryant, 2004).

The Test of Early Reading Ability-Third Edition (TERA-3). The Test of Early Reading Ability-Third Edition (TERA-3) assesses mastery of early developing reading skills in children ages 3:6 through 8:6. The TERA-3 has three subtests: (1) Alphabet (measuring knowledge of the alphabet and its uses); (2) Conventions (measuring knowledge of the conventions of print); and (3) Meaning (measuring the construction of meaning from print), all of which were administered to the students. An overall Reading Quotient can be calculated using the standard scores from all three subtests and this composite score was the one used to align with the WaKIDS Literacy score (Reid, Hresko & Hammill, 2001). The test took approximately 15-20 minutes to complete with each student. Studies provide substantial evidence for the interscorer reliability of the TERA-3 with all resulting coefficients rounding to .99. In a study that

Page 35: University of Washington Inter-rater Reliability and Concurrent

  35  

compared TERA-3 scores to that of similar tests of school achievement results confirm that the construct validity of the assessment is supported (Reid, Hresko & Hammill, 2001).

Physical

The Early Screening Inventory-Revised (ESI-R). The Early Screening Inventory-Revised (ESI-R™) 2008 Edition is a developmental screening instrument designed to address developmental, sensory, and behavioral concerns in the following areas: Visual Motor/Adaptive, Language and Cognition, and Gross Motor Skills (Meisels, Marsden, Wiske, & Henderson, 2003). Two of the subtests, which aligned with the objectives of the WaKIDS Physical domain, were administered to students: Fine-Motor (8 items) and Gross-Motor (6 items). The resulting subtest scores were summed to give each student one composite score. It can be administered to children ages 3:0-5:11 and took approximately 5-10 minutes to complete with each student. Because most screening tests draw conclusions from specific scoring levels (or cutoff points), reliability is often established using a “conditional” reliability statistic. For the ESI-R, conditional reliability was high, averaging .88 for children scoring near the ESI-R cutoff points. Moreover, split-half internal-consistency reliability was moderate at .78 and .77 for age ranges 5:00-5:5 and 5:6-5:11 respectively. Finally, research has shown the validity of ESI-R to be “highly reassuring”(Meisels, Marsden, Wiske, & Henderson, 2003).

Cognitive

The Learning Motivation Task. The Learning Motivation task is a measure of children’s positive approaches to learning and was selected to align with the WaKIDS Cognitive domain. The task requires students to complete two puzzle tasks. One of the puzzles is designed to be challenging and one of the puzzles is designed for easy completion within the allotted two-minute time limit. After the two trials are completed the child is asked to rate their feelings about each of the puzzles and indicate which of the two puzzles they would like to do again (Adapted from Smiley & Dweck, 1994). The task took approximately 10 minutes to complete with each student. It is not clear whether or not the instrument has been normed on a broad range of participants. Consequentially, validity and reliability information cannot be provided for The Learning Motivation Task. In the end, this task proved to be an inadequate measure and will not be discussed further.

Social Emotional

Social Skills Improvement System (SSIS). The SSIS (Social Skills Improvement System) Rating Scales Parent Form measures: 1) Social Skills-Communication, Cooperation, Assertion, Responsibility, Empathy, Engagement, Self-Control; 2) Competing Problem Behaviors: Externalizing, Bullying, Hyperactivity/Inattention, Internalizing, Autism Spectrum; and 3) Academic Competence: Reading Achievement, Math Achievement, and Motivation to Learn. To align with the WaKIDS Social Emotional Domain, only the first section of scores (Social Skills

Page 36: University of Washington Inter-rater Reliability and Concurrent

  36  

subscale) was used. This questionnaire is estimated to take approximately 10-25 minutes to complete (Gresham & Elliot, 2008). Based on a sample of 110 individuals between the ages of 3 and 18 who were rated by two parent/caregivers, adjusted inter-rater reliability coefficient for the Social Skills subscale was .62. This outcome is similar to the inter-rater reliability reported by the Teacher Form. Concurrent aspects of validity were also examined for the Parent Form of the SSIS Social Skills scale. Results showed adjusted scale correlations of .75 and .73 for ages 3 to 5 and 5-12 respectively (Gresham & Elliot, 2008).

Individual Observation Form. The Individual Observation form was developed by the research team as a way to capture a more nuanced description of students’ test-taking approach. At the end of each assessment, the assessor rated the child’s behavior and approach to the tasks as observed during the session. This form consisted of Likert-type scale questions which included, (1) Willingness to accompany assessor, (2) Response to difficult questions, (3) Level of cooperation, and (4) Ability to follow directions. These items were summed across to create an overall composite score out of 20 possible points.

Procedures

The present study examined the relationship between the WaKIDS assessment and other key indicators of student achievement. At the beginning of the 2012-13 school year, teachers rated each child’s performance on six domains of developing skills using the WaKIDS assessment. The teachers were given a total of seven weeks from the start of their school year to collect and submit this data to WA State. In the fall of the 2012 school year, within approximately three weeks of the WA State deadline for teachers to finalize their WaKIDS scores, participating students were assessed by researchers with one of the two assessment batteries, each requiring approximately 30-45 minutes to administer (an amount of testing time we considered appropriate for children of this age). The goal was to assess six students from each participating teacher’s class: three boys and three girls. To do this, a number of steps were taken. First, parent consent forms were sent home with all students in a participating teacher’s class to inform student’s parents/guardians of the nature of the study and ask permission for researchers to assess their child using selected standardized assessments. Once teachers had received most of the forms back from parents granting permission, teachers sent the names and genders of those students with permission to participate to the researchers. From the students available in each class, we randomly selected three boys and three girls. Sometimes, fewer than six students had returned consent forms, or a greater number of males or females had returned them; therefore for some classes, fewer than six students were assessed, and a balance of male and female was not always possible. And for a few other classes, particularly those in locations more distant from Seattle or in regions with lower participation rates, more than six students were assessed, because of the time and/or number of assessors available, to help increase the sample size.

Across all classrooms, 49.5% of the students received Battery A, and 50.5% received Battery B. These groups were balanced in terms of student gender, which was the only grouping factor used

Page 37: University of Washington Inter-rater Reliability and Concurrent

  37  

to determine battery assignment. Researchers took students individually out of the classroom to conduct the assessments on their assigned battery in a quiet, distraction-free area of the building. At the conclusion of the assessment session, assessors rated each student on the Individual Observation form. Additionally, the SSIS forms were distributed to teachers to send home with these students, to be completed by their parents, along with a parent demographic and background information survey, a stamped and addressed return envelope, and $2.00 as an incentive for parents to complete the forms and mail them back to the researchers.

Data Analysis

The concurrent validity of the GOLD assessment was examined by computing zero-order correlations between student scores on each of the six WaKIDS domains and the scores on the corresponding direct assessments. The strength of the correlation would reveal how WaKIDS scores per domain accurately measured student skill level in that domain (as measured by the direct assessments). For example, we examined the correlations between the WaKIDS Math domain score and the WJ Applied Problems direct assessment results. Additionally, Hierarchical Linear Modeling was used to examine the student data more closely by accounting for the nesting of the students within classrooms and related issues of reliability in teacher ratings.

Results

Research Question 1: Does the WaKIDS assessment give a valid picture of children’s skills, compared with other assessments? Table 9 provides the means and standard deviations that resulted from administration of all the assessments across Batteries A and B. The number of students with scores on each assessment varies because of missing data. Several students had data missing for one or two of their assessments within the battery administered but were kept in the analyses. There were four students who had missing WaKIDS assessment data across all domains and were removed from further analyses because of this substantial lack of data. Of the 333 students recruited for the study, 134 SSIS forms were completed and returned for a 40.2% response rate (although these were evenly split between A and B battery students).

Table 9.

Concurrent Validity Descriptives

Assessments N M SD

Social Emotional Social Skills Improvement System

(SSIS)

134 99.81 12.35

Page 38: University of Washington Inter-rater Reliability and Concurrent

  38  

Individual Observation Form*

322 18.44 2.26

Physical

The Early Screening Inventory-Revised (ESI-R)*

158 12.16 2.18

Language

The Peabody Picture Vocabulary Test - Fourth Edition (PPVT-4)

163 100.64 19.06

The Oral and Written Language Scales - Second Edition (OWLS-II)

163 92.37 17.38

Cognitive

The Learning Motivation Task*

--- --- ---

Literacy

Test of Phonological Awareness PLUS (TOPA-2+)*

163 23.08 7.85

The Test of Early Reading Ability - Third Edition (TERA-3)

161 89.61 15.62

Math

Woodcock-Johnson III Tests of Achievement (WJIII) Applied Problems

165 103.66 13.94

*Note: Means and standard deviations calculated from raw scores.

All of the zero-order correlations revealed statistically significant, positive relationships between scores from the direct assessments and their corresponding WaKIDS domain scores (Table 10). Upon further examination, the pattern of correlations suggests that the TOPA-2+ and the WJ were most strongly correlated with their corresponding WaKIDS domain scores (r = .64, r = .61, respectively). On the other hand, the SSIS and ESI correlated less strongly with their corresponding WaKIDS scores (r = .26, for both). And, both language assessments, PPVT-4 and OWLS-II, along with the TERA-3, fell in between with moderate correlations (r = .50, r = .50, r = .51, respectively). These findings support an interpretation that the domains within the WaKIDS adaptation of the Teaching Strategies GOLD assessment tap the intended constructs, although some more robustly than others.

Page 39: University of Washington Inter-rater Reliability and Concurrent

Table 10. Zero-order correlations

GOLD Social

Emotional GOLD

Physical GOLD

Language GOLD

Cognitive GOLD

Literacy GOLD Math

SSIS (social emotional) 0.26** --- --- --- --- ---

ESI-R

(physical) --- 0.26*** --- --- --- ---

PPVT-4

(language) --- --- 0.50*** --- --- ---

OWLS-II

(language) --- --- 0.50*** --- --- ---

LMT (cognitive) --- --- --- --- --- ---

TOPA-2+

(literacy) --- --- --- --- 0.64*** ---

TERA-3

(literacy) --- --- --- --- 0.51*** ---

WJ-III

(math) --- --- --- --- --- 0.61***

***p < .001

**p < .01

This question was further addressed with a multilevel design in which multiple children are nested within classrooms. The direct measure performances (WJ- Applied problems, OWLS-II, PPVT-4, TOPA-2+, TERA-3) are subject to two sources of variance: 1) variability for children within the same classroom and 2) variability in the mean ratings between classrooms. Because of the nesting of children within classrooms, it is likely that children within the same classroom will be more similar to each other than those across classrooms, due in part to instructional experiences in the classrooms and in part to variability across teachers in their scoring on the

Page 40: University of Washington Inter-rater Reliability and Concurrent

  40  

WaKIDS assessment. Therefore, the data were analyzed using Hierarchical Linear Modeling (Raudenbush & Bryk, 2002) in order to take the nested nature of the study design into account.

We first examined the unconditional models in each of the three domains to partition the variance into two sources; within classrooms and between classrooms.

Math Yij(WJ-Applied Problems) = γ00+µ0j+rij

Language Yij(OWLS-II) = γ00+µ0j+rij

Yij(PPVT-4) = γ00+µ0j+rij

Literacy Yij(TERA-3) = γ00+µ0j+rij

Yij(TOPA-2+) = γ00+µ0j+rij

After calculating the intraclass correlation, we found that of the total variance in the WJ-applied problems subtest, 14% was attributable to classroom effects and the rest was attributable to individual factors. Additionally, 33% of the total variance in the OWLS-II was attributable to classroom effects and the remaining to individuals, while 17% of the variance in the PPVT-4 was due to classroom effects. Lastly, 33% of the total variance in the TERA-3 was attributable to classroom effects, while 18% of the total variance in the TOPA-2+ was at the classroom level. These levels of classroom effects indicate that the impact of clustering for each of the models requires HLM to ensure appropriate precision and to guard against Type I error inflation.

Five separate models were utilized to further address the research question. The first model examined the association between teacher ratings on the WaKIDS assessment math domain and scores on the WJ-applied problems subtest.

Yij(WJ-Applied Problems) = γ00+γ01 (WaKIDS math)j+µ0j+rij

Next, we examined the association between teacher ratings on the WaKIDS assessment language domain and scores on both the OWLS-2 and PPVT.

Yij(OWLS-II) = γ00+γ01 (WaKIDS Language)j+µ0j+rij

Yij(PPVT-4) = γ00+γ01 (WaKIDS Language)j+µ0j+rij

Lastly, we examined the association between teacher ratings on the WaKIDS assessment literacy domain and scores on both the TERA and TOPA.

Yij(TERA-3) = γ00+γ01 (WaKIDS Literacy)j+µ0j+rij

Yij(TOPA-2+) = γ00+γ01 (WaKIDS Literacy)j+µ0j+rij

Page 41: University of Washington Inter-rater Reliability and Concurrent

  41  

Descriptive statistics are reported in Table 9. Means and standard deviations for each of the direct measures, as well as the observational measurements provided on the WaKIDS assessment indicate scores in the average range overall.

The first model explored the association between the WaKIDS assessment math domain scale score and the WJ-III Applied Problems subtest (See Table 11). The model estimate of the intercept showed that the mean estimate of the WJ-applied problems subtest was 103.96 points (SE = 1.24), which is significantly greater than zero, t(54) = 83.78, p < .001. The WaKIDS assessment math domain scale score was a positive predictor of WJ-Applied Problems subtest scores. Specifically, there is an estimated increase of .15 points on WJ-Applied Problems scores for students who were one unit higher on the WaKIDS math domain. The between classroom variance is significantly different from zero, that is, the intercept significantly varies across groups. These results indicate that scores on the WaKIDS math domain were significantly related to WJ-applied problems performance. WaKIDS math domain performance explains 30% of the variance in WJ-applied problems scores.

Table 11.

Two-level model of WJ-III Applied Problems

Fixed Effect Unconditional Model Full model

Coeff SE t df p Coeff SE t df p Intercept 103.68 1.24 83.45 54 <.001 *** 103.96 1.24 83.78 54 <.001 ***

TS-GOLD Math

0.15 0.03 5.57 159 <.001 ***

Random Effect Var chi df p Var chi df p Intercept 27.70 80.31 54 <.05 * 42.83 111.85 54 <.001 ***

Level 1 169.57

118.64 N=165 students from 56 classrooms; predictor entered into the model group mean centered * p < .05, ** p < .01, *** p < .001.

The next two models examined the association between the WaKIDS assessment language domain scale score and performance on the OWLS-II and the PPVT-4. (See Tables 12-13). The model estimate of the intercept showed that the mean estimate of the OWLS-II scores was 93.16 points (SE = 1.72), which is significantly greater than zero, t(52) = 54.25, p < .001. The WaKIDS assessment language domain scale score was a positive predictor of OWLS-II scores. Specifically, there is an estimated increase of .12 points on OWLS-II scores for students who were one unit higher on the WaKIDS language domain. The between classroom variance is significantly different from zero, that is, the intercept significantly varies across groups. These results indicate that scores on the WaKIDS language domain were significantly related to OWLS-II performance. WaKIDS language domain performance explains 26% of the variance in OWLS-II scores.

Page 42: University of Washington Inter-rater Reliability and Concurrent

Table 12.

Two-level model of OWLS-II

Fixed Effect Unconditional Model Full model

Coeff SE t df p Coeff SE t df p Intercept 92.49 1.78 52.07 53 <.001 *** 93.16 1.72 54.25 52 <.001 ***

TS-GOLD Language

0.12 0.01 5.57 150 <.001 ***

Random Effect Var chi df p Var chi df p Intercept 101.21 128.20 53 <.001 *** 103.94 150.88 52 <.001 ***

Level 1 204.91

150.83 N=163 students from 55 classrooms; predictor entered into the model group mean centered

* p < .05, ** p < .01, *** p < .001.

The model estimate of the intercept showed that the mean estimate of the PPVT-4 scores was 101.59 points (SE = 1.76), which is significantly greater than zero, t(53) = 57.70, p < .001. The WaKIDS assessment language domain scale score had a positive effect on PPVT-4 scores. Specifically, there is an estimated increase of .13 points on PPVT-4 scores for students who were one unit higher on the WaKIDS language domain. The between classroom variance is significantly different from zero, that is, the intercept significantly varies across groups. These results indicate that scores on the WaKIDS language domain were significantly related to PPVT-4 performance. WaKIDS language domain performance explains 26% of the variance in PPVT-4 scores.

Table 13.

Two-level model of PPVT-4

Fixed Effect Unconditional Model Full model

Coeff SE t df p Coeff SE t df p

Intercept 100.95 1.72 58.54 54 <.001 *** 101.59 1.76 57.71 53 <.001 ***

TS-GOLD Language

0.13 0.02 5.74 154 <.001 ***

Random Effect Var chi df p Var chi df p

Intercept 60.80 87.56 54 <.01 ** 88.81 114.37 53 <.001 ***

Level 1 300.78

225.24 N=163 students from 56 classrooms; predictor entered into the model group mean centered

* p < .05, ** p < .01, *** p < .001.

Page 43: University of Washington Inter-rater Reliability and Concurrent

  43  

The final two models tested the relationship between the WaKIDS assessment literacy domain scale score and performance on the TERA-3 and TOPA-2+. (see Tables 14-15). The model estimate of the intercept showed that the mean estimate of the TERA-3 scores was 89.75 points (SE = 1.63), which is significantly greater than zero, t(51) = 55.04 p < .001. The WaKIDS assessment literacy domain scale score was a positive predictor of TERA-3 scores. Specifically, there is an estimated increase of .15 points on TERA-3 scores for students who were one unit higher on the WaKIDS literacy domain. The between classroom variance is significantly different from zero, that is, the intercept significantly varies across groups. These results indicate that scores on the WaKIDS literacy domain were significantly related to TERA-3 performance. WaKIDS literacy domain performance explains 34% of the variance in TERA-3 scores.

Table 14.

Two-level model of TERA-3

Fixed Effect Unconditional Model Full model

Coeff SE t df p Coeff SE t df p

Intercept 89.24 1.62 55.08 52 <.001 *** 89.75 1.63 55.04 51 <.001 ***

TS-GOLD Literacy

0.15 0.02 6.97 148 <.001 ***

Random Effect Var chi df p Var chi df p

Intercept 82.62 126.20 52 <.001 *** 100.37 180.63 51 <.001 ***

Level 1 168.30

117.86

N=161 students from 55 classrooms; predictor entered into the model group mean centered * p < .05, ** p < .01, *** p < .001.

The model estimate of the intercept showed that the mean estimate of the TOPA-2+ scores was 23.14 points (SE = .73), which is significantly greater than zero, t(52) = 31.74 p < .001. The WaKIDS assessment literacy domain scale score was a positive predictor of TOPA-2+ scores. Specifically, there is an estimated increase of .09 points on TOPA-2+ scores for students who were one unit higher on the WaKIDS literacy domain. The between classroom variance is significantly different from zero, that is, the intercept significantly varies across groups. These results indicate that scores on the WaKIDS literacy domain were significantly related to TOPA-2+ performance. WaKIDS literacy domain performance explains 51% of the variance in TOPA-2+ scores.

Page 44: University of Washington Inter-rater Reliability and Concurrent

Table 15.

Two-level model TOPA-2+

Fixed Effect Unconditional Model Full model

Coeff SE t df p Coeff SE t df p

Intercept 23.07 0.72 31.83 53 <.001 *** 23.14 0.73 31.74 52 <.001 ***

TS-GOLD Literacy

0.09 0.01 9.17 153 <.001 ***

Random Effect Var chi df p Var chi df p

Intercept 11.26 87.05 53 <.01 ** 19.66 170.06 52 <.001 ***

Level 1 51.06

25.13

N=163 students from 55 classrooms; predictor entered into the model group mean centered * p < .05, ** p < .01, *** p < .001.

A final analysis was conducted for each of the five models, which included overall teacher agreement level with the master code as a level-2 predictor. This teacher level predictor was not statistically significant in any of the models (p > .05).

Summary and Limitations

The validity portion of this study examined the concurrent validity of the WaKIDS assessment, which is a unique adaptation of Teaching Strategies GOLD. The concurrent validity of the assessment was examined by zero-order correlations between WaKIDS and the individually administered assessments. Student scores on the WaKIDS assessment were found to be positively related to scores on the direct measures of the same domain. Low correlations were obtained for the social emotional and physical domains, and moderate correlations were obtained for language, math, and literacy domains, with the TOPA-2+ and WJ Applied Problems having the highest correlations with their corresponding domains.

An HLM analysis was also conducted to account for the nested nature of the study design and confirm the relationships between the WaKIDS assessment domains and standardized measures of similar constructs revealed in the zero-order correlations. The dependencies among students in the same classrooms are not taken into account in the simple correlations; however, even when those dependencies are taken into account in the multilevel analyses, statistically significant relationships remained between children’s WaKIDS ratings and their performance on standardized measures of mathematics, language, and literacy. The WaKIDS math domain accounted for 30% of the variance in WJ Applied Problems scores. The language domain explained 26% and 26% of the variance in OWLS-II and PPVT scores respectively. WaKIDS

Page 45: University of Washington Inter-rater Reliability and Concurrent

  45  

literacy domain performance explained 34% of the variance in TERA-3 scores, and 51% in TOPA-2+ scores. Typically, predictions of shared variance of 50% or more is the target in such multilevel models, and that target was reached only with the TOPA-2+. Still, WaKIDS scores accounted for non-trivial levels of the variance in scores on the other standardized measures.

It is important to qualify these results with information about possible limitations inherent in the research design used for this study. While every effort was made to select statistical methods that might account for these uncertainties, certain variables were outside of the research team’s control.

First, we must recognize the possible threat to internal validity that may have resulted from the differences in testing conditions between the GOLD assessment and the individually administered assessments. In other words, we should expect some degree of disagreement between scores simply due to the fact that the GOLD was administered by students’ teachers in the context of everyday classroom activities while the direct assessments were administered individually, outside of the classroom, by researchers unfamiliar to the students. This threat to validity is inherent in many evaluative designs addressing the validity of observational, curriculum based assessments (Meisels, Dorfman, & Steele, 1995; Waterman et al., 2011) and must be kept in mind when interpreting the results of these sorts of investigations.

Additionally, it should be noted that assessment instruments used to validate the GOLD assessment vary in terms of their relatedness to corresponding learning domains, as defined by GOLD. For example, each literacy item on the GOLD assessment was specifically addressed by researchers using items from the TERA or TOPA assessments. Conversely, only two GOLD items from the cognitive domain were directly addressed by researchers. This is because there are no readily available individually administered standardized assessments that address more abstract learning goals such as “Shows motivation and curiosity.”

Conclusions

Overall, the reliability analysis revealed variability in teacher ratings in relation to the master code, although less so for the two portfolios that featured typically developing students. Despite such variability, there was considerable reliability in where students were rated in relation to the developmental cut points of the WaKIDS assessment scales, with the notable exception of the portfolio featuring the student with lower skill levels. This second finding may be reassuring, given that a stated purpose of the WaKIDS assessment is to identify students needing additional instructional attention upon entrance into kindergarten. Variability in ratings, however, suggests caution in using the data for other purposes (Waterman et al, 2011). In the present study, the validity results need to be considered in light of the reliability results, since teachers’ WaKIDS scores were used to predict the standardized measures. Correlations

Page 46: University of Washington Inter-rater Reliability and Concurrent

  46  

between assessments may have been affected by variations in WaKIDS scoring across teachers. Such concerns are partially addressed by the HLM analyses, which took teacher-level variation into account and still documented that WaKIDS scores accounted for non-trivial amounts of the variance in standardized measures. These analyses suggest that, at least for the domains of math, language and literacy, the WaKIDS assessments tap the intended constructs. Despite a recent research report (Lambert, Kim, & Burts, 2013) that concluded the Teaching Strategies GOLD assessment was equally valid for students with and without disabilities and for students whose home language is not English, the present results paint a somewhat different picture of the WaKIDS assessment, especially for students with lower skill levels. It may be that Washington teachers would benefit from additional training and support using the WaKIDS assessment to rate children who struggle and, to a lesser extent, children who have home languages other than English, as well as training with specific items identified here as low in reliability. Finally, we have several recommendations related to the continued implementation of WaKIDS: (1) Require participating kindergarten teachers to complete both training and the Teaching Strategies Interrater Reliability Certification before using the assessment; (2) Provide follow up (refresher) training and establish a system and schedule of periodic reliability checks in order to maintain accuracy of results over time; (3) Provide additional training in using the WaKIDS assessment with children with special needs and children who are English Language Learners; and (4) Qualify interpretation of WaKIDS results with the caveat that some variance in data can be attributed to teacher and classroom contexts.

Page 47: University of Washington Inter-rater Reliability and Concurrent

  47  

References

Carrow-Woolfolk, E. (2011). Oral and written language scales (2nd ed.). Western Psychological Services.

Dunn, L. M., & Dunn, L. M. (2007). Technical Manual. Peabody Picture Vocabulary Test (4th ed.). Bloomington, MN: Pearson.

Gresham, F. M., & Elliott, S. N. (2008). Social Skills Improvement System Rating Scales. Bloomington, MN: Pearson Assessments.

Joseph, G.E., Cevasco, M., McGrew, K.S., Schrank, F.A., & Woodcock, R.W. (2007). Technical Manual. Woodcock-Johnson III Normative Update. Rolling Meadows, IL: Riverside Publishing.

Lambert, R. G., Kim, D. H., & Burts, D. C. (2013). Using teacher ratings to track the growth and development of young children using the Teaching Strategies GOLD® assessment system. Journal of Psychoeducational Assessment.

McGrew, K.S., Schrank, F.A., & Woodcock, R.W. (2007). Technical Manual. Woodcock-Johnson III Normative Update. Rolling Meadows, IL: Riverside Publishing.

Meisels, S. J., Dorfman, A., & Steele, D. (1995). Equity and excellence in group-administered and performance-based assessments. In Equity and excellence in educational testing and assessment (pp. 243-261). Springer Netherlands.

Meisels, S.J., Marsden, D.B., Wiske, M.S., & Henderson, L.W. (2003). Examiner’s Manual. Early Screening Inventory Revised. Bloomington, MN: Pearson Assessments.

Miller, M. D., Linn, R. L., & Gronlund, N. E. (2009). Measurement and assessment in teaching. Upper Saddle River, NJ: Merrill/Pearson.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). SAGE Publications, Incorporated.

Reid, K. D., Hresko, W. P., & Hammill, D. D. (2001). Test of Early Reading Ability, (TERA-3). Austin, Texas: Pro-Ed.

Smiley, P.A., & Dweck, C.S. (1994). Individual differences in achievement goals among young children. Child Development, 65, 1723-1743.

Teaching Strategies. (2010a). Research foundation: Teaching strategies GOLD assessment system. Retrieved April 1, 2012, from: http://www.toddlertownevanston.com/pdfs/Explanation_on_Teaching_Stratigies_GOLD.pdf

Page 48: University of Washington Inter-rater Reliability and Concurrent

  48  

Teaching Strategies. (2010b). Teaching strategies GOLD online. Retrieved from: http://www.teachingstrategies.com/page/GOLD-assessment-online.cfm

Teaching Strategies. (2011). GOLD assessment system: Technical summary. Retrieved from: http://www.teachingstrategies.com/content/pageDocs/GOLD-Tech-Summary-8-18-2011.pdf

Teaching Strategies. (2013). A kindergarten readiness model. (Unpublished report).

Torgesen, J. K., & Bryant, B. R. (2004). TOPA – 2+ Test of Phonological Awareness Second Edition: PLUS. Austin, Texas: pro-ed.

Waterman, C., McDermott, P. A., Fantuzzo, J. W., & Gadsden, V. L. (2012). The matter of assessor variance in early childhood education—Or whose score is it anyway?. Early Childhood Research Quarterly, 27(1), 46-54.

Wenner, G. (1995). Kindergarten screens as tools for the early identification of children at risk for remediation or grade retention. Psychology in the Schools, 32(4), 249-254.

Page 49: University of Washington Inter-rater Reliability and Concurrent

  49  

Appendices

Appendix A – GOLD Objectives and Dimensions (WaKIDS)

Social–Emotional 1. Regulates own emotions and behaviors

b. Follows limits and expectations c. Takes care of own needs appropriately

2. Establishes and sustains positive relationships c. Interacts with peers d. Makes friends

Physical 4. Demonstrates traveling skills 5. Demonstrates balancing skills 6. Demonstrates gross-motor manipulative skills 7. Demonstrates fine-motor strength and coordination

a. Uses fingers and hands b. Uses writing and drawing tools

Language 9. Uses language to express thoughts and needs

a. Uses an expanding expressive vocabulary b. Speaks clearly c. Uses conventional grammar d. Tells about another time or place

10. Uses appropriate conversational and other communication skills a. Engages in conversations b. Uses social rules of language

Cognitive 11. Demonstrates positive approaches to learning

c. Solves problems d. Shows curiosity and motivation e. Shows flexibility and inventiveness in thinking

12. Remembers and connects experiences a. Recognizes and recalls

13. Uses classification skills

Page 50: University of Washington Inter-rater Reliability and Concurrent

  50  

Literacy 15. Demonstrates phonological awareness

a. Notices and discriminates rhyme b. Notices and discriminates alliteration c. Notices and discriminates smaller and smaller units of sound

16. Demonstrates knowledge of the alphabet a. Identifies and names letters b. Uses letter–sound knowledge

17. Demonstrates knowledge of print and its uses b. Uses print concepts

18. Comprehends and responds to books and other texts a. Interacts during read-alouds and book conversations b. Uses emergent reading skills c. Retells stories

19. Demonstrates emergent writing skills a. Writes name b. Writes to convey meaning

Mathematics 20. Uses number concepts and operations

a. Counts b. Quantifies c. Connects numerals with their quantities

22. Compares and measures 23. Demonstrates knowledge of patterns

Page 51: University of Washington Inter-rater Reliability and Concurrent

  51  

Appendix B – Scoring Guide

 

Page 52: University of Washington Inter-rater Reliability and Concurrent

  52