skills, productivity and the evaluation of teacher ...skills, productivity and the evaluation of...

50
1 SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by Douglas N. Harris Tim R. Sass Department of Economics Department of Economics Tulane University Georgia State University [email protected] [email protected] March 4, 2014 Abstract We examine the relationships between observational ratings of teacher performance, principals’ evaluations of teachers’ cognitive and non-cognitive skills and test-score based measures of teachers’ productivity. We find that principals can distinguish between high and low performing teachers, but the overall correlation between principal ratings of teachers and teachers’ value- added contribution to student achievement is modest. The variation across metrics occurs in part because they are capturing different traits. While past teacher value-added predicts future value- added, principals’ subjective ratings can provide additional information, particularly when prior value-added measures are based on a single year of teacher performance. ___________________________________ * This study is funded under grant R305M040121 from the U.S. Department of Education. We wish to thank Stacey Rutledge, William Ingle, Peter Hinrichs and participants in the NBER summer education workshop for their comments. We are also grateful to Brian Jacob for sharing his computer code to determine conditional probabilities of teacher performance. Abhir Kulkarni, Julia Manzella, John Gibson, William Ingle, Micah Sanders, Cynthia Thompson and Lisa Verdon provided valuable research assistance. Previous versions of the paper circulated under the title “What Makes for a Good Teacher and Who Can Tell?” All remaining errors are the responsibility of the authors.

Upload: others

Post on 19-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

1

SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE*

by

Douglas N. Harris Tim R. Sass Department of Economics Department of Economics Tulane University Georgia State University [email protected] [email protected]

March 4, 2014

Abstract

We examine the relationships between observational ratings of teacher performance, principals’ evaluations of teachers’ cognitive and non-cognitive skills and test-score based measures of teachers’ productivity. We find that principals can distinguish between high and low performing teachers, but the overall correlation between principal ratings of teachers and teachers’ value-added contribution to student achievement is modest. The variation across metrics occurs in part because they are capturing different traits. While past teacher value-added predicts future value-added, principals’ subjective ratings can provide additional information, particularly when prior value-added measures are based on a single year of teacher performance. ___________________________________

*This study is funded under grant R305M040121 from the U.S. Department of Education. We wish to thank Stacey Rutledge, William Ingle, Peter Hinrichs and participants in the NBER summer education workshop for their comments. We are also grateful to Brian Jacob for sharing his computer code to determine conditional probabilities of teacher performance. Abhir Kulkarni, Julia Manzella, John Gibson, William Ingle, Micah Sanders, Cynthia Thompson and Lisa Verdon provided valuable research assistance. Previous versions of the paper circulated under the title “What Makes for a Good Teacher and Who Can Tell?” All remaining errors are the responsibility of the authors.

Page 2: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

2

I. Introduction

Research consistently finds that teacher productivity is the most important component of

a school’s effect on student learning and that there is considerable heterogeneity in teacher

productivity within and across schools.1 The paramount role of teachers has led policymakers to

focus on personnel policies governing selection, retention and compensation of teachers as a

mechanism for enhancing educational quality.

At the heart of all teacher personnel policy decisions is the issue of how to evaluate

teacher performance. Traditionally, teacher hiring, retention and salary decisions have been

based on teacher credentials such as certification status, educational attainment and experience.

However, except for the first few years of experience, research has failed to find a strong and

consistent link between these measures and student outcomes.2 Spurred on by the federal

Teacher Incentive Fund (TIF) and Race to the Top (RTTT) initiatives many states and districts

are beginning to incorporate both observations of teacher behavior and measures of student

achievement in teacher evaluations.3 They are also de-emphasizing, or in some cases

eliminating, the use of traditional measures like attainment of a master’s degrees and seniority in

retention and compensation systems.

Despite the recent policy shift, little is known about the relative merits of observational

measures and “value-added” ratings of teachers based on student test scores. Recent work by

Chetty, Friedman and Rockoff (2012) finds that students taught by high value-added teachers are

1 See, for example, Rockoff (2004), Hanushek, et al. (2005), Rivkin, Hanushek and Kain (2005), Kane, Rockoff and Staiger (2008), Aaronson, Barrow and Sander (2007). 2 See Rockoff (2004), Hanushek, et al. (2005), Jepsen (2005), Rivkin, Hanushek and Kain (2005), Boyd, et al. (2006), Clotfelter, Ladd and Vigdor (2006, 2007, 2010), Kane, Rockoff and Staiger (2008) and Harris and Sass (2011)). Harris and Sass (2011) find returns to experience beyond the first few years, particularly in middle school, and Wiswall (forthcoming) uncovers high returns to later career experience at the elementary level. 3 A summary of the evaluation systems proposed by RTTT grantees is provided in Appendix Table A1.

Page 3: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

3

more likely to have desirable long-run outcomes, including greater educational attainment,

higher earnings and a reduced probability of teenage pregnancy. While no extant research links

observational measures of teacher performance to student long-term outcomes, there is mounting

evidence that observational measures of teacher quality are not strongly correlated with teacher

value-added (Jacob and Lefgren (2008), Rockoff, et al. (2010, 2012), Mihaly, et al. (2013)).4

Thus observational measures are not simply duplicative of value-added metrics and the

divergence between the two suggests that observational measures could be capturing a different

set of teacher skills, which could influence long-term student outcomes in ways that are not

captured by value-added.5

Intertwined with the issue of how best to evaluate teacher performance is the relationship

between teacher skills and teacher productivity. Recent work in labor economics suggests that

both cognitive ability and non-cognitive personality traits, such as conscientiousness, play an

important role in determining worker productivity (Borghans, ter Weel, and Weinberg (2008),

Cunha, Heckman, Lochner, and Masterov (2006); Heckman, Stixrud, and Urzua (2006)).

Borghans, ter Weel, and Weinberg (2008) theorize that different types of jobs require different

combinations of personality traits and provide evidence that some of these traits are correlated

with productivity. They find that “caring” is more important in teaching than in any other

occupation, except nursing.

4 The relationship between subjective ratings and objective performance measures is also relatively weak in other occupations (Bommer, et al. (1995), Heneman (1986)). 5 It is also possible that the two metrics measure the same underlying traits, but diverge because one or both are biased. Indeed, value-added measures are frequently criticized for potential selection bias due to non-random assignment of students to teachers (Rothstein (2010)). However, observational ratings of teachers could be subject to the same sort of bias if unobserved student characteristics affect the perceived performance of teachers. For example, if students with behavioral problems are more likely to be assigned to inexperienced teachers, raters could incorrectly perceive that less experienced teachers have poorer classroom management skills. In addition, observational ratings of workers could be subject to biases of observers (Varma and Stroh (2001)).

Page 4: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

4

Personality traits are difficult to measure objectively (Borghans, et al. (2008)) and

perhaps are more easily captured through direct observation. Thus if teacher observational

rubrics measure non-cognitive traits that are not captured by teachers’ contributions to student

test scores, observational measures of teacher performance could serve as a valuable complement

to value-added when evaluating teachers.

If observational measures are to be used in teacher evaluation systems there is also the

issue of who can provide the most cost-effective evaluation of teacher performance. While some

districts, like Washington DC, are utilizing trained observers to evaluate teachers, the vast

majority of teacher evaluation systems being implemented across the country rely on the

observations of principals to assess teacher performance. Although they may lack specific

training in observational evaluation, principals may be lower-cost evaluators. Principals are

typically required to observe teachers as part of their job and they collect a lot of information

informally, and inexpensively, in the natural course of being in the school, interacting with

teachers and talking to parents. Principals may also define performance somewhat differently to

include contributions to output made through group interaction, e.g., mentoring of other teachers

(Harris, Ingle and Rutledge (2013)). Despite the widespread use and possible cost advantages of

using principals to conduct observational evaluations of teacher, there is currently little evidence

on whether some principals are better than others at evaluating teachers and whether the ability

to evaluate teacher performance varies across different types of teachers or school environments.

In this paper we seek to enhance understanding of the relative merits of observational and

value-added measures of teacher performance and shed light on the role that cognitive and non-

cognitive skills play in determining teacher productivity. Specifically, we employ data on

principals’ evaluations of their teachers to address the following five questions:

Page 5: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

5

1) How well do principal evaluations correlate with value-added measures of teacher productivity?

2) How does the ability of principals to measure teacher performance vary with the characteristics of principals and teachers?

3) What teacher traits are associated with their ability to promote student achievement?

4) Beyond the ability to raise achievement in the short-run, what traits do principals consider when evaluating teachers?

5) How well do principal evaluations and prior measures of teacher value-added predict future teacher productivity?

In the next section we describe the small existing literature on subjective evaluations of

teachers and their relationship with value-added. This is followed by a discussion of the data

used for our analysis, including how the interviews with principals were conducted and our

method for estimating teacher value-added. In the concluding section we discuss our empirical

results and possible policy implications.

II. Literature Review

There is a limited literature that specifically addresses the relationship between subjective

and objective assessments of school teachers. Three older studies have examined the

relationship between student test scores and principals’ subjective assessments using longitudinal

student achievement data to measure student learning growth (Murnane (1975), Armor, et al.

(1976), and Medley and Coker (1987)). However, as noted by Jacob and Lefgren (2008), these

studies do not account for measurement error in the objective test-based measure and therefore

under-state the relationship between subjective and objective measures.

In their work, Jacob and Lefgren address both the selection bias and measurement error

problems within the context of a value-added model for measuring teacher productivity that is

linked to principals’ subjective assessments. They obtain student achievement data and combine

Page 6: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

6

it with data on principals’ ratings of 201 teachers in a mid-sized school district in a Western

state.6 Jacob and Lefgren find that previous teacher value-added is a better predictor of current

student achievement than are current principal ratings. In particular, teacher value-added

calculated from test scores in 1998-2002 was a significantly better predictor of 2003 test scores

(conditional on student and peer characteristics) than were 2003 principal ratings made just prior

to the 2003 student exam. The current principal ratings were also significantly correlated with

current test scores, conditional on prior value-added. While this latter finding suggests

contemporaneous principal ratings add information, the reason is not clear. Since past value-

added is subject to transient shocks to student test score, the principal ratings might provide more

precise indicators of previous teacher productivity (especially when there is little prior test score

information, as is often the case). Alternatively, the principal ratings may simply reflect new

current-school-year (2002/03) performance information not included in past value-added (based

on test scores through 2001/02). In order to sort out these effects, in our analysis we compare the

ability of current value-added and current principal ratings to predict future teacher value-added.

The only prior study to consider principals’ assessments of specific teacher

characteristics, as opposed to the overall rating, is an unpublished working paper by Jacob and

Lefgren (2005). While they find a positive and significant relationship between the teacher

value-added and teachers’ relationship with the school administration, this is the only teacher

characteristic they consider.7

6 As in the present study, the district studied by Jacob and Lefgren chose to remain anonymous. 7 In a recent working paper, Bastian (2013) analyzes the relationship between evaluations of applicants to the highly selective “Teach for America” program and their subsequent impact on student test scores when they become teachers. He finds that both cognitive skills (as measured by prior academic achievement) and organizational and motivational abilities are associated with higher teacher value-added.

Page 7: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

7

Rockoff, et al. (2010, 2012) study an experiment in which elementary and middle school

principals in New York City were randomly assigned to receive teacher value-added

information. They found that principals change their evaluations of teachers when they receive

new information about the impact of teachers on student test scores. The extent of updating is

positively related to the precision of value-added information they receive and negatively related

to the quality of their own prior information on teachers. The acquisition of new information

also appears to have significant effects on personnel decisions and student outcomes. Rockoff, et

al. find that teachers with low value-added scores were more likely to exit their schools after the

principal received value-added information which in turn led to a small increase in student test

scores. While not the focus of their analysis, Rockoff, et al. also estimates pre-experiment

correlations between various value-added measures and principals’ evaluations of their teachers.

They find positive correlations, similar in magnitude to those obtained by Jacob and Lefgren.

The correlations tend to increase with the precision of the value-added estimates and with the

number of years the principal has known a teacher.

Most recently, Mihaly, et al. (2013) report results from the Measures of Effective

Teaching (MET) project, sponsored by the Gates Foundation. The project involved six large

school districts from throughout the U.S. and measured teacher productivity in a variety of ways,

including through student evaluations, observations of classroom practice by trained evaluators,

and student performance on two different achievement tests.8 The MET project does not include

8 A number of other studies have examined the relationship between the achievement levels of teachers’ students and subjective teacher ratings that are based on formal standards and extensive classroom observation (Gallagher (2004), Kimball, et al. (2004), Milanowski (2004)). For example, in Milanowski (2004), the subjective evaluations are based on an extensive standards-framework that required principals and assistant principals to observe each teacher six times in total and, in each case, to rate the teacher on 22 separate dimensions. All of these studies find a positive and significant relationship, despite differences in the way they measure teacher value-added and in the degree to which the observations are used for high-stakes personnel decisions. While these studies have the

Page 8: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

8

principal evaluations, however.9

Mihaly, et al. compare the ability of teacher observations, student surveys and value-

added in one classroom to predict teacher performance in another course section taught by the

same teacher.10 For elementary school teachers they compare self-contained classrooms taught

by the same teacher in two consecutive years and for middle school teachers they compare

separate sections of the same subject in the same year. They find that the best predictor of

performance is generally the same metric from the comparison classroom setting. For example,

the best predictor of value-added on state assessments from one section is value-added from

another section, the best predictor of ratings on a particular observational rubric are ratings from

the same observational measure in another classroom and so forth. Correspondingly, the optimal

combination of teacher quality measures is heavily weighted toward the indicator used to define

the outcome. Combining measures, such as value-added and observational ratings, to predict

value-added in the future or in a different classroom has relatively small advantages over using

value-added alone. Similarly, if the goal is to predict a teacher’s score on an observational scale,

combining the observational rating from another classroom with value-added yields only a

modest improvement over using the observational rating in isolation.

In addition to replicating some analyses of prior studies, we build on extant research in

four ways. First, unlike the MET study, we focus on principal evaluations, which may be a more

advantage of more structured subjective evaluations, the reliance on achievement levels with no controls for lagged achievement or prior educational inputs makes it difficult to estimate teacher value-added. 9 Ho and Kane (2013) report on a supplemental study that involved 67 teachers in one of the MET study districts, Hillsborough County. Videos of their lessons were scored by both 76 peer raters and 53 school administrators (principals and assistant principals). Ho and Kane find that administrators tended to differentiate the quality of teachers more than peer raters. Administrators tended to score their own teachers higher than teachers from other schools, but their ratings of teachers were highly correlated with the scores assigned to their teachers by administrators from other schools. 10 They also evaluate the relative performance of the measures to “predict” past performance by the same teacher.

Page 9: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

9

cost-effective alternative to other observation-based methods. Second, unlike prior work, which

is based on a limited grade range (grades 2-6 or 4-8) our analysis spans grades 2-10 and includes

elementary, middle and high schools. Third, we investigate why principal ratings differ from

value-added measures by employing information on teachers’ personality traits and skills.

Fourth, in addition to comparing contemporaneous measures of teacher performance, we test

how well prior value-added scores and prior principal evaluations of teacher predict future

teacher value-added.

III. Data and Methods

We begin by describing the general characteristics of the school district and sample of

principals, teachers and students. We then discuss in more detail the two main components of

the data: (a) administrative data that are used to estimate teacher value-added; and (b) principal

interview data that provide information about principals’ overall assessments of teachers as well

as ratings of specific teacher characteristics.

A. General Sample Description

The analysis is based on interviews with 30 principals from an anonymous mid-sized

Florida school district. The district includes a heterogeneous population of students. For

example, among the sampled schools, the school-average proportion of students eligible for

free/reduced price lunches varies from less than 10 percent to more than 90 percent. Similarly,

there is considerable heterogeneity among schools in the racial/ethnic distribution of their

students. We interviewed principals from 17 elementary (or K-8) schools, six middle schools,

four high schools, and three special population schools, representing more than half of the

principals in the district. The racial distribution of interviewed principals is comparable to the

Page 10: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

10

national average of all principals (sample district: 78 percent White; national: 82 percent White)

as is the percentage with at least a master’s degree (sample district: 100 percent; national: 90.7

percent).11 However, the percentage female is somewhat larger (sample district: 63 percent;

national: 44 percent).

The advantage of studying a school district in Florida is that the state has a long tradition

of strong test-based accountability (Harris, Herrington and Albee, 2007) that has now come to

pass in other states as a result of the federal No Child Left Behind policy. As part of its

accountability system, the state has long graded schools on an A-F scale. The number of schools

receiving the highest grade has risen over time; in our sample 20 schools received the highest

grade (A) during the 2005-06 school year; the lowest performing school in the district received a

grade of D. It is reasonable to expect that accountability policies, such as the assignment of

school grades, influence the objectives that principals see for their schools and therefore their

subjective evaluations of teachers. For example, we might expect a closer relationship between

value-added and subjective assessments in high accountability contexts where principals are not

only more aware of test scores in general, but where principals are increasingly likely to know

the test scores, and perhaps test score gains, made by students of individual teachers. We discuss

the potential influence of this phenomenon later in the analysis, but emphasize here that, by

studying a Florida school district, the results of our analysis are more applicable to the current

policy environment where high-stakes achievement-focused accountability is federal policy.

11 The national data on principals comes from the 2003-2004 Schools and Staffing Survey (SASS) as reported in the Digest of Education Statistics (National Center for Education Statistics, 2006). Part of the reason that this sample of principals has higher levels of educational attainment is that Florida law makes it difficult to become a principal without a master’s degree.

Page 11: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

11

B. Student Achievement Data and Modeling

Throughout Florida there is annual testing in grades 3-10 for both math and reading.

Until recently, two tests were administered, a high-stakes, criterion-referenced exam based on

the state curriculum standards known as the FCAT-Sunshine State Standards (SSS) exam, and a

low-stakes, norm-referenced test (NRT) which is the Stanford Achievement Test. We mainly

employ the low-stakes NRT in the present analysis for two reasons.12 First, it is a vertically

scaled test, meaning that unit changes in the achievement score should have the same meaning at

all points along the scale. Second, and most importantly, the district under study also

administers the NRT in grades 1 and 2, allowing us to compute achievement gains for students in

grades 2-10. Achievement data on the NRT are available for each of the school years 1999/00

through 2007/08.13 The SSS exam was instituted a year later and thus scores on the high-stakes

test are only available for the 2000/01-2007/08 school years. Using the low-stakes test we are

able to estimate the determinants of achievement gains for five years prior to the principal

interviews, 2000/01-2005/06, and for two years after the interviews, 2006/07-2007/08. In order

to account for any differences in test content and scaling across grades across time, we normalize

test scores by grade and year. Characteristics of the sample used in the value-added analysis are

described in Table 1.

In order to compute value-added scores for teachers we estimate a model of student

achievement, Ait, of the following form:

12 In Appendix Table A5 we present correlations between value-added and principal evaluations for the sample of teachers covered by both the FCAT-SSS and FCAT-NRT Stanford Achievement exams. While the correlations between the principal’s evaluation of a teacher’s ability to raise achievement and the teacher’s value added tend to be somewhat higher, the differences are not large. Using a common sample, the adjusted correlation in math is 0.29 for the FCAT-SSS and 0.25 for the FCAT-NRT. For reading, the adjusted correlations are 0.47 for the FCAT-SSS and 0.38 for the FCAT-NRT. 13 Prior to 2004/05 version 9 of the Stanford Achievement Test (SAT-9) was administered. Beginning in 2004/05 the SAT-10 was given. All SAT-10 scores have been converted to SAT-9 equivalent scores based on the conversion tables in Harcourt (2002).

Page 12: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

12

itgtigmkijmtititit AA PβXβ 211 (1)

The effects of prior educational inputs are captured by the lagged test score, Ait-1, and are

assumed to diminish geometrically over time at a rate (1-). The vector Xit includes time-

varying student characteristics such as student mobility, free/reduced-price lunch eligibility and

limited English proficiency status as well as time-constant student attributes like race/ethnicity

and gender. The vector of peer characteristics, P-ijmt (where the subscript –i students other than

individual i in the classroom), includes both exogenous peer characteristics and the number of

peers or class size. In addition, a teacher fixed effect (k), a school fixed effect (m) and a sets of

grade-repeater-by-grade (ig) and grade-by-year indicators (gt) are also included. The teacher

fixed effect captures both the time-invariant characteristics of teachers as well as the average

value of time-varying characteristics like experience and possession of an advanced degree.

Since school fixed effects are included, the estimated teacher effects represent the “value-added”

of an individual teacher relative to the average teacher at the school. The final term, it, is a

mean zero random error.

The achievement model depicted in equation (1) is but one of many commonly estimated

value-added models. We utilize it as our primary model since recent experimental and

simulation-based evidence suggests it is likely to produce relatively unbiased estimates of

teacher effects under a range of conditions ((Kane and Staiger (2008), Guarino, Reckase and

Wooldridge (2011)). However, we have also estimated alternative value-added models, ones that

assume complete persistence in prior inputs or control for student heterogeneity with student

Page 13: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

13

fixed effects.14 Parameter estimates from our primary model as well as from the alternative

models appear in Appendix Table A2. In Appendix Table A3 we show that the relationship

between value-added estimates of teacher productivity and principal evaluations of teacher

productivity are similar across value-added model specifications.

Recently, Rothstein (2010) has argued that value-added models may produce biased

estimates of teacher productivity due to the non-random assignment of students to teachers

within schools. For example, if students who experience an unusually high achievement gain in

one year are assigned to particular teachers the following year and there is mean reversion in

student test scores, the estimated value-added for the teachers with high prior-year gains will be

biased downward. Rothstein proposes falsification tests based on the idea that future teachers

cannot have causal effects on current achievement gains. We conduct falsification tests of this

sort, using the methodology employed by Koedel and Betts (2011). For elementary schools,

which account for more than half of our sample, we fail to reject the null of strict exogeneity (p-

value equals 0.14 in math and 0.68 in reading. We cannot, however, reject the null in middle

school or in high school.15

As noted by Jacob and Lefgren, another concern is measurement error in the estimated

teacher effects. Given the variability in student test scores, value-added estimates will yield

14 For a thorough discussion of various value-added models and the assumptions that underlie them, see Todd and Wolpin (2003) and Sass, Semykina and Harris (2013). 15 It is possible that the rejection of strict exogeneity in middle and high school could reflect tracking in the upper grades that could induce a degree of bias. To test this, we added indicators for basic/remedial and advanced/honors courses and re-ran the Rothstein tests. The F-statistics for testing the null of no future teacher “effects” were reduced somewhat, but we still reject the null for both the middle and high school samples. Goldhaber and Chaplin (2012) show that the Rothstein test may reject the null of strict exogeneity even when there is no bias, however. Therefore we also examined the results when adding track effects to the model and found that this had minimal influence on our findings. This is most likely because the effects of tracks are important when individual teachers have a high concentration of courses in lower- or upper-tracks (Harris and Anderson, 2012).

Page 14: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

14

“noisy” measures of teacher productivity, particularly for teachers with relatively few students

(McCaffrey, et al (2009)). We employ three strategies to alleviate the measurement error

problem. First, we limit our sample to teachers who taught at least five students with

achievement gain data. Second, we employ the measurement-error correction procedure adopted

by Jacob and Lefgren when evaluating the strength of correlations between value-added and

subjective evaluations by principals.16 Third, in regression analyses where value-added is the

dependent variable we use a feasible generalized least squares (FGLS) estimation procedure

which accounts for estimation error in the dependent variable.17 Finally, when value-added is

used as an explanatory variable in a regression we employ empirical Bayes estimates of teacher

value-added. The empirical Bayes method “shrinks” teacher effect estimates toward the

population mean, with the degree of shrinkage proportional to the standard error of the teacher

effect estimate (see Morris (1983)).

As noted by Mihaly et al. (2010), standard fixed-effects software routines compute fixed

effects relative to some arbitrary hold-out unit (e.g. an omitted teacher), which can produce

wildly incorrect standard errors and thus inappropriate corrections for measurement error in the

estimated teacher effects. Therefore, to estimate the teacher effects and their standard errors we

employ the Stata routine felsdvregdm, developed by Mihaly et al. (2010), which imposes a sum-

to-zero constraint on the teacher estimated teacher effects within a school and produces the

appropriate standard errors for making measurement error adjustments.18

16 See Jacob and Lefgren (2008), p.113. 17 Specifically, we employ the method developed by Lewis and Linzer (2005) and embodied in the Stata routine edvreg. 18 All standard errors on the estimated teacher effects are corrected for clustering at the classroom level using the method suggested by Moulton (1990).

Page 15: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

15

C. Principal Interview Data

Interviews were conducted in the summer of 2006. Each principal was asked to rate up

to ten teachers in grades and subjects that are subject to annual student achievement testing. Per

the requirements of the district, the interviews were “single-blind” so that the principal knew the

names of the teachers but the interviewer knew only a randomly assigned number associated

with the names.

From the administrative data described above, we identified teachers in tested grades and

subjects in the 30 schools who had taught at least one course with 10 or more tested students and

who were still in the school in the 2004/05 school year (the last year for which complete

administrative data were available prior to arranging the principal interviews). In some cases,

there were fewer than ten teachers who met these requirements. Even in schools that had ten

teachers on the list, there were cases where some teachers were not actually working in the

respective schools at the time of the interview. If the principal was familiar with a departed

teacher and felt comfortable making an assessment, then these teachers and subjective

assessments were included in the analysis. If the principal was not sufficiently familiar with the

departed teacher, then the teacher was dropped. Many schools had more than ten teachers. In

these cases, we attempted to create an even mix of five teachers of reading and math. If there

were more than five teachers in a specific subject, we chose a random sample of five to be

included in the list.

In the interviews, principals were first asked to mark on a sheet of paper the principal’s

overall assessment of each teacher, using a 1-9 scale.19 The interviewer then handed the

19 The specific question was: “First, I would like you to rate each of the ten teachers relative to the other teachers on the list. Please rate each teacher on a scale from 1-9 with 1 being not effective to 9 being exceptional. Place an X in the box to indicate your choice. Also please circle the number of any teachers whose students are primarily special populations.”

Page 16: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

16

principal another sheet of paper so that he/she could rate each teacher on each of 11

characteristics: caring, communication skills, enthusiasm, intelligence, knowledge of subject,

strong teaching skills, motivation, works well with grade team/department, works well with me

(the principal), contributes to school activities beyond the classroom, and contributes to overall

school community. The first seven characteristics in this list were found by Harris, Rutledge,

Ingle, and Thompson (2010) to be among the most important characteristics that principals look

for when hiring teachers.20 It is important to emphasize that these assessments are coming from

the principals, and not from validated, objective measures of these characteristics.

The interview questions were designed so that principals would evaluate teachers relative

to others in the school.21 One reason for doing so is that even an “absolute” evaluation would be

necessarily based on each principal’s own experiences. This implies that ratings on individual

characteristics across principals may not be based on a common reference point or a common

scale. Therefore, like Jacob and Lefgren, we normalize the ratings of each teacher characteristic

to have a mean of zero and standard deviation of one over all teachers rated by a given

principal.22 Given our teacher fixed-effects estimates are within-school measures, normalizing

the ratings allow us to compare within-school ratings to within-school teacher value-added.

20 As described in Harris, Rutledge, Ingle and Thompson (2010), the data in this study came from the second in a series of interviews carried out by the researchers. During the summer of 2005, interviews were conducted regarding the hiring process and principals preferred characteristics of teachers. The first set of interviews was important because it helped validate the types of teacher characteristics we consider. Principals were asked an open-ended question in the first interview about the teacher characteristics they prefer. Two-thirds of these responses could be placed in one of 12 categories identified from previous studies on teacher quality. The list here takes those ranked highest by principals in the first interview and then adds some of those included by Jacob and Lefgren. 21 In contrast, in the Rockoff, et al. (2012) study, principals were asked to compare each teacher to all “teachers [they] have known who taught the same grade/subject,” not just teachers at their own school. 22 Normalizing the ratings within a school avoids the problem of different principals having different scales. However, even within a school, we are assuming that the normalized scales can be given a cardinal interpretation, that is moving from one standard deviation below the mean rating to the mean is equivalent to moving from the mean to one standard deviation above the mean on a given characteristic. Table A4 in the appendix provides the

Page 17: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

17

The final activity of the interview involved asking the principals to rate each teacher

according to the following additional “outcome” measures: raises FCAT math achievement,

raises FCAT reading achievement, and raises FCAT writing achievement. These last measures

help us test whether the differences between the value-added measures and the principals’

overall assessments are due to philosophical differences regarding the importance of student

achievement as an educational outcome or to difficulty in identifying teachers who increase

student test scores. In this respect, the fact that the assessments are coming from the principal is

useful because we are trying to draw conclusions about principal evaluations, as imperfect as

they may be.

To lessen potential multicollinearity problems and reduce the number of teacher

characteristics to analyze, we conduct a factor analysis of the 11 individual teacher

characteristics rated by principals. As indicated in Appendix Table A6, the individual

characteristics can be summarized into four factors: interpersonal skills, motivation/enthusiasm,

ability to work with others, and knowledge/teaching skills/intelligence.

Finally, as part of the interview, we discovered that principals have access to a district-

purchased software program, SnapshotTM, that allows them to create various cross-tabulations of

developmental scale scores (DSS) derived from the high-stakes SSS exam, including simple

student learning gains and mean learning gains by teacher. Although we have no data about the

actual usage of this software, subsequent informal conversations with two principals suggests

that at least some principals use the program to look at the achievement gains made by students

of each teacher. While this may have provided principals with some information about

distribution of principal ratings prior to normalization for both their overall evaluation of a teacher and for the “ability to raise test scores” criterion in each subject.

Page 18: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

18

unconditional student average achievement gains, that is of course not the same thing as the

teacher value-added scores, which are conditional on student and peer characteristics. Further,

since the FCAT-SSS is only administered in grades 3-10, principals would have no raw

achievement gain information for teachers in grades 2 and 3. Nevertheless, we calculate the

correlation between mean current and lagged achievement gains on the DSS scale with principal

evaluations of their teachers. In Appendix Table A7 we show that the correlation between

principals’ overall ratings of teachers and once-lagged average DSS gains for math teachers is

relatively weak (correlation = 0.19), and not significantly correlated with contemporaneous

average student gains. Further, overall rankings of reading teachers are not significantly

correlated with either current or lagged average student achievement gains. Paired with the

consistency in correlations of principal ratings with value-added across subjects demonstrated

below, these results suggest that our estimates are not tainted by principals’ access to average

achievement gain information.

Principal evaluations may also be affected by principals’ knowledge of the ways in which

teachers are assigned to students and classrooms. Since principals often make these assignments

themselves, they are often well aware of them. For example, principals might know they assign

certain teachers to more challenging students based on student factors not typically measured or

included in value-added models. They might then adjust their ratings upwards for these teachers

in ways that the value-added measures do not. Based on our other analyses of these principals,

this seems unlikely23, but even if it did occur, this is not a source of bias in the principal

evaluation, but something to consider in explaining the patterns of results that follow.

23 In analyses of interviews with these same principals, in which they were asked to discuss each teacher at length, it appears that principals are apt to judge teachers based on observable effort rather than effectiveness per se or even behaviors we would expect to be related to effectiveness in raising student achievement (Harris, Ingle, & Rutledge,

Page 19: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

19

IV. Results

A. The Relationship of Principal Evaluations to Value-Added

Simple pairwise correlations between estimated teacher fixed effects and principals’

evaluations of teacher performance are presented in Tables 2A (for math) and 2B (for reading).

The correlations are broken down by rating type (“overall” and “ability to raise test scores”) and

grade level. For comparison purposes we also present corresponding estimates from other recent

studies. At the elementary level, we find generally stronger (though still moderate) correlations

between principals’ evaluations of teacher performance and value-added than do other studies.24

For the “ability to raise achievement” measure we obtain unadjusted correlations of 0.37 in math

and 0.35 in reading. In contrast, Jacob and Lefgren report unadjusted correlations of 0.25 in

math and 0.18 in reading while the correlations in Rockoff, et al. are 0.23 in math and 0.25 in

reading. Our correlations adjusted for measurement error in value-added are correspondingly

higher than those from other studies as well.25

2014). Also, in principals’ longer discussions of each teacher, principals never referred to the types of students’ teachers had. 24 The stronger correlations at the elementary level may be partly driven by the grade span of analysis. Both we and Jacob and Lefgren possess test scores for grades one and above and thus compute value-added for grades two and higher. In contrast, Rockoff, et al.’s analysis is based on value-added for grades 4 and higher. If we drop grades 2 and 3 from our elementary school analysis, the estimated correlations between value-added and principal ratings are comparable to those of Rockoff, et al. in reading, but are still higher in math. In math, the unadjusted correlation of value-added and the overall rating falls from .31 to .26. The unadjusted correlation between value-added and ability to raise test scores drops from .37 to .32. In reading the unadjusted correlations drop from 0.30 to .23 (overall rating) and from 0.35 to 0.22 (rating of ability to raise test scores). 25 As noted above, when principals conduct teacher evaluations they may implicitly account for variation in student behavior and skill among teachers that is not captured by value-added measures. If this is the case, their knowledge of such sorting could partly account for the divergence between the rating of principal effectiveness in raising achievement and estimated value added. An implication of this hypothesis is that for schools in which there appears to be greater sorting across classrooms there should be a lower correlation between principal evaluations and value added. A suggested by a reviewer, we attempted to test this proposition with our data, but could find few incidents of significant sorting of students across classrooms. Similar to the method employed by Clotfelter, et al. (2006), we limited our sample to elementary schools with more than one classroom in a grade and constructed six binary student characteristics (white/nonwhite, male/female, free lunch/no free lunch, lagged (normed by grade and year) test score above/below statewide mean, zero/non-zero disciplinary incidents in prior year, whether or not a student

Page 20: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

20

At the middle school level, we do not find significant correlations between principal

evaluations and value-added, though this is likely due to the relatively small number of middle

school teachers in our sample (40 teachers). In contrast, Rockoff et al. obtain unadjusted

correlations for both the “overall” and “raise achievement” and value-added of 0.4 in math and

0.2 in reading (Jacob and Legren’s sample is limited to elementary schools). When we combine

elementary and middle school teachers, however, our estimated correlations are strikingly similar

to those of Rockoff et al. For the “overall” assessment we find correlations with value-added of

0.27 in math and 0.28 in reading while Rockoff et al.’s estimated correlations are 0.23 for both

math and reading. Likewise for the “raise achievement” measure, we obtain estimated

correlations with value-added of 0.33 and 0.28 for math and reading, respectively; Rockoff et

al.’s corresponding estimates are 0.26 and 0.26.

As with middle school, our high-school-only sample is small (30 math teachers and 15

language-arts teachers) and yields imprecise estimates of the correlation between principal

evaluations and value-added estimates of teacher performance. Estimates from our full sample,

which combines elementary, middle and high school are quite similar to the full-sample

estimates of Jacob and Lefgren (elementary only) and Rockoff et al. (elementary and middle

combined). All three studies find unadjusted correlations between the principal’s assessment of

a teacher’s ability to raise achievement and the teacher’s value-added of 0.25 to 0.34 in math and

changed schools from the previous year). We then conducted chi-squared tests of the null of uniform assignment across classrooms within a school for each of the indicators. Using the same criterion of a 10 percent significance level employed by Clotfelter, et al., the proportion of grade/school/year combinations where we reject “apparent random assignment” is relatively small, ranging from two percent (gender) to 14 percent (test scores). Further, there was no clear pattern over time in terms of schools or grades within a school that did not pass the “apparent random assignment” test.

Page 21: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

21

0.18 to 0.26 in reading. Correlations adjusted for measurement error are higher, 0.32 to 0.37 in

math and 0.29 to 0.33 in reading.26

Estimates of the relationship between the assessment of teachers by trained observers and

value-added, presented in Mihaly et al., are quite similar to those between principal’s

assessments and value-added noted above. The adjusted correlations are in the range of 0.27 and

0.41 in math and 0.17 to 0.32 in reading. The correlations between student evaluations of

teachers and value-added are generally similar, though more variable: 0.32 to 0.44 in math and

0.11 to 0.50 in reading.

B. Variation in the Relationship of Principal Evaluations to Value-Added Across the Teacher Quality Distribution

The estimates presented above suggest that, on average, there is a moderate association

between assessments of teachers by principals and value-added. However, the ability of

principals to evaluate a teacher’s impact on achievement could vary across the distribution of

value-added. For example, it may be easier for principals to identify the teachers who are best at

improving student test scores than distinguishing between teachers in the middle of the value-

added distribution. In Table 3 we explore this issue by comparing the probability a teacher is

ranked in a given range of the principal’s distribution of assessments relative to their ranking in

the value-added distribution. For the sake of comparison we employ the same methodology as

Jacob and Lefgren and present their results alongside ours.27

We, like Jacob and Lefgren, find principals can identify teachers who have the greatest

impact on student achievement. In our sample, math teachers who received the highest rating on

26 See Jackson (forthcoming) and Harris and Anderson (2012) for discussion of how the properties of value-added may differ across grade levels, which could influence the correlations with principal evaluations. 27 Details of the methodology are contained in the appendix to Jacob and Lefgren (2008).

Page 22: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

22

“ability to raise test scores” also had the highest value-added among teachers in their school 72

percent of the time. In Jacob and Lefgren, where principals rated teachers on “raising student

achievement” the corresponding number is 70 percent. Had principals randomly rated teachers,

the expected probabilities would have been 38 percent in our sample and 26 percent in Jacob and

Lefgren’s sample. For reading, the corresponding probabilities of the top rated teacher being the

one with the highest value-added in our sample are 67 percent (versus 38 percent under random

ratings of teachers) and 55 percent (versus 14 percent if principals randomly rated teachers) in

the Jacob and Lefgren analysis. It is the difference between actual and random probabilities that

is relevant to determining principals’ ability to identify high-performing teachers and those

differences are substantially smaller in our sample compared to those found by Jacob and

Lefgren. We find a 34 percentage point difference for math teachers and a 29 percentage point

difference for reading teachers whereas Jacob and Lefgren uncover a 44 percentage point

difference in math and a 41 percentage point differential in reading. Neither of the differences

between our estimates and those of Jacob and Lefgren are statistically significant, however.

At the low end of the teacher quality distribution our results are quite close to those of

Jacob and Lefgren. We find the probability a teacher with the worst rating has the lowest value-

added is 60 percent for math teachers and 52 percent for reading teachers. The corresponding

probabilities in Jacob and Lefgren’s analysis are 61 percent and 38 percent. The differences

from random ratings by principals are 35 percentage points for math teachers and 29 percentage

points for reading teachers in our sample; Jacob and Lefgren uncover differentials of 38 and 29

percentage points, respectively.

Our estimates of the ability of principals to distinguish between math teachers in the

middle of the pack are also similar to those of Jacob and Lefgren. In math, we find that teachers

Page 23: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

23

who are rated above the median by principals have a 65 percent chance of being in the top 50

percent of the value-added distribution, which is 33 percentage points greater than would be

expected if principal ratings were random. For Jacob and Lefgren the numbers are 59 percent

with a 35 percentage point differential. For reading teachers, we find a 42 percentage point

differential (75 percent versus 33 percent) whereas Jacob and Lefgren uncover only a 29

percentage point difference (62 percent versus 33 percent). Neither of these cross-study

differences in the actual-versus-random-ratings differentials are statistically significant.

C. Variation in the Relationship of Principal Evaluations to Value-Added Across Teachers and Principals

Principals undoubtedly vary in their ability to assess teacher performance.28 It is

expected that the longer a principal has known their teachers the more accurate would be the

principal’s evaluation of their performance. Further, principals may gain general human capital

in personnel evaluation as their experience as a supervisor increases. In addition, more

experienced teachers are likely to exhibit less variation in their performance from day to day than

early-career teachers who are experimenting with different instructional strategies and learning

on the job. This implies it would be easier for principals to evaluate the ability of veteran

teachers from occasional classroom visits than it would be for less experienced teachers.

To test these ideas we estimate the correlation between value-added and principals’

ratings of their teachers’ ability to raise student achievement for various subsamples of teachers

and principals. To facilitate comparison with the finding of Jacobs and Lefgren, we employ the

same breakdowns as they use. Results are reported in Table 4. Unlike Jacob and Lefgren, we

28 In our sample, of 24 principals with at least five teachers who have both observational ratings and value-added scores, the correlations between value-added and rating of “ability to raise achievement” run from -0.17 to 0.94 (0.64 to 0.94 for principals with correlations significantly different from zero). In reading the principal-level correlations range from -0.20 to 0.83 (0.58 to 0.83 for correlations which are statistically significant).

Page 24: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

24

find significant correlations between principal evaluations and value-added for all sub-samples

and substantial heterogeneity in the association between principal ratings of their teachers and

estimates of teacher value-added.

As expected, we find that principals are better at distinguishing teachers’ contributions to

student achievement among teachers with 11 or more years of experience than among teachers

with 10 or fewer years in the classroom. Adjusted correlations between principal ratings and

value-added for the more experienced group are 0.40 in math and 0.39 in reading and much

lower for the less experienced group at 0.30 in math and 0.26 in reading. Jacob and Lefgren find

similar correlations for the more experienced teachers but their estimated correlations for

relatively inexperienced teachers are insignificantly different from zero in both subjects.

Similar results are found when we break down the sample by the length of interaction

between principals and teachers.29 When teachers and principals have been employed at the

same school for four or more years the adjusted correlation between principal ratings and teacher

value-added is 0.50 for math teachers and 0.37 for reading teachers. If principals and teachers

have interacted for less than four years the correlations are smaller (but still significantly

different from zero), at 0.29 in math and 0.31 in reading. In contrast, Jacob and Lefgren find

smaller correlations for the sample of principals and teachers who have known one another for

four or more years (0.35 in math and 0.28 in reading). Their estimates of the adjusted

correlations for the shorter interaction group are not statistically significant.

If we compare sub-samples which vary in the principal’s tenure at their current school,

but do not control for the tenure of their teachers, we find very little variation in the correlation

29 Middle and high school principals may have less day-to-day interaction with individual faculty than do elementary school principals who oversee fewer teachers. Consistent with expectations, we find the correlations between principal ratings and value-added are lower in middle and high schools than in elementary schools.

Page 25: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

25

of principal ratings with value-added, though all estimated correlations are statistically

significant. Jacob and Lefgren obtain somewhat higher point estimates of the corrected

correlations for the low-tenure-length principals, though the differences from our results are not

statistically significant. Their estimated correlations for principals who have been at their current

school for four or more years are significantly different from zero. We also find that experienced

principals are somewhat better at evaluating a teacher’s contribution to student achievement, at

least among math teachers. The correlation between principal ratings and value-added of math

teachers for principals with 11 or more years of administrative experience is 0.43 and only 0.34

for principals with 10 or fewer years of experience. Differences for principal evaluations of

reading teachers are much smaller, with adjusted correlations of 0.34 for the more experienced

principals and 0.30 for less experienced principals.30

D. Teacher Traits, Value-added and Principal Ratings

In order for observational evaluations of teachers to be a useful complement to value-

added ratings, they must measure valuable teacher characteristics not captured by value-added.

To determine whether this is the case we begin by analyzing the relationship between teacher

traits and teacher value-added. Estimates from regressing teacher value-added on the four

teacher characteristic factors are presented in Table 5. Results in column [1] indicate that for

math, teacher value-added is positively and significantly associated with knowledge/teaching

30 Teacher experience, principal experience, and the length of overlapping tenure at a school are naturally correlated with one another (e.g., a teacher with two years of experience cannot have more than two years of experience with a given principal, and vice versa). We attempted to disentangle the independent influences of teacher and principal experience by regressing teacher effects on principal ratings and the interaction of principal ratings with each of the three experience indicators (teacher experience greater than or equal to 11 years, principal experience greater than or equal to 11 years and overlapping experience at a school for 4 or more years). The coefficients on each of the experience interaction terms were generally insignificant, however, reflecting multi-collinearity. Thus we cannot say weather teacher experience or principal experience have effects on the ability of principals to evaluate teachers independent of the overlap in teacher-principal experience.

Page 26: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

26

skills/intelligence only. None of the other coefficients in column [1] are significant. Estimates

in column [2] indicate that the magnitude of the effect on math achievement of

knowledge/teaching/intelligence skills is nearly identical in elementary and middle/high school,

though the precision is much lower in middle/high school. The overall explanatory power of the

four factors is quite low, however, with R-squared values of 0.14.31 For reading, the only factor

which is statistically significant is teacher motivation/enthusiasm. Once again, the effect is equal

in magnitude across grades, but only statistically significant for elementary school teachers. The

relative importance of subject matter knowledge in math teacher performance is consistent with

recent findings that “Teach for America” teachers, who possess exceptionally strong academic

credentials, tend to outperform traditionally prepared teachers in teaching math, but are on par

with traditionally prepared teachers in reading instruction.32

To determine if principals’ ratings are driven solely by their perceptions of teacher value-

added, we first regress the principal’s overall rating of a teacher on their assessment of the

teacher’s ability to raise test scores. If only the ability to raise short-run test scores matters to

principals, we would expect the coefficient on “ability to raise test scores” to be near unity and a

large proportion of the variation in overall ratings to be explained by variation in the ability to

raise test scores. As indicated in the first column of Table 6, a one standard deviation increase in

“ability to raise test scores” raises the overall assessment by about three-fourths of a standard

deviation and variation in teachers’ ability to promote student achievement only explains a bit

over half of the variation in overall assessments of teacher quality. These results suggest that

principals do place considerable weight on a teacher’s productivity in raising student

31 Some of the insignificant effects may be due to multicollinearity. As demonstrated in Table A6, the four factors are all positively correlated. When each factor is regressed on estimated teacher effects separately, all are significant except “works well with others” in predicting the value-added of reading teachers. 32 See Boyd, et al. (2006), Kane, Rockoff and Staiger (2008) and Xu, Hannaway and Taylor (2011).

Page 27: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

27

achievement, but they also consider other factors when assessing teacher performance; these

additional factors could include the impact of teachers on other types of student outcomes.33

While we have no direct measures of principals’ assessments of teacher productivity in

promoting other sorts of outcomes for students, including later-life outcomes, we do have

information on principals’ ratings of specific teacher traits/skills.34 If the production function

which relates teacher skills to non-achievement outcomes differs from that which governs the

production of test-score gains, then some traits should be significantly related to the overall

assessment of a teacher holding constant the teacher’s “ability to raise test scores.”

Estimates from regressing the principal’s overall assessment on their assessment of a

teacher’s ability to raise test scores and the four skill factors are presented in column [2] of Table

6.35 For math, increases in knowledge/teaching skills/intelligence have the greatest impact on

the overall assessment of a teacher. This is followed by increases in “works well with others.”

Motivation/enthusiasm also has a significant effect, though the magnitude is much smaller. For

reading, increases in motivation/enthusiasm has the largest effect on the overall assessment of a

33 As discussed above in the context of value-added and principal evaluations, measurement error in principal evaluations would tend to reduce the estimated correlation. However, since in this case both measures (“ability to raise test scores” and the overall rating) are coming from the same source, we could expect a positive correlation between the random errors, which would raise the maximum correlation. This type of positive correlation in the errors between value-added and principal evaluations could also arise if prior information about value-added affects principals’ own subjective views about teachers, but this is very unlikely in our sample because principals did not have access to value-added measures. 34 Prior to designing the principal survey, we conducted an extensive review of theory and prior evidence that led us to the list of teacher skills we asked principals to rate. The skills included in the survey of principals have been widely reported in meta-analyses of the characteristics associated with effective teaching (Harris and Rutledge, 2010) and are considered important by principals (Harris, Rutledge, Ingle and Thompson, 2010). 35 As documented in Appendix Table A8, principals’ ratings of teachers’ skills are all positively and strongly correlated with one another in both subjects; correlations are in the range of 0.61 to 0.76. It is not obvious that this should be the case, e.g., that teachers who are more knowledgeable would also tend to have better interpersonal skills. There might be a “halo effect” whereby teachers who are rated highly by the principal overall are automatically given high marks on all of the individual characteristics. This would create multicollinearity amongst the skill factor ratings which would tend to inflate the estimated standard errors and depress the significance of the coefficients on the individual skill ratings.

Page 28: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

28

teacher, holding constant the teacher’s ability to raise test scores. Knowledge/teaching

skills/intelligence has the second largest effect with interpersonal skill having a smaller, but

statistically significant effect. Across both subjects the four skill factors explain more than one-

fourth of the variation in overall ratings (holding constant the rated ability to raise test scores).

This is consistent with the notion that principals consider more than immediate impacts on

student achievement when assessing teacher performance and that specific skills can have

varying effects on different types of student outcomes.36

D. The Relative Ability of Prior Value-Added and Principal Ratings to Predict Teacher Contributions to Student Achievement

To this point we have been comparing principal evaluations of teachers with value-added

measures constructed from all available prior student test scores (i.e. principal ratings from

summer 2006 with value-added based on achievement data up through the 2005/06 school year).

Such contemporaneous estimates of teacher productivity are relevant to decisions about the role

of principal evaluations in measuring and rewarding past performance. However,

contemporaneous measures of teacher performance are not particularly relevant for retention and

tenure decisions, where the decision should (optimally) be based on predictions about future

performance.

Likely due to data constraints, previous studies have examined the relationship between

prior ratings/value-added and student achievement, rather than future value-added. As we

discuss below, the focus on student achievement can lead to a distorted view of the relationship

between past evaluations and future productivity of teachers. Nonetheless, in order to compare

our estimates from those of other recent studies, we first estimate the relationship between

36 In additional analysis of principals’ open-ended discussions of these teachers, there is evidence that teacher effort is another factor that affects principal ratings (Harris, Ingle, and Rutledge, 2014).

Page 29: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

29

current student achievement and prior measures of teacher performance. The model is

essentially the same as the value-added model in equation (1), except that the teacher fixed effect

is replaced with a measure of the teacher’s prior value-added. For the value-added measure of

past performance we construct four different estimates, varying the amount of information used

to estimate the past teacher value-added from the maximum available (6 years of student

achievement) to a single year.37 Estimates of the relationship between prior value-added and

current student achievement are presented in the first panel of Tables 7A (for math) and 7B (for

reading).

Consistent with the findings in McCaffrey, et al. (2009), our results indicate that value-

added measures based on two years of data are better predictors of future performance than are

single-year estimates, but the gain from incorporating additional years in the computation of

prior value-added diminishes at an increasing rate. Using all available information, a one-unit

increase in past value-added is associated with a 0.85 increase in current student achievement in

math whereas with past value-added computed from a single year of data a one-unit increase in

prior value-added is associated with only a 0.48 increase in current student achievement. The

relationship is essentially the same whether two, three or six years of data are used to compute

prior value-added. The estimates are also much more precise when two or more years of data are

used than with prior value-added derived from a single year of data.

37 We utilize a common sample to ensure comparability of the value-added estimates. The sample of teachers with value-added data for 2005/06 is much smaller than for other years. This is because we selected teachers to participate in the study in the late spring of 2006, based on whether they had student achievement data for the most recent year available at that time (the 2004/05 school year). A teacher with achievement data for 2004/05 who subsequently left the school (or who switched to a non-tested grade and subject) would therefore be excluded from the 2005/06 value-added sample. To distinguish between sample-size effects and differences due to the number of years of student achievement used to estimate teacher value-added, we report results using single-year value-added estimates for 2004/05, which utilize the full sample of teachers. A comparison of estimates using data from the smaller 2005/06 sample is provided in Appendix Table A9.

Page 30: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

30

Estimates of the relationship between prior value-added and current student achievement

in reading are more variable, but generally tell a similar story. Estimates of the marginal effect

of prior value-added on current reading achievement are greater when two or more years of data

are used to compute value-added rather than when a single year of information is employed. The

t-statistics are also much higher when more than one year of data are used to compute prior

value-added. As is commonly the case, the proportion of variation in reading achievement that

can be explained is lower than for math, with R-squared values around 0.5 to 0.6 across studies.

Our estimates of the relationship between prior principal ratings and current student

achievement are qualitatively similar to those of Jacob and Lefgren in math, but differ

substantially for reading teachers. In math, both we and Jacob and Lefgren find the marginal

effect of an increase in prior principal ratings of teachers’ ability to increase test scores is

associated with higher current student achievement though the magnitudes differ somewhat (0.07

in our analysis and 0.10 in Jacob and Lefgren). In reading, however, we do not find a significant

relationship between prior principal evaluations of teachers and current student achievement

whereas Jacob and Lefgren report a statistically significant marginal effect of 0.05. Rockoff and

Speroni find a significant relationship between mentor ratings of first-year teachers and student

achievement in math and in reading, but insignificant effects in both subjects for ratings of

teaching fellow applicants in New York City.

When prior value-added and subjective evaluations are used in combination to predict

student performance in math, both metrics are statistically significant in nearly all cases, both in

our analysis and in prior studies; the one exception being pre-service evaluations of New York

City teaching fellows. Combining measures does little to enhance the ability to predict student

Page 31: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

31

achievement. Whether prior value-added, observational measures, or the combination of the two

are employed, to predict student achievement, the R-squared values are virtually identical.

Predicting future student achievement rather than value-added presents a distorted picture

of the relationship between measures of prior teacher performance and future teacher

productivity, however. Much of the variation in student achievement is due to differences in

student characteristics. Thus the relatively high R-squared values observed (0.39-0.67) reflect

the explanatory power of student characteristics, rather than the explanatory power of prior

teacher performance. Correspondingly, the choice of observational versus test-score metrics

appears to be immaterial because each is contributing only a small amount to the prediction of

student achievement.

To get a clearer picture of the relationship between measures of past performance and

future productivity, we measure future teacher productivity by calculating teachers’ future value-

added. To derive estimates of future value-added we re-estimate equation (1), using data on

student achievement gains from the 2006/07 and 2007/08 school years (including test scores

from 2005/06 as the initial lagged value). As demonstrated by (McCaffrey, et al. (2009)), basing

teacher value-added on two years of performance leads to much more precise estimates than

relying on a single estimated test score. We then regress our estimate of future value-added on

either the principal’s overall rating of the teacher from the summer of 2006 or the estimated

teacher fixed effect from a student achievement model covering the years 1999/00-2005/06.38

Using all available information, past value-added outperforms principal ratings,

explaining over five times as much of the variation in future value-added among math teachers

38 To minimize attenuation bias we employ empirical Bayes “shrunken” estimates of prior value-added but use the non-shrunken estimate of future value-added as the dependent variable. Results similar to those reported in Tables 7A and 7B are obtained if we use empirical Bayes estimates in place of the non-shrunken teacher fixed effects as the dependent variable.

Page 32: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

32

and about 19 times as much of the variation in future value-added among reading teachers.39

The edge in explanatory power (as measured by R-squared) holds up when only three, two or

even a single year of data is used to compute past value-added though the differential generally

falls as value-added is computed from fewer years of data. Thus our findings are consistent with

those of Mihaly et al. (2013); if only a single measure is employed, past value-added is superior

to principal evaluations in predicting future teacher value-added.

When prior value-added and principal ratings are combined to predict future teacher

performance, the contribution of principal ratings to the predictive power of the model also

depends on the precision of the past value-added measure. When past value-added is based on all

six years of achievement gain data before Summer 2006, principal ratings add virtually nothing

to the predictive power of past value-added in math or reading. The same is true when three or

two years of student achievement data are used to compute prior value-added. The results are

mixed when past-added is based on a single year of data. If data from 2004/05 (and a constant

sample of teachers) are used, combining prior value-added with principal evaluations increases

the proportion of variation in future value-added that is explained from 7.9 percent to 10.2

percent in math and from 2.2 percent to 2.5 percent in reading, though it is not possible to reject

the null that principal ratings are uncorrelated with future value-added (conditional on past

value-added).

With all these analyses, as well as those by Mihaly et al. (2013), it is important to

emphasize that the only future measure of productivity we have is future value-added. In general,

the best predictor of any future measure will be the lag of the same measure, so it is not

39 For math the R-squared with only prior value-added in the model only is 0.268 vs. 0.048 when only prior principal ratings are used to predict future value-added. For reading, the R-squared values are 0.133 and 0.007, respectively. Similar results are obtained when we exclude achievement data from the 05/06 school year from the value-added calculation and use only five years of test-score data (as in Jacob and Lefren (2008)).

Page 33: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

33

surprising that prior value-added is a better predictor of future value-added than is the principal

evaluation. Thus, this evidence cannot be used to draw conclusions about future productivity

broadly defined.

V. Summary and Conclusions Current policy reforms designed to improve teacher quality all hinge on having good

measures of teacher productivity. There is ongoing debate over how best to measure teacher

performance, however. While teachers’ estimated effects on student test scores (value-added)

have been shown to be associated with positive effects on later life outcomes of their students,

value-added may not provide a complete picture of teacher productivity. Prior economic

research suggests that non-cognitive factors may be particularly important, and often overlooked,

determinants of productivity in occupations like teaching. Controlling for perceived ability to

raise test scores, we find that principals’ ratings of teachers are correlated with traditional human

capital measures like teacher intelligence, subject knowledge and teaching skills in both math

and reading, but they are also associated with non-cognitive personality traits like motivation and

enthusiasm, the ability to work well with others (math) and interpersonal skills (reading).

Coupled with the fact that the correlation between observational measures of teacher

performance and value-added is relatively low, this suggests that principal evaluations of teacher

performance may be a valuable complement to teacher ratings based on student test scores.

In addition to their ability to capture non-cognitive skills, evaluations by principals have a

potential cost advantage over other alternatives, like classroom observation by external

evaluators (as in the MET project). Principals collect most of their information in the natural

Page 34: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

34

course of the job (e.g., informal conversations with parents, students, and other teachers), which

makes the marginal cost low.

The ability of principals to distinguish differences in teacher productivity does appear to

vary across principals and teachers. Principals are better able to distinguish differences between

veteran teachers than between early-career teachers. Likewise, the correlation between principal

evaluations and value-added is greater the longer a principal has known a teacher. While

principals vary in their ability to identify differences in teacher productivity, we could not

identify strong and consistent relationships between principals evaluation ability and either their

experience as an administrator or their tenure in a school. Future research in this area, with

larger samples, could explore the ways in which principal characteristics as well as

organizational forms and hierarchies influence principals’ ability to identify effective teachers.

Our analysis of the predictive power of principal ratings and past value-added also

informs the current policy debate over the use of test scores and subjective evaluations to

evaluate current teachers. When value-added measures are constructed from multiple years of

test score data, past value-added does a much better job at predicting future value-added than do

principal evaluations. However, if one only uses a single year of information to estimate prior

teacher value-added, principal evaluations add some information, though the gains are modest.

Thus, if the goal is to predict future value-added, subjective measures are likely to be of greatest

value when making retention and tenure decisions, especially for early-career teachers, for whom

there may be only a year or two of student test-score information.

While our analysis is informative regarding the various ways that teachers could be

assessed, it is important to be cautious in drawing broad policy conclusions from these results.

For example, while we have shown that prior value-added is the best predictor of future value-

Page 35: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

35

added, future value-added is not necessarily an accurate indicator of overall future teacher

productivity. Value-added is a noisy measure of a teacher’s impact on current student

achievement and may not capture other valuable contributions a teacher makes to a student’s

long-run success. Also, the fact that principals’ assessments are positively related to future

value-added, and sometimes add information beyond prior value-added, does not mean that

evaluating teachers based on principals’ assessments would necessarily be a wise policy for

high-stakes personnel decisions. The assessments that principals offered in our study involved

no financial or employment implications for teachers and principals’ stated judgments could well

differ in a high-stakes context. Principals are likely to receive additional training in evaluation

techniques if called upon to make high-stakes evaluations of teacher and thus could make more

accurate assessments of teacher productivity. On the other hand, they may be reticent to make

sharp distinctions between teachers if they know their evaluations will have significant

consequences. Even if principals would give the same assessments in high-stakes settings, doing

so could influence the working relationships between principals and teachers in unproductive

ways.

While caution is warranted, the practical reality is that many school systems around the

country are already making radical changes to the way in which teachers are evaluated and

compensated. Our results suggest principal evaluations can be a useful component of these new

teacher assessment systems. In systems where value-added metrics are used, including principal

evaluations will be most informative for early-career teachers (where value-added information is

less reliable). Further, because principal evaluations take into account a broader set of teacher

traits than those that directly affect student test scores, and provide more concrete feedback to

Page 36: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

36

teachers to facilitate improvement, evaluations of teachers by principals are likely to be a useful

component of teacher assessment when outcomes beyond student achievement are valued.

Page 37: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

37

References

Armor, David, Patricia Conry-Oseguera, Millicent Cox, Nicelma King, Lorraine McDonnell, Anthony Pascal, Edward Pauly and Gail Zellman (1976). Analysis of the School Preferred Reading Program in Selected Los Angeles Minority Schools. Report #R-2007-LAUSD. Santa Monica, CA: RAND Corporation.

Aaronson, Daniel, Lisa Barrow and William Sander (2007). “Teachers and Student Achievement in the Chicago Public High Schools.” Journal of Labor Economics 25(1): 95-135.

Bastian, Kevin C. (2013). “Do Teachers’ Non-Cognitive Skills and Traits Predict Effectiveness and Evaluation Ratings?,” Unpublished manuscript.

Bommer, William H., Jonathan L. Johnson, Gregory A. Rich, Philip M. Podsakoff, and Scott B. MacKenzie (1995). “On the Interchangeability of Objective and Subjective Measures of Employee Performance: a Meta-analysis.” Personnel Psychology 48(3): 587-605.

Borghans, Lex, Angela Lee Duckworth, James J. Heckman, and Bas ter Weel (2008). “The Economics and Psychology of Personality Traits.” Journal of Human Resources 43(4): 972–1059.

Borghans, Lex, Bas ter Weel, and Bruce A. Weinberg (2008). “Interpersonal Styles and Labor Market Outcomes.” Journal of Human Resources 43(4): 815–858.

Boyd, Donald, Pamela Grossman, Hamilton Lankford, Susanna Loeb and James Wyckoff. 2006. “How Changes in Entry Requirements Alter the Teacher Workforce and Affect Student Achievement.” Education Finance and Policy 1(2):176-216.

Chetty, Raj, John N. Friedman and Jonah E. Rockoff (2011). “The Long-term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood.” Working paper no. 17699. Cambridge, MA: National Bureau of Economic Research.

Clotfelter, Charles T., Helen F. Ladd and Jacob L. Vigdor (2006). "Teacher-Student Matching and the Assessment of Teacher Effectiveness." Journal of Human Resources 41: 778-820.

Clotfelter, Charles T., Helen F. Ladd and Jacob L. Vigdor (2007). “Credentials and Student Achievement: Longitudinal Analysis with Student Fixed Effects.” Economics of Education Review 26: 673-682.

Clotfelter, Charles T., Helen F. Ladd and Jacob L. Vigdor (2010). “Teacher Credentials and Student Achievement in High School: A Cross-Subject Analysis with Student Fixed Effects.” Journal of Human Resources 45(3): 655-681.

Cunha, Flavio, James Heckman, Lance Lochner and Dimitry Masterov (2006). "Interpreting the Evidence on Life Cycle Skill Formation," In Handbook of the Economics of Education. ed, Eric A. Hanushek and Frank Welch. 697-812. Amsterdam: North-Holland.

Page 38: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

38

Gallagher, H. Alix (2004). Vaughan Elementary's Innovative Teacher Evaluation System: Are Teacher Evaluation Scores Related to Growth in Student Achievement. Peabody Journal of Education, 79(4): 79-107.

Goldhaber, Dan and Duncan Chaplin (2012). Assessing the “Rothstein Falsification Test." Does it Really Show Teacher Value-added Models are Biased? CEDR Working Paper 2012-1.3. University of Washington, Seattle, WA.

Guarino, Cassandra M., Mark Reckase and Jeffrey Wooldridge (2011). “Can Value-Added Measures of Teacher Performance be Trusted?” Unpublished manuscript.

Hanushek, Eric A., John F. Kain, Daniel M. O’Brien, and Steven G. Rivkin (2005). “The Market for Teacher Quality.” NBER Working Paper #11154.

Harcourt Assessment (2002). “SAT-10 to SAT-9 Scaled Score to Scaled Score Conversion Tables.”

Harris, Douglas N. and Andrew Anderson (2012). “Bias of Public Sector Worker Performance Monitoring: Theory and Empirical Evidence from Secondary School Teachers.” Paper presented at the annual meeting of the Association for Public Policy and Management.

Harris, Douglas N., Carolyn Herrington and A. Albee (2007). “The Future of Vouchers: Lessons from the Adoption, Design, and Court Challenges of Florida’s Three Voucher Programs.” Educational Policy.

Harris, Douglas N., William K. Ingle and Stacey A. Rutledge (2014). “How Teacher Evaluation Methods Matter for Accountability: A Comparative Analysis of Teacher Ratings by Principals and Teacher Value-Added Measures.” American Educational Research Journal. 51: 73-112.

Harris, Douglas N. and Stacey A. Rutledge (2010). “Models and Predictors of Teacher Effectiveness: A review of the Evidence with Lessons from (and for) Other Occupations.” Teachers College Record 112(3): 914–960.

Harris, Douglas N., Stacey Rutledge, William Ingle, and Cynthia Thompson (2010). “Mix and match: What principals really look for when hiring teachers.” Education Finance and Policy 5(2): 228-246.

Harris, Douglas N. and Tim R. Sass (2011). “Teacher Training, Teacher Quality and Student Achievement.” Journal of Public Economics 95: 798-812.

Heckman, James J., Jora Stixrud, Sergio Urzua (2006) The Effects of Cognitive and Noncognitive Abilities on Labor Market Outcomes and Social Behavior. Journal of Labor Economics, 24(3), 411-482.

Heneman, Robert L. (1986). “The Relationship Between Supervisory Ratings and Results-Oriented Measures Performance: a Meta-analysis.” Personnel Psychology 39: 811-826.

Page 39: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

39

Ho, Andrew D. and Thomas J. Kane (2013). “The Reliability of Classroom Observations by School Personnel.” Seattle, WA: Bill and Melinda Gates Foundation.

Jackson, C. Kirabo (forthcoming). Teacher Quality at the High-School Level: The Importance of Accounting for Tracks. Journal of Labor Economics.

Jacob, Brian A., and Lars Lefgren (2005). “Principals as Agents: Subjective Performance Measurement in Education.” Working Paper #11463. Cambridge, MA: National Bureau of Economic Research.

Jacob, Brian A. and Lars Lefgren (2008). “Can Principals Identify Effective Teachers? Evidence on Subjective Performance Evaluation in Education.” Journal of Labor Economics, 26(1): 101-136.

Jepsen, Christopher (2005). “Teacher Characteristics and Student Achievement: Evidence from Teacher Surveys.” Journal of Urban Economics 57(2):302-19.

Kane, Thomas J., Jonah E. Rockoff and Douglas O. Staiger (2008). “What Does Certification Tell Us About Teacher Effectiveness? Evidence from New York City.” Economics of Education Review 27: 615-631.

Kane, Thomas J. and Douglas O. Staiger (2008). “Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation.” Washington DC: National Bureau of Economic Research Working Paper #14607.

Kimball, Steven M., Brad White, Anthony Milanowski and Geoffrey Borman (2004). Examining the Relationship Between Teacher Evaluation and Student assessment Results in Washoe County. Peabody Journal of Education, 79(4), 54-78.

Koedel, Cory and Julian R. Betts (2011). “Does Student Sorting Invalidate Value-Added Models of Teacher Effectiveness? An Extended Analysis of the Rothstein Critique.” Education Finance and Policy 6(1): 18-42.

Lewis, Jeffrey B. and Drew A. Linzer (2005). “Estimating Regression Models in Which the Dependent Variable Is Based on Estimates.” Political Analysis, 13(4): 345–364.

McCaffrey, Daniel F., Tim R. Sass, J.R. Lockwood and Kata Mihaly (2009). “The Inter-temporal Variability of Teacher Effect Estimates. Education Finance and Policy, 4(4), 572-606.

Medley, Donald M. and Homer Coker (1987). “The Accuracy of Principals’ Judgments of Teacher Performance.” Journal of Educational Research 80(4): 242-247.

Mihaly, Kata, Daniel F. McCaffrey, J.R. Lockwood and Tim R. Sass (2010). “Centering and Reference Groups for Estimates of Fixed Effects: Modifications to felsdvreg.” Stata Journal, 10(1), 82-103.

Page 40: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

40

Mihaly, Kata, Daniel F. McCaffrey, Douglas O. Staiger and J.R. Lockwood (2013). “A Composite Estimator of Effective Teaching.” Unpublished manuscript.

Milanowski, Anthony (2004). “The relationship between teacher performance evaluation scores and student assessment: Evidence from Cincinnati.” Peabody Journal of Education, 79(4): 33-53.

Morris, Carl N. (1983). “Practical Empirical Bayes Inference: Theory and Applications.” Journal of the American Statistical Association 78(381): 47-55.

Moulton, Brent R. (1990). “An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units.” Review of Economics and Statistics 72(2):334–38.

Murnane, Richard J. (1975). The Impact of School Resources on the Learning of Inner City Children. Cambridge, MA: Ballinger.

National Center for Education Statistics. (2006). Characteristics of Schools, Districts, Teachers, Principals, and School Libraries in the United States: 2003-04 Schools and Staffing Survey. Washington, DC.

Rivkin, Steven G., Eric A. Hanushek and John F. Kain (2005). “Teachers, Schools and Academic Achievement.” Econometrica 73(2): 417-58.

Rockoff, Jonah E. (2004). “The Impact of Individual Teachers on Student Achievement: Evidence from Panel Data.” American Economic Review 94(2): 247-52.

Rockoff, Jonah E., Douglas O. Staiger, Thomas J. Kane and Eric S. Taylor (2010). “Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools.” Working Paper #16240. Cambridge, MA: National Bureau of Economic Research.

Rockoff, Jonah E. and Cecilia Speroni (2011). “Subjective and Objective Evaluations of Teacher Effectiveness: Evidence from New York City.” Labour Economics 18: 687-696.

Rockoff, Jonah E., Douglas O. Staiger, Thomas J. Kane and Eric S. Taylor (2012). “Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools.” American Economic Review 102(7): 3184-3213.

Rothstein, Jesse (2010). “Teacher Quality in Educational Production: Tracking, Decay and Student Achievement.” Quarterly Journal of Economics 125(1): 175-214.

Sass, Tim R., Anastasia Semykina and Douglas N. Harris (2013). “Value-Added Models and the Measurement of Teacher Productivity.” Unpublished. Atlanta, GA: Georgia State University.

Todd, Petra E. and Kenneth I. Wolpin (2003). “On the Specification and Estimation of the Production Function for Cognitive Achievement.” Economic Journal 113(485):F3-F33.

Page 41: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

41

Varma, Arup and Linda K. Stroh (2001). “The Impact of Same-sex LMX Dyads on Performance Evaluations.” Human Resource Management 40(4): 309-320.

Wiswall, Matthew (forthcoming). “The Dynamics of Teacher Quality.” Journal of Public Economics.

Xu, Zeyu, Jane Hannaway and Colin Taylor (2011). “Making a Difference? The Effects of Teach for America in High School.” Journal of Policy Analysis and Management 30(3):447-469.

Page 42: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

42

Table 1 Sample Student and Teacher Characteristics

___________________________________________________________________________ Math Reading Sample Sample _________________ _________________ No. of Obs. Mean No. of Obs. Mean ___________________________________________________________________________ Students Black 31645 0.367 30794 0.360 Hispanic 31645 0.025 30794 0.024 Free/Reduced Price Lunch 31645 0.335 30794 0.329 Achievement Gain 31645 20.729 30794 18.581 Teachers Male 1023 0.115 1024 0.079 White 1023 0.695 1024 0.724 Hold Advanced Degree 1004 0.332 1008 0.350 Fully Certified 1015 0.950 1019 0.955 Taught Primarily Elementary School 1023 0.727 1024 0.729 Taught Primarily Middle School 1023 0.149 1024 0.141 Taught Primarily High School 1023 0.124 1024 0.130 Principal’s Overall Rating 237 7.084 231 7.134 Rating of Ability to Raise Test Scores 210 7.200 201 7.184 Rating on “Caring” 237 7.384 231 7.463 Rating on “Enthusiastic” 237 7.249 231 7.372 Rating on “Motivated” 237 7.414 231 7.481 Rating on “Strong Teaching Skills” 237 7.544 231 7.636 Rating on “Knows Subject” 237 7.848 231 7.918 Rating on “Communication Skills” 237 7.612 231 7.758 Rating on “Intelligence” 237 7.911 231 7.970 Rating on “Positive Relationship with Parents” 236 7.483 230 7.600 Rating on “Positive Relationship with Students” 236 7.636 230 7.739 ___________________________________________________________________________ Note: Includes only students and teachers for which a fixed effect could be computed for the teacher.

Page 43: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

43

Table 2A Pairwise Correlation of Estimated Teacher Fixed Effect and Observational Rating of Teacher - Math

 

Elementary Middle Elem/Middle

Middle/High

High Elem/ Middle/ High

This Study

Jacob/ Lefgren

RSKT MMSL This Study

RSKT MMSL This Study

RSKT This Study

This Study

This Study

Principals

VA and Principal "Overall"

Unadjusted 0.31* 0.19* 0.09 0.39* 0.27* 0.23* 0.28* 0.25 0.29*

Adjusted 0.34* 0.12 0.30* 0.33* 0.28 0.32*

VA and Principal "Raise Scores/Achievement"

Unadjusted 0.37* 0.25* 0.23* 0.22 0.38* 0.33* 0.26* 0.31* 0.18 0.34*

Adjusted 0.41* 0.32* 0.28 0.36* 0.36* 0.20 0.37*

Trained Observers

VA and CLASS Observational Rating Adjusted 0.28* 0.35*

VA and Danielson FFT Observational Rating Adjusted 0.27* 0.41*

Students

VA and Composite of Student Survey Adjusted 0.33* 0.44*

VA and "Happy in Class" Adjusted 0.32* 0.37*

*Significant at 5%. Adjusted correlations are ones in which measurement error in the test score or measurement error in the observational rating is taken into account. VA is based on state assessments. RSKT=Rockoff, Staiger, Kane and Taylor (2010) and MMSL=Mihaly, McCaffrey, Staiger and Lockwood (2013). Elementary and Middle school subsamples for RSKT are based on whether a teacher taught both math and English (and therefore assumed to be an elementary teacher) or a teacher taught only math (assumed to be middle school).

Page 44: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

44

Table 2B

Pairwise Correlation of Estimated Teacher Fixed Effect and Observational Rating of Teacher - Reading  

Elementary Middle Elem/Middle

Middle/High

High Elem/ Middle/ High

This Study

Jacob/ Lefgren

RSKT MMSL This Study

RSKT MMSL This Study

RSKT This Study

This Study

This Study

Principals

VA and Principal "Overall"

Unadjusted 0.30* 0.22* 0.17 0.16*

0.28* 0.23* 0.21 0.45 0.28*

Adjusted 0.38* 0.24

0.37* 0.30 0.54 0.38*

VA and Principal "Raise Scores/Achievement"

Unadjusted 0.35* 0.18* 0.25* -0.03 0.22*

0.28* 0.26* -0.04 0.02 0.25*

Adjusted 0.44* 0.29* -0.04

0.37* -0.05 0.02 0.33*

Trained Observers

VA and CLASS Observational Rating Adjusted

0.28* 0.32*

VA and Danielson FFT Observational Rating Adjusted

0.28* 0.17

Students

VA and Composite of Student Survey Adjusted

0.50* 0.29*

VA and "Happy in Class" Adjusted

0.34* 0.11

*Significant at 5%. Adjusted correlations are ones in which measurement error in the test score or measurement error in the observational rating is taken into account. VA is based on state assessments. RSKT=Rockoff, Staiger, Kane and Taylor (2010) and MMSL=Mihaly, McCaffrey, Staiger and Lockwood (2013). Elementary and Middle school subsamples for RSKT are based on whether a teacher taught both math and English (and therefore assumed to be an elementary teacher) or a teacher taught only English (assumed to be middle school).

Page 45: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

45

Table 3 Relationship of Estimated Teacher Fixed Effect to

Principal’s Rating of Teacher’s Ability to Raise Achievement

Math Reading This

Study Jacob/

Lefgren Diff-

erence This

Study Jacob/

Lefgren Diff-

erence Conditional probability that a teacher who received the top rating from the principal was the top teacher according to the value-added measure (SE)

0.72 (0.08)

0.70 (0.13)

0.67 (0.08)

0.55 (0.18)

Null hypothesis (probability expected if principals randomly assigned teacher ratings)

0.38 0.26 0.38 0.14

Difference between observed and null [Z-score] (p-value)

0.34 [4.43] (0.00)

0.44 [3.34] (0.00)

-0.10 (0.51)

0.29 [3.44] (0.00)

0.41 [2.29] (0.02)

-0.12 (0.54)

Conditional probability that a teacher who received a rating above the median from the principal was above the median according to the value-added measure (SE)

0.65 (0.08)

0.59 (0.14)

0.75 (0.09)

0.62 (0.12)

Null hypothesis (probability expected if principals randomly assigned teacher ratings)

0.32 0.24 0.33 0.33

Difference between observed and null [Z-score] (p-value)

0.33 [3.98] (0.00)

0.35 [2.41] (0.02)

-0.02 (0.90)

0.42 [4.76] (0.00)

0.29 [2.49] (0.01)

0.13 (0.37)

Conditional probability that a teacher who received a rating below the median from the principal was below the median according to the value-added measure (SE)

0.51 (0.10)

0.53 (0.13)

0.51 (0.11)

0.51 (0.11)

Null hypothesis (probability expected if principals randomly assigned teacher ratings)

0.24 0.26 0.25 0.35

Difference between observed and null [Z-score] (p-value)

0.27 [2.57] (0.01)

0.27 [2.19] (0.03)

0.00 (1.00)

0.26 [2.37] (0.02)

0.16 [1.40] (0.16)

0.10 (0.53)

Conditional probability that a teacher who received the bottom rating from the principal was the bottom teacher according to the value-added measure (SE)

0.60 (0.10)

0.61 (0.14)

0.52 (0.13)

0.38 (0.22)

Null hypothesis (probability expected if principals randomly assigned teacher ratings)

0.25 0.23 0.23 0.09

Difference between observed and null [Z-score] (p-value)

0.35 [3.47] (0.00)

0.38 [2.76] (0.01)

-0.03 (0.86)

0.29 [2.32] (0.02)

0.29 [1.30] (0.19)

0.00 (0.62)

Page 46: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

46

Table 4 Pairwise Correlation of Estimated Teacher Fixed Effect and Principal’s Rating

of Teacher’s Ability to Raise Achievement by Teacher and Principal Characteristics ______________________________________________________________________________ Math Reading This Study Jacob/Lefgren This Study Jacob/Lefgren______________________________________________________________________________ Baseline 0.34** 0.25** 0.25** 0.18** [0.37] [0.32] [0.33] [0.29]______________________________________________________________________________ Experienced Teachers (11 years) 0.36** 0.34** 0.30** 0.29** [0.40] [0.39] [0.39] [0.35] Inexperienced Teachers (<11 years) 0.27** 0.13 0.20** 0.07 [0.30] [0.20] [0.26] [0.38]______________________________________________________________________________ Principal Known Teacher years 0.45** 0.29** 0.29** 0.22** [0.50] [0.35] [0.37] [0.28] Principal Known Teacher < 4 years 0.26** 0.21 0.24** 0.13 [0.29] [0.29] [0.31] [0.49]______________________________________________________________________________ Principal at Current School years 0.35** 0.12 0.25** 0.08 [0.40] [0.16] [0.33] [0.12] Principal at Current School < 4 years 0.31** 0.35** 0.26** 0.29** [0.35] [0.44] [0.33] [0.49]______________________________________________________________________________ Principal’s Admin. Exp. years 0.39** -- 0.27** -- [0.43] -- [0.34] -- Principal’s Admin. Exp. < 11 years 0.30** -- 0.25** -- [0.34] -- [0.32] --______________________________________________________________________________ Note: **indicates significance at the .05 level. Correlations adjusted for estimation error in estimated teacher fixed effects are in brackets. This study sample includes elementary, middle and high school teachers; Jacob/Lefgren sample includes only elementary school teachers.

Page 47: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

47

Table 5 FGLS Estimates of the Relationship Between

Teacher Fixed Effects and Teacher Characteristic Factors (Grades 2 – 10, 1999/2000 – 2005/06)

______________________________________________________________________________ Math Reading

_____________________________ ____________________________ [1] [2] [1] [2]______________________________________________________________________________

Interpersonal Skill -0.004 -0.010 (0.023) (0.015) Knowledge/Teaching Skills/ 0.058*** -0.009 Intelligence (0.019) (0.014) Motivation/Enthusiasm 0.008 0.042*** (0.022) (0.013) Works Well With Others -0.001 0.014 (0.023) (0.014) Interpersonal Skill -0.009 -0.006 Elementary (0.025) (0.017) Interpersonal Skill 0.018 -0.023 Middle/High (0.058) (0.037) Knowledge/Teaching Skills/ 0.057*** -0.013 Intelligence Elementary (0.022) (0.016) Knowledge/Teaching Skills/ 0.054 0.007 Intelligence Middle/High (0.044) (0.035) Motivation/Enthusiasm 0.007 0.043*** Elementary (0.025) (0.014) Motivation/Enthusiasm 0.027 0.044 Middle/High (0.063) (0.037) Works Well With Others 0.008 0.017 Elementary (0.026) (0.016) Works Well With Others -0.046 0.003 Middle/High (0.059) (0.026) _____________________________________________________________________________ R-squared 0.137 0.140 0.145 0.148 No. of Observations 207 207 203 203 ______________________________________________________________________________ Note: Standard errors appear in parentheses. * indicates statistical significance at .10 level, **indicates significance at the .05 level and *** indicates significance at the .01 level in a two-tailed test. All models include controls for teacher experience, attainment of an advanced degree and a constant term.

Page 48: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

48

Table 6 Ordinary Least Squares Estimates of the Relationship Between Principal Overall Ratings of

Teachers, Principal Ratings of Ability to Raise Test Scores and Teacher Characteristic Factors (Grades 2 – 10, 1999/2000 – 2005/06)

______________________________________________________________________________ Math Reading

______________________________________________________________________________ Ability to Raise Test Scores 0.730*** 0.139*** 0.143** 0.736*** 0.179*** 0.201*** (0.048) (0.046) (0.057) (0.047) (0.052) (0.066) Ability to Raise Test Scores -0.025 -0.119 Middle/High (0.104) (0.110) Interpersonal Skill 0.042 0.037 0.099* 0.040 (0.046) (0.052) (0.059) (0.066) Interpersonal Skill -0.003 0.394** Middle/High (0.135) (0.164) Knowledge/Teaching Skills/ 0.539*** 0.526*** 0.162*** 0.199*** Intelligence (0.048) (0.057) (0.056) (0.061) Knowledge/Teaching Skills/ 0.067 -0.237 Intelligence Middle/High (0.121) (0.154) Motivation/Enthusiasm 0.088* 0.093* 0.490*** 0.479*** (0.046) (0.050) (0.057) (0.066) Motivation/Enthusiasm 0.049 -0.035 Middle/High (0.151) (0.158) Works Well With Others 0.212*** 0.240*** 0.070 0.104* (0.048) (0.054) (0.052) (0.061) Works Well With Others -0.145 -0.130 Middle/High (0.144) (0.122) _____________________________________________________________________________ R-squared 0.537 0.859 0.861 0.549 0.793 0.802 No. of Observations 202 173 173 201 175 175______________________________________________________________________________ Note: Standard errors appear in parentheses. * indicates statistical significance at .10 level, **indicates significance at the .05 level and *** indicates significance at the .01 level in a two-tailed test. All models include controls for teacher experience, attainment of an advanced degree and a constant term.

Page 49: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

49

Table 7A Relationship Between Prior Evaluations of Teachers and Future Student Achievement or Future Teacher Value-added - Math

Student Achievement Value-added

This Study Jacob/Lefgren Rockoff/Speroni This Study

Coeff. (s.e.) R2 Coeff. (s.e.) R2 Coeff. (s.e.) R2 Coeff. (s.e.) R2

Prior Value-added Only Using up to 5 or 6 Years of Prior Performance 0.854** (0.103) 0.67 0.207** (0.022) 0.41 0.677** (0.089) 0.27

Using up to 3 Years of Prior Performance 0.861** (0.096) 0.67 0.677** (0.085) 0.29

Using up to 2 Years of Prior Performance 0.868** (0.104) 0.67 0.088** (0.006) 0.67 0.746** (0.090) 0.30

Using 1 Year of Prior Performance 0.481** (0.077) 0.67 0.250** (0.068) 0.08

Prior Rating by Principal Only Overall Rating 0.062** (0.015) 0.66 0.141** (0.023) 0.40 0.053** (0.018) 0.05

Ability to Raise Test Scores/Achievement 0.068** (0.016) 0.67 0.100** (0.029) 0.39 0.045** (0.019) 0.04

Prior Rating by Observers Only Teaching Fellow 0.008 (0.012) 0.67

Mentor 0.054** (0.009) 0.67

Prior Value-added and Principal/Observer Rating Prior Value-added (5/6 Years of Prior Performance) 0.806** (0.106)

0.67 0.176** (0.023)

0.42 0.644** (0.092)

0.28 Prior Overall Rating by Principal 0.028* (0.015) 0.077** (0.025) 0.022 (0.017)

Prior Value-added (Up to 3 Years of Prior Perf.) 0.816** (0.102) 0.67

0.655** (0.090) 0.29

Prior Overall Rating by Principal 0.026* (0.015) 0.013 (0.017)

Prior Value-added (Up to 2 Years of Prior Perf.) 0.094** (0.010) 0.67

Prior Teaching Fellow's Rating 0.005 (0.010)

Prior Value-added (Up to 2 Years of Prior Perf.) 0.085** (0.007) 0.67

Prior Mentor's Rating 0.031** (0.008)

Prior Value-added (Up to 2 Years of Prior Perf.) 0.816** (0.112) 0.67

0.713** (0.093) 0.31 Prior Overall Rating by Principal 0.028* (0.015) 0.021 (0.016)

Prior Value-added (One Year of Prior Perf.) 0.430** (0.076) 0.67

0.213** (0.070) 0.10

Prior Overall Rating by Principal 0.043** (0.151) 0.038** (0.019)

Page 50: SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER ...SKILLS, PRODUCTIVITY AND THE EVALUATION OF TEACHER PERFORMANCE* by ... rubrics measure non-cognitive traits that are not captured

50

Table 7B Relationship Between Prior Evaluations of Teachers and Future Student Achievement or Future Teacher Value-added - Reading

Student Achievement Value-added This Study Jacob/Lefgren Rockoff/Speroni This Study

Coeff. (s.e.) R2 Coeff. (s.e.) R2 Coeff. (s.e.) R2 Coeff. (s.e.) R2

Prior Value-added Only

Using up to 5 or 6 Years of Prior Performance 0.873** (0.199) 0.55 0.106** (0.015) 0.48 0.958** (0.207) 0.13

Using up to 3 Years of Prior Performance 0.311** (0.125) 0.55 0.496** (0.113) 0.12

Using up to 2 Years of Prior Performance 1.116** (0.260) 0.55 0.018** (0.004) 0.61 1.219** (0.247) 0.15

Using 1 Year of Prior Performance 0.249 (0.153) 0.55 0.227* (0.130) 0.02

Prior Rating by Principal Only

Overall Rating 0.029 (0.021) 0.55 0.070** (0.020) 0.48 0.020 (0.021) 0.01

Ability to Raise Test Scores/Achievement -0.031 (0.019) 0.55 0.051** (0.019) 0.48 0.017 (0.021) 0.01

Prior Rating by Observers Only

Teaching Fellow 0.003 (0.009) 0.61

Mentor 0.023** (0.006) 0.61

Prior Value-added and Principal/Observer Rating

Prior Value-added (5/6 Years of Prior Performance) 0.867** (0.207) 0.55

0.096** (0.017) 0.49

0.999** (0.221) 0.13

Prior Principal's Overall Rating 0.002 (0.002) 0.045** (0.020) -0.011 (0.021)

Prior Value-added (Up to 3 Years of Prior Perf.) 0.291** (0.123) 0.55

0.502** (0.119) 0.12

Prior Overall Rating by Principal 0.022 (0.021) -0.004 (0.021)

Prior Value-added 0.015* (0.009) 0.61

Prior Teaching Fellow's Rating 0.001 (0.009)

Prior Value-added 0.018** (0.004) 0.61

Prior Mentor's Rating 0.020** (0.006) Prior Value-added (Up to 2 Years of Prior Perf.) 1.092** (0.264)

0.55 1.240** (0.258)

0.15 Prior Overall Rating by Principal 0.009 (0.020) -0.006 (0.020)

Prior Value-added (One Year of Prior Perf.) 0.203 (0.157) 0.55

0.212 (0.132) 0.02

Prior Overall Rating by Principal 0.019 (0.022) 0.015 (0.021)