bayesian inference for some value-added productivity indicators

Y. M. Thum / M S UY. M. Thum / M S U Nov 2005, Bayesian Inference for Some Value-Added IndicatorsNov 2005, Bayesian Inference for Some Value-Added IndicatorsNov 2005, Bayesian Inference for Some Value-Added IndicatorsNov 2005, Bayesian Inference for Some Value-Added Indicators

1

Measurement & Quantitative MethodsCounseling, Educational Psychology, & Special Education

College of EducationMichigan State University

Measurement & Quantitative MethodsCounseling, Educational Psychology, & Special Education

College of EducationMichigan State University

Bayesian InferenceBayesian Inferencefor Some Value-addedfor Some Value-addedProductivity IndicatorsProductivity Indicators

Yeow Meng ThumYeow Meng Thum

Conference on Longitudinal Modeling of Student AchievementUniversity of Maryland, November 2005

Conference on Longitudinal Modeling of Student AchievementUniversity of Maryland, November 2005


2

Overview and Conclusions Overview and Conclusions Overview and Conclusions Overview and Conclusions Recent thinking raised doubts about the validity of so-called “teacher

effects” or “school effects” captured in applications

We may have the models, but we do not have suitable data to be able to claim causal agency (question about design).

Places doubt on whether the empirical evidence for “accountability,” on the basis of which a teacher, or a program, or school may be identified as responsible for improvement or for failure, is so directly accessible .

Purpose: Suggests that descriptive measures of productivity and improvement for accounting units, teachers or schools, are still valid given the accountability data. This is where we begin, so:

Focus: Measurement, leaving aside structural relationships (until we have better data)

Employ well-defined data-base (evidence base).

Build Productivity Indicators that address value-added hypotheses about growth and change.

Design procedures for their inference -- Bayesian.


3

Start by Defining & MeasuringStart by Defining & MeasuringValue-added PerformanceValue-added Performance (Thum, 2003a)(Thum, 2003a)

Start by Defining & MeasuringStart by Defining & MeasuringValue-added PerformanceValue-added Performance (Thum, 2003a)(Thum, 2003a)

1. Make the Accountability Data Block Explicit

2. Value-added notion is keyed on our ability to measure change. Begins with a model for the Learning Change in the student: Multivariate Multi-Cohort Growth Modeling (Thum, 2003b)

3. To Measure Change, Estimate Gains

4. Multiple Outcomes Helps

5. Employ standard error of measurement (sem) of the score

6. Metric Matters for Measuring Change

7. Require Model-based Aggregation & Inference

8. Keep the “Black-box” Open


4

Longitudinal Student DataLongitudinal Student DataIs the Key Is the Key Evidence BaseEvidence Base

Longitudinal Student DataLongitudinal Student DataIs the Key Is the Key Evidence BaseEvidence Base

Longitudinal Student Cohorts

876543

01 02 03 04 05 06 07 08

876543

01 02 03 04 05 06 07 08

876543

01 02 03 04 05 06 07 08

Grade

Year

01 02 03 04 05 06 07 08

Quasi-longitudinal:Longitudinal at theSchool-grade level

Definition of Data Block must be integral to any Accountability Criteria/System

Point? A “constant ballast” Standardize evidence-base Standardize evidence-base to stabilize comparisons.to stabilize comparisons.


5

Multivariate Multi-cohortMixed-Effects ModelMultivariate Multi-cohortMixed-Effects ModelWithin Each School j (example):

0 1k k k k k k ktic ic ic tic sic stic tic

s

y TIME X

0 00 0 1 10 1, ,k k k k k k kic c ic ic c ic sic sicr r

is test subject, is time, is student, is cohort, are student covariates.sk t i c X

( , I ), ( , )icic n c ic cN Nε 0 Σ r 0 Τ

Between Schools:

ˆˆ | ( , ) , | ( , )cj cj cj cj cj c j c cN N γ γ γ V γ ζ W ζ

Bayesian Multivariate Meta-AnalysisBayesian Multivariate Meta-Analysis

is a set of school-level covariates.W


6

Why Focus on Gains?Why Focus on Gains? Thum (2003) offered a summary of some reasons. The gain score

is not inherently unreliable (Rogosa, others); is not always predictable from knowledge of initial status; is conceptually congruent with, and is an unbiased estimator of, the true gain score; does not sum to zero by construction for the group; places the pre-test AND post-test on equal footing as outcomes, thus generalizes directly to growth modeling.

In contrast, the residual gain score

ranks on only “relative progress,” allowing for “adjusted comparisons,” but is by no means “corrected” for anything in particular; an individual’s gain is dependent on who else is included in, or excluded from, the regression, and as such makes gains measurement subject to manipulation; sums to zero for a group and thus severely limits its utility for representing overall change; violates regression requirement that pre-test are error-free; does not generalize easily to longer time series.

Additionally, expanding on the conceptual congruence of the gain score with true gain, note how the gain score is ALSO the ideal for supporting causal claims under the widely-considered Rubin-Holland counterfactual framework. In the gain score, we do not need to guess the result in the unobserved “counterfactual” condition!!!!!!! 6


7

Overall Strategy:Obtain a good fitting (measurement) model for each school (surface for math), then construct and evaluate relevant valued-added hypotheses for the school.

Overall Strategy:Obtain a good fitting (measurement) model for each school (surface for math), then construct and evaluate relevant valued-added hypotheses for the school.

Year

Grade

Outcome ˆmathticy

Note: Surface need not be a “flat.”


8

So, there are ONLY Value-added So, there are ONLY Value-added Hypotheses, NHypotheses, NOTOT Value-added Models! Value-added Models!So, there are ONLY Value-added So, there are ONLY Value-added Hypotheses, NHypotheses, NOTOT Value-added Models! Value-added Models!

It is up to us to define It is up to us to define

What What ProgressProgress Are We Talking About? Are We Talking About?

How-to: Get the best data available, How-to: Get the best data available, smooth it for irregularities with the most smooth it for irregularities with the most reasonable model, and construct from the reasonable model, and construct from the “signal,” statistics that address your “signal,” statistics that address your hypotheses directly.hypotheses directly.


9

Basic “Progress Hypotheses”Basic “Progress Hypotheses”Basic “Progress Hypotheses”Basic “Progress Hypotheses”

1998 1999 2000 2001 2002

Grade

5

4

3

2

1

(Q1)Cohorts

(Q2)Grade-level

Means

(Q3)Grade-level

PACs


10

Criterion AND Norm Referencing:dual reporting formats for two questions about your achievement

Criterion AND Norm Referencing:dual reporting formats for two questions about your achievement

Norm ReferencingNRT (NCE)

Reporting

An

aly

sis / R

epo

rting

Criterion ReferencingScale Scores

Grade 3

L1

Basic

L3

AdvancedL2

Proficient

C’1 C’2

C1

C2


11

Some Value-Added HypothesesSome Value-Added HypothesesExamples with School as the Accounting Unit Based-on

Q1

Change over time in Cohort Growth RatesEstimatedCohort Growth Rates

Inter-Cohort Contrast of Cohort Growth Rates

Comparison to Proposed External or System Standard*

Q2

Estimate Grade-level Growth over Time

PredictedGrade-level Means

Inter-Grade Contrasts of Grade-level Growth Rates

Value-Added over Projected Status

Total Output over time: Combining Initial Status & Growth Rate**

Comparison to Proposed External or System Standard

Q3

Change over time in Grade-level PACsPredictedGrade-level PAC’s

Inter-Grade, Between Year, Contrasts of Grade-level PACs

Comparison to Proposed External or System Standard

* An example is the 100% Proficient is a standard for NCLB. Other examples may compare schools with each other, with “similar” schools determined by ranking on a selected covariate set (ala California), etc.** Thum & Chinen (in preparation)


12

Standards of ProgressStandards of Progress To fully judge progress, we rely on standards, or benchmarks,

absolute and contextualized, whenever these are available.

Within EACH school (over time), we might consider progress of

Different subjects, or their composites

Different grades, or their aggregates (lower primary, etc.)

Different student cohorts, or their comparisons

Different sub-groups

All the above may be individually, or in groups of school-grades, compared with

District average, schools-like-mine, etc.

Fixed district goals.


13

Score

98 99 00 01 02Year

Decreasing Productivity

98 99 00 01 02Year

Increasing Productivity

Comparing Cohort Slopes: Improvement Comparing Cohort Slopes: Improvement (Q1)(Q1)


14

Is School 201 Improving?Is School 201 Improving?Cohort Regressions Cohort Regressions (Q1)(Q1)Is School 201 Improving?Is School 201 Improving?Cohort Regressions Cohort Regressions (Q1)(Q1)

0 1 2 3 4 5 6

Year

510

550

590

630

Math

Coh 5Coh 4Coh 3Coh 6Coh 7

School 201

0 1 2 3 4 5 6

Year

540

560

580

600

620

640 Reading

Coh 5Coh 4Coh 3Coh 6Coh 7


15

Is School 201 getting more effective? Is School 201 getting more effective? (Q1)(Q1)Is School 201 getting more effective? Is School 201 getting more effective? (Q1)(Q1)

Math Read

Is Change inGrowth Rate Positive?

100% 97%

School 201

Latest Cohort Growth Rate30.85(2.44)

23.95(6.28)

3.33(0.97)

1.83(0.96)

Change in Growth Rate

This compares present with past performance. We can also compare School 201’s latest growth rate with the district average, with the average of schools “similar” to School 201.


16

What is Adequate Yearly Progress?What is Adequate Yearly Progress?Example, viaExample, via an Empirical Definition an Empirical Definition (Q2)(Q2)What is Adequate Yearly Progress?What is Adequate Yearly Progress?Example, viaExample, via an Empirical Definition an Empirical Definition (Q2)(Q2)

AYP must take into account

Where you start and Where you should end up (mandated)

Between the present time, t, and the mandated time frame to reach proficiency (T = 12)

Thus, AYP may be defined as the growth rate that will place you on the target given where you are presently, such as

(YT-Yt) / (T- t) , or a some more refined version; where YT is the cut-score for the “proficiency” and Yt is the present score.

DOES NOT MEAN THE ANALYSISNEED TO BE PERFORMED ON

CATEGORICAL DATA


17

1 2 3 4 5Year

1

2

3

6

4

5

Grade

Score800

700

600

500

Is Grade 1 Predicted Average Increasing?Is Grade 1 Predicted Average Increasing?

Is Grade 4 Predicted Average Increasing?Is Grade 4 Predicted Average Increasing?

Predicted Grade-Year MeansPredicted Grade-Year Means (Q2)(Q2)

Based on a model for the information contained in the data-block …

Based on a model for the information contained in the data-block …


18

Improvement by Grade Improvement by Grade (Q2)(Q2)Improvement by Grade Improvement by Grade (Q2)(Q2)

1 2 3 4 5

Year

550

570

590

610

630

12345

Math

1 2 3 4 5

Year

550

570

590

610

630

12345

School 202

Reading


19

CL

CU

L1

Basic

L3

Advanced

L2

Proficient

X, Time

Y, Mean

T

(4) (4) (4)ˆ ˆ ˆProb |Lower Upper YObject of InferenceObject of Inference::

(4) 4ˆ12 4

LLower

C Y

Lower bound ofSchool’s AYPfor Time=4.

(4) 4ˆ12 4

UUpper

C Y

Upper bound ofSchool’s AYPfor Time=4.

(4) (4)ˆˆ , 1, 2,3,4jX j

2Y

1Y

4Y

3Y

Assessing AYP-NCLBAssessing AYP-NCLBAssessing AYP-NCLBAssessing AYP-NCLB


20

Defining AYP-NCLBDefining AYP-NCLBDefining AYP-NCLBDefining AYP-NCLBQuestion:Question:

Given where you are at this point in time, are you improving at a Given where you are at this point in time, are you improving at a pace that will put you on the specified target pace that will put you on the specified target in the remaining in the remaining time frametime frame? ?

Implication: Implication: AYP depends on the performance of AYP depends on the performance of the school; so it changes over time. the school; so it changes over time. Classification errors are directly assessedClassification errors are directly assessed..

Implication: Implication: AYP depends on the performance of AYP depends on the performance of the school; so it changes over time. the school; so it changes over time. Classification errors are directly assessedClassification errors are directly assessed..

Answer:Answer:

If you are growing at at time If you are growing at at time t, t, your minimum growth rateyour minimum growth rate

to reach the target is , and so you make AYP-NCLB to reach the target is , and so you make AYP-NCLB

if , with probability .if , with probability .

( )ˆ t( )ˆ tL

( ) ( )ˆ ˆt tL ( ) ( )ˆ ˆProb( | )t t

L Y


21

1996 2002 2008 2014

Year

580

590

600

610

620

630

School 202

Cut-score

1996 2002 2008 2014

Year

580

600

620

640

School 202

Cut-score

Grade 3 Mathematics Grade 3 Reading

Making AYP-NCLB (Q2)Making AYP-NCLB (Q2)

(5)ˆ 598.30 4.49 ( 2002)t tY YEAR

(5)ˆ 4.49 (1.79)

(5) 618 598.30ˆ2014 2002

1.64 (0.32)

L

(5) (5)ˆ ˆProb( | ) 0.94L Y

(5)ˆ 601.4 1.60 ( 2002)t tY YEAR

(5)ˆ 1.60 (1.67)

(5) (5)ˆ ˆProb( | ) 0.44L Y

(5) 624 601.40ˆ2014 2002

1.88 (0.34)

L


22

T

CL

NotProficient

Proficient

X, Time

Y, Score

Trend in Percent Proficient Trend in Percent Proficient (Q2)(Q2)Trend in Percent Proficient Trend in Percent Proficient (Q2)(Q2)

2Y

1Y

4Y

3Y

1

100ˆ ˆProb( | )tn

t it Lit

P Y Cn

YObject of Inference:Object of Inference:

SAFE HARBORSAFE HARBOR

School makes AYPIf % proficient

increased by 10 %


23

Value-Added over Projected StatusValue-Added over Projected StatusValue-Added over Projected StatusValue-Added over Projected Status

T

(4)5ˆProb Standard |v YObject of InferenceObject of Inference::

CL

CU

L1

Basic

L3

Advanced

L2

Proficient

X, Time

Y, Mean

(4) (4)ˆˆ , 1, 2,3,4jX j

2Y

1Y

4Y

3Y

5Y

(4) (4)5 5 5v Y E Y

(Q2)

(4)Some standards: for school, district, schools-like-mine.Some standards: for school, district, schools-like-mine.


24

Total Output: School ExcellenceTotal Output: School Excellenceas A Value-Added Hypothesisas A Value-Added HypothesisTotal Output: School ExcellenceTotal Output: School Excellenceas A Value-Added Hypothesisas A Value-Added Hypothesis

1

ˆProb ( ) Standard |x T

j

x

f x dx

YObject of InferenceObject of Inference::

(Q2)

AB

C

D

1

ˆ ( )x T

j

x

f x dx

Areas underpredicted curves, f(x) !

Comparing 4th grade growth for schools, j = A, B, C and D,that combines Growth AND Final Status!

AB

C

D

ˆ ( )Bf x

ˆ ( )Cf x

ˆ ( )Af x

ˆ ( )Df x

X, Time

Y

1 T


25

Why Bayesian InferenceWhy Bayesian Inference Basic Components

See O’ Hagan (1994) for a summary. Basically formulated here as an enhancement of likelihood inference.

Highlighted Advantages

Conceptual – Credibility Intervals as likely range of of true parameters is the more natural vis-à-vis Neyman-Pearson Co.I.

Analytically less demanding, using statistics to do statistics via Markov chain Monte Carlo (MCMC), inference for ratios is straightforward.

Disadvantages

Where do get out priors – not a problem (for long anyway) with longitudinal data.

Computationally intensive, and in normally large accountability applications we need to proceed carefully.


26

Ratios & Productivity ProfilesRatios & Productivity Profiles

Thum (2003b)

Posterior Distributions of a Value-added indicator, , for 3 Schools.

ˆProbability that Standard. Productivity Profiles:

Result:Result:A measure of A measure of how much was how much was achieved (a percent) and at what achieved (a percent) and at what level of precision (a probability)level of precision (a probability), , and so the comparison is and so the comparison is (relatively) scale-free.(relatively) scale-free.


27

Sample Teacher Productivity Profiles ISample Teacher Productivity Profiles I

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0 Grade 6 Teachers

l , Proportion of Standard, %

Co

nfi

de

nc

e in

Mee

tin

g

l

%

%

We are only confident (at 70% level) that 3 teachers reached 4%

A


28

Sample Teacher Productivity Profiles 2Sample Teacher Productivity Profiles 2

0 2 4 6 8 10 12

0.78

0.86

0.94

p( 1

0j

lSj )

Teacher 75

0 2 4 6 8 10 12

0.6

0.8

1.0 Teacher 18

0 2 4 6 8 10 12% gain, l

0.3

0.6

0.9 Teacher 1

0 2 4 6 8 10 12% gain, l

0.2

0.5

0.8

p( 1

0j

lSj )

Teacher 8

Model 0Model 1Model 4

70%

70%

70%

70%

80%

Models differ in terms of adjustments for different classroom characteristics.


29

Sample Teacher Productivity Profiles 3Sample Teacher Productivity Profiles 3

Different Models produces different conclusion (Thum 2003b)

0

20

40

Model 0

70% 80%

Model 1

70% 80%

Model 4

70% 80%


30

Standing Issues re Inputs:Validity & Quality of Outcome MeasuresStanding Issues re Inputs:Validity & Quality of Outcome Measures

1. We assume that we have an outcome of student learning which the user believes to be a valid/useful measure of the intended construct.

2. The outcome measure possesses the necessary psychometric (scale) properties supporting its use.

3. To the degree that either, or both, the construct validity of the measure, and its scale-type (interval), are approximate in practice, we submit that the validity of the interpretation using this outcome needs to be tempered accordingly.

4. Faced with this complex of nearly unsolvable issues, I find myself resting some of my choices on the “satisfising principle” (Simon, 1956).


31

Selected ReferencesSelected ReferencesSelected ReferencesSelected ReferencesThum, Y. M. (2002). Measuring Student and School Progress with the

California API. CSE Technical Report 578. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, UCLA.

Thum, Y. M. (2003a). No Child Left Behind: Methodological Challenges Recommendations for Measuring Adequate Yearly Progress. CSE Technical Report 590. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, UCLA.

Thum, Y. M. (2003b). Measuring Progress towards a Goal: Estimating Teacher Productivity using a Multivariate Multilevel Model for Value-Added Analysis. Sociological Methods & Research, 32 (2), 153-207.

AcknowledgementsAcknowledgements The analyses presented here are drawn from a larger comparative analysis study organized and supported by the New American Schools. Additional illustrations concerning the API draw support from CRESST and the Los Angeles Unified School District. Many of the ideas were first tested in an evaluation sponsored by the Milken Family Foundation. Portions of this presentation were part of an invited presentation in AERA 2005, Montreal.

Y. M. Thum [email protected] Y. M. Thum [email protected] Y. M. Thum [email protected] Y. M. Thum [email protected]


32

““Too much trouble”,Too much trouble”, “too expensive”, “too expensive”,

or “who will know the difference” or “who will know the difference” are death knells to good food.are death knells to good food.

Julia Childs (1961)Julia Childs (1961)

Final Caveat:Final Caveat: In this work, the procedures are In this work, the procedures are complex only to the degree that they meet the demands of complex only to the degree that they meet the demands of the task at hand – nothing more, nothing less. We have the task at hand – nothing more, nothing less. We have clearly come a long way from naively comparing cross-clearly come a long way from naively comparing cross-sectional means.sectional means.

bayesian inference for some value-added productivity indicators

Documents