let’s start by talking about what is an outcome. these are...

In this session we’re going to talk about outcomes and the evaluation of those outcomes, the outcomes of our programs, of our interventions, and I wanted to start off with a discussion of why we should be interested in outcome evaluations. Even if we’re not going to be doing them as part of our work I would argue that it’s critical that we always keep outcomes in mind when we’re developing interventions even if we’re not at a point where we’re going to do rigorous evaluations of those outcomes and asking questions about impact. But it’s vital that we understand the nature of the outcomes that we expect; that’s why we spend time thinking about the theory of treatment and what we should expect to happen to folks after they participate in our programs, but we should also understand how we can go about evaluating our programs even if we’re not going to that now, and that should help us to better understand the interventions themselves. So for those of you that are more in the process, evaluation, frame, this is still valuable for you to do and think about, and so I want to push you and encourage you to really think about how could I evaluate the outcomes of my program and to work that through. It may not be a hundred percent in alignment with what you’re doing for your POP and your dissertation but you need to do the work and go through the thought process of how would I design that outcome evaluation. It’s important to really understanding your interventions and their effects.

1

Let’s start by talking about what is an outcome. These are characteristics of your target population, or in some cases social conditions, but important is that they’re not of the program itself so there’s no direct reference to the program, and this is a key distinction between an outcome and an output. So all of our outcomes are those things that we can observe naturally among individuals whether or not they’re participating in the program, so academic achievement of students is not directly related to the program; we can observe academic achievement among kids totally not participating in the program. So that’s a key distinction and thinking about what is an outcome? It’s not a product of the intervention per se so it’s not an attribute of the program.

2

Let’s talk about a couple of different types of outcome and how we can think about them and this will directly lead into a discussion about various designs. The first type that we want to talk about is called a Level. A level is a one-time measure. As you can see on the chart there on the right we can have let’s say an academic achievement outcome measure that we’ve measured on a group of students and on average that achievement at that one point in time is one hundred, one hundred points on some task let’s say. So what does this tell us? Think about that; what does this one data point basically tell us? It simply tells us that at that point in time the average achievement of that group of kids was a hundred. It tells us nothing about change; it doesn’t really inform us much on how they got to that level, it doesn’t tell us whether that’s high or low; it says nothing about change.

3

The next type we’ll call Change and as you can see here different from the Level we have say two levels actually, and they’re on the same group and let’s just for the sake of argument call them a pre- and a post-test of a program. And where we see at the pre-test let’s say we have an average achievement level of fifty; at the post-test we have an average achievement level of a hundred. So there we see with the bracket we can call that change, so a fifty point change over the course of the program. Now what I want you to do is think about what does that mean? What is this telling us? What is actually represented here? I think with a bit of reflection, especially referencing the work of the previous session we have to recognize that if this is all we have, a pre- post difference, we’re open to a whole host of confounding factors or threats to validity. So we can say program participants on average changed fifty points over the course of the program, but we’re not quite able to say that it was only due to the action of the program. There’s a host of other confounding factors, threats to validity that could be behind this fifty point increase. It could be a normal maturation process of this group of students in the absence of intervention; students naturally move up fifty points on this test over that given time period. So what have we gained here? We’ve at least been able to see change over the course of the program but we can’t quite pin it directly and only to the program action itself so we’re open to these threats to validity. So we’ve gotten further down the road but we’re not quite to an ability to say was it the program that caused this change.

4

So our third type is what we might call Impact and this is the portion of observed change directly due to the program; this is most of interest I would argue for program implementers and this is what we’re looking for – what was the effect of the program; what was the impact of the program on participants. As you can see on the chart to the right how do we get to a statement about impact? Well we have the same treatment group, the blue line, the fifty point change over the course of the intervention or program, but here we’re contrasting that with some comparison group or control group if we’re in a randomized experiment, experimental design. And you can see here the control group went from fifty on the pre-test to sixty at the post-test so there was a ten point increase in the group. And so we might assume then that the impact is forty instead of the fifty of the change, so in the absence of treatment we might have expected students to have progressed ten points instead of the fifty of the program, so the difference between those two is forty and we might call that impact. And what’s important here, and we’ll unpack this more as we go along in the session is who is in that comparison group. That’s vital; how is that comparison group formed to kind of talk ahead of randomized experiments and in those cases we rule out a lot of our confounding factors; in other cases we might use an existing group but recognizing that may open us up to other kinds of confounding factors or threats to validity, but in general this is what we’re talking about when we’re talking about an impact. It’s in reference to some other group.

5

When we’re talking about outcomes we need to have the ability to determine the appropriate and relevant outcomes to our intervention. We can start that process by thinking about the goals of the program; what are the actual goals? What are we hoping to move or change in people? And we need to take the stakeholder perspective; what is important to them, what do they hope that the intervention will do? And again it comes back to program theory. What does our theory of treatment say the outcome is? How is it tightly aligned to that treatment theory? We also need to think on time scale; often our programs operate on different time scales. I think I’ve given the example before of a book distribution program that I’ve evaluated. Part of the goal was to move approximate outcomes so summer learning loss. The hope of the program was that it would change or abate summer slide in kids, but the program didn’t operate on that time scale. However, we also looked at much more distal outcomes on the state achievement test and it’s good that we did because we did find some effect of the program at a longer time scale than the summer and that’s important to consider and that’s linked back to program theory. Did we think the treatment was powerful enough to move the needle in a very short time span of summer or could it have some delayed effects that would take longer to substantiate themselves and kids’ outcomes? We also need to recognize that most of the outcomes we’re interested in are multidimensional. Just think about reading; reading is a high-level construct; there are multiple dimensions to what it means to be able to read. So we need to think about out outcomes in terms of that multidimensionality; do we have multiple outcomes that represent the same construct or the different

6

sub-constructs so that’s comprehensiveness; are we covering all the bases so to say, and are they relevant? We need to think those through because sometimes we might inadvertently have irrelevant outcomes that program theory would say hum, we’re not going to have a needle on that so we don’t need to perhaps gather that data.

6

All of our work leading up to an outcome evaluation should also clue us in to potential unintended outcomes. We should always be aware of the potential for unintended outcomes and think that through and how we might measure them. And one obvious one is implementation; perhaps there’s poor implementation and that leads to an outcome we weren’t expecting. Maybe the program conceptually and theoretically is misaligned with the outcomes we’re interested in. We need to think about the program itself and is it actually quote, unquote, solving the problem or is it displacing the problem to some other place, so by intervening we’ve shifted the incidence of the problem away from the students let’s say and it’s now manifesting itself in teachers perhaps. We need to think about the complex systems in which we’re operating and in which our programs are operating so that we may solve one problem but it may cause the emergence of a new problem, so we need to think of ways of how can might capture that. And all of this goes through a thorough planning process; this is why we do the logic models; that’s why we do the theory of treatment. We’re not just looking at what we hope and expect to happen but we should also go through the motions of thinking about what are the unintended consequences of this and what might those look like.

7

One way that we can begin to talk about the effectiveness of our programs is through outcome monitoring, and this is continual measurement of indicators of outcomes and the social conditions in which our programs are embedded. So for example we might think about maybe attendance of students as our outcome of interest. Attendance is something that happens every day, kids are either there or not, and we can gather that in a timely fashion, low-cost; it’s easy to report and it’s ongoing. So by looking across time it potentially provides feedback to a program about improvement on that indicator, but we need to – and I say this over and over – we’re not talking about impact of the program per se, but we’re talking about improvement in the indicator that may be related to the program. Fundamental though is thinking about indicators of the problem that we have measures of, preprogram. We need to think about exposure to confounds and other relevant factors that could influence change on that continual measurement of the outcome. And they can be valuable indicators of program effectiveness but not impact.

8

In thinking about potential indicators that could be used for outcome monitoring we need to consider how responsive they are to any possible program effects that could be existing. Because we’re not doing an impact assessment which often takes many years to complete and at a much higher cost, we want to identify indicators that are highly valid and reliable for the task and by being more responsive they have less noise in them and hopefully so we can make more reasonable assumptions about effectiveness versus an indicator that potentially is influenced by a lot of other confounding factors. We want to identify as much as we can those indicators that only the program can affect; that may be a very high bar but we can get as close to the bar as we can. But we do have to recognize when we’re talking about outcome monitoring some limitations that could be severe. By focusing on attention on a single outcome measure we potentially can introduce bias into that outcome measure. A clear example is standards and testing. Sometimes by focusing a lot of attention on the outcomes on a standardized test we focus folk’s attention to influence that outcome measure in ways unintended, so teaching test-taking skills versus teaching on the content of the test. And so that’s to speak to the corrupt ability of those indicators; to what extent can people game the system to move the indicators versus doing the actual work to influence the indicators. And we need to be careful about interpretation; because we’re not talking about impacts we need to be very careful and nuanced with our words and that’s sometimes difficult to do or misinterpreted and folks may interpret what we’re saying as impact versus effectiveness, so we need to be really careful about that one.

9

So regardless of the designs that we may implement to look at outcomes and our evaluations there are some key things that we need to think about and collect. One is the importance of contextual data, so we need to know about clients themselves, so we need to have as rich as possible information about them and we also need to know about changes in clients that may have occurred, so changes in the mix of clients who are participants, the numbers. We also need to worry about and think about things happening out in the world that are going on at the same time as our interventions, so these are the social or environmental changes. We can think about changes in the broader economy that might impact our outcomes, demographic trends, etc. So if we’re talking about schools, changes in principals, changes in teachers, environmental changes, social changes, and we need to really think about that and come up with a plan for collecting it and observing those changes. We also need to, for an outcome evaluation, we really need to attend to process information, so fidelity of implementation and having that data as well to better understand the outcomes. We also need to have a framework whereby we can judge quote, unquote better or worse outcomes. This is key; just because we find movement in an outcome we need to have a way of understanding the magnitude of that movement, the direction of that movement, is this better or worse than what would have happened otherwise? So you really need to think about what is the yardstick, what is the standard for success or failure; is it performance above a certain mark, is it in comparison to some other group? So we need to think that through; what determines success, and often we should see that in our logic models as well.

10

Hopefully this point will become abundantly clear as you do the readings for this session, but to be very thoughtful in our presentation of outcome data and the interpretation of the outcome data. And we’ve kind of hit this point already through this presentation, but we need to be very careful about simple pre- and post-comparisons, so just looking at change from the beginning to the end of our programs. So that opens up pathways for other concurrent trends and confounding threats to validity, of how we interpret that change, and that’s why I said the yardsticks are important – what constitutes good or worse change. We need to think about who is composing the group that we’re looking at. If all we can do is a pre-post-comparison we need to do a lot of deep work in understanding that group of individuals. Are they at the extremes? Were they selected because of low prior performance? If that’s the case we open the door there to a threat to validity to regression to the mean, so the actual tendency to improve regardless of whether or not they’re in the program and vice versa if we choose them for high performance. So we need to be very thoughtful and explicit about what composes our comparisons and how we interpret our outcome data. So we’ll hit that point at several points as we discuss designs.

11

let’s start by talking about what is an outcome. these are...

Documents