the logic of c ounterfactual i mpact e valuation 1
TRANSCRIPT
The logic of
Counterfactual
Impact
Evaluation
1
To understand counterfactuals
It is necessary to understand
impacts
Impacts differ in one fundamental way
from outputs and results
Outputs and results are observable quantities
Can we observe an impact?
No, we can’t
As output indicators measure outputs, result indicators
measure results, so impact indicators measure impacts
Sorry, they don’t
Almost everything about programmes can be observed (at least in principle):
outputs (beneficiaries served, activities done, training courses offered,
KM of roads built, sewages cleaned)
outcomes/results (income levels, inequality, well-being of the population,
pollution, congestion, inflation, unemployment, birth rate)
What is needed for M&E of outputs and results are
BITs(baselines, indicators,
and targets)
Unlike outputs and results, to define, detect, understand,
and measure impacts
one needs to deal with
causality
“Causality is in the mind”
J.J. Heckman
Why this focus on causality?Because, unless we can attribute changes
(or differences) to policies, we do not know whether the intervention “works”,
“for whom” it works, and even less “why” it works
(or does not)
Causal questions represents a bigger challenge than non causal questions (descriptive, normative,
exploratory)
10
The social science scientific community defines
impact/effect as
“the difference between a situation observed after a stimulus has been
applied and the situation that would have occurred
without such stimulus” 11
A very intuitive example of the role of causality in producing credible evidence for
policy decisions
Does playing chess
have an impact on math learning?
Policy-relevant question:
Should we make chess part of the regular curriculum in elementary schools, to improve
mathematics achievement?
Which kind of evidence do we need to make this decision in an informed way?
We can think of three types of evidence, from the most naive to the most credible
14
1. The naive evidence:pre-post difference
• Take a sample of pupils in fourth grade• Measure their achievement in math at
the beginning of the year• Teach them to play chess during the
year• Test them again at the end of the year
15
Results for the pre-post difference
Pupils at the beginning of the year
Average score = 40 points
Difference = 12 points = + 30% Question: what are the implications for
making chess compulsory in schools?Have we proven anything?
The same pupils at the end of the year
Average score = 52 points
16
Can we attribute the increase in test score to playing chess?
OBVIOUSLY NOTThe data tell us that the effect is
between zero and 12 points
•There is not doubt that many more factors are at play
•So we must dismiss the increase in 10 points as unable to tell us anything about impact.
17
The pre-post great temptation
• The pre-post comparisons have a great advantage: they seem kind of obvious (the “pop” definition of impact coincides with the pre-post difference)
• Particularly when the intervention is big, and the theory suggests that the outcomes should be affected
• This is not the case here, but we should be careful in general to make causal inference based on pre-post comparisons
18
The risky alternative:with-without difference
Impact = difference between treated and not treated?
19
Compare math test scores for kids who have learned chess by themselves
and kids who have not
Not reallyAverage score of pupils who already play chess on their
own (25% of the total)
= 66 points
Difference = 21 points = + 47%This difference is OBJECTIVE,
but what does it mean, really? Does it have any implication for policy?
Average score of pupils who DO NOT play chess on their own
(75% of the total) = 45 points
20
This evidence tells us almost nothing about making chess
compulsory for all students
The data tell us that the effect of playing chess is between zero and 21 points.
Why?
The observed difference could entirely be due to differences in mathematical ability that exist before the courses, between the two groups
21
Play chessPlay
chess
Math innate ability
Math innate ability
Math test
scores
Math test
scoresCS
SELECTION PROCESS
DIRDIRE
DIRECT INFLUENCE
Ignoring math ability could severly bias the results, if we intend to interpret them as causal effect
Does it have an impact on?
66 – 45: real effect or the fruit of sorting?
22
Counterfeit Counterfactual
Both the raw difference between self-selected participants and non-participants, and the raw
change between pre and post are a caricature of the counterfactual logic
In the case of raw differences, the problem is selection bias (predetermined differences)In the case of raw changes, the problem ismaturation bias (a.k.a. natural dynamics)
23
The modern way to understandcausality is to think in terms of
POTENTIAL OUTCOMES
Let us imagine we know the score that kids would get if they played
and they would get if they did not
24
Let’s say there are three levels of ability
Kids in the top quartile (top 25%) learn to play chess on their own
Kids in the two middle quartiles learn if they are taught in school
Kids in the bottom quartile (last 25%) never learn to play chess 25
Mid math ability50%
Mid math ability50%
High math ability25%
High math ability25%
Low math ability25%
Low math ability25%
Play chess by themselvesPlay chess by themselves
Do not play chessDo not play chess
Unless taught in schoolUnless taught in school
Never learn to play Never learn to play
26
Mid math ability
Mid math ability
High math ability
High math ability
Low math ability
Low math ability
If they do play
chess
If they do play
chess
If they do NOT play
chess
If they do NOT play
chess
Impact = gain from playing
chess
Impact = gain from playing
chess
66 66 56 56 10 10
5454 48 48 6 6
40 40 40 40 0 0
Potential outcomes
27
Mid math ability
Mid math ability
High math ability
High math ability
Low math ability
Low math ability
For those who play
chess
For those who play
chess
For those who do not play chess
For those who do not play chess
66 66
48 48
40 40
Observed outcomes
45 45
the difference of 21 points is NOT an
IMPACT, it is just an OBSERVED difference
the difference of 21 points is NOT an
IMPACT, it is just an OBSERVED difference
Mid/Low math ability combined
Mid/Low math ability combined
28
The problem: we do not observe the counterfactual(s)
• For the treated, the counterfactual is 56, but we do not see it
• The true impact is 10, but we do not see it• Still we cannot use 45, that is the untreated
observed outcome
We can think of decomposing the 68-45 difference as the sum of the true impact on the treated and the effect of sorting
29
Low/mid math ability
Low/mid math ability
High math ability
High math ability
If play chessIf play chess
If do not play chessIf do not
play chessDecomposing the observed
difference
Decomposing the observed
difference
66 66 56 56 = 10Impact
for players
= 10Impact
for players
45 45 =21Observed difference
=21Observed difference= 11
preexisting differences= 11
preexisting differences
21 = 10 + 11 21 = 10 + 11 30
21 = 10 + 11
Observed differences =
Impact +
Preexisting differences(selection bias)
The heart of impact evaluation is getting rid of selection bias, by using
experiments or by using some non-experimental methods
21 = 10 + 11
Observed differences =
Impact +
Preexisting differences(selection bias)
The heart of impact evaluation is getting rid of selection bias, by using
experiments or by using some non-experimental methods
31
Experimental evidence to the rescue
Schools get a free instructor to teach chess to one class, if they agree to select
the class at random among the fourth grade classes
Now we have the following situation
32
Results of the randomized experiment
Pupils in the selected classes
Average score of randomized chess players = 60 points
Pupils in the excluded classes
Average score of NON chess players = 52 points
Difference = 8 points
Question: what does this difference tell us?33
Thus we are able to isolate the effect of chess from other factors
(but some problems remain)
The results tell us that teaching chess truly improves math performance
(by 8 points, about 15%)
34
Mid abilityMid ability
High abilityHigh ability
Low abilityLow ability
If they do play
chess
If they do play
chess
If they do NOT play
chess
If they do NOT play
chessComposition of populationComposition of population
66 66 56 56 25% 25%
54 54 48 48 50%50%
40 40 40 40 25% 25%
AveragesAverages 54 54 48 48 100% 100%
ImpactImpact Impact = 54 – 48 = 6Impact = 54 – 48 = 6
Average Treatment Effect
ATE
35
Play chessPlay
chess
Math abilityMath ability
Math test scores
Math test scores
DIRDIRE
DIRDIRE
Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (“for whom it works”)
Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (“for whom it works”) 36
The ATE is the average effect if every member of the
population is treated
Generally there is more policy interest in Average Treatment Effect on the Treated
ATT = 10 the chess example, while ATE = 6
(we ran an experiment and got an impact of 8. Can you think why this happens?)
37
Mid abilityMid ability
High abilityHigh ability
Low abilityLow ability
Schools that
vounteered
Schools that
vounteered
Schools that DID NOT vounteer
Schools that DID NOT vounteer
50% 50% 1010
50% 50%
50% 50%
66
EXPERIMENTALmean of 66 and 54
= 60
EXPERIMENTALmean of 66 and 54
= 60
True impactTrue
impact
Impact = 60 – 52 = 8Impact = 60 – 52 = 838
50% 50%
00
CONTROL mean of 56 and 48
= 52
CONTROL mean of 56 and 48
= 52
Internal validityInternal validity
Little external validity
Little external validity
Lessons learned
Impacts are differences, but not all differences are impacts
Differences (and changes) have many causes, but we do not need to undersand all the causes
We are especially interested in one cause, the policy, and we would like to eliminate all the counfounding
causes of the difference (or change)
Internal vs. External validity
39
An example of a real ERDF policy
Grants to small enterprises to invest in R&D
40
To design an impact evaluation, one needs to answer three important questions
1. Impact of what?
2. Impact for whom?
3. Impact on what?
AVERAGE NPRE 65.000 2400
POST 75.000 2400
OBSERVED CHANGE 10.000
R&D EXPENDITURES AMONG THE FIRMS RECEIVING GRANTS
Is 10.000 the true average impact of the grant?
42
43
44
The fundamental challenge to this assumption is the well known fact
that things change over time by “natural dynamics”
How do we disentangle the change due to the policy from the myriad
changes that would have occurred anyway?
45
AVERAGE N
T=0 60.000 2600
T=1 75.000 2400
DIFFERENCETREATED - NON TREATED +15.000
IS 15.000 THE TRUE IMPACT OF THE POLICY?
46
WITH-WITHOUT (I.A.: NO PRE-INTERVENTION DIFFERENCES)
47
DECOMPOSITION OF WITH-WITHOUT DIFFERENCES
48
DECOMPOSITION OF WITH-WITHOUT DIFFERENCES
49
We cannot use experiments with firms, for obvious (?) political reasons
The good news is that there are lots of non-experimental counterfactual
methods
50
The difference-in-differences (DID) is a combination of the first two
strategies
And it is a good way to understand the logic of (non-experimental)
counterfactual evaluation
51
52
53
54
55
56
57
POST DIFFE-RENCE
PREDIFFE-RENCE
58
POST DIFFE-RENCE
PRE DIFFE-RENCE
59
POST DIFFERENCE=15.000
-PRE DIFFERENCE
=10.000
=
Impact = 5000
60
61
CAN WE TEST THE PARALLELISM ASSUMPTION?
With four observed means, we cannot
The parallelism becomes testable if we have two additional data points
pre-intervention
PRE-PRE
62
63
64
65
66
67
68
69
WHEN TO USE DIFF-IN-DIFF?
When we have longitudinal data and have reasons to believe that most of what drives selection is
individual unobserved characteristics
70
Second, the path taken by the controls must be a plausible approximation of what would
happen to the treated
The following is an example in which it would be better
NOT to use DID
71
72
58.000 65.000 7.00057.000 55.000 -2.000
9.000
65.000 75.000 10.00055.000 67.000 12.000
-2.000
Diff-in-diff-in-diff -11.000
73
58.000 65.000 7.000 65.000 72.000 75.000
Linearly projected impact 3.000