the logic of c ounterfactual i mpact e valuation 1

The logic of

Counterfactual

Impact

Evaluation

1

To understand counterfactuals

It is necessary to understand

impacts

Impacts differ in one fundamental way

from outputs and results

Outputs and results are observable quantities

Can we observe an impact?

No, we can’t

As output indicators measure outputs, result indicators

measure results, so impact indicators measure impacts

Sorry, they don’t

Almost everything about programmes can be observed (at least in principle):

outputs (beneficiaries served, activities done, training courses offered,

KM of roads built, sewages cleaned)

outcomes/results (income levels, inequality, well-being of the population,

pollution, congestion, inflation, unemployment, birth rate)

What is needed for M&E of outputs and results are

BITs(baselines, indicators,

and targets)

Unlike outputs and results, to define, detect, understand,

and measure impacts

one needs to deal with

causality

“Causality is in the mind”

J.J. Heckman

Why this focus on causality?Because, unless we can attribute changes

(or differences) to policies, we do not know whether the intervention “works”,

“for whom” it works, and even less “why” it works

(or does not)

Causal questions represents a bigger challenge than non causal questions (descriptive, normative,

exploratory)

10

The social science scientific community defines

impact/effect as

“the difference between a situation observed after a stimulus has been

applied and the situation that would have occurred

without such stimulus” 11

A very intuitive example of the role of causality in producing credible evidence for

policy decisions

Does playing chess

have an impact on math learning?

Policy-relevant question:

Should we make chess part of the regular curriculum in elementary schools, to improve

mathematics achievement?

Which kind of evidence do we need to make this decision in an informed way?

We can think of three types of evidence, from the most naive to the most credible

14

1. The naive evidence:pre-post difference

• Take a sample of pupils in fourth grade• Measure their achievement in math at

the beginning of the year• Teach them to play chess during the

year• Test them again at the end of the year

15

Results for the pre-post difference

Pupils at the beginning of the year

Average score = 40 points

Difference = 12 points = + 30% Question: what are the implications for

making chess compulsory in schools?Have we proven anything?

The same pupils at the end of the year

Average score = 52 points

16

Can we attribute the increase in test score to playing chess?

OBVIOUSLY NOTThe data tell us that the effect is

between zero and 12 points

•There is not doubt that many more factors are at play

•So we must dismiss the increase in 10 points as unable to tell us anything about impact.

17

The pre-post great temptation

• The pre-post comparisons have a great advantage: they seem kind of obvious (the “pop” definition of impact coincides with the pre-post difference)

• Particularly when the intervention is big, and the theory suggests that the outcomes should be affected

• This is not the case here, but we should be careful in general to make causal inference based on pre-post comparisons

18

The risky alternative:with-without difference

Impact = difference between treated and not treated?

19

Compare math test scores for kids who have learned chess by themselves

and kids who have not

Not reallyAverage score of pupils who already play chess on their

own (25% of the total)

= 66 points

Difference = 21 points = + 47%This difference is OBJECTIVE,

but what does it mean, really? Does it have any implication for policy?

Average score of pupils who DO NOT play chess on their own

(75% of the total) = 45 points

20

This evidence tells us almost nothing about making chess

compulsory for all students

The data tell us that the effect of playing chess is between zero and 21 points.

Why?

The observed difference could entirely be due to differences in mathematical ability that exist before the courses, between the two groups

21

Play chessPlay

chess

Math innate ability

Math innate ability

Math test

scores

Math test

scoresCS

SELECTION PROCESS

DIRDIRE

DIRECT INFLUENCE

Ignoring math ability could severly bias the results, if we intend to interpret them as causal effect

Does it have an impact on?

66 – 45: real effect or the fruit of sorting?

22

Counterfeit Counterfactual

Both the raw difference between self-selected participants and non-participants, and the raw

change between pre and post are a caricature of the counterfactual logic

In the case of raw differences, the problem is selection bias (predetermined differences)In the case of raw changes, the problem ismaturation bias (a.k.a. natural dynamics)

23

The modern way to understandcausality is to think in terms of

POTENTIAL OUTCOMES

Let us imagine we know the score that kids would get if they played

and they would get if they did not

24

Let’s say there are three levels of ability

Kids in the top quartile (top 25%) learn to play chess on their own

Kids in the two middle quartiles learn if they are taught in school

Kids in the bottom quartile (last 25%) never learn to play chess 25

Mid math ability50%

Mid math ability50%

High math ability25%

High math ability25%

Low math ability25%

Low math ability25%

Play chess by themselvesPlay chess by themselves

Do not play chessDo not play chess

Unless taught in schoolUnless taught in school

Never learn to play Never learn to play

26

Mid math ability

Mid math ability

High math ability

High math ability

Low math ability

Low math ability

If they do play

chess

If they do play

chess

If they do NOT play

chess

If they do NOT play

chess

Impact = gain from playing

chess

Impact = gain from playing

chess

66 66 56 56 10 10

5454 48 48 6 6

40 40 40 40 0 0

Potential outcomes

27

Mid math ability

Mid math ability

High math ability

High math ability

Low math ability

Low math ability

For those who play

chess

For those who play

chess

For those who do not play chess

For those who do not play chess

66 66

48 48

40 40

Observed outcomes

45 45

the difference of 21 points is NOT an

IMPACT, it is just an OBSERVED difference

the difference of 21 points is NOT an

IMPACT, it is just an OBSERVED difference

Mid/Low math ability combined

Mid/Low math ability combined

28

The problem: we do not observe the counterfactual(s)

• For the treated, the counterfactual is 56, but we do not see it

• The true impact is 10, but we do not see it• Still we cannot use 45, that is the untreated

observed outcome

We can think of decomposing the 68-45 difference as the sum of the true impact on the treated and the effect of sorting

29

Low/mid math ability

Low/mid math ability

High math ability

High math ability

If play chessIf play chess

If do not play chessIf do not

play chessDecomposing the observed

difference

Decomposing the observed

difference

66 66 56 56 = 10Impact

for players

= 10Impact

for players

45 45 =21Observed difference

=21Observed difference= 11

preexisting differences= 11

preexisting differences

21 = 10 + 11 21 = 10 + 11 30

21 = 10 + 11

Observed differences =

Impact +

Preexisting differences(selection bias)

The heart of impact evaluation is getting rid of selection bias, by using

experiments or by using some non-experimental methods

21 = 10 + 11

Observed differences =

Impact +

Preexisting differences(selection bias)

The heart of impact evaluation is getting rid of selection bias, by using

experiments or by using some non-experimental methods

31

Experimental evidence to the rescue

Schools get a free instructor to teach chess to one class, if they agree to select

the class at random among the fourth grade classes

Now we have the following situation

32

Results of the randomized experiment

Pupils in the selected classes

Average score of randomized chess players = 60 points

Pupils in the excluded classes

Average score of NON chess players = 52 points

Difference = 8 points

Question: what does this difference tell us?33

Thus we are able to isolate the effect of chess from other factors

(but some problems remain)

The results tell us that teaching chess truly improves math performance

(by 8 points, about 15%)

34

Mid abilityMid ability

High abilityHigh ability

Low abilityLow ability

If they do play

chess

If they do play

chess

If they do NOT play

chess

If they do NOT play

chessComposition of populationComposition of population

66 66 56 56 25% 25%

54 54 48 48 50%50%

40 40 40 40 25% 25%

AveragesAverages 54 54 48 48 100% 100%

ImpactImpact Impact = 54 – 48 = 6Impact = 54 – 48 = 6

Average Treatment Effect

ATE

35

Play chessPlay

chess

Math abilityMath ability

Math test scores

Math test scores

DIRDIRE

DIRDIRE

Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (“for whom it works”)

Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (“for whom it works”) 36

The ATE is the average effect if every member of the

population is treated

Generally there is more policy interest in Average Treatment Effect on the Treated

ATT = 10 the chess example, while ATE = 6

(we ran an experiment and got an impact of 8. Can you think why this happens?)

37

Mid abilityMid ability

High abilityHigh ability

Low abilityLow ability

Schools that

vounteered

Schools that

vounteered

Schools that DID NOT vounteer

Schools that DID NOT vounteer

50% 50% 1010

50% 50%

50% 50%

66

EXPERIMENTALmean of 66 and 54

= 60

EXPERIMENTALmean of 66 and 54

= 60

True impactTrue

impact

Impact = 60 – 52 = 8Impact = 60 – 52 = 838

50% 50%

00

CONTROL mean of 56 and 48

= 52

CONTROL mean of 56 and 48

= 52

Internal validityInternal validity

Little external validity

Little external validity

Lessons learned

Impacts are differences, but not all differences are impacts

Differences (and changes) have many causes, but we do not need to undersand all the causes

We are especially interested in one cause, the policy, and we would like to eliminate all the counfounding

causes of the difference (or change)

Internal vs. External validity

39

An example of a real ERDF policy

Grants to small enterprises to invest in R&D

40

To design an impact evaluation, one needs to answer three important questions

1. Impact of what?

2. Impact for whom?

3. Impact on what?

AVERAGE NPRE 65.000 2400

POST 75.000 2400

OBSERVED CHANGE 10.000

R&D EXPENDITURES AMONG THE FIRMS RECEIVING GRANTS

Is 10.000 the true average impact of the grant?

42

The fundamental challenge to this assumption is the well known fact

that things change over time by “natural dynamics”

How do we disentangle the change due to the policy from the myriad

changes that would have occurred anyway?

45

AVERAGE N

T=0 60.000 2600

T=1 75.000 2400

DIFFERENCETREATED - NON TREATED +15.000

IS 15.000 THE TRUE IMPACT OF THE POLICY?

46

WITH-WITHOUT (I.A.: NO PRE-INTERVENTION DIFFERENCES)

47

DECOMPOSITION OF WITH-WITHOUT DIFFERENCES

48

DECOMPOSITION OF WITH-WITHOUT DIFFERENCES

49

We cannot use experiments with firms, for obvious (?) political reasons

The good news is that there are lots of non-experimental counterfactual

methods

50

The difference-in-differences (DID) is a combination of the first two

strategies

And it is a good way to understand the logic of (non-experimental)

counterfactual evaluation

51

POST DIFFE-RENCE

PREDIFFE-RENCE

58

POST DIFFE-RENCE

PRE DIFFE-RENCE

59

POST DIFFERENCE=15.000

-PRE DIFFERENCE

=10.000

=

Impact = 5000

60

CAN WE TEST THE PARALLELISM ASSUMPTION?

With four observed means, we cannot

The parallelism becomes testable if we have two additional data points

pre-intervention

PRE-PRE

62

69

WHEN TO USE DIFF-IN-DIFF?

When we have longitudinal data and have reasons to believe that most of what drives selection is

individual unobserved characteristics

70

Second, the path taken by the controls must be a plausible approximation of what would

happen to the treated

The following is an example in which it would be better

NOT to use DID

72

58.000 65.000 7.00057.000 55.000 -2.000

9.000

65.000 75.000 10.00055.000 67.000 12.000

-2.000

Diff-in-diff-in-diff -11.000

73

58.000 65.000 7.000 65.000 72.000 75.000

Linearly projected impact 3.000

the logic of c ounterfactual i mpact e valuation 1

Documents

causality slide

impacts slide

dont slide

heckman slide

points difference

policy decisions slide

birth rate slide

difference impact