econometrics using stata : part 2

Econometrics using STATA :

Part 2

Benjamin MonneryEconomiX, Univ Paris Nanterre

M1 Economie du Droit2017-2018

FINDING DATA COVARIATE-ADJUSTMENT MATCHING

CONTENT OF PART 2

When RCT is not an option, only option is to use observational /real-life data

1. How to retrieve data ?

public sources (data.gouv), data repositories, journal archiveshow to clean/manipulate data sets in Stata ?

2. How to fix selection bias ?• when there is only selection on observables (part 2)

i.e. easy problems where you know all the determinants ofassignment correlated with YMethods : stratification, covariate-adjustment and matching

x when there is also selection on unobservables (part 3)Methods : IV, panel, DID, RDD...

B. Monnery (EconomiX) Econometrics using Stata II 2 / 41

EXAMPLE OF SELECTION ON OBSERVABLES

What’s the effect of lawyers on judicial outcomes ? e.g. Pr(conviction)

among defendants, having a lawyer is “as random” conditional on ...

• strengh of the case (evidence)• wealth, ...

⇒ Among these determinants of treatment, strengh of casecorrelates with Pr(conviction) for sure

what about wealth ? (depends on the judicial system)

Assumption : there is selection on observables (only) if

E [Y 1i |T = 1,X ] = E [Y 1

i |T = 0,X ]

E [Y 0i |T = 1,X ] = E [Y 0

i |T = 0,X ]

Potential outcomes are the same on average for treated anduntreated with same X

Finding Data

Access to data is necessary to answer questions> know key sources, be able to manipulate their data

Access to novel data is (almost) necessary to publish in top scientificjournals

• good data + good method + interesting topic = top science• “competition” for data among researchers• difficult to teach> be curious, follow the news, learn code

DATA GOUV

also look at INSEE, ministries’ websites...

HARVARD DATAVERSE AND JOURNAL ARCHIVES

Many top scientific journals now require online publication of datasets(like AER)

https://www.aeaweb.org/articles?id=10.1257/aer.20161503

CIVIL SOCIETY INITIATIVES

We will use some of their data later in the course (Diff-in-Diff)

Covariate-adjustment

INTUITION

We want to estimate a causal treatment effect by comparing theobserved outcomes of treated and untreated people

If we think we know all the determinants X of treatment assignment Tthat also relate to Y (selection on observables), we can simplycompare treated and untreated outcomes conditional on X

How to “condition on X ” ?

1. statistically control for X in a regression model (covariateadjustment)

estimate Yi = β0 + β1Ti + β2Xi + εi2. use matching (e.g. propensity score matching)3. use stratification (subclassification) :

compute differences within small groups (strata/cells) of X

⇒ Covariate-adjustement is the regression analog to stratification

In a problem of selection on observables, we want to compare treatedand untreated within subgroups with similar potential outcomes

Ex : what’s the effect of lawyers on defendants’ probability ofconviction ?

⇒ True answer ? Probably a reduction of Pr (conviction)

⇒ Problem (selection bias) : propensity to hire a lawyer andprobability of conviction are both related to strengh of evidenceagainst defendant

• if court has strong evidence against defendant, he is more likelyto hire a lawyer to help him

• however, he is also more likely to be convicted eventually⇒ hence risk of selection bias due to differences in strengh of

evidence

If you can measure strengh of evidence, selection bias can be “easily”eliminated by stratification, covariate-adjustment or matching

STRATIFICATION

Tab 1. Sample of Defendants Tab 2. Numbers Convicted

X / T Yes No All X / T Yes No AllStrong 40 10 70 Strong 30 10 40Weak 10 20 30 Weak 5 15 20

All 50 50 100 All 35 25 60

Stata : tab X T tab X T if Convicted==1

• Naive estimator : compare rates of conviction between Yes & NoTreated : 35/50 = 70% Untreated : 25/50 = 50%

• Naive answer : detrimental “effect” of lawyers of +20% points !

⇒ But strengh of evidence is related to both Lawyers andConvictions : selection bias

Better estimator : stratify by (condition on) strengh of evidence

STRATIFICATION

Tab 1. Sample of Defendants Tab 2. Numbers Convicted

X / T Yes No All X / T Yes No AllStrong 40 10 70 Strong 30 10 40Weak 10 20 30 Weak 5 15 20

All 50 50 100 All 35 25 60

• Among Strong casesTreated : 30/40 = 75% Untreated : 10/10 = 100%

Treatment effect : -25pp effect

• Among Weak casesTreated : 5/10 = 50% Untreated : 15/20 = 75%

Treatment effect : -25pp effect

⇒ Hence the stratified estimator gives a treatment effect of -25 pp

STRATIFICATION VERSUS REGRESSIONS

Stratification solves problems of selection on observables

However in practice, it is only appropriate in the most simplesituations :

• with few variables affecting T and Y• which are all categorical• e.g. 1 dummy (strong/weak), 2 dummies (+rich/poor), ...

In real-life, assignment often depends on a large number ofnon-dichotomic variables, i.e. need to stratify the sample within a lotof different groups (cells/strata)⇒ problem known as the curse of dimensionality

Problem 1 with stratification : the curse of dimensionality

Assume we want to condition on (stratify by) k dummy variables :the number of different groups will be 2k

with k = 10, we have 210 = 1024 group-specific treatments effects tocompare and average (211 = 2024 , 310 = 59049)

• computation can become long• many cells will be empty or only contain treated or untreated

observations : can’t compute group-specific effect> makes the estimated effect less general (i.e. local) as someobservations are left-out

Problem 2 with stratification : continuous variables

In real-life, many variables are not categorical but continuous

• strong/weak and rich/poor are statistical constructions to easecalculus

• the true underlying variables are continuous in nature⇒ stratification makes assumptions of homogeneity within groups

Regressions can easily solve both problems : many X and mix ofcategorical and continuous variables

COVARIATE ADJUSTMENT

Goal : conditional on X , treatment should be “as random”

Key : control appropriately the effect of wealth and case strengh

• Flexible specification :- only linear effect Yi = β0 + β1Lawyeri + β2Wealthi + εi- or more flexible form : logarithmic, polynomial

(Wealth2,Wealth3,...), by categories/bins, linear+bins...

• Relevant data/variables :- Use data on the “best” variables explaining treatment

assignment, instead of long-shot proxy variablesannual pre-tax income, disposable income, net wealth, grosswealth ? Family wealth (to account for possible family support) ?

> a (linear ?) combination of several variables, or some index ?

Recall : do not condition on potential mediators (e.g. lenght of trial) asthey will capture part of the true causal effect of T on Y

ASSUMPTIONS

The key underlying assumptions :

• Conditional independance assumption (CIA, orunconfoundedness)

Y 1i ,Y

0i ⊥ T | X

CIA is not directly testable (you need to argue why it’s credible)

• Common support (or overlap)

Pr (T = 1|X ) ∈ (0,1)common support is easily testable

+ SUTVA

Then stratification, covariate-adjustment and matching will work

REGRESSION ANATOMY

Under those assumptions, why exactly does covariate-adjustmentwork, i.e. give a causal effect of T on Y ?

⇒ what do multiple regressions do ?

We know that a simple regression with OLS : Yi = β0 + β1X1 + εi

... gives β̂1 = Cov (Y ,X1)Var (X1)

And a multiple regression with OLS : Yi = β0 + β1X1 + β2X2 + ui

... gives β̂ = (X ′X )−1X ′Y ... ?

To understand what it means, let’s turn to the regression anatomytheorem

REGRESSION ANATOMY

SENSITIVITY TO CIA

We can estimate how sensitive the results are to potentialconfounders

Simulation approach :

• Simulate a “fake” variable F that is correlated with both T and Y

• Look at the effect of including this new covariate F on β̂T

• By comparing the β̂T s under different constructions of F(variance-covariance), document the sensitivity of your findingswith respect to a violation of CIA

⇒ If β̂T only disappears under “unrealistic” assumptions (superlarge correlations (F ,X ) and (F ,Y )), then the effect is robust topotential selection on unobservables

Matching

MATCHING

Another popular method to deal with selection on observables ismatching

Matching = Appariemment

Idea : make many pairs of similar individuals (i , j), one treated & onenon-treated, and look at their average differences in outcomes

ˆATT =1

∑T =1

(Yi − Yj (i))

where Yj (i) is the outcome of j , the non-treated individual closest tothe treated i (i.e. the match for i)

Note that we can also recover ATU and ATE with matching :

ˆATU =1

∑T =0

(Yi − Yj (i))

ˆATE =N1

NˆATT +

NˆATU

SIMPLE EXTENSIONS

Note that we can match

• on many dimensions, many X

that’s preferable to make CIA hold

• use several matches for a given i�

that’s prefered to reduce variance

ˆATT =1

∑T=1

( Yi −1M

M∑m=1

Yjm(i) )

For now, most simple 1x1 matching on one X

1X1 MATCHING ON ONE X

ANOTHER EXAMPLE : 1X1 MATCHING ON ONE X

The estimated ATT after matching is 16426− 13982 = 2444

whereas before matching : 16426− 20724 = −4298

SEVERAL X

In practice, we usually need to match on many observable variables

⇒ difficult to find perfectly similar i and j on all X (exact matching)

Other methods :• coarsened exact matching (“exact” matching within bins/ranges)

• distance-based matching- Euclidian distance||xi − xj || =

√(xi − xj )′(xi − xj ) =

√∑Kk=1(xki − xkj )2

- Normalized Euclidian distance, Mahalanobis distance

• propensity score matching

Distance-based and propensity score matching are most often used

SEVERAL MATCHES

In practice, we often want to increase precision by using severalmatches for each i

• Single nearest neighbor matching

• k-nearest neighbors matching (e.g. k=5 or 10)

• Caliper (or raduis) matching (maximal distance i − j)

• Kernel matching (different weights by distance)

• etc.

Asymptotically, they are all similar ; but in practice, this choice canmatter

PROPENSITY SCORE MATCHING

Like with distance-based matching, we want to aggregate alldifferences in X in only one index, the propensity score p(x)

p(x) measures the probability that individuals are treated (T = 1)based on their observables

• Among treated, some were very likely to be treated, some less so• Among non-treated, some were very likely not to be treated,

some less so

common support in p(x) between the two groups

Propensity score matching matches individuals with similar p(x) (butdifferent actual treatment status)

⇒ need to estimate p(x)

PROPENSITY SCORE MATCHING

To estimate p(x) for each individual (and then match neighbors), weusually use a probit (or logit) model :

Pr (T = 1|X ) = Pr (T ∗ > 0)= Pr (X ′β + ε > 0)= Pr (ε > −X ′β)= 1− CDF (−X ′β)= Phi(X ′β)

⇒p̂i (xi ) ranges from 0 to 1 (if probit or logit is used)

X are pre-determined variables (and interactions, polynomials, etc.)likely to explain T

and then predict the scores : p̂i (xi ) = Phi(X ′i β̂)

⇒ Hopefully with common support and balance of x between the twogroups

MAIN PRACTICAL ISSUES

Check common support : compare the two distributions of p(x)

Check balance of covariates : use simple t-tests, proportional tests, orthe standarized bias :

if std bias > 20%, difference is still “large”

Be careful about inference : propensity score matching is a two-stepprocess, so you need to adjust your standard errors (using bootstrap)

Many other choices to make : type of matching (1-1, 1-5, caliper,kernel, etc), replacement or not...

BONUS 3 : PRISON-BASED EDUCATION AND RECIDIVISM

Goal : make a 1-page critical review of the paper/chapter• brief summary of the paper (topic, method, main points, results)• discuss method, experimental design, interpretations,

conclusions• relate it to the class• criticisms, shortcomings ?

Send PDF by email before next monday (noon)at bmonnery@parisnanterre.fr

EXAMPLE : PRISON-BASED EDUCATION AND RECIDIVISM

Data on 31,000 prisoners released in New York State between 2005and 2008

They follow recidivism within 3 years (rearrest)

Only 347 of them received a college degree in prison

Challenge : make those 347 graduates as comparable as possible toother prisoners not getting a college degree

Method : match prisoners based on their propensity to get a degreepredicted for 47 covariates⇒ 1-1 nearest neighbor matching with a caliper of 0.01

APPLICATION ON STATA

Let’s imagine we want to estimate the effect of halfway houses(semi-liberté) instead of prison on recidivism in a sample of offendersconvicted to prison in France

• allows convicts to work, train, follow classes (probably good forreentry)

• requires them to return in “custody” every night (probably ok tomonitor offenders)

• often perceived as less punitive (possibly bad for futuredeterrence)

⇒ what’s the net causal effect on recidivism, after accounting forselection ?

Main assumption : the Conditional Independence Assumption holdsafter matching on propensity score

In Stata, we can simply use psmatch2

Econometrics using STATA :

Part 2

Benjamin MonneryEconomiX, Univ Paris Nanterre

M1 Economie du Droit2017-2018

econometrics using stata : part 2

Documents

introduction to stata programming - uc3m · introduction to...

basic econometrics with...

econometrics i - new york...

colgar applied econometrics - dphu · 2 applied...

some stata commands for endogeneity in nonlinear … ·...

teaching financial econometrics in statacarlos alberto...

number of econometrics & statistics techniques using a range...

econometrics i - new york...

applied financial econometrics using stata 3. linear factor...

baum c.f. - an introduction to modern econometrics using...

stata textbook examples introductory econometrics by

ss microeconometric analysis stata tstat2021 · course in...

useful stata commands 2012 v4 - ebf groningen · useful...

intro to applied econometrics: basic theory and stata...

bayesian econometrics in stata

econometrics using stata - resakss asia · introduction to...

econometrics ii lecture 4: instrumental variables part i

econometrics using stata · this week :chapters 1 & 2 of...

applied econometrics using stata-cameron and trivedi

using stata - aisberg.unibg.it · financial econometrics...