lecture 7: machine learning in economics · lecture 7: machine learning in economics thomas...

Lecture 7: Machine Learning in Economics

Thomas Lemieux, Econ 594

May 2019

Thomas Lemieux, Econ 594 () Lecture 7: Machine Learning in Economics May 2019 1 / 23

Introduction

There is an enormous, and fast growing literature on machinelearning.

Not so widely used in economics yet, but this is changing rapidly.

We can only scratch the surface in this lecture, but will nonetheles tryto cover the three following issues:

Terminology and basic principlesEstimation methods: LASSO and regression trees/random forestsA few economic applications

Useful references listed in the syllabus are Mullainathan and Spiess(JEP 2017) and Athey and Imbens (2019 working paper).


Basics: What �machine learning�means?

Start with a simple example: identifying whether a picture shows acat or a dog.

Humans learn to distinguish between cats and dogs by being told at avery early age �this is a cat�or �this is a dog� (e.g. in a bookshowing pictures of animals). We become really good at it afterseeing a limited number of pictures.

The �machine�will be presented with pictures where the outcomevariable y is a dummy indicating if this is a cat or a dog.

The x variables are a vector with perhaps 1,000,000 elementsindicating colour (black or white in the simplest case) for each pixel.

We can now of try to predict whether the image represents a cat or adog by running a big logit model of y on a lot of x�s.

The process of �guring out which model best �ts the data is the�learning�part of ML.


Basics: the over�tting problem

Finding a good predictive model for cats and dogs is a hard problembecause of the large dimension of the x vector, and the need to userich interactions among x�s (e.g. common colour pattern for pixelsnearby) to �t the data well.Stunning progress in this area over the last 10 years, but the basics ofthe ML approach are the same for much simpler problems likepredicting labour force participation.One important di¤erence relative to standard econometrics is that wewant a model that predicts well out-of-sample.By contrast, we tend to focus on in-sample �t in economics.Throwing in more x�s always improves the in-sample �t (e.g. the R2),but may make things worse out-of-sample because of the over�ttingproblem.For example, even if the true value of the regression coe¢ cient βk onxk is zero, the OLS estimate bβk will be di¤erent from zero, and thet-test will falsely tell us 5% of the time that βk is not zero.


Basics: Over�tting and regularization

But using βk =bβk instead of βk = 0 to do prediction in another

sample will make the out-of-sample prediction worse.

Because standard methods like OLS over�t within sample, we needalternative methods that are constrained not to over�t by penalizingmodels that are overly �exible.

This is what is called regularization in the ML jargon.

For example, the adjusted R2 (i.e. R2) is a very simple way of doing

regularization. Adding a few more x�s always increases the R2 butmay not increase R

2because of the degrees of freedom correction:

R2 = 1� SSRTSS

, R2= 1� SSR/(n� k � 1)

TSS/(n� 1)

We will get fewer variables in the model and less over�tting if we pickthe model with the highest R

2instead of the one with the highest R2.


Basics: Training and tuning

The speci�c penalty term in R2is very arbitrary. Almost surely not

the best way of doing the regularization in practice. Furthermore,searching over all possible speci�cations (set of x�s) is infeasible incomplex problems.

ML uses a very practical solution to the problem ofover�tting/regularization.

We �rst select a subsample called a training sample to estimate themodel of choice (LASSO, regression tree, neural network) with agiven regularization (i.e. penalty) parameter.

We then look at how well a prediction model �ts the data in ahold-out sample, and pick the regularization parameter that best �tsthe hold-out data.

This is called the tuning process.Once we are done the machine has �learned�how to do the bestpossible job in terms of prediction.


Basics: Supervised vs. unsupervised learning

Up to know we have been discussing the case of supervised machinelearning.

We �supervise� the machine by telling it if there is a cat or a dog onthe picture (i.e. what y is), and then let it �gure out how to bestpredict which animal is on the picture.

With unsupervised learning, there is no y and we just let themachine classify observations (pictures here) in groups that sharecommon patterns.

In the case of human learning, we would show pictures of two animalsthat have never been seen before, and ask a person to separate thepictures into two groups.

We will focus on supervised learning from hereinafter, thoughunsupervised ML has been used in some applications in economics.


Basics: ML in economics

ML (aka AI) has had recent spectacular successes in image recognitionand natural language processing that are now widely used in practice.

No such �home runs" in economics yet, perhaps because mostproblems we look at involve smooth relationships between y and x .

But there are nonetheless three sets of applications in which ML isgetting quite popular:

Prediction when this is the main goal of the exercise (e.g. forecastingwith a large number of x�s)Creating new variables based on �big data� that can then be used instandard estimation. Examples include text as data (e.g. sentimentanalysis based on tweets) and constructing risk factors for a demandfor insurance model.Treatment e¤ect heterogeneity with a large set of x�s.

We will look at an example of the third application later on.


LASSO

LASSO stands for Least Absolute Shrinkage and Selection Operator.LASSO looks like OLS, except that it also includes a penalty termthat shrinks the estimated β towards zero as a way of dealing withthe over�tting problem.

Formally, the LASSO estimator is de�ned as follows:

bβ = argminβ�N

∑i=1(yi � xi β)2 + λ jjβjj (1)

where jjβjj = ∑Kk=1 jβk j is the sum of the absolute values of the

regression coe¢ cients.

LASSO is just OLS in the special case where λ = 0.

The higher the penalty term λ, the more �shrinkage� there is towardszero.


LASSO: a few comments

While LASSO tends to shrink all the βk�s towards zero, it will alsotypically set a subset of the coe¢ cients to zero.

So LASSO can be viewed in part as a model selection procedure thatremoves some of the explanatory variables from the regression.

We have oversimpli�ed the notation by weighting all the βk�s equallyin jjβjj = ∑K

k=1 jβk j.In practice we need to account for di¤erences in the scale of thedi¤erent xk variables that result in di¤erent scales for the βk�s.

A simple way to do so is to normalize the xk�s by subtracting meansand dividing by standard deviations.


LASSO: choosing the tuning parameter

To choose the optimal value of the tuning parameter λ, we look atthe �t (e.g. the sum of square residuals, SSR) in the hold-out sample:

SSRH (λ) =NT+NH

∑i=NT+1

(yi � xibβ(λ))2where NH is the number of observations in the hold-out sample, whileNT is the number of observations in the training sample being usedto estimate bβ(λ).We can then pick the value of λ that yields the smallest value ofSSRH (λ).


Regression trees

One drawback of OLS and LASSO is that it is not clear how to moveaway from the simple linear speci�cation yi = xi β+ εi when we thinkthe model is non-linear and/or interactions among the x�s areimportant.

We could add polynomials and products (among combinations of thex�s) but it is unclear how to do so. It could also yield anunmanageable number of right-hand-side variables when thedimension of the x vector is large.

Regression trees provide a convenient and e¢ cient way of estimatingregression models that are highly �exible functions of the x�s.

The rough idea is to keep splitting the data into two groups based (ateach split) on a given xk . After 2 splits we have 4 groups ofobservations based on the x�s. After 3 splits we have 8 groups, etc.


Regression trees

Each split results in two �branches�. When we are done splitting,observations are grouped into �leaves�. For instance, with 10 splitswe end up with 210 = 1024 leaves.

The number of splits is also called the �depth�of the tree.

Once we have �nished splitting the data the predicted value of yigiven xi is the average value of y among all observations in the sameleaf.

We regularize regression tree models by choosing the depth of thetree. More depth yields a better in-sample �t but can worsen theout-of-sample �t due to the over�tting problem.

So the optimal depth can be chosen by looking at the �t (SSR again)in the hold-out sample.


Regression tree estimation

We start splitting the data trying out each xk , and go for the one thatyields the smallest SSR.Actually it is a bit more complicated since we also need to chose athreshold c to split xk into two groups.Splitting the data into observations with xk � c and xk > c yields thefollowing SSR:

SSR(k, c) = ∑i :xik�c

(yi � y k ,c ,l )2 + ∑i :xik>c

(yi � y k ,c ,r )2 (2)

y k ,c ,l is the average value of y to the left of c and y k ,c ,r is theaverage value of y to the right of c :

y k ,c ,l =1Nl

∑i :xik�c

yi (3)

y k ,c ,r =1Nr

∑i :xik>c

yi (4)


Regression tree estimation

We use the same procedure to look for the next split.

For instance, for the group to the left (xik � c) we go through all theregressors (and thresholds) to �nd out which split yields again thelargest reduction to the SSR.

Interestingly, this could involve further splittling on the basis of thesame regressor xk .

We do the same thing for the group to the right.

After the second level splitting we now have 4 groups that we furthersplit in to 8 groups, etc.


Random forests

One disadvantage of the regression tree method is that we getdiscontinuous jumps in the predicted values.

In the very �rst split discussed above, we have byi = y k ,c ,l whenxik � c and byi = y k ,c ,r when xik > c . So when xik is close to thethreshold c , a small increase in xk could yield a large jump in by .The jumps get smaller as we keep splitting the data, but by stillremain �jumpy�.

Random forest methods provide a way of smoothing the relationshipbetween by and x by estimating lots of di¤erent trees and averagingthe predicted value of byi obtained using each tree.This is typically done by i) drawing bootstrap samples and ii) onlyusing a random subset of x�s for each split.

This yields a separate (and di¤erent) estimated tree for eachbootstrap replication.


Comparison of the methods for a house pricing model

Let�s look at the example in Mullainathan and Spiess where they use51,808 observations from the American Housing Survey to estimatethe di¤erent models (for the log house price).

The data is divided into a training sample of 10,000 observations anda hold-out sample of 41,808 observations.

150 variables such as the total number of rooms, the number ofbathrooms, total square footage, etc.

They compare in-sample (training sample) and out-of-sample(hold-out sample) �t using the R2 (i.e. SSR).


Applications: treatment e¤ect heterogeneity

As mentioned earlier, applications of ML in economics includeprediction problems, the creation of new variables using big data, anduncovering the extent of treatment e¤ect heterogeneity.

Let�s consider an application where ML is used to look at treatmente¤ect heterogeneity.

�The Mortality and Medical Costs of Air Pollution: Evidence fromChanges in Wind Direction�by Deryugina, Heutel, Miller, Molitor,and Reif (March 2019 working paper).

The paper uses wind direction as an instrumental variable for airpollution.

For example, if there is a dirty coal plant 20km east of where you live,air quality will be much worse when the wind comes from the Eastthan from the West.

Wind direction is quasi-random so it should be a good instrumentalvariable.


Air pollution paper

They have Medicare data from the (almost) entire 65+ population ofthe United States.Variable include demographics, county of residence, date of death,health care use, chronic conditions, etc.Pollution is measured using PM 2.5 concentration. We want toestimate the e¤ect of PM 2.5 exposure on mortality (3-days mortalityper million bene�ciairies) and life-years lost.All estimated models include detailed geographical controls (county�xed e¤ects), so variation in PM 2.5 is of the DiD type (changes inPM 2.5 at the daily level in one county vs. others).But DiD may be invalid if, for instance, PM 2.5 goes up when there isa positive economic shock that may also have an impact on health,hospitalization, etc. Measurement error in PM 2.5 could also be asource of bias that will be solved by using IV.IV estimates are indeed substantially larger than OLS (Table 2),suggesting these potential biases are quite important here.


Footnote: how to interpret PM 2.5 e¤ects?

PM 2.5 is a standard measure of the concentration of small particules(less than 2.5 microns in diameter) in the air, expressed inmicrograms per cubic meter.

The �healthy� target is a PM 2.5 of less than 25.

The annual average is about 5 in Vancouver, but it has been over 100when the wind brought smoke from wild�res in the BC Interior overthe last two summers.

As a comparison, annual averages were 73 in Beijing and 143 in Delhiin 2016.

By most accounts Los Angeles has the worst pollution problem inNorth America. Its PM 2.5 annual average used to be 26 in 1999 andis now 12.


Heterogenous treatment e¤ects

To simplify the problem, let�s summarize wind direction using adummy for whether it goes in a �bad�(increases pollution) or �good�direction.

Individuals get the �pollution treatment�when the wind direction isbad.

We can then estimate a separate ML prediction model for 3-daymortality among the treatment and control groups.

The estimated models (LASSO and Random Forest in the paper) arethen used to predict 3-day mortality for individuals withcharacteristics xit under treatment (byT (xit )) and control (byC (xit )).The (heterogenous) treatment e¤ect for an individual withcharacteristics xit is:

bS(xit ) = byT (xit )� byC (xit ) (5)


Heterogenous treatment e¤ects

The main �nding is that while bS(xit ) is small for most individuals, itis quite large for a subset of individuals (presumably) in bad health.

Figure 6 groups bS(xit ) in bins and �nds that most of the treatmente¤ect is concentrated among the �top 5 %�of the distribution.

The paper also presents estimates where 3-day mortality is replacedby life-years lost, which is the product of mortality and predicted lifeexpectancy.

Predicted life expectancy is itself computed using ML techniques, andvaries considerably depending on age and health conditions.

Table 6 shows that the e¤ect of pollution on mortality is MUCHlarger for individuals with a low life expectancy relative to those witha higher life expectancy.

E¤ect goes from 18.51 when life expectancy < 1 year to 0.52 whenlife expectancy is between 5 and 10 years. It is only 0.06 and notsigni�cant when life expectancy > 10 years.


Stata commands

Stata is not the software program of choice if you want to estimatemodels using ML techniques.R is de�nitely the place to go. See, for instance, Paul Schrimpf�sEcon 628 notes(http://faculty.arts.ubc.ca/pschrimpf/628/mlExamplePKH.html).But Stata is catching up and you can now do LASSO and randomforest estimation.For LASSO, the main command (part of the �lassopack�package) islasso2.The command looks like regress but you also need to specify thelambda factor, or let the program pick a value for you.Likewise the main command (part of the �randomforest�package) forrandom forest estimation is randomforest. Only available in Stata 15though...To install these packages type in �ssc install lassopack�or �ssc installrandomforest�.


lecture 7: machine learning in economics · lecture 7: machine learning in economics thomas...

Documents