kaplan: introductory econometrics

Introductory EconometricsDescription Prediction and Causality

Second edition

David M Kaplan

Copyright copy 2021 David M Kaplan

Licensed under the Creative Commons AttributionndashNonCommercialndashShareAlike 40 International License (the ldquoLicenserdquo) you may not usethis file or its source files except in compliance with the License Youmay obtain a copy of the License at httpscreativecommonsorglicensesby-nc-sa40legalcode with a more readable summaryat httpscreativecommonsorglicensesby-nc-sa40

First edition January 2019 second edition June 2020 updated Jan-uary 5 2021

To my past present and future students including NLK and OAKmdashDMK

The chief difficulty Alice found at first was in managingher flamingo she succeeded in getting its body tuckedaway comfortably enough under her arm with its legshanging down but generally just as she had got its necknicely straightened out and was going to give thehedgehog a blow with its head it would twist itself roundand look up in her face with such a puzzled expressionthat she could not help bursting out laughing and whenshe had got its head down and was going to begin againit was very provoking to find that the hedgehog hadunrolled itself and was in the act of crawlingaway Alice soon came to the conclusion that it was avery difficult game indeed

Lewis Carroll Alicersquos Adventures in Wonderland(An allegory for econometrics)

Brief Contents

Contents vii

List of Discussion Questions xiv

Preface xvii

Textbook Learning Objectives xix

Notation xxi

1 Getting Started with R (or Stata) 1

I Analysis of One Variable 11

Introduction 13

2 One Variable Population 15

3 One Variable Sample 53

4 One Variable Two Populations 93

5 Midterm Exam 1 117

II Regression 119

Introduction 121

6 Comparing Two Distributions by Regression 123

7 Simple Linear Regression 155

8 Nonlinear and Nonparametric Regression 173

9 Regression with Two Binary Regressors 195

v

vi BRIEF CONTENTS

10 Regression with Multiple Regressors 221


12 Internal and External Validity 239

III Time Series 257

Introduction 259

13 Time Series One Variable 261

14 First-Order Autoregression 281

15 AR(p) and ADL Models 297

16 Final Exam 307

Bibliography 309

Index 315

Contents

Contents vii


Preface xvii


Notation xxi


11 Comparison of R and Stata 112 R 2

121 Accessing the Software 2122 Installing Packages 4

13 Stata 5131 Accessing the Software 5132 Installing Additional Commands 6

14 Optional Resources 6141 R Tutorials 6142 R Quick References 7143 Running Code in This Textbook 7144 Stata Resources 7Empirical Exercises 8


Introduction 13


21 The World is Random 16211 Before and After Two Perspectives 16212 Before and After Sampling 17213 Outcomes and Mechanisms 17

22 Population Types 18221 Finite Population 18222 Infinite Population 19

vii

viii CONTENTS

223 Superpopulation 19224 Which Population is Most Appropriate 20

23 Description of a Population 21231 Overview of Distributions and Their Features 22232 Binary Variable 23233 Discrete Variable 26234 Categorical or Ordinal Variable 31235 Continuous Variable 34

24 Prelude to Prediction Precipitation 37241 Easy ldquoPredictrdquo Current Weather 38242 Minimizing Mean Loss 39243 Different Probability 40244 Different Loss Function 41

25 Prediction with a Known Distribution 41251 Common Loss Functions 42252 Optimal Prediction Generic Examples 44253 Optimal Prediction Specific Examples 46254 Mean and Mode as Optimal Predictions 49255 Interval Prediction 50


31 Bayesian and Frequentist Perspectives 54311 Very Brief Overview Bayesian Approach 55312 Very Brief Overview Frequentist Approach 55313 Bayesian and Frequentist Differences 56

32 Types of Sampling 57321 Independent 58322 Identically Distributed 58323 Examples 59

33 The Empirical Distribution 6134 Estimation of the Population Mean 62

341 ldquoDescriptionrdquo Sample Mean 62342 ldquoPredictionrdquo Least Squares 63343 Non-iid Sampling Weights 64

35 Sampling Distribution of an Estimator 65351 Some Mathematical Calculations 65352 Graphs Binary Population 66353 Graphs Continuous Population 68354 Table Values in Repeated Samples 69

36 Sampling Distribution Approximation 70361 Non-iid Sampling 71

37 Quantifying Accuracy of an Estimator 71371 Bias 72372 Mean Squared Error 73373 Consistency 74374 Asymptotic MSE 75

38 Quantifying Uncertainty Frequentist Approaches 75381 Standard Errors 75382 Confidence Intervals 76

CONTENTS ix

383 p-values 78384 Statistical Significance 79385 Hypothesis Testing 80386 Mental Math for Statistical Uncertainty 81

39 Quantifying Uncertainty Misinterpretation and Misuse 82391 Perils of Ignoring Non-iid Sampling 82392 Not a Bayesian Belief 84393 Unlikely Events Happen (or Use Common Sense) 84394 Example of Ignoring Outside Knowledge 85395 Multiple Testing (Multiple Comparisons) 86396 Publication Bias and Science 87397 Ignoring Point Estimates (Economic Significance) 87398 Other Sources of Uncertainty 88

310 Statistical Decision Theory 89Empirical Exercises 91


41 Description 94411 Population Mean Difference 94412 Estimation 95413 Quantifying Uncertainty 96

42 Prediction 9643 Causality Overview 96

431 Correlation Does Not Imply Causation 97432 Structural and Reduced Form Approaches 99433 General Equilibrium and Partial Equilibrium 101

44 Potential Outcomes Framework 101441 Potential Outcomes 102442 Treatment Effects 103443 SUTVA 104

45 Average Treatment Effect 106451 Definition and Interpretation 106452 ATE Examples 107453 Limitation of ATE 108

46 ATE Identification 109461 Setup and Identification Question 109462 Randomization 110463 Reasons for Identification Failure 110

47 ATE Estimation and Inference 112Empirical Exercises 113


II Regression 119

Introduction 121


x CONTENTS

61 Logic 124611 Terminology 124612 Theorems 126613 Comparing Assumptions 126

62 Preliminaries 127621 Population Mean Model in Error Form 128622 Joint and Marginal Distributions 128623 Conditional Distributions 130624 Conditional Mean 132625 Comparison of Joint Marginal and Conditional

Distributions 133626 Independence and Dependence 134

63 Population Model Conditional Expectation Function 134631 Conditional Expectation Function 135632 CEF Error Term 136633 CEF Model in Error Form 136634 Linear CEF Model 137635 Interpretation Description and Prediction 137636 Interpretation with Values Besides 0 and 1 138

64 Population Model Potential Outcomes 14065 Population Model Structural 141

651 Linear Structural Model 141652 General Structural Model and ASE 142

66 Identification 143661 Linear Structural Model 143662 Average Treatment Effect 146663 Average Structural Effect 148

67 Estimation OLS 149671 Code 150

68 Quantifying Uncertainty 150681 Heteroskedasticity 151682 Code 151Empirical Exercises 154


71 Misspecification 15672 Coping with Misspecification 157

721 Model of Three Values 158722 More Than Three Values 159

73 Linear Projection 159731 Geometric Intuition 159732 Probabilistic Projection 160733 Formulas and Interpretation 161734 Linear Projection Model in Error Form 161

74 ldquoBestrdquo Linear Approximation 162741 Definition and Interpretation 162742 Limitations 162

75 ldquoBestrdquo Linear Predictor 16376 Causality Under Misspecification 164

CONTENTS xi

77 OLS Estimation and Inference 164771 OLS Estimator Insights 164772 Statistical Properties 165773 Code 167

78 Simple Linear Regression 168Empirical Exercises 171


81 Log Transformation 174811 Properties of the Natural Log Function 174812 The Log-Linear Model 176813 The Linear-Log Model 177814 The Log-Log Model 178815 Warning Model-Driven Results 178816 Code 179

82 Nonlinear-in-Variables Regression 180821 Linearity 181822 Nonlinearity 182823 Estimation and Inference 183824 Parameter Interpretation 183825 Description Prediction and Causality 185826 Code 186

83 Nonparametric Regression 187Empirical Exercises 191


91 Omitted Variable Bias 196911 An Allegory 196912 Formal Conditions 197913 Consequences 199914 OVB in Linear Projection 202

92 Linear-in-Variables Model 20393 Fully Saturated Model 20494 Structural Identification by Exogeneity 20795 Identification by Conditional Independence 20796 Collider Bias 20897 Causal Identification Difference-in-Differences 210

971 Bad Approaches 211972 Counterfactuals and Parallel Trends 212973 Identification 213974 Extensions 215

98 Estimation and Inference 215Empirical Exercises 217


101 Omitted Variable Bias 222102 Linear-in-Variables Model 223

1021 Model and Coefficient Interpretation 2231022 Limitations 224

xii CONTENTS

1023 Code 224103 Interaction Terms 225

1031 Limitation of Linear-in-Variables Model 2251032 Interpretation of Interaction Term 2251033 Non-Binary Interaction 2271034 Code 228

104 Other Examples 228105 Assumptions for Linear Projection 229

1051 Multicollinearity (Two Types) 2291052 Formal Assumptions and Results 230

106 Structural Identification 2311061 Linear Structural Model 2311062 General Structural Model 2311063 Conditional ATE 232Empirical Exercises 233



121 Terminology 240122 Threats to External Validity 240

1221 Different Place 2411222 Different Time 2411223 Different Population 242

123 Threats to Internal Validity 2431231 Functional Form Misspecification 2431232 Measurement Error in the Outcome Variable 2431233 Measurement Error in the Regressors 2461234 Non-iid Sampling and Survey Weights 2481235 Missing Data 2501236 Sample Selection 2511237 Omitted Variable Bias and Collider Bias 2521238 Simultaneity and Reverse Causality 252Empirical Exercises 254

III Time Series 257

Introduction 259


131 Terms and Notation 262132 Populations Randomness and Sampling 263133 Stationarity 264134 Autocovariance and Autocorrelation 264135 Estimation 266

1351 Mean 2661352 Autocovariances and Autocorrelations 266

CONTENTS xiii

136 Nonstationarity 2681361 Trends 2681362 Seasonality 2701363 Cycles 2711364 Structural Breaks 272

137 Decomposition 272138 Transformations 274

Empirical Exercises 276


141 Model 282142 Description 283143 Prediction (Forecasting) Optimality 284144 Estimation 285

1441 Code 286145 Parameter Stability 287146 Multi-Step Forecast 287

1461 Intuition Mean Zero (Special Case) 2881462 General Results with AR Parameters 2881463 Direct Approach 2891464 Code 289

147 Interval Forecasts 2891471 Goal and Sources of Uncertainty 2901472 Intervals Assuming Normality 2901473 Code 291

148 More R Examples 2921481 AR(1) Multi-Step Forecast Intervals 2921482 General R Forecast Allowing Seasonality and

Trend 293Empirical Exercises 295


151 The AR(p) Model 298152 Model Selection How Many Lags 298

1521 Difficulties and Intuition 2981522 AIC and BIC Formulas 2991523 Comparison of AIC and BIC 3001524 Code 301

153 Autoregressive Distributed Lag Regression 302Empirical Exercises 304

16 Final Exam 307

Bibliography 309

Index 315

List of Discussion Questions

21 web traffic 1722 student data 2123 banana loss function 4924 optimal banana prediction 50

31 probability of positive mean 6632 equal p-values equal belief 8433 jellybean solution 8634 nova 8735 Ebola drug 89

41 DPC with two populations 9442 description prediction causality 9743 cash transfer spillovers 10644 breakfast effect 111

61 logic with feathers 12762 joint distribution and causality 13063 ES habits and final scores 14364 marriage and salary 146

71 Facebook 15872 BLP 16373 student-teacher ratio simple regression 170

81 pollution and house price 17882 nonlinear OVB 18083 nonlinear wage model interpretation 18484 model evaluation 190

91 assessing OVB 19992 OVB ES habits 20293 bad panel approach 1 for Mariel boatlift 21194 bad panel approach 2 for Mariel boatlift 21195 bad panel approach 1 for fracking 21196 bad panel approach 2 for fracking 21197 parallel trends skepticism 213

101 OVB with multiple regressors 223102 sleep and interactions 226103 wage and interactions 226

xiv

LIST OF DISCUSSION QUESTIONS xv

104 linear-in-variables 227

121 external validity minimum wage 242122 exercise error 244123 measurement error scrap rate 246124 program attrition 250125 missing salary data 251126 health and medical expenditure 253

131 autocorrelation 266132 nonstationarity 274

141 forecast and reality 285142 recession-affected coefficient 287143 long-horizon AR forecast 289144 forecast sanity check 291

151 lag choice for forecasting 301

xvi LIST OF DISCUSSION QUESTIONS

Preface

This text was prepared for the 15-week semester Introductory Econo-metrics course at the University of Missouri The class focuses on sta-tistical description prediction and ldquocausalityrdquo including both struc-tural parameters and treatment effects Description and prediction(forecasting) with time series are also covered Students learn tothink probabilistically understand prediction and causality judgewhether various assumptions hold true in real-world examples andapply econometric methods in R

As usual this text may be used to teach different types of classesIn full the text provides a 15-week semester class that assumes aprevious class in probability and statistics That prerequisite could beskipped if more time is spent on the ldquoreviewrdquo material in the first fewchapters Calculus is avoided but could be added in the usual placesA shorter class could omit the time series material Of course anymaterial may be expanded condensed or skipped as the instructordesires

Some complementary complimentary texts and courses deservemention Econometrics professor Matt Masten has a ldquoCausal In-ference Bootcamprdquo video series1 as well as some ldquoCausal Inferencewith Rrdquo free courses on DataCamp2 Relevant videos are linked atthe beginning of each chapter in this textbook Stanford statisticsprofessors Trevor Hastie and Rob Tibshirani created a free introduc-tory machine learning (statistical learning) course focusing more onprediction and estimation3 Their course uses their free textbook(James Witten Hastie and Tibshirani 2013) that includes R ex-amples4 Hastie Tibshirani and Friedman (2009) also provide theirmore advanced statistical learning text for free 5 For econometricstexts focused on prediction and time series see Diebold (2018abc)6

The forecasting book by Hyndman and Athanasopoulos (2019) is athttpsotextscomfpp2 and uses R Finally Hanck Arnold Ger-ber and Schmelzer (2018) mirror the structure of the (expensive)textbook of Stock and Watson (2015) providing many R examples toillustrate the concepts they explain7

1httpsmattmastengithubiobootcamp2httpswwwdatacampcomcommunityopen-courses3httpswwwedxorgcoursestatistical-learning4httpsstatlearningcom5httpswebstanfordedu~hastieElemStatLearn6httpwwwsscupennedu~fdieboldTextbookshtml7httpswwweconometrics-with-rorg

xvii

xviii

One distinguishing feature of this text is the development of theideas of (and distinctions among) statistical description predictioncausal inference and structural estimation in the simplest possiblesettings Other texts combine these with all the complications ofregression from the beginning often confusing students (like my pastself)

A second distinguishing feature is that this textrsquos source files arefreely available Instructors may modify them as desired or copyand paste LATEX code into their own lecture notes subject to theCreative Commons license linked on the copyright page I wrote thetext in Overleaf an online (free) LATEX environment that includesknitr support so most of the R code and output is in the same Rtexfiles alongside the LATEX code Graphs are either generated from codein the Rtex files or else from a single R file also provided in thesource material You may see copy and download the entire projectfrom Overleaf8 or from my website9

Third I provide learning objectives for the overall book and foreach chapter This follows current best practices for course designUpon request I can provide a library of multiple choice questionslabeled by learning objective (Empirical exercises are already at theend of each textbook chapter)

Fourth in-class (or online) discussion questions are included alongthe way When I teach in person (30ndash40 students) I prefer to punctu-ate lectures with such questions every 20ndash30 minutes where studentsfirst discuss them for a couple minutes in small groups of 2ndash3 studentsand then volunteer to share their grouprsquos ideas with the whole classfor another couple minutes This provides an active learning opportu-nity a time for students to realize they donrsquot understand the lecturematerial (so they can ask questions) practice discussing econometricswith peers and (if nothing else) a few minutesrsquo rest

Thanks to everyone for their help and support my past economet-rics instructors my colleagues and collaborators my students (whohave not only inspired me but alerted me to typos and other defi-ciences in earlier drafts) and my family

David M KaplanSummer 2018 (edited Summer 2020)Columbia Missouri USA

8httpswwwoverleafcomreadfszrgmwzftrk9httpfacultymissouriedukaplandmteachhtml

Textbook LearningObjectives

For good reason it has become standard practice to list learningobjectives for a course as well as each unit within the course Beloware the learning objectives corresponding to this text overall Eachchapter lists more specific learning objectives that map to one ormore of these overall objectives The accompanying exercises arealso classified by learning objectives I hope you find these helpfulguidance whether you are a solo learner a class instructor or a classstudent

The textbook learning objectives (TLOs) are the following

1 Define terms from probability statistics and econometrics bothmathematically and intuitively

2 Describe various econometric methods both mathematically andintuitively including their objects of interest and assumptionsand the logical relationship between the assumptions and cor-responding theorems and properties

3 Interpret the values that could be estimated with infinite datain terms of description prediction and causality (or economicmeaning)

4 Explain the frequentistclassical statistical and asymptotic frame-works including their benefits and limitations

5 Provide multiple possible (causal) explanations for any statis-tical result distinguishing between statistical and causal rela-tionships

6 For a given economic question dataset and econometric methodjudge whether the method is appropriate and judge the eco-nomic significance and statistical significance of the results

7 Using R (or Stata) manipulate and analyze data interpretingresults both economically and statistically

xix

xx TEXTBOOK LEARNING OBJECTIVES

Notation

Much of the notation below will not make sense until you get to thecorresponding point in the text The following is primarily for yourreference later

Variables

Usually uppercase denotes a random variable whereas lowercase de-notes a non-random (fixed constant) value The primary exceptionis for certain counting variables where uppercase indicates the max-imum value and lowercase indicates a general value eg time periodt can be 1 2 3 T or regressor k out of K total regressors Scalarvector and matrix variables are typset differently For example ann-by-k random matrix with scalar (random variable) entries Xij (rowi column j) is written

X =

X11 X12 middot middot middot X1k


Xn1 Xn2 middot middot middot Xnk

and a k-dimensional non-random vector is

z =

z1z2zk

Unless otherwise specified vectors are column vectors (like above)

Both vectors and matrices can be transposed The transpose of acolumn vector is a row vector For example the transpose of the zdefined above is

zprime = (z1 z2 zk)

and the transpose of the X defined above is

Xprime =

X11 X21 middot middot middot Xn1


X1k X2k middot middot middot Xnk

xxi

xxii NOTATION

where the row i column j entry in Xprime is the row j column i entry inX

Greek letters like β and θ generally denote non-random (fixed)population parameters

Estimators usually have a ldquohatrdquo on them Since estimators arecomputed from data they are random from the frequentist perspec-tive Thus even if θ is a non-random population parameter θ is arandom variable

I try to put ldquohatsrdquo or bars on other quantities computed from thesample too For example a t-statistic would be t (a random variablecomputed from the sample) instead of just t (which looks like a non-random scalar) The sample average of Y1 Yn is Y

Estimators and other statistics (ie things computed from data)may sometimes have a subscript with the sample size n to remind usthat their sampling distribution depends on n For example θn tnand Yn

The following is a summaryy scalar fixed (non-random) valueY scalar random variableθ scalar non-random valueθ scalar random variable

x non-random column vectorxprime transpose of wX random column vectorβ non-random column vectorβ random column vector

w non-random matrixwprime transpose of wW random matrixΩ non-random matrixΩ random matrix

Symbols

In addition to the following symbols vocabulary words and abbrevi-ations (like ldquoregressionrdquo or ldquoOLSrdquo) can be looked up in the Index inthe very back of the text

=rArr implies see Section 61lArr= is implied by see Section 61lArrrArr if and only if see Section 61limnrarrinfin

limit (like in pre-calculus)plimnrarrinfin

probability limit see Section 373

rarr converges to (like in pre-calculus)prarr converges in probability to see Section 373equiv is defined asasymp approximately equalssim is distributed as

NOTATION xxiii

asim is distributed approximately (or asymptotically) as see (316)X perpperp Y X and Y are statistically independent see Section 626N(micro σ2) normal distribution with mean micro and variance σ2

N(0 1) standard normal distributionFY (middot) cumulative distribution function (CDF) of Y see Section 23fY (middot) PMF of Y (if Y is discrete) see Section 23fY (middot) PDF of Y (if Y is continuous) see Figure 231middot indicator function see (23)P(A) probability of event AP(A | B) conditional probability of A given B see Section 623E(Y ) expectation (mean) of Y see Section 23E(Y | X = x) CEF (a function of x) see Section 63E(Y | X) conditional expectation of Y given X this is a random variablensumi=1

summation from i = 1 to i = n

Var(Y ) variance of Y (square of standard deviation) see (210)Var(Y | X = x) conditional variance (a non-random value) see Section 681Var(Y | X) conditional variance (a random variable)Cov(YX) covarianceCorr(YX) correlationa b a set (containing elemnts a b etc)i = 1 n same as i isin 1 n (integers from 1 to n)j = 1 J same as j isin 1 J (integers from 1 to J)s isin S element s is in set SE(Y ) expectation for sample distribution see Section 341Yn

1n

sumni=1 Yi same as E(Y ) see Section 341

θ estimator of population parameter θ see Section 34SE(θ) standard error of estimator θ see Section 381arg min

gf(g) the value of g that minimizes f(g)

arg maxg

f(g) the value of g that maximizes f(g)

vprime xprime transposes of matrix v and vector x respectivelyvminus1 inverse of matrix v

xxiv NOTATION

Chapter 1

Getting Started with R (orStata)

=rArr Kaplan video Course Introduction

Depends on no other chapters

Unit learning objectives for this chapter

11 Run statistical software (RRStudio or Stata) [TLO 7]

12 Write code to do basic data manipulation description anddisplay [TLO 7]

You will use R (or Stata) for the empirical exercises in this text-book The code examples in the textbook are all in R

No previous experience with any statistical software is assumedConsequently the primary goal of the empirical exercises is to developyour confidence and experience with statistical software applying thetextrsquos methods and ideas to real datasets Toward this goal there arelots of explicit hints about the code you need to write

If you actually do have previous experience (or above-average in-terest) then the empirical exercises may feel too boring You couldtry figuring out alternative ways to code the solution or coding alter-native analyses etc You can also explore other online resources likeone of the free DataCamp courses1

Due to the many excellent resources online (see Section 14) thereare many people who can write R code but most do not understandhow to properly interpret econometric results or judge which methodis most appropriate So overall this classtextbook focuses more onunderstanding econometrics than coding

11 Comparison of R and Stata

I like both R and Stata statistical software and I have used bothprofessionally They excel in different ways mentioned below

1httpswwwdatacampcomcommunityopen-courses

1

2 CHAPTER 1 GETTING STARTED WITH R (OR STATA)

For this textbookclass I focus on R for the following reasons

1 Itrsquos widely used in the private sector government and academiaalike in many fields (including economics)

2 Itrsquos free to downloaduse and can even be used through a webbrowser

3 It has many econometricstatistical functions available and cre-ators of new econometricstatistical methods often provide codein R

4 There are many online resources for learning R and getting help

In comparison Stata

1 is widely used in economics and certain social sciences but lessso in fields like data science and statistics

2 is not free and canrsquot be used in a browser but is free to use inmany campus computer labs

3 is easier to use for standard econometric methods and has somenew econometric methods (while others take a few years to beimplemented)

4 also has good help files (documentation) and online support

12 R

121 Accessing the Software

There are three ways you could run R downloaded onto your owncomputer through a web browser (in the cloud) or on another com-puter like in a campus computer lab

Other computers or web browser versions may have the core Rsoftware but lack certain packages needed for the empirical exercisesIn some cases you can simply install the necessary packages with asingle command (eg in Mizzou computer labs) In other cases youmay be prohibited from installing packages in which case you wonrsquotbe able to complete the exercises so make sure to check this first

Through a Web Browser

There are many free options for using R through a web browser andthey evolve quickly This means both new and improved optionsbecoming available as well as existing options disappearing even frommajor companies (eg Microsoft Azure Notebooks was ldquoretiredrdquo)

Currently I suggest you use RStudio Cloud Itrsquos free reliable andthe same RStudio interface as if you downloaded RStudio so you canlearn from the latter half of my RStudio video To get started

1 Go to httpsrstudiocloud in any web browser2 Click the GET STARTED FOR FREE button (or else ldquoLog Inrdquo

if you already have an account)

12 R 3

3 Click the ldquoSign Uprdquo button (the free ldquoCloud Freerdquo account isselected by default)

4 Enter your email (new) password and name and click ldquoSignUprdquo (or else ldquoSign up with Googlerdquo if you prefer)

5 Start using RStudio like it were on your own computer6 Install necessary packages like usual see Section 1227 After you log out and later log in click ldquoUntitled Projectrdquo (feel

free to rename) to get back to where you wereAt httpmybinderorgv2ghbinder-examplesrmaster

urlpath=rstudio you can also use the RStudio interface through aweb browser without even making an account but 1) it does notrun the most current version of R 2) it cannot save your files fromone session to the next 3) you have to install the packages everytime (which takes many minutes to run) But these are not criticalproblems for this class older R versions are fine you can save yourcodeoutput in a text file on your own computer and you can makesome tea while the packages install

Currently the best other options use Jupyter Notebooks In orderof my preference (for this class)

bull CoCalc no account required all needed packages already in-stalled go to httpscocalccom and click ldquoRun CoCalc Nowrdquoand wait for it to load then click ldquoR (system-wide)rdquo under ldquoSug-gested kernelsrdquo and you can start typing R code

bull Google CoLab requires Google account go to httpscolabresearchgooglecomdrive1BYnnbqeyZAlYnxR9IHC8tpW07EpDeyKRand then in the Edit file menu click ldquoSelect all cellsrdquo and (alsoin the Edit menu) ldquoDelete selected cellsrdquo to get a blank note-book then under Insert click ldquoCode cellrdquo and start typing codecan install all needed packages as in Section 122 (takes a fewminutes to run)

bull Gradient by Paperspace I havenrsquot tried it but looked promisingrequires free account at httpsgradientpaperspacecom

In a Mizzou Computer Lab

You can check which Mizzou computing siteslabs have your favoritesoftware on the Computing Sites Software web page2 Scroll downto RStudio to see where you can use R with RStudio Howeversometimes there are classes or other events in computer labs you cancheck the weekly schedule posted near the door to find a free time oryou can check online3

After you log into the computer in the computer lab open RStudiofrom the Start menu (RStudio calls R itself in the background youdonrsquot have to open R directly) Then just start typing commandsand hit Enter to run them

2httpsdoitmissourieduservicescomputing-sitessites-software

3httpsdoitmissourieduservicescomputing-sites and click the labname


The computer labs donrsquot currently have the necessary packagespre-installed but you can easily install them Note that yoursquoll haveto do this every time you log in (since any files you downloadsaveget deleted when you log out) but you can just run the same line ofcode when you start RStudio each time

Also make sure to email yourself your code (or otherwise save itif you havenrsquot finished and uploaded to Canvas) before you log outsince your files get deleted when you log out

Downloading Software

=rArr Kaplan video Getting Started with RRStudio

Yoursquoll download two pieces of software R itself and RStudioBoth are free R has all the functions you need RStudio makes theinterface nicer and makes things easier for you

On Windowsbull Download the exe installer file for R Google ldquoR Windowsrdquo

or try httpscranr-projectorgbinwindowsbase andclick the ldquoDownload rdquo link near the top

bull Open the downloaded exe installer and follow the instructionsbull Download the exe installer file for RStudio Desktop (free

version) Google ldquoRStudio downloadrdquo or try httpswwwrstudiocomproductsrstudiodownloaddownload

bull Open the downloaded exe installer and follow the instructions

On Macbull Download the pkg file for R Google ldquoR Macrdquo or try httpscranr-projectorgbinmacosx

bull Open the file and follow the usual Mac installation procedurebull Download the dmg file for RStudio Desktop (free version)

Google ldquoRStudio downloadrdquo or try httpswwwrstudiocomproductsrstudiodownloaddownload

bull Open the file and follow the usual Mac installation procedure

On Linux etc if you can figure out how to run something besidesWindows or Mac you can probably figure out how to download acouple files by yourself but please let me know if not

Regardless of OS after both are installed you only ever need toopen RStudio never R Once you open RStudio just type a commandand hit Enter to run it

122 Installing Packages

You may need to install certain packages to do the empirical exercisesThis can be done with a single command in R You should double-check the package names required for each exercise but it would besomething like

13 STATA 5

installpackages(c(wooldridgelmtestsandwichforecastsurvey))

With R on your own computer you only need to run this once(not every time you use your computer) but with a web interface orcomputer lab you may need to run this code every time you start asession in R You can check which packages are already installed withinstalledpackages()

A bit about the packages

bull wooldridge (Shea 2018) has datasets originally collected byWooldridge (2020) from various sources

bull lmtest and sandwich (Zeileis 2004 Zeileis and Hothorn 2002)help construct confidence intervals (and other things) appropri-ate for economic data

bull survey (Lumley 2004 2019) has functions for dealing withcomplex survey sampling

bull forecast (Hyndman Athanasopoulos Bergmeir Caceres ChhayOrsquoHara-Wild Petropoulos Razbash Wang and Yasmeen 2020Hyndman and Khandakar 2008) has methods for forecasting

13 Stata


There are three ways you could run Stata in a campus computer labthrough Mizzoursquos Software Anywhere or (if you purchase your owncopy) downloaded onto your own computer

Empirical exercises only require built-in commands Stata hasadditional commands available for download but none are neededfor the exercises so any (internet-connected) computer with Stata issufficient


You can check which Mizzou computing siteslabs have your favoritesoftware on the Computing Sites Software web page4 Scroll downto Stata to see where itrsquos available However sometimes there areclasses or other events in computer labs you can check the weeklyschedule posted near the door to find a free time or you can checkonline5

After you log into the computer in the computer lab open Statafrom the Start menu (the actual name is somewhat longer likeldquoStataSE 15 (64-bit)rdquo) Ideally you should open the do-file editorand save a do file but for this class you could just type commands




into the short horizontal space at the bottom labeled ldquoCommandrdquoYou type a command and hit Enter to run it


Purchasing and Downloading Software

Student pricing is shown on the Stata website6 Currently (Spring2020) the cheapest option is the 6-month StataIC license Othermore expensive licenses are fine too

The software is delivered via download Follow instructions forinstallation and contact Stata if you have any technical difficulties

Software Anywhere (Mizzou)

From the Software Anywhere web page7 click the ldquoGetting Startedrdquotab and follow the instructions Once logged in itrsquos the same as ifyou were sitting at a computer in a Mizzou computer lab (see above)

Technical assistance MU Division of ITtechsupportmissouriedu

132 Installing Additional Commands

Like in R there are additional Stata commands that can be easilydownloaded and installed Commonly this can be done with thecommand ssc install followed by the name of the command

For the exercise sets the only additional command yoursquoll need isbcuse You can install this with the command ssc install bcuseIf yoursquore in a computer lab you may need to run this command everytime you start Stata if you have it on your computer just once issufficient This command makes it easy to load the datasets fromWooldridge (2020)8

14 Optional Resources

If you only want to learn enough R (or Stata) to do well in this classthen you may skip this section If yoursquod like to learn more on yourown these resources might help you get started in the right direction

141 R Tutorials

Eventually you will be able to simply Google questions you haveabout R There are lots of people on the internet really excited abouthelping you figure stuff out in R which is great

6httpswwwstatacomordernewedugradplansstudent-pricing7httpsdoitmissourieduservicessoftwaresoftware-anywhere8Descriptions httpfmwwwbceduec-pdatawooldridgedatasets

listhtml

14 OPTIONAL RESOURCES 7

However when you are first getting started it may help to gothrough a basic tutorial You are welcome to Google ldquoR basic tutorialrdquoyourself or you could try one of the following

1 Section 23 (ldquoLab Introduction to Rrdquo) in James Witten Hastieand Tibshirani (2013)

2 Section 11 in Hanck et al (2018)

3 Sections 11ndash13 in Heiss (2016)

4 Sections 21ndash25 in Kleiber and Zeileis (2008) [Chapter 2 is freeon their website]

5 Chapter 2 in Kaplan (2020)

6 No longer free after first chapter datacampcom courses likeIntroduction to R9

142 R Quick References

At first it may help to have some quick reference ldquocheat sheetsrdquo 10 11

143 Running Code in This Textbook

If yoursquod like you should be able to copy code directly from the text-book pdf file and paste it into R Sometimes you need to install acertain package first This can be done either manually or with the Rfunction installpackages() For example to install package mgcvrun the command installpackages(mgcv) within R

144 Stata Resources

For Stata helpful cheat sheets (quick references) are available forfree12 as well as various tutorials13

9httpswwwdatacampcomcoursesfree-introduction-to-r10httpswwwrstudiocomwp-contentuploads201610r-cheat-

sheet-3pdf11httpscranr-projectorgdoccontribShort-refcardpdf12httpswwwstatacombookstorestatacheatsheetspdf13httpswwwstatacomlinksresources-for-learning-stata


Empirical Exercises

Empirical Exercise EE11 In either R or Stata create a script(a sequence of commands with one command per line) to do thefollowing The data are from a New York Times article on December28 1994

a R load (and install if necessary) package wooldridgeif (require(wooldridge)) installpackages(wooldridge) library(wooldridge)

Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse wine clear

b View basic dataset info with R command wine or Stata com-mand describe

c View the first few rows of the dataset with R command head(wine) or Stata command list if _nlt=5

d Rename the alcohol column which measures liters of alcoholfrom wine (consumed per capita per year)

R names(wine)[2] lt- wine

Stata rename alcohol wine

e Add a column named id whose value is just 1 2 3 4 5 etc

R wine$id lt- 1nrow(wine)

Stata generate id = _n

f Display the countries with fewer than 100 heart disease deathsper 100000 people

R wine$country[wine$heartlt100]

Stata list country if heartlt100

g Display the rows for the countries with the 5 lowest death ratessorted by death rate

R wine[order(wine$deaths)[15]]

Stata sort deaths followed by list if _nlt=5

h Add a column with the sum of heart and liver disease deathsper 100000

R wine$heartplusliver lt- wine$heart + wine$liver

Stata generate heart_plus_liver = heart + liver

i Generate a variable with the squared death rate

R wine$deathssq lt- wine$deaths^2

Stata generate deaths_sq = deaths^2

j Display the sorted death rates

R print(sort(wine$deaths))

Stata sort deaths followed by list deaths

EMPIRICAL EXERCISES 9

k R create a vector with the proportion of total deaths (per100000) caused by heart disease with command heartproplt- wine$heartwine$deaths and then name the entries by

country with names(heartprop) lt- wine$country and printthe named vector of heart disease death proportions rounded tothree decimal places with print(round(heartprop digits=3))

Stata add a column with the proportion of heart deaths tototal deaths with command generate heart_prop = heart deaths

l Create a histogram of liver deaths

R hist(wine$liver)

Stata histogram liver

m Create a scatterplot of liver death rates (vertical axis) againstwine consumption (horizontal axis)

R plot(x=wine$wine y=wine$liver)

Stata scatter liver wine

n R only make the same plot but with axes starting at zeroadding the arguments xlim=c(0max(wine$wine)) and ylim=c(0max(wine$liver)) to the previous plot() command


Part I

Analysis of One Variable

11

Introduction

This text explores methods to answer three types of economic ques-tions each detailed in Part I

1 Description (how things arewere statistical properties and re-lationships)

2 Prediction (guessing an unknown value without interfering)

3 Causality (how changing one variable would affect another allelse equal)

For example imagine you are interested in income Depending onyour job you may want to answer a different type of question like

1 Description how many adults in the US have an income be-low $20000yr Whatrsquos the mean income among US adultsWhatrsquos the difference in mean income between two socioeco-nomic or demographic groups like those with and without acollege degree

2 Prediction for advertising purposes whatrsquos the best guess ofthe income of an unknown person visiting your companyrsquos web-site Whatrsquos the best prediction if you also know their zip code(where they live)

3 Causality for a given individual how much higher would herincome be if she had a college degree than if she didnrsquot keepingeverything else about her (parents height social skills etc)identical How much higher would her income be if she were aman all else equal If she were white

Description helps us see It summarizes an incomprehensible massof numbers into specific economically important features we can un-derstand By analogy knowing the color of each of 40000 pixels ina photograph is not as valuable as knowing itrsquos a cat

Prediction aids decisions dependent on unknowns The examplequestions above consider the purpose of advertising where correctlyguessing a personrsquos income helps decide which ad is most effective Inother private sector jobs you may need to predict future demand toknow how many self-driving cars to start producing or predict futureoil prices to aid a freight companyrsquos decisions In government ornon-profit work optimal policy may depend on predicting next yearrsquos

13

14

unemployment rate In each case as detailed in Section 25 the ldquobestrdquoprediction depends on the consequences of the related decision

Causality also aids decisions The example question about thecausal effect on income of a college degree matters for governmentpolicies to subsidize college (or not) as well as individual decisions toattend college With business decisions like changes to advertisingor website layout the causal effect on consumer behavior is whatmatters does the change itself actually cause consumers to buy moreAmong the three types questions of causality are the most difficultto answer Learning about causality from data has been a primaryfocus of the field of econometrics

Of course not all important questions concern description pre-diction and causality Policy questions usually involve tradeoffs thatultimately require value judgments For example how much futurewellbeing is worth sacrificing to be better off right now How muchGDP is worth sacrificing to decrease inequality Should a school havehonors classes that help the best students at the expense of the otherstudents Each of these policy questions requires a subjective valuejudgment that cannot be answered objectively from data

That said each policy question also depends on objectively quan-tified description prediction and causality For example the policyquestion about decreasing inequality depends on the current levelsof GDP and inequality (description) as well as the causal effect ofthe policy (eg tax) change on GDP and inequality (causality) Thefuturepresent wellbeing tradeoff depends on the current level of well-being (description) as well as future levels (prediction) The honorsclass tradeoff depends on the causal effect of honors classes on differ-ent types of students (causality) as well as the current mix of studenttypes (description) and future mix (prediction)

Chapter 2

One Variable Population

=rArr Kaplan video Chapter Introduction



21 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]

22 Describe and distinguish among different types of popula-tions including which is most appropriate for answering acertain question [TLO 3]

23 Describe distributions in different ways including units ofmeasure [TLO 3]

24 Assess the most appropriate loss function and prediction ina real-world situation [TLO 6]

25 Compute mean loss and the optimal prediction in simplemathematical examples [TLO 2]

Optional resources for this chapter

bull Basic probability the Khan Academy AP Statistics unitincludes instructional material and practice questions

bull Mean (expected value) (Lambert video)

bull Probability distribution basics onWikipedia (more than youneed to know for this class)

bull Optimal prediction Hastie Tibshirani and Friedman(2009 sect24)

bull Section 21 (ldquoRandom Variables and Probability Distribu-tionsrdquo) in Hanck et al (2018)

Chapter 2 studies a single variable by itself This settingrsquos sim-plicity helps us focus on the complexity of fundamental concepts in

15

16 CHAPTER 2 ONE VARIABLE POPULATION

probability description and prediction This fundamental under-standing will help you tackle more complex models later in this classand beyond

If yoursquove previously had a probability or statistics class then mostof this chapter may be review for you although the optimal predictionmaterial is probably new If you havenrsquot then now is your opportunityto catch up

21 The World is Random

=rArr Kaplan video ldquoBeforerdquo and ldquoAfterrdquo Perspectives of Data

211 Before and After Two Perspectives

Consider a coin flip The two possible outcomes are heads (h) andtails (t) After the flip we observe the outcome (h or t) Before theflip either h or t is possible with different probabilities

Let variableW represent the outcome After the flip the outcomeis known either W = h or W = t Before the flip both W = h andW = t are possible If the coin is ldquofairrdquo then possible outcomeW = hhas probability 12 as does W = t (Recall it is equivalent to write12 05 or 50)

The ldquoafterrdquo view sees W as a realized value (or realization) Itis either heads or tails Even if the actual ldquovaluerdquo (heads or tails) isunknown to us there is just a single value For example in physicsthe variable c represents the speed of light in a vacuum you may notknow the value but c represents a single value

Instead the ldquobeforerdquo view sees W as a random variable Thatis instead of representing a single (maybe unknown) value like inalgebra W represents a set of possible values each associated with aprobability In the coin flip example the possible outcomes are h andt and the associated probabilities are 05 and 05

Other terms for W include a random draw (or just draw) ormore specifically a random draw (or ldquorandomly drawnrdquo) from a partic-ular probability distribution Seeing the population as a probabilitydistribution (see Section 22) we could say W is randomly sampledfrom its population distribution or if there are multiple random vari-ables W1W2 (eg multiple flips of the same coin) we could saythey are randomly sampled from the population or that they col-lectively form a random sample see Section 32 for more aboutsampling

Notationally in this text random variables are usually writtenuppercase (like W or Y ) whereas realized values are usually writtenlowercase (like w or y) This notation is not unique to this textbookbut beware that other books use different notation (For more onnotation see the Notation section in the front matter before Chapter1)

21 THE WORLD IS RANDOM 17

212 Before and After Sampling

Extending Section 211 are the before sampling and after sam-pling perspectives or ldquobefore observationrdquo and ldquoafter observationrdquoSimilar to Section 211 ldquobeforerdquo corresponds to random variableswhereas ldquoafterrdquo corresponds to realized values

For example imagine you plan to record the age of one personliving in your city You take a blank piece of paper on which yoursquollwrite the age As in Section 211 after you choose a person and writetheir age (ldquoafter samplingrdquo) that number can be seen as a realizedvalue like w Before sampling there are many possible numbers thatcould end up on your paper Itrsquos not that your cityrsquos citizensrsquo ages areundetermined they each know their own age But before you ldquosamplerdquosomebody itrsquos undetermined whose age will end up on your paperIt could be your neighbor DeMarcus age 88 It could be your kidrsquosfriend Lucia age 7 It could be your colleague Xiaohong age 35 Therandom variable W is like your blank paper it has many possiblevalues each with some probability of occuring like P(W = 88) orP(W = 7)

Discussion Question 21 (web traffic) Let Y = 1 if yoursquore loggedinto the course website and Y = 0 if not

a) From what perspective is Y a non-random valueb) From what perspective is Y a random variable

There is always a ldquobeforerdquo view from which data samples (likeages) can be seen as random variables although sometimes it requiressome additional peculiar thought experiments like imagining we firstldquosamplerdquo one universe out of many like with the superpopulation inSection 22

In Sum Before amp AfterBefore multiple possible values =rArr random variableAfter single observed value =rArr realized value (non-random)

213 Outcomes and Mechanisms

Knowing everything about a coin does not fully determine the out-come of a single coin flip For example even if we flip two iden-tical coins (ie same probability of heads) at the same time onemay get heads while the other gets tails Mathematically with twocoins represented by W and Z even if they are ldquoidenticalrdquo in thatP(W = h) = P(Z = h) and P(W = t) = P(Z = t) we could stillsometimes observe W = h and Z = t More abstractly knowingeverything about random variable W does not fully determine anyparticular realization w Even if random variables W and Z have thesame properties specific realizations W = w and Z = z may differ

Conversely a single coin fliprsquos outcome does not tell us everythingabout the coin itself For example consider a ldquofairrdquo coin W with


P(W = h) = P(W = t) = 12 (50 chance of either heads or tails)and biased coin Z with P(Z = h) = 099 (99 heads) By chance wemay flip both and observe W = h and Z = h But the fact that theyboth came up heads once does not imply that the coins themselvesare identical More abstractly observing a single realization W = wdoes not tell us all the properties of random variable W

We usually want to learn about the underlying mechanisms likethe coin itself The ldquobeforerdquo view in Section 211 lets us describe theunderlying properties that we want to learn like a coinrsquos probabilityof heads P(W = h)

The coin flip is a metaphor for more complex mechanisms In eco-nomics instead of learning how coin flip outcomes are determined wecare about the underlying mechanisms that determine a wide varietyof outcomes like unemployment wages inflation trade volume fer-tility and education The underlying mechanism is often called thedata-generating process (DGP)

22 Population Types

=rArr Kaplan video Population Types

This section describes different population types and how to de-termine which is most appropriate for a particular economic questionwhich in turn helps determine which econometric method is most ap-propriate

In this textbook the population is modeled mathematically as aprobability distribution This is appropriate for the infinite popula-tion or superpopulation below but not the finite population Conse-quently it is most important to distinguish between the finite popu-lation and the other two types

The finite population cares more about the ldquoafterrdquo view whichoutcomes actually occurred The other two population types caremore about the ldquobeforerdquo view describing properties of the underlyingmechanisms that generated the outcomes (the DGP)

221 Finite Population

In English ldquopopulationrdquo means all the people living in some area likeeverybody living in Missouri In econometrics this type of populationis called a finite population Other examples of finite populationsare all employees at a particular firm all firms in a particular industryall students in a particular school or all hospitals of a certain size

The finite population is appropriate when we only care about theoutcomes of the population members not the mechanisms that de-termine such outcomes For example if we want to know how manyindividuals in Missouri are currently unemployed then our interest isin a finite population That is we donrsquot care why theyrsquore unemployedand we donrsquot care about the probability that theyrsquore unemployed weonly care about whether or not they are currently unemployed

22 POPULATION TYPES 19

222 Infinite Population

Sometimes a finite population is so large compared to the sample size(ie the number of population members we observe) that an infi-nite population is a reasonable approximation For example if weobserve only 600 individuals out of the 6+ million in Missouri econo-metric results based on finite and infinite populations are practicallyidentical

Although ldquoinfiniterdquo sounds more complex than ldquofiniterdquo it is actu-ally simpler mathematically Instead of needing to track every singlemember of a finite population an infinite population is succinctly de-scribed by a probability distribution or random variable For examplea finite population would need to consider the employment status ofall 6+ million Missourians because sampling somebody unemployedthen reduces the number of unemployed individuals remaining in thepopulation who could be sampled next In contrast an infinite popu-lation considers realizations of a random variable W with some prob-ability of having value ldquounemployedrdquo There is no effect of removingone individual from an infinite population since 1infin = 0

Besides this convenience sometimes there is no finite population(however large) that answers your question For example imaginetherersquos a new manufacturing process for carbon monoxide monitorsthat should sound an alarm above 50ppm Most work properly butsome are faulty and never alarm Specifically this manufacturing pro-cess corresponds to some probability of producing a faulty monitorThis is similar to the probability of the coin flipping process producinga ldquoheadsrdquo Mathematically the manufacturing process can be mod-eled as random variableW with some probability of the value ldquofaultyrdquoIf you want to learn this probability (ie this property of the manu-facturing process) then there is no finite number of monitors that canexactly answer your question no finite number of realizations exactlydetermines P(W = faulty) This is an infinite population question

223 Superpopulation

One variation of the infinite population is the superpopulation(coined by Deming and Stephan 1941) This imagines (infinitely)many possible universes our actual universe is just one out of infin-ity Thus even if it appears we have a finite population we couldimagine that our universersquos finite population is actually a single sam-ple from an infinite number of universesrsquo finite populations The termldquosuperpopulationrdquo essentially means ldquopopulation of populationsrdquo Ouruniversersquos finite population ldquois only one of the many possible popu-lations that might have resulted from the same underlying system ofsocial and economic causesrdquo (Deming and Stephan 1941 p 45)

For example imagine we want to learn the relationship betweenUS state-level unemployment rates and state minimum wage levelsIt may appear we are stuck with a finite population because there areonly 50 states each of which has an observable unemployment rateand minimum wage However observing all 50 states still doesnrsquot


fully answer our question about the underlying mechanism that re-lates unemployment and minimum wage so a finite population seemsinappropriate But we canrsquot just manufacture new states like we canmanufacture new carbon monoxide monitors so an infinite populationalso seems inappropriate The superpopulation imagines manufactur-ing new entire universes each with 50 states and the same economicand legal systems Given these underlying systems and mechanismsthe statesrsquo unemployment rates can be seen as random variables withvarious probabilities of the possible values To answer our economicquestion we need to learn about the properties of these random vari-ables not merely the actual unemployment in our actual 50 states

224 Which Population is Most Appropriate

Practically you need to decide which econometric method to use toanswer a particular question This decision depends partly on whichpopulation type is most appropriate Specifically finite-populationmethods differ from other methods that are appropriate for eithersuperpopulations or infinite populations Because they are less com-monly used in econometrics finite-population methods are not cov-ered in this textbook

Consequently it is most important to judge whether or not a finitepopulation is more appropriate than the other types Which is mostappropriate depends on your question (ie what you want to learn)

The finite population is most appropriate if you could fully answeryour question by observing every member of a finite population Ifnot then a superpopulation or infinite population is more appropri-ate

The distinction is described by Deming and Stephan (1941 p45) They say the finite population perspective is more appropriatefor ldquoadministrative purposesrdquo or ldquoinventory purposesrdquo whereas thesuperpopulation perspective is more appropriate for ldquoscientific gen-eralizations and decisions for action [policy]rdquo as well as ldquopredictionrdquo(assuming you want to predict values outside the finite populationlike in the future)

In Sum Population TypeHypothetically could a finite number of observations fully answeryour question

No =rArr superpopulation or infinite population modeled asprobability distribution (as in this textbook)

Yes =rArr finite population (use different methods unless sam-ple is much smaller than population)

Example Coin Flips

Imagine the president flips a coin 20 times and then randomly selects10 observations to report to you which population types is most

23 DESCRIPTION OF A POPULATION 21

appropriate It depends on your questionThe finite population is most appropriate if you only care about

the outcomes of those 20 flips For example this may be true if thepresident was flipping the coin to make a major military decision thatyou care about (like ldquoinvade if at least 1020 headsrdquo) Then knowingthe 20 flip outcomes is enough to learn the decision Further thesample size is a fairly large proportion (1020) so approximating 20as infinity seems inappropriate

The infinite population is more appropriate if you care about theproperties of the coin For example even with a fair coin (p = 12)maybe only 5 of 20 flips came up heads You donrsquot care that thefinite-population proportion of heads was 14 you care about thep = 12 property of the coin itself You still have uncertainty aboutp even after observing all 20 outcomes

Other Examples

Consider the employment status of individuals in Missouri A finitepopulation is more appropriate if you want to document the actualpercentage of Missouri individuals unemployed last week A super-population is more appropriate if you want to learn about the under-lying mechanism that relates education and unemployment That isknowing each individualrsquos employment status fully answers the firstquestion but not the second question

Consider the productivity of employees at your company (yoursquorethe CEO) If you want to know each employeersquos productivity over thepast fiscal quarter then a finite population is more appropriate If youwant to learn how a particular company policy affects productivitythen a superpopulation is more appropriate That is knowing eachemployeersquos productivity fully answers the first question but not thesecond question

Discussion Question 22 (student data) Imagine yoursquore a highschool principal You have data on every student including theirstandardized test scores from last spring

a) Describe a specific question for which the finite population ismost appropriate and explain why

b) Describe a specific question for which an infinite population orsuperpopulation is most appropriate and explain why

23 Description of a Population

Like most econometrics textbooks this textbook models the popula-tion as a probability distribution Section 22 helps you distinguishwhen this is appropriate

Description of a population is thus description of a probabilitydistribution Some distributions are completely described by a singlenumber like a coinrsquos probability of heads Others are very compli-cated so they are summarized by particular features like the meanand standard deviation


Later with regression wersquoll think about the relationship betweenthe value of variable X and the value of a summary feature of the Yprobability distribution There some caveats to that statement butthe point is that you must understand the probability distribution ofY by itself before understanding how its features depend on X

Remember there is no data yet In practice (and starting inSection 34) you use data to learn about the population to answerquestions about description prediction or causality Here we con-sider what could possibly be learned specifically for description

The following subsections describe probability distributions fordifferent types of variables as well as appropriate summary featuresFirst a brief overview is given

231 Overview of Distributions and Their Features

Complete Description

To completely describe a distribution requires a probability mass func-tion or cumulative distribution function depending on the type ofvariable (details below)

When appropriate the probability mass function (PMF) givesthe probability that random variable Y is equal to any one of its possi-ble values Notationally the PMF is usually a lowercase f sometimeswith the variable as a subscript like fY (middot) The (middot) indicates that fY (middot)is an entire function not a scalar variable nor a function evaluatedat a particular point Mathematically

fY (y) equiv P(Y = y) (21)

If y is not a possible value of Y then P(Y = y) = 0In other cases more appropriate is the cumulative distribution

function (CDF) that gives the probability of all values less than orequal to y (Less commonly ldquodistribution functionrdquo or DF) Notation-ally the CDF is usually an uppercase F sometimes with the variableas a subscript like FY (middot) Again the (middot) indicates that FY (middot) is anentire function Mathematically

FY (y) equiv P(Y le y) (22)

Either the PMF or CDF provides a full description of the dis-tribution of Y For some variable types both are appropriate inwhich case the PMF and CDF contain the same information (justrepresented differently) For other variable types only the PMF isappropriate or only the CDF

If you are studying a single variable graphing the PMF or CDFis helpful and it shows all the available information about that vari-ablersquos distribution Even if the PMF or CDF is a complicated func-tion humans are good at processing visual data In practice howeveroften you study many variables together in which case even graphingbecomes intractable (eg you canrsquot make a five-dimensional grapheasily understood)


Summary Features

Distributionsrsquo features like the mean are convenient summaries butlose information There is a tradeoff For some purposes you mayneed all the information of the probability distribution For otherpurposes a few summary features may suffice and be easier to under-stand compare and communicate

The mean is the main summary feature considered for numericvariables It provides a general idea of how high or low values areweighted by probability It provides some sense of the ldquocenterrdquo orldquolocationrdquo of the distribution The mean is used extensively in laterchapters

The standard deviation captures how spread out a distributionis for numeric variables If both very low and very high values haveenough probability then the standard deviation is high Converselyif possible values are all concentrated in a small range (with highprobability) then the standard deviation is low

Other summary features are mentioned briefly but not studiedin future chapters The median (a particular percentile) providesanother way to define the ldquocenterrdquo of a distribution The mode isthe single most likely value The mode applies even to non-numericvariables and the median also applies as long as the non-numericvalues have a low-to-high order An alternative spread measure is theinterquartile range

In Sum Random Variable Types amp FeaturesBinary (Section 232) P(Y = 1) = E(Y ) is complete descriptionDiscrete (Section 233) mean E(Y ) captures highlow standarddeviation σY captures how spread out PMF fY (y) = P(Y = y)shows probability of each possible value CDF FY (y) equiv P(Y le y)Nominal categorical (Section 234) PMF says probability of eachcategory mode is most likely category no CDF mean standarddeviationOrdinal (Section 234) similar to nominal but has CDF usemedian instead of meanContinuous (Section 235) similar to discrete but PDF insteadof PMF

232 Binary Variable

A binary variable has two possible values Other terms for a binaryvariable are dummy variable indicator variable and Bernoullirandom variable In economics ldquodummyrdquo and ldquobinaryrdquo are mostcommon

Unless otherwise specified a binary variablersquos two possible valuesare 0 and 1 For writing mathematical models these values are usuallymore convenient than values like ldquoheadsrdquo and ldquotailsrdquo Mathematicallythis can be indicated by Y isin 0 1 the value of Y must be in the setthat includes only the numbers 0 and 1 (The set 0 1 is different


than the interval [0 1] that also contains all real decimal numbersbetween 0 and 1)

Many important variables are binary Examples includebull whether the economy is in a recession (1) or not (0)bull whether somebody has a college degree (1) or not (0)bull whether a pharmaceutical drug is branded (1) or generic (0)bull whether somebody is employed (1) or not (0)bull whether a retailer is a franchise (1) or not (0)

Mathematically binary variables are often defined using the indi-cator function The indicator function 1middot equals 1 if the argumentis true and 0 if false

1A =

1 if A is true0 if A is false (23)

For example consider defining a binary random variable Y basedon the coin flip random variable W Recall that the possible valuesof the flip are W = h (heads) and W = t (tails) We now want Y = 1to indicate heads and Y = 0 tails Mathematically

Y = 1heads = 1W = h =

1 if W = h (heads)0 if W = t (tails) (24)

Other examples can also be written with an indicator functionFor example Y = 1recession Y = 1branded or Y = 1franchise

Probability Mass Function

A binary random variablersquos PMF is like in (21) Specifically fY (0) =P(Y = 0) and fY (1) = P(Y = 1)

For example consider the employment dummy Y = 1employedThat is Y = 1 if the individual is employed otherwise Y = 0 ThePMF fY (middot) is

fY (1) equiv P(Y = 1) = P(employed) fY (0) equiv P(Y = 0) = P(not employed)(25)

If in the population 80 of individuals are employed (and thus 20not) then fY (1) = 80 = 08 and fY (0) = 20 = 02

The binary PMF can actually be written in terms of one singleparameter p equiv P(Y = 1) Since Y isin 0 1 P(Y = 0) + P(Y = 1) =1 = 100 By algebra P(Y = 0) = 1minus P(Y = 1) = 1minus p Thus thePMF is

fY (1) equiv P(Y = 1) = p fY (0) equiv P(Y = 0) = 1minus p (26)

The probability distribution corresponding to (26) is called aBernoulli distribution That is if random variable Y has the PMFin (26) then we say Y follows a Bernoulli distribution with parameterp Mathematically

Y sim Bernoulli(p) (27)


Cumulative Distribution Function

Although there is no practical benefit of a binary cumulative distri-bution function (over a PMF) it may help you develop intuition

The CDF of binary Y has a particular structure If r lt 0 thenP(Y le r) = 0 If r = 0 then P(Y le r) = P(Y = 0) the CDF jumpsup (discontinuously) from 0 to P(Y = 0) at r = 0 If 0 lt r lt 1then P(Y le r) = P(Y = 0) too the CDF is flat If r = 1 thenP(Y le r) = P(Y le 1) = 1 the CDF again jumps now from P(Y = 0)to 1 If r gt 1 then P(Y le r) = 1 too the CDF remains flatAltogether letting p equiv P(Y = 1) and 1minus p = P(Y = 0) the CDF ofY is

FY (r) = (1minus p)1r ge 0+ p1r ge 1 =

0 if r lt 0

1minus p if 0 le r lt 11 if r ge 1

(28)

Summary Feature Mean

Since a Bernoulli distribution is fully described by p equiv P(Y = 1)there is no need to summarize it further However the followingresult is helpful for interpretation of regressions with binary Y andit helps develop intuition

A random variablersquosmean is a probability-weighted average of itspossible values With binary Y the possible values are 0 and 1 withrespective weights P(Y = 0) and P(Y = 1) The mean E(Y ) is thus

E(Y ) =1sumj=0

(j) P(Y = j) =

j=0︷︸︸︷(0) P(Y = 0) +

j=1︷︸︸︷(1) P(Y = 1) = P(Y = 1)

(29)So for any binary Y E(Y ) = P(Y = 1) = fY (1)

For terminology the mean E(Y ) is also called the expected valueor expectation These names explain the letter E in the mathemat-ical notation

However the terms ldquoexpectationrdquo and ldquoexpected valuerdquo cause muchconfusion They are technical terms whose meaning differs greatlyfrom the colloquial English meaning For example if you say in plainEnglish ldquoI expect the value will be 05rdquo it means you think therersquos agood chance (high probability) that the value will exactly equal 05This is not what E(Y ) = 05 means In fact with a binary Y it isimpossible to have Y = 05 We may expect (colloquially) Y = 1 ifP(Y = 1) is high or we may expect (colloquially) Y = 0 if P(Y = 0)is high but it is impossible to have Y = E(Y ) (unless p = 1 or p = 0)which is very confusing I suggest you think ldquomeanrdquo every time yousee E(Y ) or read ldquoexpected valuerdquo or ldquoexpectationrdquo

Summary Feature Standard Deviation

Again since a Bernoulli distribution is fully described by p equiv P(Y =1) = E(Y ) there is no need to summarize it further However the


following result can help develop intuitionThe standard deviation is one measure of how ldquospread outrdquo or

ldquodispersedrdquo a distribution is The standard deviation is defined as thesquare root of the variance Most commonly lowercase sigma is usedfor notation σ2Y is the variance and σY is the standard deviationWith this notation

σ2Y = Var(Y ) equiv E[(Y minus E(Y ))2] σY equivradicσ2Y (210)

For Y sim Bernoulli(p) the variance and standard deviation areσ2Y = p(1 minus p) and σY =

radicp(1minus p) The derivation from (210) is

not important (unless you want a PhD)The formula σY =

radicp(1minus p) has some intuition If p = P(Y =

1) = 1 then Y = 1 always (never Y = 0) so the distribution is notat all spread it it is very concentrated on one single value This isreflected by σY =

radic1(1minus 1) =

radic0 = 0 Similarly if p = P(Y =

1) = 0 then Y = 0 always (never Y = 1) again not at all spreadThis is reflected by σY =

radic0(1minus 0) =

radic0 = 0 In contrast if

p = P(Y = 1) = 12 then Y = 0 and Y = 1 are equally likelyThis is as spread out as possible for a binary distribution ThenσY =

radic(12)(1minus 12) =

radic14 = 12 You can graph

radicp(1minus p)

over 0 le p le 1 to see that indeed σY is highest at p = 12That said for binary Y it is redundant to report both E(Y ) and

σY since σY =radic

E(Y )[1minus E(Y )] Once you know p = E(Y ) orequivalently p = P(Y = 1) there is no new information in σY

233 Discrete Variable

A binary variable is a special case of a discrete variable which hasany (countable) number of possible values That is all binary vari-ables are discrete variables but not all discrete variables are binaryDiscrete variable examples include

bull an individualrsquos years of educationbull number of children in a householdbull the number of times a stock has split since its IPObull the number of trading partners a country hasbull number of students in a classroom

The units of measure are important for interpreting a discretevariable and its distribution For most discrete variables like numberof children the units are obvious Sometimes it is not immediatelyobvious number of students per room or per grade Number ofbills passed in one month or one year or one term


A discrete PMF is similar to a binary PMF It is again usually writtenlike fY (middot) for the PMF of discrete random variable Y The PMFrsquosinput is again a possible value and itrsquos output is the corresponding


probabilityfY (y) equiv P(Y = y) (211)

Recall uppercase Y is a random variable whereas y stands for oneparticular non-random value (like 41) If y is not one of the possiblevalues of Y then fY (y) = P(Y = y) = 0

The main difference is that a general discrete PMF cannot befully described by a single parameter p like in (26) This additionalcomplexity is a reason people look at summary features like the meanstandard deviation and percentiles

One dimension of added complexity is the possible values Therecan be more than two possible values Further they are not al-ways just 0 1 2 For example if recessions are determined on amonthly basis then the fraction of a year spent in recession could be0 112 212 1112 1 Consequently the possible values are of-ten written as y1 y2 yJ where J is the number of different valuesEquivalently the possible values are yj for j = 1 2 J (Notationthis has the same meaning as j isin 1 2 J but the convention isto write j = 1 2 J or simply j = 1 J )

Having more possible values also means more probabilities to keeptrack of Specifically if there are J possible values then there are Jprobabilities P(Y = yj) for j = 1 J Mathematically the PMFcan be written as

fY (y) =Jsumj=1

1y = yjP(Y = yj) (212)

The last P(Y = yJ) can be solved for using the fact that all J prob-abilities sum to 1 but that still leaves J minus 1 probabilities

Figure 21 shows two common ways of graphing the PMF fY (y) =13 for y = 1 2 3 ie fY (1) = fY (2) = fY (3) = 13

1 2 3

y

PMF

P(

Y=

y)0

05

1

1 2 3

y

PMF

P(

Y=

y)0

05

1

Figure 21 Example PMF plotted two ways


A discrete CDF is defined as in (22) It has a similar pattern as (28)where it is flat but then jumps up discontinuously at each possible


value yj Mathematically it can be written similarly to (212)

FY (y) equiv P(Y le y) =Jsumj=1

1yj le yP(Y = yj) (213)

The CDF corresponding to the PMF in Figure 21 could be writtentwo equivalent ways

FY (y) = (13)

3sumj=1

1j le y (214)

FY (y) =

0 if y lt 113 if 1 le y lt 223 if 2 le y lt 31 if 3 le y

(215)

Figure 22 plots this CDF

1 2 3

y

CD

F

P(Y

ley)

00

51

Figure 22 Example discrete CDF from (214)


Generalizing the binary mean in (29) the mean of discrete Y can bewritten in terms of the J possible values yj (j = 1 J) and theirprobabilities

E(Y ) =

Jsumj=1

yj P(Y = yj) = y1 P(Y = y1) + middot middot middot+ yJ P(Y = yJ)

(216)which could also be written in terms of the PMF because fY (yj) =P(Y = yj) If Y is binary then J = 2 y1 = 0 and y2 = 1 in whichcase (216) simplifies to (29)

The mean gives a rough sense of whether the distribution has highor low values weighted by their probability For example considerrandom variables W and Z with P(W = 0) = P(W = 2) = 12 andP(Z = 2) = P(Z = 4) = 12 Then

E(W ) = (0)(12) + (2)(12) = 1 E(Z) = (2)(12) + (4)(12) = 3(217)


reflecting that Z has higher values As another example imagine Wand Z both have possible values j = 1 2 3 4 but P(W = j) = j10whereas P(Z = j) = (5 minus j)10 Although the possible values areidenticalW has higher weight for the higher values which is reflectedby its larger mean

E(W ) =

4sumj=1

(j)(j10) = (1)(110) + (2)(210) + (3)(310) + (4)(410) = 3

E(Z) =4sumj=1

(j)(5minus j)10 = (1)(410) + (2)(310) + (3)(210) + (4)(110) = 2

However the mean is sensitive to very large values so it doesnot reflect the value of the ldquoaverage member of the populationrdquo Forexample let Y denote hourly wage ($hr) for a population with threeequally-likely types of individuals The possible values are y1 = 10y2 = 20 and y3 = 270 The probabilities are P(Y = yj) = 13 forj = 1 2 3 The ldquoaverage personrdquo is the middle type who gets paid$20hr (This is the median) But the mean is in $hr

E(Y ) = (10)(13) + (20)(13) + (270)(13) = 3003 = 100 (218)

This $100hr mean wage is way higher than what the average personearns The reason is that the extremely high value $270hr bringsthe mean way up A similar but more extreme example has P(Y =10) = 099 and P(Y = 3010) = 001 The ldquoaverage personrdquo is one ofthe 99 who make $10hr but the mean is four times larger $40hr

E(Y ) = (10)(099) + (3010)(001) = 99 + 301 = 40 (219)

The mean helps capture the aggregate earnings rate of the populationas a whole but it does not capture the typical wage of the averagepopulation member

For practice with another example consider years of educationThat is Y = 11 means 11 years of education Y = 12 means 12 yearsof education (through high school) etc For simplicity imagine theonly possible values are Y isin 11 12 16 18 In the notation of (216)J = 4 (four possible values) with y1 = 11 y2 = 12 y3 = 16 andy4 = 18 Let P(Y = 11) = 02 P(Y = 12) = 03 P(Y = 16) = 04and P(Y = 18) = 01 Applying (216)

E(Y ) = (11) P(Y = 11) + (12) P(Y = 12) + (16) P(Y = 16) + (18) P(Y = 18)

= (11)(02) + (12)(03) + (16)(04) + (18)(01) = 14 (220)

The expectation operator E(middot) has a useful property called linear-ity Formally the mean of a linear combination of random variablesequals the linear combination of the random variablesrsquo means Forexample given two random variables Y and Z (of any type) and twonon-random constants a and b

E(aY + bZ) = aE(Y ) + bE(Z) (221)


Here aY + bZ is a linear combination of random variables Y andZ Thus the mean of the linear combination of Y and Z equals thelinear combination of the means E(Y ) and E(Z)

Equation (221) implies other identities For example in the spe-cial case when b = 0 it implies E(aY ) = aE(Y ) As another exampleif a = 1 and Y = cW + dX then

E(cW + dX + bZ) = E(aY + bZ) = aE(Y ) + bE(Z) = (1) E(cW + dX) + bE(Z)

= cE(W ) + dE(X) + bE(Z)

Extending this further if we have random variables Yi for i = 1 nand corresponding constants ci then

E

(nsumi=1

ciYi

)=

nsumi=1

ci E(Yi) (222)


The standard deviation has the same definition and interpretation asbefore It measures how ldquospread outrdquo or ldquodispersedrdquo a distribution iswith the same units of measure as the variable itself and it is formallydefined in (210)

Without worrying about calculating these by hand consider thefollowing examples for intuition ImagineW andX both have possiblevalues v1 = minus1 v2 = 0 and v3 = 1 but P(X = vj) = 13 for each vj whereas P(W = 0) = 1 Clearly X is more spread out as reflected bythe standard deviations σX =

radic23 σW = 0 If we spread out the

values v1 and v3 farther from zero then the standard deviation shouldincrease Let Y have P(Y = minus2) = P(Y = 0) = P(Y = 2) = 13Then σY =

radic83 twice as big as σX =

radic23 Alternatively we

could ldquospread outrdquo X by defining Z to have the same vj values as Xbut with even more probability on the extreme values SpecificallyP(Z = minus1) = P(Z = 1) = 12 Then σZ = 1 bigger than σX =radic

23Like the mean the standard deviation is sensitive to very large

values For example let P(Y = 0) = 098 P(Y = minus100) = P(Y =100) = 001 Then even though 98 of the population has Y = 0(very concentrated not spread out) σY =

radic200 asymp 141

Other Summary Features

Although beyond the scope of this text I must briefly mention thatpercentiles (or quantiles) are also helpful summary features Theycan capture aspects of a probability distribution that the mean andstandard deviation do not For example the median captures thevalue of the ldquoaverage member of the populationrdquo discussed aboveHigh and low percentiles help capture the ldquotailsrdquo of a distributionlike people with the very highest (or lowest) income Measures likethe interquartile range capture the ldquospreadrdquo in a way that is not sen-sitive to very large outliers complementing the standard deviation


Quantile regression extends percentiles to regression parallel to whatwersquoll study with the mean

234 Categorical or Ordinal Variable

A binary variable is usually a special case of a categorical variablewhose possible values are ldquocategoriesrdquo not numbers This was trueof most of the examples in Section 232 like whether a retailer isa franchise or not or whether a pharmaceutical drug is branded orgeneric Such values can be coded as 0 or 1 for convenience but theylack any numeric meaning

Categorical variables can have more than two possible values Forexample non-franchise retailers could be categorized further as na-tional chain regional chain or independent Other categorical vari-able examples include

bull geographics region (north south east west)bull mode of transportation (car bike train etc)bull industry (like NAICS)bull college major (economics English ecology electrical engineer-

ing etc)

The previous examplesrsquo categories have no particular order tothem so they constitute nominal variables (or nominal cate-gorical variables) Sometimes these are simply called categoricalvariables

In contrast there could be an ordinal variable (or ordinal vat-egorical variable) An ordinal variablersquos possible values have a nat-ural order usually from ldquolowrdquo to ldquohighrdquo For example

bull bond rating (eg D C AA+ AAA)bull self-reported health status (poor fair good excellent)bull teaching evaluation responses (disagree neutral agree)bull letter grades (F C B A although often A and C are 40 and

20 there is nothing intrinsic in the letter grade system thatsuggests A is exactly twice as good as C)

Some categorical variables are not clearly nominal or ordinal Forexample consider educational degree Some degrees are higher thanothers (eg you need a bachelorrsquos degree before you get a masterrsquos)but others are not (eg PhD and MD) As another example considersex Neither male nor female is ldquohigherrdquo than the other but arguablyintersex is in between As another example some occupations areordered (eg within the same consulting firm junior analyst is lowerthan senior analyst which is lower than associate then manageretc) but others are not (eg painter chef carpenter) Such variablesrequire careful thought but are beyond our scope

In any case categorical variables are often represented by dummyvariables (binary variables) For example consider the teaching eval-uation response whose possible values are disagree neutral and agreeUsing the indicator function from (23) we can defineW = 1disagree


X = 1neutral Y = 1agree Then P(W = 1) is the probabilityof ldquodisagreerdquo P(X = 1) is the probability of ldquoneutralrdquo and P(Y = 1)is the probability of ldquoagreerdquo Equivalently from (29) these can bewritten using P(W = 1) = E(W ) P(X = 1) = E(X) and P(Y =1) = E(Y ) In a way Y is redundant since Y = 1W = X = 0completely determined by W and X


A categorical or ordinal PMF is essentially the same as a discretePMF As before it is usually written like fY (middot) for the PMF of Y The PMFrsquos input is again a possible value and itrsquos output is thecorresponding probability The only difference is that the possiblevalues are categories instead of numbers

For example consider the coin flip Random variable W has twopossible values h (heads) or t (tails) mathematically W isin h tIts PMF is fW (w) equiv P(W = w) That is fW (h) is the probability ofheads and fW (t) is the probability of tails


A nominal categorical variable does not have a CDF Recall from (22)that the CDF of Y evaluated at value v is FY (v) equiv P(Y le v) (Itdoes not matter which lowercase letter we use whether y or r or vit simply represents a non-random value) If Y is nominal then theinequality relationship le has no meaning For example we cannotevaluate if the value ldquopainterrdquo is less than or equal to ldquochefrdquo ThePMF is still well-defined because it relies only on equality and wecan evaluate if ldquopainterrdquo and ldquochefrdquo are equal But the ldquocumulativerdquopart of CDF has no meaning for nominal categorical variables

In contrast an ordinal variable can have a CDF The values areordered so we can evaluate if one value is le another (If there is aclear order but not a clear ldquolowrdquo and ldquohighrdquo ends then one end canarbitrarily be picked as ldquolowrdquo and the CDF shows the cumulativeprobability from the ldquolowrdquo end through a given category)

For example consider self-reported health status Y Any twovalues can be compared with le poor le good good le excellent etcThe CDF evaluated at ldquogoodrdquo is the probability of health that is goodor worse There are three such possible values poor fair and goodThus the CDF evaluated at good is the probability of poor fair orgood health Mathematically

FY (good) equiv P(Y le good) = P(Y = poor) + P(Y = fair) + P(Y = good)(223)

= fY (poor) + fY (fair) + fY (good)

Summary Features Mean and Standard Deviation

Categories cannot be summed or averaged so categorical variables donot have a mean You could arbitrarily assign numeric values to each


category and then use the discrete variable formula in (216) but ifI assigned different numeric values then I could get a very differentresult than you Fundamentally we cannot average ldquopainterrdquo andldquochefrdquo or average ldquopoorrdquo and ldquogoodrdquo (Eg what is poor plus goodWhat is 017 times poor)

For the same reason categorical variables do not have a standarddeviation


The mode is often useful for summarizing categorical variables Themode is the single most likely value Mathematically

mode(Y ) equiv arg maxy

P(Y = y) (224)

which is read as ldquothe value of y that maximizes P(Y = y)rdquoNominal categorical variables do not have anything like a mean

that can give a sense of whether values are generally high or low be-cause there is no sense of high or low for nominal variables Similarlythere is no sense of ldquocloserdquo or ldquofarrdquo so it is meaningless to ask howldquospread outrdquo a nominal distribution is Only features of the PMF canbe summarized without accounting for the values themselves So wecould measure whether the probability is spread more evenly acrosscategories rather than very high in some categories (but not others)but not much else

Ordinal variables do have a sense of high or low so it is possibleto summarize ordinal distributions analogous to a mean (overall highor low) or standard deviation (how spread out)

The median (and other percentiles) can summarize how generallyhigh or low an ordinal distribution is The median is the categoryfor the ldquoaverage memberrdquo of the population That is at least halfthe population has the same or lower value and at least half thepopulation has the same or higher value Mathematically the medianof ordinal random variable Y is the valuem such that P(Y le m) ge 05and P(Y ge m) ge 05

For example consider self-reported health status Y The five pos-sible values are poor fair good great and excellent Imagine thereis 15 probability of each value ie fY (y) = 15 for any of the fivepossible y values Because each category has the same probabilitynaturally the median is the middle category ldquogoodrdquo Mathematicallyto verify

P(Y le good) =

=15︷︸︸︷fY (poor) +

=15︷︸︸︷fY (fair) +

=15︷︸︸︷fY (good) = 35 gt 12

P(Y ge good) =

=15︷︸︸︷fY (good) +

=15︷︸︸︷fY (great) +

=15︷︸︸︷fY (excellent) = 35 gt 12

(225)

Imagine a different population represented by random variableW thatis generally healthier To make it obvious imagine nobody has poor


or fair health and the other three categories have 13 probabilityeach Now ldquogreatrdquo is the median reflecting that W represents a gen-erally healthier population than Y whose median was only ldquogoodrdquoMathematically to verify

P(W le great) =

=0︷︸︸︷fY (poor) +

=0︷︸︸︷fY (fair) +

=13︷︸︸︷fY (good) +

=13︷︸︸︷fY (great) = 23 gt 12

P(W ge great) =

=13︷︸︸︷fY (great) +


(226)

There are other ways to compare whether one ordinal distributionis ldquohigherrdquo or ldquomore spread outrdquo than another (eg Kaplan and Zhuo2019) but they are beyond our scope

235 Continuous Variable

A continuous variable differs from a discrete variable in some strangetechnical ways but the intuition is the same This textbook often usesdiscrete variables to build intuition since the math is simpler Youcould imagine a continuous variable like a discrete variable with a verylarge number of possible values packed very tightly together Indeedmany variables typically called ldquocontinuousrdquo are actually discrete likemonetary values (like annual income or sales) that come in discreteunits (like $001) Practically the difference is negligible Examplesof other variables modeled as continuous are

bull market concentration measures (like market share of largest firmor HHI)

bull a countryrsquos per capita annual meat consumptionbull percentage growth of GDP (or sales or stock price etc)bull crime rates (eg a cityrsquos number of property crimes per year

per 10000 people)

Always specify units of measure For example if Y is the dis-tance from an individualrsquos residence to their workplace it is mean-ingless to say Y = 15 because 15 is just a number not a measureof distance It could be 15 km but it could also be 15 mi whichis 24 km or it could even be measured in meters or feet (or par-secs though unlikely) The mean standard deviation median (andother percentiles) and interquartile range all share the same unitsas the variable itself whereas the variance has squared units (whichis harder to interpret eg squared dollars) Units always mattergreatly whether for description prediction or causality

Probability Density Function

A truly continuous random variable does not have a PMF It hasan ldquouncountably infiniterdquo number of possible values implying eachhas zero probability That is if Y is the random variable then forany possible value y P(Y = y) = 0 There is a difference between


ldquopossiblerdquo and ldquonon-zero probabilityrdquo Every observed realization y isclearly possible yet had zero probability of occurring

Although individual values have zero probability ranges of valueshave non-zero probability For example even if P(Y = y) = 0 foreach individual 0 le y le 1 itrsquos possible that P(0 le Y le 1) = 034

The probability density function (PDF) helps us see suchprobabilities for different ranges of value The ldquobell curverdquo is an ex-ample of a PDF Generally the PDF is higher around more probablevalues

If the CDF is also differentiable the derivative is called a proba-bility density function (PDF) PDFs are commonly denoted withlowercase f sometimes with a subscript like fY (middot) for the PDF of ran-dom variable Y Similar to a histogram a PDF shows the probabilityof a random variable taking a value in a certain interval as the areaunder the PDF Since P(minusinfin lt Y ltinfin) = 1 for any random variablethe total area under any PDF must equal one

Figure 23 shows an example PDF No matter how big or smallwe draw it the total area between the horizontal axis and the PDFis defined to be 1 The shaded area under the PDF between y = 0and y = 1 shows P(0 le Y le 1) That is P(0 le Y le 1) is theproportion of the total area under the PDF that is shaded The factthat P(Y = y) = 0 for any y can also be seen For example P(Y = 0)is the ldquoareardquo under the PDF between y = 0 and y = 0 but this is justa line which has zero area hence zero probability

minus3 minus2 minus1 0 1 2 3

00

02

04

y

PD

F (

dens

ity)

Area=034

Figure 23 Example of reading a probability from a PDF


The CDF of a continuous random variable has the same definition asfor a discrete or ordinal variable FY (y) equiv P(Y le y)

Unlike the jumpy stair-step CDF of a discrete random variable acontinuous random variablersquos CDF is a continuous function (And ifyou know calculus when it exists the PDF is the first derivative ofthe CDF)



The intuition for the mean and the linearity of the expectation oper-ator apply equally to continuous random variables

Computing the mean of a continuous random variable by hand re-quires calculus so it is not covered in this textbook (If you happen toknow calculus and are curious itrsquos extending the idea of probability-weighted average by replacing the sum with an integral eg if thePDF fY (middot) exists E(Y ) =

intR yfY (y)dy analogous to

sumJj=1 yjfY (yj))


The intuition for the standard deviation is also the same for contin-uous distributions but again requires calculus (integration) to com-pute


The median and other percentiles complement the mean and stan-dard deviation in summarizing continuous distributions For exam-ple if yoursquove ever taken a standardized test before (like SAT ACTor GRE) your score report probably included your percentile Yourscore percentile is the proportion of the population you scored betterthan eg if you were in the 90th percentile you scored better than90 of other students Despite their utility percentiles (quantiles)are beyond our scope but I hope you study econometrics further tolearn more about them

The Normal (Gaussian) Distribution

One particular distribution appears frequently in statistics and econo-metrics the normal distribution (or Gaussian distribution)Without getting too detailed some comments may help especiallywhen you read other books (If you want you can find plenty ofdetails on Wikipedia)

Although some variables are indeed approximately normally dis-tributed this is not ldquonormalrdquo in the common English sense mostvariables are not normal To start any discrete or categorical randomvariable cannot be normal (Gaussian) Most continuous variables arealso not normal

None of the methods in this textbook require variables to be Gaus-sian Historically sometimes normality was assumed for certain re-sults but it is not necessary You donrsquot need to worry about testingwhether or not variables are normal It doesnrsquot matter

Normal distributions are convenient for educational examples be-cause the mean and standard deviation uniquely characterize a normaldistribution Notationally the normal distribution is written N(micro σ2)for a normal distribution with mean micro and variance σ2 or sometimesequivalently N(micro σ) with standard deviation σ =

radicσ2 That is if

Y sim N(micro σ2) then E(Y ) = micro and Var(Y ) = σ2 so its standard de-viation is σ This is convenient for illustrative examples because as

24 PRELUDE TO PREDICTION PRECIPITATION 37

wersquove seen the mean gives a general sense of values being high or lowand the standard deviation describes how spread out the distributionis With micro = 0 and σ = 1 N(0 1) is called the standard normaldistribution

However this is not true of other distributions so in general themean and standard deviation do not fully summarize a distributionFor example there could be two random variables with mean 0 andstandard deviation 1 that are very different One such random vari-able is W with P(W = minus1) = P(W = 1) = 12 Another is thestandard normal Z sim N(0 1) If you only know the mean and stan-dard deviation then you cannot tell the difference betweenW and ZThese are very different and there are many other random variableswith the same mean and standard deviation

24 Prelude to Prediction Precipitation

This section introduces prediction concepts through a simple exampleto develop intuition The two main goals are 1) to show that thereis no single best prediction because ldquobestrdquo depends on the ultimatepurpose of the prediction and 2) to begin translating intuition intoformal mathematics Further mathematical formalization and morecomplex examples are in Section 25

Notationally (details below) g is your non-random guess of therealized value y of random variable Y and L(y g) quantifies how badit is to have guessed g when the realized value is y (Hypotheticallyyou could randomize your guess but this is never optimal so it is notconsidered)

Throughout this section imagine you want to predict whetheror not it will rain tomorrow Mathematically random variable Yrepresents tomorrowrsquos weather Y = 1 if it rains tomorrow and Y = 0if not Assume you actually know the probability distribution of Y(you do not need to estimate it from data) Since Y is binary Y simBernoulli(p) so knowing the distribution is equivalent to knowing p =P(Y = 1) the probability that it rains (This is usually what weatherforecasts report so you could actually do the following examples inreal life) Thus equivalently given the probability of rain you wantto predict the realized value (yes or no)

The distribution of Y alone is not enough to make a good predic-tion you also need to know the consequences of correct and incorrectpredictions in each case Intuitively if one outcome is really reallybad then you should prefer to avoid even a small risk of it comparedto a larger risk of a not-so-bad outcome

Mathematically consequences are formalized as a loss functionThe loss function L(y g) specifies how bad it is to have guessed gwhen the realized value is y (Other sources may switch the order ofy and g so be careful) In the rain example L(0 1) represents howbad it is to guess rain when it does not rain This may be differentthan L(1 0) how bad it is to guess no rain when in fact it rains For


example L(0 1) = 20 and L(1 0) = 100 means it is much worse tobe wrong when it rains (y = 1) than when it doesnrsquot (y = 0)

It can be confusing to define ldquolossrdquo when yoursquore correct (g = y) Ifsomething actually good happens it can be represented by negativeloss like L(0 0) = minus10 If L(1 1) = minus30 (even more negative thanminus10) then itrsquos even better to guess right when it rains

Even if there are good outcomes loss values can be normalizedto be non-negative without changing the best prediction In the rainexample imagine the most negative loss possible is L(1 1) = minus30Then simply adding 30 to each loss makes them all non-negativewithout changing their relative values This essentially sets L(1 1) =0 as the reference point and the loss functionrsquos interpretation is ldquoHowmuch worse is this situation than (y g) = (1 1)rdquo Itrsquos also possibleto normalize L(v v) = 0 for any v meaning that guessing g = y isnot bad at all by subtracting the original L(v v) from the originalL(v g) for all g Understanding the detailed reasoning is beyond ourscope but just be aware that L(y g) is not necessarily an absoluteldquohow bad is (y g)rdquo but can also be ldquohow bad relative to (y y)rdquo or ldquohowbad relative to the best possible (v v)rdquo

If you are more familiar with utility functions from economicsyou can think of the loss function as essentially a negative utilityfunction Apparently economists are optimistic modeling how goodthings are (utility) whereas statisticians are pessimistic modelinghow bad things are (loss) If you had a utility function u(y g) thatsays how good it is to have guessed g when the truth is y then youcan just define L(y g) = minusu(y g)

Throughout this section the consequences are the results of abet with your friend If you guess wrong then you lose some money(positive loss) If you guess right then you win some money (negativeloss)

241 Easy ldquoPredictrdquo Current Weather

Letrsquos start easy yoursquore standing outside and you want to predictwhether or not itrsquos currently raining Since you can observe this di-rectly this is like the ldquoafterrdquo view of Section 211 instead of multiplepossible values of random variable Y you see the realized value ywith no uncertainty

You make a simple $1 bet if you guess right (g = y = 0 org = y = 1) then you win $1 but if you guess wrong (g 6= y) then youlose $1 Recall that negative loss means winning Formally L(y g) is

L(0 0) = L(1 1) = minus1 L(0 1) = L(1 0) = 1 =rArr L(y g) = 2times1y 6= gminus1(227)

If you remember your microeconomics classes you may realizethat the loss function in (227) implicitly assumes a linear utilityfunction u(x) = x for simplicity That is if you currently have $xand you win $1 then your utility increases by u(x + 1) minus u(x) Ifyou lose $1 then your utility decreases from u(x) to u(x minus 1) Ifyou are risk-averse u(middot) is concave so even though the dollar amount


is the same the potential utility increase is smaller than the poten-tial utility decrease u(x + 1) minus u(x) lt u(x) minus u(x minus 1) Generallyyour loss function should be L(0 0) = L(1 1) = u(x) minus u(x + 1)and L(0 1) = L(1 0) = u(x) minus u(x minus 1) For simplicity plug-ging in u(x) = x yields u(x) minus u(x + 1) = x minus (x + 1) = minus1 andu(x)minus u(xminus 1) = xminus (xminus 1) = 1 as in (227)

Obviously you guess g = y You are correct You win $1Mathematically how can this intuition be formalized If you know

y then you can compute both L(y 0) and L(y 1) If y = 1 soL(y 0) = 1 and L(y 1) = minus1 then ldquoguessingrdquo g = 1 minimizes losssince minus1 lt 1 If y = 0 so L(y 0) = minus1 and L(y 1) = 1 thenldquoguessingrdquo g = 0 minimizes loss since minus1 lt 1 Thus the best ldquoguessrdquois indeed g = y

242 Minimizing Mean Loss

With the same loss function from (227) consider predicting tomor-rowrsquos weather if P(Y = 1) = 04 That is therersquos a 40 probabilityof rain tomorrow (and 60 chance of no rain) for the purpose of yourbet should you predict rain

Now we need some way to deal with uncertainty Regardless ofguessing g = 0 or g = 1 there is some chance of being right and somechance of being wrong

In microeconomics the typical approach is to choose the actionthat maximizes mean utility The same could be done here Equiva-lently since the loss function is essentially a negative utility functionwe could minimize mean loss This doesnrsquot guarantee yoursquoll win thebet every time but over the long-run (if you bet many times) it leadsto the lowest total loss

Mean loss is more commonly called expected loss but this canbe confusing Again ldquoexpectedrdquo is technical jargon that is unrelatedto what ldquoexpectedrdquo means in colloquial English Below it is in factimpossible to actually receive the ldquoexpectedrdquo loss since it is a decimalvalue (whereas you can only win or lose $1)

Mean loss is also sometimes called risk Again this has a precisetechnical meaning but it is probably not how you would define ldquoriskrdquocolloquially

Given the distribution of Y and the loss function in (227) howcan we pick g to minimize mean loss There are two values of g toconsider g = 0 or g = 1 (Other values are allowed but would alwaysbe wrong so yoursquod always lose eg with g = 04 g 6= y for bothy = 0 and y = 1) Given a particular guess like g = 0 the loss stilldepends on y which has multiple possible values Thus the loss hasmultiple possible values That is given g = 0 the loss is a randomvariable We can derive its distribution from the distribution of Y Then we can compute the mean of the loss distribution In this waywe can compute mean loss for each possible g Finally the best guessis the g with the smallest mean loss Details on these steps are givenbelow


To work through these steps mathematically consider the losswhen g = 0 and separately the loss when g = 1 Mathematicallydefine random variables

L0 equiv L(Y 0) L1 equiv L(Y 1) (228)

respectively representing the (distribution of) loss when g = 0 andwhen g = 1 When g = 0 the loss L(Y 0) is either L(0 0) = minus1 ifY = 0 or else L(1 0) = 1 if Y = 1 That is L0 = minus1 when Y = 0and L0 = 1 when Y = 1 We know P(Y = 1) = 04 so

P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 1minusP(Y = 1) = 06(229)

Similarly if you guess g = 1 then the loss L(Y 1) is L(1 1) = minus1when Y = 1 or else L(0 1) = 1 when Y = 0 so

P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus1) = P(Y = 1) = 04(230)

Given the loss distributions in (229) and (230) mean loss can becomputed for each possible g If you guess g = 0 then using (216)and (229) mean loss (in $) is

E(L0) = (04)(1) + (06)(minus1) = minus02 (231)

If you instead guess g = 1 then using (216) and (230) mean loss(in $) is

E(L1) = (06)(1) + (04)(minus1) = 02 (232)

The best prediction for your bet is the g that minimizes meanloss Mathematically E(L0) lt E(L1) equivalently using (228)E[L(Y 0)] lt E[L(Y 1)] That is g = 0 generates the smallest meanloss so g = 0 is the best prediction to make for your bet Eventhough itrsquos the best prediction yoursquoll still be wrong and lose $1 if itrains tomorrow which has a 40 probability of happening But inthe long run yoursquoll win money if you always predict g = 0 whereasyoursquoll lose money if you always predict g = 1

243 Different Probability

Imagine the same setup as in Section 242 but with a different distri-bution of Y P(Y = 1) = 07 Intuitively rain being more likely mightchange the optimal prediction from g = 0 to g = 1 Mathematicallythis intuition proves correct

Following the same steps from Section 242 first compute thedistribution of loss separately for g = 0 and g = 1 Define L0 and L1

as in (228) Parallel to (229) and (230) but with P(Y = 1) = 07

P(L0 = 1) = P(Y = 1) = 07 P(L0 = minus1) = P(Y = 0) = 1minus P(Y = 1) = 03

P(L1 = 1) = P(Y = 0) = 03 P(L1 = minus1) = P(Y = 1) = 07

(233)

25 PREDICTION WITH A KNOWN DISTRIBUTION 41

Parallel to (231) and (232) using (233)

E[L(Y 0)] = E(L0) = (07)(1) + (03)(minus1) = 04

E[L(Y 1)] = E(L1) = (03)(1) + (07)(minus1) = minus04(234)

Since E[L(Y 1)] lt E[L(Y 0)] g = 1 minimizes mean loss so g = 1 isthe best prediction for your bet

244 Different Loss Function

Now imagine the original setup in Section 242 but with a differentloss function Specifically if you correctly predict rain you win $10(ie minus10 loss) but otherwise L(y g) is the same as in (227)

L(0 0) = minus1 L(1 1) = minus10 L(0 1) = L(1 0) = 1 (235)

Intuitively even though rain is less probable than no rain the muchlarger payoff for correctly predicting rain might make us bet on rainMathematically this intuition proves correct


as in (228) Parallel to (229) and (230) but with L(1 1) = minus10

P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 06

P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus10) = P(Y = 1) = 04

(236)


E[L(Y 0)] = E(L0) = (04)(1) + (06)(minus1) = minus02


Since E[L(Y 1)] lt E[L(Y 0)] g = 1 minimizes mean loss so g = 1is the best prediction for your bet (As noted earlier ideally yoursquodalso allow for a utility function with some risk aversion in which casethe best prediction may additionally depend on the degree of riskaversion)

25 Prediction with a Known Distribution

=rArr Kaplan video Optimal Prediction

What does prediction mean It may seem surprising to discussprediction without any data and with a completely known distribu-tion In English usually prediction means using what you know nowto ldquopredictrdquo what will happen in the future (eg ldquoBeware the Ides ofMarchrdquo) In econometrics and statistics prediction shares the quali-ties of guessing something unknown using something known but thedetails differ (Predicting the future is a special case of predictioncalled forecasting see Part III)

Here as in Section 24 the goal is to predict the value of a ran-dom draw from a known distribution The distribution summarizes


ldquowhat you knowrdquo different possible values and their probabilities Asin Section 21 the random draw need not occur in the future in-deed it may have already happened but we havenrsquot observed it yetSo besides applications like predicting ridesharing demand tomorrowldquopredictionrdquo also includes guessing the income of a customer standingright in front of you (who hasnrsquot told you their income yet)

This section extends Section 24 including more complex exam-ples Further close connections with description (Section 23) areshown

Understanding the role of the loss function is particularly cru-cial Even if you do some fancy machine learning prediction withcross-validation you need to consider the loss function carefully Justusing whatever you find online may be inappropriate I have seenPhD students puzzled by their results because they did not use anappropriate loss function Loss functions are also central to Bayesianprediction although that is beyond our scope

251 Common Loss Functions

Ideally a loss function reflects the real-world consequences of guessingg when the realized value is y like in Section 24 In practice thismay be infeasible For example maybe the consequences are noteasily quantified Or maybe you have to make a single predictionthat will be used for multiple decisions with different consequencesor for multiple people whose utility functions differ Or maybe youhave a deadline and simply donrsquot have time to carefully construct aloss function

There are infinitely many possible loss functions but those beloware used more commonly than others

0ndash1 Loss

Define 0ndash1 loss asL0(y g) equiv 1y 6= g (238)

This equals 0 (which is good) if you guess correctly and 1 (bad) ifnot This reflects a case where it doesnrsquot matter ldquohow wrongrdquo youare it only matters whether yoursquore right or wrong

This 0ndash1 loss is often used when Y is a nominal categorical vari-able (Section 234) or binary For example if you need to predictoccupation and the true y is ldquopainterrdquo itrsquos probably no worse to haveincorrectly guessed ldquozookeeperrdquo than ldquobookkeeperrdquo theyrsquore just bothwrong This is not true of ordinal categorical variables eg if some-bodyrsquos health is ldquoexcellentrdquo itrsquos probably worse (higher loss) to havepredicted ldquopoorrdquo than ldquogreatrdquo

Although not obvious 0ndash1 loss determines the same optimal pre-diction as the rain bet loss function in (227) (Although coded asbinary the rain Y could be considered nominal categorical with pos-sible values ldquorainrdquo and ldquono rainrdquo) The optimal prediction does notchange if you add 1 to all losses It also does not change if you divide


all losses by 2 Thus given L(y g) in (227) it is equivalent to usethe loss function [L(y g) + 1]2 = 1y 6= g which is 0ndash1 loss as in(238)

Quadratic Loss

Define quadratic loss (or squared loss or squared error loss orL2 loss) as

L2(y g) = (y minus g)2 (239)

This is zero when the guess is perfect (g = y) and larger when g isfarther from y in either direction (higher or lower) Thus unlike 0ndash1loss quadratic loss can differentiate between a slightly-wrong guessand a really-wrong guess

For example let the true y = 100 The guess g = y = 100 isbest since L2(100 100) = 0 is the smallest possible loss (Squaringproduces non-negative numbers) The guess g = 99 is worse lossis L2(100 99) = (100 minus 99)2 = 1 The guess 90 is even worse sinceL2(100 90) = (100 minus 90)2 = 100 In fact even though 90 is only 10times farther from y than 99 is the loss is 100 times as big muchmuch worse The guess 110 is just as bad as 90 since they are bothwrong by 10 (higher or lower doesnrsquot matter) L2(100 110) = (100minus110)2 = 100

Quadratic loss is often used for discrete or continuous variables(Sections 233 and 235) Although ordinal loss functions shouldalso differentiate between slightly-wrong and really-wrong quadraticloss cannot be used because you cannot subtract two category valuesor square them as required by (239) eg you cannot square thedifference between ldquoexcellentrdquo and ldquogreatrdquo

Despite its common use quadratic loss is not always sensible Forexample sometimes it may be much worse to over-predict (g gt y)than under-predict (g lt y) or vice-versa Quadratic loss does notdifferentiate between over-prediction and under-prediction because(y minus g)2 = (g minus y)2 only the absolute error |y minus g| matters notwhether itrsquos positive or negative For example if y = 100 it maybe much worse to predict g = 110 than g = 90 but L2(100 110) =(100minus110)2 = 100 is the same as L2(100 90) = (100minus90)2 = 100 Asanother example sometimes it may be twice as bad to over-predictby 20 units (g minus y = 20) than 10 units (g minus y = 10) but quadraticloss treats it as four times worse since 102 = 100 but 202 = 400

Nonetheless quadratic loss is usually not crazy especially if weneed to make a prediction but donrsquot know how it will be used indecision-making

Other Loss Functions

Although beyond our scope if yoursquore curious there are other in-teresting common loss functions One is absolute loss (or L1 loss)L1(y g) = |y minus g| Variations of absolute loss include asymmetricversions for which over-prediction is worse than under-prediction (or


vice-versa) as well as absolute percentage loss |(y minus g)y| Anothercommon loss function is weighted 0ndash1 loss which generalizes 0ndash1 lossby allowing L(0 1) 6= L(1 0) (This is related to hypothesis testingwhere there are different consequences for rejecting a true hypothesisthan not rejecting a false hypothesis)

252 Optimal Prediction Generic Examples

In Sum Optimal Prediction1 Choose appropriate loss function L(y g) quantifies how

bad it is to guess g when the true value is y2 Optimal prediction the value of g with smallest mean loss

E[L(Y g)]

The following generic examples show the procedure to find the pre-diction g that is ldquooptimalrdquo in the sense of minimizing mean loss Theprocedure does not allow continuous Y since that would require cal-culus but it works for any other variable type from Section 23

Two Possible Values

This is the same procedure used in the rain example in Section 24Step 1 is to write out the loss function values for all combinations

of (y g) where g is our guess and y is the true realized value Assumetwo possible values y = a or y = b (Possibly a = 0 and b = 1 butthey could have any value including non-numeric values like ldquocatrdquo andldquodogrdquo) Assume you have to guess either a or b g = a or g = b Thusthere are four possible combinations of (y g) (a a) (b b) (a b) and(b a) The four corresponding loss function values can be arranged ina matrix where each column has the same y value and each row hasthe same g value(

L(a a) L(b a)L(a b) L(b b)

)=

(L(y = a g = a) L(y = b g = a)L(y = a g = b) L(y = b g = b)

) (240)

Step 2 is to compute the mean loss for each possible guess g In(240) each row corresponds to a different g To compute mean lossif g = a look at the first row The guess g = a is non-random Therandomness of L(Y a) is from the randomness of Y The probabilitiesP(Y = a) and P(Y = b) do not depend on our guess g If we guessg = a then wersquoll get loss L(a a) with probability P(Y = a) and wersquollget loss L(b a) with probability P(Y = b) If we guess g = b thenwersquoll get loss L(a b) with probability P(Y = a) and wersquoll get lossL(b b) with probability P(Y = b) Thus

E[L(Y a)] = P(Y = a)L(a a) + P(Y = b)L(b a)

E[L(Y b)] = P(Y = a)L(a b) + P(Y = b)L(b b)(241)

(Although we wonrsquot use it if yoursquore handy with matrix algebra yoursquollsee that the vector of mean losses can be written as the product of


the matrix in (240) with the column vector of probabilities P(Y = a)and P(Y = b) ie each mean loss is the dot product of a row in theloss matrix with the probability vector)

Step 2 could also be interpreted in terms of two new random vari-ables like L0 and L1 in Section 24 Let La equiv L(Y a) be a randomvariable representing loss when g = a The distribution of La isP(La = L(a a)) = P(Y = a) and P(La = L(b a)) = P(Y = b) Sim-ilarly let Lb equiv L(Y b) be a random variable representing loss wheng = b with P(Lb = L(a b)) = P(Y = a) and P(Lb = L(b b)) =P(Y = b) Thus yielding the same results as (241)

E[L(Y a)] = E(La) = P(Y = a)L(a a) + P(Y = b)L(b a)

E[L(Y b)] = E(Lb) = P(Y = a)L(a b) + P(Y = b)L(b b)(242)

Step 3 is to find the g that minimizes E[L(Y g)] The g thatminimizes mean loss is the optimal predictor That is if E[L(Y a)] ltE[L(Y b)] then g = a is the optimal predictor if E[L(Y b)] lt E[L(Y a)]then g = b is the optimal predictor or if E[L(Y a)] = E[L(Y b)] theng = a and g = b are equally good (or equally bad)

For example let(L(a a) L(b a)L(a b) L(b b)

)=

(0 75 0

)

P(Y = a) = 07P(Y = b) = 03

(243)

Using (241)

E[L(Y a)] = P(Y = a)L(a a) + P(Y = b)L(b a) = (07)(0) + (03)(7) = 21

(244)

E[L(Y b)] = P(Y = a)L(a b) + P(Y = b)L(b b) = (07)(5) + (03)(0) = 35(245)

Since E[L(Y a)] lt E[L(Y b)] the predictor g = a is better than g = baccording to mean loss with this particular loss function

Many Possible Values

Now let Y take J different possible values Label these v1 v2 vJ For example if J = 3 we could have v1 = 10 v2 = 20 and v3 = 22or v1 could be ldquocatrdquo v2 ldquodogrdquo and v3 ldquoechidnardquo Like before assumeg must take one of these same values

The three steps are the same as with J = 2 just with more valuesto handle

First write out the loss function values in a matrix with the rowm column k entry equal to L(Y = vk g = vm)

L(v1 v1) L(v2 v1) middot middot middot L(vJ v1)L(v1 v2) L(v2 v2) middot middot middot L(vJ v2)

L(v1 vJ) L(v2 vJ) middot middot middot L(vJ vJ)

(246)


Second compute all the mean losses

E[L(Y v1)] = P(Y = v1)L(v1 v1) + P(Y = v2)L(v2 v1) + middot middot middot =Jsumj=1

P(Y = vj)L(vj v1)


P(Y = vj)L(vj v2)

E[L(Y vJ)] = P(Y = v1)L(v1 vJ) + P(Y = v2)L(v2 vJ) + middot middot middot =Jsumj=1

P(Y = vj)L(vj vJ)

(247)

Third find the g that minimizes E[L(Y g)] That is find the smallestof the J values computed in (247) the corresponding g is the optimalpredictor

For example with J = 3 let v1 = minus1 v2 = 0 v3 = 1 LetL(v1 v1) L(v2 v1) L(v3 v1)L(v1 v2) L(v2 v2) L(v3 v2)L(v1 v3) L(v2 v3) L(v3 v3)

=

L(minus1minus1) L(0minus1) L(1minus1)L(minus1 0) L(0 0) L(1 0)L(minus1 1) L(0 1) L(1 1)

=

0 2 81 0 24 1 0

(248)

Let

P(Y = v1) = P(Y = minus1) = 02

P(Y = v2) = P(Y = 0) = 03

P(Y = v3) = P(Y = 1) = 05 = 1minus P(Y = minus1)minus P(Y = 0)

(249)

Then the mean losses are

E[L(Y v1)] = P(Y = v1)L(v1 v1) + P(Y = v2)L(v2 v1) + P(Y = v3)L(v3 v1)

= (02)(0) + (03)(2) + (05)(8) = 46


= (02)(1) + (03)(0) + (05)(2) = 12

E[L(Y v3)] = P(Y = v1)L(v1 v3) + P(Y = v2)L(v2 v3) + P(Y = vJ)L(vJ v3)

= (02)(4) + (03)(1) + (05)(0) = 11

The smallest of these three values is E[L(Y v3)] = E[L(Y 1)] = 11so the optimal predictor is g = v3 = 1

253 Optimal Prediction Specific Examples

Example Carnival Age Game

Imagine predicting a personrsquos age Y Imagine when you were youngeryou worked at a carnival in the summer where people paid five tick-ets to see if you could guess their age If you guessed correctly theywon nothing if incorrect they won the plush animal or fruit of theirchoice (which of course was still worth much less than five tickets)


Since they pay five tickets regardless of what you guess that neednot enter the loss function The fact that it only matters if you guesscorrectly or not implies 0ndash1 loss L0(y g) = 1y 6= g (For addedchallenge come back to this example and see what changes if theyonly win when you are more than three years off)

For simplicity let P(Y = 20) = 06 and P(Y = 25) = 04 Theprocedure in Section 252 can be used Mean losses with 0ndash1 loss are

E(L0(Y 20)) = E(1Y 6= 20) = (06)(0) + (04)(1) = 04

E(L0(Y 25)) = E(1Y 6= 25) = (06)(1) + (04)(0) = 06(250)

so glowast0 = 20 is the optimal prediction (See also (256) for a differentpath to the same conclusion)

Quadratic loss could lead to the wrong guess depending howblindly we apply it Comparing only g = 20 to g = 25

E[L2(Y 20)] = E[(Y minus 20)2]

= P(Y = 20)(20minus 20)2 + P(Y = 25)(25minus 20)2

= (06)(0) + (04)52 = (04)(25) = 10

E[L2(Y 25)] = E[(Y minus 25)2]

= P(Y = 20)(20minus 25)2 + P(Y = 25)(25minus 25)2

= (06)(minus5)2 + (04)(0) = (06)(25) = 15

Like with 0ndash1 loss it is better to guess the more likely value 20 thanthe less likely 25 However it is even better to guess something inbetween

E[L2(Y 22)] = E[(Y minus 22)2]

= P(Y = 20)(20minus 22)2 + P(Y = 25)(25minus 22)2

= (06)(minus2)2 + (04)(3)2 = (06)(4) + (04)(9) = 6

Some calculus shows g = 22 is actually optimal for L2 loss Howeveraccording to the rules of the carnival game if we guess g = 22 wheneveryone has either Y = 20 or Y = 25 then wersquoll lose every singletime This is the worst possible guess This is one example wherequadratic loss is not appropriate

Example Advertising

Later in life well past your carnival days you work in advertisingCoincidentally your job is still to guess a personrsquos age but with dif-ferent consequences If you guess somebody is 40 years old thenyour clientrsquos website shows an ad specifically designed for 40-year-olds The loss function should capture how much worse it is to showan ad targeting the guessed age than to show the optimal ad for theindividualrsquos true age

Unlike at the carnival some incorrect guesses are much worse thanothers For example it doesnrsquot matter much if you guess a person is40 years old but really theyrsquore 41 The optimal ad for the 40-year-old


is almost equally effective on the 41-year-old so there is very little lossfrom guessing 40 instead of 41 However guessing that the 41-year-oldis 20 years old is much worse than guessing 40 Similarly guessingthat the 41-year-old is 60 is bad but still better than guessing 80Consequently 0ndash1 loss is inappropriate because it treats guesses of20 40 60 and 80 as equally bad when y = 41

In this case quadratic loss seems more appropriate than 0ndash1 lossif not perfect

Imagine the age distribution of Y is the same as in the carnivalgame We saw that quadratic and 0ndash1 loss lead to different ldquooptimalrdquopredictions That result depends only on the mathematical distribu-tion of Y not on the real-world interpretation (carnival advertising)So the predictions are still different but now we may prefer g = 22over g = 20 That is we might ldquoloserdquo a lot by showing 25-year-olds anad targeting 20-year-olds but maybe both 20-year-olds and 25-year-olds respond to the ad targeting 22-year-olds

Example More Ages

Consider more complex versions of the carnival and advertising ex-amples with the following distribution of Y Now any value Y isin20 21 22 23 24 25 is possible Imagine young people are morelikely specifically

P(Y = j) = (26minus j)21 j = 20 21 25 (251)

With 0ndash1 loss (for the carnival) mean loss when predicting g canbe compute as follows From (238) the loss function is L0(Y g) =1y 6= g which equals 0 if y = g but equals 1 otherwise This can berewritten as 1minus1y = g Also 1y = gP(Y = y) equals P(Y = g)when y = g but otherwise equals 0 when y 6= g Putting these piecestogether mean loss is

E(L0(Y g)) =25sumy=20

1y 6= gP(Y = y) =25sumy=20

[1minus 1y = g] P(Y = y)

=

=1︷︸︸︷25sumy=20

P(Y = y)minus25sumy=20

1y = gP(Y = y) = 1minus P(Y = g)

Thus the smallest possible mean loss is achieved by the largest pos-sible P(Y = g)

arg ming

E(L0(Y g)) = arg ming

[1minusP(Y = g)] = arg maxg

P(Y = g) = 20

(252)so the best prediction is glowast0 = 20 (As defined in the Notation sectionbefore Chapter 1 arg ming f(g) means ldquothe value of g that minimizesf(g)rdquo and arg maxg f(g) menas ldquothe value of g that maximizes f(g)rdquo)

This result is intuitive because 0ndash1 loss only cares whether a pre-diction is right or wrong Guessing glowast0 = 20 gives you a 621 prob-ability of being correct the largest possible Equivalently it is the


smallest possible probability of being wrong More generally glowast0 isalways the single most likely value (the mode) see (256)

Like before quadratic loss yields a different optimal prediction Ifwe guessed g = 20 then

E[L2(Y g)] =25sumy=20

P(Y = y)(yminus20)2 =25sumy=20

[(26minusy)21](yminus20)2 = 5

(253)(Link to calculation) But we can do better by guessing a value moretoward the ldquomiddlerdquo of the distribution Although g = 20 is exactlycorrect sometimes itrsquos bad when Y = 25 If we try g = 21

E[L2(Y g)] =25sumy=20


[(26minusy)21](yminus21)2 = 83 asymp 267

(254)(Link to calculation) Since 267 lt 5 guessing g = 21 is better thang = 20 according to mean quadratic loss (Can you do even betterthan g = 21)

Discussion Question 23 (banana loss function) Imagine you runa small banana shop You buy bananas wholesale for 2 cents each($002) and sell each for 40 cents ($040) The wholesaler deliversevery Monday Any bananas not sold by the next Monday spoil youcannot sell them (they just go in the compost) Let y be the actualnumber of bananas that customers want to buy in some week Let gbe your guess ie how many you bought wholesale on Monday

a) Why isnrsquot 0ndash1 loss appropriateb) Why isnrsquot quadratic loss appropriatec) What might the loss function look like if you only care about

maximizing profit Try to be as specific and mathematical asyou can In particular consider the different consequences ofover-buying (g gt y) versus under-buying (g lt y)

254 Mean and Mode as Optimal Predictions

Under quadratic loss the mean is the optimal predictor that mini-mizes mean loss Although the details are beyond our scope calculuscan be used to take the derivative of mean loss with respect to g andset it to zero (the ldquofirst-order conditionrdquo) yielding

glowast2 equiv arg ming

E[(Y minus g)2] = E(Y ) (255)

This says the mean of a distribution has two interpretations Fordescription the mean helps summarize the ldquocenterrdquo of the distribu-tion For prediction the mean is the prediction of an unknown valueof Y that minimizes mean quadratic loss


Under 0ndash1 loss the optimal prediction is



E[1Y 6= g] = arg ming

P(Y 6= g)

= arg ming

[1minus P(Y = g)] = arg mingminusP(Y = g)

= arg maxg

P(Y = g) (256)

That is the mode (the single most likely value of Y ) is the optimalpredictor (This formula does not make sense if Y is not continuousbecause then P(Y = g) = 0 for any g) Thus like the mean the modehas two interpretations one for description and one for prediction

Discussion Question 24 (optimal banana prediction) Considerthe same setup as in DQ 23 and again assume you want to maxi-mize (mean) profit Imagine you know the distribution of Y (bananaquantity demanded in one week)

a) Do you think the mean E(Y ) is a good ldquopredictedrdquo number ofbananas to buy wholesale Explain why or why not if not alsoexplain why you think E(Y ) is too high or too low

b) What if the retail price were $99 per banana and the wholesalecost is still $002 per bananamdashwould E(Y ) be good or too highor too low and why

c) What if the retail price were equal to the wholesale price

255 Interval Prediction

Only point prediction has been discussed so far ie the single num-ber that provides the best guess of the unknown value Alternativelyinterval prediction lets the guess be a range of numbers called aprediction interval

The disadvantage of point predictions is that they are usuallywrong for discrete variables and always wrong for continuous vari-ables For example if Y sim N(0 1) then the best point predictionunder quadratic loss is E(Y ) = 0 But this guess will be wrong 100of the time since P(Y = 0) = 0

By guessing a range of numbers a prediction interval can actuallycontain the true value with large probability The length of the in-terval captures the level of uncertainty with lots of uncertainty theinterval must be very long to have a high probability of containingthe true value

For example let P(Y = j) = 1100 for j = 1 2 100 Themean E(Y ) = 505 is the best prediction under quadratic loss butit never actually happens P(Y = 505) = 0 Even P(Y = 50) =001 still very small Alternatively the prediction interval [26 75]has P(26 le Y le 75) = 50100 there is roughly a 50 probabilitythat a randomly sampled Y value is inside the prediction intervalOr P(6 le Y le 95) = 90100 so the prediction interval [6 95] hasaround 90 probability of containing a randomly drawn Y


The interval length reflects amount of uncertainty Above the90 prediction interval was [6 95] If instead P(Y = j) = 110for j = 1 2 10 then the much shorter interval [2 10] has 90probability of containing Y The values of Y are concentrated moreclosely together so the prediction interval can be smaller but stillhave the same 90 probability

Even for the same Y distribution and same interval probability(like 90) there may be multiple possible prediction intervals Inthe original example [26 75] was a 50 prediction interval but sois [25 74] or [1 50] or [51 100] Other properties can be used todistinguish among these intervals but such is beyond our scope


Chapter 3

One Variable Sample


Depends on Chapter 2



32 Describe and distinguish Bayesian and frequentist perspec-tives [TLO 4]

33 Identify and interpret properties of a sampling procedure orestimator [TLO 4]

34 Judge which estimator is better based on its properties[TLO 6]

35 Interpret different measures of statistical uncertainty[TLOs 6 and 7]

36 Assess the statistical significance and economic significanceof empirical results [TLO 6]

37 In R (or Stata) compute estimates of a population meanalong with measures of uncertainty [TLO 7]


bull Basic statistics the Khan Academy AP Statistics unit in-cludes instructional material and practice questions

bull Quantifying uncertainty and statistical significance (Mastenvideo)

bull Estimator properties (Lambert video)

bull Unbiasedness and consistency (Lambert video 1 of 2)


bull iid sampling (Lambert video)

53

54 CHAPTER 3 ONE VARIABLE SAMPLE

bull Bayesian vs frequentist cookie inference example (StackEx-change)

bull Section 28 (ldquoExploratory Data Analysis with Rrdquo) in Kleiberand Zeileis (2008) [Chapter 2 is available free on their web-site]

bull Section 22 (ldquoRandom Sampling and the Distribution ofSample Averagesrdquo) and Chapter 3 (ldquoA Review of StatisticsUsing Rrdquo) in Hanck et al (2018)

bull Sections 154 (ldquoFundamental Statisticsrdquo) and 193 (ldquoSimu-lation of Confidence Intervals and t Testsrdquo) in Heiss (2016)

bull R package boot (Canty and Ripley 2019 Davison and Hink-ley 1997)

Sections 23 and 25 considered only the population distributionwhereas Chapter 3 considers data sampled from that distributionThe words data dataset sample values and sample all refer tothe same thing the set of values that the researcher actually seesBut as in Chapter 2 this could be seen either from the ldquobeforerdquoperspective as random variables or from the ldquoafterrdquo perspective asnon-random realized values Section 21 gave the general idea of seeingobservations as random variables (the ldquobeforerdquo view) here specificdetails are provided on estimation and uncertainty

Although long this chapter is mostly review of material you shouldhave seen already in an introductory statistics class

31 Bayesian and Frequentist Perspectives

=rArr Kaplan video Bayesian and Frequentist Perspectives

Two frameworks constitute econometrics and statistics Bayesianand frequentist (or classical) These are cynically deemed ldquosectsrdquoby some but outside the vocal extremes (and amusing webcomicsxkcdcom1132) most econometricians appreciate and respect bothframeworks (and the people who use them) sometimes working withboth in turn

This text uses the frequentist framework Why Mostly thatrsquosjust how I wrote it Irsquoll spare you post hoc rationalization

There is little disagreement about the population and what wewant to learn Generally both Bayesian and frequentist perspectivesagree on everything in Chapter 2 about the population and how dataare generated

The disagreements are about how to use the sampled data tolearn about the population Frequentist and Bayesian approacheshave different advantages appropriate for different settings

The goal of the remainder of Section 31 is to give you a very basicoverview and comparison of Bayesian and frequentist approaches Atminimum I hope you get a sense of their different ways of quantifying

31 BAYESIAN AND FREQUENTIST PERSPECTIVES 55

uncertainty and the different types of questions they can (and cannot)answer

311 Very Brief Overview Bayesian Approach

The Bayesian approach models your beliefs about an unknown pop-ulation value θ like the mean θ = E(Y ) Your prior (or prior belief)is what you believe about θ before seeing the data Your posterior(or posterior belief) is what you believe about θ after seeing the dataThe Bayesian approach describes how to update your prior using theobserved data to get your posterior

Mathematically ldquobeliefrdquo is a probability distribution For exam-ple let random variable B represent your belief about the populationmean If you think therersquos a 50 chance the mean is negative thenP(B lt 0) = 50 If you think therersquos a 14 probability that B isbelow minus1 then P(B lt minus1) = 14 There are formal procedures forldquoprior elicitationrdquo ie quantifying your beliefs as a distribution

Notationally though it is confusing usually the belief is repre-sented by the (usually Greek) letter for the parameter like θ ratherthan a separate variable (like B above) This does not mean ldquothereis no population meanrdquo it is purely a notational convention for rep-resenting beliefs

For a concrete example imagine you find an archaeological sitein Missouri with many artifacts but you are unsure of which peoplegroup had lived in that site Based on its location it was either theMissouria Illini or Osage tribe which can be represented as θ = M θ = I or θ = O respectively That is θ is the unknown parameterof interest with possible values M I or O Before looking at theartifacts (the data) you believed there was equal chance of each tribeie your prior belief was P(θ = j) = 13 for each j isin M IOAfter looking at the artifacts (the data) more closely they look mostsimilar to Missouria artifacts but you are unsure Quantitatively youbelieve therersquos a 50 chance they were Missouri 40 chance Osageand 10 chance Illini This is your posterior distribution ie yourbeliefs about θ after seeing the data Mathematically your posterioris P(θ = M) = 05 P(θ = O) = 04 P(θ = I) = 01

The posterior distribution is the Bayesian way of quantifying un-certainty It is arguably more intuitive it is more similar to howpeople talk about uncertainty in daily life The posterior distributionis often summarized by a credible interval ie a range of valuesthat yoursquore pretty sure contains the true θ like P(a le θ le b) = 90Or in the above example with categorical θ the credible set MOhas 90 posterior belief yoursquod say ldquoIrsquom pretty sure itrsquos Missouria orOsage although I think therersquos a 10 chance Irsquom wrongrdquo

312 Very Brief Overview Frequentist Approach

Other sections in Chapter 3 flesh out details but the core of thefrequentist approach is the ldquobeforerdquo perspective which can also be


described in terms of repeated sampling Instead of the belief prob-abilities of a Bayesian posterior frequentist probabilities are from theldquobeforerdquo view of what dataset (and thus value of estimator and such)could be randomly sampled Equivalently as a thought experimentwe can imagine many different random samples drawn from the samepopulation the ldquobeforerdquo probabilities are then how often certain val-ues occur in these many random datasets

For intuition imagine you could randomly sample 100 datasetsfrom the same population Then the frequentist probability of anevent says approximately how many times that event occurs amongthe 100 samples (To replace ldquoapproximatelyrdquo with ldquoexactlyrdquo replace100 with infin) For example we could compute the sample mean Yin all 100 samples since the datasets are all different the samplemeans Y are also all different If Y le 0 in 50 of the 100 hypotheticalsamples then P(Yn le 0) asymp 50100 = 50 Or if Y is in the interval[minus04 04] in 70 of 100 samples then P(minus04 le Y le 04) = P(Y isin[minus04 04]) asymp 70 A similar example is in Table 31

313 Bayesian and Frequentist Differences

The following makes explicit some of the differences between theBayesian and frequentist approaches described above

First the frameworks treat different objects as random or non-random The frequentist framework treats the population mean andother population features as non-random values whereas it treats thedata as random For example the population mean micro = E(Y ) is anon-random value whereas an observation Y is a random variableIn contrast the Bayesian framework treats population features asrandom (to reflect your beliefs) whereas it treats the data as non-random values (the ldquoafterrdquo view)

Second due to this different treatment the frameworks answerdifferent types of questions especially when quantifying uncertaintyFor example the Bayesian framework is designed to answer questionslike ldquoGiven the observed data what do I believe is the probabilitythat the population mean is above 12rdquo Mathematically if y is theldquoobserved datardquo this is usually written P(micro gt 12 | y) noting theconfusing notation where micro represents beliefs This question makesno sense from the frequentist perspective either micro gt 12 or not itcannot be ldquomayberdquo with some probability In contrast the frequentistframework answers questions like ldquoGiven the value of micro whatrsquos theprobability that the sample mean is above 12rdquo Mathematically thisis usually written P(Y gt 12) or Pmicro(Y gt 12) to be explicit aboutthe dependence on micro The sample mean Y is a function of data so itis treated as a random variable This question makes no sense fromthe Bayesian perspective we can see the data so we can see eitherY gt 12 or not it cannot be ldquomayberdquo with some probability

Interestingly both frameworks can answer questions like P(Y ltmicro) but with different interpretations The Bayesian answer interpretsY as a number (that we see in the data) and micro as a random variable

32 TYPES OF SAMPLING 57

representing our beliefs The frequentist answer interprets Y as therandom variable (from the ldquobeforerdquo view) and micro as the non-randompopulation value

Third frequentist methods use only the data whereas Bayesianmethods can formally incorporate additional knowledge In practicethough even frequentist results should be interpreted in light of otherknowledge The difference is that this process is not formally withinthe frequentist methodology itself Unfortunately many people do notcombine frequentist results with other knowledge instead interpretingfrequentist results as if one single dataset contains the full absolutetruth of the universe please do not do this

In Sum Bayesian amp FrequentistFrequentist ldquobeforerdquo view of data (random variables) assessmethodsrsquo performance across repeated random samples from samepopulationBayesian ldquoafterrdquo view of data (non-random) model beliefs (aboutpopulation features) as random variables

32 Types of Sampling

=rArr Kaplan video Types of Sampling

In practice judging which econometric method is most appropri-ate requires understanding different types of sampling procedures andsampling properties Such judgment is mostly left to another text-book but this section hopes to help your understanding

Notationally we observe the values from n units which could beindividuals firms countries etc Let i = 1 refer to the first uniti = 2 to the second etc up to i = n where n is the sample sizeThe corresponding values are Y1 Y2 Yn with Yi more generallydenoting the observation for unit i A particular dataset may havespecific values like Y1 = 5 Y2 = 8 etc but to analyze statisticalproperties each Yi is seen as a random variable as in Section 21

In this section two important sampling properties are consideredldquoindependentrdquo and ldquoidentically distributedrdquo If both hold then the Yiare called independent and identically distributed (iid) randomvariables (or ldquosampled iidrdquo) and ldquosampling is iidrdquo Sometimes thevague phrase random sample refers to iid sampling This iid sam-pling is mathematically simplest but not always realistic Althoughiid sampling is the focus here (like other introductory textbooks)weights are briefly mentioned and Part III considers dependent (ienot independent) data

Notationally iid sampling is indicated by iidsim If FY (middot) is the pop-ulation CDF

Yiiidsim FY i = 1 n (31)

If there is only a population PMF fY (middot) and not a CDF (like with


nominal categorical variables) then FY is replaced by fY in (31) Ifthe Yi follow a known distribution like N(0 1) then FY is replacedby N(0 1) for example

There are other sampling properties not considered in this sectionlike sampling bias This is about whether we observe a ldquorepresenta-tive samplerdquo of the population we want to learn about (the populationof interest) Sometimes sampling bias is our fault (for using the wrongdataset for our economic question) but sometimes we try to get theright data and people refuse to answer our survey or we canrsquot getaccess to certain confidential data etc This is discussed more inChapter 12 in terms of ldquomissing datardquo and ldquosample selectionrdquo

After introducing ldquoindependentrdquo and ldquoidentically distributedrdquo sam-pling examples are discussed in Section 323

321 Independent

Qualitatively in the context of sampling independence (or inde-pendent sampling) means that from the ldquobeforerdquo view any two ob-servations are unrelated For example the value of Y2 is unrelated toY1 we are not any more likely to see a high Y2 if we see a high Y1 inthe sample

Mathematically independence means

Yi perpperp Yk for any i 6= k (32)

where perpperp denotes statistical independence That is Y1 perpperp Y2 Y1 perpperp Y8Y6 perpperp Y4 etc For any i 6= k independent sampling implies

Cov(Yi Yk) = 0 Var(Yi+Yk) = Var(Yi)+Var(Yk) E(Yi | Yk) = E(Yi)(33)

among other things

322 Identically Distributed

The identically distributed property means that from the ldquobeforerdquoview the distribution of Yi is the same for any i Qualitatively allunits are sampled from the same population Mathematically if FY (middot)is the population CDF Yi sim FY for all i = 1 n (Or Yi sim fY forPMF fY (middot)) Note sim and not iidsim it only claims the distribution of Yiin isolation not independence from the other observations

Mathematically identically distributed Yi means that for any iand k Yi and Yk have the same distribution Thus any feature oftheir distributions is also identical For example E(Yi) = E(Yk) andVar(Yi) = Var(Yk)

Practice 31 (iid sampling) You are planning to sample to valuesY1 and Y2 but you have not sampled them yet The following fourstatements correspond to the four sampling properties (or their im-plications) 1) independent 2) not independent (ie dependent) 3)identically distributed 4) not identically distributed Which is which

a) You are just as likely to get Y1 = 3 as Y2 = 3


b) If you get a negative Y1 then yoursquoll probably get a negative Y2but if you get a positive Y1 then yoursquoll probably get a positiveY2

c) Separately and simultaneously you will randomly sample Y1 andY2 using the same exact procedure from the same population

d) For Y1 you are going to get the salary of somebody with aneconomics degree and Y2 will be the salary of somebody withan art history degree

323 Examples

Consider the following sampling procedures and their properties Eachexample has 4 observations of Mizzou students You can imagine 4buckets (or pieces of paper) initially empty that will eventually con-tain information from 4 observations The sampling procedure doesnot determine the specific numeric values that end up in the bucketsbut it determines how the buckets get filled

Random Student ID (iid)

Imagine randomly picking a Mizzou student ID number then ran-domly picking a 2nd then 3rd then 4th These Yi are both indepen-dent and identically distributed (iid) They are independent becauseeach ID number is randomly drawn without any consideration of howthe other numbers are drawn and without any consideration of theother observed Yi values They are identically distributed becauseeach ID number is drawn from the same population (anyone who hasa Mizzou student ID)

Random In-State and Non-Resident Students (Stratified)

In sampling stratification means dividing the population in to sub-groups called strata such that each population member belongs toexactly one stratum Further stratified sampling means the over-all data sample has a pre-determined number of observations fromeach stratum (There are other variations whose details are beyondour scope)

For example each Mizzou student is classified as either a residentof Missouri (ldquoin-staterdquo) or not (ldquonon-residentrdquo) Imagine buckets 1 and2 say ldquoin-staterdquo while buckets 3 and 4 say ldquonon-residentrdquo observationsY1 and Y2 are from in-state students while Y3 and Y4 are from non-resident students This is stratified sampling assigning buckets todifferent strata before sampling

Often stratified sampling is independent but not identicallydistributed (inid) In our Mizzou example the students are sampledindependently but the in-state distribution may not be identical tothe non-resident distribution

There are many details and variations but for now the goal is sim-ply to be aware that stratified sampling is usually not iid so methodsassuming iid sampling may not be valid


Students in Same Class (Clustered)

Imagine randomly picking a class (like Intro Econometrics) at Mizzouthen filling the first two buckets (Y1 and Y2) with two random studentsfrom that class and then randomly picking another class and anothertwo students for the other buckets (Y3 and Y4) This is an example ofclustered sampling where each class is a ldquoclusterrdquo (This is differentthan ldquoclusteringrdquo in the sense of cluster analysis)

Observations are identically distributed (because each Yi has thesame probability of getting any particular student) but often not in-dependent For example dependence may come from students in thesame class being similarly affected by their shared experience Herebuckets 1 and 2 are correlated and 3 and 4 are correlated but not 1and 3 nor 2 and 4 etc

Without getting into details the goal for now is to recognize whenthere is clustered sampling that may not be iid in which case iid-based methods may not be valid Usually estimates are still valid butmeasures of uncertainty are not (there is really more uncertainty thanreported)

Other common examples of clustered sampling include taking allindividuals within randomly selected households students within ran-domly selected schools or classrooms and multiple observations overtime for randomly selected ldquoindividualsrdquo

Two Students Two Semesters (Clustered)

Imagine randomly picking 2 students (iid like with random ID num-bers) then observing them this semester and next semester Thisis another type of clustered sampling that usually violates indepen-dence For example imagine the variable is semester GPA Bucket 1contains the first studentrsquos GPA this semester bucket 2 contains thesame studentrsquos GPA next semester and buckets 3 and 4 contain theother studentrsquos GPAs from this semester and next semester Buckets1 and 2 (Y1 and Y2) are probably both high or both low rather thanone high and one low and similarly for buckets 3 and 4 (Y3 and Y4)That is buckets 1 and 2 are correlated and 3 and 4 are correlatedFurther observations may not even be identically distributed if fallGPA and spring GPA do not have the same distribution

From a different perspective sampling is actually iid The stu-dents themselves are sampled iid so ldquoobservationsrdquo are iid if we see(Y1 Y2) as a single observation and see (Y3 Y4) as a second observa-tion That is (Y1 Y2) is randomly sampled from the same populationas (Y3 Y4)

One Student Four Semesters (Time Series)

Similar to above if you randomly pick one student but then observethe same student over four consecutive semesters there is probablydependence and possibly not identical distributions (eg if GPAtends to increase over time) This is time series data see Part III

33 THE EMPIRICAL DISTRIBUTION 61

Practice 32 (rural household sampling) You want to learn abouthousehold consumption in rural Indonesia In an area with 100 vil-lages you either i) pick 5 villages at random then survey every house-hold in each of the 5 villages or ii) make a list of all households inall 100 villages then randomly pick 5 of them Explain why eachapproach is or isnrsquot iid

33 The Empirical Distribution

=rArr Kaplan video The Empirical Distribution

The empirical distribution is a probability distribution thatreflects the sample data It can be confusing at first but it unifiesmany approaches in this class and beyond helping them seem less adhoc and mysterious Qualitatively the empirical distribution treatsthe sample as if it were the population

Mathematically first consider a binary variable The populationis represented by binary random variable Y with some P(Y = 1) = pThe sample of size n can be represented by binary random variable Swith

P(S = 1) = p =how many Yi = 1

n=

1

n

nsumi=1

1Yi = 1 (34)

the sample proportion of observations with Yi = 1 The distributionof S is the empirical distribution

The plug-in principle or analogy principle suggests we com-pute whatever features of S we want to learn about Y For exampleif we want to learn E(Y ) then compute E(S) With enough data Sis usually very similar to Y so features of S should usually be verysimilar to those of Y

Mathematically consider now a categorical or discrete variableThe population is represented by random variable Y with PMF fY (middot)Imagine there are J categories with values (v1 vJ) so Y is fullydescribed by fY (vj) for each j = 1 J The sample is representedby random variable S with

fS(vj) =1

n

nsumi=1

1Yi = vj j = 1 J (35)

That is fS(vj) is the sample proportion of observations with Yi = vj Mathematically consider finally a continuous variable The pop-

ulation is represented by random variable Y with continuous CDFFY (middot) However even with an infinite number of possible values forY in the population there are only n possible values of Yi observedWith a continuous random variable each observed Yi value is uniqueso there are exactly n different observed Yi values The sample is thusrepresented by random variable S with PMF

fS(Yi) = 1n i = 1 n (36)


Even though Y is continuous S is discreteNotationally instead of fS(middot) more common is fY (middot) if Y has a

PMF If Y has a CDF then more often the empirical CDF (ECDF)is used It is simply the CDF of S The ECDF (or just EDF) is usuallywritten FY (middot) or just F (middot) and defined as

FY (y) = FS(y) =1

n

nsumi=1

1Yi le y (37)

ie the sample proportion of observations less than or equal to y thepoint of evaluation

Notationally a hat (circumflex) often denotes a sample analogie a feature of S analogous to a population feature of Y Above in(37) for the population FY (y) the sample analog is FY (y) = FS(y)As another example for the population P(Y = y) the sample analogis P(Y = y) = P(S = y) For the population E(Y ) the sample analogis E(Y ) = E(S) The ldquohatrdquo may indicate another value computed fromthe sample data (ie a statistic) usually an estimator even if it isnot a sample analog

34 Estimation of the Population Mean

Sections 23 and 25 helped us think about which features of the pop-ulation are useful for description and prediction Such a populationfeature is called the estimand or object of interest In practice itmust be estimated using data

This section specifically considers estimating the population meanThis is most directly useful in later chapters For other populationfeatures the same concepts apply though details differ

Recall from Section 231 that the mean is directly useful for de-scription and prediction with discrete and continuous variables and itcan have a useful probability interpretation for binary and even cate-gorical variables For binary Y E(Y ) = P(Y = 1) Further categori-cal variables can be turned into binary variables like Z = 1Y = v2For example if Y is a 2-digit NAICS industry code whose possiblevalues are ldquoutilitiesrdquo ldquoconstructionrdquo or ldquoinformationrdquo then definingZ = 1Y = information (Z = 1 for ldquoinformationrdquo Z = 0 otherwise)implies E(Z) = P(Y = information)

The focus of this section is point estimation as opposed to in-terval estimation A point estimate is a single number representingour best guess of the unknown population value In contrast an in-terval estimate is a range of numbers most commonly a confidenceinterval see Section 38

341 ldquoDescriptionrdquo Sample Mean

As alluded to in Section 33 at least with identically distributed sam-pling the population mean can be estimated by its sample analog themean of the empirical distribution This is called the sample mean

34 ESTIMATION OF THE POPULATION MEAN 63

It is also called the sample average because it averages the sampleYi values The sample average is usually denoted Y (or Yn) Mathe-matically for continuous or discrete (including binary) Y using thenotation of Section 33

Y = E(Y ) = E(S) =1

n

nsumi=1

Yi (38)

These expressions are equivalent just emphasizing different interpre-tations

342 ldquoPredictionrdquo Least Squares

Section 254 showed that the population mean E(Y ) also solves anoptimal population prediction problem Specifically (255) shows

E(Y ) = glowast2 equiv arg ming

E[(Y minus g)2]

From Section 33 the analogy principle suggests estimating E(Y )by solving the same optimal prediction problem for the empirical dis-tribution ie replacing Y with S Mathematically let


E[(S minus g)2] = arg ming

1

n

nsumi=1

(Yi minus g)2 (39)

The hope is that the sample analog glowast2 is close to the population glowast2which in turn equals E(Y )

Skipping the derivation it turns out

glowast2 = Y (310)

The prediction-motivated estimator equals the description-motivatedestimator This makes sense because in the population representedby Y the mean equals the optimal prediction (under quadratic loss)and S is simply another random variable like Y

Rewriting (39) allows the introduction of some terms and con-cepts used in later chapters In (39) the 1n has no effect on theminimization problem because it is unaffected by g Consequently itis equivalent to write

glowast2 = arg ming

nsumi=1

(Yi minus g)2 (311)

To dissect the right-hand side of (311) imagine any estimate g Sinceg can be seen as trying to predict Y sometimes g is called the pre-dicted value of Yi which in this simple setting is the same for all iHowever the observed value of Yi is used to compute g so it seemsmisleading to say Yi was ldquopredictedrdquo usually we assume the true valueis not known when we discuss prediction Instead calling g the fittedvalue is more appropriate Either way the difference Ui = Yi minus g iscalled the residual for observation i ie the difference between the


observed value Yi and the fitted value g The squared residuals arethen U2

i = (Yi minus g)2 The sum of squared residuals (SSR) is then

nsumi=1

U2i =

nsumi=1

(Yi minus g)2 (312)

Consequently (310)ndash(312) together say that Y minimizes the SSRFor this reason Y is a least squares estimator ldquoleastrdquo referring tominimization and ldquosquaresrdquo referring to the second S in SSR

343 Non-iid Sampling Weights

If your dataset has weights then you should use them Using weightsadjusts the sample to be more representative of the population Con-versely ignoring weights often produces misleading results becausethe sample is not representative

There are multiple types of weights although the distinction isbeyond our scope Generally any type of weight is treated the samefor estimation but different for ldquoinferencerdquo (confidence intervals etc)Types like survey weights (also called sampling weights) indicatenon-iid sampling whereas other types like frequency weights maysimply allow more compact storage of iid sampled data

Example

Skipping the theory an example is shown in the following code Thereare two subpopulations (two types of individuals) in the populationone with mean 0 one with mean 1 Each subpopulation forms halfthe overall population so the overall population mean is (12)(0) +(12)(1) = 12 However the second subpopulation forms much morethan half of the sample because it is oversampled each observationhas a 23 probability (instead of 12) of coming from the secondsubpopulation Thus itrsquos like sampling from a different populationwhose mean is (13)(0) + (23)(1) = 23 Without weighting theunweighted sample mean estimates this 23 value not 12

Sampling weights can be used to adjust the sample to represent thepopulation Specifically the weights are the inverse of the samplingprobabilities 1(23) = 32 = 15 for observations from the secondsubpopulation and 1(13) = 3 for individuals from the first sub-population This counteracts the fact that there are more individualsfrom the second subpopulation by weighting them less Alternativelyinstead of the inverse sampling probabilities the inverse sample pro-portions of each type could be used Another option is to use functionsvymean() in the R package survey Lumley (2004 2019)

setseed(112358)n lt- 567itype lt- sample(x=01 size=n replace=T prob=123)Y lt- itype + rnorm(n)mean(Y) without weights near 23=067 not representative

35 SAMPLING DISTRIBUTION OF AN ESTIMATOR 65

[1] 0708

with weights should be closer to true 050weightedmean(x=Y w=1((itype+1)3))

[1] 0551

weightedmean(x=Y w=ifelse(itype nsum(itype) n(n-sum(itype))))

[1] 054

35 Sampling Distribution of an Estimator

=rArr Kaplan video Sampling Distribution of an Estimator

There are two goals of this section The primary goal is to un-derstand what it means for an estimator to have a probability dis-tribution The secondary goal is to observe some patterns that areformalized in Section 36

At a high level the sampling distribution is the probability dis-tribution of an estimator treated as a random variable in the ldquobeforerdquoview Equivalently from the repeated sampling perspective the sam-pling distribution imagines computing the estimator in a large numberof randomly sampled datasets from the same population and seeingwhich values occur with what probability

To develop intuition the remainder of this section contains morespecific examples using the sample mean The examples include math-ematical calculations graphs and tables The details are not them-selves important but rather how they manifest the deeper conceptsie there is no value in memorizing the examples but rather usingthem to assess and develop your understanding of the fundamentalideas

351 Some Mathematical Calculations

To be concrete consider the sample mean as an estimator of thepopulation mean with iid sampling Here the n subscript is added toYn because the sampling distribution depends on n For example thesampling distribution of Y1 = Y1 differs from that of Y2 = (Y1+Y2)2

From the ldquobeforerdquo view the sample mean Yn is a random variableThe Yi are all random variables so their average is also a randomvariable That is the Yi have multiple possible values so the samplemean also has multiple possible values

To develop intuition consider some simple examples The sim-plest example is n = 1 so Y1 = Y1 The distribution of Y1 is the sameas the population distribution that Y1 follows

Simpler still imagine the population is binary with mean p ThenY1 sim Bernoulli(p) meaning P(Y1 = 1) = p and P(Y1 = 0) = 1 minus pSince Y1 = Y1 P(Y1 = 1) = p and P(Y1 = 0) = 1minus p If you imagine


100 randomly sampled datasets (ie 100 randomly sampled valuesof Y1) then approximately 100p datasets would have Y1 = 1 whileapproximately 100(1minusp) would have Y1 = 0 FOr example if p = 07then around 70 datasets would have Y1 = 1 and around 30 datasetswould have Y1 = 0 With infinite datasets the words ldquoapproximatelyrdquoand ldquoaroundrdquo are no longer needed but at least for me it is easier todevelop intuition by imagining 100 datasets than infin

Consider n = 2 with binary Y Despite the simplicity it takessome work to derive the sampling distribution of Y2 Let

Yiiidsim Bernoulli(p) =rArr Y1 perpperp Y2P(Y1 = 1) = P(Y2 = 1) = p

(313)There are four possible values of (Y1 Y2) (0 0) (0 1) (1 0) (1 1)This makes three possible values of Yn = (Y1 + Y2)2 0 12 or 1The corresponding probabilities can be calculated using (313) sinceY1 perpperp Y2 implies P((Y1 Y2) = (a b)) = P(Y1 = a) P(Y2 = b) (Thatis due to independence the probability that both Y1 = a and Y2 = bequals the product of the individual probabilities) Thus

P(Yn = 0) =

use Y1perpperpY2︷︸︸︷P(Y1 = 0 and Y2 = 0) =

=1minusp︷︸︸︷P(Y1 = 0)

=1minusp︷︸︸︷P(Y2 = 0) = (1minus p)2

P(Yn = 1) =


=p︷︸︸︷P(Y1 = 1)

=p︷︸︸︷P(Y2 = 1) = p2

P(Yn = 12) = 1minus P(Yn = 0)minus P(Yn = 1) = 1minus (1minus p)2 minus p2 = 2p(1minus p)(314)

To make (314) more concrete imagine again p = 07 and 100 ran-domly sampled datasets ie 100 randomly sampled pairs of (Y1 Y2)With p = 07 (1 minus p)2 = 009 p2 = 049 and 2p(1 minus p) = 042 Ifwe compute Y2 = (Y1 + Y2)2 for each of the 100 datasets (pairs)then there are approximately 9 with Y2 = 0 49 with Y2 = 1 and42 with Y2 = 12 That is the sampling distribution of Yn showsus how frequently each possible value occurs in the long-run (whenrepeatedly sampling many datasets from the same population)

With larger n andor non-binary Y andor estimators more com-plicated than the sample mean the calculations become much morecomplex Further they depend on knowing the true population dis-tribution of the Yi which in practice is unknown Consequently thesampling distribution is often approximated as in Section 36

Discussion Question 31 (probability of positive mean) After see-ing the data you want to know the probability that the true mean isstrictly positive E(Y ) gt 0 Does the frequentist sampling distribu-tion help If yes explain how if no explain why not Hint recallSection 31

352 Graphs Binary Population

Consider a binary population with p = 07 and datasets with n ob-servations sampled iid


Figure 31 graphs the sampling distribution (PMF) of Yn for vari-ous n That is for a given n Yn is a random variable whose PMF isshown The horizontal axis shows possible values of Yn The verticalaxis can be interpreted in two ways First it shows the probabilityof each possible value as a percentage eg if the bar at horizontalvalue 05 has height 42 then P(Yn = 05) = 42 Second you couldimagine randomly sampling 100 datasets and the vertical axis showsthe number of datasets in which a particular value of Yn occurs

Value of Yn

PM

F (

)

020

4060

0 05 1

n=1

Value of Yn

PM

F (

)

020

4060

0 05 1

n=2

Value of Yn

PM

F (

)

020

4060

0 05 1

n=4

Value of Yn

PM

F (

)

020

4060

0 05 1

n=8

Value of Yn

PM

F (

)

020

4060

0 05 1

n=16

Value of Yn

PM

F (

)

020

4060

0 05 1

n=32

Figure 31 Sampling distribution (PMF) of Yn with binary popula-tion

Figure 31 first shows the sampling distribution of Y1 = Y1 (then = 1 graph) and then the sampling distribution of Y2 from (314)Figure 31 then shows the sampling distribution of Yn for larger nvalues that would be very tedious to compute by hand

Figure 31 helps show how Yn can be seen as a random variablewith a probability distribution ie its sampling distribution Fordifferent n different values are possible For different n each value hasa different probability of occurring That is if we randomly samplemany datasets from the same population with the same sample size nsome datasets have one value of Yn while other datasets have anotheror another and each possible value of Yn occurs with the probabilityshown in the graphs

Figure 31 also shows two interesting patterns when comparingsmall n and larger n First with larger n the sampling distributionof Yn is more closely concentrated around the mean E(Y ) = 07 Inthe extreme when n = 1 there can only be Y1 = 0 or Y1 = 1 neitherof which is very close to the mean 07 With larger n even if Yn = 07exactly is unlikely (or impossible) the probability of being close to 07


is higher For example the probability of 06 le Yn le 08 is zero whenn = 1 but it is relatively high for the largest n Second the shape ofthe sampling distribution differs for large and small n With n = 1Yn has a Bernoulli (binary) sampling distribution With the largestn shown although it is still discrete the shape looks like the ldquobellcurverdquo shape of a normal distributionrsquos PDF Both these observationsare formalized in Section 36

353 Graphs Continuous Population

Now consider a continuous variable whose population distribution isuniformly distributed over all real (decimal) numbers between 0 and1 There is again iid sampling so Yi

iidsim Unif(0 1)Figure 32 again shows the sampling distribution of Yn but now

it is a PDF instead of PMF since Yn is a continuous random variable(see Section 235) The horizontal axis again shows possible values ofYn The vertical axis shows the probability density The area underthe PDF over any range of horizontal values shows the correspondingprobability of that range as in Figure 23

02

46

8

Value of Yn

PD

F

0 04 08

n=1

02

46

8

Value of Yn

PD

F

0 04 08

n=20

24

68

Value of Yn

PD

F

0 04 08

n=4

02

46

8

Value of Yn

PD

F

0 04 08

n=8

02

46

8

Value of Yn

PD

F

0 04 08

n=16

02

46

8

Value of Yn

PD

F

0 04 08

n=32

Figure 32 Sampling distribution (PDF) of Yn with continuous pop-ulation

Figure 32 shows that for any n different values of Yn are possiblegiven different datasets Some datasets are more likely than othersso some ranges of Yn values are more likely than others

Figure 32 shows two patterns when comparing the graphs withsmall n and larger n First the distribution is more spread out withsmall n and more concentrated around the mean E(Y ) = 05 forlarger n For example consider P(045 le Yn le 055) the area under


the PDF between horizontal values 04 and 06 This probability ispositive even with n = 1 but relatively small (it is only 20 ie thearea between 04 and 06 is only 20 of the total area under the PDFin the n = 1 graph) With the largest n it is relatively high Secondthe shape differs by n With n = 1 the PDF is flat reflecting valuesuniformly spread between 0 and 1 With n = 2 there is a single peakat the population mean E(Y ) = 05 but the PDF has straight linesand a sharp corner With larger n the PDF looks like the ldquobell curverdquoshape of a normal distributionrsquos PDF See Section 36 for more

354 Table Values in Repeated Samples

Table 31 records values and events across 100 datasets randomlysampled from the same population The population is discrete withP(Y = j) = 15 for j = minus02minus01 0 01 02 Sampling is iid soeach Yi has the same distribution as the population Y and all Yiare mutually independent Let n = 10 The population mean isE(Y ) = 0

Table 31 Example estimates and event probabilities

Sample Yn 1Yn le 0

1Yn minus 04 le 0 le Yn + 04

1 050 0 02 020 0 13 000 1 14 minus010 1 15 minus050 1 0

100 030 0 1

Average 001 52100 67100

Note P(Y = j) = 02 for j = minus2minus1 0 1 2 iid n = 10

Table 31 shows the value of Yn computed from each sample (dataset)It shows that Yn = 05 in the first sample Yn = 02 in the second sam-ple etc This reflects the sampling distribution A histogram of thesevalues (not shown) would produce a graph with similar interpretationto Figures 31 and 32

Table 31 shows for each sample whether or not the sample meanYn is less than or equal to the population mean E(Y ) = 0 in thecolumn labeled with 1

Yn le 0

That is 1 indicates that it does 0

indicates that it doesnrsquot For example in Sample 1 Yn = 05 whichis not negative so 1

Yn le 0

= 0 In Sample 4 Yn = minus01 which

is negative so 1Yn le 0

= 1 From the frequentist view the event

Yn le E(Y ) is ldquorandomrdquo in that it could occur or not occur with someprobability for each possibility The E(Y ) is non-random but Ynis random hence the event is random The eventrsquos probability is theprobability of randomly sampling a dataset in which the event occursThe bottom row of the table says the event occurred 52 times out of


100 samples (52 of the time) Since there are only 100 samples andnot infin this is not the exact probability but it reflects that the eventoccurs slightly more than half the time

Table 31 also shows for each sample whether or not the randominterval [Yn minus 04 Yn + 04] contains the population mean E(Y ) = 0ie whether or not Yn minus 04 le E(Y ) le Yn + 04 The interval isldquorandomrdquo in the frequentist sense that it has different possible prob-abilities in different datasets (since it depends on Yn) In Sample1 the interval does not contain E(Y ) Yn = 05 so the interval is[05minus 04 05 + 04] = [01 09] which does not contain E(Y ) = 0 InSample 2 the interval does contain E(Y ) Yn = 02 so the inter-val is [minus02 06] which contains E(Y ) = 0 The bottom row of thetable says this event occurred 67 times out of 100 samples (67 ofthe time) This is the ldquocoverage probabilityrdquo of a ldquoconfidence intervalrdquodescribed in Section 382

36 Sampling Distribution Approximation

Because of the difficulties mentioned in Section 35 usually an esti-matorrsquos sampling distribution is approximated

Most common is a particular type of approximation called anasymptotic approximation All else equal this type of approxima-tion is better when n is larger For example comparing Figures 31and 32 with (315) below the approximation is very bad with n = 1but very good for the largest n in each figure Unfortunately thereis no general magic threshold for n because the approximationrsquos ac-curacy also depends on certain (unknown) population features Inpractice people usually just hope n is large enough that the approx-imation is reasonable

With iid sampling the approximate distribution of Yn is

Ynasim N(microY σ

2Y n) (315)

where the a over sim stands for ldquoapproximatelyrdquo (or ldquoasymptoticallyrdquo)microY = E(Y ) is the population mean and σ2Y equiv Var(Y ) is the popu-lation variance This reflects the three patterns seen in the examplesin Sections 352 and 353 First the standard deviation is σY

radicn

which (given the same σY ) is smaller for larger n Second the mean ofthe distribution is the population mean Third the shape is normal(Gaussian) A normal approximation of a sample meanrsquos samplingdistribution is often called a central limit theorem (CLT)

For technical mathematical reasons you may often see a variationof (315) radic

n(Yn minus microY )asim N(0 σ2Y ) (316)

Practically this says the same thing as (315) it just moves the microYand n to the left-hand side from the right-hand side

In practice σ2Y is unknown but can be estimated from data (iethe sample variance)

37 QUANTIFYING ACCURACY OF AN ESTIMATOR 71

minus10 00 10

00

06

12

Value of n(Yn minus microY)

PD

Fn=1

minus10 00 10

00

06

12


PD

F

n=2

minus10 00 10

00

06

12


PD

F

n=4

minus10 00 10

00

06

12


PD

F

n=8

minus10 00 10

00

06

12


PD

Fn=16

minus10 00 10

00

06

12


PD

F

n=32

Figure 33 Sampling distribution (solid PDF) ofradicn(Yn minus microY ) with

normal approximation (dashed)

Figure 33 shows the same sampling distribution (PDF) from Fig-ure 32 along with the normal approximation The horizontal axishas been rescaled to see the shape more easily the horizontal valuesare like the left-hand side of (316) The approximation is bad withn = 1 but very good for the largest n

361 Non-iid Sampling

With non-iid sampling (see Section 32) an estimatorrsquos sampling dis-tribution is different so the approximation also differs Usually thesampling distribution is still normal but with a different standarddeviation In practice ideally you should understand the type ofsampling enough to know which R functions can provide an appro-priate approximation For now the goal is only to be able to assessthe type of sampling and understand that non-iid sampling leads toa different sampling distribution

37 Quantifying Accuracy of an Estimator

From the frequentist perspective an estimatorrsquos accuracy can bequantified by comparing features of its sampling distribution to thetrue population value Bias is an important commonly mentionedproperty but it is not sufficient Mean squared error better quantifiesaccuracy Bias and mean square error are finite-sample propertiesthat derive from the estimatorrsquos sampling distribution for a finite sam-ple size n Approximate (ldquolarge-samplerdquo or ldquoasymptoticrdquo) versions of


these properties are also discussed belowThroughout let θ be the population parameter estimated by θn

This includes θ = E(Y ) and θn = Yn but is more general

371 Bias

Recall from Section 35 the frequentist perspective that an estimatoris a random variable whose probability distribution is called its sam-pling distribution The sampling distribution differs with n hencethe subscript n on the estimator θn

Definitions

The bias of θn compares the mean of its sampling distribution to thetrue population θ Mathematically

Bias(θn) equiv E(θn)minus θ (317)

The bias captures if the estimator systematically differs from θ in aparticular direction ie how wrong the average θn is

There are four types of bias

upward bias (positive bias) E(θn) gt θ

downward bias (negative bias) E(θn) lt θ

attenuation bias (bias toward zero) 0 ltE(θn)

θlt 1 so |E(θn)| lt |θ|

bias away from zeroE(θn)

θgt 1 so |E(θn)| gt |θ|

An estimator is unbiased if its bias is zero Using (317)

Bias(θ) = 0 lArrrArr E(θ) = θ (318)

where symbol lArrrArr can be read as ldquois equivalent tordquo (see Section 61)For example with iid sampling the sample mean is unbiased

With n = 1 Y1 = Y1 so E(Y1) = E(Y1) = microY With n = 2

E[Y2] = E[(12)Y1 + (12)Y2] =

microY 2︷︸︸︷(12) E(Y1) +

microY 2︷︸︸︷(12) E(Y2) = microY

(319)using the linearity property of E(middot) from (221) Similar derivationshold for any n so E(Yn) = microY thus the bias is zero given (318)

Insufficiency of Bias to Quantify Accuracy

Bias alone does not fully quantify accuracy That is if you onlyconsider bias when choosing between two possible estimators thenyou may be fooled into choosing the worse estimator

Let θ1 and θ2 be two estimators of unknown parameter θ Herethe subscripts 1 and 2 do not indicate n but just that the estima-tors are different For simplicity let θ = 0 The first estimatorrsquosdistribution is

P(θ1 = minus100) = P(θ1 = 100) = 12 (320)


The second estimatorrsquos distribution is

P(θ2 = 1) = 1 (321)

The first estimator has smaller bias The mean of each estimatoris

E(θ1) = (12)(minus100)+(12)(100) = 0 E(θ2) = (1)(1) = 1 (322)

Thus recalling θ = 0 the bias of each estimator is

Bias(θ1) = E(θ1)minusθ = 0minus0 = 0 Bias(θ2) = E(θ2)minusθ = 1minus0 = 1(323)

Estimator θ1 is unbiased while θ2 has upward biasBut intuitively θ2 is much better It is always wrong by 1 but

θ1 is always wrong by 100 which is much worse Regardless of thedataset θ2 is 100 times closer than θ1 to the true θ = 0 Bias alonedoes not properly quantify our preferences it tells us to prefer θ1when in fact we strongly prefer θ2

372 Mean Squared Error

=rArr Kaplan video MSE Examples

The mean squared error (MSE) is a more complete measure ofldquohow badrdquo an estimator is The idea is analogous to using quadraticloss for prediction (eg Sections 251 and 254) Among other pos-sible loss functions this is most common and is generally reasonableMSE is mean quadratic loss

MSE(θ) equiv E[L2(θ θ)] = E[(θ minus θ)2] (324)

Continuing the example our intuitive preference for θ2 over θ1 issupported by MSE Since MSE measures ldquohow badrdquo an estimator isθ2 being ldquobetterrdquo means it has lower MSE Specifically

MSE(θ1) = E[(θ1 minus θ)2] = (12)(minus100minus 0)2 + (12)(100minus 0)2 = 10000

MSE(θ2) = E[(θ2 minus θ)2] = (1)(1minus 0)2 = 1

(325)

This matches our intuition θ2 is much better than θ1 because it hasmuch lower MSE

MSE can also be decomposed into variance plus squared biasThe variance can also be seen as the squared ldquostandard errorrdquo (seeSection 381) Skipping the algebra

E[(θ minus θ)2] = Var(θ) + [Bias(θ)]2 (326)

All else equal larger bias is bad but itrsquos also bad to have very highand very low estimates across datasets (large variance and ldquostandarderrorrdquo) even if they happen to average to θ


Other MSE Examples

More generally instead of assuming θ = 0 let

P(θ1 = θminus100) = P(θ1 = θ+100) = 12 P(θ2 = θ+1) = 1 (327)

The MSEs are the same as before since the θ cancels out

MSE(θ1) = E[(θ1 minus θ)2] = (12)(θ minus 100minus θ)2 + (12)(θ + 100minus θ)2 = 10000

MSE(θ2) = E[(θ2 minus θ)2] = (1)(θ + 1minus θ)2 = 1

(328)

As another example imagine we know the bias and variance oftwo estimators but not the full sampling distributions This is stillsufficient to compute MSE using (326) For example let

Bias(β1) = 1Var(β1) = 16 Bias(β2) = 10Var(β2) = 9 (329)

Plugging these into (326)

MSE(β1) = 12 + 16 = 17 MSE(β2) = 102 + 9 = 109 (330)

According to MSE β1 is better because it has lower MSE (ldquoless badrdquo)than β2 In this case although β1 has larger variance its bias isenough smaller than its overall MSE is also smaller

Practice 33 (estimator MSE) Consider three estimators of thepopulation mean micro = E(Y ) and their three sampling distributionsmicro1 sim N(micro 25) micro2 sim N(micro + 3 16) and micro3 sim N(micro + 2 9) ie allnormal distributions with respective means micro micro+ 3 and micro+ 2 andrespective variances 25 16 and 9 (Hint for MSE does it matterthan the distributions are normal)

a) Compute the MSE of each estimatorb) Rank the three estimators from best to worst in terms of MSE

373 Consistency

Analogous to how bias compares the sampling distributionrsquos meanto the true population θ consistency compares the approximate(asymptotic) sampling distributionrsquos mean to θ However in ad-dition to the approximate distribution having mean θ consistencyalso requires that the approximate distributionrsquos standard deviationis smaller for larger n

For example consider the approximate distribution of the samplemean Yn in (315) The mean is microy Further the standard deviationis proportional to 1

radicn (since the variance is proportional to 1n)

ie smaller for larger n Thus Yn is consistentVisually the consistency of Yn was seen in Figures 31 and 32

In each case with larger n the sampling distribution concentratedprobability around microY

Intuitively consistency means that in ldquolargerdquo samples (large n)there is a ldquohighrdquo probability of the estimator being ldquocloserdquo to the true

38 QUANTIFYING UNCERTAINTY FREQUENTIST APPROACHES75

value This is similar to the idea of ldquoprobably approximately correctrdquoin computer science estimator θn is ldquoconsistentrdquo if with large n itis ldquoprobably approximately correctrdquo Unfortunately there are usu-ally no precise quantitative definitions of ldquolargerdquo ldquohighrdquo and ldquocloserdquoStill the qualitative idea is that the sampling distribution of θn isconverging to the true θ if we imagine larger and larger n which isrepresented notationally by

θnprarr θ as nrarrinfin or plim

nrarrinfinθn = θ (331)

If θn is not consistent its asymptotic bias can be defined as

AsyBias(θn) equiv plimnrarrinfin

θn minus θ (332)

(Other definitions have the same practical meaning though the tech-nical details differ) Similar to ldquounbiasednessrdquo being ldquozero biasrdquo hereldquoconsistencyrdquo is ldquozero asymptotic biasrdquo There are the same four typesof asymptotic bias as bias (upward downward attenuation awayfrom zero)

374 Asymptotic MSE

It is possible to compare approximate (asymptotic) MSE but detailsare omitted

38 Quantifying Uncertainty Conventional Fre-quentist Approaches

The point estimates in Section 34 provide our best guesses aboutunknown population values but they offer no sense of our uncer-tainty Here we consider only statistical uncertainty (or samplinguncertainty) ie the uncertainty due to observing only a randomsample of data instead of knowing the true population distributionAlthough the term is ambiguous inference often refers to the typesof methods in this section (ie statistical methods other than pointestimation)

This section concerns only the conventional frequentist approachesto quantifying uncertainty Section 39 provides warnings about mis-interpretation and misuse

The general consensus among econometricians and statisticians isthat confidence intervals are more informative and easier to interpretthan p-values and hypothesis tests You should focus on confidenceintervals when producing your own empirical analysis but you mayneed to understand p-values and hypothesis tests to understand oth-ers

381 Standard Errors

Empirical economics results almost always report standard errorsalongside estimates Standard errors are commonly used to computeconfidence intervals as well as p-values and hypothesis tests


Definition and Terminology

The standard error (SE) of estimator θ the standard deviation ofits sampling distribution

SE(θ) equivradic

Var(θ) (333)

Recall from Section 23 that the standard deviation has the same unitsas the variable itself so the SE has the same units as θ

Unfortunately people may say ldquoSErdquo to refer to either (333) oran estimate of (333) so its meaning is ambiguous In this textbookat least ldquoestimated SErdquo and notation SE(θ) refer to an estimate of(333) Causing yet more confusion SE(Yn) is often called the stan-dard error of the mean whereas personally I would call it theestimated standard error of the sample mean

Interpretation

The SE helps quantify uncertainty due to random sampling or ldquosta-tistical uncertaintyrdquo Larger SE means more uncertainty

For example consider estimators θ1 θ2 and θ3 that all estimateθ If θ1 = θ then SE(θ1) = 0 reflecting zero uncertainty If P(θ2 =θ + 1) = P(θ2 = θ minus 1) = 12 then (skipping the math) SE(θ2) = 1If P(θ2 = θ + 10) = P(θ2 = θ minus 10) = 12 then SE(θ2) = 10

Unfortunately itrsquos possible to be very certain about the wrongvalue Consider the very bad estimator θ = 4 Since itrsquos a constantSE(θ) = 0 This may appear to be great (no uncertainty) but weshould feel very uncertain about the methodology of ldquojust guess 4rdquoOur uncertainty about appropriate methodology is not captured bythe SE

382 Confidence Intervals

A confidence interval (CI) helps quantify statistical uncertaintywith a longer CI indicating more statistical uncertainty A CI doesnot capture any other source of uncertainty so small values can bemisleading if there is still uncertainty about certain assumptions ormethodological choices

Essentially a CI is a range of values that tries to include thetrue population value with high probability like 90 or 95 Againldquoprobabilityrdquo means frequentist probability from the ldquobeforerdquo view orequivalently over many repeated samples from the same populationlike in Table 31

For example recall the last column in Table 31 In each of 100random samples it showed whether or not the true mean E(Y ) = 0was inside the interval [Yn minus 04 Yn + 04] This CI contained thetrue population mean in 67 of the 100 datasets From the ldquobeforerdquoview the probability of randomly sampling a dataset in which the CIcontains the true value is around 67


A 90 CI does not mean ldquoI believe therersquos a 90 chance that thetrue value is in this rangerdquo That is the interpretation of a Bayesiancredible interval see Section 31

The actual probability that a CI contains the true value oftendiffers from the desired probability In practice when you ask R tocompute a CI you specify your desired probability (like 90 or 95)called the confidence level or nominal coverage probability (orldquonominal levelrdquo or other variations) The actual probability is thecoverage probability There are three possibilities

1 Ideally a CIrsquos coverage probability is close to the nominal level2 Sometimes a CI is too long and has coverage probability above

what you requested This is bad because it does not help younarrow down the possible values of the population parameterwell (since the CI is longer than necessary)

3 Sometimes a CI is too short and has coverage probability belowwhat you requested as low as 80 50 or even close to 0This is bad because you think the true value is inside the CIbut actually in many datasets (more than you realized) the CIdoes not contain the true value

The levels 90 and 95 are most common but sometimes youmay desire 99 or even higher if it is particularly important that thetrue value be in the interval (or if you have a very large sample withvery short CIs)

Formally coverage probability is defined as follows Consider atwo-sided confidence interval of the form [L U ] where the hats re-mind us that the interval endpoints are computed from the data Forexample Table 31 had L = Ynminus04 and U = Yn+ 04 A one-sidedconfidence interval would set L = minusinfin or U = infin The coverageprobability of this CI for the parameter θ is

P(CI contains true value) = P(θ isin [L U ]) = P(L le θ le U) (334)

Sometimes a CI is written in terms of a critical value Thecritical value depends on the confidence level and comes from thestandard normal distribution N(0 1) It is used when an estimatorrsquosapproximate sampling distribution is normal (Gaussian) For exam-ple if c is the critical value for a two-sided 95 CI and SE is theestimated standard error of estimator θ then the conventional CI is[θ minus c SE θ + c SE]

Practice 34 (CI interpretation) Imagine you have a CI with 95nominal coverage probability for the true θ [14 29]

a) Explain why this does not mean ldquoI think therersquos a 95 chancethat 14 le θ le 29rdquo

b) Explain why itrsquos still possible that the true value is θ = 0c) Explain why if the true coverage probability is also 95 and

you had 99 other randomly sampled datasets then around 95of the 100 total datasets would have a CI containing the true θ


d) Despite the 95 confidence level imagine the actual coverageprobability of your CI is 75 Would a CI with actual 95coverage probability be longer or shorter than yours Explain

Example in R

The following R example constructs two-sided 95 confidence inter-vals for the mean from simulated iid standard normal data (so thetrue population mean is zero) One CI uses ttest() a standardt-test the other CIs use nonparametric bootstrap methodology fromthe boot package though details are beyond our scope

library(boot)setseed(112358) for replicabilityY lt- rnorm(n=50 mean=0 sd=1) iid N(01)CIttest lt- ttest(x=Y conflevel=095

alternative=twosided)$confintret lt- boot(data=Y statistic=function(xi) mean(x[i]) R=100)tmp lt- bootci(bootout=ret conf=095 type=c(basicbca))outtable lt- rbind(CIttesttmp$basic[45]tmp$bca[45])rownames(outtable) lt- c(NormalityBootbasicBootBCa)colnames(outtable) lt- c(LowerUpper)print(round(outtabledigits=3))

Lower Upper Normality -0213 0370 Bootbasic -0234 0387 BootBCa -0233 0388

383 p-values

Frequentist p-values are precisely defined strange common and com-monly misunderstood

A p-value measures how unlikely the observed data would be if acertain hypothesis were true Notationally a p-value is conventionallyjust denoted as p but since it is computed from data I usually write p(with a hat) for clarity The range is 0 le p le 1 Small values closer top = 0 indicate such a dataset would be unlikely if the hypothesis weretrue In that case either the hypothesis is true and we just happenedby chance to observe an unlikely dataset or else the hypothesis isfalse

For example consider the p-value for the hypothesis H0 microY = 0Values near p = 0 indicate that the observed dataset would be unlikelyif actually microY = 0 As usual Yn is a random variable Here theobserved sample mean Yo is treated as non-random The p-value isthen

p = P(|Yn| ge |Yo| | microY = 0) (335)

That is the p-value is the probability of observing a sample meaneven farther away from zero than the observed sample mean given a


population with microY = 0More generally

p = P(estimate magnitude at least as big as observed | H0 is true)(336)

384 Statistical Significance

Results with low p-values are often called statistically significantor having statistical significance These terms are usually usedwhen trying to estimate an effect (or difference) where the null hy-pothesis is zero effect A statistically significant effect estimate meansthe p-value is low meaning an estimate that large would be unlikely ifthe true effect were zero which provides some evidence of a non-zeroeffect

Conceptually statistical significance is not a yesno property buta continuum ie not ldquoifrdquo but ldquohow muchrdquo Results can be somewhatstatistically significant or extremely statistically significant or lack-ing statistical significance etc Confusingly a lower p-value meansgreater statistical significance

In practice often people say a result is statistically significant ata particular level For example if the p-value is below 005 then theresult is ldquostatistically significant at a 5 levelrdquo if below 001 thenit is ldquostatistically significant at a 1 levelrdquo etc Generally there isstatistical significance at a 100c level if the p-value is below c

Why 5 Indeed 5 is arbitrary Its origin seems to be fromRonald Fisher who wrote in 1926 ldquoWe shall not often be astray ifwe draw a conventional line at 005rdquo Recently 72 prominent re-searchers from many fields (including statistics econometrics andeconomics) wrote a piece simply titled ldquoRedefine statistical signifi-cancerdquo (Benjamin Berger Johannesson Nosek Wagenmakers BerkBollen Brembs Brown Camerer Cesarini Chambers Clyde CookDe Boeck Dienes Dreber Easwaran Efferson Fehr Fidler FieldForster George Gonzalez Goodman Green Green Greenwald Had-field Hedges Held Ho Hoijtink Hruschka Imai Imbens IoannidisJeon Jones Kirchler Laibson List Little Lupia Machery MaxwellMcCarthy Moore Morgan Munafoacute Nakagawa Nyhan Parker Per-icchi Perugini Rouder Rousseau Savalei Schoumlnbrodt Sellke Sin-clair Tingley Van Zandt Vazire Watts Winship Wolpert XieYoung Zinman and Johnson 2018) The suggestion was to reducethe conventional level for statistical significance from 5 to 05Indeed it is already (much) lower in some fields like genetics andhigh-energy physics However they also agree that there may be veryimportant empirical results with p = 005 or even larger They simplyadvocate calling such results ldquosuggestive evidencerdquo rather than treat-ing them as conclusive They also note that it may be better to focuson confidence intervals than statistical significance


385 Hypothesis Testing

In the scientific method theories imply certain hypotheses that canbe tested empirically (with data) A theory is maintained until it isdisproved then a new theory replaces it and is tested etc

In economics more often hypothesis testing is used like the p-value to provide evidence against a statement like ldquothe true effect iszerordquo

Notation and Terminology

Notationally H0 denotes the null hypothesis1 whileH1 (sometimesHa) denotes the alternative hypothesis A specific null hypothesisis a statement about a parameter written after a colon likeH0 microY =0 The alternative hypothesis is usually just that the null is false likeH1 microY 6= 0 so they are mutually exclusive (cannot both be true)

Ostensibly the goal of hypothesis testing is use data to decidewhether H0 or H1 is true but many caveats apply

Table 32 Terms for hypothesis test outcomes

donrsquot reject H0 reject H0

H0 true correct type I error (false positive)H0 false type II error correct

Much jargon accompanies hypothesis testing A hypothesis testhas two possible results either rejectH0 or do not rejectH0 (ldquoFail torejectrdquo means the same as ldquodo not rejectrdquo it does not reflect a ldquofailurerdquoin the colloquial English sense) Sometimes ldquodo not rejectrdquo is replacedby ldquoacceptrdquo but ldquodo not rejectrdquo emphasizes the asymmetry betweenH0 and H1 That is although rejection of H0 indicates evidenceagainst it lack of rejection only indicates a lack of evidence againstH0 not necessarily strong evidence supporting it The testrsquos resultis either correct or an error with terms shown in Table 32 Relatedprobabilities are

rejection probability P(reject H0)

type I error rate P(reject true H0)

type II error rate P(donrsquot reject false H0)

power P(reject false H0)

There is yet more jargon When H0 includes multiple values of aparameter θ the largest possible type I error rate is called the size ofthe test When H0 is just a single value like H0 θ = 0 then the sizeis just the type I error rate A testrsquos level is what it claims to be themaximum type I error rate (size) Like a confidence intervalrsquos nominalcoverage probability the level could be above below or equal to thetrue type I error rate Usually it is close for large n but it may bevery different with small n and there is (as usual) no quantitativethreshold for ldquolargerdquo

1Pedagogical criticism duly noted xkcdcom892


Practice 35 (testing terms) State the technical term(s) associatedwith each of these

a) Your hypothesis test did not reject the permanent income hy-pothesis even though itrsquos false

b) For a type of lab experiment where people do not behave accord-ing to expected utility maximization when repeatedly runningthe experiment on different randomly sampled groups of peo-ple your hypothesis test rejects the null hypothesis of expectedutility maximization 80 of the time

c) You want to see if there is enough empirical evidence to rejectthe efficient market hypothesis

d) Your hypothesis test rejected the permanent income hypothesiseven though itrsquos true

e) In your test of H0 θ le 0 the type I error rate is very low whenθ is very negative but the type I error rate can be as high as7 when θ = 0

Computation

Computationally the hypothesis test for H0 E(Y ) = 0 can be com-puted using the p-value In this sense the test is strictly less informa-tive than the p-value the p-value takes any number between 0 and1 whereas the test can only reject or not Specifically the level αtest rejects when p le α so the test essentially just reports whether0 le p le α or p gt α In fact the function ttest() in R does noteven report ldquorejectrdquo or ldquodo not rejectrdquo it instead reports a p-value

Alternatively a hypothesis test compares a test statistic to a crit-ical value This is the same critical value from Section 382 thatdepends on the level α and the standard normal distribution N(0 1)for use when the estimator is approximately normal For example ifc is the two-sided level 5 critical value and t is the t-statistic thenthe test rejects H0 when |t| gt c In R for a two-sided level ALPHAtest the critical value is qnorm(1-ALPHA2) but usually you do notneed to compute it manually

386 Mental Math for Statistical Uncertainty

For a quick approximation you can use a critical value of 2 Specif-ically if a point estimatersquos absolute value |θ| is more than two SEsaway from zero then it is statistically significantly different from zeroat a 5 level (actually 455) That is it is statistically significantat 5 when |θ| ge 2 SE(θ) You can also take θ and add and subtracttwo SEs to get an approximately 95 CI (actually 9545) the CIis [θ minus 2 SE(θ) θ + 2 SE(θ)]

This is useful for a few reasons First itrsquos often easy enough todo in your head Second 5 is arbitrary anyway so 455 is equallygood (or equally bad) Third confidence intervals and p-values arealready based on (asymptotic) approximations whose approximationerror is almost always bigger than the difference between 5 and


4552

Consider an example Imagine you estimate θ = minus32 and SE(θ) =

15 Then |θ| = 32 and 2 SE(θ) = 30 Since 32 gt 30 the resultis statistically significant at a 5 level (p lt 005) A 95 CI is[minus32minus 30minus32 + 30] = [minus62minus02]

Practice 36 (mental math) Let θ = minus28 SE(θ) = 19 in yourhead assess statistical significance and compute an approximate 95CI

39 Quantifying Uncertainty Misinterpretationand Misuse

This section addresses misinterpretations and misuse of frequentistinference Some of the most common problems are discussed belowas well as on the (pretty good) Wikipedia page devoted to the topic3

Practice 37 (frequentist or Bayesian) For each of the followingsay whether it is a frequentist question Bayesian question neitheror both if both explain the two possible interpretations Hint useSection 31 as well as Section 38

a) Whatrsquos the probability that the current natural unemploymentrate in the US is between 45 and 75

b) Can we create a diagnostic tool for our companyrsquos daily websitetraffic data to identify whether itrsquos normal or has been hackedlimiting the rate of falsely reporting ldquohackedrdquo on normal daysto only 1 of normal days

c) What is the probability that the true unemployment rate iswithin 1 percentage point of the estimated unemployment rate

d) Is the positive estimate θ gt 0 primarily due to the income effector substitution effect

391 Perils of Ignoring Non-iid Sampling

A CI justified by iid sampling may perform poorly when samplingis not iid (Similar problems befall p-values and hypothesis tests)Stratified sampling that is independent but not identically distributed(inid) is not too problematic since it tends to make coverage probabil-ity higher than requested CIs are ldquotoo longrdquo but still have high prob-ability of containing the true population value In contrast depen-dent sampling can make the actual coverage probability much lowerthan the requested confidence level For example maybe you ask fora 90 CI but the CIrsquos coverage probability is actually only 75

2For example a famous econometrician said ldquoI tell my students if you canget a 5 test that controls the actual type I error rate below 10 thatrsquos prettygoodrdquo (Jerry Hausman April 6 2019 keynote talk at Chinese Economists Societyconference in Lawrence KS)

3httpsenwikipediaorgwikiMisunderstandings_of_p-values

39 QUANTIFYING UNCERTAINTY MISINTERPRETATION ANDMISUSE83

Unfortunately you cannot simply ask the computer for the true cov-erage probability so you must carefully consider whether you have(in)dependent sampling

Example

Consider the following example with dependent sampling with resultssimulated in R below Imagine a ldquotime use surveyrdquo that includes aquestion about watching television (TV) In hours per day P(Y =j) = (4 minus j)10 for j = 0 1 2 3 Imagine each household containstwo adults who only watch TV together That is if individuals i andk live together then Yi = Yk

For comparison imagine two samples are collected one iid onewith clustered sampling Sample A is collected iid randomly sam-pling individuals from the population Sample B is collected by ran-domly visiting households and surveying both individuals within thehousehold Sample A contains observations Yi for i = 1 nA = 10Sample B contains observations Yi for i = 1 nB = 20 where Ykand Yk+10 live together (k = 1 10) Since Yk = Yk+10 in Sam-ple B there are really only 10 observations the other 10 are literallyduplicates

Samples A and B contain the same amount of information so theyshould lead to the same amount of uncertainty (same CI) Howeverif (incorrectly) both are assumed iid the larger sample size nB gt nAis incorrectly interpreted as greater certainty and leads to incorrectlysmaller CIs

This TV example is shown in the following simulation Manydatasets are simulated In each a 90 CI is constructed for

microY = E(Y ) = (410)(0)+(310)(1)+(210)(2)+(110)(3) = 1010 = 1

Finally the code reports the proportion of simulated datasets in whichthe CI contained the true value ie the simulated coverage probabil-ity

setseed(112358) for replicabilityvY lt- 03 possible values of YpY lt- (41)10 P(Y=0) P(Y=1) P(Y=2) P(Y=3)muY lt- sum(pYvY) E(Y)nA lt- 10 sample size for A Bs is twice thisCL lt- 090 90 CINREP lt- 1000 number of simulated datasetstmp lt- dataframe(lo=rep(NANREP)hi=NA)CIs lt- list(A=tmp B=tmp) store both CIs for each datasetfor (irep in 1NREP) sampleA lt- sample(x=vY size=nA replace=TRUE prob=pY)sampleB lt- rep(x=sample(x=vY size=nAreplace=TRUEprob=pY) times=2)CIs[[A]][irep] lt-ttest(x=sampleA conflevel=CL alternative=twosided)$confint

CIs[[B]][irep] lt-


ttest(x=sampleB conflevel=CL alternative=twosided)$confintCPA lt- mean(CIs$A[1]lt=muY amp muYlt=CIs$A[2])CPB lt- mean(CIs$B[1]lt=muY amp muYlt=CIs$B[2])dataframe(conflevel=CL CPA=CPA CPB=CPB)

conflevel CPA CPB 1 09 0895 0749

The Sample A CI is much better than the Sample B CI that in-correctly assumes iid sampling ldquoBetterrdquo means coverage probabilityis closer to the desired confidence level of 90 Specifically the simu-lated coverage probability (CP) of the Sample A CI is 895 whereasthe CP of the Sample B CI is 749

392 Not a Bayesian Belief

The p-value is often interpreted as the probability that the hypothesisH0 is true but this is wrong While intuitive such an interpretationcould only be possible in a Bayesian framework not frequentist Forexample in the frequentist framework either microY le 0 or not whereasthe Bayesian posterior describes our belief about the probability thatmicroY le 0

For example a p-value of 008 does not mean ldquoI believe therersquosan 8 chance that H0 is correctrdquo

The example in Section 394 shows how your belief about H0

depends on more than just a frequentist hypothesis test or p-valuewhich only account for what happens if H0 is true

393 Unlikely Events Happen (or Use Common Sense)

As pointed out in an insightful webcomic (xkcdcom1132) com-mon sense and outside knowledge should be used when interpretingp-values and hypothesis tests A small p-value alone (eg rejectingH0 at a 5 level) does not mean H0 is definitely false It does noteven mean that H0 is probably false A p-value below 005 is observed5 of the time even if H0 is false A 5 chance is somewhat unlikelybut far from rare especially considering the thousands and thousandsof p-values being computed every day A small p-value just means it isunlikely to occur if H0 is true but common sense sometimes suggestthat it is even less likely that H0 is false as in the comic

Discussion Question 32 (equal p-values equal belief) Considerthree examples from Berger (1985 p 2) which he attributes to LJ Savage First a person claims to be able to tell whether milkwas poured into a cup of tea or tea was poured into milk in tentrials the person guessed correctly each time Second a music expertclaims to be able to tell whether a page of sheet music was written byMozart or Haydn in ten trials the expert guessed correctly each timeThird your drunk friend claims to be able to predict the outcome


of a coin flip (heads or tails) from a fair coin (50 probability ofeach outcome) in ten trials your friend is correct each time Notethat each ldquoexperimentrdquo has a p-value of 2minus10 asymp 0001 since guessingrandomly could only get all correct with that probability (around11000 01) After seeing all this data do you have the same beliefabout whether each claim is true (ie do you think therersquos the samechance that each claim is true) Why not

394 Example of Ignoring Outside Knowledge

The following example illustrates ideas from Sections 392 and 393Table 33 shows the setup for a classic example in which frequen-

tist hypothesis testing is misleading It shows the disease status andtest results for 1000000 people randomly sampled from the popula-tion Dividing everything by 1000000 would yield a joint probabilitydistribution table but intuition is easier with people instead of prob-abilities The table shows that the disease is uncommon since only1000 people have it ie 11000 or 01

Table 33 Disease status and test result for 1000000 random people

(Donrsquot reject) (Reject)Test minus Test + Total

(H0 true) No disease 949050 49950 999000(H0 false) Disease 0 1000 1000

Total 949050 50950 1000000

Table 33 can be interpreted in hypothesis testing terms The nullhypothesis H0 is that somebody does not have the disease A positive+ test result means rejecting H0 Table 33 shows the type II errorrate is zero when H0 is false the test is always correct (+) The typeI error rate is 5 when H0 is true the test is wrong (+) with rate49950999000 = 5

The frequentist properties make the test sound very reliable thetype I error rate is controlled at the conventional 5 level and thetype II error rate is zero

However if yoursquore a random person in the population a positivetest result should not make you think you have the disease Given thatyou tested positive you could be in one of two boxes in the secondcolumn of the table That is either yoursquore one of the 49950 peoplewho tested positive but didnrsquot have the disease or one of the 1000people who tested positive and did have the disease Clearly 49950 ismuch larger than 1000 so itrsquos much more likely that you donrsquot havethe disease even though you tested positive That is conditional onhaving tested positive the probability of actually having the diseaseis still only 100050950 = 196 a very low probability

Put in more general terms even though our test rejected H0 ata 5 level it is still much (much) more likely that H0 is true thanfalse


On the other hand you may have gotten tested because you hadall 17 characteristic symptoms of the disease In that case yoursquore notexactly ldquoa random person in the populationrdquo Your conclusion wouldbe very different

The conclusion is not ldquodonrsquot believe positive resultsrdquo or ldquodonrsquotbelieve negative resultsrdquo but rather ldquothink about what else you knowto critically think about resultsrdquo

395 Multiple Testing (Multiple Comparisons)

=rArr Kaplan video Multiple Testing

Another insightful comic (xkcdcom882) illustrates the multi-ple testing problem (or multiple comparisons problem) Es-sentially the scientists keep testing whether a different color jellybean (a candy) causes acne (a skin condition) until they finally findp lt 005 and reject the null hypothesis of ldquono effectrdquo at a 5 levelSince jelly beans do not actually cause acne (H0 is actually true)this ldquofalse positiverdquo should happen roughly 5 of the time or 1 in20 knowing this the comic shows them testing 20 different colorsThe multiple testing problem is essentially that if you keep testingand testing eventually yoursquoll get a false positive with p lt 005 andrejecting H0 As an analogy even though therersquos a full moon lessthan 5 of all nights as long as you keep looking up at the sky everynight eventually yoursquoll see one

Practice 38 (research assistants) Imagine yoursquore a powerful profes-sor with a cadre of 100 research assistants (post-docs grad studentsundergrads your neighborrsquos precocious high-schooler etc) You as-sign each research assistant (RA) one of 100 variables characterizingdifferent counties in the US number of tennis courts average tem-perature per capita income etc Each RA collects a dataset withtheir particular variable and computes the correlation with county-level May 2020 COVID-19 rates Each RA then computes a p-valuefor the null hypothesis that the correlation is zero Of the 100 RAs5 report statistical significance at a 5 level (p lt 005) includinga significantly positive correlation with tennis courts and negativecorrelation with temperature In light of multiple testing how dointerpret these results

Discussion Question 33 (jellybean solution) Consider the jellybean comic from xkcdcom882 In the comic they essentially dohypothesis tests with a 5 level rejecting H0 (no effect) if the p-value is below 005 Since they do 20 hypothesis tests even if eachindividual test is unlikely to make an error itrsquos likely that at leastone of the 20 tests makes an error

a) Assuming jelly beans have zero effect what type of error (I orII) is made by the green jelly bean test

b) Would it help (ie make such an error less likely) to use 1level tests instead of 5 Explain why why not or how muchit might help


c) Would it be even better to use 0 level hypothesis tests Ex-plain why or why not

Discussion Question 34 (nova) Consider again the comic fromxkcdcom1132 about the machine that detects if the sun has gonenova (exploded) The null hypothesis H0 is that the sun has notexploded

a) Does the frequentist statistician correctly compute the p-valueand correctly reject H0 at a 5 level Whynot

b) What type of error (I or II) does the Bayesian statistician bethas been made

c) In this example explain why itrsquos almost certainly incorrectwhenever the machine reports that the sun has exploded

396 Publication Bias and Science

The jelly bean comicrsquos final panel illustrates publication bias thenewspaper only reports the exciting positive result omitting the 19negative results for the other 19 colors Not only popular media buteven academic journals are more likely to publish positive results soreading only published results gives a biased perspective

The jelly bean experiments also illustrate the importance of re-membering what ldquosciencerdquo means The result of a single study (evena good one) by itself is not science The scientific method is a processof replication and repeated testing of hypotheses If you ever hearldquoThere was this one new study that found [crazy result]rdquo you canignore it and wait till it gets replicated at least a few times4

397 Ignoring Point Estimates (Economic Significance)

Sometimes there is too much emphasis on p-values while the point es-timates are ignored This problem is mostly solved by simply lookingat confidence intervals instead of only p-values and hypothesis tests

While statistical significance (Section 384) assesses if the effectis statistically distinguishable from zero economic significance as-sesses if the effect is ldquoeconomicallyrdquo distinguishable from zero ldquoEco-nomicallyrdquo just means ldquofor real-world purposesrdquo like whether it isimportant to consider for policy purposes One way to think aboutthis is would you personally care about the difference For exampleimagine θ estimates the effect on your final exam score of studying anadditional hour per week Would you (yes you) care about having afinal exam score thatrsquos θ percentage points higher If θ = 001 thenno if θ = 50 then yes Some other examples would you care if youhad two additional years of education Would you care if your annualsalary were increased by five dollars

Conceptually like statistical significance economic significance isnot a binary yesno but a continuum of ldquohow muchrdquo economic signif-

4This is related to the ldquoreplication crisisrdquo httpsenwikipediaorgwikiReplication_crisis


icance A result could have low economic significance or extremelyhigh economic significance or moderate economic significance etc

In practice unlike with statistical significance there is no conven-tional level (like 5) to mindlessly apply so you are forced to thinkcritically (This is good)

It is important to consider units of measure For example imag-ine the estimated effect on income is θ = 10 is that economicallysignificant If the units are dollars per hour then yes if itrsquos dollarsper year then no if itrsquos thousands of dollars per month then yesetc

It is also important to consider realistic policy changes For ex-ample imagine your estimated θ is the effect of a one-unit increasein the proportion of the state budget allocated to higher educationIf the current proportion is 008 (meaning 8) then a realistic pol-icy change would be something like 002 units A one-unit increasewould mean changing from 0 to 100 of the budget spent on highereducation Possibly θ looks economically significant but 002θ doesnot

Examples

Simplifying ldquohow muchrdquo significance to just lowhigh there are fourgeneral possibilities 1) both statistical and economic significance 2)just statistical 3) just economic or 4) neither

The following are examples of each possibility in the examplewhere θ is an effect on annual income measured in dollars per year

bull Both θ = 20000 p = 0001bull Only statistical θ = 10 p = 0001bull Only economic θ = 20000 p = 043bull Neither θ = 10 p = 043

Practice 39 (significance distance and education) You observea sample of married couples for each you observe the difference intheir years of education divided by the difference in the distancebetween their childhood homes and the nearest college or universityThat is if E1 and E2 are the years of education and D1 and D2 arethe distances you observe Y = (E2 minus E1)(D2 minus D1) Distance ismeasured in kilometers (1 km = 0600 mi) You estimate Y = minus003The p-value for testing H0 E(Y ) = 0 is p = 003

a) Is this estimate economically significant Hint consider theunits of minus003

b) Is this statistically significant Be precise

398 Other Sources of Uncertainty

In practice there are many sources of uncertainty only one of whichis the ldquostatistical uncertaintyrdquo due to having a random sample Forexample there may be uncertainty about assumptions (in later chap-ters) required to interpret the population parameter in a certain way

310 STATISTICAL DECISION THEORY 89

There may be uncertainty about how reliable the data is There maybe uncertainty about whether sampling is iid

The confidence intervals and other inference methods in Section 38only quantify the statistical uncertainty from random sampling notany other type of uncertainty Thus even if you have many othertypes of uncertainty a CI could be very short incorrectly suggestingyou should feel very confident in the estimates

For example imagine you want to learn the mean household in-come in Kansas but you only have data from Missouri You decideto assume that Kansas has the same mean as Missouri The Missouridataset has very large n so SE(Yn) asymp 0 This makes it seem likewe have zero uncertainty but we may be very uncertain about theassumption that Missouri and Kansas are identical Indeed if theKansas mean is very different then our CI could have near 0 actualcoverage probability

Practice 310 (uncertainty in Kansas) Imagine again θ is the meanhousehold income in Kansas but this time you have data from KansasWhich of the following could make the actual coverage probabilitymuch lower than the nominal level ie make the CI shorter than itwould be to fully capture all your uncertainty Why

a) You have data from state tax returns but only 70 of house-holds filed a state tax return and you worry these 70 may notbe fully representative of the population

b) Your sample size is only n = 10000 so a different randomsample could have resulted in different observed income values

c) You have survey data from a representative sample but youdoubt people accurately report their household income

d) You use a CI based on iid sampling but honestly you justfound the data online somewhere and didnrsquot really understandif maybe it was clustered sampling

e) You have data on individual adult income which you then mul-tiply by the average number of adults per household but youonly have the average number of adults per household from theyear 2013 and think it might have changed

310 Statistical Decision Theory

If you want to incorporate data more formally into your decisionsthen you should learn more about statistical decision theory which isbeyond our scope It is related to the optimal prediction material inSections 24 and 25 Therersquos both frequentist and Bayesian statisticaldecision theory Either way hypothesis testing is basically never thebest way to make a decision using data For example see Berger(1985)

Discussion Question 35 (Ebola drug) Imagine you have data fora new drug that tries to cure Ebola a disease with a high mortalityrate Assume that there are no other treatments available and thatwithout the drug an infected individual will die 100 of the time


With the drug there is a possible side effect of occasional sneezingand it possibly cures the disease (so the person does not die fromit) You have a sample of 10 individuals infected with Ebola andrandomly picked 5 to take the experimental new drug and 5 to haveno treatment Of course of the 5 without the drug all 5 die Of the 5treated 2 live and 3 die You input your data into R and run a t-testwith command ttest(x=c(00000) y=c(11000)) where0 means dead and 1 means alive R says the two-sided p-value fortesting the null hypothesis that the drug has zero effect on mortalityis p = 01778 H0 is not rejected even at a 10 level (let alone 5 or1) ie the result is not statistically significant at a 10 (or 5 or1) level

a) If you then discovered that you had Ebola would you take thedrug Whynot Hint did R compute the right p-valueWhatrsquos the probability of 2 people living if the drug actuallyhas zero effect on mortality

b) What if not everybody died without the drug the untreatedgroup had 1 person live (among 5) and the treated group had3 live (among 5) yielding a p-value of 02429 Would you takethe drug if you were infected Whynot Hint what does yourloss function look like


Empirical Exercises

Empirical Exercise EE31 The data are originally from Card(1995) with individual-level observations of wages years of educa-tion and other variables

a R only run installpackages(c(wooldridgesurvey))to download and install those packages (if you have not already)

b Load the card dataset

R load package wooldridge with command library(wooldridge) and a dataframe variable named card becomesavailable the command card then shows you details about thedataset

Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse card clear

c Compute the sample average of variable wage

R mean(card$wage)

Stata mean wage (which also computes a 95 confidence in-terval)

d Estimate the population mean accounting for the samplingweights

R weightedmean(x=card$wage w=card$weight)

Stata mean wage [pweight=weight] (also computes a 95CI)

e R only (since Stata reported this already) compute a two-sided95 CI for the mean ignoring weights with ttest(x=card$wage conflevel=095)

f R only (since Stata reported this already) compute a two-sided95 confidence interval for the mean accounting for weightsfirst loading the survey package with library(survey) andthen with commandscarddes lt- svydesign(data=card weights = ~weightid = ~1)

svyret lt- svymean(x = ~wage design=carddes)c(wmean=coef(svyret) SE=SE(svyret)CI=confint(svyret level=095))

g Compute a weighted 90 confidence interval for wage

R replace level=095 with level=090

Stata add ldquooptionrdquo level(90) to get mean wage [pweight=weight] level(90)

h Optional repeat computation of a point estimate and 95 con-fidence interval (without and with weights) for the mean of adifferent variable in the dataset


R part (c) computes the unweighted point estimate part (d)computes the weighted point estimate part (e) computes theunweighted CI and part (f) computes the weighted CI

Stata part (c) computes both the unweighted point estimateand unweighted CI and part (d) computes both the weightedpoint estimate and weighted CI

Chapter 4

One Variable TwoPopulations


Depends on Chapters 2 and 3



42 Describe and distinguish among descriptive predictive andcausal questions and among different approaches to learn-ing about causality from data in economics [TLOs 3 5and 6]

43 Describe and interpret the elements of the primary statisti-cal framework for understanding causality [TLO 3]

44 Assess whether a mean difference can be interpreted withcausal meaning in a real-world example [TLO 6]

45 In R (or Stata) compute estimates of mean differencesalong with measures of uncertainty and judge economic andstatistical significance [TLO 7]


bull Structural and reduced form approaches Lewbel (2019)

bull Potential outcomes and SUTVA (Wikipedia)

bull Causal inference intro (Masten video)

bull Correlation vs causation (Masten video)

bull ATE (Masten video)

bull Individual causal effects (Masten video)

bull Potential outcomes example (Masten video)

93

94 CHAPTER 4 ONE VARIABLE TWO POPULATIONS

bull Counterfactuals (Masten video)

bull Randomized experiments (Masten video)

bull SUTVA and spillovers (Masten video)

bull Empirical example property rights effect (Masten video)

bull Structural modeling advantages (Masten video)

bull Potential outcomes and confounding (Lambert video)

With two populations we can discuss not only description andprediction but also causality Foundational ideas introduced here areextended to regression in Part II

Discussion Question 41 (DPC with two populations) Let Y de-note the hourly wage of an individual in the US Let Y A be the wageof an individual without a college degree in the US and Y B thewage of an individual with a college degree

a) How are means E(Y A) and E(Y B) more helpful for descriptionthan only E(Y )

b) How could E(Y A) and E(Y B) be used to make better predictionsthan only E(Y )

c) Why canrsquot we interpret E(Y B) minus E(Y A) as the causal effect ofa college degree on wage Hint what other factors might makeE(Y B)minusE(Y A) large even if the effect of a college degree itselfis small

41 Description

411 Population Mean Difference

Let Y A and Y B be random variables representing Y (eg income)for two populations (labeled A and B) For example if Y is incomeA is the population of individuals without a high-school degree andB is the population of individuals with a high-school degree then Y A

is income for individuals who do not have a high-school degree andY B is income for those who do

The difference of means is E(Y B) minus E(Y A) It describes howmuch higher (or lower if negative) is the mean in population B thanin population A

For example let Y isin 0 1 2 be the number of kids per familyLet the distributions in populations A and B be respectively

P(Y A = 0) = 08 P(Y A = 1) = 02 P(Y A = 2) = 0

P(Y B = 0) = P(Y B = 1) = P(Y B = 2) = 13(41)

where Y A represents the number of kids per family in population Aand Y B represents the number of kids per family in population B

41 DESCRIPTION 95

Then

E(Y B)minus E(Y A) =

2sumy=0

yP(Y B = y)

minus 2sumy=0

yP(Y A = y)

= [(0)(13) + (1)(13) + (2)(13)]minus [(0)(08) + (1)(02) + (2)(0)]

= [(13) + (23)]minus 02 = 08

Always clarify whether you are subtracting the mean of populatonA from that of population B or B from A Saying ldquoThe difference inmean number of children between the populations is 08rdquo it is unclearwhich population has a larger mean Instead say ldquoThe mean numberof children in population B is 08 higher than the mean in populationArdquo

The difference of means is also the mean of the differences ldquoMeandifferencerdquo could mean either theyrsquore equal anyway Because of thelinearity of the expectation operator as in (221)

E(Y B minus Y A) = E(Y B)minus E(Y A) (42)

Despite mathematical equality the interpretation differs For exam-ple the expression Y B minus Y A is the number of children differencebetween a family from population B and a family from population ASeeing Y B and Y A as random variables the difference Y B minus Y A isitself a random variable Thus E(Y BminusY A) is the population mean ofthe child number difference Y B minusY A whereas E(Y B)minusE(Y A) is thedifference between the mean number of children in B and the meannumber of children in A Generally due to (42) either interpretationof the mean difference is correct the same population value has twointerpretations Itrsquos like if one person says ldquoThe glass is half full ofwaterrdquo and a second person says ldquoThe glass is half emptyrdquo both arecorrect interpretations of the same thing

412 Estimation

Simply estimate the means separately (see Section 34) and take thedifference like Y B minus Y A with iid data If each individual estimatoris consistent then this is a consistent estimator of E(Y B) minus E(Y A)and thus a consistent estimator of E(Y B minus Y A) due to (42)

The following code estimates E(Y B)minus E(Y A) from simulated iidsamples suggesting population B has a lower mean

setseed(112358)YA lt- 0+sample(x=1020 size=40 replace=TRUE prob=rep(11111))YB lt- 2+sample(x= 025 size=30 replace=TRUE prob=261sum(126))mean(YB) - mean(YA)

[1] -299


413 Quantifying Uncertainty

The same approaches (and warnings) from Sections 38 and 39 applyto θ = E(Y B)minus E(Y A)

The following R code shows 95 confidence intervals for the meandifference with iid data

setseed(112358)YA lt- 0+sample(x=1020 size=40 replace=TRUE prob=rep(11111))YB lt- 2+sample(x= 025 size=30 replace=TRUE prob=261sum(126)) 95 CI for mean diffround(ttest(x=YA y=YB alternative=twosided

mu=0 conflevel=095)$confint[12] digits=2)

[1] 044 555

42 Prediction

Prediction is essentially the same as with one population Given aloss function an optimal predictor can be defined to minimize meanloss in the population and this optimal predictor can be estimatedfrom data For example mean quadratic loss is minimized by thepopulation mean and the means E(Y A) and E(Y B) can be estimatedby (weighted) sample means

Prediction accuracy improves by distinguishing between individ-uals (or firms etc) from population A and those from population BFor example at your carnival job imagine you now guess peoplersquosheight instead of age In Chapter 2 you make the same guess foreverybody Now we consider two populations like child and adult(assuming this is observable) Now we can make a different predictionfor each population like 165 cm for adults and 105 cm for childrenThis should perform better than guessing 135 cm for every individual

Part II extends this idea exploring how regression models canincorporate additional information to improve prediction accuracy

43 Causality Overview

The concepts in the remainder of this chapter appear often in laterchapters

First when is causality important rather than description orprediction We each have an innate sense of cause and effect Tryingto articulate it in language sometimes creates more confusion thanunderstanding1 For example start reading the Wikipedia page oncausality and see how you feel in 10 minutes Unlike description andprediction causality is about ldquowhyrdquo A ldquocauserdquo is the ldquobecauserdquo of the

1Some of my failed attempts include ldquocausality is about what will happen ifa policy changesrdquo (but isnrsquot ldquowhat will happenrdquo prediction) and ldquodescription isseeing how things arerdquo (but arenrsquot causal relationships also ldquohow things arerdquo)

43 CAUSALITY OVERVIEW 97

effect Description helps us see which variables tend to have high orlow values together Prediction helps us guess one variablersquos valuebased on other information But only causality concerns why Whydo these two variables tend to have similar values Only causality(not description or prediction) helps inform policy decisions we wantto know how a policy change itself influences other variables causingthem to change

Discussion Question 42 (description prediction causality) Whichtype of question (description prediction causality) is each of the fol-lowing Explain why Hint therersquos one of each

a) If you only know whether an individual is from Canada or theUS what is your best guess of their income

b) You are currently working in the US but considering movingto Canada How will your income change if you do

c) Which countryrsquos population has higher income Canada or theUS

431 Correlation Does Not Imply Causation

=rArr Kaplan video Correlation Does Not Imply Causation

Generally imagine E(Y B) gt E(Y A) This shows a clear descrip-tive relationship population B has a higher mean The implicationfor prediction is clear under quadratic loss the optimal predictionis higher for population B than A In contrast the implication forcausality is not clear Itrsquos possible that being in population B hasa positive causal effect on the outcome variable But itrsquos also possi-ble that people with large Y choose to join population B Or maybethere is something else altogether that separately causes people to joinpopulation B and have high Y Or maybe all of these The causalinterpretation of E(Y B) gt E(Y A) is ambiguous

For example consider rainfall and umbrellas Let Y A denote rain-fall when nobody is carrying an umbrella and Y B rainfall when ev-erybody is carrying an umbrella For description it rains more ondays when everyone carries an umbrella than on days when nobodydoes eg E(Y B) gt E(Y A) For prediction itrsquos better to predict ahigher rainfall value if you see everyone carrying an umbrella than ifyou see no umbrellas eg under quadratic loss the optimal predic-tions are E(Y B) and E(Y A) For causality if therersquos a drought andwe want rain should we all walk around with umbrellas to cause itto rain No rain causes umbrella-carrying not vice-versa

Consider another example with a different type of causal relation-ship Let Y A be my commute time when nobody is carrying um-brellas and let Y B be my commute time when everyone is carryingumbrellas Descriptively E(Y B) gt E(Y A) and you should predicta longer commute time if you see everybody has an umbrella Butcausally this doesnrsquot mean that you can make me late for class byopening lots of umbrellas Rain is a confounder that has a causaleffect on both umbrella-carrying and commute time as depicted in


Figure 41

Rain

Umbrellas Commute Time

++

Figure 41 Causal relationships among rain umbrella-carrying andcommute time

These examples illustrate the famous saying ldquocorrelation does notimply causationrdquo 2 The saying is a bit imprecise correlation doesindeed imply some sort of causal relationship just not any one par-ticular type of causal relationship In the first example ldquocorrelationdoes not imply causationrdquo means that ldquohigher rainfall when peoplecarry umbrellasrdquo (rain is correlated with umbrellas) does not implyldquocarrying umbrellas causes rainrdquo But the correlation is ultimatelydriven by a causal relationship rain causes umbrella-carrying Inthe second example ldquocorrelation does not imply causationrdquo meansthat ldquolonger commute when people carry umbrellasrdquo (commute time iscorrelated with umbrellas) does not imply ldquocarrying umbrellas causeslonger commutesrdquo But the correlation is ultimately drive by causalrelationships rain causes both umbrella-carrying and longer com-mutes

Although common sense helps us see that umbrellas cannot causelonger commutes similar arguments are often made For example inthe 2018 August election in Missouri a ldquoright-to-workrdquo propositionappeared on the ballot Very roughly speaking such laws restrictthe power of unions to collect certain fees from certain employeesbut the following discussion about causality does not depend on thedetails3 Before the election one mailer ad opposing right-to-worksaid something like ldquoDo you want $8000 less in your pocket eachyearrdquo The implication is that were the law to pass the causaleffect would be a decrease in annual income of $8000yr Accordingto the adrsquos footnote this $8000yr was computed as the differencein workersrsquo mean annual income between states that had a right-to-work law and those that did not ie an estimate of E(Y B)minusE(Y A)Recall that E(Y B) minus E(Y A) 6= 0 in the example with umbrellas andcommute time too but we did not conclude that umbrellas have acausal effect on commute time4 For example maybe having lowerincome causes states to pass such laws ie causality is in the opposite

2httpsenwikipediaorgwikiCorrelation_does_not_imply_causation

3But if yoursquore curious httpsenwikipediaorgwikiRight-to-work_law

4Just to clarify this is not a discussion of whether the law itself is good orbad or whether the groups supporting or opposing the law are good or bad adsendorsing right-to-work also made errors just not illustrative econometric errors


direction (reverse causality) Or maybe there is a third unobservedcharacteristic that causes states to pass such laws and causes lowerincome ie a confounder like rain in the commute example Ofcourse it is also possible that $8000yr really is the causal effectThe point is not that the number is right or wrong (or that the lawis good or bad) but that the econometric argument is incompleteAdditional assumptions are required to interpret a mean difference asa causal effect Such assumptions are discussed more in Section 46

432 Structural and Reduced Form Approaches

There are two econometric approaches to learning about causalitythe reduced form approach and the structural approach Confusinglythe reduced form approach is sometimes called causal inference eventhough the structural approach also aims to learn about causalityAlso confusingly the reduced form approach is commonly associatedwith program evaluation like assessing the effects of a job trainingprogram or welfare program but the structural approach could alsobe used

Both approaches consider counterfactual analysis but in differ-ent ways Broadly a counterfactual is a universe thatrsquos different thanour actual universe Usually the counterfactual universe is nearlyidentical to our actual universe except for one particular policy whoseeffect we want to learn The reduced form approach often consid-ers the counterfactual in which a real policy change never happenedeg what would the unemployment rate have been if the minimumwage had not increased by $2hr which in reality it actually didThe structural approach often tries to learn about underlying causaleconomic mechanisms to be able to analyze policies beyond what wehave seen historically For example maybe the sales tax has neverbeen above 10 but we want to learn about what might happen ifwe raise sales tax to 12

The reduced form approach tries to isolate causal effects byusing comparisons that are either randomized or ldquoas good as ran-domizedrdquo In our current context of populations A and B ran-domized would mean that units (eg individuals firms hospitals)are randomly assigned to a population without regard to the unitsrsquocharacteristics The ldquotreatedrdquo population would receive some specialtreatment that the ldquountreatedrdquo (ldquocontrolrdquo) population does not ldquoAsgood as randomizedrdquo would mean that although we did not explic-itly randomly assign units to each population the actual assignmentmechanism did not depend on unitsrsquo characteristics anyway Thesesituations are often called natural experiments they often arisefrom unexpected weather or disease outbreaks (like COVID-19) orcapricious political decisions Sometimes other variables are used tohelp find these ldquoas good as randomizedrdquo comparisons or to reducestatistical uncertainty but the core methodology remains the com-parison of treated and untreated units to estimate the causal effectof a particular ldquotreatmentrdquo (a variable or policy) The actual under-


lying causal mechanisms that produce the effect are not modeled likea black box

In contrast the structural approach tries to explicitly modelthe inner workings of causal systems Some ldquostructuralrdquo models arenot particularly detailed but some involve models of decision-making(like expected utility maximization) models of market equilibriamodels from game theory and other economic theory Consequentlythe structural approach tries to estimate structural parameters (orldquodeep parametersrdquo) that govern economic behavior and equilibria likeelasticities discount factors risk aversion and demand curves Thehope is that all this ldquostructurerdquo would not change under the set ofpolicies being considered ie that the policies may change the val-ues of certain variables but not change these underlying economicrelationships

The structural and reduced form approaches have complementaryadvantages and often both are helpful eg see the survey by Lewbel(2019) Both have contributed to our understanding of economicsOften structural models require stronger (less realistic) assumptionsbut in return they can analyze a wider variety of possible policiesConversely the reduced form approach often has the advantage an-alyzing the effects of existing policies but it is more difficult to ex-trapolate to hypothetical policies

For example consider the relationship between a womanrsquos edu-cation and fertility (number of children born) A reduced form ap-proach might try to find a group of women with college degrees anda group without where it seemed ldquoas good as randomrdquo who had adegree This is difficult because usually going to college is a carefullyconsidered decision and not one that we can force others to makein a randomized fashion but there may be peculiar situations like aneed-based college scholarship that had to be randomized because toomany people applied or the Cultural Revolution that suddenly shutdown universities in 1966 A structural approach might try to modelthe different factors affecting fertility choice and the different chan-nels through which education could have an effect This is difficultbecause it is a very complex decision with many variables involvedThe benefit is the ability to consider more possible policies with morenuance for example the effect on fertility of more women attendingcollege may depend on whether the increased education is due to areduced price of college (eg due to government subsidy) or greaterincentive due to higher salaries for jobs requiring college educationor some other reason

In Sum Structural amp Reduced Form ApproachesReduced form randomized or ldquoas good as randomizedrdquo com-

parisons to isolate causalityStructural more explicit economic models of causal rela-

tionships

44 POTENTIAL OUTCOMES FRAMEWORK 101

433 General Equilibrium and Partial Equilibrium

Besides structural vs reduced form another dichotomy is betweengeneral equilibrium (GE) and partial equilibrium (PE) analysisGE more ambitiously tries to model entire markets sometimes mul-tiple markets whereas PE takes current market equilibria as givenSimilar to the tradeoff between the structural and reduced form ap-proaches the tradeoff is that the GE framework can analyze policiesthat change equilibria (ie that have general equilibrium effects)but it requires stronger assumptions to do so

For example imagine you were analyzing the impact of free publicchildcare on mothersrsquo employment A PE analysis would considerhow mothers might respond to different childcare policies given thecurrent prices of private childcare current wages etc A GE analysismight further model the childcare and labor markets to allow for thepossible general equilibrium effects of public childcare policy on theprices in those markets If there is a big expansion of free publicchildcare then private childcares may indeed change their prices Ifthe expansion allows many mothers to enter the workforce then thelabor supply curve shifts out which could lower wages However ifthe proposed changes to childcare policy are relatively small thensuch GE effects may be negligible and PE analysis may suffice

The famous Lucas critique (Lucas 1976) argues in part that macroe-conomic policy analysis requires structural GE models Lucas writes(p 41) ldquoGiven that the structure of an econometric model consistsof optimal decision rules of economic agents and that optimal deci-sion rules vary systematically with changes in the structure of seriesrelevant to the decision maker it follows that any change in policywill systematically alter the structure of econometric modelsrdquo If wewant to guess how people and firms will behave in the future undernew macroeconomic policies we have to account for GE effects whichrequires deeper structural understanding and modeling of economicbehavior

In Sum General amp Partial Equilibrium ModelsPartial equilibrium models treat prices and other market equi-libria as fixed whereas general equilibrium models allow mar-kets to change

44 Potential Outcomes Framework

=rArr Kaplan video Potential Outcomes and the ATE

The reduced form approach uses the potential outcomes frame-work also called the NeymanndashRubin causal model after its twoearliest contributors (although sometimes Neymanrsquos name is dropped)It is popular not only in economics but statistics medicine politicalscience and other fields


The terms treatment and treatment effect just refer to anyvariable and its causal effect on another variable In English usu-ally ldquotreatmentrdquo makes us think narrowly about medicine (or lum-ber and facials) but it can be anything For example the ldquotreat-mentrdquo could be a job training program and the ldquotreatment effectrdquo isthe causal effect of the program on a personrsquos wage Or a treatmentcould be going to a charter school (instead of public school) Anothertreatment could be a policy or law like a higher sales tax or a certainlabor law

This section says ldquoindividualrdquo to be more concrete but you canalso imagine a firm county school etc

In Sum Causality in Potential Outcomes FrameworkTreatment effect the difference in outcomes between parallel uni-verses identical except for treatment

441 Potential Outcomes

Imagine two parallel universes The universes are identical except forone difference whether or not an individual is treated The individ-ualrsquos outcome in the universe without treatment is their untreatedpotential outcome and the individualrsquos outcome in the universewith treatment is their treated potential outcome

Notationally in this chapter Y T represents the treated potentialoutcome and Y U the untreated potential outcome Elsewhere oftenY1 and Y0 represent the treated and untreated potential outcomes orY (1) and Y (0)

For example consider parallel universes identical except for whethera particular student takes introductory econometrics (this class) or In-tro Stat II (STAT 3500) Literally everything else in each universeis identical the studentrsquos parents her other classes her height herDNA the weather on October 14 etc (For now some difficulties withldquoeverythingrdquo are glossed over eg what if econometrics is requiredfor her degree) The ldquotreatmentrdquo is taking econometrics (instead ofstatistics) The outcome variable is the studentrsquos annual income fiveyears after graduation in thousands of US dollars per year (egY = 70 is $70000yr) Let Y U denote her outcome in the universewithout treatment (STAT 3500) and Y T her outcome in the universewith treatment (econometrics) That is Y T is her treated potentialoutcome and Y U is her untreated potential outcome

Unlike in Section 41 potential outcomes Y U and Y T are not al-ways observable In the above example if a student takes STAT 3500then we can observe her untreated potential outcome Y U but not Y T conversely if she takes econometrics then her treated potential out-come Y T is observable but not Y U This partial observability makescausal inference more difficult than description or prediction

Consider some other potential outcomes examples In the right-to-work example Y T is an individualrsquos income in the universe where


the individualrsquos state has a right-to-work law and Y U is their incomein the universe thatrsquos identical except there is no such law In ouruniverse either the individualrsquos state does or does not currently havesuch a law it cannot be both so we cannot observe both potentialoutcomes (Perhaps the state did not have the law last year and doesthis year but the universe ldquolast yearrdquo was different in many ways thanthe universe ldquothis yearrdquo much more than one single law has changed)

As another example imagine universe B is where a student winsthe lottery to enter a popular charter school and universe A is wherethe student remains in the conventional public school Potential out-comes Y T and Y U are dummy (binary) variables for whether or notthe student eventually graduated from college in each universe Againin our universe we can observe Y T if the students wins the lotteryand Y U if not but we cannot observe both

442 Treatment Effects

The difference Y T minus Y U between an individualrsquos two potential out-comes is their treatment effect Just as different individuals canhave different (Y U Y T ) individuals can have different treatment ef-fects Y T minusY U ie individuals can be affected differently by the sametreatment

In the intro econometrics example the studentrsquos treatment effectY T minus Y U has the following interpretation Recall Y T is their incomeafter taking econometrics and Y U after instead taking STAT 3500Thus that particular studentrsquos treatment effect is how much higher(or lower if negative) their income is in the parallel universe that isidentical other than taking econometrics instead of STAT 3500

In the right-to-work example Y T minus Y U is the treatment effectof the law on an individualrsquos income The interpretation now is thedifference between their income in the universe with the law and theuniverse without the law with everything else held constant Thetreatment effect can be big or small positive or negative (or zero) Anumerical example is shown later in Table 42

In the charter school example Y T minus Y U is the treatment effectof the charter school on college graduation That is it is the differ-ence between the college graduation outcomes in the charter schooluniverse and the public school universe Since the outcome is bi-nary (1 if graduate college 0 if donrsquot) there are only four possiblevalues of (Y U Y T ) (student types) and only three possible treat-ment effect values Y T minus Y U = 1 if the student graduates in thecharter school universe (Y T = 1) but not the public school universe(Y U = 0) Y T minus Y U = minus1 if they only graduate in the public schooluniverse (Y U = 1) but not the charter school universe (Y T = 0) andY TminusY U = 0 if they graduate either in both universes (Y T = Y U = 1)or neither (Y T = Y U = 0) This is seen in the later example of Ta-ble 41

In all examples the potential outcomes and treatment effects maybe different for different individuals For example econometrics may


be much better for some students but only a little better for othersright-to-work may help certain workers but hurt others the charterschool may make the difference for some students to graduate col-lege but others would have graduated either way The fancy termfor people being different is heterogeneity more specifically hereldquotreatment effect heterogeneityrdquo

In economics where many systems are interrelated sometimesitrsquos difficult just to specify what ldquoeffectrdquo we care about For exampleconsider racial differences in salary In the parallel universe thatrsquosldquoidenticalrdquo except for the individualrsquos race does ldquoidenticalrdquo includehaving the same job at the same firm Or does it allow for an effectof race on hiring Does it allow for an effect on educational opportu-nities or an effect on family background (parentsrsquo education wealthetc) There is no ldquorightrdquo or ldquowrongrdquo specification but each answersa different question

443 SUTVA

SUTVA Definition

The potential outcomes definition of causality relies critically on thestable unit treatment value assumption (SUTVA) which hastwo parts

The first part of SUTVA is that every treated individual receivesthe same treatment This seems true in the right-to-work examplethe same law applies (or doesnrsquot) to everybody equally This alsoseems true in the charter school example but with more nuance Twostudents may go to the same school but have very different experi-ences like different teachers different classmates different electivesand different extra-curricular activities Even if we say these two stu-dents nominally have the ldquosame treatmentrdquo we should expect a lot ofheterogeneity and we should expect treatment effects to change everyyear as the school adds or removes (or changes) its teachers its stu-dents its elective class offerings and its extra-curricular activitiesAs another ambiguous example if therersquos a one-on-one mentoringprogram to help teen parents but of course there are many differentmentors is every teen parent receiving the ldquosame treatmentrdquo

The second part of SUTVA is the no interference assumptionThis assumes that one personrsquos treatment (or non-treatment) does notaffect the potential outcomes of any other person This often makessense for medical treatments (eg doing surgery on me doesnrsquot af-fect your health) but it requires careful thought in economics whereoften individuals are interacting either personally or through mar-kets In the charter school example if a studentrsquos success depends onbeing surrounded by other highly motivated students (or not) thenSUTVA (specifically no interference) is violated That is one stu-dentrsquos outcome depends on whether the other motivated students arein the same school (whether charter or not) ie depends on the otherstudentsrsquo ldquotreatmentrdquo

The ldquosame treatmentrdquo ambiguity also relates to the structural and


reduced form differences in Section 432 In the charter school ex-ample the structural critique would be that learning ldquothe effectrdquo ofldquogoing to charter school Brdquo last year is not particularly helpful forguiding educational policy if we canrsquot confidently extrapolate fromcharter school B last year to charter school B next year let aloneextrapolate to other charter schools let alone understand why the ef-fects are positive or negative (eg is it because of teachers or becauseof better classmates or more electives and activities) The reducedform rebuttal would be that at least they can (sometimes) be prettyconfident about their assessment of a particular school in a partic-ular year whereas trying to explicitly model the effects of teachersand classmates and classes and activities will result in models no-body believes anyway Hopefully we could learn more by trying bothapproaches than giving up and trying neither

SUTVA Violations

SUTVA can be violated in many ways especially in economics Thisis not about sampling or randomization or data it is about thepotential outcomes framework itself Even if SUTVA is satisfied andtreatment effects are well-defined it is possible to have problems withrandomization that make it impossible to actually learn about treat-ment effects Conversely even if there is a perfectly designed random-ized experiment SUTVA could be violated in which case it may beunclear what ldquotreatment effectrdquo even means

One violation of SUTVA is from spillover effects For exampleif the treatment provides helpful information (eg about financialplanning or social services or risk probabilities) treated individualsmay share such information with their untreated friends That isthe benefit of the treatment ldquospills overrdquo into untreated individualsThis could be true even if the treatment (information or otherwise)isnrsquot directly shared For example if the provided information leadsto less binge drinking among treated individuals this may reducesocial pressure that results in less binge drinking among untreatedindividuals even if they did not receive the ldquotreatmentrdquo informationOr if some treatment helps half the students in a classroom theirimprovement itself may benefit their untreated classmates

Another violation of SUTVA is from general equilibrium ef-fects (Section 433) For example maybe the treatment is a newagricultural technology hoping to increase cacao farmersrsquo earnings Ifonly one farmer gets this treatment (technology) then she benefitsfrom increased production selling more cacao at the current globalprice But if all farmers in the world get the technology then theglobal cacao supply curve shifts and the price drops Thus eachfarmerrsquos untreated and treated potential outcomes (earnings) are af-fected by all other farmersrsquo treatment which affects the market equi-librium price Other general equilibrium effects could come throughother markets For example a treatment affecting workers might af-fect the labor market as a whole (and thus wages) Or subsidies


for housing or education could affect supply and demand (and thusprices) in those markets

There can be yet other ways SUTVA is violated either from nothaving the same treatment or from interactions that violate ldquono in-terferencerdquo

Discussion Question 43 (cash transfer spillovers) Consider the ef-fect of income on food consumption (Y ) in a rural village Consider anldquounconditional cash transferrdquo program (like GiveDirectly) that (poten-tially) gives the equivalent of $1000 to a treated individual Describedifferent possible spillover effects that would violate SUTVA

Sometimes one perspective of a treatment leads to SUTVA viola-tions but another does not In the classroom example SUTVA wasviolated by spillover effects from treated students to untreated stu-dents in the same class Alternatively if each classroom is treated oruntreated (ie all students treated or all not) then there is less pos-sibility of spillover In principle even entire schools could be assignedas treated or untreated further reducing spillovers

In other cases you may actually want to learn about spillovereffects as part of the overall effect of a policy For example if thestudent-level treatment is specifically for students with certain specialneeds then we probably care about its affect on both the treated anduntreated students (Further it would be impossible to treat everyonein a school since the treatment is only appropriate for certain typesof student)

In deciding which perspective is best it is helpful to think aboutthe actual policy question what is the potential policy that couldactually be adopted in reality

45 Average Treatment Effect

=rArr Kaplan video Potential Outcomes and the ATE (again)

Although the full distribution of potential outcomes (Y U Y T ) con-tains the most information usually only certain summary features arestudied Although summary features like standard deviations andpercentiles are interesting wersquoll focus on means

451 Definition and Interpretation

The average treatment effect (ATE) is E(Y T minus Y U ) ldquoAveragerdquorefers to the population mean while ldquotreatment effectrdquo refers to Y T minusY U Thus the ATE may be interpreted as the probability-weightedaverage (mean) of all possible individual treatment effects in the pop-ulation Another name for the ATE is the average causal effect(ACE) but I use ATE to emphasize that this concept is from thepotential outcomes framework

The ATE has another interpretation Using linearity as in (221)

E(Y T minus Y U ) = E(Y T )minus E(Y U ) (43)

45 AVERAGE TREATMENT EFFECT 107

Here E(Y T ) is the mean treated potential outcome and E(Y U )is the mean untreated potential outcome Similar to Section 41E(Y T ) minus E(Y U ) is a mean difference here between the treated anduntreated potential outcome distributions This could be rephrasedas ldquothe treatment effect on the mean outcomerdquo treatment causes themean outcome to change from E(Y U ) to E(Y T )

452 ATE Examples

Table 41 shows a numerical version of the charter school exampleThe four student ldquotypesrdquo refer to the four possible values of (Y U Y T )and each type has its own probability Given the probabilities themean untreated outcome E(Y U ) mean treated outcome E(Y T ) andATE E(Y T minus Y U ) are computed using (216)

E(Y U ) = (03)(0) + (03)(0) + (01)(1) + (03)(1) = 04 (44)

E(Y T ) = (03)(0) + (03)(1) + (01)(0) + (03)(1) = 06 (45)

E(Y T minus Y U ) = (03)(0) + (03)(1) + (01)(minus1) + (03)(0) = 02(46)

To verify (43)

E(Y T minus Y U ) = 02 = 06minus 04 = E(Y T )minus E(Y U ) (47)

Table 41 Charter school example population of potential outcomesand ATE

Student type Probability Y U Y T Y T minus Y U

1 03 0 0 02 03 0 1 13 01 1 0 minus14 03 1 1 0

Mean 04 06 02

Table 42 Right-to-work example population of potential outcomesand ATE

Worker type Probability Y U Y T Y T minus Y U

($yr) ($yr) ($yr)

1 05 40000 41000 10002 02 40000 38000 minus20003 02 50000 51000 10004 01 50000 47000 minus3000

Mean 43000 43000 0

Table 42 shows a numerical version of the right-to-work exam-ple Each worker ldquotyperdquo corresponds to a different value of (Y U Y T )


each type with its own probability Given the probabilities the meanuntreated outcome E(Y U ) mean treated outcome E(Y T ) and ATEE(Y T minus Y U ) are in dollars per year

E(Y U ) = (05)(40000) + (02)(40000) + (02)(50000) + (01)(50000) = 43000(48)

E(Y T ) = (05)(41000) + (02)(38000) + (02)(51000) + (01)(47000) = 43000(49)

E(Y T minus Y U ) = (05)(1000) + (02)(minus2000) + (02)(1000) + (01)(minus3000) = 0(410)

Again to verify (43)

E(Y T minus Y U ) = $0yr = $43000yrminus $43000yr = E(Y T )minusE(Y U )(411)

453 Limitation of ATE

00

01

02

03

04

Den

sity

$12hr $15hr $18hr

Figure 42 Three distributions with the same mean

Figure 42 shows the ATE does not fully capture the effect oftreatment on the distribution The figure plots PDFs of three hourlywage distributions with identical means Picking any two distribu-tions to represent potential outcomes Y U and Y T E(Y U ) = E(Y T )so the ATE is E(Y T ) minus E(Y U ) = $0hr However zero ATE doesnot mean zero effect the distributions are all different For exampletheir standard deviations differ and one distribution is right-skewedwith a lower median wage We may disagree about which is ldquobetterrdquoor ldquoworserdquo but we can agree the differences are important

This idea is also memorable in joke form as retold by Hansen(2020 p 29)

An economist was standing with one foot in a bucket ofboiling water and the other foot in a bucket of ice Whenasked how he felt he replied ldquoOn average I feel just finerdquo

To address the limitations of the ATE one approach is to exam-ine effects on percentiles (ldquoquantile treatment effectsrdquo) but these arebeyond our scope

46 ATE IDENTIFICATION 109

Practice 41 (unrepresentative ATE) Describe a population in whichthe ATE is zero but every individual is affected by the treatment (ieall treatment effects are non-zero) For simplicity assume there areonly two types of individual For each type state the probabilitypotential outcomes Y U and Y T and causal effect Y T minus Y U whichmust be non-zero Then compute the ATE to verify itrsquos zero

46 ATE Identification

=rArr Kaplan video ATE Identification

Generally identification is a concept central to econometricsthat appears throughout this textbook A parameter is identifiedif it equals a summary feature of the population distribution of ob-servable variables Identification requires certain conditions known asidentifying assumptions

Specifically the ATE is identified when it equals a mean difference(It may equal another summary feature in more complex settings notdiscussed here) The required identifying assumptions are discussedlater in this section In practice if the identifying assumptions aretrue then the mean difference can be interpreted as the ATE but ifthey are false then it cannot

461 Setup and Identification Question

For each individual a single value is observed If the individual wasactually treated (in our universe) then treated potential outcome Y T

is observed otherwise Y U is observedConsider actually treated individuals to be population B and

consider actually untreated individuals to be population A The twopopulations are represented by random variables Y B and Y A respec-tively For a population B individual Y B is always observable withY B = Y T Similarly for a population A individual Y A is always ob-servable with Y A = Y U A random sample of Y B can be taken fromactually treated individuals and a random sample of Y A can be takenfrom actually untreated individuals For example Y B could be thegraduation outcome for a student who actually attended the charterschool (in our universe) while Y A is the outcome for a student whodid not

For the ATE the question of identification is whether the ATEequals the mean difference between the actually treated and actuallyuntreated populations Mathematically using the E(Y T ) minus E(Y U )form of the ATE from (43) the identification question is whether ornot

E(Y T )minus E(Y U ) = E(Y B)minus E(Y A) (412)

We know how to learn about the descriptive mean difference E(Y B)minusE(Y A) from data as in Section 41 If (412) holds then this isequivalent to learning about the ATE


Consider (412) in the charter school and right-to-work exam-ples For the charter school example if the ATE is identified thenit equals the college graduation probability of charter school studentsminus the college graduation probability of conventional public schoolstudents E(Y B) minus E(Y A) (Recall from (29) that for binary Y E(Y ) = P(Y = 1)) In that case the ATE is estimated simply bycomparing college graduation rates between the charter school andpublic school students For the right-to-work example if the ATE isidentified then it equals mean income in right-to-work states minusmean income in other states E(Y B)minusE(Y A) In that case the ATEis estimated simply by comparing average income between right-to-work states and other states However if the ATE is not identifiedand (412) is false then these comparisons do not estimate the ATEie the mean differences do not have a causal interpretation

462 Randomization

Randomized experiments are often used to estimate the ATE Ide-ally in a randomized experiment also called a randomized con-trolled trial (RCT) the experimenter can control who is treated andwho is not (but see comments below) Mathematically the experi-menter gets to decide whether to observe Y U or Y T for each individ-ual ldquoRandomizedrdquo means this decision is made without regard to theindividualrsquos characteristics

In practice there are many complications see Section 463 forsome examples

For intuition consider the following experimental strategy Firstimagine we only want to estimate E(Y T ) We could take a randomsample of individuals from the population and treat each one allow-ing us to observe their Y T That is we have a random sample fromthe population distribution of Y T As in Chapter 3 we can estimateE(Y T ) by the sample mean Here our Y B is Y T so E(Y B) = E(Y T )Second we can repeat the process for a second random sample butforce everyone to be untreated The key is the ability to force anyoneto be either treated or untreated this allows us to take random sam-ples of Y T and Y U Although treatment may not seem ldquorandomrdquo itis assigned without consideration of any individualrsquos characteristics

Section 662 contains more formal arguments for why randomiza-tion can help identify the ATE

463 Reasons for Identification Failure

Generally (beyond only experiments) ATE identifcation fails whenSUTVA fails (Section 443) or when treatment is not random

Outside of experiments random or ldquoas good as randomrdquo treat-ment is rare For example in the right-to-work example treatmentis probably not random Hopefully state legislatures do indeed con-sider the characteristics of individuals when deciding whether or notto pass a right-to-work law ie laws are not passed randomly Specif-ically legislatures may consider the distribution of Y U when deciding


whether or not to pass the law (which would switch everyonersquos in-come from their Y U to their Y T ) Further just looking at a mapit is notable that (as of 2019) zero US states in the Northeast cen-sus region have right-to-work laws whereas almost all states in theSouth census region have right-to-work laws (the exceptions beingDelaware and Maryland which are not really ldquoSouthernrdquo culturallyor politically) Thus it seems likely that the treatment decisions wererelated to other policy decisions that would in turn affect the incomedistribution

Even with randomized treatment assignment treatment itself maynot be random For example imagine you randomly assign individ-uals to attend a job training program but some assigned individualsnever attend This is called non-compliance ie not complyingwith the treatment assignment This is a type of self-selectionmeaning individuals decide which group to join People who skip theprogram may also skip work regularly which results in lower incomeThus many low-income individuals who should have been in the treat-ment group (if we could force them) are now in the control group iethey should have been Y B but are now Y A This decreases the con-trol grouprsquos average income and raises the treatment grouprsquos averageincome which falsely makes the treatment seem more effective thanit is Even if the training program has zero ATE so E(Y T ) = E(Y U )this non-compliance makes it look like the treatment has a positiveeffect because E(Y A) lt E(Y U ) and E(Y B) gt E(Y T )

One way to avoid this incorrect conclusion is to change perspec-tive compare groups based on treatment assignment rather than ac-tual treatment In the above example the ldquotreatmentrdquo is definedas being assigned to attend the job training program rather thanactually attending the program The resulting ATE is called theintention-to-treat effect because it measures the mean change inY corresponding to the intention to treat (ie assignment to treat-ment) Sometimes this is more directly relevant for policy anyway ifthe actual policy would not force people to be treated

Attrition is another problem that can arise even if SUTVA andrandom treatment are satisfied Attrition refers to individuals drop-ping out of the study after it starts For example maybe everyonecomes to the first job training but then some people move to a dif-ferent state and disappear from your data People leaving randomlyis fine but non-random attrition is problematic For examplemaybe the training program is so good that people get higher-payingjobs in other states You only see data for individuals who didnrsquotmove who generally have lower-paying jobs Then even though thetraining program worked really well it doesnrsquot look like it in the databecause you donrsquot see all the highest-earning treated individuals whomoved

Other concerns are introduced later especially in Section 123

Discussion Question 44 (breakfast effect) Schools with a highenough percentage of low-income students are eligible for a federally-funded free breakfast program for all students Although the program


is not mandatory all eligible schools choose to have it You compute a95 CI for the mean math test score of the ldquobreakfastrdquo schools minusthe mean of the other schools and it is [minus32minus17] points (The testis out of 100 points most scores are in the 60 to 100 range) How doyou interpret this result Think about ATE identification statisticaluncertainty and frequentist vs Bayesian perspectives

47 ATE Estimation and Inference

There is nothing new for estimation and inference It is identical toSections 412 and 413 Generally the point of causal ldquoidentificationrdquois not to propose a new statistical object but rather to imbue an ex-isting descriptive statistical object with causal meaning Here (412)gives the descriptive mean difference a causal interpretation (ATE)The interpretation does not affect how we estimate or quantify sta-tistical uncertainty about the mean difference

However recall from Section 398 that conventional methods forquantifying uncertainty (Section 38) only quantify statistical uncer-tainty not uncertainty about identification For example if iden-tification fails a 95 CI for the mean difference may only containthe ATE with 80 probability or even 50 or near 0 There aresome proposals for quantifying the sensitivity of results to violationsof identification in various settings but these are beyond our scope


Empirical Exercises

Empirical Exercise EE41 You will analyze the effects of beingassigned to a job training program where assignment was random-ized The specific program was the National Supported Work Demon-stration in the 1970s in the US Data are originally from LaLonde(1986) via Wooldridge (2020) You will look at effects on earnings(re78) and unemployment (unem78) both overall and for differentsubgroups (eg married or not) The train variable indicates (ran-domized) assignment to job training if it equals 1 and 0 otherwiseFor now we focus on computing various estimates in later chapterswersquoll think more critically about what could go wrong even with ran-domized assignment

a R only run installpackages(wooldridge) to downloadand install that package (if you have not already)

b Load the jtrain2 dataset

R load package wooldridge with command library(wooldridge) and a dataframe variable named jtrain2 be-comes available the command jtrain2 then shows you detailsabout the dataset

Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse jtrain2 clear

c R only separate the data into ldquotreatmentrdquo and ldquocontrolrdquo groups(depending on the value of train the job training variable)withtrt lt- jtrain2[jtrain2$train==1 ]ctl lt- jtrain2[jtrain2$train==0 ]

d Estimate the mean 1978 earnings (in thousands of dollars) forthe treatment group minus that of the control group along witha 95 CI for the mean difference

Rmean(trt$re78) - mean(ctl$re78)ttest(x=trt$re78 y=ctl$re78)

Stata ttest re78 by(train) unequal (also estimates themean difference)

e R only separate out the data for treated married individualsand untreated married individuals withtrtmar1 lt- trt[trt$married==1 ]ctlmar1 lt- ctl[ctl$married==1 ]

f Compute the mean difference estimate and 95 CI for the 1978earnings outcome variable comparing treated and untreatedmarried individuals

R


mean(trtmar1$re78) - mean(ctlmar1$re78)ttest(x=trtmar1$re78 y=ctlmar1$re78)

Stata ttest re78 if married==1 by(train) unequal oralternatively bysort married ttest re78 by(train)unequal

g Repeat your above analysis in parts (c)ndash(f) but first create avariable where earnings are in dollars (instead of thousands ofdollars)

R jtrain2$re78USD lt- 1000jtrain2$re78

Stata generate re78USD = re781000

h Optional repeat your analysis in parts (e) and (f) for unmarried(instead of married) individuals

i Optional repeat your analysis in parts (d)ndash(f) but for unem-ployment (unem78) instead of earnings For interpretation notethat unem78 equals 1 if unemployed all of 1978 and equals 0otherwise so the population mean is the probability of beingunemployed all year (a value between 0 = 0 and 1 = 100)and the sample average is the fraction of the sample thus un-employed So a value like 014 means 14 and a difference of014minus 011 = 003 is a difference of 3 percentage points etc

Empirical Exercise EE42 You will analyze data from an ldquoauditstudyrdquo that attempts to measure the effect of race on receiving a joboffer The Urban Institute found pairs of seemingly equally qualifiedindividuals (one black one white) and had them interview for a vari-ety of entry-level jobs in Washington DC in 1988 See Siegelman andHeckman (1993) for details and critique and the raw data in theirTable 51 (p 195) In the data each row (observation) corresponds toone job to which one pair applied Value w=1 indicates that the whiteapplicant in the pair got a job offer while b=1 if the black applicantgot an offer


b Load the audit dataset

R load package wooldridge with command library(wooldridge) and a dataframe variable named audit becomesavailable the command audit then shows you details aboutthe dataset

Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse audit clear

c Compute the difference (white minus black) in the sample frac-tion of job offers

R mean(audit$w) - mean(audit$b)

Stata ttest w==b (which also computes a 95 CI)


d Compute the sample mean of all the pairsrsquo white-minus-blackdifference Note that w-b equals 1 if the white individual got ajob offer but the black individual did not equals minus1 if the blackbut not white individual got an offer and equals 0 if both orneither of the pair got an offer

R mean(audit$w - audit$b)

Stata generate wminusb = w-b then ttest wminusb==0(also computes 95 CI see row labeled diff for both)

e R only (since Stata already reported this in the row labeled diff) compute a 95 CI for the population mean difference witheither ttest(x=audit$w y=audit$b paired=TRUE) or ttest(x=audit$w-audit$b)


Chapter 5

Midterm Exam 1


When I teach this class the first midterm exam is this week Thisldquochapterrdquo makes the chapter numbers match the week of the semesterThe midterm covers Chapters 2ndash4 ie everything up till now exceptRStata coding

117

118 CHAPTER 5 MIDTERM EXAM 1

Part II

Regression

119

Introduction

Part II concerns regression Regression is the workhorse of empiricaleconomics (and many other fields) for description prediction andcausality alike

Part II extends the concepts and methods of Part I to the regres-sion setting In the population the concepts of description predic-tion and causality from Part I are extended to regression models Inthe data estimation and inference methods extend those of Part I

More flexible regression is also considered including different mod-els interpretation and a glimpse of nonparametric regression andmachine learning

121

122

Chapter 6

Comparing TwoDistributions by Regression


Depends on Chapter 4 (which depends on Chapters 2 and 3)



62 Describe different ways of thinking about two distributionsboth mathematically and intuitively [TLO 3]

63 Describe interpret identify and distinguish among differ-ent population models and their parameters and estimators[TLO 3]

64 Judge which interpretation of a regression slope is most ap-propriate in a real-world example [TLO 6]

65 Interpret logical relationships and form appropriate logicalconclusions [TLO 2]

66 In R (or Stata) estimate the parameters in a simple regres-sion model along with measures of uncertainty and judgeeconomic and statistical significance [TLO 7]


bull Conditional probability (Khan Academy)

bull Basic joint marginal and conditional distributions (KhanAcademy)

bull James et al (2013 sect31)

bull Covariance and correlation (Lambert video)

bull Overlap assumption (Masten video)


123

124CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION

bull Assumptions for randomized experiment validity (Mastenvideo)

bull Structural vs causalreduced form approach (Mastenvideo)

bull OLS computation (Masten video)

bull Sections 21 (ldquoSimple OLS Regressionrdquo) and 22 (ldquoCoeffi-cients Fitted Values and Residualsrdquo) in Heiss (2016)

bull Section 53 (ldquoRegression When X is a Binary Variablerdquo) inHanck et al (2018)

bull R packages lmtest and sandwich (Zeileis 2004 Zeileis andHothorn 2002)

Chapter 6 revisits Chapter 4 from the perspective of regressionThe concepts of description prediction and causality are translatedinto regression language and regression models in the population Es-timation and quantifying uncertainty are also discussed

The term regression has different meanings in different contexts(and by different people) In the population it usually refers to howthe mean of a random variable Y depends on the value of anotherrandom variable(s) as in Section 63 In the sample as in Section 67it usually refers to a particular estimation technique But beware ofother (or ambiguous) uses of the word ldquoregressionrdquo especially in onlineresources

61 Logic

=rArr Kaplan video Logic Terms Example

Some basic logic is useful for understanding certain parts of econo-metrics Theoretically logic helps you understand the relationshipsamong different conditions like assumptions for theorems Practi-cally logic helps you interpret results

The following may not be fully technically correct from a philoso-pherrsquos perspective (eg perhaps I conflate logical implication withthe material conditional) but it suffices for econometrics

611 Terminology

Many words and notations can refer to the same logical relationshipLet A and B be two statements that can be either true or false Forexample maybe A is ldquoY ge 10rdquo and B is ldquoY ge 0rdquo Or A is ldquothisanimal is a catrdquo and B is ldquothis animal is a mammalrdquo The followingways of describing the logical relationship between A and B all havethe same meaning

1 If A is true then B is true (often shortened ldquoif A then Brdquo)2 A =rArr B3 A implies B

61 LOGIC 125

4 B lArr= A5 B is implied by A6 B is true if A is true7 A is true only if B is true8 A is a sufficient condition for B (shorter ldquoA is sufficient forBrdquo)

9 B is a necessary condition for A (shorter ldquoB is necessary forArdquo)

10 A is stronger than B11 B is weaker than A12 It is impossible for B to be false when A is true (but it is fine

if both are true or both are false or A is false and B is true)13 The truth table (T=true F=false)

A B A =rArr B

T T TT F FF T TF F T

14 The diagram (everything in A is also in B)

AB

To state equivalence of A and B opposite statements can be com-bined Specifically any of the following have the same meaning

1 A lArrrArr B (meaning both A =rArr B and A lArr= B)2 A is true if and only if B is true (meaning A is true if B is

true and A is true only if B is true)3 B is true if and only if A is true4 A is necessary and sufficient for B5 B is necessary and sufficient for A6 A and B are equivalent7 It is impossible for A to be false when B is true and impossible

for A to be true when B is false8 The truth table (T=true F=false)

A B A lArrrArr B

T T TT F FF T FF F T

Variations of A =rArr B have the following names Read notA asldquonot Ardquo notA is false when A is true and notA is true when A is false

bull notA =rArr notB is the inverse of A =rArr B

bull B =rArr A is the converse of A =rArr B


bull notB =rArr notA is the contrapositive of A =rArr B

The statement A =rArr B is logically equivalent to its contrapos-itive That is statements ldquoA =rArr Brdquo and ldquonotB =rArr notArdquo can beboth true or both false but itrsquos impossible for one to be true and theother false

The statement A =rArr B is not logically equivalent to either itsinverse or converse (The inverse and converse are equivalent to eachother because the inverse is the contrapositive of the converse)

For example let A be ldquoX le 0rdquo and let B be ldquoX le 10rdquobull A =rArr B any number below 0 is also below 10bull The contrapositive is X gt 10 =rArr X gt 0 which is also true

any number above 10 is also above 0bull The inverse is X gt 0 =rArr X gt 10 which is false eg ifX = 5 then X gt 0 but not X gt 10

bull The converse is X le 10 =rArr X le 0 also false again if X = 5then X le 10 but not X le 0

612 Theorems

Theorems all have the same logical structure if assumption A is truethen conclusion B is true Sometimes A and B have multiple partslike A is really ldquoA1 and A2rdquo The theoremrsquos practical use is if we canverify that A is true then we know B is also true

What if we think A is false Then B could be false or it could betrue This may be seen most readily from the picture version of theA and B relationship in Section 611 we could be somewhere insideB but outside A (ie B true A false) or we could be outside both(both false) That is as in Section 611 the theorem A =rArr B isnot equivalent to its inverse

Also from Section 611 a theorem is equivalent to its contrapos-itive That is if the theoremrsquos conclusion is false then we know itsassumption is false (If the assumption has multiple parts like ldquobothA1 and A2 are truerdquo then being false means either A1 is false or A2

is false or both are false)

613 Comparing Assumptions

To compare assumptions the terms ldquostrongerrdquo and ldquoweakerrdquo are mostcommonly used Let A1 and A2 denote different assumptions PerSection 611 ldquoA1 is stronger than A2rdquo is equivalent to A1 =rArr A2which is also equivalent to ldquoA2 is weaker than A1rdquo

All else equal it is more useful to have a theorem with weakerassumptions because it applies to more settings That is if A1 =rArrA2 then we prefer a theorem based on A2 the weaker assumption Atheorem based on A1 can only be used when A1 is true In contrast atheorem based on A2 can be used not only when A1 is true (becauseA1 =rArr A2) but also sometimes when A1 is false (but A2 is stilltrue)

62 PRELIMINARIES 127

For example let assumption A1 be ldquoa city is in Missourirdquo andlet assumption A2 be ldquoa city is in the United Statesrdquo Consider thetheorems A1 =rArr B and A2 =rArr B (The conclusion is irrelevanthere but to be concrete you could imagine B is ldquothe city is in thenorthern hemisphererdquo) Since Missouri is part of the United StatesA1 =rArr A2 ie A1 is the stronger assumption and A2 is the weakerassumption We prefer the theorem based on the weaker assumptionbecause it applies to more cities For example only the theoremA2 =rArr B applies to Houston A1 is false but A2 is true (And recallthat when A1 is false the theorem A1 =rArr B does not conclude thatB is false it just says ldquoI donrsquot know if B is true or falserdquo ie it isuseless)

Practice 61 (median theorem logic) Consider the theorem ldquoIf sam-pling is iid then the sample median consistently estimates the popu-lation medianrdquo Hint draw a picture andor write it as A =rArr B

a) What does this tell us about consistency of the sample medianwhen sampling is not iid

b) What does this tell us about sampling when the sample medianis not consistent

Practice 62 (mean theorem logic) Consider the theorem ldquoIf sam-pling is iid and the population mean is well-defined then the samplemean consistently estimates the population meanrdquo Hint there maybe multiple possible pictures that show this relationship among A1

(iid) A2 (well-defined) and B (consistency)a) What does this tell us about consistency of the sample mean

when sampling is not iidb) What does this tell us about sampling when the sample mean

is not consistent

Discussion Question 61 (logic with feathers) Consider two theo-rems Theorem 1 says ldquoIf X is an adult eagle then it has feathersrdquoTheorem 2 says ldquoIf X is an adult bird then it has feathersrdquo

a) Describe each theorem logically whatrsquos the assumption (A)whatrsquos the conclusion (B) whatrsquos the relationship

b) State Theorem 1rsquos contrapositive is it truec) Compare does Theorem 1 or Theorem 2 have a stronger as-

sumption Whyd) Compare which theorem is more useful (Which applies to

more situations)

62 Preliminaries

=rArr Kaplan video Joint Marginal and Conditional Distributions

Before getting to regression some simpler material may provideintuition (If it is not familiar to you from a previous statistics classthen you may want to consult additional resources for a deeper un-derstanding or you may not) In Section 62 there is no data onlythe population is considered


621 Population Mean Model in Error Form

To help understand the conditional mean model we start with anunconditional mean model That is interest is in microY equiv E(Y ) for asingle random variable Y as in Chapter 2

There are two ways to write the unconditional mean ldquomodelrdquoBoth look silly and over-complicated but they help bridge Chapter 4and Chapter 6 First the mean can be written directly

E(Y ) = microY (61)

Second in terms of an error term U the error form of this modelis

Y = microY + U E(U) = 0 (62)

Models (61) and (62) are equivalent Taking the mean of bothsides of (62) using the linearity property from (221)

E(Y ) = E(microY + U) = E(microY ) + E(U) = microY + 0 = microY (63)

The error term U has a precise statistical definition and meaningbut no causal or economic meaning Defining the mean error term as

U equiv Y minus E(Y ) (64)

always implies

E(U) = E[Y minus E(Y )] = E(Y )minus E(Y ) = 0 (65)

Thus the property E(U) = 0 in (62) is true essentially by definitionnot an assumption that can be false By analogy this is like definingU to be an equilateral triangle in which case the property ldquoall anglesare equalrdquo is always true not an additional assumption However Uhas no causal or economic meaning it is simply the difference betweenan individualrsquos Y and the population mean E(Y )

The error form often facilitates theoretical analysis of estimatorsbut in practice the more direct model may be easier to interpret

622 Joint and Marginal Distributions

To understand regression you must understand conditional distribu-tions To understand conditional distributions it helps to understandjoint distributions and marginal distributions

The joint distribution is the distribution of values of (XY ) to-gether which can be any combination of the variable types in Sec-tion 23 For example X could be categorical and Y continuous orX and Y both discrete or X continuous and Y ordinal etc Sincethere are so many combinations they are not all enumerated hereFurther eventually wersquoll focus on conditional distributions in whichcase the variable type of X does not matter as much for interpreta-tion For regression the focus is on numeric (discrete or continuous)X and Y Implicitly this also applies to categorical variables that


have been turned into dummy variables with the indicator functionlike X = 1cat or Y = 1employed

For (XY ) with non-continuous variable types the joint distri-bution can be described by a PMF Like before the PMF states theprobability of each possible value The difference is that ldquopossiblevaluesrdquo are now pairs of values (x y) instead of single values Forexample a possible value could be (minus5 7) or (cat dog) (It is moredifficult to gain intuition with continuous variable types but the ideaof a PDF can be extended to multiple variables)

Each joint probability in the PMF can be written multipleequivalent ways

P((XY ) = (x y)) = P(X = x Y = y) = P(X = x and Y = y)(66)

For example consider the joint distribution of dummy variablesfor employment and marital status Let Y = 1 if somebody is em-ployed and Y = 0 if not Let X = 1 if somebody is married andX = 0 if not The joint distribution of employment and maritalstatus describes the probabilities of each possible value of the vector(XY ) ie the PMF of the vector (XY ) There are four possiblevalues unmarried and unemployed (0 0) unmarried and employed(0 1) married and unemployed (1 0) and married and employed(1 1) Since these categories are mutually exclusive and exhaustivethe four probabilities must sum to 1 (ie 100) Table 61 shows anexample

Table 61 Joint distribution of marital status (X) and employmentstatus (Y )

Y = 0 Y = 1 Marginal for X (row sum)X = 0 010 010 020X = 1 020 060 080

Marginal for Y (column sum) 030 070 100

Table 61 shows both joint and marginal probabilities Here thejoint probability values can be written as pxy equiv P(X = x Y = y)or equivalently P((XY ) = (x y)) This is analogous to the scalar(one variable) PMF that described P(Y = y) for different values ybut replacing Y with (XY ) and y with (x y) The joint probabilitiesshown inside the box are p00 = 010 (ie 10) p01 = 010 p10 = 020and p11 = 060 These sum to 1

A marginal probability (or unconditional probability) con-siders just one of the random variables ignoring the other Specifi-cally the outer values in Table 61 show the marginal probabilitiesto be P(X = 0) = 020 (at the right end of the X = 0 row)P(X = 1) = 080 P(Y = 0) = 030 (at the bottom of the Y = 0column) and P(Y = 1) = 070 These probabilities describe themarginal distribution of each random variable or more specifi-cally the marginal PMFs That is X by itself is a random variablewith P(X = 0) = 020 and P(X = 1) = 080 the population proba-


bility of an individual being married is 08 (80) Similarly by itselfY is a random variable with P(Y = 0) = 030 and P(Y = 1) = 070

Discussion Question 62 (joint distribution and causality) Con-sider two binary random variables X and Y whose joint distributionis described by the probabilities P(X = 0 Y = 0) = 04 P(X =0 Y = 1) = 01 P(X = 1 Y = 0) = 01 and P(X = 1 Y = 1) = 04Note P(X = Y ) = 08 gt 02 = P(X 6= Y ) Hint think about someconcrete examples ofX and Y (marital status employment rain longcommute time etc) to prove something is ldquopossiblerdquo only requires asingle example where it is true

a) Explain why this joint distribution suggests some type of rela-tionship between X and Y

b) Given the joint distribution is it possible that X has a causaleffect on Y Whynot

c) Given the joint distribution is it possible that X does not havea causal effect on Y Whynot

623 Conditional Distributions

For non-continuous variables the conditional distribution consistsof the conditional probabilities of all different possible values Theconditional distribution of Y given X = x consists of the conditionalPMF ie the conditional probability of each possible value y givenX = x

The conditional probability of one event (like Y = 1) givenanother event (like X = 1) considers only the times when the condi-tioning event (like X = 1) occurs and then takes the proportion ofthose times that the first event (like Y = 1) occurs Mathematicallythe conditional probability P(Y = 1 | X = 1) can be read as ldquotheprobability that Y equals one conditional on X equal to onerdquo or ldquotheprobability of Y being one given X equals onerdquo or other variationsMore generally P(Y = y | X = x) is ldquothe probability that Y equals yconditional on X equal to xrdquo

For non-continuous variables a conditional probability can bewritten in terms of joint and marginal probabilities Specifically

P(Y = y | X = x) =P(Y = yX = x)

P(X = x) (67)

(This doesnrsquot apply to continuous X since the denominator would beP(X = x) = 0)

There is nothing mathematically special about the labels X andY here However conventional regression notation corresponds toconditioning on the variable named X To examine X conditional onY we could just switch the labels and then examine Y conditionalon X

For example consider the probability of employment (Y = 1)conditional on being married (X = 1) Applying (66)

P(Y = 1 | X = 1) =P(Y = 1 X = 1)

P(X = 1) (68)


The denominator is the proportion of the population thatrsquos marriedThe numerator is the proportion of the population thatrsquos both marriedand employed

Examples

For intuition you can imagine the population is actually 100 peoplerather than abstract probabilities You may have unknowingly com-puted (sample) conditional probabilities in grade school if you everanswered questions like ldquoWhat proportion of the boys in our class arewearing glassesrdquo or ldquoWhat proportion of students with black hair arewearing a sweaterrdquo

In Table 61 multiplying values by 100 gives the number of peoplein each of the four cells 10 people are unmarried and unemployedwith (X = 0 Y = 0) another 10 are unmarried but employed with(X = 0 Y = 1) 20 people are married but not employed with (X =1 Y = 0) and 60 people are married and employed (X = 1 Y = 1)Probabilities are proportions eg P(X = 1 Y = 1) = 060 so theproportion of married and employed individuals in the 100-personpopulation is 60100 = 060 = 60

Table 62 shows the number of people with different values parallelto Table 61

Table 62 Counts of individuals by marital status (X) and employ-ment status (Y )

not employed employed Marginal (sum)not married 10 10 20

married 20 60 80Marginal (sum) 30 70 100

Using Table 62 to continue with the 100-person population theconditional probability P(Y = 1 | X = 1) asks within the group ofmarried individuals (X = 1) what proportion of them are employedThere are 60 married and employed individuals and 20 married whoare not employed so 80 total This 80 is 100 times the marginalprobability P(X = 1) = 080 Out of those 80 60 are employedThus the proportion of married individuals who are employed is6080 = 075 = 75 That is to compute the conditional proba-bility we take the ldquojointrdquo number of individuals who are both marriedand employed (both X = 1 and Y = 1) and divide by the ldquomarginalrdquonumber of married individuals (X = 1) Similarly the proportion ofmarried individuals who are not employed is 2080 = 025 = 25For the unmarried group (20 individuals total) the proportion whoare employed is 1020 = 05 = 50 which is also the proportion whoare not employed


624 Conditional Mean

The conditional mean is just the mean of a conditional distributionConditional on a particular value X = x like X = 1 there is a con-ditional distribution of Y The mean of that conditional distributionis written

E(Y | X = x) (69)

To read (69) aloud you could say ldquothe conditional mean of Y givenX = xrdquo or ldquothe mean of Y conditional on X = xrdquo

Examples

From Table 61 we can compute a conditional mean We alreadycomputed the conditional distribution of employment status (Y ) con-ditional on being married (X = 1) P(Y = 1 | X = 1) = 075 andP(Y = 0 | X = 1) = 025 The mean of that conditional distribu-tion is written E(Y | X = 1) We can use the usual expected valueformula plugging in conditional probabilities For comparison theunconditional and conditional (on X = 1) means of Y are respec-tively

E(Y ) = (0) P(Y = 0) + (1) P(Y = 1) = (0)(03) + (1)(07) = 0 + 07 = 07(610)

E(Y | X = 1) = (0) P(Y = 0 | X = 1) + (1) P(Y = 1 | X = 1)

= (0)(025) + (1)(075) = 0 + 075 = 075

(611)

Since Y is binary (0 or 1) the (conditional) mean is the (conditional)probability of Y = 1 E(Y ) = P(Y = 1) = 07 and E(Y | X = 1) =P(Y = 1 | X = 1) = 075

Conditional means can be computed similarly for non-binary Yand X For example imagine Y is hours worked per week which iseither 0 20 or 40 and X is years of education which is either 1112 or 16 The conditional mean is

E(Y | X = x) =sum

jisin02040

(j) P(Y = j | X = x)

= (0) P(Y = 0 | X = x) + (20) P(Y = 20 | X = x)

+ (40) P(Y = 40 | X = x) (612)

Table 63 Joint distribution of education (X) and weekly hoursworked (Y )

Y = 0 Y = 20 Y = 40X = 11 010 005 005X = 12 005 010 015X = 16 010 010 030

Table 63 shows an example joint distribution of such an X andY from which conditional means can be computed The values in the


table are joint probabilities eg the entry in the last column of thesecond row shows P(X = 12 Y = 40) = 015 Consider the condi-tional mean E(Y | X = 16) To apply (612) requires the conditionalprobabilities which can be computed using (67) First the marginalprobability sums all entries in the row

P(X = 16) = 010 + 010 + 030 = 05 (613)

Second plugging this into (67)

P(Y = 20 | X = 16) =P(Y = 20 X = 16)

P(X = 16)=

010

050= 02

P(Y = 40 | X = 16) =P(Y = 40 X = 16)

P(X = 16)=

030

050= 06

(614)

Third plugging these into (612)

E(Y | X = 16) = 0 + (20)(02) + (40)(06) = 4 + 24 = 28 (615)

As a sanity check note that the probability of Y = 40 is higher thanthat of Y = 0 so it makes sense that the conditional mean is above20 The specifical value E(Y | X = 16) says that within the partof the population with X = 16 years of education the mean weeklyhours worked is 28

625 Comparison of Joint Marginal and ConditionalDistributions

The joint distribution has all the possible information about the dis-tribution of random variables (XY ) Rearranging (67) each jointprobability can be reconstructed by multiplying the appropriate con-ditional and marginal probabilities

P(X = x Y = y) = P(Y = y | X = x) P(X = x) (616)

Thus knowing the joint distribution has the same information asknowing both the conditional (of Y given X = x) and marginal (ofX) distributions However the conditional distributions alone (with-out the marginals) contain less information than the joint distribu-tion ie there are multiple possible joint distributions that would beconsistent with a single set of conditional distributions Similarly themarginal distributions of X and Y alone (without the conditionals)contain less information than the joint distribution

Going into regression wersquoll focus on the conditional means E(Y |X = x) which are summary features of conditional distributionswhich in turn are summaries of the full joint distribution That isconditional means (and regression) only learn one particular featureabout the population joint distribution of (XY ) As discussed in Sec-tion 231 there is a tradeoff between learning more information andhaving a summary that is more easily understood and communicated


626 Independence and Dependence

If random variables X and Y are independent then they are com-pletely unrelated statistically speaking Notationally independenceis usually written as X perpperp Y which is equivalent to Y perpperp X

Independence implies equality of marginal and conditional distri-butions Mathematically the marginal (unconditional) distributionof Y is the same as the conditional distribution of Y given X = x forany x Intuitively if X is unrelated to Y then knowing the value ofX has no information about the value of Y

This characterization of independence can be written in terms ofa PMF or CDF If Y is not continuous and thus has a PMF thenindependence implies the marginal PMF equals the conditional PMFfor any possible y and x values

Y perpperp X =rArr P(Y = y) = P(Y = y | X = x) (617)

If Y is not a nominal categorical variable and thus has a CDF thenfor any possible y and x

Y perpperp X =rArr P(Y le y) = P(Y le y | X = x) (618)

Consequently independence implies equality of marginal and con-ditional means known as mean independence That is for anypossible x value

Y perpperp X =rArr E(Y ) = E(Y | X = x) (619)

Independence implies many other properties too like Cov(XY ) =Corr(XY ) = 0 and P(X = x Y = y) = P(X = x) P(Y = y)

The opposite of independence is dependence If any conditionimplies by independence does not hold then the variables are depen-dent written X 6perpperp Y For example if Corr(XY ) 6= 0 then X 6perpperp Y Or if E(Y | X = 1) 6= E(Y | X = 0) then X and Y are neitherindependent nor mean independent

63 Population Model Conditional Expecta-tion Function

=rArr Kaplan video CEF (Binary X)

This and the following sections consider what we want to learnabout the population and how we can write it mathematically Thereis no data no estimation no uncertainty

A model describes the relationship between two (or more) vari-ables like education and income If it describes how income changeswith education then income is the usually written as Y and called theoutcome variable regressand dependent variable left-handside variable or response variable while education is written asX and called the regressor independent variable right-handside variable predictor covariate or conditioning variable

63 POPULATIONMODEL CONDITIONAL EXPECTATION FUNCTION135

Like before these variables are treated mathematically as randomvariables The ldquopopulationrdquo is a joint probability distribution of theobservable variables

There are different models for different types of relationships be-tween two variables Section 63 models a statistical relationship withinterpretations for description or prediction whereas Sections 64and 65 model causal relationships Sometimes the descriptive andcausal models coincide but generally they differ

This section combines Sections 621 and 623 to get a conditionalmean regression model

In Sum Conditional Expectation FunctionDescription shows mean Y for each subpopulation with same XE(Y | X = x)Prediction with quadratic loss optimal prediction of Y givenX = x is E(Y | X = x)Causality CEF difference sometimes has causal interpretation(Section 66)

631 Conditional Expectation Function

Using (69) let m(middot) be the conditional expectation function(CEF) of Y given X

m(x) equiv E(Y | X = x) (620)

That is the CEF m(middot) takes a value of x as input like x = 1 and tellsus the corresponding conditional mean of Y like E(Y | X = 1) = 7

It helps to remember whatrsquos random and whatrsquos non-random TheCEF m(middot) is a non-random function just as E(Y ) is non-random Forany X = x Y has a conditional distribution whose mean is m(x)a non-random value You can draw a graph of a CEF just like yougraphed any other (non-random) function in high school In contrastm(X) is a random variable That is there are multiple possible valuesof m(X) because there are multiple possible values of X

If X is binary as in this chapter then there are two conditionalmeans of interest

m(0) = E(Y | X = 0) m(1) = E(Y | X = 1) (621)

There are two possible approaches First these two conditionalmeans could be studied directly similar to Chapter 4 That is Y A

has the distribution of Y given X = 0 and Y B has the distributionof Y given X = 1 Second the conditional means can be captured ina CEF regression model

For example consider Table 61 From (611) m(1) equiv E(Y | X =1) = 075 Also

m(0) equiv E(Y | X = 0) = (0) P(Y = 0 | X = 0) + (1) P(Y = 1 | X = 0)

= P(Y = 1 | X = 0) =P(Y = 1 X = 0)

P(X = 0)=

01

02= 05


Thus the CEF is m(0) = 05 m(1) = 075 Also from Table 61the marginal distribution of X is P(X = 0) = 02 P(X = 1) = 08Thus m(X) is a random variable with

P(m(X) = 05) = P(X = 0) = 02 P(m(X) = 075) = P(X = 1) = 08(622)

632 CEF Error Term

Extending (64) the CEF error term is defined as

V equiv Y minusm(X) (623)

ie the difference between an individualrsquos actual outcome Y and theCEF evaluated at her X value m(X) As always other letters couldbe used besides V like U or W in other textbooks you may seeu or e or ε Since Y and X are random variables so is V egP(V = 0) = P(Y minusm(X) = 0)

For example let X = 1 indicate a college degree (and X = 0otherwise) and Y is income Then m(0) is the mean income amongthe no-college population and m(1) is mean income among collegedegree holders If you are a successful tech company CEO who wentto college (X = 1) then your Y is high above m(1) so your CEFerror in (623) is very large and positive Or if you didnrsquot go tocollege (X = 0) and make exactly the mean income for that groupyour Y equals m(0) so your CEF error is V = 0 Or if you went toa fancy college but decided to live off your parentsrsquo wealth and earnno income then your Y = 0 so your V = Y minusm(1) = 0minusm(1) verynegative

The CEF error has conditional mean zero Extending (65) forany X = x

E(V | X = x) = E(Y minusm(X) | X = x) = E(Y | X = x)minus E(m(X) | X = x)

= m(x)minusm(x) = 0

EquivalentlyE(V | X) = 0 (624)

That is E(V | X) is a random variable depending on X but it equalszero for every possible value of X or just imagine ldquoE(V | X = x) = 0for all xrdquo every time you see ldquoE(V | X) = 0rdquo

As in Section 621 this conditional mean zero property is not anassumption it is true by definition for any CEF error defined as in(623) By analogy if we define V as a square then it always hasthe property of having four equal sides and four equal angles suchproperties are not additional assumptions

633 CEF Model in Error Form

Given (623) extending (62) the CEF model in error form is

Y = m(X) + V E(V | X) = 0 (625)


The statement E(V | X) = 0 is equivalent to saying m(x) = E(Y |X = x) ie that m(middot) is the CEF Again it is not an assumptionabout V it is just stating what type of model this is Equation (625)can apply to non-binary X too as in later chapters

634 Linear CEF Model

With binary X the model in (625) is equivalent to

Y = m(0)1X = 0+m(1)1X = 1+ V (626)= m(0)(1minusX) +m(1)(X) + V

= m(0) + [m(1)minusm(0)]X + V (627)

To double-check when X = 0 then [m(1) minus m(0)]X = 0 so Y =m(0)+V as in the original (625) When X = 1 then m(0)+[m(1)minusm(0)]X = m(0) +m(1)minusm(0) = m(1) so Y = m(1) + V also as in(625) Thus (625) and (627) are equivalent for binary X

The CEF model in (627) can be rewritten yet again to yield amore familiar conventional structure Following conventional nota-tion let β0 equiv m(0) and β1 equiv m(1)minusm(0) Plugging these definitionsinto (627)

Y = β0 + β1X + V E(V | X) = 0 (628)

In (628) β0 and β1 are called the parameters Greek letters likeβ are commonly used to denote unknown parameters in a populationmodel In the frequentist framework these are seen as unknown butfixed (non-random) values whereas Y X and V are random vari-ables In (628) specifically β0 is the intercept and β1 is the slopeSometimes regression parameters are called coefficients β1 is theslope coefficient or the coefficient on X

Model (628) is a linear CEF model It is a ldquoCEFrdquo model becauseE(Y | X = x) = β0 + β1x or E(Y | X) = β0 + β1X The ldquolinearrdquopart is explained in Section 821 for now it suffices to recall that agraph of β0 + β1x is a straight line

Since X is binary no assumptions were required to write (628)given (625) However when X has more than two possible values itis more complicated as discussed in Chapter 7 For now with binaryX the CEF model can always be written as in (628)

635 Interpretation Description and Prediction

Practice 63 (regression parameter units) Let Y be salary measuredin $yr and let X be the number of college degrees an individual haseither X = 0 or X = 1 In (628) what are the units of measure forβ0 and β1 respectively

To interpret (628) first consider the units of measure The left-hand side is just Y the outcome variable Since they are equal theright-hand side must have the same units Thus each of the threeright-hand side terms must have the same units as Y

1 β0 has the same units as Y


2 β1X has the same units as Y so the units of β1 are the unitsof Y divided by the units of X

3 V has the same units as Y For example if Y is measured in $yr and X is the number of col-lege degrees then the units of β0 are $yr and the units of β1 are($yr)(degrees)

For description the model in (628) is the CEF a summary of theconditional distribution As seen earlier β0 = m(0) = E(Y | X = 0)the mean outcome among all individuals with X = 0 Also

β1 = m(1)minusm(0) = E(Y | X = 1)minus E(Y | X = 0) (629)

is the difference between the mean outcome for the X = 1 subpopu-lation and the mean outcome for the X = 0 subpopulation

A common phrase to describe such statistical (but maybe notcausal) differences is associated with For example if individualswith a college degree have a mean annual income that is $20000yrhigher than the mean annual income of non-college individuals thenβ1 = $20000yr and you could say ldquoOn average having a collegedegree is associated with having a $20000yr higher annual incomerdquoThis does not claim that going to college has such a causal effect onincome only a statistical association

For prediction the model in (628) is also helpful Section 254says the mean is the best predictor if the loss function is quadraticThis continues to be true conditional on X the conditional mean ofY given X = x is the best predictor given quadratic loss Formallyletting g(middot) denote any possible guess (of Y as a function of X)

m(middot) = arg ming(middot)

E[(Y minus g(X))2] (630)

In terms of the model parameters in (628) the best predictor of Ygiven X = 0 is β0 and the best predictor of Y given X = 1 is β0 +β1Combining these the best predictor of Y given X is β0 + β1X

636 Interpretation with Values Besides 0 and 1

What if X has only two possible values but they arenrsquot 0 and 1(Technically such X is still ldquobinaryrdquo but usually people mean 0 and1 when they say ldquobinaryrdquo)

For example let Y again be income and let X be educationbut now measured in years Instead of comparing individuals with acollege degree to those without imagine comparing individuals withX = 12 years of education to X = 13 By convention ldquoyears ofeducationrdquo is measured starting in grade 1 In the US the last yearof high school is grade 12 so completing high school means X = 12taking a year of college classes leads to X = 13

The fundamental conditional means are m(12) = E(Y | X = 12)and m(13) = E(Y | X = 13) but β0 and β1 in (628) have differentmeanings than before Even the units of β1 are different becausethe units of X have changed Recall that the units of β1 are units


of Y divided by units of X In both cases Y is measured in $yrBefore X was number of college degrees so the units of β1 were($yr)(degrees) Now X is measured in years (of education) sothe units of β1 are ($yr)(yr) By this alone β1 must now have adifferent interpretation it turns out β0 does too

The parameters β0 and β1 from the CEF model in (628) cannow be written in terms of m(12) and m(13) Writing the CEF asm(X) = β0 + β1X as in (628)

m(13 yr) = β0 + (13 yr)β1 m(12 yr) = β0 + (12 yr)β1

m(13 yr)minusm(12 yr) = (β0 minus β0) + (13 yrminus 12 yr)β1 = β1

β1 = [m(13 yr)minusm(12 yr)](13 yrminus 12 yr) = [m(13 yr)minusm(12 yr)]yr

β0 = m(12 yr)minus (12 yr)β1 = m(12 yr)minus (12 yr)[m(13 yr)minusm(12 yr)]

The meaning of β1 is qualitatively similar to before it is the differencein mean income between the high and low education subpopulationsHowever instead of ldquoper degreerdquo the units are now ldquoper yearrdquo (oneyear of college education)

In contrast the interpretation of β0 is very different and not verymeaningful Before β0 = m(0) was the mean of the low educationsubpopulation Here β0 takes the low education mean m(12) andthen subtracts 12 times the mean difference 12[m(13)minusm(12)] Thatis β0 is trying to extrapolate from the means for X = 13 and X = 12all the way down to X = 0 This is not the true mean income forindividuals with zero years of education m(0) only the means forX = 12 and X = 13 are known Rather it is just a guess of m(0)based on m(12) and m(13) and probably a very poor guess Furtherindividuals with zero years of education may be very rare or evennonexistent in the larger population in which case it is not eveninteresting to guess So while the slope β1 continues to have meaningthe intercept β0 may not unless X = 0 is a possible and interestingvalue

Adding another twist what if instead of X = 12 and X = 13(years of education) we compare X = 12 and X = 16 That isinstead of comparing high school (X = 12) to one year of college(X = 13) it is compared to a typical four-year college degree (X =12 + 4 = 16) more similar to our initial inquiry With m(X) =β0 + β1X as in (628)

m(16 yr) = β0 + (16 yr)β1 m(12 yr) = β0 + (12 yr)β1

m(16 yr)minusm(12 yr) = (β0 minus β0) + (16 yrminus 12 yr)β1 = (4 yr)β1

β1 = [m(16 yr)minusm(12 yr)](4 yr) = (14)[m(16 yr)minusm(12 yr)](yr)

β0 = m(12 yr)minus (12 yr)β1 = m(12 yr)minus 3[m(16 yr)minusm(12 yr)]

Again β0 tries to extrapolate from m(16 yr) and m(12 yr) to guessm(0 yr) Again such a guess is both inaccurate and irrelevant

The slope parameter β1 is different than before but still mean-ingful It takes the mean income difference m(16 yr) minusm(12 yr) andthen divides by 4 yr This computes a per-year average difference


This idea is similar to describing a 400 km car trip that took 5 hr bysaying the average speed was (400 km)(5 hr) = 80 kmhr This maybe easier to interpret since we more commonly think of kmhr butit doesnrsquot mean that the speed was constant during the whole tripeg it may have been slower in the first half due to traffic Simi-larly β1 does not mean each of the four college years is associatedwith the same increase in mean income For example the fourth yearmay have the biggest increase if the college degree (not the educationitself) matters most

64 Population Model Potential Outcomes

In contrast to a purely ldquostatisticalrdquo model like the CEF model wecould imagine a causal model that shows causal relationships betweenvariables One way to do this is with potential outcomes as intro-duced in Section 44 Again let Y U and Y T denote the untreatedand treated potential outcomes respectively

Here the two observable variables are the observed outcome Yand the treatment indicator (or treatment dummy) X That isX = 1 if an individual was ldquotreatedrdquo and X = 0 if not As beforeldquotreatmentrdquo is interpreted very broadly including things like going toa charter school a right-to-work law a tax policy or even personalcharacteristics

The observed outcome is

Y = (Y U )(1minusX) + (Y T )(X) (631)

Plugging in X = 0 yields Y = Y U whereas plugging in X = 1 yieldsY = Y T So we observe Y = Y U if the individual is untreated andY = Y T if treated

Notationally potential outcomes notation writes Y as a functionof X This is what (631) shows the treated potential outcome isY (1) = Y T and the untreated potential outcome is Y (0) = Y U

Equation (631) can be rearranged to look more like a linear modelin error form First

Y = Y U +X(Y T minus Y U ) (632)

This is a random coefficient model with intercept Y U and slopeY T minus Y U The intercept and slope coefficients are ldquorandomrdquo in thatthey can have different possible values for different individuals

Second the random coefficients can be turned into non-randomcoefficients by adding an error term Define β0 equiv E(Y U ) and β1 equivE(Y T minus Y U ) ie the ATE Then (632) becomes

Y = [Y U + β0 minus β0] +X[Y T minus Y U + β1 minus β1]

= β0 +Xβ1 +

equivU︷︸︸︷Y U minus β0 +X[Y T minus Y U minus β1] (633)

Equation (633) has the same structure as (628) but very dif-ferent meaning The parameter β1 is the ATE of X on Y it has a

65 POPULATION MODEL STRUCTURAL 141

causal meaning not just statistical meaning However we donrsquot knowif E(U | X) = 0 The error term U might have such statistical prop-erties but the model itself does not define U in terms of statisticalproperties but rather in terms of potential outcomes

65 Population Model Structural

A structural model also captures causal relationships The assump-tion is that the model itself does not change even when variable valuesand policies change (ldquoPolicyrdquo has a broad meaning here policies ofcountries firms schools etc or even just personal decisions) Morespecifically if we want to assess the causal effect of a certain policythen the structural model should be invariant to that particular pol-icy That is the policy may change the population distribution ofvariables but it cannot change the structural model itself otherwisethe model is not useful

651 Linear Structural Model

Consider the linear structural model

Y = β0 + β1X + U (634)

Unlike in a CEF the structural modelrsquos β1 and U have economicandor causal meaning by definition In (634) β1 is called a struc-tural parameter (as is β0) It has some economic or causal inter-pretation like an elasticity or demand curve slope Similarly U iscalled the structural error term This U can be interpreted asthe aggregation of all other variables that causally determine Y Wecan think about U economically not just statistically Itrsquos possibleE(U | X) = 0 but usually not Conversely the CEF error Y minusm(X)usually does not have causal or economic meaning

Consider a structural model like (634) for the example where Yis income and X is a college degree dummy Then U contains every-thing else that helps determine a personrsquos income their occupationtheir different skill levels (human capital) where they live (citycoun-try) etc

Warning on notation I used V in the CEF model in (625) and Uhere to help you avoid confusion but they are simply letters I couldhave used U in both models or V in both or ε or anything else Sodonrsquot think V always means CEF and U means not CEF)

With only a single binary X (633) and (634) seem very similarStill as seen in Section 66 the reduced form approach to identifyingthe ATE involves assumptions about potential outcomes whereas thestructural approach involves assumptions about X and U

Superficially it appears (634) claims the causal effect of X on Yis the constant β1 the same for everybody but see Section 652

Warning if you see a model Y = β0 + β1X + U make sure youknow whether itrsquos a CEF model or a structural model or yet anothertype of model (like in Chapter 7) The equation by itself only shows a


linear relationship it does not tell us the meaning of the parametersor the error term U This is something to be very wary of when youlook at econometric resources online or in other books they may havemodels that look identical but are interpreted very differently

Practice 64 Let X = 1 if an individualrsquos body mass index (BMI) is30 or greater (the technical definition of obesity) andX = 0 otherwiseand let Y denote hourly wage Consider the model Y = δ0 + δ1X +W where δ0 and δ1 are unknown non-random parameters and Wis the unobserved error term What is the interpretation of δ1 andW Explain (Hint yes this is a ldquotrickrdquo question with a very shortanswer)

652 General Structural Model and ASE

The linear model is better for building intuition but to expand yourmind consider the structural model

Y = h(XU) (635)

where Y is the outcome X is a binary regressor U = (U1 U2 ) isa vector containing all causal determinants of Y besides X and h(middot)could be any (non-random) function

The linear structural model Y = β0 + β1X + U is a special caseof (635) If h(ab) = β0 + β1a+ g(b) then (635) is

Y = h(X U) = β0 + β1X +

U︷︸︸︷g(U)

For a single individual the structural effect on Y of changingX = 0 to X = 1 is

s(U) equiv h(1U)minus h(0U) (636)

This is the causal effect on Y when X increases from 0 to 1 all elseequal (ceteris paribus) ie holding the value of U fixed Differentindividuals have different unobserved U so they may have differentstructural effects of X on Y For example getting a college degreemay increase income a lot for some individuals but not others

Itrsquos very difficult to learn about s(U) since it depends on thingswe canrsquot observe Instead we can try to learn about its mean Asusual there is a tradeoff the mean effect is not as informative andhelpful for policy but it is easier to learn about

The average structural effect (ASE) is a weighted average ofindividual structural effects The weights depend on the populationdistribution of U Mathematically this weighted average is the pop-ulation mean

ASE equiv E[s(U)] = E[h(1U)minus h(0U)] = E[h(1U)]minus E[h(0U)](637)

66 IDENTIFICATION 143

In this special case with binary X the final expressions looks like theATE where Y U = h(0U) and Y T = h(1U) That is we can imag-ine two parallel universes where everything besides X (specifically U)is the same But U may contain things that would violate SUTVAlike other peoplersquos X

The ASE is sometimes called the average causal effect (eg Hansen2020 Def 27) but so is the ATE so I avoid ldquoaverage causal effectrdquoto avoid confusion

Discussion Question 63 (ES habits and final scores) Let Y be astudentrsquos final semester score in this class 0 le Y le 100 and X = 1if the student starts each exercise set well ahead of the due date (andX = 0 if not) Consider the structural model Y = a + bX + U andthe CEF model Y = c+ dX + V

a) What does U represent Give some specific examples of whatU includes here (Hint imagine two students with the same Xbut different Y what causes them to have different Y )

b) Do you think E(U | X = 0) = E(U | X = 1) Whynotc) Do you think b = d b lt d or b gt d Why

Practice 65 (ES habits parameters) In DQ 63 what would youguess are reasonable possible values of the parameters a b c and dExplain

66 Identification

=rArr Kaplan video Identification of College Effect on Earnings

This section focuses on identification of parameters with causalmeaning In particular when does the slope parameter β1 in theCEF model also have a causal interpretation With binary X β1 isa (conditional) mean difference so intuition is similar to Section 46but many more details and formal results are provided here As inSection 46 the conditions required for identification are called iden-tifying assumptions

In Sum When Does a CEF Difference Have a CausalInterpretationIf X ldquoexogenousrdquo (unrelated to other determinants of Y Sec-tions 661 and 663)If X (as good as) randomized (unrelated to potential outcomesSection 662)


Under certain conditions (identifying assumptions) the structuralslope β1 in (634) is identified equal to the CEF slope γ1 in

E(Y | X = x) = γ0 + γ1x (638)

This is equivalent to (628) just with γ to avoid confusion with (634)


Identifying Assumptions and Formal Results

Qualitatively the structural slope is identified if X and U are ldquounre-latedrdquo That is the regressor X must be unrelated to the unobserveddeterminants of Y (that comprise U) U cannot be systematicallyhigher or lower for certain X values If true then X is called exoge-nous (link to pronunciation) If not then X is called endogenous(link to pronunciation) The precise mathematical condition for aregressorrsquos exogeneity (or endogeneity) depends on the model

Quantitatively there are a few ways to describe ldquoexogeneityrdquo of Xin (634) Specifically each of Assumptions A61ndashA63 is a sufficientcondition for β1 = γ1

Assumption A61 (independent error) U is independent of XX perpperp U

Assumption A62 (mean independent error) U is mean indepen-dent of X E(U | X) = E(U) For binary X equivalently E(U | X =0) = E(U | X = 1)

Assumption A63 (uncorrelated error) U is uncorrelated with XCorr(UX) = 0 or equivalently Cov(UX) = 0

Some of Assumptions A61ndashA63 are stronger than others Inde-pendence is stronger than mean independence

A61︷︸︸︷X perpperp U =rArr

A62︷︸︸︷E(U | X) = E(U) (639)

In general mean independence is stronger than zero correlation (whichis equivalent to zero covariance)

A62︷︸︸︷E(U | X) = E(U) =rArr

A63︷︸︸︷Cov(UX) = 0 (640)

If X has only two possible values then mean independence is actuallyequivalent to (lArrrArr ) zero correlation

Theorem 61 formally states the identification theorem You donot need to write (or even fully understand) proofs for this class butthe proof may help deepen understanding and appreciation for somestudents

Theorem 61 (linear structural identification) Consider the linearstructural and CEF models in (634) and (638) respectively AssumeX has only two possible values x1 and x2 as a special case x1 = 0and x2 = 1 If any one of Assumptions A61ndashA63 is true then thestructural slope is identified and equal to the CEF slope ie β1 =γ1 If additionally E(U) = 0 then the structural intercept is alsoidentified with β0 = γ0

Proof It is sufficient to prove the result for A62 because it is weakerthan A61 and equivalent to A63 (given only two possible X values)


Starting from the structural model

Y = β0 + β1X + U

= β0 + β1X + U +

=0︷︸︸︷E(U)minus E(U)

=

γ0︷︸︸︷β0 + E(U) +

γ1︷︸︸︷β1 X +

V︷︸︸︷U minus E(U)

As labeled the CEF intercept is γ0 = β0 + E(U) and the CEF slopeγ1 = β1 because V equiv U minus E(U) is a CEF error

E[U minus E(U) | X] =

=E(U) by A62︷︸︸︷E[U | X] minusE[E(U) | X] = E(U)minus E(U) = 0

(641)Thus mean independence implies γ1 = β1 and E(U) = 0 furtherimplies γ0 = β0

In Practice

In practice Assumptions A61ndashA63 are usually difficult to justifyRecall the example where Y is income and X is having a collegedegree Imagine U includes something called ldquoabilityrdquo that includesall skills not gained directly from college (eg skills learned from aparent) To have β1 = γ1 U would have to satisfy E(U | X = 0) =E(U | X = 1) ie the non-college and college subpopulations havethe same mean ability Of course there are many types of ability butit seems likely that in general college graduates are higher ability Infact the most famous of Michael Spencersquos Nobel Prize-winning work1

provides a more formal economic model of why the college graduatesshould have higher ability ie why E(U | X = 1) gt E(U | X = 0)

Returning to an even simpler example let Y be commute time andX = 1 if people are carrying umbrellas with X = 0 otherwise Sincethe umbrellas themselves have no effect on Y the structural β1 = 0Since rain affects Y rain is part of U although U may also includetraffic conditions and such When X = 0 there is probably no rainwhereas when X = 1 there probably is rain thus E(U | X = 1) gtE(U | X = 0) Again the structural error U is clearly not a CEFerror Consequently the CEF slope has only statistical meaning notcausal meaning Indeed here the CEF slope is the difference in meancommute time between days when people carry umbrellas and daysthey donrsquot which should be a substantial positive difference (longercommutes on days people carry umbrellas because those are rainydays) However the causal effect of umbrellas is zero so the CEFrsquosslope is bigger than the structural β1

If we could also observe weather conditions then it might be plau-sible that the remaining parts of U are unrelated to X This identi-fication approach is considered in Chapters 9 and 10

1See Spence (1973) or the very brief overview of his signaling model at httpsenwikipediaorgwikiMichael_SpenceCareer


Discussion Question 64 (marriage and salary) Let X = 1 if mar-ried and otherwise X = 0 Let Y be annual salary Consider thestructural model Y = β0 + β1X + U

a) Explain why probably E(U | X = 1) 6= E(U | X = 0) and saywhich you think is higher (Hint first think about what elseis in U ie what determines someonersquos salary or think aboutvariables that differ on average between married and unmarriedindividuals and whether any of those help determine salary)

b) Does the average salary difference between married and unmar-ried individuals have a structural meaning Whynot


Identification of the average treatment effect (ATE) was initially dis-cussed in Section 43 Here identifying assumptions and results areformally stated in regression notation

Updating (412) to this chapterrsquos notation the ATE is identifiedwhen

E(Y T )minus E(Y U ) = E(Y | X = 1)minus E(Y | X = 0) (642)

where Y T and Y U are (still) the potential treated and untreated out-comes respectively The important feature of (642) is that the right-hand side contains only observable variables Y and X Usually withenough data we can learn the population joint probability distri-bution of (YX) which in turn determines the conditional meansE(Y | X = 1) and E(Y | X = 0) If (642) is true then learningabout these conditional means (on the right-hand side) helps us learnthe ATE (left-hand side)


Assumption A64 is SUTVA as discussed in Section 44Assumption A65 is related to the discussion of randomized treat-

ment in Section 43 Mathematically the key is that randomizationsatisfies statistical independence between the treatment assignmentand the individualrsquos pair of potential outcomes X perpperp (Y U Y T )

Assumption A66 was not discussed before but it is intuitive ifeverybody (or nobody) is treated then itrsquos impossible to comparetreated and untreated outcomes For example if P(X = 1) = 0 thennobody is treated so we only observe Y = Y U for everybody We canlearn about E(Y U ) but itrsquos impossible to learn about E(Y T ) sinceY T is literally never observed Although obvious here a more generaloverlap assumption may not always hold in more complex models

The following identifying assumptions combined together are suf-ficient but not necessary That is if they are all true then the ATEis identified but there may be other ways to identify the ATE even ifthey are violated

The assumptions have various names Assumption A64 is usuallyjust called SUTVA but the main part of it is often called no inter-ference (or non-interference) Assumption A65 has many names


independence ignorability or unconfoundedness The combi-nation of A65 and A66 is called strong ignorability For moredetail history and discussion see Imbens and Wooldridge (2007)

Assumption A64 (SUTVA) Everyone with X = 1 receives thesame treatment and one individualrsquos treatment does not affect anyother individualrsquos potential outcomes

Assumption A65 (unconfoundedness) Treatment is independentof the potential outcomes X perpperp (Y U Y T )

Assumption A66 (overlap) There is strictly positive probabilityof both treatment and non-treatment 0 lt P(X = 1) lt 1

Formally identification of the ATE is shown as follows The keyis that A65 allows us to observe representative samples of both Y U

and Y T Mathematically this independence assumption implies thatthe means of the potential outcomes do not statistically depend onthe treatment X

E(Y T ) = E(Y T | X = 1) E(Y U ) = E(Y U | X = 0) (643)

From (631) Y = Y T when X = 1 and Y = Y U when X = 0 so

E(Y T | X = 1) = E(Y | X = 1) E(Y U | X = 1) = E(Y | X = 0)(644)

Combining (643) and (644) this says that the population mean ofthe treated potential outcome E(Y T ) equals the mean of the observedoutcome in the treated population E(Y | X = 1) which in Section 46was E(Y B) Since E(Y | X = 1) is a feature of the joint distributionof (YX) it is identified Since E(Y T ) = E(Y | X = 1) it is alsoidentified Similarly E(Y U ) = E(Y | X = 0) is identified so E(Y T )minusE(Y U ) is identified

Theorem 62 (ATE identification) Under A64ndashA66 the ATE isidentified

E(Y T minus Y U ) = E(Y T )minus E(Y U ) = E(Y | X = 1)minus E(Y | X = 0)

which is the slope β1 in the linear CEF model in (628)

Proof The constructive proof of ATE identification links the unob-servable ATE with the observable mean difference Specifically

use linearity︷︸︸︷E(Y T minus Y U ) =

use (643)︷︸︸︷E(Y T )minus E(Y U )

=

use (644)︷︸︸︷E(Y T | X = 1)minus

use (644)︷︸︸︷E(Y U | X = 0)

= E(Y | X = 1)minus E(Y | X = 0)

This equals the linear CEF slope coefficient as shown in (629)


In Practice

Imagine a knee surgery treatment (X) to help arthritis where Y isknee-specific pain (between 0 and 100) For each individual we canimagine two parallel universes identical except for whether the indi-vidual gets the treatment (surgery) or not It is the same surgery foreverybody and naturally one personrsquos surgery cannot affect anotherpersonrsquos pain so SUTVA is satisfied Half of patients are randomlyassigned the treatment so X perpperp (Y U Y T ) and 0 lt P(X = 1) lt 1Thus Assumptions A64ndashA66 are all satisfied and Theorem 62 saysthe ATE equals the CEF slope which we can estimate by OLS

Surgery seems like a very straightforward example but there canstill be problems For example maybe the people who volunteer toparticipate in the randomized experiment are not representative ofthe general population eg maybe they are feeling very desperatebecause they have particularly severe arthritis Another issue is thatsome people may be hurt by the treatment even if the overall ATEseems helpful Perhaps the biggest issue in real life was that ldquosurgeryrdquowas treated as a black box without understanding the particular mech-anism that reduced pain It turned out that placebo (fake) surgerywas equally effective2

Consider Theorem 62 when X is rain and Y is commute time InColumbia MO there is much less traffic in the ldquosummerrdquo (mid-May tomid-August) when most students are gone meaning both Y T and Y U

are lower There is also more rain (X = 1) That is X and (Y U Y T )are related violating Assumption A65 Intuitively the problem iswersquod see more short rainy commutes in the summer and long drycommutes during the academic year which makes it seem like raincauses short commutes but correlation does not imply causation

Practice 66 Discuss the right-to-work example from Sections 43ndash45 in terms of Assumptions A64ndashA66

663 Average Structural Effect

The ASE in (637) is identified if it equals the CEF slope A sufficientcondition for this is Assumption A67 which is qualitatively similarto Assumption A65

Assumption A67 (independence) The unobservable determinantsof Y are independent of X eg in the notation of (635) U perpperp X

Theorem 63 (ASE identification) Consider the general structuralmodel in (635) and the ASE defined in (637) If Assumption A67holds then the ASE is identified and equal to the slope of the linearCEF in (628)

2httpsdoiorg101056NEJMoa013259

67 ESTIMATION OLS 149

Proof Using Y = h(XU)

E(Y | X = 1)minus E(Y | X = 0) = E[h(XU) | X = 1]minus E[h(XU) | X = 0]

=

use A67︷︸︸︷E[h(1U) | X = 1]minus

use A67︷︸︸︷E[h(0U) | X = 0]

= E[h(1U)]minus E[h(0U)]

which equals the ASE as in (637)

67 Estimation OLS

=rArr Kaplan video OLS in R

This section considers estimation of the CEF model (628) whenX has only two possible values The interpretation (description pre-diction causality) does not matter for estimation

One approach is to define Y A as the X = 0 subpopulation andY B as the X = 1 subpopulation Then β0 = E(Y A) and β1 =E(Y B)minus E(Y A) so Sections 412 and 413 can be used to estimateE(Y A) and E(Y B)

Though simple that approach does not generalize as well as or-dinary least squares (OLS) The intuition behind the least squaresapproach comes from the characterization of the conditional mean asthe best predictor of Y given X = x with quadratic loss The ideaextends (311) for estimating the unconditional mean of Y In thepopulation if E(Y | X = x) = β0 + β1x then

(β0 β1) = arg minb0b1

E[L2(Y b0 + b1X)] = arg minb0b1

E[(Y minus b0 minus b1X)2]

(645)where L2(y g) is the quadratic loss function from (239) This showsthat the CEF provides the best (with quadratic loss) predictor of Ygiven X In the sample replacing the population mean (E) with thesample mean ( 1n

sumni=1) the minimization problem analogous to (645)

is

OLS (β0 β1) = arg minb0b1

1

n

nsumi=1

(Yi minus b0 minus b1Xi)2 (646)

The estimated CEF is thus

m(x) = β0 + β1x (647)

Extending Sections 33 and 342 the OLS regression estimatorcan be explained in terms of the empirical distribution Instead of asingle S representing Y now we have (SY SX) representing (YX)If all the sample values (Yi Xi) are unique then the empirical distri-bution has P(SY = Yi SX = Xi) = 1n for all i = 1 n Thusreplacing the population mean in (645) with the sample average in(646) can be seen as using the empirical distribution That is


E[(SYminusb0minusb1SX)2]= arg minb0b1

1

n

nsumi=1

(Yiminusb0minusb1Xi)2

(648)


Notationally as usual the ldquohatsrdquo on β0 β1 and m(x) indicatethat they are computed from the sample whereas the true populationvalues β0 β1 and m(x) lack hats

The form of (646) explains the L and S in OLS ldquoLeastrdquo (L) refersto minimization and ldquosquaresrdquo (S) refers to squaring Yi minus b0 minus b1Xi(Explaining the O is a special treat reserved for econ PhD students)

Equation (646) can be described with the terms introduced around(312) Given any estimates (β0 β1) the fitted values are

Yi equiv β0 + β1Xi = m(Xi) (649)

Given Yi the residual is defined as

Ui equiv Yi minus Yi = Yi minus β0 minus β1Xi (650)

Consequently (646) can be interpreted as saying that the OLS esti-mates (β0 β1) make the sum of squared residuals

sumni=1 U

2i as small

as possibleThe OLS estimator is consistent under fairly general conditions

see Section 772 In that case

β0prarr β0 β1

prarr β1 m(0)prarr m(0) m(1)

prarr m(1) (651)

671 Code

The following code runs OLS It also computes the sample means ofYi for the group with Xi = 0 and the group with Xi = 1 separatelyThis verifies that with a single binary regressor β0 is the sampleaverage Yi for the Xi = 0 group while β0 + β1 is the sample averageYi for the Xi = 1 group so β1 is the difference

df lt- dataframe(Y=c(14023 87695) X=c(00000 11111))ret lt- lm(formula=Y~X data=df)coef(ret)

(Intercept) X 2 5

c( mean(df$Y[df$X==0]) mean(df$Y[df$X==1]) )

[1] 2 7


=rArr Kaplan video OLS in R (again)

The ways to quantify uncertainty in Section 38 also apply to β0and β1 in the linear CEF model (628) The same interpretations andmisinterpretations apply In particular these methods do not reflectuncertainty about identifying assumptions For example a CI that

68 QUANTIFYING UNCERTAINTY 151

contains the CEF slope with 95 probability does not contains thestructural slope with 95 probability if it is not identified it couldbe only 80 or 50 or near 0

One new consideration is discussed in Section 681 followed bysample code in Section 682

681 Heteroskedasticity

Different methods for quantifying uncertainty make different assump-tions about the conditional variance Whereas the conditional meanE(Y | X = x) is the mean of the conditional distribution of Y givenX = x the conditional variance

σ2Y (x) equiv Var(Y | X = x) (652)

is the variance of the conditional distribution of Y given X = xThe term homoskedasticity means σ2Y (x) = σ2Y a constant notdepending on x whereas heteroskedasticity means σ2Y (x) is notconstant Equivalently we could write Y = β0+β1X+U and considerthe conditional variance of U since Var(Y | X) = Var(U | X) so oftenhomoskedasticity and heteroskedasticity are thought of as propertiesof the error term

Always use methods that are robust to heteroskedasticity(or heteroskedasticity-robust) This means theyrsquore valid withhomoskedasticity or heteroskedasticity whereas other methods onlywork with homoskedasticity Logically the heteroskedasticity-robustmethods have weaker assumptions so they work more often Besidesheteroskedasticity is very common in real economic data

The term ldquorobustrdquo by itself is ambiguous You should always askrobust to what Methods can be robust to heteroskedasticity robustto clustered sampling robust to measurement error robust to infinitevariance etc

Practice 67 (heteroskedasticity) Let Y = 1 if employed (and Y =0 if not) and let X = 1 if female (and X = 0 if not) Explainwhy there is probably heteroskedasticity (Hint if p = P(Y = 1)then Var(Y ) = p(1 minus p) If px = P(Y = 1 | X = x) then whatrsquosVar(Y | X = x))

682 Code

Unfortunately the default in R is to use homoskedasticity-based stan-dard errors so you have to make an extra effort to get heteroskedasticity-robust results The below code does this Since X is binary the sameresults can be obtained with a two-sample unpaired t-test with ldquoun-equal variancesrdquo as shown

The below code quantifies uncertainty about the CEF slope in aregression with a single binary regressor Using a variety of methodsthe code computes a standard error (SE) 95 confidence intervaland t-statistic and two-sided p-value for testing the null hypothesisH0 β1 = 0


In the table of output at the very end the first two rows as-sume homoskedasticity whereas the remaining four rows do not Thefirst row is a two-sample t-test assuming equal variances the sec-ond row is the default results based on lm() output The third rowis a two-sample t-test allowing for unequal variances The remain-ing rows use more general regression-based methodology allowing forheteroskedasticity based on the lmtest and sandwich packages inR (Zeileis 2004 Zeileis and Hothorn 2002) The first two rows areidentical and the following four rows are very similar to each otherbut there is a big difference between the first two rows and the nextfour rows This shows the (potentially) big difference between as-suming homoskedasticity (as in the first two rows) and allowing forheteroskedasticity (as in the last four) There are multiple ways toallow for heteroskedasticity like the HC0 HC1 and HC3 shown inthe table The differences are beyond our scope but as the tablesuggests the differences are often very small in practical terms

In practice you should use coeftest and coefci to allow forheteroskedasticity like below

library(lmtest) library(sandwich)setseed(112358)n lt- 1000df lt- dataframe(Y=c(rnorm(n=n4mean=0sd=1)

rnorm(n=3n4mean=02sd=2))X=c(rep(0n4) rep(13n4)))

ret lt- lm(formula=Y~X data=df) Store results for slope in sloutrn lt- c(ttesteqHomoskttestuneqHC0HC1HC3)slout lt- dataframe(rownames=rn SE=rep(NA6) CIlower=NA

CIupper=NA tstat=NA pvalue=NA) HC0 original from Hal White (1980)retVC0 lt- vcovHC(ret type=HC0)slout[HC0SE] lt- sqrt(retVC0[22]) HC1 matches Stata default and two-sample ttest belowretVC1 lt- vcovHC(ret type=HC1) HC3 recommendeddefault (and larger SE than HC0 HC1)retVC3 lt- vcovHC(ret type=HC3) Default homoskedastic resultsslout[Homoskc(SEtstatpvalue)] lt-

summary(ret)$coefficients[X24] Heteroskedasticity-robust testsp-valuesslout[HC0c(145)] lt- coeftest(ret vcov=retVC0)[X24]slout[HC1c(145)] lt- coeftest(ret vcov=retVC1)[X24]slout[HC3c(145)] lt- coeftest(ret vcov=retVC3)[X24] Heteroskedasticity-robust CIs (shortest to longest)slout[HC023] lt- coefci(ret vcov = retVC0)[X]slout[HC123] lt- coefci(ret vcov = retVC1)[X]slout[HC323] lt- coefci(ret vcov = retVC3)[X]slout[Homosk23] lt- confint(ret level=095)[X]


For comparison ttest() results for slopetsl lt- ttest(x=df$Y[df$X==1] y=df$Y[df$X==0] mu=0 conflevel=095

alternative=twosided paired=FALSE varequal=FALSE)slout[ttestuneq-1] lt-

c(tsl$confint tsl$statistic tsl$pvalue) For comparison varequal=TRUEt2 lt- ttest(x=df$Y[df$X==1] y=df$Y[df$X==0] mu=0 conflevel=095

alternative=twosided paired=FALSE varequal=TRUE)slout[ttesteq-1] lt- c(t2$confint t2$statistic t2$pvalue)slout[ttestuneq1] lt-(tsl$confint c(-11)) (2qt(p=1-0052df=tsl$parameter))

slout[ttesteq1] lt-(t2$confint c(-11)) (2qt(p=1-0052df=t2$parameter))

print(round(slout digits=3))

SE CIlower CIupper tstat pvalue ttesteq 0128 -0026 0476 176 0079 Homosk 0128 -0026 0476 176 0079 ttestuneq 0095 0038 0412 236 0018 HC0 0095 0039 0412 237 0018 HC1 0095 0038 0412 237 0018 HC3 0095 0038 0412 236 0018

Practice 68 (regression significance) Consider the setup of the ldquoau-dit studyrdquo from Bertrand and Mullainathan (2004) Resumes werefabricated that were identical except for the name Emily (suggestinga white female) Greg (white male) Lakisha (black female) or Jamal(black male) The resumes were then submitted to job openings andit was recorded whether or not an in-person interview for the job wasthen offered Here let Y = 1 if an interview was offered and Y = 0if not let X = 1 if the name is ldquoblackrdquo and X = 0 if not Note thatE(Y | X = x) = P(Y = 1 | X = x) ie the conditional probabil-ity of an interview A regression of Y on X (including an interceptas always) is run and heteroskedasticity-robust standard errors arecomputed Consider both economic significance and statistical sig-nificance in the following possible results (Economic and statisticalsignificance were introduced in Section 397) Hint to quickly assessstatistical significance |β1 SE| ge 2 means statistical significance ata 5 level with higher values being more statistically significant

a) Slope estimate β1 = 000001 SE = 0000001b) β1 = minus01 SE = 01c) β1 = minus02 SE = 002d) β1 = minus001 SE = 001


Empirical Exercises

Empirical Exercise EE61 You will essentially replicate EE41 butwith regression commands

a R only load the needed packages and look at a description ofthe datasetlibrary(wooldridge) library(sandwich) library(lmtest)

jtrain2

b Stata only run ssc install bcuse if necessary then load thedata withbcuse jtrain2 nodesc clear

c Run a regression of 1978 earnings (re78) on the job trainingassignment indicator (train)

R ret lt- lm(re78~train data=jtrain2)

Stata regress re78 train vce(robust) in which vce(robust) requests heteroskedasticity-robust standard errors

d R only (since already reported in Stata) output the estimatesalong with heteroskedasticity-robust standard errors and two-sided 95 confidence intervals with the codecoeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))

where argument type=HC1 refers to one specific type (amongmultiple) of heteroskedasticity-robust standard error estimator(HC stands for ldquoheteroskedasticity-consistentrdquo)

e R only create a subset of the data including only married in-dividuals with code jt2mar1 lt- jtrain2[jtrain2$married==1 ]

f Run your previous analysis for the subset of married individuals

R replace data=jtrain2 with data=jt2mar1

Stata regress re78 train if married==1 vce(robust)

g Repeat your analysis but for unmarried individuals

h Repeat your analysis on the full sample of individuals but forthe outcome variable unem78 (1978 unemployment indicator) in-stead of re78 (and remember unemployment is bad so negativecoefficient is good)

Chapter 7

Simple Linear Regression


Depends on Chapter 6 (which depends on Chapters 2ndash4)



72 Interpret what a linear regression estimates in multipleways mathematically and intuitively [TLOs 2 and 3]

73 Assess whether certain assumptions for linear regressionseem true or not in real-world examples [TLOs 2 and 6]

74 In R (or Stata) estimate a simple linear regression alongwith measures of statistical uncertainty and judge economicand statistical significance [TLO 7]


bull Regression as description (Masten video)


bull Sections 41ndash42 (ldquoSimple Linear Regressionrdquo and ldquoEstimat-ing the Coefficients of the Linear Regression Modelrdquo) inHanck et al (2018)

bull Sections 21 (ldquoSimple OLS Regressionrdquo) and 22 (ldquoCoeffi-cients Fitted Values and Residualsrdquo) in Heiss (2016) [re-peated from Chapter 6]

Surprisingly many critical issues arise with three (instead of two)possible X values With two the regression modeled conditionalmeans useful for description prediction and (sometimes) causalityHowever with three (or more) X values we may fail to model theconditional means In simple cases this can be solved with a moreflexible model in other cases we need to reinterpret what OLS actu-ally estimates in practice

155

156 CHAPTER 7 SIMPLE LINEAR REGRESSION

Generally OLS estimates something called a linear projectionThis can also be interpreted as a ldquobestrdquo linear approximation of theCEF (for description) or a ldquobestrdquo linear predictor of Y given X (forprediction) These interpretations are discussed along with statisticalproperties of OLS as an estimator of the linear projection (not CEF)

71 Misspecification

=rArr Kaplan video Misspecification of Linear CEF

Consider the linear population model

Y = β0 + β1X + U (71)

where supposedly E(U | X) = 0 and this time X has three possiblevalues 0 1 and 2

Intuitively you should worry already there are now three condi-tional means but still only two parameters That is we want to learnthe three values

m(0) equiv E(Y | X = 0) m(1) equiv E(Y | X = 1) m(2) equiv E(Y | X = 2)

but (71) has only two parameters β0 and β1 Thatrsquos like trying toserve three dinners on only two plates That does not sound likeenough flexibility

Mathematically the question is whether

m(x) = β0 + β1x x = 0 1 2

ie whether m(x) is a straight lineFor example let Y be income and let X be number of siblings

Maybe there is a big income gap between only children (X = 0) andindividuals with one sibling (X = 1) but having a second sibling(X = 2) does not change much To simplify this let m(0) gt m(1) =m(2) From m(1) and m(2) alone the CEF appears flat (zero slope)in which case β0 = m(1) and β1 = 0 fits these two points But fromm(0) and m(1) the slope appears negative β1 lt 0 and the interceptis β0 = m(0) There is no (β0 β1) that can make β0 +β1x go throughall three points m(0) m(1) and m(2) if m(0) gt m(1) = m(2)

Figure 71 shows the impossibility of a linear CEF in this previousexample In the figure m(0) = 60 (in thousands of $yr) and m(1) =m(2) = 40 The line with β0 = m(0) and β1 = m(1)minusm(0) lt 0 fitsthe first two CEF values but not the third The line with β0 = m(1)and β1 = 0 fits the second two CEF values but not the first It isimpossible to draw a straight line (β0 + β1x) through all three pointson this CEF as Euclid could tell us

A wrong model is euphemistically termed misspecified Thatis the model assumes something that is not actually true For thesiblings and income example the linear CEF model in (71) is mis-specified The model incorrectly assumes that the conditional mean

72 COPING WITH MISSPECIFICATION 157

020

4060

X

m(X

)

0 1 2

Figure 71 Misspecification of linear CEF

of Y is linear in X (ie an affine function of X) Equivalently it as-sumed m(1)minusm(0) = m(2)minusm(1) which is not true in the exampleMore specifically this type of misspecification is called functionalform misspecification since it is the linear functional form thatis wrong That is even though any values of (β0 β1) are allowedβ0 + β1x is always a straight-line function of x so it has a linearfunctional form (the general ldquoshaperdquo of the function)

Practice 71 (misspecification) Investigate whether the problemwith the sibling example was that X = 0 was a possible value (so thatthe intercept had to be β0 = m(0)) as follows Consider the sameexample but with X = 1 2 3 instead of X = 0 1 2 so m(1) = 60m(2) = m(3) = 40 Is it possible to write m(x) = β0 + β1x nowWhy or why not

Itrsquos technically possible that a CEF is linear though extremelyunlikely in practice Continuing the example if m(2) = 20 exactlythen m(x) = 60minus 20x ie β0 = 60 and β1 = minus20 In that case thelinear CEF is properly specified (or correctly specified) If in-stead m(2) = 20001 the linear CEF model is misspecified Howeverwith such a small amount of misspecification a linear model is a verygood approximation

Arguably we must learn how best to cope with misspecificationsince we cannot truly avoid it As Box (1979 p 2) famously wroteldquoAll models are wrong but some are usefulrdquo 1 With reference to Boxrsquosquote Section 83 essentially tries to maximize a modelrsquos usefulness bychoosing the optimal amount of ldquohow wrongrdquo it is (misspecification)

72 Coping with Misspecification

There are two ways to cope with misspecification change the modelor reinterpret it The first way is now discussed for (71) while rein-

1See httpsenwikipediaorgwikiAll_models_are_wrong for additionaldiscussion including the analogous quote about art from Pablo Picasso ldquoWe allknow that art is not truth Art is a lie that makes us realize truth at least thetruth that is given us to understand The artist must know the manner wherebyto convince others of the truthfulness of his liesrdquo


terpretation is detailed in Sections 73ndash75

721 Model of Three Values

To fix the misspecification the model needs to be more flexible Con-tinuing with X = 0 1 2 for simplicity there are three conditionalmeans so the model should have three parameters to be flexibleenough to avoid misspecification

One way to add another parameter is to use a dummy variablefor each possible value of X (See Section 232 to review dummyvariables) Recall the indicator function from (23) Here

1X = j =

1 if X = j0 otherwise j = 0 1 2 (72)

Since only three values ofX are possible 1X = 0 = 1minus1X = 1minus1X = 2 Thus extending (626)

m(x) = m(0)1x = 0+m(1)1x = 1+m(2)1x = 2 (73)= m(0)[1minus 1x = 1 minus 1x = 2] +m(1)1x = 1+m(2)1x = 2= m(0) + [m(1)minusm(0)]1x = 1+ [m(2)minusm(0)]1x = 2= β0 + β1 1x = 1+ β2 1x = 2

β0 equiv m(0) β1 equiv m(1)minusm(0) β2 equiv m(2)minusm(0)(74)

Although the structure of (73) is easier to interpret the structureof (74) is more common and can be interpreted as follows The pa-rameter β0 = m(0) is the conditional mean for some base categoryX = 0 The other parameters show how other conditional meansdiffer from this base category Specifically β1 = m(1) minusm(0) is theconditional mean difference between the X = 1 and X = 0 subpop-ulations and β2 = m(2) minus m(0) is the conditional mean differencebetween the X = 2 and X = 0 subpopulations

This interpretation can be applied to the income and siblings ex-ample The parameter β0 is the population mean income amongindividuals with zero siblings Zero siblings is the base categoryThen β1 is the difference in mean income between the 1-sibling and0-sibling subpopulations Earlier m(0) = 60 (thousands of $yr) andm(1) = 40 so β1 = m(1) minus m(0) = minus20 Finally β2 is the meanincome difference between the 2-sibling and 0-sibling (not 1-sibling)subpopulations m(2)minusm(0) = 40minus 60 = minus20

Discussion Question 71 (Facebook) Let X = 0 1 2 be the num-ber of Facebook accounts somebody has and Y is hours of socialmedia consumption per week

a) Explain what it means for a CEF model E(Y | X = x) =β0 + β1x to be misspecified

b) Describe a specific real-world reason to suspect misspecificationin this example

73 LINEAR PROJECTION 159

722 More Than Three Values

More generally even if X has more than three possible values dummyvariables could be used similarly to avoid CEF misspecification Ex-tending (73) there can be a dummy variable for each possible valueof X and a corresponding parameter for each Any such model al-lowing an arbitrarily different conditional mean of Y for each possiblevalue of X is called fully saturated A fully saturated CEF modelcannot be misspecified (But it may not have any causal meaningand may be practically impossible to estimate)

In more complex settings it is impossible to fix misspecificationcompletely For example if X could be any real number between 0and 1 then an infinite number of parameters is required to model theconditional expectations for the infinite number of X values this isimpossible in practice

In such settings where misspecification is unavoidable how canwe interpret the model and its parameters There are three interpre-tations of a more general linear model that includes the linear CEFmodel as a special case These are discussed next

In Sum Interpretations of What OLS EstimatesLinear projection gets β0 + β1X ldquoclosest tordquo Y probabilisti-

cally (Section 73)ldquoBestrdquo linear approximation (BLA) of CEF ldquobestrdquo (smallest

mean quadratic loss) approximation of E(Y | X) with linear formβ0 + β1X (Section 74)

ldquoBestrdquo linear predictor (BLP) ldquobestrdquo (smallest mean quadraticloss) prediction of Y given X with linear form β0 + β1X (Sec-tion 75)

73 Linear Projection

=rArr Kaplan video Linear Projection and ldquoBestrdquo vs ldquoGoodrdquo

The linear projection model is important because it is what OLSactually estimates Two additional interpretations of the linear pro-jection are described in Sections 74 and 75

731 Geometric Intuition

You may have seen orthogonal projection in geometry or linear al-gebra There is some shape (or vector space) and there is a pointoutside it Projecting the point onto the shape consists of finding thepoint within the shape that is closest to the outside point

Figure 72 illustrates projection There is a large gray circle shapeand two points outside of it (small triangle dot) The small triangleon the border of the large circle is the ldquoclosestrdquo point to the out-side small triangle as measured by Euclidean distance That is thedashed line connecting the small triangles is just barely long enough


Figure 72 Orthogonal projection

to reach the gray circle from the outside triangle point if it were anyshorter it could not reach any point in the gray circle Similarly thedot on the border of the gray shape is the projection of the outsidedot onto the shape of all the points in the gray space it is closest tothe outside dot (by Euclidean distance)

This idea can be written mathematically Let dE(w z) denotethe Euclidean distance between points w and z Let S denote ashape which is a set of points Let y denote the outside point and pthe projection In Figure 72 the gray circle is S the outside smalltriangle (or dot) is y and the small triangle (or dot) on the circlersquosborder is p The projection of point y onto shape S is the pointinside S thatrsquos closest to y ie that minimizes the distance to yMathematically

p = arg minsisinS

dE(y s) (75)

732 Probabilistic Projection

Linear projection with random variables is the same idea but with adifferent definition of distance and a different ldquoshaperdquo to search over

Notationally let LP(Y | 1 X) denote the linear projection (LP)of Y onto (1 X) The (1 X) specifies the ldquoshaperdquo that we search overrandom variables that can be written as a+bX for constants a and bie linear combinations of (1 X) (Linear combinations and linearityare detailed in Section 821) Without the 1 LP(Y | X) would onlyconsider bX with no intercept

The closest ldquopointrdquo inside the ldquoshaperdquo is usually written β0 +β1XMathematically parallel to (75)

LP(Y | 1 X) = β0+β1X = arg mina+bX

d(Y a+bX) = arg mina+bX

radicE[(Y minus aminus bX)2]

(76)where Euclidean distance dE(middot middot) has been replaced by a probabilisticldquodistancerdquo measure

d(AB) equivradic

E[(AminusB)2] (77)

That is linear projection gets β0 + β1X as ldquocloserdquo to Y as possiblein a probabilistic sense


733 Formulas and Interpretation

Some calculus (omitted) yields a formula for each linear projectioncoefficient (LPC) β0 and β1 In this special case with a singleregressor X and an intercept

β1 =Cov(YX)

Var(X) β0 = E(Y )minus β1 E(X) (78)

Writing σ2Y = Var(Y ) and σ2X = Var(X) β1 can be rewritten in termsof correlation

β1 =Cov(YX)

Var(X)=

Cov(YX)

σ2X

σYσY

=Cov(YX)

σXσY

σYσX

= Corr(YX)σYσX

(79)Either version of the formula shows how the linear projection slope β1is related to the linear dependence (covariance or correlation) betweenY andX Once the slope is determined the intercept β0 simply movesthe linear projection line up or down so that E(Y ) = β0 + β1 E(X)That is the linear projection always goes exactly through the point(x y) = (E(X)E(Y ))

People often interpret the linear projection coefficients less pre-cisely For the slope a common phrase is ldquoA one-unit increase in Xis associated with a β1 change in Y rdquo The intercept is often notmentioned since β0 = E(Y )minusβ1 E(X) is not easy to interpret exceptwhen the regressor has been demeaned so that E(X) = 0 in whichcase β0 = E(Y ) In this case β0 is called the ldquocenterceptrdquo insteadof intercept but despite the better interpretation it is rarely seen ineconomics

For description (78) shows that the LPCs summarize the jointprobability distribution of (YX) The joint distribution of (YX)determines E(Y ) E(X) Cov(YX) and Var(X) which then deter-mine β0 and β1 Although a two-number summary of a complicatedjoint distribution is very convenient clearly much information is lostin such a summary Just as percentiles (quantiles) complement themean in describing Y quantile regression complements the LPCs indescribing (YX) though it is beyond our scope

Although you donrsquot need to know it for this class the LPCs canbe written in matrix form This generalizes more easily Define vector

X equiv (1 X)prime =

[1X

] (710)

where (1 X)prime indicates the transpose of the row vector (1 X) Then[β0β1

]= [E(XXprime)]minus1 E(XY ) =

[1 E(X)

E(X) E(X2)

]minus1[E(Y )

E(XY )

] (711)

734 Linear Projection Model in Error Form

Analogous to (625) for the CEF the linear projection model canbe written in error form Analogous to defining the CEF error as


Y minusm(X) the linear projection error is defined as

U equiv Y minus LP(Y | 1 X) = Y minus (β0 + β1X) (712)

Notationally as usual there is nothing special about the letter U (orβ or even X or Y ) eg it is mathematically equivalent to defineV equiv W minus LP(W | 1 Z) and use (γ0 γ1) for the LPCs Given thedefinition of U in (712) it is always true that E(U) = Cov(XU) = 0Thus the model

Y = β0 + β1X + U E(U) = Cov(XU) = 0 (713)

is equivalent to LP(Y | 1 X) = β0 + β1XAs with the CEF the meaning of LP(Y | 1 X) = β0+β1X is more

clearly explicit but sometimes the error for (713) is more convenientmathematically as in Section 1232

74 ldquoBestrdquo Linear Approximation

=rArr Kaplan video ldquoBestrdquo Linear Approximation

=rArr Kaplan video Linear Projection and ldquoBestrdquo vs ldquoGoodrdquo (again)


For description the linear projection can be interpreted as the bestlinear approximation (BLA) of the true CEF ldquoBestrdquo here assumesquadratic loss similar to how the mean E(Y ) is the ldquobestrdquo predictorof Y with quadratic loss ldquoLinearrdquo refers to a function of the forma+ bX (see Section 821) Mathematically

LP(Y | 1 X) = β0+β1X =

BLA︷︸︸︷arg mina+bX

E[m(X)minus (a+ bX)]2 m(X) equiv E(Y | X)

(714)That is among all possible a+ bX the linear projection β0 + β1X isthe function of X that best approximates E(Y | X)

This implies that if the CEF is linear in X then the linear projec-tion equals the CEF That is ifm(X) = β0+β1X thenm(X)minus(β0+β1X) = 0 Since this term is squared in (714) zero is the smallestpossible value so β0 + β1X is the BLA

Otherwise the BLA treats more probable X as more importantwhen trying to get the linear approximation ldquocloserdquo to the true CEFThe mean Emiddot in (714) is a weighted average with more weight onmore probable X so it is more important to make m(X)minus (a+ bX)close to zero for such X values

742 Limitations

Unfortunately ldquobestrdquo does not always mean ldquogoodrdquo Sometimes theCEF is so highly nonlinear that even the best linear approximation

75 ldquoBESTrdquo LINEAR PREDICTOR 163

is still a very poor approximation By analogy ldquoAmong all cities inMissouri St Louis is closest to Kuwaitrdquo does not mean ldquoSt Louis isclose to Kuwaitrdquo Here Kuwait is the true CEF Missouri is the setof all functions linear in X and St Louis is the BLA Sometimes thebest (closest) is still not good (not close)

The following example of a ldquobadrdquo BLA is from Hansen (2020sect228) Let Y = X +X2 with no error term so m(x) = x+ x2 tooIf X sim N(0 1) then the BLALP turns out to be LP(Y | 1 X =x) = 1 + x The function 1 + x is a bad approximation of x+ x2 (trygraphing it)

Further the distribution of X can greatly affect the BLA of a non-linear CEF For example Figure 71 shows two possible BLA lines forthe same nonlinear CEF One line is the BLA when the distribu-tion of X satisfies P(X = 2) = 0 The other line is the BLA whenP(X = 0) = 0 The two lines are very different

However the BLA interpretation does at least assure us that whenthe CEF is approximately linear the linear projection approximatesthe CEF well

75 ldquoBestrdquo Linear Predictor

For prediction the linear projection can be interpreted as the bestlinear predictor (BLP) of Y given X As with the BLA ldquobestrdquoassumes quadratic loss ldquoLinearrdquo again refers to the form a + bXAs in (255) the optimal predictor minimizes mean quadratic lossMathematically

LP(Y | 1 X) = β0+β1X =

BLP︷︸︸︷arg mina+bX

EL2(Y a+ bX) = arg mina+bX

E[Yminus(a+bX)]2

(715)That is among all possible a+ bX the linear projection β0 + β1X isprecisely the function of X that ldquobestrdquo predicts Y given knowledge ofX

Mathematically (715) is the same as (76) but without theradicmiddot

Although phrased differently the linear projection goal of getting β0+β1X ldquoclosestrdquo to Y is essentially the same as prediction we want apredictor β0 + β1X that is ldquoclosestrdquo to Y

Unfortunately as with BLA ldquobestrdquo does not mean ldquogoodrdquo How-ever as with BLA this means the CEF does not need to be exactlylinear in order for the linear projection to make good predictions

As in Section 25 ldquopredictionrdquo here is defined entirely within thepopulation It does not refer to using data to guess the future thereis no data here Instead the BLP is an ideal predictor it is the(linear) predictor we would use if we fully knew everything about thepopulation The BLP is something we wish to learn Fortunately theBLP (and BLA and LP) is precisely what OLS estimates

Discussion Question 72 (BLP) Let Y be income (thousands ofdollars per year) and X be number of siblings When X = 0 the


mean Y is 60 and 50 le Y le 70 When X = 1 the mean Y is 40 and30 le Y le 50 When X = 2 itrsquos the same as when X = 1 the meanY is 40 and 30 le Y le 50 In a population with mostly X = 1 andX = 2 the BLP is LP(Y | 1 X) = 43minus 2X

a) What Y does the BLP predict when X = 0b) Is the prediction from (a) good Whynot

76 Causality Under Misspecification

Some things can be said about causality under misspecification butnone as pleasing as the BLP for prediction or BLA for descriptionFor example if the structural error U satisfies the CEF error propertyE(U | X) = 0 then the structural function is the CEF so the linearprojection is also the best linear approximation of the structural func-tion Alternatively if the structural model is linear Y = β0+β1X+U and if Cov(XU) = 0 then β1 equals the linear projection slope co-efficient However the linear structural model may be misspecifiedtoo This is one motivation for ldquononparametricrdquo CEF estimation (Sec-tion 83)

77 OLS Estimation and Inference


OLS estimation was initially discussed in Section 67 along withimportant terms like fitted values and residuals Here additionalinsights statistical properties and code are provided

771 OLS Estimator Insights

For the OLS estimator it is most important to know the statisticalproperties and R functions but I canrsquot resist a couple comments onthe nature of the estimator itself

First the ldquoleast squaresrdquo formulation of the OLS estimator from(646) mirrors the BLP definition in (715) That is following theanalogy principle (Section 33) replacing the population mean (E) in(715) with the sample mean ( 1n

sumni=1) yields (646) This reinforces

that OLS fundamentally estimates the BLP or equivalently the LP orBLA not the CEF The CEF equals the LP only in the very specialcase of a linear CEF

Second the OLS estimator can be written parallel to the popu-lation linear projection coefficients in (78) Again replacing popula-tion mean with sample mean and replacing population variance andcovariance with sample variance and covariance (78) turns into

β1 =Cov(YX)

Var(X)=

1n

sumni=1(Yi minus Y )(Xi minus X)1n

sumni=1(Xi minus X)2

β0 = Y minus β1X Y equiv 1

n

nsumi=1

Yi X equiv 1

n

nsumi=1

Xi

(716)

77 OLS ESTIMATION AND INFERENCE 165

This matches the formulas from solving the minimization problemin (646) directly This may seem surprising at first but recall thatthe population formula in (78) came from solving the populationminimization problem For this reason β1 may be called the sampleanalog of β1 and similarly for β0 and β0 just as the sample meanE(Y ) = 1

n

sumni=1 Yi is the sample analog of the population mean E(Y )

(Sections 33 and 34)Third the OLS estimator essentially performs orthogonal projec-

tion in the linear algebra sense The actual math is beyond our scopebut to get the fitted values Yi = β0 + β1Xi the vector of Yi values isprojected onto a certain subspace defined by the Xi values

772 Statistical Properties

The following statistical properties consider OLS as an estimator ofthe linear projection coefficients (LPCs) These properties hold trueunder very general assumptions If the CEF is linear then it equalsthe linear projection so these properties would equally apply to CEFestimation If the CEF is linear and additional assumptions hold suchthat the CEF slope identifies the ASE of X on Y (Section 652) thenthe following properties apply to ASE estimation

However as before the measures of statistical uncertainty (likeconfidence intervals) say nothing about uncertainty in the identifyingassumptions The statistical uncertainty only captures uncertaintyabout the LPCs

Assumptions

The following assumptions combined are sufficient for Theorems 71ndash73 but not necessary (using logical terms from Section 61)

Assumption A71 (iid sampling) Sampling of (Yi Xi) is iid

Assumption A72 (non-constant regressor) The regressor X is nota constant ie there is no single value x such that P(X = x) = 1

Assumption A73 (finite variances) The variances of Y and X arefinite Var(Y ) lt infin Var(X) lt infin Or equivalently the expectedvalues of Y 2 and X2 (ie second moments) are finite E(Y 2) ltinfinE(X2) ltinfin

Assumption A74 (finite fourth moments) The expected values ofY 4 and X4 (ie fourth moments) are finite E(Y 4) ltinfin E(X4) ltinfin

Assumption A71 was discussed in Section 32 for Yi by itself Ifwe let vector Wi equiv (Yi Xi) be whatrsquos observed about individual iand vector Wk equiv (Yk Xk) be the observation for individual k thenthe iid assumption is essentially the same as before Wi perpperp Wk fori 6= k (ldquoindependentrdquo) and Wi and Wk have the same distribution(ldquoidentically distributedrdquo) More specifically ldquoindependentrdquo means(Yi Xi) perpperp (Yk Xk) for i 6= k which implies Yi perpperp Yk Xi perpperp Xk


Yi perpperp Xk and Xi perpperp Yk but implies nothing about (in)dependencebetween Xi and Yi (or Xk and Yk) ldquoIdentically distributedrdquo says(Yi Xi) and (Yk Xk) have the same joint distribution which impliesthe conditional and marginal distributions (and their features) arealso identical For example E(Yi) = E(Yk) Var(Xi) = Var(Xk)E(Yi | Xi = x) = E(Yk | Xk = x) P(Yi le 0 | Xi = x) = P(Yk le 0 |Xk = x) etc All this readily generalizes to multiple regressors justredefining Wi equiv (Yi X1i X2i )

There can be dependence among the elements of Wi Specifi-cally the outcome Yi and regressor Xi may be correlated or otherwisedependent The iid assumption does not restrict the relationship be-tween Yi and Xi at all

Assumptions A73 and A74 are similar but A74 is stronger Thatis A74 =rArr A73 ie E(Y 4) lt infin =rArr E(Y 2) lt infin and similarlyfor X

Assumptions A73 and A74 are usually true with economic databut there are some exceptions They are true for any variable whoseabsolute value is bounded like |Y | le b lt infin because then E(Y 4) leE(b4) = b4 lt infin For example if X is age or education then |X| le200 so E(X4) le (200)4 ltinfin

Nonetheless some economic variables may violate A74 or evenA73 (Or there are variables best modeled by distributions thatviolate these assumptions) One example is stock returns or otherasset returns Whether to model such financial returns with finite orinfinite variance is a matter of ongoing debate eg see Grabchak andSamorodnitsky (2010) and references therein

Assumption A72 is qualitatively similar to the overlap assumption(A66) They both say we must see different values of X in order tolearn about a relationship involving X They both seem obvious witha single X

Conveniently if A72 seems false in the data then your statisticalsoftware will report an error or warning So donrsquot worry about A72unless you get such a warning

Theoretical Results

Theorem 71 (OLS consistency 1 regressor) If A71ndashA73 are truethen the OLS intercept and slope estimators are consistent for thepopulation linear projection intercept and slope

Theorem 71 says that with enough data the OLS coefficient esti-mators should be close to the true linear projection coefficients withhigh probability Whether the linear projection is a CEF or whetherthe slope has a causal interpretation are questions of identificationnot estimation OLS estimates the linear projection and leaves fur-ther interpretation up to us

Logically Theorem 71 does not say that OLS is a bad estimator ifsampling is not iid (the ldquoinverserdquo) as discussed in Section 61 In factthe iid assumption can be relaxed in certain ways even with some


ldquodependencerdquo (instead of independence) or survey weights OLS canstill consistently estimate the population LPCs

Theorem 72 (OLS approximate normality 1 regressor) If A71A72 and A74 are true then the OLS intercept and slope estima-tors are asymptotically normal ie with large n approximately β0 simN(β0 SE2

0) and β1 sim N(β1SE21) where the true standard errors SE0

and SE1 are unknown but can be estimated and are proportional to1radicn

Theorem 72 is practically useful for constructing confidence in-tervals whose properties are in Theorem 73

Theorem 73 (coverage probability 1 regressor) If A71 A72and A74 are true then the heteroskedasticity-robust confidence in-tervals in Section 773 are asymptotically correct That is with largeenough n the coverage probability is approximately equal to the desiredconfidence level

773 Code

The following code is based on the example from Section 71 Eachrow in the final output shows the estimate β1 (in the column titledEstimate) along with the heteroskedasticity-robust standard error(column Std Error) t-statistic for H0 β1 = 0 (column t value)and p-value for H0 β1 = 0 (column Pr(gt|t|)) and 95 CI for β1(lower endpoint in column 25 upper endpoint in column 975 )

There are three randomly simulated datasets for which these quan-tities are estimated All have X isin 0 1 2 The first dataset comesfrom a linear CEF with m(x) = 60minus 20x where P(X = j) = 13 forj = 0 1 2 The next two datasets have nonlinear CEF m(0) = 60m(1) = m(2) = 40 but different distributions of X The first dis-tribution has P(X = j) = (3 minus j)6 while the second has P(X =j) = (j + 1)6 for j = 0 1 2 As seen the distribution of X affectsthe linear projection slope when the CEF is nonlinear as discussedin Section 74

Finally dummy variables are used to estimate a properly specifiednonlinear CEF as in (74) Only the estimated coefficients are dis-played below using the coefficients() function Specifically thenumber under (Intercept) is the estimated intercept the numberunder D1 is the estimated coefficient on D1 and the number under D2is the estimated coefficient on D2

library(lmtest) library(sandwich)setseed(112358)n lt- 500 sample sizem012 lt- c(604020) m(0)m(1)m(2) (linear CEF)df lt- dataframe(X=sample(x=02 size=n prob=c(111)3 replace=TRUE)

U=rnorm(n))df$Y lt- rnorm(n=n mean=m012[1+df$X]) + df$U


ret lt- lm(formula=Y~X data=df)retVC1 lt- vcovHC(ret type=HC1)CEF lt- c(coeftest(ret vcov = retVC1)[X]

coefci(ret vcov = retVC1)[X]) Now nonlinear CEF LPC depends on X distsetseed(112358)n lt- 500 m012 lt- c(604040)df lt- dataframe(X=sample(x=02 size=n prob=316 replace=TRUE)

U=rnorm(n))df$Y lt- rnorm(n=n mean=m012[1+df$X]) + df$Uret lt- lm(formula=Y~X data=df)retVC1 lt- vcovHC(ret type=HC1)LP1 lt- c(coeftest(ret vcov = retVC1)[X]

coefci(ret vcov = retVC1)[X])setseed(112358)n lt- 500 m012 lt- c(604040)df lt- dataframe(X=sample(x=02 size=n prob=136 replace=TRUE)


coefci(ret vcov = retVC1)[X])tmp lt- rbind(CEF LP1 LP2)round(x=tmp digits=3)

Estimate Std Error t value Pr(gt|t|) 25 975 CEF -198 0077 -2571 0 -1998 -1967 LP1 -123 0310 -396 0 -1291 -1169 LP2 -77 0310 -248 0 -831 -709

Use dummies to estimate nonlinear CEFdf$D0 lt- (df$X==0) not useddf$D1 lt- asinteger(df$X==1) D1=1 iff X=1df$D2 lt- asinteger(df$X==2) D2=1 iff X=1ret lt- lm(formula=Y~D1+D2 data=df)coefficients(ret)

(Intercept) D1 D2 598 -198 -198

78 Simple Linear Regression


The prior results are essentially the same when X has more than

78 SIMPLE LINEAR REGRESSION 169

three possible values too There could even be an infinite number ofpossible values eg if there is no upper bound for X or if X couldbe any real (decimal) number between 0 and 1 Misspecification islikely The linear projection best linear approximation and bestlinear predictor interpretations all still apply OLS estimation andheteroskedasticity-robust standard errors and confidence intervals arecomputed the same way

The main difference is that it is harder to use dummy variablesto properly model a nonlinear CEF If X has only four values thenit is not too difficult But if X has hundreds or thousands of valuesor an infinite number then the dummy variable approach may failChapter 8 addresses alternative ways to model a CEF that is notlinear in X

Practice 72 (linear fit) For each scatterplot in Figure 73 guesswhat the OLS estimated regression line looks like ie the line β0 +β1X (Hint remember OLS minimizes the sum of the squares of thevertical distances from each point to the fit line) You can also makeyour own puzzles in R first make a scatterplot likeY lt- c(123413) X lt- c(12345) plot(XY)

and then (after guessing) plot the OLS fit with abline(lm(Y~X))

Figure 73 Scatterplots for Practice 72

Practice 73 (regression units) Consider a regression of wage Y($hr) on ldquodistance to nearest universityrdquo X Let γ1 be the estimatedslope when X is measured in miles and let δ1 be the estimated slopewhen X is measured in kilometers where 1 mi = 16 km

a) What are the units of γ1 δ1


b) Do you think γ1 = δ1 γ1 gt δ1 or γ1 lt δ1c) Can you come up with a formula relating γ1 and δ1 (Hint

what change in Y is associated with a 16 km increase in X interms of γ1 In terms of δ1)

Discussion Question 73 (student-teacher ratio simple regression)Let Y be the average math standardized test score (in units of points)for a schoolrsquos 5th-grade students Let X be the 5th-grade student-teacher ratio (total number of 5th-grade students divided by totalnumber of 5th-grade teachers like the average class size) generallyaround 15 le X le 25 For schools i = 1 n the values (Yi Xi) arerecorded A linear regression is run to estimate β0 and β1 in the CEFmodel Y = β0 + β1X + V E(V | X) = 0 Respond to any three ofthe following (for example parts a c and e or b c f or d e f etc)

a) What are the units of β0 and β1b) Whatrsquos the interpretation of β0 What is it useful forc) Consider the estimate β1 = minus228 What does this imply about

the average score difference between 15-student classes and 25-student classes Is it economically significant (Section 397)(Hint make additional assumptions about the scoring systems-cale if you need to)

d) Consider further that β1 has heteroskedasticity-robust standarderror 08 so the p-value for H0 β1 = 0 is 0004 Discuss thestatistical significance (Section 384) of β1

e) Describe one reason you doubt β1 has a causal interpretationf) Describe one reason you think the linear CEF model is misspec-

ified


Empirical Exercises

Empirical Exercise EE71 You will analyze data on collegesrsquo ath-letic success and number of applications The data were collected byPatrick Tulloch for an economics term project from various collegeand sports data records As the R description says ldquoThe lsquoathletic suc-cessrsquo variables are for the year prior to the enrollment and academicdatardquo

a Load the data (assuming yoursquove already installed the R packageor Stata command)

R library(wooldridge)

Stata bcuse athlet1 nodesc clear

b Keep only data from 1993

R dat lt- athlet1[athlet1$year==1993 ] Stata keepif year==1993

c Create a new variable equal to the sum of bowl (football bowlgame) and finfour (menrsquos basketball Final Four)

R dat$bowl4 lt- dat$bowl + dat$finfour

Stata generate bowl4 = bowl + finfour

d Display the number of observations with each possible value ofbowl4 (0 1 or 2)

R table(dat$bowl4)

Stata tabulate bowl4

e Regress the number of applications (for admission) on the prioryearrsquos athletic success

R ret lt- lm(apps~bowl4 data=dat)

Stata regress apps bowl4 vce(robust)

f R only save the fitted OLS values of Y for the three possiblevalues of X (bowl4) with fit012 lt- predict(ret newdata=dataframe(bowl4=02)) and optionally add helpful labelswith names(fit012) lt- c(X=0X=1X=2)

g Estimate and store the three CEF values

R mean(dat$apps[dat$bowl4==0]) to estimate m(0) and re-place 0 with 1 to estimate m(1) and with 2 to estimate m(2)store these into a vector named m012 with m012 lt- c( m0 m1 m2 ) where m0 is your code for estimatingm(0) and similarly

for m1 and m2

Stata bysort bowl4 egen CEF = mean(apps) to computethe sample mean of apps within each group of observations withthe same value of bowl4 storing it into a new variable namedCEF


h Plot the fitted OLS line against the estimated CEF points

R plot(x=02 y=m012) (to plot estimated CEF points) fol-lowed by abline(ret) (to plot the OLS fit line)

Stata twoway scatter CEF bowl4 || lfit apps bowl4

i Make the same plot but adjust the line color and style thetitle the axis labels and whatever else yoursquod like to adjust

R inside the plot() command add argument main= toset the title and similarly for xlab= and ylab= toset the x-axis and y-axis labels (where you replace all the with whatever names you want) inside the abline() functionadd arguments col=2 to change the linersquos color lty=2 to changethe line style and lwd=3 to change the line width again youcan set whatever values you like

Stata twoway scatter CEF bowl4 || lfit apps bowl4 XXX but replace the XXX with options to change the graphrsquosappearance (all separated by spaces not any more commas)like title() xtitle() ytitle() for the titleand axis labels and lcolor(red) lpattern(dash) for the linecolor and style use whatever values yoursquod like

j Display the numerical values of the OLS fit and the estimatedCEF R rbind(m012 fit012)

Stata collapse (mean) meanapps=apps by(bowl4) fol-lowed by predict OLSfit xb and list

Chapter 8

Nonlinear andNonparametric Regression


Depends on Chapter 7 (which depends on Chapters 2ndash4 and 6)



82 Interpret the coefficients in various nonlinear regressionmodels [TLOs 3 and 5]

83 Judge which model seems most appropriate using both eco-nomic reasoning and statistical insights [TLO 6]

84 In R (or Stata) estimate nonlinear and nonparametric re-gression models along with measures of uncertainty andjudge economic and statistical significance [TLO 7]


bull Functional form misspecification (Lambert video)

bull Log-log example (Lambert video)

bull Overfitting (Lambert video)

bull Sections 24 (ldquoNonlinearitiesrdquo including log models) 613(ldquoLogarithmsrdquo) and 614 (ldquoQuadratics and Polynomialsrdquo)in Heiss (2016)

bull Section 82 (ldquoNonlinear Functions of a Single IndependentVariablerdquo) in Hanck et al (2018)

bull Nonparametric regression Chapter 7 (ldquoMoving Beyond Lin-earityrdquo) in James et al (2013) including sect75 (ldquoSmoothingSplinesrdquo) and Chapter 5 (ldquoBasis Expansions and Regular-izationrdquo) in Hastie Tibshirani and Friedman (2009) includ-ing sect54 (ldquoSmoothing Splinesrdquo)

173

174CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION

bull Model selection Chapter 7 (ldquoModel Assessment and Selec-tionrdquo) in Hastie Tibshirani and Friedman (2009)

bull Biasndashvariance tradeoff James et al (2013 sect222) HastieTibshirani and Friedman (2009 sectsect295527273)

bull Part V (ldquoNonparametric Regressionrdquo) in Kaplan (2020)

bull R package splines

Having mastered regression with a linear functional form we nowconsider nonlinear functions First nonlinear functions of X are al-lowed and then nonparametric estimation and machine learning areintroduced

81 Log Transformation

Sometimes a simple regression model improves greatly by transform-ing Y or X or both The most common transformation in economicsis the natural logarithm function which economists just call ldquologrdquo

Three different log models are discussed below A model with thefamiliar form Y = β0+β1X+U could be called a ldquolinear-linearrdquo model(although itrsquos just called a linear model) meaning both Y and Xare in their original units ie in levels If Y is replaced by its logln(Y ) itrsquos called a log-linear model if instead we have Y and ln(X)then itrsquos linear-log and if both are in logs then log-log

Here in Section 81 the distinction among causal CEF and linearprojection models is unimportant The interpretation of U is leftambiguous intentionally Instead emphasis is on the interpretationof β1 in terms of units of measure

811 Properties of the Natural Log Function

Basic Shape and Properties

The natural log function is peculiar especially if you havenrsquot takencalculus It is written ln(middot) although often people will simply sayldquologrdquo (without ldquonaturalrdquo) and write log(middot) since the natural log is theonly one commonly used in economics in R the function is log()

The log function is the inverse of the exponential function ln(exp(x)) =x where exp(x) is the same as ex Consequently if ex = M thenln(M) = ln(ex) = x

Figure 81 shows the log function giving a general idea of itsshape However two important features are unclear First as x getscloser and closer to 0 ln(x) decreases toward minusinfin Second ln(x)keeps increasing to infin as x increases to infin

The log function has many properties including the following1 ln(x) is only defined for x gt 02 ln(x) is strictly increasing for any x2 gt x1 gt 0 ln(x2) gt ln(x1)3 ln(x) increases more slowly with larger x it is very steep for x

near zero but less and less steep (ie flatter) as x increases

81 LOG TRANSFORMATION 175

0 1 2 3 4 5 6 7

minus2

minus1

01

2

x

ln(x

)

Figure 81 The (natural) log function ln(middot)

4 For any x gt 0 and any b ln(xb) = b ln(x)5 For any x1 gt 0 and x2 gt 0 ln(x1x2) = ln(x1) minus ln(x2) and

ln(x1x2) = ln(x1) + ln(x2)6 limxdarr0 ln(x) = minusinfin and limxrarrinfin ln(x) =infin

Percentage Approximation

Near x = 1 ln(x) is approximately the same as the linear functionf(x) = x minus 1 ie ln(x) asymp x minus 1 Equivalently letting w equiv x minus 1if w is near zero then ln(1 + w) asymp w For example with w = 001ln(1 + 001) = 000995 Negative w lt 0 is fine too ln(1 minus 001) =minus001005 Even with w = 01 ln(11) = 00953 not far from 01The approximation is perfect at w = 0 since ln(1) = 0 exactly and itgets worse as w increases ln(15) = 0405 not good

The log function can approximate small percent changes Considerv2 gt v1 how much bigger is v2 In percentage (of v1) terms v2 is

100

(v2 minus v1v1

) = 100

(v2v1minus 1

)

larger than v1 For example if v1 = 100 and v2 = 102 then v2v1 minus1 = 102 minus 1 = 002 so wersquod say v2 is 100(002) = 2 larger thanv1 In other words the increase (in level) of v2 minus v1 = 2 is 2 ofv1 This 2 can be approximated by the log increase ln(v2)minus ln(v1)Let p equiv v2v1 minus 1 like p = 002 in the example so v2 = v1(1 + p)Combining two properties above if p is near zero then

ln(v2) = ln(v1(1+p)) = ln(v1)+ln(1+p) asymp ln(v1)+p =rArr p asymp ln(v2)minusln(v1)(81)

Put differently a log difference of p = ln(v2)minus ln(v1) is approximatelya 100p change in level In fact the above math is identical whenv2 lt v1 so p can be positive or negative (increase or decrease) How-ever as before the approximation is poor if p is larger like 05


812 The Log-Linear Model

Interpretation

A log-linear model specifies

ln(Y ) = β0 + β1X + U (82)

Since X is in levels the coefficient β1 tells us about a one unit increasein X Specifically a one unit increase in X is associated with a β1change in ln(Y ) (increase if β1 gt 0 decrease if β1 lt 0) Sometimespeople call this a β1 change in Y in log units

If β1 is close to zero then (81) offers another interpretation a oneunit increase in X is associated with an approximate 100β1 changein Y For example if β1 = 002 then a one unit increase in X isassociated with approximately a 2 increase in Y

Recall the difference between a percentage change and a percent-age point change For example a 1 increase in Y means increasingto 101Y ldquoPercentage pointrdquo only applies when the units are alreadypercentages eg a 1 percentage point increase is changing from 10to 11 or from 67 to 68

However even if β1 is near zero the approximation in (81) maybe poor if we consider large changes in X For example if againβ1 = 002 but we consider a 50-unit increase in X the increase in Yis poorly approximated by 100(β1)(50) = 100 A 100 increasewould be from value v1 to value v2 = 2v1 But if ln(v2)minusln(v1) = 1 (achange of 1 ldquolog unitrdquo) then ln(v2v1) = 1 meaning v2v1 = e asymp 272not v2v1 = 2

When to Use It

When does a log-linear model make sense Sometimes scatterplots ofthe raw Y andX data suggest it For example maybe the relationshipbetween Y and X looks nonlinear but the relationship between ln(Y )and X looks approximately linear

Sometimes even before looking at data the log-linear model makesmore sense economically or intuitively For example with Y variableslike income it may seem more natural to model effects as (approx-imate) percentage changes in Y like a 1 higher income instead ofa $500yr higher income Further the log-linear form derives fromeconomic models of human capital where there is a multiplicative ef-fect on wage The most famous of these is the ldquoMincer equationrdquo forearnings as a function of education (schooling) and experience namedafter the log-linear model in Mincer (1974 Ch 5 p 84)

Issue with Prediction

Unfortunately the log-linear model is not optimal for predicting Y even if E(U | X) = 0 From (82) the CEF is

E(Y | X = x) = eβ0+β1x E(eU | X = x)


It is easy to plug in β0 and β1 but difficult to estimate E(eU | X = x)We could simply ignore the difficult term but eβ0+β1x is generally notthe best predictor of Y given X = x There are alternatives but theyare beyond our scope

813 The Linear-Log Model

Interpretation

A linear-log model specifies

Y = β0 + β1 ln(X) + U (83)

When X increases by one log unit the corresponding change in Y isβ1 but one log unit is a very big change (more than doubling) To usethe percentage approximation a smaller change in X must be usedSpecifically an increase of X by 1 is associated with a change in Yof β1100 units A 1 increase is a change from X to 101X whichis different than a 1 percentage point change in X

Mathematically the interpretation of β1 can be seen in two stepsFirst let Z = ln(X) and imagine a linear model with Y and Z Y =β0 + β1Z +U Then an increase in Z by 001 units corresponds to achange in Y of (β1)(001) = β1100 units of Y Second from (81) anincrease in Z = ln(X) by 001 is approximately a (100)(001) = 1increase in X

For larger changes instead of using the percentage approximationjust plug in two values of X For example consider increasing X = 40to X = 60 (a 50 increase) The associated change in Y is

X=60︷︸︸︷β0 + β1 ln(60)minus

X=40︷︸︸︷[β0 + β1 ln(40)] = (β0 minus β0) + β1[ln(60)minus ln(40)] = β1 ln(6040)

= β1 ln(15) = 041β1

The same 041β1 change results for any 50 increase in X regardlessof starting value because ln(15X)minusln(X) = ln(15XX) = ln(15) =041 log units More generally a change from X = x1 to X = x2 isassociated with a change in Y of β1 ln(x2x1)

When to Use It

When does a linear-log model make sense Sometimes the scatterplotof Y and X reveals a shape that looks like a log function increasingsteeply at first then getting less and less steep but without everdecreasing (Or switch ldquoincreasingrdquo and ldquodecreasingrdquo if β1 lt 0)That is the relationship between Y andX looks nonlinear but maybeplotting Y against ln(X) looks closer to linear The log functionrsquosshape also helps model diminishing marginal benefits the first unitof X helps increase Y a lot but each additional unit of X helps lessand less


814 The Log-Log Model

Interpretation

A log-log model specifies

ln(Y ) = β0 + β1 ln(X) + U (84)

A 1 increase in X is associated with an approximate β1 change inY This percentage interpretation is particularly nice β1 representsan elasticity of Y with respect to X But if the percentages are toolarge then the approximation is poor

When to Use It

When does a log-log model make sense First itrsquos a simple way toget an elasticity interpretation Second a scatterplot of ln(Y ) againstln(X) may look roughly linear Third if you suspect a power law typeof relationship between Y and X exponentiating both sides of (84)yields

expln(Y ) = expβ0+β1 ln(X)+U =rArr Y = eβ0 expln(Xβ1)eU = eβ0Xβ1eU


As with the log-linear model eβ0Xβ1 is generally not the CEF becauseE(eU | X) = 1 is not implied by E(U | X) = 0 Consequentlypredicting Y as eβ0X β1 is generally not optimal

In Sum Regression Models with Log TransformationsLog-linear 1-unit uarr X associated with approximate 100β1

change in YLinear-log 1 uarr X associated with approximate β1100-unit

change in Y more precisely change from x1 to x2 associated withβ1 ln(x2x1)-unit change in Y

Log-log 1 uarr X associated with approximate β1 change inY (elasticity)

Discussion Question 81 (pollution and house price) Consider therelationship between the price of a house and the concentration of airpollution Explain which type of model (linear log-linear linear-logor log-log) you think would best fit and why (Hint think especiallyabout changes in levels vs in logs)

815 Warning Model-Driven Results

=rArr Kaplan video Warnings About Model-Driven Results

When choosing a model beware self-fulfilling prophecy Empir-ical results are driven by data but also by your modelrsquos structureFor example the function β0 + β1X specifies a constant (β1) change


for every unit increase in X different datasets can lead to differentestimated slopes (β1) but the slope will always be constant regard-less of the data The log-linear model may seem more flexible thana linear model but it is not it still only has two parameters It isjust different not more flexible Consequently the fitted log-linearmodel always shows a diminishing effect of X on Y as X increasesThis pattern does not come from the data but from the model itselfregardless of the data

Figure 82 based on the comic at httpsxkcdcom2048 illus-trates such self-fulfilling prophecy Each graph shows the same scat-terplot from the same data (the dots) but with a very different fittedmodel in each (the line) Clearly the differences do not come from thedata since itrsquos the exact same data All differences are entirely due tothe model The top-left shows the linear model which by construc-tion imposes a constant slope β1 Below that is a log-linear modelthe constant percentage increase of Y with each unit of X leads toexponential growth (hence the ldquoexponentialrdquo label in the comic) Thetop-right shows the ldquotapering offrdquo of the linear-log model Althoughmostly beyond our scope some comments on ldquomodel selectionrdquo are inSections 83 and 152

816 Code

2 4 6 8

05

10

15

20

Linear

2 4 6 8

05

10

15

20

LogminusLinear

2 4 6 8

05

10

15

20

LinearminusLog

2 4 6 8

05

10

15

20

LogminusLog

Figure 82 Same data different models

Figure 82 is generated by the following code that compares linearlog-linear linear-log and log-log estimation given the same datasetThe four fitted functions are plotted on four copies of the same scat-terplot in Figure 82 in homage to httpsxkcdcom2048 The


results illustrate the concerns of Section 815

par(family=serif mar=c(3311) mgp=c(21080) mfrow=c(22))setseed(112358)n lt- 31X lt- sort(runif(n=n min=1 max=9))Y lt- 1 + pnorm(q=X mean=5 sd=15) +

2( rbeta(n=n shape1=10-X shape2=X) - (10-X)10 )df lt- dataframe(X=X Y=Y)retlinlin lt- lm(Y~X data=df)retloglin lt- lm(log(Y)~X data=df)retlinlog lt- lm(Y~log(X) data=df)retloglog lt- lm(log(Y)~log(X) data=df)XL lt- YL lt- plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(predict(retlinlin)~df$X col=2)title(Linear line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(predict(retlinlog)~df$X col=2)title(Linear-Log line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(exp(predict(retloglin))~df$X col=2)title(Log-Linear line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(exp(predict(retloglog))~df$X col=2)title(Log-Log line=-1 adj=01)

82 Nonlinear-in-Variables Regression

Discussion Question 82 (nonlinear OVB) Imagine a structuralmodel Y = β0 + β1X + β2X

2 with no error term X completelydetermines Y To be more concrete imagine Y = 1+X2 (ie β0 = 1β1 = 0 β2 = 1) with 0 le X le 5 You run a linear-in-variablesregression OLS estimates the function γ0 + γ1X

a) Approximately what value would you expect γ1 to be (Hintrecall Sections 73ndash75)

b) What does γ0 + γ1X suggest about the relationship between Xand Y What features are similar or different compared to thetrue 1 +X2 (Hint draw a picture)

Beyond replacing X with a single transformation of X like ln(X)we can replace X with a more complicated nonlinear function involv-ing multiple terms and multiple parameters OLS can still be used

82 NONLINEAR-IN-VARIABLES REGRESSION 181

for estimation as long as the function is ldquolinear-in-parametersrdquo (Sec-tion 821) Again the distinctions among causal CEF and linearprojection models are not emphasized here

There are two types of (non)linearity They are often confusedFurther people often say ldquolinear modelrdquo or ldquononlinear modelrdquo withoutclarifying which type they mean

821 Linearity

The root of ldquolinearityrdquo is linear combination A linear combinationis like a weighted sum For example a linear combination of A andB is anything with the form

w1A+ w2B (85)

where w1 and w2 are weights that may take any value including zeroor even negative numbers Linear combinations may involve morethan two terms like w1A+w2B+w3C+w4D In some cases insteadof A B C andD we have something like Y1 Y2 Y3 and Y4 in whichcase the linear combination may be written in summation notation

w1Y1 + w2Y2 + w3Y3 + w4Y4 =4sumi=1

wiYi (86)

For example the expected value formula for discrete random variablesin (216) is a special case of a linear combination where the linearcombination weights are the probabilities of the different possible val-ues Also the sample mean is a linear combination of observed Yivalues with weights wi = 1n

A function is linear-in-parameters if it is a linear combinationof the parameters For example β0 + β1X is linear-in-parametersbecause it is a linear combination of the parameters β0 and β1 withweights w1 = 1 and w2 = X

w1β0 + w2β1 = (1)(β0) + (X)(β1) = β0 + β1X

The function β0+β1X is also linear-in-variables This is trickierto see since it is not actually a linear combination of X alone (Thefunction β0 + β1X is called an affine function of X which meansa linear function of X plus a constant) Secretly we have actuallyhad a second regressor all along X0 = 1 Since this second regressorX0 is just always 1 it has not been treated like a true regressor butmathematically it is Seen this way the linear combination of X0 andX has weights w1 = β0 and w2 = β1

(w1)(X0) + (w2)(X) = (β0)(1) + (β1)(X) = β0 + β1X

For this reason in economics people often call β0 + β1X ldquolinear inXrdquo even though technically it is ldquoaffine in Xrdquo and ldquolinear in X0 andXrdquo

These two types of linearity can apply specifically to CEFs orlinear projections For example if the CEF is E(Y | X = x) =


β0+β1x then the CEF is linear-in-parameters and linear-in-variablesRegardless of the CEF the linear projection of Y onto (1 X) is LP(Y |1 X) = β0 +β1X which is always linear-in-parameters and linear-in-variables by definition

Confusingly even if the models are written in error form peo-ple still refer to them as ldquolinearrdquo For example consider the CEFmodel Y = β0 + β1X + U with E(U | X) = 0 Despite the +U atthe end sometimes people say this is linear-in-variables and linear-in-parameters presumably because E(Y | X) = β0 + β1X indeedsatisfies both types of linearity Similarly consider the linear projec-tion model in error form Y = β0 + β1X + U with E(U) = E(XU) =0 Again despite the +U at the end sometimes people say thisis linear-in-variables and linear-in-parameters presumably becauseLP(Y | 1 X) = β0 + β1X indeed satisfies both types of linearity

Even more confusingly sometimes even a structural model of theform Y = β0 + β1X + U is called linear-in-parameters and linear-in-variables In that case there is no CEF or LP that is implicitly thelinear function Regardless it is helpful to be aware of conventionalterminology even if itrsquos not the best so you can understand otherswhen they mention a ldquolinear structural modelrdquo

822 Nonlinearity

Often a quadratic term is added to a model to increase flexibilitySpecifically

Y = β0 + β1X + β2X2 + U (87)

is called a quadratic model since the right-hand side is a quadraticfunction of X (plus an error term) This is now nonlinear-in-variables because of the X2 term That is β0 +β1X +β2X

2 cannotbe written as a linear combination of X0 = 1 and X so it is notlinear-in-variables However (87) is still linear-in-parameters withlinear combination weights 1 X and X2

(1)(β0) + (X)(β1) + (X2)(β2) = β0 + β1X + β2X2

There are (infinitely) many other examples of functions that arelinear-in-parameters but nonlinear-in-variables For example

β0 + β1X + β2X2 + β3X

3 + β4X4

β0 + β1 sin(X) + β2 cos(X)

β0 + β1 ln(X) + β2radicX + β3X

13

Each can be written in terms of functions fj(middot) in the form

Jsumj=0

βjfj(X) (88)

For example the polynomial example has fj(X) = Xj and J = 4


A nonlinear-in-parameters model cannot be written as a linearcombination of the parameters For example in the power law model

Y = β0Xβ1 + U (89)

the term β0Xβ1 cannot be written as a linear combination of β0 and

β1 Nonlinear-in-parameters models are not discussed further

823 Estimation and Inference

OLS can estimate nonlinear-in-variables models as long as they arelinear-in-parameters As always the OLS estimates are the param-eter values that minimize the sum of squared residuals solving theempirical analog of the optimal prediction problem (minimizing meanquadratic loss)

Inference on parameters is also the same For example the sameR code to compute a confidence interval for β1 earlier still worksand a confidence interval for β2 can be computed the same way Theunderlying codemath is very similar too However confidence inter-vals for predicted values or predicted differences now involve multiplecoefficients

824 Parameter Interpretation

Unlike estimation and inference which remain similar interpretationof parameters changes greatly with nonlinear-in-variables models

Insufficiency of Linear Coefficient

In (87) β1 is no longer the change in Y associated with a unit increasein X That is when X increases so does X2 so both β1 and β2 areneeded Not only does the change in Y associated with a unit increasein X now depend on the initial value of X but even the sign of thechange (ie increase or decrease) may depend on X

For example consider the function 5X minusX2 ie β0 = 0 β1 = 5and β2 = minus1 Going from X = 0 to X = 1 the change is

[(5)(1)minus 12]minus [(5)(0)minus 02] = 4minus 0 = 4

Going from X = 1 to X = 2 the change is


still positive but smaller From X = 2 to X = 3


no change at all And from X = 3 to X = 4

[(5)(4)minus 42]minus [(5)(3)minus 32] = 4minus 6 = minus2

a negative change ie a decrease Even though β1 = 5 is positivesometimes an increase in X is associated with a decrease in Y Noteven the sign of β1 (positive negative) tells us anything


Summarizing Nonlinear Functions

With only one X the best summary is to plot the function (alongwith a scatterplot of data) like in Figure 82 As the saying goes ldquoApicture is worth a thousand words [or numbers]rdquo However if there aremany different regressors (as in later chapters) pictures get confusing(trying to show slices of many-dimensional manifolds )

Another approach is to plug in changes of X that are relevant topolicy or a particular economic question For example if Y is incomeX is education and we want to understand the value of the 12th yearof education then comparing X = 12 to X = 11 is relevant With aquadratic model the associated change in Y is

[β0+β1(12)+β2(12)2]minus[β0+β1(11)+β2(11)2] = β1(12minus11)+β2(122minus112) = β1+23β2

Generally we could write

Y = f(X) + U (810)

where in the quadratic model f(X) = β0 + β1X + β2X2 Plugging in

OLS parameter estimates yields f(X) like f(X) = β0 + β1X + β2X2

for the quadratic We can graph this estimated function by evaluatingit at many X values and drawing a line through them Similarly fora change from X = x1 to X = x2 the (estimated) associated changein Y is

f(x2)minus f(x1)

In Sum Interpreting and Summarizing Nonlinear Mod-elsThe β1X term alone has no meaningGiven (810) a change from X = x1 to X = x2 is associated witha change in Y of f(x2)minus f(x1) estimated by f(x2)minus f(x1)

Practice 81 (quadratic example) You regress Y on X and X2 andget the fitted function Y = β0 + β1X + β2X

2 with β0 = 2 β1 = 4and β2 = minus2

a) Whatrsquos the predicted value of Y when X = 0 X = 1 X = 2b) Whatrsquos the predicted change in Y when X changes from 0 to

1 from 1 to 2

Discussion Question 83 (nonlinear wage model interpretation)Let Y be wage ($hr) and X years of education Given a sample ofdata you estimate Y = β0 + β1X + β2X

2 with β0 = 144 β1 = minus16and β2 = 01

a) Does β1 lt 0 mean that more education is associated with lowerwage Whynot

b) What does this estimated function suggest about the (descrip-tive) relationship between wage and education (Hint try plug-ging in salient values like X = 12 [high school] or X = 16[college] or graph the whole function)


825 Description Prediction and Causality

The interpretation of a nonlinear-in-variables model as causal CEFor linear projection is similar to linear-in-variables models The maindifference is that we may wish to clarify the word ldquolinearrdquo in linearprojection best linear approximation and best linear predictor

Description and Prediction

Consider the quadratic model from (87) when the true CEF is notquadratic Then the ldquolinearrdquo projection of Y onto X0 = 1 X andX2 is defined the same way as in (76) before

LP(Y | 1 XX2) = β0 + β1X + β2X2

= arg minabc

d(Y a+ bX + cX2)

= arg minabc

radicE[(Y minus aminus bX minus cX2)2] (811)

These linear projection coefficients are what OLS estimates Thissame function of X is again a ldquobestrdquo CEF approximation and ldquobestrdquopredictor of Y Specifically mirroring (714) and (715)

LP(Y | 1 XX2) = β0 + β1X + β2X2

=

BLA︷︸︸︷arg minabc

E[E(Y | X)minus (a b c)]2

=

BLP︷︸︸︷arg minabc

E[Y minus (a+ bX + cX2)]2 (812)

As before if the true CEF actually is quadratic the these all equalthe true CEF

Structural Identification

The ASE is identified if the CEF is properly specified and indepen-dence (Assumption A67) holds Then the ASE on Y of changing Xfrom x1 to x2 equals the difference in the CEF at those two points

ASE(x1 rarr x2) = E(Y | X = x2)minus E(Y | X = x1) equiv m(x2)minusm(x1)(813)

If the CEF is correctly specified then OLS can consistently estimatethis CEF difference

For example if the true CEF is actually quadratic

m(x) = β0 + β1x+ β2x2

then regressing (with OLS) Y on 1 X and X2 yields consistent esti-mators of β0 β1 and β2 under certain finite-moment and sampling


assumptions (eg iid sampling and finite fourth moments of Y andX) Then m(x) = β0 + β1x+ β2x

2 so a consistent ASE estimator is

ASE(x1 rarr x2) = m(x2)minus m(x1)

= β0 + β1x2 + β2x22 minus (β0 + β1x1 + β2x

21)

= β1(x2 minus x1) + β2(x22 minus x21) (814)

Alternatively if the true structural model is Y = h(X)+U and thestructural error U satisfies E(U | X) = 0 then the structural functionh(middot) is also the CEF m(middot) Thus if h(middot) is linear-in-parameters thenit can be estimated by OLS

826 Code

00 05 10 15 20 25 30

minus2

01

23

4

Linear

00 05 10 15 20 25 30

minus2

01

23

4Quadratic

00 05 10 15 20 25 30

minus2

01

23

4

Cubic

00 05 10 15 20 25 30

minus2

01

23

4

Trigonometric


Figure 83 is generated by the following code that fits the samedata with four models linear quadratic and cubic polynomials anda trigonometric model with a sine and cosine term Figure 83 showsfour identical scatterplots with the four different fitted models Notehow the four fitted lines have very different qualitative features eventhough they use the same data This illustrates the same concernsabout model-driven results and ldquoself-fulfilling prophecyrdquo as in Sec-tion 815 and Figure 82

par(family=serif mar=c(3311) mgp=c(21 08 0) mfrow=c(22))setseed(112358)n lt- 31X lt- sort(3rbeta(n=nshape1=1shape2=1))

83 NONPARAMETRIC REGRESSION 187

df lt- dataframe(X=X Y=1+10(X2-05)^2(X2-05-1) + rnorm(n=n))retpoly1 lt- lm(Y~X data=df)retpoly2 lt- lm(Y~X+I(X^2) data=df)retpoly3 lt- lm(Y~X+I(X^2)+I(X^3) data=df)rettrig lt- lm(Y~I(cos(2pi(X-0)3))+I(sin(2pi(X-0)3)) data=df)XL lt- YL lt- plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL

main= xlim=c(03))lines(predict(retpoly1)~df$X col=2)title(Linearline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL

main= xlim=c(03))lines(predict(retpoly2)~df$X col=2)title(Quadraticline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL

main= xlim=c(03))lines(predict(retpoly3)~df$X col=2)title(Cubicline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL

main= xlim=c(03))lines(predict(rettrig)~df$X col=2)title(Trigonometricline=-1adj=01)

83 Nonparametric Regression

=rArr Kaplan video Model Flexibility in Nonparametric Regression

In nonparametric regression the functional form of the CEFm(middot) is unknown This is more general than nonlinear-in-variables re-gression wherem(middot) is nonlinear but has a known functional form likea cubic polynomial or log-linear model in which only the coefficientvalues are unknown

In principle this allows a very flexible model for m(middot) althoughin practice the (hopefully) optimal level of flexibility must be chosensomehow There is no universal quantitative definition of ldquoflexiblerdquobut the qualitative meaning is the same as the physical flexibility ofa hose or cable can it bend around sharply in many places to takewhatever shape you wish (flexible) or can it only take on particularshapes The number of parameters (terms) in a model is a generalguide to how flexible the model is For example a model with 20paramters is more flexible than a model with only 2 parameters

Many machine learning methods are nonparametric CEF es-timators In machine learning often prediction is emphasized over


description and causality but recall that the CEF is the best predic-tor of Y given X (under quadratic loss)

One view of nonparametric regression is that it is like nonlinearregression but choosing the model with a formal statistical procedureinstead of guessing The steps are basically

1 Choose a group of possible regression models

2 Choose a way to evaluate models

3 Evaluate the quality of each model given the data

4 Select the best (least bad) model

5 Use the estimates from the selected model

Steps 1ndash4 describe model selection ie choosing which modelto use for estimation This is unavoidable Sometimes model se-lection is informal eg somebody just feels like using a quadraticmodel today With nonparametric regression usually Step 1 is doneinformally (but thoughtfully) For Step 2 there are many formalstatistical evaluation procedures to choose from this choice (of pro-cedure) is also done informally but thoughtfully Steps 3 and 4 aredone by the chosen statistical procedure using the data

In R usually Steps 1 and 2 require you to pick a particular Rfunction (and certain arguments) and then the function computesSteps 3 and 4 (and Step 5) for you Depending on the chosen modelStep 5 may be identical to Section 82

Some intuitive ways to evaluate models are really bad Firstmaximizing R2 is bad Whenever you add a term to your modelR2 always increases even if the model is worse (ie yields worseCEF estimates and predictions) Adjusted R2 is better but still notdesigned for optimal model selection Second hypothesis testing isbad Different significance levels yield different chosen models andthe answer to ldquowhich model is bestrdquo never starts with ldquoI controlledthe type I error rate rdquo

The first difficulty in selecting a good CEF model is that m(middot)could be very nonlinear Imagine Y = m(X) exactly Even withoutany error term we could get a bad estimate if we specify m(x) =β0 + β1x when really m(middot) is not linear-in-variables So our modelmust be flexible enough to approximate the true m(middot) well

The second difficulty is distinguishing m(Xi) from the CEF errorVi equiv Yi minus m(Xi) in the data If we knew Y = m(Xi) then wecould learn m(x) perfectly for all x = Xi But in reality we observeYi = m(Xi)+Vi If Yi is big we donrsquot know ifm(Xi) is big or Vi is bigYou can think of m(Xi) as the ldquosignalrdquo and Vi as the ldquonoiserdquo we wantto distinguish the signal from the noise If our model is too flexiblewe risk overfitting mistaking noise for signal For example perhapsthe true m(middot) is linear but we estimate a very nonlinear function

In practice the key is balancing the two difficulties describedabove If the model is too simple it may fail to approximate thetrue CEF If the model is too complex it may lead to overfittingThe CEF estimate is bad in either case


In more complex models optimal model selection for predictionmay not be optimal for causality Historically model selection has fo-cused on prediction Model selection for causal estimation is a cuttingedge area of econometrics research

The following code shows a particular example of nonparametricregression Specifically it uses something called a smoothing spline es-timator implemented in function smoothspline() in R The differ-ent estimates shown (thick red lines) correspond to different levels offlexibility of the model The plots labeled ldquoGCVrdquo and ldquoLOOCVrdquo referto formal model selection procedures provided through the smoothspline() function automatically The others show intentionally badfits one model is ldquoToo flexiblerdquo the other is ldquoNot flexible enoughrdquoNote that the same data is used for each estimate as seen in thescatter plots The thin black line is the true CEF

00 02 04 06 08 10

00

10

20

30

GCV

00 02 04 06 08 10

00

10

20

30

LOOCV

00 02 04 06 08 10

00

10

20

30

Undersmoothed

00 02 04 06 08 10

00

10

20

30

Oversmoothed

Figure 84 Smoothing spline estimates same data different amountsof flexibility

Figure 84 shows the results from the following code

par(family=serif mar=c(3311) mgp=c(21080) mfrow=c(22))setseed(112358)n lt- 48 CEF lt- function(x) 1 + pnorm(12(x-12)) df lt- dataframe(X=sort(runif(n)))df$Y lt- CEF(df$X) + rbeta(n=nshape1=2shape2=2)2-1rets lt- list()titles lt- c(GCVLOOCVToo flexible Not flexible enough)rets[[1]] lt- smoothspline(x=df$X y=df$Y cv=FALSE) GCVrets[[2]] lt- smoothspline(x=df$X y=df$Y cv=TRUE) LOOCVrets[[3]] lt- smoothspline(x=df$X y=df$Y df=n)


rets[[4]] lt- smoothspline(x=df$X y=df$Y df=2)xx lt- seq(from=0 to=1 by=0005)for (ifig in 14) plot(x=df$X y=df$Y type=p pch=16 xlab= ylab=

main= xlim=01 ylim=01304)lines(x=xx y=CEF(xx) col=1)lines(predict(rets[[ifig]] x=xx) col=2)title(main=titles[ifig] line=-1 adj=01)

Discussion Question 84 (model evaluation) In practice why donrsquotwe just make graphs like in Figure 84 and see which fitted functionlooks best (Hint can we make such graphs in practice If so howcan we agree on which ldquolooks bestrdquo What does ldquobestrdquo mean)


Empirical Exercises

Empirical Exercise EE81 You will analyze data on law schoolsand their student outcomes originally collected by Kelly Barnett foran economics term project The idea is to compare median startingsalaries of graduates from each law school with the schoolrsquos cost Ofcourse these are not causal estimates does a Harvard Law graduatemake a lot of money because Harvard is expensive or because shersquosvery skilled (enough to get into Harvard) Since school cost is essen-tially a continuous variable you will explore possible nonlinearity inthe (statistical) relationship between cost and salary

a Load the data (assuming yoursquove already installed that R packageor Stata command)


Stata bcuse lawsch85 nodesc clear

b Stata only make a graph with a local linear nonparametricCEF estimate (of salary given cost) a linear fit and a quadraticfit with command lpoly salary cost degree(1) n(100)addplot(lfit salary cost || qfit salary cost) wheren(100) simply specifies the number of CEF values to estimateand plot and lfit and qfit stand for linear fit and quadraticfit and model selection is done with a ldquorule-of-thumbrdquo formulathat attempts to optimally balance variance and squared bias

c R only make a data frame named df with only salary and costvariables and only when both are observed withdf lt- dataframe(Y=lawsch85$salary X=lawsch85$cost)df lt- df[(isna(df$Y) | isna(df$X)) ]

where isna() is TRUE if the entry is missing and FALSE if not

d R only compute and store linear and quadratic (in variables)regressions with retlm lt- lm(Y~X data=df) and retnl lt-lm(Y~X+I(X^2) data=df)

e R only compute and store a nonparametric smoothing splineCEF estimate with GCV model selection with command retsslt- smoothspline(x=df$X y=df$Y cv=FALSE)

f R only specify a sequence of X values and compute CEFestimates at each value from each of the three models (lin-ear quadratic nonparametric) Store the sequence as xxwith xx lt- seq(from=min(df$X) to=max(df$X) lengthout=100) and then compute the estimates asfitlm lt- predict(retlm newdata=dataframe(X=xx))fitnl lt- predict(retnl newdata=dataframe(X=xx))fitss lt- predict(retss newdata=dataframe(X=xx))

g R only make a scatterplot of raw data withplot(x=df$X y=df$Y xlab=Cost ylab=StartingSalary)


h R only plot the three estimated CEFs as lines over the scatter-plot withlines(x=xx y=fitlm col=1 lty=1)lines(x=xx y=fitnl col=2 lty=5)lines(fitss col=4 lty=3)

i Optional repeat your analysis but with the schoolrsquos rank (vari-able rank) instead of cost

j Optional repeat again but with log salary and log rank Logsalary is already in the dataset as variable lsalary (thatrsquos alowercase L before salary)

R df lt- dataframe(Y=lawsch85$lsalary X=log(lawsch85$rank))

Stata generate lrank = log(rank) then use lrank andlsalary

Empirical Exercise EE82 You will analyze data on sleep andwages originally from Biddle and Hamermesh (1990) Specificallyyoursquoll estimate the CEF of daily hours of sleep conditional on hourlywage For now just drop missing values without worry and focus onthe linear quadratic and nonparametric estimation



Stata bcuse sleep75 nodesc clear

b R only follow the same steps (identical code) as in EE81through part (h) after setting up the data frame named dfSpecifically replace EE81(c) withdf lt- dataframe(Y=sleep75$slpnaps760 X=sleep75$hrwage)

df lt- df[(isna(df$Y) | isna(df$X)) ]

and then use the same code for all subsequent steps

c Stata only generate a new variable that translates the totalweekly minutes of sleep into average daily hours of sleep withgenerate sleephrsdaily = slpnaps760

d Stata only graph linear quadratic and nonparametric(local linear) CEF estimates similar to EE81(b) withcommand lpoly sleephrsdaily hrwage degree(1) n(100) addplot(lfit sleephrsdaily hrwage || qfitsleephrsdaily hrwage )

e Optional repeat your analysis but instead of hrwage usetotwrk as the conditioning variable (regressor) this is totalminutes of work per week (You could also adjust it to be av-erage daily hours of work to make it more comparable to thesleep variable you use)


Empirical Exercise EE83 You will analyze data from the 1994ndash1995 menrsquos college basketball season scores and Las Vegas bettingldquospreadsrdquo originally collected by Scott Resnick Before each gamepeople can bet on whether the score difference will be ldquooverrdquo or ldquoun-derrdquo the spread set by bookmakers in Las Vegas (In the data theldquodifferencerdquo is the favored teamrsquos score minus the other teamrsquos scoreso the variable spread is always positive but the actual score dif-ference scrdiff can be negative if the favored team loses) Basi-cally the bookmaker adjusts the spread so that half the bets areldquooverrdquo and half ldquounderrdquo so regardless of the actual score outcomehalf win and half lose (and the bookmaker always profits) thelosers pay the winners and the bookmaker keeps the transactionfees (Itrsquos a little complicated since bets can be placed at differenttimes and the spread can change over time but we can imagine asimplified version where everyone bets at once and the spread is setso that half bet ldquooverrdquo and half ldquounderrdquo) See the Wikipedia entryat httpsenwikipediaorgwikiSpread_betting for more onspread betting Consequently the spread does not reflect the book-makerrsquos belief but rather the aggregate beliefs of everybody bettingon the game The accuracy of such aggregate wisdom has spurredthe creation of ldquoprediction marketsrdquo for events beyond sports likepresidential elections although there have been notable failures (eg2016 US presidential election)1 You will check whether the LasVegas spread is indeed a good predictor of the actual score difference

Technically the above arguments suggest that given the spreadthe median score difference should equal the spread not the meanBut such an investigation would require ldquomedian regressionrdquo (a typeof ldquoquantile regressionrdquo) which is beyond our scope Instead you willinvestigate whether the spread is still a good predictor of the actualscore difference with quadratic loss Specifically you can check if theOLS fit has intercept close to 0 and slope close to 1 (and whetherthose values are in the respective confidence intervals)

a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(sandwich) library(lmtest)

pntsprd

b Stata only load the data with bcuse pntsprd nodescclear (assuming bcuse already installed)

c For each observation (each game) compute whether the actualscore difference was over under or equal to the spread In mathand in the code below the ldquosignrdquo function (not to be confusedwith ldquosinerdquo) equals +1 for strictly positive values minus1 for strictlynegative values and 0 for zero

R overunder lt- sign(pntsprd$scrdiff-pntsprd$spread)

Stata generate overunder = sign(scrdiff - spread)1See httpsenwikipediaorgwikiPrediction_market for more


d Display the frequency of over under and equal

R table(overunder useNA=ifany)

Stata tabulate overunder missing

e Regress the score difference on the spread

R ret lt- lm(scrdiff~spread data=pntsprd)

Stata regress scrdiff spread vce(robust)

f R only (since already reported by Stata) display the point es-timates and heteroskedasticity-robust 95 confidence intervalsfor the intercept and slope withcbind(coeftest(ret vcov=vcovHC(ret type=HC1))[12]coefci( ret vcov=vcovHC(ret type=HC1)) )

g Plot nonparametric CEF fitted values against the line Y = X(intercept zero slope one)

R plot(smoothspline(x=pntsprd$spread y=pntsprd$scrdiff)) then abline(a=0 b=1 col=2)

Stata lpoly scrdiff spread degree(1) addplot(function y=x range(spread)) noscatter

h Optional repeat your analysis in parts (e)ndash(g) but with thereverse regression regress the spread on the score difference (Isthe slope still close to 1 Are you surprised Consider gameswith the biggest possible score difference should the spread beeven bigger half the time)

Chapter 9

Regression with Two BinaryRegressors


Depends on Chapters 6 and 7 (which depend on Chapters 2ndash4)



92 Assess whether there is bias from omitting a variable in areal-world example including the direction of bias [TLOs 5and 6]

93 Interpret (appropriately) the coefficients of a regression withtwo binary variables mathematically and intuitively for de-scription prediction and causality [TLO 3]

94 Assess whether comparing changes in two groups over timecan be interpreted causally and interpret such differencesappropriately [TLOs 2 3 and 6]

95 In R (or Stata) estimate regression models with two binaryvariables along with measures of uncertainty and judgeeconomic and statistical significance [TLO 7]


bull ATT (Masten video)

bull Potential outcomes and CATE (Masten video)

bull OVBconfounders (Masten video)

bull conditional independenceunconfoundedness (Mastenvideo)

bull ATEconditional independence example (Masten video)

bull Difference-in-differences (Masten video)

195

196CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS

bull Parallel trends (Masten video)

bull Diff-in-diff example immigration and unemployment (Mas-ten videos)

bull Parallel trends example immigration and unemployment(Masten videos)

bull Diff-in-diff example minimum wage (Masten video)

bull Diff-in-diff example posting calorie counts (Masten video)

bull OVB example test score and class size (Lambert video)

bull OVB example wages and education (Lambert video)

bull Sections 33 (ldquoCeteris Paribus Interpretation and OmittedVariable Biasrdquo) and 615 (ldquoInteraction Termsrdquo) in Heiss(2016)

bull Section 132 (ldquoDifference-in-Differencesrdquo) in Heiss (2016)

bull Collider bias examples httpsdoiorg101093ijedyp334

bull Collider bias review (very detailed) httpsdoiorg101146annurev-soc-071913-043455

bull Sections 61 (ldquoOmitted Variable Biasrdquo) and 83 (ldquoInteractionsBetween Independent Variablesrdquo) in Hanck et al (2018)

Perhaps surprisingly there is a lot to think about with even justtwo binary regressors Topics include (mis)specification of a CEFmodel interaction between regressors as a type of nonlinearity inter-pretation of regression coefficients causality estimation and more

91 Omitted Variable Bias

=rArr Kaplan video Omitted Variable Bias

For causality omitted variable bias (OVB) is a common prob-lem in economics More broadly it is a common problem in anyfield that uses observational (non-experimental) data and has manyvariables interact in complex ways Generally OVB arises because avariable outside our model is moving withX and causing Y to changebut our model assumes these changes are entirely from X

911 An Allegory

Imagine a ghost (Q) that often accompanies a child (X) ie the ghostand child are often in the same place at the same time The ghostalways makes a huge mess (Y ) spilling flour knocking over chairsdrawing on walls etc The childrsquos parents only observe the child andthe mess they do not observe the ghost The parents note that whenthe child is in the kitchen then there is often a mess in the kitchenand when the child is in the bathroom then there is often a mess in

91 OMITTED VARIABLE BIAS 197

the bathroom etc Thus they infer that the child (X) causes themess (Y ) However we know that it only appears that way because

GHOST1 the ghost (Q) often accompanies the child (X) and

GHOST2 the ghost (Q) causes a mess (Y )

The child is the regressor The ghost is the omitted variable Theparents are economists who over-estimate how much mess the childcauses This phenomenon is OVB

912 Formal Conditions

The ghost of OVB can be formalized as follows Consider the struc-tural model

Y = β0 + β1X + β2Q+ V (91)

where Cov(XV ) = 0 If we donrsquot observe Q then instead we havethe structural model

Y = β0 + β1X + U U equiv β2Q+ V (92)

Here X is sometimes called the included regressor (included inthe model not omitted) If X is binary then for OLS to estimateβ1 requires E(U | X = 0) = E(U | X = 1) the average effect of thestructural error term U must be the same for both X groups Forsimplicity imagine Q is also binary

Condition GHOST1 ldquothe ghost follows the childrdquo means that weusually see Q = 1 when X = 1 and Q = 0 when X = 0 Moregenerally it means Q is correlated with X This correlation does notneed to have a causal interpretation It does not matter why theghost follows the child maybe the ghost likes the childrsquos company (orvice-versa) or maybe they just get hungry at the same time It onlymatters that they tend to be in the same place Q and X tend to havethe same value OVB can also occur if there is a negative correlationeg if usually Q = 1 when X = 0 and Q = 0 when X = 1

Condition GHOST2 ldquothe ghost causes a messrdquo means that Q isa causal determinant of Y In (91) this means β2 6= 0 Although inthe example β2 gt 0 (more mess) OVB can occur with β2 lt 0 tooFor example maybe the child is really messy but the ghost cleanseverything up then the parents would incorrectly think the child isnot messy

To summarize for variable Q that is not included as a regressor (itis omitted from the model) it will cause OVB if both of the followingconditions hold

OVB1 Corr(QX) 6= 0 the omitted variable is correlated with theincluded regressor

OVB2 The omitted variable Q is a causal determinant of Y (not onlythrough X)

The variable Q may be called an omitted variable or a confounder


Assessing OVB Conditions Empirically

If Q is observed in the data then you can compare β1 (the estimatedcoefficient on X) when Q is included as a regressor to β1 when Q isomitted If the estimates are meaningfully different (economically)then it may be best to include Q to reduce OVB However thereare other types of variables that would also lead to a different β1but are actually worse to include so careful thought is required seeSection 96

If Q is not observed in the data then even Corr(QX) in OVB1cannot be assessed empirically (ie using data)

Beware of ldquoomitted variablerdquo tests that are not concerned with thistype of OVB For example Statarsquos ovtest implements the Ramseytest (RESET) Although the ov in ovtest indeed stands for ldquoomittedvariablesrdquo the Ramsey test only looks for (certain types of) nonlin-earity to see whether a polynomial model might be better than alinear model That is it is about nonlinearity in X (Section 82) notabout a separate Q variable Besides as you learned in Section 83hypothesis testing is a bad way to do model selection

Example

For example imagine we want to learn the effect of kindergartenclassroom size on earnings as an adult (This is inspired by ChettyFriedman Hilger Saez Schanzenbach and Yagan (2011) who ac-tually have randomized experimental data to answer this question)Let Y denote the annual earnings of the individual at age 30 LetX = 1 if (as a child) the individual was in a kindergarten classroomwith more than 24 students and X = 0 otherwise Imagine X is notrandomized We are curious whether we can just regress Y on X orif there is OVB Consider the following possible omitted variables

First consider Q to be somebodyrsquos first grade class size (Firstgrade is the year after kindergarten in the US) As with X Q = 1if it is above 24 students and Q = 0 otherwise Since it seems likekindergarten class size has an effect on adult earnings (Y ) according toChetty et al (2011) probably first grade class size does too satisfyingOVB2 If all students in the population are completely randomlyassigned to classes each year Corr(XQ) = 0 then OVB1 does nothold so this Q would not cause OVB However students tend tostay in the same school and some schools tend to have smaller classsizes than others so OVB1 probably does hold Since both OVB1and OVB2 are true there is OVB

Second considerQ as the number of cubbies (places to put clothesbackpacks etc) in somebodyrsquos kindergarten classroom Presumablylarger classes (X = 1) require more cubbies since there are morestudents so Corr(QX) gt 0 satisfying OVB1 However Irsquod guessthe number of cubbies does not have a causal effect on future earningsY That is if we simply went into classrooms and added a few cubbies(without adding students) I donrsquot think it would affect studentsrsquofuture earnings Thus OVB2 does not hold and this Q does not


cause OVBThird consider Q = 1 if the kindergarten is in a high-income area

and Q = 0 otherwise Areas with higher income are more likely tobe able to afford more teachers to keep class sizes small That isitrsquos more likely to see Q = 1 and X = 0 or Q = 0 and X = 1 soCorr(QX) lt 0 satisfying OVB1 Also Chetty Hendren and Katz(2016) provide evidence that growing up in a higher-income area hasa positive causal effect on earnings as an adult (not only because ofsmaller kindergarten classes) meaning Q is a causal determinant ofY satisfying OVB2 Thus omitting this Q causes OVB

In Sum Possible Omitted Variables (Q) in KindergartenExample

First grade class size affects earnings (OVB2) and proba-bly correlated with kindergarten class size (OVB1) if populationincludes multiple schools =rArr OVB

Cubbies more if more students (OVB1) but no causal effecton earnings (no OVB2) =rArr no OVB

Neighborhood income smaller classes if higher income(OVB1) and affects earnings (OVB2) =rArr OVB

Discussion Question 91 (assessing OVB) Among public elemen-tary schools (students mostly 5ndash11 years old) in California let Ybe the average standardized math test score among a schoolrsquos 5th-graders and letX be the schoolrsquos student-teacher ratio for 5th-graders(like average number of students per class) Consider a simple regres-sion of Y on X For any two of the following variables assess eachOVB condition separately and then decide whether you think itrsquos asource of OVB

a) Schoolrsquos parking lot area per student (Remember 5ndash11-year-olds donrsquot have cars to park)

b) Time of day of the testc) Schoolrsquos total spending per student (including books facilities

etc)d) Percentage of English learners (non-native speakers) among a

schoolrsquos 5th-grade students

913 Consequences

The practical problem of OVB is that we systematically over-estimateor under-estimate the true structural parameter This consequence isquantified below

Formulas

The following results are much more general than OVB with binary re-gressors Beyond OVB they quantify the consequences of any sourceof endogeneity that causes correlation between the regressor X andstructural error term U Other sources of endogeneity are discussed


in Section 123 The results also apply to any discrete and continuousX

Given structural model Y = β0 + β1X +U the OLS estimator ofβ1 has the property

β1prarr β1 +

Cov(XU)

Var(X) (93)

or equivalently (in different notation)

plimnrarrinfin

β1 = β1 +Cov(XU)

Var(X) (94)

That is for large samples (large n) the estimator β1 is close to theright-hand side expression in most randomly sampled datasets (Toreview prarr and consistency see Section 373)

Equations (93) and (94) show OVB is not solved by having lotsof data Unless Corr(XU) = 0 the OLS estimator is not consistentfor the structural β1

Rearranging (94) the asymptotic bias (as in (332)) is

plimnrarrinfin

β1 minus β1 =Cov(XU)

Var(X)= slope coefficient in LP(U | 1 X)

(95)The characterization as a linear projection slope coefficient comesfrom replacing Y with U in (78) This canrsquot be computed from datasince U is unobserved but it is helpful for thinking about the directionand magnitude of asymptotic bias

Although technically this is ldquoasymptotic biasrdquo rather than ldquobiasrdquo(Section 371) the practical implication is the same Although verydifferent mathematically we wonrsquot worry about such technicalities

Direction of Asymptotic Bias

(Recall terms and definitions from Sections 371 and 373)The direction (+ or minus) of the asymptotic bias in (95) depends on

the sign (+ or minus) of the slope in LP(U | 1 X) Ths sign of this slopeis equivalent to the sign of Corr(XU)

If Corr(XU) gt 0 then plimnrarrinfin β1 minus β1 gt 0 This is posi-tive (upward) asymptotic bias meaning we systematically estimate avalue ldquoaboverdquo the true β1 ldquoAboverdquo does not mean ldquobigger in magni-tuderdquo it could be that β1 = minus9 and positive asymptotic bias causesplimnrarrinfin β1 = 0 This is ldquopositiverdquo since 0minus (minus9) gt 0 (positive) butwe might also say that wersquore estimating a ldquosmallerrdquo effect (in fact zeroeffect) in the sense that |0| lt |minus9| This can be confusing

If Corr(XU) lt 0 then plimnrarrinfin β1 minus β1 lt 0 meaning nega-tive (downward) asymptotic bias Again confusing negative asymp-totic bias can actually make effects look bigger eg if β1 = 0 andplimnrarrinfin β1 = minus9 the true effect is zero but the negative asymptoticbias makes it appear like there is an effect


Results in Terms of Q

For OVB specifically the general results in terms of U can be trans-lated to Q As in (92) let U = β2Q+V with Cov(XV ) = 0 Thenusing a linearity property of covariance

Cov(XU) = Cov(Xβ2Q+V ) = β2 Cov(XQ) +

=0︷︸︸︷Cov(XV ) (96)

Plugging this into the first expression in (93)

β1prarr β1+

Cov(XU)

Var(X)= β1+β2

Cov(XQ)

Var(X)= β1+β2 Corr(XQ)

radicVar(Q)

Var(X)

(97)Interestingly similar to (95) Cov(XQ)Var(X) is the slope of thepopulation linear projection of Q onto X (and an intercept) LP(Q |1 X) So the asymptotic bias is the product β2γ1 where β2 is thestructural slope coefficient onQ in (91) and γ1 is the linear projectionslope coefficient in LP(Q | 1 X) = γ0 + γ1X

Equation (97) shows why both Conditions OVB1 and OVB2are required for OVB Condition OVB1 says Corr(XQ) 6= 0 whileOVB2 says β2 6= 0 If either β2 = 0 or Corr(XQ) = 0 in (97) thenβ2 Corr(XQ) = 0 and the asymptotic bias disappears β1

prarr β1The direction of asymptotic bias can also be interpreted in terms

of Q Using (97) the sign of the asymptotic bias is the sign ofβ2 Corr(XQ) That is if β2 Corr(XQ) gt 0 then there is positive(upward) asymptotic bias if β2 Corr(XQ) lt 0 then there is negative(downward) asymptotic bias

Example

Consider the asymptotic bias direction in the example where X = 1if the kindergarten class size is large and Q = 1 if the neighborhoodincome is high Earlier we thought probably Corr(XQ) lt 0 andβ2 gt 0 Thus there is negative OVB since β2 Corr(XQ) lt 0 Thatis if the true effect of class size on earnings is β1 then we systemati-cally estimate something below β1

Does this make the effect size (absolute value) appear bigger orsmaller Since smaller classes are better average earnings (Y ) arehigher when X = 0 than when X = 1 This means a negative slopeβ1 lt 0 That is the effect of changing from a smaller class (X = 0)to a larger class (X = 1) is lower future earnings (β1 lt 0) Negativeasymptotic bias means we estimate something even more negativeplimnrarrinfin β1 lt β1 lt 0 This makes the size of the effect appear largerthan it really is we estimate something farther away from zero

Intuitively this OVB direction makes sense Individuals who hada small kindergarten class tend to have grown up in wealthier areaswith lots of other advantages that also cause higher earnings If weascribe the entire mean earnings difference to kindergarten then itfalsely appears that kindergarten alone cause the big difference when


in reality many different forces were all working together in the samedirection

In Sum OVB Assessment1 Think of a specific variable Q2 Assess OVB1 correlated with X3 Assess OVB2 causal effect on Y (separate from X effect)4 If both OVB1 and OVB2 =rArr OVB5 OVB direction positive bias if Corr(XQ) and effect of Q

on Y are either both + or both minus otherwise negative bias6 OVB magnitude all else equal larger (in absolute value) if

i) larger effect of Q on Y ii) larger Corr(XQ) iii) largerVar(Q)Var(X)

Practice 91 (OVB kindergarten) Consider the OVB example withearnings as an adult (Y ) kindergarten classroom size (X) and child-hood neighborhood income (Q) But reverse the definition of X letX = 1 for smaller classrooms (24 or fewer students) and X = 0 forlarger classrooms Say whether you think each of the following is pos-itive or negative and explain why a) β1 b) Corr(XQ) c) β2 andd) OVB Also discuss e) will our estimated effect β1 tend to be largeror smaller than the true effect β1 and why

Discussion Question 92 (OVB ES habits) Recall from DQ 63the example with Y as a studentrsquos final semester score (0 le Y le 100)and X = 1 if a student starts the exercise sets well ahead of thedeadline (and X = 0 otherwise)

a) Whatrsquos one variable that might cause OVB Explain why youthink both OVB conditions are satisfied

b) Which direction of asymptotic bias would your omitted variablecause Explain

914 OVB in Linear Projection

For linear projection (without causal interpretation) the OVB for-mula is actually the same as (97) just with β1 and β2 interpreted aslinear projection coefficients rather than structural coefficients Sim-ilar results for larger linear projection models are in Hansen (2020sect224) for example

However if we are interested in prediction we donrsquot care whetherour β1 estimates a particular linear projection coefficient we only carewhether we can predict Y well Of course we donrsquot want to omit Q ifitrsquos helpful for prediction but we donrsquot care about OVB itself Thatis OVB is only a problem for causality not prediction

92 LINEAR-IN-VARIABLES MODEL 203

92 Linear-in-Variables Model

The simplest CEF model with two binary variables is linear-in-variables(Section 821)

E(Y | X1 X2) = β0 + β1X1 + β2X2 (98)

Misspecification

Unfortunately (98) may be misspecified Recall from Section 71that misspecification arose when X had three values but the CEFmodel β0 + β1X had only two parameters The case here is simi-lar (98) has only 3 parameters but there are 4 possible values of(X1 X2) Specifically (X1 X2) could equal (0 0) (0 1) (1 0) or(1 1) Consequently there are four CEF values

m(0 0) = E(Y | X1 = 0 X2 = 0) m(0 1) = E(Y | X1 = 0 X2 = 1)

m(1 0) = E(Y | X1 = 1 X2 = 0) m(1 1) = E(Y | X1 = 1 X2 = 1)

(99)

To see the possible misspecification we can write the βj regressioncoefficients in terms of the CEF values m(x1 x2) If (98) were truethen

m(0 0) = β0 + (β1)(0) + (β2)(0) = β0 (910)m(0 1) = β0 + (β1)(0) + (β2)(1) = β0 + β2 (911)m(1 0) = β0 + (β1)(1) + (β2)(0) = β0 + β1 (912)m(1 1) = β0 + (β1)(1) + (β2)(1) = β0 + β1 + β2 (913)

Consequently β1 has two interpretations It equals either (913) mi-nus (911) or (912) minus (910)

m(1 1)minusm(0 1) = (β0 + β1 + β2)minus (β0 + β2) = β1

m(1 0)minusm(0 0) = (β0 + β1)minus β0 = β1

Thus the model implicitly assumes m(1 1) minus m(0 1) = m(1 0) minusm(0 0) which may not be true of the real CEF For example

m(0 0) = 0m(1 0) = 1m(0 1) = 2m(1 1) = 4

=rArr m(1 1)minusm(0 1) = 2m(1 0)minusm(0 0) = 1

Becausem(1 1)minusm(0 1) 6= m(1 0)minusm(0 0) the CEF model in (98)is misspecified (wrong) That is there are no possible (β0 β1 β2) suchthat m(x1 x2) = β0 + β1x1 + β2x2

As discussed in Chapter 7 if the CEF model is wrong thenOLS estimates the linear projection Here OLS estimates LP(Y |1 X1 X2) However this is not useful for causality and the misspec-ification is easily fixed


More Consideration

Before we fix the misspecification consider more carefully why (98)is usually misspecified To be concrete imagine Y is wage X1 = 1 ifan individual has a college degree (and X1 = 0 if not) and X2 = 1if an individual has at least 10 years of work experience (and X2 = 0if not) For simplicity wersquoll call X1 ldquoeducationrdquo and X2 ldquoexperi-encerdquo The quantity m(1 1) minus m(0 1) compares the mean wage inthe high-education high-experience group (subpopulation) with themean wage in the low-education high-experience group That iswithin the high-experience subpopulation it compares the mean wageof the high-education and low-education sub-sub-populations Thequantity m(1 0)minusm(0 0) also compares mean wages across high andlow education but within the low-experience subpopulation Thusassuming m(1 1) minus m(0 1) = m(1 0) minus m(0 0) can be interpretedas assuming that the mean wage difference between high-educationand low-education groups is identical within the high-experience sub-population and within the low-experience subpopulation This is astrong assumption that is probably not true in this example (or inmost examples)

93 Fully Saturated Model

=rArr Kaplan video Fully Saturated Model Interpretation

Misspecification is avoided by adding the interaction termX1X2

E(Y | X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (914)

Mathematically interaction terms often involve the product of tworegressors like X1X2 here Economically the interaction term allowsthe mean Y difference associated with X1 to depend on the value ofX2 Similarly it allows the mean Y difference associated with X2 todepend on the value of X1 For example the mean wage differenceassociated with education can depend on the value of experienceMore generally interaction terms allow the change in Y associatedwith a unit increase in one regressor to depend on the value of anotherregressor

The CEF model in (914) is also called fully saturated (Sec-tion 722) since it is flexible enough to allow a different CEF valuefor each value of (X1 X2) Logically having the same number (four)of possible values of (X1 X2) as βj parameters is necessary but notsufficient for the model to be fully saturated

Interpretation of the coefficients requires writing them in terms ofdifferent CEF values First similar to (910)ndash(913) each CEF value

93 FULLY SATURATED MODEL 205

can be written in terms of the βj

m(x1 x2) = β0 + (β1)(x1) + (β2)(x2) + (β3)(x1)(x2)

m(0 0) = β0 + (β1)(0) + (β2)(0) + (β3)(0)(0) = β0 (915)m(0 1) = β0 + (β1)(0) + (β2)(1) + (β3)(0)(1) = β0 + β2 (916)m(1 0) = β0 + (β1)(1) + (β2)(0) + (β3)(1)(0) = β0 + β1 (917)m(1 1) = β0 + (β1)(1) + (β2)(1) + (β3)(1)(1) = β0 + β1 + β2 + β3

(918)

From (915)ndash(918) and their differences

(915)︷︸︸︷β0 = m(0 0) (919)

β1 =

(917) minus (915)︷︸︸︷(β0 + β1)minus β0 = m(1 0)minusm(0 0) (920)

β2 =


β3 = [β2 + β3]minus [β2] =

(918) minus (917)︷︸︸︷[(β0 + β1 + β2 + β3)minus (β0 + β1)]minus

(916) minus (915)︷︸︸︷[(β0 + β2)minus (β0)]

=

difference-in-differences︷︸︸︷difference︷︸︸︷

[m(1 1)minusm(1 0)]minusdifference︷︸︸︷

[m(0 1)minusm(0 0)] (922)= [m(1 1)minusm(0 1)]minus [m(1 0)minusm(0 0)] (923)

=


(917) minus (915)︷︸︸︷[(β0 + β1)minus (β0)]

Because of the difference-in-differences structure seen in (922)and (923) this model is sometimes called a difference-in-differencesmodel particularly when X2 represents time and X1 represents aldquotreatmentrdquo (see Section 97)

Using (919)ndash(923) the four βj in (914) have the following inter-pretations both in terms of the wage example (Y wage X1 educationX2 experience) and more generally

bull β0 = m(0 0) is the mean wage among low-education low-experience individuals

More generally β0 is the mean Y in the subpopulation withX1 = 0 and X2 = 0

Caution generally β0 6= E(Y )

bull β1 = m(1 0) minus m(0 0) is the mean wage difference betweenhigh-education and low-education individuals within the low-experience subpopulation

More generally β1 is the mean Y difference between X1 = 1and X1 = 0 individuals within the X2 = 0 subpopulation

Caution generally β1 6= E(Y | X1 = 1) minus E(Y | X1 = 0) itadditionally conditions on X2 = 0


bull β2 = m(0 1) minus m(0 0) is the mean wage difference betweenhigh-experience and low-experience individuals within the low-education subpopulation



bull β3 = [m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)] is the mean wagedifference associated with experience in the high-education sub-population minus the mean wage difference associated with ex-perience in the low-education subpopulation

More generally β3 is the mean Y difference associated withX2 in the X1 = 1 subpopulation minus the mean Y differenceassociated with X2 in the X1 = 0 subpopulation

bull β3 = [m(1 1) minusm(0 1)] minus [m(1 0) minusm(0 0)] is also the meanwage difference associated with education in the high-experiencesubpopulation minus the mean wage difference associated witheducation in the low-experience subpopulation


The βj interpretations can also be seen by considering the regres-sion of Y on X1 when X2 = 0 and separately when X2 = 1 That isplugging in x2 = 0 first and then x2 = 1 second

m(x1 0) = β0 + β1x1 + (β2)(0) + (β3)(x1)(0) = β0 + β1x1 (924)m(x1 1) = β0 + β1x1 + (β2)(1) + (β3)(x1)(1) = (β0 + β2) + (β1 + β3)x1

(925)

That is when changing from X2 = 0 to X2 = 1 the intercept changesby β2 and the slope changes by β3 These changes could be positiveor negative or zero The interaction coefficient β3 describes how theslope with respect to X1 differs when X2 = 1 versus X2 = 0

Equivalently we could switch all the X1 and X2 and interpret β3as the difference between the slope with respect to X2 when X1 = 1versus when X1 = 0

m(0 x2) = β0 + (β1)(0) + β2x2 + (β3)(0)(x2) = β0 + β2x2 (926)m(1 x2) = β0 + (β1)(1) + β2x2 + (β3)(1)(x2) = (β0 + β1) + (β2 + β3)x2

(927)

Practice 92 (binary interaction) Let Y be wage ($hr) D1 = 1 ifan individual has a college degree (D1 = 0 if not) and D2 = 1 if anindividual has more than 15 years of experience (and D2 = 0 if not)You have a sample of data and run OLS on the fully saturated modelyielding Y = 10 + 5D1 +D2 + 2D1D2

94 STRUCTURAL IDENTIFICATION BY EXOGENEITY 207

a) For the college-educated subpopulation what is the estimatedchange in mean wage associated with changing from low to highexperience

b) Within the low-experience subpopulation whatrsquos the estimateddifference in mean wage between the college and no-college sub-populations

c) How do you interpret the 2 (the coefficient on D1D2)

94 Structural Identification by Exogeneity

Imagine Y is determined by the structural model

Y = β0 + β1X1 + β2X2 + β3X1X2 + U (928)

The qualitative condition for identification is the same as in Sec-tion 661 Specifically if U (which contains other causal determinantsof Y ) is unrelated to the regressors then the structural parameters areidentified Recall that a regressor unrelated to U is called exogenousotherwise itrsquos endogenous

Mathematically one sufficient definition of ldquounrelatedrdquo here is ldquoun-correlatedrdquo If

Cov(UX1) = Cov(UX2) = Cov(UX1X2) = 0 (929)

then β1 β2 and β3 are the linear projection slope coefficients fromLP(Y | 1 X1 X2 X1X2) Other mathematical definitions of ldquounre-latedrdquo imply (929) and are thus sufficient for identification For ex-ample U perpperp (X1 X2) logically implies (929) as does mean indepen-dence E(U | X1 X2) = E(U)

If the structural β1 β2 and β3 are also linear projection coeffi-cients then they can be estimated by OLS That is we can interpretthe OLS-estimated slope coefficients as the structural parameters in(928)

95 Identification by Conditional Independence

By extending the independence assumption (A61 and A65) variantsof the ASE and ATE can be identified (Note more details andexamples are in the Spring 2020 edition)

Consider the subpopulation with X2 = 1 and whether the meandifference E(Y | X1 = 1 X2 = 1) minus E(Y | X1 = 0 X2 = 1) has acausal interpretation This is equivalent to redefining the populationas everybody with X2 = 1 and asking if the mean difference E(Y |X1 = 1) minus E(Y | X1 = 0) has a causal interpretation This questionwas studied in Section 66 for both structural and potential outcomesmodels

The key identifying assumption from Section 66 was indepen-dence In the structural model this meant independence between theregressor X1 and the unobserved determinants of Y In the potential


outcomes model this meant independence between the treatment andthe pair of potential outcomes

Extending independence is conditional independence whichessentially assumes independence within each subpopulation (X2 = 1and X2 = 0) The conditional independence assumption (CIA) hasother names like unconfoundedness selection on observablesand ignorability see Imbens and Wooldridge (2007 p 6) and ref-erences therein Mathematically both structural and potential out-comes versions of conditional independence are stated in Assump-tion A91

Assumption A91 (conditional independence assumption CIA)Let Y = h(X1 X2U) be the structural model Conditional on thecontrol variable X2 the regressor of interest X1 is independent ofthe vector of unobserved causal determinants U U perpperp X1 | X2 Al-ternatively in potential outcomes notation where Y T and Y C arethe treated and untreated potential outcomes binary treatment X1

is independent of the potential outcomes conditional on the controlvariable X2 (Y 0 Y 1) perpperp X1 | X2 More generally in either case X2

may be replaced by multiple control variables X2 X3 X4 andX1 can be discrete (including binary) or continuous in the structuralmodel

Consequently the ASE or ATE within subpopulation X2 = 1 isidentified and equal to the conditional mean difference E(Y | X1 =1 X2 = 1) minus E(Y | X1 = 0 X2 = 1) Similarly the ASE or ATEwithin subpopulationX2 = 0 is identified and equal to the conditionalmean difference E(Y | X1 = 1 X2 = 0) minus E(Y | X1 = 0 X2 =0) Because these are causal effects within a subpopulation (not fullpopulation) ie conditional on X2 the ATE is sometimes called aconditional ATE or the ASE a conditional ASE

The (unconditional) ATE can be computed from the conditionalATEs Specifically the ATE is the mean conditional ATE Writ-ing the conditional ATE for X2 = 1 as CATE(1) and similarlyCATE(0) for X2 = 0 then CATE(X2) is a random variable Specif-ically P(CATE(X2) = CATE(1)) = P(X2 = 1) and P(CATE(X2) =CATE(0)) = P(X2 = 0) Thus CATE(X2) has a mean Ultimately

ATE = E[Y T minus Y C ] = E[E(Y T minus Y C | X2)] = E[CATE(X2)]

= P(X2 = 1) CATE(1) + P(X2 = 0) CATE(0)

Similar arguments apply to the conditional ASE

96 Collider Bias

Although OVB shows the risk of omitting certain types of variables(confounders) other types of variables actually should be omittedotherwise there is a different type of (asymptotic) bias

A collider or common outcome is a variable on which both Xand Y have a causal effect (Whereas a confounder has a causal effect

96 COLLIDER BIAS 209

on both X and Y ) For example imagine you want to learn the effectof a firmrsquos ownership structure (say X = 1 for family-owned X = 0otherwise) on its research and development expenditure Y Both Xand Y affect the firmrsquos performance Z so Z is a collider

Including a collider as a regressor causes collider bias when es-timating a causal relationship This is not as intuitive as OVB butconsider the following example1

Imagine yoursquore interested in the causal effect of eating falafel orsalad on having the flu (which is zero effect) and you have a sampleof 200 individuals You randomly assigned 100 people to eat falafelfor lunch and 100 salad a few hours later you test each for flu(assume there is no testing error) Let Y = 1 if somebody has theflu (otherwise Y = 0) and X = 1 if somebody ate falafel for lunch(X = 0 if salad) Let Z = 1 if the individual has a fever (otherwiseZ = 0) Sadly the salad had some romaine contaminated with E coliso 40 of those who ate salad got a fever from the E coli unrelatedto whether or not they had the flu Among individuals with flu 90have a fever but 10 donrsquot

Table 91 Counts in falafelsaladflu example

Fever No fever

Flu No flu Flu No flu Flu No flu

Falafel 50 50 45 0 5 50Salad 50 50 47 20 3 30

Table 91 shows the number of individuals in different categoriesOverall there is no relationship between lunch and flu so the flu rateis the same in the falafel and salad groups To make the numberseasier the overall flu rate is 50 (100200 overall 50100 in eachgroup) Since nobody who ate falafel got E coli the only reason forfever is the flu which has a 90 fever rate Thus among the 50with flu who at falafel (50)(09) = 45 have a fever and 5 do not Thisentirely explains the Falafel row In the salad row given the statisticalindependence of flu (probability 05) and E coli (probability 04) theprobability of having neither is

P(not flu and not E coli) = P(not flu) P(not E coli) = [1minus05︷︸︸︷

P(flu)][1minus04︷︸︸︷

P(E coli)]= (05)(06) = 03

hence (100)(03) = 30 salad-eaters who have neither flu nor E coliand thus no fever This explains the No fever No flu entry of 30 inthe Salad row Similarly

P(flu not E coli) = (05)(06) = 03 (30 people)P(flu E coli) = (05)(04) = 02 (20 people)

P(not flu E coli) = (05)(04) = 02 (20 people)1Modified from httpsdoiorg101093ijedyp334


The ldquonot flu and E colirdquo are the 20 individuals who have a fever (fromthe E coli) but not flu The 20 with both flu and E coli all have afever due to E coli Among the 30 with flu but not E coli 90 havea fever ie (30)(09) = 27 have a fever so 3 do not This 3 is theNo fever Flu entry in the Salad row The 27 combine with the 20who had both illnesses to make 47 who have both flu and a fever inthe Salad row

If we regress Y (flu) on X (food) then we correctly estimate zeroeffect but if we also use Z (fever) then we incorrectly estimate anon-zero effect If we only look at the ldquono feverrdquo group then thereis (appropriately) zero difference the flu rate for the falafel eatersis 555 = 111 identical to the 333 = 111 for the salad eatersMathematically these ldquoratesrdquo are estimates of the conditional meanof the binary Y flu variable eg 555 = E(Y | falafel no fever)recalling E(Y ) = P(Y = 1) for binary Y However if we also lookat the ldquofeverrdquo group the flu rate is much higher in the falafel groupIn fact the falafel grouprsquos flu rate is 4545 = 100 whereas thesalad grouprsquos flu rate is only 47(47+20) = 70 substantially lowerMathematically

555︷︸︸︷E(Y | X = 1 Z = 0)minus

333︷︸︸︷E(Y | X = 0 Z = 0) = 0

4545︷︸︸︷E(Y | X = 1 Z = 1)minus

4767︷︸︸︷E(Y | X = 0 Z = 1) = 030

(930)

This suggests eating falafel causes flu but this incorrect conclusion isentirely collider bias

97 Causal Identification Difference-in-Differences

=rArr Kaplan video Diff-in-Diff Intuition

If X1 is a treatment indicator and X2 is a time period indicatorthen the fully saturated model with two binary regressors is called adifference-in-differences (diff-in-diff) model This is a special caseof (914) whose coefficients were interpreted in Section 93

Below the parameter β3 from (914) is shown to have a certaincausal interpretation under certain conditions

The general setup is that some individuals (or firms or citiesetc) were exposed to some ldquotreatmentrdquo like a training program orlaw or other policy The treatment wasnrsquot randomized but therersquos agroup of untreated individuals whose outcomes can be used to forma counterfactual whatrsquos the mean outcome of treated individualsin the parallel universe where they werenrsquot treated

Such setups are sometimes called natural experiments or quasi-experiments (see also Section 432) Since they werenrsquot fully ran-domized experiments itrsquos invalid to simply compare treated and un-treated outcomes as seen in Section 971 However there is enoughrandomness that a valid comparison can be found with some addi-tional work (like diff-in-diff)

97 CAUSAL IDENTIFICATION DIFFERENCE-IN-DIFFERENCES211

For example maybe Y is annual labor income and we are in-terested in the effect of minimum wage Imagine our city recentlyimplemented a large minimum wage increase The goal is to learn theeffect of this particular minimum wage increase on Y (income) forindividuals in our city Notationally X1 = 1 if the individual livesin our city (and X1 = 0 otherwise) and X2 = 1 if the observation isfrom the year after the minimum wage increase (and X2 = 0 if beforethe increase)

Notationally X1 = 1 is the ldquotreated grouprdquo and X1 = 0 the ldquoun-treated grouprdquo X2 = 0 is the time period ldquobeforerdquo treatment andX2 = 1 is ldquoafterrdquo

971 Bad Approaches

Discussion Question 93 (bad panel approach 1 for Mariel boatlift)Consider the basic setup from Card (1990) Due to a seemingly ran-domexogenous political decision Cubans were temporarily permit-ted to immigrate to the US for a few months in 1980 About halfsettled in Miami FL while the other half went to live in other citiesaround the US We could compare wages of native-born workers inMiami in 1979 (before boatlift) and 1981 (after) Explain why thischange in average wage would not be a good estimate of the aver-age treatment effect of the Mariel boatlift on native worker wage(Hint are 1979 Miami and 1981 Miami the same except for howmany Cubans live there or might something else have changed)

Discussion Question 94 (bad panel approach 2 for Mariel boatlift)Consider the same setup as in DQ 93 But now compare 1981 wagesof native workers in Miami and Houston TX a city that did not re-ceive a large influx of Cuban immigrants in 1980 Explain why thisdifference (Miami minus Houston) in average wage would not be agood estimate of the average treatment effect of the Mariel boatlifton native worker wage (Hint are 1981 Miami and Houston the sameexcept for how many Cubans live there or might there be other dif-ferences between the cities that might cause omitted variable bias)

Discussion Question 95 (bad panel approach 1 for fracking)Discussion Questions 95 and 96 are based loosely on the settingof Street (2018) who uses much better approaches For counties inNorth Dakota let Y denote crime rate Consider the average crimerate in counties that started fracking activity before and after thefracking started (Fracking was a new technology that allowed ex-traction of certain underground oil and natural gas reserves that werepreviously infeasible or unprofitable to extract) Explain why thischange in average crime rate would not be a good estimate of theaverage treatment effect of the fracking activity on crime rate

Discussion Question 96 (bad panel approach 2 for fracking)Consider the same setup as in DQ 95 but now compare the ldquoafterrdquocrime rates in North Dakota counties with fracking to those withoutfracking Explain why this difference (fracking minus non-fracking)


in average crime rate would not be a good estimate of the averagetreatment effect of fracking on crime rate

Continuing the minimum wage example one bad approach is touse only data from our city before and after the minimum wageincrease That is we could try to estimate E(Y | X2 = 1 X1 =1)minusE(Y | X2 = 0 X1 = 1) However coincidentally there may havebeen a national (or global) recession right after the minimum wagelaw was passed This may make everybodyrsquos income lower in the yearafter It would look like the minimum wage hurt incomes but really itwas the recession Alternatively there may have been great national(macroeconomic) conditions that made incomes go up which wouldmake us incorrectly conclude that the law increased incomes greatlyThere is almost always OVB with such before vs after comparisonswhich invalidates causal interpretation

Another bad approach is to compare incomes in our city and an-other city in the year after our law passed By using the other cityas a sort of control group we avoid the problem of misinterpretingmacroeconomic changes as treatment effects However itrsquos hard toknow which other city to pick We could pick one that has the samepopulation for example but our city may still have much higher (orlower) income for reasons other than our minimum wage For exam-ple San Francisco and Columbus OH have very similar populationsbut they have (and have for a while had) very different incomes

972 Counterfactuals and Parallel Trends

The difference-in-differences idea is to combine the before vs aftercomparison with the treated vs untreated comparison

Conceptually the goal is to construct a counterfactual (link topronunciation) like what our cityrsquos mean income would have been ifthere were not a minimum wage increase Thinking of the potentialoutcomes framework the counterfactual is the parallel universe wherethe treatment never happened

The key identifying assumption is called parallel trends Con-ceptually in the running example parallel trends says that withoutthe minimum wage law our cityrsquos mean income would have increasedby exactly the same amount as the other cityrsquos mean income Mathe-matically with m(x1 x2) equiv E(Y | X1 = x1 X2 = x2) the other cityrsquosmean income increase (ie ldquoafterrdquo minus ldquobeforerdquo) is

m(0 1)minusm(0 0) = E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)(931)

Parallel trends assumes that adding this increase to the ldquobeforerdquo meanincome in our city m(1 0) = E(Y | X1 = 1 X2 = 0) gives us thecounterfactual income for our city in the ldquoafterrdquo time period


Given parallel trends we can learn about causality by comparing

actual (our city after)︷︸︸︷E(Y | X1 = 1 X2 = 1) vs

counterfactual︷︸︸︷E(Y | X1 = 1 X2 = 0)︸︷︷︸

our city before

+ E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)︸︷︷︸increase in other city over time

(932)actual︷︸︸︷m(1 1)minus

counterfactual︷︸︸︷m(1 0) + [m(0 1)minusm(0 0)] =

β3 in (914)︷︸︸︷[m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)]

Figure 91 visualizes this effect We can think of constructingthe counterfactual outcome and then subtracting it from the actualoutcome m(1 1) or we can think of taking the beforeafter differencefor our city m(1 1) minus m(1 0) and subtracting off the beforeafterdifference in the other city m(0 1)minusm(0 0)

m(00)

m(01)

before after

other citym(10)

actual=m(11)

ldquotrea

tedrdquo c

ity

counterfactual m(10)+[m(01)-m(00)]

Diff-in-diff = m(11) - m(10)+[m(01)-m(00)]= [m(11)-m(10)] -[m(01)-m(00)]

m(01)-m(00)

Figure 91 Difference-in-differences

Discussion Question 97 (parallel trends skepticism) ConsiderUS state traffic fatality (ie car accident death) rates (Y ) wherethe year 1980 is ldquobeforerdquo (X2 = 0) and 1990 is ldquoafterrdquo (X2 = 1)Consider states that adopt a 008 blood alcohol content (BAC) limitlaw sometime between 1980 and 1990 (X1 = 1) and states that neverhave such a law (X1 = 0) Explain why you might doubt the paral-lel trends assumption Hint 1 is a BAC law the only way statestry to reduce fatal accidents Hint 2 this is more difficult thansimply thinking of an omitted variable that would cause OVB in across-sectional regression because parallel trends allows certain typesof such omitted variables

973 Identification

Population Object of Interest ATT

Most fundamentally the difference-in-differences approach only learnsthe average treatment effect for the group that was actually treated


(in our universe) This is called the average treatment effect onthe treated (ATT) (or sometimes ATTE or ATET) Mathemati-cally ATE meant E(Y 1 minus Y 0) where Y 1 and Y 0 are the treated anduntreated potential outcomes respectively (previously Y T and Y C)ATT is the same but for the subpopulation who was actually treatedin our universe Since X1 = 1 if somebody is actually treated theATT is

ATT equiv E(Y 1 minus Y 0 | X1 = 1) (933)

Itrsquos possible but uncommon that ATT = ATE For examplemaybe there are different demographics in our city than the compar-ison city or different levels of unionization or different other laborlaws or different industry mix so the minimum wage effect is differentin our city (X1 = 1) than elsewhere (This is essentially a question ofexternal validity see Chapter 12)

Identification of ATT

Parallel trends is sufficient to identify the counterfactual In potentialoutcomes notation ldquoparallel trendsrdquo is

E(Y 0 | X1 = 1 X2 = 1)minus E(Y 0 | X1 = 1 X2 = 0)

= E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)(934)

That is the mean untreated potential outcome changes over time(X2 = 0 to X2 = 1) by the same amount in the treated (X1 = 1)and untreated (X1 = 0) groups The term E(Y 0 | X1 = 1 X2 = 1)is the counterfactual like our cityrsquos mean wage in the ldquoafterrdquo periodin the parallel universe where minimum wage never increased In theother three terms Y 0 = Y ie the untreated Y 0 is the observedY Only when X1 = X2 = 1 is the treated Y 1 observed Y = Y 1Thus the counterfactual can be written uniquely in terms of the jointdistribution of (YX1 X2)

E(Y 0 | X1 = 1 X2 = 1)

= E(Y 0 | X1 = 1 X2 = 0) + [E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)]

= E(Y | X1 = 1 X2 = 0) + [E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)]

= m(1 0) + [m(0 1)minusm(0 0)] (935)

Because the counterfactual is identified so is the ATT Specificallythe ATT equals β3 in the fully saturated CEF model (914)

ATT = E(Y 1 minus Y 0 | X1 = 1 X2 = 1)

=

Y 1=Y since X1=1X2=1︷︸︸︷E(Y 1 | X1 = 1 X2 = 1)minus

use counterfactual from (935)︷︸︸︷E(Y 0 | X1 = 1 X2 = 1)

= E(Y | X1 = 1 X2 = 1)minuscounterfactual︷︸︸︷

m(1 0) + [m(0 1)minusm(0 0)]= m(1 1)minus m(1 0) + [m(0 1)minusm(0 0)]= [m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)]

= β3

98 ESTIMATION AND INFERENCE 215

Skepticism About Parallel Trends

In practice the parallel trends condition may not hold for variousreasons For example maybe our city was experiencing fast wagegrowth whereas the comparison city was declining (maybe due toreliance on different industries) Maybe our city passed the minimumwage law partly because everybodyrsquos wages were increasing anywayIn that case we canrsquot tell whether our cityrsquos wages grew more thanthe other cityrsquos wages because of the minimum wage or because ofother factors (our industries were growing theirs were declining etc)

Parallel trends is also a bit fragile since nonlinear functions ofY change whether itrsquos true or not For example if there are paralleltrends when Y is wage then there are not parallel trends for log-wageln(Y ) Similarly if there are parallel log-wage trends then the wagetrends cannot be parallel

In the data you can try to see if parallel trends seems plausiblebut it is not directly testable Specifically ldquopre-trend analysisrdquo com-pares trends for a few periods before the treatment takes place Buteven if the trends were parallel before it does not mean for sure thatthe trends would have remained parallel after the treatment year Wecan never know because the ldquotrendrdquo refers to the treated grouprsquos un-treated potential outcomes which by definition are not observed Sothere is no empirical test that can replace careful critical thought

974 Extensions

There are many interesting extensions of the basic diff-in-diff idea al-though all are beyond our scope For example there are related mod-els that allow additional regressors or more time periods or quantiletreatment effects


=rArr Kaplan video Difference-in-Differences Example

Since (914) is just a special case of a regression model standardregression techniques and R functions can be used For estimationOLS consistently estimates each βj under fairly general conditionsremember to use samplesurvey weights if they are available in thedata The same heteroskedasticity-robust methods from earlier (likeSection 773) can be used to compute confidence intervals if samplingis iid

The following code shows different R syntax to get the same co-efficient estimates with simulated data The notation X1X2 is theinteraction term (or in the output its coefficient) Heteroskedasticity-robust CIs are also reported

library(sandwich) library(lmtest)n lt- 48setseed(112358)


m00 lt- 10 m10 lt- 15 m01 lt- 16 m11 lt- 25df lt- dataframe(X1=c(rep(0n2)rep(1n2))

X2=rep(rep(01each=n4)times=2))df$Y lt- c(rep(m00n4)rep(m01n4)

rep(m10n4)rep(m11n4) ) + rnorm(n) Three equivalent estimatesret1 lt- lm(Y~X1X2 data=df)ret2 lt- lm(Y~X1+X2+X1X2 data=df)df$Xint lt- df$X1df$X2ret3 lt- lm(Y~X1+X2+Xint data=df)TrueBetas lt- c(m00m10-m00m01-m00(m11-m01)-(m10-m00))retmat lt- rbind(coef(ret1)coef(ret2)coef(ret3)TrueBetas)rownames(retmat) lt- c(est1est2est3true)print(round(retmat digits=2))

(Intercept) X1 X2 X1X2 est1 10 517 612 373 est2 10 517 612 373 est3 10 517 612 373 true 10 500 600 400

round(coefci(ret1 vcov=vcovHC(ret1type=HC1))digits=2)

25 975 (Intercept) 930 1077 X1 431 604 X2 497 727 X1X2 220 527


Empirical Exercises

Empirical Exercise EE91 You will analyze data on driving lawsand fatal accident rates originally from Freeman (2007) In partic-ular yoursquoll compare weekend driving fatality (death) rates for statesthat adopted a 008 blood alcohol content (BAC) law and states thatdidnrsquot comparing rates before and after the law adoption Standarderrors can be smaller if the full dataset is used but such methodsare beyond our scope Either way the difference-in-differences ap-proach is probably not identifying a treatment effect probably statesthat adopted such laws also adopted other ways to discourage drunkdriving whether official laws or just changing cultural norms Thisviolates the parallel trends assumption


driving

b Stata only load the data withuse httpfacultymissouriedukaplandmintro_textdriving clear

c Keep only years 1980 and 1990

R df lt- driving[driving$year==1980 | driving$year==1990 ]

Stata keep if year==1980 | year==1990

d Create a dummy variable for the ldquoafterrdquo period (year 1990)

R df$after lt- (df$year==1990)

Stata generate after = (year==1990)

e Create variable bac equal to 1 (or TRUE) if therersquos any BAC lawthat year

R df$bac lt- (df$bac08+df$bac10gt=1)

Stata generate bac = (bac08 + bac10 gt= 1)

f Drop states that already had a BAC law in the ldquobeforerdquo period(1980) leaving only states that never had the law or adopted itbetween 1980 and 1990

R dropst lt- unique(df$state[df$after amp df$bac]) toget a list of the states to drop and then remove them withdf lt- df[df$state in dropst ]

Statagenerate dropflag = (after amp bac)bysort state egen dropst = max(dropflag)drop if dropst


g Create a treatment dummy equal to 1 for states that adopted aBAC law by 1990

R treatst lt- unique(df$state[df$bac]) followed by df$treat lt- (df$state in treatst)

Stata bysort state egen treat = max(bac)

h Run a difference-in-difference regression with the intercept ldquoaf-terrdquo dummy treatment dummy and interaction term Belowthe in R and the in Stata automatically generate the de-sired interaction term

R

ret lt- lm(wkndfatrte~treatafter data=df)coeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))

Stata regress wkndfatrte treatafter vce(robust)

i To see how the OLS coefficient estimates relate to the condi-tional means (CEF estimates) compute the sample mean week-end driving fatality rate within each of the four groups definedby the time period and ldquotreatmentrdquo status

R (agg lt- aggregate(wkndfatrte~treatafter data=df FUN=mean))

Stata tabulate treat after summarize(wkndfatrte)means missing

j Display the CEF-based replication of the OLS estimates

R c(agg[13] agg[23]-agg[13] agg[33]-agg[13])for the first three coefficient estimates and c((agg[43]-agg[33])-(agg[23]-agg[13]) (agg[43]-agg[23])-(agg[33]-agg[13])) to show both (equivalent) ways to computethe interaction coefficient estimate

Stata

collapse (mean) wkndfatrte by(treat after)display wkndfatrte[1]display wkndfatrte[3]-wkndfatrte[1]display wkndfatrte[2]-wkndfatrte[1]display (wkndfatrte[4]-wkndfatrte[3])-(wkndfatrte[2]-wkndfatrte[1])

k Optional repeat part (h) but with a different outcome variableto replace wkndfatrte like the weekend fatalities per 100 mil-lion miles driven (instead of population) or the total fatalityrate (not just weekends) etc

l Optional repeat parts (e)ndash(h) but replacing your bac treatmentvariable created in part (e) with a treatment dummy equal to1 if perse (a different driving law) equals 1 (and equal to 0otherwise)


Empirical Exercise EE92 You will analyze wage data for differ-ent types of individuals from the 1976 Current Population Survey(conducted by the US Census Bureau) Specifically yoursquoll look atdummy variables for nonwhite (race) and female as well as their in-teraction The results are clearly not causal but the interaction termshows (descriptively) the difference in the whitenonwhite wage gapfor females compared to non-females or (equivalently) the differencein the femalenon-female wage gap for nonwhites compared to whites


wage1

b Stata only load the data with bcuse wage1 nodesc clear(assuming bcuse is already installed)

c Display the group mean wage for the four groups defined by thenonwhite and female dummy variables

R (agg lt- aggregate(wage~nonwhitefemale data=wage1 FUN=mean))

Stata tabulate female nonwhite summarize(wage)means missing

d Run a ldquodifference-in-differencesrdquo type of regression with theintercept non-white dummy female dummy and interactionterm

Rret lt- lm(wage~nonwhitefemale data=wage1)coeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))

Stata regress wage femalenonwhite vce(robust)

e Compute the OLS coefficient estimates manually from the fourconditional means

R store the conditional means with m00 lt- agg$wage[1]m10 lt- agg$wage[2] m01 lt- agg$wage[3] m11 lt- agg$wage[4] and show that you can replicate the OLS estimateswith rbind(coef(ret) c(m00 m10-m00 m01-m00 (m11-m01)-(m10-m00)) ) and also note that c( (m11-m01) - (m10-m00) (m11-m10) - (m01-m00) ) shows the equivalence ofthe two interpretations of the interaction term coefficient

Stata collapse the dataset to just the four conditionalmeans with collapse (mean) wage by(female nonwhite)and then display the manually calculated coefficient estimateswithdisplay wage[1]display wage[3]-wage[1]display wage[2]-wage[1]


display (wage[4]-wage[3])-(wage[2]-wage[1])display (wage[4]-wage[2])-(wage[3]-wage[1])

f Optional repeat part (d) but using south instead of female

g Optional repeat part (d) again with any two dummy variablesof your choice you may use one from a previous analysis as longas it is combined with a different dummy The dataset comeswith many dummy variables already like nonwhite female south (and other regions) servocc (and other occupationalfields and industries) and married or you can create your ownFor example you can generate a ldquomore than high school educa-tionrdquo dummy with R code wage1$gtHS lt- (wage1$educgt12)or Stata command generate gtHS = (educgt12)

Chapter 10

Regression with MultipleRegressors


Depends on Chapters 8 and 9 (which depend on Chapters 2ndash46 and 7)



102 Assess in a real-world example whether there is bias fromomitted variables and whether a linear model seems realistic[TLOs 2 and 6]

103 Describe and interpret models with multiple regressors in-cluding those in which two variables interact [TLO 3]

104 Judge which assumptions seem true and which interpre-tation seems most appropriate for real-world regressions[TLOs 2 and 6]

105 In R (or Stata) estimate a regression with multiple vari-ables along with measures of uncertainty and judge eco-nomic and statistical significance [TLO 7]



bull Hastie Tibshirani and Friedman (2009 sectsect2312431ndash32)

bull Linear projection (theory) Hansen (2020 sect7)

bull Average structural effects and their identification Hansen(2020 sect230)

bull Regression example (Masten video)

bull Perfect multicollinearity (Lambert video)

221

222CHAPTER 10 REGRESSIONWITHMULTIPLE REGRESSORS

bull Imperfect multicollinearity example (Lambert video)

bull Dummy coefficients (Lambert video)

bull Dummy interactions (Lambert video)

bull Continuous interactions (Lambert video)

bull Sections 31 (ldquoMultiple Regression in Practicerdquo) and 615(ldquoInteraction Termsrdquo) in Heiss (2016)

bull Section 44 (ldquoReporting Regression Resultsrdquo) in Heiss (2016)

bull Section 83 (ldquoInteractions Between Independent Variablesrdquo)in Hanck et al (2018)

Allowing multiple regressors opens a multitude of combinationsespecially when combined with nonlinear functions like in Chapter 8Most of Chapter 10 focuses on the different functional forms them-selves with the different types of flexibility they do (and donrsquot) allowThese discussions apply equally to descriptive predictive and causalmodels


One motivation for this chapter is that omitted variable bias (OVBSection 91) can still be a problem even if we include two regressorsWe may need to include three or even 10 or 100 regressors to avoidOVB But even with 100 regressors OVB can still be a big problem

Consider OVB with the linear structural model

Y = β0 + β1X1 + β2X2 + β3X3 + V (101)

For OLS to consistently estimate βj for j = 1 2 3 (the slope coeffi-cients) requires Cov(Xj V ) = 0 for j = 1 2 3 Imagine this is truebut X3 is omitted so

Y = β0 + β1X1 + β2X2 + U U equiv β3X3 + V (102)

In (102) OLS consistency for β1 and β2 requires Cov(X1 U) =Cov(X2 U) = 0 Since

Cov(Xj U) = β3 Cov(Xj X3) + Cov(Xj V ) (103)

this requires either β3 = 0 (ie X3 is not a causal determinant of Y )or else Cov(X1 X3) = Cov(X2 X3) = 0

There are other mathematical formulations but they all make thepoint that even including 100 regressors is not sufficient to avoid OVBif there is still an important omitted variable That is even if (102)becomes

Y = β0+β1X1+β2X2+β3X3+β4X4+middot middot middot+U U equiv γQ+V (104)

then we still have OVB if γ 6= 0 and any Cov(Xj Q) 6= 0That is there is OVB if both of the following conditions hold


OVB1prime The omitted variable is correlated with an included regressorin (104) Corr(Xj Q) 6= 0 for some j

OVB2prime The omitted variable Q is a causal determinant of Y in (104)γ 6= 0

Discussion Question 101 (OVB with multiple regressors) Con-sider the example of California schools where Y is a schoolrsquos averagestandardized math test score for 5th-graders X1 is the 5th-gradestudent-teacher ratio and X2 is the percentage of 5th-graders whoare English learners (non-native speakers) Judge whether a schoolrsquostotal expenditures per student satisfies each of Conditions OVB1prime

and OVB2prime for OVB


=rArr Kaplan video Wage Regression Example

1021 Model and Coefficient Interpretation

The linear-in-variables model and discussion from Section 92 natu-rally generalize to non-binary andor more than two regressors WithJ regressors X1 X2 XJ

Y = β0+β1X1+middot middot middot+βJXJ+U = β0+

Jsumj=1

βjXj+U equiv g(X1 XJ)+U

(105)If U is a CEF error then g(middot) represents the CEF However the fol-lowing discussion is essentially the same if U is a linear projectionerror and g(middot) is the linear projection or if the βj have a causal inter-pretation

Regardless of interpretation the coefficient βj shows how the func-tion g(middot) changes when Xj increases by one unit This is true whetherXj is binary discrete or continuous For example X1 only appearsin the β1X1 term so if we change from X1 = x1 to X1 = x1 + 1 (unitincrease) that term changes from β1x1 to β1(x1 + 1) = β1x1 + β1 achange of β1 That is for any starting values X1 = x1 X2 = x2 etca unit increase in X1 changes the function by

g(x1 + 1 x2 xJ)minus g(x1 x2 xJ)

= [β0 + β1(x1 + 1) +

Jsumj=2

βjxj ]minus [β0 + β1x1 +

Jsumj=2

βjxj ] = β1(x1 + 1minus x1) = β1

(106)

For example if Y is wage in $hr and X1 is years of education andβ1 = ($5hr)yr then each additional year of education is associatedwith a ($5hr)yr change regardless of the initial education level orother variables like experience


More generally if X1 changes by ∆1 units then the functionrsquosvalue changes by β1∆1 Regardless of the starting values if X1

changes from x1 to x1 + ∆1 then similar to (106)

g(x1 + ∆1 x2 xJ)minus g(x1 x2 xJ) (107)

= [β0 + β1(x1 + ∆1) +Jsumj=2

βjxj ]minus [β0 + β1x1 +Jsumj=2

βjxj ] = β1(x1 + ∆1 minus x1) = β1∆1

1022 Limitations

While pleasingly simple these formulas may not be realistic That isthe change in Y may depend on not only ∆1 but the starting valuex1 or other xj

For example let Y be wage X1 years of experience and X2 yearsof education Due to diminishing marginal benefits perhaps the firstyears of experience are associated with bigger increases in mean wagethan later years of experience The wage increase associated with thechange from X1 = 0 to X1 = 1 is probably larger than the increasefrom X1 = 40 to X1 = 41 even though ∆1 = 1 in both casesFurther the change from X1 = 0 to X1 = 1 may be associated with alarger wage increase for highly educated individuals (large X2) thanfor less-educated individuals Mathematically the change dependingon the starting value of X1 implies some nonlinearity in X1 andthe dependence on the value of X2 implies some sort of interactionterm(s)

Nonlinear and nonparametric functions of a single variable arediscussed in Sections 82 and 83 interactions are discussed in Sec-tions 93 and 103 Nonparametric models with multiple regressorsare beyond our scope

1023 Code

The following code shows a simple linear-in-variables regression withsimulated data In the output the row labeled X1 shows results for thecorresponding slope coefficient β1 Specifically the output shows theOLS estimate β1 (Estimate) the heteroskedasticity-robust standarderror estimate (Std Error) and a 95 confidence interval (lowerendpoint under 25 upper endpoint under 975 ) Similarly forthe other regressors and results

library(sandwich) library(lmtest)setseed(112358)n lt- 50CEF lt- function(x1x2x3) 1x1+2x2+3x3 df lt- dataframe(X1=runif(n) X2=runif(n) X3=runif(n))df$Y lt- CEF(df$X1 df$X2 df$X3) + rnorm(n)ret lt- lm(Y~X1+X2+X3 data=df)retVC1 lt- vcovHC(ret type=HC1)round(cbind(coeftest(ret vcov = retVC1)[12]

103 INTERACTION TERMS 225

coefci(ret vcov = retVC1)) digits=2)

Estimate Std Error 25 975 (Intercept) -044 047 -139 052 X1 075 040 -006 157 X2 177 049 078 276 X3 404 047 310 499

103 Interaction Terms

=rArr Kaplan video Interaction Model

=rArr Kaplan video Wage Regression Example (again)

To start imagine there are two regressors one of which is binaryTo help us remember which is which let D (for ldquodummyrdquo) be thebinary regressor (D = 1 orD = 0) andX the other regressor AssumeX is the regressor of interest

1031 Limitation of Linear-in-Variables Model

With a linear-in-variables model

Y = g(XD) + U g(XD) = β0 + β1X + β2D (108)

A unit increase in X always changes the function g(XD) by β1 unitsregardless of the starting value of X or the value of D As discussedin Section 102 this is often unrealistic

Since D has only two possible values we can plug them each intog(XD)

g(X 0) = β0 + β1X (109)

g(X 1) = β0 + β1X + (β2)(1) =

intercept︷︸︸︷(β0 + β2) +β1X (1010)

These are two functions of X one when D = 0 one when D = 1They have the same slope (β1) but different intercepts (β0 and β0+β2)

1032 Interpretation of Interaction Term

To allow both the intercept and slope to differ between g(X 0) andg(X 1) an interaction term can be used specifically the productDX Mathematically adding this term to (108)

g(XD) = β0 + β1X + β2D + β3DX (1011)


The function in (1011) is more general because setting β3 = 0 yields(108) Given (1011) instead of (109) and (1010)

g(X 0) = β0 + β1X +

=0︷︸︸︷(β2)(0) + (β3)(0)(X) = β0 + β1X (1012)

g(X 1) = β0 + β1X + (β2)(1) + (β3)(1)(X) =

intercept︷︸︸︷(β0 + β2) +

slope︷︸︸︷(β1 + β3)X

(1013)

Now the slope differs (by β3) too Just as β2 gt 0 β2 lt 0 and β2 = 0are all possible so are β3 gt 0 β3 lt 0 and β3 = 0

0

β 0β 0

+β 2

slope = β1

slope = β1 + β3

X

D=0D=1

Figure 101 Visualization of β0 + β1X + β2D + β3DX

Figure 101 illustrates the interpretation of the function from (1011)In the figurersquos example β2 gt 0 and β3 gt 0 Omitting the interactionterm is equivalent to assuming β3 = 0 in which case the two lineswould be parallel (same slope)

If yoursquore interested in D donrsquot only look at β2 Rearranging(1011)

g(XD) = (β0 + β1X) +D(β2 + β3X) (1014)

so the slope coefficient onD is β2+β3X For example even if β2 = minus2the slope β2+β3X is positive if β3X gt 2 The opposite is also possibleeg if β2 = 5 β3 = minus1 and X gt 5 then β2 gt 0 but the slope isnegative β2 + β3X lt 0

Discussion Question 102 (sleep and interactions) Let Y be apersonrsquos hours of sleep per night X the personrsquos age and D = 1 ifthe person lives in the same house as children under 8 years old (andD = 0 if not) Consider the model from (1011)

a) What would you guess for the signs (+ minus or zero) of β2 andβ3 Explain why

b) Given the same set of regressors (XD) describe another non-linear term (ie another function of X andor D besides XD and DX) that would improve the CEF estimate and whyyou think that term would help

Discussion Question 103 (wage and interactions) Repeat DQ 102but let Y be wage


1033 Non-Binary Interaction

Discussion Question 104 (linear-in-variables) Let Y be log wageX1 years of education and X2 years of experience Consider pos-sible linear-in-variables CEF model E(Y | X1 = x1 X2 = x2) =β0 + β1x1 + β2x2

a) Explain one reason you think this CEF model is misspecified(wrong)

b) How do you think the true CEF (not the misspecified linear-in-variables CEF) slope with respect to experience might differfor different values of education (Hint draw a graph withdifferent lines like E(Y | X1 = 12 X2 = x2) where you fix theX1 value and then graph the CEF as a function of only x2)

Even if neither regressor were binary an interaction term allowsthe slopes to depend on the other regressorrsquos value ReplacingX = X1

and D = X2 in (1011)

g(X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (1015)

Consider the slope of g(X1 X2) with respect to X1 at different valuesof X2 Generally rearranging (1015) as a function of X1

g(X1 X2) =

intercept︷︸︸︷(β0 + β2X2) +

slope︷︸︸︷(β1 +X2β3)X1 (1016)

Plugging values X2 = a and X2 = b into (1016) similar to (1012)and (1013)

g(X1 a) =

intercept︷︸︸︷(β0 + aβ2) +

slope︷︸︸︷(β1 + aβ3)X1 (1017)

g(X1 b) =

intercept︷︸︸︷(β0 + bβ2) +

slope︷︸︸︷(β1 + bβ3)X1 (1018)

Changing X2 from a to b changes the intercept from β0 +aβ2 to β0 +bβ2 and it changes the slope from β1 +aβ3 to β1 +bβ3 Alternativelywe could plug in X1 = a and X1 = b and consider g(aX2) andg(bX2) as functions of X2 where again both the intercept and slopemay change

Donrsquot get fooled by looking at β1 alone (a common mistake) Forexample imagine X1 is experience and X2 is years of education andY is wage ($hr) Imagine

Y = g(X1 X2) = 5minus 15X1 + 2X2 + 2X1X2 (1019)

ie β0 = 5 β1 = minus15 β2 = 2 and β3 = 2 Superficially β1 = minus15seems like a negative relationship between experience (X1) and wageit looks like more experience is associated with much lower wageHowever the interaction term affects the slope with respect to X1Using (1016) that slope is β1 + β3X2 = 2X2 minus 15 If everyonein the data has at least 10 years of education then X2 ge 10 so


2X2 minus 15 ge (2)(10)minus 15 = 5 the slope with respect to X1 is alwayspositive Even though β1 lt 0 g(X1 X2) is always increasing in X1for any possible X2 ge 10

This interaction model is more general than the linear-in-variablesmodel but not fully general For example imagine Y is wage X1 iseducation and X2 is experience Maybe the slope with respect to X2

should be increasing a lot with X1 when X1 is around 12 or 16 butless so aroundX1 = 20 (or maybe more so) This type of nonlinearityin the interaction is not allowed by simply including X1X2 There arenonlinear and nonparametric models to address such situations butdetails are beyond our scope

1034 Code

The following code illustrates estimation heteroskedasticity-robustinference and prediction with a model including an interaction termIn R formula syntax the term DX is the same as including the in-teraction term DX like in (1011) Alternatively DX includes bothlinear and interaction terms ie it is equivalent to D+X+DX So bothestimation models below are identical to (1011)

library(sandwich) library(lmtest)setseed(112358)n lt- 50CEF lt- function(dx) 2+3d+4x+5xd df lt- dataframe(X=runif(n) D=sample(x=01size=nreplace=TRUE))df$Y lt- CEF(df$D df$X) + rnorm(n) Equivalent estimatesret lt- lm(Y~DX data=df)ret2 lt- lm(Y~D+X+DX data=df)retVC lt- vcovHC(ret type=HC1)round(cbind(coeftest(ret vcov = retVC)[12]

coefci(ret vcov = retVC)) digits=2)

Estimate Std Error 25 975 (Intercept) 200 039 121 280 D 269 051 167 371 X 386 065 254 517 DX 566 085 396 736

predict(ret newdata=dataframe(X=c(128)D=10))

1 2 1189 328

104 Other Examples

=rArr Kaplan video Wage Regression Example (again again)

105 ASSUMPTIONS FOR LINEAR PROJECTION 229

Models can get very complex with multiple regressors We couldhave more than 2 regressors we could have many nonlinear func-tions of each regressor by itself and we could have many interactionsFor example even if we only have 5 regressors there are 10 pairsof regressors (like X1 and X4 X2 and X3 etc) and each pair mayhave multiple interaction terms (ie not just X1X4 but also X1X

24

or something) With each regressor by itself we may have multiplenonlinear terms There could be 40 or 50 terms in our regression justfrom 5 original regressors Even if all 5 are binary the fully saturatedmodel requires 25 = 32 parameters

With such complicated models it is better to look at predictedchanges using the full model instead of looking at individual coeffi-cients This is done in R with the predict() function

105 Assumptions for Linear Projection

Below are formal assumptions sufficient for consistency and asymp-totic normality of the OLS estimator of the linear projection coef-ficients Asymptotic normality in turn justifies confidence intervals(and p-values) which should be approximately correct in large sam-ples (large n) These are relatively weak assumptions Howeverstronger assumptions are required to interpret the linear projectionas a CEF or structural model

The assumptions are basically the same as in Section 772 withone exception (perfect multicollinearity) Like before iid samplingis sufficient but not necessary OLS consistently estimates the lin-ear projection coefficients with various types of dependent data andcomplex sampling designs and the estimators remain asymptoticallynormal although the standard errors are different

1051 Multicollinearity (Two Types)

The one new assumption is that there cannot be perfect multi-collinearity This essentially says redundant regressors are not al-lowed For example if X3 = X1 +X2 then X3 is a linear function ofother regressors so we cannot include all of X1 X2 and X3 Remem-ber that the intercept term can be seen as the coefficient on regressorX0 = 1 So if we had X1 = 1 for females and X2 = 1 for non-femalesthen X1 +X2 = 1 = X0 which means perfect multicollinearity

Something nice about perfect multicollinearity is that computerscan check it for us If you try to run a regression with perfect multi-collinearity R will simply report NA for coefficients of the ldquoredundantrdquoregressors (without warning or error) Other statistical packages maygive you a warning or error

For prediction redundant variables donrsquot help so dropping themis fine For causality we are unable to distinguish the separate effectsamong redundant variables But if they are merely ldquocontrol variablesrdquothen we do not care


A related concept is imperfect multicollinearity This refersto regressors being strongly correlated but not perfectly correlated(ie not completely redundant)

This makes it more difficult to learn about the slope coefficientson the highly correlated regressors but it does not invalidate any re-sults on identification estimation or inference ldquoMore difficultrdquo meansconfidence intervals can be large This makes sense if regressors X1

and X2 are highly correlated and we observe that Y is high whenX1 and X2 are high itrsquos unclear whether Y is high because X1 ishigh or because X2 is high Since they are highly correlated thereare few observations where only X1 or X2 (not both) is high to helpdistinguish the effect of X1 from that of X2 This is similar to thelogic behind omitted variable bias except we can see the ghost Withprediction it may be best to include only X1 or X2 (not both) butstandard model selection procedures can handle this without any spe-cial consideration (But if you have a job interview and sense thatyour interviewer thinks imperfect multicollinearity is really importantfor some reason just go with it)

1052 Formal Assumptions and Results

The assumptions and results refer to the linear projection model

LP(Y | X1 XJ) = β0 + β1X1 + middot middot middot+ βJXJ (1020)

The Xj may include nonlinear functions of an original set of regres-sors For example if X and D are observed regressors then the modelcould have X1 = D X2 = X and X3 = DX It could also includeX4 = X2 etc

Assumption A101 Sampling of (Yi X1i XJi) is iid from thepopulation joint distribution of (YX1 XJ)

Assumption A102 There is no perfect multicollinearity Thatis no Xj is a linear combination of other regressors (including theintercept) Equivalently the only constants cj (j = 0 J) thatmake 0 = c0 +

sumJj=1 cjXj are c0 = c1 = middot middot middot = cJ = 0

Assumption A103 The variances of Y and allXj are finite Var(Y ) ltinfin Var(Xj) ltinfin for j = 1 J Or equivalently the expected val-ues of Y 2 and X2

j (ie second moments) are finite E(Y 2) lt infinE(X2

j ) ltinfin for j = 1 J

Assumption A104 The expected values of Y 4 andX4j (ie fourth

moments) are finite E(Y 4) ltinfin E(X4j ) ltinfin for j = 1 J

Theorem 101 (OLS consistency) Given the linear projection modelin (1020) if A101ndashA103 are true then βj

prarr βj for j = 0 1 J

Theorem 102 (OLS approximate normality) Given the linear pro-jection model in (1020) if A101 A102 and A104 are true then theOLS coefficient estimators are asymptotically normal ie with largen approximately βj sim N(βj SE2

j ) where the true standard error SEjcan be estimated and is proportional to 1

radicn

106 STRUCTURAL IDENTIFICATION 231

In more formal mathematical econometrics Theorem 102 is writ-ten in terms of the distribution of

radicn(βjminusβj) instead of βj This can

be confusing (It is helpful mathematically but for reasons beyondour scope)

Theorem 103 (coverage probability multiple regressors) If A101A102 and A104 are true then heteroskedasticity-robust confidenceintervals are asymptotically correct That is with large enough n thecoverage probability is approximately equal to the desired confidencelevel

106 Structural Identification

There are many identification results in which there is a causal inter-pretation for something OLS can estimate Here are a few


Imagine the structural model is

Y = β0 +

Jsumj=1

βjXj + U (1021)

It is possible that some Xj are nonlinear functions of regressors in-cluding interaction terms If Cov(UXj) = 0 for all j = 1 J thenthe structural βj are also linear projection coefficients which OLS canestimate That is if all terms Xj in the regression are ldquoexogenousrdquo inthe sense of uncorrelated with U then OLS can consistently estimateall the structural slope coefficients

1062 General Structural Model

As alluded to in Section 95 a conditional average structural effectcan be identified under conditional independence (A91) Considerthe structural model

Y = h(X1 X2 XJ U) (1022)

where U = (U1 U2 ) is a vector containing all the unobserveddeterminants of Y besides the Xj The structural effect of a one-unitincrease in X1 (hence 1 subscript on s1) may depend on the initialvalue of X1 as well as values of other variables and U

s1(X1 X2 XJ U) equiv h(X1+1 X2 XJ U)minush(X1 X2 XJ U)(1023)

Fixing theXj but averaging over U the conditional average struc-tural effect (CASE) of X1 (hence 1 subscript on CASE1) is

CASE1(x1 xJ) equiv E[s1(X1 XJ U) | X1 = x1 XJ = xJ ](1024)


Under the conditional independence assumption (CIA Assumption A91)this structural object is identified

CASE1(x1 xJ) (1025)= E[Y | X1 = x1 + 1 X2 = x2 XJ = xJ ]minus E[Y | X1 = x1 XJ = xJ ]

Thus if we can guess the correct CEF functional form then wecan estimate CASEs by OLS First OLS can consistently estimatethe LP Second given the correct CEF specification we can interpretthe LP as the CEF Third we can estimate CEF differences using theestimated LP coefficients eg we can use the predict() functionin R Fourth given the identification results we can interpret theseCEF differences as causal effects (CASEs)

However if we do not guess the CEFrsquos functional form correctly(ie our model is misspecified) then the second and third steps failThat is even if the CASEs are identified we may fail to estimate themif we donrsquot estimate the CEF correctly One response to this disap-pointing fact is to use nonparametric CEF models but nonparametricregression with multiple regressors is beyond our scope

1063 Conditional ATE

As alluded to in Section 95 a conditional average treatment ef-fect (CATE) can be identified under conditional independence (A91)Here X1 is the binary treatment variable Given Assumption A91(and SUTVA and overlap) the CATE equals a CEF difference

E[Y T minus Y U | X2 = x2 XJ = xJ ] (1026)= E[Y | X1 = 1 X2 = x2 XJ = xJ ]minus E[Y | X1 = 0 X2 = x2 XJ = xJ ]

Intuitively conditional independence says that within the subpopu-lation defined by (X2 XJ) treatment is ldquoas good as randomrdquo socomparing mean treated and untreated observed outcomes (withinthe subpopulation) has a causal interpretation

For estimation as with the CASE we either need to know theCEFrsquos functional form (and use OLS) or use nonparametric estimationtechniques


Empirical Exercises

Empirical Exercise EE101 You will analyze data collected fromBotswanarsquos 1988 Demographic and Health Survey by James Heakinsfor an economics term project In particular yoursquoll see how the num-ber of living children a woman (in Botswana) has relates to variousother variables with particular interest in the womanrsquos years of ed-ucation Yoursquoll start with a simple regression of children on educthat shows an economically significant negative coefficient Thenyoursquoll see how this coefficient changes (generally moving toward zero)as you add other regressors as control variables like the husbandrsquoseducation (heduc) and the womanrsquos age (age) These changes inthe estimated coefficient suggest omitted variable bias in the originalsimple regression But even with a large number of control variableregressors there is probably still omitted variable bias


fertil2

b Stata only load the data with bcuse fertil2 nodescclear (assuming bcuse is already installed)

c Run a simple regression of children on educ

R ret1 lt- lm(children~educ data=fertil2)

Stata regress children educ vce(robust)

d Repeat but adding heduc as a control variable regressor

R ret2 lt- lm(children~educ+heduc data=fertil2)

Stata regress children educ heduc vce(robust)

e Repeat but adding yet another regressor (womanrsquos age)

R ret3 lt- lm(children~educ+heduc+age data=fertil2)

Stata regress children educ heduc age vce(robust)

f Repeat but add even more regressors (in addition to educ heduc and age) agesq knowmeth usemeth electricurban and catholic as well as interactions between age andknowmeth and between age and usemeth

R store the result as ret4 and you can simply write knowmethage and usemethage in the regression formula to generatethe interactions

Stata first create the two interaction variables like withgenerate know_age = knowmethage and then add those newvariables in your list of regressors

g R only (since already displayed by Stata) output the four setsof estimated regression coefficients with


coef(ret1)coef(ret2)coef(ret3)coef(ret4)

h Optional repeat one more time with whichever regressors (inaddition to educ) you think appropriate feel free to create ad-ditional interaction terms andor nonlinear terms (like age^3etc)

Empirical Exercise EE102 You will analyze data originally fromHarrison and Rubinfeld (1978) including housing prices and pollu-tion measures The data are not for individual houses but insteadsmall areas (census tracts Irsquod guess) within which the median hous-ing price is computed along with other characteristics that may affecthousing prices including pollution Yoursquoll start with a simple regres-sion of log price on log nox (the pollution measure) The coefficientis around minus1 meaning a 1 increase in pollution is associated with(approximately) a 1 decrease in price Then yoursquoll add other regres-sors to try to reduce omitted variable bias By adding just a couplevariables the pollution coefficient estimatersquos magnitude is cut in halfsuggesting that there was indeed much OVB However even with alarge number of regressors serious OVB may remain


hprice2

b Stata only load the data with bcuse hprice2 nodescclear (assuming bcuse is already installed)

c Run a simple log-log regression of price on nox

R ret1 lt- lm(log(price)~log(nox) data=hprice2)

Stata regress lprice lnox vce(robust)

d Repeat but adding rooms as a control variable regressor

R ret2 lt- lm(log(price)~log(nox)+rooms data=hprice2)

Stata regress lprice lnox rooms vce(robust)

e Repeat but adding yet another regressor (crime rate percapita)

R ret3 lt- lm(log(price)~log(nox)+rooms+crime data=hprice2)

Stata regress lprice lnox rooms crime vce(robust)

f Repeat but add even more regressors dist radial stratioand lowstat Store the result as ret4 in R




h Optional repeat one more time with whichever regressors youthink appropriate try to use interaction terms andor nonlinearterms (like rooms^2 etc)


Chapter 11

Midterm Exam 2


When I teach this class the second midterm exam is this weekThis ldquochapterrdquo makes the chapter numbers match the week of thesemester This midterm covers all chapters between the first midtermand now It does not explicitly include questions about the materialbefore the first midterm exam but of course that materials was foun-dational for the material covered on the new exam so it may (or maynot) still help to review it

237


Chapter 12

Internal and ExternalValidity


Depends on Chapter 7 (which depends on Chapters 2ndash4 and 6)for deeper understanding also Chapters 8 and 10



122 Assess possible problems with regression results and theirapplication to real-world questions (of description predic-tion and causality) and the likely direction of bias [TLOs 5and 6]

123 In R (or Stata) check datasets for possible issues like miss-ing data [TLO 7]


bull Sample selection from survey non-response (Masten video)

bull External validity (Masten video)

bull Missing data approaches (Masten video)

bull Reverse causality and simultaneity (Masten video)

bull Reverse causality example violence (Lambert video)

bull Reverse causality example HDI (Lambert video)

bull Greater external validity for ldquostructuralrdquo results (Mastenvideo)

bull Sections 92 (ldquoMeasurement Errorrdquo) and 93 (ldquoMissing Dataand Nonrandom Samplesrdquo) in Heiss (2016)

bull Chapter 22 (ldquoMissing Datardquo) in Kaplan (2020)

239

240 CHAPTER 12 INTERNAL AND EXTERNAL VALIDITY

bull Chapter 9 (ldquoAssessing Studies Based on Multiple Regres-sionrdquo) and Section 132 (ldquoThreats to Validity of Experi-mentsrdquo) in Hanck et al (2018)

This chapter discusses many reasons to worry about the validityof econometric results and their application to decisions Like sta-tistical and economic significance ldquovalidityrdquo is better thought of as acontinuum rather than a yesno property To tweak Boxrsquos aphorismldquoAll results are invalid but some are usefulrdquo

121 Terminology

An econometric study has internal validity if the methods are ap-propriate for the studyrsquos setting and sample ie if all the identifyingassumptions and other assumptions hold

An econometric study has external validity for a different set-ting if the results can be used to learn something about the newsetting This odes not mean the values or overall effects are identi-cal but rather that the prediction model or structural model is thesame Ideally the estimated model can be applied successfully to newvariable values and distributions

The population studied refers to the population from which thedata was sampled whereas the population of interest is the onethat you (as the researcher policy maker or decision maker) want tolearn about

For example you may see an econometric study of the causaleffect of a minimum wage change from $425hr to $505hr in NewJersey in 1992 but your job is to advise Missouri about a possibleminimum wage increase next year The study is internally valid if itproperly estimates the causal effect of the New Jersey minimum wageincrease on people in New Jersey in 1992 ie for the populationstudied It is externally valid if the estimates can be used to learnabout the (potential) policy effects in Missouri your population ofinterest Again this doesnrsquot mean the effect next year in Missourimust be identical to the effect in 1992 New Jersey but that the modelestimated with the 1992 New Jersey data can be applied to currentMissouri data to learn the potential policy effect

This chapter briefly discusses many threats to validity iereasons an analysis may not be internally or externally valid

122 Threats to External Validity

Threats to external validity are generally more obvious than threatsto internal validity but they harm evidence-based decisions just asmuch For example consider the descriptive task of estimating themedian house price in Missouri Obviously a sample of house pricesfrom California (which is much more expensive) does not help Even

122 THREATS TO EXTERNAL VALIDITY 241

with the price of every house in California we learn little about Mis-souri We can try to learn about relationships between price andhouse features (size land area etc) in California but probably evensuch relationships themselves differ in Missouri This problem is alack of external validity

The Lucas critique (Lucas 1976 also Section 433) can also be in-terpreted in terms of external validity When macroeconomic policychanges that fundamentally changes the setting Even if our esti-mates from historical data have internal validity they might not beaccurate in the new setting under the new policy ie they might nothave external validity

A few common threats to external validity are now discussedAlthough these donrsquot automatically imply lack of validity they arereasons for skepticism

1221 Different Place

Different places have different legal political cultural and economicsettings The house price example highlights just one of many im-portant differences between California and Missouri Ideally you canalways find an empirical study from the same place yoursquore interestedin If not you have to decide whether you think the other place issimilar enough to still help you make a good decision

For example imagine you need to quantify costs and benefits ofexpanding public bus systems in Missouri The neighboring states ofOklahoma and Illinois recently collected data during their (hypothet-ical) bus system expansions with different results Although bothstates are very close geographically to Missouri other characteristicsmatter too Illinois has almost double the state gas tax of Missouriand its urban population share is over 15 percentage points greaterthan Missourirsquos both of these may be important for both peoplersquosdecision to ride the bus (versus drive) and the cost of bus operationIn contrast Oklahomarsquos gas tax and urban population share are verysimilar to Missourirsquos so there is probably greater external validityStill there may be other important differences between Missouri andOklahoma some of which may be difficult to measure accurately orquantify like cultural attitudes

1222 Different Time

Even in the same place time changes the legal political cultural andeconomic setting

For example consider again the median house price in Missouriwhich is also the (unconditional) best prediction under absolute lossHaving learned not to use California data we get Missouri datamdashfromthe year 1975 This is also bad since house prices were much lowerin 1975 than today Adjusting for inflation would help some but thehousing market supply and demand have both changed substantiallyeven basics like how many people live in Missouri and housesrsquo sizeage and quality We could try to use all these variables in a model


but some may not have data available and the model itself may havechanged since 1975

What if we had Missouri data from two years ago is that closeenough One year One month It depends how quickly things arechanging and on the decision you need to make In normal conditionsthe median house price does not change more than a few thousanddollars each month That difference may not matter much for somedecisions like trying to find another state thatrsquos similar to MissouriBut that difference may be too big for other decisions like a high-frequency investment strategy Or if instead of ldquonormal conditionsrdquothere was just a financial crisis or a new law passed the month-olddata may be off by more than just a few thousand dollars

Alternatively sometimes we can model how variables change overtime in order to predict how they have changed since the data samplewas collected see Chapters 14 and 15

1223 Different Population

Even in the same place at the same time the population studied maydiffer from your population of interest

For example to guide tax incentives for first-time homebuyers inMissouri this year you want to estimate the median first-year mort-gage payment for first-time homebuyers If you find a study estimat-ing the median mortgage payment among all home owners in Missourithis year then your number will be much too big because the studiedpopulation (all owners) differs greatly from the population of interest(first-time owners) even though theyrsquore in the same place (Missouri)at the same time (this year)

As another example imagine yoursquore estimating the benefits of ex-panding government subsidies for college in the US You (amazingly)find an internally valid estimate of the mean wage increase from col-lege in the same place (US) from just a few months ago Howeverthe estimate is for the whole US population (the population studied)including individuals who already got college degrees even without theadditional subsidy Instead your population of interest is individualswho currently do not (or cannot) choose to graduate from collegebut who would with the additional subsidy Such individuals maynot have the same causal effect of college on their wages

Discussion Question 121 (external validity minimum wage) Yoursquoredeciding whether to vote for a minimum wage increase in your state orcountry (yes you wherever you live or vote right now) from $10hrto $15hr (or an equivalent increase in your countryrsquos currency) Youfind a study (Card and Krueger 1994) of effects of a minimum wageincrease from $425hr to $505hr in New Jersey in 1992 Explainyour specific concerns about external validity (Note this is only aquestion about external validity arguments about whether minimumwage should be lower or higher are completely irrelevant)

123 THREATS TO INTERNAL VALIDITY 243

123 Threats to Internal Validity

For description and prediction see Items 1ndash5 in the list belowFor causality the following common threats to internal validity

are described below

1 Functional form misspecification (Section 1231)

2 Measurement error (Sections 1232 and 1233)

3 Non-iid sampling and weights (Section 1234)

4 Missing data (Section 1235)

5 Sample selection (Section 1236)

6 Omitted variables (Section 1237)

7 Simultaneity and reverse causality (Section 1238)

Additionally violation of SUTVA (as discussed earlier) is anotherthreat to internal validity for treatment effect analysis

1231 Functional Form Misspecification

Misspecifying the functional form leads to inconsistent estimates ofthe CEF This is bad for description prediction and causality alikeDetails are in Chapters 7ndash10 including reasons for misspecificationways to address it and interpretations of what OLS estimates whenitrsquos not a CEF

1232 Measurement Error in the Outcome Variable

=rArr Kaplan video Measurement Error

Without good data itrsquos hard to get valid econometric results Asthey say ldquoGarbage in garbage outrdquo But it is not as simple as ldquogoodrdquoand ldquobadrdquo data Certain data problems can safely be ignored otherscanrsquot be ignored but they can be fixed and yet other problems cannotbe fixed by any amount of econometrics magic

Sometimes the true value of a variable is not the value seen inthe data This is especially true in survey data where individuals (orfirms schools etc) report their own information (ldquoself-reportedrdquo) andwith macroeconomic variables that are difficult to measure accuratelyWith survey data people may simply forget the exact value or theymay intentionally lie in some cases


To define some notation and terms consider the example of exerciseIn a survey people are asked how many minutes of exercise theydid last week and their responses Y are recorded in the data LetY lowast be how much exercise somebody truly did last week This Y lowast isthe latent (unobserved) true value In contrast the observed value


is Y = Y lowast + M where M is the measurement error That isM = Y minus Y lowast is the difference between the observed and true values

All of (Y lowast YM) are uppercase to show theyrsquore random variablesFor example one individual could have true Y lowast = 9852 and reportY = 100 so M = 100 minus 9852 = 148 whereas another individualcould have Y lowast = 271 report Y = 250 and have M = 250 minus 271 =minus21 where all values are in units of ldquoexercise minutes per weekrdquoThere are different values of (Y lowast YM) for different individuals thepopulation distribution describes the probabilities of these differentpossible values

Discussion Question 122 (exercise error) Consider the examplewhere Y lowast is true exercise minutes last week and Y is the value some-body reports Explain one reason (each) why an individual couldhave a) M = 0 b) M lt 0 or c) M gt 0 Overall would you guessE(M) = 0 E(M) lt 0 or E(M) gt 0 Why (Hint yoursquoll probablyneed to make additional assumptions and definitions eg what doesldquoexerciserdquo mean)

Regression

Imagine the true linear projection in error form is

Y lowast = β0 + β1X + V E(V ) = Cov(XV ) = 0 (121)

We want to learn β1 Substituting in Y lowast = Y minusM

Y minusM = β0 + β1X + V

Y = β0 + β1X +

U︷︸︸︷(V +M) = β0 + β1X + U (122)

The OLS estimator β1 may be asymptotically biased if X andM are related because then X and U are related From (95) theasymptotic bias is

plimnrarrinfin

β1minusβ1 =Cov(XU)

Var(X)=

Cov(XV +M)

Var(X)=

=0︷︸︸︷Cov(XV ) + Cov(XM)

Var(X)

(123)This is the slope coefficient in LP(M | 1 X) the linear projection ofthe measurement error onto the regressor (and an intercept) Thatis writing LP(M | 1 X) = γ0 + γ1X then the asymptotic bias isplimnrarrinfin β1 minus β1 = γ1

With binary X γ1 is a mean difference as in (629) so (123) isequivalent to

plimnrarrinfin

β1 minus β1 E(M | X = 1)minus E(M | X = 0) (124)

This helps us think about both the sign (direction) and magnitudeof asymptotic bias For example if there tends to be more positivemeasurement error when X = 1 than when X = 0 then γ1 gt 0 sothe OLS estimator β1 has positive asymptotic bias plimnrarrinfin β1 gt β1


In Sum Measurement Error in the OutcomeY observed Y lowast latenttrue M measurement error

Y = Y lowast +M lArrrArr M = Y minus Y lowastBinary X β1 asymptotic bias is E(M | X = 1)minus E(M | X =

0) see (124)General X β1 asymptotic bias is γ1 in LP(M | 1 X) = γ0 +

γ1X see (123)

Example

Continuing with Y lowast as weekly exercise let X = 1 if somebody hasa gym membership and X = 0 otherwise The goal is to learn β1 inLP(Y lowast | 1 X) = β0 + β1X Since X is binary β1 is also the meandifference E(Y lowast | X = 1)minusE(Y lowast | X = 0) Also due to binary X theslope in LP(M | 1 X) = γ0 +γ1X is the same as the mean differenceγ1 = E(M | X = 1)minus E(M | X = 0)

Therersquos no asymptotic bias in a few cases Obviously if M = 0for everybody then Y = Y lowast so regressing Y on X is identical toregressing Y lowast on X Even if everyone overreports (E(M) gt 0) orunderreports (E(M) lt 0) as long as itrsquos the same for both gymmembers and non-members then γ1 = 0 so there is no asymptoticbias Itrsquos also fine if E(M | X = 0) = E(M | X = 1) = 0 butVar(M | X = 1) lt Var(M | X = 0) ie the gym members reportmore accurately (smaller variance ofM in the extreme evenM = 0)but both groups are accurate on average

However there is asymptotic bias if therersquos systematic overreport-ing by only gym members Maybe gym members are more likely tofeel guilty about not exercising and not using their membership whichmay cause them to report going to the gym and exercising more thanthey actually do Or conversely perhaps individuals who think theyexercise more than they do (and thus have large M) are more likelyto become gym members because they think itrsquoll be worth it Ei-ther way more positive M (overreporting) is associated with X = 1compared to X = 0 ie γ1 gt 0 This leads to positive (upward)asymptotic bias of β1

Figure 121 illustrates the upward bias of β1 in the gymexerciseexample The X = 0 group does not report perfectly but there is nosystematic reporting bias The X = 1 group systematically overre-ports exercise Consequently the red linersquos slope (using observed Y )is much larger than the black linersquos slope (using true but unobservedY lowast) That is if we could observe Y lowast we would estimate the blackline but we canrsquot and using the observed Y yields a very different(biased) estimate of the slope β1

Alternatively maybe non-gym members tend to have larger M Maybe gym members only report gym time whereas non-membersinclude walking the dog lifting groceries etc In that case E(M |X = 0) gt E(M | X = 1) so γ1 lt 0 and therersquos negative asymptotic


X (gym membership)

Exe

rcis

e

0 1

Y

Y

Figure 121 Bias from measurement error in Y

bias

Discussion Question 123 (measurement error scrap rate) Imag-ine the government wants to help increase the efficiency of chalk man-ufacturing firms Specifically Y lowast is a firmrsquos ldquoscrap raterdquo what pro-portion of their output has to be ldquoscrappedrdquo (trashednot sold) dueto manufacturing defects For example Y lowast = 004 means 4 scraprate The government randomly assigns firms to a control group andtreatment group to run an experiment On January 1 the treatedfirms receive grant money which they are supposed to use to improveefficiency All firms self-report their scrap rates on December 31 thisis Y

a) Describe a reason why treated firms might systematically over-report (M gt 0) or underreport (M lt 0) their scrap rates

b) In that case and assuming untreated firms report accurately(M = 0) would we overestimate or underestimate the treatmenteffect of a grant Why

c) If the government uses these incorrect estimates to decide whetheror not to continue the program what incorrect decision mightthey make Why

Methods to Address Measurement Error

In some cases there are methods to reduce or eliminate the bias frommeasurement error However such methods often have additionalrequirements like a second measurement of the same variable andthey are beyond our scope

1233 Measurement Error in the Regressors

There are similarities between measurement error in X and measure-ment error in Y Much of the math is similar The causes of mea-surement error are the same since a variable may be the Y variablein one model but the X variable in another

To see how measurement error might cause asymptotic bias equa-tions like (121) and (122) can be derived The true LP with latent


Xlowast is

Y = β0 + β1Xlowast +R E(R) = Cov(Xlowast R) = 0 (125)

Since the observed X is X = Xlowast +M substituting in Xlowast = X minusM

Y = β0 + β1(X minusM) +R = β0 + β1X + (Rminus β1M) (126)

Like (123) the asymptotic bias is

plimnrarrinfin

β1 minus β1 =Cov(XRminus β1M)

Var(X)

so the asymptotic bias is zero if and only if Cov(XRminusβ1M) = 0 ieif the observed X is uncorrelated with the unobserved ldquoerror termrdquoRminus β1M Using (125) and linearity

Cov(XRminus β1M) = Cov(XR)minus Cov(Xβ1M)

= Cov(Xlowast +MR)minus β1 Cov(XM)

=

=0︷︸︸︷Cov(Xlowast R) + Cov(MR)minus β1 Cov(XM)

If M is uncorrelated with the LP error R = Y minus β0 minus β1Xlowast and ifβ1 = 0 (which means Y and the true Xlowast are not correlated) thenthis is zero Otherwise there is almost certainly asymptotic bias inparticular when Cov(XM) 6= 0

Attenuation Bias Assumptions and Result

Unfortunately Cov(XM) = 0 is very unlikely Consider what seemsto be the best-case scenario M is just random noise unrelated to thetrue valueXlowast so Cov(XlowastM) = 0 Unfortunately using Cov(XlowastM) =0

Cov(XM) = Cov(Xlowast+MM) =

=0︷︸︸︷Cov(XlowastM) +

=Var(M)︷︸︸︷Cov(MM) = Var(M)

(127)Assuming not everybody hasM = 0 then Var(M) gt 0 so Cov(XM) gt0 Thus even if Cov(MR) = 0 the asymptotic bias is not zerominusβ1 Cov(XM) 6= 0

In this case with Cov(XM) gt 0 and Cov(MR) = 0 the result-ing bias is called attenuation bias This means that the estimatesβ1 tend to be in between 0 and β1 0 lt plim β1β1 lt 1 implying|plim β1| lt |β1| That is the estimates are systematically pushedcloser to zero by the measurement error This is different than posi-tive (upward) bias which tends to make β1 gt β1 or negative (down-ward) bias which tends to make β1 lt β1 With attenuation bias ifβ1 gt 0 then generally 0 lt β1 lt β1 whereas if β1 lt 0 then generally0 gt β1 gt β1

Even if we cannot fix the attenuation bias it is helpful to knowthe direction of the bias For example if we estimated β1 = 7 and wesuspect attenuation bias then we may think β1 might be even largerbut probably not smaller


Attenuation Bias Example

00 05 10 15 20 25 30

00

10

20

30

X or X

Y

With XWith X

Figure 122 Bias from measurement error in X

Figure 122 illustrates attenuation bias It shows a simple examplewhere P(Xlowast = 1) = P(Xlowast = 2) = 05 and Y = Xlowast (no error term)The linear projection is just the line through (Xlowast Y ) = (1 1) and(2 2) which has β0 = 0 and β1 = 1 (intercept zero slope one) Thenimagine adding error P(M = minus1) = P(M = 1) = 05 regardless ofXlowast or Y Then the Xlowast = 1 values become X = Xlowast+M either X =1minus1 = 0 orX = 1+1 = 2 Similarly theXlowast = 2 values become eitherX = 2 minus 1 = 1 or X = 2 + 1 = 3 Now we have four possible valuesof (XY ) each with equal 025 probability (0 1) (2 1) (1 2) and(3 2) forming a parallelogram The result is LP(Y | 1 X) = 1 +X3(slope is 13) very different than LP(Y | 1 Xlowast) = Xlowast (slope is 1)That is when we add horizontal noise (errors in X) the slope of thelinear projection LP(Y | 1 X) is flatter (closer to zero) than the slopeof LP(Y | 1 Xlowast)

General Bias

Unfortunately outside this very special case the type of bias maydiffer It is not necessarily attenuation bias

In particular if Cov(MR) 6= 0 and |Cov(MR)| gt |β1 Cov(XM)|then the sign of the bias is the sign of Cov(MR) ie positive bias ifCov(MR) gt 0 or negative bias if Cov(MR) lt 0 So generally anytype of asymptotic bias is possible depending how the measurementerror is related to other variables

There are methods that address measurement error in X butthese are beyond our scope

1234 Non-iid Sampling and Survey Weights

As advised in Section 343 if your dataset has survey weights (akasampling weights) then you should use them Most statistical esti-mation functions in R allow such weights Itrsquos true that in some cases


you donrsquot actually need to use weights but itrsquos safer to just alwaysuse them

Sampling may be non-iid for reasons other than weights Clusteredandor stratified sampling can cause non-iid sampling as discussedin Section 35 Time series data also usually lack iid sampling seePart III

Generally these types of non-iid sampling do not affect consis-tency of estimators but they often cause incorrect standard errorsand confidence intervals That is constructing an asymptotic 95 CIbased on iid sampling may produce an interval with only 90 cov-erage probability or even 80 or 50 or lower Similarly p-valuesmay tend to be too small or hypothesis test type I error rates maybe much larger than desired

Thankfully valid (consistent) standard error estimators exist inalmost all these cases However it can get complicated For nowjust be aware of when sampling is non-iid

The following code shows unweighted and weighted results usingsimulated data Without worrying about the details some patternsare clear First the weighted and unweighted estimates are signifi-cantly different it is important to use weights in estimation Secondthe unweighted and weighted SEs also differ significantly Third al-though the weighted estimates are identical the three weighted stan-dard errors are different Judging which is most appropriate requiresunderstanding different types of weights and is beyond our scope

library(survey) library(sandwich) library(lmtest)setseed(112358)n lt- 20dat lt- dataframe(X=rnorm(n))dat$Y lt- 1 + 2dat$X + 3dat$X^2 + rnorm(n)dat$wgt lt- 100runif(n)+10dsgn lt- svydesign(ids=~1 weights=~wgt data=dat)c(mean(dat$Y) svymean(dat$Y dsgn) sum(dat$Ydat$wgt)sum(dat$wgt))

[1] 425 428 428

retun lt- lm(Y~X data=dat weights=NULL)retw lt- lm(Y~X data=dat weights=wgt)m1 lt- c(lmunwgtd lmwgtd svyglmwgtd lmunwgtd lmwgtd)m2 lt- c(lmun lmw svyglmw coeftestun coeftestw)out lt- dataframe(estmethod=m1 SEmethod=m2 est=NA SE=NA)out[134] lt- summary(retun)$coefficients[212]out[234] lt- summary(retw)$coefficients[212]out[334] lt- summary(svyglm(Y~X dsgn))$coefficients[212]vcu1 lt- vcovHC(retun type=HC1)vcw1 lt- vcovHC(retw type=HC1)out[434] lt- coeftest(retun vcov=vcu1)[X12]out[534] lt- coeftest(retw vcov=vcw1)[X12]print(out)


estmethod SEmethod est SE 1 lmunwgtd lmun 207 0776 2 lmwgtd lmw 155 0824 3 svyglmwgtd svyglmw 155 1426 4 lmunwgtd coeftestun 207 1259 5 lmwgtd coeftestw 155 1465

1235 Missing Data

=rArr Kaplan video Bias from Non-Ignorable Missing Data

Like with measurement error in Y (Section 1232) the reason whythere is missing data determines whether or not itrsquos a problem Aswe saw with measurement error in Y if the error is completely ran-dom (independent of X) then it will not bias linear projection slopeestimates Similarly if data is missing completely at random (likea cat walked across your computer keyboard or something) then itrsquosfine to just drop observations with missing data and proceed as usualThis is called complete case analysis where a complete case isan observation in which no values are missing (ie all values are ob-served) For example if the dataset is (Yi Xi) for i = 1 n thenthe complete cases are the i for which both Yi and Xi are observed(not missing)

In other cases we canrsquot ignore the missing data problem but thereare methods that can fix the problem and avoid asymptotic bias

In yet other cases it is very difficult to address the missing dataproblem In particular when the value of Y affects whether or notdata are missing it is very difficult For example if Y is income andpeople with high (or low) income tend not to report their income ona survey then regression estimates will be biased

Figure 123 shows an example of missingness related to Y HereY is income and X = 1 if an individual has a college degree X = 0 ifnot In the example the highest-income individuals do not report Yibut everyone else does This mostly affects Xi = 1 individuals butalso the very highest Yi in the no-college group If we just run OLSon observations with both Yi and Xi observed then both the OLSslope and sample mean are biased downward The OLS intercept isvery slightly downward biased too since the top Yi when Xi = 0 aremissing

Discussion Question 124 (program attrition) Consider a job train-ing program like the federally funded Job Training Partnership Act(JTPA) of 1982 Each eligible individual was randomly assigned toeither take the job training or not You want to estimate the aver-age treatment effect on annual income (Y ) of being assigned to thetraining (the ldquointention-to-treatrdquo effect from Section 463) Howeversome individualsrsquo data is missing because they moved to a differentstate to take a high-paying job Explain why this could be a threat tointernal validity and in which direction you think the resulting biasmight be


By College Degree

0 1

Ear

ning

sAllObserved

Combined

Data Mean

Figure 123 Non-ignorable missing data bias of both OLS and sam-ple mean

Discussion Question 125 (missing salary data) You get data ona sample of professors from research universities in the US whichis the population of interest However you only find salary data forpublic universities not private

a) Howdoes this bias your estimate of the population mean salaryWhy

b) Howdoes this bias your regression of salary on a dummy forbeing a professor in a STEM field Why (Hint consider theintercept and slope separately)

1236 Sample Selection

=rArr Kaplan video Sample Selection Bias

Whereas missing data means some values are missing in the datasetsample selection means entire individuals (observations) are miss-ing Whereas missing values are indicated by NA in R there maybe no indication that entire individuals are missing The number ofmissing individuals may be unknown

As with missing data the reason behind the sample selection iscrucial for whether it results in sample selection bias For exampleif individuals are ldquoselectedrdquo into the sample at random (unrelated totheir Yi or Xi) then itrsquos just like wersquore taking a random sample ofa random sample so we can just proceed as normal However ifindividuals are selected into the sample based on their Yi then OLS(and other estimators) can be very biased

For example similar to Figure 123 imagine Y is wage and in-dividuals with high wage are less likely to take a survey at all Ifour dataset only shows individuals who did take the survey thensample selection bias is likely The picture is basically the same as


Figure 123 just that the ldquomissingrdquo data points are now entirely unob-served (those i are not even in our sample) This particular exampledescribes non-response bias a common problem for surveys iepeople who actually answer the survey are not representative of thepopulation of interest differeing in important ways compared to peo-ple who do not answer the survey

An important economic example of sample selection is that wagesare only observed for currently employed individuals We may wantto learn what determines the wage an individual is offered by a firmHowever if the wage a firm is willing to pay is below the individualrsquosreservation wage or a legal minimum wage then the individual wonrsquotor canrsquot take the offer But if they donrsquot work then we canrsquot ob-serve the hypothetical wage This was the motivation for the famousapproach to correct for sample selection due to Heckman (1979)

Methods to address sample selection bias are beyond our scopebut you can at least try to think critically about whether sampleselection bias might be an issue in real-world examples

1237 Omitted Variable Bias and Collider Bias

Omitted variable bias is discussed in Sections 91 and 101 It isvery common with observational economic data many variables are(cor)related in economics and many important ones are difficult tomeasure (human capital technology marginal cost etc) If they areactually observed in the data then they can just be included althoughrecall that including colliders actually makes bias worse (Section 96)If not then other methods can be used under certain specific con-ditions For example difference-in-differences (Section 97) allowscertain types of omitted variables Other estimators with panel data(observations for the same unit i over multiple time periods) also allowcertain types of omitted variables like those that do not change overtime However these and yet other estimators that address omittedvariable bias are beyond our scope

1238 Simultaneity and Reverse Causality

When we regress Y on X we often (perhaps subconsciously) assumethat X may have a causal effect on Y but that Y does not have aneffect on X However sometimes in reality Y affects X too This iscalled reverse causality or simultaneous causality

The issue of simultaneity is basically the same (and often syn-onymous) but emphasizes that it is not necessarily a direct causaleffect of Y on X just that X and Y are determined by the samesystem at the same time (simultaneously) Economic systems areoften complex where conditions ldquodeterminerdquo the values of multiplevariables at the same time For example supply and demand curvessimultaneously determine the equilibrium market price and quantityRather than trying to say price affects quantity and quantity affectsprice (simultaneous causality) itrsquos more precise to say that price and


quantity are determined simultaneously by the same economic system(simultaneity)

Because economists often study systems with complex interactionsamong many variables and with observational data simultaneity andreverse causality are common

For example one question economists have studied is the effectof police officers per capita X on crime rate Y in a city (Noteas with other examples like minimum wage and right-to-work lawsthis has nothing to do with ldquogoodrdquo or ldquobadrdquo but only how simplisticeconometric analysis can fail to have a causal interpretation) Ofcourse it is possible that the density of police has a causal effect oncrime rate But it is also possible that crime rate Y has a causaleffect on X through policy decisions That is imagine you are incharge of the cityrsquos decision of how many police officers to have Asidefrom budget constraints one of the biggest factors in your decision isprobably the cityrsquos crime rate If the city has a very low crime ratethen you would probably not decide to spend more to hire more policeofficers in fact you may decide to have fewer and spend the savingson other city needs However if the cityrsquos crime rate is very highthen you would seriously consider hiring more police officers That isyour decision about X is determined partly by Y

With simultaneity or reverse causality OLS regression of Y onX does not consistently estimate structural or treatment effects Inthe police example even if there were zero effect of X on Y theresponse of X to Y would cause positive correlation between X andY (cities with more crime would have more police) ie OLS estimatesa positive slope that falsely suggests a positive effect

There are methods like instrumental variables that can (some-times) solve the problem of simultaneity or reverse causality but theyare beyond our scope For now you can just try to think criticallyabout whether or not simultaneity or reverse causality is a problemin real-world examples

Discussion Question 126 (health and medical expenditure) Youwant to learn the causal effect of how much an individual spends onmedical insurance and care (X dollars per year) on health (Y highervalue means healthier)

a) Explain why a regression of Y on X would not estimate thiscausal effect

b) Would the regression slope be higher or lower than the causaleffect Why


Empirical Exercises

Empirical Exercise EE121 You will analyze data from Rouse(1998) on a ldquoschool voucherrdquo program in Milwaukee Wisconsin AsRouse (1998) explains ldquoIn 1990 Wisconsin began providing vouchersto a small number of low-income students to attend nonsectarian pri-vate schoolsrdquo Wooldridge notes that many observations with missingdata have already been dropped so there is sample selection Healso notes you can use variable mnce90 to try to control for this butmnce90 is missing for 23 students so then therersquos a missing dataproblem too If everything were perfect the estimated ATE of eli-gibility (binary variable select) shouldnrsquot depend too much on thecontrol variables or the subsample of individuals but clearly it does

a Load and see a description of the data

R library(wooldridge) and voucher

Statause httpfacultymissouriedukaplandmintro_textvoucher clear

describe

b R only copy the dataset into data frame df with df lt-voucher

c Display the total number of observations (rows) in the dataset

R nrow(df)

Stata count

d Display summary statistics of mnce90 and mnce including thenumber of missing observations

R summary(df[c(mncemnce90)])

Stata count if missing(mnce90) and summarize mncemnce90

e Run a simple regression of mnce (the 1994 math test score)on select (the dummy variable for whether a child was everallowed to use a voucher)

R (ret1 lt- lm(mnce~select data=df))

Stata regress mnce select vce(robust)

f Repeat but adding the 1990 math test score mnce90 as a re-gressor Also compare the number of observations used in theregression to the total number of observations in the dataset

R (ret2 lt- lm(mnce~select+mnce90 data=df)) and thenlength(ret2$residuals) or summary(ret2) to see the num-ber of observations actually used

Stata regress mnce select mnce90 vce(robust) notingthat observations with missing mnce90 are automatically (andsilently) omitted from the regression but the output shows thenumber of observations actually used which you can compareto the number in the full dataset


g To try to see how much of the estimatersquos change is due to con-trolling for mnce90 versus sample selection bias re-run yourfirst simple regression but with only the observations used inthe second regression ie only observations with non-missingmnce90

R (ret2b lt- lm(mnce~select data=df[isna(df$mnce90)]))

Stata regress mnce select if missing(mnce90) vce(robust)

h Optional repeat the above three regressions but withselectyrs (number of years eligible for voucher program) in-stead of the binary select

i Optional repeat the first three regressions but with additionalregressors like female to see if they further change the coefficienton select

Empirical Exercise EE122 You will analyze data from Card(1995) first seen in EE31 with individual-level observations of wagesyears of education and other variables Yoursquoll focus on the relation-ship between wage and education The variable IQ seems like a helpfulcontrol variable but it is not observed for all individuals which maycause bias depending on why it is missing Yoursquoll estimate the co-efficient on education with different sets of regressors and differentsubsets of data Yoursquoll also look at the difference it makes using thesampling weights (as you should)


R library(wooldridge) and card

Stata bcuse card clear

b R only copy the dataset into data frame df with df lt- card


R nrow(df)

Stata count

d Show how many observations are missing IQ

R table(isna(df$IQ))

Stata count if missing(IQ)

e Run a simple regression of log wage on years of education

R (ret1u lt- lm(log(wage)~educ data=df))

Stata regress lwage educ vce(robust)

f Run the same regression but with the provided weights

R (ret1w lt- lm(log(wage)~educ data=df weights=weight))

Stata regress lwage educ [pweight=weight] vce(robust)


g Run the same simple weighted regression but with the subset ofobservations for which IQ is observed

R replace df with df[isna(df$IQ)]

Stata add if missing(IQ) after educ (with a space on eitherside)

h Regress log wage on education and IQ (which automatically usesonly observations where IQ is non-missing)

R (ret2w lt- lm(log(wage)~educ+IQ data=df weights=weight))

Stata regress lwage educ IQ [pweight=weight] vce(robust)

i Optional repeat parts (f)ndash(h) but with additional regressors ofyour choice

Part III

Time Series

257

Introduction

Time series data and models are considered in Part III The focus is onforecasting ie prediction of future values or events Foundationalconcepts like stationarity autocorrelation and appropriately adjustedstandard errors are introduced

Related (free) material is from Diebold (2018b) and Hanck et al(2018 Ch 14) Chapter 1 in the DataCamp intro time series courseis also free

259

260

Chapter 13

Time Series One Variable





132 Identify and describe different components and propertiesof a time series [TLOs 2 and 3]

133 Interpret transformed and decomposed time series [TLOs 2and 3]

134 In R (or Stata) estimate basic descriptions of a time series[TLO 7]

135 In R (or Stata) decompose a time series into different com-ponents [TLO 7]


bull Deterministic and stochastic trends (Lambert video)

bull Chapter 14 (ldquoTime Seriesrdquo) in Hansen (2020)

bull Transformations Section 32 (ldquoTransformations and adjust-mentsrdquo) in Hyndman and Athanasopoulos (2019)

bull Seasonality and holidays Section 54 (ldquoSome useful predic-torsrdquo) in Hyndman and Athanasopoulos (2019)

bull Trends seasonality andor decomposition Sections 81(ldquoRandom Walks rdquo) and 82 (ldquoStochastic vs Determin-istic Trendrdquo) in Diebold (2018c) Section 94 (ldquoStochasticand deterministic trendsrdquo) in Hyndman and Athanasopoulos(2019) Section 147 (ldquoNonstationarity I Trendsrdquo) in Hancket al (2018) Chapter 5 (ldquoTrend and Seasonalityrdquo) in Diebold(2018b) Chapter 12 (ldquoTrend and Seasonalityrdquo) in Diebold

261

262 CHAPTER 13 TIME SERIES ONE VARIABLE

(2018a) Chapter 6 (ldquoTime series decompositionrdquo) in Hyn-dman and Athanasopoulos (2019) Section 36 (ldquoClassicaldecompositionrdquo) in Holmes Scheuerell and Ward (2019)Sections 1033ndash1034 (ldquoTrendsrdquo and ldquoSeasonalityrdquo) in Heiss(2016)

bull Stationarity and random walk Section 81 (ldquoStationarityand differencingrdquo) in Hyndman and Athanasopoulos (2019)Section 112 (ldquoThe Nature of Highly Persistent Time Seriesrdquoie random walks) in Heiss (2016)

bull Estimation of mean and autocovariances Section 133 (ldquoEs-timation and Inference for the Mean Autocorrelation andPartial Autocorrelation Functionsrdquo) in Diebold (2018a)

bull HAC standard errors Section 154 (ldquoHAC Standard Er-rorsrdquo) in Hanck et al (2018)

bull Section 102 (ldquoTime Series Data Types in Rrdquo) in Heiss (2016)

Chapter 13 extends Chapter 2 to the time series setting Newconcepts like stationarity and autocorrelation are introduced Thereare even new complications just with estimating a variablersquos meanand computing a standard error

131 Terms and Notation

A time series of a single variable is written as Yt for time periodst = 1 T For example Y could be annual GDP of the US witht = 1 indicating the year 2001 and T = 10 indicating a total of tenyears of data (here 2001 2002 2010) Or Y could be quarterlyGDP from 2001Q1 (year 2001 quarter 1) through 2010Q4 a total ofT = 40 periods where t = 1 is 2001Q1 t = 2 is 2001Q2 t = 9 is2003Q1 etc Or Y could be the weekly return on a certain stockobserved over a single calendar year t = 1 52

In practice there are many possible complications with timing andmeasurement although details are beyond our scope First insteadof ldquodiscrete timerdquo periods t = 1 T ldquocontinuous timerdquo models lett be any real (decimal) number not just integers Second even withdiscrete time the periods may be of different lengths Third evenwith equal discrete periods it is important to know precisely whenand how the ldquotime trdquo observation is measured For example imagineannual data where t represents an entire year Is Yt measured onJanuary 1 of year t Or December 31 Or is Yt the average valueacross the entire year Such timing is particularly important whenanalyzing multiple time series For example if Yt is measured onJanuary 1 of year t but Xt is measured on December 31 then Xt ismeasured 364 days after Yt but only 1 day before Yt+1

Similar to physics the sampling frequency is the inverse of thelength of each time period For example if each period is one year

132 POPULATIONS RANDOMNESS AND SAMPLING 263

then there is one observation per year so the sampling frequency isyearly (or ldquoannualrdquo) If each period is one quarter then the samplingfrequency is quarterly Similarly time series can be monthly weeklydaily or even hourly or higher frequency (like for stock prices websitetraffic energy use etc)

The following terms describe relationships among observationsRelative to Yt the first lag (or first lagged value) is Ytminus1 ie thevalue from the immediately prior period Similarly the second lag isYtminus2 and generally the jth lag is Ytminusj The first difference is

∆Yt equiv Yt minus Ytminus1 (131)

(But ldquosecond differencerdquo does not refer to Yt minus Ytminus2) Looking to thefuture Yt+1 is the first lead (of Yt) and Yt+j is the jth lead In manycases modeling the relationship between Yt+1 and Yt is equivalent tomodeling Yt and Ytminus1 for example If we use observations Y1 YTfor estimation then anything in the period t = 1 T is called in-sample as opposed to t = T + 1 T + 2 which is out-of-sampleSometimes fewer than T observations are used for estimation andthe definitions are adjusted accordingly (Section 152)

132 Populations Randomness and Sampling

=rArr Kaplan video Time Series Populations

We continue the perspective of Yt as a random variable just as Yiwas earlier (Sections 21 and 23) Earlier Yi was ldquorandomrdquo since wecould have sampled a different value from the population But whatis the ldquopopulationrdquo for a time series

One view is like the superpopulation from Section 22 That iswe can imagine many (infinite) possible universes In each there arethe same mechanisms underlying how the time series values are gen-erated but the actual numerical values differ across universes Likebefore E(Yt) is the average of the Yt values across all the different uni-verses Similarly Var(Yt) is the variance across universes Measureslike Corr(Yt Yt+1) show whether Yt and Yt+1 tend to both be high(or low) or opposite or unrelated For example maybe GDP growthis high in both 2018 and 2019 in many universes and low in both inother universes but very few universes have high growth in 2018 andlow in 2019 or low and then high Then in the (super)populationCorr(Y2018 Y2019) gt 0

Another view is that we observe a sequence of T values withinan infinitely long sequence of Yt We could think about eg whatthe sample average would be if we had a very long sequence or otherldquoasymptoticrdquo properties

All that said there are only a few chapters left so in order to focuson practical descriptions and predictions (forecasts) these ldquodeeperrdquoissues are not explored further


133 Stationarity

=rArr Kaplan video Stationarity

Will the future be like the past This question arose in Sec-tion 122 on external validity Here ldquobe likerdquo is formalized in termsof probability distributions

A time series Yt is stationary if its future is like its past proba-bilistically A necessary (but not sufficient) aspect of this is E(Yt) =E(Ys) for any time periods t and s the mean never changes Like-wise the median never changes nor the standard deviation the entire(marginal) distribution of Yt is identical to that of Ys Further therelationship between this time period and next period must be stableover time ie the joint distribution of (Yt Yt+1) is identical for allt Similarly the joint distribution of the previous current and nextperiodsrsquo values (Ytminus1 Yt Yt+1) never changes In full stationarity isdefined as the joint distribution of (YtminusJ Yt Yt+1) not dependingon t for any J

The foregoing describes strict stationarity (also called strongstationarity) a ldquoweakerrdquo concept called covariance stationarity(also called wide-sense stationarity or weak-sense stationarity)requires only the means and autocovariances (Section 134) to be thesame at all t not the full joint distributions Technically it is notldquoweakerrdquo in the logical sense (Section 611) because of weird distri-butions whose mean is undefined (eg Cauchy) but if you assumeYt has finite variance then strict (strong) stationarity implies covari-ance (weak) stationarity That is given finite variance all strictlystationary series are also covariance stationary but some covariancestationary series are not strictly stationary

With either type of stationarity an estimate of E(Yt) from his-torical data can be interpreted as an estimate of the future E(YT+1)which is the (unconditional) best prediction of YT+1 under quadraticloss Stationarity essentially assumes external validity over time al-lowing us to extrapolate the past into the future In Chapters 14and 15 wersquoll improve upon the unconditional forecast by incorpo-rating other information but stationarity (and its variations) remainimportant considerations for external validity

In practice you should not blindly assume stationarity but exam-ine it empirically and economically That is you can look at the datato see if it appears stationary and you can also think about what ishappening in the world now that may change the future behavior Apreviously stationary time series may no longer be stationary if thereis a sudden law change or other event with permanent effect

Section 136 contains more on data thatrsquos nonstationary ie notstationary

134 Autocovariance and Autocorrelation

=rArr Kaplan video Autocorrelation

134 AUTOCOVARIANCE AND AUTOCORRELATION 265

An important feature of a time series is the correlation betweenthis periodrsquos value and last periodrsquos value ie between Yt and Ytminus1This correlation is called the first autocorrelation or serial corre-lation

The first autocorrelation can be positive negative or zero Forexample if todayrsquos price change is not systematically related to yes-terdayrsquos price change then the time series of price changes has zeroautocorrelation If high quarterly GDP growth follows high growthand low follows low rather than jumping around randomly each quar-ter then GDP growth has a positive autocorrelation Converselynegative first autocorrelation implies high values are followed by lowvalues and low by high more often than high following high or lowfollowing low In economics positive autocorrelation is most common

The sampling frequency affects the first autocorrelation Gener-ally first autocorrelations are closer to positive one with high fre-quency and closer to zero with low frequency For example todayrsquosUS unemployment rate will be extremely close to yesterdayrsquos rateso the first autocorrelation is near one with daily data Howeverwith yearly data (lower frequency) the first autocorrelation is lowerIf each period is one decade (even lower frequency) then the firstautocorrelation may be near zero

Generally for a stationary series the jth autocorrelation (or jthautocorrelation coefficient) ρj describes the relationship betweenYt and Ytminusj as does the related jth autocovariance γj Stationarityimplies these values do not vary with t only j (the lag) Consequentlyit is the same (statistically) if we look j periods in the past or j periodsin the future since period tminusj is j periods before t just as t is j periodsbefore t+ j and Cov(WZ) = Cov(ZW ) Mathematically

γj equiv Cov(Yt Ytminusj) = Cov(Yt+j Yt) = γminusj (132)ρj equiv Corr(Yt Ytminusj) = Corr(Yt+j Yt) = ρminusj (133)

γ0 equiv Cov(Yt Yt) = Var(Yt) = σ2Y ρ0 = Corr(Yt Yt) = 1 (134)

ρj equiv Corr(Yt Ytminusj) =Cov(Yt Ytminusj)radic

Var(Yt) Var(Ytminusj)=

γjσ2Y

=γjγ0 (135)

In (135) the denominator simplifies because stationarity implies Var(Ytminusj) =Var(Yt) = σ2Y and σ

2Y = γ0 from (134)radic

Var(Yt) Var(Ytminusj) =radicσ2Y σ

2Y = σ2Y = γ0

Although sometimes autocovariances are more convenient math-ematically autocorrelations are easier to interpret The units of au-tocovariance are the square of the units of Yt (like ldquosquared dollarsrdquo)which is difficult to interpret The autocorrelation does not dependon the units of Yt and has the same interpretation as a correlationwhere possible values are between minus1 (perfect negative linear correla-tion) and +1 (perfect positive linear correlation) The usual caveatsabout interpreting correlation (nonlinearity causality magnitude ofchange etc) apply equally to autocorrelation1

1Eg httpsenwikipediaorgwikiCorrelation_and_dependence


Discussion Question 131 (autocorrelation) For each of the fol-lowing explain why you think ρ1 gt 0 ρ1 asymp 0 or ρ1 lt 0

a) An individualrsquos employment status (Yt = 1 if employed at timet otherwise Yt = 0) observed weekly

b) GDP growth annualc) GDP growth quarterlyd) Seasonally-adjusted GDP growth quarterly

135 Estimation

1351 Mean

With stationarity the mean is the same micro = E(Yt) for all t so intu-ition suggests the sample mean may still be a good estimator Indeedit often is although consistency requires additional technical condi-tions For example if the joint distributions are Gaussian and theautocorrelations are zero at very distant lags (or ρj rarr 0 as j rarr infin)then the sample mean is consistent

1

T

Tsumt=1

Ytprarr micro as T rarrinfin (136)

See DasGupta (2008 p 40) for more and Proposition 75 in Hamilton(1994) for slightly different sufficient conditions

1352 Autocovariances and Autocorrelations

With stationarity similarly autocovariances and autocorrelations donot depend on t so the sample autocovariances and sample autocor-relations are reasonable estimators

To estimate the jth autocovariance Cov(Yt Ytminusj) the sample jthautocovariance is

γj = E[(YtminusE(Y ))(YtminusjminusE(Y ))] =1

T

Tsumt=1+j

(YtminusY )(YtminusjminusY ) (137)

This estimator is often consistent meaning γjprarr γj for a given j

as T rarr infin For example see equation [7215] in Hamilton (1994)Using j = 0 estimates the variance γ0 = σ2Y

You can check that Rrsquos acf() function uses the formula in (137)The data are Y1 = minus1 Y2 = minus1 Y3 = 1 and Y4 = 1 so T = 4 Thesample average is zero Y = ((minus1)+(minus1)+1+1)4 = 0 Thus (137)simplifies to

γj = (14)4sum

t=1+j

YtYtminusj

γ0 = (14)(Y 21 + Y 2

2 + Y 23 + Y 2

4 ) = (14)(4) = 1

γ1 = (14)(Y2Y1 + Y3Y2 + Y4Y3) = (14)(1 + (minus1) + 1) = 14 = 025

γ2 = (14)(Y3Y1 + Y4Y2) = (14)((minus1) + (minus1)) = minus24 = minus05

γ3 = (14)(Y4Y1) = minus14 = minus025

135 ESTIMATION 267

These match the following output

c(acf(x=c(-1-111) type=covariance plot=FALSE)$acf)

[1] 100 025 -050 -025

Since ρj = γjγ0 as in (135) the estimated autocorrelations are

ρj = γjγ0 (138)

Difficulties

There are limits to what we can learn from data In (137) there areonly T minus j terms being averaged In the extreme if j = T minus 1 thenthere is only a single term in the average like in the above exampleof γ3 with T = 4 Intuitively an average of a single number is a badestimator arguments about consistency of averages assume a largenumber of values being averaged If T is large then we can learnabout γ1 very well but we cannot learn about γj for large j (near T )

Even in more complex models estimation is difficult if the t = 1and t = T variables are strongly correlated

Code

In R acf() estimates autocovariances and autocorrelations as seen inthe following code using monthly international airline passenger dataThe result ρ12 gt ρ6 seems surprising at first Yt is more strongly cor-related with Ytminus12 than Ytminus6 even though Ytminus6 is closer in time How-ever these are monthly data with strong seasonality (Section 1362)so the fact that tminus12 is the same calendar month as t causes strongercorrelation than with tminus6 which is a very different season (eg tminus6is summer if t is winter) To see a graph run the code yourself withplot=TRUE instead of FALSE

retcorr lt- acf(AirPassengers lagmax=12 type=correlationplot=FALSE citype=ma)

retcov lt- acf(AirPassengers lagmax=12 type=covarianceplot=FALSE citype=ma)

print(dataframe(lagmonth=012 rhoj=round(retcorr$acfdigits=2)gammaj=round(retcov$acfdigits=0)) rownames=F)

lagmonth rhoj gammaj 0 100 14292 1 095 13549 2 088 12514 3 081 11529 4 075 10757 5 071 10201 6 068 9743 7 066 9474


8 066 9370 9 067 9589 10 070 10043 11 074 10622 12 076 10868

136 Nonstationarity

=rArr Kaplan video Nonstationarity

This section describes the most common reasons a time series isnonstationary ie not stationary

In Sum Reasons for NonstationarityStochastic trend (unit root eg random walk) variance increas-ing over timeDeterministic trend mean changing over timeSeasonality mean changing over time (repeating up-and-downpattern)Cycles up-and-down patterns without fixed frequencyBreaks permanent changes

1361 Trends

Stochastic Trends

A random walk as in (139) generates nonstationary Yt This isa special case of a more general unit root process which all sharequalitatively similar properties (including nonstationarity) It is alsosometimes called a stochastic trend Let Y0 be the initial valueLet

Yt = Ytminus1 + εt (139)

where the increments εt are iid mean zero and independent of allpast values Ys for s le t minus 1 ie the εt are independent whitenoise (Section 141)

One way to see the nonstationarity is that

Var(Yt) = Var(Ytminus1+εt) = Var(Ytminus1)+

gt0︷︸︸︷Var(εt) +2

=0 since Ytminus1perpperpεt︷︸︸︷Cov(Ytminus1 εt) gt Var(Ytminus1)

(1310)violating the property of stationarity that the variance is the same atall t Logically stationarity implies same variance at all t so by thecontrapositive different variance at difference t implies nonstationar-ity

For prediction given (139) the ldquobestrdquo guess (under quadraticloss) of next periodrsquos Yt+1 is the current periodrsquos Yt Because εt+1 is

136 NONSTATIONARITY 269

mean zero and independent of Yt

E(Yt+1 | Yt = yt) = E(Yt + εt+1 | Yt = yt) = E(Yt | Yt = yt) +

=E(εt+1) by perpperp︷︸︸︷E(εt+1 | Yt = yt)

= yt + E(εt+1) = yt + 0 = yt (1311)

From (630) the conditional mean is the best predictor under quadraticloss so the current value yt is the best predictor of next periodrsquos valueYt+1 This remains the best predictor even if instead of only yt weknow the entire history yt ytminus1 ytminus2 (mathematically if we con-dition on these realizations)

Another interpretation of (139) is that Yt contains all the relevanthistorical information about the future Yt+1 Additionally knowingYtminus1 or other past values does not help Thus the random walk hasthe Markov property (or is a Markov chain) all ldquoinformationrdquoabout future values is contained in the current value and additionallyknowing past values adds no new information

Although nonstationary the random walk can be transformed intoa stationary process by taking a first difference (Section 131) Sub-tracting Ytminus1 from both sides of (139)

Yt minus Ytminus1 = Ytminus1 + εt minus Ytminus1 = εt (1312)

and εt is iid which is a special case of stationarity Generally whena first difference of a time series produces a stationary series theoriginal time series is called difference stationary

Deterministic Trends

With a deterministic trend the time series goes up (or down or upand down) in a non-random pattern For example imagine Yt = t+εtwhere εt are mean-zero iid variables Then

E(Yt) = E(t+ εt) = t+ E(εt) = t+ 0 = t

which changes with t violating stationarity Analogous to differencestationarity a time series is trend stationary if removing its deter-ministic trend produces a stationary series In the above example Ytis trend stationary because Yt minus t = εt which is stationary

Distinguishing Trend Types

Despite their seeming so different in practice it can be difficult to dis-tinguish a stochastic trend from a deterministic trend For examplein climate econometrics2 there is ongoing debate about whether theearthrsquos temperature currently has a stochastic trend or a determinis-tic trend that changed at some point in the past eg see Kaufmann

2Although not exactly climate econometrics half the 2018 Nobel Prize wasawarded to William Nordhaus ldquofor integrating climate change into long-runmacroeconomic analysisrdquo see httpswwwnobelprizeorgprizeseconomic-sciences2018press-release


Kauppi and Stock (2010) Chang Kaufmann Kim Miller Park andPark (2020) and references therein

However difficult it is important to distinguish stochastic anddeterministic trends because they affect forecasts Roughly a trendstationary time series is expected to return to its deterministic trendline relatively quickly whereas the stochastic trend makes deviationsmore persistent For example if we knew Yt = t+ εt with mean-zeroiid εt and we observed Yt values 109 198 305 46 for t = 1 2 3 4the best forecast of Y5 is Y5 = 5 even though Y4 = 4 + 06 was wellabove the trend line In contrast with a stochastic trend wersquod expectthe effect of ε4 = 06 to persist in t = 5 so our forecast would be higherthan 5 Specifically if Yt is difference stationary with YtminusYtminus1 = 1+εt(with mean-zero iid εt) then

E(Y5 | Y4 = 46) = E(Y4 +1+ ε5 | Y4 = 46) = 46+1 = 56 (1313)

Diebold (2018c sect81) shows a similar example with US gross nationalproduct

1362 Seasonality

A time series with seasonality tends to have higher values in someseasons of the year than in others For example retail sales are highestnear the Christmas holiday season and some agricultural crops areonly harvested in one season Residential energy use is also seasonalwith the most heating in the winter and most cooling in the summerand lowest energy use in the fall and spring The pattern of seasonalitymay vary by location too energy use may be highest in winter incolder places like Montana but highest in summer in warmer placeslike Louisiana Many other variables show seasonality too either dueto human-imposed seasons (holidays school schedules elections etc)or natural seasons (weather crops sunlight etc)

Seasonality is more general than seasons within a calendar yearFor example restaurant dinner sales are higher on Friday and Satur-day than other days of the week Crime rates fluctuate with the dayof the week and even the hour of the day as do things like electricityusage There can also be ldquoseasonsrdquo like Congressional elections thatoccur only every two years (or longer)

The presence of seasonality depends on the length of time periodtoo For example if Yt is retail sales in year t then seasonality wonrsquotmatter because all seasons are lumped into a single t However if t isquarterly then seasonality appears eg Yt always jumps up duringthe fourth quarter (October November December) If t is dividedinto even shorter periods then seasonality is still seen with monthlydata Yt jumps up in December or with weekly data Yt jumps up inthe weeks leading up to Christmas

Some ldquoseasonsrdquo are not actually seasons with a fixed frequency sothey must be handled differently For example the calendar date ofEaster differs from year to year For forecasting regression models


you can add dummy variables for such events For Easter specificallythe function easter() in the forecast package is helpful

Figure 131 illustrates how seasonality can be seen in plots of Ytover t that show an up-and-down pattern that repeats every year(or other period) The left graph is from plot(AirPassengers) andshows monthly numbers of international airline passengers (in thou-sands) There is a clear up-and-down seasonal pattern that repeatsevery year You can also try using seasonplot(AirPassengers) afunction in the forecast package (Hyndman et al 2020 Hyndmanand Khandakar 2008)

Time

AirP

asse

nger

s

1950 1954 1958

100

300

500

Time

log(

AirP

asse

nger

s)

1950 1954 1958

50

55

60

65

Figure 131 Seasonality in international air travel

The right graph of Figure 131 is from plot(log(AirPassengers)) and shows ln(Yt) against t Although both show seasonality thepeak-to-trough magnitude (height) of the seasonal variation is moreconstant every year for ln(Yt) see Section 137

1363 Cycles

What about up-and-down patterns caused by macroeconomic busi-ness cycles or El NintildeondashSouthern Oscillation cycles Cycles are oftenimportant but more difficult to understand One added difficulty isthe unknown and changing length of cycles eg El Nintildeo does notcome precisely every five years nor is there a recession every fiveyears Here like in Hyndman and Athanasopoulos (2019 sect6) theldquotrendrdquo is actually a trendndashcycle component that includes cycles


too Though beyond our scope it can be helpful to explicitly split outcycles for more on cycles see for example Diebold (2018b sectsect6ndash7)

1364 Structural Breaks

Sometimes there are big permanent changes in the world and theproperties of a time series also change permanently This is oftencalled a structural break For example in the US many macroe-conomic time series look very different before and after 1985 in partic-ular the reduction in volatility led to the term ldquoGreat Moderationrdquo 3

Dealing with breaks is beyond our scope but they are important tobe aware of see also Section 145

137 Decomposition

=rArr Kaplan video Decomposition

The observed time series Yt can be written in terms of unobservedcomponents of ldquotrendrdquo (really trendndashcycle) seasonality and a remain-der (Diebold 2018b sect210) The remainder also called the randomor irregular or residual or noise component is what remains of Yt afterremoving the trend and seasonality

Notationally following Hyndman and Athanasopoulos (2019 sect6)let Tt denote trend St seasonality and Rt remainder Then

Rt equiv Yt minus Tt minus St =rArr Yt = Tt + St +Rt (1314)

This is an additive decomposition Yt is ldquodecomposedrdquo into additivetrend seasonality and remainder components which all have thesame units as Yt

Alternatively a multiplicative decomposition is

Yt = Tt times St timesRt (1315)

Now Tt still has the same units as Yt but St and Rt represent per-centage deviations from the trend For example St = 105 means 5higher or Rt = 085 means 15 lower (Often a percentage seasonalcomponent makes more sense) Actually taking the log of both sidesof (1315) yields an additive model

ln(Yt) = ln(Tt) + ln(St) + ln(Rt) (1316)

Finally sometimes the decomposition is a mix Yt = Tt times St +RtThere are R functions to decompose time series into trend sea-

sonal and remainder components To choose the right method youmust decide whether the seasonality is additive or multiplicative Forexample compared to sales on July 1 are sales on December 1 usu-ally higher by $500 (additive) or by 30 (multiplicative) In otherwords is (1314) or (1315) more sensible

3See httpsenwikipediaorgwikiGreat_Moderation

137 DECOMPOSITION 273

obse

rved

320

360

tren

d

320

360

seas

onal

minus2

201

90

time

rand

om

1960 1970 1980 1990

minus0

600

50

Figure 132 Additive decomposition monthly atmospheric CO2

(ppm)

For intuition the following roughly describes a classical additivedecomposition (Hyndman and Athanasopoulos 2019 sect63) Firstthe trend is estimated usually by some nonparametric smootheryielding the estimated trend Tt Second the ldquoseasonalrdquo averages ofYt minus Tt (the detrended data) are computed For example withmonthly data all January values of Yt minus Tt are averaged to estimateSt when t is in January and then all February values are averagedto get St for February t etc Third Rt = Yt minus Tt minus St There aremany variations with different estimators of Tt or allowing St tochange over time For multiplicative decomposition either applythe above to ln(Yt) or replace subtraction with division use YtTt inthe second step and Yt(TtSt) in the third step

Figure 132 shows an additive decomposition produced by the fol-lowing R code that uses decompose() (in the built-in stats package)

par(family=serif mgp=c(21080))ret lt- decompose(co2 type=additive)plot(ret)

Figure 133 shows a multiplicative decomposition generated bythe following R code When seasonality is multiplicative instead of


obse

rved

190

530

tren

d

190

410

seas

onal

090

120

time

rand

om

1950 1952 1954 1956 1958 1960

09

10

11

Figure 133 Multiplicative decomposition monthly airline passengers(1000s)

additive specify type=multiplicative as below

par(family=serif mgp=c(21080))ret lt- decompose(AirPassengers type=multiplicative)plot(ret)

Other R decomposition functions to try (or Google) include stl() HoltWinters() and the forecast packagersquos mstl() (multipleseasonal)

Discussion Question 132 (nonstationarity) For each of the fol-lowing time series explain specifically why you doubt its strict sta-tionarity a) GDP annual b) stock market index annual c) worldpopulation annual and d) US residential water usage monthly (hintfor non-US students itrsquos much hotter in summer and many houseshave yardsgardens that require watering)

138 Transformations

To improve interpretation or statistical properties it may help totransform a time series before analyzing it Three common transfor-mations are now briefly discussed

138 TRANSFORMATIONS 275

First the first difference looks at changes in Yt defined in (131)as ∆Yt equiv YtminusYtminus1 One motivation is Section 136 some nonstation-ary Yt are difference stationary so ∆Yt is stationary For example ifYt = Ytminus1 + Ut where Ut is iid then Yt is a random walk and thusnonstationary However ∆Yt = Ut is iid which is stationary Meth-ods that only work with stationary data could be applied to ∆Yt butnot Yt

Second log transformations sometimes help like in (1316) wherea multiplicative model becomes additive That is instead of Yt weanalyze Zt = ln(Yt)

Third taking a log difference ln(Yt) minus ln(Ytminus1) yields the com-pound growth rate This is the first difference of the log-transformedseries letting Zt = ln(Yt) then ∆Zt = ZtminusZtminus1 = ln(Yt)minusln(Ytminus1) =ln(YtYtminus1) For example the formula for the final level A after con-tinuously compounded growth at effective annual rate r for t yearsstarting at initial level P is A = Pert the ldquoPertrdquo formula you mayhave learned in high-school for computing compound interest ratesFor a single year (t = 1 in the formula) the rate r is then solved byA = Per implying er = AP and thus r = ln(AP ) = ln(A)minus ln(P )using a log property (from Section 811) for the last equality Thuswith annual data the log difference ln(Yt) minus ln(Ytminus1) represents theeffective annual rate


Empirical Exercises

Empirical Exercise EE131 You will analyze monthly US un-employment data Yoursquoll notice that the unemployment rate is notvery seasonal (by month) but it is very persistent (positively auto-correlated) Note that urate is in percent units so 52 means 52etc


R library(wooldridge) and beveridge

Stata bcuse beveridge clear

b Tell your software that you have monthly time series data

R tsdat lt- ts(data=beveridge$urate frequency=12start=c(200012)) creates a time series variable namedtsdat thatrsquos a time series (ts) with the unemployment ratedata (urate) starting in year 2000 month 12 (the first value ofbeveridge$month) Argument frequency=12 says there are12 ldquoseasonsrdquo before getting back to the first one in this case 12different months per year (Daily data could use frequency=7to allow day-of-week ldquoseasonalityrdquo)

Stata tsset ym monthly

c R only decompose (additively) the unemployment rate time se-ries into trend seasonal and remainder components with tsdeclt- decompose(tsdat) to compute and plot(tsdec) to plot

You can also see that the magnitude of the seasonal componentis relatively small with max(abs(tsdec$seasonal))

d Stata only to additively decompose the time series first esti-mate the trend component with a nonparametric ldquomoving aver-age smootherrdquo with commandtssmooth ma furate=urate weights(1 2 2 2 2 2 lt2gt 2

2 2 2 2 1)

and plot this smoothed trend against the raw time series withtsline urate furate name(furate) ylabel(3)

e Stata only compute the seasonal effects by averaging the dif-ference between the data and the trend within each month(eg average among all January values then separately amongall February values etc) Generate the month variablewith generate month = month(dofm(ym)) and compute thewithin-month averages with bysort month egen seasadd= mean(urate-furate)

f Stata only normalize the seasonal effects to average to zeroCompute the average of the raw seasonal effects and then sub-tract that value from the seasonal effects (to make them averageto zero) with commands (note the broken-up ldquolinerdquo scalarnormadd should all be on the same line of code)


sort ymscalar normadd = (seasadd[1]+seasadd[2]+seasadd[3]+seasadd[4]+seasadd[5]+seasadd[6]+seasadd[7]+seasadd[8]+seasadd[9]+seasadd[10]+seasadd[11]+seasadd[12])12

replace seasadd = seasadd - normadd

g Stata only see how big (or small) the seasonal effects are withcommandslist month seasadd if year(dofm(ym))==year(dofm(ym[1]))+1

summarize seas detail

h Stata only generate the remainder term as the raw data minustrend minus seasonality with command generate remadd =urate - furate - seasadd

i Stata only plot the seasonal and remainder series and thenmake a combined graph with everything (similar to what Rshows)tsline seasadd name(seasadd) ylabel(3)tsline remadd name(remadd) ylabel(3)graph combine furate seasadd remadd cols(1) name(decompurateadd)

j Plot the autocorrelation function (ACF) up to 48 months lag

R acf(tsdat lagmax=48 ci=0)

Stata ac urate level(95) lags(48)

k Display the autocorrelation values up to 24 months

R acf(tsdat lagmax=24 type=correlation plot=FALSE)

Stata corrgram urate lags(24) noplot

l Optional repeat the decomposition plot and ACF plot for thevacancy rate variable vrate

Empirical Exercise EE132 You will analyze monthly data on in-dustrial cement production from Shea (1993) If yoursquore curious youcan view and download more recent cement data from the FederalReserve Bank of St Louis4 Yoursquoll notice that seasonality is very im-portant Yoursquoll also notice that the autocorrelations of the raw datareflect the up-and-down seasonality whereas the autocorrelations ofthe seasonally-adjusted data show more consistently positive autocor-relation (up to two years lag or so)


R library(wooldridge) and cement

Stata bcuse cement clear

4httpsfredstlouisfedorgseriesIPG3273N



R usetsdat lt- ts(data=cement$ipcem frequency=12

start=c(cement$year[1]cement$month[1]))

to create a time series variable named tsdat thatrsquos a time series(ts) with the industrial cement production index data (ipcem)

Statagenerate yrmo = ym(year month)format yrmo tmtsset yrmo

c R only compute store and plot a multiplicative decomposi-tion to see how important seasonality is for industrial cementproductiontsdec lt- decompose(tsdat type=mult)plot(tsdec)window(tsdec$seasonal start=c(19641) end=c(196412))

The last line above prints the numerical values for the season-ality plot (which are the same for each year eg 1964 could bereplaced by 1971)

d Stata only estimate the trend and plot it against the raw datatssmooth ma fipcem1=ipcem weights(1 2 2 2 2 2 lt2gt2 2 2 2 2 1)

tsline ipcem fipcem1 name(fipcem1) ylabel(3)

e Stata only compute multiplicative seasonal effects withbysort month egen seasmult = mean(ipcemfipcem1)

(but donrsquot worry about normalizing these to average to 1 like issometimes done)

f Stata only compute the multiplicative remainder as the ob-served value divided by the trend value divided yet again bythe seasonal effectgenerate rem1mult = ipcemfipcem1seasmult

g Stata only plot the seasonal and remainder series and then allseries together (similar to the R plot)tsline seasmult name(seasmult) ylabel(3)tsline rem1mult name(rem1mult) ylabel(3)graph combine fipcem1 seasmult rem1mult cols(1)name(decompmult)

h Plot the autocorrelation function (ACF) of the raw data up to48 months lag

R acf(tsdat lagmax=48 ci=0 naaction=naomit)

Stata ac ipcem level(95) lags(48)


i Plot the ACF of the seasonally-adjusted data

R acf(tsdattsdec$seasonal lagmax=48 ci=0 naaction=naomit)

Statagenerate saipcem = ipcem seasmultac saipcem level(95) lags(48)

j Optional repeat the decomposition plot and ACF plots for adifferent variable in the dataset


Chapter 14

First-Order Autoregression





142 Describe the first-order autoregressive model and its fea-tures including interpretation for description and predic-tion [TLOs 2 and 3]

143 Interpret and evaluate forecasts including multi-step andinterval forecasts [TLOs 2 and 3]

144 In R (or Stata) estimate the parameters of a first-orderautoregression [TLO 7]

145 In R (or Stata) generate interval and multi-step forecasts[TLO 7]


bull AR(1) (Lambert video)

bull AR(1) series with different autocorrelations (Lambertvideo)

bull Chapter 12 (ldquoSerial Correlationrdquo) in Diebold (2018a)

bull Parameter stability Hanck et al (2018 sect148) Diebold(2018a sect124ndash5)

bull AR(1) model and properties Hamilton (1994 sect34)

bull AR(1) Hyndman and Athanasopoulos (2019 sect83) Hancket al (2018 sect143)

bull Asymptotic theory Hamilton (1994 sectsect82174)

281

282 CHAPTER 14 FIRST-ORDER AUTOREGRESSION

bull Forecastprediction interval Hyndman and Athanasopou-los (2019 sect35) Diebold (2018b sectsect733743)

bull Multi-step forecasting Diebold (2018b sect673)

bull R package forecast Hyndman and Athanasopoulos (201936) Hyndman et al (2020) Hyndman and Khandakar(2008)

Decent forecasts are often achieved by simply regressing Yt onYtminus1 Chapter 14 explores this model which is also useful for descrip-tion (if not causal inference) Some extensions are discussed withadditional extensions in Chapter 15

Although you donrsquot need to be able to reproduce (or even under-stand) the mathematical derivations in this chapter they are providedin case yoursquore interested

141 Model

The first-order autoregressive model or AR(1) model is es-sentially a simple linear regression in which the regressor is the firstlag of the outcome variable

Yt = φ0 + φ1Ytminus1 + εt (141)

where φ0 and φ1 are constant coefficients with φ1 called the au-toregressive parameter (or autoregressive coefficient) and εtis something called white noise A special case called independentwhite noise is if the εt are iid with mean zero and finite varianceand independent of all past Ys values for s lt t

εt sim iid E(εt) = 0 σ2ε equiv Var(εt) ltinfin εt perpperp Ytminus1 Ytminus2 for all t(142)

Diebold (2018a sect136) and Diebold (2018b sect62) have many moredetails on white noise that are beyond our scope

Given (141) and (142) stationarity (either type) of Yt dependson the parameter values Specifically

Yt is stationary lArrrArr |φ1| lt 1 (143)

If instead |φ1| then Yt has a unit root (Section 1361) For examplewith φ0 = 0 and φ1 = 1 (141) becomes the random walk in (139)ldquoExplosive processesrdquo with |φ1| gt 1 are sometimes considered to modelstock market bubbles (or other bubbles) but are beyond our scope

Assuming stationarity (either type) the mean of Yt can be solvedfor in terms of parameters φ0 and φ1 Let micro equiv E(Yt) which is thesame for all t if Yt is stationary Using (141)

micro = E(Yt) =

use linearity of E(middot)︷︸︸︷E(φ0 + φ1Ytminus1 + εt) = φ0+φ1

=micro︷︸︸︷E(Ytminus1) +

=0︷︸︸︷E(εt) = φ0+φ1micro

(144)

142 DESCRIPTION 283

Solving micro = φ0 + φ1micro for φ0 and then micro

φ0 = micro(1minus φ1) micro =φ0

1minus φ1 (145)

The AR(1) model in (141) can be written equivalently in terms ofdemeaned values Generally a demeaned random variable has hadits mean subtracted (like a ldquodebonedrdquo fish has had its bones removed)so it has mean zero like the population mean modelrsquos error termdefined in (64) Here Yt minus micro is demeaned since E(Yt) = micro so

E(Yt minus micro) = E(Yt)minus micro = 0

Similarly since micro = E(Ytminus1) E(Ytminus1 minus micro) = E(Ytminus1) minus micro = 0 andsimilarly E(Ytminusj minus micro) = 0 for all j because all E(Ytminusj) = micro due tostationarity

The demeaned AR(1) model is

Yt minus micro = φ1(Ytminus1 minus micro) + εt (146)

This is equivalent to (141) After adding micro to both sides of (146)

Yt = micro+φ1(Ytminus1minusmicro)+εt = micro+φ1Ytminus1minusφ1micro+εt =

=φ0 by (145)︷︸︸︷micro(1minus φ1) +φ1Ytminus1+εt

(147)

142 Description

Certain properties of Yt are implied by (141) and (142) Here welook at the mean variance autocovariances and autocorrelations ofYt in terms of the model parameters

The mean is micro = φ0(1 minus φ1) given covariance stationarity asshown in (145)

The variance is derived by taking the variance of each side of(141) Assuming covariance stationarity let σ2Y equiv Var(Yt) = Var(Ytminus1)Using variance identities and Cov(Ytminus1 εt) = 0 (since εt perpperp Ytminus1)

=σ2Y︷︸︸︷

Var(Yt) =

can remove constant φ0︷︸︸︷Var(φ0 + φ1Ytminus1 + εt) =

use Var(V+W )=Var(V )+Var(W )+2Cov(VW )︷︸︸︷Var(φ1Ytminus1 + εt)

=

use Var(aW )=a2 Var(W )︷︸︸︷Var(φ1Ytminus1) +

σ2ε in (142)︷︸︸︷Var(εt) +2

use linearity︷︸︸︷Cov(φ1Ytminus1 εt)

= φ21

σ2Y︷︸︸︷

Var(Ytminus1) +σ2ε + 2φ1

=0 since εtperpperpYtminus1︷︸︸︷Cov(Ytminus1 εt)

= φ21σ2Y + σ2ε

Rearranging to solve for σ2Y

σ2Y = φ21σ2Y + σ2ε =rArr σ2Y (1minus φ21) = σ2ε =rArr σ2Y =

σ2ε1minus φ21

(148)


The autocovariances can also be calculated given covariance sta-tionarity Substituting for Yt using (141) and using the same prop-erties from above

γ1 equiv Cov(Yt Ytminus1) = Cov(φ0 + φ1Ytminus1 + εt Ytminus1)

=

=0 since φ0 =const︷︸︸︷Cov(φ0 Ytminus1) + Cov(φ1Ytminus1 Ytminus1) +

=0 by εtperpperpYtminus1︷︸︸︷Cov(εt Ytminus1) = φ1

=Var(Ytminus1)︷︸︸︷Cov(Ytminus1 Ytminus1)

= φ1σ2Y (149)

Using (149) recursively


=

=0 since φ0=const︷︸︸︷Cov(φ0 Ytminus2) +φ1 Cov(Ytminus1 Ytminus2) +

=0 by εtperpperpYtminus2︷︸︸︷Cov(εt Ytminus2)

= φ1γ1 = φ1

(149)︷︸︸︷φ1σ

2Y

= φ21σ2Y (1410)

More generally by induction if γjminus1 = φjminus1σ2Y then

γj equiv Cov(Yt Ytminusj) = Cov(φ0 + φ1Ytminus1 + εt Ytminusj)

=

=0︷︸︸︷Cov(φ0 Ytminusj) +φ1

=γjminus1︷︸︸︷Cov(Ytminus1 Ytminusj) +

=0︷︸︸︷Cov(εt Ytminusj)

= φ1φjminus11 σ2Y

= φj1σ2Y (1411)

which holds for all j ge 0The autocorrelations combine (1411) with (135)

ρj equiv Corr(Yt Ytminusj) = γjσ2Y = (φj1σ

2Y )σ2Y = φj1 (1412)

With j = 1 the first autocorrelation is ρ1 = φ1 the autoregressivecoefficient in (141)

In Sum AR(1) for DescriptionGiven (141) and (142) with |φ1| lt 1 ( =rArr stationary)

mean micro = E(Yt) = φ0(1minus φ1)variance σ2Y = Var(Yt) = σ2ε (1minus φ21)jth autocovariance γj = φj1σ

2Y

jth autocorrelation ρj = φj1

143 Prediction (Forecasting) Optimality

For time series ldquopredictionrdquo usually means forecasting future valuesof Yt given the current and past values As in Section 25 given aloss function the optimal forecast (prediction) minimizes mean loss

144 ESTIMATION 285

This optimal forecast is defined in the population (without data) andcan be estimated with data In practice in many cases given theobserved Yt for t = 1 T the goal is to forecast YT+1

As in Part II the focus here is on the CEF the best forecast givenquadratic loss The white noise property of εt implies (141) is a CEFmodel

E(Yt | Ytminus1) =

split apart using linearity of E(middot)︷︸︸︷E(φ0 + φ1Ytminus1 + εt | Ytminus1)

=

=φ0 (constant)︷︸︸︷E(φ0 | Ytminus1) +

use linearity︷︸︸︷E(φ1Ytminus1 | Ytminus1) +

use εtperpperpYtminus1︷︸︸︷E(εt | Ytminus1)

= φ0 + φ1 E(Ytminus1 | Ytminus1) +

=0︷︸︸︷E(εt)

= φ0 + φ1Ytminus1 (1413)

Consequently given sample Y1 YT and corresponding OLSestimates φ0 and φ1 a reasonable forecast of YT+1 is

YT+1 = φ0 + φ1YT (1414)

Even if the AR(1) model is wrong if the time series is covariancestationary then OLS estimates the best linear predictor (Section 75)of Yt given Ytminus1 But ldquobestrdquo does not mean ldquogoodrdquo (Section 742)forecast accuracy may be improved by using additional lags othervariables andor nonlinearity (Chapter 15)

Under absolute loss or with asymmetric loss the CEF is notoptimal but rather the conditional median (or other percentiles) Inthat case a quantile autoregression model and estimator could beused but details are beyond our scope

Discussion Question 141 (forecast and reality) Given sampleY1 YT you construct the forecast YT+1 = φ0 + φ1YT Then youwait one period and observe the actual YT+1

a) Will you be surprised if YT+1 gt YT+1 Or if YT+1 lt YT+1Whynot

b) How often do you expect to see YT+1 = YT+1 Whyc) Is it usually true that YT+1 = φ0 + φ1YT Whynot Hint

for any random variable W how often does W = E(W ) iewhatrsquos P(W = E(W ))

144 Estimation

Before discussing AR(1) estimation by OLS the results of Section 135can be interpreted as consistency of OLS in an intercept-only regres-sion Section 135 gave conditions in which a good estimator of thepopulation mean is the sample mean As in (310) and (311) thesample mean is the OLS estimator for the intercept-only model

Yt = micro+ εt


In the AR(1) skipping technical details the OLS estimators φ0and φ1 are consistent in many cases If |φ1| lt 1 then OLS is con-sistent meaning φ0

prarr φ0 and φ1prarr φ1 Technical details for a more

general version of this result may be found in Case 4 on pages 215ndash217of Hamilton (1994 sect82) In fact φ1

prarr φ1 even if φ1 = 1 (Hamilton1994 sect174)

There are other consistent estimators too and some research hastried to compare the small-sample properties of these but such com-parison is beyond our scope

Using the estimated coefficients a point forecast (our singlebest guess) YT+1 is computed as in (1414) For the demeaned modelin (146)

YT+1 = micro+ φ1(YT minus micro) (1415)

1441 Code

The following code shows an example The data Y are simulatedfrom an AR(1) model with φ0 = 0 (so micro = 0) and φ1 = 025 usingarimasim() The argument nahead tells predict() how manytime periods past the end of the sample to make predictions for InR ar() by default estimates the demeaned model In the code micro isret$xmean φ1 is ret$ar and φ0 is ret$xmean(1-ret$ar) Thepredicted value in pr$pred[1] is shown to be equivalent to (1414)and (1415) (Alternatively with argument method=ols you canestimate φ0 and φ1 directly by OLS as in the final commented-outlines of code below) Finally note that the ldquostandard errorsrdquo in pr$sebelow are for the predicted values not just AR coefficient uncertaintywhich is shown separately with sqrt(ret$asyvarcoef)

setseed(112358)RHO lt- 025 n lt- 100Y lt- arimasim(n=n model=list(ar=RHO) sd=1)ret lt- ar(x=Y aic=FALSE ordermax=1)cat(sprintf(PhiHat0=53f PhiHat1=53fn

ret$xmean(1-ret$ar) ret$ar))

PhiHat0=0024 PhiHat1=0143

cat(sprintf(SE(PhiHat1)=53fn sqrt(ret$asyvarcoef)))

SE(PhiHat1)=0100

pr lt- predict(ret nahead=1) output point forecast and prediction SEc(round(pr$pred digits=3) round(asnumeric(pr$se) digits=3))

[1] 0196 0966

check prediction against formulasc(pr$pred[1] ret$xmean + ret$ar(Y[n]-ret$xmean)ret$xmean(1-ret$ar) + Y[n]ret$ar)

145 PARAMETER STABILITY 287

[1] 0196 0196 0196

Not run OLS to estimate phi0 directly (Note phi1 estimate differs slightly) retols lt- ar(x=Y aic=FALSE ordermax=1 method=ols demean=FALSE intercept=TRUE) c(retols$xintercept retols$ar) OLS estd phi0 phi1 ret2$asysecoef SE for phi0 phi1

145 Parameter Stability

Parameter stability pertains to external validity as in Section 122 isφ1 truly a constant or has it changed over time and might it change inthe future This is also related to structural breaks (Section 1364)With enough data we could form multiple historical datasets and seeif the estimates φ1 change much over time But either way this doesnot tell us what will happen in the future Historical data cannotpredict a future black swan something new not seen in the pastAs usual purely statistical analysis may fall short a combination ofyour statistical and economic expertise (and critical thinking) yieldsbetter results

This issue of parameter stability is related to the Lucas critique(Lucas 1976) For example a partial equilibrium modelrsquos parametersmay change when there is a new macroeconomic policy with generalequilibrium effects (Section 433) The policyrsquos effects might changeoptimal forecasts and time series descriptions

AR(1) models (and more complex models) allowing time-varyingcoefficients have been developed but are beyond our scope Thereare also methods for identifying when in time a certain parameterchanged also beyond our scope

Discussion Question 142 (recession-affected coefficient) Name avariable you think might have different φ1 during an extended re-cession (than not during a recession) not including the switch fromnon-recession to recession For example if there is a recession fromt = 11 to t = 20 then consider the φ1 for Yt for t = 1 10 com-pared to the φ1 for t = 11 20 As usual most importantly explainwhy you think so Hint this is not simply asking which variables arehigher or lower in a recession because thatrsquos not what φ1 describeseg the time series Zt = Yt + 10 would have the exact same φ1 as Ytjust a different φ0 or micro

146 Multi-Step Forecast

Instead of forecasting Yt+1 given Yt you may need to forecast Yt+hgiven Yt for a particular h gt 1 This is called the h-step-aheadforecast


For example if you must make a decision that affects your businessor government policy for the next year and you have monthly datayou might like to predict Yt+12 given Yt ie predict the value 12months in the future In fact you may want to predict Yt+h forall h = 1 12 ie predict each of the next 12 months (Ideallyinstead of always using quadratic loss your forecast would use a lossfunction more appropriate for your decision but often practice is notideal)

1461 Intuition Mean Zero (Special Case)

For intuition imagine the AR(1) Zt = φ1Zt + εt so E(Zt) = 0 and

E(ZT+1 | ZT ) = E(φ1ZT + εT+1 | ZT ) = φ1 E(ZT | ZT ) +

=0︷︸︸︷E(εT+1 | ZT )

= φ1ZT (1416)

Iterating once again using linearity of E(middot)

E(ZT+2 | ZT ) = E(φ1ZT+1 + εT+2 | ZT ) = φ1

=φ1ZT by (1416)︷︸︸︷E(ZT+1 | ZT ) +

=0︷︸︸︷E(εT+2 | ZT )

= φ21ZT

This pattern continues given quadratic loss the optimal forecast ofZT+h given ZT is

E(ZT+h | ZT ) = φh1ZT (1417)

With data the forecast is

ZT+h = φh1ZT (1418)

With stationary Zt |φ1| lt 1 so φh1 gets closer to zero as h getsbigger For example with φ1 = 05 and h = 10 (05)10 asymp 0001Thus the farther into the future we forecast the closer our (optimal)forecast gets to 0 = E(Zt) = E(ZT+1)

If instead φ1 = 1 then φh = 1 for any h so the optimal forecastis E(ZT+h | ZT ) = ZT Even with very large h the optimal forecastis ZT not 0 = E(Zt) = E(ZT+h) However even if the true φ1 = 1estimating |φ1| lt 1 leads to ZT+h asymp 0 for large h This is one reasonthere are many methods to try to distinguish unit root (here φ1 = 1)from stationary (here |φ1| lt 1) processes though they are beyondour scope

1462 General Results with AR Parameters

To extend the results let Zt equiv Yt minus micro with micro equiv E(Yt) continuing toassume stationarity Then (1417) and (1418) become

E(YT+h minus micro | YT ) = φh1(YT minus micro) =rArr E(YT+h | YT ) = micro+ φh1(YT minus micro)

(1419)

YT+h = micro+ φh1(YT minus micro) = micro(1minus φh1) + φh1YT =φ0

1minus φ1(1minus φh1) + φh1YT

(1420)

147 INTERVAL FORECASTS 289

Given YT φ0 and φ1 this can forecast YT+h by plugging in h Withh = 1 matching (1414)

YT+1 =φ0

1minus φ1(1minus φ1) + φ1YT = φ0 + φ1YT

1463 Direct Approach

Alternatively to forecast YT+h given YT simply regress Yt+h on Yt(and an intercept) This makes sense such a regression estimates thebest linear predictor of Yt+h given Yt This regression estimates theparameters in

Yt+h = φ0 + φ1Yt + εt+h (1421)

The forecast of YT+h is then

YT+h = φ0 + φ1YT (1422)

For example if Yt is quarterly GDP growth and we want to pre-dict GDP growth four quarters (ie one year) in the future thenh = 4 We regress Yt+4 on an intercept and Yt in our quarterly datapredicting YT+4 = φ0 + φ1YT The forecast in (1414) showed thespecial case with h = 1

1464 Code

There are functions in R that do multi-step forecasts automaticallyIn particular the forecast function in the forecast package (Hyn-dman et al 2020 Hyndman and Khandakar 2008) which also doesmulti-step interval forecasts see Section 148

Discussion Question 143 (long-horizon AR forecast) Let φ1 =05

a) Whatrsquos φh1 when h = 10 h = 20b) Given (1420) what does this imply about the forecast YT+h

when h is largec) Name one variable for which such a long-term forecast doesnrsquot

make sense and explain why notd) Does YT+20 asymp YT+21 asymp middot middot middot imply wersquod be surprised if the actual

YT+20 YT+21 go up and down (versus being all roughly thesame value) Whynot

147 Interval Forecasts

The idea of a prediction interval from Section 255 extends to timeseries forecasting Like a confidence interval an interval forecast(or forecast interval) incorporates uncertainty Instead of giving asingle number point forecast like (1414) an interval of numbers canhelp show how much uncertainty there is around the point forecast


1471 Goal and Sources of Uncertainty

The goal is to construct intervals that contain the true future valuewith some specified probability like 95 This is similar to the goalof a confidence interval to contain the true parameter value 95 ofthe time For intuition imagine your job is to create 95 intervalforecasts and you make one every day for 1000 days That is oneach day t you make an interval forecast for the next dayrsquos valueYt+1 Thus the next day you can check whether or not the true valuewas inside your interval or not If yoursquore doing your job well then youshould find that (approximately) 950 days out of 1000 your intervalcontained the true value and the other 50 days it didnrsquot

There are two sources of uncertainty in forecasting The firstsource of uncertainty is the same as in a confidence interval parame-ter uncertainty That is we only have estimated parameter values φ0and φ1 we do not know the true population parameters φ0 and φ1As before the standard error helps quantify this type of uncertainty

The second (and usually larger) source of uncertainty is the errorterm Knowing the true parameters alone is not sufficient to perfectlyforecast YT+1 with complete certainty For example consider a simplecase where φ0 = φ1 = 0 so YT+1 = εT+1 Even if somehow we knewφ0 = φ1 = 0 and thus knew that YT+1 = εT+1 we still wouldnrsquotknow εT+1 at time T when we are trying to predict YT+1 That is inaddition to uncertainty about parameters there is uncertainty aboutεT+1

1472 Intervals Assuming Normality

Continuing the example with φ0 = φ1 = 0 what would a 95 intervalforecast be Since YT+1 = εT+1 the interval should contain εT+1

with 95 probability So the interval depends on the distributionof the random variable εT+1 In the extremely unlikely case thatεT+1 sim N(0 1) then we know the interval [minus196 196] would worksince P(minus196 le εT+1 le 196) = 95 is a property of the standardnormal distribution If instead εT+1 sim N(0 σ2ε ) then the interval[minus196σε 196σε] would work wersquod probably have to estimate σε fromthe data and use the estimated σε instead This is the foundation formany interval forecasts eg see Diebold (2018b sect733)

Continuing to assume normally distributed εT+1 the 95 intervalforecast can be stated more generally This covers the AR(1) as well asany other forecast The 95 forecast interval becomes YT+1plusmn196σεie

[YT+1 minus 196σε YT+1 + 196σε] (1423)

where YT+1 is the point forecast This ignores parameter uncertaintywhich is usually much smaller than the uncertainty from εt+1 Toget a 90 interval simply replace 196 with 164 or for a 100(1 minus2α) interval use the 100(1minusα)th percentile of the standard normaldistribution (Most built-in statistical functions only ask you for thedesired percentage and compute the critical value automatically)


But what if εT+1 isnrsquot normally distributed Then the forecastinterval in (1423) is not valid It may be ldquotoo widerdquo containingthe true value with more than 95 probability or it may be ldquotooshortrdquo containing the true value with less than 95 probability Forexample if εT+1 sim Unif(minus

radic3radic

3) uniformly distributed over allreal (decimal) numbers between minus

radic3 and

radic3 then σε = 1 Butradic

3 asymp 173 so εT+1 is always between minus173 and 173 which meansit is always in the interval [minus196 196] 100 of the time (not 95)The correct 95 interval is shorter P(minus165 le εT+1 le 165) = 95Further if the distribution of εT+1 is not symmetric then the bestforecast interval may not be symmetric either ie the point forecastmay not be exactly in the middle of the interval

Itrsquos possible to estimate the distribution of εT+1 (assuming itrsquosstrictly stationary) and use the estimated percentiles to constructforecast intervals but that approach is beyond our scope

1473 Code

The following code shows basic interval forecasts using the forecastpackage (which also shows the point forecasts) The argument h=12specifies forecasting values for the next 12 time periods so the resultsinclude multi-step interval forecasts Argument level=c(8095) spec-ifies both 80 and 95 prediction intervals Although the code iseasy to run an AR(1) is not always appropriate so critical thoughtis required eg see DQ 144

library(forecast)ret lt- ar(AirPassengers aic=FALSE ordermax=1)forecast(ret h=12)

Period PointForecast Lo80 Hi80 Lo95 Hi95 Jan-1961 424 375 473 349 499 Feb-1961 417 349 484 313 520 Mar-1961 410 329 490 286 533 Apr-1961 403 312 494 264 542 May-1961 396 297 496 245 548 Jun-1961 390 284 496 228 553 Jul-1961 385 273 497 214 556 Aug-1961 379 262 496 200 558 Sep-1961 374 253 495 189 560 Oct-1961 369 244 494 178 560 Nov-1961 365 236 493 168 561 Dec-1961 360 229 491 160 561

Discussion Question 144 (forecast sanity check) Do the pointforecasts shown above pass a sanity check That is they show steadilydecreasing values from January to December 1961 does this seemreasonable given Figure 131 Whynot


148 More R Examples

1481 AR(1) Multi-Step Forecast Intervals

The following code simulates data from an AR(1) model and thencomputes (and outputs and plots) various estimates and forecastsNote that T = 100 (n lt- 100) φ1 = 08 (RHO) micro = E(Yt) = 5 andσε = 1 (from the sd=1 option) The estimated φ1 is not particularlygood although the true value is within two standard errors (there isjust a lot of uncertainty)

t

Yt

or Y

t

0 20 40 60 80 100

12

34

56

7

Figure 141 Point and interval forecasts

setseed(112358)RHO lt- 080 n lt- 100Y lt- 5 + arimasim(n=n model=list(ar=RHO) sd=1)ret lt- ar(x=Y aic=FALSE ordermax=1)cat(sprintf(PhiHat1=53f SE(PhiHat1)=53fn

ret$ar sqrt(ret$asyvarcoef)))

PhiHat1=0685 SE(PhiHat1)=0074

(fc lt- forecast(ret h=15 level=c(8095)))plot(fc)

Period PointForecast Lo80 Hi80 Lo95 Hi95 101 310 183 437 115 504 102 359 205 513 123 595 103 393 227 558 140 646 104 416 245 586 155 676 105 432 259 604 168 696 106 442 269 616 177 708 107 450 276 624 183 716 108 455 281 629 188 722 109 458 284 633 192 725 110 461 286 635 194 728

148 MORE R EXAMPLES 293

1950 1955 1960 1965

45

55

65

1950 1955 1960 1965

45

55

65

Figure 142 Air travel data and forecast (in logs) using seasonalityand trend

111 463 288 637 195 730 112 464 289 638 197 731 113 464 290 639 197 732 114 465 290 640 198 732 115 465 291 640 198 732

Figure 141 was generated by the above code and shows somepatterns The graph essentially plots the table of results (point andinterval forecasts) after plotting the original time series First inthe data itself we can see some persistence (high values tend to befollowed by high values and low by low) but the values never get toofar from the mean E(Yt) = 5 Second the point forecasts Yt+h getcloser and closer to the sample average Y = 1

T

sumTt=1 Yt as h increases

This is because we chose an AR(1) forecasting model even if thedata were not generated by an AR(1) the forecasts would show thesame pattern Third the forecast intervals get wider and wider as hincreases This makes intuitive sense (although we skipped the math)the farther in the future the less certainty we have

1482 General R Forecast Allowing Seasonality andTrend

=rArr Kaplan video Forecasting in R


Figure 142 uses the stlf() and autoarima() functions fromthe forecast package They do much better than the earlier forecastin Section 1473 that ignored seasonality and trend This generalapplication of stlf() or autoarima() can sometimes be improvedby more carefully considering the type of trend the properties of theremainder the type of seasonality etc but clearly for series wherethe trend andor seasonality is important it is much better to usethese functions that incorporate trend and seasonality than a modelthat does not allow for trend and seasonality like the basic AR(1)But AR models are still very useful they (or more general ARIMAmodels) are used by stlf() and autoarima() to fit the detrendedseasonally-adjusted data

Either way it is always good to ldquosanity checkrdquo your forecasts visu-ally In this case the point forecasts in Figure 142 look reasonableunlike the earlier basic AR(1) It is also reasonable that the intervalforecasts get longer (taller) farther into the future appropriately re-flecting greater uncertainty However the stlf() interval forecastsseem too narrow even multiple years in the future the interval is rel-atively short Reading the stlf help file suggests one reason why itsays ldquoNote that the prediction intervals ignore the uncertainty associ-ated with the seasonal componentrdquo That is it assumes the estimatedseasonality is actually the true seasonality with no uncertainty Eventhe autoarima() intervals may be ldquotoo shortrdquo since (as usual) theydo not account for uncertainty about the true model itself changingover time (ie structural breaks) only uncertainty about the param-eter values

Figure 142 was generated by the following code

library(forecast)par(family=serif mar=c(18180306) mgp=c(21080))ret1 lt- stlf(y=log(AirPassengers) h=48)plot(ret1)ret2 lt- autoarima(y=log(AirPassengers)))plot(forecast(ret2 h=48))


Empirical Exercises

Empirical Exercise EE141 You will analyze the New York StockExchange (NYSE) value-weighted price index specifically the weeklyclose prices every Wednesday (Unfortunately the dataset does notnote the dates or data source) Yoursquoll consider forecasting price aswell as the price change using an AR(1) model with both point andinterval forecasts In practice if you could reliably predict the pricechange then you could make a lot of money so you should be (very)skeptical that you can forecast the price change (This is related tothe ldquoefficient market hypothesisrdquo) Related if stock prices are a ran-dom walk then the optimal forecast should just be the most recentlyobserved value you can see if this matches your codersquos forecasts

Mathematically assume the price change Ut = YtminusYtminus1 is indeedunrelated to Yt and Ytminus1 (and other past values) and let φ0 = E(Ut)and Vt = Ut minus E(Ut) so E(Vt) = 0 Then Yt = Ytminus1 + Ut = φ0 +Ytminus1 + Vt is an AR(1) with φ1 = 1 in which case YT + φ0 is the bestforecast of YT+1 You will check if φ1 asymp 1 and estimate the value ofφ0 among other computations

a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(forecast)nyse

b Stata only load the data with bcuse nyse nodesc clear(assuming bcuse is already installed)

c Tell your software that you have weekly time series data

R tsdat lt- ts(data=nyse$price frequency=5218)

Stata tsset t weekly

d Define a variable holdout for how many time periods at the endof the sample to ldquohold outrdquo when fitting your model

R holdout lt- 20

Stata scalar holdout = 20

e R only using holdout define the time period at the end ofthe ldquotrainingrdquo data (just before the ldquotestingrdquo data) as midpt lt-length(tsdat)-holdout and use it to define the training and

testing data respectivelytsdattrain lt- subset(tsdat start=1 end=midpt)tsdattest lt- subset(tsdat start=midpt+1 end=length(tsdat))

f Estimate an AR(1) model to produce ldquodynamicrdquo forecasts iewhat would be forecast if we were living at the end of the train-ing data

R ret lt- ar(x=tsdattrain aic=FALSE ordermax=1method=yw)

Stata arima price if _nlt=_N-holdout arima(100)


g Pretend you travel back in time to the very end of the trainingdata and produce dynamic forecasts for the next 20 periods(weeks)

R (fc lt- forecast(ret h=holdout level=c(8095)))

Stata predict fmulti y dyn(t[_N-holdout+1]) wherefmulti is the name for a newly created variable and dyn tellsit to make dynamic forecasts

h Plot the forecasts against the actual historical data

R plot(fc) and lines(window(tsdattest) col=1)

Statatwoway tsline price || tsline fmulti if _ngt_N-holdout lcolor(red)

i Optional repeat your analysis but with an AR(1) model of thefirst-differenced price (∆Yt = YtminusYtminus1) which is already in thedataset as the variable cprice (ldquocrdquo for ldquochangerdquo)

R when you create tsdat use data=nyse$cprice[-1] to ex-clude the first value of cprice (which is missing) otherwisethe code should be the same you may also like to draw a linewith lines(x=c(-11)999y=c(00)) at the very end for ref-erence

Stata just use cprice and make sure to name a different newvariable in your predict command which yoursquoll reference inyour graphing command Note also that instead of arimacprice you could use OLS estimation with regress cpricecprice_1 or equivalently regress Dprice LDprice whereDprice means ldquotake the first difference of the variable pricerdquoand LDprice means ldquolag of first difference of pricerdquo

Chapter 15

Higher-OrderAutoregression andAutoregressive DistributedLag Regression


Depends on Chapter 14 (which depend on Chapters 2ndash4 6ndash8and 13)



152 Explain the problem of choosing the best model both math-ematically and intuitively along with possible solutions[TLO 2]

153 Implement and compare different ways to select the bestforecasting model [TLOs 2 and 6]

154 In R (or Stata) estimate more general time series regressionmodels for the purpose of forecasting future values [TLO 7]


bull AIC and BIC Hanck et al (2018 sect146)

bull Forecast model evaluation and selection Hyndman andAthanasopoulos (2019 sectsect3455) and function forecastCV()

bull Autoregression Hyndman and Athanasopoulos (2019 sect83)

bull Lagged predictors Hyndman and Athanasopoulos (2019sect96)

bull Example data fpp2 package in R (Hyndman 2018)

297

298 CHAPTER 15 AR(p) AND ADL MODELS

Sometimes accuracy improves by forecasting Yt+1 using not onlyYt but also Ytminus1 And why stop at Ytminus1 Maybe Ytminus2 contains ad-ditional information not found in Yt and Ytminus1 or maybe Ytminus3 doesor even longer lags of Yt Additionally other variables and possiblytheir lags may further improve forecasting accuracy

151 The AR(p) Model

The AR(p) model generalizes the AR(1) model in (141)

Yt = φ0 + φ1Ytminus1 + middot middot middot+ φpYtminusp + εt = φ0 +

psumj=1

φjYtminusj + εt (151)

Again εt is white noise with properties as in (142) Coefficient φj iscalled the jth partial autocorrelation for j = 1 p (This canbe confusing since φj 6= Corr(Yt Ytminusj) the jth autocorrelation)

Theoretical details and properties are mostly omitted here butthere are concepts similar to the AR(1) For example there is theconcept of a unit root which generates nonstationary Yt but its math-ematical characterization is more complicated than just φ1 = 1 Theautocovariances and autocorrelations can be derived from the coeffi-cients and properties of εt but the derivations and formulas are againmore complicated

Instead the next sections focus on good forecasts

152 Model Selection How Many Lags

=rArr Kaplan video Model Selection for Forecasting

In practice which p should we use This is a question of modelselection (see also Section 83) Choosing p is equivalent to settingφj = 0 for j gt p instead of estimating those φj from data

1521 Difficulties and Intuition

Recall the intuition from Section 83 If p is too small then the modelis not flexible enough implicitly this sets φj = 0 for some importantφj 6= 0 Even if the φj are estimated perfectly for j = 0 1 pthe estimated model may not forecast very well because φj = 0 6= φjfor some j gt p However if p is too big then the model can be tooflexible overfitting the data This also causes poor forecasts Wewant the ldquojust rightrdquo p that balances these two sources of error

Only looking at in-sample fit leads to overfitting (Section 83) Forexample minimizing the sum of squared residuals (SSR) or equiva-lently maximizing the R2 always picks the largest possible p regard-less of the dataset and which model is actually best The ldquoadjustedR2rdquo is better but still not designed for picking the best forecastingmodel Similarly hypothesis testing is not designed to pick the bestforecasting model

152 MODEL SELECTION HOW MANY LAGS 299

With time series large p additionally limits the amount of usabledata For example if we observe Yt for t = 1 T and we regressYt on lags up to Ytminus50 (p = 50) then we can only use t for whichboth Yt and Ytminus50 are observed If t gt T then Yt isnrsquot observedif t le p then Ytminusp isnrsquot observed If T = 51 then there is onlyone usable data point regressing Y51 on Y50 Y49 Y1 Since itrsquosimpossible to estimate 51 parameters from 1 data point p must be(much) smaller Even with p = 25 there are p + 1 = 26 parametersand T minus p = 26 usable data points estimates could be computed butcertainly suffer from overfitting With T total observations you canonly estimate an AR(p) with p lt T2 and p must be even smallerfor reliable estimation

The most common model selection methods for AR(p) models useinformation criteria Basically an information criterion tries toquantify how bad a model is for prediction so lower values are bet-ter (less bad) The two most common are the Akaike informationcriterion (AIC) proposed by Akaike (1974) and the Bayesian in-formation criterion (BIC) (or sometimes SIC SBC or SBIC) ofSchwarz (1978) There is also a ldquocorrectedrdquo AICc eg see Hyndmanand Athanasopoulos (2019 sect86) As seen below both AIC and BICtry to avoid overfitting by adding a penalty to the in-sample fit Thepenalty is larger when the model is larger (more flexible) AIC andBIC can also be used for model selection with other types of modelsbeyond autoregression

Instead of picking a single ldquobestrdquo forecasting model averaging mul-tiple forecasts (ldquoforecast averagingrdquo or more generally ldquomodel averag-ingrdquo) often performs even better but is beyond our scope

In Sum Model Selection for ForecastingAfter you think critically about which variables and lags mighthelp forecast future values AIC (and AICc) and BIC can helpyou pick which model produces the best forecasts

1522 AIC and BIC Formulas

There are many different but equivalent formulas for AIC and BICThis is because the selected model is the one whose value is lowerthan any other modelrsquos value so only the relative values matter notthe numeric values themselves Thus we could add 5 to all values ormultiply by T or take the log etc because this would not changewhich value of p (number of lags) minimizes the AIC or BIC

The AIC can be written in terms of the sum of squared residuals(SSR) and a penalty based on p Specifically

AIC(p) =

in-sample fit︷︸︸︷T ln(SSR) +

penalty︷︸︸︷2(p+ 1) (152)

Intuitively wersquod like our models to fit the data well (small SSR) butgiven the same fit we prefer less flexible models (small penalty) The


penalty prevents overfitting where a model fits the data sample ldquotoowellrdquo because it fits all the noise which in turn makes its out-of-sampleforecasts poor

The BIC also involves the SSR and a penalty Specifically

BIC(p) =


penalty︷︸︸︷(p+ 1) ln(T ) (153)

When comparing models with different lag lengths due to thedifferent number of usable data points some care is required to ensurea fair comparison For now you can try to use built-in functions andhope that they were implemented carefully eg in the forecastpackage autoarima() does automatic model selection using theAICc (which you can change to AIC or BIC with the ic argument)

1523 Comparison of AIC and BIC

Compared to the AIC the BIC has a larger penalty for large modelssince ln(T ) gt 2 if T gt 7 (And if T le 7 you should collect moredata) That is the BIC is more likely to pick smaller p ie shorterlag lengths (smaller models)

Related to this difference the BIC is better than AIC if the truemodel is small but worse if the true model is large (Shao 1997 p235) For example if the true model is an AR(1) and yoursquore selectingamong AR(p) models for p = 0 1 24 then BIC is more likely topick the true model than AIC However if the true model is AR(100)and T = 50 (in which case picking the true model is impossible) thenAIC is more likely than BIC to pick the best feasible model SimilarlyAIC is generally better for generating accurate forecasts if you onlyconsider lag length up to p but the true lag length is even largerThus whether AIC or BIC is best depends on what you think aboutthe true model

For example imagine choosing from either one or two lags TheAR(2) model always fits the data better (lower SSR) than the AR(1)model To be concrete imagine T ln(SSR) = 11 with p = 1 andT ln(SSR) = 8 with p = 2 With AIC the penalty term equals 4 forp = 1 and equals 6 for p = 2 the AIC penalty depends only on p notthe data or even T For BIC the penalty terms for p = 1 and p = 2are 2 ln(T ) and 3 ln(T ) respectively eg if T = 50 then these areapproximately 78 and 117 Thus plugging these values into (152)and (153)

AIC(1) =

11︷︸︸︷T ln(SSR) +

4︷︸︸︷2(p+ 1) = 15 BIC(1) =

11︷︸︸︷T ln(SSR) +

78︷︸︸︷(p+ 1) ln(T ) = 188

AIC(2) =

8︷︸︸︷T ln(SSR) +

6︷︸︸︷2(p+ 1) = 14 BIC(2) =

8︷︸︸︷T ln(SSR) +

117︷︸︸︷(p+ 1) ln(T ) = 197

Since AIC(2) lt AIC(1) p = 2 is better according to AIC HoweverBIC(1) lt BIC(2) so p = 1 is better according to BIC If we use AICwe then fit an AR(2) model and use its estimates to forecast YT+1 If


instead we had used BIC for model selection wersquod estimate an AR(1)model and use it to forecast YT+1

Discussion Question 151 (lag choice for forecasting) ImagineYt = 50 + 05Ytminus1 + 000001Ytminus2 + εt where the εt are independent ofpast values Ytminus1 Ytminus2 and are iid and mean-zero Do you think anestimated AR(0) AR(1) AR(2) or AR(3) would produce the bestforecasts Explain why you think your estimated model would pro-duce better forecasts than each of the other three estimated modelsHint 1 if you need to make assumptions about things like the valueof T please feel free as long as you say so explicitly Hint 2 think-ing about extreme situations is sometimes helpful eg what if εt = 0for all t or what if T = 8 etc Hint 3 yes this is a very difficultquestion

1524 Code

The following code uses the AIC to choose p then makes a forecast ofYT+1 using an AR(p) model The AIC-chosen p is shown along withthe p used to generate the data Finally the BIC is computed for theAIC-chosen p and that pminus 1 the BIC is lower for the latter value soit prefers a smaller model (smaller p) than AIC in this case

setseed(112358)MAXP lt- 15 max lag length for AR(p)ARCOEFFS lt- c(06 -04 04 01)TRUEP lt- length(ARCOEFFS) p in true AR(p) DGP simulate dataY lt- arimasim(n=60 model=list(ar=ARCOEFFS) sd=1) fit AR(p) using AIC to choose best pret lt- ar(x=Y aic=TRUE ordermax=MAXP) output optimal pcat(sprintf(true p=d AIC-chosen p=dn TRUEP ret$order))

true p=4 AIC-chosen p=7

pr lt- predict(ret nahead=1) compute point forecastc(round(pr$pred digits=3)) output

[1] -0434

check BIC for AIC-chosen p and one smaller probably BIC prefers smaller (ret2)ret1 lt- arima(Y order=c(ret$order00))ret2 lt- arima(Y order=c(ret$order-100))c(BIC(ret1)BIC(ret2))

[1] 186 185


153 Autoregressive Distributed Lag Regres-sion

The autoregressive distributed lag (ADL) model (or ldquodynamicdistributed lagrdquo model) adds other variables and their lags to theAR(p) model That is instead of forecasting Yt+1 using only Yt Ytminus1and other lags of Y we could also use Xt Xtminus1 etc Since Xt+1 isnot available at time t it should not be included as an explanatoryvariable if we are interested in forecasting Equivalently if we regressYt on Ytminus1 Ytminus2 and other lags we could add Xtminus1 Xtminus2 etc butnot Xt If the goal is not forecasting but rather understanding theeconomic relationship between Yt and Xt then this comment doesnot apply

The same ideas from before apply to the ADL model For exampleit could be used for multi-step forecasting by replacing Yt+1 with Yt+hor used for interval forecasts and forecasts may be evaluated andcompared as in Section 152

To handle seasonality decomposition or seasonal dummies canbe used The first option is to ldquoseasonally adjustrdquo your data by remov-ing the seasonal component and then fit the ADL model (and addback the seasonality into the forecast YT+1) The second option isto use the raw data but replace the intercept term with dummies foreach possible season For example with quarterly data let D1t = 1if time period t is in quarter 1 of some year (and D1t = 0 otherwise)and similarly D2t = 1 if t is in quarter 2 D3t = 1 for quarter 3 andD4t = 1 for quarter 4 All four dummies can be included as regressorsif the intercept is removed alternatively you can keep the interceptand just add D2t D3t and D4t as regressors Or for monthly datayou can include the intercept along with D2t D12t where D2t isthe dummy for February D3t for March up to D12t for December orelse remove the intercept and include all D1t D12t

The following code uses ADL models to forecast quarterly GDPgrowth First quarterly GDP Gt is transformed to Yt = ln(Gt) minusln(Gtminus1) and stored in variable GDPgr (ldquogrrdquo for ldquogrowthrdquo) Secondlags of T-bill rates are generated Third various ADL models arefit and their AIC (actually AICc) calculated Fourth the best ADLmodel is used to forecast YT+1 the output at the end shows thepoint forecast along with forecast intervals Note that autoarima()automatically chooses the best lag length for Yt but the best T-billlag is determined ldquomanuallyrdquo by calling autoarima() once for eachpossible T-bill lag

library(AER) library(forecast) data(USMacroSWQ)GDPgr lt- diff(x=log(USMacroSWQ[gdp])) GDP growthTblags lt- cbind(Tblag1=lag(USMacroSWQ[tbill]-1)

Tblag2=lag(USMacroSWQ[tbill]-2)Tblag3=lag(USMacroSWQ[tbill]-3)Tblag4=lag(USMacroSWQ[tbill]-4))

Tblags lt- subset(Tblagsend=NROW(GDPgr))

153 AUTOREGRESSIVE DISTRIBUTED LAGREGRESSION303

fit1 lt- autoarima(y=subset(GDPgrstart=4)xreg=subset(Tblags[11]start=4))




AICcs lt- c(fit1[[aicc]]fit2[[aicc]] fit3[[aicc]]fit4[[aicc]])best lt- whichmin(AICcs) fit3 has lowest AICAICc fit1 lowest BIC Now fit w all available datafit lt- autoarima(y=GDPgr xreg=Tblags[1best])tnow lt- NROW(USMacroSWQ)xr lt- cbind(Tblag1=USMacroSWQ[tnow-0tbill]

Tblag2=USMacroSWQ[tnow-1tbill]Tblag3=USMacroSWQ[tnow-2tbill]Tblag4=USMacroSWQ[tnow-3tbill])

xr lt- matrix(xr[1best] nrow=1)(fc lt- forecast(fit h=1 xreg=xr))

PointForecast Lo80 Hi80 Lo95 Hi95 2005 Q1 00103 -00014 00219 -00076 00281


Empirical Exercises

Empirical Exercise EE151 You will analyze annual US unem-ployment and inflation data from the 2004 Economic Report of thePresident Tables B-42 and B-64 The goal is to forecast the un-employment rate Wersquoll use the first T minus 1 observations to build aforecast then compare our forecast to the actual observation in timeT

a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(forecast)phillips

b Stata only load the data with bcuse phillips nodescclear (assuming bcuse is already installed)

c R only define thisyr lt- 1995 since the Stata dataset onlyhas through year 1996 so that we can get comparable resultsAlso define yr1 lt- min(phillips$year)

d Tell your software that you have annual (yearly) time seriesdata

R tsdat lt- ts(phillips[phillips$yearlt=thisyr ]frequency=1 start=yr1)

Stata tsset year yearly

e Stata only define scalar holdout = 1 and scalar endyr =year[_N]

f Plot the unemployment and inflation time series

R plot(tsdat[c(uneminf)])

Stata tsline unem inf

g Considering AR(p) models with p = 0 1 2 3 4 use the AIC tochoose the best model and estimate such a model

R ret lt- ar(tsdat[unem] aic=TRUE ordermax=4)

Stata varsoc unem maxlag(4) and then arima unem ifyearlt=endyr-holdout arima(p00) but replacing the p inarima(p00) with whatever lag length the previous varsoccommand said is optimal (Itrsquos possible to do this programmat-ically but it gets complicated)

h R only (since Stata displayed this already) compute the BICvalues for p = 0 1 2 3 4 with ret$aic+(log(ret$nused)-2)1length(ret$aic) which adjusts the AIC values to reflectthe BICrsquos different penalty

i Using the estimates based on data years up to 1995 compute(dynamic) forecasts for the next ten years 1996ndash2005 and plotthem

R (fcARp lt- forecast(ret h=10)) and plot(forecast(ret h=10))


Stata

tsappend add(9)predict fcARp if yeargtendyr-holdout yorder year unem fcARplist year unem fcARp if yeargt=endyr-holdouttwoway tsline unem || tsline fcARp

j Stata only delete the previously added rows with drop ifyeargtendyr

k Optional now consider autoregressive distributed lag (ADL)models with up to 2 lags of unemployment and up to 2 lags ofinflation Compute all the AIC values

R

unem lt- ts(phillips[ unem] frequency=1 start=yr1)

inf lt- ts(phillips[ inf] frequency=1 start=yr1)

dat lt- cbind(Y=unem L1Y=lag(unem-1)L2Y=lag(unem-2) L1X=lag(inf-1) L2X=

lag(inf-2))dat1 lt- window(dat start=yr1+2 end=thisyr)r00 lt- lm(Y~1 data=dat1)r01 lt- lm(Y~L1X data=dat1)r02 lt- lm(Y~L1X+L2X data=dat1)r10 lt- lm(Y~L1Y data=dat1)r11 lt- lm(Y~L1Y+L1X data=dat1)r12 lt- lm(Y~L1Y+L1X+L2X data=dat1)r20 lt- lm(Y~L1Y+L2Y data=dat1)r21 lt- lm(Y~L1Y+L2Y+L1X data=dat1)r22 lt- lm(Y~L1Y+L2Y+L1X+L2X data=dat1)AICs lt- dataframe(L0inf=c(AIC(r00)AIC(r10)AIC(r20))

L1inf=c(AIC(r01)AIC(r11)AIC(r21))

L2inf=c(AIC(r02)AIC(r12)AIC(r22)) )

rownames(AICs) lt- c(L0unemL1unemL2unem)print(AICs digits=4)

Stata

varsoc unem maxlag(2) exog()varsoc unem maxlag(2) exog(Linf)varsoc unem maxlag(2) exog(Linf L2inf)

l Optional estimate the ADL model with the smallest AIC Forexample if the AIC is smallest with one lag of each variablethen use R command (ret lt- lm(Y~L1Y+L1X data=window(datend=thisyr))) or Stata command arima unem Linf ifyearlt=endyr-holdout arima(100)


m Optional compute the ADL forecast for unemployment rate in1996 and compare it with the AR(p) forecast and actual 1996value

Rnewdat lt- window(dat start=thisyr+1 end=thisyr+1)fcADL lt- predict(ret newdata=newdat)res lt- rbind(fcARp$mean[1] fcADL

window(unemstart=thisyr+1end=thisyr+1))

rownames(res) lt- c(AR(p)ADLActual)colnames(res) lt- thisyr+1print(res)

Statapredict fcADL if yeargtendyr-holdout yorder year unem fcARp fcADLlist year unem fcARp fcADL if yeargt=endyr-holdout

Chapter 16

Final Exam

=rArr Kaplan video Good Luck

When I teach this class Week 16 is final exams week There is nonew material this week (since there are no classes) My final examis cumulative questions may be about any material from any timeduring the semester The exception is that there are no questionsabout coding in R although there may be some questions showingstatistical results in R

307

308 CHAPTER 16 FINAL EXAM

Bibliography

Akaike Hirotugu 1974 ldquoA new look at the statistical model identifi-cationrdquo IEEE Transactions on Automatic Control 19 (6)716ndash723URL httpsdoiorg101109TAC19741100705 [299]

Benjamin Daniel J James O Berger Magnus Johannesson Brian ANosek E-J Wagenmakers Richard Berk Kenneth A BollenBjoumlrn Brembs Lawrence Brown Colin Camerer David CesariniChristopher D Chambers Merlise Clyde Thomas D Cook PaulDe Boeck Zoltan Dienes Anna Dreber Kenny Easwaran CharlesEfferson Ernst Fehr Fiona Fidler Andy P Field Malcolm ForsterEdward I George Richard Gonzalez Steven Goodman EdwinGreen Donald P Green Anthony Greenwald Jarrod D HadfieldLarry V Hedges Leonhard Held Teck Hua Ho Herbert HoijtinkDaniel J Hruschka Kosuke Imai Guido Imbens John P A Ioanni-dis Minjeong Jeon James Holland Jones Michael Kirchler DavidLaibson John List Roderick Little Arthur Lupia Edouard Mach-ery Scott E Maxwell Michael McCarthy Don Moore Stephen LMorgan Marcus Munafoacute Shinichi Nakagawa Brendan NyhanTimothy H Parker Luis Pericchi Marco Perugini Jeff RouderJudith Rousseau Victoria Savalei Felix D Schoumlnbrodt ThomasSellke Betsy Sinclair Dustin Tingley Trisha Van Zandt SimineVazire Duncan J Watts Christopher Winship Robert L WolpertYu Xie Cristobal Young Jonathan Zinman and Valen E Johnson2018 ldquoRedefine statistical significancerdquo Nature Human Behaviour2 (1)6ndash10 URL httpsdoiorg101038s41562-017-0189-z [79]

Berger James O 1985 Statistical Decision Theory and BayesianAnalysis Springer Series in Statistics Springer 2nd ed URLhttpsdoiorg101007978-1-4757-4286-2 [84 89]

Bertrand Marianne and Sendhil Mullainathan 2004 ldquoAre Emilyand Greg More Employable Than Lakisha and Jamal A FieldExperiment on Labor Market Discriminationrdquo American EconomicReview 94 (4)991ndash1013 URL httpswwwjstororgstable3592802 [153]

Biddle Jeff E and Daniel S Hamermesh 1990 ldquoSleep and the Al-location of Timerdquo Journal of Political Economy 98 (51)922ndash943URL httpsdoiorg101086261713 [192]

309

310 BIBLIOGRAPHY

Box G E P 1979 ldquoRobustness in the Strategy of Scientific ModelBuildingrdquo Tech Rep 1954 Mathematics Research Center Uni-versity of WisconsinndashMadison URL httpwwwdticmildocscitationsADA070213 [157]

Canty Angelo and B D Ripley 2019 boot Bootstrap R (S-Plus)Functions URL httpscranr-projectorgwebpackagesboot R package version 13-23 [54]

Card David 1990 ldquoThe impact of the Mariel boatlift on the Miamilabor marketrdquo Industrial amp Labor Relations Review 43 (2)245ndash257[211]

mdashmdashmdash 1995 ldquoUsing Geographic Variation in College Proximity toEstimate the Return to Schoolingrdquo In Aspects of Labour Mar-ket Behavior Essays in Honour of John Vanderkamp edited byLouis N Christophides E Kenneth Grant and Robert SwidinskyUniversity of Toronto Press 201ndash222 [91 255]

Card David and Alan B Krueger 1994 ldquoMinimum Wages and Em-ployment A Case Study of the Fast-Food Industry in New Jerseyand Pennsylvaniardquo American Economic Review 84 (4)772ndash793URL httpswwwjstororgstable2118030 [242]

Chang Yoosoon Robert K Kaufmann Chang Sik Kim J IsaacMiller Joon Y Park and Sungkeun Park 2020 ldquoEvaluating trendsin time series of distributions A spatial fingerprint of human ef-fects on climaterdquo Journal of Econometrics 214 (1)274ndash294 URLhttpsdoiorg101016jjeconom201905014 [270]

Chetty Raj John N Friedman Nathaniel Hilger Emmanuel SaezDiane Whitmore Schanzenbach and Danny Yagan 2011 ldquoHowDoes Your Kindergarten Classroom Affect Your Earnings Ev-idence from Project STARrdquo Quarterly Journal of Economics126 (4)1593ndash1660 URL httpsdoiorg101093qjeqjr041[198]

Chetty Raj Nathaniel Hendren and Lawrence F Katz 2016 ldquoTheEffects of Exposure to Better Neighborhoods on Children NewEvidence from the Moving to Opportunity Experimentrdquo AmericanEconomic Review 106 (4)855ndash902 URL httpsdoiorg101257aer20150572 [199]

DasGupta Anirban 2008 Asymptotic Theory of Statistics and Proba-bility New York Springer URL httpsdoiorg101007978-0-387-75971-5 [266]

Davison A C and D V Hinkley 1997 Bootstrap Methods andtheir Applications Cambridge Cambridge University Press URLhttpstatwwwepflchdavisonBMA [54]

Deming W Edwards and Frederick F Stephan 1941 ldquoOn the In-terpretation of Censuses as Samplesrdquo Journal of the American

BIBLIOGRAPHY 311

Statistical Association 36 (213)45ndash49 URL httpswwwjstororgstable2278811 [19 20]

Diebold Francis X 2018a ldquoEconometric Data Sciencerdquo Depart-ment of Economics University of Pennsylvania httpwwwsscupennedu~fdieboldTextbookshtml [xvii 261 262 281 282]

mdashmdashmdash 2018b ldquoForecastingrdquo Department of Economics Uni-versity of Pennsylvania httpwwwsscupennedu~fdieboldTextbookshtml [xvii 259 261 272 282 290]

mdashmdashmdash 2018c ldquoTime Series Econometricsrdquo Department of Eco-nomics University of Pennsylvania httpwwwsscupennedu~fdieboldTextbookshtml [xvii 261 270]

Freeman Donald G 2007 ldquoDrunk Driving Legislation and Traffic Fa-talities New Evidence on BAC 08 Lawsrdquo Contemporary EconomicPolicy 25 (3)293ndash308 URL httpsdoiorg101111j1465-7287200700039x [217]

Grabchak Michael and Gennady Samorodnitsky 2010 ldquoDo finan-cial returns have finite or infinite variance A paradox and anexplanationrdquo Quantitative Finance 10 (8)883ndash893 URL httpsdoiorg10108014697680903540381 [166]

Hamilton James D 1994 Time Series Analysis Princeton NJPrinceton University Press [266 281 286]

Hanck Christoph Martin Arnold Alexander Gerber and MartinSchmelzer 2018 ldquoIntroduction to Econometrics in Rrdquo URLhttpswwweconometrics-with-rorg Department of Busi-ness Administration and Economics University of Duisburg-Essen[xvii 7 15 54 124 155 173 196 222 240 259 261 262 281 297]

Hansen Bruce E 2020 ldquoEconometricsrdquo URL httpswwwsscwiscedu~bhanseneconometrics Textbook draft [108 143163 202 221 261]

Harrison David Jr and Daniel L Rubinfeld 1978 ldquoHedonic housingprices and the demand for clean airrdquo Journal of EnvironmentalEconomics and Management 5 (1)81ndash102 URL httpsdoiorg1010160095-0696(78)90006-2 [234]

Hastie Trevor Robert Tibshirani and Jerome Friedman 2009 TheElements of Statistical Learning Data Mining Inference and Pre-diction Springer Series in Statistics Springer 2nd ed URL httpswebstanfordedu~hastieElemStatLearn Corrected 12thprinting January 13 2017 [xvii 15 173 174 221]

Heckman James J 1979 ldquoSample Selection Bias as a SpecificationErrorrdquo Econometrica 47 (1)153ndash161 URL httpswwwjstororgstable1912352 [252]

312 BIBLIOGRAPHY

Heiss Florian 2016 Using R for Introductory Econometrics Cre-ateSpace URL httpwwwurfienetreadhtml [7 54 124155 173 196 222 239 262]

Holmes E E M D Scheuerell and E J Ward 2019 ldquoAp-plied Time Series Analysis for Fisheries and EnvironmentalDatardquo URL httpsnwfsc-timeseriesgithubioatsa-labsNOAA Fisheries Northwest Fisheries Science Center 2725 Mont-lake Blvd E Seattle WA 98112 [262]

Hyndman Rob 2018 fpp2 Data for ldquoForecasting Principles andPracticerdquo (2nd Edition) URL httpsCRANR-projectorgpackage=fpp2 R package version 23 [297]

Hyndman Rob George Athanasopoulos Christoph BergmeirGabriel Caceres Leanne Chhay Mitchell OrsquoHara-Wild FotiosPetropoulos Slava Razbash Earo Wang and Farah Yasmeen 2020forecast Forecasting functions for time series and linear modelsURL httppkgrobjhyndmancomforecast R package version811 [5 271 282 289]

Hyndman Rob J and George Athanasopoulos 2019 ForecastingPrinciples and Practice OTexts URL httpsotextscomfpp2[xvii 261 262 271 272 273 281 282 297 299]

Hyndman Rob J and Yeasmin Khandakar 2008 ldquoAutomatic timeseries forecasting the forecast package for Rrdquo Journal of Sta-tistical Software 26 (3)1ndash22 URL httpwwwjstatsoftorgarticleviewv027i03 [5 271 282 289]

Imbens Guido and Jeffrey M Wooldridge 2007 ldquoWhatrsquos New inEconometrics Estimation of Average Treatment Effects UnderUnconfoundednessrdquo NBER summer lecture notes available athttpwwwnberorgWNElect_1_match_figpdf [147 208]

James Gareth Daniela Witten Trevor Hastie and Robert Tibshi-rani 2013 An Introduction to Statistical Learning Springer Textsin Statistics Springer 1st ed URL httpfacultymarshalluscedugareth-jamesISL Corrected 8th printing 2017 [xvii7 123 155 173 174 221]

Kaplan David M 2020 ldquoDistributional and NonparametricEconometricsrdquo URL httpfacultymissouriedukaplandmteachhtml Textbook draft [7 174 239]

Kaplan David M and Longhao Zhuo 2019 ldquoComparing latent in-equality with ordinal datardquo Working paper available at httpfacultymissouriedukaplandm [34]

Kaufmann Robert K Heikki Kauppi and James H Stock 2010ldquoDoes temperature contain a stochastic trend Evaluating conflict-ing statistical resultsrdquo Climatic Change 101 (3ndash4)395ndash405 URLhttpsdoiorg101007s10584-009-9711-2 [269]

BIBLIOGRAPHY 313

Kleiber Christian and Achim Zeileis 2008 Applied Econometricswith R New York Springer URL httpseeeconuibkacat~zeileisteachingAER [7 54]

LaLonde Robert J 1986 ldquoEvaluating the Econometric Evaluations ofTraining Programs with Experimental Datardquo American EconomicReview 76 (4)604ndash620 URL httpswwwjstororgstable1806062 [113]

Lewbel Arthur 2019 ldquoThe Identification Zoo Meanings of Identifica-tion in Econometricsrdquo Journal of Economic Literature 57 (4)835ndash903 URL httpsdoiorg101257jel20181361 [93 100]

Lucas Robert E Jr 1976 ldquoEconometric policy evaluation A cri-tiquerdquo In CarnegiendashRochester Conference Series on Public Policyvol 1 North-Holland 19ndash46 [101 241 287]

Lumley Thomas 2004 ldquoAnalysis of Complex Survey Samplesrdquo Jour-nal of Statistical Software 9 (1)1ndash19 R package verson 22 [5 64]

mdashmdashmdash 2019 ldquosurvey analysis of complex survey samplesrdquo R pack-age version 335-1 [5 64]

Mincer Jacob 1974 Schooling Experience and Earnings Na-tional Bureau of Economic Research URL httpwwwnberorgbooksminc74-1 [176]

Rouse Cecilia Elena 1998 ldquoPrivate School Vouchers and StudentAchievement An Evaluation of the Milwaukee Parental ChoiceProgramrdquo Quarterly Journal of Economics 113 (2)553ndash602 URLhttpsdoiorg101162003355398555685 [254]

Schwarz Gideon 1978 ldquoEstimating the Dimension of a Modelrdquo An-nals of Statistics 6 (2)461ndash464 URL httpsprojecteuclidorgeuclidaos1176344136 [299]

Shao Jun 1997 ldquoAn Asymptotic Theory for Linear Model SelectionrdquoStatistica Sinica 7 (2)221ndash242 URL httpswwwjstororgstable24306073 [300]

Shea John 1993 ldquoThe Input-Output Approach to Instrument Se-lectionrdquo Journal of Business amp Economic Statistics 11 (2)145ndash155 URL httpsdoiorg10108007350015199310509943[277]

Shea Justin M 2018 wooldridge 111 Data Sets from ldquoIntroductoryEconometrics A Modern Approach 6erdquo by Jeffrey M WooldridgeURL httpsCRANR-projectorgpackage=wooldridge Rpackage version 131 [5]

Siegelman Peter and James J Heckman 1993 ldquoThe Urban Insti-tute Audit Studies Their Methods and Findingsrdquo In Clear andConvincing Evidence Measurement of Discrimination in America

314 BIBLIOGRAPHY

edited by Michael E Fix and Raymond J Struyk Washington DCUrban Institute Press 187ndash258 URL httpwebarchiveurbanorgpublications105136html [114]

Spence Michael 1973 ldquoJob Market Signalingrdquo Quarterly Jour-nal of Economics 87 (3)355ndash374 URL httpswwwjstororgstable1882010 [145]

Stock James H and Mark W Watson 2015 Introduction to Econo-metrics Pearson 3rd updated ed URL httpswwwpearsoncomushigher-educationproductStock-Introduction-to-Econometrics-Update-3rd-Edition9780133486872html[xvii]

Street Brittany 2018 ldquoThe Impact of Economic Opportunityon Criminal Behavior Evidence from the Fracking BoomrdquoWorking Paper available at httpssitesgooglecomsitebrittanyrstreetresearch [211]

Wooldridge Jeffrey M 2020 Introductory Econometrics A ModernApproach Cengage 7th ed [5 6 113]

Zeileis Achim 2004 ldquoEconometric Computing with HC and HACCovariance Matrix Estimatorsrdquo Journal of Statistical Software11 (10)1ndash17 [5 124 152]

Zeileis Achim and Torsten Hothorn 2002 ldquoDiagnostic Checking inRegression Relationshipsrdquo R News 2 (3)7ndash10 URL httpsCRANR-projectorgdocRnews [5 124 152]

Index

AR(1) model 282AR(p) model 298

ACE see average causal effectADL see autoregressive dis-

tributed lagaffine function 181after sampling 17AIC see Akaike information

criterionAkaike information criterion

299alternative hypothesis 80analogy principle 61ASE see average structural ef-

fectassociated with 138 161asymptotic approximation 70asymptotic bias 75ATE see average treatment ef-

fectATT see average treatment ef-

fect on the treatedattenuation bias 247attrition 111autocorrelation 265

partial 298autocorrelation coefficient 265autocovariance 265autoregressive coefficient 282autoregressive distributed lag

302autoregressive parameter 282average causal effect 106average structural effect 142average treatment effect 106average treatment effect on the

treated 214

base category 158

Bayesian 54Bayesian information criterion

299before sampling 17Bernoulli random variable 23best linear approximation 162best linear predictor 163bias 72

attenuation 72downward 72negative 72positive 72toward zero 72upward 72

BIC see Bayesian informationcriterion

binary variable 23BLA see best linear approxi-

mationblack swan 287BLP see best linear predictor

CASE see conditional averagestructural effect

CATE see conditional averagetreatment effect

categorical variable 31nominal 31ordinal 31

causal inference 99CDF see cumulative distribu-

tion functionCEF see conditional expecta-

tion functionCEF error term 136central limit theorem 70CI see confidence intervalclassical 54CLT see central limit theorem

315

316 INDEX

clustered sampling 60coefficients 137collider 208collider bias 209common outcome 208complete case 250complete case analysis 250conditional average structural

effect 231conditional average treatment

effect 232conditional distribution 130conditional expectation func-

tion 135conditional independence 208conditional mean 132conditional probability 130conditioning variable 134confidence interval 76

one-sided 77two-sided 77

confidence level 77confounder 97 197consistency 74

OLS 166continuous variable 34contrapositive 126converse 125correctly specified 157counterfactual 99 210 212covariance stationarity 264covariate 134coverage probability 77credible interval 55credible set 55critical value 77 81cumulative distribution func-

tion 22

data 54data-generating process 18dataset 54decomposition 272

classical additive 273classical multiplicative

273remainder 272

demeaned 283dependence 134

dependent variable 134deterministic trend 269detrended 273DGP see data-generating pro-

cessdiff-in-diff see difference-in-

differencesdifference stationary 269difference-in-differences 205

210discrete variable 26distribution

Bernoulli 24dummy variable 23 158

ECDF see empirical CDFeconomic significance 87empirical CDF 62empirical distribution 61endogenous 144error form 128error term 128

CEF 136structural 141

estimand 62exogenous 144expectation see expected

valueexpected value 25external validity 240

false positive 80finite-sample 71first difference 263first lag 263first lead 263first-order autoregressive

model 282fitted value 63fitted values 150flexible 187forecast

h-step-ahead 287multi-step 287

forecast interval 289forecasting 259 284fourth moments 165 230frequency weights 64frequentist 54fully saturated 159 204

INDEX 317

functional form 157

Gaussian distribution 36GE see general equilibriumgeneral equilibrium 101general equilibrium effects

101 105

hat 62heterogeneity 104heteroskedasticity 151heteroskedasticity-robust 151homoskedasticity 151hypothesis testing 80

identically distributed 58identification 109identified 109identifying assumptions 109

143if 125if and only if 125ignorability 147 208iid see independent and iden-

tically distributedimplied by 125implies 124in-sample 263included regressor 197independence 147independent 134independent and identically

distributed 57independent but not identi-

cally distributed 59independent variable 134indicator function 24indicator variable 23inference 75information criterion 299inid see independent but not

identically distributedintention-to-treat 111interaction term 204 225internal validity 240interquartile range 23interval estimation 62interval forecast 289interval prediction 50invariant 141

inverse 125

joint distribution 129joint probability 129

latent 243least squares 64left-hand side variable 134level 80linear combination 181linear projection 160linear projection coefficient

161linear-in-parameters 181linear-in-variables 181linearity 29log function 174log units 176loss

expected 39loss function 37 42ndash50

0ndash1 42L2 43quadratic 43squared 43squared error 43

LP see linear projectionLPC see linear projection co-

efficient

machine learning 187marginal distribution 129marginal probability 129Markov chain 269Markov property 269mean 23 25mean independence 134mean squared error 73measurement error 244median 23misspecification

functional form 157misspecified 156mode 23 33 50model 134

linear 174linear-log 177log-linear 176log-log 178

model selection 188

318 INDEX

MSE see mean squared errormulticollinearity

imperfect 230perfect 229

multiple comparisons problem86

multiple testing problem 86

natural experiments 99 210necessary 125NeymanndashRubin causal model

101no interference 104 146nominal coverage probability

77nominal variable 31non-compliance 111non-interference 146non-random attrition 111non-response bias 252nonlinear-in-parameters 183nonlinear-in-variables 182nonparametric regression 187nonstationary 268normal distribution 36

standard 37null hypothesis 80

object of interest 62OLS see ordinary least squaresomitted variable bias 196only if 125ordinal variable 31ordinary least squares 149out-of-sample 263outcome variable 134OVB see omitted variable biasoverfitting 188oversampled 64

p-value 78parallel trends 212parameters 137partial equilibrium 101PDF see probability density

functionPE see partial equilibriumpercentile 23percentiles 30plug-in principle 61

PMF see probability massfunction

point estimation 62point forecast 286point prediction 50population 18

finite 18infinite 19super- 19

population of interest 240population studied 240posterior 55potential outcome

treated 102untreated 102

power 80predicted value 63prediction 41prediction interval 50predictor 134prior 55probability density function

35probability mass function 22program evaluation 99projection 159properly specified 157publication bias 87

quadratic model 182quantiles 30quasi-experiments 210

random coefficient 140random draw 16random sample 16 57random variable 16random walk 268randomized 99randomized controlled trial

110randomized experiment 110RCT see randomized con-

trolled trialrealization 16realized value 16reduced form 99regressand 134regression 124regressor 134

INDEX 319

rejection probability 80repeated sampling 56residual 63 150response variable 134reverse causality 252right-hand side variable 134risk 39robust to heteroskedasticity

151see also NeymanndashRubin causal

model

sample 54sample analog 62 165sample average 63sample mean 62sample selection 251sample size 57sample values 54sampling

independent 58stratified 59

sampling bias 58sampling distribution 65sampling frequency 262sampling weights 64script 8SE see standard errorseasonal dummies 302seasonality 270second moments 165 230selection on observables 208self-selection 111serial correlation 265simultaneity 252simultaneous causality 252size (of a hypothesis test) 80slope coefficient 137spillover effects 105SSR see sum of squared resid-

ualsstable unit treatment value as-

sumption 104standard deviation 23 26standard error 76standard error of the mean 76stationarity

strict 264strong 264weak-sense 264

wide-sense 264stationary 264statistic 62statistical significance 79statistically significant 79strata 59stratification 59stratum 59strong ignorability 147stronger 125structural approach 100structural break 272structural effect 142structural model 141structural parameter 141sufficient 125sum of squared residuals 64survey weights 64SUTVA see stable unit treat-

ment value assump-tion

threats to validity 240time series 262time-varying coefficients 287treatment 102treatment dummy 140treatment effect 102 103treatment indicator 140trend

stochastic 268trend stationary 269trendndashcycle component 271type I error 80type I error rate 80type II error 80type II error rate 80

unbiased 72unconditional probability 129unconfoundedness 147 208unit root 268units 57units of measure 34

variance 26

weaker 125white noise 282

independent 268 282

Contents

Preface

Textbook Learning Objectives

Notation

Getting Started with R (or Stata)

Comparison of R and Stata

R

Accessing the Software

Installing Packages

Stata


Installing Additional Commands

Optional Resources

R Tutorials

R Quick References

Running Code in This Textbook

Stata Resources

Empirical Exercises

I Analysis of One Variable

Introduction


The World is Random

Before and After Two Perspectives

Before and After Sampling

Outcomes and Mechanisms

Population Types

Finite Population

Infinite Population

Superpopulation

Which Population is Most Appropriate

Description of a Population

Overview of Distributions and Their Features

Binary Variable

Discrete Variable

Categorical or Ordinal Variable

Continuous Variable

Prelude to Prediction Precipitation

Easy ``Predict Current Weather

Minimizing Mean Loss

Different Probability

Different Loss Function

Prediction with a Known Distribution

Common Loss Functions

Optimal Prediction Generic Examples

Optimal Prediction Specific Examples

Mean and Mode as Optimal Predictions

Interval Prediction

One Variable Sample

Bayesian and Frequentist Perspectives

Very Brief Overview Bayesian Approach

Very Brief Overview Frequentist Approach

Bayesian and Frequentist Differences

Types of Sampling

Independent

Identically Distributed

Examples

The Empirical Distribution

Estimation of the Population Mean

``Description Sample Mean

``Prediction Least Squares

Non-iid Sampling Weights

Sampling Distribution of an Estimator

Some Mathematical Calculations

Graphs Binary Population

Graphs Continuous Population

Table Values in Repeated Samples

Sampling Distribution Approximation

Non-iid Sampling

Quantifying Accuracy of an Estimator

Bias

Mean Squared Error

Consistency

Asymptotic MSE

Quantifying Uncertainty Frequentist Approaches

Standard Errors

Confidence Intervals

p-values

Statistical Significance

Hypothesis Testing

Mental Math for Statistical Uncertainty

Quantifying Uncertainty Misinterpretation and Misuse

Perils of Ignoring Non-iid Sampling

Not a Bayesian Belief

Unlikely Events Happen (or Use Common Sense)

Example of Ignoring Outside Knowledge

Multiple Testing (Multiple Comparisons)

Publication Bias and Science

Ignoring Point Estimates (Economic Significance)

Other Sources of Uncertainty

Statistical Decision Theory

Empirical Exercises

One Variable Two Populations

Description

Population Mean Difference

Estimation

Quantifying Uncertainty

Prediction

Causality Overview

Correlation Does Not Imply Causation

Structural and Reduced Form Approaches

General Equilibrium and Partial Equilibrium

Potential Outcomes Framework

Potential Outcomes

Treatment Effects

SUTVA

Average Treatment Effect

Definition and Interpretation

ATE Examples

Limitation of ATE

ATE Identification

Setup and Identification Question

Randomization

Reasons for Identification Failure

ATE Estimation and Inference

Empirical Exercises

Midterm Exam 1

II Regression

Introduction

Comparing Two Distributions by Regression

Logic

Terminology

Theorems

Comparing Assumptions

Preliminaries

Population Mean Model in Error Form

Joint and Marginal Distributions

Conditional Distributions

Conditional Mean

Comparison of Joint Marginal and Conditional Distributions

Independence and Dependence

Population Model Conditional Expectation Function

Conditional Expectation Function

CEF Error Term

CEF Model in Error Form

Linear CEF Model

Interpretation Description and Prediction

Interpretation with Values Besides 0 and 1

Population Model Potential Outcomes

Population Model Structural

Linear Structural Model

General Structural Model and ASE

Identification



Average Structural Effect

Estimation OLS

Code


Heteroskedasticity

Code

Empirical Exercises


Misspecification

Coping with Misspecification

Model of Three Values

More Than Three Values

Linear Projection

Geometric Intuition

Probabilistic Projection

Formulas and Interpretation

Linear Projection Model in Error Form

``Best Linear Approximation


Limitations

``Best Linear Predictor

Causality Under Misspecification

OLS Estimation and Inference

OLS Estimator Insights

Statistical Properties

Code


Empirical Exercises

Nonlinear and Nonparametric Regression

Log Transformation

Properties of the Natural Log Function

The Log-Linear Model

The Linear-Log Model

The Log-Log Model

Warning Model-Driven Results

Code

Nonlinear-in-Variables Regression

Linearity

Nonlinearity

Estimation and Inference

Parameter Interpretation

Description Prediction and Causality

Code

Nonparametric Regression

Empirical Exercises

Regression with Two Binary Regressors

Omitted Variable Bias

An Allegory

Formal Conditions

Consequences

OVB in Linear Projection

Linear-in-Variables Model

Fully Saturated Model

Structural Identification by Exogeneity

Identification by Conditional Independence

Collider Bias

Causal Identification Difference-in-Differences

Bad Approaches

Counterfactuals and Parallel Trends

Identification

Extensions


Empirical Exercises

Regression with Multiple Regressors



Model and Coefficient Interpretation

Limitations

Code

Interaction Terms

Limitation of Linear-in-Variables Model

Interpretation of Interaction Term

Non-Binary Interaction

Code

Other Examples

Assumptions for Linear Projection

Multicollinearity (Two Types)

Formal Assumptions and Results



General Structural Model

Conditional ATE

Empirical Exercises

Midterm Exam 2

Internal and External Validity

Terminology

Threats to External Validity

Different Place

Different Time

Different Population

Threats to Internal Validity

Functional Form Misspecification

Measurement Error in the Outcome Variable

Measurement Error in the Regressors

Non-iid Sampling and Survey Weights

Missing Data

Sample Selection

Omitted Variable Bias and Collider Bias

Simultaneity and Reverse Causality

Empirical Exercises

III Time Series

Introduction


Terms and Notation

Populations Randomness and Sampling

Stationarity

Autocovariance and Autocorrelation

Estimation

Mean

Autocovariances and Autocorrelations

Nonstationarity

Trends

Seasonality

Cycles

Structural Breaks

Decomposition

Transformations

Empirical Exercises


Model

Description

Prediction (Forecasting) Optimality

Estimation

Code

Parameter Stability

Multi-Step Forecast

Intuition Mean Zero (Special Case)

General Results with AR Parameters

Direct Approach

Code

Interval Forecasts

Goal and Sources of Uncertainty

Intervals Assuming Normality

Code

More R Examples

AR(1) Multi-Step Forecast Intervals

General R Forecast Allowing Seasonality and Trend

Empirical Exercises

AR(p) and ADL Models

The AR(p) Model

Model Selection How Many Lags

Difficulties and Intuition

AIC and BIC Formulas

Comparison of AIC and BIC

Code

Autoregressive Distributed Lag Regression

Empirical Exercises

Final Exam

Bibliography

Index

Copyright copy 2021 David M Kaplan

Licensed under the Creative Commons AttributionndashNonCommercialndashShareAlike 40 International License (the ldquoLicenserdquo) you may not usethis file or its source files except in compliance with the License Youmay obtain a copy of the License at httpscreativecommonsorglicensesby-nc-sa40legalcode with a more readable summaryat httpscreativecommonsorglicensesby-nc-sa40

First edition January 2019 second edition June 2020 updated Jan-uary 5 2021




Brief Contents

Contents vii


Preface xvii


Notation xxi



Introduction 13





II Regression 119

Introduction 121





v

vi BRIEF CONTENTS




III Time Series 257

Introduction 259




16 Final Exam 307

Bibliography 309

Index 315

Contents

Contents vii


Preface xvii


Notation xxi







Introduction 13




vii

viii CONTENTS














CONTENTS ix













II Regression 119

Introduction 121


x CONTENTS
















CONTENTS xi















xii CONTENTS











III Time Series 257

Introduction 259




CONTENTS xiii















16 Final Exam 307

Bibliography 309

Index 315










xiv








Preface






xvii

xviii


















xix


Notation


Variables


X =





z =

z1z2zk



zprime = (z1 z2 zk)


Xprime =




xxi

xxii NOTATION









Symbols






NOTATION xxiii





1n




arg maxg



xxiv NOTATION

Chapter 1














1







In comparison Stata





12 R









12 R 3


































13 STATA 5








13 Stata























141 R Tutorials



listhtml













144 Stata Resources





Empirical Exercises
































R hist(wine$liver)







Part I


11

Introduction











13

14





Chapter 2

















15




























22 Population Types













223 Superpopulation













Example Coin Flips






Other Examples
























Summary Features






232 Binary Variable







1A =



Y = 1heads = 1W = h =

















0 if r lt 0


(28)




E(Y ) =1sumj=0

(j) P(Y = j) =

j=0︷︸︸︷(0) P(Y = 0) +

j=1︷︸︸︷(1) P(Y = 1) = P(Y = 1)















radic1(1minus 1) =



radic0(1minus 0) =





radicp(1minus p)
















fY (y) =Jsumj=1

1y = yjP(Y = yj) (212)



1 2 3

y

PMF

P(

Y=

y)0

05

1

1 2 3

y

PMF

P(

Y=

y)0

05

1







1yj le yP(Y = yj) (213)


FY (y) = (13)

3sumj=1

1j le y (214)

FY (y) =


(215)


1 2 3

y

CD

F

P(Y

ley)

00

51




E(Y ) =

Jsumj=1




E(W ) = (0)(12) + (2)(12) = 1 E(Z) = (2)(12) + (4)(12) = 3(217)



E(W ) =

4sumj=1

(j)(j10) = (1)(110) + (2)(210) + (3)(310) + (4)(410) = 3

E(Z) =4sumj=1

(j)(5minus j)10 = (1)(410) + (2)(310) + (3)(210) + (4)(110) = 2


E(Y ) = (10)(13) + (20)(13) + (270)(13) = 3003 = 100 (218)


E(Y ) = (10)(099) + (3010)(001) = 99 + 301 = 40 (219)



E(Y ) = (11) P(Y = 11) + (12) P(Y = 12) + (16) P(Y = 16) + (18) P(Y = 18)

= (11)(02) + (12)(03) + (16)(04) + (18)(01) = 14 (220)


E(aY + bZ) = aE(Y ) + bE(Z) (221)





= cE(W ) + dE(X) + bE(Z)


E

(nsumi=1

ciYi

)=

nsumi=1

ci E(Yi) (222)











radic200 asymp 141









ing etc)


























P(Y = y) (224)






P(Y le good) =

=15︷︸︸︷fY (poor) +

=15︷︸︸︷fY (fair) +

=15︷︸︸︷fY (good) = 35 gt 12

P(Y ge good) =

=15︷︸︸︷fY (good) +

=15︷︸︸︷fY (great) +


(225)




P(W le great) =

=0︷︸︸︷fY (poor) +

=0︷︸︸︷fY (fair) +

=13︷︸︸︷fY (good) +

=13︷︸︸︷fY (great) = 23 gt 12

P(W ge great) =

=13︷︸︸︷fY (great) +


(226)






per 10000 people)











00

02

04

y

PD

F (

dens

ity)

Area=034










sumJj=1 yjfY (yj))










radicσ2 That is if







































P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus1) = P(Y = 1) = 04(230)


E(L0) = (04)(1) + (06)(minus1) = minus02 (231)


E(L1) = (06)(1) + (04)(minus1) = 02 (232)







P(L1 = 1) = P(Y = 0) = 03 P(L1 = minus1) = P(Y = 1) = 07

(233)



E[L(Y 0)] = E(L0) = (07)(1) + (03)(minus1) = 04





L(0 0) = minus1 L(1 1) = minus10 L(0 1) = L(1 0) = 1 (235)




P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 06

P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus10) = P(Y = 1) = 04

(236)
















0ndash1 Loss







Quadratic Loss


L2(y g) = (y minus g)2 (239)













E[L(Y g)]


Two Possible Values




)=


) (240)












)=

(0 75 0

)

P(Y = a) = 07P(Y = b) = 03

(243)

Using (241)


(244)









(246)




P(Y = vj)L(vj v1)


P(Y = vj)L(vj v2)


P(Y = vj)L(vj vJ)

(247)



=


=

0 2 81 0 24 1 0

(248)

Let

P(Y = v1) = P(Y = minus1) = 02

P(Y = v2) = P(Y = 0) = 03


(249)



= (02)(0) + (03)(2) + (05)(8) = 46


= (02)(1) + (03)(0) + (05)(2) = 12


= (02)(4) + (03)(1) + (05)(0) = 11








E(L0(Y 20)) = E(1Y 6= 20) = (06)(0) + (04)(1) = 04

E(L0(Y 25)) = E(1Y 6= 25) = (06)(1) + (04)(0) = 06(250)



E[L2(Y 20)] = E[(Y minus 20)2]

= P(Y = 20)(20minus 20)2 + P(Y = 25)(25minus 20)2

= (06)(0) + (04)52 = (04)(25) = 10

E[L2(Y 25)] = E[(Y minus 25)2]

= P(Y = 20)(20minus 25)2 + P(Y = 25)(25minus 25)2

= (06)(minus5)2 + (04)(0) = (06)(25) = 15


E[L2(Y 22)] = E[(Y minus 22)2]

= P(Y = 20)(20minus 22)2 + P(Y = 25)(25minus 22)2

= (06)(minus2)2 + (04)(3)2 = (06)(4) + (04)(9) = 6


Example Advertising







Example More Ages


P(Y = j) = (26minus j)21 j = 20 21 25 (251)


E(L0(Y g)) =25sumy=20

1y 6= gP(Y = y) =25sumy=20


=

=1︷︸︸︷25sumy=20




arg ming



P(Y = g) = 20






E[L2(Y g)] =25sumy=20




E[L2(Y g)] =25sumy=20










E[(Y minus g)2] = E(Y ) (255)







P(Y 6= g)

= arg ming


= arg maxg

P(Y = g) (256)















Chapter 3

One Variable Sample


















53


















































321 Independent






among other things










323 Examples



























n=

1

n

nsumi=1

1Yi = 1 (34)




fS(vj) =1

n

nsumi=1

1Yi = vj j = 1 J (35)



fS(Yi) = 1n i = 1 n (36)




FY (y) = FS(y) =1

n

nsumi=1

1Yi le y (37)












Y = E(Y ) = E(S) =1

n

nsumi=1

Yi (38)





E[(Y minus g)2]




1

n

nsumi=1

(Yi minus g)2 (39)



glowast2 = Y (310)



glowast2 = arg ming

nsumi=1

(Yi minus g)2 (311)





nsumi=1

U2i =

nsumi=1

(Yi minus g)2 (312)





Example





[1] 0708


[1] 0551


[1] 054
















P(Yn = 0) =


=1minusp︷︸︸︷P(Y1 = 0)


P(Yn = 1) =


=p︷︸︸︷P(Y1 = 1)

=p︷︸︸︷P(Y2 = 1) = p2









Value of Yn

PM

F (

)

020

4060

0 05 1

n=1

Value of Yn

PM

F (

)

020

4060

0 05 1

n=2

Value of Yn

PM

F (

)

020

4060

0 05 1

n=4

Value of Yn

PM

F (

)

020

4060

0 05 1

n=8

Value of Yn

PM

F (

)

020

4060

0 05 1

n=16

Value of Yn

PM

F (

)

020

4060

0 05 1

n=32











02

46

8

Value of Yn

PD

F

0 04 08

n=1

02

46

8

Value of Yn

PD

F

0 04 08

n=20

24

68

Value of Yn

PD

F

0 04 08

n=4

02

46

8

Value of Yn

PD

F

0 04 08

n=8

02

46

8

Value of Yn

PD

F

0 04 08

n=16

02

46

8

Value of Yn

PD

F

0 04 08

n=32









Sample Yn 1Yn le 0


1 050 0 02 020 0 13 000 1 14 minus010 1 15 minus050 1 0

100 030 0 1

Average 001 52100 67100




Yn le 0



Yn le 0












Ynasim N(microY σ

2Y n) (315)


radicn







minus10 00 10

00

06

12


PD

Fn=1

minus10 00 10

00

06

12


PD

F

n=2

minus10 00 10

00

06

12


PD

F

n=4

minus10 00 10

00

06

12


PD

F

n=8

minus10 00 10

00

06

12


PD

Fn=16

minus10 00 10

00

06

12


PD

F

n=32











371 Bias


Definitions















E[Y2] = E[(12)Y1 + (12)Y2] =

microY 2︷︸︸︷(12) E(Y1) +






P(θ1 = minus100) = P(θ1 = 100) = 12 (320)



P(θ2 = 1) = 1 (321)


E(θ1) = (12)(minus100)+(12)(100) = 0 E(θ2) = (1)(1) = 1 (322)












(325)






Other MSE Examples


P(θ1 = θminus100) = P(θ1 = θ+100) = 12 P(θ2 = θ+1) = 1 (327)




(328)




MSE(β1) = 12 + 16 = 17 MSE(β2) = 102 + 9 = 109 (330)




373 Consistency













θn minus θ (332)


374 Asymptotic MSE






381 Standard Errors





SE(θ) equivradic

Var(θ) (333)



Interpretation
























Example in R





383 p-values




p = P(|Yn| ge |Yo| | microY = 0) (335)


































Computation







4552

















Example





microY = E(Y ) = (410)(0)+(310)(1)+(210)(2)+(110)(3) = 1010 = 1



CIs[[B]][irep] lt-





















Total 949050 50950 1000000


































Examples



























Empirical Exercises







R mean(card$wage)















Chapter 4


















93













41 Description






P(Y A = 0) = 08 P(Y A = 1) = 02 P(Y A = 2) = 0

P(Y B = 0) = P(Y B = 1) = P(Y B = 2) = 13(41)


41 DESCRIPTION 95

Then


2sumy=0

yP(Y B = y)

minus 2sumy=0

yP(Y A = y)

= [(0)(13) + (1)(13) + (2)(13)]minus [(0)(08) + (1)(02) + (2)(0)]

= [(13) + (23)]minus 02 = 08





412 Estimation




[1] -299







[1] 044 555

42 Prediction




















Figure 41

Rain


++




















tionships
































443 SUTVA

SUTVA Definition







SUTVA Violations




















452 ATE Examples


E(Y U ) = (03)(0) + (03)(0) + (01)(1) + (03)(1) = 04 (44)

E(Y T ) = (03)(0) + (03)(1) + (01)(0) + (03)(1) = 06 (45)


To verify (43)




1 03 0 0 02 03 0 1 13 01 1 0 minus14 03 1 1 0

Mean 04 06 02



($yr) ($yr) ($yr)

1 05 40000 41000 10002 02 40000 38000 minus20003 02 50000 51000 10004 01 50000 47000 minus3000

Mean 43000 43000 0




E(Y U ) = (05)(40000) + (02)(40000) + (02)(50000) + (01)(50000) = 43000(48)

E(Y T ) = (05)(41000) + (02)(38000) + (02)(51000) + (01)(47000) = 43000(49)





00

01

02

03

04

Den

sity

$12hr $15hr $18hr





















462 Randomization





















Empirical Exercises












R























Chapter 5

Midterm Exam 1



117


Part II

Regression

119

Introduction




121

122

Chapter 6


















123










61 Logic




611 Terminology



61 LOGIC 125





A B A =rArr B



AB





A B A lArrrArr B












612 Theorems
















is not consistent





more situations)

62 Preliminaries







E(Y ) = microY (61)


Y = microY + U E(U) = 0 (62)





always implies




























P(Y = y | X = x) =P(Y = yX = x)

P(X = x) (67)




P(Y = 1 | X = 1) =P(Y = 1 X = 1)

P(X = 1) (68)



Examples











E(Y | X = x) (69)


Examples


E(Y ) = (0) P(Y = 0) + (1) P(Y = 1) = (0)(03) + (1)(07) = 0 + 07 = 07(610)

E(Y | X = 1) = (0) P(Y = 0 | X = 1) + (1) P(Y = 1 | X = 1)

= (0)(025) + (1)(075) = 0 + 075 = 075

(611)



E(Y | X = x) =sum

jisin02040

(j) P(Y = j | X = x)

= (0) P(Y = 0 | X = x) + (20) P(Y = 20 | X = x)

+ (40) P(Y = 40 | X = x) (612)


Y = 0 Y = 20 Y = 40X = 11 010 005 005X = 12 005 010 015X = 16 010 010 030




P(X = 16) = 010 + 010 + 030 = 05 (613)


P(Y = 20 | X = 16) =P(Y = 20 X = 16)

P(X = 16)=

010

050= 02

P(Y = 40 | X = 16) =P(Y = 40 X = 16)

P(X = 16)=

030

050= 06

(614)


E(Y | X = 16) = 0 + (20)(02) + (40)(06) = 4 + 24 = 28 (615)




P(X = x Y = y) = P(Y = y | X = x) P(X = x) (616)


























m(x) equiv E(Y | X = x) (620)




m(0) = E(Y | X = 0) m(1) = E(Y | X = 1) (621)





= P(Y = 1 | X = 0) =P(Y = 1 X = 0)

P(X = 0)=

01

02= 05



P(m(X) = 05) = P(X = 0) = 02 P(m(X) = 075) = P(X = 1) = 08(622)

632 CEF Error Term







= m(x)minusm(x) = 0






Y = m(X) + V E(V | X) = 0 (625)






= m(0) + [m(1)minusm(0)]X + V (627)



Y = β0 + β1X + V E(V | X) = 0 (628)

















E[(Y minus g(X))2] (630)









m(13 yr) = β0 + (13 yr)β1 m(12 yr) = β0 + (12 yr)β1







m(16 yr) = β0 + (16 yr)β1 m(12 yr) = β0 + (12 yr)β1












Y = (Y U )(1minusX) + (Y T )(X) (631)








= β0 +Xβ1 +









Y = β0 + β1X + U (634)












Y = h(XU) (635)



Y = h(X U) = β0 + β1X +

U︷︸︸︷g(U)














66 Identification






E(Y | X = x) = γ0 + γ1x (638)











A62︷︸︸︷E(U | X) = E(U) (639)


A62︷︸︸︷E(U | X) = E(U) =rArr

A63︷︸︸︷Cov(UX) = 0 (640)







Y = β0 + β1X + U

= β0 + β1X + U +


=

γ0︷︸︸︷β0 + E(U) +

γ1︷︸︸︷β1 X +






In Practice




























E(Y T ) = E(Y T | X = 1) E(Y U ) = E(Y U | X = 0) (643)


E(Y T | X = 1) = E(Y | X = 1) E(Y U | X = 1) = E(Y | X = 0)(644)








=

use (644)︷︸︸︷E(Y T | X = 1)minus

use (644)︷︸︸︷E(Y U | X = 0)

= E(Y | X = 1)minus E(Y | X = 0)



In Practice














=


use A67︷︸︸︷E[h(0U) | X = 0]



67 Estimation OLS










is


1

n

nsumi=1



m(x) = β0 + β1x (647)




1

n

nsumi=1


(648)









sumni=1 U

2i as small



β0prarr β0 β1


prarr m(1) (651)

671 Code



(Intercept) X 2 5


[1] 2 7














682 Code






















Empirical Exercises



jtrain2













Chapter 7















155



71 Misspecification



Y = β0 + β1X + U (71)






m(x) = β0 + β1x x = 0 1 2






020

4060

X

m(X

)

0 1 2














1X = j =

1 if X = j0 otherwise j = 0 1 2 (72)




























p = arg minsisinS

dE(y s) (75)









d(AB) equivradic

E[(AminusB)2] (77)





β1 =Cov(YX)



β1 =Cov(YX)

Var(X)=

Cov(YX)

σ2X

σYσY

=Cov(YX)

σXσY

σYσX

= Corr(YX)σYσX






[1X

] (710)



[1 E(X)

E(X) E(X2)

]minus1[E(Y )

E(XY )

] (711)







Y = β0 + β1X + U E(U) = Cov(XU) = 0 (713)








LP(Y | 1 X) = β0+β1X =






742 Limitations









LP(Y | 1 X) = β0+β1X =



E[Yminus(a+bX)]2





















β1 =Cov(YX)

Var(X)=

1n




n

nsumi=1

Yi X equiv 1

n

nsumi=1

Xi

(716)



n







Assumptions















Theoretical Results











773 Code















(Intercept) D1 D2 598 -198 -198




















ified


Empirical Exercises











R table(dat$bowl4)








for m1 and m2











Chapter 8
















173



















0 1 2 3 4 5 6 7

minus2

minus1

01

2

x

ln(x

)







100

(v2 minus v1v1

) = 100

(v2v1minus 1

)






Interpretation


ln(Y ) = β0 + β1X + U (82)





When to Use It





E(Y | X = x) = eβ0+β1x E(eU | X = x)




Interpretation


Y = β0 + β1 ln(X) + U (83)




X=60︷︸︸︷β0 + β1 ln(60)minus


= β1 ln(15) = 041β1


When to Use It




Interpretation


ln(Y ) = β0 + β1 ln(X) + U (84)


When to Use It
















816 Code

2 4 6 8

05

10

15

20

Linear

2 4 6 8

05

10

15

20

LogminusLinear

2 4 6 8

05

10

15

20

LinearminusLog

2 4 6 8

05

10

15

20

LogminusLog
















821 Linearity


w1A+ w2B (85)


w1Y1 + w2Y2 + w3Y3 + w4Y4 =4sumi=1

wiYi (86)



w1β0 + w2β1 = (1)(β0) + (X)(β1) = β0 + β1X


(w1)(X0) + (w2)(X) = (β0)(1) + (β1)(X) = β0 + β1X







822 Nonlinearity


Y = β0 + β1X + β2X2 + U (87)



(1)(β0) + (X)(β1) + (X2)(β2) = β0 + β1X + β2X2


β0 + β1X + β2X2 + β3X

3 + β4X4



13


Jsumj=0

βjfj(X) (88)




Y = β0Xβ1 + U (89)

























Y = f(X) + U (810)




f(x2)minus f(x1)





1 from 1 to 2










LP(Y | 1 XX2) = β0 + β1X + β2X2

= arg minabc

d(Y a+ bX + cX2)

= arg minabc



LP(Y | 1 XX2) = β0 + β1X + β2X2

=



=


E[Y minus (a+ bX + cX2)]2 (812)







m(x) = β0 + β1x+ β2x2







21)



826 Code

00 05 10 15 20 25 30

minus2

01

23

4

Linear

00 05 10 15 20 25 30

minus2

01

23

4Quadratic

00 05 10 15 20 25 30

minus2

01

23

4

Cubic

00 05 10 15 20 25 30

minus2

01

23

4

Trigonometric
































00 02 04 06 08 10

00

10

20

30

GCV

00 02 04 06 08 10

00

10

20

30

LOOCV

00 02 04 06 08 10

00

10

20

30

Undersmoothed

00 02 04 06 08 10

00

10

20

30

Oversmoothed









Empirical Exercises
































pntsprd

















Chapter 9

















195


















911 An Allegory









Y = β0 + β1X + β2Q+ V (91)


Y = β0 + β1X + U U equiv β2Q+ V (92)













Example
















913 Consequences


Formulas





β1prarr β1 +

Cov(XU)

Var(X) (93)


plimnrarrinfin

β1 = β1 +Cov(XU)

Var(X) (94)




plimnrarrinfin














=0︷︸︸︷Cov(XV ) (96)


β1prarr β1+

Cov(XU)

Var(X)= β1+β2

Cov(XQ)


radicVar(Q)

Var(X)





Example



















E(Y | X1 X2) = β0 + β1X1 + β2X2 (98)

Misspecification


m(0 0) = E(Y | X1 = 0 X2 = 0) m(0 1) = E(Y | X1 = 0 X2 = 1)

m(1 0) = E(Y | X1 = 1 X2 = 0) m(1 1) = E(Y | X1 = 1 X2 = 1)

(99)







m(0 0) = 0m(1 0) = 1m(0 1) = 2m(1 1) = 4





More Consideration





E(Y | X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (914)






m(x1 x2) = β0 + (β1)(x1) + (β2)(x2) + (β3)(x1)(x2)


(918)


(915)︷︸︸︷β0 = m(0 0) (919)

β1 =


β2 =


β3 = [β2 + β3]minus [β2] =


(916) minus (915)︷︸︸︷[(β0 + β2)minus (β0)]

=




=


(917) minus (915)︷︸︸︷[(β0 + β1)minus (β0)]



















(925)




(927)








Y = β0 + β1X1 + β2X2 + β3X1X2 + U (928)



















= P(X2 = 1) CATE(1) + P(X2 = 0) CATE(0)


96 Collider Bias








Fever No fever


Falafel 50 50 45 0 5 50Salad 50 50 47 20 3 30




P(E coli)]= (05)(06) = 03







555︷︸︸︷E(Y | X = 1 Z = 0)minus

333︷︸︸︷E(Y | X = 0 Z = 0) = 0

4545︷︸︸︷E(Y | X = 1 Z = 1)minus

4767︷︸︸︷E(Y | X = 0 Z = 1) = 030

(930)











971 Bad Approaches



















our city before






m(00)

m(01)

before after

other citym(10)

actual=m(11)

ldquotrea

tedrdquo c

ity



m(01)-m(00)



973 Identification









E(Y 0 | X1 = 1 X2 = 1)minus E(Y 0 | X1 = 1 X2 = 0)

= E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)(934)


E(Y 0 | X1 = 1 X2 = 1)



= m(1 0) + [m(0 1)minusm(0 0)] (935)


ATT = E(Y 1 minus Y 0 | X1 = 1 X2 = 1)

=





= β3






974 Extensions













25 975 (Intercept) 930 1077 X1 431 604 X2 497 727 X1X2 220 527


Empirical Exercises



driving



















R








Stata







wage1















Chapter 10

















221













Y = β0 + β1X1 + β2X2 + β3X3 + V (101)



















Jsumj=1





= [β0 + β1(x1 + 1) +

Jsumj=2


Jsumj=2


(106)






= [β0 + β1(x1 + ∆1) +Jsumj=2


βjxj ] = β1(x1 + ∆1 minus x1) = β1∆1

1022 Limitations




1023 Code












Y = g(XD) + U g(XD) = β0 + β1X + β2D (108)



g(X 0) = β0 + β1X (109)

g(X 1) = β0 + β1X + (β2)(1) =

intercept︷︸︸︷(β0 + β2) +β1X (1010)




g(XD) = β0 + β1X + β2D + β3DX (1011)



g(X 0) = β0 + β1X +

=0︷︸︸︷(β2)(0) + (β3)(0)(X) = β0 + β1X (1012)

g(X 1) = β0 + β1X + (β2)(1) + (β3)(1)(X) =



(1013)


0

β 0β 0

+β 2

slope = β1

slope = β1 + β3

X

D=0D=1




g(XD) = (β0 + β1X) +D(β2 + β3X) (1014)












and D = X2 in (1011)

g(X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (1015)


g(X1 X2) =


slope︷︸︸︷(β1 +X2β3)X1 (1016)


g(X1 a) =


slope︷︸︸︷(β1 + aβ3)X1 (1017)

g(X1 b) =


slope︷︸︸︷(β1 + bβ3)X1 (1018)



Y = g(X1 X2) = 5minus 15X1 + 2X2 + 2X1X2 (1019)






1034 Code






1 2 1189 328

104 Other Examples




24






























radicn










Y = β0 +

Jsumj=1

βjXj + U (1021)




Y = h(X1 X2 XJ U) (1022)
















Empirical Exercises



fertil2




















hprice2

















Chapter 11

Midterm Exam 2



237


Chapter 12


















239




121 Terminology















1222 Different Time















are described below





















Regression





Y = β0 + β1X +

U︷︸︸︷(V +M) = β0 + β1X + U (122)


plimnrarrinfin


Var(X)=

Cov(XV +M)

Var(X)=


Var(X)



plimnrarrinfin







γ1X see (123)

Example







X (gym membership)

Exe

rcis

e

0 1

Y

Y


bias











Xlowast is





plimnrarrinfin


Var(X)




=













00 05 10 15 20 25 30

00

10

20

30

X or X

Y

With XWith X



General Bias













[1] 425 428 428




1235 Missing Data








By College Degree

0 1

Ear

ning

sAllObserved

Combined

Data Mean





























Empirical Exercises





describe



R nrow(df)

Stata count






















R nrow(df)

Stata count


















Part III

Time Series

257

Introduction



259

260

Chapter 13
















261
























133 Stationarity



















γjσ2Y

=γjγ0 (135)




2Y = σ2Y = γ0







135 Estimation

1351 Mean


1

T

Tsumt=1







T

Tsumt=1+j





γj = (14)4sum

t=1+j

YtYtminusj

γ0 = (14)(Y 21 + Y 2

2 + Y 23 + Y 2

4 ) = (14)(4) = 1

γ1 = (14)(Y2Y1 + Y3Y2 + Y4Y3) = (14)(1 + (minus1) + 1) = 14 = 025


γ3 = (14)(Y4Y1) = minus14 = minus025

135 ESTIMATION 267



[1] 100 025 -050 -025


ρj = γjγ0 (138)

Difficulties



Code







8 066 9370 9 067 9589 10 070 10043 11 074 10622 12 076 10868

136 Nonstationarity




1361 Trends

Stochastic Trends














= yt + E(εt+1) = yt + 0 = yt (1311)








E(Yt) = E(t+ εt) = t+ E(εt) = t+ 0 = t








E(Y5 | Y4 = 46) = E(Y4 +1+ ε5 | Y4 = 46) = 46+1 = 56 (1313)


1362 Seasonality








Time

AirP

asse

nger

s

1950 1954 1958

100

300

500

Time

log(

AirP

asse

nger

s)

1950 1954 1958

50

55

60

65



1363 Cycles







137 Decomposition














obse

rved

320

360

tren

d

320

360

seas

onal

minus2

201

90

time

rand

om

1960 1970 1980 1990

minus0

600

50


(ppm)






obse

rved

190

530

tren

d

190

410

seas

onal

090

120

time

rand

om

1950 1952 1954 1956 1958 1960

09

10

11






138 Transformations







Empirical Exercises











2 2 2 2 1)














































Chapter 14


















281







141 Model










micro = E(Yt) =




(144)

142 DESCRIPTION 283



1minus φ1 (145)









(147)

142 Description




=σ2Y︷︸︸︷

Var(Yt) =



=


σ2ε in (142)︷︸︸︷Var(εt) +2


= φ21

σ2Y︷︸︸︷



= φ21σ2Y + σ2ε



σ2ε1minus φ21

(148)




=




= φ1σ2Y (149)



=



= φ1γ1 = φ1

(149)︷︸︸︷φ1σ

2Y

= φ21σ2Y (1410)



=





= φj1σ2Y (1411)



2Y )σ2Y = φj1 (1412)




2Y




144 ESTIMATION 285



E(Yt | Ytminus1) =


=





=0︷︸︸︷E(εt)

= φ0 + φ1Ytminus1 (1413)


YT+1 = φ0 + φ1YT (1414)







144 Estimation


Yt = micro+ εt









1441 Code






SE(PhiHat1)=0100


[1] 0196 0966



[1] 0196 0196 0196














=0︷︸︸︷E(εT+1 | ZT )

= φ1ZT (1416)


E(ZT+2 | ZT ) = E(φ1ZT+1 + εT+2 | ZT ) = φ1

=φ1ZT by (1416)︷︸︸︷E(ZT+1 | ZT ) +

=0︷︸︸︷E(εT+2 | ZT )

= φ21ZT


E(ZT+h | ZT ) = φh1ZT (1417)


ZT+h = φh1ZT (1418)






(1419)



(1420)



YT+1 =φ0




Yt+h = φ0 + φ1Yt + εt+h (1421)


YT+h = φ0 + φ1YT (1422)


1464 Code


















[YT+1 minus 196σε YT+1 + 196σε] (1423)




radic3radic


radic3 and




1473 Code






148 More R Examples



t

Yt

or Y

t

0 20 40 60 80 100

12

34

56

7








1950 1955 1960 1965

45

55

65

1950 1955 1960 1965

45

55

65


111 463 288 637 195 730 112 464 289 638 197 731 113 464 290 639 197 732 114 465 290 640 198 732 115 465 291 640 198 732


T











Empirical Exercises









R holdout lt- 20

















Chapter 15















297



151 The AR(p) Model



psumj=1



















AIC(p) =


penalty︷︸︸︷2(p+ 1) (152)





BIC(p) =


penalty︷︸︸︷(p+ 1) ln(T ) (153)






AIC(1) =

11︷︸︸︷T ln(SSR) +

4︷︸︸︷2(p+ 1) = 15 BIC(1) =

11︷︸︸︷T ln(SSR) +

78︷︸︸︷(p+ 1) ln(T ) = 188

AIC(2) =

8︷︸︸︷T ln(SSR) +

6︷︸︸︷2(p+ 1) = 14 BIC(2) =

8︷︸︸︷T ln(SSR) +

117︷︸︸︷(p+ 1) ln(T ) = 197





1524 Code





[1] -0434


[1] 186 185




















Empirical Exercises



















Stata




R








Stata









Chapter 16

Final Exam



307


Bibliography






309

310 BIBLIOGRAPHY












BIBLIOGRAPHY 311













312 BIBLIOGRAPHY












BIBLIOGRAPHY 313














314 BIBLIOGRAPHY








Index











treated 214

base category 158













315

316 INDEX








tion 22



273remainder 272















INDEX 317

functional form 157


101 105








inverse 125







efficient




model selection 188

318 INDEX























INDEX 319



model












variance 26


independent 268 282

Contents

Preface


Notation



R


Installing Packages

Stata



Optional Resources

R Tutorials

R Quick References


Stata Resources

Empirical Exercises


Introduction


The World is Random




Population Types

Finite Population

Infinite Population

Superpopulation




Binary Variable

Discrete Variable


Continuous Variable











Interval Prediction

One Variable Sample





Types of Sampling

Independent


Examples












Non-iid Sampling


Bias

Mean Squared Error

Consistency

Asymptotic MSE


Standard Errors


p-values


Hypothesis Testing












Empirical Exercises


Description


Estimation


Prediction

Causality Overview





Potential Outcomes

Treatment Effects

SUTVA



ATE Examples

Limitation of ATE

ATE Identification


Randomization



Empirical Exercises

Midterm Exam 1

II Regression

Introduction


Logic

Terminology

Theorems


Preliminaries




Conditional Mean





CEF Error Term


Linear CEF Model







Identification




Estimation OLS

Code


Heteroskedasticity

Code

Empirical Exercises


Misspecification




Linear Projection

Geometric Intuition






Limitations






Code


Empirical Exercises


Log Transformation




The Log-Log Model


Code


Linearity

Nonlinearity




Code


Empirical Exercises



An Allegory

Formal Conditions

Consequences






Collider Bias


Bad Approaches


Identification

Extensions


Empirical Exercises





Limitations

Code

Interaction Terms




Code

Other Examples







Conditional ATE

Empirical Exercises

Midterm Exam 2


Terminology


Different Place

Different Time







Missing Data

Sample Selection



Empirical Exercises

III Time Series

Introduction


Terms and Notation


Stationarity


Estimation

Mean


Nonstationarity

Trends

Seasonality

Cycles

Structural Breaks

Decomposition

Transformations

Empirical Exercises


Model

Description


Estimation

Code

Parameter Stability

Multi-Step Forecast



Direct Approach

Code

Interval Forecasts



Code

More R Examples



Empirical Exercises


The AR(p) Model





Code


Empirical Exercises

Final Exam

Bibliography

Index




Brief Contents

Contents vii


Preface xvii


Notation xxi



Introduction 13





II Regression 119

Introduction 121





v

vi BRIEF CONTENTS




III Time Series 257

Introduction 259




16 Final Exam 307

Bibliography 309

Index 315

Contents

Contents vii


Preface xvii


Notation xxi







Introduction 13




vii

viii CONTENTS














CONTENTS ix













II Regression 119

Introduction 121


x CONTENTS
















CONTENTS xi















xii CONTENTS











III Time Series 257

Introduction 259




CONTENTS xiii















16 Final Exam 307

Bibliography 309

Index 315










xiv








Preface






xvii

xviii


















xix


Notation


Variables


X =





z =

z1z2zk



zprime = (z1 z2 zk)


Xprime =




xxi

xxii NOTATION









Symbols






NOTATION xxiii





1n




arg maxg



xxiv NOTATION

Chapter 1














1







In comparison Stata





12 R









12 R 3


































13 STATA 5








13 Stata























141 R Tutorials



listhtml













144 Stata Resources





Empirical Exercises
































R hist(wine$liver)







Part I


11

Introduction











13

14





Chapter 2

















15




























22 Population Types













223 Superpopulation













Example Coin Flips






Other Examples
























Summary Features






232 Binary Variable







1A =



Y = 1heads = 1W = h =

















0 if r lt 0


(28)




E(Y ) =1sumj=0

(j) P(Y = j) =

j=0︷︸︸︷(0) P(Y = 0) +

j=1︷︸︸︷(1) P(Y = 1) = P(Y = 1)















radic1(1minus 1) =



radic0(1minus 0) =





radicp(1minus p)
















fY (y) =Jsumj=1

1y = yjP(Y = yj) (212)



1 2 3

y

PMF

P(

Y=

y)0

05

1

1 2 3

y

PMF

P(

Y=

y)0

05

1







1yj le yP(Y = yj) (213)


FY (y) = (13)

3sumj=1

1j le y (214)

FY (y) =


(215)


1 2 3

y

CD

F

P(Y

ley)

00

51




E(Y ) =

Jsumj=1




E(W ) = (0)(12) + (2)(12) = 1 E(Z) = (2)(12) + (4)(12) = 3(217)



E(W ) =

4sumj=1

(j)(j10) = (1)(110) + (2)(210) + (3)(310) + (4)(410) = 3

E(Z) =4sumj=1

(j)(5minus j)10 = (1)(410) + (2)(310) + (3)(210) + (4)(110) = 2


E(Y ) = (10)(13) + (20)(13) + (270)(13) = 3003 = 100 (218)


E(Y ) = (10)(099) + (3010)(001) = 99 + 301 = 40 (219)



E(Y ) = (11) P(Y = 11) + (12) P(Y = 12) + (16) P(Y = 16) + (18) P(Y = 18)

= (11)(02) + (12)(03) + (16)(04) + (18)(01) = 14 (220)


E(aY + bZ) = aE(Y ) + bE(Z) (221)





= cE(W ) + dE(X) + bE(Z)


E

(nsumi=1

ciYi

)=

nsumi=1

ci E(Yi) (222)











radic200 asymp 141









ing etc)


























P(Y = y) (224)






P(Y le good) =

=15︷︸︸︷fY (poor) +

=15︷︸︸︷fY (fair) +

=15︷︸︸︷fY (good) = 35 gt 12

P(Y ge good) =

=15︷︸︸︷fY (good) +

=15︷︸︸︷fY (great) +


(225)




P(W le great) =

=0︷︸︸︷fY (poor) +

=0︷︸︸︷fY (fair) +

=13︷︸︸︷fY (good) +

=13︷︸︸︷fY (great) = 23 gt 12

P(W ge great) =

=13︷︸︸︷fY (great) +


(226)






per 10000 people)











00

02

04

y

PD

F (

dens

ity)

Area=034










sumJj=1 yjfY (yj))










radicσ2 That is if







































P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus1) = P(Y = 1) = 04(230)


E(L0) = (04)(1) + (06)(minus1) = minus02 (231)


E(L1) = (06)(1) + (04)(minus1) = 02 (232)







P(L1 = 1) = P(Y = 0) = 03 P(L1 = minus1) = P(Y = 1) = 07

(233)



E[L(Y 0)] = E(L0) = (07)(1) + (03)(minus1) = 04





L(0 0) = minus1 L(1 1) = minus10 L(0 1) = L(1 0) = 1 (235)




P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 06

P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus10) = P(Y = 1) = 04

(236)
















0ndash1 Loss







Quadratic Loss


L2(y g) = (y minus g)2 (239)













E[L(Y g)]


Two Possible Values




)=


) (240)












)=

(0 75 0

)

P(Y = a) = 07P(Y = b) = 03

(243)

Using (241)


(244)









(246)




P(Y = vj)L(vj v1)


P(Y = vj)L(vj v2)


P(Y = vj)L(vj vJ)

(247)



=


=

0 2 81 0 24 1 0

(248)

Let

P(Y = v1) = P(Y = minus1) = 02

P(Y = v2) = P(Y = 0) = 03


(249)



= (02)(0) + (03)(2) + (05)(8) = 46


= (02)(1) + (03)(0) + (05)(2) = 12


= (02)(4) + (03)(1) + (05)(0) = 11








E(L0(Y 20)) = E(1Y 6= 20) = (06)(0) + (04)(1) = 04

E(L0(Y 25)) = E(1Y 6= 25) = (06)(1) + (04)(0) = 06(250)



E[L2(Y 20)] = E[(Y minus 20)2]

= P(Y = 20)(20minus 20)2 + P(Y = 25)(25minus 20)2

= (06)(0) + (04)52 = (04)(25) = 10

E[L2(Y 25)] = E[(Y minus 25)2]

= P(Y = 20)(20minus 25)2 + P(Y = 25)(25minus 25)2

= (06)(minus5)2 + (04)(0) = (06)(25) = 15


E[L2(Y 22)] = E[(Y minus 22)2]

= P(Y = 20)(20minus 22)2 + P(Y = 25)(25minus 22)2

= (06)(minus2)2 + (04)(3)2 = (06)(4) + (04)(9) = 6


Example Advertising







Example More Ages


P(Y = j) = (26minus j)21 j = 20 21 25 (251)


E(L0(Y g)) =25sumy=20

1y 6= gP(Y = y) =25sumy=20


=

=1︷︸︸︷25sumy=20




arg ming



P(Y = g) = 20






E[L2(Y g)] =25sumy=20




E[L2(Y g)] =25sumy=20










E[(Y minus g)2] = E(Y ) (255)







P(Y 6= g)

= arg ming


= arg maxg

P(Y = g) (256)















Chapter 3

One Variable Sample


















53


















































321 Independent






among other things










323 Examples



























n=

1

n

nsumi=1

1Yi = 1 (34)




fS(vj) =1

n

nsumi=1

1Yi = vj j = 1 J (35)



fS(Yi) = 1n i = 1 n (36)




FY (y) = FS(y) =1

n

nsumi=1

1Yi le y (37)












Y = E(Y ) = E(S) =1

n

nsumi=1

Yi (38)





E[(Y minus g)2]




1

n

nsumi=1

(Yi minus g)2 (39)



glowast2 = Y (310)



glowast2 = arg ming

nsumi=1

(Yi minus g)2 (311)





nsumi=1

U2i =

nsumi=1

(Yi minus g)2 (312)





Example





[1] 0708


[1] 0551


[1] 054
















P(Yn = 0) =


=1minusp︷︸︸︷P(Y1 = 0)


P(Yn = 1) =


=p︷︸︸︷P(Y1 = 1)

=p︷︸︸︷P(Y2 = 1) = p2









Value of Yn

PM

F (

)

020

4060

0 05 1

n=1

Value of Yn

PM

F (

)

020

4060

0 05 1

n=2

Value of Yn

PM

F (

)

020

4060

0 05 1

n=4

Value of Yn

PM

F (

)

020

4060

0 05 1

n=8

Value of Yn

PM

F (

)

020

4060

0 05 1

n=16

Value of Yn

PM

F (

)

020

4060

0 05 1

n=32











02

46

8

Value of Yn

PD

F

0 04 08

n=1

02

46

8

Value of Yn

PD

F

0 04 08

n=20

24

68

Value of Yn

PD

F

0 04 08

n=4

02

46

8

Value of Yn

PD

F

0 04 08

n=8

02

46

8

Value of Yn

PD

F

0 04 08

n=16

02

46

8

Value of Yn

PD

F

0 04 08

n=32









Sample Yn 1Yn le 0


1 050 0 02 020 0 13 000 1 14 minus010 1 15 minus050 1 0

100 030 0 1

Average 001 52100 67100




Yn le 0



Yn le 0












Ynasim N(microY σ

2Y n) (315)


radicn







minus10 00 10

00

06

12


PD

Fn=1

minus10 00 10

00

06

12


PD

F

n=2

minus10 00 10

00

06

12


PD

F

n=4

minus10 00 10

00

06

12


PD

F

n=8

minus10 00 10

00

06

12


PD

Fn=16

minus10 00 10

00

06

12


PD

F

n=32











371 Bias


Definitions















E[Y2] = E[(12)Y1 + (12)Y2] =

microY 2︷︸︸︷(12) E(Y1) +






P(θ1 = minus100) = P(θ1 = 100) = 12 (320)



P(θ2 = 1) = 1 (321)


E(θ1) = (12)(minus100)+(12)(100) = 0 E(θ2) = (1)(1) = 1 (322)












(325)






Other MSE Examples


P(θ1 = θminus100) = P(θ1 = θ+100) = 12 P(θ2 = θ+1) = 1 (327)




(328)




MSE(β1) = 12 + 16 = 17 MSE(β2) = 102 + 9 = 109 (330)




373 Consistency













θn minus θ (332)


374 Asymptotic MSE






381 Standard Errors





SE(θ) equivradic

Var(θ) (333)



Interpretation
























Example in R





383 p-values




p = P(|Yn| ge |Yo| | microY = 0) (335)


































Computation







4552

















Example





microY = E(Y ) = (410)(0)+(310)(1)+(210)(2)+(110)(3) = 1010 = 1



CIs[[B]][irep] lt-





















Total 949050 50950 1000000


































Examples



























Empirical Exercises







R mean(card$wage)















Chapter 4


















93













41 Description






P(Y A = 0) = 08 P(Y A = 1) = 02 P(Y A = 2) = 0

P(Y B = 0) = P(Y B = 1) = P(Y B = 2) = 13(41)


41 DESCRIPTION 95

Then


2sumy=0

yP(Y B = y)

minus 2sumy=0

yP(Y A = y)

= [(0)(13) + (1)(13) + (2)(13)]minus [(0)(08) + (1)(02) + (2)(0)]

= [(13) + (23)]minus 02 = 08





412 Estimation




[1] -299







[1] 044 555

42 Prediction




















Figure 41

Rain


++




















tionships
































443 SUTVA

SUTVA Definition







SUTVA Violations




















452 ATE Examples


E(Y U ) = (03)(0) + (03)(0) + (01)(1) + (03)(1) = 04 (44)

E(Y T ) = (03)(0) + (03)(1) + (01)(0) + (03)(1) = 06 (45)


To verify (43)




1 03 0 0 02 03 0 1 13 01 1 0 minus14 03 1 1 0

Mean 04 06 02



($yr) ($yr) ($yr)

1 05 40000 41000 10002 02 40000 38000 minus20003 02 50000 51000 10004 01 50000 47000 minus3000

Mean 43000 43000 0




E(Y U ) = (05)(40000) + (02)(40000) + (02)(50000) + (01)(50000) = 43000(48)

E(Y T ) = (05)(41000) + (02)(38000) + (02)(51000) + (01)(47000) = 43000(49)





00

01

02

03

04

Den

sity

$12hr $15hr $18hr





















462 Randomization





















Empirical Exercises












R























Chapter 5

Midterm Exam 1



117


Part II

Regression

119

Introduction




121

122

Chapter 6


















123










61 Logic




611 Terminology



61 LOGIC 125





A B A =rArr B



AB





A B A lArrrArr B












612 Theorems
















is not consistent





more situations)

62 Preliminaries







E(Y ) = microY (61)


Y = microY + U E(U) = 0 (62)





always implies




























P(Y = y | X = x) =P(Y = yX = x)

P(X = x) (67)




P(Y = 1 | X = 1) =P(Y = 1 X = 1)

P(X = 1) (68)



Examples











E(Y | X = x) (69)


Examples


E(Y ) = (0) P(Y = 0) + (1) P(Y = 1) = (0)(03) + (1)(07) = 0 + 07 = 07(610)

E(Y | X = 1) = (0) P(Y = 0 | X = 1) + (1) P(Y = 1 | X = 1)

= (0)(025) + (1)(075) = 0 + 075 = 075

(611)



E(Y | X = x) =sum

jisin02040

(j) P(Y = j | X = x)

= (0) P(Y = 0 | X = x) + (20) P(Y = 20 | X = x)

+ (40) P(Y = 40 | X = x) (612)


Y = 0 Y = 20 Y = 40X = 11 010 005 005X = 12 005 010 015X = 16 010 010 030




P(X = 16) = 010 + 010 + 030 = 05 (613)


P(Y = 20 | X = 16) =P(Y = 20 X = 16)

P(X = 16)=

010

050= 02

P(Y = 40 | X = 16) =P(Y = 40 X = 16)

P(X = 16)=

030

050= 06

(614)


E(Y | X = 16) = 0 + (20)(02) + (40)(06) = 4 + 24 = 28 (615)




P(X = x Y = y) = P(Y = y | X = x) P(X = x) (616)


























m(x) equiv E(Y | X = x) (620)




m(0) = E(Y | X = 0) m(1) = E(Y | X = 1) (621)





= P(Y = 1 | X = 0) =P(Y = 1 X = 0)

P(X = 0)=

01

02= 05



P(m(X) = 05) = P(X = 0) = 02 P(m(X) = 075) = P(X = 1) = 08(622)

632 CEF Error Term







= m(x)minusm(x) = 0






Y = m(X) + V E(V | X) = 0 (625)






= m(0) + [m(1)minusm(0)]X + V (627)



Y = β0 + β1X + V E(V | X) = 0 (628)

















E[(Y minus g(X))2] (630)









m(13 yr) = β0 + (13 yr)β1 m(12 yr) = β0 + (12 yr)β1







m(16 yr) = β0 + (16 yr)β1 m(12 yr) = β0 + (12 yr)β1












Y = (Y U )(1minusX) + (Y T )(X) (631)








= β0 +Xβ1 +









Y = β0 + β1X + U (634)












Y = h(XU) (635)



Y = h(X U) = β0 + β1X +

U︷︸︸︷g(U)














66 Identification






E(Y | X = x) = γ0 + γ1x (638)











A62︷︸︸︷E(U | X) = E(U) (639)


A62︷︸︸︷E(U | X) = E(U) =rArr

A63︷︸︸︷Cov(UX) = 0 (640)







Y = β0 + β1X + U

= β0 + β1X + U +


=

γ0︷︸︸︷β0 + E(U) +

γ1︷︸︸︷β1 X +






In Practice




























E(Y T ) = E(Y T | X = 1) E(Y U ) = E(Y U | X = 0) (643)


E(Y T | X = 1) = E(Y | X = 1) E(Y U | X = 1) = E(Y | X = 0)(644)








=

use (644)︷︸︸︷E(Y T | X = 1)minus

use (644)︷︸︸︷E(Y U | X = 0)

= E(Y | X = 1)minus E(Y | X = 0)



In Practice














=


use A67︷︸︸︷E[h(0U) | X = 0]



67 Estimation OLS










is


1

n

nsumi=1



m(x) = β0 + β1x (647)




1

n

nsumi=1


(648)









sumni=1 U

2i as small



β0prarr β0 β1


prarr m(1) (651)

671 Code



(Intercept) X 2 5


[1] 2 7














682 Code






















Empirical Exercises



jtrain2













Chapter 7















155



71 Misspecification



Y = β0 + β1X + U (71)






m(x) = β0 + β1x x = 0 1 2






020

4060

X

m(X

)

0 1 2














1X = j =

1 if X = j0 otherwise j = 0 1 2 (72)




























p = arg minsisinS

dE(y s) (75)









d(AB) equivradic

E[(AminusB)2] (77)





β1 =Cov(YX)



β1 =Cov(YX)

Var(X)=

Cov(YX)

σ2X

σYσY

=Cov(YX)

σXσY

σYσX

= Corr(YX)σYσX






[1X

] (710)



[1 E(X)

E(X) E(X2)

]minus1[E(Y )

E(XY )

] (711)







Y = β0 + β1X + U E(U) = Cov(XU) = 0 (713)








LP(Y | 1 X) = β0+β1X =






742 Limitations









LP(Y | 1 X) = β0+β1X =



E[Yminus(a+bX)]2





















β1 =Cov(YX)

Var(X)=

1n




n

nsumi=1

Yi X equiv 1

n

nsumi=1

Xi

(716)



n







Assumptions















Theoretical Results











773 Code















(Intercept) D1 D2 598 -198 -198




















ified


Empirical Exercises











R table(dat$bowl4)








for m1 and m2











Chapter 8
















173



















0 1 2 3 4 5 6 7

minus2

minus1

01

2

x

ln(x

)







100

(v2 minus v1v1

) = 100

(v2v1minus 1

)






Interpretation


ln(Y ) = β0 + β1X + U (82)





When to Use It





E(Y | X = x) = eβ0+β1x E(eU | X = x)




Interpretation


Y = β0 + β1 ln(X) + U (83)




X=60︷︸︸︷β0 + β1 ln(60)minus


= β1 ln(15) = 041β1


When to Use It




Interpretation


ln(Y ) = β0 + β1 ln(X) + U (84)


When to Use It
















816 Code

2 4 6 8

05

10

15

20

Linear

2 4 6 8

05

10

15

20

LogminusLinear

2 4 6 8

05

10

15

20

LinearminusLog

2 4 6 8

05

10

15

20

LogminusLog
















821 Linearity


w1A+ w2B (85)


w1Y1 + w2Y2 + w3Y3 + w4Y4 =4sumi=1

wiYi (86)



w1β0 + w2β1 = (1)(β0) + (X)(β1) = β0 + β1X


(w1)(X0) + (w2)(X) = (β0)(1) + (β1)(X) = β0 + β1X







822 Nonlinearity


Y = β0 + β1X + β2X2 + U (87)



(1)(β0) + (X)(β1) + (X2)(β2) = β0 + β1X + β2X2


β0 + β1X + β2X2 + β3X

3 + β4X4



13


Jsumj=0

βjfj(X) (88)




Y = β0Xβ1 + U (89)

























Y = f(X) + U (810)




f(x2)minus f(x1)





1 from 1 to 2










LP(Y | 1 XX2) = β0 + β1X + β2X2

= arg minabc

d(Y a+ bX + cX2)

= arg minabc



LP(Y | 1 XX2) = β0 + β1X + β2X2

=



=


E[Y minus (a+ bX + cX2)]2 (812)







m(x) = β0 + β1x+ β2x2







21)



826 Code

00 05 10 15 20 25 30

minus2

01

23

4

Linear

00 05 10 15 20 25 30

minus2

01

23

4Quadratic

00 05 10 15 20 25 30

minus2

01

23

4

Cubic

00 05 10 15 20 25 30

minus2

01

23

4

Trigonometric
































00 02 04 06 08 10

00

10

20

30

GCV

00 02 04 06 08 10

00

10

20

30

LOOCV

00 02 04 06 08 10

00

10

20

30

Undersmoothed

00 02 04 06 08 10

00

10

20

30

Oversmoothed









Empirical Exercises
































pntsprd

















Chapter 9

















195


















911 An Allegory









Y = β0 + β1X + β2Q+ V (91)


Y = β0 + β1X + U U equiv β2Q+ V (92)













Example
















913 Consequences


Formulas





β1prarr β1 +

Cov(XU)

Var(X) (93)


plimnrarrinfin

β1 = β1 +Cov(XU)

Var(X) (94)




plimnrarrinfin














=0︷︸︸︷Cov(XV ) (96)


β1prarr β1+

Cov(XU)

Var(X)= β1+β2

Cov(XQ)


radicVar(Q)

Var(X)





Example



















E(Y | X1 X2) = β0 + β1X1 + β2X2 (98)

Misspecification


m(0 0) = E(Y | X1 = 0 X2 = 0) m(0 1) = E(Y | X1 = 0 X2 = 1)

m(1 0) = E(Y | X1 = 1 X2 = 0) m(1 1) = E(Y | X1 = 1 X2 = 1)

(99)







m(0 0) = 0m(1 0) = 1m(0 1) = 2m(1 1) = 4





More Consideration





E(Y | X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (914)






m(x1 x2) = β0 + (β1)(x1) + (β2)(x2) + (β3)(x1)(x2)


(918)


(915)︷︸︸︷β0 = m(0 0) (919)

β1 =


β2 =


β3 = [β2 + β3]minus [β2] =


(916) minus (915)︷︸︸︷[(β0 + β2)minus (β0)]

=




=


(917) minus (915)︷︸︸︷[(β0 + β1)minus (β0)]



















(925)




(927)








Y = β0 + β1X1 + β2X2 + β3X1X2 + U (928)



















= P(X2 = 1) CATE(1) + P(X2 = 0) CATE(0)


96 Collider Bias








Fever No fever


Falafel 50 50 45 0 5 50Salad 50 50 47 20 3 30




P(E coli)]= (05)(06) = 03







555︷︸︸︷E(Y | X = 1 Z = 0)minus

333︷︸︸︷E(Y | X = 0 Z = 0) = 0

4545︷︸︸︷E(Y | X = 1 Z = 1)minus

4767︷︸︸︷E(Y | X = 0 Z = 1) = 030

(930)











971 Bad Approaches



















our city before






m(00)

m(01)

before after

other citym(10)

actual=m(11)

ldquotrea

tedrdquo c

ity



m(01)-m(00)



973 Identification









E(Y 0 | X1 = 1 X2 = 1)minus E(Y 0 | X1 = 1 X2 = 0)

= E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)(934)


E(Y 0 | X1 = 1 X2 = 1)



= m(1 0) + [m(0 1)minusm(0 0)] (935)


ATT = E(Y 1 minus Y 0 | X1 = 1 X2 = 1)

=





= β3






974 Extensions













25 975 (Intercept) 930 1077 X1 431 604 X2 497 727 X1X2 220 527


Empirical Exercises



driving



















R








Stata







wage1















Chapter 10

















221













Y = β0 + β1X1 + β2X2 + β3X3 + V (101)



















Jsumj=1





= [β0 + β1(x1 + 1) +

Jsumj=2


Jsumj=2


(106)






= [β0 + β1(x1 + ∆1) +Jsumj=2


βjxj ] = β1(x1 + ∆1 minus x1) = β1∆1

1022 Limitations




1023 Code












Y = g(XD) + U g(XD) = β0 + β1X + β2D (108)



g(X 0) = β0 + β1X (109)

g(X 1) = β0 + β1X + (β2)(1) =

intercept︷︸︸︷(β0 + β2) +β1X (1010)




g(XD) = β0 + β1X + β2D + β3DX (1011)



g(X 0) = β0 + β1X +

=0︷︸︸︷(β2)(0) + (β3)(0)(X) = β0 + β1X (1012)

g(X 1) = β0 + β1X + (β2)(1) + (β3)(1)(X) =



(1013)


0

β 0β 0

+β 2

slope = β1

slope = β1 + β3

X

D=0D=1




g(XD) = (β0 + β1X) +D(β2 + β3X) (1014)












and D = X2 in (1011)

g(X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (1015)


g(X1 X2) =


slope︷︸︸︷(β1 +X2β3)X1 (1016)


g(X1 a) =


slope︷︸︸︷(β1 + aβ3)X1 (1017)

g(X1 b) =


slope︷︸︸︷(β1 + bβ3)X1 (1018)



Y = g(X1 X2) = 5minus 15X1 + 2X2 + 2X1X2 (1019)






1034 Code






1 2 1189 328

104 Other Examples




24






























radicn










Y = β0 +

Jsumj=1

βjXj + U (1021)




Y = h(X1 X2 XJ U) (1022)
















Empirical Exercises



fertil2




















hprice2

















Chapter 11

Midterm Exam 2



237


Chapter 12


















239




121 Terminology















1222 Different Time















are described below





















Regression





Y = β0 + β1X +

U︷︸︸︷(V +M) = β0 + β1X + U (122)


plimnrarrinfin


Var(X)=

Cov(XV +M)

Var(X)=


Var(X)



plimnrarrinfin







γ1X see (123)

Example







X (gym membership)

Exe

rcis

e

0 1

Y

Y


bias











Xlowast is





plimnrarrinfin


Var(X)




=













00 05 10 15 20 25 30

00

10

20

30

X or X

Y

With XWith X



General Bias













[1] 425 428 428




1235 Missing Data








By College Degree

0 1

Ear

ning

sAllObserved

Combined

Data Mean





























Empirical Exercises





describe



R nrow(df)

Stata count






















R nrow(df)

Stata count


















Part III

Time Series

257

Introduction



259

260

Chapter 13
















261
























133 Stationarity



















γjσ2Y

=γjγ0 (135)




2Y = σ2Y = γ0







135 Estimation

1351 Mean


1

T

Tsumt=1







T

Tsumt=1+j





γj = (14)4sum

t=1+j

YtYtminusj

γ0 = (14)(Y 21 + Y 2

2 + Y 23 + Y 2

4 ) = (14)(4) = 1

γ1 = (14)(Y2Y1 + Y3Y2 + Y4Y3) = (14)(1 + (minus1) + 1) = 14 = 025


γ3 = (14)(Y4Y1) = minus14 = minus025

135 ESTIMATION 267



[1] 100 025 -050 -025


ρj = γjγ0 (138)

Difficulties



Code







8 066 9370 9 067 9589 10 070 10043 11 074 10622 12 076 10868

136 Nonstationarity




1361 Trends

Stochastic Trends














= yt + E(εt+1) = yt + 0 = yt (1311)








E(Yt) = E(t+ εt) = t+ E(εt) = t+ 0 = t








E(Y5 | Y4 = 46) = E(Y4 +1+ ε5 | Y4 = 46) = 46+1 = 56 (1313)


1362 Seasonality








Time

AirP

asse

nger

s

1950 1954 1958

100

300

500

Time

log(

AirP

asse

nger

s)

1950 1954 1958

50

55

60

65



1363 Cycles







137 Decomposition














obse

rved

320

360

tren

d

320

360

seas

onal

minus2

201

90

time

rand

om

1960 1970 1980 1990

minus0

600

50


(ppm)






obse

rved

190

530

tren

d

190

410

seas

onal

090

120

time

rand

om

1950 1952 1954 1956 1958 1960

09

10

11






138 Transformations







Empirical Exercises











2 2 2 2 1)














































Chapter 14


















281







141 Model










micro = E(Yt) =




(144)

142 DESCRIPTION 283



1minus φ1 (145)









(147)

142 Description




=σ2Y︷︸︸︷

Var(Yt) =



=


σ2ε in (142)︷︸︸︷Var(εt) +2


= φ21

σ2Y︷︸︸︷



= φ21σ2Y + σ2ε



σ2ε1minus φ21

(148)




=




= φ1σ2Y (149)



=



= φ1γ1 = φ1

(149)︷︸︸︷φ1σ

2Y

= φ21σ2Y (1410)



=





= φj1σ2Y (1411)



2Y )σ2Y = φj1 (1412)




2Y




144 ESTIMATION 285



E(Yt | Ytminus1) =


=





=0︷︸︸︷E(εt)

= φ0 + φ1Ytminus1 (1413)


YT+1 = φ0 + φ1YT (1414)







144 Estimation


Yt = micro+ εt









1441 Code






SE(PhiHat1)=0100


[1] 0196 0966



[1] 0196 0196 0196














=0︷︸︸︷E(εT+1 | ZT )

= φ1ZT (1416)


E(ZT+2 | ZT ) = E(φ1ZT+1 + εT+2 | ZT ) = φ1

=φ1ZT by (1416)︷︸︸︷E(ZT+1 | ZT ) +

=0︷︸︸︷E(εT+2 | ZT )

= φ21ZT


E(ZT+h | ZT ) = φh1ZT (1417)


ZT+h = φh1ZT (1418)






(1419)



(1420)



YT+1 =φ0




Yt+h = φ0 + φ1Yt + εt+h (1421)


YT+h = φ0 + φ1YT (1422)


1464 Code


















[YT+1 minus 196σε YT+1 + 196σε] (1423)




radic3radic


radic3 and




1473 Code






148 More R Examples



t

Yt

or Y

t

0 20 40 60 80 100

12

34

56

7








1950 1955 1960 1965

45

55

65

1950 1955 1960 1965

45

55

65


111 463 288 637 195 730 112 464 289 638 197 731 113 464 290 639 197 732 114 465 290 640 198 732 115 465 291 640 198 732


T











Empirical Exercises









R holdout lt- 20

















Chapter 15















297



151 The AR(p) Model



psumj=1



















AIC(p) =


penalty︷︸︸︷2(p+ 1) (152)





BIC(p) =


penalty︷︸︸︷(p+ 1) ln(T ) (153)






AIC(1) =

11︷︸︸︷T ln(SSR) +

4︷︸︸︷2(p+ 1) = 15 BIC(1) =

11︷︸︸︷T ln(SSR) +

78︷︸︸︷(p+ 1) ln(T ) = 188

AIC(2) =

8︷︸︸︷T ln(SSR) +

6︷︸︸︷2(p+ 1) = 14 BIC(2) =

8︷︸︸︷T ln(SSR) +

117︷︸︸︷(p+ 1) ln(T ) = 197





1524 Code





[1] -0434


[1] 186 185




















Empirical Exercises



















Stata




R








Stata









Chapter 16

Final Exam



307


Bibliography






309

310 BIBLIOGRAPHY












BIBLIOGRAPHY 311













312 BIBLIOGRAPHY












BIBLIOGRAPHY 313














314 BIBLIOGRAPHY








Index











treated 214

base category 158













315

316 INDEX








tion 22



273remainder 272















INDEX 317

functional form 157


101 105








inverse 125







efficient




model selection 188

318 INDEX























INDEX 319



model












variance 26


independent 268 282

Contents

Preface


Notation



R


Installing Packages

Stata



Optional Resources

R Tutorials

R Quick References


Stata Resources

Empirical Exercises


Introduction


The World is Random




Population Types

Finite Population

Infinite Population

Superpopulation




Binary Variable

Discrete Variable


Continuous Variable











Interval Prediction

One Variable Sample





Types of Sampling

Independent


Examples












Non-iid Sampling


Bias

Mean Squared Error

Consistency

Asymptotic MSE


Standard Errors


p-values


Hypothesis Testing












Empirical Exercises


Description


Estimation


Prediction

Causality Overview





Potential Outcomes

Treatment Effects

SUTVA



ATE Examples

Limitation of ATE

ATE Identification


Randomization



Empirical Exercises

Midterm Exam 1

II Regression

Introduction


Logic

Terminology

Theorems


Preliminaries




Conditional Mean





CEF Error Term


Linear CEF Model







Identification




Estimation OLS

Code


Heteroskedasticity

Code

Empirical Exercises


Misspecification




Linear Projection

Geometric Intuition






Limitations






Code


Empirical Exercises


Log Transformation




The Log-Log Model


Code


Linearity

Nonlinearity




Code


Empirical Exercises



An Allegory

Formal Conditions

Consequences






Collider Bias


Bad Approaches


Identification

Extensions


Empirical Exercises





Limitations

Code

Interaction Terms




Code

Other Examples







Conditional ATE

Empirical Exercises

Midterm Exam 2


Terminology


Different Place

Different Time







Missing Data

Sample Selection



Empirical Exercises

III Time Series

Introduction


Terms and Notation


Stationarity


Estimation

Mean


Nonstationarity

Trends

Seasonality

Cycles

Structural Breaks

Decomposition

Transformations

Empirical Exercises


Model

Description


Estimation

Code

Parameter Stability

Multi-Step Forecast



Direct Approach

Code

Interval Forecasts



Code

More R Examples



Empirical Exercises


The AR(p) Model





Code


Empirical Exercises

Final Exam

Bibliography

Index

kaplan: introductory econometrics

Documents