kaplan: introductory econometrics
TRANSCRIPT
Introductory EconometricsDescription Prediction and Causality
Second edition
David M Kaplan
Copyright copy 2021 David M Kaplan
Licensed under the Creative Commons AttributionndashNonCommercialndashShareAlike 40 International License (the ldquoLicenserdquo) you may not usethis file or its source files except in compliance with the License Youmay obtain a copy of the License at httpscreativecommonsorglicensesby-nc-sa40legalcode with a more readable summaryat httpscreativecommonsorglicensesby-nc-sa40
First edition January 2019 second edition June 2020 updated Jan-uary 5 2021
To my past present and future students including NLK and OAKmdashDMK
The chief difficulty Alice found at first was in managingher flamingo she succeeded in getting its body tuckedaway comfortably enough under her arm with its legshanging down but generally just as she had got its necknicely straightened out and was going to give thehedgehog a blow with its head it would twist itself roundand look up in her face with such a puzzled expressionthat she could not help bursting out laughing and whenshe had got its head down and was going to begin againit was very provoking to find that the hedgehog hadunrolled itself and was in the act of crawlingaway Alice soon came to the conclusion that it was avery difficult game indeed
Lewis Carroll Alicersquos Adventures in Wonderland(An allegory for econometrics)
Brief Contents
Contents vii
List of Discussion Questions xiv
Preface xvii
Textbook Learning Objectives xix
Notation xxi
1 Getting Started with R (or Stata) 1
I Analysis of One Variable 11
Introduction 13
2 One Variable Population 15
3 One Variable Sample 53
4 One Variable Two Populations 93
5 Midterm Exam 1 117
II Regression 119
Introduction 121
6 Comparing Two Distributions by Regression 123
7 Simple Linear Regression 155
8 Nonlinear and Nonparametric Regression 173
9 Regression with Two Binary Regressors 195
v
vi BRIEF CONTENTS
10 Regression with Multiple Regressors 221
11 Midterm Exam 2 237
12 Internal and External Validity 239
III Time Series 257
Introduction 259
13 Time Series One Variable 261
14 First-Order Autoregression 281
15 AR(p) and ADL Models 297
16 Final Exam 307
Bibliography 309
Index 315
Contents
Contents vii
List of Discussion Questions xiv
Preface xvii
Textbook Learning Objectives xix
Notation xxi
1 Getting Started with R (or Stata) 1
11 Comparison of R and Stata 112 R 2
121 Accessing the Software 2122 Installing Packages 4
13 Stata 5131 Accessing the Software 5132 Installing Additional Commands 6
14 Optional Resources 6141 R Tutorials 6142 R Quick References 7143 Running Code in This Textbook 7144 Stata Resources 7Empirical Exercises 8
I Analysis of One Variable 11
Introduction 13
2 One Variable Population 15
21 The World is Random 16211 Before and After Two Perspectives 16212 Before and After Sampling 17213 Outcomes and Mechanisms 17
22 Population Types 18221 Finite Population 18222 Infinite Population 19
vii
viii CONTENTS
223 Superpopulation 19224 Which Population is Most Appropriate 20
23 Description of a Population 21231 Overview of Distributions and Their Features 22232 Binary Variable 23233 Discrete Variable 26234 Categorical or Ordinal Variable 31235 Continuous Variable 34
24 Prelude to Prediction Precipitation 37241 Easy ldquoPredictrdquo Current Weather 38242 Minimizing Mean Loss 39243 Different Probability 40244 Different Loss Function 41
25 Prediction with a Known Distribution 41251 Common Loss Functions 42252 Optimal Prediction Generic Examples 44253 Optimal Prediction Specific Examples 46254 Mean and Mode as Optimal Predictions 49255 Interval Prediction 50
3 One Variable Sample 53
31 Bayesian and Frequentist Perspectives 54311 Very Brief Overview Bayesian Approach 55312 Very Brief Overview Frequentist Approach 55313 Bayesian and Frequentist Differences 56
32 Types of Sampling 57321 Independent 58322 Identically Distributed 58323 Examples 59
33 The Empirical Distribution 6134 Estimation of the Population Mean 62
341 ldquoDescriptionrdquo Sample Mean 62342 ldquoPredictionrdquo Least Squares 63343 Non-iid Sampling Weights 64
35 Sampling Distribution of an Estimator 65351 Some Mathematical Calculations 65352 Graphs Binary Population 66353 Graphs Continuous Population 68354 Table Values in Repeated Samples 69
36 Sampling Distribution Approximation 70361 Non-iid Sampling 71
37 Quantifying Accuracy of an Estimator 71371 Bias 72372 Mean Squared Error 73373 Consistency 74374 Asymptotic MSE 75
38 Quantifying Uncertainty Frequentist Approaches 75381 Standard Errors 75382 Confidence Intervals 76
CONTENTS ix
383 p-values 78384 Statistical Significance 79385 Hypothesis Testing 80386 Mental Math for Statistical Uncertainty 81
39 Quantifying Uncertainty Misinterpretation and Misuse 82391 Perils of Ignoring Non-iid Sampling 82392 Not a Bayesian Belief 84393 Unlikely Events Happen (or Use Common Sense) 84394 Example of Ignoring Outside Knowledge 85395 Multiple Testing (Multiple Comparisons) 86396 Publication Bias and Science 87397 Ignoring Point Estimates (Economic Significance) 87398 Other Sources of Uncertainty 88
310 Statistical Decision Theory 89Empirical Exercises 91
4 One Variable Two Populations 93
41 Description 94411 Population Mean Difference 94412 Estimation 95413 Quantifying Uncertainty 96
42 Prediction 9643 Causality Overview 96
431 Correlation Does Not Imply Causation 97432 Structural and Reduced Form Approaches 99433 General Equilibrium and Partial Equilibrium 101
44 Potential Outcomes Framework 101441 Potential Outcomes 102442 Treatment Effects 103443 SUTVA 104
45 Average Treatment Effect 106451 Definition and Interpretation 106452 ATE Examples 107453 Limitation of ATE 108
46 ATE Identification 109461 Setup and Identification Question 109462 Randomization 110463 Reasons for Identification Failure 110
47 ATE Estimation and Inference 112Empirical Exercises 113
5 Midterm Exam 1 117
II Regression 119
Introduction 121
6 Comparing Two Distributions by Regression 123
x CONTENTS
61 Logic 124611 Terminology 124612 Theorems 126613 Comparing Assumptions 126
62 Preliminaries 127621 Population Mean Model in Error Form 128622 Joint and Marginal Distributions 128623 Conditional Distributions 130624 Conditional Mean 132625 Comparison of Joint Marginal and Conditional
Distributions 133626 Independence and Dependence 134
63 Population Model Conditional Expectation Function 134631 Conditional Expectation Function 135632 CEF Error Term 136633 CEF Model in Error Form 136634 Linear CEF Model 137635 Interpretation Description and Prediction 137636 Interpretation with Values Besides 0 and 1 138
64 Population Model Potential Outcomes 14065 Population Model Structural 141
651 Linear Structural Model 141652 General Structural Model and ASE 142
66 Identification 143661 Linear Structural Model 143662 Average Treatment Effect 146663 Average Structural Effect 148
67 Estimation OLS 149671 Code 150
68 Quantifying Uncertainty 150681 Heteroskedasticity 151682 Code 151Empirical Exercises 154
7 Simple Linear Regression 155
71 Misspecification 15672 Coping with Misspecification 157
721 Model of Three Values 158722 More Than Three Values 159
73 Linear Projection 159731 Geometric Intuition 159732 Probabilistic Projection 160733 Formulas and Interpretation 161734 Linear Projection Model in Error Form 161
74 ldquoBestrdquo Linear Approximation 162741 Definition and Interpretation 162742 Limitations 162
75 ldquoBestrdquo Linear Predictor 16376 Causality Under Misspecification 164
CONTENTS xi
77 OLS Estimation and Inference 164771 OLS Estimator Insights 164772 Statistical Properties 165773 Code 167
78 Simple Linear Regression 168Empirical Exercises 171
8 Nonlinear and Nonparametric Regression 173
81 Log Transformation 174811 Properties of the Natural Log Function 174812 The Log-Linear Model 176813 The Linear-Log Model 177814 The Log-Log Model 178815 Warning Model-Driven Results 178816 Code 179
82 Nonlinear-in-Variables Regression 180821 Linearity 181822 Nonlinearity 182823 Estimation and Inference 183824 Parameter Interpretation 183825 Description Prediction and Causality 185826 Code 186
83 Nonparametric Regression 187Empirical Exercises 191
9 Regression with Two Binary Regressors 195
91 Omitted Variable Bias 196911 An Allegory 196912 Formal Conditions 197913 Consequences 199914 OVB in Linear Projection 202
92 Linear-in-Variables Model 20393 Fully Saturated Model 20494 Structural Identification by Exogeneity 20795 Identification by Conditional Independence 20796 Collider Bias 20897 Causal Identification Difference-in-Differences 210
971 Bad Approaches 211972 Counterfactuals and Parallel Trends 212973 Identification 213974 Extensions 215
98 Estimation and Inference 215Empirical Exercises 217
10 Regression with Multiple Regressors 221
101 Omitted Variable Bias 222102 Linear-in-Variables Model 223
1021 Model and Coefficient Interpretation 2231022 Limitations 224
xii CONTENTS
1023 Code 224103 Interaction Terms 225
1031 Limitation of Linear-in-Variables Model 2251032 Interpretation of Interaction Term 2251033 Non-Binary Interaction 2271034 Code 228
104 Other Examples 228105 Assumptions for Linear Projection 229
1051 Multicollinearity (Two Types) 2291052 Formal Assumptions and Results 230
106 Structural Identification 2311061 Linear Structural Model 2311062 General Structural Model 2311063 Conditional ATE 232Empirical Exercises 233
11 Midterm Exam 2 237
12 Internal and External Validity 239
121 Terminology 240122 Threats to External Validity 240
1221 Different Place 2411222 Different Time 2411223 Different Population 242
123 Threats to Internal Validity 2431231 Functional Form Misspecification 2431232 Measurement Error in the Outcome Variable 2431233 Measurement Error in the Regressors 2461234 Non-iid Sampling and Survey Weights 2481235 Missing Data 2501236 Sample Selection 2511237 Omitted Variable Bias and Collider Bias 2521238 Simultaneity and Reverse Causality 252Empirical Exercises 254
III Time Series 257
Introduction 259
13 Time Series One Variable 261
131 Terms and Notation 262132 Populations Randomness and Sampling 263133 Stationarity 264134 Autocovariance and Autocorrelation 264135 Estimation 266
1351 Mean 2661352 Autocovariances and Autocorrelations 266
CONTENTS xiii
136 Nonstationarity 2681361 Trends 2681362 Seasonality 2701363 Cycles 2711364 Structural Breaks 272
137 Decomposition 272138 Transformations 274
Empirical Exercises 276
14 First-Order Autoregression 281
141 Model 282142 Description 283143 Prediction (Forecasting) Optimality 284144 Estimation 285
1441 Code 286145 Parameter Stability 287146 Multi-Step Forecast 287
1461 Intuition Mean Zero (Special Case) 2881462 General Results with AR Parameters 2881463 Direct Approach 2891464 Code 289
147 Interval Forecasts 2891471 Goal and Sources of Uncertainty 2901472 Intervals Assuming Normality 2901473 Code 291
148 More R Examples 2921481 AR(1) Multi-Step Forecast Intervals 2921482 General R Forecast Allowing Seasonality and
Trend 293Empirical Exercises 295
15 AR(p) and ADL Models 297
151 The AR(p) Model 298152 Model Selection How Many Lags 298
1521 Difficulties and Intuition 2981522 AIC and BIC Formulas 2991523 Comparison of AIC and BIC 3001524 Code 301
153 Autoregressive Distributed Lag Regression 302Empirical Exercises 304
16 Final Exam 307
Bibliography 309
Index 315
List of Discussion Questions
21 web traffic 1722 student data 2123 banana loss function 4924 optimal banana prediction 50
31 probability of positive mean 6632 equal p-values equal belief 8433 jellybean solution 8634 nova 8735 Ebola drug 89
41 DPC with two populations 9442 description prediction causality 9743 cash transfer spillovers 10644 breakfast effect 111
61 logic with feathers 12762 joint distribution and causality 13063 ES habits and final scores 14364 marriage and salary 146
71 Facebook 15872 BLP 16373 student-teacher ratio simple regression 170
81 pollution and house price 17882 nonlinear OVB 18083 nonlinear wage model interpretation 18484 model evaluation 190
91 assessing OVB 19992 OVB ES habits 20293 bad panel approach 1 for Mariel boatlift 21194 bad panel approach 2 for Mariel boatlift 21195 bad panel approach 1 for fracking 21196 bad panel approach 2 for fracking 21197 parallel trends skepticism 213
101 OVB with multiple regressors 223102 sleep and interactions 226103 wage and interactions 226
xiv
LIST OF DISCUSSION QUESTIONS xv
104 linear-in-variables 227
121 external validity minimum wage 242122 exercise error 244123 measurement error scrap rate 246124 program attrition 250125 missing salary data 251126 health and medical expenditure 253
131 autocorrelation 266132 nonstationarity 274
141 forecast and reality 285142 recession-affected coefficient 287143 long-horizon AR forecast 289144 forecast sanity check 291
151 lag choice for forecasting 301
xvi LIST OF DISCUSSION QUESTIONS
Preface
This text was prepared for the 15-week semester Introductory Econo-metrics course at the University of Missouri The class focuses on sta-tistical description prediction and ldquocausalityrdquo including both struc-tural parameters and treatment effects Description and prediction(forecasting) with time series are also covered Students learn tothink probabilistically understand prediction and causality judgewhether various assumptions hold true in real-world examples andapply econometric methods in R
As usual this text may be used to teach different types of classesIn full the text provides a 15-week semester class that assumes aprevious class in probability and statistics That prerequisite could beskipped if more time is spent on the ldquoreviewrdquo material in the first fewchapters Calculus is avoided but could be added in the usual placesA shorter class could omit the time series material Of course anymaterial may be expanded condensed or skipped as the instructordesires
Some complementary complimentary texts and courses deservemention Econometrics professor Matt Masten has a ldquoCausal In-ference Bootcamprdquo video series1 as well as some ldquoCausal Inferencewith Rrdquo free courses on DataCamp2 Relevant videos are linked atthe beginning of each chapter in this textbook Stanford statisticsprofessors Trevor Hastie and Rob Tibshirani created a free introduc-tory machine learning (statistical learning) course focusing more onprediction and estimation3 Their course uses their free textbook(James Witten Hastie and Tibshirani 2013) that includes R ex-amples4 Hastie Tibshirani and Friedman (2009) also provide theirmore advanced statistical learning text for free 5 For econometricstexts focused on prediction and time series see Diebold (2018abc)6
The forecasting book by Hyndman and Athanasopoulos (2019) is athttpsotextscomfpp2 and uses R Finally Hanck Arnold Ger-ber and Schmelzer (2018) mirror the structure of the (expensive)textbook of Stock and Watson (2015) providing many R examples toillustrate the concepts they explain7
1httpsmattmastengithubiobootcamp2httpswwwdatacampcomcommunityopen-courses3httpswwwedxorgcoursestatistical-learning4httpsstatlearningcom5httpswebstanfordedu~hastieElemStatLearn6httpwwwsscupennedu~fdieboldTextbookshtml7httpswwweconometrics-with-rorg
xvii
xviii
One distinguishing feature of this text is the development of theideas of (and distinctions among) statistical description predictioncausal inference and structural estimation in the simplest possiblesettings Other texts combine these with all the complications ofregression from the beginning often confusing students (like my pastself)
A second distinguishing feature is that this textrsquos source files arefreely available Instructors may modify them as desired or copyand paste LATEX code into their own lecture notes subject to theCreative Commons license linked on the copyright page I wrote thetext in Overleaf an online (free) LATEX environment that includesknitr support so most of the R code and output is in the same Rtexfiles alongside the LATEX code Graphs are either generated from codein the Rtex files or else from a single R file also provided in thesource material You may see copy and download the entire projectfrom Overleaf8 or from my website9
Third I provide learning objectives for the overall book and foreach chapter This follows current best practices for course designUpon request I can provide a library of multiple choice questionslabeled by learning objective (Empirical exercises are already at theend of each textbook chapter)
Fourth in-class (or online) discussion questions are included alongthe way When I teach in person (30ndash40 students) I prefer to punctu-ate lectures with such questions every 20ndash30 minutes where studentsfirst discuss them for a couple minutes in small groups of 2ndash3 studentsand then volunteer to share their grouprsquos ideas with the whole classfor another couple minutes This provides an active learning opportu-nity a time for students to realize they donrsquot understand the lecturematerial (so they can ask questions) practice discussing econometricswith peers and (if nothing else) a few minutesrsquo rest
Thanks to everyone for their help and support my past economet-rics instructors my colleagues and collaborators my students (whohave not only inspired me but alerted me to typos and other defi-ciences in earlier drafts) and my family
David M KaplanSummer 2018 (edited Summer 2020)Columbia Missouri USA
8httpswwwoverleafcomreadfszrgmwzftrk9httpfacultymissouriedukaplandmteachhtml
Textbook LearningObjectives
For good reason it has become standard practice to list learningobjectives for a course as well as each unit within the course Beloware the learning objectives corresponding to this text overall Eachchapter lists more specific learning objectives that map to one ormore of these overall objectives The accompanying exercises arealso classified by learning objectives I hope you find these helpfulguidance whether you are a solo learner a class instructor or a classstudent
The textbook learning objectives (TLOs) are the following
1 Define terms from probability statistics and econometrics bothmathematically and intuitively
2 Describe various econometric methods both mathematically andintuitively including their objects of interest and assumptionsand the logical relationship between the assumptions and cor-responding theorems and properties
3 Interpret the values that could be estimated with infinite datain terms of description prediction and causality (or economicmeaning)
4 Explain the frequentistclassical statistical and asymptotic frame-works including their benefits and limitations
5 Provide multiple possible (causal) explanations for any statis-tical result distinguishing between statistical and causal rela-tionships
6 For a given economic question dataset and econometric methodjudge whether the method is appropriate and judge the eco-nomic significance and statistical significance of the results
7 Using R (or Stata) manipulate and analyze data interpretingresults both economically and statistically
xix
xx TEXTBOOK LEARNING OBJECTIVES
Notation
Much of the notation below will not make sense until you get to thecorresponding point in the text The following is primarily for yourreference later
Variables
Usually uppercase denotes a random variable whereas lowercase de-notes a non-random (fixed constant) value The primary exceptionis for certain counting variables where uppercase indicates the max-imum value and lowercase indicates a general value eg time periodt can be 1 2 3 T or regressor k out of K total regressors Scalarvector and matrix variables are typset differently For example ann-by-k random matrix with scalar (random variable) entries Xij (rowi column j) is written
X =
X11 X12 middot middot middot X1k
X21 X22 middot middot middot X2k
Xn1 Xn2 middot middot middot Xnk
and a k-dimensional non-random vector is
z =
z1z2zk
Unless otherwise specified vectors are column vectors (like above)
Both vectors and matrices can be transposed The transpose of acolumn vector is a row vector For example the transpose of the zdefined above is
zprime = (z1 z2 zk)
and the transpose of the X defined above is
Xprime =
X11 X21 middot middot middot Xn1
X12 X22 middot middot middot Xn2
X1k X2k middot middot middot Xnk
xxi
xxii NOTATION
where the row i column j entry in Xprime is the row j column i entry inX
Greek letters like β and θ generally denote non-random (fixed)population parameters
Estimators usually have a ldquohatrdquo on them Since estimators arecomputed from data they are random from the frequentist perspec-tive Thus even if θ is a non-random population parameter θ is arandom variable
I try to put ldquohatsrdquo or bars on other quantities computed from thesample too For example a t-statistic would be t (a random variablecomputed from the sample) instead of just t (which looks like a non-random scalar) The sample average of Y1 Yn is Y
Estimators and other statistics (ie things computed from data)may sometimes have a subscript with the sample size n to remind usthat their sampling distribution depends on n For example θn tnand Yn
The following is a summaryy scalar fixed (non-random) valueY scalar random variableθ scalar non-random valueθ scalar random variable
x non-random column vectorxprime transpose of wX random column vectorβ non-random column vectorβ random column vector
w non-random matrixwprime transpose of wW random matrixΩ non-random matrixΩ random matrix
Symbols
In addition to the following symbols vocabulary words and abbrevi-ations (like ldquoregressionrdquo or ldquoOLSrdquo) can be looked up in the Index inthe very back of the text
=rArr implies see Section 61lArr= is implied by see Section 61lArrrArr if and only if see Section 61limnrarrinfin
limit (like in pre-calculus)plimnrarrinfin
probability limit see Section 373
rarr converges to (like in pre-calculus)prarr converges in probability to see Section 373equiv is defined asasymp approximately equalssim is distributed as
NOTATION xxiii
asim is distributed approximately (or asymptotically) as see (316)X perpperp Y X and Y are statistically independent see Section 626N(micro σ2) normal distribution with mean micro and variance σ2
N(0 1) standard normal distributionFY (middot) cumulative distribution function (CDF) of Y see Section 23fY (middot) PMF of Y (if Y is discrete) see Section 23fY (middot) PDF of Y (if Y is continuous) see Figure 231middot indicator function see (23)P(A) probability of event AP(A | B) conditional probability of A given B see Section 623E(Y ) expectation (mean) of Y see Section 23E(Y | X = x) CEF (a function of x) see Section 63E(Y | X) conditional expectation of Y given X this is a random variablensumi=1
summation from i = 1 to i = n
Var(Y ) variance of Y (square of standard deviation) see (210)Var(Y | X = x) conditional variance (a non-random value) see Section 681Var(Y | X) conditional variance (a random variable)Cov(YX) covarianceCorr(YX) correlationa b a set (containing elemnts a b etc)i = 1 n same as i isin 1 n (integers from 1 to n)j = 1 J same as j isin 1 J (integers from 1 to J)s isin S element s is in set SE(Y ) expectation for sample distribution see Section 341Yn
1n
sumni=1 Yi same as E(Y ) see Section 341
θ estimator of population parameter θ see Section 34SE(θ) standard error of estimator θ see Section 381arg min
gf(g) the value of g that minimizes f(g)
arg maxg
f(g) the value of g that maximizes f(g)
vprime xprime transposes of matrix v and vector x respectivelyvminus1 inverse of matrix v
xxiv NOTATION
Chapter 1
Getting Started with R (orStata)
=rArr Kaplan video Course Introduction
Depends on no other chapters
Unit learning objectives for this chapter
11 Run statistical software (RRStudio or Stata) [TLO 7]
12 Write code to do basic data manipulation description anddisplay [TLO 7]
You will use R (or Stata) for the empirical exercises in this text-book The code examples in the textbook are all in R
No previous experience with any statistical software is assumedConsequently the primary goal of the empirical exercises is to developyour confidence and experience with statistical software applying thetextrsquos methods and ideas to real datasets Toward this goal there arelots of explicit hints about the code you need to write
If you actually do have previous experience (or above-average in-terest) then the empirical exercises may feel too boring You couldtry figuring out alternative ways to code the solution or coding alter-native analyses etc You can also explore other online resources likeone of the free DataCamp courses1
Due to the many excellent resources online (see Section 14) thereare many people who can write R code but most do not understandhow to properly interpret econometric results or judge which methodis most appropriate So overall this classtextbook focuses more onunderstanding econometrics than coding
11 Comparison of R and Stata
I like both R and Stata statistical software and I have used bothprofessionally They excel in different ways mentioned below
1httpswwwdatacampcomcommunityopen-courses
1
2 CHAPTER 1 GETTING STARTED WITH R (OR STATA)
For this textbookclass I focus on R for the following reasons
1 Itrsquos widely used in the private sector government and academiaalike in many fields (including economics)
2 Itrsquos free to downloaduse and can even be used through a webbrowser
3 It has many econometricstatistical functions available and cre-ators of new econometricstatistical methods often provide codein R
4 There are many online resources for learning R and getting help
In comparison Stata
1 is widely used in economics and certain social sciences but lessso in fields like data science and statistics
2 is not free and canrsquot be used in a browser but is free to use inmany campus computer labs
3 is easier to use for standard econometric methods and has somenew econometric methods (while others take a few years to beimplemented)
4 also has good help files (documentation) and online support
12 R
121 Accessing the Software
There are three ways you could run R downloaded onto your owncomputer through a web browser (in the cloud) or on another com-puter like in a campus computer lab
Other computers or web browser versions may have the core Rsoftware but lack certain packages needed for the empirical exercisesIn some cases you can simply install the necessary packages with asingle command (eg in Mizzou computer labs) In other cases youmay be prohibited from installing packages in which case you wonrsquotbe able to complete the exercises so make sure to check this first
Through a Web Browser
There are many free options for using R through a web browser andthey evolve quickly This means both new and improved optionsbecoming available as well as existing options disappearing even frommajor companies (eg Microsoft Azure Notebooks was ldquoretiredrdquo)
Currently I suggest you use RStudio Cloud Itrsquos free reliable andthe same RStudio interface as if you downloaded RStudio so you canlearn from the latter half of my RStudio video To get started
1 Go to httpsrstudiocloud in any web browser2 Click the GET STARTED FOR FREE button (or else ldquoLog Inrdquo
if you already have an account)
12 R 3
3 Click the ldquoSign Uprdquo button (the free ldquoCloud Freerdquo account isselected by default)
4 Enter your email (new) password and name and click ldquoSignUprdquo (or else ldquoSign up with Googlerdquo if you prefer)
5 Start using RStudio like it were on your own computer6 Install necessary packages like usual see Section 1227 After you log out and later log in click ldquoUntitled Projectrdquo (feel
free to rename) to get back to where you wereAt httpmybinderorgv2ghbinder-examplesrmaster
urlpath=rstudio you can also use the RStudio interface through aweb browser without even making an account but 1) it does notrun the most current version of R 2) it cannot save your files fromone session to the next 3) you have to install the packages everytime (which takes many minutes to run) But these are not criticalproblems for this class older R versions are fine you can save yourcodeoutput in a text file on your own computer and you can makesome tea while the packages install
Currently the best other options use Jupyter Notebooks In orderof my preference (for this class)
bull CoCalc no account required all needed packages already in-stalled go to httpscocalccom and click ldquoRun CoCalc Nowrdquoand wait for it to load then click ldquoR (system-wide)rdquo under ldquoSug-gested kernelsrdquo and you can start typing R code
bull Google CoLab requires Google account go to httpscolabresearchgooglecomdrive1BYnnbqeyZAlYnxR9IHC8tpW07EpDeyKRand then in the Edit file menu click ldquoSelect all cellsrdquo and (alsoin the Edit menu) ldquoDelete selected cellsrdquo to get a blank note-book then under Insert click ldquoCode cellrdquo and start typing codecan install all needed packages as in Section 122 (takes a fewminutes to run)
bull Gradient by Paperspace I havenrsquot tried it but looked promisingrequires free account at httpsgradientpaperspacecom
In a Mizzou Computer Lab
You can check which Mizzou computing siteslabs have your favoritesoftware on the Computing Sites Software web page2 Scroll downto RStudio to see where you can use R with RStudio Howeversometimes there are classes or other events in computer labs you cancheck the weekly schedule posted near the door to find a free time oryou can check online3
After you log into the computer in the computer lab open RStudiofrom the Start menu (RStudio calls R itself in the background youdonrsquot have to open R directly) Then just start typing commandsand hit Enter to run them
2httpsdoitmissourieduservicescomputing-sitessites-software
3httpsdoitmissourieduservicescomputing-sites and click the labname
4 CHAPTER 1 GETTING STARTED WITH R (OR STATA)
The computer labs donrsquot currently have the necessary packagespre-installed but you can easily install them Note that yoursquoll haveto do this every time you log in (since any files you downloadsaveget deleted when you log out) but you can just run the same line ofcode when you start RStudio each time
Also make sure to email yourself your code (or otherwise save itif you havenrsquot finished and uploaded to Canvas) before you log outsince your files get deleted when you log out
Downloading Software
=rArr Kaplan video Getting Started with RRStudio
Yoursquoll download two pieces of software R itself and RStudioBoth are free R has all the functions you need RStudio makes theinterface nicer and makes things easier for you
On Windowsbull Download the exe installer file for R Google ldquoR Windowsrdquo
or try httpscranr-projectorgbinwindowsbase andclick the ldquoDownload rdquo link near the top
bull Open the downloaded exe installer and follow the instructionsbull Download the exe installer file for RStudio Desktop (free
version) Google ldquoRStudio downloadrdquo or try httpswwwrstudiocomproductsrstudiodownloaddownload
bull Open the downloaded exe installer and follow the instructions
On Macbull Download the pkg file for R Google ldquoR Macrdquo or try httpscranr-projectorgbinmacosx
bull Open the file and follow the usual Mac installation procedurebull Download the dmg file for RStudio Desktop (free version)
Google ldquoRStudio downloadrdquo or try httpswwwrstudiocomproductsrstudiodownloaddownload
bull Open the file and follow the usual Mac installation procedure
On Linux etc if you can figure out how to run something besidesWindows or Mac you can probably figure out how to download acouple files by yourself but please let me know if not
Regardless of OS after both are installed you only ever need toopen RStudio never R Once you open RStudio just type a commandand hit Enter to run it
122 Installing Packages
You may need to install certain packages to do the empirical exercisesThis can be done with a single command in R You should double-check the package names required for each exercise but it would besomething like
13 STATA 5
installpackages(c(wooldridgelmtestsandwichforecastsurvey))
With R on your own computer you only need to run this once(not every time you use your computer) but with a web interface orcomputer lab you may need to run this code every time you start asession in R You can check which packages are already installed withinstalledpackages()
A bit about the packages
bull wooldridge (Shea 2018) has datasets originally collected byWooldridge (2020) from various sources
bull lmtest and sandwich (Zeileis 2004 Zeileis and Hothorn 2002)help construct confidence intervals (and other things) appropri-ate for economic data
bull survey (Lumley 2004 2019) has functions for dealing withcomplex survey sampling
bull forecast (Hyndman Athanasopoulos Bergmeir Caceres ChhayOrsquoHara-Wild Petropoulos Razbash Wang and Yasmeen 2020Hyndman and Khandakar 2008) has methods for forecasting
13 Stata
131 Accessing the Software
There are three ways you could run Stata in a campus computer labthrough Mizzoursquos Software Anywhere or (if you purchase your owncopy) downloaded onto your own computer
Empirical exercises only require built-in commands Stata hasadditional commands available for download but none are neededfor the exercises so any (internet-connected) computer with Stata issufficient
In a Mizzou Computer Lab
You can check which Mizzou computing siteslabs have your favoritesoftware on the Computing Sites Software web page4 Scroll downto Stata to see where itrsquos available However sometimes there areclasses or other events in computer labs you can check the weeklyschedule posted near the door to find a free time or you can checkonline5
After you log into the computer in the computer lab open Statafrom the Start menu (the actual name is somewhat longer likeldquoStataSE 15 (64-bit)rdquo) Ideally you should open the do-file editorand save a do file but for this class you could just type commands
4httpsdoitmissourieduservicescomputing-sitessites-software
5httpsdoitmissourieduservicescomputing-sites and click the labname
6 CHAPTER 1 GETTING STARTED WITH R (OR STATA)
into the short horizontal space at the bottom labeled ldquoCommandrdquoYou type a command and hit Enter to run it
Also make sure to email yourself your code (or otherwise save itif you havenrsquot finished and uploaded to Canvas) before you log outsince your files get deleted when you log out
Purchasing and Downloading Software
Student pricing is shown on the Stata website6 Currently (Spring2020) the cheapest option is the 6-month StataIC license Othermore expensive licenses are fine too
The software is delivered via download Follow instructions forinstallation and contact Stata if you have any technical difficulties
Software Anywhere (Mizzou)
From the Software Anywhere web page7 click the ldquoGetting Startedrdquotab and follow the instructions Once logged in itrsquos the same as ifyou were sitting at a computer in a Mizzou computer lab (see above)
Technical assistance MU Division of ITtechsupportmissouriedu
132 Installing Additional Commands
Like in R there are additional Stata commands that can be easilydownloaded and installed Commonly this can be done with thecommand ssc install followed by the name of the command
For the exercise sets the only additional command yoursquoll need isbcuse You can install this with the command ssc install bcuseIf yoursquore in a computer lab you may need to run this command everytime you start Stata if you have it on your computer just once issufficient This command makes it easy to load the datasets fromWooldridge (2020)8
14 Optional Resources
If you only want to learn enough R (or Stata) to do well in this classthen you may skip this section If yoursquod like to learn more on yourown these resources might help you get started in the right direction
141 R Tutorials
Eventually you will be able to simply Google questions you haveabout R There are lots of people on the internet really excited abouthelping you figure stuff out in R which is great
6httpswwwstatacomordernewedugradplansstudent-pricing7httpsdoitmissourieduservicessoftwaresoftware-anywhere8Descriptions httpfmwwwbceduec-pdatawooldridgedatasets
listhtml
14 OPTIONAL RESOURCES 7
However when you are first getting started it may help to gothrough a basic tutorial You are welcome to Google ldquoR basic tutorialrdquoyourself or you could try one of the following
1 Section 23 (ldquoLab Introduction to Rrdquo) in James Witten Hastieand Tibshirani (2013)
2 Section 11 in Hanck et al (2018)
3 Sections 11ndash13 in Heiss (2016)
4 Sections 21ndash25 in Kleiber and Zeileis (2008) [Chapter 2 is freeon their website]
5 Chapter 2 in Kaplan (2020)
6 No longer free after first chapter datacampcom courses likeIntroduction to R9
142 R Quick References
At first it may help to have some quick reference ldquocheat sheetsrdquo 10 11
143 Running Code in This Textbook
If yoursquod like you should be able to copy code directly from the text-book pdf file and paste it into R Sometimes you need to install acertain package first This can be done either manually or with the Rfunction installpackages() For example to install package mgcvrun the command installpackages(mgcv) within R
144 Stata Resources
For Stata helpful cheat sheets (quick references) are available forfree12 as well as various tutorials13
9httpswwwdatacampcomcoursesfree-introduction-to-r10httpswwwrstudiocomwp-contentuploads201610r-cheat-
sheet-3pdf11httpscranr-projectorgdoccontribShort-refcardpdf12httpswwwstatacombookstorestatacheatsheetspdf13httpswwwstatacomlinksresources-for-learning-stata
8 CHAPTER 1 GETTING STARTED WITH R (OR STATA)
Empirical Exercises
Empirical Exercise EE11 In either R or Stata create a script(a sequence of commands with one command per line) to do thefollowing The data are from a New York Times article on December28 1994
a R load (and install if necessary) package wooldridgeif (require(wooldridge)) installpackages(wooldridge) library(wooldridge)
Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse wine clear
b View basic dataset info with R command wine or Stata com-mand describe
c View the first few rows of the dataset with R command head(wine) or Stata command list if _nlt=5
d Rename the alcohol column which measures liters of alcoholfrom wine (consumed per capita per year)
R names(wine)[2] lt- wine
Stata rename alcohol wine
e Add a column named id whose value is just 1 2 3 4 5 etc
R wine$id lt- 1nrow(wine)
Stata generate id = _n
f Display the countries with fewer than 100 heart disease deathsper 100000 people
R wine$country[wine$heartlt100]
Stata list country if heartlt100
g Display the rows for the countries with the 5 lowest death ratessorted by death rate
R wine[order(wine$deaths)[15]]
Stata sort deaths followed by list if _nlt=5
h Add a column with the sum of heart and liver disease deathsper 100000
R wine$heartplusliver lt- wine$heart + wine$liver
Stata generate heart_plus_liver = heart + liver
i Generate a variable with the squared death rate
R wine$deathssq lt- wine$deaths^2
Stata generate deaths_sq = deaths^2
j Display the sorted death rates
R print(sort(wine$deaths))
Stata sort deaths followed by list deaths
EMPIRICAL EXERCISES 9
k R create a vector with the proportion of total deaths (per100000) caused by heart disease with command heartproplt- wine$heartwine$deaths and then name the entries by
country with names(heartprop) lt- wine$country and printthe named vector of heart disease death proportions rounded tothree decimal places with print(round(heartprop digits=3))
Stata add a column with the proportion of heart deaths tototal deaths with command generate heart_prop = heart deaths
l Create a histogram of liver deaths
R hist(wine$liver)
Stata histogram liver
m Create a scatterplot of liver death rates (vertical axis) againstwine consumption (horizontal axis)
R plot(x=wine$wine y=wine$liver)
Stata scatter liver wine
n R only make the same plot but with axes starting at zeroadding the arguments xlim=c(0max(wine$wine)) and ylim=c(0max(wine$liver)) to the previous plot() command
10 CHAPTER 1 GETTING STARTED WITH R (OR STATA)
Part I
Analysis of One Variable
11
Introduction
This text explores methods to answer three types of economic ques-tions each detailed in Part I
1 Description (how things arewere statistical properties and re-lationships)
2 Prediction (guessing an unknown value without interfering)
3 Causality (how changing one variable would affect another allelse equal)
For example imagine you are interested in income Depending onyour job you may want to answer a different type of question like
1 Description how many adults in the US have an income be-low $20000yr Whatrsquos the mean income among US adultsWhatrsquos the difference in mean income between two socioeco-nomic or demographic groups like those with and without acollege degree
2 Prediction for advertising purposes whatrsquos the best guess ofthe income of an unknown person visiting your companyrsquos web-site Whatrsquos the best prediction if you also know their zip code(where they live)
3 Causality for a given individual how much higher would herincome be if she had a college degree than if she didnrsquot keepingeverything else about her (parents height social skills etc)identical How much higher would her income be if she were aman all else equal If she were white
Description helps us see It summarizes an incomprehensible massof numbers into specific economically important features we can un-derstand By analogy knowing the color of each of 40000 pixels ina photograph is not as valuable as knowing itrsquos a cat
Prediction aids decisions dependent on unknowns The examplequestions above consider the purpose of advertising where correctlyguessing a personrsquos income helps decide which ad is most effective Inother private sector jobs you may need to predict future demand toknow how many self-driving cars to start producing or predict futureoil prices to aid a freight companyrsquos decisions In government ornon-profit work optimal policy may depend on predicting next yearrsquos
13
14
unemployment rate In each case as detailed in Section 25 the ldquobestrdquoprediction depends on the consequences of the related decision
Causality also aids decisions The example question about thecausal effect on income of a college degree matters for governmentpolicies to subsidize college (or not) as well as individual decisions toattend college With business decisions like changes to advertisingor website layout the causal effect on consumer behavior is whatmatters does the change itself actually cause consumers to buy moreAmong the three types questions of causality are the most difficultto answer Learning about causality from data has been a primaryfocus of the field of econometrics
Of course not all important questions concern description pre-diction and causality Policy questions usually involve tradeoffs thatultimately require value judgments For example how much futurewellbeing is worth sacrificing to be better off right now How muchGDP is worth sacrificing to decrease inequality Should a school havehonors classes that help the best students at the expense of the otherstudents Each of these policy questions requires a subjective valuejudgment that cannot be answered objectively from data
That said each policy question also depends on objectively quan-tified description prediction and causality For example the policyquestion about decreasing inequality depends on the current levelsof GDP and inequality (description) as well as the causal effect ofthe policy (eg tax) change on GDP and inequality (causality) Thefuturepresent wellbeing tradeoff depends on the current level of well-being (description) as well as future levels (prediction) The honorsclass tradeoff depends on the causal effect of honors classes on differ-ent types of students (causality) as well as the current mix of studenttypes (description) and future mix (prediction)
Chapter 2
One Variable Population
=rArr Kaplan video Chapter Introduction
Depends on no other chapters
Unit learning objectives for this chapter
21 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
22 Describe and distinguish among different types of popula-tions including which is most appropriate for answering acertain question [TLO 3]
23 Describe distributions in different ways including units ofmeasure [TLO 3]
24 Assess the most appropriate loss function and prediction ina real-world situation [TLO 6]
25 Compute mean loss and the optimal prediction in simplemathematical examples [TLO 2]
Optional resources for this chapter
bull Basic probability the Khan Academy AP Statistics unitincludes instructional material and practice questions
bull Mean (expected value) (Lambert video)
bull Probability distribution basics onWikipedia (more than youneed to know for this class)
bull Optimal prediction Hastie Tibshirani and Friedman(2009 sect24)
bull Section 21 (ldquoRandom Variables and Probability Distribu-tionsrdquo) in Hanck et al (2018)
Chapter 2 studies a single variable by itself This settingrsquos sim-plicity helps us focus on the complexity of fundamental concepts in
15
16 CHAPTER 2 ONE VARIABLE POPULATION
probability description and prediction This fundamental under-standing will help you tackle more complex models later in this classand beyond
If yoursquove previously had a probability or statistics class then mostof this chapter may be review for you although the optimal predictionmaterial is probably new If you havenrsquot then now is your opportunityto catch up
21 The World is Random
=rArr Kaplan video ldquoBeforerdquo and ldquoAfterrdquo Perspectives of Data
211 Before and After Two Perspectives
Consider a coin flip The two possible outcomes are heads (h) andtails (t) After the flip we observe the outcome (h or t) Before theflip either h or t is possible with different probabilities
Let variableW represent the outcome After the flip the outcomeis known either W = h or W = t Before the flip both W = h andW = t are possible If the coin is ldquofairrdquo then possible outcomeW = hhas probability 12 as does W = t (Recall it is equivalent to write12 05 or 50)
The ldquoafterrdquo view sees W as a realized value (or realization) Itis either heads or tails Even if the actual ldquovaluerdquo (heads or tails) isunknown to us there is just a single value For example in physicsthe variable c represents the speed of light in a vacuum you may notknow the value but c represents a single value
Instead the ldquobeforerdquo view sees W as a random variable Thatis instead of representing a single (maybe unknown) value like inalgebra W represents a set of possible values each associated with aprobability In the coin flip example the possible outcomes are h andt and the associated probabilities are 05 and 05
Other terms for W include a random draw (or just draw) ormore specifically a random draw (or ldquorandomly drawnrdquo) from a partic-ular probability distribution Seeing the population as a probabilitydistribution (see Section 22) we could say W is randomly sampledfrom its population distribution or if there are multiple random vari-ables W1W2 (eg multiple flips of the same coin) we could saythey are randomly sampled from the population or that they col-lectively form a random sample see Section 32 for more aboutsampling
Notationally in this text random variables are usually writtenuppercase (like W or Y ) whereas realized values are usually writtenlowercase (like w or y) This notation is not unique to this textbookbut beware that other books use different notation (For more onnotation see the Notation section in the front matter before Chapter1)
21 THE WORLD IS RANDOM 17
212 Before and After Sampling
Extending Section 211 are the before sampling and after sam-pling perspectives or ldquobefore observationrdquo and ldquoafter observationrdquoSimilar to Section 211 ldquobeforerdquo corresponds to random variableswhereas ldquoafterrdquo corresponds to realized values
For example imagine you plan to record the age of one personliving in your city You take a blank piece of paper on which yoursquollwrite the age As in Section 211 after you choose a person and writetheir age (ldquoafter samplingrdquo) that number can be seen as a realizedvalue like w Before sampling there are many possible numbers thatcould end up on your paper Itrsquos not that your cityrsquos citizensrsquo ages areundetermined they each know their own age But before you ldquosamplerdquosomebody itrsquos undetermined whose age will end up on your paperIt could be your neighbor DeMarcus age 88 It could be your kidrsquosfriend Lucia age 7 It could be your colleague Xiaohong age 35 Therandom variable W is like your blank paper it has many possiblevalues each with some probability of occuring like P(W = 88) orP(W = 7)
Discussion Question 21 (web traffic) Let Y = 1 if yoursquore loggedinto the course website and Y = 0 if not
a) From what perspective is Y a non-random valueb) From what perspective is Y a random variable
There is always a ldquobeforerdquo view from which data samples (likeages) can be seen as random variables although sometimes it requiressome additional peculiar thought experiments like imagining we firstldquosamplerdquo one universe out of many like with the superpopulation inSection 22
In Sum Before amp AfterBefore multiple possible values =rArr random variableAfter single observed value =rArr realized value (non-random)
213 Outcomes and Mechanisms
Knowing everything about a coin does not fully determine the out-come of a single coin flip For example even if we flip two iden-tical coins (ie same probability of heads) at the same time onemay get heads while the other gets tails Mathematically with twocoins represented by W and Z even if they are ldquoidenticalrdquo in thatP(W = h) = P(Z = h) and P(W = t) = P(Z = t) we could stillsometimes observe W = h and Z = t More abstractly knowingeverything about random variable W does not fully determine anyparticular realization w Even if random variables W and Z have thesame properties specific realizations W = w and Z = z may differ
Conversely a single coin fliprsquos outcome does not tell us everythingabout the coin itself For example consider a ldquofairrdquo coin W with
18 CHAPTER 2 ONE VARIABLE POPULATION
P(W = h) = P(W = t) = 12 (50 chance of either heads or tails)and biased coin Z with P(Z = h) = 099 (99 heads) By chance wemay flip both and observe W = h and Z = h But the fact that theyboth came up heads once does not imply that the coins themselvesare identical More abstractly observing a single realization W = wdoes not tell us all the properties of random variable W
We usually want to learn about the underlying mechanisms likethe coin itself The ldquobeforerdquo view in Section 211 lets us describe theunderlying properties that we want to learn like a coinrsquos probabilityof heads P(W = h)
The coin flip is a metaphor for more complex mechanisms In eco-nomics instead of learning how coin flip outcomes are determined wecare about the underlying mechanisms that determine a wide varietyof outcomes like unemployment wages inflation trade volume fer-tility and education The underlying mechanism is often called thedata-generating process (DGP)
22 Population Types
=rArr Kaplan video Population Types
This section describes different population types and how to de-termine which is most appropriate for a particular economic questionwhich in turn helps determine which econometric method is most ap-propriate
In this textbook the population is modeled mathematically as aprobability distribution This is appropriate for the infinite popula-tion or superpopulation below but not the finite population Conse-quently it is most important to distinguish between the finite popu-lation and the other two types
The finite population cares more about the ldquoafterrdquo view whichoutcomes actually occurred The other two population types caremore about the ldquobeforerdquo view describing properties of the underlyingmechanisms that generated the outcomes (the DGP)
221 Finite Population
In English ldquopopulationrdquo means all the people living in some area likeeverybody living in Missouri In econometrics this type of populationis called a finite population Other examples of finite populationsare all employees at a particular firm all firms in a particular industryall students in a particular school or all hospitals of a certain size
The finite population is appropriate when we only care about theoutcomes of the population members not the mechanisms that de-termine such outcomes For example if we want to know how manyindividuals in Missouri are currently unemployed then our interest isin a finite population That is we donrsquot care why theyrsquore unemployedand we donrsquot care about the probability that theyrsquore unemployed weonly care about whether or not they are currently unemployed
22 POPULATION TYPES 19
222 Infinite Population
Sometimes a finite population is so large compared to the sample size(ie the number of population members we observe) that an infi-nite population is a reasonable approximation For example if weobserve only 600 individuals out of the 6+ million in Missouri econo-metric results based on finite and infinite populations are practicallyidentical
Although ldquoinfiniterdquo sounds more complex than ldquofiniterdquo it is actu-ally simpler mathematically Instead of needing to track every singlemember of a finite population an infinite population is succinctly de-scribed by a probability distribution or random variable For examplea finite population would need to consider the employment status ofall 6+ million Missourians because sampling somebody unemployedthen reduces the number of unemployed individuals remaining in thepopulation who could be sampled next In contrast an infinite popu-lation considers realizations of a random variable W with some prob-ability of having value ldquounemployedrdquo There is no effect of removingone individual from an infinite population since 1infin = 0
Besides this convenience sometimes there is no finite population(however large) that answers your question For example imaginetherersquos a new manufacturing process for carbon monoxide monitorsthat should sound an alarm above 50ppm Most work properly butsome are faulty and never alarm Specifically this manufacturing pro-cess corresponds to some probability of producing a faulty monitorThis is similar to the probability of the coin flipping process producinga ldquoheadsrdquo Mathematically the manufacturing process can be mod-eled as random variableW with some probability of the value ldquofaultyrdquoIf you want to learn this probability (ie this property of the manu-facturing process) then there is no finite number of monitors that canexactly answer your question no finite number of realizations exactlydetermines P(W = faulty) This is an infinite population question
223 Superpopulation
One variation of the infinite population is the superpopulation(coined by Deming and Stephan 1941) This imagines (infinitely)many possible universes our actual universe is just one out of infin-ity Thus even if it appears we have a finite population we couldimagine that our universersquos finite population is actually a single sam-ple from an infinite number of universesrsquo finite populations The termldquosuperpopulationrdquo essentially means ldquopopulation of populationsrdquo Ouruniversersquos finite population ldquois only one of the many possible popu-lations that might have resulted from the same underlying system ofsocial and economic causesrdquo (Deming and Stephan 1941 p 45)
For example imagine we want to learn the relationship betweenUS state-level unemployment rates and state minimum wage levelsIt may appear we are stuck with a finite population because there areonly 50 states each of which has an observable unemployment rateand minimum wage However observing all 50 states still doesnrsquot
20 CHAPTER 2 ONE VARIABLE POPULATION
fully answer our question about the underlying mechanism that re-lates unemployment and minimum wage so a finite population seemsinappropriate But we canrsquot just manufacture new states like we canmanufacture new carbon monoxide monitors so an infinite populationalso seems inappropriate The superpopulation imagines manufactur-ing new entire universes each with 50 states and the same economicand legal systems Given these underlying systems and mechanismsthe statesrsquo unemployment rates can be seen as random variables withvarious probabilities of the possible values To answer our economicquestion we need to learn about the properties of these random vari-ables not merely the actual unemployment in our actual 50 states
224 Which Population is Most Appropriate
Practically you need to decide which econometric method to use toanswer a particular question This decision depends partly on whichpopulation type is most appropriate Specifically finite-populationmethods differ from other methods that are appropriate for eithersuperpopulations or infinite populations Because they are less com-monly used in econometrics finite-population methods are not cov-ered in this textbook
Consequently it is most important to judge whether or not a finitepopulation is more appropriate than the other types Which is mostappropriate depends on your question (ie what you want to learn)
The finite population is most appropriate if you could fully answeryour question by observing every member of a finite population Ifnot then a superpopulation or infinite population is more appropri-ate
The distinction is described by Deming and Stephan (1941 p45) They say the finite population perspective is more appropriatefor ldquoadministrative purposesrdquo or ldquoinventory purposesrdquo whereas thesuperpopulation perspective is more appropriate for ldquoscientific gen-eralizations and decisions for action [policy]rdquo as well as ldquopredictionrdquo(assuming you want to predict values outside the finite populationlike in the future)
In Sum Population TypeHypothetically could a finite number of observations fully answeryour question
No =rArr superpopulation or infinite population modeled asprobability distribution (as in this textbook)
Yes =rArr finite population (use different methods unless sam-ple is much smaller than population)
Example Coin Flips
Imagine the president flips a coin 20 times and then randomly selects10 observations to report to you which population types is most
23 DESCRIPTION OF A POPULATION 21
appropriate It depends on your questionThe finite population is most appropriate if you only care about
the outcomes of those 20 flips For example this may be true if thepresident was flipping the coin to make a major military decision thatyou care about (like ldquoinvade if at least 1020 headsrdquo) Then knowingthe 20 flip outcomes is enough to learn the decision Further thesample size is a fairly large proportion (1020) so approximating 20as infinity seems inappropriate
The infinite population is more appropriate if you care about theproperties of the coin For example even with a fair coin (p = 12)maybe only 5 of 20 flips came up heads You donrsquot care that thefinite-population proportion of heads was 14 you care about thep = 12 property of the coin itself You still have uncertainty aboutp even after observing all 20 outcomes
Other Examples
Consider the employment status of individuals in Missouri A finitepopulation is more appropriate if you want to document the actualpercentage of Missouri individuals unemployed last week A super-population is more appropriate if you want to learn about the under-lying mechanism that relates education and unemployment That isknowing each individualrsquos employment status fully answers the firstquestion but not the second question
Consider the productivity of employees at your company (yoursquorethe CEO) If you want to know each employeersquos productivity over thepast fiscal quarter then a finite population is more appropriate If youwant to learn how a particular company policy affects productivitythen a superpopulation is more appropriate That is knowing eachemployeersquos productivity fully answers the first question but not thesecond question
Discussion Question 22 (student data) Imagine yoursquore a highschool principal You have data on every student including theirstandardized test scores from last spring
a) Describe a specific question for which the finite population ismost appropriate and explain why
b) Describe a specific question for which an infinite population orsuperpopulation is most appropriate and explain why
23 Description of a Population
Like most econometrics textbooks this textbook models the popula-tion as a probability distribution Section 22 helps you distinguishwhen this is appropriate
Description of a population is thus description of a probabilitydistribution Some distributions are completely described by a singlenumber like a coinrsquos probability of heads Others are very compli-cated so they are summarized by particular features like the meanand standard deviation
22 CHAPTER 2 ONE VARIABLE POPULATION
Later with regression wersquoll think about the relationship betweenthe value of variable X and the value of a summary feature of the Yprobability distribution There some caveats to that statement butthe point is that you must understand the probability distribution ofY by itself before understanding how its features depend on X
Remember there is no data yet In practice (and starting inSection 34) you use data to learn about the population to answerquestions about description prediction or causality Here we con-sider what could possibly be learned specifically for description
The following subsections describe probability distributions fordifferent types of variables as well as appropriate summary featuresFirst a brief overview is given
231 Overview of Distributions and Their Features
Complete Description
To completely describe a distribution requires a probability mass func-tion or cumulative distribution function depending on the type ofvariable (details below)
When appropriate the probability mass function (PMF) givesthe probability that random variable Y is equal to any one of its possi-ble values Notationally the PMF is usually a lowercase f sometimeswith the variable as a subscript like fY (middot) The (middot) indicates that fY (middot)is an entire function not a scalar variable nor a function evaluatedat a particular point Mathematically
fY (y) equiv P(Y = y) (21)
If y is not a possible value of Y then P(Y = y) = 0In other cases more appropriate is the cumulative distribution
function (CDF) that gives the probability of all values less than orequal to y (Less commonly ldquodistribution functionrdquo or DF) Notation-ally the CDF is usually an uppercase F sometimes with the variableas a subscript like FY (middot) Again the (middot) indicates that FY (middot) is anentire function Mathematically
FY (y) equiv P(Y le y) (22)
Either the PMF or CDF provides a full description of the dis-tribution of Y For some variable types both are appropriate inwhich case the PMF and CDF contain the same information (justrepresented differently) For other variable types only the PMF isappropriate or only the CDF
If you are studying a single variable graphing the PMF or CDFis helpful and it shows all the available information about that vari-ablersquos distribution Even if the PMF or CDF is a complicated func-tion humans are good at processing visual data In practice howeveroften you study many variables together in which case even graphingbecomes intractable (eg you canrsquot make a five-dimensional grapheasily understood)
23 DESCRIPTION OF A POPULATION 23
Summary Features
Distributionsrsquo features like the mean are convenient summaries butlose information There is a tradeoff For some purposes you mayneed all the information of the probability distribution For otherpurposes a few summary features may suffice and be easier to under-stand compare and communicate
The mean is the main summary feature considered for numericvariables It provides a general idea of how high or low values areweighted by probability It provides some sense of the ldquocenterrdquo orldquolocationrdquo of the distribution The mean is used extensively in laterchapters
The standard deviation captures how spread out a distributionis for numeric variables If both very low and very high values haveenough probability then the standard deviation is high Converselyif possible values are all concentrated in a small range (with highprobability) then the standard deviation is low
Other summary features are mentioned briefly but not studiedin future chapters The median (a particular percentile) providesanother way to define the ldquocenterrdquo of a distribution The mode isthe single most likely value The mode applies even to non-numericvariables and the median also applies as long as the non-numericvalues have a low-to-high order An alternative spread measure is theinterquartile range
In Sum Random Variable Types amp FeaturesBinary (Section 232) P(Y = 1) = E(Y ) is complete descriptionDiscrete (Section 233) mean E(Y ) captures highlow standarddeviation σY captures how spread out PMF fY (y) = P(Y = y)shows probability of each possible value CDF FY (y) equiv P(Y le y)Nominal categorical (Section 234) PMF says probability of eachcategory mode is most likely category no CDF mean standarddeviationOrdinal (Section 234) similar to nominal but has CDF usemedian instead of meanContinuous (Section 235) similar to discrete but PDF insteadof PMF
232 Binary Variable
A binary variable has two possible values Other terms for a binaryvariable are dummy variable indicator variable and Bernoullirandom variable In economics ldquodummyrdquo and ldquobinaryrdquo are mostcommon
Unless otherwise specified a binary variablersquos two possible valuesare 0 and 1 For writing mathematical models these values are usuallymore convenient than values like ldquoheadsrdquo and ldquotailsrdquo Mathematicallythis can be indicated by Y isin 0 1 the value of Y must be in the setthat includes only the numbers 0 and 1 (The set 0 1 is different
24 CHAPTER 2 ONE VARIABLE POPULATION
than the interval [0 1] that also contains all real decimal numbersbetween 0 and 1)
Many important variables are binary Examples includebull whether the economy is in a recession (1) or not (0)bull whether somebody has a college degree (1) or not (0)bull whether a pharmaceutical drug is branded (1) or generic (0)bull whether somebody is employed (1) or not (0)bull whether a retailer is a franchise (1) or not (0)
Mathematically binary variables are often defined using the indi-cator function The indicator function 1middot equals 1 if the argumentis true and 0 if false
1A =
1 if A is true0 if A is false (23)
For example consider defining a binary random variable Y basedon the coin flip random variable W Recall that the possible valuesof the flip are W = h (heads) and W = t (tails) We now want Y = 1to indicate heads and Y = 0 tails Mathematically
Y = 1heads = 1W = h =
1 if W = h (heads)0 if W = t (tails) (24)
Other examples can also be written with an indicator functionFor example Y = 1recession Y = 1branded or Y = 1franchise
Probability Mass Function
A binary random variablersquos PMF is like in (21) Specifically fY (0) =P(Y = 0) and fY (1) = P(Y = 1)
For example consider the employment dummy Y = 1employedThat is Y = 1 if the individual is employed otherwise Y = 0 ThePMF fY (middot) is
fY (1) equiv P(Y = 1) = P(employed) fY (0) equiv P(Y = 0) = P(not employed)(25)
If in the population 80 of individuals are employed (and thus 20not) then fY (1) = 80 = 08 and fY (0) = 20 = 02
The binary PMF can actually be written in terms of one singleparameter p equiv P(Y = 1) Since Y isin 0 1 P(Y = 0) + P(Y = 1) =1 = 100 By algebra P(Y = 0) = 1minus P(Y = 1) = 1minus p Thus thePMF is
fY (1) equiv P(Y = 1) = p fY (0) equiv P(Y = 0) = 1minus p (26)
The probability distribution corresponding to (26) is called aBernoulli distribution That is if random variable Y has the PMFin (26) then we say Y follows a Bernoulli distribution with parameterp Mathematically
Y sim Bernoulli(p) (27)
23 DESCRIPTION OF A POPULATION 25
Cumulative Distribution Function
Although there is no practical benefit of a binary cumulative distri-bution function (over a PMF) it may help you develop intuition
The CDF of binary Y has a particular structure If r lt 0 thenP(Y le r) = 0 If r = 0 then P(Y le r) = P(Y = 0) the CDF jumpsup (discontinuously) from 0 to P(Y = 0) at r = 0 If 0 lt r lt 1then P(Y le r) = P(Y = 0) too the CDF is flat If r = 1 thenP(Y le r) = P(Y le 1) = 1 the CDF again jumps now from P(Y = 0)to 1 If r gt 1 then P(Y le r) = 1 too the CDF remains flatAltogether letting p equiv P(Y = 1) and 1minus p = P(Y = 0) the CDF ofY is
FY (r) = (1minus p)1r ge 0+ p1r ge 1 =
0 if r lt 0
1minus p if 0 le r lt 11 if r ge 1
(28)
Summary Feature Mean
Since a Bernoulli distribution is fully described by p equiv P(Y = 1)there is no need to summarize it further However the followingresult is helpful for interpretation of regressions with binary Y andit helps develop intuition
A random variablersquosmean is a probability-weighted average of itspossible values With binary Y the possible values are 0 and 1 withrespective weights P(Y = 0) and P(Y = 1) The mean E(Y ) is thus
E(Y ) =1sumj=0
(j) P(Y = j) =
j=0︷ ︸︸ ︷(0) P(Y = 0) +
j=1︷ ︸︸ ︷(1) P(Y = 1) = P(Y = 1)
(29)So for any binary Y E(Y ) = P(Y = 1) = fY (1)
For terminology the mean E(Y ) is also called the expected valueor expectation These names explain the letter E in the mathemat-ical notation
However the terms ldquoexpectationrdquo and ldquoexpected valuerdquo cause muchconfusion They are technical terms whose meaning differs greatlyfrom the colloquial English meaning For example if you say in plainEnglish ldquoI expect the value will be 05rdquo it means you think therersquos agood chance (high probability) that the value will exactly equal 05This is not what E(Y ) = 05 means In fact with a binary Y it isimpossible to have Y = 05 We may expect (colloquially) Y = 1 ifP(Y = 1) is high or we may expect (colloquially) Y = 0 if P(Y = 0)is high but it is impossible to have Y = E(Y ) (unless p = 1 or p = 0)which is very confusing I suggest you think ldquomeanrdquo every time yousee E(Y ) or read ldquoexpected valuerdquo or ldquoexpectationrdquo
Summary Feature Standard Deviation
Again since a Bernoulli distribution is fully described by p equiv P(Y =1) = E(Y ) there is no need to summarize it further However the
26 CHAPTER 2 ONE VARIABLE POPULATION
following result can help develop intuitionThe standard deviation is one measure of how ldquospread outrdquo or
ldquodispersedrdquo a distribution is The standard deviation is defined as thesquare root of the variance Most commonly lowercase sigma is usedfor notation σ2Y is the variance and σY is the standard deviationWith this notation
σ2Y = Var(Y ) equiv E[(Y minus E(Y ))2] σY equivradicσ2Y (210)
For Y sim Bernoulli(p) the variance and standard deviation areσ2Y = p(1 minus p) and σY =
radicp(1minus p) The derivation from (210) is
not important (unless you want a PhD)The formula σY =
radicp(1minus p) has some intuition If p = P(Y =
1) = 1 then Y = 1 always (never Y = 0) so the distribution is notat all spread it it is very concentrated on one single value This isreflected by σY =
radic1(1minus 1) =
radic0 = 0 Similarly if p = P(Y =
1) = 0 then Y = 0 always (never Y = 1) again not at all spreadThis is reflected by σY =
radic0(1minus 0) =
radic0 = 0 In contrast if
p = P(Y = 1) = 12 then Y = 0 and Y = 1 are equally likelyThis is as spread out as possible for a binary distribution ThenσY =
radic(12)(1minus 12) =
radic14 = 12 You can graph
radicp(1minus p)
over 0 le p le 1 to see that indeed σY is highest at p = 12That said for binary Y it is redundant to report both E(Y ) and
σY since σY =radic
E(Y )[1minus E(Y )] Once you know p = E(Y ) orequivalently p = P(Y = 1) there is no new information in σY
233 Discrete Variable
A binary variable is a special case of a discrete variable which hasany (countable) number of possible values That is all binary vari-ables are discrete variables but not all discrete variables are binaryDiscrete variable examples include
bull an individualrsquos years of educationbull number of children in a householdbull the number of times a stock has split since its IPObull the number of trading partners a country hasbull number of students in a classroom
The units of measure are important for interpreting a discretevariable and its distribution For most discrete variables like numberof children the units are obvious Sometimes it is not immediatelyobvious number of students per room or per grade Number ofbills passed in one month or one year or one term
Probability Mass Function
A discrete PMF is similar to a binary PMF It is again usually writtenlike fY (middot) for the PMF of discrete random variable Y The PMFrsquosinput is again a possible value and itrsquos output is the corresponding
23 DESCRIPTION OF A POPULATION 27
probabilityfY (y) equiv P(Y = y) (211)
Recall uppercase Y is a random variable whereas y stands for oneparticular non-random value (like 41) If y is not one of the possiblevalues of Y then fY (y) = P(Y = y) = 0
The main difference is that a general discrete PMF cannot befully described by a single parameter p like in (26) This additionalcomplexity is a reason people look at summary features like the meanstandard deviation and percentiles
One dimension of added complexity is the possible values Therecan be more than two possible values Further they are not al-ways just 0 1 2 For example if recessions are determined on amonthly basis then the fraction of a year spent in recession could be0 112 212 1112 1 Consequently the possible values are of-ten written as y1 y2 yJ where J is the number of different valuesEquivalently the possible values are yj for j = 1 2 J (Notationthis has the same meaning as j isin 1 2 J but the convention isto write j = 1 2 J or simply j = 1 J )
Having more possible values also means more probabilities to keeptrack of Specifically if there are J possible values then there are Jprobabilities P(Y = yj) for j = 1 J Mathematically the PMFcan be written as
fY (y) =Jsumj=1
1y = yjP(Y = yj) (212)
The last P(Y = yJ) can be solved for using the fact that all J prob-abilities sum to 1 but that still leaves J minus 1 probabilities
Figure 21 shows two common ways of graphing the PMF fY (y) =13 for y = 1 2 3 ie fY (1) = fY (2) = fY (3) = 13
1 2 3
y
PMF
P(
Y=
y)0
05
1
1 2 3
y
PMF
P(
Y=
y)0
05
1
Figure 21 Example PMF plotted two ways
Cumulative Distribution Function
A discrete CDF is defined as in (22) It has a similar pattern as (28)where it is flat but then jumps up discontinuously at each possible
28 CHAPTER 2 ONE VARIABLE POPULATION
value yj Mathematically it can be written similarly to (212)
FY (y) equiv P(Y le y) =Jsumj=1
1yj le yP(Y = yj) (213)
The CDF corresponding to the PMF in Figure 21 could be writtentwo equivalent ways
FY (y) = (13)
3sumj=1
1j le y (214)
FY (y) =
0 if y lt 113 if 1 le y lt 223 if 2 le y lt 31 if 3 le y
(215)
Figure 22 plots this CDF
1 2 3
y
CD
F
P(Y
ley)
00
51
Figure 22 Example discrete CDF from (214)
Summary Feature Mean
Generalizing the binary mean in (29) the mean of discrete Y can bewritten in terms of the J possible values yj (j = 1 J) and theirprobabilities
E(Y ) =
Jsumj=1
yj P(Y = yj) = y1 P(Y = y1) + middot middot middot+ yJ P(Y = yJ)
(216)which could also be written in terms of the PMF because fY (yj) =P(Y = yj) If Y is binary then J = 2 y1 = 0 and y2 = 1 in whichcase (216) simplifies to (29)
The mean gives a rough sense of whether the distribution has highor low values weighted by their probability For example considerrandom variables W and Z with P(W = 0) = P(W = 2) = 12 andP(Z = 2) = P(Z = 4) = 12 Then
E(W ) = (0)(12) + (2)(12) = 1 E(Z) = (2)(12) + (4)(12) = 3(217)
23 DESCRIPTION OF A POPULATION 29
reflecting that Z has higher values As another example imagine Wand Z both have possible values j = 1 2 3 4 but P(W = j) = j10whereas P(Z = j) = (5 minus j)10 Although the possible values areidenticalW has higher weight for the higher values which is reflectedby its larger mean
E(W ) =
4sumj=1
(j)(j10) = (1)(110) + (2)(210) + (3)(310) + (4)(410) = 3
E(Z) =4sumj=1
(j)(5minus j)10 = (1)(410) + (2)(310) + (3)(210) + (4)(110) = 2
However the mean is sensitive to very large values so it doesnot reflect the value of the ldquoaverage member of the populationrdquo Forexample let Y denote hourly wage ($hr) for a population with threeequally-likely types of individuals The possible values are y1 = 10y2 = 20 and y3 = 270 The probabilities are P(Y = yj) = 13 forj = 1 2 3 The ldquoaverage personrdquo is the middle type who gets paid$20hr (This is the median) But the mean is in $hr
E(Y ) = (10)(13) + (20)(13) + (270)(13) = 3003 = 100 (218)
This $100hr mean wage is way higher than what the average personearns The reason is that the extremely high value $270hr bringsthe mean way up A similar but more extreme example has P(Y =10) = 099 and P(Y = 3010) = 001 The ldquoaverage personrdquo is one ofthe 99 who make $10hr but the mean is four times larger $40hr
E(Y ) = (10)(099) + (3010)(001) = 99 + 301 = 40 (219)
The mean helps capture the aggregate earnings rate of the populationas a whole but it does not capture the typical wage of the averagepopulation member
For practice with another example consider years of educationThat is Y = 11 means 11 years of education Y = 12 means 12 yearsof education (through high school) etc For simplicity imagine theonly possible values are Y isin 11 12 16 18 In the notation of (216)J = 4 (four possible values) with y1 = 11 y2 = 12 y3 = 16 andy4 = 18 Let P(Y = 11) = 02 P(Y = 12) = 03 P(Y = 16) = 04and P(Y = 18) = 01 Applying (216)
E(Y ) = (11) P(Y = 11) + (12) P(Y = 12) + (16) P(Y = 16) + (18) P(Y = 18)
= (11)(02) + (12)(03) + (16)(04) + (18)(01) = 14 (220)
The expectation operator E(middot) has a useful property called linear-ity Formally the mean of a linear combination of random variablesequals the linear combination of the random variablesrsquo means Forexample given two random variables Y and Z (of any type) and twonon-random constants a and b
E(aY + bZ) = aE(Y ) + bE(Z) (221)
30 CHAPTER 2 ONE VARIABLE POPULATION
Here aY + bZ is a linear combination of random variables Y andZ Thus the mean of the linear combination of Y and Z equals thelinear combination of the means E(Y ) and E(Z)
Equation (221) implies other identities For example in the spe-cial case when b = 0 it implies E(aY ) = aE(Y ) As another exampleif a = 1 and Y = cW + dX then
E(cW + dX + bZ) = E(aY + bZ) = aE(Y ) + bE(Z) = (1) E(cW + dX) + bE(Z)
= cE(W ) + dE(X) + bE(Z)
Extending this further if we have random variables Yi for i = 1 nand corresponding constants ci then
E
(nsumi=1
ciYi
)=
nsumi=1
ci E(Yi) (222)
Summary Feature Standard Deviation
The standard deviation has the same definition and interpretation asbefore It measures how ldquospread outrdquo or ldquodispersedrdquo a distribution iswith the same units of measure as the variable itself and it is formallydefined in (210)
Without worrying about calculating these by hand consider thefollowing examples for intuition ImagineW andX both have possiblevalues v1 = minus1 v2 = 0 and v3 = 1 but P(X = vj) = 13 for each vj whereas P(W = 0) = 1 Clearly X is more spread out as reflected bythe standard deviations σX =
radic23 σW = 0 If we spread out the
values v1 and v3 farther from zero then the standard deviation shouldincrease Let Y have P(Y = minus2) = P(Y = 0) = P(Y = 2) = 13Then σY =
radic83 twice as big as σX =
radic23 Alternatively we
could ldquospread outrdquo X by defining Z to have the same vj values as Xbut with even more probability on the extreme values SpecificallyP(Z = minus1) = P(Z = 1) = 12 Then σZ = 1 bigger than σX =radic
23Like the mean the standard deviation is sensitive to very large
values For example let P(Y = 0) = 098 P(Y = minus100) = P(Y =100) = 001 Then even though 98 of the population has Y = 0(very concentrated not spread out) σY =
radic200 asymp 141
Other Summary Features
Although beyond the scope of this text I must briefly mention thatpercentiles (or quantiles) are also helpful summary features Theycan capture aspects of a probability distribution that the mean andstandard deviation do not For example the median captures thevalue of the ldquoaverage member of the populationrdquo discussed aboveHigh and low percentiles help capture the ldquotailsrdquo of a distributionlike people with the very highest (or lowest) income Measures likethe interquartile range capture the ldquospreadrdquo in a way that is not sen-sitive to very large outliers complementing the standard deviation
23 DESCRIPTION OF A POPULATION 31
Quantile regression extends percentiles to regression parallel to whatwersquoll study with the mean
234 Categorical or Ordinal Variable
A binary variable is usually a special case of a categorical variablewhose possible values are ldquocategoriesrdquo not numbers This was trueof most of the examples in Section 232 like whether a retailer isa franchise or not or whether a pharmaceutical drug is branded orgeneric Such values can be coded as 0 or 1 for convenience but theylack any numeric meaning
Categorical variables can have more than two possible values Forexample non-franchise retailers could be categorized further as na-tional chain regional chain or independent Other categorical vari-able examples include
bull geographics region (north south east west)bull mode of transportation (car bike train etc)bull industry (like NAICS)bull college major (economics English ecology electrical engineer-
ing etc)
The previous examplesrsquo categories have no particular order tothem so they constitute nominal variables (or nominal cate-gorical variables) Sometimes these are simply called categoricalvariables
In contrast there could be an ordinal variable (or ordinal vat-egorical variable) An ordinal variablersquos possible values have a nat-ural order usually from ldquolowrdquo to ldquohighrdquo For example
bull bond rating (eg D C AA+ AAA)bull self-reported health status (poor fair good excellent)bull teaching evaluation responses (disagree neutral agree)bull letter grades (F C B A although often A and C are 40 and
20 there is nothing intrinsic in the letter grade system thatsuggests A is exactly twice as good as C)
Some categorical variables are not clearly nominal or ordinal Forexample consider educational degree Some degrees are higher thanothers (eg you need a bachelorrsquos degree before you get a masterrsquos)but others are not (eg PhD and MD) As another example considersex Neither male nor female is ldquohigherrdquo than the other but arguablyintersex is in between As another example some occupations areordered (eg within the same consulting firm junior analyst is lowerthan senior analyst which is lower than associate then manageretc) but others are not (eg painter chef carpenter) Such variablesrequire careful thought but are beyond our scope
In any case categorical variables are often represented by dummyvariables (binary variables) For example consider the teaching eval-uation response whose possible values are disagree neutral and agreeUsing the indicator function from (23) we can defineW = 1disagree
32 CHAPTER 2 ONE VARIABLE POPULATION
X = 1neutral Y = 1agree Then P(W = 1) is the probabilityof ldquodisagreerdquo P(X = 1) is the probability of ldquoneutralrdquo and P(Y = 1)is the probability of ldquoagreerdquo Equivalently from (29) these can bewritten using P(W = 1) = E(W ) P(X = 1) = E(X) and P(Y =1) = E(Y ) In a way Y is redundant since Y = 1W = X = 0completely determined by W and X
Probability Mass Function
A categorical or ordinal PMF is essentially the same as a discretePMF As before it is usually written like fY (middot) for the PMF of Y The PMFrsquos input is again a possible value and itrsquos output is thecorresponding probability The only difference is that the possiblevalues are categories instead of numbers
For example consider the coin flip Random variable W has twopossible values h (heads) or t (tails) mathematically W isin h tIts PMF is fW (w) equiv P(W = w) That is fW (h) is the probability ofheads and fW (t) is the probability of tails
Cumulative Distribution Function
A nominal categorical variable does not have a CDF Recall from (22)that the CDF of Y evaluated at value v is FY (v) equiv P(Y le v) (Itdoes not matter which lowercase letter we use whether y or r or vit simply represents a non-random value) If Y is nominal then theinequality relationship le has no meaning For example we cannotevaluate if the value ldquopainterrdquo is less than or equal to ldquochefrdquo ThePMF is still well-defined because it relies only on equality and wecan evaluate if ldquopainterrdquo and ldquochefrdquo are equal But the ldquocumulativerdquopart of CDF has no meaning for nominal categorical variables
In contrast an ordinal variable can have a CDF The values areordered so we can evaluate if one value is le another (If there is aclear order but not a clear ldquolowrdquo and ldquohighrdquo ends then one end canarbitrarily be picked as ldquolowrdquo and the CDF shows the cumulativeprobability from the ldquolowrdquo end through a given category)
For example consider self-reported health status Y Any twovalues can be compared with le poor le good good le excellent etcThe CDF evaluated at ldquogoodrdquo is the probability of health that is goodor worse There are three such possible values poor fair and goodThus the CDF evaluated at good is the probability of poor fair orgood health Mathematically
FY (good) equiv P(Y le good) = P(Y = poor) + P(Y = fair) + P(Y = good)(223)
= fY (poor) + fY (fair) + fY (good)
Summary Features Mean and Standard Deviation
Categories cannot be summed or averaged so categorical variables donot have a mean You could arbitrarily assign numeric values to each
23 DESCRIPTION OF A POPULATION 33
category and then use the discrete variable formula in (216) but ifI assigned different numeric values then I could get a very differentresult than you Fundamentally we cannot average ldquopainterrdquo andldquochefrdquo or average ldquopoorrdquo and ldquogoodrdquo (Eg what is poor plus goodWhat is 017 times poor)
For the same reason categorical variables do not have a standarddeviation
Other Summary Features
The mode is often useful for summarizing categorical variables Themode is the single most likely value Mathematically
mode(Y ) equiv arg maxy
P(Y = y) (224)
which is read as ldquothe value of y that maximizes P(Y = y)rdquoNominal categorical variables do not have anything like a mean
that can give a sense of whether values are generally high or low be-cause there is no sense of high or low for nominal variables Similarlythere is no sense of ldquocloserdquo or ldquofarrdquo so it is meaningless to ask howldquospread outrdquo a nominal distribution is Only features of the PMF canbe summarized without accounting for the values themselves So wecould measure whether the probability is spread more evenly acrosscategories rather than very high in some categories (but not others)but not much else
Ordinal variables do have a sense of high or low so it is possibleto summarize ordinal distributions analogous to a mean (overall highor low) or standard deviation (how spread out)
The median (and other percentiles) can summarize how generallyhigh or low an ordinal distribution is The median is the categoryfor the ldquoaverage memberrdquo of the population That is at least halfthe population has the same or lower value and at least half thepopulation has the same or higher value Mathematically the medianof ordinal random variable Y is the valuem such that P(Y le m) ge 05and P(Y ge m) ge 05
For example consider self-reported health status Y The five pos-sible values are poor fair good great and excellent Imagine thereis 15 probability of each value ie fY (y) = 15 for any of the fivepossible y values Because each category has the same probabilitynaturally the median is the middle category ldquogoodrdquo Mathematicallyto verify
P(Y le good) =
=15︷ ︸︸ ︷fY (poor) +
=15︷ ︸︸ ︷fY (fair) +
=15︷ ︸︸ ︷fY (good) = 35 gt 12
P(Y ge good) =
=15︷ ︸︸ ︷fY (good) +
=15︷ ︸︸ ︷fY (great) +
=15︷ ︸︸ ︷fY (excellent) = 35 gt 12
(225)
Imagine a different population represented by random variableW thatis generally healthier To make it obvious imagine nobody has poor
34 CHAPTER 2 ONE VARIABLE POPULATION
or fair health and the other three categories have 13 probabilityeach Now ldquogreatrdquo is the median reflecting that W represents a gen-erally healthier population than Y whose median was only ldquogoodrdquoMathematically to verify
P(W le great) =
=0︷ ︸︸ ︷fY (poor) +
=0︷ ︸︸ ︷fY (fair) +
=13︷ ︸︸ ︷fY (good) +
=13︷ ︸︸ ︷fY (great) = 23 gt 12
P(W ge great) =
=13︷ ︸︸ ︷fY (great) +
=13︷ ︸︸ ︷fY (excellent) = 23 gt 12
(226)
There are other ways to compare whether one ordinal distributionis ldquohigherrdquo or ldquomore spread outrdquo than another (eg Kaplan and Zhuo2019) but they are beyond our scope
235 Continuous Variable
A continuous variable differs from a discrete variable in some strangetechnical ways but the intuition is the same This textbook often usesdiscrete variables to build intuition since the math is simpler Youcould imagine a continuous variable like a discrete variable with a verylarge number of possible values packed very tightly together Indeedmany variables typically called ldquocontinuousrdquo are actually discrete likemonetary values (like annual income or sales) that come in discreteunits (like $001) Practically the difference is negligible Examplesof other variables modeled as continuous are
bull market concentration measures (like market share of largest firmor HHI)
bull a countryrsquos per capita annual meat consumptionbull percentage growth of GDP (or sales or stock price etc)bull crime rates (eg a cityrsquos number of property crimes per year
per 10000 people)
Always specify units of measure For example if Y is the dis-tance from an individualrsquos residence to their workplace it is mean-ingless to say Y = 15 because 15 is just a number not a measureof distance It could be 15 km but it could also be 15 mi whichis 24 km or it could even be measured in meters or feet (or par-secs though unlikely) The mean standard deviation median (andother percentiles) and interquartile range all share the same unitsas the variable itself whereas the variance has squared units (whichis harder to interpret eg squared dollars) Units always mattergreatly whether for description prediction or causality
Probability Density Function
A truly continuous random variable does not have a PMF It hasan ldquouncountably infiniterdquo number of possible values implying eachhas zero probability That is if Y is the random variable then forany possible value y P(Y = y) = 0 There is a difference between
23 DESCRIPTION OF A POPULATION 35
ldquopossiblerdquo and ldquonon-zero probabilityrdquo Every observed realization y isclearly possible yet had zero probability of occurring
Although individual values have zero probability ranges of valueshave non-zero probability For example even if P(Y = y) = 0 foreach individual 0 le y le 1 itrsquos possible that P(0 le Y le 1) = 034
The probability density function (PDF) helps us see suchprobabilities for different ranges of value The ldquobell curverdquo is an ex-ample of a PDF Generally the PDF is higher around more probablevalues
If the CDF is also differentiable the derivative is called a proba-bility density function (PDF) PDFs are commonly denoted withlowercase f sometimes with a subscript like fY (middot) for the PDF of ran-dom variable Y Similar to a histogram a PDF shows the probabilityof a random variable taking a value in a certain interval as the areaunder the PDF Since P(minusinfin lt Y ltinfin) = 1 for any random variablethe total area under any PDF must equal one
Figure 23 shows an example PDF No matter how big or smallwe draw it the total area between the horizontal axis and the PDFis defined to be 1 The shaded area under the PDF between y = 0and y = 1 shows P(0 le Y le 1) That is P(0 le Y le 1) is theproportion of the total area under the PDF that is shaded The factthat P(Y = y) = 0 for any y can also be seen For example P(Y = 0)is the ldquoareardquo under the PDF between y = 0 and y = 0 but this is justa line which has zero area hence zero probability
minus3 minus2 minus1 0 1 2 3
00
02
04
y
PD
F (
dens
ity)
Area=034
Figure 23 Example of reading a probability from a PDF
Cumulative Distribution Function
The CDF of a continuous random variable has the same definition asfor a discrete or ordinal variable FY (y) equiv P(Y le y)
Unlike the jumpy stair-step CDF of a discrete random variable acontinuous random variablersquos CDF is a continuous function (And ifyou know calculus when it exists the PDF is the first derivative ofthe CDF)
36 CHAPTER 2 ONE VARIABLE POPULATION
Summary Feature Mean
The intuition for the mean and the linearity of the expectation oper-ator apply equally to continuous random variables
Computing the mean of a continuous random variable by hand re-quires calculus so it is not covered in this textbook (If you happen toknow calculus and are curious itrsquos extending the idea of probability-weighted average by replacing the sum with an integral eg if thePDF fY (middot) exists E(Y ) =
intR yfY (y)dy analogous to
sumJj=1 yjfY (yj))
Summary Feature Standard Deviation
The intuition for the standard deviation is also the same for contin-uous distributions but again requires calculus (integration) to com-pute
Other Summary Features
The median and other percentiles complement the mean and stan-dard deviation in summarizing continuous distributions For exam-ple if yoursquove ever taken a standardized test before (like SAT ACTor GRE) your score report probably included your percentile Yourscore percentile is the proportion of the population you scored betterthan eg if you were in the 90th percentile you scored better than90 of other students Despite their utility percentiles (quantiles)are beyond our scope but I hope you study econometrics further tolearn more about them
The Normal (Gaussian) Distribution
One particular distribution appears frequently in statistics and econo-metrics the normal distribution (or Gaussian distribution)Without getting too detailed some comments may help especiallywhen you read other books (If you want you can find plenty ofdetails on Wikipedia)
Although some variables are indeed approximately normally dis-tributed this is not ldquonormalrdquo in the common English sense mostvariables are not normal To start any discrete or categorical randomvariable cannot be normal (Gaussian) Most continuous variables arealso not normal
None of the methods in this textbook require variables to be Gaus-sian Historically sometimes normality was assumed for certain re-sults but it is not necessary You donrsquot need to worry about testingwhether or not variables are normal It doesnrsquot matter
Normal distributions are convenient for educational examples be-cause the mean and standard deviation uniquely characterize a normaldistribution Notationally the normal distribution is written N(micro σ2)for a normal distribution with mean micro and variance σ2 or sometimesequivalently N(micro σ) with standard deviation σ =
radicσ2 That is if
Y sim N(micro σ2) then E(Y ) = micro and Var(Y ) = σ2 so its standard de-viation is σ This is convenient for illustrative examples because as
24 PRELUDE TO PREDICTION PRECIPITATION 37
wersquove seen the mean gives a general sense of values being high or lowand the standard deviation describes how spread out the distributionis With micro = 0 and σ = 1 N(0 1) is called the standard normaldistribution
However this is not true of other distributions so in general themean and standard deviation do not fully summarize a distributionFor example there could be two random variables with mean 0 andstandard deviation 1 that are very different One such random vari-able is W with P(W = minus1) = P(W = 1) = 12 Another is thestandard normal Z sim N(0 1) If you only know the mean and stan-dard deviation then you cannot tell the difference betweenW and ZThese are very different and there are many other random variableswith the same mean and standard deviation
24 Prelude to Prediction Precipitation
This section introduces prediction concepts through a simple exampleto develop intuition The two main goals are 1) to show that thereis no single best prediction because ldquobestrdquo depends on the ultimatepurpose of the prediction and 2) to begin translating intuition intoformal mathematics Further mathematical formalization and morecomplex examples are in Section 25
Notationally (details below) g is your non-random guess of therealized value y of random variable Y and L(y g) quantifies how badit is to have guessed g when the realized value is y (Hypotheticallyyou could randomize your guess but this is never optimal so it is notconsidered)
Throughout this section imagine you want to predict whetheror not it will rain tomorrow Mathematically random variable Yrepresents tomorrowrsquos weather Y = 1 if it rains tomorrow and Y = 0if not Assume you actually know the probability distribution of Y(you do not need to estimate it from data) Since Y is binary Y simBernoulli(p) so knowing the distribution is equivalent to knowing p =P(Y = 1) the probability that it rains (This is usually what weatherforecasts report so you could actually do the following examples inreal life) Thus equivalently given the probability of rain you wantto predict the realized value (yes or no)
The distribution of Y alone is not enough to make a good predic-tion you also need to know the consequences of correct and incorrectpredictions in each case Intuitively if one outcome is really reallybad then you should prefer to avoid even a small risk of it comparedto a larger risk of a not-so-bad outcome
Mathematically consequences are formalized as a loss functionThe loss function L(y g) specifies how bad it is to have guessed gwhen the realized value is y (Other sources may switch the order ofy and g so be careful) In the rain example L(0 1) represents howbad it is to guess rain when it does not rain This may be differentthan L(1 0) how bad it is to guess no rain when in fact it rains For
38 CHAPTER 2 ONE VARIABLE POPULATION
example L(0 1) = 20 and L(1 0) = 100 means it is much worse tobe wrong when it rains (y = 1) than when it doesnrsquot (y = 0)
It can be confusing to define ldquolossrdquo when yoursquore correct (g = y) Ifsomething actually good happens it can be represented by negativeloss like L(0 0) = minus10 If L(1 1) = minus30 (even more negative thanminus10) then itrsquos even better to guess right when it rains
Even if there are good outcomes loss values can be normalizedto be non-negative without changing the best prediction In the rainexample imagine the most negative loss possible is L(1 1) = minus30Then simply adding 30 to each loss makes them all non-negativewithout changing their relative values This essentially sets L(1 1) =0 as the reference point and the loss functionrsquos interpretation is ldquoHowmuch worse is this situation than (y g) = (1 1)rdquo Itrsquos also possibleto normalize L(v v) = 0 for any v meaning that guessing g = y isnot bad at all by subtracting the original L(v v) from the originalL(v g) for all g Understanding the detailed reasoning is beyond ourscope but just be aware that L(y g) is not necessarily an absoluteldquohow bad is (y g)rdquo but can also be ldquohow bad relative to (y y)rdquo or ldquohowbad relative to the best possible (v v)rdquo
If you are more familiar with utility functions from economicsyou can think of the loss function as essentially a negative utilityfunction Apparently economists are optimistic modeling how goodthings are (utility) whereas statisticians are pessimistic modelinghow bad things are (loss) If you had a utility function u(y g) thatsays how good it is to have guessed g when the truth is y then youcan just define L(y g) = minusu(y g)
Throughout this section the consequences are the results of abet with your friend If you guess wrong then you lose some money(positive loss) If you guess right then you win some money (negativeloss)
241 Easy ldquoPredictrdquo Current Weather
Letrsquos start easy yoursquore standing outside and you want to predictwhether or not itrsquos currently raining Since you can observe this di-rectly this is like the ldquoafterrdquo view of Section 211 instead of multiplepossible values of random variable Y you see the realized value ywith no uncertainty
You make a simple $1 bet if you guess right (g = y = 0 org = y = 1) then you win $1 but if you guess wrong (g 6= y) then youlose $1 Recall that negative loss means winning Formally L(y g) is
L(0 0) = L(1 1) = minus1 L(0 1) = L(1 0) = 1 =rArr L(y g) = 2times1y 6= gminus1(227)
If you remember your microeconomics classes you may realizethat the loss function in (227) implicitly assumes a linear utilityfunction u(x) = x for simplicity That is if you currently have $xand you win $1 then your utility increases by u(x + 1) minus u(x) Ifyou lose $1 then your utility decreases from u(x) to u(x minus 1) Ifyou are risk-averse u(middot) is concave so even though the dollar amount
24 PRELUDE TO PREDICTION PRECIPITATION 39
is the same the potential utility increase is smaller than the poten-tial utility decrease u(x + 1) minus u(x) lt u(x) minus u(x minus 1) Generallyyour loss function should be L(0 0) = L(1 1) = u(x) minus u(x + 1)and L(0 1) = L(1 0) = u(x) minus u(x minus 1) For simplicity plug-ging in u(x) = x yields u(x) minus u(x + 1) = x minus (x + 1) = minus1 andu(x)minus u(xminus 1) = xminus (xminus 1) = 1 as in (227)
Obviously you guess g = y You are correct You win $1Mathematically how can this intuition be formalized If you know
y then you can compute both L(y 0) and L(y 1) If y = 1 soL(y 0) = 1 and L(y 1) = minus1 then ldquoguessingrdquo g = 1 minimizes losssince minus1 lt 1 If y = 0 so L(y 0) = minus1 and L(y 1) = 1 thenldquoguessingrdquo g = 0 minimizes loss since minus1 lt 1 Thus the best ldquoguessrdquois indeed g = y
242 Minimizing Mean Loss
With the same loss function from (227) consider predicting tomor-rowrsquos weather if P(Y = 1) = 04 That is therersquos a 40 probabilityof rain tomorrow (and 60 chance of no rain) for the purpose of yourbet should you predict rain
Now we need some way to deal with uncertainty Regardless ofguessing g = 0 or g = 1 there is some chance of being right and somechance of being wrong
In microeconomics the typical approach is to choose the actionthat maximizes mean utility The same could be done here Equiva-lently since the loss function is essentially a negative utility functionwe could minimize mean loss This doesnrsquot guarantee yoursquoll win thebet every time but over the long-run (if you bet many times) it leadsto the lowest total loss
Mean loss is more commonly called expected loss but this canbe confusing Again ldquoexpectedrdquo is technical jargon that is unrelatedto what ldquoexpectedrdquo means in colloquial English Below it is in factimpossible to actually receive the ldquoexpectedrdquo loss since it is a decimalvalue (whereas you can only win or lose $1)
Mean loss is also sometimes called risk Again this has a precisetechnical meaning but it is probably not how you would define ldquoriskrdquocolloquially
Given the distribution of Y and the loss function in (227) howcan we pick g to minimize mean loss There are two values of g toconsider g = 0 or g = 1 (Other values are allowed but would alwaysbe wrong so yoursquod always lose eg with g = 04 g 6= y for bothy = 0 and y = 1) Given a particular guess like g = 0 the loss stilldepends on y which has multiple possible values Thus the loss hasmultiple possible values That is given g = 0 the loss is a randomvariable We can derive its distribution from the distribution of Y Then we can compute the mean of the loss distribution In this waywe can compute mean loss for each possible g Finally the best guessis the g with the smallest mean loss Details on these steps are givenbelow
40 CHAPTER 2 ONE VARIABLE POPULATION
To work through these steps mathematically consider the losswhen g = 0 and separately the loss when g = 1 Mathematicallydefine random variables
L0 equiv L(Y 0) L1 equiv L(Y 1) (228)
respectively representing the (distribution of) loss when g = 0 andwhen g = 1 When g = 0 the loss L(Y 0) is either L(0 0) = minus1 ifY = 0 or else L(1 0) = 1 if Y = 1 That is L0 = minus1 when Y = 0and L0 = 1 when Y = 1 We know P(Y = 1) = 04 so
P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 1minusP(Y = 1) = 06(229)
Similarly if you guess g = 1 then the loss L(Y 1) is L(1 1) = minus1when Y = 1 or else L(0 1) = 1 when Y = 0 so
P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus1) = P(Y = 1) = 04(230)
Given the loss distributions in (229) and (230) mean loss can becomputed for each possible g If you guess g = 0 then using (216)and (229) mean loss (in $) is
E(L0) = (04)(1) + (06)(minus1) = minus02 (231)
If you instead guess g = 1 then using (216) and (230) mean loss(in $) is
E(L1) = (06)(1) + (04)(minus1) = 02 (232)
The best prediction for your bet is the g that minimizes meanloss Mathematically E(L0) lt E(L1) equivalently using (228)E[L(Y 0)] lt E[L(Y 1)] That is g = 0 generates the smallest meanloss so g = 0 is the best prediction to make for your bet Eventhough itrsquos the best prediction yoursquoll still be wrong and lose $1 if itrains tomorrow which has a 40 probability of happening But inthe long run yoursquoll win money if you always predict g = 0 whereasyoursquoll lose money if you always predict g = 1
243 Different Probability
Imagine the same setup as in Section 242 but with a different distri-bution of Y P(Y = 1) = 07 Intuitively rain being more likely mightchange the optimal prediction from g = 0 to g = 1 Mathematicallythis intuition proves correct
Following the same steps from Section 242 first compute thedistribution of loss separately for g = 0 and g = 1 Define L0 and L1
as in (228) Parallel to (229) and (230) but with P(Y = 1) = 07
P(L0 = 1) = P(Y = 1) = 07 P(L0 = minus1) = P(Y = 0) = 1minus P(Y = 1) = 03
P(L1 = 1) = P(Y = 0) = 03 P(L1 = minus1) = P(Y = 1) = 07
(233)
25 PREDICTION WITH A KNOWN DISTRIBUTION 41
Parallel to (231) and (232) using (233)
E[L(Y 0)] = E(L0) = (07)(1) + (03)(minus1) = 04
E[L(Y 1)] = E(L1) = (03)(1) + (07)(minus1) = minus04(234)
Since E[L(Y 1)] lt E[L(Y 0)] g = 1 minimizes mean loss so g = 1 isthe best prediction for your bet
244 Different Loss Function
Now imagine the original setup in Section 242 but with a differentloss function Specifically if you correctly predict rain you win $10(ie minus10 loss) but otherwise L(y g) is the same as in (227)
L(0 0) = minus1 L(1 1) = minus10 L(0 1) = L(1 0) = 1 (235)
Intuitively even though rain is less probable than no rain the muchlarger payoff for correctly predicting rain might make us bet on rainMathematically this intuition proves correct
Following the same steps from Section 242 first compute thedistribution of loss separately for g = 0 and g = 1 Define L0 and L1
as in (228) Parallel to (229) and (230) but with L(1 1) = minus10
P(L0 = 1) = P(Y = 1) = 04 P(L0 = minus1) = P(Y = 0) = 06
P(L1 = 1) = P(Y = 0) = 06 P(L1 = minus10) = P(Y = 1) = 04
(236)
Parallel to (231) and (232) using (236)
E[L(Y 0)] = E(L0) = (04)(1) + (06)(minus1) = minus02
E[L(Y 1)] = E(L1) = (06)(1) + (04)(minus10) = minus34(237)
Since E[L(Y 1)] lt E[L(Y 0)] g = 1 minimizes mean loss so g = 1is the best prediction for your bet (As noted earlier ideally yoursquodalso allow for a utility function with some risk aversion in which casethe best prediction may additionally depend on the degree of riskaversion)
25 Prediction with a Known Distribution
=rArr Kaplan video Optimal Prediction
What does prediction mean It may seem surprising to discussprediction without any data and with a completely known distribu-tion In English usually prediction means using what you know nowto ldquopredictrdquo what will happen in the future (eg ldquoBeware the Ides ofMarchrdquo) In econometrics and statistics prediction shares the quali-ties of guessing something unknown using something known but thedetails differ (Predicting the future is a special case of predictioncalled forecasting see Part III)
Here as in Section 24 the goal is to predict the value of a ran-dom draw from a known distribution The distribution summarizes
42 CHAPTER 2 ONE VARIABLE POPULATION
ldquowhat you knowrdquo different possible values and their probabilities Asin Section 21 the random draw need not occur in the future in-deed it may have already happened but we havenrsquot observed it yetSo besides applications like predicting ridesharing demand tomorrowldquopredictionrdquo also includes guessing the income of a customer standingright in front of you (who hasnrsquot told you their income yet)
This section extends Section 24 including more complex exam-ples Further close connections with description (Section 23) areshown
Understanding the role of the loss function is particularly cru-cial Even if you do some fancy machine learning prediction withcross-validation you need to consider the loss function carefully Justusing whatever you find online may be inappropriate I have seenPhD students puzzled by their results because they did not use anappropriate loss function Loss functions are also central to Bayesianprediction although that is beyond our scope
251 Common Loss Functions
Ideally a loss function reflects the real-world consequences of guessingg when the realized value is y like in Section 24 In practice thismay be infeasible For example maybe the consequences are noteasily quantified Or maybe you have to make a single predictionthat will be used for multiple decisions with different consequencesor for multiple people whose utility functions differ Or maybe youhave a deadline and simply donrsquot have time to carefully construct aloss function
There are infinitely many possible loss functions but those beloware used more commonly than others
0ndash1 Loss
Define 0ndash1 loss asL0(y g) equiv 1y 6= g (238)
This equals 0 (which is good) if you guess correctly and 1 (bad) ifnot This reflects a case where it doesnrsquot matter ldquohow wrongrdquo youare it only matters whether yoursquore right or wrong
This 0ndash1 loss is often used when Y is a nominal categorical vari-able (Section 234) or binary For example if you need to predictoccupation and the true y is ldquopainterrdquo itrsquos probably no worse to haveincorrectly guessed ldquozookeeperrdquo than ldquobookkeeperrdquo theyrsquore just bothwrong This is not true of ordinal categorical variables eg if some-bodyrsquos health is ldquoexcellentrdquo itrsquos probably worse (higher loss) to havepredicted ldquopoorrdquo than ldquogreatrdquo
Although not obvious 0ndash1 loss determines the same optimal pre-diction as the rain bet loss function in (227) (Although coded asbinary the rain Y could be considered nominal categorical with pos-sible values ldquorainrdquo and ldquono rainrdquo) The optimal prediction does notchange if you add 1 to all losses It also does not change if you divide
25 PREDICTION WITH A KNOWN DISTRIBUTION 43
all losses by 2 Thus given L(y g) in (227) it is equivalent to usethe loss function [L(y g) + 1]2 = 1y 6= g which is 0ndash1 loss as in(238)
Quadratic Loss
Define quadratic loss (or squared loss or squared error loss orL2 loss) as
L2(y g) = (y minus g)2 (239)
This is zero when the guess is perfect (g = y) and larger when g isfarther from y in either direction (higher or lower) Thus unlike 0ndash1loss quadratic loss can differentiate between a slightly-wrong guessand a really-wrong guess
For example let the true y = 100 The guess g = y = 100 isbest since L2(100 100) = 0 is the smallest possible loss (Squaringproduces non-negative numbers) The guess g = 99 is worse lossis L2(100 99) = (100 minus 99)2 = 1 The guess 90 is even worse sinceL2(100 90) = (100 minus 90)2 = 100 In fact even though 90 is only 10times farther from y than 99 is the loss is 100 times as big muchmuch worse The guess 110 is just as bad as 90 since they are bothwrong by 10 (higher or lower doesnrsquot matter) L2(100 110) = (100minus110)2 = 100
Quadratic loss is often used for discrete or continuous variables(Sections 233 and 235) Although ordinal loss functions shouldalso differentiate between slightly-wrong and really-wrong quadraticloss cannot be used because you cannot subtract two category valuesor square them as required by (239) eg you cannot square thedifference between ldquoexcellentrdquo and ldquogreatrdquo
Despite its common use quadratic loss is not always sensible Forexample sometimes it may be much worse to over-predict (g gt y)than under-predict (g lt y) or vice-versa Quadratic loss does notdifferentiate between over-prediction and under-prediction because(y minus g)2 = (g minus y)2 only the absolute error |y minus g| matters notwhether itrsquos positive or negative For example if y = 100 it maybe much worse to predict g = 110 than g = 90 but L2(100 110) =(100minus110)2 = 100 is the same as L2(100 90) = (100minus90)2 = 100 Asanother example sometimes it may be twice as bad to over-predictby 20 units (g minus y = 20) than 10 units (g minus y = 10) but quadraticloss treats it as four times worse since 102 = 100 but 202 = 400
Nonetheless quadratic loss is usually not crazy especially if weneed to make a prediction but donrsquot know how it will be used indecision-making
Other Loss Functions
Although beyond our scope if yoursquore curious there are other in-teresting common loss functions One is absolute loss (or L1 loss)L1(y g) = |y minus g| Variations of absolute loss include asymmetricversions for which over-prediction is worse than under-prediction (or
44 CHAPTER 2 ONE VARIABLE POPULATION
vice-versa) as well as absolute percentage loss |(y minus g)y| Anothercommon loss function is weighted 0ndash1 loss which generalizes 0ndash1 lossby allowing L(0 1) 6= L(1 0) (This is related to hypothesis testingwhere there are different consequences for rejecting a true hypothesisthan not rejecting a false hypothesis)
252 Optimal Prediction Generic Examples
In Sum Optimal Prediction1 Choose appropriate loss function L(y g) quantifies how
bad it is to guess g when the true value is y2 Optimal prediction the value of g with smallest mean loss
E[L(Y g)]
The following generic examples show the procedure to find the pre-diction g that is ldquooptimalrdquo in the sense of minimizing mean loss Theprocedure does not allow continuous Y since that would require cal-culus but it works for any other variable type from Section 23
Two Possible Values
This is the same procedure used in the rain example in Section 24Step 1 is to write out the loss function values for all combinations
of (y g) where g is our guess and y is the true realized value Assumetwo possible values y = a or y = b (Possibly a = 0 and b = 1 butthey could have any value including non-numeric values like ldquocatrdquo andldquodogrdquo) Assume you have to guess either a or b g = a or g = b Thusthere are four possible combinations of (y g) (a a) (b b) (a b) and(b a) The four corresponding loss function values can be arranged ina matrix where each column has the same y value and each row hasthe same g value(
L(a a) L(b a)L(a b) L(b b)
)=
(L(y = a g = a) L(y = b g = a)L(y = a g = b) L(y = b g = b)
) (240)
Step 2 is to compute the mean loss for each possible guess g In(240) each row corresponds to a different g To compute mean lossif g = a look at the first row The guess g = a is non-random Therandomness of L(Y a) is from the randomness of Y The probabilitiesP(Y = a) and P(Y = b) do not depend on our guess g If we guessg = a then wersquoll get loss L(a a) with probability P(Y = a) and wersquollget loss L(b a) with probability P(Y = b) If we guess g = b thenwersquoll get loss L(a b) with probability P(Y = a) and wersquoll get lossL(b b) with probability P(Y = b) Thus
E[L(Y a)] = P(Y = a)L(a a) + P(Y = b)L(b a)
E[L(Y b)] = P(Y = a)L(a b) + P(Y = b)L(b b)(241)
(Although we wonrsquot use it if yoursquore handy with matrix algebra yoursquollsee that the vector of mean losses can be written as the product of
25 PREDICTION WITH A KNOWN DISTRIBUTION 45
the matrix in (240) with the column vector of probabilities P(Y = a)and P(Y = b) ie each mean loss is the dot product of a row in theloss matrix with the probability vector)
Step 2 could also be interpreted in terms of two new random vari-ables like L0 and L1 in Section 24 Let La equiv L(Y a) be a randomvariable representing loss when g = a The distribution of La isP(La = L(a a)) = P(Y = a) and P(La = L(b a)) = P(Y = b) Sim-ilarly let Lb equiv L(Y b) be a random variable representing loss wheng = b with P(Lb = L(a b)) = P(Y = a) and P(Lb = L(b b)) =P(Y = b) Thus yielding the same results as (241)
E[L(Y a)] = E(La) = P(Y = a)L(a a) + P(Y = b)L(b a)
E[L(Y b)] = E(Lb) = P(Y = a)L(a b) + P(Y = b)L(b b)(242)
Step 3 is to find the g that minimizes E[L(Y g)] The g thatminimizes mean loss is the optimal predictor That is if E[L(Y a)] ltE[L(Y b)] then g = a is the optimal predictor if E[L(Y b)] lt E[L(Y a)]then g = b is the optimal predictor or if E[L(Y a)] = E[L(Y b)] theng = a and g = b are equally good (or equally bad)
For example let(L(a a) L(b a)L(a b) L(b b)
)=
(0 75 0
)
P(Y = a) = 07P(Y = b) = 03
(243)
Using (241)
E[L(Y a)] = P(Y = a)L(a a) + P(Y = b)L(b a) = (07)(0) + (03)(7) = 21
(244)
E[L(Y b)] = P(Y = a)L(a b) + P(Y = b)L(b b) = (07)(5) + (03)(0) = 35(245)
Since E[L(Y a)] lt E[L(Y b)] the predictor g = a is better than g = baccording to mean loss with this particular loss function
Many Possible Values
Now let Y take J different possible values Label these v1 v2 vJ For example if J = 3 we could have v1 = 10 v2 = 20 and v3 = 22or v1 could be ldquocatrdquo v2 ldquodogrdquo and v3 ldquoechidnardquo Like before assumeg must take one of these same values
The three steps are the same as with J = 2 just with more valuesto handle
First write out the loss function values in a matrix with the rowm column k entry equal to L(Y = vk g = vm)
L(v1 v1) L(v2 v1) middot middot middot L(vJ v1)L(v1 v2) L(v2 v2) middot middot middot L(vJ v2)
L(v1 vJ) L(v2 vJ) middot middot middot L(vJ vJ)
(246)
46 CHAPTER 2 ONE VARIABLE POPULATION
Second compute all the mean losses
E[L(Y v1)] = P(Y = v1)L(v1 v1) + P(Y = v2)L(v2 v1) + middot middot middot =Jsumj=1
P(Y = vj)L(vj v1)
E[L(Y v2)] = P(Y = v1)L(v1 v2) + P(Y = v2)L(v2 v2) + middot middot middot =Jsumj=1
P(Y = vj)L(vj v2)
E[L(Y vJ)] = P(Y = v1)L(v1 vJ) + P(Y = v2)L(v2 vJ) + middot middot middot =Jsumj=1
P(Y = vj)L(vj vJ)
(247)
Third find the g that minimizes E[L(Y g)] That is find the smallestof the J values computed in (247) the corresponding g is the optimalpredictor
For example with J = 3 let v1 = minus1 v2 = 0 v3 = 1 LetL(v1 v1) L(v2 v1) L(v3 v1)L(v1 v2) L(v2 v2) L(v3 v2)L(v1 v3) L(v2 v3) L(v3 v3)
=
L(minus1minus1) L(0minus1) L(1minus1)L(minus1 0) L(0 0) L(1 0)L(minus1 1) L(0 1) L(1 1)
=
0 2 81 0 24 1 0
(248)
Let
P(Y = v1) = P(Y = minus1) = 02
P(Y = v2) = P(Y = 0) = 03
P(Y = v3) = P(Y = 1) = 05 = 1minus P(Y = minus1)minus P(Y = 0)
(249)
Then the mean losses are
E[L(Y v1)] = P(Y = v1)L(v1 v1) + P(Y = v2)L(v2 v1) + P(Y = v3)L(v3 v1)
= (02)(0) + (03)(2) + (05)(8) = 46
E[L(Y v2)] = P(Y = v1)L(v1 v2) + P(Y = v2)L(v2 v2) + P(Y = v3)L(v3 v2)
= (02)(1) + (03)(0) + (05)(2) = 12
E[L(Y v3)] = P(Y = v1)L(v1 v3) + P(Y = v2)L(v2 v3) + P(Y = vJ)L(vJ v3)
= (02)(4) + (03)(1) + (05)(0) = 11
The smallest of these three values is E[L(Y v3)] = E[L(Y 1)] = 11so the optimal predictor is g = v3 = 1
253 Optimal Prediction Specific Examples
Example Carnival Age Game
Imagine predicting a personrsquos age Y Imagine when you were youngeryou worked at a carnival in the summer where people paid five tick-ets to see if you could guess their age If you guessed correctly theywon nothing if incorrect they won the plush animal or fruit of theirchoice (which of course was still worth much less than five tickets)
25 PREDICTION WITH A KNOWN DISTRIBUTION 47
Since they pay five tickets regardless of what you guess that neednot enter the loss function The fact that it only matters if you guesscorrectly or not implies 0ndash1 loss L0(y g) = 1y 6= g (For addedchallenge come back to this example and see what changes if theyonly win when you are more than three years off)
For simplicity let P(Y = 20) = 06 and P(Y = 25) = 04 Theprocedure in Section 252 can be used Mean losses with 0ndash1 loss are
E(L0(Y 20)) = E(1Y 6= 20) = (06)(0) + (04)(1) = 04
E(L0(Y 25)) = E(1Y 6= 25) = (06)(1) + (04)(0) = 06(250)
so glowast0 = 20 is the optimal prediction (See also (256) for a differentpath to the same conclusion)
Quadratic loss could lead to the wrong guess depending howblindly we apply it Comparing only g = 20 to g = 25
E[L2(Y 20)] = E[(Y minus 20)2]
= P(Y = 20)(20minus 20)2 + P(Y = 25)(25minus 20)2
= (06)(0) + (04)52 = (04)(25) = 10
E[L2(Y 25)] = E[(Y minus 25)2]
= P(Y = 20)(20minus 25)2 + P(Y = 25)(25minus 25)2
= (06)(minus5)2 + (04)(0) = (06)(25) = 15
Like with 0ndash1 loss it is better to guess the more likely value 20 thanthe less likely 25 However it is even better to guess something inbetween
E[L2(Y 22)] = E[(Y minus 22)2]
= P(Y = 20)(20minus 22)2 + P(Y = 25)(25minus 22)2
= (06)(minus2)2 + (04)(3)2 = (06)(4) + (04)(9) = 6
Some calculus shows g = 22 is actually optimal for L2 loss Howeveraccording to the rules of the carnival game if we guess g = 22 wheneveryone has either Y = 20 or Y = 25 then wersquoll lose every singletime This is the worst possible guess This is one example wherequadratic loss is not appropriate
Example Advertising
Later in life well past your carnival days you work in advertisingCoincidentally your job is still to guess a personrsquos age but with dif-ferent consequences If you guess somebody is 40 years old thenyour clientrsquos website shows an ad specifically designed for 40-year-olds The loss function should capture how much worse it is to showan ad targeting the guessed age than to show the optimal ad for theindividualrsquos true age
Unlike at the carnival some incorrect guesses are much worse thanothers For example it doesnrsquot matter much if you guess a person is40 years old but really theyrsquore 41 The optimal ad for the 40-year-old
48 CHAPTER 2 ONE VARIABLE POPULATION
is almost equally effective on the 41-year-old so there is very little lossfrom guessing 40 instead of 41 However guessing that the 41-year-oldis 20 years old is much worse than guessing 40 Similarly guessingthat the 41-year-old is 60 is bad but still better than guessing 80Consequently 0ndash1 loss is inappropriate because it treats guesses of20 40 60 and 80 as equally bad when y = 41
In this case quadratic loss seems more appropriate than 0ndash1 lossif not perfect
Imagine the age distribution of Y is the same as in the carnivalgame We saw that quadratic and 0ndash1 loss lead to different ldquooptimalrdquopredictions That result depends only on the mathematical distribu-tion of Y not on the real-world interpretation (carnival advertising)So the predictions are still different but now we may prefer g = 22over g = 20 That is we might ldquoloserdquo a lot by showing 25-year-olds anad targeting 20-year-olds but maybe both 20-year-olds and 25-year-olds respond to the ad targeting 22-year-olds
Example More Ages
Consider more complex versions of the carnival and advertising ex-amples with the following distribution of Y Now any value Y isin20 21 22 23 24 25 is possible Imagine young people are morelikely specifically
P(Y = j) = (26minus j)21 j = 20 21 25 (251)
With 0ndash1 loss (for the carnival) mean loss when predicting g canbe compute as follows From (238) the loss function is L0(Y g) =1y 6= g which equals 0 if y = g but equals 1 otherwise This can berewritten as 1minus1y = g Also 1y = gP(Y = y) equals P(Y = g)when y = g but otherwise equals 0 when y 6= g Putting these piecestogether mean loss is
E(L0(Y g)) =25sumy=20
1y 6= gP(Y = y) =25sumy=20
[1minus 1y = g] P(Y = y)
=
=1︷ ︸︸ ︷25sumy=20
P(Y = y)minus25sumy=20
1y = gP(Y = y) = 1minus P(Y = g)
Thus the smallest possible mean loss is achieved by the largest pos-sible P(Y = g)
arg ming
E(L0(Y g)) = arg ming
[1minusP(Y = g)] = arg maxg
P(Y = g) = 20
(252)so the best prediction is glowast0 = 20 (As defined in the Notation sectionbefore Chapter 1 arg ming f(g) means ldquothe value of g that minimizesf(g)rdquo and arg maxg f(g) menas ldquothe value of g that maximizes f(g)rdquo)
This result is intuitive because 0ndash1 loss only cares whether a pre-diction is right or wrong Guessing glowast0 = 20 gives you a 621 prob-ability of being correct the largest possible Equivalently it is the
25 PREDICTION WITH A KNOWN DISTRIBUTION 49
smallest possible probability of being wrong More generally glowast0 isalways the single most likely value (the mode) see (256)
Like before quadratic loss yields a different optimal prediction Ifwe guessed g = 20 then
E[L2(Y g)] =25sumy=20
P(Y = y)(yminus20)2 =25sumy=20
[(26minusy)21](yminus20)2 = 5
(253)(Link to calculation) But we can do better by guessing a value moretoward the ldquomiddlerdquo of the distribution Although g = 20 is exactlycorrect sometimes itrsquos bad when Y = 25 If we try g = 21
E[L2(Y g)] =25sumy=20
P(Y = y)(yminus21)2 =25sumy=20
[(26minusy)21](yminus21)2 = 83 asymp 267
(254)(Link to calculation) Since 267 lt 5 guessing g = 21 is better thang = 20 according to mean quadratic loss (Can you do even betterthan g = 21)
Discussion Question 23 (banana loss function) Imagine you runa small banana shop You buy bananas wholesale for 2 cents each($002) and sell each for 40 cents ($040) The wholesaler deliversevery Monday Any bananas not sold by the next Monday spoil youcannot sell them (they just go in the compost) Let y be the actualnumber of bananas that customers want to buy in some week Let gbe your guess ie how many you bought wholesale on Monday
a) Why isnrsquot 0ndash1 loss appropriateb) Why isnrsquot quadratic loss appropriatec) What might the loss function look like if you only care about
maximizing profit Try to be as specific and mathematical asyou can In particular consider the different consequences ofover-buying (g gt y) versus under-buying (g lt y)
254 Mean and Mode as Optimal Predictions
Under quadratic loss the mean is the optimal predictor that mini-mizes mean loss Although the details are beyond our scope calculuscan be used to take the derivative of mean loss with respect to g andset it to zero (the ldquofirst-order conditionrdquo) yielding
glowast2 equiv arg ming
E[(Y minus g)2] = E(Y ) (255)
This says the mean of a distribution has two interpretations Fordescription the mean helps summarize the ldquocenterrdquo of the distribu-tion For prediction the mean is the prediction of an unknown valueof Y that minimizes mean quadratic loss
50 CHAPTER 2 ONE VARIABLE POPULATION
Under 0ndash1 loss the optimal prediction is
glowast0 equiv arg ming
E(L0(Y g)) = arg ming
E[1Y 6= g] = arg ming
P(Y 6= g)
= arg ming
[1minus P(Y = g)] = arg mingminusP(Y = g)
= arg maxg
P(Y = g) (256)
That is the mode (the single most likely value of Y ) is the optimalpredictor (This formula does not make sense if Y is not continuousbecause then P(Y = g) = 0 for any g) Thus like the mean the modehas two interpretations one for description and one for prediction
Discussion Question 24 (optimal banana prediction) Considerthe same setup as in DQ 23 and again assume you want to maxi-mize (mean) profit Imagine you know the distribution of Y (bananaquantity demanded in one week)
a) Do you think the mean E(Y ) is a good ldquopredictedrdquo number ofbananas to buy wholesale Explain why or why not if not alsoexplain why you think E(Y ) is too high or too low
b) What if the retail price were $99 per banana and the wholesalecost is still $002 per bananamdashwould E(Y ) be good or too highor too low and why
c) What if the retail price were equal to the wholesale price
255 Interval Prediction
Only point prediction has been discussed so far ie the single num-ber that provides the best guess of the unknown value Alternativelyinterval prediction lets the guess be a range of numbers called aprediction interval
The disadvantage of point predictions is that they are usuallywrong for discrete variables and always wrong for continuous vari-ables For example if Y sim N(0 1) then the best point predictionunder quadratic loss is E(Y ) = 0 But this guess will be wrong 100of the time since P(Y = 0) = 0
By guessing a range of numbers a prediction interval can actuallycontain the true value with large probability The length of the in-terval captures the level of uncertainty with lots of uncertainty theinterval must be very long to have a high probability of containingthe true value
For example let P(Y = j) = 1100 for j = 1 2 100 Themean E(Y ) = 505 is the best prediction under quadratic loss butit never actually happens P(Y = 505) = 0 Even P(Y = 50) =001 still very small Alternatively the prediction interval [26 75]has P(26 le Y le 75) = 50100 there is roughly a 50 probabilitythat a randomly sampled Y value is inside the prediction intervalOr P(6 le Y le 95) = 90100 so the prediction interval [6 95] hasaround 90 probability of containing a randomly drawn Y
25 PREDICTION WITH A KNOWN DISTRIBUTION 51
The interval length reflects amount of uncertainty Above the90 prediction interval was [6 95] If instead P(Y = j) = 110for j = 1 2 10 then the much shorter interval [2 10] has 90probability of containing Y The values of Y are concentrated moreclosely together so the prediction interval can be smaller but stillhave the same 90 probability
Even for the same Y distribution and same interval probability(like 90) there may be multiple possible prediction intervals Inthe original example [26 75] was a 50 prediction interval but sois [25 74] or [1 50] or [51 100] Other properties can be used todistinguish among these intervals but such is beyond our scope
52 CHAPTER 2 ONE VARIABLE POPULATION
Chapter 3
One Variable Sample
=rArr Kaplan video Chapter Introduction
Depends on Chapter 2
Unit learning objectives for this chapter
31 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
32 Describe and distinguish Bayesian and frequentist perspec-tives [TLO 4]
33 Identify and interpret properties of a sampling procedure orestimator [TLO 4]
34 Judge which estimator is better based on its properties[TLO 6]
35 Interpret different measures of statistical uncertainty[TLOs 6 and 7]
36 Assess the statistical significance and economic significanceof empirical results [TLO 6]
37 In R (or Stata) compute estimates of a population meanalong with measures of uncertainty [TLO 7]
Optional resources for this chapter
bull Basic statistics the Khan Academy AP Statistics unit in-cludes instructional material and practice questions
bull Quantifying uncertainty and statistical significance (Mastenvideo)
bull Estimator properties (Lambert video)
bull Unbiasedness and consistency (Lambert video 1 of 2)
bull Unbiasedness and consistency (Lambert video 2 of 2)
bull iid sampling (Lambert video)
53
54 CHAPTER 3 ONE VARIABLE SAMPLE
bull Bayesian vs frequentist cookie inference example (StackEx-change)
bull Section 28 (ldquoExploratory Data Analysis with Rrdquo) in Kleiberand Zeileis (2008) [Chapter 2 is available free on their web-site]
bull Section 22 (ldquoRandom Sampling and the Distribution ofSample Averagesrdquo) and Chapter 3 (ldquoA Review of StatisticsUsing Rrdquo) in Hanck et al (2018)
bull Sections 154 (ldquoFundamental Statisticsrdquo) and 193 (ldquoSimu-lation of Confidence Intervals and t Testsrdquo) in Heiss (2016)
bull R package boot (Canty and Ripley 2019 Davison and Hink-ley 1997)
Sections 23 and 25 considered only the population distributionwhereas Chapter 3 considers data sampled from that distributionThe words data dataset sample values and sample all refer tothe same thing the set of values that the researcher actually seesBut as in Chapter 2 this could be seen either from the ldquobeforerdquoperspective as random variables or from the ldquoafterrdquo perspective asnon-random realized values Section 21 gave the general idea of seeingobservations as random variables (the ldquobeforerdquo view) here specificdetails are provided on estimation and uncertainty
Although long this chapter is mostly review of material you shouldhave seen already in an introductory statistics class
31 Bayesian and Frequentist Perspectives
=rArr Kaplan video Bayesian and Frequentist Perspectives
Two frameworks constitute econometrics and statistics Bayesianand frequentist (or classical) These are cynically deemed ldquosectsrdquoby some but outside the vocal extremes (and amusing webcomicsxkcdcom1132) most econometricians appreciate and respect bothframeworks (and the people who use them) sometimes working withboth in turn
This text uses the frequentist framework Why Mostly thatrsquosjust how I wrote it Irsquoll spare you post hoc rationalization
There is little disagreement about the population and what wewant to learn Generally both Bayesian and frequentist perspectivesagree on everything in Chapter 2 about the population and how dataare generated
The disagreements are about how to use the sampled data tolearn about the population Frequentist and Bayesian approacheshave different advantages appropriate for different settings
The goal of the remainder of Section 31 is to give you a very basicoverview and comparison of Bayesian and frequentist approaches Atminimum I hope you get a sense of their different ways of quantifying
31 BAYESIAN AND FREQUENTIST PERSPECTIVES 55
uncertainty and the different types of questions they can (and cannot)answer
311 Very Brief Overview Bayesian Approach
The Bayesian approach models your beliefs about an unknown pop-ulation value θ like the mean θ = E(Y ) Your prior (or prior belief)is what you believe about θ before seeing the data Your posterior(or posterior belief) is what you believe about θ after seeing the dataThe Bayesian approach describes how to update your prior using theobserved data to get your posterior
Mathematically ldquobeliefrdquo is a probability distribution For exam-ple let random variable B represent your belief about the populationmean If you think therersquos a 50 chance the mean is negative thenP(B lt 0) = 50 If you think therersquos a 14 probability that B isbelow minus1 then P(B lt minus1) = 14 There are formal procedures forldquoprior elicitationrdquo ie quantifying your beliefs as a distribution
Notationally though it is confusing usually the belief is repre-sented by the (usually Greek) letter for the parameter like θ ratherthan a separate variable (like B above) This does not mean ldquothereis no population meanrdquo it is purely a notational convention for rep-resenting beliefs
For a concrete example imagine you find an archaeological sitein Missouri with many artifacts but you are unsure of which peoplegroup had lived in that site Based on its location it was either theMissouria Illini or Osage tribe which can be represented as θ = M θ = I or θ = O respectively That is θ is the unknown parameterof interest with possible values M I or O Before looking at theartifacts (the data) you believed there was equal chance of each tribeie your prior belief was P(θ = j) = 13 for each j isin M IOAfter looking at the artifacts (the data) more closely they look mostsimilar to Missouria artifacts but you are unsure Quantitatively youbelieve therersquos a 50 chance they were Missouri 40 chance Osageand 10 chance Illini This is your posterior distribution ie yourbeliefs about θ after seeing the data Mathematically your posterioris P(θ = M) = 05 P(θ = O) = 04 P(θ = I) = 01
The posterior distribution is the Bayesian way of quantifying un-certainty It is arguably more intuitive it is more similar to howpeople talk about uncertainty in daily life The posterior distributionis often summarized by a credible interval ie a range of valuesthat yoursquore pretty sure contains the true θ like P(a le θ le b) = 90Or in the above example with categorical θ the credible set MOhas 90 posterior belief yoursquod say ldquoIrsquom pretty sure itrsquos Missouria orOsage although I think therersquos a 10 chance Irsquom wrongrdquo
312 Very Brief Overview Frequentist Approach
Other sections in Chapter 3 flesh out details but the core of thefrequentist approach is the ldquobeforerdquo perspective which can also be
56 CHAPTER 3 ONE VARIABLE SAMPLE
described in terms of repeated sampling Instead of the belief prob-abilities of a Bayesian posterior frequentist probabilities are from theldquobeforerdquo view of what dataset (and thus value of estimator and such)could be randomly sampled Equivalently as a thought experimentwe can imagine many different random samples drawn from the samepopulation the ldquobeforerdquo probabilities are then how often certain val-ues occur in these many random datasets
For intuition imagine you could randomly sample 100 datasetsfrom the same population Then the frequentist probability of anevent says approximately how many times that event occurs amongthe 100 samples (To replace ldquoapproximatelyrdquo with ldquoexactlyrdquo replace100 with infin) For example we could compute the sample mean Yin all 100 samples since the datasets are all different the samplemeans Y are also all different If Y le 0 in 50 of the 100 hypotheticalsamples then P(Yn le 0) asymp 50100 = 50 Or if Y is in the interval[minus04 04] in 70 of 100 samples then P(minus04 le Y le 04) = P(Y isin[minus04 04]) asymp 70 A similar example is in Table 31
313 Bayesian and Frequentist Differences
The following makes explicit some of the differences between theBayesian and frequentist approaches described above
First the frameworks treat different objects as random or non-random The frequentist framework treats the population mean andother population features as non-random values whereas it treats thedata as random For example the population mean micro = E(Y ) is anon-random value whereas an observation Y is a random variableIn contrast the Bayesian framework treats population features asrandom (to reflect your beliefs) whereas it treats the data as non-random values (the ldquoafterrdquo view)
Second due to this different treatment the frameworks answerdifferent types of questions especially when quantifying uncertaintyFor example the Bayesian framework is designed to answer questionslike ldquoGiven the observed data what do I believe is the probabilitythat the population mean is above 12rdquo Mathematically if y is theldquoobserved datardquo this is usually written P(micro gt 12 | y) noting theconfusing notation where micro represents beliefs This question makesno sense from the frequentist perspective either micro gt 12 or not itcannot be ldquomayberdquo with some probability In contrast the frequentistframework answers questions like ldquoGiven the value of micro whatrsquos theprobability that the sample mean is above 12rdquo Mathematically thisis usually written P(Y gt 12) or Pmicro(Y gt 12) to be explicit aboutthe dependence on micro The sample mean Y is a function of data so itis treated as a random variable This question makes no sense fromthe Bayesian perspective we can see the data so we can see eitherY gt 12 or not it cannot be ldquomayberdquo with some probability
Interestingly both frameworks can answer questions like P(Y ltmicro) but with different interpretations The Bayesian answer interpretsY as a number (that we see in the data) and micro as a random variable
32 TYPES OF SAMPLING 57
representing our beliefs The frequentist answer interprets Y as therandom variable (from the ldquobeforerdquo view) and micro as the non-randompopulation value
Third frequentist methods use only the data whereas Bayesianmethods can formally incorporate additional knowledge In practicethough even frequentist results should be interpreted in light of otherknowledge The difference is that this process is not formally withinthe frequentist methodology itself Unfortunately many people do notcombine frequentist results with other knowledge instead interpretingfrequentist results as if one single dataset contains the full absolutetruth of the universe please do not do this
In Sum Bayesian amp FrequentistFrequentist ldquobeforerdquo view of data (random variables) assessmethodsrsquo performance across repeated random samples from samepopulationBayesian ldquoafterrdquo view of data (non-random) model beliefs (aboutpopulation features) as random variables
32 Types of Sampling
=rArr Kaplan video Types of Sampling
In practice judging which econometric method is most appropri-ate requires understanding different types of sampling procedures andsampling properties Such judgment is mostly left to another text-book but this section hopes to help your understanding
Notationally we observe the values from n units which could beindividuals firms countries etc Let i = 1 refer to the first uniti = 2 to the second etc up to i = n where n is the sample sizeThe corresponding values are Y1 Y2 Yn with Yi more generallydenoting the observation for unit i A particular dataset may havespecific values like Y1 = 5 Y2 = 8 etc but to analyze statisticalproperties each Yi is seen as a random variable as in Section 21
In this section two important sampling properties are consideredldquoindependentrdquo and ldquoidentically distributedrdquo If both hold then the Yiare called independent and identically distributed (iid) randomvariables (or ldquosampled iidrdquo) and ldquosampling is iidrdquo Sometimes thevague phrase random sample refers to iid sampling This iid sam-pling is mathematically simplest but not always realistic Althoughiid sampling is the focus here (like other introductory textbooks)weights are briefly mentioned and Part III considers dependent (ienot independent) data
Notationally iid sampling is indicated by iidsim If FY (middot) is the pop-ulation CDF
Yiiidsim FY i = 1 n (31)
If there is only a population PMF fY (middot) and not a CDF (like with
58 CHAPTER 3 ONE VARIABLE SAMPLE
nominal categorical variables) then FY is replaced by fY in (31) Ifthe Yi follow a known distribution like N(0 1) then FY is replacedby N(0 1) for example
There are other sampling properties not considered in this sectionlike sampling bias This is about whether we observe a ldquorepresenta-tive samplerdquo of the population we want to learn about (the populationof interest) Sometimes sampling bias is our fault (for using the wrongdataset for our economic question) but sometimes we try to get theright data and people refuse to answer our survey or we canrsquot getaccess to certain confidential data etc This is discussed more inChapter 12 in terms of ldquomissing datardquo and ldquosample selectionrdquo
After introducing ldquoindependentrdquo and ldquoidentically distributedrdquo sam-pling examples are discussed in Section 323
321 Independent
Qualitatively in the context of sampling independence (or inde-pendent sampling) means that from the ldquobeforerdquo view any two ob-servations are unrelated For example the value of Y2 is unrelated toY1 we are not any more likely to see a high Y2 if we see a high Y1 inthe sample
Mathematically independence means
Yi perpperp Yk for any i 6= k (32)
where perpperp denotes statistical independence That is Y1 perpperp Y2 Y1 perpperp Y8Y6 perpperp Y4 etc For any i 6= k independent sampling implies
Cov(Yi Yk) = 0 Var(Yi+Yk) = Var(Yi)+Var(Yk) E(Yi | Yk) = E(Yi)(33)
among other things
322 Identically Distributed
The identically distributed property means that from the ldquobeforerdquoview the distribution of Yi is the same for any i Qualitatively allunits are sampled from the same population Mathematically if FY (middot)is the population CDF Yi sim FY for all i = 1 n (Or Yi sim fY forPMF fY (middot)) Note sim and not iidsim it only claims the distribution of Yiin isolation not independence from the other observations
Mathematically identically distributed Yi means that for any iand k Yi and Yk have the same distribution Thus any feature oftheir distributions is also identical For example E(Yi) = E(Yk) andVar(Yi) = Var(Yk)
Practice 31 (iid sampling) You are planning to sample to valuesY1 and Y2 but you have not sampled them yet The following fourstatements correspond to the four sampling properties (or their im-plications) 1) independent 2) not independent (ie dependent) 3)identically distributed 4) not identically distributed Which is which
a) You are just as likely to get Y1 = 3 as Y2 = 3
32 TYPES OF SAMPLING 59
b) If you get a negative Y1 then yoursquoll probably get a negative Y2but if you get a positive Y1 then yoursquoll probably get a positiveY2
c) Separately and simultaneously you will randomly sample Y1 andY2 using the same exact procedure from the same population
d) For Y1 you are going to get the salary of somebody with aneconomics degree and Y2 will be the salary of somebody withan art history degree
323 Examples
Consider the following sampling procedures and their properties Eachexample has 4 observations of Mizzou students You can imagine 4buckets (or pieces of paper) initially empty that will eventually con-tain information from 4 observations The sampling procedure doesnot determine the specific numeric values that end up in the bucketsbut it determines how the buckets get filled
Random Student ID (iid)
Imagine randomly picking a Mizzou student ID number then ran-domly picking a 2nd then 3rd then 4th These Yi are both indepen-dent and identically distributed (iid) They are independent becauseeach ID number is randomly drawn without any consideration of howthe other numbers are drawn and without any consideration of theother observed Yi values They are identically distributed becauseeach ID number is drawn from the same population (anyone who hasa Mizzou student ID)
Random In-State and Non-Resident Students (Stratified)
In sampling stratification means dividing the population in to sub-groups called strata such that each population member belongs toexactly one stratum Further stratified sampling means the over-all data sample has a pre-determined number of observations fromeach stratum (There are other variations whose details are beyondour scope)
For example each Mizzou student is classified as either a residentof Missouri (ldquoin-staterdquo) or not (ldquonon-residentrdquo) Imagine buckets 1 and2 say ldquoin-staterdquo while buckets 3 and 4 say ldquonon-residentrdquo observationsY1 and Y2 are from in-state students while Y3 and Y4 are from non-resident students This is stratified sampling assigning buckets todifferent strata before sampling
Often stratified sampling is independent but not identicallydistributed (inid) In our Mizzou example the students are sampledindependently but the in-state distribution may not be identical tothe non-resident distribution
There are many details and variations but for now the goal is sim-ply to be aware that stratified sampling is usually not iid so methodsassuming iid sampling may not be valid
60 CHAPTER 3 ONE VARIABLE SAMPLE
Students in Same Class (Clustered)
Imagine randomly picking a class (like Intro Econometrics) at Mizzouthen filling the first two buckets (Y1 and Y2) with two random studentsfrom that class and then randomly picking another class and anothertwo students for the other buckets (Y3 and Y4) This is an example ofclustered sampling where each class is a ldquoclusterrdquo (This is differentthan ldquoclusteringrdquo in the sense of cluster analysis)
Observations are identically distributed (because each Yi has thesame probability of getting any particular student) but often not in-dependent For example dependence may come from students in thesame class being similarly affected by their shared experience Herebuckets 1 and 2 are correlated and 3 and 4 are correlated but not 1and 3 nor 2 and 4 etc
Without getting into details the goal for now is to recognize whenthere is clustered sampling that may not be iid in which case iid-based methods may not be valid Usually estimates are still valid butmeasures of uncertainty are not (there is really more uncertainty thanreported)
Other common examples of clustered sampling include taking allindividuals within randomly selected households students within ran-domly selected schools or classrooms and multiple observations overtime for randomly selected ldquoindividualsrdquo
Two Students Two Semesters (Clustered)
Imagine randomly picking 2 students (iid like with random ID num-bers) then observing them this semester and next semester Thisis another type of clustered sampling that usually violates indepen-dence For example imagine the variable is semester GPA Bucket 1contains the first studentrsquos GPA this semester bucket 2 contains thesame studentrsquos GPA next semester and buckets 3 and 4 contain theother studentrsquos GPAs from this semester and next semester Buckets1 and 2 (Y1 and Y2) are probably both high or both low rather thanone high and one low and similarly for buckets 3 and 4 (Y3 and Y4)That is buckets 1 and 2 are correlated and 3 and 4 are correlatedFurther observations may not even be identically distributed if fallGPA and spring GPA do not have the same distribution
From a different perspective sampling is actually iid The stu-dents themselves are sampled iid so ldquoobservationsrdquo are iid if we see(Y1 Y2) as a single observation and see (Y3 Y4) as a second observa-tion That is (Y1 Y2) is randomly sampled from the same populationas (Y3 Y4)
One Student Four Semesters (Time Series)
Similar to above if you randomly pick one student but then observethe same student over four consecutive semesters there is probablydependence and possibly not identical distributions (eg if GPAtends to increase over time) This is time series data see Part III
33 THE EMPIRICAL DISTRIBUTION 61
Practice 32 (rural household sampling) You want to learn abouthousehold consumption in rural Indonesia In an area with 100 vil-lages you either i) pick 5 villages at random then survey every house-hold in each of the 5 villages or ii) make a list of all households inall 100 villages then randomly pick 5 of them Explain why eachapproach is or isnrsquot iid
33 The Empirical Distribution
=rArr Kaplan video The Empirical Distribution
The empirical distribution is a probability distribution thatreflects the sample data It can be confusing at first but it unifiesmany approaches in this class and beyond helping them seem less adhoc and mysterious Qualitatively the empirical distribution treatsthe sample as if it were the population
Mathematically first consider a binary variable The populationis represented by binary random variable Y with some P(Y = 1) = pThe sample of size n can be represented by binary random variable Swith
P(S = 1) = p =how many Yi = 1
n=
1
n
nsumi=1
1Yi = 1 (34)
the sample proportion of observations with Yi = 1 The distributionof S is the empirical distribution
The plug-in principle or analogy principle suggests we com-pute whatever features of S we want to learn about Y For exampleif we want to learn E(Y ) then compute E(S) With enough data Sis usually very similar to Y so features of S should usually be verysimilar to those of Y
Mathematically consider now a categorical or discrete variableThe population is represented by random variable Y with PMF fY (middot)Imagine there are J categories with values (v1 vJ) so Y is fullydescribed by fY (vj) for each j = 1 J The sample is representedby random variable S with
fS(vj) =1
n
nsumi=1
1Yi = vj j = 1 J (35)
That is fS(vj) is the sample proportion of observations with Yi = vj Mathematically consider finally a continuous variable The pop-
ulation is represented by random variable Y with continuous CDFFY (middot) However even with an infinite number of possible values forY in the population there are only n possible values of Yi observedWith a continuous random variable each observed Yi value is uniqueso there are exactly n different observed Yi values The sample is thusrepresented by random variable S with PMF
fS(Yi) = 1n i = 1 n (36)
62 CHAPTER 3 ONE VARIABLE SAMPLE
Even though Y is continuous S is discreteNotationally instead of fS(middot) more common is fY (middot) if Y has a
PMF If Y has a CDF then more often the empirical CDF (ECDF)is used It is simply the CDF of S The ECDF (or just EDF) is usuallywritten FY (middot) or just F (middot) and defined as
FY (y) = FS(y) =1
n
nsumi=1
1Yi le y (37)
ie the sample proportion of observations less than or equal to y thepoint of evaluation
Notationally a hat (circumflex) often denotes a sample analogie a feature of S analogous to a population feature of Y Above in(37) for the population FY (y) the sample analog is FY (y) = FS(y)As another example for the population P(Y = y) the sample analogis P(Y = y) = P(S = y) For the population E(Y ) the sample analogis E(Y ) = E(S) The ldquohatrdquo may indicate another value computed fromthe sample data (ie a statistic) usually an estimator even if it isnot a sample analog
34 Estimation of the Population Mean
Sections 23 and 25 helped us think about which features of the pop-ulation are useful for description and prediction Such a populationfeature is called the estimand or object of interest In practice itmust be estimated using data
This section specifically considers estimating the population meanThis is most directly useful in later chapters For other populationfeatures the same concepts apply though details differ
Recall from Section 231 that the mean is directly useful for de-scription and prediction with discrete and continuous variables and itcan have a useful probability interpretation for binary and even cate-gorical variables For binary Y E(Y ) = P(Y = 1) Further categori-cal variables can be turned into binary variables like Z = 1Y = v2For example if Y is a 2-digit NAICS industry code whose possiblevalues are ldquoutilitiesrdquo ldquoconstructionrdquo or ldquoinformationrdquo then definingZ = 1Y = information (Z = 1 for ldquoinformationrdquo Z = 0 otherwise)implies E(Z) = P(Y = information)
The focus of this section is point estimation as opposed to in-terval estimation A point estimate is a single number representingour best guess of the unknown population value In contrast an in-terval estimate is a range of numbers most commonly a confidenceinterval see Section 38
341 ldquoDescriptionrdquo Sample Mean
As alluded to in Section 33 at least with identically distributed sam-pling the population mean can be estimated by its sample analog themean of the empirical distribution This is called the sample mean
34 ESTIMATION OF THE POPULATION MEAN 63
It is also called the sample average because it averages the sampleYi values The sample average is usually denoted Y (or Yn) Mathe-matically for continuous or discrete (including binary) Y using thenotation of Section 33
Y = E(Y ) = E(S) =1
n
nsumi=1
Yi (38)
These expressions are equivalent just emphasizing different interpre-tations
342 ldquoPredictionrdquo Least Squares
Section 254 showed that the population mean E(Y ) also solves anoptimal population prediction problem Specifically (255) shows
E(Y ) = glowast2 equiv arg ming
E[(Y minus g)2]
From Section 33 the analogy principle suggests estimating E(Y )by solving the same optimal prediction problem for the empirical dis-tribution ie replacing Y with S Mathematically let
glowast2 equiv arg ming
E[(S minus g)2] = arg ming
1
n
nsumi=1
(Yi minus g)2 (39)
The hope is that the sample analog glowast2 is close to the population glowast2which in turn equals E(Y )
Skipping the derivation it turns out
glowast2 = Y (310)
The prediction-motivated estimator equals the description-motivatedestimator This makes sense because in the population representedby Y the mean equals the optimal prediction (under quadratic loss)and S is simply another random variable like Y
Rewriting (39) allows the introduction of some terms and con-cepts used in later chapters In (39) the 1n has no effect on theminimization problem because it is unaffected by g Consequently itis equivalent to write
glowast2 = arg ming
nsumi=1
(Yi minus g)2 (311)
To dissect the right-hand side of (311) imagine any estimate g Sinceg can be seen as trying to predict Y sometimes g is called the pre-dicted value of Yi which in this simple setting is the same for all iHowever the observed value of Yi is used to compute g so it seemsmisleading to say Yi was ldquopredictedrdquo usually we assume the true valueis not known when we discuss prediction Instead calling g the fittedvalue is more appropriate Either way the difference Ui = Yi minus g iscalled the residual for observation i ie the difference between the
64 CHAPTER 3 ONE VARIABLE SAMPLE
observed value Yi and the fitted value g The squared residuals arethen U2
i = (Yi minus g)2 The sum of squared residuals (SSR) is then
nsumi=1
U2i =
nsumi=1
(Yi minus g)2 (312)
Consequently (310)ndash(312) together say that Y minimizes the SSRFor this reason Y is a least squares estimator ldquoleastrdquo referring tominimization and ldquosquaresrdquo referring to the second S in SSR
343 Non-iid Sampling Weights
If your dataset has weights then you should use them Using weightsadjusts the sample to be more representative of the population Con-versely ignoring weights often produces misleading results becausethe sample is not representative
There are multiple types of weights although the distinction isbeyond our scope Generally any type of weight is treated the samefor estimation but different for ldquoinferencerdquo (confidence intervals etc)Types like survey weights (also called sampling weights) indicatenon-iid sampling whereas other types like frequency weights maysimply allow more compact storage of iid sampled data
Example
Skipping the theory an example is shown in the following code Thereare two subpopulations (two types of individuals) in the populationone with mean 0 one with mean 1 Each subpopulation forms halfthe overall population so the overall population mean is (12)(0) +(12)(1) = 12 However the second subpopulation forms much morethan half of the sample because it is oversampled each observationhas a 23 probability (instead of 12) of coming from the secondsubpopulation Thus itrsquos like sampling from a different populationwhose mean is (13)(0) + (23)(1) = 23 Without weighting theunweighted sample mean estimates this 23 value not 12
Sampling weights can be used to adjust the sample to represent thepopulation Specifically the weights are the inverse of the samplingprobabilities 1(23) = 32 = 15 for observations from the secondsubpopulation and 1(13) = 3 for individuals from the first sub-population This counteracts the fact that there are more individualsfrom the second subpopulation by weighting them less Alternativelyinstead of the inverse sampling probabilities the inverse sample pro-portions of each type could be used Another option is to use functionsvymean() in the R package survey Lumley (2004 2019)
setseed(112358)n lt- 567itype lt- sample(x=01 size=n replace=T prob=123)Y lt- itype + rnorm(n)mean(Y) without weights near 23=067 not representative
35 SAMPLING DISTRIBUTION OF AN ESTIMATOR 65
[1] 0708
with weights should be closer to true 050weightedmean(x=Y w=1((itype+1)3))
[1] 0551
weightedmean(x=Y w=ifelse(itype nsum(itype) n(n-sum(itype))))
[1] 054
35 Sampling Distribution of an Estimator
=rArr Kaplan video Sampling Distribution of an Estimator
There are two goals of this section The primary goal is to un-derstand what it means for an estimator to have a probability dis-tribution The secondary goal is to observe some patterns that areformalized in Section 36
At a high level the sampling distribution is the probability dis-tribution of an estimator treated as a random variable in the ldquobeforerdquoview Equivalently from the repeated sampling perspective the sam-pling distribution imagines computing the estimator in a large numberof randomly sampled datasets from the same population and seeingwhich values occur with what probability
To develop intuition the remainder of this section contains morespecific examples using the sample mean The examples include math-ematical calculations graphs and tables The details are not them-selves important but rather how they manifest the deeper conceptsie there is no value in memorizing the examples but rather usingthem to assess and develop your understanding of the fundamentalideas
351 Some Mathematical Calculations
To be concrete consider the sample mean as an estimator of thepopulation mean with iid sampling Here the n subscript is added toYn because the sampling distribution depends on n For example thesampling distribution of Y1 = Y1 differs from that of Y2 = (Y1+Y2)2
From the ldquobeforerdquo view the sample mean Yn is a random variableThe Yi are all random variables so their average is also a randomvariable That is the Yi have multiple possible values so the samplemean also has multiple possible values
To develop intuition consider some simple examples The sim-plest example is n = 1 so Y1 = Y1 The distribution of Y1 is the sameas the population distribution that Y1 follows
Simpler still imagine the population is binary with mean p ThenY1 sim Bernoulli(p) meaning P(Y1 = 1) = p and P(Y1 = 0) = 1 minus pSince Y1 = Y1 P(Y1 = 1) = p and P(Y1 = 0) = 1minus p If you imagine
66 CHAPTER 3 ONE VARIABLE SAMPLE
100 randomly sampled datasets (ie 100 randomly sampled valuesof Y1) then approximately 100p datasets would have Y1 = 1 whileapproximately 100(1minusp) would have Y1 = 0 FOr example if p = 07then around 70 datasets would have Y1 = 1 and around 30 datasetswould have Y1 = 0 With infinite datasets the words ldquoapproximatelyrdquoand ldquoaroundrdquo are no longer needed but at least for me it is easier todevelop intuition by imagining 100 datasets than infin
Consider n = 2 with binary Y Despite the simplicity it takessome work to derive the sampling distribution of Y2 Let
Yiiidsim Bernoulli(p) =rArr Y1 perpperp Y2P(Y1 = 1) = P(Y2 = 1) = p
(313)There are four possible values of (Y1 Y2) (0 0) (0 1) (1 0) (1 1)This makes three possible values of Yn = (Y1 + Y2)2 0 12 or 1The corresponding probabilities can be calculated using (313) sinceY1 perpperp Y2 implies P((Y1 Y2) = (a b)) = P(Y1 = a) P(Y2 = b) (Thatis due to independence the probability that both Y1 = a and Y2 = bequals the product of the individual probabilities) Thus
P(Yn = 0) =
use Y1perpperpY2︷ ︸︸ ︷P(Y1 = 0 and Y2 = 0) =
=1minusp︷ ︸︸ ︷P(Y1 = 0)
=1minusp︷ ︸︸ ︷P(Y2 = 0) = (1minus p)2
P(Yn = 1) =
use Y1perpperpY2︷ ︸︸ ︷P(Y1 = 1 and Y2 = 1) =
=p︷ ︸︸ ︷P(Y1 = 1)
=p︷ ︸︸ ︷P(Y2 = 1) = p2
P(Yn = 12) = 1minus P(Yn = 0)minus P(Yn = 1) = 1minus (1minus p)2 minus p2 = 2p(1minus p)(314)
To make (314) more concrete imagine again p = 07 and 100 ran-domly sampled datasets ie 100 randomly sampled pairs of (Y1 Y2)With p = 07 (1 minus p)2 = 009 p2 = 049 and 2p(1 minus p) = 042 Ifwe compute Y2 = (Y1 + Y2)2 for each of the 100 datasets (pairs)then there are approximately 9 with Y2 = 0 49 with Y2 = 1 and42 with Y2 = 12 That is the sampling distribution of Yn showsus how frequently each possible value occurs in the long-run (whenrepeatedly sampling many datasets from the same population)
With larger n andor non-binary Y andor estimators more com-plicated than the sample mean the calculations become much morecomplex Further they depend on knowing the true population dis-tribution of the Yi which in practice is unknown Consequently thesampling distribution is often approximated as in Section 36
Discussion Question 31 (probability of positive mean) After see-ing the data you want to know the probability that the true mean isstrictly positive E(Y ) gt 0 Does the frequentist sampling distribu-tion help If yes explain how if no explain why not Hint recallSection 31
352 Graphs Binary Population
Consider a binary population with p = 07 and datasets with n ob-servations sampled iid
35 SAMPLING DISTRIBUTION OF AN ESTIMATOR 67
Figure 31 graphs the sampling distribution (PMF) of Yn for vari-ous n That is for a given n Yn is a random variable whose PMF isshown The horizontal axis shows possible values of Yn The verticalaxis can be interpreted in two ways First it shows the probabilityof each possible value as a percentage eg if the bar at horizontalvalue 05 has height 42 then P(Yn = 05) = 42 Second you couldimagine randomly sampling 100 datasets and the vertical axis showsthe number of datasets in which a particular value of Yn occurs
Value of Yn
PM
F (
)
020
4060
0 05 1
n=1
Value of Yn
PM
F (
)
020
4060
0 05 1
n=2
Value of Yn
PM
F (
)
020
4060
0 05 1
n=4
Value of Yn
PM
F (
)
020
4060
0 05 1
n=8
Value of Yn
PM
F (
)
020
4060
0 05 1
n=16
Value of Yn
PM
F (
)
020
4060
0 05 1
n=32
Figure 31 Sampling distribution (PMF) of Yn with binary popula-tion
Figure 31 first shows the sampling distribution of Y1 = Y1 (then = 1 graph) and then the sampling distribution of Y2 from (314)Figure 31 then shows the sampling distribution of Yn for larger nvalues that would be very tedious to compute by hand
Figure 31 helps show how Yn can be seen as a random variablewith a probability distribution ie its sampling distribution Fordifferent n different values are possible For different n each value hasa different probability of occurring That is if we randomly samplemany datasets from the same population with the same sample size nsome datasets have one value of Yn while other datasets have anotheror another and each possible value of Yn occurs with the probabilityshown in the graphs
Figure 31 also shows two interesting patterns when comparingsmall n and larger n First with larger n the sampling distributionof Yn is more closely concentrated around the mean E(Y ) = 07 Inthe extreme when n = 1 there can only be Y1 = 0 or Y1 = 1 neitherof which is very close to the mean 07 With larger n even if Yn = 07exactly is unlikely (or impossible) the probability of being close to 07
68 CHAPTER 3 ONE VARIABLE SAMPLE
is higher For example the probability of 06 le Yn le 08 is zero whenn = 1 but it is relatively high for the largest n Second the shape ofthe sampling distribution differs for large and small n With n = 1Yn has a Bernoulli (binary) sampling distribution With the largestn shown although it is still discrete the shape looks like the ldquobellcurverdquo shape of a normal distributionrsquos PDF Both these observationsare formalized in Section 36
353 Graphs Continuous Population
Now consider a continuous variable whose population distribution isuniformly distributed over all real (decimal) numbers between 0 and1 There is again iid sampling so Yi
iidsim Unif(0 1)Figure 32 again shows the sampling distribution of Yn but now
it is a PDF instead of PMF since Yn is a continuous random variable(see Section 235) The horizontal axis again shows possible values ofYn The vertical axis shows the probability density The area underthe PDF over any range of horizontal values shows the correspondingprobability of that range as in Figure 23
02
46
8
Value of Yn
PD
F
0 04 08
n=1
02
46
8
Value of Yn
PD
F
0 04 08
n=20
24
68
Value of Yn
PD
F
0 04 08
n=4
02
46
8
Value of Yn
PD
F
0 04 08
n=8
02
46
8
Value of Yn
PD
F
0 04 08
n=16
02
46
8
Value of Yn
PD
F
0 04 08
n=32
Figure 32 Sampling distribution (PDF) of Yn with continuous pop-ulation
Figure 32 shows that for any n different values of Yn are possiblegiven different datasets Some datasets are more likely than othersso some ranges of Yn values are more likely than others
Figure 32 shows two patterns when comparing the graphs withsmall n and larger n First the distribution is more spread out withsmall n and more concentrated around the mean E(Y ) = 05 forlarger n For example consider P(045 le Yn le 055) the area under
35 SAMPLING DISTRIBUTION OF AN ESTIMATOR 69
the PDF between horizontal values 04 and 06 This probability ispositive even with n = 1 but relatively small (it is only 20 ie thearea between 04 and 06 is only 20 of the total area under the PDFin the n = 1 graph) With the largest n it is relatively high Secondthe shape differs by n With n = 1 the PDF is flat reflecting valuesuniformly spread between 0 and 1 With n = 2 there is a single peakat the population mean E(Y ) = 05 but the PDF has straight linesand a sharp corner With larger n the PDF looks like the ldquobell curverdquoshape of a normal distributionrsquos PDF See Section 36 for more
354 Table Values in Repeated Samples
Table 31 records values and events across 100 datasets randomlysampled from the same population The population is discrete withP(Y = j) = 15 for j = minus02minus01 0 01 02 Sampling is iid soeach Yi has the same distribution as the population Y and all Yiare mutually independent Let n = 10 The population mean isE(Y ) = 0
Table 31 Example estimates and event probabilities
Sample Yn 1Yn le 0
1Yn minus 04 le 0 le Yn + 04
1 050 0 02 020 0 13 000 1 14 minus010 1 15 minus050 1 0
100 030 0 1
Average 001 52100 67100
Note P(Y = j) = 02 for j = minus2minus1 0 1 2 iid n = 10
Table 31 shows the value of Yn computed from each sample (dataset)It shows that Yn = 05 in the first sample Yn = 02 in the second sam-ple etc This reflects the sampling distribution A histogram of thesevalues (not shown) would produce a graph with similar interpretationto Figures 31 and 32
Table 31 shows for each sample whether or not the sample meanYn is less than or equal to the population mean E(Y ) = 0 in thecolumn labeled with 1
Yn le 0
That is 1 indicates that it does 0
indicates that it doesnrsquot For example in Sample 1 Yn = 05 whichis not negative so 1
Yn le 0
= 0 In Sample 4 Yn = minus01 which
is negative so 1Yn le 0
= 1 From the frequentist view the event
Yn le E(Y ) is ldquorandomrdquo in that it could occur or not occur with someprobability for each possibility The E(Y ) is non-random but Ynis random hence the event is random The eventrsquos probability is theprobability of randomly sampling a dataset in which the event occursThe bottom row of the table says the event occurred 52 times out of
70 CHAPTER 3 ONE VARIABLE SAMPLE
100 samples (52 of the time) Since there are only 100 samples andnot infin this is not the exact probability but it reflects that the eventoccurs slightly more than half the time
Table 31 also shows for each sample whether or not the randominterval [Yn minus 04 Yn + 04] contains the population mean E(Y ) = 0ie whether or not Yn minus 04 le E(Y ) le Yn + 04 The interval isldquorandomrdquo in the frequentist sense that it has different possible prob-abilities in different datasets (since it depends on Yn) In Sample1 the interval does not contain E(Y ) Yn = 05 so the interval is[05minus 04 05 + 04] = [01 09] which does not contain E(Y ) = 0 InSample 2 the interval does contain E(Y ) Yn = 02 so the inter-val is [minus02 06] which contains E(Y ) = 0 The bottom row of thetable says this event occurred 67 times out of 100 samples (67 ofthe time) This is the ldquocoverage probabilityrdquo of a ldquoconfidence intervalrdquodescribed in Section 382
36 Sampling Distribution Approximation
Because of the difficulties mentioned in Section 35 usually an esti-matorrsquos sampling distribution is approximated
Most common is a particular type of approximation called anasymptotic approximation All else equal this type of approxima-tion is better when n is larger For example comparing Figures 31and 32 with (315) below the approximation is very bad with n = 1but very good for the largest n in each figure Unfortunately thereis no general magic threshold for n because the approximationrsquos ac-curacy also depends on certain (unknown) population features Inpractice people usually just hope n is large enough that the approx-imation is reasonable
With iid sampling the approximate distribution of Yn is
Ynasim N(microY σ
2Y n) (315)
where the a over sim stands for ldquoapproximatelyrdquo (or ldquoasymptoticallyrdquo)microY = E(Y ) is the population mean and σ2Y equiv Var(Y ) is the popu-lation variance This reflects the three patterns seen in the examplesin Sections 352 and 353 First the standard deviation is σY
radicn
which (given the same σY ) is smaller for larger n Second the mean ofthe distribution is the population mean Third the shape is normal(Gaussian) A normal approximation of a sample meanrsquos samplingdistribution is often called a central limit theorem (CLT)
For technical mathematical reasons you may often see a variationof (315) radic
n(Yn minus microY )asim N(0 σ2Y ) (316)
Practically this says the same thing as (315) it just moves the microYand n to the left-hand side from the right-hand side
In practice σ2Y is unknown but can be estimated from data (iethe sample variance)
37 QUANTIFYING ACCURACY OF AN ESTIMATOR 71
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
Fn=1
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
F
n=2
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
F
n=4
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
F
n=8
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
Fn=16
minus10 00 10
00
06
12
Value of n(Yn minus microY)
PD
F
n=32
Figure 33 Sampling distribution (solid PDF) ofradicn(Yn minus microY ) with
normal approximation (dashed)
Figure 33 shows the same sampling distribution (PDF) from Fig-ure 32 along with the normal approximation The horizontal axishas been rescaled to see the shape more easily the horizontal valuesare like the left-hand side of (316) The approximation is bad withn = 1 but very good for the largest n
361 Non-iid Sampling
With non-iid sampling (see Section 32) an estimatorrsquos sampling dis-tribution is different so the approximation also differs Usually thesampling distribution is still normal but with a different standarddeviation In practice ideally you should understand the type ofsampling enough to know which R functions can provide an appro-priate approximation For now the goal is only to be able to assessthe type of sampling and understand that non-iid sampling leads toa different sampling distribution
37 Quantifying Accuracy of an Estimator
From the frequentist perspective an estimatorrsquos accuracy can bequantified by comparing features of its sampling distribution to thetrue population value Bias is an important commonly mentionedproperty but it is not sufficient Mean squared error better quantifiesaccuracy Bias and mean square error are finite-sample propertiesthat derive from the estimatorrsquos sampling distribution for a finite sam-ple size n Approximate (ldquolarge-samplerdquo or ldquoasymptoticrdquo) versions of
72 CHAPTER 3 ONE VARIABLE SAMPLE
these properties are also discussed belowThroughout let θ be the population parameter estimated by θn
This includes θ = E(Y ) and θn = Yn but is more general
371 Bias
Recall from Section 35 the frequentist perspective that an estimatoris a random variable whose probability distribution is called its sam-pling distribution The sampling distribution differs with n hencethe subscript n on the estimator θn
Definitions
The bias of θn compares the mean of its sampling distribution to thetrue population θ Mathematically
Bias(θn) equiv E(θn)minus θ (317)
The bias captures if the estimator systematically differs from θ in aparticular direction ie how wrong the average θn is
There are four types of bias
upward bias (positive bias) E(θn) gt θ
downward bias (negative bias) E(θn) lt θ
attenuation bias (bias toward zero) 0 ltE(θn)
θlt 1 so |E(θn)| lt |θ|
bias away from zeroE(θn)
θgt 1 so |E(θn)| gt |θ|
An estimator is unbiased if its bias is zero Using (317)
Bias(θ) = 0 lArrrArr E(θ) = θ (318)
where symbol lArrrArr can be read as ldquois equivalent tordquo (see Section 61)For example with iid sampling the sample mean is unbiased
With n = 1 Y1 = Y1 so E(Y1) = E(Y1) = microY With n = 2
E[Y2] = E[(12)Y1 + (12)Y2] =
microY 2︷ ︸︸ ︷(12) E(Y1) +
microY 2︷ ︸︸ ︷(12) E(Y2) = microY
(319)using the linearity property of E(middot) from (221) Similar derivationshold for any n so E(Yn) = microY thus the bias is zero given (318)
Insufficiency of Bias to Quantify Accuracy
Bias alone does not fully quantify accuracy That is if you onlyconsider bias when choosing between two possible estimators thenyou may be fooled into choosing the worse estimator
Let θ1 and θ2 be two estimators of unknown parameter θ Herethe subscripts 1 and 2 do not indicate n but just that the estima-tors are different For simplicity let θ = 0 The first estimatorrsquosdistribution is
P(θ1 = minus100) = P(θ1 = 100) = 12 (320)
37 QUANTIFYING ACCURACY OF AN ESTIMATOR 73
The second estimatorrsquos distribution is
P(θ2 = 1) = 1 (321)
The first estimator has smaller bias The mean of each estimatoris
E(θ1) = (12)(minus100)+(12)(100) = 0 E(θ2) = (1)(1) = 1 (322)
Thus recalling θ = 0 the bias of each estimator is
Bias(θ1) = E(θ1)minusθ = 0minus0 = 0 Bias(θ2) = E(θ2)minusθ = 1minus0 = 1(323)
Estimator θ1 is unbiased while θ2 has upward biasBut intuitively θ2 is much better It is always wrong by 1 but
θ1 is always wrong by 100 which is much worse Regardless of thedataset θ2 is 100 times closer than θ1 to the true θ = 0 Bias alonedoes not properly quantify our preferences it tells us to prefer θ1when in fact we strongly prefer θ2
372 Mean Squared Error
=rArr Kaplan video MSE Examples
The mean squared error (MSE) is a more complete measure ofldquohow badrdquo an estimator is The idea is analogous to using quadraticloss for prediction (eg Sections 251 and 254) Among other pos-sible loss functions this is most common and is generally reasonableMSE is mean quadratic loss
MSE(θ) equiv E[L2(θ θ)] = E[(θ minus θ)2] (324)
Continuing the example our intuitive preference for θ2 over θ1 issupported by MSE Since MSE measures ldquohow badrdquo an estimator isθ2 being ldquobetterrdquo means it has lower MSE Specifically
MSE(θ1) = E[(θ1 minus θ)2] = (12)(minus100minus 0)2 + (12)(100minus 0)2 = 10000
MSE(θ2) = E[(θ2 minus θ)2] = (1)(1minus 0)2 = 1
(325)
This matches our intuition θ2 is much better than θ1 because it hasmuch lower MSE
MSE can also be decomposed into variance plus squared biasThe variance can also be seen as the squared ldquostandard errorrdquo (seeSection 381) Skipping the algebra
E[(θ minus θ)2] = Var(θ) + [Bias(θ)]2 (326)
All else equal larger bias is bad but itrsquos also bad to have very highand very low estimates across datasets (large variance and ldquostandarderrorrdquo) even if they happen to average to θ
74 CHAPTER 3 ONE VARIABLE SAMPLE
Other MSE Examples
More generally instead of assuming θ = 0 let
P(θ1 = θminus100) = P(θ1 = θ+100) = 12 P(θ2 = θ+1) = 1 (327)
The MSEs are the same as before since the θ cancels out
MSE(θ1) = E[(θ1 minus θ)2] = (12)(θ minus 100minus θ)2 + (12)(θ + 100minus θ)2 = 10000
MSE(θ2) = E[(θ2 minus θ)2] = (1)(θ + 1minus θ)2 = 1
(328)
As another example imagine we know the bias and variance oftwo estimators but not the full sampling distributions This is stillsufficient to compute MSE using (326) For example let
Bias(β1) = 1Var(β1) = 16 Bias(β2) = 10Var(β2) = 9 (329)
Plugging these into (326)
MSE(β1) = 12 + 16 = 17 MSE(β2) = 102 + 9 = 109 (330)
According to MSE β1 is better because it has lower MSE (ldquoless badrdquo)than β2 In this case although β1 has larger variance its bias isenough smaller than its overall MSE is also smaller
Practice 33 (estimator MSE) Consider three estimators of thepopulation mean micro = E(Y ) and their three sampling distributionsmicro1 sim N(micro 25) micro2 sim N(micro + 3 16) and micro3 sim N(micro + 2 9) ie allnormal distributions with respective means micro micro+ 3 and micro+ 2 andrespective variances 25 16 and 9 (Hint for MSE does it matterthan the distributions are normal)
a) Compute the MSE of each estimatorb) Rank the three estimators from best to worst in terms of MSE
373 Consistency
Analogous to how bias compares the sampling distributionrsquos meanto the true population θ consistency compares the approximate(asymptotic) sampling distributionrsquos mean to θ However in ad-dition to the approximate distribution having mean θ consistencyalso requires that the approximate distributionrsquos standard deviationis smaller for larger n
For example consider the approximate distribution of the samplemean Yn in (315) The mean is microy Further the standard deviationis proportional to 1
radicn (since the variance is proportional to 1n)
ie smaller for larger n Thus Yn is consistentVisually the consistency of Yn was seen in Figures 31 and 32
In each case with larger n the sampling distribution concentratedprobability around microY
Intuitively consistency means that in ldquolargerdquo samples (large n)there is a ldquohighrdquo probability of the estimator being ldquocloserdquo to the true
38 QUANTIFYING UNCERTAINTY FREQUENTIST APPROACHES75
value This is similar to the idea of ldquoprobably approximately correctrdquoin computer science estimator θn is ldquoconsistentrdquo if with large n itis ldquoprobably approximately correctrdquo Unfortunately there are usu-ally no precise quantitative definitions of ldquolargerdquo ldquohighrdquo and ldquocloserdquoStill the qualitative idea is that the sampling distribution of θn isconverging to the true θ if we imagine larger and larger n which isrepresented notationally by
θnprarr θ as nrarrinfin or plim
nrarrinfinθn = θ (331)
If θn is not consistent its asymptotic bias can be defined as
AsyBias(θn) equiv plimnrarrinfin
θn minus θ (332)
(Other definitions have the same practical meaning though the tech-nical details differ) Similar to ldquounbiasednessrdquo being ldquozero biasrdquo hereldquoconsistencyrdquo is ldquozero asymptotic biasrdquo There are the same four typesof asymptotic bias as bias (upward downward attenuation awayfrom zero)
374 Asymptotic MSE
It is possible to compare approximate (asymptotic) MSE but detailsare omitted
38 Quantifying Uncertainty Conventional Fre-quentist Approaches
The point estimates in Section 34 provide our best guesses aboutunknown population values but they offer no sense of our uncer-tainty Here we consider only statistical uncertainty (or samplinguncertainty) ie the uncertainty due to observing only a randomsample of data instead of knowing the true population distributionAlthough the term is ambiguous inference often refers to the typesof methods in this section (ie statistical methods other than pointestimation)
This section concerns only the conventional frequentist approachesto quantifying uncertainty Section 39 provides warnings about mis-interpretation and misuse
The general consensus among econometricians and statisticians isthat confidence intervals are more informative and easier to interpretthan p-values and hypothesis tests You should focus on confidenceintervals when producing your own empirical analysis but you mayneed to understand p-values and hypothesis tests to understand oth-ers
381 Standard Errors
Empirical economics results almost always report standard errorsalongside estimates Standard errors are commonly used to computeconfidence intervals as well as p-values and hypothesis tests
76 CHAPTER 3 ONE VARIABLE SAMPLE
Definition and Terminology
The standard error (SE) of estimator θ the standard deviation ofits sampling distribution
SE(θ) equivradic
Var(θ) (333)
Recall from Section 23 that the standard deviation has the same unitsas the variable itself so the SE has the same units as θ
Unfortunately people may say ldquoSErdquo to refer to either (333) oran estimate of (333) so its meaning is ambiguous In this textbookat least ldquoestimated SErdquo and notation SE(θ) refer to an estimate of(333) Causing yet more confusion SE(Yn) is often called the stan-dard error of the mean whereas personally I would call it theestimated standard error of the sample mean
Interpretation
The SE helps quantify uncertainty due to random sampling or ldquosta-tistical uncertaintyrdquo Larger SE means more uncertainty
For example consider estimators θ1 θ2 and θ3 that all estimateθ If θ1 = θ then SE(θ1) = 0 reflecting zero uncertainty If P(θ2 =θ + 1) = P(θ2 = θ minus 1) = 12 then (skipping the math) SE(θ2) = 1If P(θ2 = θ + 10) = P(θ2 = θ minus 10) = 12 then SE(θ2) = 10
Unfortunately itrsquos possible to be very certain about the wrongvalue Consider the very bad estimator θ = 4 Since itrsquos a constantSE(θ) = 0 This may appear to be great (no uncertainty) but weshould feel very uncertain about the methodology of ldquojust guess 4rdquoOur uncertainty about appropriate methodology is not captured bythe SE
382 Confidence Intervals
A confidence interval (CI) helps quantify statistical uncertaintywith a longer CI indicating more statistical uncertainty A CI doesnot capture any other source of uncertainty so small values can bemisleading if there is still uncertainty about certain assumptions ormethodological choices
Essentially a CI is a range of values that tries to include thetrue population value with high probability like 90 or 95 Againldquoprobabilityrdquo means frequentist probability from the ldquobeforerdquo view orequivalently over many repeated samples from the same populationlike in Table 31
For example recall the last column in Table 31 In each of 100random samples it showed whether or not the true mean E(Y ) = 0was inside the interval [Yn minus 04 Yn + 04] This CI contained thetrue population mean in 67 of the 100 datasets From the ldquobeforerdquoview the probability of randomly sampling a dataset in which the CIcontains the true value is around 67
38 QUANTIFYING UNCERTAINTY FREQUENTIST APPROACHES77
A 90 CI does not mean ldquoI believe therersquos a 90 chance that thetrue value is in this rangerdquo That is the interpretation of a Bayesiancredible interval see Section 31
The actual probability that a CI contains the true value oftendiffers from the desired probability In practice when you ask R tocompute a CI you specify your desired probability (like 90 or 95)called the confidence level or nominal coverage probability (orldquonominal levelrdquo or other variations) The actual probability is thecoverage probability There are three possibilities
1 Ideally a CIrsquos coverage probability is close to the nominal level2 Sometimes a CI is too long and has coverage probability above
what you requested This is bad because it does not help younarrow down the possible values of the population parameterwell (since the CI is longer than necessary)
3 Sometimes a CI is too short and has coverage probability belowwhat you requested as low as 80 50 or even close to 0This is bad because you think the true value is inside the CIbut actually in many datasets (more than you realized) the CIdoes not contain the true value
The levels 90 and 95 are most common but sometimes youmay desire 99 or even higher if it is particularly important that thetrue value be in the interval (or if you have a very large sample withvery short CIs)
Formally coverage probability is defined as follows Consider atwo-sided confidence interval of the form [L U ] where the hats re-mind us that the interval endpoints are computed from the data Forexample Table 31 had L = Ynminus04 and U = Yn+ 04 A one-sidedconfidence interval would set L = minusinfin or U = infin The coverageprobability of this CI for the parameter θ is
P(CI contains true value) = P(θ isin [L U ]) = P(L le θ le U) (334)
Sometimes a CI is written in terms of a critical value Thecritical value depends on the confidence level and comes from thestandard normal distribution N(0 1) It is used when an estimatorrsquosapproximate sampling distribution is normal (Gaussian) For exam-ple if c is the critical value for a two-sided 95 CI and SE is theestimated standard error of estimator θ then the conventional CI is[θ minus c SE θ + c SE]
Practice 34 (CI interpretation) Imagine you have a CI with 95nominal coverage probability for the true θ [14 29]
a) Explain why this does not mean ldquoI think therersquos a 95 chancethat 14 le θ le 29rdquo
b) Explain why itrsquos still possible that the true value is θ = 0c) Explain why if the true coverage probability is also 95 and
you had 99 other randomly sampled datasets then around 95of the 100 total datasets would have a CI containing the true θ
78 CHAPTER 3 ONE VARIABLE SAMPLE
d) Despite the 95 confidence level imagine the actual coverageprobability of your CI is 75 Would a CI with actual 95coverage probability be longer or shorter than yours Explain
Example in R
The following R example constructs two-sided 95 confidence inter-vals for the mean from simulated iid standard normal data (so thetrue population mean is zero) One CI uses ttest() a standardt-test the other CIs use nonparametric bootstrap methodology fromthe boot package though details are beyond our scope
library(boot)setseed(112358) for replicabilityY lt- rnorm(n=50 mean=0 sd=1) iid N(01)CIttest lt- ttest(x=Y conflevel=095
alternative=twosided)$confintret lt- boot(data=Y statistic=function(xi) mean(x[i]) R=100)tmp lt- bootci(bootout=ret conf=095 type=c(basicbca))outtable lt- rbind(CIttesttmp$basic[45]tmp$bca[45])rownames(outtable) lt- c(NormalityBootbasicBootBCa)colnames(outtable) lt- c(LowerUpper)print(round(outtabledigits=3))
Lower Upper Normality -0213 0370 Bootbasic -0234 0387 BootBCa -0233 0388
383 p-values
Frequentist p-values are precisely defined strange common and com-monly misunderstood
A p-value measures how unlikely the observed data would be if acertain hypothesis were true Notationally a p-value is conventionallyjust denoted as p but since it is computed from data I usually write p(with a hat) for clarity The range is 0 le p le 1 Small values closer top = 0 indicate such a dataset would be unlikely if the hypothesis weretrue In that case either the hypothesis is true and we just happenedby chance to observe an unlikely dataset or else the hypothesis isfalse
For example consider the p-value for the hypothesis H0 microY = 0Values near p = 0 indicate that the observed dataset would be unlikelyif actually microY = 0 As usual Yn is a random variable Here theobserved sample mean Yo is treated as non-random The p-value isthen
p = P(|Yn| ge |Yo| | microY = 0) (335)
That is the p-value is the probability of observing a sample meaneven farther away from zero than the observed sample mean given a
38 QUANTIFYING UNCERTAINTY FREQUENTIST APPROACHES79
population with microY = 0More generally
p = P(estimate magnitude at least as big as observed | H0 is true)(336)
384 Statistical Significance
Results with low p-values are often called statistically significantor having statistical significance These terms are usually usedwhen trying to estimate an effect (or difference) where the null hy-pothesis is zero effect A statistically significant effect estimate meansthe p-value is low meaning an estimate that large would be unlikely ifthe true effect were zero which provides some evidence of a non-zeroeffect
Conceptually statistical significance is not a yesno property buta continuum ie not ldquoifrdquo but ldquohow muchrdquo Results can be somewhatstatistically significant or extremely statistically significant or lack-ing statistical significance etc Confusingly a lower p-value meansgreater statistical significance
In practice often people say a result is statistically significant ata particular level For example if the p-value is below 005 then theresult is ldquostatistically significant at a 5 levelrdquo if below 001 thenit is ldquostatistically significant at a 1 levelrdquo etc Generally there isstatistical significance at a 100c level if the p-value is below c
Why 5 Indeed 5 is arbitrary Its origin seems to be fromRonald Fisher who wrote in 1926 ldquoWe shall not often be astray ifwe draw a conventional line at 005rdquo Recently 72 prominent re-searchers from many fields (including statistics econometrics andeconomics) wrote a piece simply titled ldquoRedefine statistical signifi-cancerdquo (Benjamin Berger Johannesson Nosek Wagenmakers BerkBollen Brembs Brown Camerer Cesarini Chambers Clyde CookDe Boeck Dienes Dreber Easwaran Efferson Fehr Fidler FieldForster George Gonzalez Goodman Green Green Greenwald Had-field Hedges Held Ho Hoijtink Hruschka Imai Imbens IoannidisJeon Jones Kirchler Laibson List Little Lupia Machery MaxwellMcCarthy Moore Morgan Munafoacute Nakagawa Nyhan Parker Per-icchi Perugini Rouder Rousseau Savalei Schoumlnbrodt Sellke Sin-clair Tingley Van Zandt Vazire Watts Winship Wolpert XieYoung Zinman and Johnson 2018) The suggestion was to reducethe conventional level for statistical significance from 5 to 05Indeed it is already (much) lower in some fields like genetics andhigh-energy physics However they also agree that there may be veryimportant empirical results with p = 005 or even larger They simplyadvocate calling such results ldquosuggestive evidencerdquo rather than treat-ing them as conclusive They also note that it may be better to focuson confidence intervals than statistical significance
80 CHAPTER 3 ONE VARIABLE SAMPLE
385 Hypothesis Testing
In the scientific method theories imply certain hypotheses that canbe tested empirically (with data) A theory is maintained until it isdisproved then a new theory replaces it and is tested etc
In economics more often hypothesis testing is used like the p-value to provide evidence against a statement like ldquothe true effect iszerordquo
Notation and Terminology
Notationally H0 denotes the null hypothesis1 whileH1 (sometimesHa) denotes the alternative hypothesis A specific null hypothesisis a statement about a parameter written after a colon likeH0 microY =0 The alternative hypothesis is usually just that the null is false likeH1 microY 6= 0 so they are mutually exclusive (cannot both be true)
Ostensibly the goal of hypothesis testing is use data to decidewhether H0 or H1 is true but many caveats apply
Table 32 Terms for hypothesis test outcomes
donrsquot reject H0 reject H0
H0 true correct type I error (false positive)H0 false type II error correct
Much jargon accompanies hypothesis testing A hypothesis testhas two possible results either rejectH0 or do not rejectH0 (ldquoFail torejectrdquo means the same as ldquodo not rejectrdquo it does not reflect a ldquofailurerdquoin the colloquial English sense) Sometimes ldquodo not rejectrdquo is replacedby ldquoacceptrdquo but ldquodo not rejectrdquo emphasizes the asymmetry betweenH0 and H1 That is although rejection of H0 indicates evidenceagainst it lack of rejection only indicates a lack of evidence againstH0 not necessarily strong evidence supporting it The testrsquos resultis either correct or an error with terms shown in Table 32 Relatedprobabilities are
rejection probability P(reject H0)
type I error rate P(reject true H0)
type II error rate P(donrsquot reject false H0)
power P(reject false H0)
There is yet more jargon When H0 includes multiple values of aparameter θ the largest possible type I error rate is called the size ofthe test When H0 is just a single value like H0 θ = 0 then the sizeis just the type I error rate A testrsquos level is what it claims to be themaximum type I error rate (size) Like a confidence intervalrsquos nominalcoverage probability the level could be above below or equal to thetrue type I error rate Usually it is close for large n but it may bevery different with small n and there is (as usual) no quantitativethreshold for ldquolargerdquo
1Pedagogical criticism duly noted xkcdcom892
38 QUANTIFYING UNCERTAINTY FREQUENTIST APPROACHES81
Practice 35 (testing terms) State the technical term(s) associatedwith each of these
a) Your hypothesis test did not reject the permanent income hy-pothesis even though itrsquos false
b) For a type of lab experiment where people do not behave accord-ing to expected utility maximization when repeatedly runningthe experiment on different randomly sampled groups of peo-ple your hypothesis test rejects the null hypothesis of expectedutility maximization 80 of the time
c) You want to see if there is enough empirical evidence to rejectthe efficient market hypothesis
d) Your hypothesis test rejected the permanent income hypothesiseven though itrsquos true
e) In your test of H0 θ le 0 the type I error rate is very low whenθ is very negative but the type I error rate can be as high as7 when θ = 0
Computation
Computationally the hypothesis test for H0 E(Y ) = 0 can be com-puted using the p-value In this sense the test is strictly less informa-tive than the p-value the p-value takes any number between 0 and1 whereas the test can only reject or not Specifically the level αtest rejects when p le α so the test essentially just reports whether0 le p le α or p gt α In fact the function ttest() in R does noteven report ldquorejectrdquo or ldquodo not rejectrdquo it instead reports a p-value
Alternatively a hypothesis test compares a test statistic to a crit-ical value This is the same critical value from Section 382 thatdepends on the level α and the standard normal distribution N(0 1)for use when the estimator is approximately normal For example ifc is the two-sided level 5 critical value and t is the t-statistic thenthe test rejects H0 when |t| gt c In R for a two-sided level ALPHAtest the critical value is qnorm(1-ALPHA2) but usually you do notneed to compute it manually
386 Mental Math for Statistical Uncertainty
For a quick approximation you can use a critical value of 2 Specif-ically if a point estimatersquos absolute value |θ| is more than two SEsaway from zero then it is statistically significantly different from zeroat a 5 level (actually 455) That is it is statistically significantat 5 when |θ| ge 2 SE(θ) You can also take θ and add and subtracttwo SEs to get an approximately 95 CI (actually 9545) the CIis [θ minus 2 SE(θ) θ + 2 SE(θ)]
This is useful for a few reasons First itrsquos often easy enough todo in your head Second 5 is arbitrary anyway so 455 is equallygood (or equally bad) Third confidence intervals and p-values arealready based on (asymptotic) approximations whose approximationerror is almost always bigger than the difference between 5 and
82 CHAPTER 3 ONE VARIABLE SAMPLE
4552
Consider an example Imagine you estimate θ = minus32 and SE(θ) =
15 Then |θ| = 32 and 2 SE(θ) = 30 Since 32 gt 30 the resultis statistically significant at a 5 level (p lt 005) A 95 CI is[minus32minus 30minus32 + 30] = [minus62minus02]
Practice 36 (mental math) Let θ = minus28 SE(θ) = 19 in yourhead assess statistical significance and compute an approximate 95CI
39 Quantifying Uncertainty Misinterpretationand Misuse
This section addresses misinterpretations and misuse of frequentistinference Some of the most common problems are discussed belowas well as on the (pretty good) Wikipedia page devoted to the topic3
Practice 37 (frequentist or Bayesian) For each of the followingsay whether it is a frequentist question Bayesian question neitheror both if both explain the two possible interpretations Hint useSection 31 as well as Section 38
a) Whatrsquos the probability that the current natural unemploymentrate in the US is between 45 and 75
b) Can we create a diagnostic tool for our companyrsquos daily websitetraffic data to identify whether itrsquos normal or has been hackedlimiting the rate of falsely reporting ldquohackedrdquo on normal daysto only 1 of normal days
c) What is the probability that the true unemployment rate iswithin 1 percentage point of the estimated unemployment rate
d) Is the positive estimate θ gt 0 primarily due to the income effector substitution effect
391 Perils of Ignoring Non-iid Sampling
A CI justified by iid sampling may perform poorly when samplingis not iid (Similar problems befall p-values and hypothesis tests)Stratified sampling that is independent but not identically distributed(inid) is not too problematic since it tends to make coverage probabil-ity higher than requested CIs are ldquotoo longrdquo but still have high prob-ability of containing the true population value In contrast depen-dent sampling can make the actual coverage probability much lowerthan the requested confidence level For example maybe you ask fora 90 CI but the CIrsquos coverage probability is actually only 75
2For example a famous econometrician said ldquoI tell my students if you canget a 5 test that controls the actual type I error rate below 10 thatrsquos prettygoodrdquo (Jerry Hausman April 6 2019 keynote talk at Chinese Economists Societyconference in Lawrence KS)
3httpsenwikipediaorgwikiMisunderstandings_of_p-values
39 QUANTIFYING UNCERTAINTY MISINTERPRETATION ANDMISUSE83
Unfortunately you cannot simply ask the computer for the true cov-erage probability so you must carefully consider whether you have(in)dependent sampling
Example
Consider the following example with dependent sampling with resultssimulated in R below Imagine a ldquotime use surveyrdquo that includes aquestion about watching television (TV) In hours per day P(Y =j) = (4 minus j)10 for j = 0 1 2 3 Imagine each household containstwo adults who only watch TV together That is if individuals i andk live together then Yi = Yk
For comparison imagine two samples are collected one iid onewith clustered sampling Sample A is collected iid randomly sam-pling individuals from the population Sample B is collected by ran-domly visiting households and surveying both individuals within thehousehold Sample A contains observations Yi for i = 1 nA = 10Sample B contains observations Yi for i = 1 nB = 20 where Ykand Yk+10 live together (k = 1 10) Since Yk = Yk+10 in Sam-ple B there are really only 10 observations the other 10 are literallyduplicates
Samples A and B contain the same amount of information so theyshould lead to the same amount of uncertainty (same CI) Howeverif (incorrectly) both are assumed iid the larger sample size nB gt nAis incorrectly interpreted as greater certainty and leads to incorrectlysmaller CIs
This TV example is shown in the following simulation Manydatasets are simulated In each a 90 CI is constructed for
microY = E(Y ) = (410)(0)+(310)(1)+(210)(2)+(110)(3) = 1010 = 1
Finally the code reports the proportion of simulated datasets in whichthe CI contained the true value ie the simulated coverage probabil-ity
setseed(112358) for replicabilityvY lt- 03 possible values of YpY lt- (41)10 P(Y=0) P(Y=1) P(Y=2) P(Y=3)muY lt- sum(pYvY) E(Y)nA lt- 10 sample size for A Bs is twice thisCL lt- 090 90 CINREP lt- 1000 number of simulated datasetstmp lt- dataframe(lo=rep(NANREP)hi=NA)CIs lt- list(A=tmp B=tmp) store both CIs for each datasetfor (irep in 1NREP) sampleA lt- sample(x=vY size=nA replace=TRUE prob=pY)sampleB lt- rep(x=sample(x=vY size=nAreplace=TRUEprob=pY) times=2)CIs[[A]][irep] lt-ttest(x=sampleA conflevel=CL alternative=twosided)$confint
CIs[[B]][irep] lt-
84 CHAPTER 3 ONE VARIABLE SAMPLE
ttest(x=sampleB conflevel=CL alternative=twosided)$confintCPA lt- mean(CIs$A[1]lt=muY amp muYlt=CIs$A[2])CPB lt- mean(CIs$B[1]lt=muY amp muYlt=CIs$B[2])dataframe(conflevel=CL CPA=CPA CPB=CPB)
conflevel CPA CPB 1 09 0895 0749
The Sample A CI is much better than the Sample B CI that in-correctly assumes iid sampling ldquoBetterrdquo means coverage probabilityis closer to the desired confidence level of 90 Specifically the simu-lated coverage probability (CP) of the Sample A CI is 895 whereasthe CP of the Sample B CI is 749
392 Not a Bayesian Belief
The p-value is often interpreted as the probability that the hypothesisH0 is true but this is wrong While intuitive such an interpretationcould only be possible in a Bayesian framework not frequentist Forexample in the frequentist framework either microY le 0 or not whereasthe Bayesian posterior describes our belief about the probability thatmicroY le 0
For example a p-value of 008 does not mean ldquoI believe therersquosan 8 chance that H0 is correctrdquo
The example in Section 394 shows how your belief about H0
depends on more than just a frequentist hypothesis test or p-valuewhich only account for what happens if H0 is true
393 Unlikely Events Happen (or Use Common Sense)
As pointed out in an insightful webcomic (xkcdcom1132) com-mon sense and outside knowledge should be used when interpretingp-values and hypothesis tests A small p-value alone (eg rejectingH0 at a 5 level) does not mean H0 is definitely false It does noteven mean that H0 is probably false A p-value below 005 is observed5 of the time even if H0 is false A 5 chance is somewhat unlikelybut far from rare especially considering the thousands and thousandsof p-values being computed every day A small p-value just means it isunlikely to occur if H0 is true but common sense sometimes suggestthat it is even less likely that H0 is false as in the comic
Discussion Question 32 (equal p-values equal belief) Considerthree examples from Berger (1985 p 2) which he attributes to LJ Savage First a person claims to be able to tell whether milkwas poured into a cup of tea or tea was poured into milk in tentrials the person guessed correctly each time Second a music expertclaims to be able to tell whether a page of sheet music was written byMozart or Haydn in ten trials the expert guessed correctly each timeThird your drunk friend claims to be able to predict the outcome
39 QUANTIFYING UNCERTAINTY MISINTERPRETATION ANDMISUSE85
of a coin flip (heads or tails) from a fair coin (50 probability ofeach outcome) in ten trials your friend is correct each time Notethat each ldquoexperimentrdquo has a p-value of 2minus10 asymp 0001 since guessingrandomly could only get all correct with that probability (around11000 01) After seeing all this data do you have the same beliefabout whether each claim is true (ie do you think therersquos the samechance that each claim is true) Why not
394 Example of Ignoring Outside Knowledge
The following example illustrates ideas from Sections 392 and 393Table 33 shows the setup for a classic example in which frequen-
tist hypothesis testing is misleading It shows the disease status andtest results for 1000000 people randomly sampled from the popula-tion Dividing everything by 1000000 would yield a joint probabilitydistribution table but intuition is easier with people instead of prob-abilities The table shows that the disease is uncommon since only1000 people have it ie 11000 or 01
Table 33 Disease status and test result for 1000000 random people
(Donrsquot reject) (Reject)Test minus Test + Total
(H0 true) No disease 949050 49950 999000(H0 false) Disease 0 1000 1000
Total 949050 50950 1000000
Table 33 can be interpreted in hypothesis testing terms The nullhypothesis H0 is that somebody does not have the disease A positive+ test result means rejecting H0 Table 33 shows the type II errorrate is zero when H0 is false the test is always correct (+) The typeI error rate is 5 when H0 is true the test is wrong (+) with rate49950999000 = 5
The frequentist properties make the test sound very reliable thetype I error rate is controlled at the conventional 5 level and thetype II error rate is zero
However if yoursquore a random person in the population a positivetest result should not make you think you have the disease Given thatyou tested positive you could be in one of two boxes in the secondcolumn of the table That is either yoursquore one of the 49950 peoplewho tested positive but didnrsquot have the disease or one of the 1000people who tested positive and did have the disease Clearly 49950 ismuch larger than 1000 so itrsquos much more likely that you donrsquot havethe disease even though you tested positive That is conditional onhaving tested positive the probability of actually having the diseaseis still only 100050950 = 196 a very low probability
Put in more general terms even though our test rejected H0 ata 5 level it is still much (much) more likely that H0 is true thanfalse
86 CHAPTER 3 ONE VARIABLE SAMPLE
On the other hand you may have gotten tested because you hadall 17 characteristic symptoms of the disease In that case yoursquore notexactly ldquoa random person in the populationrdquo Your conclusion wouldbe very different
The conclusion is not ldquodonrsquot believe positive resultsrdquo or ldquodonrsquotbelieve negative resultsrdquo but rather ldquothink about what else you knowto critically think about resultsrdquo
395 Multiple Testing (Multiple Comparisons)
=rArr Kaplan video Multiple Testing
Another insightful comic (xkcdcom882) illustrates the multi-ple testing problem (or multiple comparisons problem) Es-sentially the scientists keep testing whether a different color jellybean (a candy) causes acne (a skin condition) until they finally findp lt 005 and reject the null hypothesis of ldquono effectrdquo at a 5 levelSince jelly beans do not actually cause acne (H0 is actually true)this ldquofalse positiverdquo should happen roughly 5 of the time or 1 in20 knowing this the comic shows them testing 20 different colorsThe multiple testing problem is essentially that if you keep testingand testing eventually yoursquoll get a false positive with p lt 005 andrejecting H0 As an analogy even though therersquos a full moon lessthan 5 of all nights as long as you keep looking up at the sky everynight eventually yoursquoll see one
Practice 38 (research assistants) Imagine yoursquore a powerful profes-sor with a cadre of 100 research assistants (post-docs grad studentsundergrads your neighborrsquos precocious high-schooler etc) You as-sign each research assistant (RA) one of 100 variables characterizingdifferent counties in the US number of tennis courts average tem-perature per capita income etc Each RA collects a dataset withtheir particular variable and computes the correlation with county-level May 2020 COVID-19 rates Each RA then computes a p-valuefor the null hypothesis that the correlation is zero Of the 100 RAs5 report statistical significance at a 5 level (p lt 005) includinga significantly positive correlation with tennis courts and negativecorrelation with temperature In light of multiple testing how dointerpret these results
Discussion Question 33 (jellybean solution) Consider the jellybean comic from xkcdcom882 In the comic they essentially dohypothesis tests with a 5 level rejecting H0 (no effect) if the p-value is below 005 Since they do 20 hypothesis tests even if eachindividual test is unlikely to make an error itrsquos likely that at leastone of the 20 tests makes an error
a) Assuming jelly beans have zero effect what type of error (I orII) is made by the green jelly bean test
b) Would it help (ie make such an error less likely) to use 1level tests instead of 5 Explain why why not or how muchit might help
39 QUANTIFYING UNCERTAINTY MISINTERPRETATION ANDMISUSE87
c) Would it be even better to use 0 level hypothesis tests Ex-plain why or why not
Discussion Question 34 (nova) Consider again the comic fromxkcdcom1132 about the machine that detects if the sun has gonenova (exploded) The null hypothesis H0 is that the sun has notexploded
a) Does the frequentist statistician correctly compute the p-valueand correctly reject H0 at a 5 level Whynot
b) What type of error (I or II) does the Bayesian statistician bethas been made
c) In this example explain why itrsquos almost certainly incorrectwhenever the machine reports that the sun has exploded
396 Publication Bias and Science
The jelly bean comicrsquos final panel illustrates publication bias thenewspaper only reports the exciting positive result omitting the 19negative results for the other 19 colors Not only popular media buteven academic journals are more likely to publish positive results soreading only published results gives a biased perspective
The jelly bean experiments also illustrate the importance of re-membering what ldquosciencerdquo means The result of a single study (evena good one) by itself is not science The scientific method is a processof replication and repeated testing of hypotheses If you ever hearldquoThere was this one new study that found [crazy result]rdquo you canignore it and wait till it gets replicated at least a few times4
397 Ignoring Point Estimates (Economic Significance)
Sometimes there is too much emphasis on p-values while the point es-timates are ignored This problem is mostly solved by simply lookingat confidence intervals instead of only p-values and hypothesis tests
While statistical significance (Section 384) assesses if the effectis statistically distinguishable from zero economic significance as-sesses if the effect is ldquoeconomicallyrdquo distinguishable from zero ldquoEco-nomicallyrdquo just means ldquofor real-world purposesrdquo like whether it isimportant to consider for policy purposes One way to think aboutthis is would you personally care about the difference For exampleimagine θ estimates the effect on your final exam score of studying anadditional hour per week Would you (yes you) care about having afinal exam score thatrsquos θ percentage points higher If θ = 001 thenno if θ = 50 then yes Some other examples would you care if youhad two additional years of education Would you care if your annualsalary were increased by five dollars
Conceptually like statistical significance economic significance isnot a binary yesno but a continuum of ldquohow muchrdquo economic signif-
4This is related to the ldquoreplication crisisrdquo httpsenwikipediaorgwikiReplication_crisis
88 CHAPTER 3 ONE VARIABLE SAMPLE
icance A result could have low economic significance or extremelyhigh economic significance or moderate economic significance etc
In practice unlike with statistical significance there is no conven-tional level (like 5) to mindlessly apply so you are forced to thinkcritically (This is good)
It is important to consider units of measure For example imag-ine the estimated effect on income is θ = 10 is that economicallysignificant If the units are dollars per hour then yes if itrsquos dollarsper year then no if itrsquos thousands of dollars per month then yesetc
It is also important to consider realistic policy changes For ex-ample imagine your estimated θ is the effect of a one-unit increasein the proportion of the state budget allocated to higher educationIf the current proportion is 008 (meaning 8) then a realistic pol-icy change would be something like 002 units A one-unit increasewould mean changing from 0 to 100 of the budget spent on highereducation Possibly θ looks economically significant but 002θ doesnot
Examples
Simplifying ldquohow muchrdquo significance to just lowhigh there are fourgeneral possibilities 1) both statistical and economic significance 2)just statistical 3) just economic or 4) neither
The following are examples of each possibility in the examplewhere θ is an effect on annual income measured in dollars per year
bull Both θ = 20000 p = 0001bull Only statistical θ = 10 p = 0001bull Only economic θ = 20000 p = 043bull Neither θ = 10 p = 043
Practice 39 (significance distance and education) You observea sample of married couples for each you observe the difference intheir years of education divided by the difference in the distancebetween their childhood homes and the nearest college or universityThat is if E1 and E2 are the years of education and D1 and D2 arethe distances you observe Y = (E2 minus E1)(D2 minus D1) Distance ismeasured in kilometers (1 km = 0600 mi) You estimate Y = minus003The p-value for testing H0 E(Y ) = 0 is p = 003
a) Is this estimate economically significant Hint consider theunits of minus003
b) Is this statistically significant Be precise
398 Other Sources of Uncertainty
In practice there are many sources of uncertainty only one of whichis the ldquostatistical uncertaintyrdquo due to having a random sample Forexample there may be uncertainty about assumptions (in later chap-ters) required to interpret the population parameter in a certain way
310 STATISTICAL DECISION THEORY 89
There may be uncertainty about how reliable the data is There maybe uncertainty about whether sampling is iid
The confidence intervals and other inference methods in Section 38only quantify the statistical uncertainty from random sampling notany other type of uncertainty Thus even if you have many othertypes of uncertainty a CI could be very short incorrectly suggestingyou should feel very confident in the estimates
For example imagine you want to learn the mean household in-come in Kansas but you only have data from Missouri You decideto assume that Kansas has the same mean as Missouri The Missouridataset has very large n so SE(Yn) asymp 0 This makes it seem likewe have zero uncertainty but we may be very uncertain about theassumption that Missouri and Kansas are identical Indeed if theKansas mean is very different then our CI could have near 0 actualcoverage probability
Practice 310 (uncertainty in Kansas) Imagine again θ is the meanhousehold income in Kansas but this time you have data from KansasWhich of the following could make the actual coverage probabilitymuch lower than the nominal level ie make the CI shorter than itwould be to fully capture all your uncertainty Why
a) You have data from state tax returns but only 70 of house-holds filed a state tax return and you worry these 70 may notbe fully representative of the population
b) Your sample size is only n = 10000 so a different randomsample could have resulted in different observed income values
c) You have survey data from a representative sample but youdoubt people accurately report their household income
d) You use a CI based on iid sampling but honestly you justfound the data online somewhere and didnrsquot really understandif maybe it was clustered sampling
e) You have data on individual adult income which you then mul-tiply by the average number of adults per household but youonly have the average number of adults per household from theyear 2013 and think it might have changed
310 Statistical Decision Theory
If you want to incorporate data more formally into your decisionsthen you should learn more about statistical decision theory which isbeyond our scope It is related to the optimal prediction material inSections 24 and 25 Therersquos both frequentist and Bayesian statisticaldecision theory Either way hypothesis testing is basically never thebest way to make a decision using data For example see Berger(1985)
Discussion Question 35 (Ebola drug) Imagine you have data fora new drug that tries to cure Ebola a disease with a high mortalityrate Assume that there are no other treatments available and thatwithout the drug an infected individual will die 100 of the time
90 CHAPTER 3 ONE VARIABLE SAMPLE
With the drug there is a possible side effect of occasional sneezingand it possibly cures the disease (so the person does not die fromit) You have a sample of 10 individuals infected with Ebola andrandomly picked 5 to take the experimental new drug and 5 to haveno treatment Of course of the 5 without the drug all 5 die Of the 5treated 2 live and 3 die You input your data into R and run a t-testwith command ttest(x=c(00000) y=c(11000)) where0 means dead and 1 means alive R says the two-sided p-value fortesting the null hypothesis that the drug has zero effect on mortalityis p = 01778 H0 is not rejected even at a 10 level (let alone 5 or1) ie the result is not statistically significant at a 10 (or 5 or1) level
a) If you then discovered that you had Ebola would you take thedrug Whynot Hint did R compute the right p-valueWhatrsquos the probability of 2 people living if the drug actuallyhas zero effect on mortality
b) What if not everybody died without the drug the untreatedgroup had 1 person live (among 5) and the treated group had3 live (among 5) yielding a p-value of 02429 Would you takethe drug if you were infected Whynot Hint what does yourloss function look like
EMPIRICAL EXERCISES 91
Empirical Exercises
Empirical Exercise EE31 The data are originally from Card(1995) with individual-level observations of wages years of educa-tion and other variables
a R only run installpackages(c(wooldridgesurvey))to download and install those packages (if you have not already)
b Load the card dataset
R load package wooldridge with command library(wooldridge) and a dataframe variable named card becomesavailable the command card then shows you details about thedataset
Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse card clear
c Compute the sample average of variable wage
R mean(card$wage)
Stata mean wage (which also computes a 95 confidence in-terval)
d Estimate the population mean accounting for the samplingweights
R weightedmean(x=card$wage w=card$weight)
Stata mean wage [pweight=weight] (also computes a 95CI)
e R only (since Stata reported this already) compute a two-sided95 CI for the mean ignoring weights with ttest(x=card$wage conflevel=095)
f R only (since Stata reported this already) compute a two-sided95 confidence interval for the mean accounting for weightsfirst loading the survey package with library(survey) andthen with commandscarddes lt- svydesign(data=card weights = ~weightid = ~1)
svyret lt- svymean(x = ~wage design=carddes)c(wmean=coef(svyret) SE=SE(svyret)CI=confint(svyret level=095))
g Compute a weighted 90 confidence interval for wage
R replace level=095 with level=090
Stata add ldquooptionrdquo level(90) to get mean wage [pweight=weight] level(90)
h Optional repeat computation of a point estimate and 95 con-fidence interval (without and with weights) for the mean of adifferent variable in the dataset
92 CHAPTER 3 ONE VARIABLE SAMPLE
R part (c) computes the unweighted point estimate part (d)computes the weighted point estimate part (e) computes theunweighted CI and part (f) computes the weighted CI
Stata part (c) computes both the unweighted point estimateand unweighted CI and part (d) computes both the weightedpoint estimate and weighted CI
Chapter 4
One Variable TwoPopulations
=rArr Kaplan video Chapter Introduction
Depends on Chapters 2 and 3
Unit learning objectives for this chapter
41 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
42 Describe and distinguish among descriptive predictive andcausal questions and among different approaches to learn-ing about causality from data in economics [TLOs 3 5and 6]
43 Describe and interpret the elements of the primary statisti-cal framework for understanding causality [TLO 3]
44 Assess whether a mean difference can be interpreted withcausal meaning in a real-world example [TLO 6]
45 In R (or Stata) compute estimates of mean differencesalong with measures of uncertainty and judge economic andstatistical significance [TLO 7]
Optional resources for this chapter
bull Structural and reduced form approaches Lewbel (2019)
bull Potential outcomes and SUTVA (Wikipedia)
bull Causal inference intro (Masten video)
bull Correlation vs causation (Masten video)
bull ATE (Masten video)
bull Individual causal effects (Masten video)
bull Potential outcomes example (Masten video)
93
94 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
bull Counterfactuals (Masten video)
bull Randomized experiments (Masten video)
bull SUTVA and spillovers (Masten video)
bull Empirical example property rights effect (Masten video)
bull Structural modeling advantages (Masten video)
bull Potential outcomes and confounding (Lambert video)
With two populations we can discuss not only description andprediction but also causality Foundational ideas introduced here areextended to regression in Part II
Discussion Question 41 (DPC with two populations) Let Y de-note the hourly wage of an individual in the US Let Y A be the wageof an individual without a college degree in the US and Y B thewage of an individual with a college degree
a) How are means E(Y A) and E(Y B) more helpful for descriptionthan only E(Y )
b) How could E(Y A) and E(Y B) be used to make better predictionsthan only E(Y )
c) Why canrsquot we interpret E(Y B) minus E(Y A) as the causal effect ofa college degree on wage Hint what other factors might makeE(Y B)minusE(Y A) large even if the effect of a college degree itselfis small
41 Description
411 Population Mean Difference
Let Y A and Y B be random variables representing Y (eg income)for two populations (labeled A and B) For example if Y is incomeA is the population of individuals without a high-school degree andB is the population of individuals with a high-school degree then Y A
is income for individuals who do not have a high-school degree andY B is income for those who do
The difference of means is E(Y B) minus E(Y A) It describes howmuch higher (or lower if negative) is the mean in population B thanin population A
For example let Y isin 0 1 2 be the number of kids per familyLet the distributions in populations A and B be respectively
P(Y A = 0) = 08 P(Y A = 1) = 02 P(Y A = 2) = 0
P(Y B = 0) = P(Y B = 1) = P(Y B = 2) = 13(41)
where Y A represents the number of kids per family in population Aand Y B represents the number of kids per family in population B
41 DESCRIPTION 95
Then
E(Y B)minus E(Y A) =
2sumy=0
yP(Y B = y)
minus 2sumy=0
yP(Y A = y)
= [(0)(13) + (1)(13) + (2)(13)]minus [(0)(08) + (1)(02) + (2)(0)]
= [(13) + (23)]minus 02 = 08
Always clarify whether you are subtracting the mean of populatonA from that of population B or B from A Saying ldquoThe difference inmean number of children between the populations is 08rdquo it is unclearwhich population has a larger mean Instead say ldquoThe mean numberof children in population B is 08 higher than the mean in populationArdquo
The difference of means is also the mean of the differences ldquoMeandifferencerdquo could mean either theyrsquore equal anyway Because of thelinearity of the expectation operator as in (221)
E(Y B minus Y A) = E(Y B)minus E(Y A) (42)
Despite mathematical equality the interpretation differs For exam-ple the expression Y B minus Y A is the number of children differencebetween a family from population B and a family from population ASeeing Y B and Y A as random variables the difference Y B minus Y A isitself a random variable Thus E(Y BminusY A) is the population mean ofthe child number difference Y B minusY A whereas E(Y B)minusE(Y A) is thedifference between the mean number of children in B and the meannumber of children in A Generally due to (42) either interpretationof the mean difference is correct the same population value has twointerpretations Itrsquos like if one person says ldquoThe glass is half full ofwaterrdquo and a second person says ldquoThe glass is half emptyrdquo both arecorrect interpretations of the same thing
412 Estimation
Simply estimate the means separately (see Section 34) and take thedifference like Y B minus Y A with iid data If each individual estimatoris consistent then this is a consistent estimator of E(Y B) minus E(Y A)and thus a consistent estimator of E(Y B minus Y A) due to (42)
The following code estimates E(Y B)minus E(Y A) from simulated iidsamples suggesting population B has a lower mean
setseed(112358)YA lt- 0+sample(x=1020 size=40 replace=TRUE prob=rep(11111))YB lt- 2+sample(x= 025 size=30 replace=TRUE prob=261sum(126))mean(YB) - mean(YA)
[1] -299
96 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
413 Quantifying Uncertainty
The same approaches (and warnings) from Sections 38 and 39 applyto θ = E(Y B)minus E(Y A)
The following R code shows 95 confidence intervals for the meandifference with iid data
setseed(112358)YA lt- 0+sample(x=1020 size=40 replace=TRUE prob=rep(11111))YB lt- 2+sample(x= 025 size=30 replace=TRUE prob=261sum(126)) 95 CI for mean diffround(ttest(x=YA y=YB alternative=twosided
mu=0 conflevel=095)$confint[12] digits=2)
[1] 044 555
42 Prediction
Prediction is essentially the same as with one population Given aloss function an optimal predictor can be defined to minimize meanloss in the population and this optimal predictor can be estimatedfrom data For example mean quadratic loss is minimized by thepopulation mean and the means E(Y A) and E(Y B) can be estimatedby (weighted) sample means
Prediction accuracy improves by distinguishing between individ-uals (or firms etc) from population A and those from population BFor example at your carnival job imagine you now guess peoplersquosheight instead of age In Chapter 2 you make the same guess foreverybody Now we consider two populations like child and adult(assuming this is observable) Now we can make a different predictionfor each population like 165 cm for adults and 105 cm for childrenThis should perform better than guessing 135 cm for every individual
Part II extends this idea exploring how regression models canincorporate additional information to improve prediction accuracy
43 Causality Overview
The concepts in the remainder of this chapter appear often in laterchapters
First when is causality important rather than description orprediction We each have an innate sense of cause and effect Tryingto articulate it in language sometimes creates more confusion thanunderstanding1 For example start reading the Wikipedia page oncausality and see how you feel in 10 minutes Unlike description andprediction causality is about ldquowhyrdquo A ldquocauserdquo is the ldquobecauserdquo of the
1Some of my failed attempts include ldquocausality is about what will happen ifa policy changesrdquo (but isnrsquot ldquowhat will happenrdquo prediction) and ldquodescription isseeing how things arerdquo (but arenrsquot causal relationships also ldquohow things arerdquo)
43 CAUSALITY OVERVIEW 97
effect Description helps us see which variables tend to have high orlow values together Prediction helps us guess one variablersquos valuebased on other information But only causality concerns why Whydo these two variables tend to have similar values Only causality(not description or prediction) helps inform policy decisions we wantto know how a policy change itself influences other variables causingthem to change
Discussion Question 42 (description prediction causality) Whichtype of question (description prediction causality) is each of the fol-lowing Explain why Hint therersquos one of each
a) If you only know whether an individual is from Canada or theUS what is your best guess of their income
b) You are currently working in the US but considering movingto Canada How will your income change if you do
c) Which countryrsquos population has higher income Canada or theUS
431 Correlation Does Not Imply Causation
=rArr Kaplan video Correlation Does Not Imply Causation
Generally imagine E(Y B) gt E(Y A) This shows a clear descrip-tive relationship population B has a higher mean The implicationfor prediction is clear under quadratic loss the optimal predictionis higher for population B than A In contrast the implication forcausality is not clear Itrsquos possible that being in population B hasa positive causal effect on the outcome variable But itrsquos also possi-ble that people with large Y choose to join population B Or maybethere is something else altogether that separately causes people to joinpopulation B and have high Y Or maybe all of these The causalinterpretation of E(Y B) gt E(Y A) is ambiguous
For example consider rainfall and umbrellas Let Y A denote rain-fall when nobody is carrying an umbrella and Y B rainfall when ev-erybody is carrying an umbrella For description it rains more ondays when everyone carries an umbrella than on days when nobodydoes eg E(Y B) gt E(Y A) For prediction itrsquos better to predict ahigher rainfall value if you see everyone carrying an umbrella than ifyou see no umbrellas eg under quadratic loss the optimal predic-tions are E(Y B) and E(Y A) For causality if therersquos a drought andwe want rain should we all walk around with umbrellas to cause itto rain No rain causes umbrella-carrying not vice-versa
Consider another example with a different type of causal relation-ship Let Y A be my commute time when nobody is carrying um-brellas and let Y B be my commute time when everyone is carryingumbrellas Descriptively E(Y B) gt E(Y A) and you should predicta longer commute time if you see everybody has an umbrella Butcausally this doesnrsquot mean that you can make me late for class byopening lots of umbrellas Rain is a confounder that has a causaleffect on both umbrella-carrying and commute time as depicted in
98 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
Figure 41
Rain
Umbrellas Commute Time
++
Figure 41 Causal relationships among rain umbrella-carrying andcommute time
These examples illustrate the famous saying ldquocorrelation does notimply causationrdquo 2 The saying is a bit imprecise correlation doesindeed imply some sort of causal relationship just not any one par-ticular type of causal relationship In the first example ldquocorrelationdoes not imply causationrdquo means that ldquohigher rainfall when peoplecarry umbrellasrdquo (rain is correlated with umbrellas) does not implyldquocarrying umbrellas causes rainrdquo But the correlation is ultimatelydriven by a causal relationship rain causes umbrella-carrying Inthe second example ldquocorrelation does not imply causationrdquo meansthat ldquolonger commute when people carry umbrellasrdquo (commute time iscorrelated with umbrellas) does not imply ldquocarrying umbrellas causeslonger commutesrdquo But the correlation is ultimately drive by causalrelationships rain causes both umbrella-carrying and longer com-mutes
Although common sense helps us see that umbrellas cannot causelonger commutes similar arguments are often made For example inthe 2018 August election in Missouri a ldquoright-to-workrdquo propositionappeared on the ballot Very roughly speaking such laws restrictthe power of unions to collect certain fees from certain employeesbut the following discussion about causality does not depend on thedetails3 Before the election one mailer ad opposing right-to-worksaid something like ldquoDo you want $8000 less in your pocket eachyearrdquo The implication is that were the law to pass the causaleffect would be a decrease in annual income of $8000yr Accordingto the adrsquos footnote this $8000yr was computed as the differencein workersrsquo mean annual income between states that had a right-to-work law and those that did not ie an estimate of E(Y B)minusE(Y A)Recall that E(Y B) minus E(Y A) 6= 0 in the example with umbrellas andcommute time too but we did not conclude that umbrellas have acausal effect on commute time4 For example maybe having lowerincome causes states to pass such laws ie causality is in the opposite
2httpsenwikipediaorgwikiCorrelation_does_not_imply_causation
3But if yoursquore curious httpsenwikipediaorgwikiRight-to-work_law
4Just to clarify this is not a discussion of whether the law itself is good orbad or whether the groups supporting or opposing the law are good or bad adsendorsing right-to-work also made errors just not illustrative econometric errors
43 CAUSALITY OVERVIEW 99
direction (reverse causality) Or maybe there is a third unobservedcharacteristic that causes states to pass such laws and causes lowerincome ie a confounder like rain in the commute example Ofcourse it is also possible that $8000yr really is the causal effectThe point is not that the number is right or wrong (or that the lawis good or bad) but that the econometric argument is incompleteAdditional assumptions are required to interpret a mean difference asa causal effect Such assumptions are discussed more in Section 46
432 Structural and Reduced Form Approaches
There are two econometric approaches to learning about causalitythe reduced form approach and the structural approach Confusinglythe reduced form approach is sometimes called causal inference eventhough the structural approach also aims to learn about causalityAlso confusingly the reduced form approach is commonly associatedwith program evaluation like assessing the effects of a job trainingprogram or welfare program but the structural approach could alsobe used
Both approaches consider counterfactual analysis but in differ-ent ways Broadly a counterfactual is a universe thatrsquos different thanour actual universe Usually the counterfactual universe is nearlyidentical to our actual universe except for one particular policy whoseeffect we want to learn The reduced form approach often consid-ers the counterfactual in which a real policy change never happenedeg what would the unemployment rate have been if the minimumwage had not increased by $2hr which in reality it actually didThe structural approach often tries to learn about underlying causaleconomic mechanisms to be able to analyze policies beyond what wehave seen historically For example maybe the sales tax has neverbeen above 10 but we want to learn about what might happen ifwe raise sales tax to 12
The reduced form approach tries to isolate causal effects byusing comparisons that are either randomized or ldquoas good as ran-domizedrdquo In our current context of populations A and B ran-domized would mean that units (eg individuals firms hospitals)are randomly assigned to a population without regard to the unitsrsquocharacteristics The ldquotreatedrdquo population would receive some specialtreatment that the ldquountreatedrdquo (ldquocontrolrdquo) population does not ldquoAsgood as randomizedrdquo would mean that although we did not explic-itly randomly assign units to each population the actual assignmentmechanism did not depend on unitsrsquo characteristics anyway Thesesituations are often called natural experiments they often arisefrom unexpected weather or disease outbreaks (like COVID-19) orcapricious political decisions Sometimes other variables are used tohelp find these ldquoas good as randomizedrdquo comparisons or to reducestatistical uncertainty but the core methodology remains the com-parison of treated and untreated units to estimate the causal effectof a particular ldquotreatmentrdquo (a variable or policy) The actual under-
100 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
lying causal mechanisms that produce the effect are not modeled likea black box
In contrast the structural approach tries to explicitly modelthe inner workings of causal systems Some ldquostructuralrdquo models arenot particularly detailed but some involve models of decision-making(like expected utility maximization) models of market equilibriamodels from game theory and other economic theory Consequentlythe structural approach tries to estimate structural parameters (orldquodeep parametersrdquo) that govern economic behavior and equilibria likeelasticities discount factors risk aversion and demand curves Thehope is that all this ldquostructurerdquo would not change under the set ofpolicies being considered ie that the policies may change the val-ues of certain variables but not change these underlying economicrelationships
The structural and reduced form approaches have complementaryadvantages and often both are helpful eg see the survey by Lewbel(2019) Both have contributed to our understanding of economicsOften structural models require stronger (less realistic) assumptionsbut in return they can analyze a wider variety of possible policiesConversely the reduced form approach often has the advantage an-alyzing the effects of existing policies but it is more difficult to ex-trapolate to hypothetical policies
For example consider the relationship between a womanrsquos edu-cation and fertility (number of children born) A reduced form ap-proach might try to find a group of women with college degrees anda group without where it seemed ldquoas good as randomrdquo who had adegree This is difficult because usually going to college is a carefullyconsidered decision and not one that we can force others to makein a randomized fashion but there may be peculiar situations like aneed-based college scholarship that had to be randomized because toomany people applied or the Cultural Revolution that suddenly shutdown universities in 1966 A structural approach might try to modelthe different factors affecting fertility choice and the different chan-nels through which education could have an effect This is difficultbecause it is a very complex decision with many variables involvedThe benefit is the ability to consider more possible policies with morenuance for example the effect on fertility of more women attendingcollege may depend on whether the increased education is due to areduced price of college (eg due to government subsidy) or greaterincentive due to higher salaries for jobs requiring college educationor some other reason
In Sum Structural amp Reduced Form ApproachesReduced form randomized or ldquoas good as randomizedrdquo com-
parisons to isolate causalityStructural more explicit economic models of causal rela-
tionships
44 POTENTIAL OUTCOMES FRAMEWORK 101
433 General Equilibrium and Partial Equilibrium
Besides structural vs reduced form another dichotomy is betweengeneral equilibrium (GE) and partial equilibrium (PE) analysisGE more ambitiously tries to model entire markets sometimes mul-tiple markets whereas PE takes current market equilibria as givenSimilar to the tradeoff between the structural and reduced form ap-proaches the tradeoff is that the GE framework can analyze policiesthat change equilibria (ie that have general equilibrium effects)but it requires stronger assumptions to do so
For example imagine you were analyzing the impact of free publicchildcare on mothersrsquo employment A PE analysis would considerhow mothers might respond to different childcare policies given thecurrent prices of private childcare current wages etc A GE analysismight further model the childcare and labor markets to allow for thepossible general equilibrium effects of public childcare policy on theprices in those markets If there is a big expansion of free publicchildcare then private childcares may indeed change their prices Ifthe expansion allows many mothers to enter the workforce then thelabor supply curve shifts out which could lower wages However ifthe proposed changes to childcare policy are relatively small thensuch GE effects may be negligible and PE analysis may suffice
The famous Lucas critique (Lucas 1976) argues in part that macroe-conomic policy analysis requires structural GE models Lucas writes(p 41) ldquoGiven that the structure of an econometric model consistsof optimal decision rules of economic agents and that optimal deci-sion rules vary systematically with changes in the structure of seriesrelevant to the decision maker it follows that any change in policywill systematically alter the structure of econometric modelsrdquo If wewant to guess how people and firms will behave in the future undernew macroeconomic policies we have to account for GE effects whichrequires deeper structural understanding and modeling of economicbehavior
In Sum General amp Partial Equilibrium ModelsPartial equilibrium models treat prices and other market equi-libria as fixed whereas general equilibrium models allow mar-kets to change
44 Potential Outcomes Framework
=rArr Kaplan video Potential Outcomes and the ATE
The reduced form approach uses the potential outcomes frame-work also called the NeymanndashRubin causal model after its twoearliest contributors (although sometimes Neymanrsquos name is dropped)It is popular not only in economics but statistics medicine politicalscience and other fields
102 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
The terms treatment and treatment effect just refer to anyvariable and its causal effect on another variable In English usu-ally ldquotreatmentrdquo makes us think narrowly about medicine (or lum-ber and facials) but it can be anything For example the ldquotreat-mentrdquo could be a job training program and the ldquotreatment effectrdquo isthe causal effect of the program on a personrsquos wage Or a treatmentcould be going to a charter school (instead of public school) Anothertreatment could be a policy or law like a higher sales tax or a certainlabor law
This section says ldquoindividualrdquo to be more concrete but you canalso imagine a firm county school etc
In Sum Causality in Potential Outcomes FrameworkTreatment effect the difference in outcomes between parallel uni-verses identical except for treatment
441 Potential Outcomes
Imagine two parallel universes The universes are identical except forone difference whether or not an individual is treated The individ-ualrsquos outcome in the universe without treatment is their untreatedpotential outcome and the individualrsquos outcome in the universewith treatment is their treated potential outcome
Notationally in this chapter Y T represents the treated potentialoutcome and Y U the untreated potential outcome Elsewhere oftenY1 and Y0 represent the treated and untreated potential outcomes orY (1) and Y (0)
For example consider parallel universes identical except for whethera particular student takes introductory econometrics (this class) or In-tro Stat II (STAT 3500) Literally everything else in each universeis identical the studentrsquos parents her other classes her height herDNA the weather on October 14 etc (For now some difficulties withldquoeverythingrdquo are glossed over eg what if econometrics is requiredfor her degree) The ldquotreatmentrdquo is taking econometrics (instead ofstatistics) The outcome variable is the studentrsquos annual income fiveyears after graduation in thousands of US dollars per year (egY = 70 is $70000yr) Let Y U denote her outcome in the universewithout treatment (STAT 3500) and Y T her outcome in the universewith treatment (econometrics) That is Y T is her treated potentialoutcome and Y U is her untreated potential outcome
Unlike in Section 41 potential outcomes Y U and Y T are not al-ways observable In the above example if a student takes STAT 3500then we can observe her untreated potential outcome Y U but not Y T conversely if she takes econometrics then her treated potential out-come Y T is observable but not Y U This partial observability makescausal inference more difficult than description or prediction
Consider some other potential outcomes examples In the right-to-work example Y T is an individualrsquos income in the universe where
44 POTENTIAL OUTCOMES FRAMEWORK 103
the individualrsquos state has a right-to-work law and Y U is their incomein the universe thatrsquos identical except there is no such law In ouruniverse either the individualrsquos state does or does not currently havesuch a law it cannot be both so we cannot observe both potentialoutcomes (Perhaps the state did not have the law last year and doesthis year but the universe ldquolast yearrdquo was different in many ways thanthe universe ldquothis yearrdquo much more than one single law has changed)
As another example imagine universe B is where a student winsthe lottery to enter a popular charter school and universe A is wherethe student remains in the conventional public school Potential out-comes Y T and Y U are dummy (binary) variables for whether or notthe student eventually graduated from college in each universe Againin our universe we can observe Y T if the students wins the lotteryand Y U if not but we cannot observe both
442 Treatment Effects
The difference Y T minus Y U between an individualrsquos two potential out-comes is their treatment effect Just as different individuals canhave different (Y U Y T ) individuals can have different treatment ef-fects Y T minusY U ie individuals can be affected differently by the sametreatment
In the intro econometrics example the studentrsquos treatment effectY T minus Y U has the following interpretation Recall Y T is their incomeafter taking econometrics and Y U after instead taking STAT 3500Thus that particular studentrsquos treatment effect is how much higher(or lower if negative) their income is in the parallel universe that isidentical other than taking econometrics instead of STAT 3500
In the right-to-work example Y T minus Y U is the treatment effectof the law on an individualrsquos income The interpretation now is thedifference between their income in the universe with the law and theuniverse without the law with everything else held constant Thetreatment effect can be big or small positive or negative (or zero) Anumerical example is shown later in Table 42
In the charter school example Y T minus Y U is the treatment effectof the charter school on college graduation That is it is the differ-ence between the college graduation outcomes in the charter schooluniverse and the public school universe Since the outcome is bi-nary (1 if graduate college 0 if donrsquot) there are only four possiblevalues of (Y U Y T ) (student types) and only three possible treat-ment effect values Y T minus Y U = 1 if the student graduates in thecharter school universe (Y T = 1) but not the public school universe(Y U = 0) Y T minus Y U = minus1 if they only graduate in the public schooluniverse (Y U = 1) but not the charter school universe (Y T = 0) andY TminusY U = 0 if they graduate either in both universes (Y T = Y U = 1)or neither (Y T = Y U = 0) This is seen in the later example of Ta-ble 41
In all examples the potential outcomes and treatment effects maybe different for different individuals For example econometrics may
104 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
be much better for some students but only a little better for othersright-to-work may help certain workers but hurt others the charterschool may make the difference for some students to graduate col-lege but others would have graduated either way The fancy termfor people being different is heterogeneity more specifically hereldquotreatment effect heterogeneityrdquo
In economics where many systems are interrelated sometimesitrsquos difficult just to specify what ldquoeffectrdquo we care about For exampleconsider racial differences in salary In the parallel universe thatrsquosldquoidenticalrdquo except for the individualrsquos race does ldquoidenticalrdquo includehaving the same job at the same firm Or does it allow for an effectof race on hiring Does it allow for an effect on educational opportu-nities or an effect on family background (parentsrsquo education wealthetc) There is no ldquorightrdquo or ldquowrongrdquo specification but each answersa different question
443 SUTVA
SUTVA Definition
The potential outcomes definition of causality relies critically on thestable unit treatment value assumption (SUTVA) which hastwo parts
The first part of SUTVA is that every treated individual receivesthe same treatment This seems true in the right-to-work examplethe same law applies (or doesnrsquot) to everybody equally This alsoseems true in the charter school example but with more nuance Twostudents may go to the same school but have very different experi-ences like different teachers different classmates different electivesand different extra-curricular activities Even if we say these two stu-dents nominally have the ldquosame treatmentrdquo we should expect a lot ofheterogeneity and we should expect treatment effects to change everyyear as the school adds or removes (or changes) its teachers its stu-dents its elective class offerings and its extra-curricular activitiesAs another ambiguous example if therersquos a one-on-one mentoringprogram to help teen parents but of course there are many differentmentors is every teen parent receiving the ldquosame treatmentrdquo
The second part of SUTVA is the no interference assumptionThis assumes that one personrsquos treatment (or non-treatment) does notaffect the potential outcomes of any other person This often makessense for medical treatments (eg doing surgery on me doesnrsquot af-fect your health) but it requires careful thought in economics whereoften individuals are interacting either personally or through mar-kets In the charter school example if a studentrsquos success depends onbeing surrounded by other highly motivated students (or not) thenSUTVA (specifically no interference) is violated That is one stu-dentrsquos outcome depends on whether the other motivated students arein the same school (whether charter or not) ie depends on the otherstudentsrsquo ldquotreatmentrdquo
The ldquosame treatmentrdquo ambiguity also relates to the structural and
44 POTENTIAL OUTCOMES FRAMEWORK 105
reduced form differences in Section 432 In the charter school ex-ample the structural critique would be that learning ldquothe effectrdquo ofldquogoing to charter school Brdquo last year is not particularly helpful forguiding educational policy if we canrsquot confidently extrapolate fromcharter school B last year to charter school B next year let aloneextrapolate to other charter schools let alone understand why the ef-fects are positive or negative (eg is it because of teachers or becauseof better classmates or more electives and activities) The reducedform rebuttal would be that at least they can (sometimes) be prettyconfident about their assessment of a particular school in a partic-ular year whereas trying to explicitly model the effects of teachersand classmates and classes and activities will result in models no-body believes anyway Hopefully we could learn more by trying bothapproaches than giving up and trying neither
SUTVA Violations
SUTVA can be violated in many ways especially in economics Thisis not about sampling or randomization or data it is about thepotential outcomes framework itself Even if SUTVA is satisfied andtreatment effects are well-defined it is possible to have problems withrandomization that make it impossible to actually learn about treat-ment effects Conversely even if there is a perfectly designed random-ized experiment SUTVA could be violated in which case it may beunclear what ldquotreatment effectrdquo even means
One violation of SUTVA is from spillover effects For exampleif the treatment provides helpful information (eg about financialplanning or social services or risk probabilities) treated individualsmay share such information with their untreated friends That isthe benefit of the treatment ldquospills overrdquo into untreated individualsThis could be true even if the treatment (information or otherwise)isnrsquot directly shared For example if the provided information leadsto less binge drinking among treated individuals this may reducesocial pressure that results in less binge drinking among untreatedindividuals even if they did not receive the ldquotreatmentrdquo informationOr if some treatment helps half the students in a classroom theirimprovement itself may benefit their untreated classmates
Another violation of SUTVA is from general equilibrium ef-fects (Section 433) For example maybe the treatment is a newagricultural technology hoping to increase cacao farmersrsquo earnings Ifonly one farmer gets this treatment (technology) then she benefitsfrom increased production selling more cacao at the current globalprice But if all farmers in the world get the technology then theglobal cacao supply curve shifts and the price drops Thus eachfarmerrsquos untreated and treated potential outcomes (earnings) are af-fected by all other farmersrsquo treatment which affects the market equi-librium price Other general equilibrium effects could come throughother markets For example a treatment affecting workers might af-fect the labor market as a whole (and thus wages) Or subsidies
106 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
for housing or education could affect supply and demand (and thusprices) in those markets
There can be yet other ways SUTVA is violated either from nothaving the same treatment or from interactions that violate ldquono in-terferencerdquo
Discussion Question 43 (cash transfer spillovers) Consider the ef-fect of income on food consumption (Y ) in a rural village Consider anldquounconditional cash transferrdquo program (like GiveDirectly) that (poten-tially) gives the equivalent of $1000 to a treated individual Describedifferent possible spillover effects that would violate SUTVA
Sometimes one perspective of a treatment leads to SUTVA viola-tions but another does not In the classroom example SUTVA wasviolated by spillover effects from treated students to untreated stu-dents in the same class Alternatively if each classroom is treated oruntreated (ie all students treated or all not) then there is less pos-sibility of spillover In principle even entire schools could be assignedas treated or untreated further reducing spillovers
In other cases you may actually want to learn about spillovereffects as part of the overall effect of a policy For example if thestudent-level treatment is specifically for students with certain specialneeds then we probably care about its affect on both the treated anduntreated students (Further it would be impossible to treat everyonein a school since the treatment is only appropriate for certain typesof student)
In deciding which perspective is best it is helpful to think aboutthe actual policy question what is the potential policy that couldactually be adopted in reality
45 Average Treatment Effect
=rArr Kaplan video Potential Outcomes and the ATE (again)
Although the full distribution of potential outcomes (Y U Y T ) con-tains the most information usually only certain summary features arestudied Although summary features like standard deviations andpercentiles are interesting wersquoll focus on means
451 Definition and Interpretation
The average treatment effect (ATE) is E(Y T minus Y U ) ldquoAveragerdquorefers to the population mean while ldquotreatment effectrdquo refers to Y T minusY U Thus the ATE may be interpreted as the probability-weightedaverage (mean) of all possible individual treatment effects in the pop-ulation Another name for the ATE is the average causal effect(ACE) but I use ATE to emphasize that this concept is from thepotential outcomes framework
The ATE has another interpretation Using linearity as in (221)
E(Y T minus Y U ) = E(Y T )minus E(Y U ) (43)
45 AVERAGE TREATMENT EFFECT 107
Here E(Y T ) is the mean treated potential outcome and E(Y U )is the mean untreated potential outcome Similar to Section 41E(Y T ) minus E(Y U ) is a mean difference here between the treated anduntreated potential outcome distributions This could be rephrasedas ldquothe treatment effect on the mean outcomerdquo treatment causes themean outcome to change from E(Y U ) to E(Y T )
452 ATE Examples
Table 41 shows a numerical version of the charter school exampleThe four student ldquotypesrdquo refer to the four possible values of (Y U Y T )and each type has its own probability Given the probabilities themean untreated outcome E(Y U ) mean treated outcome E(Y T ) andATE E(Y T minus Y U ) are computed using (216)
E(Y U ) = (03)(0) + (03)(0) + (01)(1) + (03)(1) = 04 (44)
E(Y T ) = (03)(0) + (03)(1) + (01)(0) + (03)(1) = 06 (45)
E(Y T minus Y U ) = (03)(0) + (03)(1) + (01)(minus1) + (03)(0) = 02(46)
To verify (43)
E(Y T minus Y U ) = 02 = 06minus 04 = E(Y T )minus E(Y U ) (47)
Table 41 Charter school example population of potential outcomesand ATE
Student type Probability Y U Y T Y T minus Y U
1 03 0 0 02 03 0 1 13 01 1 0 minus14 03 1 1 0
Mean 04 06 02
Table 42 Right-to-work example population of potential outcomesand ATE
Worker type Probability Y U Y T Y T minus Y U
($yr) ($yr) ($yr)
1 05 40000 41000 10002 02 40000 38000 minus20003 02 50000 51000 10004 01 50000 47000 minus3000
Mean 43000 43000 0
Table 42 shows a numerical version of the right-to-work exam-ple Each worker ldquotyperdquo corresponds to a different value of (Y U Y T )
108 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
each type with its own probability Given the probabilities the meanuntreated outcome E(Y U ) mean treated outcome E(Y T ) and ATEE(Y T minus Y U ) are in dollars per year
E(Y U ) = (05)(40000) + (02)(40000) + (02)(50000) + (01)(50000) = 43000(48)
E(Y T ) = (05)(41000) + (02)(38000) + (02)(51000) + (01)(47000) = 43000(49)
E(Y T minus Y U ) = (05)(1000) + (02)(minus2000) + (02)(1000) + (01)(minus3000) = 0(410)
Again to verify (43)
E(Y T minus Y U ) = $0yr = $43000yrminus $43000yr = E(Y T )minusE(Y U )(411)
453 Limitation of ATE
00
01
02
03
04
Den
sity
$12hr $15hr $18hr
Figure 42 Three distributions with the same mean
Figure 42 shows the ATE does not fully capture the effect oftreatment on the distribution The figure plots PDFs of three hourlywage distributions with identical means Picking any two distribu-tions to represent potential outcomes Y U and Y T E(Y U ) = E(Y T )so the ATE is E(Y T ) minus E(Y U ) = $0hr However zero ATE doesnot mean zero effect the distributions are all different For exampletheir standard deviations differ and one distribution is right-skewedwith a lower median wage We may disagree about which is ldquobetterrdquoor ldquoworserdquo but we can agree the differences are important
This idea is also memorable in joke form as retold by Hansen(2020 p 29)
An economist was standing with one foot in a bucket ofboiling water and the other foot in a bucket of ice Whenasked how he felt he replied ldquoOn average I feel just finerdquo
To address the limitations of the ATE one approach is to exam-ine effects on percentiles (ldquoquantile treatment effectsrdquo) but these arebeyond our scope
46 ATE IDENTIFICATION 109
Practice 41 (unrepresentative ATE) Describe a population in whichthe ATE is zero but every individual is affected by the treatment (ieall treatment effects are non-zero) For simplicity assume there areonly two types of individual For each type state the probabilitypotential outcomes Y U and Y T and causal effect Y T minus Y U whichmust be non-zero Then compute the ATE to verify itrsquos zero
46 ATE Identification
=rArr Kaplan video ATE Identification
Generally identification is a concept central to econometricsthat appears throughout this textbook A parameter is identifiedif it equals a summary feature of the population distribution of ob-servable variables Identification requires certain conditions known asidentifying assumptions
Specifically the ATE is identified when it equals a mean difference(It may equal another summary feature in more complex settings notdiscussed here) The required identifying assumptions are discussedlater in this section In practice if the identifying assumptions aretrue then the mean difference can be interpreted as the ATE but ifthey are false then it cannot
461 Setup and Identification Question
For each individual a single value is observed If the individual wasactually treated (in our universe) then treated potential outcome Y T
is observed otherwise Y U is observedConsider actually treated individuals to be population B and
consider actually untreated individuals to be population A The twopopulations are represented by random variables Y B and Y A respec-tively For a population B individual Y B is always observable withY B = Y T Similarly for a population A individual Y A is always ob-servable with Y A = Y U A random sample of Y B can be taken fromactually treated individuals and a random sample of Y A can be takenfrom actually untreated individuals For example Y B could be thegraduation outcome for a student who actually attended the charterschool (in our universe) while Y A is the outcome for a student whodid not
For the ATE the question of identification is whether the ATEequals the mean difference between the actually treated and actuallyuntreated populations Mathematically using the E(Y T ) minus E(Y U )form of the ATE from (43) the identification question is whether ornot
E(Y T )minus E(Y U ) = E(Y B)minus E(Y A) (412)
We know how to learn about the descriptive mean difference E(Y B)minusE(Y A) from data as in Section 41 If (412) holds then this isequivalent to learning about the ATE
110 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
Consider (412) in the charter school and right-to-work exam-ples For the charter school example if the ATE is identified thenit equals the college graduation probability of charter school studentsminus the college graduation probability of conventional public schoolstudents E(Y B) minus E(Y A) (Recall from (29) that for binary Y E(Y ) = P(Y = 1)) In that case the ATE is estimated simply bycomparing college graduation rates between the charter school andpublic school students For the right-to-work example if the ATE isidentified then it equals mean income in right-to-work states minusmean income in other states E(Y B)minusE(Y A) In that case the ATEis estimated simply by comparing average income between right-to-work states and other states However if the ATE is not identifiedand (412) is false then these comparisons do not estimate the ATEie the mean differences do not have a causal interpretation
462 Randomization
Randomized experiments are often used to estimate the ATE Ide-ally in a randomized experiment also called a randomized con-trolled trial (RCT) the experimenter can control who is treated andwho is not (but see comments below) Mathematically the experi-menter gets to decide whether to observe Y U or Y T for each individ-ual ldquoRandomizedrdquo means this decision is made without regard to theindividualrsquos characteristics
In practice there are many complications see Section 463 forsome examples
For intuition consider the following experimental strategy Firstimagine we only want to estimate E(Y T ) We could take a randomsample of individuals from the population and treat each one allow-ing us to observe their Y T That is we have a random sample fromthe population distribution of Y T As in Chapter 3 we can estimateE(Y T ) by the sample mean Here our Y B is Y T so E(Y B) = E(Y T )Second we can repeat the process for a second random sample butforce everyone to be untreated The key is the ability to force anyoneto be either treated or untreated this allows us to take random sam-ples of Y T and Y U Although treatment may not seem ldquorandomrdquo itis assigned without consideration of any individualrsquos characteristics
Section 662 contains more formal arguments for why randomiza-tion can help identify the ATE
463 Reasons for Identification Failure
Generally (beyond only experiments) ATE identifcation fails whenSUTVA fails (Section 443) or when treatment is not random
Outside of experiments random or ldquoas good as randomrdquo treat-ment is rare For example in the right-to-work example treatmentis probably not random Hopefully state legislatures do indeed con-sider the characteristics of individuals when deciding whether or notto pass a right-to-work law ie laws are not passed randomly Specif-ically legislatures may consider the distribution of Y U when deciding
46 ATE IDENTIFICATION 111
whether or not to pass the law (which would switch everyonersquos in-come from their Y U to their Y T ) Further just looking at a mapit is notable that (as of 2019) zero US states in the Northeast cen-sus region have right-to-work laws whereas almost all states in theSouth census region have right-to-work laws (the exceptions beingDelaware and Maryland which are not really ldquoSouthernrdquo culturallyor politically) Thus it seems likely that the treatment decisions wererelated to other policy decisions that would in turn affect the incomedistribution
Even with randomized treatment assignment treatment itself maynot be random For example imagine you randomly assign individ-uals to attend a job training program but some assigned individualsnever attend This is called non-compliance ie not complyingwith the treatment assignment This is a type of self-selectionmeaning individuals decide which group to join People who skip theprogram may also skip work regularly which results in lower incomeThus many low-income individuals who should have been in the treat-ment group (if we could force them) are now in the control group iethey should have been Y B but are now Y A This decreases the con-trol grouprsquos average income and raises the treatment grouprsquos averageincome which falsely makes the treatment seem more effective thanit is Even if the training program has zero ATE so E(Y T ) = E(Y U )this non-compliance makes it look like the treatment has a positiveeffect because E(Y A) lt E(Y U ) and E(Y B) gt E(Y T )
One way to avoid this incorrect conclusion is to change perspec-tive compare groups based on treatment assignment rather than ac-tual treatment In the above example the ldquotreatmentrdquo is definedas being assigned to attend the job training program rather thanactually attending the program The resulting ATE is called theintention-to-treat effect because it measures the mean change inY corresponding to the intention to treat (ie assignment to treat-ment) Sometimes this is more directly relevant for policy anyway ifthe actual policy would not force people to be treated
Attrition is another problem that can arise even if SUTVA andrandom treatment are satisfied Attrition refers to individuals drop-ping out of the study after it starts For example maybe everyonecomes to the first job training but then some people move to a dif-ferent state and disappear from your data People leaving randomlyis fine but non-random attrition is problematic For examplemaybe the training program is so good that people get higher-payingjobs in other states You only see data for individuals who didnrsquotmove who generally have lower-paying jobs Then even though thetraining program worked really well it doesnrsquot look like it in the databecause you donrsquot see all the highest-earning treated individuals whomoved
Other concerns are introduced later especially in Section 123
Discussion Question 44 (breakfast effect) Schools with a highenough percentage of low-income students are eligible for a federally-funded free breakfast program for all students Although the program
112 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
is not mandatory all eligible schools choose to have it You compute a95 CI for the mean math test score of the ldquobreakfastrdquo schools minusthe mean of the other schools and it is [minus32minus17] points (The testis out of 100 points most scores are in the 60 to 100 range) How doyou interpret this result Think about ATE identification statisticaluncertainty and frequentist vs Bayesian perspectives
47 ATE Estimation and Inference
There is nothing new for estimation and inference It is identical toSections 412 and 413 Generally the point of causal ldquoidentificationrdquois not to propose a new statistical object but rather to imbue an ex-isting descriptive statistical object with causal meaning Here (412)gives the descriptive mean difference a causal interpretation (ATE)The interpretation does not affect how we estimate or quantify sta-tistical uncertainty about the mean difference
However recall from Section 398 that conventional methods forquantifying uncertainty (Section 38) only quantify statistical uncer-tainty not uncertainty about identification For example if iden-tification fails a 95 CI for the mean difference may only containthe ATE with 80 probability or even 50 or near 0 There aresome proposals for quantifying the sensitivity of results to violationsof identification in various settings but these are beyond our scope
EMPIRICAL EXERCISES 113
Empirical Exercises
Empirical Exercise EE41 You will analyze the effects of beingassigned to a job training program where assignment was random-ized The specific program was the National Supported Work Demon-stration in the 1970s in the US Data are originally from LaLonde(1986) via Wooldridge (2020) You will look at effects on earnings(re78) and unemployment (unem78) both overall and for differentsubgroups (eg married or not) The train variable indicates (ran-domized) assignment to job training if it equals 1 and 0 otherwiseFor now we focus on computing various estimates in later chapterswersquoll think more critically about what could go wrong even with ran-domized assignment
a R only run installpackages(wooldridge) to downloadand install that package (if you have not already)
b Load the jtrain2 dataset
R load package wooldridge with command library(wooldridge) and a dataframe variable named jtrain2 be-comes available the command jtrain2 then shows you detailsabout the dataset
Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse jtrain2 clear
c R only separate the data into ldquotreatmentrdquo and ldquocontrolrdquo groups(depending on the value of train the job training variable)withtrt lt- jtrain2[jtrain2$train==1 ]ctl lt- jtrain2[jtrain2$train==0 ]
d Estimate the mean 1978 earnings (in thousands of dollars) forthe treatment group minus that of the control group along witha 95 CI for the mean difference
Rmean(trt$re78) - mean(ctl$re78)ttest(x=trt$re78 y=ctl$re78)
Stata ttest re78 by(train) unequal (also estimates themean difference)
e R only separate out the data for treated married individualsand untreated married individuals withtrtmar1 lt- trt[trt$married==1 ]ctlmar1 lt- ctl[ctl$married==1 ]
f Compute the mean difference estimate and 95 CI for the 1978earnings outcome variable comparing treated and untreatedmarried individuals
R
114 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
mean(trtmar1$re78) - mean(ctlmar1$re78)ttest(x=trtmar1$re78 y=ctlmar1$re78)
Stata ttest re78 if married==1 by(train) unequal oralternatively bysort married ttest re78 by(train)unequal
g Repeat your above analysis in parts (c)ndash(f) but first create avariable where earnings are in dollars (instead of thousands ofdollars)
R jtrain2$re78USD lt- 1000jtrain2$re78
Stata generate re78USD = re781000
h Optional repeat your analysis in parts (e) and (f) for unmarried(instead of married) individuals
i Optional repeat your analysis in parts (d)ndash(f) but for unem-ployment (unem78) instead of earnings For interpretation notethat unem78 equals 1 if unemployed all of 1978 and equals 0otherwise so the population mean is the probability of beingunemployed all year (a value between 0 = 0 and 1 = 100)and the sample average is the fraction of the sample thus un-employed So a value like 014 means 14 and a difference of014minus 011 = 003 is a difference of 3 percentage points etc
Empirical Exercise EE42 You will analyze data from an ldquoauditstudyrdquo that attempts to measure the effect of race on receiving a joboffer The Urban Institute found pairs of seemingly equally qualifiedindividuals (one black one white) and had them interview for a vari-ety of entry-level jobs in Washington DC in 1988 See Siegelman andHeckman (1993) for details and critique and the raw data in theirTable 51 (p 195) In the data each row (observation) corresponds toone job to which one pair applied Value w=1 indicates that the whiteapplicant in the pair got a job offer while b=1 if the black applicantgot an offer
a R only run installpackages(wooldridge) to downloadand install that package (if you have not already)
b Load the audit dataset
R load package wooldridge with command library(wooldridge) and a dataframe variable named audit becomesavailable the command audit then shows you details aboutthe dataset
Stata run ssc install bcuse to ensure command bcuse isinstalled and then load the dataset with bcuse audit clear
c Compute the difference (white minus black) in the sample frac-tion of job offers
R mean(audit$w) - mean(audit$b)
Stata ttest w==b (which also computes a 95 CI)
EMPIRICAL EXERCISES 115
d Compute the sample mean of all the pairsrsquo white-minus-blackdifference Note that w-b equals 1 if the white individual got ajob offer but the black individual did not equals minus1 if the blackbut not white individual got an offer and equals 0 if both orneither of the pair got an offer
R mean(audit$w - audit$b)
Stata generate wminusb = w-b then ttest wminusb==0(also computes 95 CI see row labeled diff for both)
e R only (since Stata already reported this in the row labeled diff) compute a 95 CI for the population mean difference witheither ttest(x=audit$w y=audit$b paired=TRUE) or ttest(x=audit$w-audit$b)
116 CHAPTER 4 ONE VARIABLE TWO POPULATIONS
Chapter 5
Midterm Exam 1
=rArr Kaplan video Chapter Introduction
When I teach this class the first midterm exam is this week Thisldquochapterrdquo makes the chapter numbers match the week of the semesterThe midterm covers Chapters 2ndash4 ie everything up till now exceptRStata coding
117
118 CHAPTER 5 MIDTERM EXAM 1
Part II
Regression
119
Introduction
Part II concerns regression Regression is the workhorse of empiricaleconomics (and many other fields) for description prediction andcausality alike
Part II extends the concepts and methods of Part I to the regres-sion setting In the population the concepts of description predic-tion and causality from Part I are extended to regression models Inthe data estimation and inference methods extend those of Part I
More flexible regression is also considered including different mod-els interpretation and a glimpse of nonparametric regression andmachine learning
121
122
Chapter 6
Comparing TwoDistributions by Regression
=rArr Kaplan video Chapter Introduction
Depends on Chapter 4 (which depends on Chapters 2 and 3)
Unit learning objectives for this chapter
61 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
62 Describe different ways of thinking about two distributionsboth mathematically and intuitively [TLO 3]
63 Describe interpret identify and distinguish among differ-ent population models and their parameters and estimators[TLO 3]
64 Judge which interpretation of a regression slope is most ap-propriate in a real-world example [TLO 6]
65 Interpret logical relationships and form appropriate logicalconclusions [TLO 2]
66 In R (or Stata) estimate the parameters in a simple regres-sion model along with measures of uncertainty and judgeeconomic and statistical significance [TLO 7]
Optional resources for this chapter
bull Conditional probability (Khan Academy)
bull Basic joint marginal and conditional distributions (KhanAcademy)
bull James et al (2013 sect31)
bull Covariance and correlation (Lambert video)
bull Overlap assumption (Masten video)
bull Correlation vs causation (Masten video)
123
124CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
bull Assumptions for randomized experiment validity (Mastenvideo)
bull Structural vs causalreduced form approach (Mastenvideo)
bull OLS computation (Masten video)
bull Sections 21 (ldquoSimple OLS Regressionrdquo) and 22 (ldquoCoeffi-cients Fitted Values and Residualsrdquo) in Heiss (2016)
bull Section 53 (ldquoRegression When X is a Binary Variablerdquo) inHanck et al (2018)
bull R packages lmtest and sandwich (Zeileis 2004 Zeileis andHothorn 2002)
Chapter 6 revisits Chapter 4 from the perspective of regressionThe concepts of description prediction and causality are translatedinto regression language and regression models in the population Es-timation and quantifying uncertainty are also discussed
The term regression has different meanings in different contexts(and by different people) In the population it usually refers to howthe mean of a random variable Y depends on the value of anotherrandom variable(s) as in Section 63 In the sample as in Section 67it usually refers to a particular estimation technique But beware ofother (or ambiguous) uses of the word ldquoregressionrdquo especially in onlineresources
61 Logic
=rArr Kaplan video Logic Terms Example
Some basic logic is useful for understanding certain parts of econo-metrics Theoretically logic helps you understand the relationshipsamong different conditions like assumptions for theorems Practi-cally logic helps you interpret results
The following may not be fully technically correct from a philoso-pherrsquos perspective (eg perhaps I conflate logical implication withthe material conditional) but it suffices for econometrics
611 Terminology
Many words and notations can refer to the same logical relationshipLet A and B be two statements that can be either true or false Forexample maybe A is ldquoY ge 10rdquo and B is ldquoY ge 0rdquo Or A is ldquothisanimal is a catrdquo and B is ldquothis animal is a mammalrdquo The followingways of describing the logical relationship between A and B all havethe same meaning
1 If A is true then B is true (often shortened ldquoif A then Brdquo)2 A =rArr B3 A implies B
61 LOGIC 125
4 B lArr= A5 B is implied by A6 B is true if A is true7 A is true only if B is true8 A is a sufficient condition for B (shorter ldquoA is sufficient forBrdquo)
9 B is a necessary condition for A (shorter ldquoB is necessary forArdquo)
10 A is stronger than B11 B is weaker than A12 It is impossible for B to be false when A is true (but it is fine
if both are true or both are false or A is false and B is true)13 The truth table (T=true F=false)
A B A =rArr B
T T TT F FF T TF F T
14 The diagram (everything in A is also in B)
AB
To state equivalence of A and B opposite statements can be com-bined Specifically any of the following have the same meaning
1 A lArrrArr B (meaning both A =rArr B and A lArr= B)2 A is true if and only if B is true (meaning A is true if B is
true and A is true only if B is true)3 B is true if and only if A is true4 A is necessary and sufficient for B5 B is necessary and sufficient for A6 A and B are equivalent7 It is impossible for A to be false when B is true and impossible
for A to be true when B is false8 The truth table (T=true F=false)
A B A lArrrArr B
T T TT F FF T FF F T
Variations of A =rArr B have the following names Read notA asldquonot Ardquo notA is false when A is true and notA is true when A is false
bull notA =rArr notB is the inverse of A =rArr B
bull B =rArr A is the converse of A =rArr B
126CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
bull notB =rArr notA is the contrapositive of A =rArr B
The statement A =rArr B is logically equivalent to its contrapos-itive That is statements ldquoA =rArr Brdquo and ldquonotB =rArr notArdquo can beboth true or both false but itrsquos impossible for one to be true and theother false
The statement A =rArr B is not logically equivalent to either itsinverse or converse (The inverse and converse are equivalent to eachother because the inverse is the contrapositive of the converse)
For example let A be ldquoX le 0rdquo and let B be ldquoX le 10rdquobull A =rArr B any number below 0 is also below 10bull The contrapositive is X gt 10 =rArr X gt 0 which is also true
any number above 10 is also above 0bull The inverse is X gt 0 =rArr X gt 10 which is false eg ifX = 5 then X gt 0 but not X gt 10
bull The converse is X le 10 =rArr X le 0 also false again if X = 5then X le 10 but not X le 0
612 Theorems
Theorems all have the same logical structure if assumption A is truethen conclusion B is true Sometimes A and B have multiple partslike A is really ldquoA1 and A2rdquo The theoremrsquos practical use is if we canverify that A is true then we know B is also true
What if we think A is false Then B could be false or it could betrue This may be seen most readily from the picture version of theA and B relationship in Section 611 we could be somewhere insideB but outside A (ie B true A false) or we could be outside both(both false) That is as in Section 611 the theorem A =rArr B isnot equivalent to its inverse
Also from Section 611 a theorem is equivalent to its contrapos-itive That is if the theoremrsquos conclusion is false then we know itsassumption is false (If the assumption has multiple parts like ldquobothA1 and A2 are truerdquo then being false means either A1 is false or A2
is false or both are false)
613 Comparing Assumptions
To compare assumptions the terms ldquostrongerrdquo and ldquoweakerrdquo are mostcommonly used Let A1 and A2 denote different assumptions PerSection 611 ldquoA1 is stronger than A2rdquo is equivalent to A1 =rArr A2which is also equivalent to ldquoA2 is weaker than A1rdquo
All else equal it is more useful to have a theorem with weakerassumptions because it applies to more settings That is if A1 =rArrA2 then we prefer a theorem based on A2 the weaker assumption Atheorem based on A1 can only be used when A1 is true In contrast atheorem based on A2 can be used not only when A1 is true (becauseA1 =rArr A2) but also sometimes when A1 is false (but A2 is stilltrue)
62 PRELIMINARIES 127
For example let assumption A1 be ldquoa city is in Missourirdquo andlet assumption A2 be ldquoa city is in the United Statesrdquo Consider thetheorems A1 =rArr B and A2 =rArr B (The conclusion is irrelevanthere but to be concrete you could imagine B is ldquothe city is in thenorthern hemisphererdquo) Since Missouri is part of the United StatesA1 =rArr A2 ie A1 is the stronger assumption and A2 is the weakerassumption We prefer the theorem based on the weaker assumptionbecause it applies to more cities For example only the theoremA2 =rArr B applies to Houston A1 is false but A2 is true (And recallthat when A1 is false the theorem A1 =rArr B does not conclude thatB is false it just says ldquoI donrsquot know if B is true or falserdquo ie it isuseless)
Practice 61 (median theorem logic) Consider the theorem ldquoIf sam-pling is iid then the sample median consistently estimates the popu-lation medianrdquo Hint draw a picture andor write it as A =rArr B
a) What does this tell us about consistency of the sample medianwhen sampling is not iid
b) What does this tell us about sampling when the sample medianis not consistent
Practice 62 (mean theorem logic) Consider the theorem ldquoIf sam-pling is iid and the population mean is well-defined then the samplemean consistently estimates the population meanrdquo Hint there maybe multiple possible pictures that show this relationship among A1
(iid) A2 (well-defined) and B (consistency)a) What does this tell us about consistency of the sample mean
when sampling is not iidb) What does this tell us about sampling when the sample mean
is not consistent
Discussion Question 61 (logic with feathers) Consider two theo-rems Theorem 1 says ldquoIf X is an adult eagle then it has feathersrdquoTheorem 2 says ldquoIf X is an adult bird then it has feathersrdquo
a) Describe each theorem logically whatrsquos the assumption (A)whatrsquos the conclusion (B) whatrsquos the relationship
b) State Theorem 1rsquos contrapositive is it truec) Compare does Theorem 1 or Theorem 2 have a stronger as-
sumption Whyd) Compare which theorem is more useful (Which applies to
more situations)
62 Preliminaries
=rArr Kaplan video Joint Marginal and Conditional Distributions
Before getting to regression some simpler material may provideintuition (If it is not familiar to you from a previous statistics classthen you may want to consult additional resources for a deeper un-derstanding or you may not) In Section 62 there is no data onlythe population is considered
128CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
621 Population Mean Model in Error Form
To help understand the conditional mean model we start with anunconditional mean model That is interest is in microY equiv E(Y ) for asingle random variable Y as in Chapter 2
There are two ways to write the unconditional mean ldquomodelrdquoBoth look silly and over-complicated but they help bridge Chapter 4and Chapter 6 First the mean can be written directly
E(Y ) = microY (61)
Second in terms of an error term U the error form of this modelis
Y = microY + U E(U) = 0 (62)
Models (61) and (62) are equivalent Taking the mean of bothsides of (62) using the linearity property from (221)
E(Y ) = E(microY + U) = E(microY ) + E(U) = microY + 0 = microY (63)
The error term U has a precise statistical definition and meaningbut no causal or economic meaning Defining the mean error term as
U equiv Y minus E(Y ) (64)
always implies
E(U) = E[Y minus E(Y )] = E(Y )minus E(Y ) = 0 (65)
Thus the property E(U) = 0 in (62) is true essentially by definitionnot an assumption that can be false By analogy this is like definingU to be an equilateral triangle in which case the property ldquoall anglesare equalrdquo is always true not an additional assumption However Uhas no causal or economic meaning it is simply the difference betweenan individualrsquos Y and the population mean E(Y )
The error form often facilitates theoretical analysis of estimatorsbut in practice the more direct model may be easier to interpret
622 Joint and Marginal Distributions
To understand regression you must understand conditional distribu-tions To understand conditional distributions it helps to understandjoint distributions and marginal distributions
The joint distribution is the distribution of values of (XY ) to-gether which can be any combination of the variable types in Sec-tion 23 For example X could be categorical and Y continuous orX and Y both discrete or X continuous and Y ordinal etc Sincethere are so many combinations they are not all enumerated hereFurther eventually wersquoll focus on conditional distributions in whichcase the variable type of X does not matter as much for interpreta-tion For regression the focus is on numeric (discrete or continuous)X and Y Implicitly this also applies to categorical variables that
62 PRELIMINARIES 129
have been turned into dummy variables with the indicator functionlike X = 1cat or Y = 1employed
For (XY ) with non-continuous variable types the joint distri-bution can be described by a PMF Like before the PMF states theprobability of each possible value The difference is that ldquopossiblevaluesrdquo are now pairs of values (x y) instead of single values Forexample a possible value could be (minus5 7) or (cat dog) (It is moredifficult to gain intuition with continuous variable types but the ideaof a PDF can be extended to multiple variables)
Each joint probability in the PMF can be written multipleequivalent ways
P((XY ) = (x y)) = P(X = x Y = y) = P(X = x and Y = y)(66)
For example consider the joint distribution of dummy variablesfor employment and marital status Let Y = 1 if somebody is em-ployed and Y = 0 if not Let X = 1 if somebody is married andX = 0 if not The joint distribution of employment and maritalstatus describes the probabilities of each possible value of the vector(XY ) ie the PMF of the vector (XY ) There are four possiblevalues unmarried and unemployed (0 0) unmarried and employed(0 1) married and unemployed (1 0) and married and employed(1 1) Since these categories are mutually exclusive and exhaustivethe four probabilities must sum to 1 (ie 100) Table 61 shows anexample
Table 61 Joint distribution of marital status (X) and employmentstatus (Y )
Y = 0 Y = 1 Marginal for X (row sum)X = 0 010 010 020X = 1 020 060 080
Marginal for Y (column sum) 030 070 100
Table 61 shows both joint and marginal probabilities Here thejoint probability values can be written as pxy equiv P(X = x Y = y)or equivalently P((XY ) = (x y)) This is analogous to the scalar(one variable) PMF that described P(Y = y) for different values ybut replacing Y with (XY ) and y with (x y) The joint probabilitiesshown inside the box are p00 = 010 (ie 10) p01 = 010 p10 = 020and p11 = 060 These sum to 1
A marginal probability (or unconditional probability) con-siders just one of the random variables ignoring the other Specifi-cally the outer values in Table 61 show the marginal probabilitiesto be P(X = 0) = 020 (at the right end of the X = 0 row)P(X = 1) = 080 P(Y = 0) = 030 (at the bottom of the Y = 0column) and P(Y = 1) = 070 These probabilities describe themarginal distribution of each random variable or more specifi-cally the marginal PMFs That is X by itself is a random variablewith P(X = 0) = 020 and P(X = 1) = 080 the population proba-
130CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
bility of an individual being married is 08 (80) Similarly by itselfY is a random variable with P(Y = 0) = 030 and P(Y = 1) = 070
Discussion Question 62 (joint distribution and causality) Con-sider two binary random variables X and Y whose joint distributionis described by the probabilities P(X = 0 Y = 0) = 04 P(X =0 Y = 1) = 01 P(X = 1 Y = 0) = 01 and P(X = 1 Y = 1) = 04Note P(X = Y ) = 08 gt 02 = P(X 6= Y ) Hint think about someconcrete examples ofX and Y (marital status employment rain longcommute time etc) to prove something is ldquopossiblerdquo only requires asingle example where it is true
a) Explain why this joint distribution suggests some type of rela-tionship between X and Y
b) Given the joint distribution is it possible that X has a causaleffect on Y Whynot
c) Given the joint distribution is it possible that X does not havea causal effect on Y Whynot
623 Conditional Distributions
For non-continuous variables the conditional distribution consistsof the conditional probabilities of all different possible values Theconditional distribution of Y given X = x consists of the conditionalPMF ie the conditional probability of each possible value y givenX = x
The conditional probability of one event (like Y = 1) givenanother event (like X = 1) considers only the times when the condi-tioning event (like X = 1) occurs and then takes the proportion ofthose times that the first event (like Y = 1) occurs Mathematicallythe conditional probability P(Y = 1 | X = 1) can be read as ldquotheprobability that Y equals one conditional on X equal to onerdquo or ldquotheprobability of Y being one given X equals onerdquo or other variationsMore generally P(Y = y | X = x) is ldquothe probability that Y equals yconditional on X equal to xrdquo
For non-continuous variables a conditional probability can bewritten in terms of joint and marginal probabilities Specifically
P(Y = y | X = x) =P(Y = yX = x)
P(X = x) (67)
(This doesnrsquot apply to continuous X since the denominator would beP(X = x) = 0)
There is nothing mathematically special about the labels X andY here However conventional regression notation corresponds toconditioning on the variable named X To examine X conditional onY we could just switch the labels and then examine Y conditionalon X
For example consider the probability of employment (Y = 1)conditional on being married (X = 1) Applying (66)
P(Y = 1 | X = 1) =P(Y = 1 X = 1)
P(X = 1) (68)
62 PRELIMINARIES 131
The denominator is the proportion of the population thatrsquos marriedThe numerator is the proportion of the population thatrsquos both marriedand employed
Examples
For intuition you can imagine the population is actually 100 peoplerather than abstract probabilities You may have unknowingly com-puted (sample) conditional probabilities in grade school if you everanswered questions like ldquoWhat proportion of the boys in our class arewearing glassesrdquo or ldquoWhat proportion of students with black hair arewearing a sweaterrdquo
In Table 61 multiplying values by 100 gives the number of peoplein each of the four cells 10 people are unmarried and unemployedwith (X = 0 Y = 0) another 10 are unmarried but employed with(X = 0 Y = 1) 20 people are married but not employed with (X =1 Y = 0) and 60 people are married and employed (X = 1 Y = 1)Probabilities are proportions eg P(X = 1 Y = 1) = 060 so theproportion of married and employed individuals in the 100-personpopulation is 60100 = 060 = 60
Table 62 shows the number of people with different values parallelto Table 61
Table 62 Counts of individuals by marital status (X) and employ-ment status (Y )
not employed employed Marginal (sum)not married 10 10 20
married 20 60 80Marginal (sum) 30 70 100
Using Table 62 to continue with the 100-person population theconditional probability P(Y = 1 | X = 1) asks within the group ofmarried individuals (X = 1) what proportion of them are employedThere are 60 married and employed individuals and 20 married whoare not employed so 80 total This 80 is 100 times the marginalprobability P(X = 1) = 080 Out of those 80 60 are employedThus the proportion of married individuals who are employed is6080 = 075 = 75 That is to compute the conditional proba-bility we take the ldquojointrdquo number of individuals who are both marriedand employed (both X = 1 and Y = 1) and divide by the ldquomarginalrdquonumber of married individuals (X = 1) Similarly the proportion ofmarried individuals who are not employed is 2080 = 025 = 25For the unmarried group (20 individuals total) the proportion whoare employed is 1020 = 05 = 50 which is also the proportion whoare not employed
132CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
624 Conditional Mean
The conditional mean is just the mean of a conditional distributionConditional on a particular value X = x like X = 1 there is a con-ditional distribution of Y The mean of that conditional distributionis written
E(Y | X = x) (69)
To read (69) aloud you could say ldquothe conditional mean of Y givenX = xrdquo or ldquothe mean of Y conditional on X = xrdquo
Examples
From Table 61 we can compute a conditional mean We alreadycomputed the conditional distribution of employment status (Y ) con-ditional on being married (X = 1) P(Y = 1 | X = 1) = 075 andP(Y = 0 | X = 1) = 025 The mean of that conditional distribu-tion is written E(Y | X = 1) We can use the usual expected valueformula plugging in conditional probabilities For comparison theunconditional and conditional (on X = 1) means of Y are respec-tively
E(Y ) = (0) P(Y = 0) + (1) P(Y = 1) = (0)(03) + (1)(07) = 0 + 07 = 07(610)
E(Y | X = 1) = (0) P(Y = 0 | X = 1) + (1) P(Y = 1 | X = 1)
= (0)(025) + (1)(075) = 0 + 075 = 075
(611)
Since Y is binary (0 or 1) the (conditional) mean is the (conditional)probability of Y = 1 E(Y ) = P(Y = 1) = 07 and E(Y | X = 1) =P(Y = 1 | X = 1) = 075
Conditional means can be computed similarly for non-binary Yand X For example imagine Y is hours worked per week which iseither 0 20 or 40 and X is years of education which is either 1112 or 16 The conditional mean is
E(Y | X = x) =sum
jisin02040
(j) P(Y = j | X = x)
= (0) P(Y = 0 | X = x) + (20) P(Y = 20 | X = x)
+ (40) P(Y = 40 | X = x) (612)
Table 63 Joint distribution of education (X) and weekly hoursworked (Y )
Y = 0 Y = 20 Y = 40X = 11 010 005 005X = 12 005 010 015X = 16 010 010 030
Table 63 shows an example joint distribution of such an X andY from which conditional means can be computed The values in the
62 PRELIMINARIES 133
table are joint probabilities eg the entry in the last column of thesecond row shows P(X = 12 Y = 40) = 015 Consider the condi-tional mean E(Y | X = 16) To apply (612) requires the conditionalprobabilities which can be computed using (67) First the marginalprobability sums all entries in the row
P(X = 16) = 010 + 010 + 030 = 05 (613)
Second plugging this into (67)
P(Y = 20 | X = 16) =P(Y = 20 X = 16)
P(X = 16)=
010
050= 02
P(Y = 40 | X = 16) =P(Y = 40 X = 16)
P(X = 16)=
030
050= 06
(614)
Third plugging these into (612)
E(Y | X = 16) = 0 + (20)(02) + (40)(06) = 4 + 24 = 28 (615)
As a sanity check note that the probability of Y = 40 is higher thanthat of Y = 0 so it makes sense that the conditional mean is above20 The specifical value E(Y | X = 16) says that within the partof the population with X = 16 years of education the mean weeklyhours worked is 28
625 Comparison of Joint Marginal and ConditionalDistributions
The joint distribution has all the possible information about the dis-tribution of random variables (XY ) Rearranging (67) each jointprobability can be reconstructed by multiplying the appropriate con-ditional and marginal probabilities
P(X = x Y = y) = P(Y = y | X = x) P(X = x) (616)
Thus knowing the joint distribution has the same information asknowing both the conditional (of Y given X = x) and marginal (ofX) distributions However the conditional distributions alone (with-out the marginals) contain less information than the joint distribu-tion ie there are multiple possible joint distributions that would beconsistent with a single set of conditional distributions Similarly themarginal distributions of X and Y alone (without the conditionals)contain less information than the joint distribution
Going into regression wersquoll focus on the conditional means E(Y |X = x) which are summary features of conditional distributionswhich in turn are summaries of the full joint distribution That isconditional means (and regression) only learn one particular featureabout the population joint distribution of (XY ) As discussed in Sec-tion 231 there is a tradeoff between learning more information andhaving a summary that is more easily understood and communicated
134CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
626 Independence and Dependence
If random variables X and Y are independent then they are com-pletely unrelated statistically speaking Notationally independenceis usually written as X perpperp Y which is equivalent to Y perpperp X
Independence implies equality of marginal and conditional distri-butions Mathematically the marginal (unconditional) distributionof Y is the same as the conditional distribution of Y given X = x forany x Intuitively if X is unrelated to Y then knowing the value ofX has no information about the value of Y
This characterization of independence can be written in terms ofa PMF or CDF If Y is not continuous and thus has a PMF thenindependence implies the marginal PMF equals the conditional PMFfor any possible y and x values
Y perpperp X =rArr P(Y = y) = P(Y = y | X = x) (617)
If Y is not a nominal categorical variable and thus has a CDF thenfor any possible y and x
Y perpperp X =rArr P(Y le y) = P(Y le y | X = x) (618)
Consequently independence implies equality of marginal and con-ditional means known as mean independence That is for anypossible x value
Y perpperp X =rArr E(Y ) = E(Y | X = x) (619)
Independence implies many other properties too like Cov(XY ) =Corr(XY ) = 0 and P(X = x Y = y) = P(X = x) P(Y = y)
The opposite of independence is dependence If any conditionimplies by independence does not hold then the variables are depen-dent written X 6perpperp Y For example if Corr(XY ) 6= 0 then X 6perpperp Y Or if E(Y | X = 1) 6= E(Y | X = 0) then X and Y are neitherindependent nor mean independent
63 Population Model Conditional Expecta-tion Function
=rArr Kaplan video CEF (Binary X)
This and the following sections consider what we want to learnabout the population and how we can write it mathematically Thereis no data no estimation no uncertainty
A model describes the relationship between two (or more) vari-ables like education and income If it describes how income changeswith education then income is the usually written as Y and called theoutcome variable regressand dependent variable left-handside variable or response variable while education is written asX and called the regressor independent variable right-handside variable predictor covariate or conditioning variable
63 POPULATIONMODEL CONDITIONAL EXPECTATION FUNCTION135
Like before these variables are treated mathematically as randomvariables The ldquopopulationrdquo is a joint probability distribution of theobservable variables
There are different models for different types of relationships be-tween two variables Section 63 models a statistical relationship withinterpretations for description or prediction whereas Sections 64and 65 model causal relationships Sometimes the descriptive andcausal models coincide but generally they differ
This section combines Sections 621 and 623 to get a conditionalmean regression model
In Sum Conditional Expectation FunctionDescription shows mean Y for each subpopulation with same XE(Y | X = x)Prediction with quadratic loss optimal prediction of Y givenX = x is E(Y | X = x)Causality CEF difference sometimes has causal interpretation(Section 66)
631 Conditional Expectation Function
Using (69) let m(middot) be the conditional expectation function(CEF) of Y given X
m(x) equiv E(Y | X = x) (620)
That is the CEF m(middot) takes a value of x as input like x = 1 and tellsus the corresponding conditional mean of Y like E(Y | X = 1) = 7
It helps to remember whatrsquos random and whatrsquos non-random TheCEF m(middot) is a non-random function just as E(Y ) is non-random Forany X = x Y has a conditional distribution whose mean is m(x)a non-random value You can draw a graph of a CEF just like yougraphed any other (non-random) function in high school In contrastm(X) is a random variable That is there are multiple possible valuesof m(X) because there are multiple possible values of X
If X is binary as in this chapter then there are two conditionalmeans of interest
m(0) = E(Y | X = 0) m(1) = E(Y | X = 1) (621)
There are two possible approaches First these two conditionalmeans could be studied directly similar to Chapter 4 That is Y A
has the distribution of Y given X = 0 and Y B has the distributionof Y given X = 1 Second the conditional means can be captured ina CEF regression model
For example consider Table 61 From (611) m(1) equiv E(Y | X =1) = 075 Also
m(0) equiv E(Y | X = 0) = (0) P(Y = 0 | X = 0) + (1) P(Y = 1 | X = 0)
= P(Y = 1 | X = 0) =P(Y = 1 X = 0)
P(X = 0)=
01
02= 05
136CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
Thus the CEF is m(0) = 05 m(1) = 075 Also from Table 61the marginal distribution of X is P(X = 0) = 02 P(X = 1) = 08Thus m(X) is a random variable with
P(m(X) = 05) = P(X = 0) = 02 P(m(X) = 075) = P(X = 1) = 08(622)
632 CEF Error Term
Extending (64) the CEF error term is defined as
V equiv Y minusm(X) (623)
ie the difference between an individualrsquos actual outcome Y and theCEF evaluated at her X value m(X) As always other letters couldbe used besides V like U or W in other textbooks you may seeu or e or ε Since Y and X are random variables so is V egP(V = 0) = P(Y minusm(X) = 0)
For example let X = 1 indicate a college degree (and X = 0otherwise) and Y is income Then m(0) is the mean income amongthe no-college population and m(1) is mean income among collegedegree holders If you are a successful tech company CEO who wentto college (X = 1) then your Y is high above m(1) so your CEFerror in (623) is very large and positive Or if you didnrsquot go tocollege (X = 0) and make exactly the mean income for that groupyour Y equals m(0) so your CEF error is V = 0 Or if you went toa fancy college but decided to live off your parentsrsquo wealth and earnno income then your Y = 0 so your V = Y minusm(1) = 0minusm(1) verynegative
The CEF error has conditional mean zero Extending (65) forany X = x
E(V | X = x) = E(Y minusm(X) | X = x) = E(Y | X = x)minus E(m(X) | X = x)
= m(x)minusm(x) = 0
EquivalentlyE(V | X) = 0 (624)
That is E(V | X) is a random variable depending on X but it equalszero for every possible value of X or just imagine ldquoE(V | X = x) = 0for all xrdquo every time you see ldquoE(V | X) = 0rdquo
As in Section 621 this conditional mean zero property is not anassumption it is true by definition for any CEF error defined as in(623) By analogy if we define V as a square then it always hasthe property of having four equal sides and four equal angles suchproperties are not additional assumptions
633 CEF Model in Error Form
Given (623) extending (62) the CEF model in error form is
Y = m(X) + V E(V | X) = 0 (625)
63 POPULATIONMODEL CONDITIONAL EXPECTATION FUNCTION137
The statement E(V | X) = 0 is equivalent to saying m(x) = E(Y |X = x) ie that m(middot) is the CEF Again it is not an assumptionabout V it is just stating what type of model this is Equation (625)can apply to non-binary X too as in later chapters
634 Linear CEF Model
With binary X the model in (625) is equivalent to
Y = m(0)1X = 0+m(1)1X = 1+ V (626)= m(0)(1minusX) +m(1)(X) + V
= m(0) + [m(1)minusm(0)]X + V (627)
To double-check when X = 0 then [m(1) minus m(0)]X = 0 so Y =m(0)+V as in the original (625) When X = 1 then m(0)+[m(1)minusm(0)]X = m(0) +m(1)minusm(0) = m(1) so Y = m(1) + V also as in(625) Thus (625) and (627) are equivalent for binary X
The CEF model in (627) can be rewritten yet again to yield amore familiar conventional structure Following conventional nota-tion let β0 equiv m(0) and β1 equiv m(1)minusm(0) Plugging these definitionsinto (627)
Y = β0 + β1X + V E(V | X) = 0 (628)
In (628) β0 and β1 are called the parameters Greek letters likeβ are commonly used to denote unknown parameters in a populationmodel In the frequentist framework these are seen as unknown butfixed (non-random) values whereas Y X and V are random vari-ables In (628) specifically β0 is the intercept and β1 is the slopeSometimes regression parameters are called coefficients β1 is theslope coefficient or the coefficient on X
Model (628) is a linear CEF model It is a ldquoCEFrdquo model becauseE(Y | X = x) = β0 + β1x or E(Y | X) = β0 + β1X The ldquolinearrdquopart is explained in Section 821 for now it suffices to recall that agraph of β0 + β1x is a straight line
Since X is binary no assumptions were required to write (628)given (625) However when X has more than two possible values itis more complicated as discussed in Chapter 7 For now with binaryX the CEF model can always be written as in (628)
635 Interpretation Description and Prediction
Practice 63 (regression parameter units) Let Y be salary measuredin $yr and let X be the number of college degrees an individual haseither X = 0 or X = 1 In (628) what are the units of measure forβ0 and β1 respectively
To interpret (628) first consider the units of measure The left-hand side is just Y the outcome variable Since they are equal theright-hand side must have the same units Thus each of the threeright-hand side terms must have the same units as Y
1 β0 has the same units as Y
138CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
2 β1X has the same units as Y so the units of β1 are the unitsof Y divided by the units of X
3 V has the same units as Y For example if Y is measured in $yr and X is the number of col-lege degrees then the units of β0 are $yr and the units of β1 are($yr)(degrees)
For description the model in (628) is the CEF a summary of theconditional distribution As seen earlier β0 = m(0) = E(Y | X = 0)the mean outcome among all individuals with X = 0 Also
β1 = m(1)minusm(0) = E(Y | X = 1)minus E(Y | X = 0) (629)
is the difference between the mean outcome for the X = 1 subpopu-lation and the mean outcome for the X = 0 subpopulation
A common phrase to describe such statistical (but maybe notcausal) differences is associated with For example if individualswith a college degree have a mean annual income that is $20000yrhigher than the mean annual income of non-college individuals thenβ1 = $20000yr and you could say ldquoOn average having a collegedegree is associated with having a $20000yr higher annual incomerdquoThis does not claim that going to college has such a causal effect onincome only a statistical association
For prediction the model in (628) is also helpful Section 254says the mean is the best predictor if the loss function is quadraticThis continues to be true conditional on X the conditional mean ofY given X = x is the best predictor given quadratic loss Formallyletting g(middot) denote any possible guess (of Y as a function of X)
m(middot) = arg ming(middot)
E[(Y minus g(X))2] (630)
In terms of the model parameters in (628) the best predictor of Ygiven X = 0 is β0 and the best predictor of Y given X = 1 is β0 +β1Combining these the best predictor of Y given X is β0 + β1X
636 Interpretation with Values Besides 0 and 1
What if X has only two possible values but they arenrsquot 0 and 1(Technically such X is still ldquobinaryrdquo but usually people mean 0 and1 when they say ldquobinaryrdquo)
For example let Y again be income and let X be educationbut now measured in years Instead of comparing individuals with acollege degree to those without imagine comparing individuals withX = 12 years of education to X = 13 By convention ldquoyears ofeducationrdquo is measured starting in grade 1 In the US the last yearof high school is grade 12 so completing high school means X = 12taking a year of college classes leads to X = 13
The fundamental conditional means are m(12) = E(Y | X = 12)and m(13) = E(Y | X = 13) but β0 and β1 in (628) have differentmeanings than before Even the units of β1 are different becausethe units of X have changed Recall that the units of β1 are units
63 POPULATIONMODEL CONDITIONAL EXPECTATION FUNCTION139
of Y divided by units of X In both cases Y is measured in $yrBefore X was number of college degrees so the units of β1 were($yr)(degrees) Now X is measured in years (of education) sothe units of β1 are ($yr)(yr) By this alone β1 must now have adifferent interpretation it turns out β0 does too
The parameters β0 and β1 from the CEF model in (628) cannow be written in terms of m(12) and m(13) Writing the CEF asm(X) = β0 + β1X as in (628)
m(13 yr) = β0 + (13 yr)β1 m(12 yr) = β0 + (12 yr)β1
m(13 yr)minusm(12 yr) = (β0 minus β0) + (13 yrminus 12 yr)β1 = β1
β1 = [m(13 yr)minusm(12 yr)](13 yrminus 12 yr) = [m(13 yr)minusm(12 yr)]yr
β0 = m(12 yr)minus (12 yr)β1 = m(12 yr)minus (12 yr)[m(13 yr)minusm(12 yr)]
The meaning of β1 is qualitatively similar to before it is the differencein mean income between the high and low education subpopulationsHowever instead of ldquoper degreerdquo the units are now ldquoper yearrdquo (oneyear of college education)
In contrast the interpretation of β0 is very different and not verymeaningful Before β0 = m(0) was the mean of the low educationsubpopulation Here β0 takes the low education mean m(12) andthen subtracts 12 times the mean difference 12[m(13)minusm(12)] Thatis β0 is trying to extrapolate from the means for X = 13 and X = 12all the way down to X = 0 This is not the true mean income forindividuals with zero years of education m(0) only the means forX = 12 and X = 13 are known Rather it is just a guess of m(0)based on m(12) and m(13) and probably a very poor guess Furtherindividuals with zero years of education may be very rare or evennonexistent in the larger population in which case it is not eveninteresting to guess So while the slope β1 continues to have meaningthe intercept β0 may not unless X = 0 is a possible and interestingvalue
Adding another twist what if instead of X = 12 and X = 13(years of education) we compare X = 12 and X = 16 That isinstead of comparing high school (X = 12) to one year of college(X = 13) it is compared to a typical four-year college degree (X =12 + 4 = 16) more similar to our initial inquiry With m(X) =β0 + β1X as in (628)
m(16 yr) = β0 + (16 yr)β1 m(12 yr) = β0 + (12 yr)β1
m(16 yr)minusm(12 yr) = (β0 minus β0) + (16 yrminus 12 yr)β1 = (4 yr)β1
β1 = [m(16 yr)minusm(12 yr)](4 yr) = (14)[m(16 yr)minusm(12 yr)](yr)
β0 = m(12 yr)minus (12 yr)β1 = m(12 yr)minus 3[m(16 yr)minusm(12 yr)]
Again β0 tries to extrapolate from m(16 yr) and m(12 yr) to guessm(0 yr) Again such a guess is both inaccurate and irrelevant
The slope parameter β1 is different than before but still mean-ingful It takes the mean income difference m(16 yr) minusm(12 yr) andthen divides by 4 yr This computes a per-year average difference
140CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
This idea is similar to describing a 400 km car trip that took 5 hr bysaying the average speed was (400 km)(5 hr) = 80 kmhr This maybe easier to interpret since we more commonly think of kmhr butit doesnrsquot mean that the speed was constant during the whole tripeg it may have been slower in the first half due to traffic Simi-larly β1 does not mean each of the four college years is associatedwith the same increase in mean income For example the fourth yearmay have the biggest increase if the college degree (not the educationitself) matters most
64 Population Model Potential Outcomes
In contrast to a purely ldquostatisticalrdquo model like the CEF model wecould imagine a causal model that shows causal relationships betweenvariables One way to do this is with potential outcomes as intro-duced in Section 44 Again let Y U and Y T denote the untreatedand treated potential outcomes respectively
Here the two observable variables are the observed outcome Yand the treatment indicator (or treatment dummy) X That isX = 1 if an individual was ldquotreatedrdquo and X = 0 if not As beforeldquotreatmentrdquo is interpreted very broadly including things like going toa charter school a right-to-work law a tax policy or even personalcharacteristics
The observed outcome is
Y = (Y U )(1minusX) + (Y T )(X) (631)
Plugging in X = 0 yields Y = Y U whereas plugging in X = 1 yieldsY = Y T So we observe Y = Y U if the individual is untreated andY = Y T if treated
Notationally potential outcomes notation writes Y as a functionof X This is what (631) shows the treated potential outcome isY (1) = Y T and the untreated potential outcome is Y (0) = Y U
Equation (631) can be rearranged to look more like a linear modelin error form First
Y = Y U +X(Y T minus Y U ) (632)
This is a random coefficient model with intercept Y U and slopeY T minus Y U The intercept and slope coefficients are ldquorandomrdquo in thatthey can have different possible values for different individuals
Second the random coefficients can be turned into non-randomcoefficients by adding an error term Define β0 equiv E(Y U ) and β1 equivE(Y T minus Y U ) ie the ATE Then (632) becomes
Y = [Y U + β0 minus β0] +X[Y T minus Y U + β1 minus β1]
= β0 +Xβ1 +
equivU︷ ︸︸ ︷Y U minus β0 +X[Y T minus Y U minus β1] (633)
Equation (633) has the same structure as (628) but very dif-ferent meaning The parameter β1 is the ATE of X on Y it has a
65 POPULATION MODEL STRUCTURAL 141
causal meaning not just statistical meaning However we donrsquot knowif E(U | X) = 0 The error term U might have such statistical prop-erties but the model itself does not define U in terms of statisticalproperties but rather in terms of potential outcomes
65 Population Model Structural
A structural model also captures causal relationships The assump-tion is that the model itself does not change even when variable valuesand policies change (ldquoPolicyrdquo has a broad meaning here policies ofcountries firms schools etc or even just personal decisions) Morespecifically if we want to assess the causal effect of a certain policythen the structural model should be invariant to that particular pol-icy That is the policy may change the population distribution ofvariables but it cannot change the structural model itself otherwisethe model is not useful
651 Linear Structural Model
Consider the linear structural model
Y = β0 + β1X + U (634)
Unlike in a CEF the structural modelrsquos β1 and U have economicandor causal meaning by definition In (634) β1 is called a struc-tural parameter (as is β0) It has some economic or causal inter-pretation like an elasticity or demand curve slope Similarly U iscalled the structural error term This U can be interpreted asthe aggregation of all other variables that causally determine Y Wecan think about U economically not just statistically Itrsquos possibleE(U | X) = 0 but usually not Conversely the CEF error Y minusm(X)usually does not have causal or economic meaning
Consider a structural model like (634) for the example where Yis income and X is a college degree dummy Then U contains every-thing else that helps determine a personrsquos income their occupationtheir different skill levels (human capital) where they live (citycoun-try) etc
Warning on notation I used V in the CEF model in (625) and Uhere to help you avoid confusion but they are simply letters I couldhave used U in both models or V in both or ε or anything else Sodonrsquot think V always means CEF and U means not CEF)
With only a single binary X (633) and (634) seem very similarStill as seen in Section 66 the reduced form approach to identifyingthe ATE involves assumptions about potential outcomes whereas thestructural approach involves assumptions about X and U
Superficially it appears (634) claims the causal effect of X on Yis the constant β1 the same for everybody but see Section 652
Warning if you see a model Y = β0 + β1X + U make sure youknow whether itrsquos a CEF model or a structural model or yet anothertype of model (like in Chapter 7) The equation by itself only shows a
142CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
linear relationship it does not tell us the meaning of the parametersor the error term U This is something to be very wary of when youlook at econometric resources online or in other books they may havemodels that look identical but are interpreted very differently
Practice 64 Let X = 1 if an individualrsquos body mass index (BMI) is30 or greater (the technical definition of obesity) andX = 0 otherwiseand let Y denote hourly wage Consider the model Y = δ0 + δ1X +W where δ0 and δ1 are unknown non-random parameters and Wis the unobserved error term What is the interpretation of δ1 andW Explain (Hint yes this is a ldquotrickrdquo question with a very shortanswer)
652 General Structural Model and ASE
The linear model is better for building intuition but to expand yourmind consider the structural model
Y = h(XU) (635)
where Y is the outcome X is a binary regressor U = (U1 U2 ) isa vector containing all causal determinants of Y besides X and h(middot)could be any (non-random) function
The linear structural model Y = β0 + β1X + U is a special caseof (635) If h(ab) = β0 + β1a+ g(b) then (635) is
Y = h(X U) = β0 + β1X +
U︷ ︸︸ ︷g(U)
For a single individual the structural effect on Y of changingX = 0 to X = 1 is
s(U) equiv h(1U)minus h(0U) (636)
This is the causal effect on Y when X increases from 0 to 1 all elseequal (ceteris paribus) ie holding the value of U fixed Differentindividuals have different unobserved U so they may have differentstructural effects of X on Y For example getting a college degreemay increase income a lot for some individuals but not others
Itrsquos very difficult to learn about s(U) since it depends on thingswe canrsquot observe Instead we can try to learn about its mean Asusual there is a tradeoff the mean effect is not as informative andhelpful for policy but it is easier to learn about
The average structural effect (ASE) is a weighted average ofindividual structural effects The weights depend on the populationdistribution of U Mathematically this weighted average is the pop-ulation mean
ASE equiv E[s(U)] = E[h(1U)minus h(0U)] = E[h(1U)]minus E[h(0U)](637)
66 IDENTIFICATION 143
In this special case with binary X the final expressions looks like theATE where Y U = h(0U) and Y T = h(1U) That is we can imag-ine two parallel universes where everything besides X (specifically U)is the same But U may contain things that would violate SUTVAlike other peoplersquos X
The ASE is sometimes called the average causal effect (eg Hansen2020 Def 27) but so is the ATE so I avoid ldquoaverage causal effectrdquoto avoid confusion
Discussion Question 63 (ES habits and final scores) Let Y be astudentrsquos final semester score in this class 0 le Y le 100 and X = 1if the student starts each exercise set well ahead of the due date (andX = 0 if not) Consider the structural model Y = a + bX + U andthe CEF model Y = c+ dX + V
a) What does U represent Give some specific examples of whatU includes here (Hint imagine two students with the same Xbut different Y what causes them to have different Y )
b) Do you think E(U | X = 0) = E(U | X = 1) Whynotc) Do you think b = d b lt d or b gt d Why
Practice 65 (ES habits parameters) In DQ 63 what would youguess are reasonable possible values of the parameters a b c and dExplain
66 Identification
=rArr Kaplan video Identification of College Effect on Earnings
This section focuses on identification of parameters with causalmeaning In particular when does the slope parameter β1 in theCEF model also have a causal interpretation With binary X β1 isa (conditional) mean difference so intuition is similar to Section 46but many more details and formal results are provided here As inSection 46 the conditions required for identification are called iden-tifying assumptions
In Sum When Does a CEF Difference Have a CausalInterpretationIf X ldquoexogenousrdquo (unrelated to other determinants of Y Sec-tions 661 and 663)If X (as good as) randomized (unrelated to potential outcomesSection 662)
661 Linear Structural Model
Under certain conditions (identifying assumptions) the structuralslope β1 in (634) is identified equal to the CEF slope γ1 in
E(Y | X = x) = γ0 + γ1x (638)
This is equivalent to (628) just with γ to avoid confusion with (634)
144CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
Identifying Assumptions and Formal Results
Qualitatively the structural slope is identified if X and U are ldquounre-latedrdquo That is the regressor X must be unrelated to the unobserveddeterminants of Y (that comprise U) U cannot be systematicallyhigher or lower for certain X values If true then X is called exoge-nous (link to pronunciation) If not then X is called endogenous(link to pronunciation) The precise mathematical condition for aregressorrsquos exogeneity (or endogeneity) depends on the model
Quantitatively there are a few ways to describe ldquoexogeneityrdquo of Xin (634) Specifically each of Assumptions A61ndashA63 is a sufficientcondition for β1 = γ1
Assumption A61 (independent error) U is independent of XX perpperp U
Assumption A62 (mean independent error) U is mean indepen-dent of X E(U | X) = E(U) For binary X equivalently E(U | X =0) = E(U | X = 1)
Assumption A63 (uncorrelated error) U is uncorrelated with XCorr(UX) = 0 or equivalently Cov(UX) = 0
Some of Assumptions A61ndashA63 are stronger than others Inde-pendence is stronger than mean independence
A61︷ ︸︸ ︷X perpperp U =rArr
A62︷ ︸︸ ︷E(U | X) = E(U) (639)
In general mean independence is stronger than zero correlation (whichis equivalent to zero covariance)
A62︷ ︸︸ ︷E(U | X) = E(U) =rArr
A63︷ ︸︸ ︷Cov(UX) = 0 (640)
If X has only two possible values then mean independence is actuallyequivalent to (lArrrArr ) zero correlation
Theorem 61 formally states the identification theorem You donot need to write (or even fully understand) proofs for this class butthe proof may help deepen understanding and appreciation for somestudents
Theorem 61 (linear structural identification) Consider the linearstructural and CEF models in (634) and (638) respectively AssumeX has only two possible values x1 and x2 as a special case x1 = 0and x2 = 1 If any one of Assumptions A61ndashA63 is true then thestructural slope is identified and equal to the CEF slope ie β1 =γ1 If additionally E(U) = 0 then the structural intercept is alsoidentified with β0 = γ0
Proof It is sufficient to prove the result for A62 because it is weakerthan A61 and equivalent to A63 (given only two possible X values)
66 IDENTIFICATION 145
Starting from the structural model
Y = β0 + β1X + U
= β0 + β1X + U +
=0︷ ︸︸ ︷E(U)minus E(U)
=
γ0︷ ︸︸ ︷β0 + E(U) +
γ1︷︸︸︷β1 X +
V︷ ︸︸ ︷U minus E(U)
As labeled the CEF intercept is γ0 = β0 + E(U) and the CEF slopeγ1 = β1 because V equiv U minus E(U) is a CEF error
E[U minus E(U) | X] =
=E(U) by A62︷ ︸︸ ︷E[U | X] minusE[E(U) | X] = E(U)minus E(U) = 0
(641)Thus mean independence implies γ1 = β1 and E(U) = 0 furtherimplies γ0 = β0
In Practice
In practice Assumptions A61ndashA63 are usually difficult to justifyRecall the example where Y is income and X is having a collegedegree Imagine U includes something called ldquoabilityrdquo that includesall skills not gained directly from college (eg skills learned from aparent) To have β1 = γ1 U would have to satisfy E(U | X = 0) =E(U | X = 1) ie the non-college and college subpopulations havethe same mean ability Of course there are many types of ability butit seems likely that in general college graduates are higher ability Infact the most famous of Michael Spencersquos Nobel Prize-winning work1
provides a more formal economic model of why the college graduatesshould have higher ability ie why E(U | X = 1) gt E(U | X = 0)
Returning to an even simpler example let Y be commute time andX = 1 if people are carrying umbrellas with X = 0 otherwise Sincethe umbrellas themselves have no effect on Y the structural β1 = 0Since rain affects Y rain is part of U although U may also includetraffic conditions and such When X = 0 there is probably no rainwhereas when X = 1 there probably is rain thus E(U | X = 1) gtE(U | X = 0) Again the structural error U is clearly not a CEFerror Consequently the CEF slope has only statistical meaning notcausal meaning Indeed here the CEF slope is the difference in meancommute time between days when people carry umbrellas and daysthey donrsquot which should be a substantial positive difference (longercommutes on days people carry umbrellas because those are rainydays) However the causal effect of umbrellas is zero so the CEFrsquosslope is bigger than the structural β1
If we could also observe weather conditions then it might be plau-sible that the remaining parts of U are unrelated to X This identi-fication approach is considered in Chapters 9 and 10
1See Spence (1973) or the very brief overview of his signaling model at httpsenwikipediaorgwikiMichael_SpenceCareer
146CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
Discussion Question 64 (marriage and salary) Let X = 1 if mar-ried and otherwise X = 0 Let Y be annual salary Consider thestructural model Y = β0 + β1X + U
a) Explain why probably E(U | X = 1) 6= E(U | X = 0) and saywhich you think is higher (Hint first think about what elseis in U ie what determines someonersquos salary or think aboutvariables that differ on average between married and unmarriedindividuals and whether any of those help determine salary)
b) Does the average salary difference between married and unmar-ried individuals have a structural meaning Whynot
662 Average Treatment Effect
Identification of the average treatment effect (ATE) was initially dis-cussed in Section 43 Here identifying assumptions and results areformally stated in regression notation
Updating (412) to this chapterrsquos notation the ATE is identifiedwhen
E(Y T )minus E(Y U ) = E(Y | X = 1)minus E(Y | X = 0) (642)
where Y T and Y U are (still) the potential treated and untreated out-comes respectively The important feature of (642) is that the right-hand side contains only observable variables Y and X Usually withenough data we can learn the population joint probability distri-bution of (YX) which in turn determines the conditional meansE(Y | X = 1) and E(Y | X = 0) If (642) is true then learningabout these conditional means (on the right-hand side) helps us learnthe ATE (left-hand side)
Identifying Assumptions and Formal Results
Assumption A64 is SUTVA as discussed in Section 44Assumption A65 is related to the discussion of randomized treat-
ment in Section 43 Mathematically the key is that randomizationsatisfies statistical independence between the treatment assignmentand the individualrsquos pair of potential outcomes X perpperp (Y U Y T )
Assumption A66 was not discussed before but it is intuitive ifeverybody (or nobody) is treated then itrsquos impossible to comparetreated and untreated outcomes For example if P(X = 1) = 0 thennobody is treated so we only observe Y = Y U for everybody We canlearn about E(Y U ) but itrsquos impossible to learn about E(Y T ) sinceY T is literally never observed Although obvious here a more generaloverlap assumption may not always hold in more complex models
The following identifying assumptions combined together are suf-ficient but not necessary That is if they are all true then the ATEis identified but there may be other ways to identify the ATE even ifthey are violated
The assumptions have various names Assumption A64 is usuallyjust called SUTVA but the main part of it is often called no inter-ference (or non-interference) Assumption A65 has many names
66 IDENTIFICATION 147
independence ignorability or unconfoundedness The combi-nation of A65 and A66 is called strong ignorability For moredetail history and discussion see Imbens and Wooldridge (2007)
Assumption A64 (SUTVA) Everyone with X = 1 receives thesame treatment and one individualrsquos treatment does not affect anyother individualrsquos potential outcomes
Assumption A65 (unconfoundedness) Treatment is independentof the potential outcomes X perpperp (Y U Y T )
Assumption A66 (overlap) There is strictly positive probabilityof both treatment and non-treatment 0 lt P(X = 1) lt 1
Formally identification of the ATE is shown as follows The keyis that A65 allows us to observe representative samples of both Y U
and Y T Mathematically this independence assumption implies thatthe means of the potential outcomes do not statistically depend onthe treatment X
E(Y T ) = E(Y T | X = 1) E(Y U ) = E(Y U | X = 0) (643)
From (631) Y = Y T when X = 1 and Y = Y U when X = 0 so
E(Y T | X = 1) = E(Y | X = 1) E(Y U | X = 1) = E(Y | X = 0)(644)
Combining (643) and (644) this says that the population mean ofthe treated potential outcome E(Y T ) equals the mean of the observedoutcome in the treated population E(Y | X = 1) which in Section 46was E(Y B) Since E(Y | X = 1) is a feature of the joint distributionof (YX) it is identified Since E(Y T ) = E(Y | X = 1) it is alsoidentified Similarly E(Y U ) = E(Y | X = 0) is identified so E(Y T )minusE(Y U ) is identified
Theorem 62 (ATE identification) Under A64ndashA66 the ATE isidentified
E(Y T minus Y U ) = E(Y T )minus E(Y U ) = E(Y | X = 1)minus E(Y | X = 0)
which is the slope β1 in the linear CEF model in (628)
Proof The constructive proof of ATE identification links the unob-servable ATE with the observable mean difference Specifically
use linearity︷ ︸︸ ︷E(Y T minus Y U ) =
use (643)︷ ︸︸ ︷E(Y T )minus E(Y U )
=
use (644)︷ ︸︸ ︷E(Y T | X = 1)minus
use (644)︷ ︸︸ ︷E(Y U | X = 0)
= E(Y | X = 1)minus E(Y | X = 0)
This equals the linear CEF slope coefficient as shown in (629)
148CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
In Practice
Imagine a knee surgery treatment (X) to help arthritis where Y isknee-specific pain (between 0 and 100) For each individual we canimagine two parallel universes identical except for whether the indi-vidual gets the treatment (surgery) or not It is the same surgery foreverybody and naturally one personrsquos surgery cannot affect anotherpersonrsquos pain so SUTVA is satisfied Half of patients are randomlyassigned the treatment so X perpperp (Y U Y T ) and 0 lt P(X = 1) lt 1Thus Assumptions A64ndashA66 are all satisfied and Theorem 62 saysthe ATE equals the CEF slope which we can estimate by OLS
Surgery seems like a very straightforward example but there canstill be problems For example maybe the people who volunteer toparticipate in the randomized experiment are not representative ofthe general population eg maybe they are feeling very desperatebecause they have particularly severe arthritis Another issue is thatsome people may be hurt by the treatment even if the overall ATEseems helpful Perhaps the biggest issue in real life was that ldquosurgeryrdquowas treated as a black box without understanding the particular mech-anism that reduced pain It turned out that placebo (fake) surgerywas equally effective2
Consider Theorem 62 when X is rain and Y is commute time InColumbia MO there is much less traffic in the ldquosummerrdquo (mid-May tomid-August) when most students are gone meaning both Y T and Y U
are lower There is also more rain (X = 1) That is X and (Y U Y T )are related violating Assumption A65 Intuitively the problem iswersquod see more short rainy commutes in the summer and long drycommutes during the academic year which makes it seem like raincauses short commutes but correlation does not imply causation
Practice 66 Discuss the right-to-work example from Sections 43ndash45 in terms of Assumptions A64ndashA66
663 Average Structural Effect
The ASE in (637) is identified if it equals the CEF slope A sufficientcondition for this is Assumption A67 which is qualitatively similarto Assumption A65
Assumption A67 (independence) The unobservable determinantsof Y are independent of X eg in the notation of (635) U perpperp X
Theorem 63 (ASE identification) Consider the general structuralmodel in (635) and the ASE defined in (637) If Assumption A67holds then the ASE is identified and equal to the slope of the linearCEF in (628)
2httpsdoiorg101056NEJMoa013259
67 ESTIMATION OLS 149
Proof Using Y = h(XU)
E(Y | X = 1)minus E(Y | X = 0) = E[h(XU) | X = 1]minus E[h(XU) | X = 0]
=
use A67︷ ︸︸ ︷E[h(1U) | X = 1]minus
use A67︷ ︸︸ ︷E[h(0U) | X = 0]
= E[h(1U)]minus E[h(0U)]
which equals the ASE as in (637)
67 Estimation OLS
=rArr Kaplan video OLS in R
This section considers estimation of the CEF model (628) whenX has only two possible values The interpretation (description pre-diction causality) does not matter for estimation
One approach is to define Y A as the X = 0 subpopulation andY B as the X = 1 subpopulation Then β0 = E(Y A) and β1 =E(Y B)minus E(Y A) so Sections 412 and 413 can be used to estimateE(Y A) and E(Y B)
Though simple that approach does not generalize as well as or-dinary least squares (OLS) The intuition behind the least squaresapproach comes from the characterization of the conditional mean asthe best predictor of Y given X = x with quadratic loss The ideaextends (311) for estimating the unconditional mean of Y In thepopulation if E(Y | X = x) = β0 + β1x then
(β0 β1) = arg minb0b1
E[L2(Y b0 + b1X)] = arg minb0b1
E[(Y minus b0 minus b1X)2]
(645)where L2(y g) is the quadratic loss function from (239) This showsthat the CEF provides the best (with quadratic loss) predictor of Ygiven X In the sample replacing the population mean (E) with thesample mean ( 1n
sumni=1) the minimization problem analogous to (645)
is
OLS (β0 β1) = arg minb0b1
1
n
nsumi=1
(Yi minus b0 minus b1Xi)2 (646)
The estimated CEF is thus
m(x) = β0 + β1x (647)
Extending Sections 33 and 342 the OLS regression estimatorcan be explained in terms of the empirical distribution Instead of asingle S representing Y now we have (SY SX) representing (YX)If all the sample values (Yi Xi) are unique then the empirical distri-bution has P(SY = Yi SX = Xi) = 1n for all i = 1 n Thusreplacing the population mean in (645) with the sample average in(646) can be seen as using the empirical distribution That is
(β0 β1) = arg minb0b1
E[(SYminusb0minusb1SX)2]= arg minb0b1
1
n
nsumi=1
(Yiminusb0minusb1Xi)2
(648)
150CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
Notationally as usual the ldquohatsrdquo on β0 β1 and m(x) indicatethat they are computed from the sample whereas the true populationvalues β0 β1 and m(x) lack hats
The form of (646) explains the L and S in OLS ldquoLeastrdquo (L) refersto minimization and ldquosquaresrdquo (S) refers to squaring Yi minus b0 minus b1Xi(Explaining the O is a special treat reserved for econ PhD students)
Equation (646) can be described with the terms introduced around(312) Given any estimates (β0 β1) the fitted values are
Yi equiv β0 + β1Xi = m(Xi) (649)
Given Yi the residual is defined as
Ui equiv Yi minus Yi = Yi minus β0 minus β1Xi (650)
Consequently (646) can be interpreted as saying that the OLS esti-mates (β0 β1) make the sum of squared residuals
sumni=1 U
2i as small
as possibleThe OLS estimator is consistent under fairly general conditions
see Section 772 In that case
β0prarr β0 β1
prarr β1 m(0)prarr m(0) m(1)
prarr m(1) (651)
671 Code
The following code runs OLS It also computes the sample means ofYi for the group with Xi = 0 and the group with Xi = 1 separatelyThis verifies that with a single binary regressor β0 is the sampleaverage Yi for the Xi = 0 group while β0 + β1 is the sample averageYi for the Xi = 1 group so β1 is the difference
df lt- dataframe(Y=c(14023 87695) X=c(00000 11111))ret lt- lm(formula=Y~X data=df)coef(ret)
(Intercept) X 2 5
c( mean(df$Y[df$X==0]) mean(df$Y[df$X==1]) )
[1] 2 7
68 Quantifying Uncertainty
=rArr Kaplan video OLS in R (again)
The ways to quantify uncertainty in Section 38 also apply to β0and β1 in the linear CEF model (628) The same interpretations andmisinterpretations apply In particular these methods do not reflectuncertainty about identifying assumptions For example a CI that
68 QUANTIFYING UNCERTAINTY 151
contains the CEF slope with 95 probability does not contains thestructural slope with 95 probability if it is not identified it couldbe only 80 or 50 or near 0
One new consideration is discussed in Section 681 followed bysample code in Section 682
681 Heteroskedasticity
Different methods for quantifying uncertainty make different assump-tions about the conditional variance Whereas the conditional meanE(Y | X = x) is the mean of the conditional distribution of Y givenX = x the conditional variance
σ2Y (x) equiv Var(Y | X = x) (652)
is the variance of the conditional distribution of Y given X = xThe term homoskedasticity means σ2Y (x) = σ2Y a constant notdepending on x whereas heteroskedasticity means σ2Y (x) is notconstant Equivalently we could write Y = β0+β1X+U and considerthe conditional variance of U since Var(Y | X) = Var(U | X) so oftenhomoskedasticity and heteroskedasticity are thought of as propertiesof the error term
Always use methods that are robust to heteroskedasticity(or heteroskedasticity-robust) This means theyrsquore valid withhomoskedasticity or heteroskedasticity whereas other methods onlywork with homoskedasticity Logically the heteroskedasticity-robustmethods have weaker assumptions so they work more often Besidesheteroskedasticity is very common in real economic data
The term ldquorobustrdquo by itself is ambiguous You should always askrobust to what Methods can be robust to heteroskedasticity robustto clustered sampling robust to measurement error robust to infinitevariance etc
Practice 67 (heteroskedasticity) Let Y = 1 if employed (and Y =0 if not) and let X = 1 if female (and X = 0 if not) Explainwhy there is probably heteroskedasticity (Hint if p = P(Y = 1)then Var(Y ) = p(1 minus p) If px = P(Y = 1 | X = x) then whatrsquosVar(Y | X = x))
682 Code
Unfortunately the default in R is to use homoskedasticity-based stan-dard errors so you have to make an extra effort to get heteroskedasticity-robust results The below code does this Since X is binary the sameresults can be obtained with a two-sample unpaired t-test with ldquoun-equal variancesrdquo as shown
The below code quantifies uncertainty about the CEF slope in aregression with a single binary regressor Using a variety of methodsthe code computes a standard error (SE) 95 confidence intervaland t-statistic and two-sided p-value for testing the null hypothesisH0 β1 = 0
152CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
In the table of output at the very end the first two rows as-sume homoskedasticity whereas the remaining four rows do not Thefirst row is a two-sample t-test assuming equal variances the sec-ond row is the default results based on lm() output The third rowis a two-sample t-test allowing for unequal variances The remain-ing rows use more general regression-based methodology allowing forheteroskedasticity based on the lmtest and sandwich packages inR (Zeileis 2004 Zeileis and Hothorn 2002) The first two rows areidentical and the following four rows are very similar to each otherbut there is a big difference between the first two rows and the nextfour rows This shows the (potentially) big difference between as-suming homoskedasticity (as in the first two rows) and allowing forheteroskedasticity (as in the last four) There are multiple ways toallow for heteroskedasticity like the HC0 HC1 and HC3 shown inthe table The differences are beyond our scope but as the tablesuggests the differences are often very small in practical terms
In practice you should use coeftest and coefci to allow forheteroskedasticity like below
library(lmtest) library(sandwich)setseed(112358)n lt- 1000df lt- dataframe(Y=c(rnorm(n=n4mean=0sd=1)
rnorm(n=3n4mean=02sd=2))X=c(rep(0n4) rep(13n4)))
ret lt- lm(formula=Y~X data=df) Store results for slope in sloutrn lt- c(ttesteqHomoskttestuneqHC0HC1HC3)slout lt- dataframe(rownames=rn SE=rep(NA6) CIlower=NA
CIupper=NA tstat=NA pvalue=NA) HC0 original from Hal White (1980)retVC0 lt- vcovHC(ret type=HC0)slout[HC0SE] lt- sqrt(retVC0[22]) HC1 matches Stata default and two-sample ttest belowretVC1 lt- vcovHC(ret type=HC1) HC3 recommendeddefault (and larger SE than HC0 HC1)retVC3 lt- vcovHC(ret type=HC3) Default homoskedastic resultsslout[Homoskc(SEtstatpvalue)] lt-
summary(ret)$coefficients[X24] Heteroskedasticity-robust testsp-valuesslout[HC0c(145)] lt- coeftest(ret vcov=retVC0)[X24]slout[HC1c(145)] lt- coeftest(ret vcov=retVC1)[X24]slout[HC3c(145)] lt- coeftest(ret vcov=retVC3)[X24] Heteroskedasticity-robust CIs (shortest to longest)slout[HC023] lt- coefci(ret vcov = retVC0)[X]slout[HC123] lt- coefci(ret vcov = retVC1)[X]slout[HC323] lt- coefci(ret vcov = retVC3)[X]slout[Homosk23] lt- confint(ret level=095)[X]
68 QUANTIFYING UNCERTAINTY 153
For comparison ttest() results for slopetsl lt- ttest(x=df$Y[df$X==1] y=df$Y[df$X==0] mu=0 conflevel=095
alternative=twosided paired=FALSE varequal=FALSE)slout[ttestuneq-1] lt-
c(tsl$confint tsl$statistic tsl$pvalue) For comparison varequal=TRUEt2 lt- ttest(x=df$Y[df$X==1] y=df$Y[df$X==0] mu=0 conflevel=095
alternative=twosided paired=FALSE varequal=TRUE)slout[ttesteq-1] lt- c(t2$confint t2$statistic t2$pvalue)slout[ttestuneq1] lt-(tsl$confint c(-11)) (2qt(p=1-0052df=tsl$parameter))
slout[ttesteq1] lt-(t2$confint c(-11)) (2qt(p=1-0052df=t2$parameter))
print(round(slout digits=3))
SE CIlower CIupper tstat pvalue ttesteq 0128 -0026 0476 176 0079 Homosk 0128 -0026 0476 176 0079 ttestuneq 0095 0038 0412 236 0018 HC0 0095 0039 0412 237 0018 HC1 0095 0038 0412 237 0018 HC3 0095 0038 0412 236 0018
Practice 68 (regression significance) Consider the setup of the ldquoau-dit studyrdquo from Bertrand and Mullainathan (2004) Resumes werefabricated that were identical except for the name Emily (suggestinga white female) Greg (white male) Lakisha (black female) or Jamal(black male) The resumes were then submitted to job openings andit was recorded whether or not an in-person interview for the job wasthen offered Here let Y = 1 if an interview was offered and Y = 0if not let X = 1 if the name is ldquoblackrdquo and X = 0 if not Note thatE(Y | X = x) = P(Y = 1 | X = x) ie the conditional probabil-ity of an interview A regression of Y on X (including an interceptas always) is run and heteroskedasticity-robust standard errors arecomputed Consider both economic significance and statistical sig-nificance in the following possible results (Economic and statisticalsignificance were introduced in Section 397) Hint to quickly assessstatistical significance |β1 SE| ge 2 means statistical significance ata 5 level with higher values being more statistically significant
a) Slope estimate β1 = 000001 SE = 0000001b) β1 = minus01 SE = 01c) β1 = minus02 SE = 002d) β1 = minus001 SE = 001
154CHAPTER 6 COMPARING TWODISTRIBUTIONS BY REGRESSION
Empirical Exercises
Empirical Exercise EE61 You will essentially replicate EE41 butwith regression commands
a R only load the needed packages and look at a description ofthe datasetlibrary(wooldridge) library(sandwich) library(lmtest)
jtrain2
b Stata only run ssc install bcuse if necessary then load thedata withbcuse jtrain2 nodesc clear
c Run a regression of 1978 earnings (re78) on the job trainingassignment indicator (train)
R ret lt- lm(re78~train data=jtrain2)
Stata regress re78 train vce(robust) in which vce(robust) requests heteroskedasticity-robust standard errors
d R only (since already reported in Stata) output the estimatesalong with heteroskedasticity-robust standard errors and two-sided 95 confidence intervals with the codecoeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))
where argument type=HC1 refers to one specific type (amongmultiple) of heteroskedasticity-robust standard error estimator(HC stands for ldquoheteroskedasticity-consistentrdquo)
e R only create a subset of the data including only married in-dividuals with code jt2mar1 lt- jtrain2[jtrain2$married==1 ]
f Run your previous analysis for the subset of married individuals
R replace data=jtrain2 with data=jt2mar1
Stata regress re78 train if married==1 vce(robust)
g Repeat your analysis but for unmarried individuals
h Repeat your analysis on the full sample of individuals but forthe outcome variable unem78 (1978 unemployment indicator) in-stead of re78 (and remember unemployment is bad so negativecoefficient is good)
Chapter 7
Simple Linear Regression
=rArr Kaplan video Chapter Introduction
Depends on Chapter 6 (which depends on Chapters 2ndash4)
Unit learning objectives for this chapter
71 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
72 Interpret what a linear regression estimates in multipleways mathematically and intuitively [TLOs 2 and 3]
73 Assess whether certain assumptions for linear regressionseem true or not in real-world examples [TLOs 2 and 6]
74 In R (or Stata) estimate a simple linear regression alongwith measures of statistical uncertainty and judge economicand statistical significance [TLO 7]
Optional resources for this chapter
bull Regression as description (Masten video)
bull James et al (2013 sect31)
bull Sections 41ndash42 (ldquoSimple Linear Regressionrdquo and ldquoEstimat-ing the Coefficients of the Linear Regression Modelrdquo) inHanck et al (2018)
bull Sections 21 (ldquoSimple OLS Regressionrdquo) and 22 (ldquoCoeffi-cients Fitted Values and Residualsrdquo) in Heiss (2016) [re-peated from Chapter 6]
Surprisingly many critical issues arise with three (instead of two)possible X values With two the regression modeled conditionalmeans useful for description prediction and (sometimes) causalityHowever with three (or more) X values we may fail to model theconditional means In simple cases this can be solved with a moreflexible model in other cases we need to reinterpret what OLS actu-ally estimates in practice
155
156 CHAPTER 7 SIMPLE LINEAR REGRESSION
Generally OLS estimates something called a linear projectionThis can also be interpreted as a ldquobestrdquo linear approximation of theCEF (for description) or a ldquobestrdquo linear predictor of Y given X (forprediction) These interpretations are discussed along with statisticalproperties of OLS as an estimator of the linear projection (not CEF)
71 Misspecification
=rArr Kaplan video Misspecification of Linear CEF
Consider the linear population model
Y = β0 + β1X + U (71)
where supposedly E(U | X) = 0 and this time X has three possiblevalues 0 1 and 2
Intuitively you should worry already there are now three condi-tional means but still only two parameters That is we want to learnthe three values
m(0) equiv E(Y | X = 0) m(1) equiv E(Y | X = 1) m(2) equiv E(Y | X = 2)
but (71) has only two parameters β0 and β1 Thatrsquos like trying toserve three dinners on only two plates That does not sound likeenough flexibility
Mathematically the question is whether
m(x) = β0 + β1x x = 0 1 2
ie whether m(x) is a straight lineFor example let Y be income and let X be number of siblings
Maybe there is a big income gap between only children (X = 0) andindividuals with one sibling (X = 1) but having a second sibling(X = 2) does not change much To simplify this let m(0) gt m(1) =m(2) From m(1) and m(2) alone the CEF appears flat (zero slope)in which case β0 = m(1) and β1 = 0 fits these two points But fromm(0) and m(1) the slope appears negative β1 lt 0 and the interceptis β0 = m(0) There is no (β0 β1) that can make β0 +β1x go throughall three points m(0) m(1) and m(2) if m(0) gt m(1) = m(2)
Figure 71 shows the impossibility of a linear CEF in this previousexample In the figure m(0) = 60 (in thousands of $yr) and m(1) =m(2) = 40 The line with β0 = m(0) and β1 = m(1)minusm(0) lt 0 fitsthe first two CEF values but not the third The line with β0 = m(1)and β1 = 0 fits the second two CEF values but not the first It isimpossible to draw a straight line (β0 + β1x) through all three pointson this CEF as Euclid could tell us
A wrong model is euphemistically termed misspecified Thatis the model assumes something that is not actually true For thesiblings and income example the linear CEF model in (71) is mis-specified The model incorrectly assumes that the conditional mean
72 COPING WITH MISSPECIFICATION 157
020
4060
X
m(X
)
0 1 2
Figure 71 Misspecification of linear CEF
of Y is linear in X (ie an affine function of X) Equivalently it as-sumed m(1)minusm(0) = m(2)minusm(1) which is not true in the exampleMore specifically this type of misspecification is called functionalform misspecification since it is the linear functional form thatis wrong That is even though any values of (β0 β1) are allowedβ0 + β1x is always a straight-line function of x so it has a linearfunctional form (the general ldquoshaperdquo of the function)
Practice 71 (misspecification) Investigate whether the problemwith the sibling example was that X = 0 was a possible value (so thatthe intercept had to be β0 = m(0)) as follows Consider the sameexample but with X = 1 2 3 instead of X = 0 1 2 so m(1) = 60m(2) = m(3) = 40 Is it possible to write m(x) = β0 + β1x nowWhy or why not
Itrsquos technically possible that a CEF is linear though extremelyunlikely in practice Continuing the example if m(2) = 20 exactlythen m(x) = 60minus 20x ie β0 = 60 and β1 = minus20 In that case thelinear CEF is properly specified (or correctly specified) If in-stead m(2) = 20001 the linear CEF model is misspecified Howeverwith such a small amount of misspecification a linear model is a verygood approximation
Arguably we must learn how best to cope with misspecificationsince we cannot truly avoid it As Box (1979 p 2) famously wroteldquoAll models are wrong but some are usefulrdquo 1 With reference to Boxrsquosquote Section 83 essentially tries to maximize a modelrsquos usefulness bychoosing the optimal amount of ldquohow wrongrdquo it is (misspecification)
72 Coping with Misspecification
There are two ways to cope with misspecification change the modelor reinterpret it The first way is now discussed for (71) while rein-
1See httpsenwikipediaorgwikiAll_models_are_wrong for additionaldiscussion including the analogous quote about art from Pablo Picasso ldquoWe allknow that art is not truth Art is a lie that makes us realize truth at least thetruth that is given us to understand The artist must know the manner wherebyto convince others of the truthfulness of his liesrdquo
158 CHAPTER 7 SIMPLE LINEAR REGRESSION
terpretation is detailed in Sections 73ndash75
721 Model of Three Values
To fix the misspecification the model needs to be more flexible Con-tinuing with X = 0 1 2 for simplicity there are three conditionalmeans so the model should have three parameters to be flexibleenough to avoid misspecification
One way to add another parameter is to use a dummy variablefor each possible value of X (See Section 232 to review dummyvariables) Recall the indicator function from (23) Here
1X = j =
1 if X = j0 otherwise j = 0 1 2 (72)
Since only three values ofX are possible 1X = 0 = 1minus1X = 1minus1X = 2 Thus extending (626)
m(x) = m(0)1x = 0+m(1)1x = 1+m(2)1x = 2 (73)= m(0)[1minus 1x = 1 minus 1x = 2] +m(1)1x = 1+m(2)1x = 2= m(0) + [m(1)minusm(0)]1x = 1+ [m(2)minusm(0)]1x = 2= β0 + β1 1x = 1+ β2 1x = 2
β0 equiv m(0) β1 equiv m(1)minusm(0) β2 equiv m(2)minusm(0)(74)
Although the structure of (73) is easier to interpret the structureof (74) is more common and can be interpreted as follows The pa-rameter β0 = m(0) is the conditional mean for some base categoryX = 0 The other parameters show how other conditional meansdiffer from this base category Specifically β1 = m(1) minusm(0) is theconditional mean difference between the X = 1 and X = 0 subpop-ulations and β2 = m(2) minus m(0) is the conditional mean differencebetween the X = 2 and X = 0 subpopulations
This interpretation can be applied to the income and siblings ex-ample The parameter β0 is the population mean income amongindividuals with zero siblings Zero siblings is the base categoryThen β1 is the difference in mean income between the 1-sibling and0-sibling subpopulations Earlier m(0) = 60 (thousands of $yr) andm(1) = 40 so β1 = m(1) minus m(0) = minus20 Finally β2 is the meanincome difference between the 2-sibling and 0-sibling (not 1-sibling)subpopulations m(2)minusm(0) = 40minus 60 = minus20
Discussion Question 71 (Facebook) Let X = 0 1 2 be the num-ber of Facebook accounts somebody has and Y is hours of socialmedia consumption per week
a) Explain what it means for a CEF model E(Y | X = x) =β0 + β1x to be misspecified
b) Describe a specific real-world reason to suspect misspecificationin this example
73 LINEAR PROJECTION 159
722 More Than Three Values
More generally even if X has more than three possible values dummyvariables could be used similarly to avoid CEF misspecification Ex-tending (73) there can be a dummy variable for each possible valueof X and a corresponding parameter for each Any such model al-lowing an arbitrarily different conditional mean of Y for each possiblevalue of X is called fully saturated A fully saturated CEF modelcannot be misspecified (But it may not have any causal meaningand may be practically impossible to estimate)
In more complex settings it is impossible to fix misspecificationcompletely For example if X could be any real number between 0and 1 then an infinite number of parameters is required to model theconditional expectations for the infinite number of X values this isimpossible in practice
In such settings where misspecification is unavoidable how canwe interpret the model and its parameters There are three interpre-tations of a more general linear model that includes the linear CEFmodel as a special case These are discussed next
In Sum Interpretations of What OLS EstimatesLinear projection gets β0 + β1X ldquoclosest tordquo Y probabilisti-
cally (Section 73)ldquoBestrdquo linear approximation (BLA) of CEF ldquobestrdquo (smallest
mean quadratic loss) approximation of E(Y | X) with linear formβ0 + β1X (Section 74)
ldquoBestrdquo linear predictor (BLP) ldquobestrdquo (smallest mean quadraticloss) prediction of Y given X with linear form β0 + β1X (Sec-tion 75)
73 Linear Projection
=rArr Kaplan video Linear Projection and ldquoBestrdquo vs ldquoGoodrdquo
The linear projection model is important because it is what OLSactually estimates Two additional interpretations of the linear pro-jection are described in Sections 74 and 75
731 Geometric Intuition
You may have seen orthogonal projection in geometry or linear al-gebra There is some shape (or vector space) and there is a pointoutside it Projecting the point onto the shape consists of finding thepoint within the shape that is closest to the outside point
Figure 72 illustrates projection There is a large gray circle shapeand two points outside of it (small triangle dot) The small triangleon the border of the large circle is the ldquoclosestrdquo point to the out-side small triangle as measured by Euclidean distance That is thedashed line connecting the small triangles is just barely long enough
160 CHAPTER 7 SIMPLE LINEAR REGRESSION
Figure 72 Orthogonal projection
to reach the gray circle from the outside triangle point if it were anyshorter it could not reach any point in the gray circle Similarly thedot on the border of the gray shape is the projection of the outsidedot onto the shape of all the points in the gray space it is closest tothe outside dot (by Euclidean distance)
This idea can be written mathematically Let dE(w z) denotethe Euclidean distance between points w and z Let S denote ashape which is a set of points Let y denote the outside point and pthe projection In Figure 72 the gray circle is S the outside smalltriangle (or dot) is y and the small triangle (or dot) on the circlersquosborder is p The projection of point y onto shape S is the pointinside S thatrsquos closest to y ie that minimizes the distance to yMathematically
p = arg minsisinS
dE(y s) (75)
732 Probabilistic Projection
Linear projection with random variables is the same idea but with adifferent definition of distance and a different ldquoshaperdquo to search over
Notationally let LP(Y | 1 X) denote the linear projection (LP)of Y onto (1 X) The (1 X) specifies the ldquoshaperdquo that we search overrandom variables that can be written as a+bX for constants a and bie linear combinations of (1 X) (Linear combinations and linearityare detailed in Section 821) Without the 1 LP(Y | X) would onlyconsider bX with no intercept
The closest ldquopointrdquo inside the ldquoshaperdquo is usually written β0 +β1XMathematically parallel to (75)
LP(Y | 1 X) = β0+β1X = arg mina+bX
d(Y a+bX) = arg mina+bX
radicE[(Y minus aminus bX)2]
(76)where Euclidean distance dE(middot middot) has been replaced by a probabilisticldquodistancerdquo measure
d(AB) equivradic
E[(AminusB)2] (77)
That is linear projection gets β0 + β1X as ldquocloserdquo to Y as possiblein a probabilistic sense
73 LINEAR PROJECTION 161
733 Formulas and Interpretation
Some calculus (omitted) yields a formula for each linear projectioncoefficient (LPC) β0 and β1 In this special case with a singleregressor X and an intercept
β1 =Cov(YX)
Var(X) β0 = E(Y )minus β1 E(X) (78)
Writing σ2Y = Var(Y ) and σ2X = Var(X) β1 can be rewritten in termsof correlation
β1 =Cov(YX)
Var(X)=
Cov(YX)
σ2X
σYσY
=Cov(YX)
σXσY
σYσX
= Corr(YX)σYσX
(79)Either version of the formula shows how the linear projection slope β1is related to the linear dependence (covariance or correlation) betweenY andX Once the slope is determined the intercept β0 simply movesthe linear projection line up or down so that E(Y ) = β0 + β1 E(X)That is the linear projection always goes exactly through the point(x y) = (E(X)E(Y ))
People often interpret the linear projection coefficients less pre-cisely For the slope a common phrase is ldquoA one-unit increase in Xis associated with a β1 change in Y rdquo The intercept is often notmentioned since β0 = E(Y )minusβ1 E(X) is not easy to interpret exceptwhen the regressor has been demeaned so that E(X) = 0 in whichcase β0 = E(Y ) In this case β0 is called the ldquocenterceptrdquo insteadof intercept but despite the better interpretation it is rarely seen ineconomics
For description (78) shows that the LPCs summarize the jointprobability distribution of (YX) The joint distribution of (YX)determines E(Y ) E(X) Cov(YX) and Var(X) which then deter-mine β0 and β1 Although a two-number summary of a complicatedjoint distribution is very convenient clearly much information is lostin such a summary Just as percentiles (quantiles) complement themean in describing Y quantile regression complements the LPCs indescribing (YX) though it is beyond our scope
Although you donrsquot need to know it for this class the LPCs canbe written in matrix form This generalizes more easily Define vector
X equiv (1 X)prime =
[1X
] (710)
where (1 X)prime indicates the transpose of the row vector (1 X) Then[β0β1
]= [E(XXprime)]minus1 E(XY ) =
[1 E(X)
E(X) E(X2)
]minus1[E(Y )
E(XY )
] (711)
734 Linear Projection Model in Error Form
Analogous to (625) for the CEF the linear projection model canbe written in error form Analogous to defining the CEF error as
162 CHAPTER 7 SIMPLE LINEAR REGRESSION
Y minusm(X) the linear projection error is defined as
U equiv Y minus LP(Y | 1 X) = Y minus (β0 + β1X) (712)
Notationally as usual there is nothing special about the letter U (orβ or even X or Y ) eg it is mathematically equivalent to defineV equiv W minus LP(W | 1 Z) and use (γ0 γ1) for the LPCs Given thedefinition of U in (712) it is always true that E(U) = Cov(XU) = 0Thus the model
Y = β0 + β1X + U E(U) = Cov(XU) = 0 (713)
is equivalent to LP(Y | 1 X) = β0 + β1XAs with the CEF the meaning of LP(Y | 1 X) = β0+β1X is more
clearly explicit but sometimes the error for (713) is more convenientmathematically as in Section 1232
74 ldquoBestrdquo Linear Approximation
=rArr Kaplan video ldquoBestrdquo Linear Approximation
=rArr Kaplan video Linear Projection and ldquoBestrdquo vs ldquoGoodrdquo (again)
741 Definition and Interpretation
For description the linear projection can be interpreted as the bestlinear approximation (BLA) of the true CEF ldquoBestrdquo here assumesquadratic loss similar to how the mean E(Y ) is the ldquobestrdquo predictorof Y with quadratic loss ldquoLinearrdquo refers to a function of the forma+ bX (see Section 821) Mathematically
LP(Y | 1 X) = β0+β1X =
BLA︷ ︸︸ ︷arg mina+bX
E[m(X)minus (a+ bX)]2 m(X) equiv E(Y | X)
(714)That is among all possible a+ bX the linear projection β0 + β1X isthe function of X that best approximates E(Y | X)
This implies that if the CEF is linear in X then the linear projec-tion equals the CEF That is ifm(X) = β0+β1X thenm(X)minus(β0+β1X) = 0 Since this term is squared in (714) zero is the smallestpossible value so β0 + β1X is the BLA
Otherwise the BLA treats more probable X as more importantwhen trying to get the linear approximation ldquocloserdquo to the true CEFThe mean Emiddot in (714) is a weighted average with more weight onmore probable X so it is more important to make m(X)minus (a+ bX)close to zero for such X values
742 Limitations
Unfortunately ldquobestrdquo does not always mean ldquogoodrdquo Sometimes theCEF is so highly nonlinear that even the best linear approximation
75 ldquoBESTrdquo LINEAR PREDICTOR 163
is still a very poor approximation By analogy ldquoAmong all cities inMissouri St Louis is closest to Kuwaitrdquo does not mean ldquoSt Louis isclose to Kuwaitrdquo Here Kuwait is the true CEF Missouri is the setof all functions linear in X and St Louis is the BLA Sometimes thebest (closest) is still not good (not close)
The following example of a ldquobadrdquo BLA is from Hansen (2020sect228) Let Y = X +X2 with no error term so m(x) = x+ x2 tooIf X sim N(0 1) then the BLALP turns out to be LP(Y | 1 X =x) = 1 + x The function 1 + x is a bad approximation of x+ x2 (trygraphing it)
Further the distribution of X can greatly affect the BLA of a non-linear CEF For example Figure 71 shows two possible BLA lines forthe same nonlinear CEF One line is the BLA when the distribu-tion of X satisfies P(X = 2) = 0 The other line is the BLA whenP(X = 0) = 0 The two lines are very different
However the BLA interpretation does at least assure us that whenthe CEF is approximately linear the linear projection approximatesthe CEF well
75 ldquoBestrdquo Linear Predictor
For prediction the linear projection can be interpreted as the bestlinear predictor (BLP) of Y given X As with the BLA ldquobestrdquoassumes quadratic loss ldquoLinearrdquo again refers to the form a + bXAs in (255) the optimal predictor minimizes mean quadratic lossMathematically
LP(Y | 1 X) = β0+β1X =
BLP︷ ︸︸ ︷arg mina+bX
EL2(Y a+ bX) = arg mina+bX
E[Yminus(a+bX)]2
(715)That is among all possible a+ bX the linear projection β0 + β1X isprecisely the function of X that ldquobestrdquo predicts Y given knowledge ofX
Mathematically (715) is the same as (76) but without theradicmiddot
Although phrased differently the linear projection goal of getting β0+β1X ldquoclosestrdquo to Y is essentially the same as prediction we want apredictor β0 + β1X that is ldquoclosestrdquo to Y
Unfortunately as with BLA ldquobestrdquo does not mean ldquogoodrdquo How-ever as with BLA this means the CEF does not need to be exactlylinear in order for the linear projection to make good predictions
As in Section 25 ldquopredictionrdquo here is defined entirely within thepopulation It does not refer to using data to guess the future thereis no data here Instead the BLP is an ideal predictor it is the(linear) predictor we would use if we fully knew everything about thepopulation The BLP is something we wish to learn Fortunately theBLP (and BLA and LP) is precisely what OLS estimates
Discussion Question 72 (BLP) Let Y be income (thousands ofdollars per year) and X be number of siblings When X = 0 the
164 CHAPTER 7 SIMPLE LINEAR REGRESSION
mean Y is 60 and 50 le Y le 70 When X = 1 the mean Y is 40 and30 le Y le 50 When X = 2 itrsquos the same as when X = 1 the meanY is 40 and 30 le Y le 50 In a population with mostly X = 1 andX = 2 the BLP is LP(Y | 1 X) = 43minus 2X
a) What Y does the BLP predict when X = 0b) Is the prediction from (a) good Whynot
76 Causality Under Misspecification
Some things can be said about causality under misspecification butnone as pleasing as the BLP for prediction or BLA for descriptionFor example if the structural error U satisfies the CEF error propertyE(U | X) = 0 then the structural function is the CEF so the linearprojection is also the best linear approximation of the structural func-tion Alternatively if the structural model is linear Y = β0+β1X+U and if Cov(XU) = 0 then β1 equals the linear projection slope co-efficient However the linear structural model may be misspecifiedtoo This is one motivation for ldquononparametricrdquo CEF estimation (Sec-tion 83)
77 OLS Estimation and Inference
=rArr Kaplan video OLS in R
OLS estimation was initially discussed in Section 67 along withimportant terms like fitted values and residuals Here additionalinsights statistical properties and code are provided
771 OLS Estimator Insights
For the OLS estimator it is most important to know the statisticalproperties and R functions but I canrsquot resist a couple comments onthe nature of the estimator itself
First the ldquoleast squaresrdquo formulation of the OLS estimator from(646) mirrors the BLP definition in (715) That is following theanalogy principle (Section 33) replacing the population mean (E) in(715) with the sample mean ( 1n
sumni=1) yields (646) This reinforces
that OLS fundamentally estimates the BLP or equivalently the LP orBLA not the CEF The CEF equals the LP only in the very specialcase of a linear CEF
Second the OLS estimator can be written parallel to the popu-lation linear projection coefficients in (78) Again replacing popula-tion mean with sample mean and replacing population variance andcovariance with sample variance and covariance (78) turns into
β1 =Cov(YX)
Var(X)=
1n
sumni=1(Yi minus Y )(Xi minus X)1n
sumni=1(Xi minus X)2
β0 = Y minus β1X Y equiv 1
n
nsumi=1
Yi X equiv 1
n
nsumi=1
Xi
(716)
77 OLS ESTIMATION AND INFERENCE 165
This matches the formulas from solving the minimization problemin (646) directly This may seem surprising at first but recall thatthe population formula in (78) came from solving the populationminimization problem For this reason β1 may be called the sampleanalog of β1 and similarly for β0 and β0 just as the sample meanE(Y ) = 1
n
sumni=1 Yi is the sample analog of the population mean E(Y )
(Sections 33 and 34)Third the OLS estimator essentially performs orthogonal projec-
tion in the linear algebra sense The actual math is beyond our scopebut to get the fitted values Yi = β0 + β1Xi the vector of Yi values isprojected onto a certain subspace defined by the Xi values
772 Statistical Properties
The following statistical properties consider OLS as an estimator ofthe linear projection coefficients (LPCs) These properties hold trueunder very general assumptions If the CEF is linear then it equalsthe linear projection so these properties would equally apply to CEFestimation If the CEF is linear and additional assumptions hold suchthat the CEF slope identifies the ASE of X on Y (Section 652) thenthe following properties apply to ASE estimation
However as before the measures of statistical uncertainty (likeconfidence intervals) say nothing about uncertainty in the identifyingassumptions The statistical uncertainty only captures uncertaintyabout the LPCs
Assumptions
The following assumptions combined are sufficient for Theorems 71ndash73 but not necessary (using logical terms from Section 61)
Assumption A71 (iid sampling) Sampling of (Yi Xi) is iid
Assumption A72 (non-constant regressor) The regressor X is nota constant ie there is no single value x such that P(X = x) = 1
Assumption A73 (finite variances) The variances of Y and X arefinite Var(Y ) lt infin Var(X) lt infin Or equivalently the expectedvalues of Y 2 and X2 (ie second moments) are finite E(Y 2) ltinfinE(X2) ltinfin
Assumption A74 (finite fourth moments) The expected values ofY 4 and X4 (ie fourth moments) are finite E(Y 4) ltinfin E(X4) ltinfin
Assumption A71 was discussed in Section 32 for Yi by itself Ifwe let vector Wi equiv (Yi Xi) be whatrsquos observed about individual iand vector Wk equiv (Yk Xk) be the observation for individual k thenthe iid assumption is essentially the same as before Wi perpperp Wk fori 6= k (ldquoindependentrdquo) and Wi and Wk have the same distribution(ldquoidentically distributedrdquo) More specifically ldquoindependentrdquo means(Yi Xi) perpperp (Yk Xk) for i 6= k which implies Yi perpperp Yk Xi perpperp Xk
166 CHAPTER 7 SIMPLE LINEAR REGRESSION
Yi perpperp Xk and Xi perpperp Yk but implies nothing about (in)dependencebetween Xi and Yi (or Xk and Yk) ldquoIdentically distributedrdquo says(Yi Xi) and (Yk Xk) have the same joint distribution which impliesthe conditional and marginal distributions (and their features) arealso identical For example E(Yi) = E(Yk) Var(Xi) = Var(Xk)E(Yi | Xi = x) = E(Yk | Xk = x) P(Yi le 0 | Xi = x) = P(Yk le 0 |Xk = x) etc All this readily generalizes to multiple regressors justredefining Wi equiv (Yi X1i X2i )
There can be dependence among the elements of Wi Specifi-cally the outcome Yi and regressor Xi may be correlated or otherwisedependent The iid assumption does not restrict the relationship be-tween Yi and Xi at all
Assumptions A73 and A74 are similar but A74 is stronger Thatis A74 =rArr A73 ie E(Y 4) lt infin =rArr E(Y 2) lt infin and similarlyfor X
Assumptions A73 and A74 are usually true with economic databut there are some exceptions They are true for any variable whoseabsolute value is bounded like |Y | le b lt infin because then E(Y 4) leE(b4) = b4 lt infin For example if X is age or education then |X| le200 so E(X4) le (200)4 ltinfin
Nonetheless some economic variables may violate A74 or evenA73 (Or there are variables best modeled by distributions thatviolate these assumptions) One example is stock returns or otherasset returns Whether to model such financial returns with finite orinfinite variance is a matter of ongoing debate eg see Grabchak andSamorodnitsky (2010) and references therein
Assumption A72 is qualitatively similar to the overlap assumption(A66) They both say we must see different values of X in order tolearn about a relationship involving X They both seem obvious witha single X
Conveniently if A72 seems false in the data then your statisticalsoftware will report an error or warning So donrsquot worry about A72unless you get such a warning
Theoretical Results
Theorem 71 (OLS consistency 1 regressor) If A71ndashA73 are truethen the OLS intercept and slope estimators are consistent for thepopulation linear projection intercept and slope
Theorem 71 says that with enough data the OLS coefficient esti-mators should be close to the true linear projection coefficients withhigh probability Whether the linear projection is a CEF or whetherthe slope has a causal interpretation are questions of identificationnot estimation OLS estimates the linear projection and leaves fur-ther interpretation up to us
Logically Theorem 71 does not say that OLS is a bad estimator ifsampling is not iid (the ldquoinverserdquo) as discussed in Section 61 In factthe iid assumption can be relaxed in certain ways even with some
77 OLS ESTIMATION AND INFERENCE 167
ldquodependencerdquo (instead of independence) or survey weights OLS canstill consistently estimate the population LPCs
Theorem 72 (OLS approximate normality 1 regressor) If A71A72 and A74 are true then the OLS intercept and slope estima-tors are asymptotically normal ie with large n approximately β0 simN(β0 SE2
0) and β1 sim N(β1SE21) where the true standard errors SE0
and SE1 are unknown but can be estimated and are proportional to1radicn
Theorem 72 is practically useful for constructing confidence in-tervals whose properties are in Theorem 73
Theorem 73 (coverage probability 1 regressor) If A71 A72and A74 are true then the heteroskedasticity-robust confidence in-tervals in Section 773 are asymptotically correct That is with largeenough n the coverage probability is approximately equal to the desiredconfidence level
773 Code
The following code is based on the example from Section 71 Eachrow in the final output shows the estimate β1 (in the column titledEstimate) along with the heteroskedasticity-robust standard error(column Std Error) t-statistic for H0 β1 = 0 (column t value)and p-value for H0 β1 = 0 (column Pr(gt|t|)) and 95 CI for β1(lower endpoint in column 25 upper endpoint in column 975 )
There are three randomly simulated datasets for which these quan-tities are estimated All have X isin 0 1 2 The first dataset comesfrom a linear CEF with m(x) = 60minus 20x where P(X = j) = 13 forj = 0 1 2 The next two datasets have nonlinear CEF m(0) = 60m(1) = m(2) = 40 but different distributions of X The first dis-tribution has P(X = j) = (3 minus j)6 while the second has P(X =j) = (j + 1)6 for j = 0 1 2 As seen the distribution of X affectsthe linear projection slope when the CEF is nonlinear as discussedin Section 74
Finally dummy variables are used to estimate a properly specifiednonlinear CEF as in (74) Only the estimated coefficients are dis-played below using the coefficients() function Specifically thenumber under (Intercept) is the estimated intercept the numberunder D1 is the estimated coefficient on D1 and the number under D2is the estimated coefficient on D2
library(lmtest) library(sandwich)setseed(112358)n lt- 500 sample sizem012 lt- c(604020) m(0)m(1)m(2) (linear CEF)df lt- dataframe(X=sample(x=02 size=n prob=c(111)3 replace=TRUE)
U=rnorm(n))df$Y lt- rnorm(n=n mean=m012[1+df$X]) + df$U
168 CHAPTER 7 SIMPLE LINEAR REGRESSION
ret lt- lm(formula=Y~X data=df)retVC1 lt- vcovHC(ret type=HC1)CEF lt- c(coeftest(ret vcov = retVC1)[X]
coefci(ret vcov = retVC1)[X]) Now nonlinear CEF LPC depends on X distsetseed(112358)n lt- 500 m012 lt- c(604040)df lt- dataframe(X=sample(x=02 size=n prob=316 replace=TRUE)
U=rnorm(n))df$Y lt- rnorm(n=n mean=m012[1+df$X]) + df$Uret lt- lm(formula=Y~X data=df)retVC1 lt- vcovHC(ret type=HC1)LP1 lt- c(coeftest(ret vcov = retVC1)[X]
coefci(ret vcov = retVC1)[X])setseed(112358)n lt- 500 m012 lt- c(604040)df lt- dataframe(X=sample(x=02 size=n prob=136 replace=TRUE)
U=rnorm(n))df$Y lt- rnorm(n=n mean=m012[1+df$X]) + df$Uret lt- lm(formula=Y~X data=df)retVC1 lt- vcovHC(ret type=HC1)LP2 lt- c(coeftest(ret vcov = retVC1)[X]
coefci(ret vcov = retVC1)[X])tmp lt- rbind(CEF LP1 LP2)round(x=tmp digits=3)
Estimate Std Error t value Pr(gt|t|) 25 975 CEF -198 0077 -2571 0 -1998 -1967 LP1 -123 0310 -396 0 -1291 -1169 LP2 -77 0310 -248 0 -831 -709
Use dummies to estimate nonlinear CEFdf$D0 lt- (df$X==0) not useddf$D1 lt- asinteger(df$X==1) D1=1 iff X=1df$D2 lt- asinteger(df$X==2) D2=1 iff X=1ret lt- lm(formula=Y~D1+D2 data=df)coefficients(ret)
(Intercept) D1 D2 598 -198 -198
78 Simple Linear Regression
=rArr Kaplan video OLS in R
The prior results are essentially the same when X has more than
78 SIMPLE LINEAR REGRESSION 169
three possible values too There could even be an infinite number ofpossible values eg if there is no upper bound for X or if X couldbe any real (decimal) number between 0 and 1 Misspecification islikely The linear projection best linear approximation and bestlinear predictor interpretations all still apply OLS estimation andheteroskedasticity-robust standard errors and confidence intervals arecomputed the same way
The main difference is that it is harder to use dummy variablesto properly model a nonlinear CEF If X has only four values thenit is not too difficult But if X has hundreds or thousands of valuesor an infinite number then the dummy variable approach may failChapter 8 addresses alternative ways to model a CEF that is notlinear in X
Practice 72 (linear fit) For each scatterplot in Figure 73 guesswhat the OLS estimated regression line looks like ie the line β0 +β1X (Hint remember OLS minimizes the sum of the squares of thevertical distances from each point to the fit line) You can also makeyour own puzzles in R first make a scatterplot likeY lt- c(123413) X lt- c(12345) plot(XY)
and then (after guessing) plot the OLS fit with abline(lm(Y~X))
Figure 73 Scatterplots for Practice 72
Practice 73 (regression units) Consider a regression of wage Y($hr) on ldquodistance to nearest universityrdquo X Let γ1 be the estimatedslope when X is measured in miles and let δ1 be the estimated slopewhen X is measured in kilometers where 1 mi = 16 km
a) What are the units of γ1 δ1
170 CHAPTER 7 SIMPLE LINEAR REGRESSION
b) Do you think γ1 = δ1 γ1 gt δ1 or γ1 lt δ1c) Can you come up with a formula relating γ1 and δ1 (Hint
what change in Y is associated with a 16 km increase in X interms of γ1 In terms of δ1)
Discussion Question 73 (student-teacher ratio simple regression)Let Y be the average math standardized test score (in units of points)for a schoolrsquos 5th-grade students Let X be the 5th-grade student-teacher ratio (total number of 5th-grade students divided by totalnumber of 5th-grade teachers like the average class size) generallyaround 15 le X le 25 For schools i = 1 n the values (Yi Xi) arerecorded A linear regression is run to estimate β0 and β1 in the CEFmodel Y = β0 + β1X + V E(V | X) = 0 Respond to any three ofthe following (for example parts a c and e or b c f or d e f etc)
a) What are the units of β0 and β1b) Whatrsquos the interpretation of β0 What is it useful forc) Consider the estimate β1 = minus228 What does this imply about
the average score difference between 15-student classes and 25-student classes Is it economically significant (Section 397)(Hint make additional assumptions about the scoring systems-cale if you need to)
d) Consider further that β1 has heteroskedasticity-robust standarderror 08 so the p-value for H0 β1 = 0 is 0004 Discuss thestatistical significance (Section 384) of β1
e) Describe one reason you doubt β1 has a causal interpretationf) Describe one reason you think the linear CEF model is misspec-
ified
EMPIRICAL EXERCISES 171
Empirical Exercises
Empirical Exercise EE71 You will analyze data on collegesrsquo ath-letic success and number of applications The data were collected byPatrick Tulloch for an economics term project from various collegeand sports data records As the R description says ldquoThe lsquoathletic suc-cessrsquo variables are for the year prior to the enrollment and academicdatardquo
a Load the data (assuming yoursquove already installed the R packageor Stata command)
R library(wooldridge)
Stata bcuse athlet1 nodesc clear
b Keep only data from 1993
R dat lt- athlet1[athlet1$year==1993 ] Stata keepif year==1993
c Create a new variable equal to the sum of bowl (football bowlgame) and finfour (menrsquos basketball Final Four)
R dat$bowl4 lt- dat$bowl + dat$finfour
Stata generate bowl4 = bowl + finfour
d Display the number of observations with each possible value ofbowl4 (0 1 or 2)
R table(dat$bowl4)
Stata tabulate bowl4
e Regress the number of applications (for admission) on the prioryearrsquos athletic success
R ret lt- lm(apps~bowl4 data=dat)
Stata regress apps bowl4 vce(robust)
f R only save the fitted OLS values of Y for the three possiblevalues of X (bowl4) with fit012 lt- predict(ret newdata=dataframe(bowl4=02)) and optionally add helpful labelswith names(fit012) lt- c(X=0X=1X=2)
g Estimate and store the three CEF values
R mean(dat$apps[dat$bowl4==0]) to estimate m(0) and re-place 0 with 1 to estimate m(1) and with 2 to estimate m(2)store these into a vector named m012 with m012 lt- c( m0 m1 m2 ) where m0 is your code for estimatingm(0) and similarly
for m1 and m2
Stata bysort bowl4 egen CEF = mean(apps) to computethe sample mean of apps within each group of observations withthe same value of bowl4 storing it into a new variable namedCEF
172 CHAPTER 7 SIMPLE LINEAR REGRESSION
h Plot the fitted OLS line against the estimated CEF points
R plot(x=02 y=m012) (to plot estimated CEF points) fol-lowed by abline(ret) (to plot the OLS fit line)
Stata twoway scatter CEF bowl4 || lfit apps bowl4
i Make the same plot but adjust the line color and style thetitle the axis labels and whatever else yoursquod like to adjust
R inside the plot() command add argument main= toset the title and similarly for xlab= and ylab= toset the x-axis and y-axis labels (where you replace all the with whatever names you want) inside the abline() functionadd arguments col=2 to change the linersquos color lty=2 to changethe line style and lwd=3 to change the line width again youcan set whatever values you like
Stata twoway scatter CEF bowl4 || lfit apps bowl4 XXX but replace the XXX with options to change the graphrsquosappearance (all separated by spaces not any more commas)like title() xtitle() ytitle() for the titleand axis labels and lcolor(red) lpattern(dash) for the linecolor and style use whatever values yoursquod like
j Display the numerical values of the OLS fit and the estimatedCEF R rbind(m012 fit012)
Stata collapse (mean) meanapps=apps by(bowl4) fol-lowed by predict OLSfit xb and list
Chapter 8
Nonlinear andNonparametric Regression
=rArr Kaplan video Chapter Introduction
Depends on Chapter 7 (which depends on Chapters 2ndash4 and 6)
Unit learning objectives for this chapter
81 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
82 Interpret the coefficients in various nonlinear regressionmodels [TLOs 3 and 5]
83 Judge which model seems most appropriate using both eco-nomic reasoning and statistical insights [TLO 6]
84 In R (or Stata) estimate nonlinear and nonparametric re-gression models along with measures of uncertainty andjudge economic and statistical significance [TLO 7]
Optional resources for this chapter
bull Functional form misspecification (Lambert video)
bull Log-log example (Lambert video)
bull Overfitting (Lambert video)
bull Sections 24 (ldquoNonlinearitiesrdquo including log models) 613(ldquoLogarithmsrdquo) and 614 (ldquoQuadratics and Polynomialsrdquo)in Heiss (2016)
bull Section 82 (ldquoNonlinear Functions of a Single IndependentVariablerdquo) in Hanck et al (2018)
bull Nonparametric regression Chapter 7 (ldquoMoving Beyond Lin-earityrdquo) in James et al (2013) including sect75 (ldquoSmoothingSplinesrdquo) and Chapter 5 (ldquoBasis Expansions and Regular-izationrdquo) in Hastie Tibshirani and Friedman (2009) includ-ing sect54 (ldquoSmoothing Splinesrdquo)
173
174CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
bull Model selection Chapter 7 (ldquoModel Assessment and Selec-tionrdquo) in Hastie Tibshirani and Friedman (2009)
bull Biasndashvariance tradeoff James et al (2013 sect222) HastieTibshirani and Friedman (2009 sectsect295527273)
bull Part V (ldquoNonparametric Regressionrdquo) in Kaplan (2020)
bull R package splines
Having mastered regression with a linear functional form we nowconsider nonlinear functions First nonlinear functions of X are al-lowed and then nonparametric estimation and machine learning areintroduced
81 Log Transformation
Sometimes a simple regression model improves greatly by transform-ing Y or X or both The most common transformation in economicsis the natural logarithm function which economists just call ldquologrdquo
Three different log models are discussed below A model with thefamiliar form Y = β0+β1X+U could be called a ldquolinear-linearrdquo model(although itrsquos just called a linear model) meaning both Y and Xare in their original units ie in levels If Y is replaced by its logln(Y ) itrsquos called a log-linear model if instead we have Y and ln(X)then itrsquos linear-log and if both are in logs then log-log
Here in Section 81 the distinction among causal CEF and linearprojection models is unimportant The interpretation of U is leftambiguous intentionally Instead emphasis is on the interpretationof β1 in terms of units of measure
811 Properties of the Natural Log Function
Basic Shape and Properties
The natural log function is peculiar especially if you havenrsquot takencalculus It is written ln(middot) although often people will simply sayldquologrdquo (without ldquonaturalrdquo) and write log(middot) since the natural log is theonly one commonly used in economics in R the function is log()
The log function is the inverse of the exponential function ln(exp(x)) =x where exp(x) is the same as ex Consequently if ex = M thenln(M) = ln(ex) = x
Figure 81 shows the log function giving a general idea of itsshape However two important features are unclear First as x getscloser and closer to 0 ln(x) decreases toward minusinfin Second ln(x)keeps increasing to infin as x increases to infin
The log function has many properties including the following1 ln(x) is only defined for x gt 02 ln(x) is strictly increasing for any x2 gt x1 gt 0 ln(x2) gt ln(x1)3 ln(x) increases more slowly with larger x it is very steep for x
near zero but less and less steep (ie flatter) as x increases
81 LOG TRANSFORMATION 175
0 1 2 3 4 5 6 7
minus2
minus1
01
2
x
ln(x
)
Figure 81 The (natural) log function ln(middot)
4 For any x gt 0 and any b ln(xb) = b ln(x)5 For any x1 gt 0 and x2 gt 0 ln(x1x2) = ln(x1) minus ln(x2) and
ln(x1x2) = ln(x1) + ln(x2)6 limxdarr0 ln(x) = minusinfin and limxrarrinfin ln(x) =infin
Percentage Approximation
Near x = 1 ln(x) is approximately the same as the linear functionf(x) = x minus 1 ie ln(x) asymp x minus 1 Equivalently letting w equiv x minus 1if w is near zero then ln(1 + w) asymp w For example with w = 001ln(1 + 001) = 000995 Negative w lt 0 is fine too ln(1 minus 001) =minus001005 Even with w = 01 ln(11) = 00953 not far from 01The approximation is perfect at w = 0 since ln(1) = 0 exactly and itgets worse as w increases ln(15) = 0405 not good
The log function can approximate small percent changes Considerv2 gt v1 how much bigger is v2 In percentage (of v1) terms v2 is
100
(v2 minus v1v1
) = 100
(v2v1minus 1
)
larger than v1 For example if v1 = 100 and v2 = 102 then v2v1 minus1 = 102 minus 1 = 002 so wersquod say v2 is 100(002) = 2 larger thanv1 In other words the increase (in level) of v2 minus v1 = 2 is 2 ofv1 This 2 can be approximated by the log increase ln(v2)minus ln(v1)Let p equiv v2v1 minus 1 like p = 002 in the example so v2 = v1(1 + p)Combining two properties above if p is near zero then
ln(v2) = ln(v1(1+p)) = ln(v1)+ln(1+p) asymp ln(v1)+p =rArr p asymp ln(v2)minusln(v1)(81)
Put differently a log difference of p = ln(v2)minus ln(v1) is approximatelya 100p change in level In fact the above math is identical whenv2 lt v1 so p can be positive or negative (increase or decrease) How-ever as before the approximation is poor if p is larger like 05
176CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
812 The Log-Linear Model
Interpretation
A log-linear model specifies
ln(Y ) = β0 + β1X + U (82)
Since X is in levels the coefficient β1 tells us about a one unit increasein X Specifically a one unit increase in X is associated with a β1change in ln(Y ) (increase if β1 gt 0 decrease if β1 lt 0) Sometimespeople call this a β1 change in Y in log units
If β1 is close to zero then (81) offers another interpretation a oneunit increase in X is associated with an approximate 100β1 changein Y For example if β1 = 002 then a one unit increase in X isassociated with approximately a 2 increase in Y
Recall the difference between a percentage change and a percent-age point change For example a 1 increase in Y means increasingto 101Y ldquoPercentage pointrdquo only applies when the units are alreadypercentages eg a 1 percentage point increase is changing from 10to 11 or from 67 to 68
However even if β1 is near zero the approximation in (81) maybe poor if we consider large changes in X For example if againβ1 = 002 but we consider a 50-unit increase in X the increase in Yis poorly approximated by 100(β1)(50) = 100 A 100 increasewould be from value v1 to value v2 = 2v1 But if ln(v2)minusln(v1) = 1 (achange of 1 ldquolog unitrdquo) then ln(v2v1) = 1 meaning v2v1 = e asymp 272not v2v1 = 2
When to Use It
When does a log-linear model make sense Sometimes scatterplots ofthe raw Y andX data suggest it For example maybe the relationshipbetween Y and X looks nonlinear but the relationship between ln(Y )and X looks approximately linear
Sometimes even before looking at data the log-linear model makesmore sense economically or intuitively For example with Y variableslike income it may seem more natural to model effects as (approx-imate) percentage changes in Y like a 1 higher income instead ofa $500yr higher income Further the log-linear form derives fromeconomic models of human capital where there is a multiplicative ef-fect on wage The most famous of these is the ldquoMincer equationrdquo forearnings as a function of education (schooling) and experience namedafter the log-linear model in Mincer (1974 Ch 5 p 84)
Issue with Prediction
Unfortunately the log-linear model is not optimal for predicting Y even if E(U | X) = 0 From (82) the CEF is
E(Y | X = x) = eβ0+β1x E(eU | X = x)
81 LOG TRANSFORMATION 177
It is easy to plug in β0 and β1 but difficult to estimate E(eU | X = x)We could simply ignore the difficult term but eβ0+β1x is generally notthe best predictor of Y given X = x There are alternatives but theyare beyond our scope
813 The Linear-Log Model
Interpretation
A linear-log model specifies
Y = β0 + β1 ln(X) + U (83)
When X increases by one log unit the corresponding change in Y isβ1 but one log unit is a very big change (more than doubling) To usethe percentage approximation a smaller change in X must be usedSpecifically an increase of X by 1 is associated with a change in Yof β1100 units A 1 increase is a change from X to 101X whichis different than a 1 percentage point change in X
Mathematically the interpretation of β1 can be seen in two stepsFirst let Z = ln(X) and imagine a linear model with Y and Z Y =β0 + β1Z +U Then an increase in Z by 001 units corresponds to achange in Y of (β1)(001) = β1100 units of Y Second from (81) anincrease in Z = ln(X) by 001 is approximately a (100)(001) = 1increase in X
For larger changes instead of using the percentage approximationjust plug in two values of X For example consider increasing X = 40to X = 60 (a 50 increase) The associated change in Y is
X=60︷ ︸︸ ︷β0 + β1 ln(60)minus
X=40︷ ︸︸ ︷[β0 + β1 ln(40)] = (β0 minus β0) + β1[ln(60)minus ln(40)] = β1 ln(6040)
= β1 ln(15) = 041β1
The same 041β1 change results for any 50 increase in X regardlessof starting value because ln(15X)minusln(X) = ln(15XX) = ln(15) =041 log units More generally a change from X = x1 to X = x2 isassociated with a change in Y of β1 ln(x2x1)
When to Use It
When does a linear-log model make sense Sometimes the scatterplotof Y and X reveals a shape that looks like a log function increasingsteeply at first then getting less and less steep but without everdecreasing (Or switch ldquoincreasingrdquo and ldquodecreasingrdquo if β1 lt 0)That is the relationship between Y andX looks nonlinear but maybeplotting Y against ln(X) looks closer to linear The log functionrsquosshape also helps model diminishing marginal benefits the first unitof X helps increase Y a lot but each additional unit of X helps lessand less
178CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
814 The Log-Log Model
Interpretation
A log-log model specifies
ln(Y ) = β0 + β1 ln(X) + U (84)
A 1 increase in X is associated with an approximate β1 change inY This percentage interpretation is particularly nice β1 representsan elasticity of Y with respect to X But if the percentages are toolarge then the approximation is poor
When to Use It
When does a log-log model make sense First itrsquos a simple way toget an elasticity interpretation Second a scatterplot of ln(Y ) againstln(X) may look roughly linear Third if you suspect a power law typeof relationship between Y and X exponentiating both sides of (84)yields
expln(Y ) = expβ0+β1 ln(X)+U =rArr Y = eβ0 expln(Xβ1)eU = eβ0Xβ1eU
Issue with Prediction
As with the log-linear model eβ0Xβ1 is generally not the CEF becauseE(eU | X) = 1 is not implied by E(U | X) = 0 Consequentlypredicting Y as eβ0X β1 is generally not optimal
In Sum Regression Models with Log TransformationsLog-linear 1-unit uarr X associated with approximate 100β1
change in YLinear-log 1 uarr X associated with approximate β1100-unit
change in Y more precisely change from x1 to x2 associated withβ1 ln(x2x1)-unit change in Y
Log-log 1 uarr X associated with approximate β1 change inY (elasticity)
Discussion Question 81 (pollution and house price) Consider therelationship between the price of a house and the concentration of airpollution Explain which type of model (linear log-linear linear-logor log-log) you think would best fit and why (Hint think especiallyabout changes in levels vs in logs)
815 Warning Model-Driven Results
=rArr Kaplan video Warnings About Model-Driven Results
When choosing a model beware self-fulfilling prophecy Empir-ical results are driven by data but also by your modelrsquos structureFor example the function β0 + β1X specifies a constant (β1) change
81 LOG TRANSFORMATION 179
for every unit increase in X different datasets can lead to differentestimated slopes (β1) but the slope will always be constant regard-less of the data The log-linear model may seem more flexible thana linear model but it is not it still only has two parameters It isjust different not more flexible Consequently the fitted log-linearmodel always shows a diminishing effect of X on Y as X increasesThis pattern does not come from the data but from the model itselfregardless of the data
Figure 82 based on the comic at httpsxkcdcom2048 illus-trates such self-fulfilling prophecy Each graph shows the same scat-terplot from the same data (the dots) but with a very different fittedmodel in each (the line) Clearly the differences do not come from thedata since itrsquos the exact same data All differences are entirely due tothe model The top-left shows the linear model which by construc-tion imposes a constant slope β1 Below that is a log-linear modelthe constant percentage increase of Y with each unit of X leads toexponential growth (hence the ldquoexponentialrdquo label in the comic) Thetop-right shows the ldquotapering offrdquo of the linear-log model Althoughmostly beyond our scope some comments on ldquomodel selectionrdquo are inSections 83 and 152
816 Code
2 4 6 8
05
10
15
20
Linear
2 4 6 8
05
10
15
20
LogminusLinear
2 4 6 8
05
10
15
20
LinearminusLog
2 4 6 8
05
10
15
20
LogminusLog
Figure 82 Same data different models
Figure 82 is generated by the following code that compares linearlog-linear linear-log and log-log estimation given the same datasetThe four fitted functions are plotted on four copies of the same scat-terplot in Figure 82 in homage to httpsxkcdcom2048 The
180CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
results illustrate the concerns of Section 815
par(family=serif mar=c(3311) mgp=c(21080) mfrow=c(22))setseed(112358)n lt- 31X lt- sort(runif(n=n min=1 max=9))Y lt- 1 + pnorm(q=X mean=5 sd=15) +
2( rbeta(n=n shape1=10-X shape2=X) - (10-X)10 )df lt- dataframe(X=X Y=Y)retlinlin lt- lm(Y~X data=df)retloglin lt- lm(log(Y)~X data=df)retlinlog lt- lm(Y~log(X) data=df)retloglog lt- lm(log(Y)~log(X) data=df)XL lt- YL lt- plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(predict(retlinlin)~df$X col=2)title(Linear line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(predict(retlinlog)~df$X col=2)title(Linear-Log line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(exp(predict(retloglin))~df$X col=2)title(Log-Linear line=-1 adj=01)plot(x=df$X y=df$Y type=p pch=16 main= xlab=XL ylab=YL)lines(exp(predict(retloglog))~df$X col=2)title(Log-Log line=-1 adj=01)
82 Nonlinear-in-Variables Regression
Discussion Question 82 (nonlinear OVB) Imagine a structuralmodel Y = β0 + β1X + β2X
2 with no error term X completelydetermines Y To be more concrete imagine Y = 1+X2 (ie β0 = 1β1 = 0 β2 = 1) with 0 le X le 5 You run a linear-in-variablesregression OLS estimates the function γ0 + γ1X
a) Approximately what value would you expect γ1 to be (Hintrecall Sections 73ndash75)
b) What does γ0 + γ1X suggest about the relationship between Xand Y What features are similar or different compared to thetrue 1 +X2 (Hint draw a picture)
Beyond replacing X with a single transformation of X like ln(X)we can replace X with a more complicated nonlinear function involv-ing multiple terms and multiple parameters OLS can still be used
82 NONLINEAR-IN-VARIABLES REGRESSION 181
for estimation as long as the function is ldquolinear-in-parametersrdquo (Sec-tion 821) Again the distinctions among causal CEF and linearprojection models are not emphasized here
There are two types of (non)linearity They are often confusedFurther people often say ldquolinear modelrdquo or ldquononlinear modelrdquo withoutclarifying which type they mean
821 Linearity
The root of ldquolinearityrdquo is linear combination A linear combinationis like a weighted sum For example a linear combination of A andB is anything with the form
w1A+ w2B (85)
where w1 and w2 are weights that may take any value including zeroor even negative numbers Linear combinations may involve morethan two terms like w1A+w2B+w3C+w4D In some cases insteadof A B C andD we have something like Y1 Y2 Y3 and Y4 in whichcase the linear combination may be written in summation notation
w1Y1 + w2Y2 + w3Y3 + w4Y4 =4sumi=1
wiYi (86)
For example the expected value formula for discrete random variablesin (216) is a special case of a linear combination where the linearcombination weights are the probabilities of the different possible val-ues Also the sample mean is a linear combination of observed Yivalues with weights wi = 1n
A function is linear-in-parameters if it is a linear combinationof the parameters For example β0 + β1X is linear-in-parametersbecause it is a linear combination of the parameters β0 and β1 withweights w1 = 1 and w2 = X
w1β0 + w2β1 = (1)(β0) + (X)(β1) = β0 + β1X
The function β0+β1X is also linear-in-variables This is trickierto see since it is not actually a linear combination of X alone (Thefunction β0 + β1X is called an affine function of X which meansa linear function of X plus a constant) Secretly we have actuallyhad a second regressor all along X0 = 1 Since this second regressorX0 is just always 1 it has not been treated like a true regressor butmathematically it is Seen this way the linear combination of X0 andX has weights w1 = β0 and w2 = β1
(w1)(X0) + (w2)(X) = (β0)(1) + (β1)(X) = β0 + β1X
For this reason in economics people often call β0 + β1X ldquolinear inXrdquo even though technically it is ldquoaffine in Xrdquo and ldquolinear in X0 andXrdquo
These two types of linearity can apply specifically to CEFs orlinear projections For example if the CEF is E(Y | X = x) =
182CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
β0+β1x then the CEF is linear-in-parameters and linear-in-variablesRegardless of the CEF the linear projection of Y onto (1 X) is LP(Y |1 X) = β0 +β1X which is always linear-in-parameters and linear-in-variables by definition
Confusingly even if the models are written in error form peo-ple still refer to them as ldquolinearrdquo For example consider the CEFmodel Y = β0 + β1X + U with E(U | X) = 0 Despite the +U atthe end sometimes people say this is linear-in-variables and linear-in-parameters presumably because E(Y | X) = β0 + β1X indeedsatisfies both types of linearity Similarly consider the linear projec-tion model in error form Y = β0 + β1X + U with E(U) = E(XU) =0 Again despite the +U at the end sometimes people say thisis linear-in-variables and linear-in-parameters presumably becauseLP(Y | 1 X) = β0 + β1X indeed satisfies both types of linearity
Even more confusingly sometimes even a structural model of theform Y = β0 + β1X + U is called linear-in-parameters and linear-in-variables In that case there is no CEF or LP that is implicitly thelinear function Regardless it is helpful to be aware of conventionalterminology even if itrsquos not the best so you can understand otherswhen they mention a ldquolinear structural modelrdquo
822 Nonlinearity
Often a quadratic term is added to a model to increase flexibilitySpecifically
Y = β0 + β1X + β2X2 + U (87)
is called a quadratic model since the right-hand side is a quadraticfunction of X (plus an error term) This is now nonlinear-in-variables because of the X2 term That is β0 +β1X +β2X
2 cannotbe written as a linear combination of X0 = 1 and X so it is notlinear-in-variables However (87) is still linear-in-parameters withlinear combination weights 1 X and X2
(1)(β0) + (X)(β1) + (X2)(β2) = β0 + β1X + β2X2
There are (infinitely) many other examples of functions that arelinear-in-parameters but nonlinear-in-variables For example
β0 + β1X + β2X2 + β3X
3 + β4X4
β0 + β1 sin(X) + β2 cos(X)
β0 + β1 ln(X) + β2radicX + β3X
13
Each can be written in terms of functions fj(middot) in the form
Jsumj=0
βjfj(X) (88)
For example the polynomial example has fj(X) = Xj and J = 4
82 NONLINEAR-IN-VARIABLES REGRESSION 183
A nonlinear-in-parameters model cannot be written as a linearcombination of the parameters For example in the power law model
Y = β0Xβ1 + U (89)
the term β0Xβ1 cannot be written as a linear combination of β0 and
β1 Nonlinear-in-parameters models are not discussed further
823 Estimation and Inference
OLS can estimate nonlinear-in-variables models as long as they arelinear-in-parameters As always the OLS estimates are the param-eter values that minimize the sum of squared residuals solving theempirical analog of the optimal prediction problem (minimizing meanquadratic loss)
Inference on parameters is also the same For example the sameR code to compute a confidence interval for β1 earlier still worksand a confidence interval for β2 can be computed the same way Theunderlying codemath is very similar too However confidence inter-vals for predicted values or predicted differences now involve multiplecoefficients
824 Parameter Interpretation
Unlike estimation and inference which remain similar interpretationof parameters changes greatly with nonlinear-in-variables models
Insufficiency of Linear Coefficient
In (87) β1 is no longer the change in Y associated with a unit increasein X That is when X increases so does X2 so both β1 and β2 areneeded Not only does the change in Y associated with a unit increasein X now depend on the initial value of X but even the sign of thechange (ie increase or decrease) may depend on X
For example consider the function 5X minusX2 ie β0 = 0 β1 = 5and β2 = minus1 Going from X = 0 to X = 1 the change is
[(5)(1)minus 12]minus [(5)(0)minus 02] = 4minus 0 = 4
Going from X = 1 to X = 2 the change is
[(5)(2)minus 22]minus [(5)(1)minus 12] = 6minus 4 = 2
still positive but smaller From X = 2 to X = 3
[(5)(3)minus 32]minus [(5)(2)minus 22] = 6minus 6 = 0
no change at all And from X = 3 to X = 4
[(5)(4)minus 42]minus [(5)(3)minus 32] = 4minus 6 = minus2
a negative change ie a decrease Even though β1 = 5 is positivesometimes an increase in X is associated with a decrease in Y Noteven the sign of β1 (positive negative) tells us anything
184CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
Summarizing Nonlinear Functions
With only one X the best summary is to plot the function (alongwith a scatterplot of data) like in Figure 82 As the saying goes ldquoApicture is worth a thousand words [or numbers]rdquo However if there aremany different regressors (as in later chapters) pictures get confusing(trying to show slices of many-dimensional manifolds )
Another approach is to plug in changes of X that are relevant topolicy or a particular economic question For example if Y is incomeX is education and we want to understand the value of the 12th yearof education then comparing X = 12 to X = 11 is relevant With aquadratic model the associated change in Y is
[β0+β1(12)+β2(12)2]minus[β0+β1(11)+β2(11)2] = β1(12minus11)+β2(122minus112) = β1+23β2
Generally we could write
Y = f(X) + U (810)
where in the quadratic model f(X) = β0 + β1X + β2X2 Plugging in
OLS parameter estimates yields f(X) like f(X) = β0 + β1X + β2X2
for the quadratic We can graph this estimated function by evaluatingit at many X values and drawing a line through them Similarly fora change from X = x1 to X = x2 the (estimated) associated changein Y is
f(x2)minus f(x1)
In Sum Interpreting and Summarizing Nonlinear Mod-elsThe β1X term alone has no meaningGiven (810) a change from X = x1 to X = x2 is associated witha change in Y of f(x2)minus f(x1) estimated by f(x2)minus f(x1)
Practice 81 (quadratic example) You regress Y on X and X2 andget the fitted function Y = β0 + β1X + β2X
2 with β0 = 2 β1 = 4and β2 = minus2
a) Whatrsquos the predicted value of Y when X = 0 X = 1 X = 2b) Whatrsquos the predicted change in Y when X changes from 0 to
1 from 1 to 2
Discussion Question 83 (nonlinear wage model interpretation)Let Y be wage ($hr) and X years of education Given a sample ofdata you estimate Y = β0 + β1X + β2X
2 with β0 = 144 β1 = minus16and β2 = 01
a) Does β1 lt 0 mean that more education is associated with lowerwage Whynot
b) What does this estimated function suggest about the (descrip-tive) relationship between wage and education (Hint try plug-ging in salient values like X = 12 [high school] or X = 16[college] or graph the whole function)
82 NONLINEAR-IN-VARIABLES REGRESSION 185
825 Description Prediction and Causality
The interpretation of a nonlinear-in-variables model as causal CEFor linear projection is similar to linear-in-variables models The maindifference is that we may wish to clarify the word ldquolinearrdquo in linearprojection best linear approximation and best linear predictor
Description and Prediction
Consider the quadratic model from (87) when the true CEF is notquadratic Then the ldquolinearrdquo projection of Y onto X0 = 1 X andX2 is defined the same way as in (76) before
LP(Y | 1 XX2) = β0 + β1X + β2X2
= arg minabc
d(Y a+ bX + cX2)
= arg minabc
radicE[(Y minus aminus bX minus cX2)2] (811)
These linear projection coefficients are what OLS estimates Thissame function of X is again a ldquobestrdquo CEF approximation and ldquobestrdquopredictor of Y Specifically mirroring (714) and (715)
LP(Y | 1 XX2) = β0 + β1X + β2X2
=
BLA︷ ︸︸ ︷arg minabc
E[E(Y | X)minus (a b c)]2
=
BLP︷ ︸︸ ︷arg minabc
E[Y minus (a+ bX + cX2)]2 (812)
As before if the true CEF actually is quadratic the these all equalthe true CEF
Structural Identification
The ASE is identified if the CEF is properly specified and indepen-dence (Assumption A67) holds Then the ASE on Y of changing Xfrom x1 to x2 equals the difference in the CEF at those two points
ASE(x1 rarr x2) = E(Y | X = x2)minus E(Y | X = x1) equiv m(x2)minusm(x1)(813)
If the CEF is correctly specified then OLS can consistently estimatethis CEF difference
For example if the true CEF is actually quadratic
m(x) = β0 + β1x+ β2x2
then regressing (with OLS) Y on 1 X and X2 yields consistent esti-mators of β0 β1 and β2 under certain finite-moment and sampling
186CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
assumptions (eg iid sampling and finite fourth moments of Y andX) Then m(x) = β0 + β1x+ β2x
2 so a consistent ASE estimator is
ASE(x1 rarr x2) = m(x2)minus m(x1)
= β0 + β1x2 + β2x22 minus (β0 + β1x1 + β2x
21)
= β1(x2 minus x1) + β2(x22 minus x21) (814)
Alternatively if the true structural model is Y = h(X)+U and thestructural error U satisfies E(U | X) = 0 then the structural functionh(middot) is also the CEF m(middot) Thus if h(middot) is linear-in-parameters thenit can be estimated by OLS
826 Code
00 05 10 15 20 25 30
minus2
01
23
4
Linear
00 05 10 15 20 25 30
minus2
01
23
4Quadratic
00 05 10 15 20 25 30
minus2
01
23
4
Cubic
00 05 10 15 20 25 30
minus2
01
23
4
Trigonometric
Figure 83 Same data different models
Figure 83 is generated by the following code that fits the samedata with four models linear quadratic and cubic polynomials anda trigonometric model with a sine and cosine term Figure 83 showsfour identical scatterplots with the four different fitted models Notehow the four fitted lines have very different qualitative features eventhough they use the same data This illustrates the same concernsabout model-driven results and ldquoself-fulfilling prophecyrdquo as in Sec-tion 815 and Figure 82
par(family=serif mar=c(3311) mgp=c(21 08 0) mfrow=c(22))setseed(112358)n lt- 31X lt- sort(3rbeta(n=nshape1=1shape2=1))
83 NONPARAMETRIC REGRESSION 187
df lt- dataframe(X=X Y=1+10(X2-05)^2(X2-05-1) + rnorm(n=n))retpoly1 lt- lm(Y~X data=df)retpoly2 lt- lm(Y~X+I(X^2) data=df)retpoly3 lt- lm(Y~X+I(X^2)+I(X^3) data=df)rettrig lt- lm(Y~I(cos(2pi(X-0)3))+I(sin(2pi(X-0)3)) data=df)XL lt- YL lt- plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL
main= xlim=c(03))lines(predict(retpoly1)~df$X col=2)title(Linearline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL
main= xlim=c(03))lines(predict(retpoly2)~df$X col=2)title(Quadraticline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL
main= xlim=c(03))lines(predict(retpoly3)~df$X col=2)title(Cubicline=-1adj=01)plot(x=df$X y=df$Y type=p pch=16 xlab=XL ylab=YL
main= xlim=c(03))lines(predict(rettrig)~df$X col=2)title(Trigonometricline=-1adj=01)
83 Nonparametric Regression
=rArr Kaplan video Model Flexibility in Nonparametric Regression
In nonparametric regression the functional form of the CEFm(middot) is unknown This is more general than nonlinear-in-variables re-gression wherem(middot) is nonlinear but has a known functional form likea cubic polynomial or log-linear model in which only the coefficientvalues are unknown
In principle this allows a very flexible model for m(middot) althoughin practice the (hopefully) optimal level of flexibility must be chosensomehow There is no universal quantitative definition of ldquoflexiblerdquobut the qualitative meaning is the same as the physical flexibility ofa hose or cable can it bend around sharply in many places to takewhatever shape you wish (flexible) or can it only take on particularshapes The number of parameters (terms) in a model is a generalguide to how flexible the model is For example a model with 20paramters is more flexible than a model with only 2 parameters
Many machine learning methods are nonparametric CEF es-timators In machine learning often prediction is emphasized over
188CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
description and causality but recall that the CEF is the best predic-tor of Y given X (under quadratic loss)
One view of nonparametric regression is that it is like nonlinearregression but choosing the model with a formal statistical procedureinstead of guessing The steps are basically
1 Choose a group of possible regression models
2 Choose a way to evaluate models
3 Evaluate the quality of each model given the data
4 Select the best (least bad) model
5 Use the estimates from the selected model
Steps 1ndash4 describe model selection ie choosing which modelto use for estimation This is unavoidable Sometimes model se-lection is informal eg somebody just feels like using a quadraticmodel today With nonparametric regression usually Step 1 is doneinformally (but thoughtfully) For Step 2 there are many formalstatistical evaluation procedures to choose from this choice (of pro-cedure) is also done informally but thoughtfully Steps 3 and 4 aredone by the chosen statistical procedure using the data
In R usually Steps 1 and 2 require you to pick a particular Rfunction (and certain arguments) and then the function computesSteps 3 and 4 (and Step 5) for you Depending on the chosen modelStep 5 may be identical to Section 82
Some intuitive ways to evaluate models are really bad Firstmaximizing R2 is bad Whenever you add a term to your modelR2 always increases even if the model is worse (ie yields worseCEF estimates and predictions) Adjusted R2 is better but still notdesigned for optimal model selection Second hypothesis testing isbad Different significance levels yield different chosen models andthe answer to ldquowhich model is bestrdquo never starts with ldquoI controlledthe type I error rate rdquo
The first difficulty in selecting a good CEF model is that m(middot)could be very nonlinear Imagine Y = m(X) exactly Even withoutany error term we could get a bad estimate if we specify m(x) =β0 + β1x when really m(middot) is not linear-in-variables So our modelmust be flexible enough to approximate the true m(middot) well
The second difficulty is distinguishing m(Xi) from the CEF errorVi equiv Yi minus m(Xi) in the data If we knew Y = m(Xi) then wecould learn m(x) perfectly for all x = Xi But in reality we observeYi = m(Xi)+Vi If Yi is big we donrsquot know ifm(Xi) is big or Vi is bigYou can think of m(Xi) as the ldquosignalrdquo and Vi as the ldquonoiserdquo we wantto distinguish the signal from the noise If our model is too flexiblewe risk overfitting mistaking noise for signal For example perhapsthe true m(middot) is linear but we estimate a very nonlinear function
In practice the key is balancing the two difficulties describedabove If the model is too simple it may fail to approximate thetrue CEF If the model is too complex it may lead to overfittingThe CEF estimate is bad in either case
83 NONPARAMETRIC REGRESSION 189
In more complex models optimal model selection for predictionmay not be optimal for causality Historically model selection has fo-cused on prediction Model selection for causal estimation is a cuttingedge area of econometrics research
The following code shows a particular example of nonparametricregression Specifically it uses something called a smoothing spline es-timator implemented in function smoothspline() in R The differ-ent estimates shown (thick red lines) correspond to different levels offlexibility of the model The plots labeled ldquoGCVrdquo and ldquoLOOCVrdquo referto formal model selection procedures provided through the smoothspline() function automatically The others show intentionally badfits one model is ldquoToo flexiblerdquo the other is ldquoNot flexible enoughrdquoNote that the same data is used for each estimate as seen in thescatter plots The thin black line is the true CEF
00 02 04 06 08 10
00
10
20
30
GCV
00 02 04 06 08 10
00
10
20
30
LOOCV
00 02 04 06 08 10
00
10
20
30
Undersmoothed
00 02 04 06 08 10
00
10
20
30
Oversmoothed
Figure 84 Smoothing spline estimates same data different amountsof flexibility
Figure 84 shows the results from the following code
par(family=serif mar=c(3311) mgp=c(21080) mfrow=c(22))setseed(112358)n lt- 48 CEF lt- function(x) 1 + pnorm(12(x-12)) df lt- dataframe(X=sort(runif(n)))df$Y lt- CEF(df$X) + rbeta(n=nshape1=2shape2=2)2-1rets lt- list()titles lt- c(GCVLOOCVToo flexible Not flexible enough)rets[[1]] lt- smoothspline(x=df$X y=df$Y cv=FALSE) GCVrets[[2]] lt- smoothspline(x=df$X y=df$Y cv=TRUE) LOOCVrets[[3]] lt- smoothspline(x=df$X y=df$Y df=n)
190CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
rets[[4]] lt- smoothspline(x=df$X y=df$Y df=2)xx lt- seq(from=0 to=1 by=0005)for (ifig in 14) plot(x=df$X y=df$Y type=p pch=16 xlab= ylab=
main= xlim=01 ylim=01304)lines(x=xx y=CEF(xx) col=1)lines(predict(rets[[ifig]] x=xx) col=2)title(main=titles[ifig] line=-1 adj=01)
Discussion Question 84 (model evaluation) In practice why donrsquotwe just make graphs like in Figure 84 and see which fitted functionlooks best (Hint can we make such graphs in practice If so howcan we agree on which ldquolooks bestrdquo What does ldquobestrdquo mean)
EMPIRICAL EXERCISES 191
Empirical Exercises
Empirical Exercise EE81 You will analyze data on law schoolsand their student outcomes originally collected by Kelly Barnett foran economics term project The idea is to compare median startingsalaries of graduates from each law school with the schoolrsquos cost Ofcourse these are not causal estimates does a Harvard Law graduatemake a lot of money because Harvard is expensive or because shersquosvery skilled (enough to get into Harvard) Since school cost is essen-tially a continuous variable you will explore possible nonlinearity inthe (statistical) relationship between cost and salary
a Load the data (assuming yoursquove already installed that R packageor Stata command)
R library(wooldridge)
Stata bcuse lawsch85 nodesc clear
b Stata only make a graph with a local linear nonparametricCEF estimate (of salary given cost) a linear fit and a quadraticfit with command lpoly salary cost degree(1) n(100)addplot(lfit salary cost || qfit salary cost) wheren(100) simply specifies the number of CEF values to estimateand plot and lfit and qfit stand for linear fit and quadraticfit and model selection is done with a ldquorule-of-thumbrdquo formulathat attempts to optimally balance variance and squared bias
c R only make a data frame named df with only salary and costvariables and only when both are observed withdf lt- dataframe(Y=lawsch85$salary X=lawsch85$cost)df lt- df[(isna(df$Y) | isna(df$X)) ]
where isna() is TRUE if the entry is missing and FALSE if not
d R only compute and store linear and quadratic (in variables)regressions with retlm lt- lm(Y~X data=df) and retnl lt-lm(Y~X+I(X^2) data=df)
e R only compute and store a nonparametric smoothing splineCEF estimate with GCV model selection with command retsslt- smoothspline(x=df$X y=df$Y cv=FALSE)
f R only specify a sequence of X values and compute CEFestimates at each value from each of the three models (lin-ear quadratic nonparametric) Store the sequence as xxwith xx lt- seq(from=min(df$X) to=max(df$X) lengthout=100) and then compute the estimates asfitlm lt- predict(retlm newdata=dataframe(X=xx))fitnl lt- predict(retnl newdata=dataframe(X=xx))fitss lt- predict(retss newdata=dataframe(X=xx))
g R only make a scatterplot of raw data withplot(x=df$X y=df$Y xlab=Cost ylab=StartingSalary)
192CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
h R only plot the three estimated CEFs as lines over the scatter-plot withlines(x=xx y=fitlm col=1 lty=1)lines(x=xx y=fitnl col=2 lty=5)lines(fitss col=4 lty=3)
i Optional repeat your analysis but with the schoolrsquos rank (vari-able rank) instead of cost
j Optional repeat again but with log salary and log rank Logsalary is already in the dataset as variable lsalary (thatrsquos alowercase L before salary)
R df lt- dataframe(Y=lawsch85$lsalary X=log(lawsch85$rank))
Stata generate lrank = log(rank) then use lrank andlsalary
Empirical Exercise EE82 You will analyze data on sleep andwages originally from Biddle and Hamermesh (1990) Specificallyyoursquoll estimate the CEF of daily hours of sleep conditional on hourlywage For now just drop missing values without worry and focus onthe linear quadratic and nonparametric estimation
a Load the data (assuming yoursquove already installed that R packageor Stata command)
R library(wooldridge)
Stata bcuse sleep75 nodesc clear
b R only follow the same steps (identical code) as in EE81through part (h) after setting up the data frame named dfSpecifically replace EE81(c) withdf lt- dataframe(Y=sleep75$slpnaps760 X=sleep75$hrwage)
df lt- df[(isna(df$Y) | isna(df$X)) ]
and then use the same code for all subsequent steps
c Stata only generate a new variable that translates the totalweekly minutes of sleep into average daily hours of sleep withgenerate sleephrsdaily = slpnaps760
d Stata only graph linear quadratic and nonparametric(local linear) CEF estimates similar to EE81(b) withcommand lpoly sleephrsdaily hrwage degree(1) n(100) addplot(lfit sleephrsdaily hrwage || qfitsleephrsdaily hrwage )
e Optional repeat your analysis but instead of hrwage usetotwrk as the conditioning variable (regressor) this is totalminutes of work per week (You could also adjust it to be av-erage daily hours of work to make it more comparable to thesleep variable you use)
EMPIRICAL EXERCISES 193
Empirical Exercise EE83 You will analyze data from the 1994ndash1995 menrsquos college basketball season scores and Las Vegas bettingldquospreadsrdquo originally collected by Scott Resnick Before each gamepeople can bet on whether the score difference will be ldquooverrdquo or ldquoun-derrdquo the spread set by bookmakers in Las Vegas (In the data theldquodifferencerdquo is the favored teamrsquos score minus the other teamrsquos scoreso the variable spread is always positive but the actual score dif-ference scrdiff can be negative if the favored team loses) Basi-cally the bookmaker adjusts the spread so that half the bets areldquooverrdquo and half ldquounderrdquo so regardless of the actual score outcomehalf win and half lose (and the bookmaker always profits) thelosers pay the winners and the bookmaker keeps the transactionfees (Itrsquos a little complicated since bets can be placed at differenttimes and the spread can change over time but we can imagine asimplified version where everyone bets at once and the spread is setso that half bet ldquooverrdquo and half ldquounderrdquo) See the Wikipedia entryat httpsenwikipediaorgwikiSpread_betting for more onspread betting Consequently the spread does not reflect the book-makerrsquos belief but rather the aggregate beliefs of everybody bettingon the game The accuracy of such aggregate wisdom has spurredthe creation of ldquoprediction marketsrdquo for events beyond sports likepresidential elections although there have been notable failures (eg2016 US presidential election)1 You will check whether the LasVegas spread is indeed a good predictor of the actual score difference
Technically the above arguments suggest that given the spreadthe median score difference should equal the spread not the meanBut such an investigation would require ldquomedian regressionrdquo (a typeof ldquoquantile regressionrdquo) which is beyond our scope Instead you willinvestigate whether the spread is still a good predictor of the actualscore difference with quadratic loss Specifically you can check if theOLS fit has intercept close to 0 and slope close to 1 (and whetherthose values are in the respective confidence intervals)
a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(sandwich) library(lmtest)
pntsprd
b Stata only load the data with bcuse pntsprd nodescclear (assuming bcuse already installed)
c For each observation (each game) compute whether the actualscore difference was over under or equal to the spread In mathand in the code below the ldquosignrdquo function (not to be confusedwith ldquosinerdquo) equals +1 for strictly positive values minus1 for strictlynegative values and 0 for zero
R overunder lt- sign(pntsprd$scrdiff-pntsprd$spread)
Stata generate overunder = sign(scrdiff - spread)1See httpsenwikipediaorgwikiPrediction_market for more
194CHAPTER 8 NONLINEAR ANDNONPARAMETRIC REGRESSION
d Display the frequency of over under and equal
R table(overunder useNA=ifany)
Stata tabulate overunder missing
e Regress the score difference on the spread
R ret lt- lm(scrdiff~spread data=pntsprd)
Stata regress scrdiff spread vce(robust)
f R only (since already reported by Stata) display the point es-timates and heteroskedasticity-robust 95 confidence intervalsfor the intercept and slope withcbind(coeftest(ret vcov=vcovHC(ret type=HC1))[12]coefci( ret vcov=vcovHC(ret type=HC1)) )
g Plot nonparametric CEF fitted values against the line Y = X(intercept zero slope one)
R plot(smoothspline(x=pntsprd$spread y=pntsprd$scrdiff)) then abline(a=0 b=1 col=2)
Stata lpoly scrdiff spread degree(1) addplot(function y=x range(spread)) noscatter
h Optional repeat your analysis in parts (e)ndash(g) but with thereverse regression regress the spread on the score difference (Isthe slope still close to 1 Are you surprised Consider gameswith the biggest possible score difference should the spread beeven bigger half the time)
Chapter 9
Regression with Two BinaryRegressors
=rArr Kaplan video Chapter Introduction
Depends on Chapters 6 and 7 (which depend on Chapters 2ndash4)
Unit learning objectives for this chapter
91 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
92 Assess whether there is bias from omitting a variable in areal-world example including the direction of bias [TLOs 5and 6]
93 Interpret (appropriately) the coefficients of a regression withtwo binary variables mathematically and intuitively for de-scription prediction and causality [TLO 3]
94 Assess whether comparing changes in two groups over timecan be interpreted causally and interpret such differencesappropriately [TLOs 2 3 and 6]
95 In R (or Stata) estimate regression models with two binaryvariables along with measures of uncertainty and judgeeconomic and statistical significance [TLO 7]
Optional resources for this chapter
bull ATT (Masten video)
bull Potential outcomes and CATE (Masten video)
bull OVBconfounders (Masten video)
bull conditional independenceunconfoundedness (Mastenvideo)
bull ATEconditional independence example (Masten video)
bull Difference-in-differences (Masten video)
195
196CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
bull Parallel trends (Masten video)
bull Diff-in-diff example immigration and unemployment (Mas-ten videos)
bull Parallel trends example immigration and unemployment(Masten videos)
bull Diff-in-diff example minimum wage (Masten video)
bull Diff-in-diff example posting calorie counts (Masten video)
bull OVB example test score and class size (Lambert video)
bull OVB example wages and education (Lambert video)
bull Sections 33 (ldquoCeteris Paribus Interpretation and OmittedVariable Biasrdquo) and 615 (ldquoInteraction Termsrdquo) in Heiss(2016)
bull Section 132 (ldquoDifference-in-Differencesrdquo) in Heiss (2016)
bull Collider bias examples httpsdoiorg101093ijedyp334
bull Collider bias review (very detailed) httpsdoiorg101146annurev-soc-071913-043455
bull Sections 61 (ldquoOmitted Variable Biasrdquo) and 83 (ldquoInteractionsBetween Independent Variablesrdquo) in Hanck et al (2018)
Perhaps surprisingly there is a lot to think about with even justtwo binary regressors Topics include (mis)specification of a CEFmodel interaction between regressors as a type of nonlinearity inter-pretation of regression coefficients causality estimation and more
91 Omitted Variable Bias
=rArr Kaplan video Omitted Variable Bias
For causality omitted variable bias (OVB) is a common prob-lem in economics More broadly it is a common problem in anyfield that uses observational (non-experimental) data and has manyvariables interact in complex ways Generally OVB arises because avariable outside our model is moving withX and causing Y to changebut our model assumes these changes are entirely from X
911 An Allegory
Imagine a ghost (Q) that often accompanies a child (X) ie the ghostand child are often in the same place at the same time The ghostalways makes a huge mess (Y ) spilling flour knocking over chairsdrawing on walls etc The childrsquos parents only observe the child andthe mess they do not observe the ghost The parents note that whenthe child is in the kitchen then there is often a mess in the kitchenand when the child is in the bathroom then there is often a mess in
91 OMITTED VARIABLE BIAS 197
the bathroom etc Thus they infer that the child (X) causes themess (Y ) However we know that it only appears that way because
GHOST1 the ghost (Q) often accompanies the child (X) and
GHOST2 the ghost (Q) causes a mess (Y )
The child is the regressor The ghost is the omitted variable Theparents are economists who over-estimate how much mess the childcauses This phenomenon is OVB
912 Formal Conditions
The ghost of OVB can be formalized as follows Consider the struc-tural model
Y = β0 + β1X + β2Q+ V (91)
where Cov(XV ) = 0 If we donrsquot observe Q then instead we havethe structural model
Y = β0 + β1X + U U equiv β2Q+ V (92)
Here X is sometimes called the included regressor (included inthe model not omitted) If X is binary then for OLS to estimateβ1 requires E(U | X = 0) = E(U | X = 1) the average effect of thestructural error term U must be the same for both X groups Forsimplicity imagine Q is also binary
Condition GHOST1 ldquothe ghost follows the childrdquo means that weusually see Q = 1 when X = 1 and Q = 0 when X = 0 Moregenerally it means Q is correlated with X This correlation does notneed to have a causal interpretation It does not matter why theghost follows the child maybe the ghost likes the childrsquos company (orvice-versa) or maybe they just get hungry at the same time It onlymatters that they tend to be in the same place Q and X tend to havethe same value OVB can also occur if there is a negative correlationeg if usually Q = 1 when X = 0 and Q = 0 when X = 1
Condition GHOST2 ldquothe ghost causes a messrdquo means that Q isa causal determinant of Y In (91) this means β2 6= 0 Although inthe example β2 gt 0 (more mess) OVB can occur with β2 lt 0 tooFor example maybe the child is really messy but the ghost cleanseverything up then the parents would incorrectly think the child isnot messy
To summarize for variable Q that is not included as a regressor (itis omitted from the model) it will cause OVB if both of the followingconditions hold
OVB1 Corr(QX) 6= 0 the omitted variable is correlated with theincluded regressor
OVB2 The omitted variable Q is a causal determinant of Y (not onlythrough X)
The variable Q may be called an omitted variable or a confounder
198CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
Assessing OVB Conditions Empirically
If Q is observed in the data then you can compare β1 (the estimatedcoefficient on X) when Q is included as a regressor to β1 when Q isomitted If the estimates are meaningfully different (economically)then it may be best to include Q to reduce OVB However thereare other types of variables that would also lead to a different β1but are actually worse to include so careful thought is required seeSection 96
If Q is not observed in the data then even Corr(QX) in OVB1cannot be assessed empirically (ie using data)
Beware of ldquoomitted variablerdquo tests that are not concerned with thistype of OVB For example Statarsquos ovtest implements the Ramseytest (RESET) Although the ov in ovtest indeed stands for ldquoomittedvariablesrdquo the Ramsey test only looks for (certain types of) nonlin-earity to see whether a polynomial model might be better than alinear model That is it is about nonlinearity in X (Section 82) notabout a separate Q variable Besides as you learned in Section 83hypothesis testing is a bad way to do model selection
Example
For example imagine we want to learn the effect of kindergartenclassroom size on earnings as an adult (This is inspired by ChettyFriedman Hilger Saez Schanzenbach and Yagan (2011) who ac-tually have randomized experimental data to answer this question)Let Y denote the annual earnings of the individual at age 30 LetX = 1 if (as a child) the individual was in a kindergarten classroomwith more than 24 students and X = 0 otherwise Imagine X is notrandomized We are curious whether we can just regress Y on X orif there is OVB Consider the following possible omitted variables
First consider Q to be somebodyrsquos first grade class size (Firstgrade is the year after kindergarten in the US) As with X Q = 1if it is above 24 students and Q = 0 otherwise Since it seems likekindergarten class size has an effect on adult earnings (Y ) according toChetty et al (2011) probably first grade class size does too satisfyingOVB2 If all students in the population are completely randomlyassigned to classes each year Corr(XQ) = 0 then OVB1 does nothold so this Q would not cause OVB However students tend tostay in the same school and some schools tend to have smaller classsizes than others so OVB1 probably does hold Since both OVB1and OVB2 are true there is OVB
Second considerQ as the number of cubbies (places to put clothesbackpacks etc) in somebodyrsquos kindergarten classroom Presumablylarger classes (X = 1) require more cubbies since there are morestudents so Corr(QX) gt 0 satisfying OVB1 However Irsquod guessthe number of cubbies does not have a causal effect on future earningsY That is if we simply went into classrooms and added a few cubbies(without adding students) I donrsquot think it would affect studentsrsquofuture earnings Thus OVB2 does not hold and this Q does not
91 OMITTED VARIABLE BIAS 199
cause OVBThird consider Q = 1 if the kindergarten is in a high-income area
and Q = 0 otherwise Areas with higher income are more likely tobe able to afford more teachers to keep class sizes small That isitrsquos more likely to see Q = 1 and X = 0 or Q = 0 and X = 1 soCorr(QX) lt 0 satisfying OVB1 Also Chetty Hendren and Katz(2016) provide evidence that growing up in a higher-income area hasa positive causal effect on earnings as an adult (not only because ofsmaller kindergarten classes) meaning Q is a causal determinant ofY satisfying OVB2 Thus omitting this Q causes OVB
In Sum Possible Omitted Variables (Q) in KindergartenExample
First grade class size affects earnings (OVB2) and proba-bly correlated with kindergarten class size (OVB1) if populationincludes multiple schools =rArr OVB
Cubbies more if more students (OVB1) but no causal effecton earnings (no OVB2) =rArr no OVB
Neighborhood income smaller classes if higher income(OVB1) and affects earnings (OVB2) =rArr OVB
Discussion Question 91 (assessing OVB) Among public elemen-tary schools (students mostly 5ndash11 years old) in California let Ybe the average standardized math test score among a schoolrsquos 5th-graders and letX be the schoolrsquos student-teacher ratio for 5th-graders(like average number of students per class) Consider a simple regres-sion of Y on X For any two of the following variables assess eachOVB condition separately and then decide whether you think itrsquos asource of OVB
a) Schoolrsquos parking lot area per student (Remember 5ndash11-year-olds donrsquot have cars to park)
b) Time of day of the testc) Schoolrsquos total spending per student (including books facilities
etc)d) Percentage of English learners (non-native speakers) among a
schoolrsquos 5th-grade students
913 Consequences
The practical problem of OVB is that we systematically over-estimateor under-estimate the true structural parameter This consequence isquantified below
Formulas
The following results are much more general than OVB with binary re-gressors Beyond OVB they quantify the consequences of any sourceof endogeneity that causes correlation between the regressor X andstructural error term U Other sources of endogeneity are discussed
200CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
in Section 123 The results also apply to any discrete and continuousX
Given structural model Y = β0 + β1X +U the OLS estimator ofβ1 has the property
β1prarr β1 +
Cov(XU)
Var(X) (93)
or equivalently (in different notation)
plimnrarrinfin
β1 = β1 +Cov(XU)
Var(X) (94)
That is for large samples (large n) the estimator β1 is close to theright-hand side expression in most randomly sampled datasets (Toreview prarr and consistency see Section 373)
Equations (93) and (94) show OVB is not solved by having lotsof data Unless Corr(XU) = 0 the OLS estimator is not consistentfor the structural β1
Rearranging (94) the asymptotic bias (as in (332)) is
plimnrarrinfin
β1 minus β1 =Cov(XU)
Var(X)= slope coefficient in LP(U | 1 X)
(95)The characterization as a linear projection slope coefficient comesfrom replacing Y with U in (78) This canrsquot be computed from datasince U is unobserved but it is helpful for thinking about the directionand magnitude of asymptotic bias
Although technically this is ldquoasymptotic biasrdquo rather than ldquobiasrdquo(Section 371) the practical implication is the same Although verydifferent mathematically we wonrsquot worry about such technicalities
Direction of Asymptotic Bias
(Recall terms and definitions from Sections 371 and 373)The direction (+ or minus) of the asymptotic bias in (95) depends on
the sign (+ or minus) of the slope in LP(U | 1 X) Ths sign of this slopeis equivalent to the sign of Corr(XU)
If Corr(XU) gt 0 then plimnrarrinfin β1 minus β1 gt 0 This is posi-tive (upward) asymptotic bias meaning we systematically estimate avalue ldquoaboverdquo the true β1 ldquoAboverdquo does not mean ldquobigger in magni-tuderdquo it could be that β1 = minus9 and positive asymptotic bias causesplimnrarrinfin β1 = 0 This is ldquopositiverdquo since 0minus (minus9) gt 0 (positive) butwe might also say that wersquore estimating a ldquosmallerrdquo effect (in fact zeroeffect) in the sense that |0| lt |minus9| This can be confusing
If Corr(XU) lt 0 then plimnrarrinfin β1 minus β1 lt 0 meaning nega-tive (downward) asymptotic bias Again confusing negative asymp-totic bias can actually make effects look bigger eg if β1 = 0 andplimnrarrinfin β1 = minus9 the true effect is zero but the negative asymptoticbias makes it appear like there is an effect
91 OMITTED VARIABLE BIAS 201
Results in Terms of Q
For OVB specifically the general results in terms of U can be trans-lated to Q As in (92) let U = β2Q+V with Cov(XV ) = 0 Thenusing a linearity property of covariance
Cov(XU) = Cov(Xβ2Q+V ) = β2 Cov(XQ) +
=0︷ ︸︸ ︷Cov(XV ) (96)
Plugging this into the first expression in (93)
β1prarr β1+
Cov(XU)
Var(X)= β1+β2
Cov(XQ)
Var(X)= β1+β2 Corr(XQ)
radicVar(Q)
Var(X)
(97)Interestingly similar to (95) Cov(XQ)Var(X) is the slope of thepopulation linear projection of Q onto X (and an intercept) LP(Q |1 X) So the asymptotic bias is the product β2γ1 where β2 is thestructural slope coefficient onQ in (91) and γ1 is the linear projectionslope coefficient in LP(Q | 1 X) = γ0 + γ1X
Equation (97) shows why both Conditions OVB1 and OVB2are required for OVB Condition OVB1 says Corr(XQ) 6= 0 whileOVB2 says β2 6= 0 If either β2 = 0 or Corr(XQ) = 0 in (97) thenβ2 Corr(XQ) = 0 and the asymptotic bias disappears β1
prarr β1The direction of asymptotic bias can also be interpreted in terms
of Q Using (97) the sign of the asymptotic bias is the sign ofβ2 Corr(XQ) That is if β2 Corr(XQ) gt 0 then there is positive(upward) asymptotic bias if β2 Corr(XQ) lt 0 then there is negative(downward) asymptotic bias
Example
Consider the asymptotic bias direction in the example where X = 1if the kindergarten class size is large and Q = 1 if the neighborhoodincome is high Earlier we thought probably Corr(XQ) lt 0 andβ2 gt 0 Thus there is negative OVB since β2 Corr(XQ) lt 0 Thatis if the true effect of class size on earnings is β1 then we systemati-cally estimate something below β1
Does this make the effect size (absolute value) appear bigger orsmaller Since smaller classes are better average earnings (Y ) arehigher when X = 0 than when X = 1 This means a negative slopeβ1 lt 0 That is the effect of changing from a smaller class (X = 0)to a larger class (X = 1) is lower future earnings (β1 lt 0) Negativeasymptotic bias means we estimate something even more negativeplimnrarrinfin β1 lt β1 lt 0 This makes the size of the effect appear largerthan it really is we estimate something farther away from zero
Intuitively this OVB direction makes sense Individuals who hada small kindergarten class tend to have grown up in wealthier areaswith lots of other advantages that also cause higher earnings If weascribe the entire mean earnings difference to kindergarten then itfalsely appears that kindergarten alone cause the big difference when
202CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
in reality many different forces were all working together in the samedirection
In Sum OVB Assessment1 Think of a specific variable Q2 Assess OVB1 correlated with X3 Assess OVB2 causal effect on Y (separate from X effect)4 If both OVB1 and OVB2 =rArr OVB5 OVB direction positive bias if Corr(XQ) and effect of Q
on Y are either both + or both minus otherwise negative bias6 OVB magnitude all else equal larger (in absolute value) if
i) larger effect of Q on Y ii) larger Corr(XQ) iii) largerVar(Q)Var(X)
Practice 91 (OVB kindergarten) Consider the OVB example withearnings as an adult (Y ) kindergarten classroom size (X) and child-hood neighborhood income (Q) But reverse the definition of X letX = 1 for smaller classrooms (24 or fewer students) and X = 0 forlarger classrooms Say whether you think each of the following is pos-itive or negative and explain why a) β1 b) Corr(XQ) c) β2 andd) OVB Also discuss e) will our estimated effect β1 tend to be largeror smaller than the true effect β1 and why
Discussion Question 92 (OVB ES habits) Recall from DQ 63the example with Y as a studentrsquos final semester score (0 le Y le 100)and X = 1 if a student starts the exercise sets well ahead of thedeadline (and X = 0 otherwise)
a) Whatrsquos one variable that might cause OVB Explain why youthink both OVB conditions are satisfied
b) Which direction of asymptotic bias would your omitted variablecause Explain
914 OVB in Linear Projection
For linear projection (without causal interpretation) the OVB for-mula is actually the same as (97) just with β1 and β2 interpreted aslinear projection coefficients rather than structural coefficients Sim-ilar results for larger linear projection models are in Hansen (2020sect224) for example
However if we are interested in prediction we donrsquot care whetherour β1 estimates a particular linear projection coefficient we only carewhether we can predict Y well Of course we donrsquot want to omit Q ifitrsquos helpful for prediction but we donrsquot care about OVB itself Thatis OVB is only a problem for causality not prediction
92 LINEAR-IN-VARIABLES MODEL 203
92 Linear-in-Variables Model
The simplest CEF model with two binary variables is linear-in-variables(Section 821)
E(Y | X1 X2) = β0 + β1X1 + β2X2 (98)
Misspecification
Unfortunately (98) may be misspecified Recall from Section 71that misspecification arose when X had three values but the CEFmodel β0 + β1X had only two parameters The case here is simi-lar (98) has only 3 parameters but there are 4 possible values of(X1 X2) Specifically (X1 X2) could equal (0 0) (0 1) (1 0) or(1 1) Consequently there are four CEF values
m(0 0) = E(Y | X1 = 0 X2 = 0) m(0 1) = E(Y | X1 = 0 X2 = 1)
m(1 0) = E(Y | X1 = 1 X2 = 0) m(1 1) = E(Y | X1 = 1 X2 = 1)
(99)
To see the possible misspecification we can write the βj regressioncoefficients in terms of the CEF values m(x1 x2) If (98) were truethen
m(0 0) = β0 + (β1)(0) + (β2)(0) = β0 (910)m(0 1) = β0 + (β1)(0) + (β2)(1) = β0 + β2 (911)m(1 0) = β0 + (β1)(1) + (β2)(0) = β0 + β1 (912)m(1 1) = β0 + (β1)(1) + (β2)(1) = β0 + β1 + β2 (913)
Consequently β1 has two interpretations It equals either (913) mi-nus (911) or (912) minus (910)
m(1 1)minusm(0 1) = (β0 + β1 + β2)minus (β0 + β2) = β1
m(1 0)minusm(0 0) = (β0 + β1)minus β0 = β1
Thus the model implicitly assumes m(1 1) minus m(0 1) = m(1 0) minusm(0 0) which may not be true of the real CEF For example
m(0 0) = 0m(1 0) = 1m(0 1) = 2m(1 1) = 4
=rArr m(1 1)minusm(0 1) = 2m(1 0)minusm(0 0) = 1
Becausem(1 1)minusm(0 1) 6= m(1 0)minusm(0 0) the CEF model in (98)is misspecified (wrong) That is there are no possible (β0 β1 β2) suchthat m(x1 x2) = β0 + β1x1 + β2x2
As discussed in Chapter 7 if the CEF model is wrong thenOLS estimates the linear projection Here OLS estimates LP(Y |1 X1 X2) However this is not useful for causality and the misspec-ification is easily fixed
204CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
More Consideration
Before we fix the misspecification consider more carefully why (98)is usually misspecified To be concrete imagine Y is wage X1 = 1 ifan individual has a college degree (and X1 = 0 if not) and X2 = 1if an individual has at least 10 years of work experience (and X2 = 0if not) For simplicity wersquoll call X1 ldquoeducationrdquo and X2 ldquoexperi-encerdquo The quantity m(1 1) minus m(0 1) compares the mean wage inthe high-education high-experience group (subpopulation) with themean wage in the low-education high-experience group That iswithin the high-experience subpopulation it compares the mean wageof the high-education and low-education sub-sub-populations Thequantity m(1 0)minusm(0 0) also compares mean wages across high andlow education but within the low-experience subpopulation Thusassuming m(1 1) minus m(0 1) = m(1 0) minus m(0 0) can be interpretedas assuming that the mean wage difference between high-educationand low-education groups is identical within the high-experience sub-population and within the low-experience subpopulation This is astrong assumption that is probably not true in this example (or inmost examples)
93 Fully Saturated Model
=rArr Kaplan video Fully Saturated Model Interpretation
Misspecification is avoided by adding the interaction termX1X2
E(Y | X1 X2) = β0 + β1X1 + β2X2 + β3X1X2 (914)
Mathematically interaction terms often involve the product of tworegressors like X1X2 here Economically the interaction term allowsthe mean Y difference associated with X1 to depend on the value ofX2 Similarly it allows the mean Y difference associated with X2 todepend on the value of X1 For example the mean wage differenceassociated with education can depend on the value of experienceMore generally interaction terms allow the change in Y associatedwith a unit increase in one regressor to depend on the value of anotherregressor
The CEF model in (914) is also called fully saturated (Sec-tion 722) since it is flexible enough to allow a different CEF valuefor each value of (X1 X2) Logically having the same number (four)of possible values of (X1 X2) as βj parameters is necessary but notsufficient for the model to be fully saturated
Interpretation of the coefficients requires writing them in terms ofdifferent CEF values First similar to (910)ndash(913) each CEF value
93 FULLY SATURATED MODEL 205
can be written in terms of the βj
m(x1 x2) = β0 + (β1)(x1) + (β2)(x2) + (β3)(x1)(x2)
m(0 0) = β0 + (β1)(0) + (β2)(0) + (β3)(0)(0) = β0 (915)m(0 1) = β0 + (β1)(0) + (β2)(1) + (β3)(0)(1) = β0 + β2 (916)m(1 0) = β0 + (β1)(1) + (β2)(0) + (β3)(1)(0) = β0 + β1 (917)m(1 1) = β0 + (β1)(1) + (β2)(1) + (β3)(1)(1) = β0 + β1 + β2 + β3
(918)
From (915)ndash(918) and their differences
(915)︷ ︸︸ ︷β0 = m(0 0) (919)
β1 =
(917) minus (915)︷ ︸︸ ︷(β0 + β1)minus β0 = m(1 0)minusm(0 0) (920)
β2 =
(916) minus (915)︷ ︸︸ ︷(β0 + β2)minus β0 = m(0 1)minusm(0 0) (921)
β3 = [β2 + β3]minus [β2] =
(918) minus (917)︷ ︸︸ ︷[(β0 + β1 + β2 + β3)minus (β0 + β1)]minus
(916) minus (915)︷ ︸︸ ︷[(β0 + β2)minus (β0)]
=
difference-in-differences︷ ︸︸ ︷difference︷ ︸︸ ︷
[m(1 1)minusm(1 0)]minusdifference︷ ︸︸ ︷
[m(0 1)minusm(0 0)] (922)= [m(1 1)minusm(0 1)]minus [m(1 0)minusm(0 0)] (923)
=
(918) minus (916)︷ ︸︸ ︷[(β0 + β1 + β2 + β3)minus (β0 + β2)]minus
(917) minus (915)︷ ︸︸ ︷[(β0 + β1)minus (β0)]
Because of the difference-in-differences structure seen in (922)and (923) this model is sometimes called a difference-in-differencesmodel particularly when X2 represents time and X1 represents aldquotreatmentrdquo (see Section 97)
Using (919)ndash(923) the four βj in (914) have the following inter-pretations both in terms of the wage example (Y wage X1 educationX2 experience) and more generally
bull β0 = m(0 0) is the mean wage among low-education low-experience individuals
More generally β0 is the mean Y in the subpopulation withX1 = 0 and X2 = 0
Caution generally β0 6= E(Y )
bull β1 = m(1 0) minus m(0 0) is the mean wage difference betweenhigh-education and low-education individuals within the low-experience subpopulation
More generally β1 is the mean Y difference between X1 = 1and X1 = 0 individuals within the X2 = 0 subpopulation
Caution generally β1 6= E(Y | X1 = 1) minus E(Y | X1 = 0) itadditionally conditions on X2 = 0
206CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
bull β2 = m(0 1) minus m(0 0) is the mean wage difference betweenhigh-experience and low-experience individuals within the low-education subpopulation
More generally β2 is the mean Y difference between X2 = 1and X2 = 0 individuals within the X1 = 0 subpopulation
Caution generally β2 6= E(Y | X2 = 1) minus E(Y | X2 = 0) itadditionally conditions on X1 = 0
bull β3 = [m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)] is the mean wagedifference associated with experience in the high-education sub-population minus the mean wage difference associated with ex-perience in the low-education subpopulation
More generally β3 is the mean Y difference associated withX2 in the X1 = 1 subpopulation minus the mean Y differenceassociated with X2 in the X1 = 0 subpopulation
bull β3 = [m(1 1) minusm(0 1)] minus [m(1 0) minusm(0 0)] is also the meanwage difference associated with education in the high-experiencesubpopulation minus the mean wage difference associated witheducation in the low-experience subpopulation
More generally β3 is the mean Y difference associated withX1 in the X2 = 1 subpopulation minus the mean Y differenceassociated with X1 in the X2 = 0 subpopulation
The βj interpretations can also be seen by considering the regres-sion of Y on X1 when X2 = 0 and separately when X2 = 1 That isplugging in x2 = 0 first and then x2 = 1 second
m(x1 0) = β0 + β1x1 + (β2)(0) + (β3)(x1)(0) = β0 + β1x1 (924)m(x1 1) = β0 + β1x1 + (β2)(1) + (β3)(x1)(1) = (β0 + β2) + (β1 + β3)x1
(925)
That is when changing from X2 = 0 to X2 = 1 the intercept changesby β2 and the slope changes by β3 These changes could be positiveor negative or zero The interaction coefficient β3 describes how theslope with respect to X1 differs when X2 = 1 versus X2 = 0
Equivalently we could switch all the X1 and X2 and interpret β3as the difference between the slope with respect to X2 when X1 = 1versus when X1 = 0
m(0 x2) = β0 + (β1)(0) + β2x2 + (β3)(0)(x2) = β0 + β2x2 (926)m(1 x2) = β0 + (β1)(1) + β2x2 + (β3)(1)(x2) = (β0 + β1) + (β2 + β3)x2
(927)
Practice 92 (binary interaction) Let Y be wage ($hr) D1 = 1 ifan individual has a college degree (D1 = 0 if not) and D2 = 1 if anindividual has more than 15 years of experience (and D2 = 0 if not)You have a sample of data and run OLS on the fully saturated modelyielding Y = 10 + 5D1 +D2 + 2D1D2
94 STRUCTURAL IDENTIFICATION BY EXOGENEITY 207
a) For the college-educated subpopulation what is the estimatedchange in mean wage associated with changing from low to highexperience
b) Within the low-experience subpopulation whatrsquos the estimateddifference in mean wage between the college and no-college sub-populations
c) How do you interpret the 2 (the coefficient on D1D2)
94 Structural Identification by Exogeneity
Imagine Y is determined by the structural model
Y = β0 + β1X1 + β2X2 + β3X1X2 + U (928)
The qualitative condition for identification is the same as in Sec-tion 661 Specifically if U (which contains other causal determinantsof Y ) is unrelated to the regressors then the structural parameters areidentified Recall that a regressor unrelated to U is called exogenousotherwise itrsquos endogenous
Mathematically one sufficient definition of ldquounrelatedrdquo here is ldquoun-correlatedrdquo If
Cov(UX1) = Cov(UX2) = Cov(UX1X2) = 0 (929)
then β1 β2 and β3 are the linear projection slope coefficients fromLP(Y | 1 X1 X2 X1X2) Other mathematical definitions of ldquounre-latedrdquo imply (929) and are thus sufficient for identification For ex-ample U perpperp (X1 X2) logically implies (929) as does mean indepen-dence E(U | X1 X2) = E(U)
If the structural β1 β2 and β3 are also linear projection coeffi-cients then they can be estimated by OLS That is we can interpretthe OLS-estimated slope coefficients as the structural parameters in(928)
95 Identification by Conditional Independence
By extending the independence assumption (A61 and A65) variantsof the ASE and ATE can be identified (Note more details andexamples are in the Spring 2020 edition)
Consider the subpopulation with X2 = 1 and whether the meandifference E(Y | X1 = 1 X2 = 1) minus E(Y | X1 = 0 X2 = 1) has acausal interpretation This is equivalent to redefining the populationas everybody with X2 = 1 and asking if the mean difference E(Y |X1 = 1) minus E(Y | X1 = 0) has a causal interpretation This questionwas studied in Section 66 for both structural and potential outcomesmodels
The key identifying assumption from Section 66 was indepen-dence In the structural model this meant independence between theregressor X1 and the unobserved determinants of Y In the potential
208CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
outcomes model this meant independence between the treatment andthe pair of potential outcomes
Extending independence is conditional independence whichessentially assumes independence within each subpopulation (X2 = 1and X2 = 0) The conditional independence assumption (CIA) hasother names like unconfoundedness selection on observablesand ignorability see Imbens and Wooldridge (2007 p 6) and ref-erences therein Mathematically both structural and potential out-comes versions of conditional independence are stated in Assump-tion A91
Assumption A91 (conditional independence assumption CIA)Let Y = h(X1 X2U) be the structural model Conditional on thecontrol variable X2 the regressor of interest X1 is independent ofthe vector of unobserved causal determinants U U perpperp X1 | X2 Al-ternatively in potential outcomes notation where Y T and Y C arethe treated and untreated potential outcomes binary treatment X1
is independent of the potential outcomes conditional on the controlvariable X2 (Y 0 Y 1) perpperp X1 | X2 More generally in either case X2
may be replaced by multiple control variables X2 X3 X4 andX1 can be discrete (including binary) or continuous in the structuralmodel
Consequently the ASE or ATE within subpopulation X2 = 1 isidentified and equal to the conditional mean difference E(Y | X1 =1 X2 = 1) minus E(Y | X1 = 0 X2 = 1) Similarly the ASE or ATEwithin subpopulationX2 = 0 is identified and equal to the conditionalmean difference E(Y | X1 = 1 X2 = 0) minus E(Y | X1 = 0 X2 =0) Because these are causal effects within a subpopulation (not fullpopulation) ie conditional on X2 the ATE is sometimes called aconditional ATE or the ASE a conditional ASE
The (unconditional) ATE can be computed from the conditionalATEs Specifically the ATE is the mean conditional ATE Writ-ing the conditional ATE for X2 = 1 as CATE(1) and similarlyCATE(0) for X2 = 0 then CATE(X2) is a random variable Specif-ically P(CATE(X2) = CATE(1)) = P(X2 = 1) and P(CATE(X2) =CATE(0)) = P(X2 = 0) Thus CATE(X2) has a mean Ultimately
ATE = E[Y T minus Y C ] = E[E(Y T minus Y C | X2)] = E[CATE(X2)]
= P(X2 = 1) CATE(1) + P(X2 = 0) CATE(0)
Similar arguments apply to the conditional ASE
96 Collider Bias
Although OVB shows the risk of omitting certain types of variables(confounders) other types of variables actually should be omittedotherwise there is a different type of (asymptotic) bias
A collider or common outcome is a variable on which both Xand Y have a causal effect (Whereas a confounder has a causal effect
96 COLLIDER BIAS 209
on both X and Y ) For example imagine you want to learn the effectof a firmrsquos ownership structure (say X = 1 for family-owned X = 0otherwise) on its research and development expenditure Y Both Xand Y affect the firmrsquos performance Z so Z is a collider
Including a collider as a regressor causes collider bias when es-timating a causal relationship This is not as intuitive as OVB butconsider the following example1
Imagine yoursquore interested in the causal effect of eating falafel orsalad on having the flu (which is zero effect) and you have a sampleof 200 individuals You randomly assigned 100 people to eat falafelfor lunch and 100 salad a few hours later you test each for flu(assume there is no testing error) Let Y = 1 if somebody has theflu (otherwise Y = 0) and X = 1 if somebody ate falafel for lunch(X = 0 if salad) Let Z = 1 if the individual has a fever (otherwiseZ = 0) Sadly the salad had some romaine contaminated with E coliso 40 of those who ate salad got a fever from the E coli unrelatedto whether or not they had the flu Among individuals with flu 90have a fever but 10 donrsquot
Table 91 Counts in falafelsaladflu example
Fever No fever
Flu No flu Flu No flu Flu No flu
Falafel 50 50 45 0 5 50Salad 50 50 47 20 3 30
Table 91 shows the number of individuals in different categoriesOverall there is no relationship between lunch and flu so the flu rateis the same in the falafel and salad groups To make the numberseasier the overall flu rate is 50 (100200 overall 50100 in eachgroup) Since nobody who ate falafel got E coli the only reason forfever is the flu which has a 90 fever rate Thus among the 50with flu who at falafel (50)(09) = 45 have a fever and 5 do not Thisentirely explains the Falafel row In the salad row given the statisticalindependence of flu (probability 05) and E coli (probability 04) theprobability of having neither is
P(not flu and not E coli) = P(not flu) P(not E coli) = [1minus05︷ ︸︸ ︷
P(flu)][1minus04︷ ︸︸ ︷
P(E coli)]= (05)(06) = 03
hence (100)(03) = 30 salad-eaters who have neither flu nor E coliand thus no fever This explains the No fever No flu entry of 30 inthe Salad row Similarly
P(flu not E coli) = (05)(06) = 03 (30 people)P(flu E coli) = (05)(04) = 02 (20 people)
P(not flu E coli) = (05)(04) = 02 (20 people)1Modified from httpsdoiorg101093ijedyp334
210CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
The ldquonot flu and E colirdquo are the 20 individuals who have a fever (fromthe E coli) but not flu The 20 with both flu and E coli all have afever due to E coli Among the 30 with flu but not E coli 90 havea fever ie (30)(09) = 27 have a fever so 3 do not This 3 is theNo fever Flu entry in the Salad row The 27 combine with the 20who had both illnesses to make 47 who have both flu and a fever inthe Salad row
If we regress Y (flu) on X (food) then we correctly estimate zeroeffect but if we also use Z (fever) then we incorrectly estimate anon-zero effect If we only look at the ldquono feverrdquo group then thereis (appropriately) zero difference the flu rate for the falafel eatersis 555 = 111 identical to the 333 = 111 for the salad eatersMathematically these ldquoratesrdquo are estimates of the conditional meanof the binary Y flu variable eg 555 = E(Y | falafel no fever)recalling E(Y ) = P(Y = 1) for binary Y However if we also lookat the ldquofeverrdquo group the flu rate is much higher in the falafel groupIn fact the falafel grouprsquos flu rate is 4545 = 100 whereas thesalad grouprsquos flu rate is only 47(47+20) = 70 substantially lowerMathematically
555︷ ︸︸ ︷E(Y | X = 1 Z = 0)minus
333︷ ︸︸ ︷E(Y | X = 0 Z = 0) = 0
4545︷ ︸︸ ︷E(Y | X = 1 Z = 1)minus
4767︷ ︸︸ ︷E(Y | X = 0 Z = 1) = 030
(930)
This suggests eating falafel causes flu but this incorrect conclusion isentirely collider bias
97 Causal Identification Difference-in-Differences
=rArr Kaplan video Diff-in-Diff Intuition
If X1 is a treatment indicator and X2 is a time period indicatorthen the fully saturated model with two binary regressors is called adifference-in-differences (diff-in-diff) model This is a special caseof (914) whose coefficients were interpreted in Section 93
Below the parameter β3 from (914) is shown to have a certaincausal interpretation under certain conditions
The general setup is that some individuals (or firms or citiesetc) were exposed to some ldquotreatmentrdquo like a training program orlaw or other policy The treatment wasnrsquot randomized but therersquos agroup of untreated individuals whose outcomes can be used to forma counterfactual whatrsquos the mean outcome of treated individualsin the parallel universe where they werenrsquot treated
Such setups are sometimes called natural experiments or quasi-experiments (see also Section 432) Since they werenrsquot fully ran-domized experiments itrsquos invalid to simply compare treated and un-treated outcomes as seen in Section 971 However there is enoughrandomness that a valid comparison can be found with some addi-tional work (like diff-in-diff)
97 CAUSAL IDENTIFICATION DIFFERENCE-IN-DIFFERENCES211
For example maybe Y is annual labor income and we are in-terested in the effect of minimum wage Imagine our city recentlyimplemented a large minimum wage increase The goal is to learn theeffect of this particular minimum wage increase on Y (income) forindividuals in our city Notationally X1 = 1 if the individual livesin our city (and X1 = 0 otherwise) and X2 = 1 if the observation isfrom the year after the minimum wage increase (and X2 = 0 if beforethe increase)
Notationally X1 = 1 is the ldquotreated grouprdquo and X1 = 0 the ldquoun-treated grouprdquo X2 = 0 is the time period ldquobeforerdquo treatment andX2 = 1 is ldquoafterrdquo
971 Bad Approaches
Discussion Question 93 (bad panel approach 1 for Mariel boatlift)Consider the basic setup from Card (1990) Due to a seemingly ran-domexogenous political decision Cubans were temporarily permit-ted to immigrate to the US for a few months in 1980 About halfsettled in Miami FL while the other half went to live in other citiesaround the US We could compare wages of native-born workers inMiami in 1979 (before boatlift) and 1981 (after) Explain why thischange in average wage would not be a good estimate of the aver-age treatment effect of the Mariel boatlift on native worker wage(Hint are 1979 Miami and 1981 Miami the same except for howmany Cubans live there or might something else have changed)
Discussion Question 94 (bad panel approach 2 for Mariel boatlift)Consider the same setup as in DQ 93 But now compare 1981 wagesof native workers in Miami and Houston TX a city that did not re-ceive a large influx of Cuban immigrants in 1980 Explain why thisdifference (Miami minus Houston) in average wage would not be agood estimate of the average treatment effect of the Mariel boatlifton native worker wage (Hint are 1981 Miami and Houston the sameexcept for how many Cubans live there or might there be other dif-ferences between the cities that might cause omitted variable bias)
Discussion Question 95 (bad panel approach 1 for fracking)Discussion Questions 95 and 96 are based loosely on the settingof Street (2018) who uses much better approaches For counties inNorth Dakota let Y denote crime rate Consider the average crimerate in counties that started fracking activity before and after thefracking started (Fracking was a new technology that allowed ex-traction of certain underground oil and natural gas reserves that werepreviously infeasible or unprofitable to extract) Explain why thischange in average crime rate would not be a good estimate of theaverage treatment effect of the fracking activity on crime rate
Discussion Question 96 (bad panel approach 2 for fracking)Consider the same setup as in DQ 95 but now compare the ldquoafterrdquocrime rates in North Dakota counties with fracking to those withoutfracking Explain why this difference (fracking minus non-fracking)
212CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
in average crime rate would not be a good estimate of the averagetreatment effect of fracking on crime rate
Continuing the minimum wage example one bad approach is touse only data from our city before and after the minimum wageincrease That is we could try to estimate E(Y | X2 = 1 X1 =1)minusE(Y | X2 = 0 X1 = 1) However coincidentally there may havebeen a national (or global) recession right after the minimum wagelaw was passed This may make everybodyrsquos income lower in the yearafter It would look like the minimum wage hurt incomes but really itwas the recession Alternatively there may have been great national(macroeconomic) conditions that made incomes go up which wouldmake us incorrectly conclude that the law increased incomes greatlyThere is almost always OVB with such before vs after comparisonswhich invalidates causal interpretation
Another bad approach is to compare incomes in our city and an-other city in the year after our law passed By using the other cityas a sort of control group we avoid the problem of misinterpretingmacroeconomic changes as treatment effects However itrsquos hard toknow which other city to pick We could pick one that has the samepopulation for example but our city may still have much higher (orlower) income for reasons other than our minimum wage For exam-ple San Francisco and Columbus OH have very similar populationsbut they have (and have for a while had) very different incomes
972 Counterfactuals and Parallel Trends
The difference-in-differences idea is to combine the before vs aftercomparison with the treated vs untreated comparison
Conceptually the goal is to construct a counterfactual (link topronunciation) like what our cityrsquos mean income would have been ifthere were not a minimum wage increase Thinking of the potentialoutcomes framework the counterfactual is the parallel universe wherethe treatment never happened
The key identifying assumption is called parallel trends Con-ceptually in the running example parallel trends says that withoutthe minimum wage law our cityrsquos mean income would have increasedby exactly the same amount as the other cityrsquos mean income Mathe-matically with m(x1 x2) equiv E(Y | X1 = x1 X2 = x2) the other cityrsquosmean income increase (ie ldquoafterrdquo minus ldquobeforerdquo) is
m(0 1)minusm(0 0) = E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)(931)
Parallel trends assumes that adding this increase to the ldquobeforerdquo meanincome in our city m(1 0) = E(Y | X1 = 1 X2 = 0) gives us thecounterfactual income for our city in the ldquoafterrdquo time period
97 CAUSAL IDENTIFICATION DIFFERENCE-IN-DIFFERENCES213
Given parallel trends we can learn about causality by comparing
actual (our city after)︷ ︸︸ ︷E(Y | X1 = 1 X2 = 1) vs
counterfactual︷ ︸︸ ︷E(Y | X1 = 1 X2 = 0)︸ ︷︷ ︸
our city before
+ E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)︸ ︷︷ ︸increase in other city over time
(932)actual︷ ︸︸ ︷m(1 1)minus
counterfactual︷ ︸︸ ︷m(1 0) + [m(0 1)minusm(0 0)] =
β3 in (914)︷ ︸︸ ︷[m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)]
Figure 91 visualizes this effect We can think of constructingthe counterfactual outcome and then subtracting it from the actualoutcome m(1 1) or we can think of taking the beforeafter differencefor our city m(1 1) minus m(1 0) and subtracting off the beforeafterdifference in the other city m(0 1)minusm(0 0)
m(00)
m(01)
before after
other citym(10)
actual=m(11)
ldquotrea
tedrdquo c
ity
counterfactual m(10)+[m(01)-m(00)]
Diff-in-diff = m(11) - m(10)+[m(01)-m(00)]= [m(11)-m(10)] -[m(01)-m(00)]
m(01)-m(00)
Figure 91 Difference-in-differences
Discussion Question 97 (parallel trends skepticism) ConsiderUS state traffic fatality (ie car accident death) rates (Y ) wherethe year 1980 is ldquobeforerdquo (X2 = 0) and 1990 is ldquoafterrdquo (X2 = 1)Consider states that adopt a 008 blood alcohol content (BAC) limitlaw sometime between 1980 and 1990 (X1 = 1) and states that neverhave such a law (X1 = 0) Explain why you might doubt the paral-lel trends assumption Hint 1 is a BAC law the only way statestry to reduce fatal accidents Hint 2 this is more difficult thansimply thinking of an omitted variable that would cause OVB in across-sectional regression because parallel trends allows certain typesof such omitted variables
973 Identification
Population Object of Interest ATT
Most fundamentally the difference-in-differences approach only learnsthe average treatment effect for the group that was actually treated
214CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
(in our universe) This is called the average treatment effect onthe treated (ATT) (or sometimes ATTE or ATET) Mathemati-cally ATE meant E(Y 1 minus Y 0) where Y 1 and Y 0 are the treated anduntreated potential outcomes respectively (previously Y T and Y C)ATT is the same but for the subpopulation who was actually treatedin our universe Since X1 = 1 if somebody is actually treated theATT is
ATT equiv E(Y 1 minus Y 0 | X1 = 1) (933)
Itrsquos possible but uncommon that ATT = ATE For examplemaybe there are different demographics in our city than the compar-ison city or different levels of unionization or different other laborlaws or different industry mix so the minimum wage effect is differentin our city (X1 = 1) than elsewhere (This is essentially a question ofexternal validity see Chapter 12)
Identification of ATT
Parallel trends is sufficient to identify the counterfactual In potentialoutcomes notation ldquoparallel trendsrdquo is
E(Y 0 | X1 = 1 X2 = 1)minus E(Y 0 | X1 = 1 X2 = 0)
= E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)(934)
That is the mean untreated potential outcome changes over time(X2 = 0 to X2 = 1) by the same amount in the treated (X1 = 1)and untreated (X1 = 0) groups The term E(Y 0 | X1 = 1 X2 = 1)is the counterfactual like our cityrsquos mean wage in the ldquoafterrdquo periodin the parallel universe where minimum wage never increased In theother three terms Y 0 = Y ie the untreated Y 0 is the observedY Only when X1 = X2 = 1 is the treated Y 1 observed Y = Y 1Thus the counterfactual can be written uniquely in terms of the jointdistribution of (YX1 X2)
E(Y 0 | X1 = 1 X2 = 1)
= E(Y 0 | X1 = 1 X2 = 0) + [E(Y 0 | X1 = 0 X2 = 1)minus E(Y 0 | X1 = 0 X2 = 0)]
= E(Y | X1 = 1 X2 = 0) + [E(Y | X1 = 0 X2 = 1)minus E(Y | X1 = 0 X2 = 0)]
= m(1 0) + [m(0 1)minusm(0 0)] (935)
Because the counterfactual is identified so is the ATT Specificallythe ATT equals β3 in the fully saturated CEF model (914)
ATT = E(Y 1 minus Y 0 | X1 = 1 X2 = 1)
=
Y 1=Y since X1=1X2=1︷ ︸︸ ︷E(Y 1 | X1 = 1 X2 = 1)minus
use counterfactual from (935)︷ ︸︸ ︷E(Y 0 | X1 = 1 X2 = 1)
= E(Y | X1 = 1 X2 = 1)minuscounterfactual︷ ︸︸ ︷
m(1 0) + [m(0 1)minusm(0 0)]= m(1 1)minus m(1 0) + [m(0 1)minusm(0 0)]= [m(1 1)minusm(1 0)]minus [m(0 1)minusm(0 0)]
= β3
98 ESTIMATION AND INFERENCE 215
Skepticism About Parallel Trends
In practice the parallel trends condition may not hold for variousreasons For example maybe our city was experiencing fast wagegrowth whereas the comparison city was declining (maybe due toreliance on different industries) Maybe our city passed the minimumwage law partly because everybodyrsquos wages were increasing anywayIn that case we canrsquot tell whether our cityrsquos wages grew more thanthe other cityrsquos wages because of the minimum wage or because ofother factors (our industries were growing theirs were declining etc)
Parallel trends is also a bit fragile since nonlinear functions ofY change whether itrsquos true or not For example if there are paralleltrends when Y is wage then there are not parallel trends for log-wageln(Y ) Similarly if there are parallel log-wage trends then the wagetrends cannot be parallel
In the data you can try to see if parallel trends seems plausiblebut it is not directly testable Specifically ldquopre-trend analysisrdquo com-pares trends for a few periods before the treatment takes place Buteven if the trends were parallel before it does not mean for sure thatthe trends would have remained parallel after the treatment year Wecan never know because the ldquotrendrdquo refers to the treated grouprsquos un-treated potential outcomes which by definition are not observed Sothere is no empirical test that can replace careful critical thought
974 Extensions
There are many interesting extensions of the basic diff-in-diff idea al-though all are beyond our scope For example there are related mod-els that allow additional regressors or more time periods or quantiletreatment effects
98 Estimation and Inference
=rArr Kaplan video Difference-in-Differences Example
Since (914) is just a special case of a regression model standardregression techniques and R functions can be used For estimationOLS consistently estimates each βj under fairly general conditionsremember to use samplesurvey weights if they are available in thedata The same heteroskedasticity-robust methods from earlier (likeSection 773) can be used to compute confidence intervals if samplingis iid
The following code shows different R syntax to get the same co-efficient estimates with simulated data The notation X1X2 is theinteraction term (or in the output its coefficient) Heteroskedasticity-robust CIs are also reported
library(sandwich) library(lmtest)n lt- 48setseed(112358)
216CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
m00 lt- 10 m10 lt- 15 m01 lt- 16 m11 lt- 25df lt- dataframe(X1=c(rep(0n2)rep(1n2))
X2=rep(rep(01each=n4)times=2))df$Y lt- c(rep(m00n4)rep(m01n4)
rep(m10n4)rep(m11n4) ) + rnorm(n) Three equivalent estimatesret1 lt- lm(Y~X1X2 data=df)ret2 lt- lm(Y~X1+X2+X1X2 data=df)df$Xint lt- df$X1df$X2ret3 lt- lm(Y~X1+X2+Xint data=df)TrueBetas lt- c(m00m10-m00m01-m00(m11-m01)-(m10-m00))retmat lt- rbind(coef(ret1)coef(ret2)coef(ret3)TrueBetas)rownames(retmat) lt- c(est1est2est3true)print(round(retmat digits=2))
(Intercept) X1 X2 X1X2 est1 10 517 612 373 est2 10 517 612 373 est3 10 517 612 373 true 10 500 600 400
round(coefci(ret1 vcov=vcovHC(ret1type=HC1))digits=2)
25 975 (Intercept) 930 1077 X1 431 604 X2 497 727 X1X2 220 527
EMPIRICAL EXERCISES 217
Empirical Exercises
Empirical Exercise EE91 You will analyze data on driving lawsand fatal accident rates originally from Freeman (2007) In partic-ular yoursquoll compare weekend driving fatality (death) rates for statesthat adopted a 008 blood alcohol content (BAC) law and states thatdidnrsquot comparing rates before and after the law adoption Standarderrors can be smaller if the full dataset is used but such methodsare beyond our scope Either way the difference-in-differences ap-proach is probably not identifying a treatment effect probably statesthat adopted such laws also adopted other ways to discourage drunkdriving whether official laws or just changing cultural norms Thisviolates the parallel trends assumption
a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(sandwich) library(lmtest)
driving
b Stata only load the data withuse httpfacultymissouriedukaplandmintro_textdriving clear
c Keep only years 1980 and 1990
R df lt- driving[driving$year==1980 | driving$year==1990 ]
Stata keep if year==1980 | year==1990
d Create a dummy variable for the ldquoafterrdquo period (year 1990)
R df$after lt- (df$year==1990)
Stata generate after = (year==1990)
e Create variable bac equal to 1 (or TRUE) if therersquos any BAC lawthat year
R df$bac lt- (df$bac08+df$bac10gt=1)
Stata generate bac = (bac08 + bac10 gt= 1)
f Drop states that already had a BAC law in the ldquobeforerdquo period(1980) leaving only states that never had the law or adopted itbetween 1980 and 1990
R dropst lt- unique(df$state[df$after amp df$bac]) toget a list of the states to drop and then remove them withdf lt- df[df$state in dropst ]
Statagenerate dropflag = (after amp bac)bysort state egen dropst = max(dropflag)drop if dropst
218CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
g Create a treatment dummy equal to 1 for states that adopted aBAC law by 1990
R treatst lt- unique(df$state[df$bac]) followed by df$treat lt- (df$state in treatst)
Stata bysort state egen treat = max(bac)
h Run a difference-in-difference regression with the intercept ldquoaf-terrdquo dummy treatment dummy and interaction term Belowthe in R and the in Stata automatically generate the de-sired interaction term
R
ret lt- lm(wkndfatrte~treatafter data=df)coeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))
Stata regress wkndfatrte treatafter vce(robust)
i To see how the OLS coefficient estimates relate to the condi-tional means (CEF estimates) compute the sample mean week-end driving fatality rate within each of the four groups definedby the time period and ldquotreatmentrdquo status
R (agg lt- aggregate(wkndfatrte~treatafter data=df FUN=mean))
Stata tabulate treat after summarize(wkndfatrte)means missing
j Display the CEF-based replication of the OLS estimates
R c(agg[13] agg[23]-agg[13] agg[33]-agg[13])for the first three coefficient estimates and c((agg[43]-agg[33])-(agg[23]-agg[13]) (agg[43]-agg[23])-(agg[33]-agg[13])) to show both (equivalent) ways to computethe interaction coefficient estimate
Stata
collapse (mean) wkndfatrte by(treat after)display wkndfatrte[1]display wkndfatrte[3]-wkndfatrte[1]display wkndfatrte[2]-wkndfatrte[1]display (wkndfatrte[4]-wkndfatrte[3])-(wkndfatrte[2]-wkndfatrte[1])
k Optional repeat part (h) but with a different outcome variableto replace wkndfatrte like the weekend fatalities per 100 mil-lion miles driven (instead of population) or the total fatalityrate (not just weekends) etc
l Optional repeat parts (e)ndash(h) but replacing your bac treatmentvariable created in part (e) with a treatment dummy equal to1 if perse (a different driving law) equals 1 (and equal to 0otherwise)
EMPIRICAL EXERCISES 219
Empirical Exercise EE92 You will analyze wage data for differ-ent types of individuals from the 1976 Current Population Survey(conducted by the US Census Bureau) Specifically yoursquoll look atdummy variables for nonwhite (race) and female as well as their in-teraction The results are clearly not causal but the interaction termshows (descriptively) the difference in the whitenonwhite wage gapfor females compared to non-females or (equivalently) the differencein the femalenon-female wage gap for nonwhites compared to whites
a R only load the needed packages (and install them before thatif necessary) and look at a description of the datasetlibrary(wooldridge) library(sandwich) library(lmtest)
wage1
b Stata only load the data with bcuse wage1 nodesc clear(assuming bcuse is already installed)
c Display the group mean wage for the four groups defined by thenonwhite and female dummy variables
R (agg lt- aggregate(wage~nonwhitefemale data=wage1 FUN=mean))
Stata tabulate female nonwhite summarize(wage)means missing
d Run a ldquodifference-in-differencesrdquo type of regression with theintercept non-white dummy female dummy and interactionterm
Rret lt- lm(wage~nonwhitefemale data=wage1)coeftest(ret vcov=vcovHC(ret type=HC1))coefci( ret vcov=vcovHC(ret type=HC1))
Stata regress wage femalenonwhite vce(robust)
e Compute the OLS coefficient estimates manually from the fourconditional means
R store the conditional means with m00 lt- agg$wage[1]m10 lt- agg$wage[2] m01 lt- agg$wage[3] m11 lt- agg$wage[4] and show that you can replicate the OLS estimateswith rbind(coef(ret) c(m00 m10-m00 m01-m00 (m11-m01)-(m10-m00)) ) and also note that c( (m11-m01) - (m10-m00) (m11-m10) - (m01-m00) ) shows the equivalence ofthe two interpretations of the interaction term coefficient
Stata collapse the dataset to just the four conditionalmeans with collapse (mean) wage by(female nonwhite)and then display the manually calculated coefficient estimateswithdisplay wage[1]display wage[3]-wage[1]display wage[2]-wage[1]
220CHAPTER 9 REGRESSIONWITH TWOBINARY REGRESSORS
display (wage[4]-wage[3])-(wage[2]-wage[1])display (wage[4]-wage[2])-(wage[3]-wage[1])
f Optional repeat part (d) but using south instead of female
g Optional repeat part (d) again with any two dummy variablesof your choice you may use one from a previous analysis as longas it is combined with a different dummy The dataset comeswith many dummy variables already like nonwhite female south (and other regions) servocc (and other occupationalfields and industries) and married or you can create your ownFor example you can generate a ldquomore than high school educa-tionrdquo dummy with R code wage1$gtHS lt- (wage1$educgt12)or Stata command generate gtHS = (educgt12)
Chapter 10
Regression with MultipleRegressors
=rArr Kaplan video Chapter Introduction
Depends on Chapters 8 and 9 (which depend on Chapters 2ndash46 and 7)
Unit learning objectives for this chapter
101 Define new vocabulary words (in bold) both mathemati-cally and intuitively [TLO 1]
102 Assess in a real-world example whether there is bias fromomitted variables and whether a linear model seems realistic[TLOs 2 and 6]
103 Describe and interpret models with multiple regressors in-cluding those in which two variables interact [TLO 3]
104 Judge which assumptions seem true and which interpre-tation seems most appropriate for real-world regressions[TLOs 2 and 6]
105 In R (or Stata) estimate a regression with multiple vari-ables along with measures of uncertainty and judge eco-nomic and statistical significance [TLO 7]
Optional resources for this chapter
bull James et al (2013 sect32)
bull Hastie Tibshirani and Friedman (2009 sectsect2312431ndash32)
bull Linear projection (theory) Hansen (2020 sect7)
bull Average structural effects and their identification Hansen(2020 sect230)
bull Regression example (Masten video)
bull Perfect multicollinearity (Lambert video)
221
222CHAPTER 10 REGRESSIONWITHMULTIPLE REGRESSORS
bull Imperfect multicollinearity example (Lambert video)
bull Dummy coefficients (Lambert video)
bull Dummy interactions (Lambert video)
bull Continuous interactions (Lambert video)
bull Sections 31 (ldquoMultiple Regression in Practicerdquo) and 615(ldquoInteraction Termsrdquo) in Heiss (2016)
bull Section 44 (ldquoReporting Regression Resultsrdquo) in Heiss (2016)
bull Section 83 (ldquoInteractions Between Independent Variablesrdquo)in Hanck et al (2018)
Allowing multiple regressors opens a multitude of combinationsespecially when combined with nonlinear functions like in Chapter 8Most of Chapter 10 focuses on the different functional forms them-selves with the different types of flexibility they do (and donrsquot) allowThese discussions apply equally to descriptive predictive and causalmodels
101 Omitted Variable Bias
One motivation for this chapter is that omitted variable bias (OVBSection 91) can still be a problem even if we include two regressorsWe may need to include three or even 10 or 100 regressors to avoidOVB But even with 100 regressors OVB can still be a big problem
Consider OVB with the linear structural model
Y = β0 + β1X1 + β2X2 + β3X3 + V (101)
For OLS to consistently estimate βj for j = 1 2 3 (the slope coeffi-cients) requires Cov(Xj V ) = 0 for j = 1 2 3 Imagine this is truebut X3 is omitted so
Y = β0 + β1X1 + β2X2 + U U equiv β3X3 + V (102)
In (102) OLS consistency for β1 and β2 requires Cov(X1 U) =Cov(X2 U) = 0 Since
Cov(Xj U) = β3 Cov(Xj X3) + Cov(Xj V ) (103)
this requires either β3 = 0 (ie X3 is not a causal determinant of Y )or else Cov(X1 X3) = Cov(X2 X3) = 0
There are other mathematical formulations but they all make thepoint that even including 100 regressors is not sufficient to avoid OVBif there is still an important omitted variable That is even if (102)becomes
Y = β0+β1X1+β2X2+β3X3+β4X4+middot middot middot+U U equiv γQ+V (104)
then we still have OVB if γ 6= 0 and any Cov(Xj Q) 6= 0That is there is OVB if both of the following conditions hold
102 LINEAR-IN-VARIABLES MODEL 223
OVB1prime The omitted variable is correlated with an included regressorin (104) Corr(Xj Q) 6= 0 for some j
OVB2prime The omitted variable Q is a causal determinant of Y in (104)γ 6= 0
Discussion Question 101 (OVB with multiple regressors) Con-sider the example of California schools where Y is a schoolrsquos averagestandardized math test score for 5th-graders X1 is the 5th-gradestudent-teacher ratio and X2 is the percentage of 5th-graders whoare English learners (non-native speakers) Judge whether a schoolrsquostotal expenditures per student satisfies each of Conditions OVB1prime
and OVB2prime for OVB
102 Linear-in-Variables Model
=rArr Kaplan video Wage Regression Example
1021 Model and Coefficient Interpretation
The linear-in-variables model and discussion from Section 92 natu-rally generalize to non-binary andor more than two regressors WithJ regressors X1 X2 XJ
Y = β0+β1X1+middot middot middot+βJXJ+U = β0+
Jsumj=1
βjXj+U equiv g(X1 XJ)+U
(105)If U is a CEF error then g(middot) represents the CEF However the fol-lowing discussion is essentially the same if U is a linear projectionerror and g(middot) is the linear projection or if the βj have a causal inter-pretation
Regardless of interpretation the coefficient βj shows how the func-tion g(middot) changes when Xj increases by one unit This is true whetherXj is binary discrete or continuous For example X1 only appearsin the β1X1 term so if we change from X1 = x1 to X1 = x1 + 1 (unitincrease) that term changes from β1x1 to β1(x1 + 1) = β1x1 + β1 achange of β1 That is for any starting values X1 = x1 X2 = x2 etca unit increase in X1 changes the function by
g(x1 + 1 x2 xJ)minus g(x1 x2 xJ)
= [β0 + β1(x1 + 1) +
Jsumj=2
βjxj ]minus [β0 + β1x1 +
Jsumj=2
βjxj ] = β1(x1 + 1minus x1) = β1
(106)
For example if Y is wage in $hr and X1 is years of education andβ1 = ($5hr)yr then each additional year of education is associatedwith a ($5hr)yr change regardless of the initial education level orother variables like experience
224CHAPTER 10 REGRESSIONWITHMULTIPLE REGRESSORS
More generally if X1 changes by ∆1 units then the functionrsquosvalue changes by β1∆1 Regardless of the starting values if X1
changes from x1 to x1 + ∆1 then similar to (106)
g(x1 + ∆1 x2 xJ)minus g(x1 x2 xJ) (107)
= [β0 + β1(x1 + ∆1) +Jsumj=2
βjxj ]minus [β0 + β1x1 +Jsumj=2
βjxj ] = β1(x1 + ∆1 minus x1) = β1∆1
1022 Limitations
While pleasingly simple these formulas may not be realistic That isthe change in Y may depend on not only ∆1 but the starting valuex1 or other xj
For example let Y be wage X1 years of experience and X2 yearsof education Due to diminishing marginal benefits perhaps the firstyears of experience are associated with bigger increases in mean wagethan later years of experience The wage increase associated with thechange from X1 = 0 to X1 = 1 is probably larger than the increasefrom X1 = 40 to X1 = 41 even though ∆1 = 1 in both casesFurther the change from X1 = 0 to X1 = 1 may be associated with alarger wage increase for highly educated individuals (large X2) thanfor less-educated individuals Mathematically the change dependingon the starting value of X1 implies some nonlinearity in X1 andthe dependence on the value of X2 implies some sort of interactionterm(s)
Nonlinear and nonparametric functions of a single variable arediscussed in Sections 82 and 83 interactions are discussed in Sec-tions 93 and 103 Nonparametric models with multiple regressorsare beyond our scope
1023 Code
The following code shows a simple linear-in-variables regression withsimulated data In the output the row labeled X1 shows results for thecorresponding slope coefficient β1 Specifically the output shows theOLS estimate β1 (Estimate) the heteroskedasticity-robust standarderror estimate (Std Error) and a 95 confidence interval (lowerendpoint under 25 upper endpoint under 975 ) Similarly forthe other regressors and results
library(sandwich) library(lmtest)setseed(112358)n lt- 50CEF lt- function(x1x2x3) 1x1+2x2+3x3 df lt- dataframe(X1=runif(n) X2=runif(n) X3=runif(n))df$Y lt- CEF(df$X1 df$X2 df$X3) + rnorm(n)ret lt- lm(Y~X1+X2+X3 data=df)retVC1 lt- vcovHC(ret type=HC1)round(cbind(coeftest(ret vcov = retVC1)[12]
103 INTERACTION TERMS 225
coefci(ret vcov = retVC1)) digits=2)
Estimate Std Error 25 975 (Intercept) -044 047 -139 052 X1 075 040 -006 157 X2 177 049 078 276 X3 404 047 310 499
103 Interaction Terms
=rArr Kaplan video Interaction Model
=rArr Kaplan video Wage Regression Example (again)
To start imagine there are two regressors one of which is binaryTo help us remember which is which let D (for ldquodummyrdquo) be thebinary regressor (D = 1 orD = 0) andX the other regressor AssumeX is the regressor of interest
1031 Limitation of Linear-in-Variables Model
With a linear-in-variables model
Y = g(XD) + U g(XD) = β0 + β1X + β2D (108)
A unit increase in X always changes the function g(XD) by β1 unitsregardless of the starting value of X or the value of D As discussedin Section 102 this is often unrealistic
Since D has only two possible values we can plug them each intog(XD)
g(X 0) = β0 + β1X (109)
g(X 1) = β0 + β1X + (β2)(1) =
intercept︷ ︸︸ ︷(β0 + β2) +β1X (1010)
These are two functions of X one when D = 0 one when D = 1They have the same slope (β1) but different intercepts (β0 and β0+β2)
1032 Interpretation of Interaction Term
To allow both the intercept and slope to differ between g(X 0) andg(X 1) an interaction term can be used specifically the productDX Mathematically adding this term to (108)
g(XD) = β0 + β1X + β2D + β3DX (1011)