1/59: topic 1.2 – extensions of the linear regression model microeconometric modeling william...

1/59: Topic 1.2 – Extensions of the Linear Regression Model

Microeconometric Modeling

William GreeneStern School of BusinessNew York UniversityNew York NY USA

1.2 Extensions of the Linear Regression Model


Concepts

• Multiple Imputation• Robust Covariance

Matrices• Bootstrap• Maximum Likelihood• Method of Moments• Estimating Individual

Outcomes

Models

• Linear Regression Model• Quantile Regression• Stochastic Frontier


Multiple Imputation for Missing Data


Imputed Covariance Matrix


Implementation

SAS, Stata: Create full data sets with imputed values inserted. M = 5 is the familiar standard number of imputed data sets. Data are replicated and redistributed SAS: Standard procedure and code distributed. Stata: Elaborate imputation equations, M=5

NLOGIT Create an internal map of the missing values

and a set of engines for filling missing values Loop through imputed data sets during

estimation. M may be arbitrary – memory usage and data

storage are independent of M. Data may be replicated


Regression with Conventional Standard Errors


Robust Covariance Matrices

Robust standard errors, not estimates Robust to: Heteroscedasticty Not robust to: (all considered later)

Correlation across observations Individual unobserved heterogeneity Incorrect model specification

‘Robust inference’ means hypothesis tests and confidence intervals using robust covariance matrices

-1 2 -1i i ii

The White Estimator

Est.Var[ ] = ( ) e ( )b X X x x X X


A Robust Covariance Matrix

Uncorrected


Bootstrap Estimation of the Asymptotic Variance of an Estimator

Known form of asymptotic variance: Compute from known results

Unknown form, known generalities about properties: Use bootstrapping Root N consistency Sampling conditions amenable to central limit theorems Compute by resampling mechanism within the sample.


Bootstrapping Algorithm

1. Estimate parameters using full sample: b2. Repeat R times:

Draw n observations from the n, with replacement

Estimate with b(r). 3. Estimate variance with

V = (1/R)r [b(r) - b][b(r) - b]’

(Some use mean of replications instead of b. Advocated (without motivation) by original designers of the method.)


Application: Correlation between Age and Education


Bootstrapped Regression


Bootstrap Replications


Bootstrapped Confidence IntervalsEstimate Norm()=(12 + 22 + 32 + 42)1/2


Quantile Regression

Q(y|x,) = x, = quantile Estimated by linear programming Q(y|x,.50) = x, .50 median regression Median regression estimated by LAD (estimates

same parameters as mean regression if symmetric conditional distribution)

Why use quantile (median) regression? Semiparametric Robust to some extensions (heteroscedasticity?) Complete characterization of conditional distribution


1 1

Model : , ( | , ) , [ , ] 0

ˆˆResiduals: u

1Asymptotic Variance:

= E[f (0) ] Estimated by

Asymptotic Theory Based Estimator of Variance of Q - REG

x | x

A C A

A xx

i i i i i i i i

i i i

u

y u Q y Q u

y

N

βx βx

-βx

1

.2

1 1 1ˆ1 | | B

B 2 Bandwidth B can be Silverman's Rule of Thumb:

ˆ ˆ( | .75) ( | .25)1.06 ,

1.349

(1- )(1- ) [ ] Estimated by

x x

C = xx

N

i i ii

i iu

uN

Q u Q uMin s

N

EN

12For =.5 and normally distributed u, this all simplifies to .2

But, this is an ideal application for bootstrap

X X

X

g.

X

pin

us

Estimated Variance for Quantile Regression


= .25

= .50

= .75

Quantile Regressions


OLS vs. Least

Absolute Deviation

s


Coefficient on MALE dummy variable in quantile regressions


A Production Function Model with Inefficiency The Stochastic Frontier Model


Inefficiency in Production


Cost Inefficiency

y* = f(x) C* = g(y*,w)

(Samuelson – Shephard duality results)

Cost inefficiency: If y < f(x), then C must be greater than g(y,w). Implies the idea of a cost frontier.

lnC = lng(y,w) + u, u > 0.


Corrected Ordinary Least Squares


COLS Cost Frontier


Stochastic Frontier Models Motivation:

Factors not under control of the firm Measurement error Differential rates of adoption of technology

Frontier is randomly placed by the whole collection of stochastic elements which might enter the model outside the control of the firm.

Aigner, Lovell, Schmidt (1977),

Meeusen, van den Broeck (1977),

Battese, Corra (1977)


The Stochastic Frontier Model

( )

ln +

= + .

iviii

i i ii

i i

= fy eTE

= + v uy

+

x

x

x

ui > 0, but vi may take any value. A symmetric distribution, such as the normal distribution, is usually assumed for vi. Thus, the stochastic frontier is

+’xi+vi

and, as before, ui represents the inefficiency.


Least Squares Estimation

Average inefficiency is embodied in the third moment of the disturbance εi = vi - ui.

So long as E[vi - ui] is constant, the OLS estimates of the slope parameters of the frontier function are unbiased and consistent. (The constant term estimates α-E[ui]. The average inefficiency present in the distribution is reflected in the asymmetry of the distribution, which can be estimated using the OLS residuals:

3

1

1 ˆˆ( - [ ])N

N

3 i ii

= Em


Application to Spanish Dairy Farms

Input Units Mean Std. Dev.

Minimum

Maximum

Milk Milk production (liters)

131,108 92,539 14,110 727,281

Cows # of milking cows 2.12 11.27 4.5 82.3

Labor

# man-equivalent units

1.67 0.55 1.0 4.0

Land Hectares of land devoted to pasture and crops.

12.99 6.17 2.0 45.1

Feed Total amount of feedstuffs fed to dairy cows (tons)

57,941 47,981 3,924.14

376,732

N = 247 farms, T = 6 years (1993-1998)


Example: Dairy Farms


The Normal-Half Normal Model

2

2

ln

1Normal component: ~ [0, ]; ( ) , .

Half normal component: | |, ~ [0, ]

1 Underlying normal: ( ) ,

Half

x

xi i i i

i i

ii v i i

v v

i i i u

ii i

u u

y v u

vv N f v v

u U U N

Uf U v

1 1normal ( ) ,0

(0)

ii i

u u

uf u u


Skew Normal Variable


Estimation: Least Squares/MoM

OLS estimator of β is consistent E[ui] = (2/π)1/2σu, so OLS constant

estimates α+ (2/π)1/2σu

Second and third moments of OLS residuals estimate

Method of Moments:Use [a,b,m2,m3] to estimate [,,u, v]

and 0

2 2 32 u v 3 u

- 2 2 4 = + = 1 - m m


Standard Form: The Skew Normal Distribution


Log Likelihood Function

Waldman (1982) result on skewness of OLS residuals: If the OLS residuals are positively skewed, rather than negative, then OLS maximizes the log likelihood, and there is no evidence of inefficiency in the data.


Airlines Data – 256 Observations


Least Squares Regression


Alternative Models:Half Normal and Exponential


Other Models

Many other parametric models Semiparametric and nonparametric – the recent

outer reaches of the theoretical literature Other variations including heterogeneity in the

frontier function and in the distribution of inefficiency

Normal-Exponential Likelihood

2 2n

ui=1

Ln ( ; ) =

(( ) / ( )1-ln ln

2

v u

u i i v u i i

v v u

L data

v u v u


A Test for Inefficiency? Base test on u = 0 <=> = 0 Standard test procedures

Likelihood ratio Wald Lagrange Multiplier

Nonstandard testing situation: Variance = 0 on the boundary of the

parameter space Standard chi squared distribution does not

apply.


Estimating ui

No direct estimate of ui

Data permit estimation of yi – β’xi. Can this be used? εi = yi – β’xi = vi – ui Indirect estimate of ui, using E[ui|vi – ui] This is E[ui|yi, xi]

vi – ui is estimable with ei = yi – b’xi.


Fundamental Tool - JLMS

2

( )[ | ] ,

1 ( )i i

i i i ii

E u

We can insert our maximum likelihood estimates of all parameters.

Note: This estimates E[u|vi – ui], not ui.

2

ˆ ˆˆ ˆˆ ( ) ( )ˆ ˆ ˆˆ[ | ] , ˆ ˆ ˆ( )1

i i ii i i i

i

yE u

x


Application: Electricity Generation


Estimated Translog Production Frontiers


Inefficiency Estimates


Estimated Inefficiency Distribution


Estimated Efficiency


A Semiparametric Approach

Y = g(x,z) + v - u [Normal-Half Normal](1) Locally linear nonparametric regression estimates g(x,z)(2) Use residuals from nonparametric regression to estimate variance parameters using MLE(3) Use estimated variance parameters and residuals to estimate technical efficiency.


Airlines Application


Efficiency Distributions


Nonparametric Methods - DEA


DEA is done using linear programming


Methodological Problems with DEA

Measurement error Outliers Specification errors The overall problem with the

deterministic frontier approach


DEA and SFA: Same Answer?

Christensen and Greene data N=123 minus 6 tiny firms X = capital, labor, fuel Y = millions of KWH

Cobb-Douglas Production Function vs. DEA


Comparing the Two Methods.

1/59: topic 1.2 – extensions of the linear regression model microeconometric modeling william...

Documents

linear regression model

missing data slide

education slide

data storage

imputed values

n observations

standard procedure

replacement estimate