introduction to panel data analysis - rdc - western...

48
Introduction to Panel Data Analysis Youngki Shin Department of Economics Email: [email protected] Statistics and Data Series at Western November 21, 2012 1 / 40

Upload: vankhue

Post on 20-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Panel Data Analysis

Youngki ShinDepartment of Economics

Email: [email protected]

Statistics and Data Series at WesternNovember 21, 2012

1 / 40

Motivation

More observations mean more information.

More observations with a certain structure mean much moreinformation: pooled cross sections and panel data

How can we extract additional information from pooled cross sectionsor panel data?

2 / 40

Motivation

More observations mean more information.

More observations with a certain structure mean much moreinformation: pooled cross sections and panel data

How can we extract additional information from pooled cross sectionsor panel data?

2 / 40

Motivation

More observations mean more information.

More observations with a certain structure mean much moreinformation: pooled cross sections and panel data

How can we extract additional information from pooled cross sectionsor panel data?

2 / 40

Example

Effect of an Incinerator on Housing Prices

With cross-sectional data in 1981, we have

rprice = 101, 307.5− 30, 688.27nearinc

(3, 093.0) (5, 827.71)

n = 142

With another cross-sectional data in 1978 when there were no incinerator,we have

rprice = 82, 517.23− 18, 824.37nearinc

(2, 653.79) (4, 744.59)

n = 179

Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.

3 / 40

Example

Effect of an Incinerator on Housing Prices

With cross-sectional data in 1981, we have

rprice = 101, 307.5− 30, 688.27nearinc

(3, 093.0) (5, 827.71)

n = 142

With another cross-sectional data in 1978 when there were no incinerator,we have

rprice = 82, 517.23− 18, 824.37nearinc

(2, 653.79) (4, 744.59)

n = 179

Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.

3 / 40

Example

Effect of an Incinerator on Housing Prices

With cross-sectional data in 1981, we have

rprice = 101, 307.5− 30, 688.27nearinc

(3, 093.0) (5, 827.71)

n = 142

With another cross-sectional data in 1978 when there were no incinerator,we have

rprice = 82, 517.23− 18, 824.37nearinc

(2, 653.79) (4, 744.59)

n = 179

Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.

3 / 40

Outline

Data Structure

Policy Evaluation with Pooled Cross Sections

Three Approaches in Panel Data Estimation

First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator

Empirical Application: Smoking on Birth Outcomes

Concluding Remarks

4 / 40

Outline

Data Structure

Policy Evaluation with Pooled Cross Sections

Three Approaches in Panel Data Estimation

First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator

Empirical Application: Smoking on Birth Outcomes

Concluding Remarks

5 / 40

Data Structure (cont.)

A set of pooled cross sections is obtained by sampling randomly froma large population at different time points.

A (typical) panel data set follow the same individuals over time.

For example, consider that I sample three individuals from this roomat two time points:

Time Pooled Panel

t=1 John, Jane, Evelyn Eric, Andrew, Rachelt=2 Kyle, Justin, Lisa Eric, Andrew, Rachel

6 / 40

Data Structure (cont.)A Snapshot of Data

Table: Pooled Data

year rprice nearinc y81

1978 60000 0 01978 54000 1 01978 38000 1 0

......

...1981 82000 1 11981 52000 0 11981 97000 0 1

Table: Panel Data

id year inf unem

12 1950 7.3 3.512 1951 9.1 2.716 1950 5.3 5.416 1951 4.6 6.7...

......

...43 1950 7.1 4.243 1951 8.5 3.247 1950 6.7 5.447 1951 2.6 9.4

7 / 40

Data Structure (cont.)

There are also very useful panel structures other than theindividual-time combination.

1 Twins data: i is for twins id, and t is for the individual among thespecific twins. Control for unobserved generic factors.

2 School data: students sampled from many schools (or classrooms).Then, i is for school id, and t is for the student in school i .

8 / 40

Data Structure (cont.)

Examples of pooled cross sections:

Current Population Survey (CPS), USA

Examples of a panel data:

Labor Market Activity Survey (LMAS), CanadaPanel Study of Income Dynamics (PSID), USANational Longitudinal Survey of Youth (NLSY), USAA time series of provincial (or country) level data.ex) inflation and unemployment rate of 50 countries in 1950–2010.

It is usually easier to collect pooled cross sections than to do paneldata.

9 / 40

Outline

Data Structure

Policy Evaluation with Pooled Cross Sections

Three Approaches in Panel Data Estimation

First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator

Empirical Application: Smoking on Birth Outcomes

Concluding Remarks

10 / 40

Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator

Terminology:

Treatment Group: those who are affected by a policy (a treatment)Control Group: those who are not.

The object of policy evaluation is to measure the (mean) difference ofoutcomes between the treatment group and the control group. Thismeasure is also called the average treatment effect.

Consider that you are testing the effect of a new drug. How can youdesign the experiment? Randomization.

Recall the incinerator and housing prices example. Is randomizationpossible?

11 / 40

Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator

Consider an example of a drug test:

blprsi = β0 + β1treati + ui

If you randomized the control/treatment groups well, i.e.Cov(treati , ui ) = 0, then you can estimate the effect of the drug by asingle cross section.

In policy evaluation in social sciences, treati and ui are easilycorrelated:

log(wagei ) = β0 + β1jbtrni + ui

12 / 40

Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator

Pooled cross sections help us to evaluate the policy effect correctly bymeasuring the difference twice (before and after the policyimplementation.)

Recall the two regressions in the incinerator example:

rprice = γ0 + γ1nearinc + u in years 1978 and 1981

δ1 = γ1,81 − γ1,78=(rprice81,nr − rprice81,fr

)−(rprice78,nr − rprice78,fr

)If perfectly randomized, the second term is 0.

This estimator is called the Difference-in-Difference estimator.

13 / 40

Policy Evaluation with a Pooled Cross SectionDifference-in-Difference Estimator

The effect can be estimated just by a single regression with somedummy variable.

rprice = β0 + δ0y81 + β1nearinc + δ1y81 · nearinc + u

This result is not intuitive. Just follow the logic:

Before (y81 = 0) After (y81 = 1) After-Before

Control (nearinc = 0) β0 β0 + δ0 δ0Treatment (nearinc = 1) β0 + β1 β0 + δ0 + β1 + δ1 δ0 + δ1Treatment-Control β1 β1 + δ1 δ1

Therefore, δ1 in the above regression gives the same estimate of theDifference-in-Difference estimator.

14 / 40

Outline

Data Structure

Policy Evaluation with Pooled Cross Sections

Three Approaches in Panel Data Estimation

First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator

Empirical Application: Smoking on Birth Outcomes

Concluding Remarks

15 / 40

Panel Data and the First Difference (FD) Estimator

In panel data, we follow the same individual over time. This specificstructure enables us to conduct a better analysis.

Specifically, we can control for certain types of omitted variablescalled unobserved heterogeneity.

Let us think about some examples:

log(wageit) = β0 + δ0d2t + β1educit + ai + uit︸ ︷︷ ︸vit

Notation: now we have two subscripts, i and t.

Both ai and uit are unobservables called a fixed effect and anidiosyncratic error, respectively.

16 / 40

Panel Data and the First Difference (FD) Estimator

For simplicity, consider two periods model:

yit = β0 + δ0d2t + β1xit + ai + uit t = 1, 2.

The pooled OLS does not work well since ai is usually correlated withxit , i.e. Cov(vit , xit) 6= 0.

A simple solution is the First-Difference (FD) estimator.

yi2 = (β0 + δ0) + β1xi2 + ai + ui2 t = 2

yi1 = β0 + β1xi1 + ai + ui1 t = 1

Taking a difference gives

yi2 − yi1 = δ0 + β1 (xi2 − xi1) + (ui2 − ui1)

or

∆yi = δ0 + β1∆xi + ∆ui .

17 / 40

Panel Data and the First Difference (FD) Estimator

The (pooled) OLS works in the new regression,

∆yi = δ0 + β1∆xi + ∆ui , if

1 ∆ui and ∆xi are uncorrelated;2 ∆xi has some variation.

The second condition is violated if xit does not change over time:ex) gender, race, etc.. Then, ∆xi = 0.

Even in the wage equation example,

log(wageit) = β0 + δ0d2t + β1educit + ai + uit ,

Most working population do not increase the years of educ .

18 / 40

Panel Data and the First Difference (FD) EstimatorMore than Two Time Periods

When panel data contain more than two time periods, we can stillapply the FD estimator to control for unobserved heterogeneity.

The sufficient condition for the estimator to be valid is

Cov(xit , uis) = 0 for all t and s.

This condition is violated when1 Future regressors react to the past dependent variable (feedback);2 Regressors contain a lagged dependent variable;3 An important (i.e. related to xit) time-varying regressor is omitted.

Take differences with adjacent time periods and run the followingregression when t = 1, 2, and 3:

∆yit = α0 + α3d3t + β1∆xit + ∆uit for t = 2, 3.

19 / 40

Additional Remarks on FD Estimator

Due to the expansion over the time dimension, serial correlation mayarise.

Also, we cannot exclude the heteroskedasticity problem.

Since we use the OLS estimator, we can apply the White correction orthe HAC estimation method as before.

20 / 40

Fixed Effect Estimator

Consider a simple error component model again:

yit = β1xit + ai + uit , t = 1, . . . ,T and i = 1, . . . , n.

We assume that the idiosyncratic error uit is ‘innocuous’ in the sense:

E (uit |Xi) = 0 or E (uit |xit) = 0.

However, the individual fixed effect ai could be arbitrarily correlatedwith xit .

We have already known that the FD estimator cancels out theunobserved heterogeneity ai .

21 / 40

Fixed Effect Estimator

There is a different way to cancel out unobserved heterogeneity.

First, fix the individual i and take an average over time:

yi = β1xi + ai + ui .

where

yi =1

T

T∑t=1

yit , xi =1

T

T∑t=1

xit , and ui =1

T

T∑t=1

uit .

The point is

ai =1

T

T∑t=1

ai =1

TTai = ai .

22 / 40

Fixed Effect Estimator

Now, take a difference between two equations:

yit = β1xit + ai + uit , t = 1, 2, . . . ,T .

yi = β1xi + ai + ui .

Then, what we have is

yit − yi = β1(xit − xi ) + (uit − ui ), t = 1, 2, . . . ,T

or

yit = β1xit + uit , t = 1, 2, . . . ,T .

We may apply the pooled OLS on the last equation.

23 / 40

Fixed Effect Estimator

The FE estimator uses information from within group (i) variation:

yi1 = yi1 − yi

yi2 = yi2 − yi

...

yiT = yiT − yi

For this reason, the FE estimator is also called within estimator.

This can be readily extended to a multiple regression model:

yit = β1x1it + β2x2it + . . .+ βk xkit + uit

24 / 40

Fixed Effect EstimatorFD vs. FE

If T = 2, the FD estimator and the FE estimator are identical:

yi2 ≡ yi2 − yi = yi2 −(yi1 + yi2

2

)=

y1 − y22

≡ 1

2∆yi2.

Therefore,

yi2 = β1xit + uit

⇐⇒ 1

2∆yi2 = β1

1

2∆xi2 +

1

2∆ui2

⇐⇒ ∆yi2 = β1∆xi2 + ∆ui2

However, they are different in a finite sample if T > 2. Unless there isa unit root (or severe serial correlation) problem, you would better usethe FE estimator.

25 / 40

Random Effect Estimator

In the random effect model:

yit = β0 + β1xit + ai + uit ,

we assume that

Cov(xit , ai ) = 0.

Then, we come back to the ‘nice’ world where we don’t need tocancel out ai . Just use the pooled OLS?

No. There is a serial correlation problem.

26 / 40

Random Effect Estimator

In the random effect model:

yit = β0 + β1xit + ai + uit ,

we assume that

Cov(xit , ai ) = 0.

Then, we come back to the ‘nice’ world where we don’t need tocancel out ai . Just use the pooled OLS?

No. There is a serial correlation problem.

26 / 40

Random Effect EstimatorSerial Correlation in the RE model

We have two components in the error term:

vit = ai + uit

Suppose that uit is totally innocuous again:

Cov(ai , uit) = Cov(uit , uis) = 0 for t 6= s.

Now, we calculate Corr(vit , vis) and show that it is not zero:

Var(vit) = Var(ai + uit) = σ2a + σ2u

Cov(vit , vis) = E ((ai + uit)(ai + uis))

= E (a2i + aiuis + aiuit + uituis)

= E (a2i ) = σ2a

27 / 40

Random Effect EstimatorSerial Correlation in the RE model

Therefore,

Corr(vit , vis) =σ2a

σ2a + σ2u6= 0

Any inference based on the pooled OLS would be incorrect.

However, we know how to fix this problem. Do GLS!

We want to transform the original model into

yit = β0 + β1xit + vit

where vit does not have the serial correlation anymore.

28 / 40

Random Effect Estimator

We multiplied ρ and took a difference when there is a AR(1) serialcorrelation. In this case, we multiply

λ = 1−[

σ2uσ2u + Tσ2a

](1/2)and take a difference as

yit − λyi = β0(1− λ) + β1(xit − λxi ) + vit − λvi

We can show that vit(= vit − λvi ) is not serially correlated.

The λ should be estimated by λ.

This specific GLS estimator is called the Random Effect (RE)estimator.

29 / 40

Random Effect Estimator

The RE estimator is something between the pooled OLS and the FEestimator. Note that in Equation:

yit − λyi = β0(1− λ) + β1(xit − λxi ) + vit − λvi ,

it becomes the pooled OLS when λ = 0, and does the FE estimatorwhen λ = 1.

The λ is always between 0 and 1 in the RE model.

As T →∞, the FE and RE estimators are equivalent since λ→ 1.

30 / 40

Random Effect EstimatorRE vs. FE

If you believe that there is obvious endogenous fixed factor, ai , inyour model, you should use the FE estimator.

Otherwise, the RE estimator will tell you more: non time-varyingregressors, efficiency etc.

Keep in mind that the RE estimator is not even consistent ifCov(xit , ai ) 6= 0.

We can test whether Cov(xit , ai ) = 0 or not.

31 / 40

Random Effect EstimatorHausman Test

The idea of the Hausman test is simple. The null hypothesis is

H0 :Cov(xit , ai ) = 0

H1 :Cov(xit , ai ) 6= 0

Under H0, both RE and FE are consistent:

βREp→ β, βFE

p→ β.

Thus, we can expect that βRE ≈ βFE .

However, under H1, only βFE is consistent. Therefore, we reject H0 ifthe difference between βRE and βFE is large enough.

32 / 40

Outline

Data Structure

Policy Evaluation with Pooled Cross Sections

Three Approaches in Panel Data Estimation

First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator

Empirical Application: Smoking on Birth Outcomes

Concluding Remarks

33 / 40

Empirical Application: Smoking on Birth Outcomes

“Infants born to women who smoke during pregnancy have a lower averagebirthweight... Low birthweight is associated with increased risk forneonatal, perinatal, and infant morbidity and mortality.”

(Women and Smoking: A Report of the Surgeon General, 2001, requoted fromAbrevaya (2006))

34 / 40

Empirical Application: Smoking on Birth Outcomes

The direct medical costs: According to the estimates of Lewit et al.(1995), the low-birthweight (LBW) infants (less than 10% of births)account for more than 1/3 of health care costs during the first year of life.

The long-term costs:“Hack et al. (1995) find that LBW babies have developmental problems incognition, attention and neuromotor functioning that persist untiladolescence.” (Abrevaya (2006))

35 / 40

Empirical Application: Smoking on Birth OutcomesHow to Estimate

The OLS estimates would be biased into the negative direction due toendogeneity.

IV estimation?

Comparison between OLS and IV estimatesfrom Abrevaya (2006)

492 J. ABREVAYA

Figure 1. Previous estimates of smoking’s effect on birthweight

The remainder of the paper is outlined as follows. Section 2 describes the data and the matchingstrategies used to construct panel data sets for analysis. Section 3 considers the inconsistencycaused by incorrect matching. The direction of this inconsistency is the same as the omittedvariables bias, and a simple model suggests that the mismatched observations, ‘false matches’,have a disproportionate effect on the inconsistency. Section 4 reports the main empirical results.First, fixed effects estimates (based upon the matched panels of Section 2) are reported. Second,using a proxy for correct matching that is available in the earlier part of the sample, the effectof mismatching is empirically investigated. Third, an augmented model specification is used toallow for the possibility of heterogeneous smoking effects. Section 5 considers several possibleviolations of the strict exogeneity assumption made for fixed effects estimation. These potentialviolations include feedback effects (smoking influenced by a prior birth outcome), correlatedmaternal behaviour (smoking changes correlated with unobservable behavioural changes) andmisclassification of smoking status. Section 6 concludes.

2. DATA AND MATCHING STRATEGY

2.1. Federal Natality Data

The data used for this study come from the Natality Data Sets (released by the National Centerfor Health Statistics (NCHS)) from 1990 to 1998. The natality data are based upon birth recordsfrom every live birth that occurs in the United States and contain information on birth outcomes,maternal prenatal behaviour (including smoking behaviour for nearly all states) and demographicattributes. During the time period analysed, there are approximately 4 million births per year inthe natality data.

Unfortunately, the federal natality data does have limitations. There are no unique identifiers(e.g., social security number) for mothers, which makes construction of a panel data set non-trivial.Other (non-unique) identifiers that would make matching far easier (such as mother’s name and

Copyright 2006 John Wiley & Sons, Ltd. J. Appl. Econ. 21: 489–519 (2006)

36 / 40

Empirical Application: Smoking on Birth Outcomes

The fixed-effect (FE) estimation can be used if panel data areavailable.

Abrevaya (2006) constructed a pseudo panel data set and showedthat the FE estimate is smaller than that of the OLS.

yib = x ′ibβ + γsib + ci + uib

where i is Mom’s id and b is the order of a baby from Mom i .

The estimation results for γ by OLS and FE are −243.27(3.20)−144.04(4.75), respectively.

37 / 40

Concluding Remarks

Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.

Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.

If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.

Panel data are not restricted to the individual-time structure.

38 / 40

Concluding Remarks

Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.

Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.

If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.

Panel data are not restricted to the individual-time structure.

38 / 40

Concluding Remarks

Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.

Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.

If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.

Panel data are not restricted to the individual-time structure.

38 / 40

Concluding Remarks

Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.

Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.

If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.

Panel data are not restricted to the individual-time structure.

38 / 40

Concluding Remarks

Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.

Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.

If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.

Panel data are not restricted to the individual-time structure.

38 / 40

Stata Commands

Load the data set filename.dta.

First, we need to set an id variable and a time variable. Check therelevant variable names.

xtset id time.

Now type xtsum.

The command for the FE estimator isxtreg dep x1 x2 x3, . . ., fe

The command for the RE estimator isxtreg dep x1 x2 x3, . . ., re

39 / 40