panel data analysis

From Workshop at NSO during 22-26 September 2008

Panel Data Analysis

OutlineWhat are panel data?Why use panel data?Handling panel data in stataDescribing panel dataWithin and Between variationUnobservablesTesting the FE and RE assumptions

What are panel data?

Panel data are a form of longitudinal data, involving regularly repeated observations on the same individuals.

Individuals may be people, households, firms, area ,etc

Repeat observations may be different time periods or units within clusters (e.g. workers within firms)

Why use panel data?

Repeated observations on individuals allow for possibility of isolating effects of unobserved differences between individuals

We can study dynamicsThe ability to make causal inference is

enhanced by temporal orderingSome phenomena are inherently

longitudinal (e.g. poverty persistence; unstable employment)

But don’t expect too much

Variation between people usually far exceeds variation over time for an individual A panel with T waves doesn’t give T

times the information of a cross-section

Variation over time may not exist for some important variables or may be inflated by measurement error

Some terminology

A balanced panel has the same number of time observations (T) on each of the n individuals

An unbalanced panel has different number of time observations (Ti) on each individual

A compact panel covers only consecutive time periods for each individual- there are no “gaps”

Attrition is the process of drop-out of individuals from the panel, leading to an unbalanced and possible non-compact panel

A short panel has a large number of individual but few time observations on each, (e.g. BHPS has 5,500 households and 15 waves)

A long panel has a long run of time observations on each individual, permitting separate time-series analysis for each

Handling panel data in stata

For our purposes, the unit of analysis or case is either the person or household: If case = person, case contains information on

person’s state, perhaps at different dates If case = household, case contains info on

some or all household members (cross-sectional only!)

The data can be organized in two ways: Wide form-data is sometimes supplied in this

format Long form-usually most convenient & needed

for most panel data commands in Stata

Wide file format

One row per case Observations on a variable for different time

periods (or dates) held in different columns Variable name identifies time (via perfix)

PID awage bwage cwage

(Wage at W1) (Wage at W2) (Wage at W3)

10001 7.2 7.5 7.7

10002 6.3 missing 6.3

10003 5.4 5.4 missing

… … … …

Long file format

Potentially multiple rows per case, with Observations on a variable for different time periods (or

dates) held in extra rows for each individual Case-row identifier identifies time (e.g. PID, wave)

PID wave wage10001 1 7.210001 2 7.510001 3 7.710002 1 6.310002 3 6.310003 1 5.410003 2 5.4… … …

Panel and time variables

Use “tsset” to tell Stata which are panel and time variables:

. tsset pid wavepanel variable: pid (unbalanced)time variable: wave, 1 to 14, but with gaps

Note that “tsset” automatically sorts the data accordingly.

Describing panel data

Ways of describing/summarizing panel data: Basic patterns of available cases Between-and within-group components of

variation Transition tables

Some basic notation:yit is the “dependent variable” to be

analyses i indexes the individual (pid), i = 1,2,…., n t indexes the repeated observation / time

period (wave),t = 1,2…, Ti

Dependent variable

yit may be:Continuous (e.g. wages);Mixed discrete/continuous (e.g. hours of

work);Binary (e.g. employed/not employed);Ordered discrete (e.g. Likert scale for

degree of happiness);Unordered discrete (e.g. occupation)

Describe patterns of panel data

xtdes

. xtdes pid: 10002251, 10004491, ..., 1.497e+08 n = 16442 wave: 1, 2, ..., 14 T = 14 Delta(wave) = 1; (14-1)+1 = 14 (pid*wave uniquely identifies each observation)Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 2 7 14 14 14

Freq. Percent Cum. | Pattern ---------------------------+---------------- 4410 26.82 26.82 | 11111111111111 995 6.05 32.87 | 1............. 646 3.93 36.80 | 11............ ... ... ... …............. 35 0.21 84.69 | .........111.. 33 0.20 84.89 | 1.1........... 2485 15.11 100.00 | (other patterns) ---------------------------+---------------- 16442 100.00 | XXXXXXXXXXXXXX

Describe the pattern of panel data

. tabulate wave wave | Freq. Percent Cum.------------+----------------------------------- 1 | 9,912 7.97 7.97 2 | 9,459 7.61 15.58 3 | 9,024 7.26 22.84 4 | 9,060 7.29 30.13 5 | 8,827 7.10 37.23 6 | 9,137 7.35 44.58 7 | 9,118 7.33 51.91 8 | 8,940 7.19 59.11 9 | 8,820 7.09 66.20 10 | 8,701 7.00 73.20 11 | 8,590 6.91 80.11 12 | 8,383 6.74 86.85 13 | 8,264 6.65 93.50 14 | 8,080 6.50 100.00------------+----------------------------------- Total | 124,315 100.00

The number of observation declines across waves. This is consistent with attrition from the panel.

Between-and within-group variation

Stata command, xtsum, summarizes within and between variation.

But it does not give and exact decomposition: Converts sums of squares to variance using

different ‘degrees of freedom’ so they are not comparable

Reports square root (i.e. standard deviation) of these variances

Documentation is not very clear! But useful as a good approximation.

xtsum

Between-and within-group variation

xtsum. xtsum payguVariable | Mean Std. Dev. Min Max | Observations-----------------+--------------------------------------------+----------------paygu overall | 1224.762 1054.031 .0833333 72055.43 | N = 67666 between | 812.5707 8.666667 11323 | n = 11149 within | 640.9227 -7782.167 64965.64 | T-bar = 6.06924

. display r(sd_w)640.92268. display r(sd)1054.031. display r(sd_w)^2 / r(sd)^2 // proportion of within variation.36974691. display r(sd_b)^2 / r(sd)^2 // proportion of between variation.59431354

pangu (gross monthly earnings) more between people than they change over time for the same people. This is implications for panel analysis because we often rely on changes over time.

Between-and within-group variation for discrete variable

xttabExample: part-time work = 30 hours or less per weeks

. xttab pt Overall Between Within pt | Freq. Percent Freq. Percent Percent----------+----------------------------------------------------- 0 | 48119 72.55 8820 79.78 83.77 1 | 18204 27.45 5027 45.47 57.14----------+----------------------------------------------------- Total | 66323 100.00 13847 125.24 74.10 (n = 11056)

Describing panel data-summary

Panel data involve 2 dimensions, group (typically individual) and time. We need to examine variation along each dimension to get a “feel” for the data.

To fully exploit panel data, we need enough within-group (cross-time) variation. Can evaluate amount of within (and between) variation in different ways: Continuous variables: between and within standard

deviation (and variance) using xtsum Categorical variables: between and within variation using

xttab Binary variables: simple sequence description if not too

many waves.

Some basic identification problems

1. Unobservable variables Can we identify the impact of unobservable? Can we distinguish the impact of

unobservables from the impact of time-invariant observables?

2. Age, cohort and time effects-can they be distinguished?

Behavior may change with age Current behavior may be effected by

experience in “formative years” Time may effect behavior through changing

social environment

Identification of unobservable (1)

Example : wage models based on human capital theory:yit = ziα + xitβ + ui + εit

where i = 1…n, t=1…Ti

yit = log wagezi = observable time-invariant factors (e.g. sex, year of

birth)Xit = observable time-variant factors (e.g. job tenure)ui = unobservable “ability” (assume not to change over

time)εit = “luck”

Can we identify the effect of ui if we can’t observe it?

Identification of unobservables (2)

The identification of the effect of rests on assumptions about the correlation structure of the compound residual vit

vit = ui + εitif individual have been sampled at random, there is no

correlation across different individualscov (ui , uj ) = 0

cov ([εi1…εit], [εj1…εjt]) = 0For any two (different) sampled individuals i and j

But there may be some correlation over time for any individual:

cov (vis , vit) ≠ 0 for two different period s ≠ t,since:cov (vis , vit) = cov (ui + εis , ui + εit) = var(ui )+cov (εis , εit) If we assume cov (εis , εit) =0 then ui is the only source of

correlation over time, so its variance can be identified from the correlation of the residuals.

Pooled regression for panel data

The “standard” panel data regression model is:yit = ziα + xitβ + ui + εit

We have observations indexed by t = 1….Ti = 1….n.• A pooled regression of y on z and x using all the data

together would assume that there is no correlation across individuals, nor across time periods for any individual

• This would ignore the individual effect u, which generates correlation between the values of (u i + εi1 , ui + εit) for each individual I

• So pooled regression does ’t make best use of the data

• Under favorable conditions (if ui is uncorrelated with zi and xit), pooled regression gives unbiased but inefficient results, with incorrect standard errors, t-ratios, etc.

• If ui is correlated with zi and xit , pooled regression is also biased

Fixed effect or random effects? Concepts and interpretation

If individuals are randomly sampled from population then ui is random.

In practice, with randomly sampled data, FE/RE choice is based on whether a futher assumption holds: that ui is uncorrelated with the regressors: E(ui | zi, Xi) = 0

Testing the hypothesis of uncorrelated effects

The random effects estimator (and any estimator that uses between-group variation) is only unbiased if the following hypothesis is true:

It is important to test H0. There are various equivalent ways of doing so, including: Hausman test: is the difference

large? Between-within comparison: is large? Mundlak approach: estimate the model

by GLS and test H0: = 0

itiiitiit xxzy 0

itiiitiit xxzy 0

GLSW ˆˆ

BW ˆˆ

BHPS example: feasible GLS estimates

. xtreg lwage age cohort, re

Random-effects GLS regression Number of obs = 59615Group variable (i): pid Number of groups = 10077

R-sq: within = 0.1296 Obs per group: min = 1 between = 0.0589 avg = 5.9 overall = 0.0503 max = 14

Random effects u_i ~ Gaussian Wald chi2(2) = 7967.85corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------ lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- age | .0305788 .0003524 86.78 0.000 .0298882 .0312694 cohort | .0183379 .0004847 37.84 0.000 .017388 .0192879 _cons | -35.09007 .9586169 -36.60 0.000 -36.96892 -33.21121-------------+---------------------------------------------------------------- sigma_u | .48687179 sigma_e | .28128391 rho | .74974873 (fraction of variance due to u_i)------------------------------------------------------------------------------

BHPS example: within-group estimates

. xtreg lwage age cohort, fe

Fixed-effects (within) regression Number of obs = 59615Group variable (i): pid Number of groups = 10077

R-sq: within = 0.1296 Obs per group: min = 1 between = 0.0543 avg = 5.9 overall = 0.0363 max = 14

F(1,49537) = 7377.78corr(u_i, Xb) = -0.4386 Prob > F = 0.0000

------------------------------------------------------------------------------ lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- age | .0308941 .0003597 85.89 0.000 .0301892 .0315991 cohort | (dropped) _cons | .8987139 .0135417 66.37 0.000 .8721721 .9252558-------------+---------------------------------------------------------------- sigma_u | .57521051 sigma_e | .28128107 rho | .80702022 (fraction of variance due to u_i)------------------------------------------------------------------------------F test that all u_i=0: F(10076, 49537) = 18.00 Prob > F = 0.0000

Example: BHPS Hausman test

. hausman fixed random

---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference S.E.-------------+---------------------------------------------------------------- age | .0308941 .0305788 .0003153 .0000722------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 19.08 Prob>chi2 = 0.0000

Summary of random effects model

Unlike a cross-sectional model, the RE model allows for an unobserved, time-invariant individual effects.

The key assumption of the RE model is that the individual effect is uncorrelated with the regressors.

Can test the key zero-correlation assumption using a Hausman or Mundlak test.

RE is more efficient than FE because it uses between-group variation as well as within-group variation

panel data analysis

Documents

eachhandling panel data

panel data commands

individuala panel

firmswhy use panel data

different time periods

individuala compact

wavesa long panel

consecutive time periods