the stata journal · the stata journal is published quarterly by the stata press, college station,...

The Stata Journal

Volume 13 Number 4 2013

®

A Stata Press publicationStataCorp LPCollege Station, Texas

The Stata Journal

Editors

H. Joseph Newton

Department of Statistics

Texas A&M University

College Station, Texas

[email protected]

Nicholas J. Cox

Department of Geography

Durham University

Durham, UK

[email protected]

Associate Editors

Christopher F. Baum, Boston College

Nathaniel Beck, New York University

Rino Bellocco, Karolinska Institutet, Sweden, and

University of Milano-Bicocca, Italy

Maarten L. Buis, WZB, Germany

A. Colin Cameron, University of California–Davis

Mario A. Cleves, University of Arkansas for

Medical Sciences

William D. Dupont, Vanderbilt University

Philip Ender, University of California–Los Angeles

David Epstein, Columbia University

Allan Gregory, Queen’s University

James Hardin, University of South Carolina

Ben Jann, University of Bern, Switzerland

Stephen Jenkins, London School of Economics and

Political Science

Ulrich Kohler, University of Potsdam, Germany

Frauke Kreuter, Univ. of Maryland–College Park

Peter A. Lachenbruch, Oregon State University

Jens Lauritsen, Odense University Hospital

Stanley Lemeshow, Ohio State University

J. Scott Long, Indiana University

Roger Newson, Imperial College, London

Austin Nichols, Urban Institute, Washington DC

Marcello Pagano, Harvard School of Public Health

Sophia Rabe-Hesketh, Univ. of California–Berkeley

J. Patrick Royston, MRC Clinical Trials Unit,

London

Philip Ryan, University of Adelaide

Mark E. Schaffer, Heriot-Watt Univ., Edinburgh

Jeroen Weesie, Utrecht University

Ian White, MRC Biostatistics Unit, Cambridge

Nicholas J. G. Winter, University of Virginia

Jeffrey Wooldridge, Michigan State University

Stata Press Editorial Manager

Lisa Gilmore

Stata Press Copy Editors

David Culwell and Deirdre Skaggs

The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book

reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository

papers that link the use of Stata commands or programs to associated principles, such as those that will serve

as tutorials for users first encountering a new field of statistics or a major new technique; 2) papers that go

“beyond the Stata manual” in explaining key features or uses of Stata that are of interest to intermediate

or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to

a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users

(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers

analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could

be of interest or usefulness to researchers, especially in fields that are of practical importance but are not

often included in texts or other journals, such as the use of Stata in managing datasets, especially large

datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata

with topics such as extended examples of techniques and interpretation of results, simulations of statistical

concepts, and overviews of subject areas.

The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-

ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch,

Scopus, and Social Sciences Citation Index.

For more information on the Stata Journal, including information for authors, see the webpage

http://www.stata-journal.com

http://www.stata-journal.com

Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone

979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at

http://www.stata.com/bookstore/sj.html

Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic

1-year subscription $ 98 1-year subscription $138

2-year subscription $165 2-year subscription $245


1-year student subscription $ 75 1-year student subscription $ 99

1-year institutional subscription $245 1-year institutional subscription $285



Electronic only Electronic only

1-year subscription $ 75 1-year subscription $ 75



1-year student subscription $ 45 1-year student subscription $ 45

Back issues of the Stata Journal may be ordered online at

http://www.stata.com/bookstore/sjj.html

Individual articles three or more years old may be accessed online without charge. More recent articles may

be ordered online.

http://www.stata-journal.com/archives.html

The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.

Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX

77845, USA, or emailed to [email protected].

®

Copyright c© 2013 by StataCorp LP

Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and

help files) are copyright c© by StataCorp LP. The contents of the supporting files (programs, datasets, and

help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy

or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,

as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.

Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.

This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,

fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.

Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting

files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,

or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,

incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote

free communication among Stata users.

The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata

Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.

http://www.stata.com/bookstore/sj.html

http://www.stata.com/bookstore/sjj.html

http://www.stata-journal.com/archives.html

Volume 13 Number 4 2013

The Stata Journal

Articles and Columns 669

The Stata Journal Editors’ Prize 2013: Erik Thorlund Parner and Per KraghAndersen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

Attributable and unattributable risks and fractions and other scenario compar-isons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roger B. Newson 672

Dealing with identifier variables in data management and analysis . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .P. Wilner Jeanty 699

Stochastic frontier analysis using Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Belotti, S. Daidone, G. Ilardi, and V. Atella 719

Flexible parametric illness-death models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .S. R. Hinchliffe, D. A. Scott, and P. C. Lambert 759

Implementation of a double-hurdle model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Garcıa 776

Little’s test of missing completely at random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Li 795

Testing for zero inflation in count models: Bias correction for the Vuong test . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B. A. Desmarais and J. J. Harden 810

Parametric inference using structural break tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Z. L. Flynn and L. M. Magnusson 836

cmpute: A tool to generate or replace a variable . . . . . . . . . . . . . . . . . . . . . P. Royston 862

group2: Generating the finest partition that is coarser than two given partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. H. Salas Pauliac 867

A score test for group comparisons in single-index models . . . . . . . . . P. Guimaraes 876

Software Updates 884

The Stata Journal (2013)13, Number 4, pp. 669–671

The Stata Journal Editors’ Prize 2013: ErikThorlund Parner and Per Kragh Andersen

1 Prize announcement

The editors of the Stata Journal are delighted to announce the award of the Edi-tors’ Prize for 2013 to Erik Thorlund Parner and Per Kragh Andersen. Theaim of the prize is to reward contributions to the Stata community for one or moreoutstanding articles published in the Journal in the previous three calendar years.For the original announcement of the prize and its precise terms of reference, seeNewton and Cox (2012), which is accessible at the following website: http://www.stata-journal.com/sjpdf.html?articlenum=gn0052. The prize recognizes the outstanding ar-ticle on “Regression analysis of censored data using pseudo-observations” (Parner andAndersen 2010).

Erik Thorlund Parner was born in 1967 and grew up in Aarhus, Denmark. All of hisdegrees are from Aarhus University: a bachelor’s degree in mathematics and theoreticalstatistics, a master of science in mathematical statistics in 1995, and a doctorate fora thesis on inference in semiparametric frailty models in 1997. In 2001, he became anassociate professor of biostatistics, and in 2011, he became a professor at the Departmentof Public Health, Section for Biostatistics, Aarhus University.

Parner’s main interest is time-to-event analysis, in particular, methods for multi-variate time-to-event data, interval-censored data, and the pseudo-observation methodfor censored data. He also works with instrumental-variable analysis with applicationin general practice. In addition to methodological aspects, he has a major interest inthe etiology of autism and time trends in autism prevalence. He is the coauthor of 8methodological articles and 40 applied articles, including 2 articles in the Stata Journal.He became an associate editor for the Scandinavian Journal of Statistics in 2013.

c© 2013 StataCorp LP gn0058

670 Stata Journal Editors’ Prize 2013

Per Kragh Andersen was born in 1952 and grew up in the Copenhagen area ofDenmark. After earning a bachelor’s degree in mathematics and computer science and amaster of science in mathematical statistics in 1978, he became a founding staff memberof the Statistical Research Unit, which later grew into the Department of Biostatisticsin the Faculty of Health Sciences at the University of Copenhagen. Andersen wasawarded a doctorate in mathematical statistics in 1982 for a thesis on statistical modelsfor covariates’ influence on the intensity of a point process and a medical doctorate in1997 (all degrees conferred from the University of Copenhagen). In 1985, he became anassociate professor of biostatistics, and in 1998, he became a professor of biostatisticsat the University of Copenhagen.

Andersen’s main research interest is the analysis of survival and event history data,with some current focus on methodology for pseudo-observations. He is a coauthor (withØ. Borgan, R. D. Gill, and N. Keiding) of the 1993 book entitled Statistical ModelsBased on Counting Processes. With Niels Keiding, he edited the survival analysis partof the Encyclopedia of Biostatistics, which in 2006 appeared separately as Survivaland Event History Analysis. In 2010, he coauthored the book Regression with LinearPredictors with Lene Theil Skovgaard. He is an author or coauthor of more than100 methodological and 180 applied articles and has served on the editorial boards ofseveral journals, including Biometrics, Scandinavian Journal of Statistics, Statistics inMedicine, and Lifetime Data Analysis.

The article by Parner and Andersen (2010) draws upon a series of articles in which amethod based on pseudovalues is proposed for direct regression modeling of the survivalfunction, the restricted mean, and the cumulative incidence function in competing riskswith right-censored data. The models, once the pseudovalues have been computed, canbe fit using standard generalized estimating equation software. Stata procedures arepresented for computing these pseudo-observations. An example from a bone marrowtransplantation study is used to illustrate the method.

The method introduced by Parner and Andersen (2010) is increasingly being used.The driving force behind the article is the innovative work of Andersen and his variouscoworkers and junior colleagues for nearly a decade. The pseudovalues idea opens upwhole new ways of doing survival analysis with right-censored time-to-event observa-tions. For example, the pseudovalues method allows us to fit regression models based onsurvival probability estimates at the level of the individual rather than in groups; it fa-cilitates the analysis of survival data in the time domain rather than the nearly universalprobability or (relative) hazards domain. As of now, the article by Parner and Andersen(2010) is the only evident implementation of these ideas in Stata, and as such, it is crit-ical for people who wish to use the new techniques in Stata. The three programs allincorporate Mata code for speed and efficiency and have been found very reliable acrossrepeated uses in different types and sizes of datasets.

671

As editors, we are indebted to the awardees for biographical material and to anecessarily anonymous nominator for a most helpful appreciation of their work.

H. Joseph Newton and Nicholas J. CoxEditors, Stata Journal

2 ReferencesAndersen, P. K., Ø. Borgan, R. D. Gill, and N. Keiding. 1993. Statistical Models Based

on Counting Processes. New York: Springer.

Andersen, P. K., and L. T. Skovgaard. 2010. Regression with Linear Predictors. NewYork: Springer.

Keiding, N., and P. K. Andersen, eds. 2006. Survival and Event History Analysis.Chichester, UK: Wiley.

Newton, H. J., and N. J. Cox. 2012. Announcement of the Stata Journal Editors’ Prize2012. Stata Journal 12: 1–2.

Parner, E. T., and P. K. Andersen. 2010. Regression analysis of censored data usingpseudo-observations. Stata Journal 10: 408–422.


Attributable and unattributable risks andfractions and other scenario comparisons

Roger B. NewsonNational Heart and Lung Institute

Imperial College LondonLondon, UK

[email protected]

Abstract. Scenarios are alternative versions of the same dataset with the samevariables but different observations or values. Applied scientists frequently wantto predict how much good an intervention will do by comparing outcomes fromthe same model between different scenarios. Alternatively, they may want to com-pare outcomes between different models applied to the same scenario, for instance,when standardizing statistics from different subpopulations to a common genderand age distribution. Standard Stata tools for scenario means and comparisonsare margins and pwcompare. A suite of packages is presented for estimating sce-nario means and comparisons by using margins, together with normalizing andvariance-stabilizing transformations implemented by using nlcom. margprev es-timates marginal prevalences; marglmean estimates marginal arithmetic means;regpar estimates the difference between two marginal prevalences (the popula-tion attributable risk); punaf estimates the ratio between two marginal arithmeticmeans (the population unattributable fraction); and punafcc estimates a marginalmean between-scenario risk or hazard ratio for case–control or survival data (alsoknown as a population unattributable fraction). The population unattributablefraction and its confidence limits are subtracted from 1 to estimate the populationattributable fraction. Formulas and examples are presented, including an examplefrom the Global Allergy and Asthma European Network.

Keywords: st0314, margprev, marglmean, regpar, punaf, punafcc, margins, nlcom,population, unattributable, attributable, risk, fraction, PAR, PAF, PUF, scenario,comparison, standardization

1 Introduction

Applied scientists, especially in the public health sector, usually want to know how muchgood they can do. In particular, they might want to estimate, from the available data,how much reduction they would see in a disease rate if everybody stopped smoking orif all children received a proposed vaccine. Alternatively, they might compare diseaserates between different subpopulations, discover heterogeneity, and wonder whether thatheterogeneity is caused by confounding factors, such as differences in the age distributionbetween different subpopulations. After all, if subpopulation A has a higher rate of aparticular cancer than subpopulation B, then this might be due to something in theenvironment of subpopulation A, to which subpopulation B is not exposed, or it mightbe due to subpopulation A being mostly older than subpopulation B. If we could

c© 2013 StataCorp LP st0314

Roger B. Newson 673

eliminate the second possibility by standardizing the disease rates to a standard agedistribution, then we might have evidence for the first possibility. In both cases, weare comparing scenarios. In the first case, we are comparing two different scenarios,using data from the same sample. In the second case, we are comparing the samescenario, using data from two different samples, one from subpopulation A and onefrom subpopulation B.

In statistics, scenarios are alternative versions of the same data matrix, with equiv-alent columns (variables) but with different rows (observations). Different scenarioshave a one-to-one correspondence between the columns, so equivalent columns have thesame variable names. However, different scenarios may or may not have a one-to-onecorrespondence between equivalent rows. If we use regression methods, then we mightwant to estimate the scenario means of an outcome variable Y under different scenariosdefined by specifying values for particular X variables. The X variables that vary be-tween scenarios are known as exposures, and the other X variables, which are invariantbetween scenarios, are known as concomitant variables.

A seminal reference for scenario means and comparisons in generalized linear modelsis Lane and Nelder (1982). However, an important case is the estimation of populationattributable fractions after fitting a logistic regression model, which is given with differ-ent formulas for cohort studies and for case–control studies by Greenland and Drescher(1993). These formulas were implemented in Stata by Brady (1998), who introduced theStata 5 aflogit command. This command is still downloadable by using the commandfindit aflogit. However, it does not support factor variable lists, and the Stata 5code sometimes has problems with the long variable names used in subsequent Stataversions. Another special case of a scenario comparison is the population attributablerisk (PAR), which is defined in Gordis (2000).

In Stata 11, a new command, margins, was added (see [R] margins). margins

inputs a set of estimation results and a set of X variables and outputs scenario means forexpressions involving predicted Y values under one or more scenarios. These scenariomeans are estimated with covariance matrices, so the user can calculate confidenceintervals for them. In Stata 12, the commands contrast and pwcompare were added(see [R] contrast and [R] pwcompare), along with the pwcompare and pwcompare()

options for margins (see [R] margins, pwcompare). These commands can be usedto calculate confidence intervals for differences between scenario means. However, usersfrequently want to estimate scenario means and their differences and ratios by usingnormalizing and variance-stabilizing transformations to generate confidence limits inwhich the user can have confidence. This can be done by using nlcom (see [R] nlcom).

This article introduces a suite of programs that call margins and nlcom to calculatescenario prevalences and means, their differences, their ratios, and other comparisonstatistics. These statistics are known as marginal means, marginal prevalences, andattributable and unattributable risks and fractions. Section 2 describes the commands.Section 3 describes the methods and formulas used. Finally, section 4 gives practicalexamples of the use of these commands.

674 Attributable and unattributable risks and fractions

2 The margprev, marglmean, regpar, punaf, and punafcccommands

2.1 Syntax

margprev[if] [

in] [

weight] [

, atspec(atspec) subpop(subspec)

predict(pred opt) vce(vcespec) noesample force iterate(#) eform

level(#) post]

marglmean[if] [

in] [

weight] [

, atspec(atspec) subpop(subspec)


level(#) post]

regpar[if] [

in] [

weight] [

, atspec(atspec) atzero(atspec0) subpop(subspec)

predict(pred opt) vce(vcespec) noesample force iterate(#) level(#)

post]

punaf[if] [

in] [

weight] [

, atspec(atspec) atzero(atspec0) subpop(subspec)


level(#) post]

punafcc[if] [

in] [

weight] [

, atspec(atspec) subpop(subspec) vce(vcespec)

noesample force iterate(#) eform level(#) post]

where atspec and atspec0 are specifications recognized by the at() option of margins,subspec is a subpopulation specification of the form recognized by the subpop() optionof margins, and vcespec is a variance–covariance specification of the form recognizedby margins and must have one of the values

delta | unconditional

pweights, aweights, fweights, and iweights are allowed and handled as margins.

2.2 Description

The margprev, marglmean, regpar, punaf, and punafcc commands are for use afterthe parameters of a regression model have been fit by using an estimation command.They estimate a range of scenario prevalences, means and mean risk ratios, and theirbetween-scenario comparisons (differences and ratios). These are estimated with con-

Roger B. Newson 675

fidence limits derived by using normalizing and variance-stabilizing transformations toestimate the transformed parameters and their dispersion matrix. A difference betweentwo scenario prevalences is known as a PAR, and a ratio between two scenario arithmeticmeans, or a mean between-scenario risk ratio or hazard ratio, is known as a popula-tion unattributable fraction (PUF). When a PUF is estimated, a confidence interval isalso calculated, using end-point transformation, for the population attributable fraction(PAF), which is derived by subtracting the PUF from 1. Table 1 lists the five commands,the estimated parameters, and the transformations used.

Table 1. List of commands with estimated parameters and transformations used

Package Estimated parameters Transformations

margprev 1 marginal prevalence Logitmarglmean 1 marginal arithmetic mean Logregpar 2 marginal prevalences and their Logit, Fisher’s z

difference (PAR)punaf 2 marginal arithmetic means and their Log

ratio (PUF)punafcc 1 mean between-scenario risk or hazard Log

ratio (PUF)

2.3 Options

atspec(atspec) is a specification allowed as a value of the at() option of margins

(see [R] margins). This specification must identify a single scenario (denoted “Sce-nario 1” in the output), defined as a fantasy world in which a subset of the predic-tor variables in the model is set to values that may be different from their valuesin the real world. In the case of punafcc, which is intended for use with case–control or survival data, the specification is restricted and may set variables onlyto values (not to statistics). If atspec() is not specified, then its default value isatspec((asobserved) all), implying that scenario 1 is the baseline scenario, rep-resented by the predictor values actually present in the dataset currently in memory.

atzero(atspec0) is available for regpar and punaf only. It specifies a specificationallowed as a value of the at() option of margins. This specification must identify asingle baseline scenario (denoted “Scenario 0” in the output), defined as an alterna-tive fantasy world in which a subset of predictors in the model is set to the valuesspecified by atspec0. Scenario 0 will then be compared with the scenario specifiedby the atspec() option, scenario 1. If atzero() is not specified, then its defaultvalue is atzero((asobserved) all), implying that scenario 0 is the baseline sce-nario, represented by the predictor values actually present in the dataset currentlyin memory.


subpop(subspec), predict(pred opt), vce(vcespec), noesample, and force functionas the options of the same names for margins. subpop() specifies a subpopulation;predict() specifies a predict option; vce() specifies the formula used for calculatingthe dispersion matrix of the estimated parameters; noesample specifies that theestimated statistics will not be restricted to the current estimation sample; and forcespecifies that the scenario means will still be estimated even if there are potentialproblems detected by margins. The predict() option is not currently availablefor punafcc, but it enables the use of the other four commands after a multiple-equation command. For instance, after mlogit, the option predict(outcome(2))

allows scenario prevalences to be estimated and compared for the second value of amultinomial outcome. (See [R] mlogit.)

iterate(#) has the same form and function as the option of the same name for nlcom(see [R] nlcom). iterate() specifies the number of iterations used by nlcom to findthe optimal step size to calculate the numerical derivatives of the transformed sce-nario means and comparisons, with respect to the original scenario means calculatedby margins.

eform specifies that the command will display an estimate, p-value, and confidencelimits instead of the log estimate; see the help files for margprev, marglmean, punaf,and punafcc for complete descriptions.

level(#) specifies the percentage confidence level to be used in calculating the confi-dence intervals. If not specified, then level() is taken from the current value of thec-class value c(level), which is usually level(95).

post specifies that the command will post in e() the estimation results for estimatingthe transformed scenario means and any comparisons (differences or ratios). If postis not specified, then any existing estimation results are left in e(). Note thatthe estimation results posted are for the transformed parameters and not for theparameters themselves. This is done because the estimation results are intended todefine symmetric confidence intervals for the transformed parameters, which can beback transformed to define asymmetric confidence intervals for the untransformedparameters and for the PAR in the case of punaf and punafcc.

Roger B. Newson 677

2.4 Stored results

margprev, marglmean, regpar, punaf, and punafcc store the following results in r():

Scalarsr(N) number of observationsr(rank) rank of r(V)r(N sub) subpopulation observationsr(N clust) number of clustersr(N psu) number of samples, primary sampling units, survey data onlyr(N strata) number of strata, survey data onlyr(df r) variance degrees of freedom, survey data onlyr(N poststrata) number of post strata, survey data onlyr(k margins) number of terms in marginlistr(k by) number of subpopulationsr(k at) number of at() optionsr(level) confidence level

Macrosr(atzero) atzero() option (regpar and punaf only)r(atspec) atspec() option

Matricesr(cimat) matrix of asymmetric confidence intervals (not stored by marglmean)r(b) vector of estimated transformed parametersr(V) dispersion matrix for transformed estimated parameters

The matrix r(cimat) is not stored by marglmean. It contains asymmetric confi-dence intervals (one per row) for the untransformed marginal prevalence in the case ofmargprev, for the untransformed marginal prevalences and their untransformed differ-ence (the PAR) in the case of regpar, and for the PAF (equal to 1− PUF) in the case ofpunaf and punafcc. The matrices r(b) and r(V) contain the estimate and dispersionmatrix, respectively, for the transformed parameters, as indicated in table 1.


If post is specified, then margprev, marglmean, regpar, punaf, and punafcc alsostore the following results in e():

Scalarse(N) number of observationse(rank) rank of e(V)e(N sub) subpopulation observationse(N clust) number of clusterse(N psu) number of samples, primary sampling units, survey data onlye(N strata) number of strata, survey data onlye(df r) variance degrees of freedom, survey data onlye(N poststrata) number of post strata, survey data onlye(k margins) number of terms in marginliste(k by) number of subpopulationse(k at) number of at() options

Macrose(cmd) command namee(predict) program used to implement predicte(atzero) atzero() option (regpar and punaf only)e(atspec) atspec() optione(properties) b V

Matricesr(cimat) matrix of asymmetric confidence intervals (not stored by marglmean)e(b) vector of estimated transformed parameterse(V) dispersion matrix for transformed estimated parameters

e(V srs) simple-random-sampling-without-replacement (co)variance, Vsrswor,if svy

e(V srswr) simple-random-sampling-with-replacement (co)variance, Vsrswr, if svyand fpc()

e(V msp) misspecification (co)variance, Vmsp, if svy and available

Functionse(sample) marks estimation sample

3 Methods and formulas

This section is highly technical. The casual reader might like to skip it and proceed tosection 4 and possibly return to this section for reference later.

The methods used are a combination of those in margins and in nlcom. We denoteby θ the vector of parameters estimated by the most recent model fit and denote byf(z,θ) the function of the covariate row vector z and the parameter vector θ whosemean we want to estimate. In general, we aim to estimate a population parameter ofthe form

p(θ) =1

MR

M∑j=1

Rjf(Zj ,θ) (1)

Roger B. Newson 679

where Zj is the value of the covariate vector in the jth member of the population ofM observations, Rj is a binary variable identifying membership of the jth observationin a subpopulation (0 for nonmembers and 1 for members), and MR is the size of thesubpopulation identified by the Rj , equal to

MR =M∑j=1

Rj

(This population of M observations may or may not be the population from which ourdata are sampled.)

We aim to estimate p(θ) using the sample statistic

p =1

w.

N∑j=1

rjwjf(zj , θ

)(2)

where N is the number of observations in the sample, zj is the vector of covariates in

the jth observation in the sample, θ is the estimate of the parameter θ derived fromthe sample, rj is a binary variable identifying membership of the jth observation in asubsample corresponding to the subpopulation identified by the Rj , wj is the weightfor the jth observation in the sample, and

w. =N∑j=1

rjwj

is the sum of weights in the subsample. These weights are normally chosen so that (2)is a consistent estimate of the population parameter p(θ) in (1).

3.1 Scenario means estimated

The margprev, marglmean, regpar, punaf, and punafcc commands all start by es-timating one or two population scenario means of the form (1) by using one or twocorresponding sample scenario means of the form (2). Here scenarios are defined asalternative versions of the population and sample datasets, identified by alternativeversions of the covariate vectors Zj and zj , respectively. The scenarios are denotedscenario 1 (used by all five commands) and scenario 0 (currently used only by regpar

and punaf). We will denote by Z(0)j and Z

(1)j the values of the covariate vector for the

jth population observation in scenarios 0 and 1, respectively, and denote by z(0)j and

z(1)j the values of the covariate vector for the jth sample observation in scenarios 0 and

1, respectively. (We will continue to denote by Zj and zj the real-world values of thecovariate vectors for the jth population observation and for the jth sample observation,respectively. Furthermore, we will assume that a mathematical function exists, deriving

Z(i)j from Zj and deriving z

(i)j from zj , for i ∈ {0, 1}.)


Each of the commands estimates 1 or 2 scenario means p(i)(θ) of functions f (i)(z,θ),using estimators p(i), for scenario indices i ∈ {0, 1}, over subpopulations defined bysubpopulation indicators Rj as in (1), using subsample indicators rj as in (2). The sub-populations and subsamples are the same for both scenarios. Therefore, for scenario i,the population scenario mean of (1) becomes

p(i)(θ) =1

MR

M∑j=1

Rjf(i)(Zj ,θ) (3)

and the corresponding estimator of (2) becomes

p(i) =1

w.

N∑j=1

rjwjf(i)(zj , θ

)(4)

The commands vary in the specification of the functions to be averaged and of thesubpopulations over which these functions are to be averaged. The subpopulation isgoverned by the subpop() option, which functions as the option of the same name formargins (see [R] margins). For a population index j from 1 to M , we will denote bySj the binary variable indicating membership of the jth population observation in thesubpopulation specified by the subpop() option. Similarly, for a sample index j from 1to N , we will denote by sj the binary variable indicating membership of the jth sampleobservation in the subsample specified by the subpop() option.

In the case of the commands margprev, marglmean, regpar, and punaf, the right-hand sides of (3) and (4) are specified by

Rj = Sj , rj = sj , f (i) (Zj ,θ) = μ(Z

(i)j ,θ

), f (i)

(zj , θ

)= μ

(z(i)j , θ

)where μ(z,θ) specifies the conditional arithmetic mean calculated by predict for thecovariate vector z and the parameter vector θ.

In the case of the punafcc command, used for case–control and survival data, thedefinitions are slightly more complicated and depend on whether the most recent esti-mation command is stcox or some other estimation command. We will define the truthvalue T (x) of a numeric value x to be 1 if x is nonzero, 0 if x is 0, and missing if xis missing. For a population index j from 1 to M , we will define Yj to be the failureindicator variable d, generated by the command stset, if the most recent estimationcommand is stcox and to be the dependent variable given by the estimation resulte(depvar) if the most recent estimation command is another estimation command.Similarly, for a sample index j from 1 to N , we will define yj to be the failure indicatorvariable d, generated by the command stset, if the most recent estimation commandis stcox and to be the dependent variable given by the estimation result e(depvar) ifthe most recent estimation command is another estimation command. (See [ST] stcoxfor documentation of stcox and [ST] stset for documentation of stset.) We will alsodenote by β the column vector containing the subvector of the parameter vector θ con-taining the coefficients corresponding to the covariates of the z vector and denote by

Roger B. Newson 681

β the column vector containing the corresponding subvector of the parameter-estimatevector θ. The right-hand sides of (3) and (4) are then specified by

Rj = SjT (Yj)

rj = sjT (yj)

f (i)(Zj ,θ) = exp{(

Z(i)j − Zj

)β}

f (i)(zj , θ

)= exp

{(z(i)j − zj

)β}

This implies that (3) is the population mean risk ratio (or hazard ratio) between sce-nario i and the real world for the “subsubpopulation” of cases (or failures) of the sub-population specified by the subpop() option and that (4) is a corresponding samplemean risk ratio (or hazard ratio) for the “subsubsample” of cases (or failures) of thesubsample specified by the subpop() option. A mean between-scenario ratio is a subtlydifferent quantity from a ratio between scenario means; however, both of these quanti-ties are known as population unattributable fractions and can be subtracted from 1 togive population attributable fractions.

In all the above equations, the margprev, marglmean, regpar, and punaf commandsassume that predict specifies a conditional arithmetic mean and that the punafcc

command assumes that the parameters of the model are log odds or hazard ratios, whilethe truth values of the dependent or failure variable indicate case status or failure. Itis the user’s responsibility to ensure that these assumptions are true.

Dispersion-matrix estimates for the estimated scenario means (4) are calculated byusing methods depending on the vce() option as discussed in [R] margins.

3.2 Symmetric confidence intervals for transformed parameters

Having estimated the scenario means and their sampling dispersion matrix by usingmargins, we then estimate the transformed parameters by using the normalizing andvariance-stabilizing transformations specified in table 1. This is done by using nlcom,so we will use similar notation to nlcom (see [R] nlcom). We will denote by H thenumber of transformed parameters that we want to estimate and denote the vector oftransformed parameters by

g(θ) = {g1(θ), . . . , gH(θ)}


The gh(θ) are functions of the originally estimated parameter vector θ that are esti-

mated by using the corresponding gh(θ). However, we will define them in terms ofthe scenario means (3) estimated by margins. Table 2 lists the transformed parame-ters estimated by each command and identified by their formulas and their commonlyused parameter names. The logit and log transformations are standard normalizing andvariance-stabilizing transformations for the prevalences of binary variables and for thearithmetic means of nonnegative-valued variables and their ratios, respectively. The hy-perbolic arctangent arctanh(), also known as Fisher’s z transform, was recommendedby Edwardes (1995) for the general Somers’ D parameter, which is discussed extensivelyin Newson (2006) and includes as a special case the difference between two proportions,exemplified in the scenario-comparison case by the PAR.

The nlcom command inputs the estimates and dispersion matrix for the scenariomeans p(i)(θ), generated by margins, and outputs the estimates and dispersion matrixfor the gh(θ) by using numerically estimated derivatives of the transformed parameterswith respect to the scenario means. The output estimates vector and dispersion matrixare stored in r(b) and r(V), respectively. If the user specifies the post option, then thesematrices are also stored in e(b) and e(V), respectively. In either case, the matrices canbe used in the same way to compute symmetric confidence intervals for the transformedparameters.

Table 2. Transformed parameters expressed as functions of scenario means

Package Parameter formulas Parameter names

margprev g1(θ) = logit{ p(1)(θ) } Logit prevalencemarglmean g1(θ) = log{ p(1)(θ) } Log arithmetic meanregpar g1(θ) = logit{ p(0)(θ) } Logit prevalence

g2(θ) = logit{ p(1)(θ) } Logit prevalenceg3(θ) = arctanh{ p(0)(θ) − p(1)(θ) } z transformed PAR

punaf g1(θ) = log{ p(0)(θ) } Log arithmetic meang2(θ) = log{ p(1)(θ) } Log arithmetic meang3(θ) = log{ p(1)(θ) / p(0)(θ) } Log PUF

punafcc g1(θ) = log{ p(1)(θ) } Log PUF

3.3 Asymmetric confidence intervals for untransformed parameters

Generally, the user really wants to see confidence intervals for arithmetic means andtheir ratios or for prevalences and their differences instead of seeing confidence inter-vals for the transformed parameters of table 2. In the case of the logged parametersestimated by marglmean, punaf, and punafcc, the eform option allows the user toview the untransformed parameters and their confidence limits. However, in the case ofmargprev, the eform option displays the odds and not the prevalence, and the eform

Roger B. Newson 683

option is not available for regpar. Moreover, even in the case of the logged param-eters of punaf and punafcc, the user wants to estimate the PAF instead of the PUF.To cater for these cases, the commands of the punaf suite (except for marglmean) alsooutput a matrix of confidence intervals for the untransformed parameters of interest.This confidence interval matrix is stored in r(cimat) and is also automatically listedin the output. For each command, it has one row for each of the K parameters ck(θ)for k ∈ {1 . . .K} and three columns containing the estimates, lower confidence limits,and upper confidence limits, respectively, of these parameters. The confidence intervalsin this matrix are asymmetric.

Table 3 lists the parameters whose asymmetric confidence intervals are listed andsaved in the confidence interval matrix by the four commands that produce such amatrix. In each case, the command computes a confidence interval for the transformedparameter gh(θ), with estimates and lower and upper confidence limits correspondingto the confidence level specified by the level() option, which defaults to level(95).The estimate, lower confidence limit, and upper confidence limit for the untransformedparameter ck(θ) are then derived by transforming the estimate, lower confidence limit,and upper confidence limit, respectively, for the transformed parameter (in the caseof margprev and regpar) or by transforming the estimate, upper confidence limit, andlower confidence limit, respectively, for the transformed parameter (in the case of punafand punafcc).

Table 3. Untransformed parameters expressed as functions of transformed parameters

Package Parameter formulas Parameter names

margprev c1(θ) = invlogit{ g1(θ) } Scenario 1 prevalenceregpar c1(θ) = invlogit{ g1(θ) } Scenario 0 prevalence

c2(θ) = invlogit{ g2(θ) } Scenario 1 prevalencec3(θ) = tanh{ g3(θ) } PAR

punaf c1(θ) = 1 − exp{ g3(θ) } PAF (cohort or cross-sectional)punafcc c1(θ) = 1 − exp{ g1(θ) } PAF (case–control or survival)


4 Examples

4.1 Scenario comparisons in the lbw data using regpar

lbw.dta was discussed by Hosmer, Lemeshow, and Klar (1988) and is posted on theStata Press website. It has one observation for each of a sample of 189 pregnancies anddata on the birthweight of the baby and on a list of predictive variables. The mostinteresting of these variables is probably the mother’s smoking status during pregnancy,coded as the binary variable smoke, which is equal to 1 if the mother smoked duringpregnancy and 0 otherwise. We will estimate scenario comparisons from a logistic re-gression model to predict the binary variable low, indicating that the baby’s birthweightwas below 2,500 grams.

After loading the lbw data, we fit a logistic model of low with respect to the exposurefactor smoke and the confounding factor race (1 for white, 2 for black, or 3 for other):

. use http://www.stata-press.com/data/r12/lbw.dta(Hosmer & Lemeshow data)

. logit low i.race i.smoke, or vce(robust)

Iteration 0: log pseudolikelihood = -117.336Iteration 1: log pseudolikelihood = -110.10441Iteration 2: log pseudolikelihood = -109.98749Iteration 3: log pseudolikelihood = -109.98736Iteration 4: log pseudolikelihood = -109.98736

Logistic regression Number of obs = 189Wald chi2(3) = 14.30Prob > chi2 = 0.0025

Log pseudolikelihood = -109.98736 Pseudo R2 = 0.0626

Robustlow Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

race2 2.956742 1.420439 2.26 0.024 1.153162 7.5811753 3.030001 1.187272 2.83 0.005 1.405753 6.530954

1.smoke 3.052631 1.10296 3.09 0.002 1.503568 6.197631_cons .1587319 .0515235 -5.67 0.000 .0840173 .2998882

We see that maternal smoking triples the odds of low birthweight and that having amother of either of the two nonwhite maternal races has a similar effect on the odds.However, few of the public really understand odds ratios. They might understand moreeasily the difference that might result if all mothers quit smoking before pregnancy, buttheir racial mix remained the same as in the real world. The regpar command canestimate this difference, using the stored estimation results:

Roger B. Newson 685

. regpar, at(smoke=0)Scenario 0: (asobserved) _allScenario 1: smoke=0Symmetric confidence intervals for the logit proportionsunder Scenario 0 and Scenario 1and for the z-transformed population attributable risk (PAR)Total number of observations used: 189

Coef. Std. Err. z P>|z| [95% Conf. Interval]

Scenario_0 -.789997 .1519305 -5.20 0.000 -1.087775 -.4922187Scenario_1 -1.215955 .2051031 -5.93 0.000 -1.61795 -.8139606

PAR .0837153 .0266196 3.14 0.002 .0315419 .1358887

Asymmetric 95% CIs for the untransformed proportionsunder Scenario 0 and Scenario 1and for the untransformed population attributable risk (PAR)

Estimate Minimum MaximumScenario_0 .31216931 .25203743 .37937104Scenario_1 .22864901 .16548776 .30704715

PAR .08352031 .03153146 .13505843

regpar starts its output by specifying scenarios 0 and 1 in the language of the at()

option of margins. Scenario 0 is (asobserved) all, implying that all covariates andfactors are as observed in our real-world sample. Scenario 1 is smoke=0, implying thatno mothers smoke, but (by default) the factor race is distributed as in our real-worldsample. regpar then displays the logit proportions with low birth rate under scenarios 0and 1 and the z transform of the difference between these proportions, known as thePAR, with their standard errors, z statistics, p-values, and symmetric confidence limits.Finally, it displays the more comprehensible asymmetric confidence intervals for theuntransformed scenario proportions and for their difference. We see that in the realworld (Scenario 0), 31.2% of babies are expected to have a low birthweight but that inthe dream scenario where no mothers smoke and their races stay the same (Scenario 1),only 22.9% of babies are expected to have a low birthweight. The difference betweenthese scenario percentages (PAR) is 8.4%, with confidence limits from 3.2% to 13.5%.The PAR can be interpreted as the proportion of all babies that have low birthweightbecause they were born in scenario 0 instead of in scenario 1.

Alternatively, we might want to communicate our message to an audience of smokingmothers, who might want to know how much they could do for their children if only theyquit smoking before pregnancy. To answer this, we might use regpar with a subpop()

option to compute an exposed-population attributable risk for the subpopulation ofsmoking mothers:


. regpar, at(smoke=0) subpop(if smoke==1)Scenario 0: (asobserved) _allScenario 1: smoke=0Symmetric confidence intervals for the logit proportionsunder Scenario 0 and Scenario 1and for the z-transformed population attributable risk (PAR)Total number of observations used: 189


Scenario_0 -.3829923 .2373852 -1.61 0.107 -.8482587 .0822742Scenario_1 -1.436486 .2279922 -6.30 0.000 -1.883343 -.9896299

PAR .2166422 .0707321 3.06 0.002 .0780098 .3552746



PAR .21331537 .07785194 .34104503

This time, the option subpop(if smoke==1) restricts the prediction to the subpopula-tion of smoking mothers, but scenarios 0 and 1 are defined as before. Once again, regpardisplays the incomprehensible symmetric confidence intervals for the transformed pa-rameters followed by the asymmetric confidence intervals for the transformed parame-ters, which are probably more easily explained to smoking mothers. We see that thechildren of smoking mothers have a 40.1% prevalence of low birthweight, which mightbe reduced to 19.2% if their mothers quit smoking before pregnancy, while their racialmix remained the same. The difference is 21.3% with confidence limits from 7.8% to34.1%.

Another possibility is to compare our zero-smoking dream scenario not with theintermediate world in which we live but with the nightmare scenario where all mothersstarted smoking. This is done by using the atzero() option, which can be used to resetscenario 0, as follows:

Roger B. Newson 687

. regpar, at(smoke=0) atzero(smoke=1)Scenario 0: smoke=1Scenario 1: smoke=0Symmetric confidence intervals for the logit proportionsunder Scenario 0 and Scenario 1and for the z-transformed population attributable risk (PAR)Total number of observations used: 189


Scenario_0 -.1697027 .2464163 -0.69 0.491 -.6526697 .3132642Scenario_1 -1.215955 .2051031 -5.93 0.000 -1.61795 -.8139606

PAR .2331622 .0759652 3.07 0.002 .0842732 .3820512



PAR .22902683 .08407429 .36448745

We see that scenario 0 is set by the atzero() option to smoke=1, while scenario 1is still smoke=0. Once again, regpar displays the symmetric confidence intervals forthe transformed parameters followed by the asymmetric confidence intervals for theuntransformed parameters. We see that if all mothers smoked and the racial mix stayedthe same, then 45.8% of children might have low birthweight. The dream scenarioprevalence, where no mothers smoke and the racial mix stays the same, is still 22.9%,as before. The difference in prevalence between the nightmare scenario 0 and the dreamscenario 1 is 22.9% with confidence limits from 8.4% to 36.4%.

regpar might be even more useful if we had a large number of confounders insteadof the single confounder race. In that case, we might want to reduce the potentiallyinfinite-dimensioned confounder space to a finite-dimensioned confounder space by defin-ing a propensity score for smoking, as recommended by Rosenbaum and Rubin (1983).Such a propensity score might be defined by using a logistic regression model to regresssmoke with respect to the multiple confounders, then by using predict to define thesmoking propensity score for each subject as the predicted probability of smoking forthat subject. We might then define a grouping variable for the propensity score by usingxtile (see [D] pctile) and then use the propensity-group variable in a second logisticregression model with low as the outcome and with smoking exposure and smoking-propensity group as the predictors. A problem with using propensity scores or groupsas covariates in a logistic regression model is that the conditional odds ratio with re-spect to exposure, adjusted for the propensity score, is not the same quantity as theconditional odds ratio with respect to exposure, adjusted for the original confounders.This is in contrast to conditional mean differences (including prevalence differences)between exposed and unexposed subjects, where the mean difference conditional on thepropensity score is equal to the mean difference conditional on the original covariates.Austin et al. (2007) argue that if we use the propensity-adjusted odds ratio to estimatethe confounder-adjusted odds ratio, then our estimate is likely to be biased toward thenull hypothesis that the odds ratio is 1, leading to an underestimation of the magnitude


of the exposure effect. This problem can be arguably solved by fitting a logistic re-gression of disease with respect to exposure propensity and exposure and then by usingregpar to define the exposure effect as a difference in marginal disease prevalences be-tween a nightmare scenario where exposure propensity stays the same and all subjectsare exposed and a dream scenario where exposure propensity stays the same and allsubjects are unexposed.

4.2 Scenario comparisons in the lbw data using punaf

Then again, we might want to estimate the possibility for disease prevention as a pro-portion of the total disease burden of low birthweight instead of as a proportion of allbabies. This can be done by using punaf after the same logistic regression model asbefore. punaf compares scenario arithmetic means (including scenario prevalences) byusing ratios instead of differences. These ratios, known as PUFs, can then be subtractedfrom 1 to obtain PAFs. As a simple example, we compare the smoking-free dreamscenario to the real world once again:

. punaf, at(smoke=0) eformScenario 0: (asobserved) _allScenario 1: smoke=0Confidence intervals for the means under Scenario 0 and Scenario 1and for the population unattributable faction (PUF)Total number of observations used: 189

Mean/Ratio Std. Err. z P>|z| [95% Conf. Interval]

Scenario_0 .3121693 .0326225 -11.14 0.000 .2543534 .3831271Scenario_1 .228649 .0361738 -9.33 0.000 .1676887 .3117704

PUF .7324519 .0818807 -2.79 0.005 .5883333 .911874

95% CI for the population attributable fraction (PAF)Estimate Minimum Maximum

PAF .2675481 .08812601 .41166675

We see that the scenarios, as in our first example with regpar, and the scenario means,computed by using punaf, are the same as the untransformed scenario prevalences usingregpar. The confidence limits are slightly different because they are computed by usingthe log transform instead of the logit transform. The PUF is the ratio between thescenario 1 mean and the scenario 0 mean and represents the fraction of the scenario 0disease burden that would remain if the babies were born in scenario 1. (Note that theeform option ensures that we see confidence intervals for the scenario means and theirratio instead of for their logs.) Finally, punaf subtracts the PUF (and its lower and upperconfidence limits) from 1 to obtain the PAF (and its lower and upper confidence limits)and displays these in the bottom line of output. We see that 26.8% of the disease burdenof low birthweight might be eliminated by eliminating maternal smoking, assuming thatthe racial mix stays the same, with confidence limits from 8.8% to 41.2%.

Roger B. Newson 689

4.3 margprev and marglmean in the lbw data

We can also estimate marginal prevalences and means without comparing them betweendifferent scenarios. The marglprev command can estimate marginal odds and thecorresponding marginal prevalences from the current estimation results. For instance,the marginal odds and prevalence of low birthweight in a world of smoking motherswith the existing race distribution could be estimated as follows:

. margprev, at(smoke==1) eformScenario 1: smoke==1Confidence interval for the marginal oddsunder Scenario 1Total number of observations used: 189

Odds Std. Err. z P>|z| [95% Conf. Interval]

Scenario_1 .8439156 .2079545 -0.69 0.491 .5206539 1.367883

Asymmetric 95% CI for the untransformed marginal prevalenceunder Scenario 1

Estimate Minimum MaximumScenario_1 .45767584 .34238817 .57768182

This time, only scenario 1 is specified because there is no scenario 0. margprev displaysfirst the marginal odds (not the marginal log odds, because eform has been specified)and then a confidence interval for the marginal prevalence, which is the same as the onecalculated for the same nightmare scenario by regpar.

The marglmean command can estimate general marginal means for general nonneg-ative variables, using the log transform to calculate confidence intervals. For instance,we might fit a gamma-family regression model for the nonnegative variable bwt, repre-senting birthweight in grams, with respect to race and smoking status, as follows, usingthe glm command detailed in Hardin and Hilbe (2012):


. glm bwt i.race i.smoke, family(gamma) link(log) eform vce(robust)

Iteration 0: log pseudolikelihood = -1698.0172Iteration 1: log pseudolikelihood = -1697.9741Iteration 2: log pseudolikelihood = -1697.9741

Generalized linear models No. of obs = 189Optimization : ML Residual df = 185

Scale parameter = .0555296Deviance = 12.0823464 (1/df) Deviance = .06531Pearson = 10.27297009 (1/df) Pearson = .0555296

Variance function: V(u) = u^2 [Gamma]Link function : g(u) = ln(u) [Log]

AIC = 18.01031Log pseudolikelihood = -1697.974084 BIC = -957.6409

Robustbwt exp(b) Std. Err. z P>|z| [95% Conf. Interval]

race2 .8594198 .042562 -3.06 0.002 .7799205 .94702273 .863627 .0360104 -3.52 0.000 .795855 .9371702

1.smoke .8697043 .032986 -3.68 0.000 .8073975 .9368193_cons 3332.454 97.62645 276.88 0.000 3146.499 3529.398

The parameters are a baseline arithmetic mean cons (in grams) for the babies ofnonsmoking white mothers, two arithmetic mean ratios for the babies of black andmiscellaneous-race mothers, and an arithmetic mean ratio for the babies of smokingmothers compared with the babies of nonsmoking mothers of the same race. We can nowuse marglmean to estimate the marginal arithmetic mean, with asymmetric confidencelimits, that would be expected if all mothers smoked and the race distribution remainedthe same:

. marglmean, at(smoke=1) eformScenario 1: smoke=1Asymmetric confidence interval for the marginal meanunder Scenario 1Total number of observations used: 189

Mean Std. Err. z P>|z| [95% Conf. Interval]

Scenario_1 2702.087 80.18231 266.28 0.000 2549.416 2863.902

We see that the mean birthweight in this scenario would be 2,702 grams with confidencelimits from 2,549 grams to 2,864 grams. We could also use punaf to estimate the ratio(or PUF) between this scenario mean and the scenario mean where no mothers smoked(not shown to save space).

4.4 punafcc in case–control and survival data

The punafcc command calculates unattributable and attributable fractions for case–control and survival data. The unattributable fraction in this case is a mean between-

Roger B. Newson 691

scenario odds ratio for cases (if used after a logistic estimation) or a mean between-scenario hazard ratio for lifetimes that terminated from the cause of interest (if usedafter a Cox survival regression) instead of a ratio of scenario means. Currently, the onlyscenarios that can be compared in this way are scenario 1 and the world in which wesampled the data.

downs.dta is an example of a case–control study dataset, described and used inepitab (see [ST] epitab) to demonstrate the cci command. The data are from Roth-man, Greenland, and Lash (2008) and represent a case–control study whose outcomevariable is Down syndrome in infants, with maternal spermicide use as the exposureand maternal age group as a confounding factor. The dataset has eight observationsand four variables. These variables are three binary key variables (case, exposed, andage) identifying the eight observations uniquely and indicating case status, exposurestatus, and maternal age at or above 35 years, respectively, and one integer variable(pop) containing frequency weights for the combination of case status, exposure status,and age group indicated by the three key variables.

We start by loading downs.dta and fitting a full logistic regression model, allowingage odds ratios and different exposure odds ratios for the two age groups:

. webuse downs, clear

. logit case i.age i.exposed i.age#i.exposed [fweight=pop], or vce(robust)

Iteration 0: log pseudolikelihood = -85.885722Iteration 1: log pseudolikelihood = -82.752975Iteration 2: log pseudolikelihood = -81.552365Iteration 3: log pseudolikelihood = -81.451562Iteration 4: log pseudolikelihood = -81.451332Iteration 5: log pseudolikelihood = -81.451332



Robustcase Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

1.age 4.104651 2.775961 2.09 0.037 1.090465 15.450441.exposed 3.394231 2.290446 1.81 0.070 .9043692 12.73905

age#exposed1 1 1.689141 2.389726 0.37 0.711 .105541 27.034

_cons .0084986 .002846 -14.24 0.000 .0044086 .0163831

These odds ratios are not easy to interpret at first sight, especially the interactionodds ratio, which is a ratio of ratios. We might find it easier to understand the frac-tions of Down syndrome births unattributable and attributable to spermicide expo-sure. These can be estimated by using punafcc. It is probably a good idea to use thevce(unconditional) option because the covariates exposure status and maternal agewill definitely be subject to sampling error if we sample cases and controls and thenmeasure exposure status and maternal age.


. punafcc, at(exposed=0) eform vce(unconditional)Scenario 0: (asobserved) _allScenario 1: exposed=0Confidence interval for the population unattributable faction (PUF)Total number of observations used: 1270

Ratio Std. Err. z P>|z| [95% Conf. Interval]

PUF .816142 .1181495 -1.40 0.160 .6145268 1.083903


PAF .18385804 -.08390349 .38547325

We see from the PUF that in a fantasy scenario where no mothers were exposed tospermicide, we might expect the rate of Down syndrome to be 81.6% of that observedin the population from which our cases and controls were sampled with 95% confidencelimits from 61.5% to 108.4%. This allows the possibility that spermicide use mighteven be slightly protective, at least at some maternal ages. The PAF is computed bysubtracting the PUF from 1 and therefore has confidence limits from −8.4% to 38.5%.These limits are wide enough to include 0 and even a small range of negative values.

Similarly, we can estimate unattributable and attributable fractions in the Stanfordheart transplant dataset heart3, with one observation per study subject per time inter-val, where the time interval can be a pretransplant interval (present for all subjects) ora posttransplant interval (present only for subjects who received a transplant). We willfit the Cox regression model used in stcox (see [ST] stcox), where death is regressedwith respect to the quantitative covariates year (year of acceptance) and age (age inyears at start) and the binary variables posttran (indicating that the interval is post-transplant) and surgery (indicating prior heart surgery on entry). We do not need touse stset, because this has already been done to the dataset.

Roger B. Newson 693

. use http://www.stata-press.com/data/r12/stan3, clear(Heart transplant data)

. stcox age posttran surg year, vce(robust)

failure _d: diedanalysis time _t: t1

id: id

Iteration 0: log pseudolikelihood = -298.31514Iteration 1: log pseudolikelihood = -289.7344Iteration 2: log pseudolikelihood = -289.53498Iteration 3: log pseudolikelihood = -289.53378Iteration 4: log pseudolikelihood = -289.53378Refining estimates:Iteration 0: log pseudolikelihood = -289.53378

Cox regression -- Breslow method for ties

No. of subjects = 103 Number of obs = 172No. of failures = 75Time at risk = 31938.1

Wald chi2(4) = 19.68Log pseudolikelihood = -289.53378 Prob > chi2 = 0.0006

(Std. Err. adjusted for 103 clusters in id)

Robust_t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]

age 1.030224 .0148771 2.06 0.039 1.001474 1.059799posttran .9787243 .2961736 -0.07 0.943 .5408498 1.771104surgery .3738278 .1304912 -2.82 0.005 .1886013 .7409665

year .8873107 .0613176 -1.73 0.084 .7749139 1.01601

We see the hazard ratios associated with each binary or quantitative covariate, withHuber (or robust) confidence limits.

We might want to know the fractions of mortality attributable and unattributableto subjects not having prior surgery. That is, we might want to ask how much the deathrate in the study might have decreased if all patients had received heart surgery priorto joining the study and if acceptance years, ages, and transplant history during thestudy had been the same as in the real world, and to ask how much hazard would haveremained. This can be done by using punafcc with the option vce(unconditional)

as before because the covariate values of lifetimes that ended in death will be subjectto sampling error, assuming that deaths do not occur by design.


. punafcc, at(surgery==1) eform vce(unconditional)Scenario 0: (asobserved) _allScenario 1: surgery==1Confidence interval for the population unattributable faction (PUF)Total number of observations used: 172

Ratio Std. Err. z P>|z| [95% Conf. Interval]

PUF .4239216 .1317422 -2.76 0.006 .2305459 .7794955


PAF .5760784 .22050449 .76945406

From the PUF, we see that giving all the subjects prior surgery, and changing nothingelse, might have reduced mortality to 42.4% of the level observed. When this PUF issubtracted from 100% to get a PAF, we conclude that 57.6% of the mortality observedis attributable to subjects not having prior surgery with confidence limits from 22.1%to 76.9%.

The option vce(unconditional), recommended here for use with punafcc, requiresthat the user must specify vce(robust) in the estimation command generating theparameter estimates. Also the interpretation of the unattributable and attributablefractions requires the assumption that the association between the outcome and theexposure altered in the fantasy scenarios is indeed causal, meaning that the outcomewill change as predicted if we intervene to change the exposure.

4.5 Standardization as out-of-sample prediction

We can also compare outcomes between different models applied to the same scenarioinstead of between the same model applied to different scenarios. For instance, ina multicenter study, we might fit a logistic regression model of disease with respectto gender and age to the data from a center, input a dataset specifying a standarddistribution of gender and age, and use margprev to estimate the marginal prevalenceexpected if the logistic model is applied to that standard population. This is an exampleof out-of-sample prediction, and the five commands introduced here have a noesample

option to make this possible; this option is similar to the one of the same name formargins.

The Global Allergy and Asthma European Network (GA2LEN) survey is part of amultiregional European study on asthma and allergy in Europe. Sensitivity to a rangeof allergens was measured on a subsample of subjects in each region, using skin pricktests. We wanted to compare sensitivity prevalences, standardized to a common agedistribution, between 13 European regions. To do this, we fitted a logistic regressionmodel for sensitivity to each allergen in each region, with respect to gender and age,and then used margprev to estimate a standardized sensitivity prevalence.

Roger B. Newson 695

For instance, in the case of sensitivity to cat allergen in the United Kingdom, thelogistic model (fit by using sampling probability weights) was as follows:

. logit spt_cat male fquesagec [pweight=sampwt5], or

Iteration 0: log pseudolikelihood = -1030.8768Iteration 1: log pseudolikelihood = -977.80033Iteration 2: log pseudolikelihood = -973.41056Iteration 3: log pseudolikelihood = -973.39866Iteration 4: log pseudolikelihood = -973.39866



Robustspt_cat Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

male 2.527963 1.535882 1.53 0.127 .7684525 8.316188fquesagec .6700974 .2209261 -1.21 0.225 .3511585 1.278712

_cons .0794547 .0300632 -6.69 0.000 .0378487 .1667967

The variables spt cat and male are binary indicators of skin-prick sensitivity to catallergen and male gender, and the variable fquesagec is a continuous age centered bysubtracting 48 years and divided by 10 years to be expressed in decades over 48 years.Therefore, the parameter cons is a baseline sensitivity odds for 48-year-old women;the parameter male is a male-gender odds ratio; and the parameter fquesagec is aper-decade odds ratio for age, assuming the effect of age on odds to be exponential.To derive a standardized prevalence from these parameters, we first load (and list) anew dataset with one observation per gender per age group and data on the numbersof individuals in that gender and age group in a European standard population:


. use estanpop, clear

. list male agemin agemax agemean fquesagec stanpop, abbr(32) sepby(male)

male agemin agemax agemean fquesagec stanpop

1. 0 20 24 22 -2.6 70002. 0 25 29 27 -2.1 70003. 0 30 34 32 -1.6 70004. 0 35 39 37 -1.1 70005. 0 40 44 42 -.6 70006. 0 45 49 47 -.1 70007. 0 50 54 52 .4 70008. 0 55 59 57 .9 60009. 0 60 64 62 1.4 500010. 0 65 69 67 1.9 400011. 0 70 74 72 2.4 3000

12. 1 20 24 22 -2.6 700013. 1 25 29 27 -2.1 700014. 1 30 34 32 -1.6 700015. 1 35 39 37 -1.1 700016. 1 40 44 42 -.6 700017. 1 45 49 47 -.1 700018. 1 50 54 52 .4 700019. 1 55 59 57 .9 600020. 1 60 64 62 1.4 500021. 1 65 69 67 1.9 400022. 1 70 74 72 2.4 3000

In this dataset, male indicates male gender; agemin, agemax, and agemean containminimum, maximum, and mean ages in years; fquesagec contains the mean age indecades centered at 48 years; and stanpop contains the number of individuals with thatgender and age group in the European standard population. We can now estimate themarginal odds and prevalence by applying our model to this dataset, using stanpop asa frequency-weight variable:

. margprev [fweight=stanpop], eform noesampleScenario 1: (asobserved) _allConfidence interval for the marginal oddsunder Scenario 1Total number of observations used: 134000

Odds Std. Err. z P>|z| [95% Conf. Interval]

Scenario_1 .1782219 .07486 -4.11 0.000 .0782391 .4059742

Asymmetric 95% CI for the untransformed marginal prevalenceunder Scenario 1

Estimate Minimum MaximumScenario_1 .15126346 .07256191 .2887494

We see the marginal odds and the more comprehensible marginal prevalence of 15.1%(95% confidence interval: 7.3% to 28.9%). The marginal odds for this region (theUnited Kingdom) and the 12 others were entered into the Statistical Software Com-ponents parmhet package to compute heterogeneity statistics. The I2 statistic of

Roger B. Newson 697

Higgins and Thompson (2002) was 46.4%, with a p-value of 0.033, so there seems tobe heterogeneity in cat allergy prevalence between European regions not attributableto heterogeneity in gender and age distribution.

5 Acknowledgments

I thank all at StataCorp for making the margins command available. In particular,I thank Kristin MacDonald, Jennifer Rolfes, Miguel Dorta, and Jeff S. Pitblado forhandling the demanding and highly technical margins queries with which I routinelybombarded them during the development of the suite of commands described here. Ialso thank Dr. Mohammadreza Bozorgmanesh of the Research Institute for EndocrineSciences, Tehran, Iran, for drawing my attention to the literature on the PAF, especiallyon extensions of the PAF to survival analyses. I also thank the administrators of theGA2LEN survey for allowing me to include the example of section 4.5 in this article. TheGA2LEN conducted an epidemiological project from 2007 to 2009 to examine respiratoryand allergic disease in Europe. This work involved 24 research centers in 15 countriesacross Europe and was funded by the European Union. My own work at ImperialCollege London is financed by the United Kingdom Department of Health.

6 ReferencesAustin, P. C., P. Grootendorst, S.-L. T. Normand, and G. M. Anderson. 2007. Condi-tioning on the propensity score can result in biased estimation of common measuresof treatment effect: A Monte Carlo study. Statistics in Medicine 26: 754–768.

Brady, A. R. 1998. sbe21: Adjusted population attributable fractions from logisticregression. Stata Technical Bulletin 42: 8–12. Reprinted in Stata Technical BulletinReprints, vol. 7, pp. 137–143. College Station, TX: Stata Press.

Edwardes, M. D. deB. 1995. A confidence interval for Pr(X < Y )−Pr(X > Y ) estimatedfrom simple cluster samples. Biometrics 51: 571–578.

Gordis, L. 2000. Epidemiology. 2nd ed. Philadelphia: Saunders.

Greenland, S., and K. Drescher. 1993. Maximum likelihood estimation of the at-tributable fraction from logistic models. Biometrics 49: 865–872.

Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rded. College Station, TX: Stata Press.

Higgins, J. P. T., and S. G. Thompson. 2002. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21: 1539–1558.

Hosmer, D. W., S. Lemeshow, and J. Klar. 1988. Goodness-of-fit testing for the logisticregression model when the estimated probabilities are small. Biometrical Journal 30:911–924.


Lane, P. W., and J. A. Nelder. 1982. Analysis of covariance and standardization asinstances of prediction. Biometrics 38: 613–621.

Newson, R. 2006. Confidence intervals for rank statistics: Somers’ D and extensions.Stata Journal 6: 309–334.

Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score inobservational studies for causal effects. Biometrika 70: 41–55.

Rothman, K. J., S. Greenland, and T. L. Lash. 2008. Modern Epidemiology. 3rd ed.Philadelphia: Lippincott Williams & Wilkins.

About the author

Roger B. Newson is a lecturer in medical statistics at Imperial College London, UK, workingprincipally in asthma research. He wrote the margprev, marglmean, regpar, punaf, punafcc,and parmhet Stata packages.


Dealing with identifier variables in datamanagement and analysis

P. Wilner JeantyKinder Institute for Urban Research

andHobby Center for the Study of Texas

Rice UniversityHouston, TX

[email protected]

Abstract. Identifier variables are prominent in most data files and, more oftenthan not, are essential to fully use the information in a Stata dataset. However,rendering them in the proper format and relevant number of digits appropriatefor data management and statistical analysis might pose unnerving challenges toinexperienced or even veteran Stata users. To lessen these challenges, I providesome useful tips and guard against some pitfalls by featuring two official Stataroutines: the string() function and its elaborated wrapper, the tostring com-mand. I illustrate how to use these two routines to address the difficulties causedby identifier variables in managing and analyzing data from private institutionsand U.S. government agencies.

Keywords: dm0071, identifier variables, leading zeros, FIPS codes, U.S. CensusBureau, Bureau of Economic Analysis, USDA, cross-sectional data, panel data

1 Introduction

Identifier variables are essential to fully use the information in a Stata dataset. Themost typical examples of identifier variables in databases provided to the public byprivate and governmental institutions in the United States are geographic identifiers,also known as federal information processing standards (FIPS) codes. Assigned by theNational Institute of Standards and Technology, FIPS codes are crucially useful for join-ing geographic records or observations for different data files and databases originatingfrom different sources.

These standardized codes identify U.S. geographic areas from aggregate levels suchas states to finer levels such as census blocks. States have two-digit codes, and countieshave three-digit codes; there are also codes for metropolitan areas, census tracts, andcensus block groups. In county-level databases maintained by various U.S. institutions,each county is identified by a five-digit FIPS code, the first two digits of which identify thestate. Often, depending on the data management and analysis tasks at hand, the dataanalyst must generate a new identifier variable by concatenating two or more existingvariables, extracting a certain number of digits from an existing variable, or convertingan existing variable from numeric to string and vice versa.

c© 2013 StataCorp LP dm0071

700 Dealing with ID variables in Stata

Centered on the premise that identifier variables are best stored in string format, thisarticle features two official Stata routines—the string() function and its well-carved-out wrapper, the tostring command—to accomplish this task, which sometimes maypose unnerving challenges to inexperienced or even veteran Stata users. Cox (2002)provides a lucid discussion of these two routines and their uses but does not emphasizetheir relevance for dealing with identifier variables. While the gist of my article revolvesaround FIPS codes, the highlighted principles apply to most identifier variables. In whatfollows, after a succinct overview of string() and tostring, I illustrate how the tworoutines can be used in managing data from institutions such as the U.S. Departmentof Agriculture (USDA), the U.S. Census Bureau, and the Bureau of Economic Analysis,which provide data with FIPS codes to the public.

2 Overview of Stata’s tostring and string()

Stata provides two commands and two functions to convert data from numeric tostring format: the commands tostring and decode and the functions string() andstrofreal(). Because decode simply creates a new string variable named newvar basedon the “encoded” numeric variable varname and its value labels, it is not useful here.The strofreal() function is essentially a synonym for string(). Thus the focus hereis on the tostring command and the string() function, although I will occasionallyuse other commands or functions to help when necessary.

The syntax for the string() function is string(numvar, "%fmt"). The first argu-ment, numvar, can be either the name of a variable stored in numeric format or simplya number. The second argument, "%fmt", is useful for formatting numbers, time, ordates. tostring relies heavily on string() to convert numeric variables into theirstring equivalents. To mimic the format argument in string(), tostring provides aformat(%fmt) option with the default format being "%12.0g". Stata carries about 10different formats that can be used with both string() and tostring (see [D] format,[D] destring, and [D] functions).

Used with no format options specified, both string() and tostring will dutifullyconvert variables from numeric to string format. However, when the expression is greaterthan seven digits, using the string() function will result in loss of information. Withtostring, the loss arises when the variables to be converted hold values exceeding 10digits, even though the default format is "%12.0g". To preempt loss of information inthese instances, the user must specify the second argument of string() or the format()option of tostring. While the general format %g can be used to ward off the loss, itis totally inadequate for handling leading zeros. The fixed format %f must be used toinstruct Stata that the leading zeros should be inserted before converting variables fromnumeric to string. This will become more clear with examples.

P. Wilner Jeanty 701

3 Generating identifier variables by concatenation

A five-digit FIPS code is often needed for merging. Some governmental institutions, suchas the USDA or some units within the U.S. Census Bureau, publish data with separateFIPS codes for states and counties; this makes it a little difficult to join their datasetswith datasets from other agencies delivering databases with a five-digit FIPS code. Ifyou want to create a five-digit FIPS code variable, then state and county FIPS codes(most often read in as numeric in Stata) must be concatenated with leading zeros putin place before concatenation.

The need for creating a new identifier variable in this way is commonplace in manydata management tasks where numeric variables have to be transformed into stringvariables before string functions can carry out string manipulations. For instance,researchers downloading Small Area Income and Poverty Estimates (SAIPE) data forschool districts and counties from the U.S. Census Bureau have to grapple with thisissue unless they have access to at least Stata 12 and elect to download SAIPE datain Excel format. SAIPE data from 1989 to 2010 can be downloaded in Excel (.xls)format or in text (.txt) format. In both file types, the state FIPS codes contain leadingzeros, while the county FIPS codes do not.1 Unless the data are in Excel format andthe user is using Stata 12 or higher, when read into Stata, both variables are stored asnumeric, and all leading zeros, if any, are dropped. Thus it pays to know how to obtaina five-digit FIPS code variable stored in string format by concatenating state and countyFIPS codes stored in numeric format while retaining the leading zeros.

For illustration, I will use a dataset downloaded from the USDA’s National Agricul-tural Statistics Service’s website via Quick Stats.2 A fraction of the dataset is displayedbelow.

1. For more details on SAIPE, see http://www.census.gov/did/www/saipe/data/index.html.2. The USDA’s Quick Stats version 2.0 is the most comprehensive tool for accessing agricultural data

published by the National Agricultural Statistics Service. Using Quick Stats, researchers can querythe National Agricultural Statistics Service database on the basis of commodity, location, or timeperiod; they then can visualize the data on a map, manipulate and export the results, or save alink for future use. For more on Quick Stats, go to http://www.nass.usda.gov/Quick Stats/.


. use fixit_dat2

. list state stfips county cofips in 1/6

state stfips county cofips

1. ALABAMA 1 AUTAUGA 12. ALABAMA 1 BALDWIN 33. ARIZONA 4 GREENLEE 114. FLORIDA 12 PALM BEACH 995. GEORGIA 13 ECHOLS 101

6. GEORGIA 13 EFFINGHAM 103

Now suppose you wish to generate a variable, say, fips, containing five-digit FIPS

codes for each county by concatenating the variables stfips and cofips. You wouldundoubtedly want the variable to be stored in a string format. As Cox (2002) force-fully points out, code or identifier variables are better held as strings for ease of dataprocessing. Let us put all this into context: after the fips variable is created, the firstfive observations should look like 01001, 01003, 04011, 12099, and 13101. This entailsadding leading zeros to the variables stfips and cofips and converting them fromnumeric to string before concatenation.

Using tostring and its format() option, we can conveniently convert the two nu-meric variables while inserting the appropriate number of leading zeros and finally per-form the concatenation. Recall that quotes are not required when specifying a formatwith the format() option of a command or the format command itself, which is differentwhen using the second argument of string().

. tostring stfips, gen(stateid) format(%02.0f)stateid generated as str2

. tostring cofips, generate(countyid) format(%03.0f)countyid generated as str3

. generate fips = stateid + countyid

. list fips in 1/5

fips

1. 010012. 010033. 040114. 120995. 13101


We can also use the concat() function of the egen command to concatenate stringvariables because their values will remain unchanged at the time of concatenation.

. drop fips

. egen fips=concat(stateid countyid)

. list fips in 1/5

fips

1. 010012. 010033. 040114. 120995. 13101

egen’s concat() function is also a wrapper of string() because it converts variablesfrom numeric to string before concatenation. Similarly to tostring, egen’s concat()function provides a format() option to accommodate the format argument in string().However, you may not use concat() and its format() option to directly convert stfipsand cofips from numeric to string, insert the leading zeros, and finally concatenate.The problem is that two different formats must be used. The format(%fmt) option ofany command allows only one format to be specified.

Coding

. egen str5 fips=concat(stfips cofips), format(%05.0f)

would generate a 10-digit rather than a 5-digit variable.

Using string() with a fixed format as the second argument, you can do everythingin one line.

. drop fips

. generate str5 fips= string(stfips, "%02.0f") + string(cofips, "%03.0f")

. list fips in 1/5

fips

1. 010012. 010033. 040114. 120995. 13101

Largely because of its ability to deal with more than one variable at a time and toautomatically determine which string type is needed, tostring provides some efficiency,one of the purposes for which it was written (Cox and Wernow 2000). This featuredecreases when the string variables to be created require you to specify different formats.Still, you will see tostring’s efficiency on display in section 5.


Often we would rather let Stata decipher the string data type to be created. In thatcase, a variable of type str1, the most compact string type, would be created, and thenthe replace command would automatically lead to the promotion of the variable to theappropriate type:

. generate str1 fips= ""

. replace fips = string(stfips, "%02.0f") + string(cofips, "%03.0f")

Although identifier variables are better stored in string format, if you want the FIPS

codes to be numeric, you can use the real() function to make them so but at the costof losing the leading zeros (see another issue with the real() function in section 4.2).

. generate fips1 = real(fips)

Thus far, we have focused attention on cases where FIPS codes are read in numericformat when insheeted, which is by far the most encountered situation. Nonetheless, itmay very well be the case that the variable is stored in string format but the leading zerosare left out. For instance, you may have used the user-written command labcenswdi,which somehow returns FIPS codes in string format but leaves out the leading zeros(Jeanty 2011).

More concretely, suppose you are dealing with counties within only one state andyou have a three-digit county FIPS code variable for that state but want a five-digitFIPS code for merging. Then you would need to concatenate the state FIPS with thecounty FIPS while preserving the leading zeros. Suppose the variable holds the three-digit county FIPS codes for the Texas counties and is called countyfips. How you willobtain a five-digit FIPS code variable will depend on the storage format of countyfips,which can be string or numeric.

Suppose the countyfips variable is stored as numeric (for some reason, county FIPS

codes take on only odd numbers: 1, 3, 5, . . . or 001, 003, 005, . . .). To obtain a five-digitFIPS code, you code

. generate str5 fips="48" + string(countyfips, "%03.0f")

Now suppose countyfips is stored as a string. The leading zeros may or may notbe in place. If the leading zeros are in place, you code

. generate str5 fips="48" + countyfips

Otherwise, you code

. generate str5 fips="48" + string(real(countyfips), "%03.0f")

Relatedly, imagine that the previously discussed variables stfips and cofips wereread in string format without the leading zeros. Now suppose you wish to retain theleading zeros and keep the same variable names. You would easily code

. replace stfips=string(real(stfips), "%02.0f")

. replace codefips=string(real(cofips), "%03.0f")


Concatenating the two variables to create a five-digit FIPS code variable simplyentails coding

. generate str5 fips=stfips+cofips

Or you can code everything in one line:

. generate str5 fips=string(real(stfips), "%02.0f") +> string(real(cofips), "%03.0f")

As you can see, the way to concatenate two existing variables to generate a newidentifier variable depends largely on the storage format of the existing variables andthe presence or absence of leading zeros. How about generating a new identifier variableby extracting a certain number of digits from an existing string or numeric variable?This is the subject of the next section.

4 Generating identifier variables by extraction

This section covers the creation of identifier variables by extracting numerical charactersfrom either string or numeric variables.

4.1 The case of short identifier variables

Consider a dataset with the first few observations on the first four variables shownbelow.

. use fixit_dat, clear

. list in 1/5

address city state zip

1. PO Box 93527 Cleveland OH 4410431042. PO Box 14690 Hanoverton OH 444233. PO Box 12583 Valley City OH 4428093274. PO Box 297098 Rogers OH 444555. 4724 Delbeach Rd Medina OH 442568489

Here zip is a numeric variable taking on zip codes in both five-digit and nine-digitformats. Based on the variable zip, imagine you want to create a numeric five-digitzip code for the purpose of merging with another dataset. Because of the lack in Stataof a function like substr() for numeric variables, the zip variable must be convertedinto its string equivalent before the five digits can be subtracted. One way to proceed isto use the tostring command to do the conversion and then apply the substr() andreal() functions successively, as follows:

. tostring zip, gen(cutzip)cutzip generated as str9

. generate zip1=real(substr(cutzip,1,5))


. list zip1 in 1/5

zip1

1. 441042. 444233. 442804. 444555. 44256

However, a more concise way would be to use the string() function with "%9.0f"

as the second argument.

. generate zip2=real(substr(string(zip,"%9.0f"),1,5))

. list zip2 in 1/5

zip2

1. 441042. 444233. 442804. 444555. 44256

With the string() function, however, missing values would have resulted for all thenine-digit values of the zip variable if we had not specified the format "%9.0f" as asecond argument. In addition, using a format different from the fixed type (%f) woulddo us a disservice. For example, consider

. generate zip3=real(substr(string(zip),1,5))(14 missing values generated)

. list zip3 in 1/5

zip3

1. .2. 444233. .4. 444555. .

The default format of string() truncated all the nine-digit values by converting them toscientific format as powers of 10, which caused loss of information. Against this backdropis specifying "%9.0f" as a second argument. If applying the real(), substr(), andstring(n, s) functions renders the generated variable in scientific format, you canalways fix it by using the format command as

format varname %fmt


However, in the case of long identifier variables, you may still be surprised by unex-pected values even after applying the format command. This issue is taken up below.

4.2 The case of long identifier variables

Consider the 12-digit census block group identifier variable stfid:

. use blockdata, clear

. format stfid %18.0f

. list stfid in 1/6

stfid

1. 3909981070032. 3909981080033. 3909981090014. 3909981090015. 390998109001

Suppose you wish to extract an 11-digit census tract identifier variable from the 12-digit census block group FIPS code stfid for merging. To do so, you can easily applythe string() function with the format argument:

. generate tract_stid=substr(string(stfid,"%12.0f"),1,11)

. list tract_stid in 1/5

tract_stid

1. 390998107002. 390998108003. 390998109004. 390998109005. 39099810900

The extracted census tract identifier variable is stored in string format by con-struction. What if, for some reason, you want it to be numeric rather than string?Instinctively, you would code

. generate tract_numid=real(tract_stid)

. format tract_numid %12.0f


However, listing a few observations shows values countering expectations.

. list tract_numid in 1/6

tract_numid

1. 390998097922. 390998097923. 390998097924. 390998097925. 39099809792

The conversion process has anomalously resulted in loss of information. This is a fla-grant example where tostring will refuse to process a conversion from numeric to stringunless the force option is specified to explicitly convey the approval of information loss.

How about typing the following?

. replace tract_numid=real(substr(string(stfid,"%12.0f"),1,11))(0 real changes made)

The problem, not surprisingly, remains unsolved. The primary reason is that by default,Stata creates float variables. With such a large number of digits, float variables can holdonly multiples of four (Cox 2006). If you need a numeric variable of this size, the key isto specify the data storage type as double or long because the identifier variable takeson values with more than nine digits. Doubles have as many as 16 digits of accuracy.

. drop tract_numid

. generate double tract_numid=real(substr(string(stfid,"%12.0f"),1,11))

. format tract_numid %12.0f

. list tract_numid in 1/5

tract_numid

1. 390998107002. 390998108003. 390998109004. 390998109005. 39099810900

Whereas the difficulty of storing a nine-digit identifier variable in numeric formatcan be sidestepped, Cox (2002) provides two compelling reasons why identifier variablesmust be stored in string format: precision and data size. Yet the need to exportyour data to the ArcGIS3 software package for mapping and spatial statistical analysisis another key reason why you might want to store your identifier variable in stringformat. Stata and ArcGIS communicate fairly well. Data exported using the outsheetcommand or the new Stata 12 export excel command are readily usable in ArcGISwith no intermediary steps or other software.

3. ArcGIS is a registered trademark of Environmental Systems Research Institute Inc.


. outsheet using yourfilename.csv, names nolabel comma

Identifier variables generated in Stata and stored in string format can be used tojoin your data with basemaps or shapefiles by using a matching key variable also instring format. However, it is important to refrain from outsheeting variables holdingvalues in scientific format. Consider a variable with the value 268398565. Withoutproper formatting, this number will be stored and displayed as 2.684e+08. And ifoutsheeted as such, the value that will be rendered is 268400000, which results in loss ofinformation. To overcome this conundrum, you must use a %g or %f format to properlyformat variables holding numbers in scientific format before outsheeting the data. Usersof Stata 12 or higher can now use the export excel command, which automaticallyhandles such problems (see [D] import excel). With export excel, there is no needto apply any formatting for the correct values to be rendered in the Excel dataset.4 Toexport data to ArcGIS using export excel, you code5

. export excel using yourfilename.xls, firstrow(variable) sheet("yoursheetname")> nolabel

Conversely, text files generated from exporting attribute tables in ArcGIS can bereadily imported into Stata with no intermediary steps or other software by using theinsheet command.

. insheet using arcgisfile.txt, names clear

Certainly, your preference for either string or numeric will depend on your purpose,but bear in mind that in certain circumstances, Stata processes string and numericvariables differently. For instance, when sorting a string variable, Stata will sort “15”before “5”, but when sorting a numeric variable, it will sort “5” before “15”. Oneimplication for a string variable taking on U.S. state or county FIPS codes is that theusual sort order of the U.S. states or counties in a dataset will get messed up. To remedythis, you must insert the leading zeros. With the leading zeros in place, Stata will sort“05” before “15”, preserving the usual sort order of U.S. states and counties.

5 Importing identifier variables from a spreadsheet

Because many private and governmental institutions provide data to the public inspreadsheets, I will now address how to import long and short identifier variables froma spreadsheet. I begin with an emphasis on insheeting data in .csv format using theinsheet command, available in both Stata 12 and previous Stata versions. I then high-light the advantages of importing the same data in Excel format using the new Stata 12import excel command, not available in previous Stata versions. The data for thissection are taken from the U.S. Census Bureau and the Bureau of Economic Analysis.

4. Note that export excel does not carry the format of decimal numbers onto the Excel dataset.5. If the Stata dataset contains missing values, those values will be converted to <Null> in ArcGIS.


5.1 Importing long identifier variables

The U.S. Census Bureau now requires census data users to download gazetteer filescontaining variables on land area, water area, and geographic coordinates about theareal units for which the data were downloaded. This illustration concerns a gazetteerfile containing census tract identifier variables (such as FIPS codes), land area, waterarea, and latitude and longitude for all census tracts in Texas.6 A primary interesthere is to import into Stata the data provided in .csv format to create a small datasetto merge with census demographic data. Yet it is ineffective for all string variablesholding value characters to be converted to their numeric representations when they areimported.

At this time, there is no way to tell Stata to read in some variables as string andothers as numeric when insheeting a dataset containing string variables with numericalcontents. This inconvenience, far from being a downside per se, is offset by the greatamount of user friendliness and freedom of action that Stata provides. If you want tokeep identifier variables in string format, there are two ways to do so. The first way isto insert a new row and place some text on top of the identifier variable you want toremain as string after reading it into Stata. This can be done in Excel before insheetingthe data. Then you will need to drop the inserted row once you are in Stata. Thesecond way is to convert the numeric variables back to string format after loading thedata into Stata. We will take the second route, which is a little bumpier. For now, letus bring in the data and list a few observations.

. insheet using Gaz_tracts_48_coord.txt, names clear(8 vars, 5265 obs)

. list geoid aland_sqmi awater_sqmi intptlat intptlong in 1/5, abb(11)

geoid aland_sqmi awater_sqmi intptlat intptlong

1. 4.800e+10 186.606 3.037 31.97147 -95.552442. 4.800e+10 6.39 .115 31.73464 -95.815713. 4.800e+10 27.981 1.015 31.8 -95.912384. 4.800e+10 8.896 .038 31.78781 -95.64195. 4.800e+10 7.974 .128 31.7502 -95.66921

The geoid variable is displayed in scientific format for taking very large values wherethe number of digits is greater than seven. But do not make too much of this glitch; itcan easily be undone by using the format command.

. format geoid %11.0f

6. The U.S. Census Bureau provides gazetteer files for counties and lower summary levels suchas census tracts, block groups, and so on. For the geographic units of interest, these filescontain data on land area, water area, and latitudes and longitudes in decimal degrees. Seehttp://www.census.gov/geo/www/gazetteer/gazette.html. Interestingly, the vintage of the geog-raphy (that is, the FIPS code or geographic identifier) in the 2010 gazetteer files is the same asthat in the 2010 Census data downloads. If you download data from American FactFinder 2, youmight as well download one of these files should you need the corresponding latitudes, longitudes,and land area for your geographies.

http://www.census.gov/geo/www/gazetteer/gazette.html


Using the format command at this point is as important as specifying the format()option when tostring is invoked below. I now list a few observations to give you asense of the values taken by the geoid variable.

. list geoid in 1/5

geoid

1. 480019501002. 480019504013. 480019504024. 480019505005. 48001950600

If you want to keep the identifier variable as numeric, you can stop right there. Butthe goal is to obtain a string identifier variable from geoid. There are two prominentways to proceed: keep the same variable name or use a new name such as fips. In thefirst case, tostring can do it all in one call:

. tostring geoid, replace force format(%11.0f)geoid was double now str11

Let us pause for a moment. Why do we need to specify three options here? The an-swer is obvious. Specifying the force option explicitly conveys agreement on conversionfrom numeric to string, even if the conversion is potentially irreversible. By specifyingthe replace option, you confirm in essence that your old variable should be replacedwith a new one. The format() option—the most important one here—is to forestallany loss of information. The default format used by tostring, %12.0g, is ineffective atpreventing the loss of information.

The line of code above works flawlessly, but Cox (2011) rightly recommends usingthe underlying function directly when tostring’s force option has to be invoked. Toperform the conversion, you could directly use the string() function, the workhorse oftostring. However, if you want to keep the same variable name, using the string()

function requires some intermediary steps.


. generate ngeoid=string(geoid, "%11.0f")

. drop geoid

. rename ngeoid geoid

As seen before, tostring obviates those intermediary steps. On the other hand,should a different name be needed for the geoid variable, then tostring and thestring() function will involve the same number of calls. Whether to use string()

or tostring in this instance is essentially immaterial and depends on personal prefer-ence. For example, suppose you want the new name to be fips. To use string(), youcode



. generate fips =string(geoid, "%11.0f")

. drop geoid

A similar call to tostring() is


. tostring geoid, format(%11.0f) force gen(fips)fips generated as str11

. drop geoid

Note here the use of the gen() option rather than replace.

5.2 Importing short identifier variables

This example shows another downside of insheeting a spreadsheet in .csv format: thedeletion of the leading zeros in the insheeted dataset.7 This problem is inherent in iden-tifier variables being automatically converted from string to numeric when a dataset isinsheeted. In the case of short identifier variables, the reverse conversion is much sim-pler. Consider a simple five-digit county FIPS code downloaded with transfer paymentdata from the website of the Bureau of Economic Analysis.

. insheet using transfer2009_csv.csv, names clear(3 vars, 3138 obs)

. list fips in 1/5

fips

1. 10012. 10033. 10054. 10075. 1009

7. This problem is commonly encountered by 2010 U.S. Census data users. See the following frequentlyasked question at https://ask.census.gov/faq.php?id=5000&faqld=1647: American FactFinder:How do I replace the leading zeros in my database compatible (.csv) download when openingthe download in Microsoft Excel (that is, GEO ID2)?


Converting back to string format and inserting leading zeros with the same variablename entail a simple call to tostring with the replace option. Also needed is theformat() option to retain the leading zeros.

. tostring fips, replace format("%05.0f")fips was long now str5

. list fips in 1/5

fips

1. 010012. 010033. 010054. 010075. 01009

Notice that here I did not specify the force option: not doing so is unharmful.

5.3 Using the new import excel command in Stata 12 and higher

Earlier, I asserted that there is no way to tell Stata to treat some variables with nu-merical content as string and others as numeric when insheeting a .csv data file. Thisis true in Stata 12 and under. Yet a startling difference exists between Stata 12 andprevious Stata versions when it comes to importing spreadsheets.

Recall the four main problems encountered when using the insheet command. First,string variables with numerical contents get converted to numeric. Second, all leadingzeros are dropped. Third, identifier variables with large values get converted to scientificformat. Fourth, numerical variables with 1,000-separator commas become characters.Against these shortcomings is the new import excel command in Stata 12 and higher.Interestingly, import excel leaves identifier variables utterly intact upon reading Excelfiles. import excel can decipher for itself whether a variable holding value charactersis string or numeric. Thus with the advent of import excel, there is no need toinstruct Stata whether a variable with numerical characters should be stored as stringor numeric.

Consider import excel on the same gazetteer and transfer data files used earlierbut now saved in Excel rather than .csv or .txt format.

. import excel Gaz_tracts_48_coord.xls, firstrow case(lower) clear

. list geoid in 1/5

geoid

1. 480019501002. 480019504013. 480019504024. 480019505005. 48001950600


It does not matter whether you are dealing with long or short identifier variables.

. import excel transfer2009_xls, firstrow case(lower) clear

. list fips in 1/5

fips

1. 010012. 010033. 010054. 010075. 01009

In both examples, note the use of the case option to convert the variable names tolowercase; import excel, by default, preserves the case.

If you are working with identifier variables and have access to at least Stata 12, thenyou should save your data in Excel format. Even better is if you have access to Excel2007 or 2010, in which case it would be wise to save the data in .xlsx format to takeadvantage of the flexibility of the import excel command. If the data provider offersboth .csv and Excel formats, you have access to both Stata 12 and Excel 2007 or 2010,and, most importantly, the two data files only differ by their formats, then choose theExcel format. As indicated earlier, SAIPE provides data in both .csv and .xls formats.At this time, Stata documentation lacks technical details with regard to the size of adataset that can be insheeted via the insheet command. Documentation on import

excel, on the other hand, is inordinately terse and self-contained (see [D] importexcel). An .xls worksheet may contain as many as 65,536 rows and 256 columns. Thestring size limit is 255 characters. The size limits for an .xlsx worksheet are 1,048,576rows by 16,384 columns with a string size limit of 32,767 characters. Thus it is worthsaving your data under the .xlsx format whenever possible. If your Stata flavor can loadmore variables and observations than allowed by import excel, you always have theoption of importing your data bit by bit and then piecing the bits together afterward.8

6 Creating your own identifier variables

Often and for a number of reasons, you might need to generate your own identifiervariable, numeric or string. Even more important might be the need to create a uniqueidentification number (ID) for elements within panels. In such a case, an ID is requirednot only for each group or panel but also for each element of each panel. Two key systemvariables in Stata, n and N, upon which I will expand later, prove indispensable. Tocreate a unique ID for each observation in a cross-section dataset, you code

. generate uniqid=_n

The variable uniqid, stored in numeric format, will take on values ranging from 1to N , where N is the current number of observations. There are many reasons why you

8. Type help limits to know the limits of your Stata flavor.


might want to keep uniqid numeric. For instance, several Stata commands require anumeric ID variable. You might also want data points to be equally spaced on the x axiswhen graphing, a job perfect for a numeric ID variable. But chances are that you mightalso want this variable to be stored in string format. Again the string() function ishandy.

. generate uniqid=string(_n)

Because this is a string variable, adding the leading zeros makes it a little moreappealing. Doing so, however, requires some thinking because the number of leadingzeros to be inserted must be in accord with the string variable length or the number ofcharacters it contains.

. local slen=length(string(_N))

. generate str`slen´ uniqid=string(_n,"%0`slen´.0f")

As you can see, adding the leading zeros hinges upon knowing a priori the numberof observations in your dataset. In the case of a panel or repeated-observation dataset,you first need to create an ID for each group and then an ID for each observation withineach group. You might have a variable, say, myidvar, taking on repeated values onwhich you want to base the group or panel ID numbers. If that is the case, a simpleway to proceed is to code

. egen groupid=group(myidvar)

. by groupid, sort: generate obsid=_n

where myidvar is the variable containing the elements for which a unique ID is needed.

Nonetheless, the notion that you have in your dataset an existing variable on whichyou can base the groups might be untenable. Instead, all you might want to do is groupobservations using a unique group ID and a unique observation ID. To do this, you willfirst have to decide how many groups you need and the number of elements in eachgroup. Once you decide, the egen command and its seq() function are all you need toget the job done. Suppose you have a dataset of N observations and you want to createn1 groups of n2 observations so that n1 × n2 = N . To assign an ID to the groups andthe group elements, you code

. egen groupid=seq(), from(1) to(n1) block(n2)

. by groupid, sort: generate obsid=_n

These two variables, of course, will be numeric.

In a panel-data setting, an interesting yet unnerving task is to generate a stringidentifier variable where each member of a group carries the ID number of the groupand where the leading zeros are in place for both groups and group members. Toproceed, we will build on the foundation laid above.

. quietly tabulate groupid

. local lnf=length(string(`r(r)´))


. by groupid, sort: generate nb=_N

. quietly summarize nb

. local mln=length(string(`r(max)´))

. by goupid, sort: generate obsid=string(groupid, "%0`lnf´.0f") +> string(_n, "%0`mln´.0f")

It is worth understanding these lines of code. Beginning with the second line, thestring() function feeds on one of the results, r(r), which is the number of rowsor, in this case, groups returned by the tabulate command. length() counts howmany digits that number contains. Because length() is essentially a string function,string() is used to convert the number from numeric to string. The third line countshow many elements are in each group. Assuming the number of elements is the samefor each group, lines 4 and 5 count the number of digits contained in that number.Built on the hindsight gained from the previous sections, line 6 creates an identifiervariable by concatenating group ID and member ID after inserting leading zeros in both.Admittedly, things might become obfuscated in an unbalanced panel-data setting.

Also noteworthy is the heavy reliance on the system variables n and N. n acts likean observation counter or marker. Stata thinks of it as the observation number ofthe current observation. N typically indicates the total number of observations in thecurrent dataset, including missing ones, or the number of observations in the currentby() group. It stands out as the subscript of the last observation in a dataset or inthe current by() group. In essence, typing display N will display the number ofobservations in the dataset currently loaded in memory. These two crucially importantbuilt-in variables deserve due acquaintance for serious data management tasks. Formore details, see [U] 13.4 System variables ( variables) and [U] 13.7 Explicitsubscripting.

7 Checking identifier variables for duplicates

As stressed before, one of the many reasons for maintaining identifier variables is tomerge datasets obtained from different sources: to link data to geographic informationsystem databases such as geographic boundary files or topologically integrated geo-graphic encoding and referencing line shapefiles representing geographic features suchas roads, rivers, and nonvisible legal boundaries; selected point features such as hospi-tals; or selected areas such as parks.9 Before you embark on such an endeavor, it isgood practice to ensure that your identifier variable is unique, be it in cross-sectionalor panel-data settings. Note that duplicate FIPS code elements in a dataset are veryunlikely.

9. Topologically integrated geographic encoding and referencing LINE shapefiles for 2010 are availablefor download from the U.S. Census Bureau’s website athttp://www.census.gov/cgi-bin/geo/shapefiles2010/main.


To check for duplicates using the variable myidvar in a cross-sectional dataset, youtype

. duplicates list myidvar

If duplicates are present, Stata will display their occurrence. Otherwise, Stata willrespond with the message 0 observations are duplicates, an indication that onlya one-to-one or a one-to-many merge can be done using the key variable myidvar.

To check for duplicates within panels in a panel dataset, you invoke the duplicatescommand by listing the panel identifier and then the observation identifier. This ideacan be extended to census tracts within counties or block groups within census tracts.

. duplicates list panelidvar obsidvar

Once identified, duplicates can be easily dropped. To drop all but the first occurrenceof each group of duplicated observations, you code

. duplicates drop myidvar

or in a panel dataset,

. duplicates drop panelidvar obsidvar

If you are a Stata user who cannot afford to be without data from U.S. governmentagencies, the string() function and its elaborated wrapper, the tostring command,deserve to be part of your repertoire.

8 Conclusion

Identifier variables are key to data management and analysis. Importing and exportingthem from and to other statistical software packages and rendering them to the properformat and relevant number of digits might be at times challenging for inexperiencedor even veteran Stata users. Arguably, the new import excel and export excel com-mands provided in Stata 12 or higher can relieve Stata users from most of the datamanagement burdens inherent in importing or exporting identifier variables when thedata format is of the type .xls or .xlsx. Many private and governmental institutionscontinue to deliver data to the public in .csv format. Dealing with identifier variableswhen the data are in this format involves many unwieldy challenges. Featuring Stata’sstring() function and its discreetly carved-out wrapper, the tostring command, thisarticle addressed most, if not all, of those challenges. The principles highlighted herewill enable better and efficient management of data about U.S. geographic areas.

9 Acknowledgments

The author thanks an anonymous reviewer and Deborah Perez for their useful commentson this article.


10 ReferencesCox, N. J. 2002. Speaking Stata: On numbers and strings. Stata Journal 2: 314–329.

———. 2006. Stata tip 33: Sweet sixteen: Hexadecimal formats and precision problems.Stata Journal 6: 282–283.

———. 2011. Speaking Stata: Fun and fluency with functions. Stata Journal 11:460–471.

Cox, N. J., and J. B. Wernow. 2000. dm80: Changing numeric variables to string. StataTechnical Bulletin 56: 8–12. Reprinted in Stata Technical Bulletin Reprints, vol. 10,pp. 24–28. College Station, TX: Stata Press.

Jeanty, P. W. 2011. Managing the U.S. Census 2000 and World Development Indicatorsdatabases for statistical analysis in Stata. Stata Journal 11: 589–604.

About the author

P. Wilner Jeanty is a research scientist for the Kinder Institute for Urban Research and theHobby Center for the Study of Texas, Rice University. Jeanty has broad interests in urban andregional economics, environmental economics, development economics, and applied economet-rics. His research applies statistical theory and modeling techniques from the fields of spatial,environmental, and regional economics, including the applications of spatial econometrics, geo-graphic information systems, and econometrics of nonmarket valuation and panel data. Jeantyalso has experience in survey design and implementation and in statistical software program-ming.


Stochastic frontier analysis using Stata

Federico BelottiCentre for Economic and International Studies

University of Rome Tor VergataRome, Italy

[email protected]

Silvio DaidoneCentre for Health Economics

University of YorkYork, UK

[email protected]

Giuseppe IlardiEconomic and Financial Statistics Department

Bank of ItalyRome, Italy

[email protected]

Vincenzo AtellaCentre for Economic and International Studies

University of Rome Tor VergataRome, Italy

[email protected]

Abstract. This article describes sfcross and sfpanel, two new Stata commandsfor the estimation of cross-sectional and panel-data stochastic frontier models.sfcross extends the capabilities of the frontier command by including addi-tional models (Greene, 2003, Journal of Productivity Analysis 19: 179–190; Wang,2002, Journal of Productivity Analysis 18: 241–253) and command functional-ities, such as the possibility of managing complex survey data characteristics.Similarly, sfpanel allows one to fit a much wider range of time-varying ineffi-ciency models compared with the xtfrontier command, including the model ofCornwell, Schmidt, and Sickles (1990, Journal of Econometrics 46: 185–200); themodel of Lee and Schmidt (1993, in The Measurement of Productive Efficiency:Techniques and Applications), a production frontier model with flexible temporalvariation in technical efficiency; the flexible model of Kumbhakar (1990, Journalof Econometrics 46: 201–211); the inefficiency effects model of Battese and Coelli(1995 Empirical Economics 20: 325–332); and the “true” fixed- and random-effectsmodels of Greene (2005a, Journal of Econometrics 126: 269–303). A brief overviewof the stochastic frontier literature, a description of the two commands and theiroptions, and examples using simulated and real data are provided.

Keywords: st0315, sfcross, sfpanel, stochastic frontier analysis, production frontier,cost frontier, cross-sectional, panel data

1 Introduction

This article describes sfcross and sfpanel, two new Stata commands for the estima-tion of parametric stochastic frontier (SF) models using cross-sectional and panel data.


720 Stochastic frontier using Stata

Since the publication of the seminal articles by Meeusen and van den Broeck (1977) andAigner, Lovell, and Schmidt (1977), this class of models has become a popular tool forefficiency analysis; a stream of research has produced many reformulations and exten-sions of the original statistical models, generating a flourishing industry of empiricalstudies. An extended review of these models can be found in the recent survey byGreene (2012).

The SF model is motivated by the theoretical idea that no economic agent can exceedthe ideal “frontier”, and deviations from this extreme represent individual inefficiencies.From the statistical point of view, this idea has been implemented by specifying aregression model characterized by a composite error term in which the classical idiosyn-cratic disturbance, aiming at capturing measurement error and any other classical noise,is included with a one-sided disturbance that represents inefficiency.1 Whether cross-sectional or panel data, production or cost frontier, time invariant or varying inefficiency,parametric SF models are usually estimated by likelihood-based methods, and the maininterest is on making inference about both frontier parameters and inefficiency.

The estimation of SF models is already possible using official Stata routines. How-ever, the available commands cover a restricted range of models, especially in the panel-data case.

The sfcross command mirrors the frontier command’s functionality but addsnew features such as i) the estimation of normal-gamma models via simulated maximumlikelihood (SML) (Greene 2003); ii) the estimation of the normal-truncated normal modelproposed by Wang (2002), in which both the location and the scale parameters of theinefficiency distribution can be expressed as a function of exogenous covariates; andiii) the opportunity to manage complex survey data characteristics (via the svyset

command).

As far as panel-data analysis is concerned, the Stata xtfrontier command allowsthe estimation of a normal-truncated normal model with time-invariant inefficiency(Battese and Coelli 1988), and a time-varying version named the “time decay” model(Battese and Coelli 1992). Our sfpanel command allows one to fit a wider range oftime-varying inefficiency models, including the model of Cornwell, Schmidt, and Sick-les (1990), that of Lee and Schmidt (1993), the flexible model of Kumbhakar (1990),the time decay and the inefficiency effects models of Battese and Coelli (1992, 1995),and the “true” fixed- (TFE) and random-effects (TRE) models of Greene (2005a). Forthe last two models, the command allows different distributional assumptions, provid-ing the modeling of both inefficiency location and scale parameters. Furthermore, thecommand allows the estimation of the random-effects time-invariant inefficiency mod-els of Pitt and Lee (1981) and Battese and Coelli (1988), as well as the fixed-effectsversion of the model of Schmidt and Sickles (1984), which is characterized by no distri-butional assumptions on the inefficiency term. In addition, because the main objective

1. The literature distinguishes between production and cost frontiers. The former represents themaximum amount of output that can be obtained from a given level of inputs, while the lattercharacterizes the minimum expenditure required to produce a bundle of outputs given the pricesof the inputs used in its production.

F. Belotti, S. Daidone, G. Ilardi, and V. Atella 721

of SF analysis is the estimation of inefficiency, we provide postestimation routines tocompute both inefficiency and efficiency scores, as well as their confidence intervals(Jondrow et al. 1982; Battese and Coelli 1988; Horrace and Schmidt 1996). Finally,sfcross and sfpanel also allow the simultaneous modeling of heteroskedasticity in theidiosyncratic error term.

In the development of these new commands, we make extensive use of Mata to speedup the estimation process. We allow for the use of Stata factor variables, weighted esti-mation, constrained estimation, resampling-based variance estimation, and clustering.Moreover, by using Mata structures and libraries, we provide a very readable code thatcan be easily developed further by other Stata users. All of these features make thecommands simple to use, extremely flexible, and fast; they also ensure the opportunityto fit state-of-the-art SF models.

Finally, we would like to emphasize that sfpanel offers the possibility of performinga constrained fixed-effects estimation, which is not yet available with xtreg. Moreover,the models of Cornwell, Schmidt, and Sickles (1990) and Lee and Schmidt (1993), al-though proposed in the SF literature, are linear panel-data models with time-varyingfixed effects and thus potentially very useful in other contexts.

The article is organized as follows. In section 2, we present a brief review of the SF

approach evolution, focusing on the models that can be estimated using the proposedcommands. Sections 3 and 4 describe the syntax of sfcross and sfpanel, focusing onthe main options. Sections 5 and 6 illustrate the two commands using simulated dataand two empirical applications from the SF literature. Finally, section 7 offers someconclusions.

2 A review of stochastic frontier models

We begin our discussion with a general formulation of the SF cross-sectional model andthen review extensions and improvements that have been proposed in the literature,focusing on those models that can be estimated using sfcross and sfpanel. Giventhe large number of estimators allowed by the two commands, we deliberately do notdiscuss the derivation of the corresponding criterion functions. We refer the reader tothe cited works for details on the estimation of each model. A summary of all estimablemodels and their features is reported in table 1.

2.1 Cross-sectional models

Consider the following SF model,

yi = α+ x′iβ + εi, i = 1, . . . , N (1)

εi = vi − ui (2)

vi ∼ N (0, σ2v) (3)

ui ∼ F (4)


where yi represents the logarithm of the output (or cost) of the ith productive unit,xi is a vector of inputs (input prices and quantities in the case of a cost frontier),and β is the vector of technology parameters. The composed error term εi is thesum (or the difference) of a normally distributed disturbance, vi, representing mea-surement and specification error, and a one-sided disturbance, ui, representing ineffi-ciency.2 Moreover, ui and vi are assumed to be independent of each other and inde-pendent and identically distributed across observations. The last assumption aboutthe distribution F of the inefficiency term is needed to make the model estimable.Aigner, Lovell, and Schmidt (1977) assumed a half-normal distribution, that is, ui ∼N+

(0, σ2

u

), while Meeusen and van den Broeck (1977) opted for an exponential one,

ui ∼ E (σu). Other commonly adopted distributions are the truncated normal (Steven-son 1980) and the gamma distributions (Greene 1980a,b, 2003).

The distributional assumption required for the identification of the inefficiency termimplies that this model is usually fit by maximum likelihood (ML), even if modifiedordinary least-squares or generalized method of moments estimators are possible (ofteninefficient) alternatives.3 In general, SF analysis is based on two sequential steps: in

the first, estimates of the model parameters θ are obtained by maximizing the log-likelihood function �(θ), where θ = (α,β′, σ2

u, σ2v)

′.4 In the second step, point estimatesof inefficiency can be obtained through the mean (or the mode) of the conditional

distribution f(ui|εi), where εi = yi − α− x′iβ.

The derivation of the likelihood function is based on the independence assumptionbetween ui and vi. Because the composite model error εi is defined as εi = vi − ui, itsprobability density function is the convolution of the two component densities as

fε(εi) =

∫ +∞

0

fu(ui)fv(εi + ui)dui (5)

Hence, the log-likelihood function for a sample of n productive units is

�(θ) =

n∑i=1

logfε(εi|θ)

The marginalization of ui in (5) leads to a convenient closed-form expression only forthe normal-half normal, normal-exponential, and normal-truncated normal models. Inall other cases (for example, the normal-gamma model), numerical- or simulation-basedtechniques are necessary to approximate the integral in (5).

The second estimation step is necessary because the estimates of the model param-eters allow for the computation of residuals ε but not for the inefficiency estimates.

2. In this section, we consider only production functions. However, the sign of the ui term in (2) ispositive or negative depending on whether the frontier describes a cost or a production function,respectively.

3. Notice that when a distributional assumption on u is made, sfcross and sfpanel estimate modelparameters by likelihood-based techniques.

4. Different model parameterizations are used in the SF literature as, for example, θ = (α,β′, σ2, λ)′,where σ2 = σ2

u + σ2v and λ = σu/σv .


Because the main objective of SF analysis is the estimation of technical (or cost) effi-ciency, a strategy for disentangling this unobserved component from the compoundederror is required. As mentioned before, the most well-known solutions to this problem,proposed by Jondrow et al. (1982) and Battese and Coelli (1988), exploit the condi-tional distribution of u given ε. Thus a point estimate of the inefficiencies can beobtained using the mean E(u|ε) (or the mode M[u|ε]) of this conditional distribution.Once point estimates of u are obtained, estimates of the technical (cost) efficiency canbe derived as

Eff = exp (−u)

where u is either E(u|ε) or M(u|ε).5

2.2 Panel-data models

The availability of a richer set of information in panel data allows one to relax some ofthe assumptions previously imposed and to consider a more realistic characterization ofthe inefficiencies.

Pitt and Lee (1981) were the first to extend the model (1–4) to longitudinal data.They proposed the ML estimation of the following normal-half normal SF model:

yit = α+ x′itβ + εit, i = 1, . . . , N, t = 1, . . . , Ti (6)

εit = vit − ui

vit ∼ N (0, σ2v)

ui ∼ N+(0, σ2

u

)The generalization of this model to the normal-truncated normal case has been proposedby Battese and Coelli (1988).6 As pointed out by Schmidt and Sickles (1984), the esti-mation of an SF model with time-invariant inefficiency can also be performed by adaptingconventional fixed-effects estimation techniques, thereby allowing inefficiency to be cor-related with the frontier regressors and avoiding distributional assumptions about ui.However, the time-invariant nature of the inefficiency term has been questioned, espe-cially in the presence of empirical applications based on long-panel datasets. To relaxthis restriction, Cornwell, Schmidt, and Sickles (1990) have approached the problem byproposing the following SF model with individual-specific slope parameters,

yit = α+ x′itβ + vit ± uit, i = 1, . . . , N, t = 1, . . . , Ti

uit = ωi + ωi1t+ ωi2t2 (7)

in which the model parameters are estimated by extending the conventional fixed- andrandom-effects panel-data estimators. This quadratic specification allows for a unit-specific temporal pattern of inefficiency but requires the estimation of a large numberof parameters (N × 3).

5. A general presentation of the postestimation procedures implemented in the sfcross and sfpanel

routines is given by Kumbhakar and Lovell (2000) and Greene (2012), to whom we refer the readerfor further details.

6. The normal-exponential model is another straightforward extension allowed by sfpanel.


Following a slightly different estimation strategy, Lee and Schmidt (1993) proposedan alternative specification in which the uit is specified as

uit = g(t)× ui

where g(t) is represented by a set of time dummy variables. This specification is moreparsimonious than (7), and it does not impose any parametric form, but it is less flexiblebecause it restricts the temporal pattern of uit to be the same for all productive units.7

Kumbhakar (1990) was the first to propose the ML estimation of a time-varying SF

model in which g(t) is specified as

g(t) ={1 + exp

(γt+ δt2

)}−1

This model contains only two additional parameters to be estimated, γ and δ, and thehypothesis of time-invariant technical efficiency can be easily tested by setting γ = δ = 0.A similar model called “time decay” has been proposed by Battese and Coelli (1992) inwhich

g(t) = exp {−η (t− Ti)}The common feature of all of these time-varying SF models is that the intercept α is thesame across productive units, thus generating a misspecification bias in the presence oftime-invariant unobservable factors, which are unrelated with the production processbut affect the output. Therefore, the effect of these factors may be captured by theinefficiency term, producing biased results.

Greene (2005a) approached this issue through a time-varying SF normal-half normalmodel with unit-specific intercepts, obtained by replacing (6) by the following specifi-cation:

yit = αi + x′itβ + εit (8)

Compared with previous models, this specification allows one to disentangle time-varying inefficiency from unit-specific time-invariant unobserved heterogeneity. For thisreason, Greene (2005a) termed these models TFE or TRE according to the assumptionson the unobserved unit-specific heterogeneity. While the estimation of the TRE speci-fication can be easily performed using simulation-based techniques, the ML estimationof the TFE variant requires the solution of two major issues related to the estima-tion of nonlinear panel-data models. The first is purely computational because of thelarge dimension of the parameter space. Nevertheless, Greene (2005a,b) showed thata maximum-likelihood dummy variable (MLDV) approach is computationally feasiblealso in the presence of a large number of nuisance parameters αi (N > 1000). Thesecond, the so-called incidental parameters problem, is an inferential issue that arises

7. Han, Orea, and Schmidt (2005) and Ahn, Hoon Lee, and Schmidt (2001) propose estimating themodels of Cornwell, Schmidt, and Sickles (1990) and Lee and Schmidt (1993), respectively, througha generalized method of moments (GMM) approach. They show that GMM is preferable be-cause it is asymptotically efficient. Currently, sfpanel allows for the estimation of the models ofCornwell, Schmidt, and Sickles (1990) and Lee and Schmidt (1993) through modified least-squaresdummy variables and iterative least-squares approaches, respectively. We leave for future updatesthe implementation of the GMM estimator.


when the number of units is relatively large compared with the length of the panel. Inthese cases, the unit-specific intercepts are inconsistently estimated as N → ∞ withfixed T because only Ti observations are used to estimate each unit-specific parame-ter (Neyman and Scott 1948; Lancaster 2002). As shown in Belotti and Ilardi (2012),because this inconsistency contaminates the variance parameters, which represent thekey ingredients in the postestimation of inefficiencies, the MLDV approach appears tobe appropriate only when the length of the panel is large enough (T ≥ 10).8

Although model (8) may appear to be the most flexible and parsimonious choiceamong the several existing time-varying specifications, one can argue that a portion ofthe time-invariant unobserved heterogeneity does belong to inefficiency, or these twocomponents should not be disentangled at all. The sfpanel command provides optionsfor the estimation of these two extremes: for the models of Schmidt and Sickles (1984),Pitt and Lee (1981), and Battese and Coelli (1988), in which all time-invariant unob-served heterogeneity is considered as inefficiency, and for the two “true” specifications,in which all time-invariant unobserved heterogeneity is ruled out from the inefficiencycomponent. As pointed out by Greene (2005b), neither formulation is completely satis-factory a priori, and the choice should be driven by the features of the data at hand.9

Despite the usefulness of SF models in many contexts, a practical disclaimer is inorder: in both cross-sectional and panel-data models, the identification through dis-tributional assumptions of the two components u and v heavily depends on how theshape of their distributions is involved in defining the shape of the ε distribution. Iden-tification problems may arise when either the shapes are very similar (as pointed outby Ritter and Simar [1997] in the case of small samples for the normal-gamma cross-sectional model), or one of the two components is responsible for most of the shapeof the ε distribution. The latter is the case where the ratio between the inefficiencyand measurement error variability (the so-called signal-to-noise ratio, σu/σv) is verysmall or very large. In these cases, the profile of the log likelihood becomes quite “flat”,producing nontrivial numerical maximization problems.

2.3 Exogenous inefficiency determinants and heteroskedasticity

A very important issue in SF analysis is the inclusion in the model of exogenous vari-ables that are supposed to affect the distribution of inefficiency. These variables, whichusually are neither the inputs nor the outputs of the production process but nonetheless

8. A common approach to solve this problem is based on the elimination of the αi through adata transformation. The consistent estimation of the fixed-effects variant of Greene’s (2005a)model is still an open research issue in SF literature. Promising solutions have been pro-posed by Chen, Schmidt, and Wang (2011) for a homoskedastic normal-half normal model andBelotti and Ilardi (2012) for a more flexible heteroskedastic specification in normal-half normaland normal-exponential models. We are currently working to update the sfpanel command alongthese directions.

9. A way to disentangle unobserved heterogeneity from inefficiency is to include explanatory vari-ables that are correlated with inefficiency but not with the remaining heterogeneity. The use of(untestable) exclusion restrictions is a quite standard econometric technique to deal with identifi-cation issues.


affect the productive unit performance, have been incorporated in a variety of ways: i)they may shift the frontier function and the inefficiency distribution; ii) they may scalethe frontier function and the inefficiency distribution; and iii) they may shift and scalethe frontier function and the inefficiency distribution. Moreover, Kumbhakar and Lovell(2000) stress that in contrast with the linear regression model, in which the misspecifi-cation of the second moment of the errors distribution determines only efficiency losses,the presence of uncontrolled observable heterogeneity in ui and vi may affect the infer-ence in SF models. Indeed, while neglected heteroskedasticity in vi does not produceany bias for the frontier’s parameter estimates, it leads to biased inefficiency estimates,as we show in section 5.3.

In this section, we present the approaches that introduce heterogeneity in the loca-tion parameter of the inefficiency distribution and heteroskedasticity of the inefficiencyas well as in the parameter of the idiosyncratic error term for the models implemented inthe sfcross and sfpanel commands. Because these approaches can be easily extendedto the panel-data context, we deliberately confine the review to the cross-sectionalframework.

As pointed out by Greene (2012), researchers have often incorporated exogenouseffects using a two-step approach. In the first step, estimates of inefficiency are obtainedwithout controlling for these factors, while in the second, the estimated inefficiencyscores are regressed (or otherwise associated) with them. Wang and Schmidt (2002)show that this approach leads to severely biased results; thus we shall focus only onmodel extensions based on simultaneous estimation.

A natural starting point for introducing exogenous variables in the inefficiency modelis in the location of the distribution. The most well-known approaches are those sug-gested by Kumbhakar, Ghosh, and McGuckin (1991) and Huang and Liu (1994). Theyproposed to parameterize the mean of the pretruncated inefficiency distribution. Basi-cally, model (1)–(3) can be completed with

ui ∼ N+(μi, σ

2u

)(9)

μi = z′iψ

where ui is a realization from a truncated normal random variable, zi is a vector ofexogenous variables (including a constant term), and ψ is the vector of unknown pa-rameters to be estimated (the so-called inefficiency effects). One interesting feature ofthis approach is that the vector zi may include interactions with input variables, thusallowing one to test the hypothesis that inefficiency is neutral with respect to its impacton input usage.10

10. Battese and Coelli (1995) proposed a similar specification for panel data.


An alternative approach to analyzing the effect of exogenous determinants on inef-ficiency is obtained by scaling its distribution. Then a model that allows heteroskedas-ticity in ui and vi becomes a straightforward extension. For example, Caudill and Ford(1993), Caudill, Ford, and Gropper (1995), and Hadri (1999) proposed to parameterizethe variance of the pretruncated inefficiency distribution in the following way:

ui ∼ N+(0, σ2

ui

)(10)

σ2ui = exp (z′iψ) (11)

Hadri (1999) extends this last specification by allowing the variance of the idiosyncraticterm to be heteroskedastic so that (3) can be rewritten as

vi ∼ N (0, σ2vi)

σ2vi = exp (h′

iφ) (12)

where the variables in hi do not necessarily appear in zi.

As in Wang (2002), both sfcross and sfpanel allow one to combine (9) and (10)for the normal-truncated normal model. In postestimation, it is possible to computenonmonotonic effects of the exogenous factors zi on ui. A different specification hasbeen suggested by Wang and Schmidt (2002), in which both the location and varianceparameters are scaled by the same positive (monotonic) function h(zi,ψ). Their model,

ui = h(zi,ψ)u∗i with u∗i ∼ N (

μ, σ2)+

, is equivalent to the assumption that ui ∼N{μh(zi,ψ), σ2h(zi,ψ)

2}+, in which the zi vector does not include a constant term.11

11. We are currently working to extend the sfcross command to normal-truncated normal modelswith scaling property (Wang and Schmidt 2002).


Tab

le1.

Asummary

ofsfcrossandsfpanelestimationandpostestimationcapabilities

Refere

nce

Arg

uments

of

F2

Est.

Loc.eff.

Hete

r.Hete

r.JLM

S4

BC

5CI6

d()andm()1

meth

od3

inu

inu

inv

Cro

ss-sectionalmodels

Aigner,Lovell,andSchmidt(1977)

hnormal

HN

ML

xx

xx

xMeeusenandvanden

Broeck(1977)

exponential

EML

xx

xx

xx

Steven

son(1980)

tnormal

TN

ML

xx

xx

xx

Green

e(2003)

gamma

GSML

xx

xx

Panel-data

models

SchmidtandSickles(1984)

fe

-W

xSchmidtandSickles(1984)

regls

-GLS

xPittandLee

(1981)

pl81

HN

ML

xx

xBatteseandCoelli(1988)

bc88

TN

ML

xx

xCornwell,Schmidt,

andSickles(1990)

fecss

-MW

xLee

andSchmidt(1993)

fels

-IL

Sx

Kumbhakar(1990)

kumb90

HN

ML

xx


bc92

TN

ML

xx


bc95

TN

ML

xx

xx

xx

Green

e(2005a)

tfe

EMLDV

xx

xx

xx

Green

e(2005a)

tfe

HN

MLDV

xx

xx

xGreen

e(2005a)

tfe

TN

MLDV

xx

xx

xx

Green

e(2005a)

tre

ESML

xx

xx

xx

Green

e(2005a)

tre

HN

SML

xx

xx

xGreen

e(2005a)

tre

TN

SML

xx

xx

xx

1Themodel()optionis

available

insfpanel.Thedistribution()optionis

available

insfcross.

2DistributionF

ofu:HN

=halfnorm

al,E

=ex

ponen

tial,TN

=trunca

tednorm

al,G

=gamma.

3Estim

ationmethod:ML=

maxim

um

likelihood,SML=

simulatedmaxim

um

likelihood,GLS=

gen

eralizedleast

squares,

W=

within

group,MW

=modified

within

group,IL

S=

iterativeleast

squares,

MLDV

=maxim

um-likelihooddummyvariable.

4Ineffi

cien

cy(andeffi

cien

cy)estimationvia

theapproach

ofJondrow

etal.(1982).

5Efficien

cyestimationvia

theapproach

ofBatteseandCoelli(1988).

6Confiden

ceintervalforineffi

cien

cies

via

theapproach

ofHorrace

andSch

midt(1996).


3 The sfcross command

The new Stata command sfcross provides parametric ML estimators of SF models,where the default is represented by production. The general syntax of this command isas follows:

sfcross depvar[indepvars

] [if] [

in] [

weight] [

, distribution(distname)

emean(varlist m[, noconstant

]) usigma(varlist u

[, noconstant

])

vsigma(varlist v[, noconstant

]) svfrontier(listname) svemean(listname)

svusigma(listname) svvsigma(listname) cost simtype(simtype)

nsimulations(#) base(#) postscore posthessian]

This command and its panel analog, sfpanel, are written using the moptimize()

suite of functions, the optimization engine used by ml, and share the same features ofall Stata estimation commands, including access to the estimation results and optionsfor the maximization process (see help maximize). Stata 11.2 is the earliest versionthat can be used to run the commands. aweight, fweight, iweight, and pweight areallowed (see help weight). sfcross supports the svy prefix (see help survey). Thedefault is the normal-exponential model. Most options are similar to those of otherStata estimation commands. A full description of all available options is provided inthe sfcross help file.

3.1 Main options for sfcross

distribution(distname) specifies the distribution for the inefficiency term as half nor-mal (hnormal), truncated normal (tnormal), exponential (exponential), or gamma(gamma). The default is distribution(exponential).


]) may be used only with distribution(tnormal).

With this option, sfcross specifies the mean of the truncated normal distribu-tion in terms of a linear function of the covariates defined in varlist m. Specifyingnoconstant suppresses the constant in this function.

usigma(varlist u[, noconstant

]) specifies that the technical inefficiency component

is heteroskedastic, with the variance expressed as a function of the covariates definedin varlist u. Specifying noconstant suppresses the constant in this function.


]) specifies that the idiosyncratic error component is

heteroskedastic, with the variance expressed as a function of the covariates definedin varlist v. Specifying noconstant suppresses the constant in this function.

svfrontier(listname) specifies a 1 × k vector of initial values for the coefficients ofthe frontier. The vector must have the same length of the parameter vector to beestimated.


svemean(listname) specifies a 1 × km vector of initial values for the coefficients of theconditional mean model. This can be specified only with distribution(tnormal).

svusigma(listname) specifies a 1× ku vector of initial values for the coefficients of thetechnical inefficiency variance function.

svvsigma(listname) specifies a 1× kv vector of initial values for the coefficients of theidiosyncratic error variance function.

cost specifies that sfcross fit a cost frontier model.

simtype(simtype) specifies the method for random draws when distribution(gamma)

is specified. runiform generates uniformly distributed random variates; halton

and genhalton create, respectively, Halton sequences and generalized Halton se-quences where the base is expressed by the prime number in base(#). The defaultis simtype(runiform). See help mata halton() for more details on generatingHalton sequences.

nsimulations(#) specifies the number of draws used when distribution(gamma) isspecified. The default is nsimulations(250).

base(#) specifies the number, preferably a prime, used as a base for the generationof Halton sequences and generalized Halton sequences when distribution(gamma)

is specified. The default is base(7). Note that Halton sequences based on largeprimes (# > 10) can be highly correlated, and their coverage may be worse thanthat of the pseudorandom uniform sequences.

postscore saves an observation-by-observation matrix of scores in the estimation re-sults’ list. Scores are defined as the derivative of the objective function with respectto the parameters. This option is not allowed when the size of the scores’ matrix isgreater than the Stata matrix limit; see [R] limits.

posthessian saves the Hessian matrix corresponding to the full set of coefficients inthe estimation results’ list.

3.2 Postestimation command after sfcross

After the estimation with sfcross, the predict command can be used to compute linearpredictions, (in)efficiency, and score variables. Moreover, the sfcross postestimationcommand allows one to compute the (in)efficiency confidence interval through the optionci as well as nonmonotonic marginal effects in the manner of Wang (2002) using, whenappropriate, the option marginal. The syntax of the command is the following,

predict[type

]newvar

[if] [

in] [

, statistic]

predict[type

] {stub* |newvar xb newvar v newvar u} [if] [

in], scores

where statistic includes xb, stdp, u, m, jlms, bc, ci, and marginal.


xb, the default, calculates the linear prediction.

stdp calculates the standard error of the linear prediction.

u produces estimates of (technical or cost) inefficiency via E(u|ε) using the estimator ofJondrow et al. (1982).

m produces estimates of (technical or cost) inefficiency via M(u|ε), the mode of theconditional distribution of u|ε. This option is not allowed when the estimation isperformed with the distribution(gamma) option.

jlms produces estimates of (technical or cost) efficiency via exp{−E(u|ε)}.bc produces estimates of (technical or cost) efficiency via E{exp(−u|ε)}, the estimator

of Battese and Coelli (1988). This option is not allowed when the estimation isperformed with the distribution(gamma) option.

ci computes the confidence interval using the approach proposed by Horrace andSchmidt (1996). It can be used only when u or bc is specified. The default islevel(95), or a 95% confidence interval. If the option level(#) is used in theprevious estimation command, the confidence interval will be computed using the# level. This option creates two additional variables: newvar LBcilevel and new-var UBcilevel, the lower and the upper bound, respectively. This option is notallowed when the estimation is performed with the distribution(gamma) option.

marginal calculates the marginal effects of the exogenous determinants on E(u) andVar(u). The marginal effects are observation specific and are saved in the newvariables varname m M and varname u V, the marginal effects on the mean and thevariance of the inefficiency, respectively. varname m and varname u are the names ofeach exogenous determinant specified in options emean(varlist m

[, noconstant

])

and usigma(varlist u[, noconstant

]). marginal can be used only if the estima-

tion is performed with the distribution(tnormal) option. When they are bothspecified, varlist m and varlist u must contain the same variables in the same order.This option can be specified in two ways: i) together with u, m, jlms, or bc; and ii)alone without specifying newvar.

scores calculates score variables. When the argument of the option distribution() ishnormal, tnormal, or exponential, score variables are generated as the derivativeof the objective function with respect to the parameters. When the argument ofthe option distribution() is gamma, they are generated as the derivative of theobjective function with respect to the coefficients. This difference is due to thedifferent moptimize() evaluator type used to implement the estimators (see help

mata moptimize()).


4 The sfpanel command

sfpanel allows for the estimation of SF panel-data models through ML and least-squarestechniques. The general sfpanel syntax is the following:

sfpanel depvar[indepvars

] [if] [

in] [

weight], model(modeltype)

[options

]As for its cross-sectional counterpart, version 11.2 is the earliest version of Stata thatcan be used to run sfpanel. Similarly, all types of weights are allowed, but the de-clared weight variable must be constant within each unit of the panel. Moreover, thecommand does not support the svy prefix. The default model is the time-decay modelof Battese and Coelli (1992). A description of the main command-specific estimationand postestimation options is provided below. A full description of all available optionsis provided in the sfpanel help file.

4.1 Main options for sfpanel

True fixed- and random-effects models (Greene 2005a,b)

distribution(distname) specifies the distribution for the inefficiency term as half-normal (hnormal), truncated normal (tnormal), or exponential (exponential). Thedefault is distribution(exponential).


]) may be used only with distribution(tnormal).

With this option, sfpanel specifies the mean of the truncated normal distribu-tion in terms of a linear function of the covariates defined in varlist m. Specifyingnoconstant suppresses the constant in this function.







feshow allows the user to display estimates of individual fixed effects, along with struc-tural parameters. Only for model(tfe).

simtype(simtype) specifies the method to generate random draws for the unit-specificrandom effects. runiform generates uniformly distributed random variates; haltonand genhalton create, respectively, Halton sequences and generalized Halton se-quences where the base is expressed by the prime number in base(#). The defaultis simtype(runiform). See help mata halton() for more details on generatingHalton sequences. Only for model(tre).

nsimulations(#) specifies the number of draws used in the simulation. The default isnsimulations(250). Only for model(tre).


base(#) specifies the number, preferably a prime, used as a base for the generation ofHalton sequences and generalized Halton sequences. The default is base(7). Notethat Halton sequences based on large primes (# > 10) can be highly correlated,and their coverage may be worse than that of the pseudorandom uniform sequences.Only for model(tre).

ML random-effects time-varying inefficiency effects model (Battese and Coelli 1995)


]) fits the Battese and Coelli (1995) conditional mean

model, in which the mean of the truncated normal distribution is expressed as a linearfunction of the covariates specified in varlist m. Specifying noconstant suppressesthe constant in this function.







ML random-effects flexible time-varying efficiency model (Kumbhakar 1990)

bt(varlist bt[, noconstant

]) fits a model that allows a flexible specification of tech-

nical inefficiency, handling different types of time behavior, using the formulationuit = ui [1 + exp(varlist bt)]

−1. Typically, explanatory variables in varlist bt are rep-

resented by a polynomial in time. Specifying noconstant suppresses the constantin the function. The default includes a linear and a quadratic term in time withoutconstant, as in Kumbhakar (1990).

4.2 Postestimation command after sfpanel

After the estimation with sfpanel, the predict command can be used to compute linearpredictions, (in)efficiency, and score variables. Moreover, the sfpanel postestimationcommand allows one to compute the (in)efficiency confidence interval through the optionci as well as nonmonotonic marginal effects in the manner of Wang (2002) using, whenappropriate, the option marginal. The syntax of the command is the following,

predict[type

]newvar

[if] [

in] [

, statistic]

predict[type

] {stub* |newvar xb newvar v newvar u} [if] [

in], scores

where statistic includes xb, stdp, u, u0, m, jlms, bc, ci, marginal, and trunc(tlevel).

xb, the default, calculates the linear prediction.


stdp calculates the standard error of the linear prediction.

u produces estimates of (technical or cost) inefficiency via E(u|ε) using the estimator ofJondrow et al. (1982).

u0 produces estimates of (technical or cost) inefficiency via E(u|ε) using the estimatorof Jondrow et al. (1982) when the random effect is zero. This statistic can only bespecified when the estimation is performed with the model(tre) option.

m produces estimates of (technical or cost) inefficiency via M(u|ε), the mode of the con-ditional distribution of u|ε. This statistic is not allowed when the estimation is per-formed with the option model(fecss), model(fels), model(fe), or model(regls).

jlms produces estimates of (technical or cost) efficiency via exp{−E(u|ε)}.bc produces estimates of (technical or cost) efficiency via E {exp(−u|ε)}, estimator of the

Battese and Coelli (1988). This statistic is not allowed when the estimation is per-formed with the option model(fecss), model(fels), model(fe), or model(regls).

ci computes the confidence interval using the approach of Horrace and Schmidt (1996).This option can only be used with the u, jlms, and bc statistics but not when the es-timation is performed with the option model(fels), model(bc92), model(kumb90),model(fecss), model(fe), or model(regls). The default is level(95), or a 95%confidence interval. If the option level(#) is used in the previous estimation com-mand, the confidence interval will be computed using the# level. This option createstwo additional variables: newvar LBcilevel and newvar UBcilevel, the lower and theupper bound, respectively.

marginal calculates the marginal effects of the exogenous determinants on E(u) andVar(u). The marginal effects are observation specific and are saved in the new vari-ables varname m M and varname u V, the marginal effects on the unconditionalmean and the variance of inefficiency, respectively. varname m and varname u arethe names of each exogenous determinant specified in options emean(varlist m

[,

noconstant]) and usigma(varlist u

[, noconstant

]). marginal can only be used

if estimation is performed with the model(bc95) option or if the inefficiency inmodel(tfe) or model(tre) is distribution(tnormal). When they are both speci-fied, varlist m and varlist u must contain the same variables in the same order. Thisoption can be specified in two ways: i) together with u, m, jlms, or bc; and ii) alonewithout specifying newvar.

trunc(tlevel) excludes from the inefficiency estimation the units whose effects are, atleast at one time period, in the upper and bottom tlevel% range. trunc() can only beused if the estimation is performed with model(fe), model(regls), model(fecss),and model(fels).

scores calculates score variables. This option is not allowed when the estimation is per-formed with the option model(fecss), model(fels), model(fe), or model(regls).When the argument of the option model() is tfe or bc95, score variables are gen-erated as the derivative of the objective function with respect to the parameters.


When the argument of the option model() is tre, bc88, bc92, kumb90, or pl81,they are generated as the derivative of the objective function with respect to thecoefficients. This difference is due to the different moptimize() evaluator type usedto implement the estimators (see help mata moptimize()).

5 Examples with simulated data

In this section, we use simulated data to illustrate sfcross and sfpanel estimationcapabilities, focusing on some of the models that cannot be estimated using officialStata routines.12

5.1 The normal-gamma SF production model

There is a large debate in the SF literature about the (non)identifiability of the normal-gamma cross-sectional model. Ritter and Simar (1997) pointed out that this modelis difficult to distinguish from the normal-exponential one, and that the estimation ofthe shape parameter of the gamma distribution may require large sample sizes (up toseveral thousand observations). On the other hand, Greene (2003) argued that theirresult “was a matter of degree, not a definitive result”, and that the (non)identifiabilityof the true value of the shape parameter remains an empirical question. In this section,we illustrate the sfcross command by fitting a normal-gamma SF production model.We consider the following data-generating process (DGP),

yi = 1 + 0.3x1i + 0.7x2i + vi − ui, i = 1, . . . , N

vi ∼ N (0, 1)

ui ∼ Γ(2, 2)

where the inefficiency is gamma distributed with shape and scale parameters equal to2, the idiosyncratic error is N (0, 1), and the two regressors, x1i and x2i, are normallydistributed with 0 means and variances equal to 1 and 4, respectively. The sample sizeis set to 1,000 observations, a large size as noted by Ritter and Simar (1997) but ingeneral not so large given the current availability of microdata. Let us begin by fittingthe normal-exponential model using the following syntax:

12. We report the Mata code used for the DGP and the models’ estimation syntax for each examplein the sj examples simdata.do ancillary file.


. sfcross y x1 x2, distribution(exp) nolog

Stoc. frontier normal/exponential model Number of obs = 1000Wald chi2(2) = 419.88Prob > chi2 = 0.0000

Log likelihood = -2423.0869

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

Frontierx1 .3709605 .068792 5.39 0.000 .2361306 .5057904x2 .6810641 .0339945 20.03 0.000 .6144361 .747692

_cons -.1474677 .1131198 -1.30 0.192 -.3691784 .0742431

Usigma_cons 2.173649 .0957468 22.70 0.000 1.985989 2.361309

Vsigma_cons .3827463 .1498911 2.55 0.011 .0889652 .6765274

sigma_u 2.964844 .1419372 20.89 0.000 2.699305 3.256505sigma_v 1.210911 .0907524 13.34 0.000 1.045487 1.40251lambda 2.448441 .2058941 11.89 0.000 2.044895 2.851986

. estimates store exp

. predict uhat_exp, u

Note that the normal-exponential model is the sfcross default, so we might omitthe option distribution(exponential).13 As can be seen, although there is only oneequation to be estimated in the model, the command fits three of Mata’s moptimize()equations (see [M-5] moptimize( )). Indeed, given that sfcross allows both the inef-ficiency and the idiosyncratic error to be heteroskedastic (see table 1), the output alsoreports variance parameters estimated in a transformed metric according to (11) and(12), respectively. In this example, the inefficiency is assumed to be homoskedastic, sosfcross estimates the coefficient of the constant term in (11) rather than directly esti-mating σu. To make the output easily interpretable, sfcross also displays the varianceparameters in their natural metric.

As expected, the normal-exponential model produces biased results, especially forthe frontier’s constant term and the inefficiency scale parameter σu. We also run thepredict command using the u option. In this way, inefficiency estimates are obtainedthrough the approach of Jondrow et al. (1982). Because the inefficiencies are drawnfrom a gamma distribution, a better fit can be obtained using the following command:

13. The option nolog allows one to omit the display of the criterion function iteration log. sfcross

and sfpanel allow one to use all maximize options available for ml estimation commands (see help

maximize) and the additional options postscore and posthessian, which report the score and theHessian as an e() vector and matrix, respectively.


. sfcross y x1 x2, distribution(gamma) nsim(50) simtype(genha) base(7) nolog

Stoc. frontier normal/gamma model Number of obs = 1000Wald chi2(2) = 438.02Prob > chi2 = 0.0000

Log simulated-likelihood = -2419.0008Number of Randomized Halton Sequences = 50Base for Randomized Halton Sequences = 7


Frontierx1 .3809769 .0670487 5.68 0.000 .2495639 .5123899x2 .6877634 .0336088 20.46 0.000 .6218914 .7536354

_cons .9361791 .4121864 2.27 0.023 .1283087 1.74405

Usigma_cons 1.53519 .226486 6.78 0.000 1.091286 1.979094

Vsigma_cons -.2734356 .333033 -0.82 0.412 -.9261682 .379297

sigma_u 2.154578 .2439909 8.83 0.000 1.725717 2.690016sigma_v .8722163 .1452384 6.01 0.000 .6293397 1.208825lambda 2.470234 .1969744 12.54 0.000 2.084171 2.856297g_shape 1.879186 .3845502 4.89 0.000 1.125482 2.632891

. estimates store gamma

. predict uhat_gamma, u

In the normal-gamma cross-sectional model, the parameters are estimated using sim-ulated maximum likelihood (SML). A better approximation of the log-likelihood functionrequires the right choice about the number of draws and the way they are created. Inthis example, we use generalized Halton sequences (simtype(genhalton)) with baseequal to 7 (base(7)) and only 50 draws (nsim(50)). Indeed, a Halton sequence gen-erally has a more uniform coverage than a sequence generated from pseudouniformrandom numbers. Moreover, as noted by Greene (2003), the computational efficiency,when compared with that of pseudouniform random draws, appears to be at least 10to 1. Thus, in our example, the same results can be approximately obtained using 500pseudouniform draws (see help mata halton()).14

14. For all models fit using SML, the default option of sfcross and sfpanel is simtype(uniform) withnsim(250). In our opinion, small values (for example, 50 for Halton sequences and 250 for pseu-douniform random draws) are sufficient for exploratory work. On the other hand, larger values, inthe order of several hundreds, are advisable for more precise results. We suggest using Halton se-quences rather than pseudorandom random draws. However, as pointed out by Drukker and Gates(2006), “Halton sequences based on large primes (d > 10) can be highly correlated, and theircoverage can be worse than that of the pseudorandom uniform sequences”.


As expected, in this example, the parameters of the normal-gamma model are prop-erly estimated. Furthermore, this model is preferable to the normal-exponential one, ascorroborated by the following likelihood-ratio test.15

. lrtest exp gamma

Likelihood-ratio test LR chi2(1) = 8.17(Assumption: exp nested in gamma) Prob > chi2 = 0.0043

Similar conclusions may be drawn by comparing the estimated mean inefficiencieswith the true simulated one, even if the Spearman rank correlation with the latter ishigh and very similar for both uhat gamma and uhat exp.16

. summarize u uhat_gamma uhat_exp

Variable Obs Mean Std. Dev. Min Max

u 1000 4.097398 2.91035 .0259262 19.90251uhat_gamma 1000 4.048885 2.839368 .4752663 20.27557

uhat_exp 1000 2.964844 2.64064 .363516 18.95619

. spearman u uhat_gamma uhat_exp(obs=1000)

u uhat_g~a uhat_exp

u 1.0000uhat_gamma 0.9141 1.0000

uhat_exp 0.9145 0.9998 1.0000

5.2 Panel-data time-varying inefficiency models

Cornwell, Schmidt, and Sickles (1990) and Lee and Schmidt (1993) provide a fixed-effects treatment of models like those proposed by Kumbhakar (1990) and Battese andCoelli (1992). Currently, sfpanel allows for the estimation of the models of Corn-well, Schmidt, and Sickles (1990) and Lee and Schmidt (1993) by means of modifiedleast-squares dummy variables and iterative least squares (ILS), respectively. An in-teresting aspect of these models is that although they have been proposed in the SF

literature, they are actually linear panel-data models with time-varying fixed effectsand thus potentially very useful in other contexts. However, their consistency requireswhite noise errors, and they are less efficient than the GMM estimator proposed byAhn, Hoon Lee, and Schmidt (2001) and Han, Orea, and Schmidt (2005).

15. Notice that exp and gamma are the names of the exponential and gamma models’ estimation resultssaved with the estimates store command.

16. In line with Ritter and Simar (1997), our simulation results indicate that in the normal-gammamodel, a relatively large sample is needed to achieve a reasonable degree of precision in the estimatesof inefficiency distribution parameters.


In this section, we report the main syntax to fit such models. We start by specifyingthe following stochastic production frontier translog model:

yit = uit + 0.2x1it + 0.6x2it + 0.6x3it + 0.2x21it + 0.1x22it + 0.2x23it

+ 0.15x1itx2it − 0.3x1itx3it − 0.3x2itx3it + vit

vit ∼ N (0, 0.25), i = 1, . . . , n, t = 1, . . . , T

As already mentioned, the main feature of these models is the absence of any distri-butional assumption about inefficiency. In this example, the DGP follows the Lee andSchmidt (1993) model, where ui = δiξ. For each unit, the parameter δi is drawn froma uniform distribution in [0,

√12τ + 1 − 1] with τ = 0.8. The elements of the vector

ξ = (ξ1, . . . , ξT ) are equally spaced between −2 and 2. This setup implies a standarddeviation of the inefficiency term σu ≈ 1.83.

Once the sample is declared to be a panel (see help xtset), the models of Lee andSchmidt (1993) and Cornwell, Schmidt, and Sickles (1990) can be estimated using thefollowing syntaxes:

. sfpanel y x1 x2 x3 x1_sq x2_sq x3_sq x1_x2 x1_x3 x2_x3, model(fels)

(output omitted )

. estimates store fels

. predict uhat_fels, u

. sfpanel y x1 x2 x3 x1_sq x2_sq x3_sq x1_x2 x1_x3 x2_x3, model(fecss)

(output omitted )

. estimates store fecss

. predict uhat_fecss, u

Notice that we use the predict command with the u option to postestimate inef-ficiency. As an additional source of comparison, we use the same simulated data toassess the behavior of the Schmidt and Sickles (1984) time-invariant inefficiency model.The fixed-effects version of this model can be fit using sfpanel and the official xtregcommand. However, when the estimation is performed using sfpanel, the predict

command with the option u can be used to obtain inefficiency estimates.17

. sfpanel y x1 x2 x3 x1_sq x2_sq x3_sq x1_x2 x1_x3 x2_x3, model(fe)

(output omitted )

. estimates store fess_sf

. predict uhat_fess, u

. xtreg y x1 x2 x3 x1_sq x2_sq x3_sq x1_x2 x1_x3 x2_x3, fe

(output omitted )

. estimates store fess_xt

Table 2 reports the estimation results from the three models. Unsurprisingly, boththe frontier and variance parameters are well estimated in the ls93 and css90 models.This result shows that when the DGP follows the model by Lee and Schmidt (1993), the

17. Both xtreg and sfpanel also allow for the estimation of the random-effects version of this modelthrough the feasible generalized least-squares approach.


estimator by Cornwell, Schmidt, and Sickles (1990) provides reliable results. On theother hand, being the data generated from a time-varying model, variance estimatesfrom the ss84 model show a substantial bias.

Table 2. Schmidt and Sickles (ss84), Cornwell, Schmidt, and Sickles (css90), andLee and Schmidt (ls93) estimation results

ss84 css90 ls93

x1 0.254*** 0.185*** 0.171***(0.0695) (0.0237) (0.0230)

x2 0.626*** 0.619*** 0.611***(0.0354) (0.0121) (0.0117)

x3 0.602*** 0.591*** 0.596***(0.0220) (0.0073) (0.0075)

x1 sq 0.193*** 0.204*** 0.209***(0.0234) (0.0078) (0.0076)

x2 sq 0.099*** 0.103*** 0.101***(0.0080) (0.0027) (0.0026)

x3 sq 0.198*** 0.201*** 0.201***(0.0036) (0.0012) (0.0012)

x1 x2 0.149*** 0.142*** 0.145***(0.0198) (0.0066) (0.0064)

x1 x3 −0.293*** −0.295*** −0.295***(0.0130) (0.0043) (0.0043)

x2 x3 −0.306*** −0.300*** −0.301***(0.0076) (0.0026) (0.0025)

cons −0.050(0.0866)

σu 0.223 1.859 1.832σv 2.096 0.499 0.497

We do not expect large differences with regard to inefficiency scores, given the sim-ilarities in terms of variance estimates between css90 and ls93. Note that for thesemodels (including ss84), inefficiency scores are retrieved in postestimation, with the as-sumption that the best decision-making unit is fully efficient.18 As seen in the followingsummarize command, both css90 and ls93 average inefficiencies are close to the truevalues, while the Spearman rank correlations are almost equal to 1. As expected, thess84 estimated inefficiencies are highly biased, and the corresponding units’ ranking iscompletely unreliable.

18. This assumption involves calculating ui = α − αi with α = maxi=1,...,n(αi) in the case of time-invariant inefficiency models and uit = αt − αit with αt = maxi=1,...,n(αit) in the case of time-varying inefficiency models.


. summarize u uhat_fels uhat_fecss uhat_fess


u 2500 1.345894 1.265155 0 4.486315uhat_fels 2500 1.669869 1.419268 0 5.550043

uhat_fecss 2500 2.214224 1.314245 0 6.500714uhat_fess 2500 .645184 .2232496 0 1.27254

. spearman u uhat_fels uhat_fecss uhat_fess(obs=2500)

u uhat_~ls uhat~css uhat~ess

u 1.0000uhat_fels 0.9794 1.0000

uhat_fecss 0.8974 0.9129 1.0000uhat_fess 0.0005 0.0092 0.1991 1.0000

Finally, we show additional features of sfpanel: i) the possibility of computingelasticities via the official lincom command; and ii) the possibility of performing aconstrained fixed-effects estimation, which is not yet available with xtreg.

With respect to the former point, it is well known that parameters in a translog pro-duction frontier do not represent output elasticities. In particular, a linear combinationof frontier parameters is needed for computing such elasticities. Moreover, to calculateoutput elasticities at means, we first need to compute and store the mean for each inputvariable using the following syntax:

. quietly summarize x1

. scalar x1m = r(mean)





Then the lincom command can be used to combine estimated frontier parametersusing the following standard syntax:

. lincom x1 + x1_sq * x1m + x1_x2*x2m + x1_x3*x3m

( 1) x1 + 1.108946*x1_sq + 1.074533*x1_x2 + 1.05167*x1_x3 = 0

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

(1) .3203578 .05348 5.99 0.000 .2154752 .4252405


( 1) x2 + 1.074533*x2_sq + 1.108946*x1_x2 + 1.05167*x2_x3 = 0


(1) .5751999 .0254143 22.63 0.000 .5253585 .6250413



( 1) x3 + 1.05167*x3_sq + 1.108946*x1_x3 + 1.074533*x2_x3 = 0


(1) .156379 .0158945 9.84 0.000 .1252075 .1875505

Finally, the constant return to scale (CRS) hypothesis can be trivially tested by usingthe following syntax:

. lincom (x1 + x1_sq * x1m + x1_x2*x2m + x1_x3*x3m)> + (x2 + x2_sq * x2m + x1_x2*x1m + x2_x3*x3m)> + (x3 + x3_sq * x3m + x1_x3*x1m + x2_x3*x2m) - 1

( 1) x1 + x2 + x3 + 1.108946*x1_sq + 1.074533*x2_sq + 1.05167*x3_sq +2.18348*x1_x2 + 2.160617*x1_x3 + 2.126204*x2_x3 = 1


(1) .0519367 .0609852 0.85 0.395 -.0676648 .1715383

In this example, the CRS hypothesis cannot be rejected. To run a constrained fixed-effects estimation, we can define the required set of constraints to impose CRS throughthe official Stata command constraint using the following syntax:

. // Constraints definition

. constraint define 1 x1 + x2 + x3 = 1

. constraint define 2 x1_sq + x1_x2 + x1_x3 = 0



Then the constrained model can be estimated using sfpanel with the model(fe)

and constraints(1 2 3 4) options.


. sfpanel y x1 x2 x3 x1_sq x2_sq x3_sq x1_x2 x1_x3 x2_x3, model(fe)> constraints(1 2 3 4)

Time-invariant fixed-effects model (LSDV) Number of obs = 2500Group variable: id Number of groups = 500Time variable: time Obs per group: min = 5

avg = 5.0max = 5

( 1) x1 + x2 + x3 = 1( 2) x1_sq + x1_x2 + x1_x3 = 0( 3) x2_sq + x1_x2 + x2_x3 = 0( 4) x3_sq + x1_x3 + x2_x3 = 0


x1 .3530365 .0851901 4.14 0.000 .1860671 .520006x2 .5092917 .0434568 11.72 0.000 .4241179 .5944655x3 .1376718 .0270375 5.09 0.000 .0846792 .1906644

x1_sq -.0343576 .0287476 -1.20 0.232 -.0907019 .0219868x2_sq .1282553 .0098209 13.06 0.000 .1090067 .1475039x3_sq .21594 .004442 48.61 0.000 .2072339 .2246461x1_x2 .0610211 .0242651 2.51 0.012 .0134624 .1085799x1_x3 -.0266635 .0159577 -1.67 0.095 -.0579401 .0046131x2_x3 -.1892764 .0092834 -20.39 0.000 -.2074716 -.1710813_cons .2326412 .1062126 2.19 0.029 .0244682 .4408141

sigma_u .7140381sigma_v 2.5700643

The constrained frontier estimates are more biased than the unconstrained ones butare still not too far from the true values. This is an artifact of our DGP because thescale elasticity has been simulated without imposing CRS.

5.3 “True” fixed- and random-effects models

As already discussed in section 2.2, the “true” fixed- and random-effects models allowone to disentangle time-invariant heterogeneity from time-varying inefficiency. In thissection, we present the main syntax and some of the options to fit such models. We startby specifying the following normal-exponential stochastic production frontier model,

yit = 1 + αi + 0.3x1it + 0.7x2it + vit − uit

vit ∼ N (0, 1) (13)

uit ∼ E (2) , i = 1, . . . , n, t = 1, . . . , T (14)

where the nuisance parameters αi (i = 1, . . . , n) are drawn from a N (0, θ2) with θ = 1.5.In the fixed-effects design (TFEDGP), the two regressors x1it and x2it are distributed foreach unit according to a normal distribution centered in the corresponding unit-effect αi

with variances equal to 1 and 4, respectively. This design ensures correlation betweenregressors and individual effects, a typical scenario in which the fixed-effects specificationrepresents the consistent choice.19

19. Notice that higher values of θ correspond to higher correlations between the regressors and theunit-specific effects.


As far as the random-effects design is concerned (TREDGP), x1it and x2it are notcorrelated with the unit-specific effects and are distributed according to a normal dis-tribution with 0 mean, and variances equal to 1 and 4, respectively.

The generated sample consists of a balanced panel of 1,000 units observed for 10periods for a total of 10,000 observations. Once the sample is declared as a panel, wefit the following models:

i) a normal-exponential TFE model on TFEDGP data (tfe1)20

. sfpanel yf x1_c x2_c, model(tfe) distribution(exp) rescale

(output omitted )

. estimates store tfe_c

. predict u_tfe_c, u

ii) a normal-exponential TRE model on TFEDGP data (tre1)

. sfpanel yf x1_c x2_c, model(tre) distribution(exp) nsim(50)> simtype(genhalton) base(7) rescale

(output omitted )

. estimates store tre_c

. predict u_tre_c, u

iii) a normal-exponential TRE model on TREDGP data (tre2)

. sfpanel yr x1_nc x2_nc, model(tre) distribution(exp) nsim(100)> simtype(genhalton) base(7) rescale

(output omitted )

. estimates store tre_nc

. predict u_tre_nc, u

. predict u0_tre_nc, u0

As shown in the first column of table 3, when the model is correctly specified, the frontierparameters are properly estimated. However, in this example, the MLDV estimator ofσv is slightly biased by the incidental parameter problem even if the length of the panelis quite large.21 This problem does not seem to affect variance estimates in the tre1

model. In this case, the parameters are estimated using the SML technique assumingthat the unobserved heterogeneity is distributed as N (0, θ2) (where θ represents thestandard deviation of the unobserved heterogeneity) and that E(αi|x1it, x2it) = 0. Thus,because the estimates are obtained using the TFEDGP data, the frontier and θ parameterestimates are biased.

20. Note that yf, x1 c, and x2 c are the variables from the TFEDGP, while yr, x1 nc, and x2 nc arefrom the TREDGP.

21. See section 2.2 for a discussion of the MLDV estimator problems in the TFE model.


Table 3. TFE and TRE estimation results

tfe1 tre1 tre2

x1 c 0.304*** 0.776***(0.0164) (0.0198)

x2 c 0.700*** 0.811***(0.0081) (0.0094)

x1 nc 0.295***(0.0176)

x2 nc 0.706***(0.0089)

cons 1.062*** 1.090***(0.0342) (0.0540)

σu 2.075 2.035 2.023σv 0.770 1.095 0.973θ 0.602 1.542

On the contrary, by fitting a correctly specified TRE model on TREDGP data (columntre2 in table 3), all parameters, including the frontier ones, are accurately estimated.

After each estimation, we use the predict command to obtain inefficiency esti-mates. As already mentioned, option u instructs the postestimation routine to com-pute inefficiencies through the estimator of Jondrow et al. (1982) (see help sfpanel

postestimation). In the case of the TRE model, the predict command also allows forthe option u0 to estimate inefficiencies assuming the random effects are zero. At thispoint, we can summarize the estimated inefficiencies to compare them with the actualvalues.

. summarize u u_tfe_c u_tre_c u_tre_nc u0_tre_nc


u 10000 2.004997 2.00852 .0003777 20.83139u_tfe_c 10000 2.075017 1.948148 .2008319 20.42197u_tre_c 10000 2.034946 1.818154 .2430926 18.76244

u_tre_nc 10000 2.025002 1.831147 .2656734 19.98998u0_tre_nc 10000 2.200728 2.086419 .1338385 19.47738

. spearman u u_tfe_c u_tre_c u_tre_nc u0_tre_nc(obs=10000)

u u_tfe_c u_tre_c u_tre_nc u0_tre~c

u 1.0000u_tfe_c 0.7654 1.0000u_tre_c 0.7541 0.9291 1.0000

u_tre_nc 0.7700 0.9925 0.9464 1.0000u0_tre_nc 0.6297 0.7313 0.8168 0.7965 1.0000


All the estimates of Jondrow et al. (1982) are very close to the true simulated ones(u). Actually, the estimated average inefficiency after a correctly specified TRE modelshows a lower bias than the estimated average inefficiency after a correctly specifiedTFE model. This is due to the incidental parameters problem. Also note the goodperformances of the TRE model when it is fit on the TFEDGP data (u tre c).

Introducing heteroskedasticity

Finally, we deal with the problem of heteroskedasticity, a very important issue forapplied research. For both TFE and TRE models, we compare the estimates obtainedfrom a model that neglects heteroskedasticity with those obtained from a heteroskedasticone. To introduce heteroskedasticity, we replace equations (13)–(14) with the following,

vit ∼ N (0, σvit)

uit ∼ E (σuit)

σvit = exp {0.5(1 + .5× zvit)}σuit = exp {0.5(2 + 1× zuit)}

where both inefficiency and idiosyncratic-error scale parameters are now a function ofa constant term and of an exogenous covariate (zuit and zvit), drawn from a standard-normal random variable. Note that because of the introduction of heteroskedasticity, wewill deal with “average” σu and σv, which in our simulated sample are approximately 3.1and 1.7, respectively. In this case, each observation has a different signal-to-noise ratio,which implies an average of about 1.9. We estimate four different models:

i) a homoskedastic TFE model on heteroskedastic TFEDGP data (tfe1)

. sfpanel yf x1_c x2_c, model(tfe) distribution(exp) rescale

(output omitted )

. estimates store tfe_hom

. predict u_tfe_hom, u

ii) a heteroskedastic TFE model on heteroskedastic TFEDGP data (tfe2)

. sfpanel yf x1_c x2_c, model(tfe) distribution(exp) usigma(zu) vsigma(zv)

(output omitted )

. estimates store tfe_het

. predict u_tfe_het, u

iii) a homoskedastic TRE model on heteroskedastic TREDGP data (tre1)

. sfpanel yr x1_nc x2_nc, model(tre) distribution(exp) nsim(50)> simtype(genhalton) base(7) rescale

(output omitted )

. estimates store tre_hom

. predict u_tre_hom, u


iv) a heteroskedastic TRE model on heteroskedastic TREDGP data (tre2)

. sfpanel yr x1_nc x2_nc, model(tre) distribution(exp) usigma(zu) vsigma(zv)> nsim(50) simtype(genhalton) base(7) rescale

(output omitted )

. estimates store tre_het

. predict u_tre_het, u

. predict u0_tre_het, u0

Estimation results are reported in table 4. As expected, tfe1 variance estimates arebiased by both the incidental parameters problem and the neglected heteroskedasticityin u and v. These estimates can be significantly improved by considering both sourcesof heteroskedasticity using the options usigma(varlist u) and vsigma(varlist v) (tfe2).Exactly the same argument applies in the TRE case (tre1 versus tre2) but without theincidental parameters problem.

Table 4. TFE and TRE estimation results (homoskedasticity versus heteroskedasticity)

tfe1 tfe2 tre1 tre2

x1 c 0.324*** 0.295***(0.0271) (0.0245)

x2 c 0.723*** 0.732***(0.0134) (0.0121)

x1 nc 0.316*** 0.310***(0.0290) (0.0264)

x2 nc 0.681*** 0.689***(0.0147) (0.0135)

cons 1.576*** 1.113***(0.0637) (0.0652)

σu 3.717 3.264 3.642 3.168σv 1.185 1.402 1.526 1.693θ 1.579 1.565

As we mentioned in section 2.3, neglecting heteroskedasticity in u and v leads tobiased inefficiency estimates. This conclusion is confirmed by the summarize command.

. summarize u u_tfe_hom u_tfe_het u_tre_hom u_tre_het u0_tre_het


u 10000 3.091925 3.915396 .000169 52.20689u_tfe_hom 10000 3.717061 3.941147 .3442658 51.54804u_tfe_het 10000 3.271297 3.828366 .2642199 52.06564u_tre_hom 10000 3.641955 3.788298 .3739219 51.76109u_tre_het 10000 3.173224 3.709123 .3241621 51.83721

u0_tre_het 10000 3.2855 3.844297 .1828969 54.2632


The average inefficiency is upward biased (by about 15%) for both TFE and TRE

models in which heteroskedasticity has been neglected. A slightly better result is alsoobtained in terms of Spearman’s rank correlation.

. spearman u u_tfe_hom u_tfe_het u_tre_hom u_tre_het u0_tre_het(obs=10000)

u u_tfe_~m u_tfe_~t u_tre_~m u_tre_~t u0_tre~t

u 1.0000u_tfe_hom 0.7287 1.0000u_tfe_het 0.7536 0.9589 1.0000u_tre_hom 0.7380 0.9830 0.9531 1.0000u_tre_het 0.7623 0.9461 0.9835 0.9642 1.0000

u0_tre_het 0.7039 0.8173 0.8455 0.8944 0.9121 1.0000

6 Empirical applications

In this section, we illustrate sfcross and sfpanel capabilities through two empiricalapplications from the SF literature. The first analyzes the cost inefficiency of Swissrailways using data from the Swiss Federal Office of Statistics on public transport com-panies; the second focuses on the technical inefficiency of Spanish dairy farms, usingdata from a voluntary record-keeping program.22

6.1 Swiss railways

This application is based on an unbalanced panel of 50 railway companies over theperiod 1985–1997, which resulted in 605 observations. We think that this application isinteresting for at least two reasons: i) cost frontiers are much less diffuse in the literaturecompared with production frontiers, given the lack of reliable cost and price data; andii) the length of the panel makes this database quite unusual in the SF literature. Adetailed description of the Swiss railway transport system and complete information onthe variables used are available in Farsi, Filippini, and Greene (2005).

To estimate a Cobb–Douglas cost frontier, we impose linear homogeneity by nor-malizing total costs and prices through the price of energy. Therefore, the model canbe written as

ln

(TCit

Peit

)= β0 + βY lnYit + βQ lnQit + βN lnNit+

+ βPk ln

(PkitPeit

)+ βPl ln

(PlitPeit

)+

1997∑t=1986

βtdyeart + uit + vit (15)

where i and t are the subscripts denoting the railway company and year, respectively. Asis common, uit is interpreted as a measure of cost inefficiency. Two output measures areincluded in the cost function: passenger output and freight output. Length of network

22. Both datasets are freely available from the webpage of professor William Greene(http://people.stern.nyu.edu/wgreene/).


is included as an output characteristic. Further, we have price data for three inputs:capital, labor, and energy. All monetary values, including total costs, are in 1997 SwissFrancs (CHF). We have also included a set of time dummies, dyeart, to control forunobserved time-dependent variation in costs.

We consider three time-varying inefficiency specifications—the Kumbhakar (1990)model (kumb90), the Battese and Coelli (1992) model (bc92), and the Greene (2005a)random-effects model (tre)—and three time-invariant models. With respect to the lat-ter group, we estimate the fixed-effects version of the Schmidt and Sickles (1984) model(ss84), the Pitt and Lee (1981) (pl81) model, and the Battese and Coelli (1988) (bc88)model. All models are fit assuming that the inefficiency is half-normally distributed—that is, all except bc88 and bc92, in which u ∼ N+(μ, σ2

u), and ss84, in which nodistributional assumption is made. The choice of also including Greene’s specificationis driven by the multioutput technology that characterizes a railway company, for whichunmeasured quality, captured by the random effects, may play an important role in theproduction process. Finally, as a benchmark, we fit a pooled cross-sectional model(pcs).

Table 5 shows the results. Coefficient estimates of input prices and outputs are allsignificant across the seven models and have the expected signs (positive marginal costsand positive own-price elasticities). Looking at table 6, we further observe that the threetime-invariant specifications provide inefficiency estimates that are highly correlated.Perhaps the most interesting result is that inefficiency scores obtained from kumb90 andbc92 models are also highly correlated with those coming from time-invariant models(table 6 and figure 1). This is not surprising, because the two time-invariance hypothe-ses, H0 : t = t2 = 0 in the kumb90 model and H0 : η = 0 in the bc92 specification,cannot be rejected at the 5% level. Hence, we may conclude that there is evidence oftime-invariant technical inefficiency in the Swiss railway transport system, at least forthe study period.

Consistently with this result, we also find that the tre model provides inefficiencyestimates that have no link with those obtained from any of the other models. Moreover,because of a very low estimate of the inefficiency variance, the estimated signal-to-noiseratio, λ, is the lowest one. In our opinion, these results are driven from the peculiartime-varying inefficiency specification of this model. Indeed, when the inefficiency termis constant over time, the tre specification does not allow one to disentangle time-invariant unobserved heterogeneity from inefficiency. This interpretation is supportedby the fact that the estimated standard deviation of the random effects (θ) dominatesthe inefficiency one (σu).


Table 5. Swiss railways estimation results (50 firms for a total of 605 observations)

pcs ss pl81 bc88 kumb90 bc92 tre

b/se b/se b/se b/se b/se b/se b/se

lnY 0.492*** 0.114*** 0.200*** 0.199*** 0.193*** 0.199*** 0.201***(0.015) (0.032) (0.034) (0.033) (0.033) (0.033) (0.026)

lnQ 0.030*** 0.014* 0.021*** 0.021*** 0.020*** 0.020*** 0.028***(0.006) (0.006) (0.006) (0.006) (0.006) (0.006) (0.005)

lnN 0.393*** 0.448*** 0.485*** 0.503*** 0.477*** 0.499*** 0.583***(0.027) (0.051) (0.045) (0.047) (0.044) (0.047) (0.034)

lnpk 0.171*** 0.318*** 0.310*** 0.311*** 0.311*** 0.313*** 0.311***(0.032) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017)

lnpl 0.592*** 0.546*** 0.548*** 0.546*** 0.538*** 0.543*** 0.560***(0.074) (0.037) (0.037) (0.037) (0.037) (0.037) (0.037)

dyear1986 0.009 0.010 0.009 0.009 0.015 0.008 0.015(0.056) (0.015) (0.015) (0.015) (0.015) (0.015) (0.015)

dyear1987 0.003 0.020 0.012 0.012 0.023 0.009 0.018(0.056) (0.015) (0.015) (0.015) (0.017) (0.015) (0.015)

dyear1988 0.010 0.039* 0.028 0.027 0.043* 0.023 0.034*(0.057) (0.015) (0.015) (0.015) (0.019) (0.016) (0.016)

dyear1989 0.036 0.065*** 0.052*** 0.052*** 0.070*** 0.046** 0.058***(0.057) (0.016) (0.016) (0.016) (0.021) (0.016) (0.016)

dyear1990 0.024 0.084*** 0.068*** 0.068*** 0.086*** 0.060*** 0.074**(0.058) (0.016) (0.016) (0.016) (0.022) (0.017) (0.016)

dyear1991 0.030 0.098*** 0.078*** 0.078*** 0.096*** 0.069*** 0.086***(0.058) (0.017) (0.018) (0.017) (0.024) (0.019) (0.018)

dyear1992 0.046 0.111*** 0.094*** 0.094*** 0.109*** 0.083*** 0.101***(0.058) (0.017) (0.017) (0.017) (0.023) (0.019) (0.017)

dyear1993 0.015 0.100*** 0.081*** 0.081*** 0.092*** 0.069*** 0.089***(0.057) (0.017) (0.017) (0.017) (0.023) (0.020) (0.017)

dyear1994 −0.001 0.082*** 0.063*** 0.063*** 0.069** 0.049* 0.070***(0.056) (0.017) (0.017) (0.017) (0.022) (0.020) (0.017)

dyear1995 0.019 0.059*** 0.048** 0.047** 0.045* 0.031 0.050**(0.057) (0.016) (0.016) (0.016) (0.022) (0.021) (0.016)

dyear1996 0.027 0.037* 0.028 0.027 0.018 0.010 0.017(0.057) (0.017) (0.016) (0.016) (0.022) (0.022) (0.017)

dyear1997 0.019 0.038* 0.030 0.029 0.009 0.009 0.029(0.060) (0.018) (0.017) (0.017) (0.023) (0.024) (0.017)

Constant −8.310*** −2.682*** −4.895*** −4.929*** −4.626*** −4.871*** −4.894***(0.976) (0.652) (0.643) (0.634) (0.637) (0.637) (0.531)

Continued on next page


pcs ss pl81 bc88 kumb90 bc92 tre

b/se b/se b/se b/se b/se b/se b/se

t - - - - 0.023 - -- - - - (0.015) - -

t2 - - - - −0.002 - -- - - - (0.001) - -

η - - - - - −0.002 -- - - - - (0.002) -

λ 2.882 7.803 11.366 7.716 23.930 7.887 1.634σ 0.464 0.560 0.807 0.551 1.682 0.562 0.098σu 0.438 0.555 0.804 0.546 1.681 0.557 0.083σv 0.152 0.071 0.071 0.071 0.070 0.071 0.051θ - - - - - - 0.347

Estimated cost inefficiencies, uit

Mean 0.350 0.807 0.663 0.679 0.687 0.682 0.091SD 0.233 0.550 0.429 0.425 0.445 0.428 0.076Min 0.060 0.000 0.015 0.020 0.015 0.019 0.018Max 1.134 2.507 2.006 1.991 2.124 2.031 0.629

Log likelihood −116.572 - 595.159 596.523 597.649 597.285 595.516

Notes: Standard errors for ancillary parameters are not reported.

Table 6. Swiss railways correlation of inefficiency estimates

Variables pcs ss84 pl81 bc88 kumb90 bc92 tre

pcs 1.000ss84 0.439 1.000pl81 0.595 0.975 1.000bc88 0.608 0.971 0.991 1.000kumb90 0.573 0.984 0.990 0.998 1.000bc92 0.603 0.974 0.991 1.000 0.998 1.000tre −0.140 −0.378 −0.405 −0.407 −0.400 −0.406 1.000

6.2 Spanish dairy farms

This application is based on a balanced panel of 247 dairy farms located in NorthernSpain over a six-year period (1993–1998). This dataset is interesting because it repre-sents what is generally available to researchers: short panel, information only on input


0.5

11.

52

bc92

0 .5 1 1.5 2bc88

0.5

11.

52

bc92

0 .5 1 1.5 2kumb90

0.5

11.

52

pl81

0 .5 1 1.5 2kumb90

0.5

11.

52

2.5

ss

0 .5 1 1.5 2bc92

Figure 1. Swiss railways inefficiency scatterplots

and output volumes, heterogeneity of output, and less than ideal proxies for inputs. Theoutput variable is given by the liters of milk produced per year. This measure explainsonly partially the final output of this industry: milk can also be considered an inter-mediate input to produce dairy products. Furthermore, variables such as slaughteredanimals should also be considered part of the final output.

The functional form employed in the empirical analysis is the following translogproduction function with time dummy variables to control for neutral technical change,

ln yit = β0 +4∑

j=1

βj lnxjit +1

2

4∑j=1

4∑k=1

βjk lnxjit lnxkit

+1998∑

t=1993

βtdyeart − uit + vit (16)

where j and t are the subscripts denoting farm and year, respectively. Four inputshave been employed in the production frontier: number of milking cows (x1), numberof man-equivalent units (x2), hectares of land devoted to pasture and crops (x3), andkilograms of feedstuffs fed to the dairy cows (x4). More details on these variables areavailable in Cuesta (2000) and Alvarez and Arias (2004).

We have fit three models with time-varying inefficiency: the normal-half normalKumbhakar (1990) model (kumb90), a random-effects model, fit through the feasi-ble generalized least-squares method; the Cornwell, Schmidt, and Sickles (1990) model(css90), fit through the modified least-squares dummy variable technique; and finally,the Lee and Schmidt (1993) model (ls93), fit through ILS. Note that the last two mod-els are fit using approaches that do not allow intercept (β0) and time dummies (dyeart)


to be simultaneously included in the frontier equation. Finally, we also considered twomodels with time-invariant inefficiency, that is, the uit term reduced to ui in (16): thefirst was proposed by Schmidt and Sickles (1984) and estimated without any distri-butional assumption through the least-squares dummy variable approach (ss84); thesecond was proposed by Pitt and Lee (1981) and estimated through ML assuming ahalf-normal inefficiency (pl81).

Table 7 reports the results of our exercise. There is a certain degree of similaritybetween the different models because both parameter significance and magnitudes arecomparable. For the ss84, css90, and ls93models, the most efficient firm in the samplefor each period is considered as fully efficient; thus the smallest value of inefficiency is0. On average and as expected, the css90 model shows a higher level of inefficiency,and its distribution also has more variability, while the other models seem to behavevery similarly in this application. Finally, as we can see in table 8, linear correlationsbetween inefficiencies are very high. This does not come as a surprise given the similarityof the estimated frontier parameters, and it looks like an indication that in medium-short panels and in certain economic sectors and contexts, a time-invariant inefficiencyspecification is a valid solution.


Table 7. Spanish dairy farms estimation results (247 firms for a total of 1,482 obs.)

ss84 css90 ls93 kumb90 pl81

b/se b/se b/se b/se b/se

x1 0.642*** 0.527*** 0.641*** 0.661*** 0.660***(0.036) (0.065) (0.036) (0.028) (0.028)

x2 0.037* 0.043 0.037* 0.038** 0.041**(0.017) (0.027) (0.017) (0.015) (0.015)

x3 0.011 0.079 0.010 0.050** 0.049**(0.025) (0.063) (0.025) (0.018) (0.018)

x4 0.308*** 0.226*** 0.307*** 0.351*** 0.356***(0.020) (0.035) (0.020) (0.018) (0.017)

x11 0.135 −0.187 0.133 0.308 0.314(0.157) (0.192) (0.155) (0.171) (0.178)

x22 −0.002 0.060 −0.001 −0.111 −0.112(0.069) (0.111) (0.068) (0.064) (0.067)

x33 −0.242 −0.168 −0.243 −0.129 −0.131(0.188) (0.317) (0.187) (0.119) (0.115)

x44 0.105* −0.125 0.105* 0.112* 0.118*(0.050) (0.084) (0.050) (0.048) (0.049)

x12 −0.010 0.059 −0.009 −0.060 −0.064(0.073) (0.100) (0.072) (0.077) (0.081)

x13 0.084 −0.114 0.085 0.088 0.091(0.102) (0.158) (0.101) (0.090) (0.090)

x14 −0.075 0.142 −0.074 −0.140 −0.146(0.083) (0.132) (0.082) (0.084) (0.088)

x23 0.001 0.067 0.002 0.020 0.011(0.050) (0.107) (0.050) (0.049) (0.050)

x24 −0.011 −0.062 −0.011 0.025 0.025(0.041) (0.060) (0.041) (0.039) (0.040)

x34 −0.012 0.110 −0.013 −0.015 −0.017(0.046) (0.085) (0.046) (0.041) (0.041)

dyear1994 0.035*** - - 0.042*** 0.027***(0.007) - - (0.010) (0.007)

dyear1995 0.062*** - - 0.072*** 0.048***(0.009) - - (0.014) (0.008)

dyear1996 0.072*** - - 0.078*** 0.052***(0.010) - - (0.016) (0.009)

dyear1997 0.075*** - - 0.074*** 0.051***(0.010) - - (0.017) (0.009)

dyear1998 0.092*** - - 0.077*** 0.064***(0.012) - - (0.018) (0.010)

Constant 11.512*** - - 11.695*** 11.711***(0.016) - - (0.019) (0.016)

Continued on next page


ss84 css90 ls93 kumb90 pl81

b/se b/se b/se b/se b/se

t - - - −0.347 -- - - (0.212) -

t2 - - - 0.045 -- - - (0.028) -

λ 1.948 3.711 2.010 4.485 2.775σ 0.168 0.237 0.171 0.356 0.230σu 0.149 0.229 0.153 0.348 0.216σv 0.077 0.062 0.076 0.077 0.078

Estimated technical inefficiencies, uit

Mean 0.315 0.531 0.316 0.182 0.179SD 0.149 0.227 0.150 0.119 0.117Min 0.000 0.000 0.000 0.008 0.009Max 0.873 1.412 0.879 0.667 0.623

Log likelihood - - - 1355.248 1351.826

Notes: Cluster–robust standard errors are in parentheses. Standard errors forancillary parameters are not reported.

Table 8. Spanish dairy farms, correlation of inefficiency estimates

Variables ss84 css90 ls93 kumb90 pl81

ss84 1.000css90 0.868 1.000ls93 1.000 0.871 1.000kumb90 0.938 0.726 0.936 1.000pl81 0.931 0.709 0.929 0.995 1.000

7 Concluding remarks

In this article, we introduced the new Stata commands sfcross and sfpanel, whichimplement an extensive array of SF models for cross-sectional and panel data. Withrespect to the available official Stata commands, frontier and xtfrontier, we addmultiple features for estimating frontier parameters and for postestimating unit inef-ficiency and efficiency. In the development of the commands, we widely exploit Matapotentiality. By using Mata structures, we provide a very readable code that can beeasily developed further by other Stata users.

We illustrated the commands’ estimation capabilities through simulated data, fo-cusing on some of the models that cannot be estimated using official Stata commands.


Finally, we illustrated the proposed routines using real datasets under different possibleempirical scenarios: short versus long panels, cost versus production frontiers, homoge-neous versus heterogeneous outputs.

8 Acknowledgments

We are grateful to David Drukker and all participants at the 2009 Italian Stata UsersGroup meeting for useful comments. We thank William Greene for valuable adviceand discussions and for maintaining an excellent webpage and making available severaldatabases, two of which we have extracted and used in the empirical applications.

9 ReferencesAhn, S. C., Y. Hoon Lee, and P. Schmidt. 2001. GMM estimation of linear panel data

models with time-varying individual effects. Journal of Econometrics 101: 219–255.

Aigner, D. J., C. A. K. Lovell, and P. Schmidt. 1977. Formulation and estimation ofstochastic frontier production function models. Journal of Econometrics 6: 21–37.

Alvarez, A., and C. Arias. 2004. Technical efficiency and farm size: A conditionalanalysis. Agricultural Economics 30: 241–250.

Battese, G. E., and T. J. Coelli. 1988. Prediction of firm-level technical efficiencies witha generalized frontier production function and panel data. Journal of Econometrics38: 387–399.

———. 1992. Frontier production functions, technical efficiency and panel data: Withapplication to paddy farmers in India. Journal of Productivity Analysis 3: 153–169.

———. 1995. A model for technical inefficiency effects in a stochastic frontier productionfunction for panel data. Empirical Economics 20: 325–332.

Belotti, F., and G. Ilardi. 2012. Consistent estimation of the “true” fixed-effects stochas-tic frontier model. CEIS Tor Vergata: Research Paper Series.

Caudill, S. B., and J. M. Ford. 1993. Biases in frontier estimation due to heteroscedas-ticity. Economics Letters 41: 17–20.

Caudill, S. B., J. M. Ford, and D. M. Gropper. 1995. Frontier estimation and firm-specific inefficiency measures in the presence of heteroscedasticity. Journal of Businessand Economic Statistics 13: 105–111.

Chen, Y.-Y., P. Schmidt, and H.-J. Wang. 2011. Consistent estimation of the fixedeffects stochastic frontier model.http://www.economics.uwo.ca/newsletter/misc/2011/schmidt nov11.pdf.

Cornwell, C., P. Schmidt, and R. C. Sickles. 1990. Production frontiers with cross-sectional and time-series variation in efficiency levels. Journal of Econometrics 46:185–200.


Cuesta, R. A. 2000. A production model with firm-specific temporal variation in tech-nical inefficiency: With application to Spanish dairy farms. Journal of ProductivityAnalysis 13: 139–158.

Drukker, D. M., and R. Gates. 2006. Generating Halton sequences using Mata. StataJournal 6: 214–228.

Farsi, M., M. Filippini, and W. Greene. 2005. Efficiency measurement in network indus-tries: Application to the Swiss railway companies. Journal of Regulatory Economics28: 69–90.

Greene, W. 2005a. Reconsidering heterogeneity in panel data estimators of the stochas-tic frontier model. Journal of Econometrics 126: 269–303.

———. 2005b. Fixed and random effects in stochastic frontier models. Journal ofProductivity Analysis 23: 7–32.

Greene, W. H. 1980a. Maximum likelihood estimation of econometric frontier functions.Journal of Econometrics 13: 27–56.

———. 1980b. On the estimation of a flexible frontier production model. Journal ofEconometrics 13: 101–115.

———. 2003. Simulated likelihood estimation of the normal-gamma stochastic frontierfunction. Journal of Productivity Analysis 19: 179–190.

———. 2012. Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice Hall.

Hadri, K. 1999. Estimation of a doubly heteroscedastic stochastic frontier cost function.Journal of Business and Economic Statistics 17: 359–363.

Han, C., L. Orea, and P. Schmidt. 2005. Estimation of a panel data model with para-metric temporal variation in individual effects. Journal of Econometrics 126: 241–267.

Horrace, W. C., and P. Schmidt. 1996. Confidence statements for efficiency estimatesfrom stochastic frontier models. Journal of Productivity Analysis 7: 257–282.

Huang, C. J., and J.-T. Liu. 1994. Estimation of a non-neutral stochastic frontierproduction function. Journal of Productivity Analysis 5: 171–180.

Jondrow, J., C. A. K. Lovell, I. S. Materov, and P. Schmidt. 1982. On the estimationof technical inefficiency in the stochastic frontier production function model. Journalof Econometrics 19: 233–238.

Kumbhakar, S. C. 1990. Production frontiers, panel data, and time-varying technicalinefficiency. Journal of Econometrics 46: 201–211.

Kumbhakar, S. C., S. Ghosh, and J. T. McGuckin. 1991. A generalized production fron-tier approach for estimating determinants of inefficiency in U.S. dairy farms. Journalof Business and Economic Statistics 9: 279–286.


Kumbhakar, S. C., and C. A. K. Lovell. 2000. Stochastic Frontier Analysis. Cambridge:Cambridge University Press.

Lancaster, T. 2002. The incidental parameters problem since 1948. Journal of Econo-metrics 95: 391–414.

Lee, Y. H., and P. Schmidt. 1993. A production frontier model with flexible tempo-ral variation in technical efficiency. In The Measurement of Productive Efficiency:Techniques and Applications, ed. H. O. Fried, C. A. Knox Lovell, and S. S. Schmidt,237–255. New York: Oxford University Press.

Meeusen, W., and J. van den Broeck. 1977. Efficiency estimation from Cobb–Douglasproduction functions with composed error. International Economic Review 18: 435–444.

Neyman, J., and E. Scott. 1948. Consistent estimates based on partially consistentobservations. Econometrica 16: 1–32.

Pitt, M. M., and L.-F. Lee. 1981. The measurement and sources of technical inefficiencyin the Indonesian weaving industry. Journal of Development Economics 9: 43–64.

Ritter, C., and L. Simar. 1997. Pitfalls of normal-gamma stochastic frontier models.Journal of Productivity Analysis 8: 167–182.

Schmidt, P., and R. C. Sickles. 1984. Production frontiers and panel data. Journal ofBusiness and Economic Statistics 2: 367–374.

Stevenson, R. E. 1980. Likelihood functions for generalized stochastic frontier estima-tion. Journal of Econometrics 13: 57–66.

Wang, H.-J. 2002. Heteroscedasticity and non-monotonic efficiency effects of a stochasticfrontier model. Journal of Productivity Analysis 18: 241–253.

Wang, H.-J., and P. Schmidt. 2002. One-step and two-step estimation of the effects ofexogenous variables on technical efficiency levels. Journal of Productivity Analysis18: 129–144.

About the authors

Federico Belotti is a researcher at the Centre for Economics and International Studies (CEIS)of the University of Rome Tor Vergata.

Silvio Daidone is a research associate at the Centre for Health Economics (CHE) of the Uni-versity of York and an economist at the Agricultural Development Economics Division of theFood and Agriculture Organization of the United Nations.

Giuseppe Ilardi is a researcher at the Economic and Financial Statistics Department of theBank of Italy.

Vincenzo Atella is an associate professor at the University of Rome Tor Vergata.


Flexible parametric illness-death models

Sally R. HinchliffeDepartment of Health Sciences

University of LeicesterLeicester, UK

[email protected]

David A. ScottOxford Outcomes Ltd

Oxford, UK

[email protected]

Paul C. LambertDepartment of Health Sciences

University of LeicesterLeicester, UK

andDepartment of Medical Epidemiology and Biostatistics

Karolinska InstitutetStockholm, Sweden

[email protected]

Abstract. It is usual in time-to-event data to have more than one event ofinterest, for example, time to death from different causes. Competing risks modelscan be applied in these situations where events are considered mutually exclusiveabsorbing states. That is, we have some initial state—for example, alive witha diagnosis of cancer—and we are interested in several different endpoints, allof which are final. However, the progression of disease will usually consist ofone or more intermediary events that may alter the progression to an endpoint.These events are neither initial states nor absorbing states. Here we considerone of the simplest multistate models, the illness-death model. stpm2illd is apostestimation command used after fitting a flexible parametric survival modelwith stpm2 to estimate the probability of being in each of four states as a functionof time. There is also the option to generate confidence intervals and transitionhazard functions. The new command is illustrated through a simple example.

Keywords: st0316, illdprep, stpm2illd, survival analysis, multistate models, flexibleparametric models

1 Introduction

It is usual in time-to-event data to have more than one event of interest, for example,time to death from different causes. If we treat these events as mutually exclusiveendpoints where the occurrence of an event is final, then we can apply a competingrisks model (Prentice et al. 1978; Colzani et al. 2011; Hinchliffe and Lambert 2013a,b).These endpoints are known as absorbing states, and we model the time to each of thesefrom some initial state, for example, alive with a diagnosis of cancer. However, theprogression of disease will usually consist of one or more intermediary events that may


760 Flexible parametric illness-death models

alter the progression to an endpoint (Putter, Fiocco, and Geskus 2007). These eventscannot be classified as initial states or absorbing states and so are known as transientstates or intermediate states.

Illness-death models are a special case of multistate models, where individuals startout healthy and then may become ill and go on to die. In theory, some patients mayrecover from an illness and become healthy again (Andersen, Abildstrom, and Rosthøj2002). This is known as a bidirectional illness-death model. We will consider only theunidirectional model as illustrated in figure 1.

The two main measures of interest for analyses of this type are the transition hazardsand the probability of being in each state as a function of time. The transition hazardscan inform us about the impact of risk factors on rates of illness and disease or mortality.Additionally, the probabilities of being in each state provide an absolute measure onwhich to base prognosis and clinical decisions (Koller et al. 2012). The purpose of thisarticle is to explain how to set up the data using illdprep in a format that allowsflexible parametric survival models (stpm2) to estimate transition hazards. Using thepostestimation command stpm2illd, we can then obtain both the probability of beingin each state as a function of time and the confidence intervals for each.

Figure 1. Unidirectional illness-death model

S. R. Hinchliffe, D. A. Scott, and P. C. Lambert 761

2 Methods

Figure 1 shows a graphical representation of a unidirectional illness-death model. Thestates are represented with a box and given a number from one to four. The transitionsare represented by arrows going from one state to another. In total, there are threetransitions labeled from one to three. We represent a transition from state i to j byi → j; therefore, the transition hazards are denoted on the diagram as α13, α12, andα24 (Putter, Fiocco, and Geskus 2007). If T denotes the time of reaching state j fromstate i, we denote the hazard rate (transition intensity) of the i→ j transition by

αij = limΔt→0

Pr(t ≤ T < t+Δt | T ≥ t)

Δt(1)

Currently, most applications of illness-death models involve the Cox model. How-ever, we are interested in parametric estimates and so advocate the use of the flexibleparametric survival model, first proposed by Royston and Parmar (2002). The approachuses restricted cubic spline functions to model the baseline log cumulative hazard. Ithas the advantage over other well-known models such as the Cox model because it pro-duces smooth predictions and can be extended to incorporate complex time-dependenteffects, again through the use of restricted cubic splines. The Stata implementation ofthe model using stpm2 is described in detail elsewhere (Lambert and Royston 2009).

The transition hazard rates in (1) can be obtained from the flexible parametricsurvival model. This could be done by fitting separate models for each of the threetransitions, but this would not allow for shared parameters. It is possible to fit one modelfor all three transitions simultaneously by stacking the data so that each individualpatient has up to three rows of data, dependent on how many transitions each patientis at risk of.

Table 1 shows four cancer patients of varying ages who are all at risk of both relapseof their cancer and death. Relapse can be considered an intermediary event, whereasdeath is final and thus an absorbing state. Patient 1, aged 44, is at risk of both relapseand death for 2.4 years until the patient relapses and goes on to die after 7.6 years.Patient 2, aged 68, is at risk of both relapse and death for 9 years until the patient diesand is no longer at risk of relapse. Patient 3, aged 52, is at risk of both relapse anddeath until the patient is censored at 6.1 years. Finally, patient 4, aged 38, is at riskof both relapse and death for 4.6 years until the patient relapses and is at risk of deathuntil being censored at 13.8 years.

To model all three transitions simultaneously, we need to set up the data as shownin table 2. The data have been expanded so that each patient now has up to three rowsof data. As shown in figure 1, transition 1 goes from alive and well to dead, transition 2goes from alive and well to ill, and transition 3 goes from ill to dead. Patient 1 is atrisk of both relapse (state 2) and death (state 3) for 2.4 years when the patient relapses.The patient is then at risk of death with relapse (state 4) from 2.4 years to 7.6 years,when he or she dies. Patient 2 is at risk of both relapse (state 2) and death (state 3) for9 years until the patient dies and is no longer at risk of relapse. Because patient 2 neverexperienced a relapse, the patient is never at risk of experiencing state 4. Therefore, in


the expanded data, he or she has only two rows of data. Patient 3 is at risk of bothrelapse (state 2) and death (state 3) for 6.1 years when the patient is censored from thestudy. Again, because patient 3 never experienced a relapse, the patient is never at riskof experiencing transition 3 and thus has only two rows of data. Finally, patient 4 is atrisk of both relapse (state 2) and death (state 3) for 4.6 years when he or she relapses.The patient is then at risk of death with relapse (state 4) from 4.6 years to 13.8 yearswhen the patient is censored.

Table 1. Standard dataset with relapse and survival times (years) for four patients

ID Age Relapse time Relapse indicator Survival time Death indicator

1 44 2.4 1 7.6 12 68 9.0 0 9.0 13 52 6.1 0 6.1 04 38 4.6 1 13.8 0

Table 2. Expanded dataset with transition indicators and start and stop times (years)for four patients

ID Age Trans 1 Trans 2 Trans 3 Status Start Stop

1 44 1 0 0 0 0 2.41 44 0 1 0 1 0 2.41 44 0 0 1 1 2.4 7.6

2 68 1 0 0 1 0 9.02 68 0 1 0 0 0 9.0

3 52 1 0 0 0 0 6.13 52 0 1 0 0 0 6.1

4 38 1 0 0 0 0 4.64 38 0 1 0 1 0 4.64 38 0 0 1 0 4.6 13.8

The transition hazard rates can be transformed into the probability of being in eachof the four states (state occupation probabilities) through the following relationships.Notice that as in the competing risks setting, there is not a one-to-one correspondencebetween the transition hazards and the transition probabilities: the latter is a functionof multiple transition hazards.


The probability of being alive and well will depend on both the transition rate fromalive to dead [α13(t)] and the transition rate from alive to relapse [α12(t)]. An individualneeds to have survived both death (state 3) and illness (state 2) to remain in the staterepresenting alive and well. This is essentially the survival probability where both deathand illness are considered events.

P (alive and well at time t) = exp

⎧⎨⎩−t∫

0

α13(s) + α12(s)ds

⎫⎬⎭ (2)

When estimating the probability of being alive with illness, we have to consider notonly the probability of getting ill but also the probability of remaining alive with theillness (that is, of not moving to state 4). The probability of being ill is a function of thetransition hazard from alive (state 1) to ill (state 2) and the probability of being aliveand well from (2). The probability of remaining alive with the illness (that is, stayingin state 2) is the survival function for the transition from ill to death (transition 3 infigure 1).

P (alive with illness at time t) =

t∫0

(ill at time s)

× P (survive with illness from s to t)ds

=

t∫0

α12(s) exp

⎧⎨⎩−s∫

0

α13(u) + α12(u)du

⎫⎬⎭× exp

⎧⎨⎩−t∫

s

α24(u)du

⎫⎬⎭ ds (3)

The probability of dying without illness is a function of the transition hazard fromalive (state 1) to dead (state 3) and the probability of being alive and well from (2).

P (dead without illness at time t) =

t∫0

α13(s) exp

⎧⎨⎩−s∫

0

α13(u) + α12(u)du

⎫⎬⎭ ds (4)

Finally, the probability of dying with illness can be estimated by subtracting theprobability of being in each of the other three states from 1.

P (dead with illness at time t) = 1− P (alive and well at time t)− P (ill at time t)

− P (dead without illness at time t) (5)

To get the overall probability of death at time t, we add P (dead without illnessat time t) and P (dead with illness at time t). Confidence intervals can be calculatedfor each of these probabilities using the delta method (Carstensen 2006; Lambert et al.2010).


3 The illdprep command

The illdprep command is used before stset and stpm2 to set the data up in theformat needed for illness-death models as shown in table 2 in section 2.

3.1 Syntax

illdprep, id(varlist) statevar(varlist) statetime(varlist)[status(varname)

transname(varlist) addtime(real)]

3.2 Options

id(varlist) specifies the name of the ID variable in the dataset. Before the command isused, each ID number should have just one row of data. The command will expandthe data so that each ID number will have up to three rows of data. id() is required.

statevar(varlist) specifies the names of the two event-indicator variables needed tosplit the data. As demonstrated in figure 1 and table 2, an indicator variable willbe needed to specify whether a patient has become ill and whether a patient hasdied. Because death is a final absorbing state, this must come last in the varlist.So, for example, if we were interested in relapse and death and our event-indicatorvariables were relapse and dead, then we would specify statevar(relapse dead)

in that order. statevar() is required.

statetime(varlist) specifies the names of the two event-time variables. The vari-ables should be input in the order that corresponds to statevar(varlist). So ifour event-time variables were relapsetime and survtime, then we would specifystatetime(relapsetime survtime) in that order to correspond with the examplegiven for statevar(varlist). statetime() is required.

status(varname) allows the user to specify the name of the newly generated statusvariable as shown in table 2.

transname(varlist) allows the user to specify the names of the newly generated tran-sition indicators. The default for these is trans1, trans2, and trans3. The usermust specify these in the order that corresponds with figure 1. varlist must containthree variable names.

addtime(real) specifies an amount to add to the death time when event times are tied.For example, if a patient both relapses and dies at the same time in the data, thenthe user could add 0.1 to the death time so that the stset command does not dropthe third transition. The specified value will obviously depend on the time units inthe data.


4 The stpm2illd command

The stpm2illd command is a postestimation command used after stpm2 to obtain thepredictions given in (2), (3), (4), and (5) in section 2. The names specified in newvarlistcoincide with the order of the transitions entered in the options.

4.1 Syntax

stpm2illd newvarlist, trans1(varname #[varname # ...

]) trans2(varname #[

varname # ...]) trans3(varname #

[varname # ...

])[obs(integer) ci

mint(real) maxt(real) timename(varname) hazard hazname(varlist) combine]

4.2 Options

trans1(varname #[varname # ...

]) . . . trans3(varname #

[varname # ...

])

requests that the covariates specified by the listed varname be set to # when pre-dicting the hazards for each transition. The transition numbers correspond to thosein the diagram above. Therefore, trans1() relates to the transition from alive todead, trans2() relates to the transition from alive to ill, and trans3() relates tothe transition from ill to dead. trans1(), trans2(), and trans3() are required.

obs(integer) specifies the number of observations (of time) to predict for. The defaultis obs(1000). Observations are evenly spread between the minimum and maximumvalue of follow-up time. Note: Because the command uses numerical integration,if the number of specified observations is too small, then it may result in biasedestimates.

ci calculates a 95% confidence interval for the probabilities of being in each state andstores the confidence limits in prob newvar lci and prob newvar uci.

mint(real) specifies the minimum value of follow-up time. The default is set as theminimum event time from stset.

maxt(real) specifies the maximum value of follow-up time. The default is set as themaximum event time from stset.

timename(varname) is the name given to the time variable used for predictions. Thedefault is timename( newt). Note that this is the variable for time that needs to beused when plotting curves for the transition hazards and probabilities.

hazard predicts the hazard function for each transition.

hazname(varlist) allows the user to specify the names for the transition hazards ifthe hazard option is chosen. These will then be stored in variables called h var.The default is hazname(trans1 trans2 trans3), which cause variables h trans1,h trans2, h trans3 to be created. varlist must contain three variable names.


combine allows the user to combine the probabilities of being in states 3 and 4 to givethe overall probability of death. If this option is specified, then the user only needs tospecify three names in newvarlist. The last name given in the list should correspondto the combined probability of states 3 and 4. So, for example, if we write alive

ill dead in the newvarlist, then the probability of being in each state as a functionof time will be stored as prob alive, prob ill, and prob dead.

5 Example

The Rotterdam breast cancer data used in this example are taken from Royston andLambert (2011). Download the data at http://www.stata-press.com/data/fpsaus.html.The data contain information on 2,982 patients with primary breast cancer. Both thetime to relapse and the time to death are recorded.

We must first set up the data so that they are in the format required to use thestpm2 and stpm2illd commands.

. use rott2(Rotterdam breast cancer data, truncated at 10 years)

. illdprep, id(pid) statevar(rfi osi) statetime(rf os) addtime(0.1)

Note that .1 has been added to os for one or more individuals as the addtimeoption has been specified by the user. These individuals are indicated witha value of 1 in the newly generated _check variable.

Note that one or more individuals have the rfi event at the same time as theyare censored for the rfi event. The program assumes that the individualwas not at risk of osi after the rfi time and therefore will not have a thirdrow in the data. These individuals are indicated with a value of 1 in the newlygenerated _check2 variable. The user may wish to change this in the originaldata and rerun the command.

The data have been expanded so that each patient has up to three rows of dataas demonstrated in tables 1 and 2. Three indicator variables have been created foreach of the three transitions (trans1, trans2, and trans3). A variable, trans, is alsostored in the data and will be needed to obtain initial values in the stpm2 command.A further indicator variable called status has been created to summarize which ofthe three transitions each patient has experienced: 1 indicates that the patient hasexperienced the transition, and 0 indicates otherwise. The addtime() option has beenspecified to add 0.1 to the death time for any patients who relapse and die at theexact same time. The relapse and death times are in months from diagnosis; thus 0.1is equivalent to approximately 3 days in this example. A check variable has beengenerated in correspondence with 0.1 to indicate which patients had this amount addedto their death time. A warning has also been given for one or more patients who havea relapse and are censored for the death event at the same time. This means that forsuch a patient, the command has dropped the third row of data representing transition3 because the patient was never actually at risk of death after relapse. Finally, thecommand has generated start and stop times to show when a patient enters and exitseach state. These newly generated variables can be used to stset the data. We canthen run the stpm2 command for all three transitions simultaneously.


. stset stop, enter(start) failure(status==1) scale(12)

failure event: status == 1obs. time interval: (0, stop]enter on or after: time startexit on or before: failure

t for analysis: time/12

7471 total obs.0 exclusions

7471 obs. remaining, representing2790 failures in single record/single failure data

38398.57 total analysis time at risk, at risk from t = 0earliest observed entry t = 0

last observed exit t = 19.28268

. stpm2 trans1 trans2 trans3 age, scale(hazard) rcsbaseoff nocons dftvc(3)> tvc(trans1 trans2 trans3) initstrata(trans) eformnote: delayed entry models are being fitted

Iteration 0: log likelihood = -5497.7319Iteration 1: log likelihood = -5495.6716Iteration 2: log likelihood = -5495.6418Iteration 3: log likelihood = -5495.6418

Log likelihood = -5495.6418 Number of obs = 7471

exp(b) Std. Err. z P>|z| [95% Conf. Interval]

xbtrans1 .02331 .0028974 -30.24 0.000 .01827 .0297403trans2 .2455235 .0216091 -15.96 0.000 .206622 .291749trans3 .9442842 .1211267 -0.45 0.655 .7343719 1.214198

age 1.008449 .0015035 5.64 0.000 1.005507 1.0114_rcs_trans11 3.537942 .3075088 14.54 0.000 2.983778 4.195029_rcs_trans12 .9383132 .0507433 -1.18 0.239 .8439475 1.04323_rcs_trans13 .9906213 .0352729 -0.26 0.791 .9238449 1.062224_rcs_trans21 2.539793 .0574909 41.18 0.000 2.429576 2.65501_rcs_trans22 1.29505 .024191 13.84 0.000 1.248494 1.343342_rcs_trans23 .9669232 .0094508 -3.44 0.001 .9485762 .985625_rcs_trans31 2.171531 .209309 8.04 0.000 1.797714 2.62308_rcs_trans32 1.162727 .0698784 2.51 0.012 1.033527 1.308079_rcs_trans33 .9826401 .0147 -1.17 0.242 .9542469 1.011878


Patients can be at risk of death with relapse only after they have experienced therelapse event; therefore, the time for this state is later than the time of origin. Thismeans that a delayed entry model is fit as indicated in the stpm2 command. By default,the stpm2 command obtains initial values from a Cox model. The initstrata() optionin the command line allows for this Cox model to be stratified by the three transitions.By including the three transition indicators (trans1(), trans2(), and trans3()) asboth main effects and time-dependent effects (using the tvc() option), we have fit astratified model with three separate baselines, one for each transition. For this reason,we have used the rcsbaseoff option together with the nocons option, which excludesthe baseline hazard from the model. The hazard ratio (95% confidence intervals) forage is 1.008449 (1.005507 to 1.0114). This means that all three transition rates increaseby 0.8% with each yearly increase in age. By including age in the model in this way, wehave assumed that the effect of age remains constant across all three transitions. Thisis unlikely to be the case.

By including interaction terms between age and the three transition indicators, wecan estimate a different age effect for each transition.

. forvalues i=1/3 {2. generate transìáge=transì´*age3. }

. stpm2 trans1 trans2 trans3 trans1age trans2age trans3age,> scale(hazard) rcsbaseoff nocons dftvc(2)> tvc(trans1 trans2 trans3) initstrata(trans) eformnote: delayed entry models are being fitted

Iteration 0: log likelihood = -5369.4658Iteration 1: log likelihood = -5332.4523Iteration 2: log likelihood = -5330.8393Iteration 3: log likelihood = -5330.8192Iteration 4: log likelihood = -5330.8191



xbtrans1 8.91e-06 5.07e-06 -20.41 0.000 2.92e-06 .0000272trans2 .4515908 .0521128 -6.89 0.000 .3601785 .5662032trans3 1.181057 .1969131 1.00 0.318 .8518305 1.637527

trans1age 1.139042 .0089574 16.55 0.000 1.121621 1.156735trans2age .9974217 .0020578 -1.25 0.211 .9933966 1.001463trans3age 1.006303 .0023563 2.68 0.007 1.001696 1.010932

_rcs_trans11 3.951158 .3388209 16.02 0.000 3.339888 4.674303_rcs_trans12 .8822663 .0454067 -2.43 0.015 .7976121 .9759051_rcs_trans21 2.493473 .0543812 41.89 0.000 2.389134 2.602369_rcs_trans22 1.240989 .0179256 14.95 0.000 1.206348 1.276624_rcs_trans31 1.939886 .1551909 8.28 0.000 1.658365 2.269198_rcs_trans32 1.078697 .035531 2.30 0.021 1.011258 1.150633

The hazard ratio (95% confidence interval) for the age transition 1 interaction is1.139042 (1.121621 to 1.156735), which suggests that the transition rate from alive todead increases by approximately 14% with every yearly increase in age. The hazard ratio(95% confidence interval) for the age transition 2 interaction is 0.9974217 (0.9933966 to


1.001463), which suggests that the transition rate from alive to relapse decreases withage; however, this is not significant. Finally, the hazard ratio (95% confidence interval)for the age transition 3 interaction is 1.006303 (1.001696 to 1.010932), which suggeststhat for those who relapse, the transition rate from relapse to dead also increases withage.

Now that we have run stpm2, we can run the postestimation command stpm2illd

to obtain the probability of being in each of the four states as demonstrated in figure 1.Because we have included age as a continuous variable, we need to choose a particu-lar covariate pattern for which to make the predictions. We will run the stpm2illd

command twice, once for age 65 and once for age 85.

. * Age 65 *

. stpm2illd alive65 relapse65 death65 relapsedeath65,> trans1(trans1 1 trans1age 65) trans2(trans2 1 trans2age 65)> trans3(trans3 1 trans3age 65) ci

. * Age 85 *

. stpm2illd alive85 relapse85 death85 relapsedeath85,> trans1(trans1 1 trans1age 85) trans2(trans2 1 trans2age 85)> trans3(trans3 1 trans3age 85) ci

The trans1() to trans3() options give the linear predictor for each of the threetransitions for which we want the prediction. The commands have generated eight newvariables containing the probabilities of being in each state. The predictions for age65 are denoted with a 65 at the end of the variable name, and the predictions for age85 are denoted with an 85. The eight probabilities are prob alive65, prob ill65,prob death65, prob illdeath65, prob alive85, prob ill85, prob death85, andprob illdeath85. Each of these variables has a corresponding high and low confidencebound, for example, prob alive65 lci and prob alive65 uci. These were createdwhen the ci option was specified.


If we plot the probability of each state along with its confidence intervals againsttime for both age 65 and age 85, we can achieve plots as shown in figure 2.

0.00.20.40.60.81.0

0 5 10 15 20

Alive

0.00.20.40.60.81.0

0 5 10 15 20

Relapse

0.00.20.40.60.81.0

0 5 10 15 20

Death

0.00.20.40.60.81.0

0 5 10 15 20

Relapse then Death

Prob

abilit

y of

bei

ng in

sta

te

Time since diagnosis (years)

Age 65

Probability 95% CI

(a)

0.00.20.40.60.81.0

0 5 10 15 20

Alive

0.00.20.40.60.81.0

0 5 10 15 20

Relapse

0.00.20.40.60.81.0

0 5 10 15 20

Death

0.00.20.40.60.81.0

0 5 10 15 20

Relapse then Death

Prob

abilit

y of

bei

ng in

sta

te

Time since diagnosis (years)

Age 85

Probability 95% CI

(b)

Figure 2. Probability of being alive and well, having a relapse, dying before relapse, ordying after relapse as a function of time since diagnosis (years) for those aged 65 and85

Figure 2 shows that the probability of remaining alive and well is significantly lowerfor those aged 85 compared with those aged 65. By 15 years, the probability of beingalive and well is almost 0 for those aged 85. As expected, the probability of dying beforerelapse is higher for those aged 85, with values reaching approximately 0.63 by 15 yearscompared with 0.15 for those aged 65.


The plot for the probability of relapse is different in shape from the other threeplots. This is because relapse is a transient state; patients may enter the relapse state,but after some time, they may leave that state and go on to die. This gives the curvethat peaks after about 3 or 4 years for both those aged 65 (probability approximately0.2) and those aged 85 (probability approximately 0.18). The curve then begins todecrease as more patients with relapse go on to die. Finally, the probability of deathfor those that suffer a relapse is higher at age 65 (approximately 0.48) than at age 85(approximately 0.34). This is due to the high number of deaths before relapse in thoseaged 85.

The model shown above assumes proportional hazards for the age transition interac-tions. In many epidemiological studies, the effect of age will be time dependent. We willnow fit the flexible parametric survival model again and include time-dependent effectsfor the age transition interactions. This time, we want to obtain only one estimate forthe overall probability of death, that is, to combine the probabilities of being in stages3 and 4 in figure 1. To do this, we need to use the combine option. When we use thisoption, we only need to specify three new variable names in the stpm2illd commandline.


. stpm2 trans1 trans2 trans3 trans1age trans2age trans3age,> scale(hazard) rcsbaseoff nocons> dftvc(trans1age:2 trans2age:2 trans3age:2 3)> tvc(trans1 trans2 trans3 trans1age trans2age trans3age)> initstrata(trans) eformnote: delayed entry models are being fitted

Iteration 0: log likelihood = -5324.6353Iteration 1: log likelihood = -5311.9706Iteration 2: log likelihood = -5310.9136Iteration 3: log likelihood = -5310.8708Iteration 4: log likelihood = -5310.8707



xbtrans1 .00001 6.98e-06 -16.52 0.000 2.56e-06 .0000392trans2 .4132457 .0503478 -7.25 0.000 .3254634 .5247042trans3 .7806882 .2184512 -0.88 0.376 .4511235 1.351014

trans1age 1.137403 .0109405 13.38 0.000 1.11616 1.159049trans2age .9989598 .002172 -0.48 0.632 .9947119 1.003226trans3age 1.011249 .0048782 2.32 0.020 1.001733 1.020855

_rcs_trans11 4.143021 2.836773 2.08 0.038 1.082654 15.85421_rcs_trans12 1.55668 .7534786 0.91 0.361 .6028274 4.019812_rcs_trans13 .9768487 .0402173 -0.57 0.569 .9011207 1.058941_rcs_trans21 3.084326 .2939969 11.82 0.000 2.558728 3.71789_rcs_trans22 1.552191 .1114611 6.12 0.000 1.348408 1.786772_rcs_trans23 .9740596 .0096225 -2.66 0.008 .9553812 .9931032_rcs_trans31 3.232405 .8243798 4.60 0.000 1.960824 5.328597_rcs_trans32 1.59504 .2284165 3.26 0.001 1.204692 2.11187_rcs_trans33 .987701 .0144313 -0.85 0.397 .9598174 1.016395

_rcs_trans1age1 .9992748 .0093111 -0.08 0.938 .981191 1.017692_rcs_trans1age2 .9922872 .0064791 -1.19 0.236 .9796694 1.005068_rcs_trans2age1 .9965224 .0016427 -2.11 0.035 .993308 .9997471_rcs_trans2age2 .99681 .001214 -2.62 0.009 .9944334 .9991922_rcs_trans3age1 .9937488 .0039773 -1.57 0.117 .9859838 1.001575_rcs_trans3age2 .9949577 .0020061 -2.51 0.012 .9910336 .9988974

. drop prob_alive65 prob_relapse65 prob_death65 prob_relapsedeath65> prob_alive85 prob_relapse85 prob_death85 prob_relapsedeath85

. * Age 65 *

. stpm2illd alive65 relapse65 death65, trans1(trans1 1 trans1age 65)> trans2(trans2 1 trans2age 65) trans3(trans3 1 trans3age 65) ci combine

. * Age 85 *

. stpm2illd alive85 relapse85 death85, trans1(trans1 1 trans1age 85)> trans2(trans2 1 trans2age 85) trans3(trans3 1 trans3age 85) ci combine

Notice that we have allowed different degrees of freedom for the age transition in-teractions (2df) and the three separate transition baselines (3df) by specifying this inthe dftvc() option. We have also dropped the variables generated in the previousstpm2illd command. If users did not wish to do this, then they would have to specifydifferent names for the probability variables when running the command again. Ratherthan graphing the probabilities of being in each state as separate line plots (as we didpreviously), we can display them by stacking the probabilities on top of one another.This produces a graph as shown in figure 3. To do this, we need to generate new vari-


ables that sum up the probabilities. This is done for each of the two age predictions,65 and 85. The code shown below is for those aged 85 only.

. generate tot1=prob_alive85(6471 missing values generated)

. generate tot2=prob_alive85+prob_relapse85(6471 missing values generated)

. generate tot3=prob_alive85+prob_relapse85+prob_death85(6471 missing values generated)

. twoway (area tot3 _newt if _newt<=15, sort)> (area tot2 _newt if _newt<=15, sort) (area tot1 _newt if _newt<=15, sort),> legend(order(3 "Alive and well" 2 "Relapse" 1 "Dead") rows(1))> ylabel(0(0.2)1, angle(0) format(%3.1f))> xtitle("Time since diagnosis (years)") title("Age 85")> plotregion(margin(zero)) scheme(sj)

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15Time since diagnosis (years)

Alive and well Relapse Dead

Age 65

(a)

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15Time since diagnosis (years)

Alive and well Relapse Dead

Age 85

(b)

Figure 3. Stacked probability of being alive, having a relapse, and dying as a functionof time since diagnosis (years) for those aged 65 and 85


As we showed previously in figure 2, the probability of remaining alive and well forthose aged 85 decreases to almost 0 over the period of 15 years. The probability ofbeing alive after relapse is highest between approximately 1 and 5 years since breastcancer diagnosis for those aged 85. It then starts to decrease as more patients die withrelapse. For those aged 65, the probability of being alive after relapse remains fairlystable beyond 5 years. By 15 years, approximately 65% of those aged 65 and 98% ofthose aged 85 have died.

6 Conclusion

The new commands illdprep and stpm2illd, in conjunction with the existing com-mand stpm2, provide a suite of programs that will enable users to estimate transitionhazards and probabilities within an illness-death model framework using flexible para-metric survival models. We hope that it will be a useful tool in medical research. Theillness-death model is a very simple multistate model. Therefore, further developmentsare needed to fit more complex multistate models.

7 ReferencesAndersen, P. K., S. Z. Abildstrom, and S. Rosthøj. 2002. Competing risks as a multi-state model. Statistical Methods in Medical Research 11: 203–215.

Carstensen, B. 2006. Demography and epidemiology: Practical use of theLexis diagram in the computer age, or: Who needs the Cox-model anyway?Technical Report 06.2, Department of Biostatistics, University of Copenhagen.http://biostat.ku.dk/reports/2006/rr-06-2.pdf.

Colzani, E., A. Liljegren, A. L. Johansson, J. Adolfsson, H. Hellborg, P. F. Hall, andK. Czene. 2011. Prognosis of patients with breast cancer: Causes of death and effectsof time since diagnosis, age, and tumor characteristics. Journal of Clinical Oncology29: 4014–4021.

Hinchliffe, S. R., and P. C. Lambert. 2013a. Extending the flexible parametric survivalmodel for competing risks. Stata Journal 13: 344–355.

———. 2013b. Flexible parametric modelling of cause-specific hazards to estimatecumulative incidence functions. BMC Medical Research Methodology 13: 13.

Koller, M. T., H. Raatz, E. W. Steyerberg, and M. Wolbers. 2012. Competing risksand the clinical community: irrelevance or ignorance? Statistics in Medicine 31:1089–1097.

Lambert, P. C., P. W. Dickman, C. P. Nelson, and P. Royston. 2010. Estimating thecrude probability of death due to cancer and other causes using relative survivalmodels. Statistics in Medicine 29: 885–895.


Lambert, P. C., and P. Royston. 2009. Further development of flexible parametricmodels for survival analysis. Stata Journal 9: 265–290.

Prentice, R. L., J. D. Kalbfleisch, A. V. Peterson, Jr., N. Flournoy, V. T. Farewell, andN. E. Breslow. 1978. The analysis of failure times in the presence of competing risks.Biometrics 34: 541–554.

Putter, H., M. Fiocco, and R. B. Geskus. 2007. Tutorial in biostatistics: Competingrisks and multi-state models. Statistics in Medicine 26: 2389–2430.

Royston, P., and P. C. Lambert. 2011. Flexible Parametric Survival Analysis UsingStata: Beyond the Cox Model. College Station, TX: Stata Press.

Royston, P., and M. K. B. Parmar. 2002. Flexible parametric proportional-hazards andproportional-odds models for censored survival data, with application to prognosticmodelling and estimation of treatment effects. Statistics in Medicine 21: 2175–2197.

About the authors

Sally Hinchliffe is a PhD student at the University of Leicester, UK. She is currently workingon developing methodology for application in competing risks.

David Scott is Senior Director of Health Economics at Oxford Outcomes Ltd and an MScstudent in Medical Statistics at the University of Leicester, UK, where he is undertaking histhesis on multistate modeling.

Paul Lambert is a professor of biostatistics at the University of Leicester, UK. His main interestis in the development and application of methods in population-based cancer research.


Implementation of a double-hurdle model

Bruno GarcıaThe College of William and Mary

Williamsburg, [email protected]

Abstract. Corner solution responses are frequently observed in the social sciences.One common approach to model phenomena that give rise to corner solution re-sponses is to use the tobit model. If the decision to participate in the marketis decoupled from the consumption amount decision, then the tobit model is in-appropriate. In these cases, the double-hurdle model presented in Cragg (1971,Econometrica 39: 829–844) is an appropriate alternative to the tobit model. Inthis article, I introduce a command, dblhurdle, that fits the double-hurdle model.The implementation allows the errors of the participation decision and the amountdecision to be correlated. The capabilities of predict after dblhurdle are also dis-cussed.

Keywords: st0317, dblhurdle, tobit, Heckman, Cragg, double hurdle, hurdle

1 Introduction

Double-hurdle models are used with dependent variables that take on the endpointsof an interval with positive probability and that are continuously distributed over theinterior of the interval. For example, you observe the amount of alcohol individualsconsume over a fixed period of time. The distribution of the amounts will be roughlycontinuous over positive values, but there will be a “pile up” at zero, which is the cornersolution to the consumption problem the individuals face; no individual can consume anegative amount of alcohol.

One common approach to modeling such situations is to use the tobit model. Sup-pose the dependent variable y is continuous over positive values, but Pr(y = 0) > 0 andPr(y < 0) = 0. Letting Φ () denote a standard normal cumulative distribution function(CDF) and φ () denote a standard normal density function, recall that the log-likelihoodfunction for the tobit model is

log(L) =∑yi=0

[log

{1− Φ

(xiβ

σ

)}]+∑yi>0

[log

{φ

(yi − xiβ

σ

)}− log (σ)

]

The functional form of the tobit model imposes a restriction on the underlyingstochastic process: xiβ parameterizes both the conditional probability that yi = 0 andthe conditional density associated with the magnitude of yi whenever yi > 0. Thusthe tobit model cannot properly handle the situation where the effect of a covariate onthe probability of participation Pr(yi > 0) and the effect of the same covariate on theamount of participation have different signs. For example, it might be the case that


B. Garcıa 777

attending AA meetings lowers the probability of engaging in the consumption of alcohol,but if alcohol is consumed, a high quantity of consumption is likely because of bingedrinking. A similar situation can be seen in the work of Martınez-Espineira (2006),who examined a survey asking respondents to state a reasonable tax amount to protectcoyotes by compensating farmers for livestock losses. Martınez-Espineira (2006) findsthat “respondents who hunt stated support for significantly lower levels of tax thannonhunters. However, hunters are less likely to state a zero amount of tax”.

2 The double-hurdle model

The consumer-choice example described below provides intuition about the structure inthe double-hurdle model. The model is not limited to problems in this context and canalso be applied in epidemiology and other applied biostatistical fields.

Suppose individuals make their consumption decisions in two steps. First, the indi-vidual determines whether he or she wants to participate in the market. This is calledthe participation decision. Then the individual determines an optimal consumptionamount (which may be 0) given his or her circumstances. This is called the quantitydecision. If yi represents the observed consumption amount of the individual, we canmodel it as

yi =

{xiβ + εi if min(xiβ + εi, ziγ + ui) > 00 otherwise(εiui

)∼ N (0,Σ) ,Σ =

(1 σ12σ12 σ

)Letting Ψ (x, y, ρ) denote the CDF of a bivariate normal with correlation ρ, the log-

likelihood function for the double-hurdle model is

log(L) =∑yi=0

[log

{1− Φ

(ziγ,

xiβ

σ, ρ

)}]

+∑yi>0

(log

[Φ

{ziγ + ρ

σ (yi − xiβ)√1− ρ2

}]− log [σ] + log

{φ

(yi − xiβ

σ

)})

The double-hurdle model can be reduced to the tobit model by setting ρ = 0 and takingthe limit ziγ → +∞.

3 The dblhurdle command

The dblhurdle command implements the double-hurdle model, where the error termsof the participation equation and the quantity equation are jointly normal and maybe correlated. Letting xiβ + εi model the quantity equation and ziγ + ui model theparticipation equation, the command estimates β, γ, ρ, and σ, where σ = Var (ε). Werestrict Var (u) to equal 1; otherwise, the model is not identified.

778 Double-hurdle model

3.1 Syntax

dblhurdle depvar[indepvars

] [if] [

in] [

weight], {ll(#) | ul(#)}[

peq(varlist,[noconstant

]) ptobit noconstant constraints(numlist)

vce(vcetype) level(#) correlation display options maximize options]

indepvars and peq() may contain factor variables; see [U] 11.4.3 Factor variables.

3.2 Options

ll(#) indicates a lower corner. Observations with depvar ≤ # are considered at thecorner. One of ul(#) or ll(#) must be specified.

ul(#) indicates an upper corner. Observations with depvar ≥ # are considered at thecorner. One of ul(#) or ll(#) must be specified.

peq(varlist,[noconstant

]) specifies the set of regressors for the participation equa-

tion if these are different from those of the quantity equation.

ptobit specifies that the participation equation should consist of a constant only. Thisoption cannot be specified with the peq() option.

noconstant; see [R] estimation options.

constraints(numlist) is used to specify any constraints the researcher may want toimpose on the model.

vce(vcetype) specifies the type of standard error reported. vcetypemay be oim (default),robust, or cluster clustvar.

level(#); see [R] estimation options.

correlation displays the correlation between the error terms of the quantity equationand the participation equation. The covariance is not shown when this option isspecified.

display options; see Reporting under [R] estimation options.

maximize options: technique(algorithm spec), iterate(#),[no

]log,

tolerance(#), ltolerance(#), nrtolerance(#), and from(init specs); see[R] maximize. These options are seldom used.

B. Garcıa 779

3.3 Stored results

dblhurdle stores the following in e():

Scalarse(N) number of observations e(ulopt) contents of ul()e(ll) log likelihood e(llopt) contents of ll()e(converged) 1 if converged, 0 otherwise

Macrose(cmd) dblhurdle e(predict) program used to implemente(cmdline) command as typed predicte(depvar) name of dependent variable e(marginsok) predictions allowed by marginse(title) title in estimation output e(qvars) variables in quantity equatione(vce) vcetype specified in vce() e(pvars) variables in participatione(properties) b V equation

Matricese(b) coefficient vector e(V) variance–covariance matrix ofe(Cns) constraints matrix the estimators

Functionse(sample) marks estimation sample

4 Postestimation: predict

4.1 Syntax

predict[type

]newvarname

[if] [

in] [

, xb zb xbstdp zbstdp ppar ycond

yexpected stepnum(#)]

4.2 Options

xb calculates the linear prediction for the quantity equation. This is the default optionwhen no options are specified in addition to stepnum().

zb calculates the linear prediction for the participation equation.

xbstdp calculates the standard error of the linear prediction of the quantity equation,xb.

zbstdp calculates the standard error of the linear prediction of the participation equa-tion, zb.

ppar is the probability of being away from the corner conditional on the covariates.

ycond is the expectation of the dependent variable conditional on the covariates and onthe dependent variable being away from the corner.

yexpected is the expectation of the dependent variable conditional on the covariates.

stepnum(#) controls the number of steps to be taken for predictions that require in-tegration (yexpected and ycond). More specifically, # will be the number of stepstaken per unit of the smallest standard deviations of the normal distributions used


in the prediction. The default is stepnum(10). You can fine-tune the value of thisparameter by trial and error until increasing the parameter results in no or littlechange in the predicted value.

5 Example

We illustrate the use of the dblhurdle command using smoke.dta from Wooldridge(2010).1

We begin our example by describing the dataset:

. use smoke

. describe

Contains data from smoke.dtaobs: 807vars: 10 15 Aug 2012 19:00size: 19,368

storage display valuevariable name type format label variable label

educ float %9.0g years of schoolingcigpric float %9.0g state cig. price, cents/packwhite byte %8.0g =1 if whiteage byte %8.0g in yearsincome int %8.0g annual income, $cigs byte %8.0g cigs. smoked per dayrestaurn byte %8.0g =1 if rest. smk. restrictionslincome float %9.0g log(income)agesq float %9.0g age^2lcigpric float %9.0g log(cigprice)

Sorted by:

. misstable summarize(variables nonmissing or string)

We will model the number of cigarettes smoked per day, so the dependent variablewill be cigs. The explanatory variables we use are educ (number of years of schooling);the log of income; the log of the price of cigarettes in the individual’s state; restaurn,which takes the value 1 if the individual’s state has restrictions against smoking inrestaurants and 0 otherwise; and we include the individual’s age and the age squared.Not all variables will be included in both equations.

The fact that cigs (the dependent variable) is a byte should remind us that weare implicitly relaxing an assumption of the double-hurdle model. The hypothesizeddata-generating process generates values over a continuous range of values, but all theobserved number of cigarettes are integers.

1. The data were downloaded from http://fmwww.bc.edu/ec-p/data/wooldridge/smoke.dta, and thevariables were labeled according to http://fmwww.bc.edu/ec-p/data/wooldridge/smoke.des.

B. Garcıa 781

It is always good to check for any missing values; because we have no string variables,the output of misstable summarize ensures that there are no missing values.

The dependent variable should have a “corner” at zero because all nonsmokers willreport smoking zero cigarettes per day. We verify this point by tabulating the dependentvariable. This simple check is important because it might be the case that our datacontain only smokers with positive entries in the variable cigs, in which case a truncatedregression model would be more appropriate. We perform the simple check:

. tabulate cigs

cigs.smoked per

day Freq. Percent Cum.

0 497 61.59 61.591 7 0.87 62.452 5 0.62 63.073 5 0.62 63.694 2 0.25 63.945 7 0.87 64.816 3 0.37 65.187 2 0.25 65.438 3 0.37 65.809 2 0.25 66.0510 28 3.47 69.5211 2 0.25 69.7612 4 0.50 70.2613 2 0.25 70.5114 1 0.12 70.6315 23 2.85 73.4816 1 0.12 73.6118 3 0.37 73.9819 1 0.12 74.1020 101 12.52 86.6225 7 0.87 87.4828 3 0.37 87.8630 42 5.20 93.0633 1 0.12 93.1835 2 0.25 93.4340 37 4.58 98.0250 6 0.74 98.7655 1 0.12 98.8860 8 0.99 99.8880 1 0.12 100.00

Total 807 100.00

The tabulation of the dependent variable reveals that about 60% of the individualsin the sample smoked 0 cigarettes. Strangely, we also see that individuals seem to smokecigarettes in multiples of five—in part, this may be due to a reporting heuristic usedby individuals.


We estimate the parameters of a double-hurdle model by typing

. dblhurdle cigs educ restaurn lincome lcigpric, peq(educ c.age##c.age) ll(0)> nolog

Double-Hurdle regression Number of obs = 807

cigs Coef. Std. Err. z P>|z| [95% Conf. Interval]

cigseduc 4.373058 .8969167 4.88 0.000 2.615134 6.130983

restaurn -6.629484 2.630784 -2.52 0.012 -11.78573 -1.473241lincome 3.236915 1.534674 2.11 0.035 .2290102 6.24482

lcigpric -2.376598 12.02945 -0.20 0.843 -25.95388 21.20068_cons -44.41139 50.5775 -0.88 0.380 -143.5415 54.71869

peqeduc -.2053851 .0324439 -6.33 0.000 -.2689739 -.1417963age .0867284 .015593 5.56 0.000 .0561666 .1172901

c.age#c.age -.0010174 .0001755 -5.80 0.000 -.0013615 -.0006734

_cons 1.093345 .4821582 2.27 0.023 .1483324 2.038358

/sigma 24.58939 2.904478 18.89671 30.28206/covariance -20.70667 3.881986 -5.33 0.000 -28.31523 -13.09812

The command showcases some of the features implemented. We used factor variablesto include both age and age squared.

The command displays the number of observations in the sample. It lacks a testagainst a benchmark model. Most estimation commands implement a test against abenchmark constant-only model. For the double-hurdle model, the choice of modelto test against has been left to the user. This test can be carried out with standardpostestimation tools.

The estimation table shows results for four equations. In the econometric sense, weestimated the parameters from two equations and two dependence parameters. The firstequation displays the coefficients of the quantity equation, which is titled cigs after thedependent variable. The second equation, titled peq, which is short for participationequation, displays the coefficients of the participation equation. The third equation,titled /sigma, displays the estimated value of the standard deviation of the error termof the quantity equation. As mentioned, the analogous parameter of the participationequation is set to 1; otherwise, the model is not identified. The fourth equation, titled/covariance, displays the estimated value of the covariance between the error termsof the quantity equation and the participation equation. If the correlation option isspecified, the correlation is displayed instead, and the equation title changes to /rho.

The results allow us to appreciate the strengths of the double-hurdle model. Forexample, the coefficient of educ has a positive value on the quantity equation, while theanalogous coefficient in the participation equation has a negative value. This impliesthat more educated individuals will be less likely to smoke, but if they smoke, they willtend to smoke more than less educated individuals.

B. Garcıa 783

So a small increment in the number of years of schooling will positively affect thenumber of daily cigarettes smoked given that an individual is a smoker but negativelyaffect the probability that the individual is a smoker. Naturally, we may want toknow which effect, if any, dominates. For nonlinear problems like this one, which effectdominates depends on the other characteristics of the individual. In these situations,researchers often calculate marginal effects. In our example, we illustrate how to com-pute the average marginal effect of the number of years of schooling (educ) on threedifferent quantities of interest:

• The probability of smoking

• The expected number of cigarettes smoked given that you smoke

• The expected number of cigarettes smoked

Given the signs of the coefficients, we know that the average marginal effect of educon the probability of smoking will be negative. We also expect the average marginaleffect of educ on the number of cigarettes smoked given that you are a smoker will bepositive. The final quantity, the marginal effect of educ on the number of cigarettessmoked regardless of smoker status, is ambiguous.

To estimate these quantities, we use the predict() option in conjunction with themargins command. First, we calculate the average marginal effect of educ on theprobability that the individual is a smoker by using the ppar option:

. margins, dydx(educ) predict(ppar)

Average marginal effects Number of obs = 807Model VCE : OIM

Expression : predict(ppar)dy/dx w.r.t. : educ

Delta-methoddy/dx Std. Err. z P>|z| [95% Conf. Interval]

educ -.0348973 .0052745 -6.62 0.000 -.0452352 -.0245595

Note that the effect is negative, as expected, and significant.

Next we compute the average marginal effect of education on the number of cigarettessmoked given that the individual is a smoker by using the ycond option. We will carryon this computation twice to illustrate the use of the stepnum() option.


. set r onr; t=0.00 11:43:17

. margins, dydx(educ) predict(ycond)


Expression : predict(ycond)dy/dx w.r.t. : educ


educ .691684 .2795245 2.47 0.013 .143826 1.239542

r; t=120.64 11:45:17

. margins, dydx(educ) predict(ycond stepnum(100))


Expression : predict(ycond stepnum(100))dy/dx w.r.t. : educ


educ .6916836 .2795244 2.47 0.013 .1438259 1.239541

r; t=1153.86 12:04:31

. set r off

First, we calculate the average marginal effect with the default value of stepnum(),which is 10. We note that the effect is positive, as expected, and significant. Wealso note that when the calculation is repeated with a stepnum() of 100, we observe achange in the sixth decimal point, which in this context is meaningless, but it comes atthe expense of a tenfold increase in run time. Hence, stepnum() should be used withcaution. My advice is to tune it by using predict with the ycond option until thepredicted values show little or no sensitivity to positive changes in stepnum().

Finally, we use the yexpected option of predict in margins to calculate the averagemarginal effect educ has on the number of cigarettes smoked per day regardless of theindividual’s smoker status:

. margins, dydx(educ) predict(yexpected)


Expression : predict(yexpected)dy/dx w.r.t. : educ


educ -.5487611 .1473763 -3.72 0.000 -.8376132 -.2599089

B. Garcıa 785

We note that the effect is negative and that it is statistically significant. Hence, onaverage, a higher education will lower the expected number of cigarettes an individualsmokes.

6 Monte Carlo simulation

This section describes some Monte Carlo simulations used to investigate the finite-sample properties of the estimator. Point estimates of the parameters should be closeto their true values, and the rejection rate of the true null hypothesis should be closeto the nominal size of the test.

To this end, we perform a Monte Carlo simulation, and we look at three measuresof performance:

• The mean of the estimated parameters should be close to their true values.

• The mean standard error of the estimated parameters over the repetitions shouldbe close to the standard deviation of the point estimates.

• The rejection rate of hypothesis tests should be close to the nominal size of thetest.

The first step consists of choosing the parameters of the model. The quantity equa-tion was chosen to have one continuous covariate, one indicator variable, and an in-tercept. The variance of the error associated with this equation is equal to 1. Theparticipation equation consists of a different continuous variable, indicator variable,and intercept. The error terms will be drawn so that they are independent. Thusthe correlation between the error terms will be 0. We set an upper corner at 0. Thedata-generating process can be summarized as follows:

y =

{min(0, 2x1 − d1 + 0.5 + ε) if x2 − 2d2 + 1 + u < 00 otherwise(

εu

)∼ N (0,Σ) ,Σ =

(1 00 1

)

A dataset of 2,000 observations was created containing the covariates. The x’s weredrawn from a standard normal distribution, and the d’s were drawn from a Bernoulliwith p = 1/2. In the pseudocode below, we refer to this dataset as “base”.

Now we describe an iteration of the simulation:

1. Use “base”.

2. For each observation, draw (gen) ε from a standard normal.

3. For each observation, draw (gen) u from a standard normal.


4. For each observation, compute y according to the data-generating process pre-sented above.

5. Fit the model, and save the values of interest with post.

The values of interest during each iteration are the point estimates of the parame-ters; the standard errors of the parameters; and, for each parameter, whether the 95%confidence interval around the estimated parameter excluded the true value of the pa-rameter. At the conclusion of the simulation, we have a dataset of 10,000 observations,where each observation is a realization of the values of interest.

The following table summarizes the results. It shows the mean estimated coefficient,or “mean”; the standard deviation of the sample of estimated coefficients, or “std. dev.”;the mean estimated standard error, or “std. err.”; and the proportion of the time a testof size 0.05 rejected the true null hypothesis, denoted by “rej. rate”.

Table 1. Results of the simulation

parameter true value mean std. dev std. err rej. rate

βx12 2.0007 0.0563 0.0561 0.0524

βd1−1 −1.0001 0.0860 0.0856 0.0507

βcons1 0.5 0.5007 0.0881 0.0885 0.0497γx2

1 1.0095 0.0823 0.0811 0.0520γd2

−2 −2.0156 0.1424 0.1426 0.0486γcons2 1 1.0068 0.0862 0.0863 0.0507sigma 1 0.9979 0.0364 0.0364 0.0542covariance 0 0.0016 0.1046 0.1036 0.0532

The results show that the statistical properties of the estimates are as desired. Othersimulations were done to see how these results would change under extreme circum-stances, such as correlations close to the extremes of −1 or 1. The results were qualita-tively similar to those above for correlations as high as 0.95 and as low as −0.95. Therewere instances where the tests did not achieve their nominal size. Rather than beingdriven by the extreme values of the input parameters, these issues seem to be drivenprimarily by the proportion of observations at the corner. As this proportion gets closeto either extreme (0 or 1), the nominal size of a test of the covariance deviates from thetrue size. This becomes an issue once the proportion of observations at the corner isabove 95% or below 5%.

The other parameters can also be affected by this, but for those parameters, this ismore intuitive because it can be viewed through the lens of a small-sample problem. Forexample, if most of your observations are at the corner, you will have very little data toestimate the parameters associated with the quantity equation. Because the confidenceintervals produced by maximum likelihood are normal only asymptotically, we cannotexpect them to achieve their nominal size on small samples.

B. Garcıa 787

Figure 1 summarizes this information. Each scatterplot contains the observed re-jection rate of a test of nominal size 0.05 on the vertical axis, and the proportion ofobservations at the corner on the horizontal axis. Each point on the scatterplot rep-resents a variation on the parameterization of the data-generating process presentedabove. I held the coefficients of the quantity and participation equations fixed, and Itried every combination of upper or lower corner; corner at −2, 0, 2; σ ∈ {0.2, 1, 10};and ρ ∈ {−0.95, 0, 0.95}.

.05

.2.3

5.5

.65

.8.9

5Β

x1

.05 .5 .95Prop. of obs. at corner

.05

.2.3

5.5

.65

.8.9

5Β

d1


.05

.2.3

5.5

.65

.8.9

5Β

cons


.05

.2.3

5.5

.65

.8.9

5γ x

2


.05

.2.3

5.5

.65

.8.9

5γ d

2


.05

.2.3

5.5

.65

.8.9

5γ c

ons


.05

.2.3

5.5

.65

.8.9

5σ


.05

.2.3

5.5

.65

.8.9

5co

varia

nce


Rej

. rat

e

σ ≠ 10 σ = 10

Figure 1. Scatterplots showing rejection rate of a test of nominal size 0.05 and proportionof observations at the corner

Of these, only the parameterization where σ = 10 seems to induce a discrepancybetween the nominal size of the test and the attained size of the test, particularly forγd2

. Hence, I decided to mark those points with a square instead of a circle.

Notice that the nominal size is almost never achieved once you cross the 0.95 pro-portion (marked with a vertical line). Also notice that tests involving the gammacoefficients (those of the participation equation) also deviate from their nominal size(albeit less markedly) when the proportion of censored observations is low. This is mostobvious for the γd2

coefficient.


A less intuitive issue occurs when the set of regressors in the participation equation isequal to the set of regressors of the quantity equation. In this case, the model is weaklyidentified, and the nominal sizes will differ from the true size of the test. To illustrate,we attempt to recover the parameters of the following data-generating process:

y =

{min(0, 2x1 − d1 + 0.5 + ε) if 2x1 − d1 + 0.5 + u < 00 otherwise(

εu

)∼ N (0,Σ) ,Σ =

(1 00 1

)

The results, summarized in the following table, suggest that the point estimates canbe trusted but that the size of the tests may deviate from the advertised values.

Table 2. Results of the data-generating process

parameter true value mean std. dev std. err rej. rate

βx12 2.0043 0.0907 0.0877 0.0711

βd1−1 −1.0029 0.0925 0.0925 0.0535

βcons1 0.5 0.5077 0.1618 0.1569 0.0762γx1

2 2.0625 0.2898 0.2699 0.0846γd1

−1 −1.0270 0.2216 0.2114 0.0560γcons1 0.5 0.5417 0.2548 0.2447 0.0777sigma 1 1.0009 0.0331 0.0328 0.0534covariance 0 0.0374 0.2754 0.2541 0.1118

Figure 2 is analogous to figure 1. We note that tests on the covariance are particularlyunreliable, that the distinction between the cases where σ = 10 and σ = 10 seems notto matter, and that the rejection rate exceeds the nominal size of the test when theproportion of observations at the corner is around 0.9. However, when the proportionof observations at the corner is between 0.3 and 0.8, the sizes are mostly reliable withthe notable exception of tests of the covariance.

B. Garcıa 789

.05

.2.3

5.5

.65

.8.9

5Β

x1


.05

.2.3

5.5

.65

.8.9

5Β

d1


.05

.2.3

5.5

.65

.8.9

5Β

cons


.05

.2.3

5.5

.65

.8.9

5γ x

1


.05

.2.3

5.5

.65

.8.9

5γ d

1


.05

.2.3

5.5

.65

.8.9

5γ c

ons


.05

.2.3

5.5

.65

.8.9

5σ


.05

.2.3

5.5

.65

.8.9

5co

varia

nce


Rej

. rat

e

σ ≠ 10 σ = 10

Figure 2. Scatterplots showing rejection rate of a test of nominal size 0.05 and proportionof observations at the corner


7.1 Log likelihood

In Cragg (1971), a variety of double-hurdle models were first proposed. Jones (1992)applies the double-hurdle model with correlation in the error terms to data on tobaccoexpenditures. Letting Φ (•) denote the standard normal CDF, φ (•) denote the standardnormal density function, and Ψ (x, y, ρ) denote the CDF of the bivariate normal withcorrelation ρ, the log-likelihood function for the double-hurdle model with a lower cornerat c is

a =

{zγ − c+ ρ

σ (y − xβ)}√

1− ρ2, b =

xβ − c

σ

log(L) =∑yi=c

[log {1−Ψ(zγ − c, b, ρ)}]

+∑yi>c

[log {Φ(a)} − log (σ) + log

{φ

(y − xβ

σ

)}]


If the upper corner is at c, then

log(L) =∑yi=c

[log {1−Ψ(c− zγ,−b, ρ)}]

+∑yi<c

[log {Φ(−a)} − log (σ) + log

{φ

(xβ − y

σ

)}]

7.2 Choosing the initial point

The optimization routine optimize() requires an initial point from which to initializethe optimization algorithm. My choice of starting point is [xβ, zγ, 0, 5]

′, where β are the

ordinary least-squares estimates of a regression of the dependent variable of the modelon x, the variables in the quantity equation; γ are the ordinary least-squares estimatesof a regression of the dependent variable of the model on z, the variables in the quantityequation; and ρ and σ are chosen to be 0 and 5, respectively.

There is no guarantee that the initial point will be feasible. If the initial point isinfeasible, the use of the from() option is recommended.

7.3 First derivatives

The first derivatives of the log likelihood (if βj is the constant, simply let xj = 1and likewise for γj) are given below. These were adapted from Jones and Yen (2000).Letting ψ (x, y, ρ) be the density of a bivariate normal with correlation ρ,

B. Garcıa 791

Ψ = Ψ(zγ − c, b, ρ)

ψ = Ψ(zγ − c, b, ρ)

Φ12 = Φ

(zγ − c− bρ√

1− ρ2

)

φ12 = Φ

(zγ − c− bρ√

1− ρ2

)

Φ21 = Φ

(b− ρ (zγ − c)√

1− ρ2

)

dβjd log (L)

=xjσ

[∑y=c

{φ (b) Φ12

Ψ− 1

}+∑y>c

{−ρφ (a)√1− ρ2Φ(a)

+y − xβ

σ

}]

dγjd log (L)

= zj

[∑y=c

{φ (z − c) Φ21

Ψ− 1

}+∑y>c

{φ (a)√

1− ρ2Φ(a)

}]

dσ12d log (L)

=1

σ

(∑y=c

(ψ

Ψ− 1

)+∑y>x

[y − xβ

σ

{φ (a)

Φ (a)√1− ρ2

}+

aρ√1− ρ2

])

dσ

d log (L)=

1

σ

∑y=c

[b

{Φ12φ (b)

1−Ψ

}+

ρψ

1−Ψ

]

+1

σ

∑y>c

[(y − xβ

σ

)2

− 1 +

{−ρφ (a)

Φ (a)√1− ρ2

}(2y − xβ

σ+

aρ√1− ρ2

)]

The implementation of the derivatives for an upper corner at c requires a few minorchanges. First, the derivatives with respect to βj and γj should be multiplied by −1.Finally, multiply a, b, zγ − c, and (y − xβ)/σ by −1.

7.4 Weights

The weighting schemes implemented for dblhurdle are frequency weights (fw), sam-pling weights (pw), and importance weights (iw). Recall that the likelihood functionis summed over observations. To implement the weights, you need to multiply the ithterm of the summation over observations by the weight of the ith observation. Thefrequency weights are only allowed to be positive integers.


When frequency weights are specified, the sample size is adjusted so that it is equalto the sum of the weights. The importance weights are allowed to be any real number.No sample-size adjustments are made when importance weights are specified. Thesampling weights are like the importance weights, but a robust estimator of the varianceis computed instead of the default oim estimator. No sample-size adjustment is madewhen sampling weights are specified, and the weights are not allowed to be negative.

Finally, analytic weights (aw) are not allowed. This command was written with thetobit command in mind. In that case, the aweights (normalized) divide the varianceof the error term. In the case of dblhurdle, the rationale for dividing the variance bythe normalized weights does not carry over well because we also have to estimate thecovariance between the error terms.

7.5 Prediction

There are three options in the prediction program that require some explanation. Theppar option computes the probability of being away from the corner conditional on thecovariates. Thus this option computes

Pr (y > c|x, z) = Φ

(zγ − c,

xβ − c

σ, ρ

)

The option ycond computes the following expectation:

E (y|x, z, y > c) =

∫ +∞

c

yf(y|u > c− zγ, ε > c− xβ)dy

f(y|u > c− zγ, ε > c− xβ) =

φ(

y−xβσ

)Φ

{zγ−c+ ρ

σ (y−xβ)√1−ρ2

}σΦ

(zγ − c, xβ−c

σ , ρ)

Finally, the option yexpected computes the expected value of y conditional on xand z:

E (y|x, z) = c {1− Pr (y > c|x, z)}+ Pr (y > c|x, z)E (y|x, z, y > c)

Note that the options that involve integration are time consuming. Thus the optionstepnum() was added to the prediction program to allow the user some control of theexecution time for the integration. Letting ns denote the stepnum(), the step size ischosen to be

min(σ,√

1− ρ2)

ns

Execution is faster when the stepnum() is smaller, but the improved run time comesat a cost to accuracy. The default is stepnum(10).

B. Garcıa 793

When the corner is above, the expressions that change become

Pr (y < c|x, z) = Φ

(c− zγ,

c− xβ

σ, ρ

)E (y|x, z, y < c) =

∫ c

−∞yf(y|u < zγ − c, ε < xβ − c)dy

f(y|u < zγ − c, ε < xβ − c) =

φ(

xβ−yσ

)Φ

{c−zγ− ρ

σ (y−xβ)√1−ρ2

}σΦ

(c− zγ, c−xβ

σ , ρ)

8 Conclusion

The double-hurdle model was an important contribution to the econometric toolkitused by researchers. I hope that readers will consider this model and, in particular, thedblhurdle command when their first instinct is to use the tobit model. The examplepresented in section 5 illustrates the flexibility of the model. It allows the researcher tobreak down the modeled quantity along two useful dimensions, the “quantity” dimensionand the “participation” dimension.

The command presented in this article only allows for a single corner in the data.One desirable feature to add is the capability to handle dependent variables with twocorners. Such variables are common (for example, 401k contributions), so this featurewould certainly provide higher value to users.

9 Acknowledgments

I wrote this article and the command described therein during a summer internship atStataCorp. It was exciting to meet the individuals behind Stata. I thank David Drukkerfor his support and for the time he spent going over the intricate details of the models.I also thank Rafal Raciborski for all of his comments, suggestions, and tips. Any errorsin my work are my own.

10 ReferencesCragg, J. G. 1971. Some statistical models for limited dependent variables with appli-cation to the demand for durable goods. Econometrica 39: 829–844.

Jones, A. M. 1992. A note on computation of the double-hurdle model with dependencewith an application to tobacco expenditure. Bulletin of Economic Research 44: 67–74.

Jones, A. M., and S. T. Yen. 2000. A Box-Cox double-hurdle model. Manchester School68: 203–221.

Martınez-Espineira, R. 2006. A Box-Cox Double-Hurdle model of wildlife valuation:The citizen’s perspective. Ecological Economics 58: 192–208.


Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nded. Cambridge, MA: MIT Press.

About the author

Bruno Garcıa is working toward a master’s degree in computational operations research at theCollege of William and Mary. He received his bachelor’s degree in applied mathematics andeconomics from Brown University.


Little’s test of missing completely at random

Cheng LiNorthwestern University

Evanston, [email protected]

Abstract. In missing-data analysis, Little’s test (1988, Journal of the AmericanStatistical Association 83: 1198–1202) is useful for testing the assumption of miss-ing completely at random for multivariate, partially observed quantitative data. Iintroduce the mcartest command, which implements Little’s missing completelyat random test and its extension for testing the covariate-dependent missingness.The command also includes an option to perform the likelihood-ratio test withadjustment for unequal variances. I illustrate the use of mcartest through anexample and evaluate the finite-sample performance of these tests in simulationstudies.

Keywords: st0318, mcartest, CDM, MAR, MCAR, MNAR, chi-squared, missingdata, missing-value patterns, multivariate, power

1 Introduction

Statistical inference based on incomplete data typically involves certain assumptions forthe missing-data mechanism. The validity of these assumptions requires formal evalu-ation before any further analysis. For example, likelihood-based inference is valid onlyif the missing-data mechanism is ignorable (Rubin 1976), which usually relies on themissing at random (MAR) assumption. MAR assumes that the missingness of the datamay depend on the observed data but is independent of the unobserved data. There-fore, testing MAR is in general impossible because it requires unavailable informationabout the missing data. Instead, the missing completely at random (MCAR) assumptionassumes that the missingness of the data is independent of both the observed and theunobserved data, which is stronger than MAR and possible to test using only the ob-served data. When the missing-data mechanism depends on the unobserved data, dataare missing not at random (MNAR). Although the likelihood inference only requires theMAR assumption, testing of MCAR is still of interest in real applications because manysimple missing-data methods such as complete-case analysis are valid only under MCAR

(see Little and Rubin [2002, chap. 3]; also see the blood-test example in section 4).Also the maximum likelihood estimation for the multivariate normal model may bemore sensitive to the distributional assumption when the data are not MCAR (Little1988).

In this article, I present a new command, mcartest. It implements the χ2 test ofMCAR for multivariate quantitative data proposed by Little (1988), which tests whethersignificant difference exists between the means of different missing-value patterns. Thetest statistic takes a form similar to the likelihood-ratio statistic for multivariate normal


796 Little’s MCAR Test

data and is asymptotically χ2 distributed under the null hypothesis that there are nodifferences between the means of different missing-value patterns. Rejection of the nullprovides sufficient evidence to indicate that the data are not MCAR. The command alsoaccommodates the testing of the covariate-dependent missingness (CDM) assumption,a straightforward extension of Little’s MCAR test when covariates are present. It alsoallows unequal variances between different missing-value patterns.


2.1 MCAR, MAR, MNAR, and CDM

First, I introduce the formal definitions of the four missing-data mechanisms. Supposewe have an independent and identically distributed sequence of p-dimensional vectorsyi = (yi1, . . . , yip)

�, i = 1, 2, . . . , n, where n is the sample size, and Y = (y1, . . . ,yn)� is

the n×p data matrix. Hereafter, we are mainly interested in testing whether Y is MCAR.Denote the observed entries and missing entries of Y as Y o and Y m, respectively. Insome situations, we may also have completely observed q-dimensional covariates x.Let X be the n × q data matrix of covariate values. Let the p-dimensional vectorri = (ri1, . . . , rip)

� denote the indicator of whether each component in vector yi isobserved; that is, rik = 1 if yik is observed, and rik = 0 if yik is missing for i = 1, 2, . . . , nand k = 1, 2, . . . , p. The stacked matrix of r is R = (r1, . . . , rn)

�. Then the MAR

assumption is defined as

Pr(R|Y m,Y o,X) = Pr(R|Y o,X) (1)

In other words, the distribution of the missing indicators depends only on the observeddata.

The stronger assumption of MCAR is defined as

Pr(R|Y m,Y o,X) = Pr(R) (2)

which implies that the missing indicators are completely independent of both the miss-ing data and the observed data. Note that here R is also independent of covariatesX, as suggested by Little (1995). This means that under the MCAR assumption, themissingness should be totally independent of any observed variables. Instead, if R onlydepends on covariates X,

Pr(R|Y m,Y o,X) = Pr(R|X) (3)

then Little (1995) suggests that (3) be referred to as CDM (Fitzmaurice et al. 2009,chap. 17), while the term “MCAR” is reserved for (2). It is worth noting that accordingto the definition, CDM is a special case of MAR because covariates x are always fullyobserved. Finally, any missing-data mechanism that does not satisfy (1) is MNAR.

C. Li 797

2.2 Test of MCAR

In Little’s test of MCAR (Little 1988), the data yi, (i = 1, 2, . . . , n) are modeled asp-dimensional multivariate normal with mean vector μ and covariance matrix Σ, withpart of the components in yi’s missing. When the normality is not satisfied, Little’stest still works in the asymptotic sense for quantitative random vectors yi’s but is notsuitable for categorical variables (Little 1988). We suppose that there are a total ofJ missing-value patterns among all yi’s. For each pattern j, let oj and mj be theindex sets of the observed components and the missing components, respectively, andpj = |oj | be the number of observed components in pattern j. Furthermore, let μoj

and Σoj be the pj × 1-dimensional mean vector and the pj × pj covariance matrix ofonly the observed components for the jth missing pattern, and let yoj

(pj × 1) be theobserved sample average for the jth missing pattern. Finally, let Ij ⊆ {1, 2, . . . , n} be

the index set of pattern j in the sample, and let nj = |Ij |; then∑J

j=1 nj = n.

Little’s χ2 test statistic for MCAR takes the following form:

d20 =

J∑j=1

nj(yoj− μoj

)�Σ−1oj

(yoj− μoj

) (4)

The idea is that if the data are MCAR, then conditional on the missing indicator ri, thefollowing null hypothesis holds,

H0 : yo,i|ri ∼ N(μoj,Σoj

) if i ∈ Ij , 1 ≤ j ≤ J (5)

where μojis a subvector of the mean vector μ.

Instead, if (1) is not true, then conditional on the missing indicator ri, the meansof the observed y’s are expected to vary across different patterns, which implies

H1 : yo,i|ri ∼ N(νoj,Σoj

) if i ∈ Ij , 1 ≤ j ≤ J (6)

where νoj, j = 1, 2, . . . , J are mean vectors of each pattern j and can be distinct.

Rejecting (1) is sufficient for rejecting the MCAR assumption (2), but not necessary.

Little (1988) proves that the statistic (4) is the likelihood-ratio statistic for testing(1) against (6). If the normality assumption holds, then d20 follows χ2 distribution with

degrees of freedom (d.f.) =∑J

j=1 pj−p. If yi’s are not multivariate normal but have thesame mean μ and covariance matrix Σ, then by the multivariate central limit theorem(see, for example, part (c) of the lemma in Little [1988]), under the null assumption ofMCAR, d20 follows the same χ2 distribution asymptotically.

In practice, because μ and Σ are usually unknown, Little (1988) proposes to replace

them with the unbiased estimators μ and Σ = nΣ/(n − 1), where μ and Σ are themaximum likelihood estimators based on the null hypothesis (1). Thus Σoj in (4) is

replaced by the submatrix Σojof Σ, which gives

d2 =

J∑j=1

nj(yoj− μoj

)�Σ−1

oj(yoj

− μoj) (7)


Asymptotically, d2 follows χ2 distribution with d.f. =∑J

j=1 pj − p, and (1) is rejected if

d2 > χ2d.f.(1− α), where α is the significance level. μ and Σ can be obtained from the

expectation-maximization (EM) algorithm using the observed data Yo (Little and Rubin2002; Schafer 1997).

2.3 Test of CDM

A natural extension of Little’s test of MCAR is to test the CDM assumption (3) of yiconditional on xi when covariates xi’s are present. For simplicity, we assume that xi

contains the constant term 1 as one of its components. If y depends linearly on x, thenthe model becomes

y = Bx+ ε

where B is a p× q matrix of coefficients, and ε ∼ N(0,Σ). Under the homoskedasticityassumption, Σ does not depend on x. When we compare this with the model withoutcovariates, we see we need to replace every unconditional mean of y with the conditionalmean of y given x and test whether the coefficient matrix B varies among differentmissing patterns. The χ2 test statistic (4) now becomes

d20 =

J∑j=1

∑i∈Ij

(Boj

xi −Bojxi

)�Σ−1

oj

(Boj

xi −Bojxi

)

=

J∑j=1

∑i∈Ij

x�i

(Boj

−Boj

)�Σ−1

oj

(Boj

−Boj

)xi (8)

whereBojis a pj×q submatrix ofB, whose rows correspond to the jth missing pattern,

and Boj is the ordinary least-squares estimator of Boj using the observed data frompattern j. It is straightforward to see that d20 in (4) is a special case of d20 in (8) whenx only contains the constant component 1.

Accordingly, we are now testing the null hypothesis

H0 : yo,i|ri,xi ∼ N(Bojxi,Σoj

) if i ∈ Ij , 1 ≤ j ≤ J (9)

versusH1 : yo,i|ri,xi ∼ N(Doj

xi,Σoj) if i ∈ Ij , 1 ≤ j ≤ J (10)

where under H1, the CDM assumption does not hold, and yoj=Doj

x+ε for pattern j,with Doj potentially different among all patterns but with the error terms still sharingthe same multivariate distribution N(0,Σ).

In practice, we replaceB andΣ in (8) with unbiased estimators B and Σ = nΣ/(n−q), where B and Σ are the maximum likelihood estimators using all data under H0,and calculate

d2 =

J∑j=1

∑i∈Ij

x�i

(Boj − Boj

)�Σ

−1

oj

(Boj − Boj

)xi (11)

C. Li 799

which asymptotically follows χ2 distribution with d.f. = q(∑J

j=1 pj − p), and (9) is

rejected if d2 > χ2d.f.(1 − α), where α is the significance level. Again, when there

are no covariates and x only contains the constant component 1 with q = 1, thend.f. =

∑Jj=1 pj − p, which coincides with the d.f. in the test of MCAR.

2.4 Adjustment for unequal variances

As Little (1988) points out, one important limitation of d2 in (7) and (11) is that thecovariance matrix of observed y (or observed y conditional on x) is still the same for allmissing-value patterns even in the alternative hypotheses (6) and (10). This assumptionmay not be satisfied in general, especially when the number of missing patterns is large.Therefore, we can relax this limitation on covariance matrices and replace the alternativehypothesis with

H1 : yo,i|ri,xi ∼ N(Dojxi,Γoj

) if i ∈ Ij , 1 ≤ j ≤ J (12)

where Γoj contains distinct parameters for each missing pattern j. To test (9) against(12), we can derive the following likelihood-ratio statistic as in Little (1988),

d2aug = d2 +J∑

j=1

nj

{tr(Soj

Σ−1

oj

)− pj − log |Soj

|+ log |Σoj|}

(13)

where d2 is the same as in (7) without covariates or (11) with covariates, Sojis the

estimated covariance matrix of residuals from the regression of observed yojon x in

pattern j, Σoj is the same as in (7), and “aug” stands for “augmented” because moreparameters need to be estimated for covariance matrices in the new test. Asymptoti-cally, d2aug follows χ2 distribution with d.f. = q(

∑Jj=1 pj − p) +

∑Jj=1{pj(pj + 1)}/2−

{p(p+1)}/2, and (1) or (9) is rejected if d2aug > χ2d.f.(1−α), where α is the significance

level. This augmented test using d2aug tends to have higher power than the test usingd2 for large sample sizes, especially when the covariance structures of different missing-value patterns vary a lot, as shown later in our simulation results in section 5. On theother hand, d2aug may not be applicable if some patterns have too small sample sizessuch that nj < pj + q: Soj will then be singular, and hence, log |Soj | in the expressionof d2aug cannot be computed.

3 The mcartest command

3.1 Description

mcartest performs Little’s χ2 test for the MCAR assumption and accommodates arbi-trary missing-value patterns. depvars contains a list of variables with missing values tobe tested. depvars requires at least two variables. indepvars contains a list of covari-ates. When indepvars are specified, mcartest tests the CDM assumption for depvarsconditional on indepvars (see Little [1995]). The test statistic uses multivariate normal


estimates from the EM algorithm (see [MI] mi impute mvn). The unequal option per-forms Little’s augmented χ2 test, which allows unequal variances between missing-valuepatterns. See Little (1988) for details.

3.2 Syntax

Test for MCAR

mcartest depvars[if] [

in] [

, noconstant unequal emoutput em options]

Test for CDM

mcartest depvars = indepvars[if] [

in] [

, noconstant unequal emoutput

em options]

3.3 Options

noconstant suppresses constant term.

unequal specifies that unequal variances between missing-value patterns be allowed. Bydefault, the test assumes equal variances between different missing-value patterns.

emoutput specifies that intermediate output from EM estimation be displayed.

em options specifies the options in EM algorithm. See [MI] mi impute mvn (StataCorp2013) for details.

3.4 Stored results

mcartest saves the following in r():

Scalarsr(N) number of observations r(chi2) Little’s χ2 statisticr(N S em) number of unique r(df) χ2 d.f.

missing-value patterns r(p) χ2 p-value

4 Example

I illustrate the use of the mcartest command through an example. The fictional datasetused here is the blood-test results in a study of obesity that contains 371 observationsand 11 variables: cholesterol level, triglyceride level, diastolic blood pressure, systolicblood pressure, age, gender, height, weight, exercise time in a week, alcohol, and smok-ing. Suppose the variables of interest are the first four, coded as chol, trig, diasbp,and sysbp, and the other seven are used as auxiliary variables, coded as age, female,height, weight, exercise, alcohol, and smoking. Descriptions of these variables areshown in table 1.

C. Li 801

Table 1. Descriptions of the variables

Name Type Description

chol Continuous Cholesterol leveltrig Continuous Triglyceride level

diasbp Continuous Diastolic blood pressuresysbp Continuous Systolic blood pressureage Categorical 1 if 21–30, 2 if 31–40, 3 if 41–50, 4 if above 50

female Categorical 1 if female, 0 if maleheight Continuous Height in inchesweight Continuous Weight in poundsexercise Discrete Exercise in hours per weekalcohol Categorical 1 if drinking alcohol, 0 if notsmoking Categorical 1 if smoking, 0 if not

After loading the data, we can check the missing-value patterns by using misstable.

. use bloodtest(fictional blood test data)

. misstable summarizeObs<.

UniqueVariable Obs=. Obs>. Obs<. values Min Max

chol 90 281 265 187.73 224.57trig 70 301 280 103.22 136.21

diasbp 34 337 24 66 90sysbp 73 298 32 106 138

. misstable pattern, freq

Missing-value patterns(1 means complete)

PatternFrequency 1 2 3 4

122 1 1 1 1

72 1 1 1 070 1 0 1 155 1 1 0 134 0 1 1 118 1 1 0 0

371

Variables are (1) diasbp (2) trig (3) sysbp (4) chol


The results suggest that the dataset contains missing values in the first four variables,but all the other variables are completely observed. Of the 371 observations, 122 arecomplete, while over 2/3 of the observations contain missing values, with 6 missing-valuepatterns in total that are not monotone.

Now we can determine whether the data are MCAR using the mcartest command.Suppose in the beginning, we do not include any of the auxiliary variables in the analysisand only apply Little’s MCAR test to chol, trig, diasbp, and sysbp. We try both theregular MCAR test and the test with unequal variances.

. mcartest chol trig diasbp sysbp, emoutput nolog

Expectation-maximization estimation Number obs = 371Number missing = 267Number patterns = 6

Prior: uniform Obs per pattern: min = 18avg = 61.83333max = 122

Observed log likelihood = -2623.2645 at iteration 17

chol trig diasbp sysbp

Coef_cons 206.2264 120.5829 78.8161 121.196

Sigmachol 41.91012 22.33289 3.762825 3.48862trig 22.33289 42.08035 6.622086 10.69249

diasbp 3.762825 6.622086 18.45518 14.37273sysbp 3.48862 10.69249 14.37273 35.92427

Little´s MCAR test

Number of obs = 371Chi-square distance = 25.7412Degrees of freedom = 14Prob > chi-square = 0.0279

We specified the emoutput option to display the EM estimates and also suppressedthe log using the nolog option of em options. If the EM algorithm does not converge,mcartest will generate a warning message in blue, similar to what mi impute mvn does.EM has converged in this test. The regular Little’s MCAR test gives a χ2 distance of25.74 with d.f. 14 and p-value 0.0279. The test provides evidence that the missing datain the four variables of interest are not MCAR under significance level 0.05.

We can also specify the unequal option to run the test with unequal variances.

. mcartest chol trig diasbp sysbp, unequal

Little´s MCAR test with unequal variances


C. Li 803

This test gives a χ2 distance of 56.71 with d.f. 41 and p-value 0.0522. The p-valueis only slightly larger than 0.05, indicating that although the evidence against MCAR isnot strong, the power of the test could possibly be low. Both tests cast doubts on theMCAR assumption.

Next we add auxiliary variables as covariates into the test and test the CDM assump-tion. Note that age is grouped into four brackets and female has two groups, so weuse the factor variables i.age and i.female in the test. We also specify the emoutputoption to display the EM estimates of the linear regression coefficients.

. mcartest chol trig diasbp sysbp = weight i.age i.female, emoutput nolog

Expectation-maximization estimation Number obs = 371Number missing = 267Number patterns = 6

Prior: uniform Obs per pattern: min = 18avg = 61.83333max = 122

Observed log likelihood = -2477.8319 at iteration 24

chol trig diasbp sysbp

Coefweight .0898433 .1155952 .0035606 .03159191b.age 0 0 0 02.age -.0790635 -.598354 .0120911 -.60068853.age -.3147961 -.6971391 -.4392923 -1.076144.age -2.220313 -2.172395 .4254206 -.582046

0b.female 0 0 0 01.female 2.10565 -4.386112 -4.315367 -2.971464

_cons 191.5976 103.5614 79.32499 117.3274

Sigmachol 38.04902 15.04927 2.537881 1.435059trig 15.04927 21.60197 -.5490975 1.695223

diasbp 2.537881 -.5490975 14.83308 10.89443sysbp 1.435059 1.695223 10.89443 32.07185

Little´s CDM test


This CDM test gives a χ2 distance of 89.50 with d.f. 84 and p-value 0.3204. Wefind that for this dataset, adding age, female, and weight as covariates can pass theCDM test. The EM outputs in the table give the EM estimates of the multivariate linearregression of chol, trig, diasbp, and sysbp on weight, age, and female, includingthe regression coefficients (Coef) and the covariance matrix of the errors (Sigma). Forcomparison, we also run the test with all the seven auxiliary variables as covariates.


. mcartest chol trig diasbp sysbp = weight height exercise i.age i.female> i.alcohol i.smoking

Little´s CDM test


This CDM test gives a χ2 distance of 141.15 with d.f. 140 and p-value 0.4569. BothCDM tests are highly nonsignificant, which implies that although chol, trig, diasbp,and sysbp are not MCAR, the missing-data mechanism can be reasonably viewed asCDM given the auxiliary variables age, female, and weight. Therefore, for this dataset,any analysis of chol, trig, diasbp, and sysbp using only the 122 completely observedsamples without adjusting the effect of the auxiliary variables is not valid because theMCAR assumption is violated. The means of these four variables are significantly dif-ferent in the 122 completely observed samples and in the other samples that containmissing values. On the other hand, the plausible CDM assumption implies that themeans of these four variables change linearly with the auxiliary variables. For example,the mean level of the cholesterol level changes from case to case with linear dependenceon the subject’s weight, age, and gender, and the linear regression coefficients are dis-played in the foregoing output of EM estimates. Because CDM is a special case of MAR

(as mentioned in section 2.1), this example also implies that simple methods such ascomplete-case analysis do not necessarily work under the more general MAR assumption.

As suggested by Little (1995), because in real applications, no information about thecovariates is known beforehand, it seems preferable to include all possible covariates inthe model. However, including more covariates will increase the χ2 d.f. considerably, ascan be seen in this example, which could make the estimation less efficient and the testless powerful. Therefore, we need to balance between the limited sample size and thenumber of covariates and choose the appropriate MCAR or CDM assumption for testing.

5 Simulation study

In this section, I evaluate the performance of Little’s χ2 test of MCAR and CDM throughsimulation studies. In general, when the true missing-data mechanism is MCAR, theempirical rejection probability of Little’s test of MCAR fits well with the nominal signif-icance level, with a stable performance even for small samples, different proportions ofmissing values, and different numbers of variables with missing values. This was foundin Little (1988) and Kim and Bentler (2002) and confirmed by my own simulations,which are not included here. However, for Little’s test of CDM, the natural extensionof the MCAR test, it remains unclear whether increasing the number of covariates hasan impact on its finite sample performance. I explored this by simulating the followingmodel,

C. Li 805

(y1y2

)= B

⎛⎜⎜⎜⎝x1x2...xq

⎞⎟⎟⎟⎠+

(ε1ε2

)

where B is a p × (q − 1) matrix of all 1s, x1, x2, . . . , xq−1 are independent N(0, 1)variables, and the error terms follow(

ε1ε2

)∼ N

{(00

),

(1 0.50.5 1

)}

y1 is MCAR with probability 0.5, and y2 is always completely observed, yieldingtwo missing-value patterns. y = (y1, y2)

� is tested for CDM with auxiliary variables(covariates) x = (x1, . . . , xq−1)

�. The number of covariates q − 1 (constant term notincluded) varies among 0, 1, 2, 5, 10, and 20, and the sample size increases from 100,250, and 500 to 1,000. For each scenario, 10,000 Monte Carlo replications are used.Under the null hypothesis (9), d2 in (11) asymptotically follows χ2 distribution withd.f. = q. At significance level α = 0.05, I report the empirical rejection probability of theCDM test in table 2. The Monte Carlo standard errors are displayed in the parenthesesright after each rejection rate.

Table 2. Empirical rejection rates of the CDM test with α = 0.05

Covariates χ2 d.f. Sample size100 250 500 1000

0 1 0.051 (0.002) 0.043 (0.002) 0.050 (0.002) 0.048 (0.002)1 2 0.051 (0.002) 0.052 (0.002) 0.050 (0.002) 0.052 (0.002)2 3 0.044 (0.002) 0.049 (0.002) 0.049 (0.002) 0.048 (0.002)5 6 0.045 (0.002) 0.049 (0.002) 0.050 (0.002) 0.051 (0.002)10 11 0.036 (0.002) 0.045 (0.002) 0.046 (0.002) 0.047 (0.002)20 21 0.023 (0.001) 0.039 (0.002) 0.045 (0.002) 0.046 (0.002)

Table 2 shows that in this model, when the number of covariates is small, the em-pirical rejection rate of Little’s CDM test is sufficiently close to the nominal level 0.05with a sample size of 100 or 250. However, as the number of covariates increases to 10and 20, the empirical rejection rate is much lower than the nominal level 0.05 when thesample size is 100 or 250. Therefore, in small samples, the CDM test tends to be moreconservative when the number of covariates is large.


It is also of interest to compare the performance of Little’s MCAR test statisticd2 with that of the augmented test statistic, d2aug, when the covariance matrices varyamong different missing-value patterns. I simulated the following simple model withoutcovariates, (

y1y2

)∼ N

{(00

),

(1 0.50.5 1

)}where y2 always remains complete through all observations, and y1 is missing withprobability 0.5 based on the missing mechanisms below. In principle, we can compareboth the rejection probabilities when the null hypothesis (1) or (9) is satisfied by thetrue model and the power of these tests when the null is violated. The alternativehypothesis could be either (10) or (12) and will be covered by the five cases below. Inthe following, Φ(·) is the cumulative distribution function of the standard normal, andΦ−1(·) is its inverse.

1. (MCAR) y1 is MCAR with probability 0.5.

2. (MAR) y1 is missing if and only if Φ−1(0.1) ≤ y2 ≤ 0 or y2 ≥ Φ−1(0.9).

3. (MAR) y1 is missing if and only if |y2| ≥ Φ−1(0.75).

4. (MNAR) y1 is missing if and only if Φ−1(0.2) ≤ y1 ≤ 0 or y1 ≥ Φ−1(0.8).

5. (MNAR) y1 is missing if and only if |y1| ≥ Φ−1(0.75).

Note that y1 is missing with probability 0.5 in all five cases, yielding two missing-valuepatterns, and we always test the full vector of y = (y1, y2)

�. Therefore, the truemissing-data mechanism of case 1 corresponds to MCAR. Case 2 and case 3 are MAR.Case 4 and case 5 are MNAR. The covariance structures of two missing-value patternsare the same in cases 1, 2, and 4 by symmetry and different in cases 3 and 5. Under thenull hypothesis (1), d2 in (7) asymptotically follows χ2 distribution with d.f. = 1, andd2aug in (13) asymptotically follows χ2 distribution with d.f. = 2. I report the empiricalrejection rates of both tests at significance level α = 0.05 using sample sizes 100, 250,500, and 1,000 based on 10,000 Monte Carlo replications for each of the five missing-data mechanisms. The results are summarized in table 3. The Monte Carlo standarderrors are displayed in the parentheses right after each rejection rate.

C. Li 807

Table 3. Empirical rejection rates when α = 0.05 for d2 and d2aug

Missingness Test Sample sizeof y1 statistic 100 250 500 1000

Case 1 (MCAR) d2 0.051 (0.002) 0.043 (0.002) 0.050 (0.002) 0.048 (0.002)d2aug 0.053 (0.002) 0.048 (0.002) 0.050 (0.002) 0.050 (0.002)

Case 2 (MAR) d2 0.182 (0.004) 0.346 (0.005) 0.566 (0.005) 0.851 (0.004)d2aug 0.184 (0.004) 0.303 (0.005) 0.490 (0.005) 0.780 (0.004)

Case 3 (MAR) d2 0.052 (0.002) 0.051 (0.002) 0.051 (0.002) 0.050 (0.002)d2aug 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)

Case 4 (MNAR) d2 0.363 (0.005) 0.728 (0.004) 0.953 (0.002) 0.999 (0.000)d2aug 0.292 (0.005) 0.626 (0.005) 0.916 (0.003) 0.998 (0.000)

Case 5 (MNAR) d2 0.050 (0.002) 0.053 (0.002) 0.048 (0.002) 0.052 (0.002)d2aug 0.261 (0.004) 0.572 (0.005) 0.882 (0.003) 0.996 (0.001)

We can compare the results from d2 and d2aug in table 3. In case 1, where y1 is MCAR,the empirical rejection rates for both d2 and d2aug are close to the nominal level. Case 2(MAR) and case 4 (MNAR) also behave similarly, though the power of d2 seems to beslightly higher than d2aug. This is not surprising because in the true model, covariancematrices of the two missing patterns are exactly the same, and d2aug is less efficientbecause it estimates two covariance matrices separately. However, in either case 3(MAR), where y1 is missing if |y2| ≥ Φ−1(0.75), or case 5 (MNAR), where y1 is missingif |y1| ≥ Φ−1(0.75), the missing data and the observed data have the same mean zerobut different variances. As a result, the empirical rejection rates from d2 are very low,indicating weak power of Little’s test in these two situations. The power of d2 doesnot improve significantly even if we increase the sample size to 1,000. Instead, afteradjustment for unequal variances, d2aug has much higher power, and the power increasesto 1 as the sample size increases from 100 to 1,000. This implies that d2 may not bereliable when the difference between missing-value patterns does not lie in their means,while d2aug can overcome this weakness when the covariance structure varies significantlyacross different missing-value patterns.

Although the augmented test for unequal variances has better power in some situ-ations, such as case 3 and case 5 of the model above, it may be too conservative withsmall sample sizes and complicated missing-value patterns. In the extreme case, accord-ing to (13), d2aug cannot be computed when some missing-value patterns contain too fewobservations. In the following, I simulate the same example from Little (1988) andcompare the finite sample performance of d2 and d2aug with more complicated missing-value patterns. Little (1988) considers a multivariate normal model with four variables,y = (y1, y2, y3, y4)

�, generated by

y1 = z1

y2 = z1√0.9 + z2

√0.1

y3 = z1√0.2 + z2

√0.1 + z3

√0.7

y4 = −z1√0.6 + z2

√0.25 + z3

√0.1 + z4

√0.05


where z1, z2, z3, and z4 are independent standard normal random variables. We onlyobserve y1, y2, y3, and y4 but not z1, z2, z3, and z4, and the missing-data mechanismof y1, y2, y3, and y4 is MCAR. For y = (y1, y2, y3, y4)

�, Little (1988) considers sevenmissing-value patterns in total, which can be represented by the missing indicator vectorr = 1111, 1110, 1100, 1101, 1001, 1011, and 1010. For example, r = 1110 means thaty1, y2, and y3 are observed and y4 is missing. The proportions of the seven missing-valuepatterns in the sample are 0.4, 0.1, 0.1, 0.1, 0.1, 0.1, and 0.1, respectively. We examinethe empirical rejection rates of d2 and d2aug using sample sizes 100, 250, 500, 1,000, and2,000 based on 10,000 Monte Carlo replications. The results are summarized in table 4,and the Monte Carlo standard errors are displayed in the parentheses.

Table 4. Empirical rejection rates when α = 0.05 for d2 and d2aug

Test Sample sizestatistic 100 250 500 1000 2000

d2 0.043 (0.002) 0.047 (0.002) 0.054 (0.002) 0.051 (0.002) 0.049 (0.002)d2aug 0.213 (0.004) 0.096 (0.003) 0.070 (0.003) 0.060 (0.002) 0.053 (0.002)

Given these seven missing-value patterns, the χ2 d.f. for d2 and d2aug are 15 and 42,respectively. The results in table 4 suggest that with too many parameters in the covari-ance matrices to estimate, the empirical rejection rates for d2aug are too conservative andget close to the nominal level 0.05 only when the sample size is 2,000. In comparison,d2 has already achieved acceptable accuracy when the sample size is 250. This im-plies that d2aug may not perform as well as d2 in small samples when the missing-valuepatterns become more complicated. Moreover, as pointed out in Little (1988), d2augmay be sensitive to departure from the normality assumption because d2aug involves thecomparison of variances, while simulation results in Little (1988) suggest that d2 is rel-atively robust to nonnormality of the data. Therefore, the augmented test works bestfor nearly multivariate normal data when the covariance structure differs significantlyamong missing-value patterns and a sufficient number of observations are available ineach pattern.

C. Li 809

6 Conclusion

In this article, I presented the mcartest command, which implements Little’s χ2 testof the MCAR assumption or the CDM assumption. The methodology is mainly basedon Little (1988) and can be extended to testing the CDM assumption when covariatesare included in the test. The command also allows adjustment for unequal variancesthrough the unequal option. I demonstrated how to use this command and the caveatsof choosing covariates through an example. Finally, I examined the performance of theMCAR and CDM tests, compared the strengths and weaknesses of the regular test andthe test with unequal variances by simulation, and provided some suggestions for howto use them in practice.

7 Acknowledgments

This work was done during my internship at StataCorp in the summer of 2012. I amgrateful to Yulia Marchenko for her guidance and support. I also thank the reviewerfor helpful comments that have substantially improved the article.

8 ReferencesFitzmaurice, G., M. Davidian, G. Verbeke, and G. Molenberghs. 2009. Handbooks ofModern Statistical Methods: Longitudinal Data Analysis. Boca Raton, FL: Chapman& Hall/CRC.

Kim, K. H., and P. M. Bentler. 2002. Tests of homogeneity of means and covariancematrices for multivariate incomplete data. Psychometrika 67: 609–623.

Little, R. J. A. 1988. A test of missing completely at random for multivariate data withmissing values. Journal of the American Statistical Association 83: 1198–1202.

———. 1995. Modeling the drop-out mechanism in repeated-measures studies. Journalof the American Statistical Association 90: 1112–1121.

Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. 2nded. Hoboken, NJ: Wiley.

Rubin, D. B. 1976. Inference and missing data. Biometrika 63: 581–592.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton, FL: Chap-man & Hall/CRC.

StataCorp. 2013. Stata 13 Multiple-Imputation Reference Manual. College Station, TX:Stata Press.

About the author

Cheng Li is a PhD candidate in statistics at Northwestern University. His research is currentlyfocused on Bayesian methods for high-dimensional problems.


Testing for zero inflation in count models: Biascorrection for the Vuong test

Bruce A. DesmaraisUniversity of Massachusetts–Amherst

Amherst, MA

[email protected]

Jeffrey J. HardenUniversity of Colorado–Boulder

Boulder, CO

[email protected]

Abstract. The proportion of zeros in event-count processes may be inflated by anadditional mechanism by which zeros are created. This has given rise to statisticalmodels that accommodate zero inflation; these are available in Stata through thezip and zinb commands. The Vuong (1989, Econometrica 57: 307–333) test isregularly used to determine whether estimating a zero-inflation component is ap-propriate or whether a single-equation count model should be used. The use of theVuong test in this case is complicated by the fact that zero-inflated models involvethe estimation of several more parameters than the single-equation models. Al-though Vuong (1989, Econometrica 57: 307–333) suggested corrections to the teststatistic to address the comparison of models with different numbers of parameters,Stata does not implement any such correction. The result is that the Vuong testused by Stata is biased toward supporting the model with a zero-inflation com-ponent, even when no zero inflation exists in the generative process. We providenew Stata commands for computing the Vuong statistic with corrections based onthe Akaike and Bayesian (Schwarz) information criteria. In an extensive MonteCarlo study, we illustrate the bias inherent in using the uncorrected Vuong test,and we examine the relative merits of the Akaike and Schwarz corrections. Then,in an empirical example from international relations research, we show that er-rors in selecting an event-count model can have clear implications for substantiveconclusions.

Keywords: st0319, zipcv, zinbcv, count models, Poisson, zero-inflated Poisson,negative binomial, zero-inflated negative binomial, Vuong test, AIC, BIC, zip, zinb

1 Introduction

In formulating statistical models, there is an inherent tension between reducing the datato a parsimonious and comprehensible summary and specifying a model that adequatelycaptures the complexities in real data (Achen 2005). This balancing act is apparent inthe modeling of event-count data with a seemingly disproportionate number of zeros.One way that this overabundance could arise is that the presence of zeros is inflated byan additional process besides the one that influences the counts that are greater thanzero. Regression models for zero-inflated counts offer the benefit of accommodatingmultiple theories regarding the presence of zeros (Lambert 1992).

Underlying the choice between conventional-count regression and zero-inflated mod-eling is the common tension between overfitting and successfully explaining empirical


B. A. Desmarais and J. J. Harden 811

features of the data. A striking trait of many event-count datasets is the sheer pro-portion of zeros in the dependent variables. The field of international relations, whichfocuses to a great degree on events of an extreme and rare nature, is one in which thistrait is highly prevalent. Data from international conflict and terrorism provide illustra-tive examples: 89% zeros in Clare (2007), 91% in Kisangani and Pickering (2007), and97% in Neumayer and Plumper (2011). This characteristic raises an important theo-retical and empirical question: Is there a process that inflates the probability of a zerocase?

Much is at stake in the answer to this question. A “yes” amounts to more thanjust the addition of an explanatory variable—an entire process, in the form of anotherequation and several more parameters to estimate, is added to the model. Such anaddition may be warranted if there is strong theoretical reason to expect two processes.For instance, Clare (2007) presents a theoretical differentiation of international disputeinitiation and escalation. Data-driven, inductive assessments of the presence of multipleprocesses can be performed by formally testing whether the added complexity of thezero-inflated model improves significantly upon the fit of the standard count model. Thevalidity of this test is critical. A false negative—choosing the standard model when thezero-inflated model should be used—directs attention away from a separate and strikingcomponent of the data-generating process (DGP). A false positive—incorrectly choosingthe zero-inflated model—causes the erroneous complication of the model through theaddition of an entire equation to the specification.

With this tension in mind, researchers commonly use the Vuong test (Vuong 1989)to determine whether the zero-inflated model fits the data statistically significantlybetter than count regression with a single equation (see Vogus and Welbourne [2003];Anthony [2005]; Mondak and Sanders [2005]; Clare [2007]; Lee et al. [2007]; Zandersen,Termansen, and Jensen [2007]; Tiwari et al. [2009]; Nielsen et al. [2010]; Cavrini etal. [2012]; and Zhang et al. [2012]). In this article, we show that there are problemswith the implementation of this test in Stata. In particular, Vuong (1989) demonstratesthat bias ensues from comparing models with different parameters and suggests usingan information criterion adjustment to correct this bias. The built-in Stata commandsfor zero-inflated count models, zip (zero-inflated Poisson (ZIP) regression) and zinb

(zero-inflated negative binomial (ZINB) regression), do not implement a correction tothe Vuong test statistic to account for the added parameters in the zero-inflated model.The result of having no such correction is that Stata’s computation of the Vuong teststatistic is strongly biased in favor of the more complex model with a zero-inflationcomponent, even when there is no zero inflation in the true DGP.

We address this problem here by providing new Stata commands, zipcv and zinbcv,which operate exactly like zip and zinb but add computations of the Vuong test withtwo different corrections suggested by Vuong (1989)—one based on the Akaike informa-tion criterion (AIC) (Akaike 1974) and one based on the Bayesian (Schwarz) informationcriterion (BIC) (Schwarz 1978). We show that these commands allow applied researchersto properly use the Vuong (1989) test to decide between standard and zero-inflated countmodels.

812 Testing for zero inflation in count models

After reviewing zero-inflated models and the details of the commands, we illustratethe use of zinbcv with an example from research on international disputes. Clare (2007)shows evidence from a ZINB model that redemocratizing countries with long legacies ofpast democratic regimes is more likely to initiate international disputes, while thosewith long legacies of past authoritarian regimes follow more cautious foreign policy. Weshow that support for these assertions is conditional on the use of the ZINB model; undera standard negative binomial (NB) model, the length of the previous democratic regimeexerts a small and statistically nonsignificant effect on the expected number of disputesinitiated. Moreover, while the uncorrected Vuong test statistic from zinb selects theZINB model (p < 0.05), the test statistic with the BIC correction in zinbcv selects theNB model (p < 0.05).

2 Zero-inflated count models

The class of zero-inflated count regression models first proposed by Lambert (1992) asthe ZIP model, is a mixture between a generalized linear model (GLM) for the dichoto-mous outcome that a count Y is equal to zero (such as logit, with covariates z andcoefficients γ) and a conventional event-count GLM (such as a Poisson or NB regressionwith covariates x and coefficients β). The likelihood of a single observation is given bythe following equation (Long 1997, 244),

l(y|x, z,β,γ) = P (z′γ)I(y = 0) + {1− P (z′γ)} f(y|x′β)

where P is the cumulative distribution function used to specify the dichotomous outcomethat y > 0, and f is the probability mass function corresponding to the chosen countmodel (for example, the Poisson distribution). Using the log link to parameterize f , weobtain the mean of yi:

μi = {1− P (z′iγ)} exp(x′iβ)

There are several important properties of this model to note. First, the probabil-ity of a zero is governed by both the dichotomous and count equations in the model.Specifically,

Pr(yi = 0) = P (z′iγ) + {1− P (z′iγ)}f(0|x′iβ)

This is different from the popular hurdle models, first proposed by Mullahy (1986), inwhich the probability of a 0 is completely determined by a dichotomous GLM and thedistribution of counts above 0 is governed by a count distribution truncated from belowat 1. Second, the count regression f is not “nested” in the zero-inflated model, becausethe model does not reduce to f when γ = 0, in which case the probability of a 0 isinflated by 0.50. The main implication stemming from these properties is that it isnecessary to compare the zero-inflated model with a simple count model using a testfor nonnested models. The conventional likelihood-ratio test, Wald test, or Lagrangemultiplier test cannot be used (Long 1997).


2.1 The Vuong test

The Vuong test is designed to compare two models (g1 and g2) fit to the same databy maximum likelihood. Specifically, it tests the null hypothesis that the two mod-els fit the data equally well. The models need not be nested, nor does one of themodels need to represent the correct specification. The specific metric of model fitis the Kullback–Leibler divergence (KLD) (Kullback and Leibler 1951) from the truemodel that generated the data (gt). The KLD is a measure of the distance between twoprobability distributions, which is the basis of many measures used for model compari-son and selection, including the AIC (Akaike 1974), the Takeuchi information criterion(Konishi and Kitagawa 1996), the generalized information criterion (Konishi and Kita-gawa 1996), and the cross-validated log likelihood (Smyth 2000). The KLD betweenmodels g and gt is denoted DKL(gt||g). The null hypothesis of the Vuong test is

H0 : DKL(gt||g1) = DKL(gt||g2)The formula for DKL(gt||g), where gt and g are both models for nonnegative integers(for example, counts), is defined as

DKL(gt||g) =

∞∑y=0

ln

{gt(y)

g(y)

}gt(y)

=

∞∑y=0

ln {gt(y)} gt(y)−∞∑y=0

ln {g(y)} gt(y)

= Egt [ln{gt(y)}]− Egt [ln{g(y)}]

From this, the null hypothesis, H0, can be written as

H0 : DKL(gt||g1)−DKL(gt||g2) = 0

(Egt [ln{gt(y)}]− Egt [ln{g1(y)}])− (Egt [ln{gt(y)}]− Egt [ln{g2(y)}]) = 0

Egt [ln{g1(y)}]− Egt [ln{g2(y)}] = 0 (1)

(1) is the difference in the expected values of the log likelihoods of g1 and g2 when theirparameters are estimated on data generated from gt. Importantly, Egt [ln{g1(y)}] andEgt [ln{g2(y)}] are not formulated under the assumption that the same sample is used toestimate the parameters and evaluate the likelihoods. For a sample size N , the Vuongtest is a difference of means test (that is, a paired z test) applied to the N individuallog-likelihood contributions (of the N observations) to g1 and g2. In the context oftesting for zero inflation, the Vuong test is a test for whether the mean observation-wisedifference between the log-likelihood contribution to the zero-inflation model and thecontribution to the standard count model is, on average, greater than zero. Let β bethe estimate of β when the zero-inflation component is not included in the model, β theestimate of β in the zero-inflation model, and γ the estimate of γ. Let dl be a vectorof length N , such that the ith element is the ith individual log-likelihood difference

dli = ln{l(yi|xi, zi, β, γ

)}− ln

{f(yi|xi

′β)}


The Vuong test statistic is

Vuong =(sdl

√n)−1

n∑i=1

dli

where sdl is the standard deviation of dl. Because 1/N∑n

i=1 dli is a consistent estimatorof the quantity in (1), under H0, the Vuong test statistic is asymptotically normallydistributed by the central limit theorem (Vuong 1989).

The estimated log likelihood is a consistent estimator of the KLD, which establishesthe consistency and asymptotic normality of the Vuong test statistic. However, theestimated log likelihood is a biased estimator of the KLD, a result that motivated thederivation of the AIC (Akaike 1974) and numerous other model fit statistics. This meansthat the Vuong test statistic is a biased estimator of the differences in the average fit ofthe count model and zero-inflated count model. The bias in the estimated log likelihoodas an estimator of the KLD arises from the fact that the same data are used to estimateboth the parameters of the model (that is, coefficients and standard errors) and theaverage value of the log likelihood. This “double dipping” produces a positive bias inthe in-sample log likelihood as an estimator of the KLD (Konishi and Kitagawa 1996).Intuitively, this bias arises because some of the random noise from the sample getstreated as nonrandom signal when estimating the KLD with the log likelihood.

It is generally intractable to derive the value of this bias in a finite sample, so modelselection criteria use asymptotic corrections. For example, the AIC uses the correction p(the number of estimated parameters), which is equal to the asymptotic bias, given thatgt is nested in the fit model. The bias is accentuated when g1 and g2 have a differentnumber of parameters (p1 and p2, respectively), as is the case when comparing single-equation and zero-inflated count models. Vuong (1989) suggests adding an averagedifference in a selection-criterion-based correction factor to each dli to correct the bias.

For instance, if the correction factor is based on the AIC, the corrected difference inlog likelihoods is given by

dlci = dli +p2 − p1N

(2)

Vuong (1989) also provides the BIC correction as

dlci = dli + (p2 − p1)ln(N)

2N(3)

Another possibility would be to use an out-of-sample approach to computing theindividual log-likelihood contributions, such as through leave-one-out cross-validation(for example, Smyth [2000]). Rendering the training and testing data independent ofone another removes the optimistic bias of in-sample measures. We tested such anapproach in the analysis described below and found minimal differences between it andthe asymptotic corrections in (2) (AIC) and especially (3) (BIC). Not surprisingly, thesedifferences were particularly small as N increased. Because the iterative nature of leave-one-out cross-validation produces considerable computational costs, we elected not toinclude it in the corrections to the Vuong test we examine here.


3 Monte Carlo simulations

Having outlined the basic premise of zero-inflated models and the Vuong (1989) test, wenext examine via simulation the consequences of failing to correct the Vuong statisticwhen comparing standard and zero-inflation count models. The Stata commands zipand zinb offer the option of reporting a Vuong test statistic, which many researchersuse (for example, Vogus and Welbourne [2003]; Anthony [2005]; Mondak and Sanders[2005]; Clare [2007]; Zandersen, Termansen, and Jensen [2007]). However, Stata’s doc-umentation for the Vuong statistic does not mention which adjustment (AIC or BIC) isused. Stata’s technical support informed us that current versions of Stata do not includeany adjustment. We then verified this by inspection of zip.ado and zinb.ado.1

We study the performance of the Vuong test in selecting between ZIP and Poissonmodels and ZINB and NB models. In the simulation study, we examine the consequencesof two important dimensions for the performance of the uncorrected and corrected tests:first, the sample size, and second, the number of covariates in the inflation component ofthe model, both in generating the dependent variable and fitting the zero-inflated mod-els. We use Stata’s example dataset fish.dta to parameterize the simulation study.Approximately 57% of the 250 observations in this dataset have a value of 0 on thedependent variable. To define parameters for the data simulated in the Monte Carlostudy, we first fit zero-inflated models with count as the dependent variable and stan-dardized versions of nofish, livebait, camper, persons, and child as independentvariables in both the count and inflation components. The linear predictors in the countcomponents of the models in the Poisson- and NB-based simulations, respectively, are

x′β = 0.734− 0.384nofish+ 0.376livebait+ 0.264camper

+ 0.940persons− 1.004child

and

x′β = 0.515− 0.127nofish+ 0.504livebait+ 0.106camper

+ 1.131persons− 1.013child

In the following equations, the inflation components contain a number of termsequal to the number of covariates included in the respective condition in the simulationstudy. The formulas are given below for the Poisson and negative binomial simulations,respectively.2

z′γ = −0.157 + 1.73child− 0.669persons− 0.443camper− 0.176livebait

− 0.638nofish

1. Specifically, we verified this with zip.ado version 1.6.11 (6/6/2011) and zinb.ado version 1.7.12(4/19/2012). Both of these were the current files as of 29 April 2013.

2. In the appendix, we present a replication of this Monte Carlo study using slightly different param-eterizations and medpar.dta.


and

z′γ = −1.92 + 2.63child− 1.065persons− 1.23camper+ 0.161livebait

− 0.619nofish

The bias in the observed log likelihood as an estimator of the expected log likelihood,the resulting bias in the Vuong test statistic, and the bias corrections associated with AIC

and BIC depend upon the sample size and the difference in the number of parametersin the two models under comparison (Konishi and Kitagawa 1996). Accordingly, thetwo conditions on which we focus are the sample size and the number of variables inthe inflation component of the model. We examine sample sizes of 200, 500, and 3,000.In terms of the number of parameters, we vary the inflation component in two ways.First, we run simulations in which there is no zero inflation, drawing the outcomes fromthe Poisson and NB models, and study the performance of the three test variants whenone, three, and five covariates are incorrectly included in the inflation component. Wethen run a second variant in which there is zero inflation and examine the performanceof the tests when one, three, and five covariates are correctly included in the inflationcomponent. Each of the 36 conditions (3 sample sizes × 3 covariate specifications × 2inflation/no inflation × 2 distributions) is run for 1,000 iterations.3

Figures 1–2 present the results of the simulations under the condition of no zero in-flation. The plots illustrate the results of hypothesis tests derived from the uncorrected,AIC-corrected, and BIC-corrected Vuong statistics. The graphs depict the distributionof significance test results based on the Vuong test comparing standard to zero-inflatedcount models with the respective correction. To demonstrate interpretation of the plots,we walk through the results conveyed in panel (a) of figure 1. Panel (a) gives the resultsfor the simulations with a sample size of 200 and a single covariate incorrectly includedin the inflation component of the ZIP model. The AIC-corrected test statistically signif-icantly (at the 0.05 one-tailed level) selects the single-equation Poisson model around67% of the time, supports the Poisson model (though not significantly) approximately30% of the time, and supports the two-equation ZIP model (though not significantly)approximately 3% of the time. The BIC-corrected test statistically significantly selectsthe Poisson model about 97% of the time and supports the Poisson model (though notsignificantly) approximately 3% of the time. The Vuong test without a correction sup-ports the Poisson model (though not significantly) approximately 45% of the time andsupports the ZIP (though not significantly) approximately 55% of the time.

3. We performed all the computations presented in this section in Stata/SE 11.1 and Stata/IC 12.1.


0 20 40 60 80 100Cumulative Percent

None

BIC

AIC


None

BIC

AIC


None

BIC

AIC

(a) Sample size = 200 (b) Sample size = 500 (c) Sample size = 3,000Inflation covariates = 1 Inflation covariates = 1 Inflation covariates = 1


None

BIC

AIC


None

BIC

AIC


None

BIC

AIC

(d) Sample size = 200 (e) Sample size = 500 (f) Sample size = 3,000Inflation covariates = 3 Inflation covariates = 3 Inflation covariates = 3


None

BIC

AIC


None

BIC

AIC


None

BIC

AIC

(g) Sample size = 200 (h) Sample size = 500 (i) Sample size = 3,000Inflation covariates = 5 Inflation covariates = 5 Inflation covariates = 5

z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 1. Monte Carlo results with Poisson simulations. The plots depict the distri-bution of significance test results based on the Vuong test comparing Poisson to ZIP

models with the respective correction across varying sample sizes and numbers of co-variates incorrectly included in the inflation component.



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC


z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 2. Monte Carlo results with NB simulations. The plots depict the distribution ofsignificance test results based on the Vuong test comparing NB to ZINB models with therespective correction across varying sample sizes and numbers of covariates incorrectlyincluded in the inflation component.

When there is no zero inflation in the DGP, the BIC-corrected statistic performs thebest, and the uncorrected statistic performs the worst. The BIC-corrected statistic isstatistically significantly negative (p < 0.05, one tailed)—in favor of the single-equationmodel—in 95–100% of the iterations. In contrast, the uncorrected Vuong statistic ispositive in more than 80% of the iterations and statistically significantly in favor of thezero-inflated NB model in 5–60% of the iterations. The poor performance of the uncor-rected test depends heavily on sample size and the number of covariates included in the


inflation component.4 However, it is particularly critical to note that not once in thesimulation runs without zero inflation did the uncorrected test result in a statisticallysignificant rejection of the zero-inflated model. The AIC-corrected test performs mod-erately better in the no-inflation condition. In 20–60% of iterations, the zero-inflatedmodel is statistically significantly rejected, and the single-equation model is virtuallynever rejected. However, the degree to which the AIC favors the single-equation countmodel decreases with the number of covariates incorrectly included in the inflation com-ponent.

Figures 3 and 4 present results in which zero inflation is a component of the genera-tive process. In the Poisson-based simulations, all the tests perform equally well, nearlyalways rejecting the single-equation model. However, when it comes to the NB-basedsimulations, the uncorrected Vuong statistic performs the best in selecting the correctlyspecified model—nearly always statistically significantly rejecting the single equationmodel. The AIC-corrected test performs moderately well in the small sample (N = 200)conditions, significantly favoring the zero-inflated model in 40–50% of the iterations andvirtually always statistically significantly selecting the zero-inflation model in the largersample-size conditions. The performance of the BIC-corrected statistic—performing theworst among the three when ZINB is the correct model—varies substantially across thesample size and covariate conditions. The tendency for the BIC-corrected statistic to sta-tistically significantly reject NB is inversely related to the number of covariates correctlyincluded in the zero-inflation component and directly related to the sample size.

4. Specifically, the smaller the sample and the larger the number of covariates incorrectly included inthe inflation component, the more likely the uncorrected test is to statistically significantly rejectthe zero-inflated model.



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC


z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 3. Monte Carlo results with ZIP simulations. The plots depict the distributionof significance test results based on the Vuong test comparing Poisson to ZIP modelswith the respective correction across varying sample sizes and numbers of covariatescorrectly included in the inflation component.



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC



None

BIC

AIC


None

BIC

AIC


None

BIC

AIC


z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 4. Monte Carlo results with ZINB simulations. The plots depict the distributionof significance test results based on the Vuong test comparing NB to ZINB models withthe respective correction across varying sample sizes and numbers of covariates correctlyincluded in the inflation component.

Our simulation study illustrates two important points regarding the use of the Vuongtest for choosing between zero-inflated and single-equation count models. First, failureto correct for the additional parameters estimated in the zero-inflation model by usingthe uncorrected Vuong statistic results in a substantial tendency toward erroneouslyrejecting the single-equation model when there is no zero inflation in the generativeprocess. In small to moderate sample sizes with five or more covariates included in theinflation component, this tendency can exceed 40%. Second, the AIC and BIC correc-


tions exhibit their usual relative strengths. The BIC correction is better at conclusivelysupporting the more parsimonious single-equation model when it is appropriate, andthe AIC is better at conclusively supporting the more extensively specified zero-inflatedmodel when it is appropriate. Moreover, neither the AIC- nor the BIC-corrected testsexhibit the extreme tendency toward statistical significance in the wrong direction thatis exhibited by the uncorrected statistic when there is no zero inflation in the generativeprocess. In larger samples, the BIC-corrected test appears to exhibit an advantage inthat it both performs very well at rejecting the zero-inflated model when there is nozero inflation and rejecting the single-equation model when zero inflation is present.In contrast, the AIC-corrected test does not perform well at rejecting the zero-inflationmodel when there is no zero inflation, even in our large-sample conditions.

4 Model selection in international relations

Having shown the problem with the uncorrected Vuong statistic with simulation and thecorresponding improvements offered by the AIC or BIC correction, we now turn to theirapplication to data from recent work using event-count models in international relations(Clare 2007).5 Our objectives here are to demonstrate that the different implementa-tions of the Vuong test for zero inflation can produce considerably different results inan applied setting and to provide an illustration of our zinbcv command.6 In additionto illustrating the use of the new command, we show that the selection made by thetest is critical to our understanding of important processes, such as conflict behavior.

4.1 The data

Clare (2007) examines the conflict behavior of democratizing regimes. He posits thatredemocratizing states are more likely to initiate conflict, especially when there is alonger democratic history in the state. In contrast, he expects a stronger authoritarianlegacy to correspond with less initiation of conflict. Clare’s (2007) primary theoreticalclaim is that leaders of democratizing states face varying degrees of threat of losing powerto the old authoritarian regime because of failed foreign policy. Thus democratizingstates have more freedom to maneuver in foreign policy decision making when theauthoritarian legacy is weak and less freedom when it is strong.

Using nation-year as the unit of analysis for the period 1950–1990, Clare (2007)models the count of disputes initiated by a state in a given year as a function of severalindependent variables, including indicators for the regime type (see Clare [2007, 267]),and measures of the duration of the most recent authoritarian and democratic regimes.The core test of the theory comes through the interaction of redemocratization—anindicator for a state that is in the process of democratizing—and each of these two

5. The data for this example in Stata format are publicly available at the Journal of Peace Researchreplication data archive:http://www.prio.no/Journals/Journal/?x=2&content=replicationData#2007.

6. The zipcv command works in exactly the same way as zinbcv, so we only show the latter toconserve space.


regime duration measures. Clare (2007) expects the interaction between redemocra-tization and duration of the most recent authoritarian regime to produce a negativecoefficient, which indicates a drop in the expected number of disputes initiated whenthe past authoritarian legacy is longer. In contrast, he expects redemocratization ×duration of the most recent democratic regime to produce a positive coefficient, whichindicates an increase in the expected number of disputes initiated when the past demo-cratic regime is longer.

4.2 Computing the Vuong test

Clare (2007) uses the ZINB model in estimation because 89% of the 3,955 cases in thedata contain a 0 on the dependent variable.7 He also reports the uncorrected Vuongtest statistic of 3.22, which corresponds to a statistically significant selection of the ZINB

over the NB (p ≈ 0.001). However, this test statistic is problematic because it does notcorrect for additional parameters from the inflation equation. To obtain the correctedtest statistics, as well as the results of the zinb routine, we use zinbcv in the exactsame way as zinb:

7. He includes the same set of covariates in the count and inflation equations.


. use clare-monadic-replication.dta

. zinbcv init_count stable_dem1 stable_aut1 redem1 redem1_pautdur1> redem1_pdemdur1 growth prop_demsregion1 riots1,> inflate(stable_dem1 stable_aut1 redem1 redem1_pautdur1 redem1_pdemdur1> growth prop_demsregion1 riots1) vuong nolog

Zero-inflated negative binomial regression Number of obs = 3955Nonzero obs = 442Zero obs = 3513

Inflation model = logit LR chi2(8) = 98.05Log likelihood = -1541.107 Prob > chi2 = 0.0000

init_count Coef. Std. Err. z P>|z| [95% Conf. Interval]

init_countstable_dem1 .0792226 .2257374 0.35 0.726 -.3632145 .5216598stable_aut1 .3562161 .2275137 1.57 0.117 -.0897026 .8021349

redem1 .9875878 .7367568 1.34 0.180 -.4564291 2.431605redem1_pautdur1 -.1189129 .0615584 -1.93 0.053 -.2395651 .0017393redem1_pdemdur1 .0721211 .0428799 1.68 0.093 -.011922 .1561642

growth .0000189 2.05e-06 9.22 0.000 .0000149 .000023prop_demsregion1 -1.434321 .2373772 -6.04 0.000 -1.899572 -.9690703

riots1 .0270679 .0126341 2.14 0.032 .0023056 .0518302_cons -1.671941 .2222414 -7.52 0.000 -2.107526 -1.236356

inflatestable_dem1 -12.35688 595.2899 -0.02 0.983 -1179.104 1154.39stable_aut1 1.260302 1.375148 0.92 0.359 -1.434939 3.955542

redem1 4.442412 5.332937 0.83 0.405 -6.009953 14.89478redem1_pautdur1 -.5272241 .5540954 -0.95 0.341 -1.613231 .5587829redem1_pdemdur1 .6353599 .4939444 1.29 0.198 -.3327533 1.603473

growth -.0000544 .0000169 -3.22 0.001 -.0000876 -.0000213prop_demsregion1 -16.1701 10.59818 -1.53 0.127 -36.94215 4.601959

riots1 -.2118948 .1103174 -1.92 0.055 -.428113 .0043234_cons -.4104956 1.399768 -0.29 0.769 -3.15399 2.332999

/lnalpha -.4193847 .3544373 -1.18 0.237 -1.114069 .2752996

alpha .6574512 .2330252 .3282207 1.316925

Vuong test of zinb vs. standard negative binomial: z = 3.22 Pr>z = 0.0006Pr<z = 0.9994

with AIC (Akaike) correction: z = 1.77 Pr>z = 0.0386Pr<z = 0.9614

with BIC (Schwarz) correction: z = -2.80 Pr>z = 0.9974Pr<z = 0.0026

. display e(vuong)3.219827

. display e(vuongAIC)1.7674489

. display e(vuongBIC)-2.7950052

This prints all the information users are accustomed to seeing with zinb, but alsoincludes the corrected versions of the Vuong statistic under the uncorrected version.Additionally, the exact values are stored in e(vuongAIC) and e(vuongBIC). In thiscase, the Vuong test statistic with the AIC correction is 1.77, which still corresponds


to a statistically significant selection of the zero-inflated model (p ≈ 0.04). However,the Vuong test statistic with the BIC correction is −2.80, which represents a significantselection of the standard NB model (p ≈ 0.003). In short, there is considerable variationin the results from the Vuong test. However, given the results from our above simulationand the relatively large sample size, we place more weight on the BIC-corrected Vuongtest and conclude that the more parsimonious NB is the appropriate model.

4.3 Implications of the results

Table 1 summarizes results from both the ZINB and NB models. Notice first that theoriginal ZINB shows support for Clare’s (2007) hypotheses. In particular, the coefficienton redemocratization×duration of the most recent authoritarian regime is negative, thecoefficient on redemocratization×duration of the most recent democratic regime is pos-itive, and both are statistically significant. This indicates that in states that are tran-sitioning to democracy, a longer legacy of authoritarian rule contributes to a decline inthe number of disputes initiated, while a longer legacy of democratic rule correspondswith an increase in disputes. Additionally, these effects are substantively meaningful.As Clare (2007, 270–271) notes, an increase of 1 year in the duration of the previousauthoritarian regime corresponds to an 11% drop in the expected number of disputes,while the same increase for democratic regimes produces an 8% increase in the expectednumber of disputes.


Table 1. ZINB and NB results from Clare (2007)

Variable ZINB NB

Stable democracy 0.08 −0.09(0.23) (0.22)

Stable autocracy 0.36∗ 0.14(0.23) (0.20)

Redemocratization 0.99 0.95∗

(0.74) (0.43)Redemocratization × −0.12∗ −0.10∗

Duration of most recent authoritarian regime (0.06) (0.04)

Redemocratization × 0.07∗ 0.03Duration of most recent democratic regime (0.04) (0.03)

Economic growth 1.9e−5∗ 2.1e−5∗

(2.1e−6) (2.3e−6)Riots 0.03∗ 0.04∗

(0.01) (0.01)Other democratic countries in the region −1.43∗ −0.76∗

(0.24) (0.21)Intercept −1.67∗ −2.02∗

(0.22) (0.21)

N (zeros) 3,955 (3,514) 3,955 (3,514)

Vuong (uncorrected) 3.22∗

Vuong (AIC) 1.77∗

Vuong (BIC) −2.80∗

Note: Cell entries report coefficient estimates with standard errors in parentheses for Clare (2007)

original ZINB model estimates and a replication using NB. Positive Vuong test statistic values

indicate a selection of the ZINB model, and negative values indicate a selection of the NB model.∗ p < 0.05 (one-tailed).

However, note that the coefficients on each of these interaction terms decline inmagnitude in the NB model with redemocratization×duration of the most recent demo-cratic regime dropping by more than half the value of the ZINB estimate. Furthermore,this latter coefficient is no longer statistically significant at the 0.05 level in the NB

model. We assess the substantive implications of this in figure 5. Both graphs plot theexpected number of disputes initiated by redemocratizing regimes on the y axis at theminimum and maximum values of duration of the most recent democratic regime (0 and41 years, respectively). The third bar plots the difference between these two estimates.Panel (a) gives results from the ZINB model, and panel (b) shows NB results. Note thatthe difference is large (≈ 6 disputes) if ZINB is used but small (< 1 dispute) with thebetter-fitting NB. Thus at least half of the support for the original theory depends on


using the ZINB model instead of the standard NB. This is problematic in light of thefact that the BIC-corrected Vuong test clearly supports the rejection of the ZINB model.

(a) ZINB

(b) NB

Figure 5. Change in the expected number of disputes initiated by redemocratizing statesfrom the minimum (0 years) to maximum (41 years) observed value of duration of themost recent democratic regime. The difference is large (≈ 6 disputes) if the ZINB modelis used but small (< 1 dispute) with the better-fitting standard NB.


5 Conclusions

In formulating and evaluating statistical models of event-count processes, we encounteran inherent tension between developing a parsimonious summary of the data and ac-counting for meaningful empirical peculiarities. Because these processes are often de-fined on events that are relatively rare, such as dispute initiation, scholars regularlyconfront datasets with many zeros on the dependent variable. An important questionstemming from this characteristic centers on whether some of these zeros arise becauseof an additional generative mechanism. If so, the proper inferential method is to fita count model with a zero-inflation equation to account for the second process. Thisadded complexity comes with a risk; statistically, specifying an inflation equation whenone is not needed reduces the efficiency of the estimator and convolutes interpretation.Perhaps worse, theoretically, the inclusion of an additional equation in the model focusesresearchers’ efforts on a potentially erroneous account of the process under study.

A common response to this tension is the use of Vuong (1989) nonnested model-selection procedure, which provides a test statistic that can be used to compare standardand zero-inflated count models fit to the same data. We show there are problems withthe current implementation of this test in applied research. In particular, the Vuongtest executed in Stata’s zip and zinb commands does not implement any correction forthe added parameters estimated for the inflation equation, which leads to a test thatfavors the zero-inflated models, even when there is no zero inflation in the generativeprocess. We solve this problem with the zipcv and zinbcv commands. These commandsinclude all the functionality of the zip and zinb commands that are currently in Statabut report the uncorrected, AIC-corrected, and BIC-corrected Vuong test statistics.

Finally, in a replication analysis, we apply the findings from the simulation stud-ies to real data. Results show that the process of selecting between competing, countmodels can have implications for substantive conclusions from results on internationalpolitical processes. In the presence of nontrivial model dependence shown in the ex-ample, researchers need statistically sound criteria on which to make decisions. Thesuggestions given here provide such criteria for scholars in selecting between standardand zero-inflated count models.

6 ReferencesAchen, C. H. 2005. Let’s put garbage-can regressions and garbage-can probits wherethey belong. Conflict Management and Peace Science 22: 327–339.

Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactionson Automatic Control 19: 716–723.

Anthony, D. 2005. Cooperation in microcredit borrowing groups: Identity, sanctions,and reciprocity in the production of collective goods. American Sociological Review70: 496–515.

Cavrini, G., S. Broccoli, A. Puccini, and M. Zoli. 2012. EQ-5D as a predictor of mortalityand hospitalization in elderly people. Quality of Life Research 21: 269–280.


Clare, J. 2007. Democratization and international conflict. Journal of Peace Research44: 259–276.

Kisangani, E. F., and J. Pickering. 2007. Diverting with benevolent military force:Reducing risks and rising above strategic behavior. International Studies Quarterly51: 277–299.

Konishi, S., and G. Kitagawa. 1996. Generalised information criteria in model selection.Biometrika 83: 875–890.

Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Annals ofMathematical Statistics 22: 79–86.

Lambert, D. 1992. Zero-inflated poisson regression, with an application to defects inmanufacturing. Technometrics 34: 1–14.

Lee, Y.-G., J.-D. Lee, Y.-I. Song, and S.-J. Lee. 2007. An in-depth empirical analysisof patent citation counts using zero-inflated count data model: The case of KIST.Scientometrics 70: 27–39.

Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables.Thousand Oaks, CA: Sage.

Mondak, J. J., and M. S. Sanders. 2005. The complexity of tolerance and intolerancejudgments: A response to Gibson. Political Behavior 27: 325–337.

Mullahy, J. 1986. Specification and testing of some modified count data models. Journalof Econometrics 33: 341–365.

Neumayer, E., and T. Plumper. 2011. Foreign terror on Americans. Journal of PeaceResearch 48: 3–17.

Nielsen, S. E., G. McDermid, G. B. Stenhouse, and M. S. Boyce. 2010. Dynamic wildlifehabitat models: Seasonal foods and mortality risk predict occupancy-abundance andhabitat selection in grizzly bears. Biological Conservation 143: 1623–1634.

Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464.

Smyth, P. 2000. Model selection for probabilistic clustering using cross-validated likeli-hood. Statistics and Computing 10: 63–72.

Tiwari, A., J. A. VanLeeuwen, I. R. Dohoo, G. P. Keefe, J. P. Haddad, H. M. Scott,and T. Whiting. 2009. Risk factors associated with Mycobacterium avium subspeciesparatuberculosis seropositivity in Canadian dairy cows and herds. Preventive Veteri-nary Medicine 88: 32–41.

Vogus, T. J., and T. M. Welbourne. 2003. Structuring for high reliability: HR practicesand mindful processes in reliability-seeking organizations. Journal of OrganizationalBehavior 24: 877–903.


Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses.Econometrica 57: 307–333.

Zandersen, M., M. Termansen, and F. S. Jensen. 2007. Testing benefits transfer of forestrecreation values over a twenty-year time horizon. Land Economics 83: 412–440.

Zhang, X., Y. Lei, D. Cai, and F. Liu. 2012. Predicting tree recruitment with negativebinomial mixture models. Forest Ecology and Management 270: 209–215.

About the authors

Bruce A. Desmarais is an assistant professor in the Department of Political Science at theUniversity of Massachusetts–Amherst and a core faculty member in the Computational SocialScience Initiative at the University of Massachusetts–Amherst.

Jeffrey J. Harden is an assistant professor in the Department of Political Science at the Uni-versity of Colorado–Boulder.

A Appendix: Simulations using medpar.dta

Here we replicate our Monte Carlo simulation using Stata’s example dataset for the ztnb(zero-truncated NB) command, medpar.dta, to parameterize the simulation study. Thesettings in this replication of the simulation study differ in a few ways from the originalversion. First, the sample sizes differ slightly. Second, we use fewer variables in thecount and inflation components such that the sets of variables in each equation aredisjoint. Third, the dataset used to select parameter values has relatively fewer zeros.

Approximately 8% of the 1,495 observations in medpar.dta have a value of 1 on thedependent variable. We generate zeros by subtracting 1 from the dependent variablebefore proceeding with the simulation study. To define parameters in the simulationstudy, we first fit ZINB and ZIP models with los − 1 as the dependent variable andstandardized versions of hmo, age, and type1 as independent variables in the countcomponent and died, white, and age80 as independent variables in the inflation com-ponent. The linear predictors in the count components of the models from which theoutcome is simulated in the Poisson- and NB-based simulations, respectively, are

x′β = 0.249− 0.039hmo− 0.016age− 0.168type1

andx′β = 2.22− 0.035hmo− .0013 age− 0.164type1

The inflation components contain a number of terms in the following equations equal tothe number of covariates included in the respective condition in the simulation study.The formulas are again given for the Poisson and NB simulations, respectively.


z′γ = −2.94 + 1.12died+ 0.190white− 0.096age80

andz′γ = −15.5 + 10.0died+ 0.315white− 0.101age80

In this version of the simulation study, we examine sample sizes of 300, 700, and2,000. In terms of the number of parameters, we vary the inflation component in twoways. First, we run the simulation without zero inflation in the DGP and study theperformance of the three test variants when one, two, and three covariates are incorrectlyincluded in the inflation component. We then run a second variant in which there is zeroinflation and examine the performance of the tests when one, two, and three covariatesare correctly included in the inflation component. Each of the 36 conditions (3 samplesizes × 3 covariate specifications × 2 inflation/no inflation × 2 distributions) is run for1,000 iterations.8

Figures 6–7 present the results of the simulations with medpar.dta. As in thesimulations from section 3, when there is no zero inflation in the DGP, the BIC-correctedstatistic performs the best, and the uncorrected statistic performs the worst. TheBIC-corrected statistic is statistically significantly negative (p < 0.05, one tailed)—in favor of the single-equation model—in 95–100% of the iterations. In contrast, theuncorrected Vuong statistic is positive in approximately 80% or more of the iterations.Unlike the simulations in section 3, under this design, the uncorrected test is rarelystatistically significant in favor of the zero-inflated model. However, just as in theprevious simulations, not once did the uncorrected test result in a statistically significantrejection of the zero-inflated model. The AIC-corrected test performs moderately betterin the no-inflation condition. In approximately 45–65% of the iterations, the zero-inflated model is statistically significantly rejected, and the single-equation model isvirtually never rejected. However, as we found before, the degree to which the AIC favorsthe single-equation count model decreases with the number of covariates incorrectlyincluded in the inflation component.

8. We performed all the computations presented in this section in Stata/SE 11.1 and Stata/IC 12.1.





z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 6. Monte Carlo results with Poisson simulations (medpar.dta). The plots depictthe distribution of significance test results based on the Vuong test comparing Poissonto ZIP models with the respective correction across varying sample sizes and numbersof covariates incorrectly included in the inflation component.





z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 7. Monte Carlo results with NB simulations (medpar.dta). The plots depictthe distribution of significance test results based on the Vuong test comparing NB toZINB models with the respective correction across varying sample sizes and numbers ofcovariates incorrectly included in the inflation component.

Figures 8 and 9 present results in which zero inflation is a component of the gener-ative process. In this condition, the uncorrected Vuong statistic performs the best inselecting the correctly specified model—nearly always statistically significantly reject-ing the single-equation model. The AIC-corrected test performs fairly well in the small-sample (N = 300) conditions, significantly favoring the zero-inflated model in 60–80%of the iterations, but virtually always statistically significantly selects the zero-inflationmodel in the larger sample-size conditions. The performance of the BIC-corrected statis-tic, which performs the worst among the three statistics when zero inflation is the correct


model, varies substantially across the sample-size and covariate conditions. As we foundabove, the tendency for the BIC-corrected statistic to statistically significantly reject thesingle-equation model is inversely related to the number of covariates correctly includedin the zero-inflation component and directly related to the sample size.




z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 8. Monte Carlo results with ZIP simulations (medpar.dta). The plots depict thedistribution of significance test results based on the Vuong test comparing Poisson toZIP models with the respective correction across varying sample sizes and numbers ofcovariates correctly included in the inflation component.





z ≤ −1.65 −1.65 < z ≤ 0 0 < z ≤ 1.65 1.65 < z

Figure 9. Monte Carlo results with ZINB simulations (medpar.dta). The plots depictthe distribution of significance test results based on the Vuong test comparing NB toZINB models with the respective correction across varying sample sizes and numbers ofcovariates correctly included in the inflation component.


Parametric inference using structural breaktests

Zachary L. FlynnUniversity of Wisconsin–Madison

Madison, WI

[email protected]

Leandro M. MagnussonUniversity of Western Australia

Crawley, [email protected]

Abstract. We present methods for testing hypotheses and estimating confidencesets for structural parameters of economic models in the presence of instabilitiesand breaks of unknown form. These methods constructively explore informationgenerated by changes in the data-generating process to improve the inference ofparameters that remain stable over time. The proposed methods are suitable formodels cast in the generalized method of moments framework, which makes theirapplication wide. Moreover, they are robust to the presence of weak instruments.The genstest command in Stata implements these methods to conduct hypothesistests and to estimate confidence sets.

Keywords: st0320, genstest, condivreg, ivregress, ivreg2, gmm, qll, generalizedmethod of moments, structural change, weak instruments, hypothesis testing, con-fidence sets

1 Introduction

We present methods for the inference of parameters in economic models in the presenceof instabilities and breaks of unknown form. The main idea behind these methods isto constructively explore exogenous changes in the data-generating process to improvethe inference about parameters that are assumed to be stable over time. For example,exogenous changes in the monetary policy induced by the Central Bank affect interestrates. However, if we are interested in parameters that characterize a production func-tion technology, these parameters are not affected by monetary policy. These exogenousvariations in the interest rate can be used to improve the inference of such technologi-cal parameters. The proposed methods are suitable for models cast in the generalizedmethod of moments (GMM) framework, which makes their application wide. Moreover,they are robust to the presence of weak instruments; that is, we do not assume that thestructural parameters are consistently estimated.

Estimation of economic models using GMM departs from a set of moment restric-tions, usually derived from the economic theory. Two underlying assumptions of aGMM estimator are that the parameters are stable over time and that they can be con-sistently estimated using the empirical moment restrictions. These assumptions canbe very strong in some economic models. Stock and Watson (1996) and Piehl et al.(2003) report evidence of parameter instability in, respectively, macroeconomic and mi-croeconomic models (for models with identification failure, see Stock, Wright, and Yogo


Z. L. Flynn and L. M. Magnusson 837

[2002], Kleibergen and Mavroeidis [2009], and references therein). Therefore, we cannotrule out a priori the presence of instabilities and weak instruments.

Structural break tests are considered as a diagnostic test, conducted after estimation.These tests compare the null hypothesis of a stable model (constant parameter) againstthe alternative of an unstable model (time-varying parameter). Examples include theones proposed by Andrews (1993) and Elliott and Muller (2006). Moreover, they assumethat the parameters are consistently estimated under the null hypothesis. Instead, weperform hypothesis testing of the parameter of interest and later estimate confidenceintervals for this parameter by inverting these tests without estimating the parameterin the first place.

The proposed tests are a combination of two (asymptotically) independent statistics.One of them, the S test (see Stock and Wright [2000]), tests the validity of the momentcondition. The second statistic tests the stability of such moments. We call these teststhe “generalized S” (gen-S) tests because they are an extension of the S test for stablemodels. An important feature of the tests is that they are identification robust tests;that is, they have the correct size even in the presence of weak instruments.

The genstest command performs these new methods. For a given structural param-eter of interest θ, the tests test the simple hypothesis H0 : θ = θ0 against the alternativeH1 : θ = θ0, where θ0 is a hypothesized value of θ. The genstest command also gen-erates 1 − α confidence intervals and sets by inverting the new statistical tests. Thereare no restrictions on how many parameters may be tested by genstest. However,genstest calculates confidence sets only up to two parameters because these are themost straightforward to graph (a simple while loop can generate a confidence set forany number of parameters).

In section 2, we show how changes in the first stage improve the inference of struc-tural parameters in a linear instrumental variable (IV) model and describe the proposedmethods. In section 3, we present the general algorithm for implementing the tests. Insection 4, we discuss the syntax and options of the postestimation command genstest.Finally, in section 5, we provide examples of its use for performing hypothesis tests andconstructing confidence intervals and sets.

2 Structural inference under instability

2.1 A stylized IV model

Consider the following simple limited-information IV model{y = Yθ + uY = ZΠ+ v

(1)

838 Inference using structural break tests

where y is a T × 1 vector, Y is a T × kY matrix of explanatory variables, Z is a T × kZmatrix of instruments, and u and v are residuals. We are interested in testing thefollowing assumption about the value of the structural parameter θ:

Hθ0 : θ = θ0 against Hθ

1 : θ = θ0

Postmultiplying the second equation in (1) by θ0 and subtracting from the first equation,we derive

y −Yθ0 = Zδ + e (2)

where δ = Π(θ − θ0) and e = u+v (θ − θ0). Therefore, we can test the null hypothesisHθ

0 : θ = θ0 indirectly by testing the assumption

Hδ0 : δ = 0 against Hδ

1 : δ = 0

on the auxiliary (2). The principle of testing this null hypothesis by testing violations of

the moment restrictions E{(1/T )∑Tt=1 Zt (yt − Ytθ)} = 0 is from Anderson and Rubin

(1949); the auxiliary regression representation in (2) is attributed to Chernozhukovand Hansen (2008). From now on, we will refer to this test simply as the S test, theextended version of the Anderson and Rubin (1949) test for GMM models proposed byStock and Wright (2000).

The S test has the correct size even when the structural parameter θ is not identified;see Stock and Wright (2000). However, when Π ≈ 0, the S test will not reject Hδ

0 whenHδ

1 is true.1 If this is the case, confidence sets derived by inverting the S test areunbounded, giving no information about the location of θ.

In the representation (1), Π captures the strength of the instruments Zt and isassumed to be the unique solution of

E {Z ′t (Yt − ZtΠ)} = 0 for all 0 < t ≤ T

Now, similar to Angrist and Krueger (1995), we will assume that the strength of theinstruments might differ in two subsamples. For simplicity, order the observations suchthat

E {Z ′t (Yt − ZtΠ1)} = 0 for all 0 < t ≤ tb

E {Z ′t (Yt − ZtΠ2)} = 0 for all tb < t ≤ T

Partition Z =(Z′1 : Z′

2)′, where Z1 and Z2 are tb × kZ and (T − tb) × kZ submatrices

of Z containing observations of the first and second subsamples, respectively. DefineZ1 = (Z′

1 : 0′)′ and Z2 = (0′ : Z′2)

′so that Z = Z1 + Z2. The first-stage equation is

rewritten asY = Z1Π1 + Z2Π2 + v

and the auxiliary regression becomes

y −Yθ0 = Z1δ1 + Z2δ2 + e (3)

1. When Π = 0, the estimated value of δ will be close to 0, independent of whether ‖θ − θ0‖ > 0.


We restate the null hypothesis as

Hδb0 : δ1 = δ2 = 0 against Hδb

1 : δ1 = 0 or δ2 = 0

Therefore, a single change in Π doubles the number of instruments for testing θ. More-over, the S test applied to the auxiliary regression (3) would have more power to rejectHδb

0 than the S test applied to (2) to reject Hδ0 .

The above linear IV model shows that changes in the first-stage reduced-form param-eter can improve the inference about the second-stage structural parameter θ, which re-mains constant over time or cross-section units. However, in practice, we may not knowwhen the change occurs, the magnitude of the change, or the nature of the instability.In the following subsection, we present tests that do not require such knowledge.

2.2 The generalized S tests

Assume that from an economic model, we derived a moment condition of the form

E {Z ′tu (Yt; θ, γ)} = 0 for all 0 < t ≤ T (4)

where u (·; ·) is a one-dimensional real function indexed by the p-dimensional structuralparameter vector θ and by the q-dimensional nuisance parameter vector γ, which isalways treated as stable under the null hypothesis. Yt is a vector of random variables,and Zt is the 1 × kZ dimensional row vector of instruments. For simplicity, we denoteu (Yt; θ, γ) as ut (θ, γ). We can consider ut (θ, γ) as the unobserved error term of aregression such that E {u (Yt; θ, γ) |Zt} = 0. For instance, in the previous section,ut (θ, γ) = yt − Ytθ. Further examples for cross-section models and time-series modelsare shown in section 5.

The moment restriction in (4) can be restated in terms of full-sample and stabilityrestrictions as, respectively,

E

{1

T

T∑t=1

Z ′tut (θ, γ)

}= 0 and E {Z ′

tut (θ, γ)} is stable over t (5)

Usual GMM methods for estimation and inference use only the first kZ full-sample re-strictions E{(1/T )∑T

t=1 Z′tut (θ, γ)} = 0. Magnusson and Mavroeidis (2010a) propose

tests for the vector of structural parameters θ that explore both restrictions. Testingthe assumption Hθ

0 : θ = θ0 against Hθ1 : θ = θ0 can be indirectly conducted by testing

both restrictions in (5), evaluated at θ0, against the alternative that at least one of theseconditions is violated. The tests have the following general form:

gen-S (θ0) = gen-S (θ0; c) +c

1 + cS (θ0) (6)

The first component of gen-S, the gen-S, tests the stability restrictions, while its secondcomponent, the S test, detects violations of the full-sample moment restrictions. The


nonnegative scalars c and c determine the weights that the investigator attaches toviolations of the stability and full-sample restrictions under H1, respectively. In thisframework, the S test is considered a test that sets no weight on the stability restrictions;that is, c = 0. Several possibilities exist for choosing gen-S. Here we present four suchstatistics, which are described in the next section. The proposed stability tests areclosely related to the quasi-local-level (qLL) test derived in Elliott and Muller (2006)and to the average (ave-), exponential (exp-), and supremum (sup-) Wald tests derivedin Andrews (1993) and Sowell (1996). Additionally, all the proposed stability tests areasymptotically independent from the S test.

The four gen-S tests implemented by the genstest command are denoted qLL-S,ave-S, exp-S, and sup-S. In deriving the four tests, we set c = c; that is, violations ofthe full-sample and stability moment restrictions are weighted equally. In particular,in the case of qLL-S, c = c = 10. All the suggested tests have nontrivial power wheninstabilities are present under the alternative hypothesis. However, according to theweighted average power criteria, the qLL-S dominates the other tests if the instability ofthe moments follows a difference martingale sequence under H1, and the ave-S and exp-S dominate the other tests if a single break is assumed at an unknown date.2 Furtherdetails about the optimality properties of these tests are in Magnusson and Mavroeidis(2010a).

We can use the gen-S tests for estimating confidence intervals and sets. The 1− αconfidence interval (set) consists of the points θ in the parameter space Θ that do notreject the test under H0 : θ = θ at α significance level. Once a grid of points in theparameter space is defined, we proceed by computing the tests at these points andselecting them accordingly.

The gen-S tests are asymptotically pivotal under H0. Although their limit distri-butions are not standard, critical values can be simulated. Included with the genstestcommand are critical value tables up to the case where kZ = 20 for all suggested tests.

3 The algorithm for implementing the generalized S tests

Next we show the algorithms for computing the two components of the gen-S test in(6), starting with the S test.

2. Set c = c = c. The ave-S asymptotically power dominates the remaining tests when c −→ 0; theexp-S dominates when c −→ +∞.


3.1 S test algorithm

The S test is obtained from the following steps:

1. Estimate the nuisance parameter vector γ under the null hypothesis Hθ0 : θ = θ0

using the following objective function,

γ (θ0) ≡ argminγ

u (θ0, γ)′Z{Φ (θ0)

}−1

Z′u (θ0, γ) (7)

where u (θ0, γ) is a T × 1 vector whose typical tth element is ut (θ0, γ), Z is a

T ×kZ matrix of instruments, and Φ (θ0) is an estimator of Φ (θ0, γ), the varianceof Z′u (θ0, γ).

2. Substitute u (θ0, γ) by u {θ0, γ (θ0)} into the objective function in (7).

3. The S test for testing Hθ0 : θ = θ0 is

S (θ0) = u {θ0, γ (θ0)}′ Z{Φ (θ0)

}−1

Z′u {θ0, γ (θ0)} (8)

Under the null hypothesis, S (θ0)d−→ χ2

(kZ−q), where χ2(kZ−q) is a chi-squared

distribution with (kZ − q) degrees of freedom.

The S test in (8) has two differences from the one proposed by Chernozhukov andHansen (2008). First, the proposed S test encompasses models in which the residualterm is a nonlinear function of the parameters (see examples 2 and 3 in section 5).Second, if the residual vector u (θ0, γ) is a linear function of parameters, then we con-centrate the nuisance γ using an oblique projection matrix instead of a linear projectionmatrix.3

Next we turn to gen-S, the stability part of the gen-S test.

3.2 The stability tests

There are two classes of tests that detect instabilities of the moments under the alter-native assumption. The first class corresponds to the qLL-S, a test calibrated to detectsmall but persistent changes in the moments. In the second class, the tests are derivedassuming that there is only a single break in the moment at an unknown date. Theyare the ave-S, the exp-S, and the sup-S.

3. When u (θ0, γ) = y − Yθ0 − Xγ, substituting γ by its ordinary least-squares estimate

(X′X)−1 X′ (y − Y θ0) is the same as premultiplying u (θ0, γ) by MX = I − X (X′X)−1 X′, thematrix that projects onto the orthogonal space spanned by the columns of X. Our method is equiv-alent to premultiplying u (θ0, γ) by the oblique projection matrix MΨ

X = I − X (X′ΨX)−1 X′Ψ,

where Ψ = Z{Φ (θ0)

}−1Z′.


Persistent time-variation case

In the algorithm for computing the qLL-S test, we define the following T × T matricesand T × 1 vector,

D =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

1 0 · · · · · · 0

−1 1 0 · · · ...

0. . .

. . .. . .

......

. . . −1 1 00 · · · 0 −1 1

⎤⎥⎥⎥⎥⎥⎥⎥⎦, R =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

1 0 · · · · · · 0

r 1 0 · · · ...

r2. . .

. . .. . .

......

. . . r 1 0rT−1 · · · r2 r 1

⎤⎥⎥⎥⎥⎥⎥⎥⎦, and r =

⎡⎢⎢⎢⎢⎢⎢⎣

rr2

...

...rT

⎤⎥⎥⎥⎥⎥⎥⎦where r = 1 − (10/T ). The D matrix is a first-difference operator, while R is the

cumulative product operator matrix. Let U (θ0) be the following T × k matrix,

U (θ0) = {u (θ0) , . . . , u (θ0)}where u (θ0) = u {θ0, γ (θ0)}. The qLL-S statistic, which is the stability part of theqLL-S test, is obtained after taking the following steps:

1. First, compute the T ×k matrix V(θ0) = {U(θ0) Z}Φ(θ0)−1/2, where denotes

the direct product, and Φ(θ0)−1/2 is the symmetric square root matrix of Φ(θ0)

−1.

Second, compute H(θ0)=R{DV(θ0)}, a T × k matrix.

2. Estimate the T × k matrix w, the ordinary least-square residuals of the followingregression,

H (θ0) = rB +w

where B is a 1×k row vector of parameters, and compute TSSRw =k∑

i=1

T∑t=1

(wi,t)2,

the total sum of squared residuals of the above regression.

3. Compute the T × k matrix ε, the ordinary least-square residuals of the followingregression,

V (θ0) = ιTC + ε

where ιT is a T × 1 vector of 1s, and C is a 1 × k row vector of parameters.

Calculate TSSRε =k∑

i=1

T∑t=1

(εi,t)2, the total sum of squared residuals of the above

regression.

4. The qLL-S statistic under H0 is

qLL-S(θ0) = TSSRε − r × TSSRw

The asymptotic distribution of this statistic is a functional of a k-dimensionalOrnstein–Uhlenbeck process.

5. The qLL-S test is defined as

qLL-S(θ0) = qLL-S(θ0) +10

11S (θ0)


Single break, unknown break date

We take the following steps in computing the stability part of the ave-, exp-, and sup-Stests:

1. Specify an interval where a break in the moments is located. This interval mustbe defined as [tl, tu], where tl = [sT ], tu = [(1− s)T ], s ∈ (0, 0.5), and [m] denotesthe integer part of m.

2. For a possible break date j ∈ [tl, tu], let T1 = j, and T2 = T − j. PartitionZ as Z = (Z′

1 : Z′2)

′, where Z1 and Z2 are T1 × k and T2 × k submatrices of

Z containing, respectively, observations before and after j. Similarly, partition

u (θ0, γ) as u (θ0, γ) ={u1 (θ0, γ)

′: u2 (θ0, γ)

′}′.

3. Estimate the nuisance parameter γ under the null hypothesis H0 : θ = θ0. Similarto the S test, this step consists of solving the following minimization problem,

γj (θ0) ≡ argminγ

u1 (θ0, γ)′Z1

{Φ1(θ0)

}−1

Z′1u1 (θ0, γ)

+ u2 (θ0, γ)′Z2

{Φ2(θ0)

}−1

Z′2u2 (θ0, γ) (9)

where Φ1 (θ0) and Φ2 (θ0) are, respectively, estimators of Φ1 (θ0, γ) and Φ2 (θ0, γ),the variances of Z′

1u1 (θ0, γ) and Z′2u2 (θ0, γ) under H0.

4. Substitute u1 (θ0, γ) and u2 (θ0, γ) with u1 {θ0, γj (θ0)} and u2 {θ0, γj (θ0)}, re-spectively, into the objective function in (9).

5. Compute the following modified S test assuming a break at date j:

S (θ0; j) = u1 {θ0, γj (θ0)}′ Z1

{Φ1(θ0)

}−1

Z′1u1 {θ0, γj (θ0)}

+ u2 {θ0, γj (θ0)}′ Z2

{Φ2(θ0)

}−1

Z′2u2 {θ0, γj (θ0)} (10)

Define the following statistic:

S (θ0; j) = S (θ0; j)− S (θ0)

6. Repeat steps 1 through 5 for each possible break date in [tl, tu].

7. The ave-, exp-, and sup-S tests are defined as

ave-S (θ0) =1

d (tl, tu)

tu∑j=tl

S (θ0; j)

exp-S (θ0) = 2 log

⎡⎣ 1

d (tl, tu)

tu∑j=tl

exp

{1

2S (θ0; j)

}⎤⎦sup-S (θ0) = sup

j∈[tl,tu]

S (θ0; j)


where d (tl, tu) = tu−tl+1. One can show that the asymptotic distributions of theabove tests are functionals of standard k-dimensional Brownian bridge processeson (0, 1).

8. The ave-, exp-, and sup-S tests are defined, respectively, as

ave-S (θ0) = S (θ0) + ave-S (θ0)

exp-S (θ0) = S (θ0) + exp-S (θ0)

sup-S (θ0) = S (θ0) + sup-S (θ0)

3.3 The estimation of nuisance parameters and variance–covariancematrix

The estimators of γ (θ0) and γj (θ0) in equations (7) and (9) can be the two-step oriterative GMM estimators. The first-step estimator for computing γ (θ0), necessary forestimating Φ, solves

minγ

u (θ0, γ)′Z W Z′u (θ0, γ) (11)

where W is a square matrix (for example, the identity matrix or (Z′Z)−1). Similarly,

the first step for computing γj (θ0) solves

minγ

u1 (θ0, γ)′Z1W1Z

′1u1 (θ0, γ) + u2 (θ0, γ)

′Z2W2Z

′2u2 (θ0, γ)

where W1 and W2 are conformable quadratic matrices. For theoretical reasons, γ (θ0)and γj (θ0) cannot be the first-step estimators of γ; see Stock and Wright (2000) andCaner (2007). Under the null assumption and under sequences of local alternatives,γ (θ0) and γj (θ0) have the same probability limits.4 Hence, we can replace γj (θ0) withγ (θ0) for computing the S (θ0; j) statistic in (10).

The estimation of the variance–covariance matrix Φ depends on the assumptionabout the asymptotic variance of T−1/2

∑Tt=1 Z

′tut (θ, γ) evaluated at {θ0, γ (θ0)}. The

genstest command is very flexible about the structure of the variance matrix, allow-ing for homoskedastic residuals; heteroskedastic residuals (including adjustment factorshc1, hc2, hc3, and hc4; see Davidson and MacKinnon [2003]); cluster residuals; andheteroskedastic autocorrelated residuals (including options for the kernel and numberof lags when computing the autocorrelation terms). More details are in the followingsection.

The general form of the estimator Φ (θ0) in (7) used for computing the S and qLL-S

tests is Z′Ω (θ0)Z, where Ω (θ0) is a T ×T matrix whose elements are a function of thevector of the estimated residuals u{θ0, γ (θ0)}.

In the case of ave-, exp-, and sup-S, the estimators Φ1 (θ0) and Φ2 (θ0) in (10) can be

represented by the Ti × Ti matrix ZiΩi (θ0)Zi for i = 1, 2. The elements of Ωi (θ0) are

4. However, they have different limits under a fixed sequence of alternatives; see Andrews (1993).


a function of ui{θ0, γj (θ0)}. The asymptotic variances of T−1/21 Z′

1u1{θ0, γj (θ0)} and

T−1/22 Z′

2u2{θ0, γj (θ0)} are the same as the asymptotic variance of T−1/2Z′u{θ0, γ (θ0)}(see Hall [2005]), so we can substitute Φ1 (θ0) and Φ2 (θ0) with Φ (θ0) in (10).

4 The genstest command

The genstest command implements the above four gen-S tests in Stata and Mata. Itmay be invoked as a stand-alone command or as a postestimation command for gmm.

When genstest is used as a postestimation command, it will only use the gmm

options that genstest implements (they are listed and described in this section). Anyadditional gmm options will be discarded. Furthermore, genstest performs the tests ononly one residual expression, so the gmm estimation command should conform to thatlimitation.

Additionally, the command can estimate confidence intervals and sets (up to twoparameters) based on these tests.5 These intervals (sets) are generated using a gridsearch method on the points of the parameter space that do not reject the null hypothesisof the test.

The genstest command requires at least Stata 10 because of the use of Mata’soptimization functions. No additional packages are required beyond a standard Stataand Mata installation.

4.1 Syntax

The syntax for genstest was designed to be as similar to Stata 11’s gmm command aspossible. The syntax is defined as follows:

genstest[(residual)

] [if] [

weight] [

, instruments(varlist[, noconstant

])

derivative(/name =[<]dexp

[>]) twostep igmm init(numlist)

null(numlist | last) test(namelist) sb stab winitial(iwtype)

wmatrix(wmtype) center small trim(#) nuisS varS ci(ci options)]

residual is an expression defining the ut (θ, γ), the error-term function used in the

empirical moment (1/T )∑T

t=1 Z′tut (θ0, γ), where Zt is the vector of instruments.

In the residual expression, enclosing a name inside < > indicates a parameter to betested as the null hypothesis, while enclosing a name inside { } indicates a parameterto estimate. For example, in the following linear regression model,

y1,t = y2,tθ + xtγ + ut

5. Because genstest can perform hypothesis tests on any number of parameters, to generate a confi-dence set for a higher number of parameters, one only needs to use nested while loops.


where θ is the parameter to be tested, and γ is estimated under the null hypothesis.The regression residual expression is

(y1 - <theta>*y2 - {gamma}*x)

A constant is not added to the residual expression by default to keep the behaviorof genstest similar to that of the gmm command. However, a constant is automati-cally included in the vector of instruments, Zt, unless noconstant is specified in theinstruments() option.

In the same example, if the residual expression is specified as

(y1 - {theta}*y2 - {gamma}*x)

then both parameters are estimated. In this case, the S test will be the same asthe overidentification restriction of Hansen (1982), also known as the J test, and theave-, exp-, and sup-S tests will be equivalent to the overidentification restriction testsproposed by Hall and Sen (1999), also known as the O test.

When running genstest as a postestimation command, one does not need to specifyresidual. In that case, genstest uses the residual given to gmm.

4.2 Testing options

instruments(varlist[, noconstant

]) specifies the vector of instruments Zt. The op-

tional noconstant indicates removal of a constant from the matrix of instruments.

derivative(/name =[<]dexp

[>]) specifies the derivative of a residual function with

respect to the parameter name. The functionality of this option requires enteringall untested parameter derivatives; otherwise, derivatives in the optimization algo-rithm will be computed numerically. The use of this option is recommended whenestimating confidence intervals and sets because it improves the performance of theoptimization algorithm and the computational speed (see ci() option). This optionis specified as in gmm with the addition that the <>’s indicate the value of a param-eter tested under the null hypothesis. If one is using genstest as a postestimationcommand, the derivatives passed to gmm will be used by genstest.

twostep requires the two-step general method of moments estimator is used (this is thedefault).

igmm requires the iterated general method of moments estimator be used.

init(numlist) sets the initial values in the optimization routine for estimating thenuisance parameters. The default choice is a vector of zeros. One should includethis option if the algorithm for estimating untested parameters does not converge,if it converges to a local minimum, or if the residual expression is undefined at thezero vector.


null(numlist | last) tells genstest to test Hθ0 : θ = θ0, where θ0 is the hypothesized

value of the parameter of interest θ. By default, genstest tests the null hypothesisthat all parameters defined inside < > are equal to 0. If genstest is used as apostestimation command, then null(last) will set θ0 as the last estimate obtainedafter running gmm. The supplied numlistmust be in the same order as the parametersappearing in the residual expression.

test(namelist) lists the names of the gmm parameters in the residual expression to betested. This option is only applicable if genstest is being used as a postestimationcommand.

sb reports the ave-, exp-, and sup-S tests (the single-break tests). These tests arecomputationally more intensive than the qLL-S test and therefore not computed bydefault.

stab reports the stability tests (the S tests).

winitial(iwtype) specifies the initial weighting matrix W in (11) for obtaining aninefficient estimate of γ. There are two options for this matrix: identity, whichuses the identity matrix, and unadjusted, which sets (Z′Z) as the initial weightmatrix. The default is winitial(unadjusted).

wmatrix(wmtype) allows the choice of the covariance matrix in (7). wmtype represents

the user choice for the estimator type of the variance of T−1/2∑T

t=1 Z′tut (θ0, γ).

The choices are the following:

unajusted for the homoskedastic case.

robust, hc1, hc2, hc3, and hc4 for the heteroskedastic case with hci, for i = 1, . . . , 4denoting the residual adjustment options (see Davidson and MacKinnon [2003]).The default is wmatrix(hc1), which denotes multiplying the square of the resid-uals by {T/(T − kZ)}.

cluster clustvar for a cluster–robust covariance matrix having the cluster variabledefined in clustvar.

hac kernel[lags

]for the heteroskedastic and autocorrelated (HAC) robust covariance

matrix. The kernel can be defined as

bartlett or nwest for the Bartlett (Newey–West) kernel;

parzen or gallant for the Parzen (Gallant) kernel; or

quadraticspectral or andrews for the quadratic spectral (Andrews) kernel.


When selecting the kernel, the user can choose the number of lags to compute theHAC estimate. lags may be one of the following:

optimal, if using the optimal selection algorithm of Newey and West (1994) (imple-mented using the same algorithm as in gmm).

automatic for setting the number of lags to the starting value of the optimal-lagselection algorithm (divided by 5):

Bartlett: 4× (T100

) 29

Parzen: 4× (T100

) 425

Quadratic Spectral: 4× (T100

) 225

number for any number specified by the user.

Technical note

The genstest default number of lags is automatic, which differs from the defaultnumber in the built-in Stata gmm function.

center indicates recentering the moment function when computing the HAC estimateof the variance.

small indicates using a small-sample adjustment when computing the HAC weight ma-trix.

Options: Single-break tests

The following options are for computation of the ave-, exp-, and sup-S tests.

trim(#) specifies the value of the trimming parameter s used to fix tl = [sT ] andtu = [(1− s)T ] in step 1 of the algorithm for computing the single-break stabilitytests. The options are s = 0.05, 0.10, 0.15, and 0.20. The default is trim(0.15).

nuisS indicates the use of γ (θ0), the estimate of the nuisance parameter in (7), in placeof γj (θ0), derived from (9), when computing the split-sample tests.

varS specifies the use of Φ (θ0), the estimated variance of the moments using all ob-

servations, in place of both Φ1 (θ0) and Φ2 (θ0) when computing the split-sampletests.

The weight matrices for computing Φ1 (θ0) and Φ2 (θ0) are the same as the onefor winitial() and wmatrix(). For example, choosing winitial(unadjusted) andwmatrix(hac nwest automatic) implies that the initial weight matrices for estimatingγ(θ0; j) are (Z′

1Z1) and (Z′2Z2) in (3.3) and that the HAC estimators use the Bartlett

kernel with lags 4×(T1/100)2/9

and 4×(T2/100)2/9

for Φ1 (θ0) and Φ2 (θ0), respectively.


Options: Confidence interval and region

The genstest command has the option of estimating confidence intervals and sets,the latter up to two parameters, using a grid search algorithm (to estimate higher-dimensional sets, one can use a simple while loop because genstest can test anynumber of parameters). If a two-parameter confidence set is chosen, the result can bedisplayed in a twoway graph. The use of the option derivative() is recommendedwhen estimating confidence intervals and sets. The options are the following:

ci(numlist[, ci options

]) indicates that a confidence interval or set be estimated.

numlist specifies the range of the grid search. For example,

ci(a b c d, ci options)

sets [a,b] and [c,d] as the grid search range of a confidence region for two parameters.

points(numlist) determines the number of equally spaced points for the grid search.The default is points(20) for confidence intervals and points(20 20) for con-fidence sets.

alpha(#) determines the 1 − α coverage probability of the interval or set. Thedefault is alpha(0.05).

allpv tells genstest to return p-values for all points tested in the selected range.Therefore, if one wishes to examine the confidence interval (set) for a differentsignificance level, there is no need to execute the command a second time.

autograph tells genstest to automatically graph the confidence region if two pa-rameters are being tested. Whether or not this option is specified, the pointsnecessary to plot the confidence region are stored in matrices.

Technical note

The estimation of confidence intervals and sets, which is based on a grid searchprocess, is not in the default of genstest, because it can be computationally intensive.To estimate confidence intervals and regions, we recommend running without the sb

option (which would cause genstest to perform the split-sample tests) or using the sboption with the nuisS and varS options.

Technical note

For confidence interval results, if allpv is not given, then genstest returns a ma-trix of values that pass the test alongside the resulting statistic. If allpv is specified,genstest saves a matrix containing the grid search values associated with their respec-tive p-values (p-values and grid search points are reported in the first and the subsequentcolumns, respectively).


4.3 Stored results

genstest stores the following in r():

Scalarsr(S) S statisticr(aveS) ave-S statisticr(expS) exp-S statisticr(supS) sup-S statisticr(qllS) qLL-S statistic

r(avestabS) ave-S statistic

r(expstabS) exp-S statistic

r(supstabS) sup-S statistic

r(qllstabS) qLL-S statisticr(pS) S statistic p-valuer(paveS) ave-S p-valuer(pexpS) exp-S p-valuer(psupS) sup-S statistic p-valuer(pqllS) qLL-S p-value

r(pavestabS) ave-S p-value

r(pexpstabS) exp-S p-value

r(psupstabS) sup-S p-value

r(pqllstabS) qLL-S p-value

Matricesr(Sci) grid search points not rejected by the S test or search points and their

associated p-values (if allpv is specified)r(aveSci) grid search points not rejected by the ave-S test or search points and their

associated p-valuesr(expSci) grid search points not rejected by the exp-S test or search points and their

associated p-valuesr(supSci) grid search points not rejected by the sup-S test or search points and their

associated p-valuesr(qllSci) grid search points not rejected by the qLL-S test or search points and their

associated p-values

r(avestabSci) grid search points not rejected by the ave-S test or search points and theirassociated p-values

r(expstabSci) grid search points not rejected by the exp-S test or search points and theirassociated p-values

r(supstabSci) grid search points not rejected by the sup-S test or search points and theirassociated p-values

r(qllstabSci) grid search points not rejected by the qLL-S test or search points and theirassociated p-values

5 Examples

We present three examples to illustrate the use of genstest. The first example is theregression model of married female labor supply presented in Mroz (1987). The secondexample is based on the Poisson regression model contained in the example sectionof the gmm command in Stata’s Base Reference Manual: Release 11 (see StataCorp[2009]). In the first two examples, we assume that the observations are independentbut not identically distributed. The third example presents inference about parametersof the new Keynesian Phillips curve (NKPC) model discussed in Sbordone (2005) and


Magnusson and Mavroeidis (2010b). In this last example, residuals are assumed to beheteroskedastic and exhibit autocorrelation of unknown form.

5.1 Example 1: Instrumental variable regression—independent ob-servations and heteroskedastic residuals

In studying the married female labor supply, Mroz (1987) suggests a regression of hoursof work (hours) on the log of wages (lwage, the only endogenous variable), householdincome excluding the woman’s wage (nwifeinc), years of education (educ), age (age),and the number of children less than six and greater than six years old (kidslt6 andkidsge6, respectively). The chosen excluded instruments are the actual labor marketexperience and its square (exper and expersq) and the father’s and mother’s years ofeducation (fatheduc and motheduc). The data consist of 428 women in the labor force.The IV regression model may be summarized as follows:

hours = θ lwage+ γ0 + γ1nwifeinc+ γ2educ+ γ3age+ γ4kidslt6+ γ5kidsge6+ u

lwage = ρ0 + ρ1exper+ ρ2expersq+ ρ3fatheduc+ ρ4motheduc

+ ρ5educ+ ρ6nwifeinc+ ρ7age+ ρ8kidslt6+ ρ9kidsge6+ v

We examine the effect of lwage on hours of work. We sort the data by lwage becauseone might be concerned if this effect is constant across observations. Sorting the datadoes not affect Wald and S tests. Independence among observations is assumed, butthe distribution of the error term is heteroskedastic. Because genstest uses a weightmatrix robust to heteroskedasticity by default (hc1), the option is omitted below.

. use http://www.stata.com/data/jwooldridge/eacsap/mroz.dta

. sort lwage

. gmm (hours - {theta}*lwage - {g0} - {g1}*educ - {g2}*nwifeinc - {g3}*age -> {g4}*kidslt6 - {g5}*kidsge6) if inlf==1, level(90)> inst(exper expersq fatheduc motheduc educ nwifeinc age kidslt6 kidsge6)

(output omitted )

GMM estimation

Number of parameters = 7Number of moments = 10Initial weight matrix: Unadjusted Number of obs = 428GMM weight matrix: Robust

RobustCoef. Std. Err. z P>|z| [90% Conf. Interval]

/theta 1223.656 456.8492 2.68 0.007 472.206 1975.106/g0 2287.937 522.7613 4.38 0.000 1428.071 3147.803/g1 -143.8503 52.84411 -2.72 0.006 -230.7712 -56.9295/g2 -8.466459 4.488806 -1.89 0.059 -15.84989 -1.083031/g3 -8.105428 8.896522 -0.91 0.362 -22.7389 6.528048/g4 -261.7084 177.988 -1.47 0.141 -554.4726 31.05575/g5 -56.63245 48.20149 -1.17 0.240 -135.9168 22.65194

Instruments for equation 1: exper expersq fatheduc motheduc educ nwifeincage kidslt6 kidsge6 _cons


. genstest (hours - <theta>*lwage - {g0} - {g1}*educ - {g2}*nwifeinc - {g3}*age -> {g4}*kidslt6 - {g5}*kidsge6) if inlf==1,> inst(exper expersq fatheduc motheduc educ nwifeinc age kidslt6 kidsge6)> stab wmatrix(hc1) ci(-200 7000, points(60) alpha(0.10))

Test Statistic P-value CI (alpha=.1)

S 26.316010 0.000 [880, 6280]qLL-S 68.829101 0.006 Rejected GridqLL-stab-S 42.513092 0.632 [-80, 280]

Tested null hypothesis vector: <theta> = < 0.000>Number of Instruments - Included Instruments: 10 - 6Number of Observations: 428

The difference between the Wald and S confidence intervals for θ indicates thepresence of weak instruments. It is interesting to observe that the S test confidenceinterval does not intersect the qLL-S confidence interval. This results in an empty qLL-S confidence interval and clearly suggests that the effect of lwage on hours of work isnot constant.

This example uses genstest as a stand-alone command. Because genstest usesthe null that all parameters of interest are equal to 0 by default, the option null() isomitted. In such cases, there is no need to run gmm first (we do so to show the Waldconfidence interval for comparison).

5.2 Example 2: Exponential regression with endogenous regressors

This example corresponds to [R] gmm examples 6, 7, and 8 on pages 591–595 in Stata’sBase Reference Manual (see StataCorp [2009]). Cameron and Trivedi (2010) modeldoctor visits on the basis of the following factors: a patient’s income (income), whethera patient has a chronic disease (chronic), whether a patient has private insurance(private), and gender. They use an exponential regression model. The dataset hasdemographic information on 4,412 patients. Taking income to be endogenous, one addsthese additional instruments: age and the dummy variables hispanic and black. Wesubset the model to include only female patients (there are 2,082 observations). Thecomponents of the empirical moment are

u = docvis− exp(θincome+ γ0 + γ1chronic+ γ2private)

Z = (1, chronic, private, age, black, hispanic)

This Poisson regression model assumes that the residuals are heteroskedastic. Be-fore performing inference using the genstest command, we sort the data accordingto income first and then age because of the possibility that lower- and higher-incomegroups have different income effects. We first estimate the parameters using the gmm

command and then run the genstest as a postestimation command.


. webuse docvisits

. sort income age

. local expr = "exp({theta}*income + {g0} + {g1}*chronic + {g2}*private)"

. gmm (docvis - èxpr´) if female==1, inst(age private chronic black hispanic)> deriv(/theta = -1*income*èxpr´) deriv(/g0 = -1*èxpr´)> deriv(/g1 = -1*chronic*èxpr´) deriv(/g2 = -1*private*èxpr´) level(99)

(output omitted )

GMM estimation

Number of parameters = 4Number of moments = 6Initial weight matrix: Unadjusted Number of obs = 2082GMM weight matrix: Robust

RobustCoef. Std. Err. z P>|z| [99% Conf. Interval]

/theta .0178061 .0038447 4.63 0.000 .0079029 .0277094/g0 .0791596 .1278643 0.62 0.536 -.250197 .4085162/g1 .9640517 .076972 12.52 0.000 .7657849 1.162319/g2 .4172219 .1590262 2.62 0.009 .0075976 .8268461

Instruments for equation 1: age private chronic black hispanic _cons

. genstest, null(last) test(theta) stab sb varS nuisS ci(-0.05 0.05, points(20)> alpha(0.01))convergence not achieved


S 3.643981 0.303 [.01, .03]qLL-S 2.02e+02 0.001 [.005, .005]ave-S 22.440894 0.003 [.005, .015]exp-S 29.458495 0.001 [.005, .01]sup-S 35.183634 0.002 [0, .015]qLL-stab-S 1.98e+02 0.001 [0, .005]ave-stab-S 18.796914 0.001 [-.01, .01]exp-stab-S 25.814514 0.001 [-.005, .01]sup-stab-S 31.539654 0.001 [-.01, .015]

Tested null hypothesis vector: <theta> = < 0.018 >Number of Instruments - Included Instruments: 6 - 3Number of Observations: 2082

The null(last) option has genstest test the null hypothesis that θ is equal to itsgmm estimated value. Here it is not necessary to specify the derivative() option ingenstest again, because the command will use the derivative() expression of gmm

and automatically consider theta as the only parameter to be tested. This commandcould also have been executed in the following way (assuming it is still being used as apostestimation command):

. local expr = "exp(<theta>*income + {g0} + {g1}*chronic+ {g2}*private)"

. genstest, null(last) test(theta) deriv(/g0 =-1*èxpr´)> deriv(/g1 = -1*chronic*èxpr´) deriv(/g2 = -1*private*èxpr´)

We use the options varS and nuisS to reduce computation time of the ave-, exp-,and sup-S tests. The p-values of the gen-S and gen-S tests indicate that at the 1%


significance level, we reject H0 : θ = θ, where θ is the gmm estimate of θ. We also noticethat the gen-S tests for 99% confidence interval lengths for θ are smaller than the lengthof the confidence interval reported by gmm. The reduction is due to imposing stabilityrestrictions as indicated by the upper bounds of gen-S confidence intervals.

5.3 Example 3: NKPC

The hybrid NKPC is defined by the following equation,

πt = γ +1

1 + ρEt(πt+1) +

ρ

1 + ρπt−1 +

(1− φ)2

φ(1 + ρ)xt + εt

where πt is inflation, xt is labor share, and εt is a shock. The parameter ρ measures thedegree of indexation to past inflation. The parameter φ is the probability that a firmwill be unable to change its price in a given period (hence, 1/(1−φ) is the average timeover which a price is fixed).

The empirical moment condition derived from the above model is

Z′t

{Δπt − γ − 1

1 + ρ(πt+1 − πt−1)− (1− φ)2

φ(1 + ρ)xt

}︸︷︷︸

ut(θ,γ)

where Zt = (1,Δπt−1,Δπt−2, xt−1, xt−2, xt−3), and θ = (ρ, φ).

We illustrate first the estimation of confidence intervals for φ and ρ on the basis of thegeneralized S and S tests. We use quarterly data on inflation and labor share. Inflationis calculated from the gross domestic product deflator, while labor share is obtainedfrom the Bureau of Labor Statistics and transformed according to the procedure usedin Sbordone (2005). The data comprise information from 1959:2 to 2008:3.

The genstest command is used as a postestimation command. In estimating theconfidence intervals, we restrict the grid search to be between 0 and 1, which correspondsto the range determined by the economic theory. All the options in genstest are setaccording to the gmm command option: a HAC weight matrix with the Bartlett kerneland the use of recentered moments for computing the HAC with the number of lagsselected according to the optimal method suggested by Newey and West (1994) (see[R] gmm in Stata 11).6

6. We set 1 as the initial value for estimating φ. The default value for estimating φ using gmm is 0,which will result in an error message.


. use nkpc_gmm

. local expr = "{g} - (1/(1 + {rho}))*(F.inf - L.inf) -> (((1 - {phi=1})^2)/({phi=1}*(1 + {rho})))*ls"

. generate time=q(1947q2)+_n-1 // generate a quarterly series

. format time %tq

. tsset timetime variable: time, 1947q2 to 2008q3

delta: 1 quarter

. generate dinf = inf - L.inf(1 missing value generated)

. gmm (dinf - `expr´) if time>=tq(1959q2) & time<=tq(2008q3),> inst(L.dinf L2.dinf L.ls L2.ls L3.ls) wmat(hac nwest optimal) centerwarning: 1 missing value returned for equation 1 at initial values

(output omitted )

GMM estimation

Number of parameters = 3Number of moments = 6Initial weight matrix: Unadjusted Number of obs = 197GMM weight matrix: HAC Bartlett 23

(lags chosen by Newey-West)

HACCoef. Std. Err. z P>|z| [95% Conf. Interval]

/g -.0045429 .0049064 -0.93 0.354 -.0141594 .0050735/rho .3862696 .1429788 2.70 0.007 .1060362 .6665029/phi .8208458 .0739276 11.10 0.000 .6759503 .9657412

HAC standard errors based on Bartlett kernel with 23 lags.(Lags chosen by Newey-West method.)

Instruments for equation 1: L.dinf L2.dinf L.ls L2.ls L3.ls _cons

. genstest, init(0 0.76) test(rho) null(last) ci(0.01 0.99, points(20)) stab sbNote: using the nuisS and/or varS options will decrease computation time for> the single-break tests.


S 3.020303 0.554 [.059, .794]qLL-S 35.709204 0.173 [.108, .696]ave-S 12.124914 0.248 [.157, .549]exp-S 16.133502 0.150 [.255, .5]sup-S 20.743887 0.192 [.206, .5]qLL-stab-S 32.688901 0.123 [.01, .941]ave-stab-S 9.104611 0.101 [.157, .451]exp-stab-S 13.113198 0.068 [.304, .451]sup-stab-S 17.723584 0.113 [.255, .451]

Tested null hypothesis vector: <rho> = < 0.386 >Number of Instruments - Included Instruments: 6 - 2Number of Observations: 197


. genstest, test(phi) init(0 0.38) null(last) ci(0.50 0.99, points(20)) stab sbNote: using the nuisS and/or varS options will decrease computation time for> the single-break tests.


S 2.712954 0.607 [.622, .99]qLL-S 32.978783 0.310 [.647, .99]ave-S 12.526473 0.218 [.72, .99]exp-S 18.453388 0.077 [.818, .99]sup-S 24.109671 0.082 [.818, .99]qLL-stab-S 30.265829 0.234 [.573, .99]ave-stab-S 9.813519 0.069 [.794, .99]exp-stab-S 15.740433 0.025 [.867, .99]sup-stab-S 21.396717 0.034 [.843, .99]

Tested null hypothesis vector: <phi> = < 0.821 >Number of Instruments - Included Instruments: 6 - 2Number of Observations: 197

The shrinkage in the gen-S confidence intervals relative to the S is due to the stabilityrestrictions. This reduction is particularly remarkable for the exp-S test.

The next call of the genstest command illustrates how to test multiple parameters.In this call, we also set a grid search for estimating confidence regions for (ρ, φ) on thebasis of the gen-S tests. The confidence regions are the collection of points (ρ0, θ0) ∈(0, 1) × (0, 1) in the grid search that do not reject the null hypothesis H0 : (ρ, φ) =(ρ0, φ0).

. genstest, test(rho phi) null(last) init(0.0) derivative(/g=-1)> ci(0.01 .99 0.01 0.99, points(10 10)) stab sbNote: using the nuisS and/or varS options will decrease computation time for> the single-break tests.

Test Statistic P-value

S 3.043952 0.693qLL-S 37.768561 0.136ave-S 12.948586 0.271exp-S 18.421490 0.110sup-S 23.887305 0.117qLL-stab-S 34.724609 0.068ave-stab-S 9.904634 0.066exp-stab-S 15.377538 0.029sup-stab-S 20.843353 0.042

Tested null hypothesis vector: <rho phi> = < 0.386 0.821 >Number of Instruments - Included Instruments: 6 - 1Number of Observations: 197


Example 3 takes longer to run than examples 1 and 2 because the single-break testsare estimating recursively γ (θ, j), Φ1(θ), and Φ2(θ) 138 times for each point in the gridsearch.7 Removing the sb option or adding nuisS and varS will reduce the computationtime significantly.

The confidence sets for the joint hypothesis are shown below for the Wald and Stests. The graph for the Wald confidence set was created using built-in Stata commands(test and gmm). The other graph was created using the stored results of the genstestcommand. In the same graphs, we plot the confidence intervals of each parameter.

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

Wald Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

S Confidence Region (95%)

Figure 1. Wald and S 95%-level confidence sets for φ and ρ in the NKPC. The forcingvariable is the log of the labor share. Instruments: constant, two lags of Δπ, and threelags of xt. Period: 1960q1–2008q3.

7. It takes approximately three to four minutes to compute confidence intervals for ρ and φ, respec-tively. In the case of the confidence regions, it takes 16 minutes to test all 100 points of the definedgrid search. The reported times are obtained after executing the code in a PC with Intel(R)Core(TM) I7-2600 CPU 3.4 GHz processor with 4 GB RAM memory, Windows 7 system withStata/IC 12.1.


Comparing the S confidence set with the results of the Wald test above, we mightsuspect the presence of weak instruments: the confidence interval of ρ generated by theS test covers almost the entire parameter space.

Next we obtain the confidence interval and regions using the ave-S, exp-S, sup-S,and qLL-S tests. The proposed tests, which are robust to weak instruments, generatesmaller confidence intervals and regions than both the Wald and the S tests.

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

ave−S Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

exp−S Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

sup−S Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

qLL−S Confidence Region (95%)

Figure 2. gen-S 95%-level confidence sets for φ and ρ in the NKPC. The forcing variableis the log of the labor share. Instruments: constant, two lags of Δπ, and three lags ofxt. Period: 1960q1–2008q3.

We illustrate the importance of imposing the stability restrictions when performinginference in figure 3. The ave-S test is a combination of the S and ave-S tests [see(6)]. In the S-test confidence region graph, the confidence interval for ρ covers almostthe entire parameter space. The range of points in the parameter space that satisfiesthe stability restrictions is a small fraction of the range of points that satisfies the Stest, as illustrated by the ave-S confidence region graph. The same explanation can beextended to explaining the reduction in the range of φ.


0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

S Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

ave−S Confidence Region (95%)

0.2

.4.6

.81

φ

0 .2 .4 .6 .8 1ρ

ave−stab−S Confidence Region (95%)

Figure 3. S, ave-S, and ave-S 95%-level confidence sets for φ and ρ in the NKPC. Theforcing variable is the log of the labor share. Instruments: constant, two lags of Δπ,and three lags of xt. Period: 1960q1–2008q3.

This last example illustrates that in time-series applications, the generalized S testsimprove inference of structural parameters by incorporating information about insta-bilities in the moment condition without imposing identification restrictions implicitlyassumed by computing the Wald test.

6 Acknowledgments

We would like to thank Mark Schaffer and Daniel Millimet for suggestions and com-ments. An anonymous referee helped us to improve the manuscript and the genstest

code, and we are grateful for this. We also would like to thank the Tulane ResearchExperience for Undergraduates in Applied Microeconomics and Program Evaluation(Treu-Ampe) and the Provost’s Fund for Faculty–Student Scholarly and Artistic En-gagement for funding support.


7 ReferencesAnderson, T. W., and H. Rubin. 1949. Estimation of the parameters of a single equationin a complete system of stochastic equations. Annals of Mathematical Statistics 20:46–63.

Andrews, D. W. K. 1993. Tests for parameter instability and structural change withunknown change point. Econometrica 61: 821–856.

Angrist, J. D., and A. B. Krueger. 1995. Split-sample instrumental variables estimatesof the return to schooling. Journal of Business and Economic Statistics 13: 225–235.

Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed.College Station, TX: Stata Press.

Caner, M. 2007. Boundedly pivotal structural change tests in continuous updatingGMM with strong, weak identification and completely unidentified cases. Journal ofEconometrics 137: 28–67.

Chernozhukov, V., and C. Hansen. 2008. The reduced form: A simple approach toinference with weak instruments. Economics Letters 100: 68–71.

Davidson, R., and J. G. MacKinnon. 2003. Econometric Theory and Methods. NewYork: Oxford University Press.

Elliott, G., and U. K. Muller. 2006. Efficient tests for general persistent time variationin regression coefficients. Review of Economic Studies 73: 907–940.

Hall, A. R. 2005. Generalized Method of Moments. Oxford: Oxford University Press.

Hall, A. R., and A. Sen. 1999. Structural stability testing in models estimated bygeneralized method of moments. Journal of Business and Economic Statistics 17:335–348.

Hansen, L. P. 1982. Large sample properties of generalized method of moments estima-tors. Econometrica 50: 1029–1054.

Kleibergen, F., and S. Mavroeidis. 2009. Weak instrument robust tests in GMM andthe new Keynesian Phillips curve. Journal of Business and Economic Statistics 27:293–311.

Magnusson, L. M., and S. Mavroeidis. 2010a. Identification using stability restrictions.Working Paper 1116, Department of Economics, Tulane University.http://ideas.repec.org/p/tul/wpaper/1116.html.

———. 2010b. Identification-robust minimum distance estimation of the new KeynesianPhillips curve. Journal of Money, Credit and Banking 42: 465–481.

Mroz, T. A. 1987. The sensitivity of an empirical model of married women’s hours ofwork to economic and statistical assumptions. Econometrica 55: 765–799.


Newey, W. K., and K. D. West. 1994. Automatic lag selection in covariance matrixestimation. Review of Economic Studies 61: 631–653.

Piehl, A. M., S. J. Cooper, A. A. Braga, and D. M. Kennedy. 2003. Testing for structuralbreaks in the evaluation of programs. Review of Economics and Statistics 85: 550–558.

Sbordone, A. M. 2005. Do expected future marginal costs drive inflation dynamics?Journal of Monetary Economics 52: 1183–1197.

Sowell, F. 1996. Optimal tests for parameter instability in the generalized method ofmoments framework. Econometrica 64: 1085–1107.

StataCorp. 2009. Stata: Release 11. Statistical Software. College Station, TX: StataCorpLP.

Stock, J. H., and M. W. Watson. 1996. Evidence on structural instability in macroeco-nomic time series relations. Journal of Business and Economic Statistics 14: 11–30.

Stock, J. H., and J. H. Wright. 2000. GMM with weak identification. Econometrica 68:1055–1096.

Stock, J. H., J. H. Wright, and M. Yogo. 2002. A survey of weak instruments and weakidentification in generalized method of moments. Journal of Business and EconomicStatistics 20: 518–529.

About the authors

Zachary Flynn is a PhD candidate in economics at the University of Wisconsin–Madison.

Leandro Magnusson is an assistant professor in the Department of Economics at the Universityof Western Australia.


cmpute: A tool to generate or replace avariable

Patrick RoystonHub for Trials Methodology Research

MRC Clinical Trials Unit at UCL

London, UK

[email protected]

Abstract. I provide a new programming tool, cmpute, to manage conveniently thecreation of a new variable or the replacement of an existing variable interactivelyor within a Stata program.

Keywords: dm0072, cmpute, data management, create variable, replace variable,ado-file programming

1 Introduction

In Stata programs that I write, I am often faced with safely managing the creation ofnew variables to be stored in the workspace alongside user data. For example, I do notwish to overwrite existing user variables without warning. An obvious precaution is toinclude a replace option in the program so that the user can sanction overwriting avariable when appropriate. However, accurately handling the presence or absence ofreplace and the creation of a new variable is programmatically cumbersome.

In this short article, I describe a new tool, cmpute, to streamline the “regeneration”(creation or replacement) of a variable subject to certain sensible constraints. cmpute

has some features in common with an earlier program, defv (Gleason 1997, 1999). How-ever, the aims of defv are different. A key goal of defv is to enable the documentationof changes to an important variable by accumulating characteristics (as notes), possiblyover many sessions with a particular dataset. My main goal with cmpute is to stream-line the creation or replacement of variables within a Stata program. While it is finethat cmpute may be found useful interactively, that’s not my goal.

From its original release, Stata has separated the creation of new variables (done withgenerate) from the revision of the contents of existing variables (done with replace).Furthermore, while you can abbreviate generate all the way down to g if you wish (inpractice, most people use gen), you cannot abbreviate replace. These decisions all flowfrom Stata’s underlying philosophy of protecting your data and of making it as difficultas possible for you to change your data unless you spell out explicitly that this is yourintention.


P. Royston 863

In proposing to do what Stata’s designers in their wisdom cast asunder, I am con-sciously favoring programmer convenience while also reducing any element of risk byprotecting users against inadvertent changes to their data. (Note: If you specify theforce option of cmpute, be aware that it means what it says. The effects of a forcemay be drastic.)

cmpute has a loose connection with the official command clonevar, which preciselyreproduces the data and all other features of an existing variable in a new variable.

2 Example

Consider the following simple program:

program define mylog// Program to safely create a log transformation of a single variableversion 12.1syntax varlist(min=1 max=1 numeric), generate(string) [replace]capture confirm var `generate´// `generate´ does not exist; it´s safe to create it and finishif c(rc) != 0 {

generate `generate´ = ln(`varlist´)exit

}// `generate´ does exist; it must be handled correctlyif "`replace´" == "replace" {

replace `generate´ = ln(`varlist´)}else {

display as error "`generate´ already defined"error 110

}end

The program accepts a variable supplied in varlist and creates a new variablecalled string, stored in a local macro called generate, containing the logarithmicallytransformed values of `varlist´. mylog replaces the contents of the variable string ifit already exists—provided that the replace option is specified. If the replace optionis not specified, an error message must be issued because we do not wish to wipe outthe existing string without permission. The above program is not completely foolproof,but on the whole, it does a reasonable job of handling various possible inputs and theexistence or otherwise of the variable string. There must be thousands of programs outthere containing lines of code that do something similar. If more than one variable isto be handled, the code can get quite bulky (and ugly).

We could replace chunks of code like that in mylog with a single call to the newprogram, cmpute. For example,

. cmpute logx = ln(x), replace

does essentially the same thing as

. mylog x, generate(logx) replace

864 cmpute: A tool to generate or replace a variable

Of course, cmpute is much more general; within Stata’s limits, it can handle an arbi-trarily complex expression after the = sign.

3 Syntax

The syntax of cmpute is as follows:

cmpute[type

] {existing var |newvar} = exp[if] [

in] [

, force label(string)

replace]

3.1 Description

cmpute replaces an existing variable, existing var, or creates a new variable, newvar,from an expression in exp. An error message occurs if an attempt is made to changeexisting var without specifying replace. If type is specified, cmpute sets the storagetype of existing var or newvar to type (see also the force option). type must be one ofbyte, int, long, float, double, str#, or in Stata 13 or higher, strL.

Note that cmpute leaves formats, value labels, and characteristics as they were, soa programmer wanting to alter any of those needs to make the changes separately.

Although cmpute is envisaged primarily as a programmer’s tool, users may also findit convenient in interactive use as a shortcut to creating and labeling a new (or existing)variable in one step.

3.2 Options

force applies recast to force a change in the storage type of an existing var to type.This option should be used with caution because it could result in loss of data. Seehelp on recast for further information. force has no effect on a newvar.

label(string) labels the new or regenerated variable “string”.

replace replaces existing var. Using cmpute with an existing variable but omittingreplace raises an error message. replace has no effect on a newvar.

3.3 Examples: Interactive use

The examples given below are of interactive use. See section 4 to get an idea of cmpute’sutility in programming.

. cmpute str6 make = substr(make, 1, 6), replace label("Make (trunc)")

. cmpute int gear_ratio = int(100 * gear_ratio), replace force

. cmpute logx = ln(x), label("log(x)")

P. Royston 865

4 Example: Programming use

Here is a simple program, an extension of mylog, that uses cmpute to manage thecreation of new variables:

program define mylog2version 12.1syntax varlist(min=1 numeric) [if] [in], generate(string) [replace]marksample touselocal nvar : word count `varlist´tokenize `varlist´forvalues i = 1 / `nvar´ {

cmpute double `generate´ì´ = ln(`ì´´) if `touse´, `replace´ ///label("ln(`ì´´)")

}end

mylog2 log-transforms a list of variables in `varlist´. As you can see, the aimhere is to implement an option whose syntax is generate(name). The option savespermanently a bunch of new or replaced variables whose names begin with name. If thereplace option is omitted, the cmpute . . . line will raise an error if a variable called`generate´ì´ already exists for some i. If replace is used, all such variables aresilently overwritten.

I have requested that the log-transformed variables `generate´1, `generate´2, . . . ,`generate´`nvar´ be stored in double precision, and I have simultaneously labeledthem meaningfully. The local macro `ì´´ evaluates to the ith token (element) in`varlist´, that is, to the ith variable name.

Note: I have written mylog2 such that if any member of varlist has a missing valuein a given observation not due to the if and in qualifiers, that observation becomesmissing in all the generated variables. The reason is that marksample has automaticallyincorporated missingness of members of varlist in the indicator temporary variabletouse. I could easily change such behavior if that is not what is wanted. For example,the cmpute . . . line could instead be coded

cmpute double `generate´ì´ = ln(`ì´´) ìf´ ìn´, `replace´ label("ln(`ì´´)")

which would preserve all original values of variables in `varlist´ except where filteredby either the if or the in qualifier or of course by an attempt to log transform anonpositive value.

5 Summary

cmpute is meant as an interactive command or a programming tool. In a program,you often wish to create a new variable or replace an existing one, and you also haveimplemented a replace option to allow an existing variable to be overwritten. cmputehandles the necessary coding and (critically) the error checking in a single call. Doingthis properly line by line within your program is cumbersome. cmpute also supportsexpressions via =exp and supports labeling and recasting a regenerated variable.

866 cmpute: A tool to generate or replace a variable

6 Acknowledgment

I am most grateful to Nick Cox for clarifying the original presentation of cmpute andfor providing helpful comments on the manuscript, which have led me to significantimprovements.

7 ReferencesGleason, J. R. 1997. dm50: Defining variables and recording their definitions. StataTechnical Bulletin 40: 9–10. Reprinted in Stata Technical Bulletin Reprints, vol. 7,pp. 48–49. College Station, TX: Stata Press.

———. 1999. dm50.1: Update to defv. Stata Technical Bulletin 51: 2. Reprinted inStata Technical Bulletin Reprints, vol. 9, pp. 14–15. College Station, TX: Stata Press.

About the author

Patrick Royston is a medical statistician with more than 30 years of experience, with a stronginterest in biostatistical methods and in statistical computing and algorithms. He works largelyin methodological issues in the design and analysis of clinical trials and observational studies.He is currently focusing on alternative outcome measures in trials with a time-to-event outcome;on problems of model building and validation with survival data, including prognostic factorstudies and treatment-covariate interactions; on parametric modeling of survival data; and onnovel clinical trial designs.


group2: Generating the finest partition that iscoarser than two given partitions

Christian H. Salas PauliacDepartment of Public Policy

University of ChicagoChicago, IL

[email protected]

Abstract. In this article, I develop a useful interpretation of the function group()

based on partitions belonging to mathematical set theory, an interpretation thatin turn engenders a related command here called group2. In the context of thepartitioning of sets, while the function group() creates a variable that generates thecoarsest partition that is finer than the finest partition generated by the variablesused as arguments, the group2 command will create a variable that generates thefinest partition that is coarser than the coarsest partition generated by the variablesused as arguments. This latter operation has proven very useful in several problemsof database management. An introduction of this new command in the contextof mathematical partitions is provided, and two examples of its application arepresented.

Keywords: dm0073, group2, partitions, group, egen

1 Introduction

The egen function group() generates a variable that takes numerical values that indi-cate “groups” of observations generated by the list of variables in the varlist. In thiscontext, a group is understood to be observations that share the same value for everyone of the variables in varlist. In other contexts, however, a group might be understoodto be observations that share the same value for any one of the variables in varlist. Analternative interpretation of these two contexts is based on mathematical partitions,where the total number of observations in a database is understood to be a set andeach variable is understood to be a partition whose cells are defined by the differentvalues of each variable. Groups in the first context generate the coarsest partition thatis finer than the finest partition generated by the variables in varlist; groups in thesecond context generate the finest partition that is coarser than the coarsest partitiongenerated by the variables in varlist.

In this article, I will introduce a simple command called group2, which generates avariable that takes numerical values that indicate the groups as understood in the lattercase. I will do so by relating both the group() function and the group2 command tomathematical set theory and by motivating the use of the new command with tworeal-life examples.


868 group2

2 A set-theory interpretation of the function group()

A brief overview of partitions in the context of set theory will come in handy. Let Xbe a set of elements. The set P of nonempty subsets A,B, . . . is a partition of X if andonly if the following two conditions hold for any A,B ∈ P , where A = B:⋃

P = X

A ∩B = ∅Subsets A,B, . . . of partition P are usually called cells of P . Simply, a partition ofa set X is a fragmentation or grouping of the elements of X; different partitions willgenerate different fragmentation patterns. Let P1 and P2 be two different partitions ofX. Partition P1 is said to be finer than partition P2 (and P2 coarser than P1) if everycell of P1 is a subset of some cell of P2. In other words, P1 is a further fragmentationof P2. For example,

X = {a, b, c, d, e, f}P1 = {{a, b}, {c}, {d, e, f}}P2 = {{a, b, c}, {d, e, f}}

Sometimes, such relation cannot be established between two partitions. For example,partitions P3 and P4 group the elements of X in a manner such that two cells of onepartition overlap with more than one cell of the other partition. Two useful operationsin this case are to find the coarsest partition that is finer than the finest of the twooriginal partitions (call this P5) and to find the finest partition that is coarser than thecoarsest of the two original partitions (call this P6). Intuitively, P5 cuts through thecell overlaps, generating a partition whose cells are contained in no more than one cellof the original partitions, and P6 combines the cell overlaps so that all the cells of theoriginal partitions are contained in no more than one cell of P6. For example,

P3 = {{a, b}, {c, d}, {e, f}}P4 = {{a}, {b, c}, {d}, {e, f}}P5 = {{a}, {b}, {c}, {d}, {e, f}}P6 = {{a, b, c, d}, {e, f}}

How does this framework relate to the Stata function group()? Consider the totalnumber of observations in a database to be the set X. Each variable of the databasecan be interpreted to be a partition P of the set X, where the cells of the partition aredefined by the different values of each variable. For example, if one of the variables inthe database takes the values 0 and 1, this variable generates a partition of X consistingof two cells, one that contains all observations for which this variable is equal to 0 andone that contains all observations for which it is equal to 1.

When the function group() takes two or more variables as arguments, it generatesthe coarsest partition that is finer than the finest partition generated by these arguments.

C. H. Salas Pauliac 869

The following example serves as illustration. Say that we are working with a databasecontaining survey information on households and that we are interested in two variables:location, indicating whether it is an urban or a rural household, and gender, indicatingwhether the head of the household is male or female. Each of these variables generatesa different partition of the total number of observations, each according to a differentcriterion. When using these two variables as arguments, the group() function willgenerate a numerical variable identifying four groups containing all the possible typesof households using these two variables: urban–male, urban–female, rural–male, andrural–female.1 This new variable effectively generates the coarsest partition that isfiner than the finest partition generated by the location and gender variables.

3 The group2 command

3.1 Syntax

The syntax for the group2 command is as follows:

group2 varlist

Exactly two variables must be specified in varlist. No options are allowed.

3.2 Description

If we consider the sample of observations in a dataset to be a particular set, the group2command creates a variable that generates the finest partition (of this set) that is coarserthan the coarsest partition generated by the two variables in varlist.

Exactly two variables must be specified in varlist. Both numerical and string vari-ables are allowed. Missing values in varlist (either . or "") are treated as if each onewere a unique value, thereby indicating separate partitions. If n variables are neededin varlist, where n > 2, the command needs to be applied n− 1 times, where the thirdvariable will be run with the outcome variable from applying the command to the firsttwo variables, and so forth.

3.3 Remarks

The group2 command generates a variable that takes numerical values, each one indicat-ing a different group of observations, just as the egen function group() does. However,while group() understands a group to be all observations that share the same value for

1. Actually, the group() function may generate two, three, or four groups depending on the natureof the original partitions. For example, if all urban households have a male head of the family andall rural households have a female head of the family, then the two partitions generated by thesevariables are equivalent, and the group() function will only replicate such partitions. On the otherhand, if all urban households have a male head of the family but rural households have both maleand female heads of household, then the group() function will generate three groups.

870 group2

every one of the variables in varlist, group2 understands a group to be all observationsthat share the same value for any one of the variables in varlist. In other words, group()creates a variable that generates the coarsest partition (of the set of observations) thatis finer than the finest partition generated by the variables in varlist; group2 creates avariable that generates the finest partition (of the set of observations) that is coarserthan the coarsest partition generated by the variables in varlist.

Table 1 compares the use of the group() function with the group2 command asapplied to two variables (var1 and var2) of a fictitious sample.

Table 1. group() versus group2

var1var2group()group2

1 A 1 11 B 2 12 B 3 12 C 4 13 D 5 23 D 5 2

4 Example

Several contexts in database management will require the finest partition that is coarserthan the coarsest partition generated by two or more variables. The new commandgroup2 will perform such an operation. Let us illustrate two practical applications ofthis command with two examples that in fact motivated the creation of the program.

4.1 Example 1: Generating identification from several sources

Let us imagine a database containing information on employed individuals (henceforthcalled workers), where each observation is a worker and each variable is a different char-acteristic of the worker. We know that several workers are duplicated in the database;that is, several observations may be referring to the same worker. This might happenin several contexts. One example is when appending several databases containing dif-ferent subsets of a population and where the intersection of these subsets might notalways be empty. Another example happens in network databases in a panel form,where each principal declares several network members and where one network membermight be declared by more than one principal. In a long panel form where the principalis duplicated as many times as the number of network members it has, several networkmembers might be repeated.


We wish to create a unique ID for each worker; note that this ID variable will generatea partition of the database where each cell will contain observations corresponding tothe same individual. A simple way of doing this is to identify duplicates in terms of allvariables. Unfortunately, if some of the variables that would be useful to identify theworker (for example, home address) have missing values or are likely to be misspelledfor some of the observations, none (or very few) of the observations will be an exactduplicate of any other in terms of all the variables.

An alternative way of generating this ID is to take a subset of the variables, onethat is certain to uniquely identify a worker, and spot all duplicated observations interms of this subset. The ID created in this exercise will generate a partition whereeach cell will contain observations corresponding to the same worker in terms of thissubset of variables. The problem with this strategy is that there might be differentsubsets of variables uniquely identifying a worker that might generate ID assignmentsnot equivalent to one another. If all ID assignments are right, we need an operationthat uses the information contained in every assignment to produce an overall uniqueidentification.

Such an operation is performed by the group2 command. Each ID assignment gen-erates a partition over the set of workers in the sample; we know that each cell in everypartition contains observations corresponding to a unique individual. If partitions arenot equivalent, then at least one cell of one partition will contain, be contained by, orcross one or more cells of another partition. For cells in this situation, one partition,Pa, will inform us that individuals contained in two different cells belonging to anotherpartition, Pb, are the same worker. This will happen to several cells in every partitionavailable. Therefore, the overall unique identification will emerge when all crossing cellsare combined, which is precisely the finest partition that is coarser than the coarsestpartition generated by the separate ID assignment.

Say we have four variables that provide information on workers: the name of theinstitution where the worker is employed (workplace), the worker’s full name (name),the city where the worker lives (city), and the worker’s home address (homeaddress).We know for certain that two subsets of these variables will uniquely identify everyworker in the database: workplace–name and city–homeaddress. Using these twosubsets, we generate the variables id1 and id2, which uniquely identify workers interms of each subset of variables.2 Table 2 shows an extract of this database.

2. Variable id1, for example, may be generated by typing egen id1 = group(workplace name).

872 group2

Tab

le2.

Generatingidentificationfrom

severalsources

nworkplace

name

cityhomeaddressid1id2group()group2

1ArthurGuinness&

Son

William

Gosset

Dublin100A

Street1

11

12ArthurGuinness&

Son

William

Gosset

Bublin100A

Street1

22

13Fed

eralOffice

forIntellectualPropertyAlbertEinstein

Bern

200B

Street2

33

24Fed

eralOffice


Bern

200B

Street2

33

25Fed

eralOffice


-200B

Street2

44

26Guinnessbrewery

William

Sealy

GossetDublin100A

Street3

15

17Guinnessbrewery

William

Sealy

GossetDublin

-3

56

1


From table 2, it is clear that observations 1, 2, 6, and 7 are the same person, andobservations 3, 4, and 5 are another person. The variable id1 successfully identifies thesecond individual, yet it fails to fully identify the first one, instead indicating that ob-servations 1 and 2 are one person and 6 and 7 are another because both the workplace’sname and the worker’s name are spelled differently among those two subsets of obser-vations. On the other hand, because city and homeaddress contain both misspellingsand missing values, variable id2 is only able to identify that observations 1 and 6 arethe same worker and that observations 3 and 4 are the same worker.

The group2 command combines the information provided from both id1 and id2 togenerate a complete identification. First, because id1 shows that observations 1 and 2are the same worker and id2 shows that observations 1 and 6 are the same worker, thengroup2 knows that observations 1, 2, and 6 are the same worker. Yet id1 also showsthat observations 6 and 7 are the same worker; therefore, observations 1, 2, 6, and 7must be the same worker. Second, id1 shows that 3, 4, and 5 are the same worker,and even though id2 shows that observation 5 is a different worker, group2 ignores thisbecause any of the matches are sufficient for identification.

The group() function, when using id1 and id2 as arguments, is also shown forcomparison.

4.2 Example 2: Network identification

Let us illustrate the use of the group2 command by looking at a different kind of prob-lem: identification of network overlaps. Social networks are a major influence in today’seconomic activity, and promising research is taking place on this topic. A common chal-lenge when managing these datasets is the identification of complete networks, wheredeclarations of individual networks must be combined to calculate the size of the overallnetwork.

Imagine a database that contains information on individuals, the places that theyhave visited in the last two days, and a variable indicating whether, before these twodays, they were infected with a highly contagious virus. We are interested in knowingwhich places might have been left infected by the individuals carrying the virus and,therefore, which individuals were at risk of having been infected by visiting these places.

In this example, we understand a network to be different individuals connectedthrough commonly visited places. Columns 1 to 3 of table 3 contain an extract of thisdatabase. These columns tell us that there are six individuals to be considered, thateach individual visited three places in the last two days, and that only individual 1 wasoriginally infected. From this small extract, we can identify the network manually: oneof the places individual 1 visited, place C, was also visited by individual 2, so he orshe is possibly infected; furthermore, individual 4 visited a place that individual 2 alsovisited, place D, so individual 4 may also be infected. In addition, we can see thatindividuals 3 and 6 both visited place H and that individual 5 visited three places noother individual visited.

874 group2

In the presence of a larger dataset, the group2 command will carry out the identifica-tion automatically: the command will group all observations that either share the sameindividual id or share the same place id. The outcome variable of this identificationhas been placed in the fourth column of table 3.

Table 3. Network identification

individual idplace idvirusgroup2

1 A 1 11 B 1 11 C 1 12 C 0 12 D 0 12 E 0 13 F 0 33 G 0 33 H 0 34 D 0 14 I 0 14 J 0 15 K 0 25 L 0 25 M 0 26 H 0 36 N 0 36 G 0 3

5 Discussion

In this article, I presented an alternative interpretation of the group() function basedon set theory, an interpretation that in turn engenders a related command, here calledgroup2. While the group() function creates a variable that generates the coarsest parti-tion that is finer than the finest partition generated by the variables used as arguments,the group2 command will create a variable that generates the finest partition that iscoarser than the coarsest partition generated by the variables used as arguments. Theoperation performed by this new command has been very useful in a number of databasemanagement problems, and sharing it became a natural step.

A future contribution will be to allow for this command to use more than twovariables in varlist; in the meantime, the command has to be applied n − 1 times forn variables. An additional contribution will be to add an option that allows the userto choose whether missing values are to be treated as different, unique values (theonly current alternative) or as one particular value common to all missing values when


partitioning the set of observations. To the extent that they are needed, other moresophisticated set operations might be translated to functions for database management.

About the author

Christian Salas Pauliac is studying for a PhD in Public Policy at the University of Chicago,Chicago, IL.


A score test for group comparisons insingle-index models

Paulo GuimaraesUniversidade do Porto

Porto, [email protected]

Abstract. In this article, I derive a score test for the equality of one or moreparameters across groups of observations following estimation of a single-indexmodel. The test has a wide array of applications and nests Pearson’s chi-squaredtest as a particular case. The postestimation command scoregrp implements thetest and works with logit, logistic, probit, poisson, or regress (see [R] logit,[R] logistic, [R] probit, [R] poisson, and [R] regress). Finally, I show someapplications of the test.

Keywords: st0321, scoregrp, score test, logit, logistic, probit, Poisson, regress

1 Introduction

In many practical situations after estimation of a regression model, there is interest inperforming a test for equality of one or more parameters across groups of observations.The analysis of variance (ANOVA) and analysis of covariance models are probably thebetter-known examples, but many other situations fall under this general description.For example, one may want to implement a test to decide whether to include a factorvariable or an interaction with a factor variable as a regressor. Other generic examplesare tests of structural change when one wants to decide whether to impose a single modelto the pooled data or estimate the model in each subsample separately. Yet anotherexample is the situation where one wants to decide whether a panel-data estimator oreven a mixed model is more appropriate. Some goodness-of-fit tests are also based onthe comparison of estimated parameters across groups of observations (for example, thegoodness-of fit test for the logistic regression proposed by Tsiatis [1980]).

For models estimated by maximum likelihood, three asymptotically equivalent testsmay be used for hypothesis testing: the likelihood-ratio test (LRT), the Wald test, orthe score (or Lagrange multiplier) test. The score test has the advantage of requiringonly estimation of the restricted model, that is, estimation of the model under thenull hypothesis. This advantage is particularly relevant in situations when it becomescomputationally expensive to estimate the unrestricted model. The score test has bettersmall-sample properties than the Wald test (Boos 1992; Fears, Benichou, and Gail 1996)and is effective relative to the LRT (Godfrey 1981). However, in practice, the score testis rarely used because it lacks a general estimation command such as Stata’s lrtest

(see [R] lrtest) for the LRT or test (see [R] test) for the Wald test.


P. Guimaraes 877

As we will see, a score test for the equality of parameters across groups followingestimation of single-index models is easy to implement. Moreover, I will also showthat for some particular situations, this test is identical to Pearson’s chi-squared testapplied to individual-level data. In the following section, I derive the test and show itsrelation with Pearson’s chi-squared test. Next I present the Stata command scoregrp,which implements the test after estimation with logit, logistic, probit, poisson, orregress (see [R] logit, [R] logistic, [R] probit, [R] poisson, and [R] regress). Finally,I illustrate the use of scoregrp in some examples.

2 Score tests for group effects

2.1 The score test

Suppose that we have specified a probability model for a dependent variable Y and havea collection of n independent and identically distributed observations. Further, admitthat the observations of Y may be classified into G mutually exclusive groups, eachgroup with ng observations and g = 1, . . . , G. Assume that for the ith observation ofgroup g, the expected value of Y is a known function of μig; that is, E(yig) = g(μig). Theindex μig is a linear combination of covariates; that is, μig = x′

igθ, where xig is a vectorof the observed covariates for the ith observation on group g, and θ′ = [θ1, θ2, . . . , θk]is a k × 1 vector of unknown parameters associated with the x covariates.

If we let the known density function for Y be represented by f(y;θ), then we canwrite the likelihood function as

L(θ;Y) =

G∏g=1

ng∏i=1

f(θ; yig) (1)

where yig is the ith observation of Y on group g. The maximum likelihood estimatesare the values of θ that maximize (1). They are obtained by solving the k equationsthat result from differentiating the logarithm of the likelihood function with respect toθ. Thus the maximum likelihood estimates θ are those values of θ such that

s(θ) =

G∑g=1

ng∑i=1

sig(θ) =

G∑g=1

ng∑i=1

∂lnf(θ; yig)

∂μxig = 0 (2)

For θ to be a maximum likelihood estimate, the matrix of second derivatives of thelog-likelihood function, the Hessian matrix, evaluated at θ, must be negative definite.This matrix equals

H =

G∑g=1

ng∑i=1

Hig =

G∑g=1

ng∑i=1

∂2lnf(θ; yig

)∂μ2

xigx′ig (3)

The vectors s and sig have a dimension of k×1. To refer to the element of the vectorsig that is associated with a specific coefficient, say, coefficient θj , we will use the generic

878 Score test for groups

notation sθj ,ig. Similarly, H and Hig are k × k matrices, and the notation hθjθl,ig willrefer to the specific element of matrix Hig that corresponds to the coefficients θj and θl.At times, I will give a different interpretation to a subscripted matrix, but the intendedmeaning should be clear from the context.

Suppose now that one wants to test the equality of a subset of the parameters (say, atotal of k1 parameters) across groups of observations. Without loss of generality, admitthat θ′ = [α′,β′] and that α is a vector containing all the parameters to be tested. Ournull hypothesis is then

Ho : α1 = α2 = · · · = αG

Implementing Rao’s score test for this hypothesis leads to the statistic

T = s(ϑ)′ [

−H(ϑ)]−1

s(ϑ)

(4)

where s(ϑ)′ is a score vector calculated with respect to all the coefficients implied by thealternative hypothesis but evaluated at the maximum likelihood solution obtained underthe null hypothesis. Thus ϑ′ = [α′

1,α′2, . . . ,α

′G;β

′] is the “expanded” set of coefficientsthat is consistent with the alternative hypothesis. Under the null hypothesis, the scoretest in (4) is asymptotically approximated by a chi-squared distribution with k1(G− 1)degrees of freedom. Partitioning the score vector and Hessian matrix in (4) with respectto the two sets of coefficients, α and β, we can rewrite (4) as

T = −[

sαsβ

]′ [Hαα Hαβ

Hβα Hββ

]−1 [sαsβ

]

The second set of score values evaluated at the restricted estimates is 0; thus sβ(ϑ) =0. Hence, using the well-known result on the inverse of partitioned matrices, we canrewrite (4) as

T = −s′α[Hαα −Hαβ [Hββ ]

−1Hβα

]−1

sα (5)

The important thing to note is that all matrices in (5) are easily obtained follow-ing estimation of the restricted model. The matrix Hββ is the Hessian matrix of therestricted model obtained by excluding the rows and columns corresponding to the pa-rameters in α. The other matrices are obtained as partial sums of the observation-levelcomponents of the Hessian and gradient vectors shown in (2) and (3). The expressionfor the score test in (5) may be presented in an alternative way, which will prove usefulin subsequent analysis. Using a known result on matrix identities (see, for example,Demidenko [2004, 651]), we can restate the test statistic on (5) as

T = −s′α[H−1

αα +H−1ααHαβ

[Hββ −HβαH

−1ααHαβ

]−1HβαH

−1αα

]sα

or more succinctly asT = −s′αH

−1ααsα +Δ (6)

P. Guimaraes 879

where

Δ = −s′αH−1ααHαβ

[Hββ −HβαH

−1ααHαβ

]−1HβαH

−1ααsα

The matrixHαα is block diagonal; it is thus easily invertible regardless of the numberof groups because it only requires the inversion of the diagonal matrices that havedimension k1. The other matrix that needs to be inverted has the dimension of β (thatis, a dimension equal to the number of covariates not tested). In practical applications,it may be simpler to define a matrix G with dimensions n×G, where the gth column isa vector with elements that take the value 1 if the observation belongs to group g and0 otherwise. Letting X be a matrix containing all covariates in the model and M be adiagonal matrix with generic element hαα,ig, then we can write all the matrices that gointo the formula for the test as Hαα = G′MG, Hαβ = G′MX, and Hββ = X′MX.

2.2 Relationship with the Pearson χ2 statistic

To further explore the relation with the Pearson χ2 statistic, let us now consider thesituation where one wants to test whether the constant of a regression model differsacross groups. In this case, k1 = 1 and we can rewrite (6) as

T = −G∑

g=1

s2α,•ghαα,•g

+Δ (7)

where the symbol “•” is used to represent a summation across all elements of i. Theabove expression makes obvious the relationship between our test and Pearson’s χ2 test.Without covariates, Δ = 0, and the score test for the equality of the intercept acrossgroups of observations becomes the Pearson χ2 test.

Poisson regression

Consider a typical Poisson regression model with expected value

λig = exp(α+ x′igβ)

To implement the score test in the Poisson regression model, we need to note that thegeneric elements for the score vector are sα,ig = yig − λig and for the M matrix are

hαα,ig = −λig.If the regression model has no covariates, then λig = y and Δ = 0. If we plug these

values into (7), then we obtain the well-known Pearson χ2 test for count data.

T =

G∑g=1

ng(yg − y)2

y


Logit regression

Consider now a typical logistic regression with binary dependent variable:

Prob(yig = 1|x) = Λig =exp(α+ x′

igβ)

1 + exp(α+ x′igβ)

Now the generic elements for the score vector are sα,ig = yig − Λig and for the M

matrix are hαα,ig = −Λig(1− Λig). If we let p denote the proportion of 1s in the totalsample and let pg denote the proportion of 1s in each subgroup, then in a model withoutcovariates, the test simplifies to

T =

G∑g=1

ng(pg − p)2

p(1− p)(8)

which is the known Pearson χ2 test for binary data.

Linear regression

Finally, let us consider a typical linear regression model such as

yig = α+ x′igβ + uig

where uig is normally independent and identically distributed with 0 expected valueand variance equal to σ2. The elements of the score vector are sα,ig = (yig − yig)/σ

2

and those of the Hessian are hαα,ig = −σ−2. Without covariates, yig = y and the testsimplifies to

T =

G∑g=1

ng(yg − y)2

σ2(9)

which is identical to the one-way ANOVA formula. However, in this circumstance, thetest will not produce the same result as the usual ANOVA because the test uses themaximum likelihood estimate of σ2. As a curiosity, I note that when applied to binarydata, (9) produces the same results as the Pearson test for binary data in (8).

3 The scoregrp command

The scoregrp command is a user-written command for Stata that implements thetest described above after estimation with the commands logit, logistic, probit,poisson, or regress. It is partially implemented in Mata and requires installation ofthe user-written command matdelrc, programmed by Nicholas J. Cox. Because of theway scoregrp is programmed, the command should work well in situations when thenumber of groups is very large. Additionally, incorporating other single-index modelsinto scoregrp should be a straightforward task requiring only the coding of the scoreand Hessian for the new models.

P. Guimaraes 881

3.1 Syntax

The command has a very simple syntax:

scoregrp[indepvars

], group(varname)

[nocons

]The argument indepvars consists of a list of the variables whose coefficients we want

to test. By default, it is assumed that the constant is included among indepvars, butwe can exclude it with the nocons option.

3.2 Options

group(varname) specifies the variable that identifies the group. group() is required.

nocons specifies that the constant not be included among the coefficients to be tested.

4 Examples

To illustrate the use of scoregrp, let us use union.dta, downloaded from the Statawebsite. After reading in the data, we start by implementing Pearson’s χ2 to testwhether the proportion of unionized individuals remains constant over time.

. webuse union(NLS Women 14-24 in 1968)

. tabulate union year, nofreq chi2Pearson chi2(11) = 107.8144 Pr = 0.000

The same result is obtained if we run a logit regression without covariates and test fordifferences in the constant term across years.

. quietly logit union

. scoregrp, group(year)

Score test for logit regression

Test result is chi(11) = 107.8144 Pr = 0.0000

Next let us consider a logit regression with three covariates, age, grade, and black,and again test for differences in the constant term across years.

. quietly logit union age grade black

. scoregrp, group(year)




The results clearly reject the null hypothesis, and thus we include yearly dummyvariables in the logit regression. In the following, we check whether to include aninteraction between the variables grade and black.

. quietly tabulate year, generate(y)

. quietly logit union age grade black y1-y11

. scoregrp grade, group(black) nocons



The hypothesis that the coefficient on the interaction is 0 is not rejected. Thefollowing test is akin to a test of permanence of structure and compares whether allcoefficients in the two subsamples defined by the variable south are identical.


. scoregrp age grade black y1-y11, group(south)



Finally, we use scoregrp to test whether one should account for unobserved hetero-geneity across individuals.


. scoregrp, group(idcode)


Test result is chi(4433) = 1.46e+04 Pr = 0.0000

The results suggest that the data have substantial unobserved heterogeneity.

5 Conclusion

In this article, I derived a score test to check whether one or more coefficients differacross groups of observations following estimation of a single-index model. The user-written command scoregrp is a Stata implementation of the test. The present versionof the command works after logit, logistic, probit, poisson, or regress and maybe easily extended to other single-index models.

For many practical applications, scoregrp offers no computational advantage andcan be slower than existing alternatives based on LRT or Wald tests. But with largedatasets and particularly when the unrestricted model is complex (for example, arandom-effects or mixed model), then scoregrp is likely to be the faster approach.Researchers may also want to use scoregrp in situations when an LRT or a Wald testis not an option. Consider the cases of panel-data estimators for logit and Poisson re-gression with fixed effects. In these cases, a Wald or an LRT test to check whether oneshould include the fixed effect is not possible, because the alternative model is estimatedby conditional maximum likelihood. As shown earlier, implementation of the test withscoregrp is straightforward.

P. Guimaraes 883

6 Acknowledgment

I thank Joao Santos Silva for helpful comments on a previous version of this article.

7 ReferencesBoos, D. D. 1992. On generalized score tests. American Statistician 46: 327–333.

Demidenko, E. 2004. Mixed Models: Theory and Applications. New York: Wiley.

Fears, T. R., J. Benichou, and M. H. Gail. 1996. A reminder of the fallibility of theWald statistic. American Statistician 50: 226–227.

Godfrey, L. G. 1981. On the invariance of the Lagrange multiplier test with respect tocertain changes in the alternative hypothesis. Econometrica 49: 1443–1455.

Tsiatis, A. A. 1980. A note on a goodness-of-fit test for the logistic regression model.Biometrika 67: 250–251.

About the author

Paulo Guimaraes is an associate professor in the Department of Economics at the Universityof Porto in Portugal.

The Stata Journal (2013)13, Number 4, p. 884

Software Updates

st0247 1: Respondent-driven sampling. M. Schonlau and E. Liebau. Stata Journal 12:72–93.

In addition to the original respondent-driven sampling (RDS) estimator, theVolz–Heckathorn estimator is now implemented in the rds command. The Volz–Heckathorn estimator (for proportions) is

p =

(∑iεI

1di

)(∑

jεS1dj

)where S is the full sample, I is the subpopulation of interest, and di and dj are theself-reported degrees, or number of “friends” of respondents i and j, respectively.The original estimator is sometimes called RDS I, and the Volz–Heckathorn estimatoris sometimes called RDS II. When one uses bootstrapping with rds, bootstrap resultsare reported for both estimators. How to use bootstrapping with the rds commandis explained in the rds help file.

st0301 1: Fitting the generalized multinomial logit model in Stata. Y. Gu, A. R Hole,and S. Knox. Stata Journal 13: 382–397.

A bug has been fixed in the gmnlpred command, which could lead to incorrectlycalculated predicted probabilities when the dataset contained missing values.

c© 2013 StataCorp LP up0042

the stata journal · the stata journal is published quarterly by the stata press, college station,...

Documents