regularized calibrated estimation of propensity scores with...

Biometrika (2020), 107, 1, pp. 137–158 doi: 10.1093/biomet/asz059Printed in Great Britain Advance Access publication 3 December 2019

Regularized calibrated estimation of propensity scores withmodel misspecification and high-dimensional data

By Z. TAN

Department of Statistics, Rutgers University, 110 Frelinghuysen Road, Piscataway,New Jersey 08854, U.S.A.

[email protected]

Summary

Propensity scores are widely used with inverse probability weighting to estimate treatmenteffects in observational studies. We study calibrated estimation as an alternative to maximum like-lihood estimation for fitting logistic propensity score models. We show that, with possible modelmisspecification, minimizing the expected calibration loss underlying the calibrated estimatorsinvolves reducing both the expected likelihood loss and a measure of relative errors between thelimiting and true propensity scores, which governs the mean squared errors of inverse probabilityweighted estimators. Furthermore, we derive a regularized calibrated estimator by minimizingthe calibration loss with a lasso penalty. We develop a Fisher scoring descent algorithm for com-puting the proposed estimator and provide a high-dimensional analysis of the resulting inverseprobability weighted estimators, leveraging the control of relative errors of propensity scores forcalibrated estimation. We present a simulation study and an empirical application to demonstratethe advantages of the proposed methods over maximum likelihood and its regularization. Themethods are implemented in the R package RCAL.

Some key words: Calibrated estimation; Causal inference; Fisher scoring; Inverse probability weighting; Lasso penalty;Model misspecification; Propensity score; Regularized M-estimation.

1. Introduction

Inverse probability weighted estimation is one of the main methods of using propensity scoresto estimate treatment effects in observational studies. In practice, propensity scores are unknownand typically estimated from observed data by fitting logistic regression using maximum likeli-hood. However, inverse probability weighted estimation may suffer from large weights, causedby propensity scores being estimated close to 0 for a few treated subjects or close to 1 for afew untreated subjects. It has been argued that such methods may perform poorly even when thepropensity score model appears to be nearly correct (e.g., Kang & Schafer, 2007).

As an alternative to maximum likelihood estimation, calibrated estimation has been proposedfor fitting propensity score models. The basic idea is to use a system of estimating equationssuch that the weighted averages of the covariates in the treated subsample are equal to the simpleaverages in the overall sample. Subsequently, the fitted propensity scores can be used as usualin inverse probability weighted estimators or extensions. Such ideas and similar methods havebeen studied, and sometimes independently (re)derived, in various contexts of causal inference,missing-data problems and survey sampling (e.g., Folsom, 1991; Tan, 2010; Graham et al., 2012;Hainmueller, 2012; Imai & Ratkovic, 2014; Kim & Haziza, 2014; Vermeulen & Vansteelandt,2015; Chan et al., 2016; Yiu & Su, 2018).

©c 2019 Biometrika Trust

Dow

nloaded from https://academ

ic.oup.com/biom

et/article-abstract/107/1/137/5658668 by Rutgers U

niversity Libraries/Technical Services user on 01 July 2020

138 Z. Tan

This work aims to address two questions. Previously, calibrated estimation was studied eitherthrough numerical experiments or by establishing properties such as local efficiency and doublerobustness. The first question is whether, in the presence of possible model misspecification,any advantage can be formally established for calibrated estimation over maximum likelihoodestimation when fitting propensity score models for inverse probability weighting, regardless ofoutcome regression models. In addition, calibrated estimation was previously analysed with thenumber of covariates p either fixed or growing slowly under strong enough smoothness conditions(e.g., Chan et al., 2016). The second question is how to extend and analyse calibrated estimationwhen the number of covariates is close to or greater than the sample size.

We develop theory and methods to address the foregoing questions, with a logistic propen-sity score model. First, we establish a new relationship between the loss functions underlyingcalibrated and maximum likelihood estimation. From this result, we show that minimizing theexpected calibration loss involves reducing both the expected likelihood loss and, separately, ameasure of relative errors of the target, or limiting, propensity score, which then governs themean squared errors of inverse probability weighted estimators. Such control of relative errorsof propensity scores cannot be achieved by minimizing the expected likelihood loss alone.

Second, we derive a regularized calibrated estimator by minimizing the calibration losswith a lasso penalty (Tibshirani, 1996), in which calibration equations are relaxed to box con-straints. We develop a novel algorithm for computing the proposed estimator, exploiting quadraticapproximation (Friedman et al., 2010), Fisher scoring (McCullagh & Nelder, 1989) and themajorization-minimization technique (Wu & Lange, 2010). We also perform a high-dimensionalanalysis of the regularized calibrated estimator and the resulting inverse probability weightedestimators of population means, allowing possible model misspecification. Compared with pre-vious results based on regularized maximum likelihood estimation (e.g., Belloni et al., 2014), ourresults are established under weaker conditions, by carefully leveraging the control of relativeerrors of propensity scores in calibrated estimation.

In a companion paper (Tan, 2019a), we build on the work on propensity score estimation anddevelop regularized calibrated estimation using both propensity score and outcome regressionmodels. Both the computational algorithm and the high-dimensional analysis developed hereare directly used in Tan (2019a). In addition, the theoretical comparison presented here betweencalibrated and maximum likelihood estimation of propensity scores is of independent interest.

2. Background: causal inference

Suppose that the observed data consist of independent and identically distributed observations{(Yi, Ti, Xi) : i = 1, . . . , n} of (Y , T , X ), where Y is an outcome variable, T is a treatment variabletaking value 0 or 1, and X = (X1, . . . , Xd) is a vector of measured covariates. In the potentialoutcomes framework for causal inference (Neyman, 1923; Rubin, 1974), let (Y 0, Y 1) be potentialoutcomes that would be observed under treatments 0 and 1, respectively. For consistency, assumethat Y is either Y 0 if T = 0 or Y 1 if T = 1, that is, Y = (1 − T )Y 0 + TY 1. There are twocausal parameters commonly of interest: the average treatment effect, defined as E(Y 1 − Y 0) =μ1 − μ0 with μt = E(Y t) (t = 0, 1), and the average treatment effect on the treated, definedas E(Y 1 − Y 0 | T = 1) = ν1 − ν0 with νt = E(Y t | T = 1) (t = 0, 1). For concreteness, wemainly discuss estimation of (μ0,μ1) until § 7, where we discuss (ν0, ν1).

For identification of (μ0,μ1), we make the following two assumptions throughout: (i) T ⊥Y 0 | X and T ⊥ Y 1 | X (Rubin, 1976); (ii) 0 < pr(T = 1 | X = x) < 1 for all x. There are twobroad approaches to estimating (μ0,μ1), depending on additional modelling assumptions on theoutcome regression function m∗(t, X ) = E(Y | T = t, X ) or the propensity score (Rosenbaum& Rubin, 1983) π∗(X ) = pr(T = 1 | X ); see Tan (2007) for further discussion.

Dow


ic.oup.com/biom



Calibrated estimation of propensity scores 139

For the propensity score approach, consider a regression model of the form

pr(T = 1 | X ) = π(X ; γ ) = �{γ Tf (X )}, (1)

where �(·) is an inverse link function, f (x) is a vector of known functions, and γ is avector of unknown parameters. Typically, logistic regression is used with π(X ; γ ) = [1 +exp{−γ Tf (X )}]−1. Let γML be the maximum likelihood estimator of γ , which for logisticregression minimizes the average negative loglikelihood

�ML(γ ) = E(log[1 + exp{γ Tf (X )}] − T γ Tf (X )

). (2)

Throughout, E(·) denotes the sample average. The fitted propensity score, πML(X ) = π(X ; γML),can be used in various ways to estimate (μ0,μ1). In particular, we focus on inverse probabilityweighting, which is central to the semiparametric theory of estimation in missing-data problems.Two commonly used inverse probability weighted estimators for μ1 are

μ1IPW(πML) = E

{TY π−1

ML(X )}, μ1

rIPW(πML) = μ1IPW

/E{T π−1

ML(X )}.

Similarly, two inverse probability weighted estimators for μ0 are defined by replacing T with1−T and πML(X )with 1− πML(X ). If model (1) is correctly specified, then the above estimatorsare consistent under standard regularity conditions as n → ∞ with the dimension of γ fixed.

3. Calibrated estimation

3.1. Preliminaries

For concreteness, we assume that propensity score model (1) is a logistic regression model:

pr(T = 1 | X ) = π(X ; γ ) = [1 + exp{−γ Tf (X )}]−1, (3)

where f (x) = {1, f1(x), . . . , fp(x)}T is a vector of known functions including a constant andγ = (γ0, γ1, . . . , γp)

T is a vector of unknown parameters. Let γ 1CAL be an estimator of γ that

solves

E[{Tπ−1(X ; γ )− 1}f (X )] = 0. (4)

The fitted propensity score is π1CAL(X ) = π(X ; γ 1

CAL). Thenμ1 can be estimated by μ1IPW(π

1CAL)

or equivalently μ1rIPW(π

1CAL), with πML(X ) replaced by π1

CAL(X ). These two estimators areidentical because E{T/π1

CAL(X )} = 1 by (4) with a constant included in f (X ).Similarly, let γ 0

CAL be an estimator of γ solving

E([(1 − T ){1 − π(X ; γ )}−1 − 1

]f (X )

) = 0, (5)

and let π0CAL(X ) = π(X ; γ 0

CAL). Then μ0 can be estimated by μ0IPW(π

0CAL) or equivalently

μ0rIPW(π

0CAL), with πML(X ) replaced by π0

CAL(X ), where the equivalence of these two estimatorsfollows from the fact that E[(1 − T )/{1 − π0

CAL(X )}] = 1 by (5) with a constant included inf (X ). See § 7 for remarks on the unusual fact that two different sets of fitted propensity scores,π1

CAL(X ) and π0CAL(X ), are used for estimating μ1 and μ0, respectively.

Following the survey literature (Folsom, 1991), (4) is referred to as the calibration equationfor the treated (i.e., treatment 1), because the inverse probability weighted average of f (Xi) overthe treated group {i : Ti = 1, i = 1, . . . , n} is calibrated to the average of f (Xi) over the entiresample including the treated and untreated. Similarly, (5) is referred to as the calibration equation

Dow


ic.oup.com/biom



140 Z. Tan

for the untreated (i.e., treatment 0). The resulting estimators γ 1CAL and γ 0

CAL are called calibratedestimators of γ , in contrast to the maximum likelihood estimator γML. The fitted values π1

CAL(X )and π0

CAL(X ) are also called calibrated propensity scores.Calibration equations have been proposed in various contexts. The idea seems to have been

explored first by Folsom (1991). The equations in (4) can be deduced from Tan (2010, § 4.4) witha degenerate propensity score model. Moreover, (4) has been used by Graham et al. (2012), Kim& Haziza (2014) and Vermeulen & Vansteelandt (2015) to develop locally efficient and doublyrobust estimators, and was obtained by Chan et al. (2016) in a dual formulation. As shown in § 7,(5) leads to the same estimator as in entropy balancing (Hainmueller, 2012).

3.2. Comparison between calibrated and maximum likelihood estimation

We compare the loss functions for calibrated and maximum likelihood estimation of propen-sity scores and the associated inverse probability weighted estimators in the limiting case. Thecalibrated estimator γ 1

CAL can be equivalently defined as a minimizer of the loss function

�CAL(γ ) = E[T exp{−γ Tf (X )} + (1 − T )γ Tf (X )

]. (6)

In fact, setting the gradient of �CAL(γ ) to zero is easily shown to yield the calibration equation(4) with logistic π(X ; γ ). Moreover, �CAL(γ ) is convex in γ and is strictly convex and boundedfrom below under a certain nonseparation condition, as discussed in the Supplementary Material.Previously, the loss function �CAL and similar ones were mainly used as a computational device(Tan, 2010, § 4.4; Vermeulen & Vansteelandt, 2015).

First, we establish an interesting relationship between the likelihood and calibration loss func-tions �ML(γ ) and �CAL(γ ). To allow for misspecification of model (3), we write �ML(γ ) =κML(γ

Tf ) and �CAL(γ ) = κCAL(γTf ), where for a function g(x),

κML(g) = E(log

[1 + exp{g(X )}] − Tg(X )

), (7)

κCAL(g) = E[T exp{−g(X )} + (1 − T )g(X )

]. (8)

Then κML(g∗) and κCAL(g∗) are well-defined for the true log odds ratio g∗(x) = log[π∗(x)/{1−π∗(x)}], even when model (3) is misspecified, i.e., when g∗(x) is not of the form γ Tf (x). It caneasily be shown that both κML(g) and κCAL(g) are convex in g. For two functions g(x) and g′(x),consider the Bregman divergences associated with κML and κCAL:

DML(g, g′) = κML(g)− κML(g′)− 〈∇κML(g

′), g − g′〉,DCAL(g, g′) = κCAL(g)− κCAL(g

′)− 〈∇κCAL(g′), g − g′〉,

where g is identified as a vector (g1, . . . , gn) with gi = g(Xi), 〈∇κCAL(g′), g − g′〉 =n−1 ∑n

i=1

[(∂/∂g′

i){Ti exp(−g′i) + (1 − Ti)g′

i} × (gi − g′i)

], and 〈∇κML(g′), g − g′〉 is simi-

larly defined. For two probabilities ρ ∈ (0, 1) and ρ′ ∈ (0, 1), the Kullback–Liebler divergenceis L(ρ, ρ′) = ρ′ log(ρ′/ρ) + (1 − ρ′) log{(1 − ρ′)/(1 − ρ)} � 0. In addition, let K(ρ, ρ′) =ρ′/ρ − 1 − log(ρ′/ρ) � 0, which is strictly convex in ρ′/ρ with a minimum of 0 at ρ′/ρ = 1.

Proposition 1. (i) For any functions g(x) and g′(x) and the corresponding functions π(x) =[1 + exp{−g(x)}]−1 and π ′(x) = [1 + exp{−g′(x)}]−1, we have that

DML(g, g′) = E[L{π(X ),π ′(X )}],

DCAL(g, g′) = E

(T

π ′(X )[K{π(X ),π ′(X )} + L{π(X ),π ′(X )}]).

Dow


ic.oup.com/biom



https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/asz059#supplementary-data


(ii) As a result, we have that for any fixed value γ ,

E{�ML(γ )− κML(g∗)} = E

[L{π(X ; γ ),π∗(X )}], (9)

E{�CAL(γ )− κCAL(g∗)} = E

[K{π(X ; γ ),π∗(X )} + L{π(X ; γ ),π∗(X )}]. (10)

To examine the implications of Proposition 1, we first briefly describe some results fromthe classical theory of estimation in misspecified models (White, 1982; Manski, 1988). Understandard regularity conditions, as n → ∞ while p is fixed, the maximum likelihood estimator γMLcan be shown to converge in probability to a target, or limiting, value γML, which is defined as aminimizer of the expected loss E{�ML(γ )} or, equivalently, the Kullback–Liebler divergence (9).Similarly, γ 1

CAL can be shown to converge in probability to a target value γ 1CAL, which is defined

as a minimizer of the expected loss E{�CAL(γ )} or, equivalently, the calibration divergence(10). If model (3) is correctly specified, then both γML and γ 1

CAL coincide with γ ∗ such thatπ(· ; γ ∗) = π∗(·). However, if model (3) is misspecified, then γML and γ 1

CAL are in generaldifferent, determined by the minimization of the corresponding divergence, (9) or (10).

Next, we demonstrate how the mean squared error of an inverse probability weighted estimator,μ1

IPW(γ ) = E{TY /π(X ; γ )}, can be bounded from above by the likelihood and calibrationdivergences (9) and (10). Consider the mean squared relative error

msre(γ ) = E[Q{π(X ; γ ),π∗(X )}] = E

[{π∗(X )π(X ; γ )

− 1}2

],

where Q(ρ, ρ′) = (ρ′/ρ − 1)2 for two probabilities ρ ∈ (0, 1) and ρ′ ∈ (0, 1). This measure ofrelative errors of propensity scores can be seen to directly govern the mean squared error of theestimator μ1

IPW(γ ), independently of the outcome variable Y .

Proposition 2. Suppose that E{(Y 1)2 | X } � c and π∗(X ) � δ almost surely for someconstants c > 0 and δ ∈ (0, 1).

(i) For any fixed value γ ,

E[{μ1

IPW(γ )− μ1}2] � 2c

nδ+ c

(1 + 2

nδ

)msre(γ ). (11)

(ii) If π(X ; γ ) � aπ∗(X ) almost surely for some constant a ∈ (0, 1/2], then

msre(γ ) � 5

3aE[K{π(X ; γ ),π∗(X )}] � 5

3aE{�CAL(γ )− κCAL(g

∗)}. (12)

The factor 5/(3a) in general cannot be improved up to a constant, independent of a.(iii) If π(X ; γ ) � b almost surely for some constant b ∈ (0, 1), then

msre(γ ) � 1

2b2 E[L{π(X ; γ ),π∗(X )}] = 1

2b2 E{�ML(γ )− κML(g∗)}. (13)

The factor 1/(2b2) in general cannot be improved up to a divisor of order log(b−1).

Setting γ to γ 1CAL in (11) and (12) shows that if π(X ; γ 1

CAL) � aπ∗(X ) almost surely, then

E[{μ1

IPW(γ1CAL)− μ1}2] � 2c

nδ+ 5c

3a

(1 + 2

nδ

)minγ

E{�CAL(γ )− κCAL(g∗)}. (14)

Dow


ic.oup.com/biom



142 Z. Tan

The minimum calibration divergence is inflated in (14) by a leading factor 5c/(3a), which remainsbounded from above as long as a is bounded away from zero, even when some limiting propensityscores π(X ; γ 1

CAL) are close to zero. In contrast, setting γ to the limiting value γML in (11) and(13) shows that if π(X ; γML) � b almost surely, then

E[{μ1

IPW(γML)− μ1}2] � 2c

nδ+ c

2b2

(1 + 2

nδ

)minγ

E{�ML(γ )− κML(g∗)}. (15)

The minimum likelihood divergence is inflated in (15) by the leading factor c/(2b2), which maybe arbitrarily large when some limiting propensity scores π(X ; γML) are close to zero. In general,which bound, (14) or (15), is smaller depends on both the divergences and the inflation factors,and the comparison between the bounds or the mean squared errors may vary in different data-generation settings. Nevertheless, the preceding discussion suggests that the mean squared errorof μ1

IPW(γ1CAL) can be more strongly controlled by the minimization of the calibration divergence

than that of μ1IPW(γML) by the minimization of the likelihood divergence, especially when some

of the limiting propensity scores are close to zero. This result is complementary to the statisticaltheory in § 4.3 on how inverse probability weighted estimators with fitted propensity scoresdeviate from μ1

IPW(γ1CAL) with limiting propensity scores. See (29) and the related discussion

thereafter.Technically, the difference between (12) and (13) can be attributed to the function

K{π(X ; γ ),π∗(X )} in the calibration divergence (10). On the one hand, (12) is deduced from thefollowing lemma, proved in the Supplementary Material. The calibration divergence controls themean squared relative error E[{π∗(X )/π(X ; γ )−1}2] effectively through E[K{π(X ; γ ),π∗(X )}].On the other hand, (13) is based on the inequality L(ρ, ρ′) � 2(ρ−ρ′)2 (Cover & Thomas, 1991,Lemma 12.6.1). The likelihood divergence mainly controls the mean squared absolute errorE[{π(X ; γ )−π∗(X )}2] rather than the mean squared relative error. Lemma 1 also shows that anupper bound on ρ′/ρ alone does not in general constrain the ratio Q(ρ, ρ′)/L(ρ, ρ′).

Lemma 1. For a constant a ∈ (0, 1/2], if any two probabilities ρ ∈ (0, 1) and ρ′ ∈ (0, 1)satisfy ρ � aρ′, then

Q(ρ, ρ′) � 5

3 aK(ρ, ρ′).

By comparison, supρ�aρ′ {Q(ρ, ρ′)/L(ρ, ρ′)} = ∞ for any constant a > 0.

For illustration, consider a simple setting adapted from the simulation study in § 5. Let X ∼N (0, 1) and π∗(X ) = {1 + exp(X )}−1. For W = f1(X ) = exp(X /2), the propensity scoremodel is π(X ; γ ) = {1 + exp(γ0 + γ1W )}−1, which is misspecified. Figure 1 shows the limitingpropensity scores π(· ; γML), π(· ; γ 1

CAL) and π(· ; γBAL) (Imai & Ratkovic, 2014), conditional onn = 400 design points (W1, . . . , Wn), labelled as covariate, where Wi = exp(Xi/2) with Xi beingthe i/401 quantile of N (0, 1). The values γML, γ 1

CAL and γBAL are computed by minimizing,respectively, �ML(γ ), �CAL(γ ) and �BAL(γ ) in (2), (6) and (36) with Ti replaced by π∗(Xi).If judged by absolute errors, that is, by |π(· ; γML) − π∗(·)| etc., the three propensity scoresare comparable and reasonably capture the main trend of the true propensity scores. However,substantial differences emerge when the propensity scores are compared in terms of relativeerrors, that is, |π∗(·)/π(· ; γML)− 1| etc. The calibrated propensity scores are the most accurate,the maximum likelihood propensity scores are the least accurate, and the balancing propensityscores are in between, especially in the right tail of W where the true propensity scores aresmall. If a true propensity score 0.05 is estimated by, for example, 0.005, then the relative error

Dow


ic.oup.com/biom






Covariate

Prob

abili

ty

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0(a) (b)

Covariate

Rat

io

0 1 2 3 4

0

2

4

6

8

10

| | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

Fig. 1. (a) Limiting propensity scores based on γML (dashed), γ 1CAL (dot-dashed) and γBAL (dotted), together with the

true propensity score (solid); (b) the ratios of the true propensity scores to those when a propensity score model ismisspecified.

is large even though the absolute error appears small. As suggested by (11), it is relative ratherthan absolute errors that are relevant in evaluating propensity scores used for inverse probabilityweighting.

4. Regularized calibrated estimation

4.1. Background

We propose a regularized calibrated estimator of γ in propensity score model (3). There are twomotivations. First, regularization is needed in the situation where γ 1

CAL does not exist because thecalibration loss (6) is unbounded from below; see Proposition S1 in the Supplementary Material.In our numerical study, nonconvergence is found for γ 1

CAL in 100% of repeated simulations withn = 200 and p = 50, as shown in the Supplementary Material. Second, regularization is alsoneeded in high-dimensional settings where the dimension of f (X ) is close to or greater than thesample size.

The regularized calibrated estimator, denoted by γ 1RCAL, is defined by minimizing the

calibration loss �CAL(γ ) with a lasso penalty (Tibshirani, 1996),

�RCAL(γ ) = �CAL(γ )+ λ‖γ1:p‖1, (16)

whereγ1:p = (γ1, . . . , γp)T excludingγ0, ‖·‖1 denotes the L1-norm such that ‖γ1:p‖1 = ∑p

j=1 |γj|,and λ � 0 is a tuning parameter. By the Karush–Kuhn–Tucker condition for minimization of(16), the fitted propensity score, π1

RCAL(X ) = π(X ; γ 1RCAL), satisfies

1

n

n∑i=1

Ti

π1RCAL(Xi)

= 1, (17)

1

n

∣∣∣∣∣n∑

i=1

Tifj(Xi)

π1RCAL(Xi)

−n∑

i=1

fj(Xi)

∣∣∣∣∣ � λ (j = 1, . . . , p), (18)

where equality holds in (18) for any j such that the jth estimate (γ 1RCAL)j is nonzero. The inverse

probability weights, 1/π1RCAL(Xi) with Ti = 1, still sum to the sample size n by (17), but the

weighted average of each covariate fj(Xi) over the treated group may differ from the overallaverage of fj(Xi) by no more than λ. In other words, introducing the lasso penalty to calibratedestimation yields a relaxation of the equalities (4) to box constraints (18).

Dow


ic.oup.com/biom






144 Z. Tan

When model (3) is logistic regression, a lasso-penalized maximum likelihood estimator, isobtained by minimizing

�RML(γ ) = �ML(γ )+ λ‖γ1:p‖1, (19)

where �ML(γ ) is the average negative loglikelihood in (2). Lasso-penalized maximum likelihoodestimation has been studied extensively for generalized linear models (e.g., Buhlmann & van deGeer, 2011). Such estimators have also been used in fitting propensity score models to estimatetreatment effects (Belloni et al., 2014; Farrell, 2015). We build on and extend previous work tohandle computation and analysis of γ 1

RCAL and inverse probability weighted estimation.

4.2. Computation

We present a Fisher scoring descent algorithm for computing the estimator γ 1RCAL, that is,

minimizing �RCAL(γ ) in (16) for any fixed λ. The basic idea of the algorithm is to iterativelyform a quadratic approximation to the calibration loss �CAL(γ ) in (6) and solve a lasso-penalizedweighted least squares problem, similarly to existing algorithms for lasso-penalized maximumlikelihood-based logistic regression (e.g., Friedman et al., 2010). However, a suitable quadraticapproximation is obtained here only after an additional step of replacing certain sample quanti-ties with model expectations. This idea is known as Fisher scoring and was previously used toderive the iterative reweighted least squares method for fitting generalized linear models withnoncanonical links, such as probit regression (McCullagh & Nelder, 1989).

The quadratic approximation directly from a Taylor expansion of �CAL(γ ) about currentestimates, denoted by γ , is

�CAL,Q1(γ ; γ ) = �CAL(γ )+ E([−T exp{−f T(X )γ } + 1 − T

]f T(X )(γ − γ )

+ 1

2(γ − γ )Tf T(X )

[T exp{−f T(X )γ }]f (X )(γ − γ )

). (20)

As suggested by the quadratic term, it is tempting to recast (20) as a weighted least squaresobjective function in the form (1/2)

∑ni=1 wi{Ti − f T(Xi)γ }2 with some working response Ti, free

ofγ , and weight wi = Ti exp{−f T(Xi)γ }. Instead, by Fisher scoring, we replace Ti exp{−f T(Xi)γ }with its expectation [1 + exp{f T(Xi)γ }]−1 under (3) with parameter γ , and obtain

�CAL,Q2(γ ; γ ) = �CAL(γ )+ E([−T exp{−f T(X )γ } + 1 − T

]f T(X )(γ − γ )

+ 1

2(γ − γ )Tf T(X )

[1 + exp{f T(Xi)γ }]−1f (X )(γ − γ )

), (21)

which is easily shown to be a weighted least squares objective function with covariate vectorf (Xi) and with working response and weight

Ti = f T(Xi)γ + Ti − π(Xi; γ )

π(Xi; γ ){1 − π(Xi; γ )} , (22)

wi = 1 − π(Xi; γ ), (23)

respectively. By comparison, in the iterative reweighted least squares algorithm for fitting logisticregression by maximum likelihood, the working response is the same as (22), but the weight isπ(Xi; γ ){1−π(Xi; γ )}. Therefore, observations are weighted more withπ(Xi; γ ) closer to 1/2 formaximum likelihood estimation, but with π(Xi; γ ) closer to 0 for calibrated estimation by (23).

Dow


ic.oup.com/biom




To reduce computational cost, we make use of the majorization-minimization technique (Wu &Lange, 2010), similarly to existing algorithms for logistic regression. In particular, a majorizingfunction of (21) at current estimates γ is, by the quadratic lower bound principle (Bohning & Lind-say, 1988), the quadratic function obtained by replacing the Hessian E[f T(X ){1−π(X ; γ )}f (X )]with E[f T(X )f (X )] in (21). The resulting quadratic function of γ , denoted by �CAL,Q3(γ ; γ ), canbe shown to be a weighted least squares objective function with working response and weight

Ti = f T(Xi)γ + Ti

π(Xi; γ )− 1, wi = 1,

respectively. A complication of Fisher scoring, i.e., transition from (20) to (21) is that, unlikea direct majorization of the quadratic approximation from a Taylor expansion, the function�CAL,Q3(γ ; γ ) may not be a majorizing function of �CAL(γ ), and hence minimization of�CAL, Q3(γ ; γ ) + λ‖γ1:p‖1 may not lead to a decrease of the objective function �RCAL(γ ) =�CAL(γ )+λ‖γ1:p‖1 from the current value �RCAL(γ ), as would be achieved by the majorization-minimization technique. However, the descent property, when occasionally violated, can berestored by incorporating a backtracking line search, because the direction found by minimizing�CAL, Q3(γ ; γ )+ λ‖γ1:p‖1 must be a descent direction for the objective function �RCAL(γ ).

Proposition 3. Let γ (1) |= γ be a minimizer of �CAL, Q2(γ ; γ )+λ‖γ1:p‖1 or of �CAL, Q3(γ ; γ )+λ‖γ1:p‖1, and let γ (t) = (1 − t)γ + tγ (1) for 0 � t � 1. Then any subgradient of �CAL(γ

(t))+λ‖γ (t)1:p‖1 at t = 0 is negative.

From the preceding discussions we develop the following algorithm.

Algorithm 1. Fisher scoring descent algorithm for minimizing (16).

(i) Set an initial value γ (0).(ii) Repeat the following updates for k = 1, 2, . . . until convergence to obtain γ 1

CAL:(ii1) Compute γ (k−1/2) = arg minγ �CAL, Q2(γ ; γ (k−1))+ λ‖γ1:p‖1

or γ (k−1/2) = arg minγ �CAL, Q3(γ ; γ (k−1))+ λ‖γ1:p‖1.(ii2) If �RCAL(γ

(k−1/2)) < �RCAL(γ(k−1)), then set γ (k) = γ (k−1/2); otherwise set

γ (k) = (1 − t)γ (k−1) + tγ (k−1/2) for some 0 < t < 1 through a backtrackingline search, such that �RCAL(γ

(k)) < �RCAL(γ(k−1)).

Various algorithms, for example coordinate descent as in Friedman et al. (2010), can be usedto solve the least squares lasso problem in step (ii2). Our implementation in the R (R Devel-opment Core Team, 2020) package RCAL (Tan, 2019b) employs the simple surrogate function�CAL, Q3(γ ; γ ) and then a variation of the active set algorithm of Osborne et al. (2000), whichenjoys a finite termination property. We need to compute only once and save the Cholesky decom-position of the Gram matrix defined from the vectors {fj(X1), . . . , fj(Xn)} for the active coordinatesγj in the active set algorithm.

4.3. High-dimensional analysis

In this subsection we provide a high-dimensional analysis of the regularized calibrated esti-mator γ 1

RCAL and the resulting estimator of μ1, allowing for misspecification of model (3). Ouranalysis of lasso-penalized M-estimators deals with the convergence of such estimators, γ 1

RCAL,to the target values, γ 1

CAL, under model misspecification, which is related to but distinct fromprevious results on excess prediction errors (e.g., Buhlmann & van de Geer, 2011); see the

Dow


ic.oup.com/biom



146 Z. Tan

Supplementary Material for further discussion. Moreover, our analysis of the inverse probabilityweighted estimator of μ1 based on γ 1

RCAL carefully exploits the results on the calibration loss in§ 3.2 to obtain convergence under weaker conditions than previously realized.

As discussed in § 3.2, for calibrated estimation with the loss �CAL(γ ), the target value of γ ,denoted by γ 1

CAL, is defined as a minimizer of the expected calibration loss

E{�CAL(γ )} = E[T exp{−γ Tf (X )} + (1 − T )γ Tf (X )

].

The resulting approximation of g∗ is g1CAL = (γ 1

CAL)Tf , which is in general different from g∗

in the presence of model misspecification. For our theoretical analysis of γ 1RCAL, the tuning

parameter in the lasso-penalized loss (16) is specified as λ = A0λ0 with a constant A0 > 1 and

λ0 = O(1)[log{(1 + p)/ε}/n]1/2,

where O(1) is a constant depending only on (B0, C0) from conditions (i) and (ii) in Proposition 4,and 0 < ε < 1 is a tail probability for the error bound. For example, taking ε = 1/(1 + p) givesλ0 = O(1){2 log(1 + p)/n}1/2, a familiar rate in high-dimensional analysis.

Our first result, Proposition 4, establishes the convergence of γ 1RCAL to γ 1

CAL in the L1-norm‖γ 1

RCAL − γ 1CAL‖1 and the symmetrized Bregman divergence between g1

RCAL = (γ 1RCAL)

Tf and

g1CAL = (γ 1

CAL)Tf . In fact, convergence is obtained in terms of D‡

CAL(g1RCAL, g1

CAL), where fortwo functions g = γ Tf and g′ = γ ′Tf ,

D‡CAL(g, g′) = DCAL(g, g′)+ DCAL(g

′, g)+ (A0 − 1)λ0‖γ − γ ′‖1.

See the Supplementary Material for a discussion of the technical conditions imposed and acomparison with related results in high-dimensional analysis, including those of Buhlmann &van de Geer (2011), Huang & Zhang (2012) and Negahban et al. (2012).

Proposition 4. Suppose that:

(i) g1CAL(X ) � B0 almost surely for a constant B0 ∈ R, i.e., π(X ; γ 1

CAL) is bounded frombelow by {1 + exp(−B0)}−1;

(ii) for some subset S ⊂ {0, 1, . . . , p} containing 0 and constants ν0 > 0 and ξ0 > 1, ifb = (b0, b1, . . . , bp) ∈ R

1+p satisfies∑

j �∈S |bj| � ξ0∑

j∈S |bj|, then ν20(

∑j∈S |bj|)2 �

|S|(bT�1CALb) where �1

CAL = E[f (X )T exp{−g1CAL(X )}f T(X )];

(iii) maxj=0, 1, ..., p |fj(X )| � C0 for a constant C0 > 0;(iv) |S|λ0 � η0 for a sufficiently small constant η0 > 0.

Then for a sufficiently large constant A0 depending only on ξ0, we have that with probability atleast 1 − 4ε,

D‡CAL(g

1RCAL, g1

CAL) � O(1){λ0

∑j �∈S

|γ 1CAL, j| + |S|λ2

0

},

where O(1) depends only on (A0, B0, ξ0, ν0, C0, η0).

From Proposition 4, the following slow and fast rates can be deduced. The two rates are ofdistinct interest, being valid under different assumptions. Taking S = {0} leads to a slow rate oforder λ0

∑pj=1 |γ 1

CAL, j|, where the corresponding compatibility assumption is explicitly satisfiedunder mild conditions.

Dow


ic.oup.com/biom








Corollary 1. Suppose that conditions (i), (iii) and (iv) in Proposition 4 hold with S = {0},and that either no linear combination of f1(X ), . . . , fp(X ) is close to being a constant or theψ2-weighted L2-norms of f1(X ), . . . , fp(X ) are bounded away from above by 1, as defined in theSupplementary Material, with ψ(T , g) = T exp(−g) + (1 − T )g. Then for a sufficiently largeconstant A0, we have that with probability at least 1 − 4ε,

D‡CAL(g

1RCAL, g1

CAL) � O(1)λ0

p∑j=1

|γ 1CAL, j|, (24)

where O(1) depends only on (A0, B0, C0, η0) and η3 or η4 from (S6) or (S7) in the SupplementaryMaterial.

Taking S = {0} ∪ {j : γ 1CAL, j |= 0, j = 1, . . . , p} yields a fast rate of order |S|λ2

0, albeitunder a compatibility condition on the linear dependency between f1(X ), . . . , fp(X ), which maybe violated when the number of covariates, p, is large.

Corollary 2. Suppose that conditions (i)–(iv) in Proposition 4 hold with S = {0} ∪ {j :γ 1

CAL, j |= 0, j = 1, . . . , p}. Then for a sufficiently large constant A0, we have that with probabilityat least 1 − 4ε,

D‡CAL(g

1RCAL, g1

CAL) � O(1)|S|λ20, (25)


We now examine implications of the preceding results and those in § 3.2 on inverse probabilityweighted estimation. Write π1

RCAL(X ) = π(X ; γ 1RCAL), the fitted propensity score based on

γ 1RCAL. Consider the resulting estimator in two equivalent forms due to (17),

μ1IPW(π

1RCAL) = μ1

rIPW(π1RCAL) = E

{TY

π1RCAL(X )

}.

Then a high-probability bound can be obtained on the difference between μ1IPW(π

1RCAL) and the

limiting version μ1IPW(π

1CAL) with π1

CAL(X ) = π(X ; γ 1CAL).

Proposition 5. (i) Suppose that the conditions in Corollary 1 hold and that∑p

j=1 |γ 1CAL, j| �

M1 for a constant M1 > 0. Then for a sufficiently large constant A0, we have that with probabilityat least 1 − 4ε,

∣∣μ1IPW(π

1RCAL)− μ1

IPW(π1CAL)

∣∣2 � O(1)λ0E

{TY 2

π1CAL(X )

}, (26)

where O(1) depends only on (A0, B0, C0, η0, M1) and η3 or η4 from (S6) or (S7) in theSupplementary Material.

(ii) Suppose that the conditions in Corollary 2 hold. Then for a sufficiently large constant A0,we have that with probability at least 1 − 4ε,

∣∣μ1IPW(π

1RCAL)− μ1

IPW(π1CAL)

∣∣2 � O(1)|S|λ20E

{TY 2

π1CAL(X )

}, (27)


Dow


ic.oup.com/biom









148 Z. Tan

A remarkable feature of Proposition 5 is that as λ0 → 0, the difference between μ1IPW(π

1RCAL)

and μ1IPW(π

1CAL) can be seen to converge in probability to zero, even when the L1-norm ‖γ 1

RCAL−γ 1

CAL‖1 may not converge to zero. In fact, the L1-norm‖γ 1RCAL−γ 1

CAL‖1 is in general only boundedfrom above in probability under the conditions for the slow rate in Corollary 1. The situationwith the fast rate in Corollary 2 is similar, but technically subtler: ‖γ 1

RCAL − γ 1CAL‖1 is generally

of order |S|λ0, which is only required to be sufficiently small, no greater than some positiveconstant η0, but need not be arbitrarily close to zero. The error bounds (26) and (27) are provedby carefully exploiting Proposition 1(i) and Lemma 1 on control of relative errors of propensityscores through the calibration loss �CAL; see the proofs in the Supplementary Material.

It is interesting to discuss how results related to Proposition 5 can be obtained for inverseprobability weighted estimators ofμ1 based on γRML, such as μ1

IPW(πRML). In previous analyses(e.g., Belloni et al., 2014, § 5), such results were obtained under additional conditions ensuringthat the L1-norm ‖γRML − γML‖1 converges in probability to zero. Alternatively, by mimickingour proof of Proposition 5, similar results can be shown for μ1

IPW(πRML) without requiringconvergence of ‖γRML − γML‖1 to zero. Analogously to (27), it can be shown under conditionscomparable to those in Corollary 2 that with high probability,

∣∣μ1IPW(πRML)− μ1

IPW(πML)∣∣2 � O(1)|S ′|λ2

0E

{TY 2

πML(X )2

}, (28)

where S ′ is a counterpart of S. There is, however, a subtle difference between (27) and(28). Write gRML = γ T

RMLf and gML = γ TMLf . As discussed in the Supplementary

Material, O(1)|S|λ20 in (27) is obtained as 5/(3a)DCAL(g1

RCAL, g1CAL), whereas O(1)|S ′|λ2

0 in(28) is obtained as 1/(2δ2a′2)DML(gRML, gML), where a > 0 and a′ > 0 are constants such thatπ1

RCAL(X )/π1CAL(X ) � a and πRML(X )/πML(X ) � a′ with high probabilities, and δ > 0 is

assumed to satisfy πML(X ) � δ, similar to the lower bound {1 + exp(−B0)}−1 in Proposition 4.This zoomed-in view shows a comparison reminiscent of that between (12) and (13). The like-lihood divergence DML(gRML, gML) can be inflated by a much larger factor for the error bound(28) than DCAL(g1

RCAL, g1CAL) for the error bound (27).

Finally, although Proposition 5 deals with convergence of μ1IPW(π

1RCAL) to μ1

IPW(π1CAL), which

may differ from the parameter of interest μ1, we point out that Propositions 2 and 5 are comple-mentary in providing support for the use of π1

RCAL in inverse probability weighted estimation ofμ1. The argument is based on the triangle inequality,

|μ1IPW(π

1RCAL)− μ1| � |μ1

IPW(π1CAL)− μ1| + |μ1

IPW(π1RCAL)− μ1

IPW(π1CAL)|. (29)

The first and second terms in (29) represent, respectively, limiting bias and sampling variation.On the one hand, as discussed after Proposition 2, minimization of the expected calibrationloss facilitates achieving a smaller first term, |μ1

IPW(π1CAL) − μ1|, than maximum likelihood

estimation. On the other hand, by the preceding discussion, using the calibration loss with lassopenalization makes it possible to achieve a smaller second term, |μ1

IPW(π1RCAL)− μ1

IPW(π1CAL)|,

than maximum likelihood estimation with lasso penalization in high-dimensional settings.

5. Simulation study

We conduct a simulation study extending the design of Kang & Schafer (2007) to high-dimensional, sparse settings. For p � 4, let X = (X1, . . . , Xp)

T be multivariate normal with mean

Dow


ic.oup.com/biom








0 and covariances cov(Xj, Xk) = 0 if 1 � j, k � 4 and cov(Xj, Xk) = 2−|j−k| if 5 � j � p or5 � k � p, and let T be a binary variable such that

pr(T = 1 | X ) = π∗(X ) = [1 + exp{X1 − 0.5X2 + 0.25X3 + 0.1X4}]−1, (30)

which depends only on the four covariates (X1, X2, X3, X4). Consider two versions of the logisticmodel (3) with the following regressors: (i) fj(X ) = Xj for j = 1, . . . , p; (ii) fj(X ) is a standardizedversion of Wj with sample mean 0 and sample variance 1, where W1 = exp(0.5X1), W2 =10 + {1 + exp(X1)}−1X2, W3 = (0.04X1X3 + 0.6)3, W4 = (X2 + X4 + 20)2 and, if p > 4,Wj = Xj for j = 5, . . . , p. Then model (3) is correctly specified in scenario (i), but misspecifiedin scenario (ii). For p = 4, model (3) in scenario (ii), although misspecified, appears adequateby conventional diagnostics, as shown in Kang & Schafer (2007). In addition, with an outcomevariable Y , this simulation setting with p = 4 has been widely used to study estimators forμ1 = E(Y 1) with observed data {(TiY 1

i , Ti, Xi) : i = 1, . . . , n}.We compare six estimators in the ratio form μ1

rIPW(π), labelled as follows:

(I) π is replaced by the true propensity score π∗;(II) π = E(T ) obtained from model (3) with only the intercept f ≡ 1;

(III) π = πML obtained by maximum likelihood, i.e., by minimizing (2);(IV) π = πRML obtained by lasso-penalized maximum likelihood, i.e., by minimizing (19);(V) π = π1

CAL obtained by calibrated estimation, i.e., by minimizing (6);(VI) π = π1

RCAL obtained by regularized calibrated estimation, i.e., by minimizing (16).

The functions (2) and (6) are minimized using the R package trust (Geyer, 2014); (19) and(16) are minimized using the R package RCAL (Tan, 2019b), based on Friedman et al. (2010) andour algorithm in § 4.2. The tuning parameter λ in (16) or (19) is determined using five-fold cross-validation based on the corresponding loss function. For k = 1, . . . , 5 let Ik be a random subsam-ple of size n/5 from {1, 2, . . . , n}. For a loss function �(γ ), for example �CAL(γ ) in (6), denote by�(γ ; I) the loss function obtained when the sample average E(·) is computed over only the sub-sample I. The five-fold cross-validation criterion is defined as cv5(λ) = (1/5)

∑5k=1 �(γ

(k)λ ; Ik),

where γ (k)λ is a minimizer of the penalized loss �(γ ; Ick )+λ‖γ1:p‖1 over the subsample Ic

k of size4n/5. Then λ is selected by minimizing cv5(λ) over the discrete set {λ∗/2j : j = 0, 1, . . . , 10},where for π0 = E(T ) the value λ∗ is computed as λ∗ = maxj=1, ..., p |E{(T − π0)fj(X )}| whenthe likelihood loss (2) is used, or as λ∗ = maxj=1, ..., p |E{(T/π0 − 1)fj(X )}| when the calibrationloss (6) is used. It can be shown that in either case, the penalized loss �(γ )+ λ‖γ1:p‖1 over theoriginal sample has a minimum at γ1:p = 0 for all λ � λ∗.

The performance of an inverse probability weighted estimator μ1rIPW(π) depends not only on

the closeness of π to π∗, but also on the outcome regression function m∗1(X ) = E(Y 1 | X )

and the error ε = Y 1 − m∗1(X ). Although such dependency can be exploited to achieve local

efficiency and double robustness by incorporating an outcome regression model, we are interestedin comparing inverse probability weighted estimators regardless of outcome regression models.Under unconfoundedness, it can be shown via conditioning on {(Ti, Xi) : i = 1, . . . , n} that

E{μ1rIPW(π)} = E{μ1

rIPW(π ; m∗1)},

var{μ1rIPW(π)} = var{μ1

rIPW(π ; m∗1)} + var{μ1

rIPW(π ; ε)},where μ1

rIPW(π ; h) = E{Th(X )/π(X )}/E{T/π(X )} for a function h(X ) and μ1rIPW(π ; ε) =

E{Tε/π(X )}/E{T/π(X )}. As a result, the mean squared error E[{μ1rIPW(π) − μ1}2] can

Dow


ic.oup.com/biom



150 Z. Tan

lin1

lin2

I II III IV V VI

I II III IV V VI

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

0.8

1.0

quad1

quad2

I II III IV V VI

I II III IV V VI

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

exp

noise

I II III IV V VI

I II III IV V VI

0.0

0.2

0.4

0.6

0.8

1.0

0.00

0.10

0.20

0.30

Fig. 2. Root mean squared errors of μ1rIPW(π ; h) and μ1

rIPW(π ; ε) for the estimators π labelled I–VI when the logisticmodel (3) is correctly specified, with p = 4 (�), 20 (+), 50 (•), 100 (×) or 200 (�) and with n = 200 (left), 400 (middle)or 800 (right) within the column for each estimator. The estimators πML and π1

CAL, i.e., III and V, are computed onlyfor p = 4, 20 and 50. The results for π1

CAL should be interpreted with caution for (p, n) = (20, 200) and (50, �400).

be decomposed as mse{μ1rIPW(π ; m∗

1)} + var{μ1rIPW(π ; ε)}, where mse{μ1

rIPW(π ; h)} =E([μ1

rIPW(π ; h)−E{h(X )}]2). We consider several configurations for m∗1(X ): (lin1) h(X ) = X1 +

0.5X2+0.5X3+0.5X4; (lin2) h(X ) = X1+2X2+2X3+2X4; (quad1) h(X ) = ∑4j=1{max(Xj, 0)}2;

(quad2) h(X ) = ∑4j=1{max(−Xj, 0)}2 and (exp) h(X ) = exp(

∑4j=1 Xj/2). The configuration lin1

is similar to that in Kang & Schafer (2007) up to a linear transformation.For model (3) correctly specified or misspecified, Fig. 2 or 3 displays, respectively, Monte

Carlo estimates of mse1/2{μ1

rIPW(π ; h)} with five choices of h(X ) and var1/2{μ1rIPW(π ; ε)} where

ε ∼ N (0, 1), for the six estimators π labelled I–VI above, from 1000 repeated simulations withn = 200, 400, 800 and p = 4, 20, 50, 100, 200. See the Supplementary Material for numericalvalues. The nonpenalized estimators πML and π1

CAL are computed only for p from 4 to 50. For(p, n) = (20, 200) or (50, �400), the estimator π1

CAL is obtained with nonconvergence declaredby the R package trust in 30–100% of simulations, indicating that the loss function �CAL(γ )

may not have a finite minumum. We summarize the findings as follows.First, for all the choices of (n, p) and h(X ) studied, the estimator π1

RCAL yields similar orsmaller mean squared errors than πRML, whether model (3) is correctly specified or misspecified.The advantage of π1

RCAL is substantial in the case of misspecified model (3).Second, for relatively small p, 50 or less, the estimator π1

CAL, in spite of the nonconvergenceissue mentioned above, consistently leads to smaller mean squared errors than πML, whethermodel (3) is correctly specified or misspecified. In the case of misspecified model (3), theperformance of πML deteriorates substantially, particularly for estimation associated with theconfigurations quad1 and exp for h(X ). A possible explanation is that h(X ) in these cases quicklyincreases as (X1, X3, X4) become large, which by (30) is the region where the propensity scoreπ∗(X ) becomes small. Even a small discrepancy, especially underestimation, between π and π∗for the few observations with T = 1 in this region can give rise to large errors for the estimatesμ1

rIPW(π ; h).

Dow


ic.oup.com/biom






lin1

lin2I II III IV V VI

0.0

0.2

0.4

0.6

I II III IV V VI0.0

0.2

0.4

0.6

0.8

1.0

quad1

quad2I II III IV V VI

0.0

0.2

0.4

0.6

0.8

1.0

I II III IV V VI0.0

0.2

0.4

0.6

exp

noiseI II III IV V VI

0.0

0.2

0.4

0.6

0.8

1.0

I II III IV V VI0.00

0.10

0.20

0.30

Fig. 3. Root mean squared errors of μ1rIPW(π ; h) and μ1

rIPW(π ; ε) plotted as in Fig. 2, for the estimators π labelledI–VI when the logistic model (3) is misspecified; the values are censored within the upper limit of the vertical axis

(dotted line).

Third, the regularized estimator πRML or π1RCAL yields smaller or slightly larger mean squared

errors than the corresponding nonregularized estimator πML or π1CAL with p � 50, except for

π1RCAL versus π1

CAL in the configuration lin1 for h(X ); see the Supplementary Material for adiscussion.

In the Supplementary Material we report various additional results, including the numberof samples with nonconvergence for γML and γ 1

CAL, the average numbers of nonzero coef-ficients obtained in γRML and γ 1

RCAL, and the mean squared errors related to estimation ofμ1 − μ0.

6. Application to a medical study

We present an empirical application of the proposed methods to a medical study describedin Connors et al. (1996) on the effects of right heart catheterization. The observational studywas of interest at a time when many physicians believed that the procedure led to betterpatient outcomes, but the benefit had not been demonstrated in any randomized clinical tri-als. The study of Connors et al. (1996) included n = 5735 critically ill patients admitted tofive medical centres. For each patient, the data consist of the treatment status T , defined as1 if the procedure was used within 24 hours of admission and 0 otherwise, the health out-come Y , defined as survival time up to 30 days, and a list of 75 covariates X specified bymedical specialists in critical care. In previous analyses using propensity scores, logistic regres-sion was employed either with main effects only (e.g., Hirano & Imbens, 2002; Vermeulen &Vansteelandt, 2015) or with interaction terms manually added (Tan, 2006) in the approach ofRosenbaum & Rubin (1984).

To capture possible dependency beyond main effects, we consider a logistic propensity scoremodel (3), where the vector f (X ) includes all main effects and two-way interactions of Xexcept those with numbers of nonzero values less than 46, i.e., 0.8% of the sample of 5735.

Dow


ic.oup.com/biom







152 Z. Tan

CA

L1

CA

L1

CA

L1

RMLj

jj

RC

AL

0 500 1000 1500 2000−0.4

−0.2

0.0

0.2

0.4

0 500 1000 1500 2000−0.2

−0.1

0.0

0.1

0.2

0 500 1000 1500 2000−0.2

−0.1

0.0

0.1

0.2

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

(a) (b)

(c) (d)

Fig. 4. Standardized calibration differences cal1(π ; fj) plotted against index j for the estimators (a) π = E(T ),

(b) πRML and (c) π1RCAL with λ selected by cross-validation; in each panel a vertical line is placed at the end of the

indices for 71 main effects, and two horizontal lines are placed at the maximum absolute standardized differences;crosses (×) mark the indices j corresponding to 188 nonzero estimates of γj for πRML and 32 nonzero estimates of γj

for π1RCAL. (d) Fitted propensity scores {πRML(Xi), π1

RCAL(Xi)} in the treated sample {i : Ti = 1, i = 1, . . . , n}.

The dimension of f (X ) is p = 1855, excluding the constant. All variables in f (X ) are stan-dardized to have sample mean 0 and sample variance 1. We apply the methods of regularizedmaximum likelihood and regularized calibrated estimation, labelled as rml and rcal, simi-larly to how we conducted the simulation study, with the lasso tuning parameter λ attemptedin a finer set {λ∗/2j/4 : j = 0, 1, . . . , 24}, where λ∗ is the value leading to a zero solutionγ1 = · · · = γp = 0.

To measure the effect of calibration in the treated sample for a function h(X ) using apropensity score estimate π , we use the standardized calibration difference cal

1(π ; h) =[μ1

rIPW(π ; h)−E{h(X )}]/V 1/2{h(X )}, where E(·) and V (·) denote the sample mean and variance,respectively, and μ1

rIPW(π ; h) is defined as μ1rIPW(π) with Y replaced by h(X ). For fj(X ) stan-

dardized with sample mean 0 and sample variance 1, cal1(π ; fj) reduces to μ1

rIPW(π ; fj). See, forexample,Austin & Stuart (2015, § 4.1.1) for a related statistic based on μ1

rIPW(π ; h)−μ0rIPW(π ; h)

for balance checking. Figure 4 displays the standardized calibration differences for all the vari-ables fj(X ) and the fitted propensity scores in the treated sample, obtained from the regularizedestimators πRML and π1

RCAL with the tuning parameter λ selected by five-fold cross-validationas in § 5.

From Fig. 4 it can be seen that the maximum absolute standardized differences are reduced from35% to about 10% based on the estimators πRML and π1

RCAL. However, the latter estimator π1RCAL

is obtained with a much smaller number, 32 versus 188, of nonzero estimates of coefficients γj.The corresponding standardized differences for these 32 nonzero coefficients precisely attain themaximum absolute value, 0.102, which is also the tuning parameter λ used for the lasso penaltyin (18). The fitted propensity scores π1

RCAL(Xi) in the treated sample are consistently larger orsmaller than πRML(Xi) when close to 0 or 1, respectively. As a result, the inverse probability

Dow


ic.oup.com/biom




# nonzero

max

⏐CA

L1 ⏐

0.00

0.10

0.20

0.30

Relative variance

max

⏐CA

L1 ⏐

0 50 100 150 200 250 0.0 0.2 0.4 0.6 0.8 1.00.00

0.10

0.20

0.30

(a) (b)

Fig. 5. Maximum absolute standardized differences, maxj |cal1(π ; fj)|, plotted against (a) the numbers of nonzero

estimates of (γ1, . . . , γp) and (b) the relative variances of the inverse probability weights in the treated sample, forπRML (circles) and π1

RCAL (crosses) as the tuning parameter λ varies for the lasso penalty; vertical lines indicate thevalues corresponding to λ selected by cross-validation.

weights 1/π1RCAL(Xi) tend to be less variable than 1/πRML(Xi), which is also confirmed in the

following discussion.Figure 5 shows how the maximum absolute standardized differences are related to the numbers

of nonzero estimates of γj and the relative variances of the inverse probability weights in thetreated sample as the tuning parameter λ varies. For a set of weights {wi : Ti = 1, i = 1, . . . , n},the relative variance is defined as

∑i:Ti=1(wi − w)2/{(n1 − 1)w2}, where w = ∑

i:Ti=1 wi/n1

and n1/n = E(T ). See Liu (2001, § 2.5.3) for a discussion on using the relative variance tomeasure the efficiency of a weighted sample. As seen from Fig. 5, in the process of reducingthe standardized differences, the estimator π1

RCAL is associated with a much smaller number ofnonzero coefficients γj, or greater sparsity, and with smaller relative variances of the inverseprobability weights, or greater efficiency, than πRML. These results demonstrate the advantagesof regularized calibrated estimation in high-dimensional settings.

Additional results are reported in the Supplementary Material, including results that parallelthose in Fig. 5 for the fitted propensity score π0

RCAL in the untreated sample, as well as theestimates of μ1, μ0 and μ1 − μ1 for the 30-day survival status indicating Y � 30.

7. Further discussion

7.1. Dual formulation

The regularized calibrated estimator γ 1RCAL can also be derived in a dual formulation. Let

w = {wi > 1 : Ti = 1, i = 1, . . . , n}. For fixed λ � 0, consider the following optimizationproblem:

minimize �DCAL(w) = ∑

1�i�n: Ti=1{(wi − 1) log(wi − 1)− (wi − 1)} (31)

subject to∑

1�i�n: Ti=1wi = n, (32)∣∣∑1�i�n: Ti=1wifj(Xi)− ∑n

i=1 fj(Xi)∣∣ � λ (j = 1, . . . , p). (33)

It can be shown directly via the Karush–Kuhn–Tucker condition that if γ 1RCAL minimizes the

penalized loss �RCAL(γ ) in (16), then the inverse probability weights

wi = {π1RCAL(Xi)}−1 = 1 + exp{−(γ 1

RCAL)Tf (Xi)} (1 � i � n : Ti = 1)

are a solution to (31)–(33). In the case of exact calibration (λ = 0), (31)–(33) can be obtained fromChan et al. (2016) with the particular distance measure �D

CAL(w); see also Zubizarreta (2015) for

Dow


ic.oup.com/biom





154 Z. Tan

a related method. Similarly, for the regularized likelihood estimator γRML minimizing �RML(γ )

in (19), it can be shown that the fitted propensity scores πi = πRML(Xi) (i = 1, . . . , n) solve thefollowing program with π = {0 < πi < 1 : i = 1, . . . , n}:

minimize �DML(π) = ∑n

i=1{(1 − πi) log(1 − πi)+ πi log(πi)}subject to

∑ni=1(Ti − πi) = 0,∣∣∑ni=1(Ti − πi)fj(Xi)

∣∣ � λ (j = 1, . . . , p). (34)

See Dudik et al. (2007) for general results relating box constraints such as (33) and (34) tothe lasso penalty in a conceptually similar context. These dual formulations help to shed lighton the differences between maximum likelihood and calibration estimation, which deal with,respectively, propensity scores in the probability scale and the scale of inverse probability weights.

We distinguish between two types of calibration estimators viewed in a dual formulation, whichin previous work usually involved exact constraints. The first type is survey calibration (Deville &Sarndal, 1992), in which calibration weights are constructed by minimizing a measure of distanceto the design weights, i.e., inverses of inclusion probabilities, subject to calibration equations.Similar ideas are used in Tan (2010, 2013) to derive improved doubly robust estimators, throughadjusting the inverses of fitted propensity scores to achieve calibration constraints, possiblydepending on a fitted outcome regression function. The second type of calibrated estimators,such as γ 1

CAL for γ or μ1(π1CAL) forμ1, are typically derived to deal with nonresponse or missing

data, in a similar manner to the first type with uniform design weights. But there is a subtledifference. The survey calibration weights (Deville & Sarndal, 1992) are expected to deviatefrom the design weights by Op(n−1/2), and are mainly used to reduce the asymptotic variancesof the resulting estimators of population quantities. The calibration weights of the second typecan be viewed as the inverses of fitted response probabilities or propensity scores from a modelimplied by the choice of distance measure, and are expected to behave as Op(1) to achieve biasreduction.

7.2. Estimation of average treatment effects

So far our theory and methods have focused mainly on estimation ofμ1, but they can be directlyextended to estimation of μ0, and hence the average treatment effect μ1 − μ0. As mentionedin § 3, for estimation of μ0 with model (3), the calibrated estimator of γ , denoted by γ 0

CAL, isdefined as a solution to (5). By exchanging T with 1−T and γ with −γ in (6), the correspondingloss function minimized by γ 0

CAL is

�0CAL(γ ) = E

[(1 − T ) exp{γ Tf (X )} − Tγ Tf (X )

].

For fixed λ � 0, the regularized calibrated estimator γ 0RCAL is defined as a minimizer of

�0RCAL(γ ) = �0

CAL(γ )+ λ‖γ1:p‖1.

The fitted propensity score, π0RCAL(X ) = π(X ; γ 0

RCAL), then satisfies (17) and (18) with Ti

replaced by 1 − Ti and π1RCAL(Xi) replaced by 1 − π0

RCAL(Xi). The resulting estimator of μ0 isμ0

IPW(π0RCAL) = μ0

rIPW(π0RCAL), and that of μ1 − μ0 is μ1

IPW(π1RCAL)− μ0

IPW(π0RCAL).

An interesting feature of our approach is that two different estimators of the propensity scoreare used when estimating μ0 and μ1. The estimators γ 0

RCAL and γ 1RCAL may in general have

different asymptotic limits when the propensity score model (3) is misspecified, even thoughtheir asymptotic limits coincide when model (3) is correctly specified. Such possible differences

Dow


ic.oup.com/biom




should not be of concern: the two estimators μ0IPW(π) and μ1

IPW(π) are decoupled, involving twodisjoint subsets of fitted propensity scores on the untreated {i : Ti = 0} and the treated {i : Ti = 1},respectively. In fact, separate estimation of propensity scores and inverse weights for the treatedand untreated samples can lead to more flexible approximations, and hence potentially less biasin the presence of model misspecification. Furthermore, whether substantial differences existbetween these separately fitted propensity scores can be used for diagnosis of the validity ofmodel (3). See Chan et al. (2016, § 2.3) for a related discussion.

7.3. Calibration or balancing

It is interesting to compare calibrated propensity scores with the covariate balancing propensityscores of Imai & Ratkovic (2014). For model (1), the covariate balancing estimator of γ , denotedby γBAL, is defined as a solution to

E

[{T

π(X ; γ )− 1 − T

1 − π(X ; γ )

}f (X )

]= 0. (35)

The same fitted propensity score πBAL(X ) = π(X ; γBAL) can be used in μ1IPW(πBAL) or

μ1rIPW(πBAL) for estimating μ1 and in μ0

IPW(πBAL) or μ0rIPW(πBAL) for estimating μ0. Equa-

tion (35) amounts to finding a single value γBAL such that the left-hand sides of (4) and (5)are equal, although they may each deviate from zero. For calibrated estimation, (4) and (5) aresatisfied separately by two estimators γ 1

CAL and γ 0CAL. An advantage of using the calibration

equations (4) and (5) is that for t = 0 or 1, μtIPW(π

tCAL), but not μt

IPW(πBAL), is doubly robust,i.e., remains consistent if either the propensity score model (1) or a linear outcome model iscorrect, E(Y t | X ) = αT

t f (X ) for a coefficient vector αt (Graham et al., 2012). We also point outthat with the logistic model (3), γBAL can be obtained by minimizing the loss function

�BAL(γ ) = �CAL(γ )+ �0CAL(γ ), (36)

which is still convex in γ . Our results developed for calibrated estimation and regularization canbe adapted to γBAL and its regularized version. See Fig. 1 for a comparison of limiting propensityscores in a simple example with model misspecification.

7.4. Estimation of average treatment effects on the treated

There is a simple extension of our approach to estimation of the average treatment effect on thetreated, ν1 − ν0, as defined in § 2. The parameter ν1 = E(Y 1 | T = 1) can be directly estimatedby E(TY )/E(T ). Two standard inverse probability weighted estimators for ν0 are

ν0IPW(πML) = E

{(1 − T )πML(X )Y

1 − πML(X )

} /E(T )

and ν0rIPW(πML), which is defined as ν0

IPW(πML) with E(T ) replaced by E[(1 − T )πML(X )/{1 −πML(X )}], where πML(X ) is the maximum likelihood fitted propensity score. To derive acalibrated estimator of γ , consider the set of calibration equations

E

[{(1 − T )π(X ; γ )

1 − π(X ; γ )− T

}f (X )

]= 0. (37)

Equation (37) is used in Imai & Ratkovic (2014) as a balancing equation in propensity scoreestimation for estimating ν1 −ν0. We point out two simple results which, though straightforwardas shown below, do not seem to have been discussed before.

Dow


ic.oup.com/biom



156 Z. Tan

First, (37) is equivalent to the calibration equations (5) for γ 0CAL when estimating μ0. This

follows from the simple identity (1−T )π(X ; γ )/{1−π(X ; γ )}−T = (1−T )/{1−π(X ; γ )}−1.Therefore the same set of fitted propensity scores, for example π0

RCAL(Xi) based on the regularizedestimator γ 0

RCAL, can be used for estimating μ0 by μ0IPW(π

0RCAL) and for estimating ν0 by

ν0IPW(π

0RCAL) = ν0

rIPW(π0RCAL), due to an equation similar to (17).

Second, for the logistic model (3), the estimator ν0IPW(π

0CAL) = ν0

rIPW(π0CAL) is identical to

the estimator of ν0 from entropy balancing (Hainmueller, 2012). In fact, ν0rIPW(π

0CAL) can be

rewritten as∑

i:Ti=0 wiYi where, for γ = γ 0CAL, γ1:p = (γ1, . . . , γp)

T and f1:p = (f1, . . . , fp)T,

wi = exp{−γ T1:pf1:p(Xi)}∑

i′:Ti′=0 exp{−γ T1:pf1:p(Xi′)} .

Equation (37) for γ 0CAL then implies that

∑i:Ti=0 wi = 1 and

∑i:Ti=0

wifj(Xi) = E{ (1−T )π(X ;γ )

1−π(X ;γ ) fj(X )}

E{ (1−T )π(X ;γ )

1−π(X ;γ )

} = E{Tfj(X )}E(T )

= 1

n1

∑i:Ti=1

fj(Xi)

for j = 1, . . . , p, where n1/n = E(T ). The weights {wi : Ti = 0, 1 � i � n} and the aboveconstraints are the same as in entropy balancing. The result can also be shown by comparingthe weighting scheme in Hainmueller (2012) and the program analogous to (31)–(33), but cor-responding to the calibration equations (5). From this connection, our regularized method alsoextends entropy balancing to allow box constraints similar to (18).

7.5. Augmented inverse probability weighting and related methods

The development in this article focuses on estimation of propensity scores to improve inverseprobability weighted estimation, without invoking any outcome regression model. On the otherhand, using both propensity score and outcome regression models may enhance efficiency androbustness and, particularly with high-dimensional data, enable valid confidence intervals to beconstructed under suitable conditions. In the following, we briefly discuss related work in high-dimensional, parametric settings. See Robins et al. (2017) for estimation based on higher-orderinfluence functions in nonparametric settings.

Belloni et al. (2014) and Farrell (2015) studied the augmented inverse probability weightedestimator (Robins et al., 1994), with propensity score and outcome regression models fitted byregularized maximum likelihood, and obtained valid confidence intervals when both the propen-sity score and the outcome regression models are correctly specified under suitable sparsityconditions. In a 2017 Cornell University technical report, Ning, Peng and Imai also used regular-ized maximum likelihood estimation of propensity scores and outcome regression functions, butapplied an inverse probability weighted estimator with propensity scores further adjusted throughcalibration equations similar to (4), depending on variable selection from the initial regularizedestimation. Similar results to those of Belloni et al. (2014) and Farrell (2015) were obtained.Athey et al. (2018) combined penalized estimation of a linear outcome model and constructionof balancing weights, similarly to Zubizarreta (2015), and obtained valid confidence intervals inthe situation where a linear outcome model is correctly specified. In contrast with these works,Tan (2019a) developed regularized calibrated estimation for outcome regression in conjunctionwith that for propensity scores as studied here to achieve model-assisted inference, that is, validconfidence intervals in cases where the propensity score model is correctly specified, but theoutcome regression model may be misspecified. With linear outcome models, the confidenceintervals are also doubly robust.

Dow


ic.oup.com/biom




Acknowledgement

The author thanks Cun-Hui Zhang for helpful discussions, as well as the editor, an associateeditor and a referee for constructive comments which have led to various improvements of thearticle. This research was supported in part by the Patient-Centred Outcomes Research Institute.

Supplementary material

Supplementary Material available at Biometrika online includes technical proofs and additionalnumerical results from the simulation study and empirical application.

References

Athey, S., Imbens, G. W. & Wager, S. (2018). Approximate residual balancing: Debiased inference of averagetreatment effects in high dimensions. J. R. Statist. Soc. B 80, 597–623.

Austin, P. C. & Stuart, E. A. (2015). Moving towards best practice when using inverse probability of treatmentweighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statist.Med. 34, 3661–79.

Belloni, A., Chernozhukov, V. & Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81, 608–50.

Bohning, D. & Lindsay, B. G. (1988). Monotonicity of quadratic approximation algorithms. Ann. Inst. Statist. Math.40, 641–63.

Buhlmann, P. & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications.New York: Springer.

Chan, K. C. G., Yam, S. C. P. & Zhang, Z. (2016). Globally efficient non-parametric inference of average treatmenteffects by empirical balancing calibration weighting. J. R. Statist. Soc. B 78, 673–700.

Connors, A. F., Speroff, T., Dawson, N. V., Thomas, C., Harrell Jr, F. E., Wagner, D., Desbiens, N., Goldman,

L., Wu, A. W., Califf, R. M. et al. (1996). The effectiveness of right heart catheterization in the initial care ofcritically ill patients. J. Am. Med. Assoc. 276, 889–97.

Cover, T. M. & Thomas, J. A. (1991). Elements of Information Theory. New York: Wiley.Deville, J. C. & Sarndal, C. E. (1992). Calibration estimators in survey sampling. J. Am. Statist. Assoc. 87, 376–82.Dudik, M., Phillips, S. J. & Schapire, R. E. (2007). Maximum entropy density estimation with generalized

regularization and an application to species distribution modeling. J. Mach. Learn. Res. 8, 1217–60.Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations.

J. Economet. 189, 1–23.Folsom, R. E. (1991). Exponential and logistic weight adjustments for sampling and nonresponse error reduction. In

Proc. Am. Statist. Assoc. Social Statist. Sect. Alexandria, Virginia: American Statistical Association, pp. 197–202.Friedman, J., Hastie, T. & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate

descent. J. Statist. Software 33, 1–22.Geyer, C. J. (2014). Trust: Trust Region Optimization. R package version 0.1-6, available at https://CRAN.

R-project.org/package=trust.Graham, B. S., de Xavier Pinto, C. C. & Egel, D. (2012). Inverse probability tilting for moment condition models

with missing data. Rev. Econ. Stud. 79, 1053–79.Hainmueller, J. (2012). Entropy balancing for causal effects: Multivariate reweighting method to produce balanced

samples in observational studies. Polit. Anal. 20, 25–46.Hirano, K. & Imbens, G. W. (2002). Estimation of causal effects using propensity score weighting: An application to

data on right heart catheterization. Health Serv. Out. Res. Methodol. 2, 259–78.Huang, J. & Zhang, C.-H. (2012). Estimation and selection via absolute penalized convex minimization and its

multistage adaptive applications. J. Mach. Learn. Res. 13, 1839–64.Imai, K. & Ratkovic, M. (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–63.Kang, J. D. Y. & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for

estimating a population mean from incomplete data (with Discussion). Statist. Sci. 22, 523–39.Kim, J. K. & Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statist. Sinica 24,

375–94.Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer.Manski, C. F. (1988). Analog Estimation Methods in Econometrics. New York: Chapman & Hall.McCullagh, P. & Nelder, J. (1989). Generalized Linear Models. New York: Chapman & Hall, 2nd ed.Negahban, S. N., Ravikumar, P., Wainwright, M. J. & Yu, B. (2012). A unified framework for high-dimensional

analysis of M-estimators with decomposable regularizers. Statist. Sci. 27, 538–57.

Dow


ic.oup.com/biom





158 Z. Tan

Neyman, J. (1923). On the application of probability theory to agricultural experiments: Essay on principles, Section9. Statist. Sci. 5, 465–80.

Osborne, M., Presnell, B. & Turlach, B. (2000). A new approach to variable selection in least squares problem.IMA J. Numer. Anal. 20, 389–404.

R Development Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.

Robins, J. M., Li, L., Mukherjee, R., Tchetgen Tchetgen, E. & van der Vaart, A. (2017). Minimax estimation ofa functional on a structured high dimensional model. Ann. Statist. 45, 1951–87.

Robins, J. M., Rotnitzky, A. & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors arenot always observed. J. Am. Statist. Assoc. 89, 846–66.

Rosenbaum, P. R. & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causaleffects. Biometrika 70, 41–55.

Rosenbaum, P. R. & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on thepropensity score. J. Am. Statist. Assoc. 79, 516–24.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ.Psychol. 66, 688–701.

Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–90.Tan, Z. (2006). A distributional approach for causal inference using propensity scores. J. Am. Statist. Assoc. 101,

1619–37.Tan, Z. (2007). Comment: Understanding OR, PS, and DR. Statist. Sci. 22, 560–8.Tan, Z. (2010). Bounded, efficient, and doubly robust estimation with inverse weighting. Biometrika 97, 661–82.Tan, Z. (2013). Simple design-efficient calibration estimators for rejective and high-entropy sampling. Biometrika

100, 399–415.Tan, Z. (2019a). Model-assisted inference for treatment effects using regularized calibrated estimation with high-

dimensional data. Ann. Statist. to appear, arXiv: 1801.09817.Tan, Z. (2019b). RCAL: Regularized Calibrated Estimation. R package version 1.0, available at https://CRAN.R-

project.org/package=RCAL.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88.Vermeulen, K. & Vansteelandt, S. (2015). Bias-reduced doubly robust estimation. J. Am. Statist. Assoc. 110,

1024–36.White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25.Wu, T. T. & Lange, K. (2010). The MM alternative to EM. Statist. Sci. 25, 492–505.Yiu, S. & Su, L. (2018). Covariate association eliminating weights: A unified weighting framework for causal effect

estimation. Biometrika 105, 709–22.Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. J. Am.

Statist. Assoc. 110, 910–22.

[Received on 30 May 2018. Editorial decision on 15 May 2019]

Dow


ic.oup.com/biom



regularized calibrated estimation of propensity scores with...

Documents