the augmented synthetic control method · 2019-11-12 · 1 introduction the synthetic control...
TRANSCRIPT
The Augmented Synthetic Control Method∗
Eli Ben-Michael, Avi Feller, and Jesse Rothstein
UC Berkeley
November 2019
Abstract
The synthetic control method (SCM) is a popular approach for estimating the impact of atreatment on a single unit in panel data settings. The “synthetic control” is a weighted averageof control units that balances the treated unit’s pre-treatment outcomes as closely as possible.A critical feature of the original proposal is to use SCM only when the fit on pre-treatmentoutcomes is excellent. We propose Augmented SCM to extend SCM to settings where suchpre-treatment fit is infeasible. Analogous to bias correction for inexact matching, AugmentedSCM uses an outcome model to estimate the bias due to imperfect pre-treatment fit and thende-biases the original SCM estimate. Our main proposal, which uses ridge regression as theoutcome model, directly controls pre-treatment fit while minimizing extrapolation from theconvex hull. We bound the bias of this approach under a linear factor model and show howregularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCMwith extensive simulation studies and apply this framework to estimate the impact of the 2012Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth
R package.
∗email: [email protected]. We thank Alberto Abadie, Josh Angrist, Matias Cattaneo, Alex D’Amour, Peng
Ding, Erin Hartman, Chad Hazlett, Steve Howard, Guido Imbens, Brian Jacob, Pat Kline, Caleb Miles, Luke Miratrix,
Sam Pimentel, Fredrik Savje, Jas Sekhon, Jake Soloff, Panos Toulis, Stefan Wager, Yiqing Xu, Alan Zaslavsky, and
Xiang Zhou for thoughtful comments and discussion, as well as seminar participants at Stanford, UC Berkeley, UNC,
the 2018 Atlantic Causal Inference Conference, COMPIE 2018, and the 2018 Polmeth Conference.
1 Introduction
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment
on a single unit in panel data settings with a modest number of control units and with many
pre-treatment periods (Abadie and Gardeazabal, 2003; Abadie et al., 2010, 2015). The idea is to
construct a weighted average of control units, known as a synthetic control, that matches the treated
unit’s pre-treatment outcomes. The estimated impact is then the difference in post-treatment
outcomes between the treated unit and the synthetic control. SCM has been widely applied — the
main SCM papers have over 4,000 citations — and has been called “arguably the most important
innovation in the policy evaluation literature in the last 15 years” (Athey and Imbens, 2017).
A critical feature of the original proposal is to use SCM only when the synthetic control’s pre-
treatment outcomes closely match the pre-treatment outcomes for the treated unit (Abadie et al.,
2015). When it is not possible to construct a synthetic control that fits pre-treatment outcomes
well, the original papers advise against using SCM.
We propose the augmented synthetic control method (ASCM) to extend SCM to settings where
excellent pre-treatment fit is not feasible. Analogous to bias correction for inexact matching (Rubin,
1973; Abadie and Imbens, 2011), ASCM uses an outcome model to estimate the bias due to imper-
fect pre-treatment fit and then de-biases the original SCM estimate. If pre-treatment fit is good,
the estimated bias will be small, and the SCM and ASCM estimates will be similar. Otherwise,
the estimates will diverge, relying more heavily on the outcome model than SCM weighting.
We focus on the special case in which the outcome model is a ridge-regularized linear regression
model, which we call Ridge ASCM. Like SCM, Ridge ASCM can be written as a weighted average
of the control unit outcomes. However, where SCM weights are always non-negative, Ridge ASCM
weights can be negative. This introduces a trade off: allowing for negative weights improves pre-
treatment fit relative to SCM alone, but at the cost of extrapolating outside the convex hull of
control units. We show that the regularization parameter in Ridge ASCM directly parameterizes
this trade-off by penalizing the distance from SCM weights, analogous to calibration in survey
sampling (Deville and Sarndal, 1992). By contrast, ridge regression alone can also be written as a
(possibly negative) weighting estimator, but one that instead seeks the minimum variance — rather
than minimum extrapolation — solution.
To assess the bias of the Ridge ASCM estimator, we connect pre-treatment fit to bias under
the linear factor model often invoked in this setting (Abadie et al., 2010). For weights constrained
to be on the simplex, including SCM weights, we bound the possible bias in terms of a measure
of pre-treatment fit and an approximation error. Incorporating the ridge adjustment improves
pre-treatment fit, but may increase the approximation error if the ridge regression over-fits noise
in the pre-treatment outcomes. In practice this will require careful regularization: we describe a
cross-validation hyper-parameter selection procedure that builds off the in-time placebo check in
Abadie et al. (2015).
1
Next, we extend the Augmented SCM approach to incorporate auxiliary covariates other than
pre-treatment outcomes. The most immediate extension is to simply include these auxiliary co-
variates in parallel to the lagged outcomes in both the outcome and SCM models. In the special
case where the auxiliary covariates are low dimensional, we instead propose a two-step procedure:
first residualize the pre- and post-treatment outcomes on the auxiliary covariates and then fit
Ridge ASCM on the residualized outcomes; this extends a suggestion from Doudchenko and Im-
bens (2017). We show that this controls bias under a linear factor model with auxiliary covariates
and also show that recent proposals for SCM with an intercept shift (Doudchenko and Imbens,
2017; Ferman and Pinto, 2018) are special cases of this approach.
We demonstrate Augmented SCM by using it to examine the effect of an aggressive tax cut
on economic output in Kansas in 2012, finding a substantial negative effect. We then generate
simulation studies calibrated to this application. Overall, we find large gains from ASCM relative
to alternative estimators, especially under model mis-specification, in terms of both bias and root
mean squared error. We implement the proposed methodology in the augsynth package for R,
available at https://github.com/ebenmichael/augsynth.
The paper proceeds as follows. Section 1.1 briefly reviews related work. Section 2 introduces
notation and the SCM estimator. Section 3 proposes Augmented SCM and gives key results for
Ridge ASCM. Section 4 bounds the bias under a linear factor model, the standard setting for SCM.
Section 5 extends the ASCM framework to incorporate auxiliary covariates and alternative outcome
models. Section 6 reports on the application to the Kansas tax cuts as well as extensive simulation
studies. Finally, Section 7 discusses some possible directions for further research. The appendix
includes all of the proofs, as well as additional derivations and technical discussion.
1.1 Related work
SCM was introduced by Abadie and Gardeazabal (2003) and Abadie et al. (2010, 2015) and is the
subject of an extensive methodological literature; see Abadie (2019) and Samartsidis et al. (2019)
for recent reviews. We briefly highlight some relevant aspects of this literature.
A first set of papers adapts the original SCM proposal to allow for more robust estimation while
retaining the simplex constraint on the weights. Robbins et al. (2017); Doudchenko and Imbens
(2017); Abadie and L’Hour (2018); Minard and Waddell (2018) incorporate a penalty on the weights
into the SCM optimization problem, building on a suggestion in Abadie et al. (2015). Gobillon and
Magnac (2016) and Hazlett and Xu (2018) explore dimension reduction strategies and other data
transformations that can improve the performance of the subsequent estimator.
A second set of papers relaxes the constraint that the weights are on the simplex as well as
other generalizations. In particular, Doudchenko and Imbens (2017) relax the SCM restriction that
control unit weights be non-negative, arguing that there are many settings in which negative weights
would be desirable. Amjad et al. (2018) propose an interesting variant that combines negative
2
weights with a pre-processing step. Powell (2018) instead allows for extrapolation via a Frisch-
Waugh-Lovell-style projection, which similarly generalizes the typical SCM setting. Doudchenko
and Imbens (2017) and Ferman and Pinto (2018) both propose to incorporate an intercept into
the SCM problem, which we discuss in Section 5. There have also been several other proposals to
reduce bias in SCM, developed independently and contemporaneously with ours; see Abadie and
L’Hour (2018), Chernozhukov et al. (2018), and Arkhangelsky et al. (2019). In particular, Abadie
and L’Hour (2018) also propose bias correcting SCM using regression; Arkhangelsky et al. (2019)
propose the Synthetic Difference-in-Differences estimator which, as the authors show, can be seen
as a special case of our proposal with a constrained outcome regression.
Finally, there have been several recent proposals to use outcome modeling rather than SCM-
style weighting in this setting. These include the matrix completion method in Athey et al. (2017),
the generalized synthetic control method in Xu (2017), and the combined approaches in Hsiao et al.
(2018). We explore the performance of these methods, both in isolation and in combination with
SCM, in Section 6.2.
2 Overview of the Synthetic Control Method
2.1 Notation and setup
We consider the canonical SCM panel data setting with i = 1, . . . , N units observed for t = 1, . . . , T
time periods. Let Wi be an indicator that unit i is treated at time T0 < T ; units with Wi = 0 never
receive the treatment. We restrict our attention to the case where a single unit receives treatment,
and follow the convention that the first unit is treated, that is W1 = 1. The remaining N0 = N − 1
units are potential controls, often referred to as donor units in the SCM context.
The outcome is Y and is typically continuous. We adopt the potential outcomes frame-
work (Neyman, 1923; Rubin, 1974) and invoke SUTVA, which assumes a well-defined treatment
and excludes interference between units (Rubin, 1980). The potential outcomes for unit i in period
t under control and treatment are Yit(0) and Yit(1), respectively.1 Observed outcomes are:
Yit =
Yit(0) if Wi = 0 or t ≤ T0
Yit(1) if Wi = 1 and t > T0.(1)
To keep notation simple, we assume that there is only one post-treatment observation, T = T0 + 1,
though our results are easily extended to larger T . We use Yi to represent the single post-treatment
observation for unit i, dropping t from the subscript. We use Xit, for t ≤ T0, to represent pre-
1To simplify the discussion, we assume that potential outcomes under both treatment and control exist for allunits in all time periods t > T0. With some additional technical caveats, we could relax this assumption to onlyrequire that Yit(0) be well-defined for all units, without also requiring control units to have well-defined potentialoutcomes under treatment.
3
treatment outcomes, to emphasize that pre-treatment outcomes serve as covariates in SCM. With
some abuse of notation, we use X0· to represent the N0-by-T0 matrix of control unit pre-treatment
outcomes and Y0 for the N0-vector of control unit outcomes in period T . With only one treated
unit, Y1 is a scalar, and X1· is a T0-row vector of treated unit pre-treatment outcomes. The data
structure is then:
Y11 Y12 . . . Y1T0 Y1T
Y21 Y22 . . . Y2T0 Y2T
......
YN1 YN2 . . . YNT0 YNT
≡
X11 X12 . . . X1T0 Y1
X21 X22 . . . X2T0 Y2
......
︸ ︷︷ ︸pre-treatment outcomes
XN1 XN2 . . . XNT0 YN
≡(X1· Y1
X0· Y0
)(2)
The treatment effect of interest is τ = τ1T = Y1(1)− Y1(0) = Y1 − Y1(0).
2.2 Synthetic Control Method
The Synthetic Control Method imputes the missing potential outcome for the treated unit, Y1(0),
as a weighted average of the control outcomes, Y ′0γ (Abadie and Gardeazabal, 2003; Abadie et al.,
2010, 2015). Weights are chosen to balance pre-treatment outcomes and possibly other covariates.
We consider a version of SCM that chooses weights γ as a solution to the constrained optimization
problem:
minγ
‖X1· −X ′0·γ‖22 + ζ∑Wi=0
f(γi)
subject to∑Wi=0
γi = 1
γi ≥ 0 i : Wi = 0
(3)
where ‖X1·−X ′0·γ‖22 ≡∑T0
t=1(X1t−X ′0tγ)2 is the squared 2-norm on RT0 , and the constraints limit
γ to the simplex in RN0 .
Equation (3) modifies the original SCM proposal in two key ways. First, this version, known
as the constrained regression formulation of SCM, follows the recent methodological literature and
focuses solely on balancing the lagged outcomes X (see Doudchenko and Imbens, 2017; Ferman
and Pinto, 2018). By contrast, the original SCM formulation also includes auxiliary covariates Z
and a weighted L2 norm. Our proposal naturally accommodates such covariates; we re-introduce
them in Section 5.
Second, Equation (3) penalizes the dispersion of the weights with hyperparameter ζ > 0, fol-
lowing a suggestion in Abadie et al. (2015). Possible penalization functions include an elastic net
penalty (Doudchenko and Imbens, 2017), an entropy penalty (Robbins et al., 2017), and a penalty
4
based on a measure of pairwise distance (Abadie and L’Hour, 2018). When the weights are con-
strained to be non-negative, as in the original SCM proposal and in (3), the particular choice of
dispersion penalty does not play a central role. However, if this constraint is relaxed, as we consider
below, regularizing the weights with a dispersion penalty is especially important. We discuss this
choice at length in the next section.
The SCM weights in Equation (3) optimize the pre-treatment fit, minimizing the imbalance
of pre-treatment outcomes between the treated unit and the weighted control mean. When these
weights achieve perfect pre-treatment fit, that is, when X1t − X ′0tγscm = 0 for all t, the resulting
estimator has many attractive properties, including a bias bound established by (see Abadie et al.,
2010). Achieving perfect (or nearly perfect) pre-treatment fit is not always feasible in practice,
however. By definition, weights that yield perfect fit exist if and only if the treated unit’s X1·
vector is inside the convex hull of the control units, X0·. As a result, the curse of dimensionality
applies for large T ; see Ferman and Pinto (2018) for a detailed discussion in the SCM setting.
Abadie et al. (2015) recommend against using SCM when “the pre-treatment fit is poor or the
number of pre-treatment periods is small.” Similarly, Abadie et al. (2010, 2015) argue that an
SCM should be used only if the setting passes extensive balance checks, such as “placebo in time”
checks. The conditional nature of SCM analysis therefore excludes many practical settings. Our
proposal enables the use of (a modified) SCM approach even when these balance checks fail.
3 Augmented SCM
3.1 Overview
Let Yi = m(Xi) + εi be a working outcome model, where m(Xi) is a function of pre-treatment
outcomes that we estimate with m(Xi), and εi is an independent error term. Under this additive
structure, the bias of any weighting estimator with weights γ can be written as:
bias = Y1(0)−∑Wi=0
γiYi =
m(X1)−∑Wi=0
γim(Xi)
+ E
ε1 −∑Wi=0
γiεi
. (4)
The first term here represents bias attributable to imbalance in X. Using the estimated m(Xi), we
can then estimate this component of the bias by substituting m(Xi):
bias = m(X1)−∑Wi=0
γi m(Xi).
5
This leads naturally to a bias-corrected estimator for Y1(0):
Y aug1 (0) =
∑Wi=0
γiYi +
m(X1)−∑Wi=0
γim(Xi)
(5)
= m(X1) +∑Wi=0
γi(Yi − m(Xi)). (6)
When the weights γi are the SCM weights defined above, we refer to this estimator as Augmented
SCM (ASCM). Standard SCM is a special case, where m(·) is set to be a constant.
Equations (5) and (6), while equivalent, highlight two distinct motivations for ASCM. Equation
(5) directly corrects the SCM estimate,∑γiYi, by the estimated bias. This is analogous to bias
correction for inexact matching (Rubin, 1973; Abadie and Imbens, 2011). If the estimated bias is
small, then the SCM and ASCM estimates will be similar. By contrast, Equation (6) is analogous to
standard doubly robust estimators (Robins et al., 1994), which begin with the outcome model but
then re-weight to balance residuals. This is also comparable in form to the generalized regression
estimator in survey sampling (Cassel et al., 1976; Breidt and Opsomer, 2017), which has been
adapted to the causal inference setting by, among others, Athey et al. (2018) and Tan (2018). We
emphasize the bias correction interpretation and consider residual balancing further in Section 5.
Finally, we discuss a connection to inverse propensity score weighting in Appendix C.
3.2 Ridge ASCM
We focus our discussion on the special case where m(Xi) is a ridge-regularized linear model, and
discuss the use of non-ridge outcome models with ASCM in Sections 5 and 6. The outcome model
is m(Xi) = ηr0 + X ′i·η
r, where ηr0 and ηr are the coefficients of a ridge regression of control post-
treatment outcomes Y0 on centered pre-treatment outcomes X0· with penalty hyper-parameter λr:
{ηr0, ηr} = arg minη0,η
1
2
∑Wi=0
(Yi − (η0 +X ′iη))2 + λr‖η‖22. (7)
With this estimate, the ridge regression specialization of (5) is:
Y aug1 (0) =
∑Wi=0
γscmi Yi +
X1· −∑Wi=0
γscmi Xi·
· ηr. (8)
We refer to this as Ridge Augmented SCM (ridge ASCM).
In the remainder of this section, we first show that ridge ASCM is a weighting estimator
with potentially negative weights. We then show that allowing for negative weights improves
pre-treatment fit relative to SCM alone. Finally, we show that, unlike ridge regression alone, ridge
6
ASCM directly penalizes extrapolation from the SCM weights. We give additional results bounding
the variance of Ridge ASCM in Lemma 5 in the Appendix.
3.2.1 Ridge ASCM as a weighting estimator
Lemma 1 expresses ridge ASCM as a weighting estimator that adjusts the raw SCM weights to
achieve better covariate balance, while penalizing the magnitude of the adjustment (see Deville
et al., 1993; Breidt and Opsomer, 2017, for similar results in survey sampling).
Lemma 1. The ridge-augmented SCM estimator (8) is:
Y aug1 (0) =
∑Wi=0
γscmi Yi +
X1· −∑Wi=0
γscmi Xi·
· ηr =∑Wi=0
γaugi Yi, (9)
where
γaugi = γscm
i + (X1 −X ′0·γscm)′(X ′0·X0· + λrIT0)−1Xi·. (10)
Moreover, the ridge ASCM weights γaug are the solution to
minγ s.t.
∑i γi=1
1
2λr‖X1· −X ′0·γ‖22 +
1
2‖γ − γscm‖22 . (11)
Equation (11) is a special case of the penalized SCM problem in Equation (3), where f(γi) =12‖γi− γ
scmi ‖22 and where we relax the restriction that weights are non-negative. Lemma 1 highlights
the interplay between the level of pre-treatment fit and the tuning parameter λr in determining the
size of the adjustment. When SCM yields good pre-treatment fit or λr is large, the adjustment term
is small and γaug will remain close to the SCM weights. Indeed, the ridge ASCM and SCM weights
are identical when SCM weights exactly balance the lagged outcomes. When SCM weights do not
achieve exact balance, the ridge ASCM solution will use negative weights to extrapolate from the
convex hull of the control units, with the amount of extrapolation controlled by λr. When SCM
has poor fit and λr is small, the adjustment is large and as a consequence, γaug will extrapolate
more from the support of the data.
3.2.2 Ridge ASCM improves pre-treatment fit relative to SCM alone
The expression of ridge ASCM as a weighting estimator allows us to compare the pre-treatment
fits for ridge ASCM to that for SCM alone. Specifically, we show that the pre-treatment fit from
ridge ASCM is at least as good as — and generally better than — the pre-treatment fit from SCM
alone.
Lemma 2. Let 1√N0X0· = UDV ′ be the singular value decomposition of the matrix of control
pre-intervention outcomes, where r is the rank of X0, U ∈ RN0×r, V ∈ RT0×r, and D = diag(dj)
7
is the diagonal matrix of singular values. Furthermore, let Xi = V ′Xi be the rotation of Xi along
the singular vectors of X0·. Then γaug, the ridge ASCM weights with hyperparameter λr = λN0
satisfy:
‖X1 −X ′0·γaug‖2 =
∥∥∥∥∥diag
(λ
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2
. (12)
There are two important cases to consider. First, as λ → 0, then Lemma 2 shows that (possibly
infeasible) ridge ASCM will have perfect pre-treatment fit. Second, as λ → ∞ the pre-treatment
fit of ridge ASCM will approach that of SCM alone from below. Thus, except in corner cases, ridge
ASCM will achieve better pre-treatment fit than SCM alone, albeit at the price of possible negative
weights.2
3.2.3 Ridge ASCM directly penalizes extrapolation, unlike ridge regression alone
It is useful to compare ridge ASCM to a simple ridge regression estimator. This, too, can be written
as a weighting estimator with possibly negative weights. However, while ridge ASCM penalizes the
distance from SCM weights — and thus, extrapolation from the convex hull — ridge regression
instead penalizes the variance of the control unit weights.
Lemma 3. With ηr0 and ηr, the solutions to (7), the ridge estimate can be written as a weighting
estimator:
Y r1 (0) = ηr
0 + ηr′X1 =∑Wi=0
γriYi, (13)
where
γri =
1
N0+ (X1 − X0)′(X ′0·X0· + λrIT0)−1Xi. (14)
Moreover, the ridge weights γr are the solution to
minγ |
∑i γi=1
1
2λr‖X1· −X ′0·γ‖22 +
1
2
∥∥∥∥γ − 1
N0
∥∥∥∥2
2
. (15)
Lemma 3 shows that ridge regression weights minimize imbalance while penalizing deviations from
uniformity.3 We can therefore view ridge regression as an adjustment to uniform weights (the
2For example, in the special case where the pre-treatment SCM error is orthogonal to the pre-intervention series ofthe control units then the ridge adjustment will not improve the fit. This is unlikely unless T0 � N0 (and impossibleif N0 > T0 and X0· is full column rank). Additionally, Lemma 2 shows that we can view ridge ASCM as a formof soft singular value thresholding and so is similar to the hard singular value thresholding procedure proposed byAmjad et al. (2018).
3This penalization is akin to “calibration” in survey sampling (see Deville and Sarndal, 1992). The penalizedSCM form (15) shows that the ridge weights are a special case of the elastic-net synthetic control estimator proposedby Doudchenko and Imbens (2017). When λ = 0, the ridge weights reduce to the OLS weights considered by Kline(2011) and Abadie et al. (2015); when λ > 0 they correspond to an L2 penalized version of the Oaxaca-Blinderweights considered by Kline (2011). Lemma 3 also relates to the “vertical” and “horizontal” regressions consideredby Athey et al. (2017) and Arkhangelsky et al. (2019); it implies that the two are equivalent in the special casewithout an intercept in the vertical regression.
8
minimum variance solution) to achieve better pre-treatment fit, where the regularization parameter
λr controls the trade-off between better fit and lower variance. By contrast, the ridge ASCM weights
in Lemma 1 penalize distance to SCM weights γscm rather than to uniform weights 1N0
. Penalizing
distance to SCM weights is natural in our setting since these weights are sparse and interpretable.
Moreover, recall that X ′0·γscm is the projection of X1 onto the convex hull of the control units.
Thus, ridge ASCM weights directly penalize extrapolation outside the convex hull, in contrast to
ridge regression, which allows for arbitrary extrapolation.
Allowing for negative weights is an important departure from the original SCM proposal, which
constrains weights to be on the simplex. Abadie et al. (2010, 2015) argue that the simplex constraint
is important for avoiding extrapolation and for preserving interpretability. Indeed, when it is
possible to achieve excellent pre-treatment fit with SCM, that is when ‖X1 − X ′0·γscm‖2 is very
small, we should prefer SCM weights over possibly negative weights: a slight balance improvement
is not worth the extrapolation and the loss of interpretability. In cases where this is not possible,
however, some degree of extrapolation is critical for improving pre-treatment fit.4 There are two
main challenges to allowing for extrapolation. First, extrapolation increases the potential for over-
fitting. In line with our discussion above, Doudchenko and Imbens (2017) argue that regularizing the
weights with a dispersion penalty — and choosing that dispersion penalty carefully — is especially
important when weights can be negative. Second, extrapolation comes at the price of greater model
dependence. We discuss practical model checks in Section 6.
4 Bias under a linear factor model
We now consider the bias of Augmented SCM under the canonical linear factor model considered
by Abadie et al. (2010). For weights constrained to be on the simplex, as in standard SCM, the bias
is bounded by the pre-treatment fit and an approximation error due to the potential for over-fitting
to noise. Incorporating the ridge adjustment can reduce bias through improved pre-treatment fit,
but may also increase the risk of bias due to over-fitting. In practice this will require careful
regularization: we describe a cross-validation hyper-parameter selection procedure that builds off
the in-time placebo check in Abadie et al. (2015).
4.1 Linear factor model and bias for general weights
Following the setup in Abadie et al. (2010), we assume that there are J unknown, latent time-
varying factors µt = {µjt}, j = 1, . . . , J , with maxjt |µjt| ≤ M , where J will typically be small
relative to N and T0. Each unit has a vector of unknown factor loadings φi ∈ RJ , which we consider
4See King and Zeng (2006) for a discussion of extrapolation in constructing counterfactuals. As they note: “If welearn that a counterfactual question involves extrapolation, we still might wish to proceed if the question is sufficientlyimportant, but we would be aware of how much more model dependent our answers would be.”
9
fixed. Control potential outcomes are weighted averages of these factors plus additive noise εit:
Yit(0) =J∑j=1
φijµjt + εit = φ′iµt + εit. (16)
Slightly abusing notation, we collect the pre-intervention factors into a matrix µ ∈ RT0×J , where
the tth row of µ contains the factor values at time t, µ′t. Following Bai (2009) and Xu (2017), we
further assume that the factors are orthogonal and normalized, i.e. that 1T0µ′µ = IJ . Finally, and
crucially, we assume that treatment assignment Wi is ignorable given a unit’s factor loadings,
EεT [Yi(0) | φi,Wi] = EεT [Yi(0) | φi] ,
where the expectation is taken with respect to εT = (ε1T , . . . , εNT ). Lastly, we assume that the
error terms εit are independent (across units and over time) sub-Gaussian random variables with
scale parameter σ.
Under this model, the error of a weighting estimator depends on a bias term due to imbalance
in the latent φi and a noise term due to the noise at time T :
Y1(0)−∑Wi=0
γiYi =
φ1 −∑Wi=0
γiφi
· µT︸ ︷︷ ︸bias from imbalance
+ ε1T −∑Wi=0
γiεit︸ ︷︷ ︸noise
. (17)
The error can be controlled by two aspects of the weights. First, the noise term depends on the
variance of the weights ‖γ‖22 and is minimized by uniform weights. Second, an estimator that
exactly balances the latent φi will yield an unbiased estimate of Y1(0); without exact balance, bias
will be proportional to the level of imbalance in φi. Balancing the observed Xi, however, will not
necessarily balance the latent φi, and the relationship between imbalance in X and bias is less
direct. Following Abadie et al. (2010), we show in the appendix that, under Equation (16), the
bias of any weighting estimator with weights γ can be expressed as
φ1 −∑Wi=0
γiφi
· µT =1
T0
µ′X1 −∑Wi=0
γiµ′Xi
︸ ︷︷ ︸
imbalance in X
·µT −1
T0
µ′ε1 −∑Wi=0
γiµ′εi
︸ ︷︷ ︸
approximation error
·µT . (18)
The first term is the imbalance of observed lagged outcomes and the second term is an approxima-
tion error arising from the latent factor structure.
10
4.2 Bounding the bias for weights on the simplex
We can bound the bias in Equation (18) for weights on the simplex using the triangle inequality
and accounting for the worst-case approximation error:
Theorem 1. Under the linear factor model (16) with sub-Gaussian noise, for weights γ ∈ ∆N0
and for any δ > 0,∣∣∣∣∣∣φ1 −
∑Wi=0
γiφi
· µT∣∣∣∣∣∣ ≤ JM2
√T0‖X1 −X ′0·γ‖2︸ ︷︷ ︸
imbalance in X
+2JM2σ√
T0
(√logN0 + 2δ
)︸ ︷︷ ︸
approximation error
(19)
with probability at least 1− 4e−δ2
2 , where maxjt |µjt| ≤M .
Theorem 1 contains two competing terms: the pre-treatment fit and the approximation error due
to balancing lagged outcomes rather than balancing the latent factors themselves. To understand
this theorem, we consider two special cases. First, in the special case with exact balance, considered
by Abadie et al. (2010), the first term of Equation (19) is zero and the bias is controlled by the
approximation error, which converges to zero in probability as T0 → ∞. Intuitively, the lagged
outcome for unit i at time t, Xit, is a noisy proxy for the index φ′iµt. Thus, as we observe more Xit
— and can exactly balance each one — we are better able to match on this index and, as a result,
on the underlying factor loadings. Without exact balance, Theorem 1 shows that a long pre-period
may not be enough to control bias.5 This emphasizes the critical role that conditioning on excellent
pre-treatment fit plays in ensuring that SCM yields reasonable estimates of the counterfactual.
4.3 Bounding the bias for Ridge Augmented SCM
We now consider the bias of ridge ASCM under a linear factor model, by relaxing the simplex
constraint in Theorem 1. This relaxation introduces an additional term in the bound due to an
increased risk of over-fitting.
Theorem 2. Under the linear factor model (16) with sub-Gaussian noise, for any δ > 0, the ridge
5Similar results exist for bias in other panel data settings. For instance, the approximation error in (19) is analogousto so-called “Nickell (1981) bias” that arises in short panels. Separately, Ferman and Pinto (2018) argue that — evenif weights exist that exactly balance the latent factor loadings — as T0 →∞, SCM weights will converge to a solutionthat does not exactly balance the latent factors, leading to bias. Theorem 1 complements their asymptotic analysiswith a finite sample bound that holds for all weights on the simplex, and explicitly includes the level of imbalance inthe lagged outcomes.
11
ASCM weights with hyperparameter λr = λN0 satisfy the bound∣∣∣∣∣∣φ1 −
∑Wi=0
γaugi φi
· µT∣∣∣∣∣∣ ≤ JM2
√T0
( ∥∥∥∥∥diag
(λ
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2︸ ︷︷ ︸
imbalance in X
+
∥∥∥∥∥diag
(4djσ
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2︸ ︷︷ ︸
excess approximation error
)+
2JM2σ√T0
(√logN0 + 2δ
)︸ ︷︷ ︸
SCM approximation error
(20)
with probability at least 1− 4e−δ2
2 − 2e−N0(2−
√log 5)2
2 .
Comparing Theorems 1 and 2, we see that the term deriving from imbalance in X will be
smaller for ridge ASCM than for SCM alone. However, allowing for extrapolation through negative
weights creates the potential for additional approximation error in excess of the SCM approximation
error. Therefore, ridge augmentation affects the bias in two competing ways. In one direction,
augmentation may reduce the bias by improving the pre-treatment fit, as in Lemma 2. In the other
direction, augmentation may increase bias by over-fitting to the noisy lagged outcomes, increasing
the approximation error relative to SCM alone. The gains to augmentation will therefore depend
on the the signal to noise ratio — i.e. the size of the largest singular values relative to σ — and the
level of pre-treatment fit along the principal directions of X0·. When the signal to noise ratio is high
and the pre-treatment fit is poor, the improvement in bias from augmentation can be substantial.
Comparing the two bias bounds also shows the importance of the simplex constraint in control-
ling the approximation error. In the worst case scenario, the SCM problem places all of the weight
on a single control unit and primarily balances noise rather than underlying factors. Constraining
weights to the simplex mitigates this risk by restricting the maximal weight to be one. As a re-
sult, the worst case approximation error in Theorem 1 only scales logarithmically with N0 while
decreasing with√T0. In contrast, allowing for negative weights creates the potential for worse
approximation error in Theorem 2; the soft penalty and tuning parameter λ take the place of a
hard constraint on the weights in controlling the approximation error.
Finally, the analyses for both Theorems 1 and 2 are restricted to a finite sample setting with
a fixed number of units (e.g., the 50 states), where we attempt to understand the bias in any
particular practitioner’s data series. The situation is different in a setting where units are sampled
from a (possibly infinite) population, and both N0 → ∞ and T0 → ∞. In this case, under
appropriate regularity conditions, the bias tends to zero since including more units improves the
12
pre-intervention fit faster than it increases the approximation error. Arkhangelsky et al. (2019) and
Ferman (2019) both consider this setting and show that the standard SCM estimator is consistent
as both N0 →∞ and T0 →∞. In addition, results in Arkhangelsky et al. (2019) imply that ridge
ASCM is be consistent under slightly weaker assumptions than SCM alone.6
4.4 Hyper-parameter selection
Regularization plays a vital role in navigating the trade-off between improved pre-treatment fit
and increased approximation error. As λ → ∞, ridge ASCM will converge to standard SCM, and
the excess approximation error will be zero, yielding the same bias bound as in Theorem 1. At
the other extreme where λ = 0, we estimate a (potentially infeasible) un-regularized regression.
Pre-treatment fit will be perfect in this case, but the approximation error could be large.
In general the bias bound in Theorem 2 will not be monotonic in λ and will be minimized by
some λ ∈ (0,∞), thus we must carefully select λ in practice. In Appendix Figure D.1 we inspect
an empirical example of the bound as a function of λ. We show through simulation in Section
6.2 that the gains to pre-treatment fit through augmentation outweigh the potential for increased
approximation error in a range of practical settings.
We propose a cross-validation approach for selecting λ inspired by the in-time placebo check
proposed by Abadie et al. (2015). Let Y1(−t) =∑
Wi=0 γaugi(−t)Yit be the estimate of Y1t where time
period t is excluded from fitting the estimator in (10). Abadie et al. (2015) propose to compare
the difference Y1t − Y1(−t) for some t ≤ T0. We can extend this idea to compute the leave-one-out
cross validation MSE over time periods:
CV (λ) =
T0∑t=1
(Y1t − Y1(−t)
)2. (21)
We can then choose λ to minimize CV (λ) or follow a more conservative approach such as the
“one-standard-error” rule (Hastie et al., 2009). This proposal is similar to the leave-one-out cross
validation proposed by Doudchenko and Imbens (2017), who select hyperparameters by holding
out control units and minimizing the MSE of the control units in the post-treatment time T .
5 Auxiliary covariates and residualization
Thus far, we have focused exclusively on the lagged outcomes as predictors. We now turn to
incorporating auxiliary covariates into ASCM. The most immediate extension is to simply include
these auxiliary covariates in parallel to the lagged outcomes in both the outcome and SCM models.
6Arkhangelsky et al. (2019) show consistency of ASCM augmented with a linear model with a hard constraint onthe regression coefficients, in particular for the constrained form of ridge regression. While the constrained form doesnot have a closed form solution as a weighting estimator, there exists a choice of hyperparameter that is equivalentto the Lagrangian form we consider here.
13
In the special case where the auxiliary covariates are low dimensional, we instead propose a two-
step procedure, extending a suggestion from Doudchenko and Imbens (2017): first residualize the
pre- and post-treatment outcomes on the auxiliary covariates and then fit Ridge ASCM on the
residualized outcomes. We show that this controls bias under a linear factor model with auxiliary
covariates and also show that recent proposals for SCM with an intercept shift (Doudchenko and
Imbens, 2017; Ferman et al., 2017) are a special case of this approach.
To define relevant notation, let Zi ∈ RK denote the vector of covariates for unit i; Zi may
include summaries of lagged outcomes or time-varying covariates such as the pre-treatment mean
Xi, treated as unit-level measures. Let Z0· ∈ RN0×K denote the matrix of covariates for the donor
units. We assume that the covariates are centered, Z0· = 0.7
5.1 Incorporating auxiliary covariates into ASCM
We begin by describing the general approach where we include both X and Z in parallel into both
SCM and the augmentation, and then shift to the special case where the auxiliary covariates are
low dimensional. For the general case, we can fit SCM by minimizing a weighted average of the
imbalance in X and Z over non-negative weighting vectors:
minγ∈∆N0
θx‖X1· −X ′0·γ‖22 + θz‖Z1 − Z0·γ‖22 + ζ∑Wi=0
f(γi) (22)
where θx and θz are researcher-selected weights that govern the relative priority placed on balancing
the two vectors, which need not be on the same scale. As above, we can augment these weights
with an outcome model m(Xi, Zi) that is a function of both the lagged outcomes and auxiliary
covariates. For example, a simple extension of ridge ASCM chooses m(X,Z) = η0 + X ′ηx + Z ′ηz
to be linear in X and Z, fit via ridge regression with different penalties for the two components:
minη0,ηx,ηz
1
2
∑Wi=0
(Yi − (η0 +X ′iηx + Z ′iηz))2 + λx‖ηx‖22 + λz‖ηz‖22. (23)
As with ridge ASCM, including auxiliary covariates into the ridge augmentation again yields a
weighting estimator that balances both the lagged outcomes and the auxiliary covariates.
In many settings, the dimension of Z is small relative to N (i.e., K � N), and it is feasible
to fit a regression model that regularizes the lagged outcome coefficients ηx but does not regu-
larize the auxiliary covariate coefficients ηz (i.e., set λz = 0). In Lemma 4 below, we show that
this is a weighting estimator that perfectly balances the auxiliary covariates Z. Importantly, the
corresponding pre-treatment fit in the lagged outcomes X only depends on the pre-treatment fit
7A complementary strategy is to consider using SCM to balance lower dimensional summaries of the vector oflagged outcomes, such as estimated pre-treatment average, trend, or factor loadings; see, for example, Gobillon andMagnac (2016) and Hazlett and Xu (2018).
14
on the residuals from regressions of the lagged outcomes on Z. We therefore propose a two-step
procedure where we first residualize the pre- and post-treatment outcomes on Z and then estimate
Ridge ASCM on the residualized outcomes.8 We propose this as our default approach for practice
in settings where SCM alone is not feasible.
Lemma 4. Let ηx and ηz be the solutions to (23) with λx = λr and λz = 0. For any weight vector
γ that sums to one, the ASCM estimator from Equation (6) with m(Xi, Zi) = X ′iηx + Z ′iηz is
∑Wi=0
γiYi +
X1· −∑Wi=0
γiXi·
′ ηx +
Z1· −∑Wi=0
γiZi·
′ ηz =∑Wi=0
γcovi Yi, (24)
where
γcovi = γi + (X1 − X0·)(X
′0·X0· + λrI)−1Xi + (Z1 − Z ′0γ)′(Z ′0Z0)−1Zi, (25)
and Xi is the residual components of a regression of pre-treatment outcomes on the control auxiliary
covariates:
Xi = Xi − Z ′i(Z ′0Z0)−1Z ′0X0. (26)
These weights exactly balance the auxiliary covariates, Z1 − Z ′0·γcov = 0; the imbalance in the
lagged outcomes is ∥∥X1 −X ′0·γcov∥∥
2≤(
λr
λr +N0d2r
)∥∥X1 − X ′0·γ∥∥
2, (27)
where dr is the minimal singular value of X0.
Lemma 4 has two main implications for incorporating auxiliary covariates into ridge ASCM.
First, the weights γcov exactly balance the auxiliary covariates regardless of whether the SCM
weights alone also achieve exact balance. Thus, we can simply exclude Z from Equation (22), or,
equivalently set θz = 0. Second, from Equation (27) we see that the pre-treatment fit on the lagged
outcomes X depends on how well the SCM weights balance the residualized lagged outcomes X.
This suggests modifying Equation (22) to balance X rather than X, which leads to the two-step
procedure we outline above.
8This is equivalent to fitting the partitioned ridge regression (23) and fitting SCM weights to the residualizedlagged outcomes. Doudchenko and Imbens (2017) also suggest a version of residualization that is equivalent toexcluding the lagged outcomes X from the outcome regression.
15
5.2 Bias under a linear factor model with covariates
We can quantify the behavior of this two-step procedure in controlling bias under a more general
form of the linear factor model (16) with covariates:
Yit(0) =J∑j=1
φijµjt +K∑k=1
BtkZik + εit = φ′iµt +B′tZi + εit, (28)
where Bt ∈ RK is a set of coefficients on the covariates (see Abadie et al., 2010; Botosaru and
Ferman, 2019, for additional discussion). Under this model the bias will be due to imbalance in both
the latent factor loadings φi and the covariates Zi. The auxiliary covariates are perfectly balanced
due to the residualization and we can bound the bias due to imbalance in φ using techniques from
Sections 3 and 4. For ease of exposition, we assume that the control covariates are standardized
and rotated, which can always be true after pre-processing. We present results for the simpler case
in which we fit SCM on the residualized pre-treatment outcomes rather than ridge ASCM (i.e. we
let λr →∞); the more general version follows immediately by applying Theorem 2.
Theorem 3. Under the linear factor model with covariates (28) with 1N0Z ′0·Z0· = IK and sub-
Gaussian noise, for any δ > 0 γcov in Equation (25) with λr →∞ satisfies the bound∣∣∣∣∣∣φ1 −
∑Wi=0
γcovi φi
· µT∣∣∣∣∣∣ ≤ 4JM2
√T0
( ∥∥X1 − X ′0·γ)∥∥
2︸ ︷︷ ︸imbalance in X
+ σ
√K
N0‖Z1 − Z ′0·γ‖2︸ ︷︷ ︸
excess approximation error
)+
2JM2σ√T0
(√logN0 + 2δ
)︸ ︷︷ ︸
SCM approximation error
(29)
with probability at least 1− 4e−δ2
2 − 2e−KN0(2−
√log 5)2
2 .
Building on Lemma 4, Theorem 3 shows that due to the additive, separable structure of the
auxiliary covariates in Equation (28), controlling the pre-treatment fit in the residualized lagged
outcomes X partially controls the bias. As above, this justifies directly targeting fit in X rather
than targeting raw lagged outcomes X. Moreover, since the number of covariates K is small relative
to N0, the auxiliary covariates are measured without noise, and the residualized lagged outcomes
are orthogonal to the auxiliary covariates by construction, the excess approximation error due to
augmentation will be small.
As in Section 3.2.2, we can achieve better balance when we apply ridge ASCM to X than when
we apply SCM alone. Because X are orthogonal to Z by construction, this comes at no cost in
terms of imbalance in Z. However, the fundamental challenge of over-fitting to noise still remains,
16
and, as in the case without auxiliary covariates, selecting the tuning parameter remains important.
We again propose to follow the cross validation approach in Section 4.4, here using the residualized
lagged outcomes X rather than the raw lagged outcomes X.
5.3 Including the average lagged outcome as a covariate
An important special case of this two-step procedure is the de-meaned or intercept shift SCM,
which incorporates an intercept into the SCM optimization problem (Doudchenko and Imbens,
2017; Ferman and Pinto, 2018). The estimator of the treated unit’s potential outcome under
control is then:
Y int scm1 (0) = X1 +
∑Wi=0
γscm∗i (Yi − Xi), (30)
where Xi = 1T0
∑T0t=1Xit is the pre-treatment mean outcome for unit i and the weights γscm∗
i are
chosen to balance the residual outcomes Xit−Xi rather than the raw outcomes Xit. This estimator
is equivalent to setting Zi = Xi and constraining ηz to equal one, thus estimating a fixed-effects
outcome model as m(Xi) = Xi. Alternatively, we can view this approach as augmenting SCM with
a panel regression on the vector of unit indicators.
The corresponding treatment effect estimate is then a weighted difference-in-differences estima-
tor:
τde =(Y1 − X1
)−
∑Wi=0
γscm∗i (Yi − Xi)
=1
T0
T0∑t=1
(Y1 −X1t)−
∑Wi=0
γscm∗i (Yi −Xit)
.(31)
Replacing the SCM weights with uniform weights in Equation (31) yields the typical two-group
difference-in-differences approach. Replacing the SCM weights with traditional inverse propensity
score weights instead yields a variant of the semiparametric difference-in-differences estimator of
Abadie (2005), albeit using lagged outcomes as predictors rather than auxiliary covariates.
In practice, the unit-specific pre-treatment average Xi may be a poor estimate for unit fixed
effects, due to the incidental parameter problem (Neyman and Scott, 1948). However, Ferman and
Pinto (2018) show that this combined approach has attractive theoretical properties in a range
of settings with large T0 and is often preferable to the usual two-group difference-in-differences
estimator; Arkhangelsky et al. (2019) come to a similar conclusion under a different asymptotic
framework. Thus, we also propose extending the two-step procedure above to include Xi as a
default predictor.
17
6 Simulations and numerical illustrations
We now turn to an empirical illustration and simulations. First, we use our approach to examine
the effect of an aggressive tax cut on economic output in Kansas in 2012. We then generate
simulation studies calibrated to this application and assess the performance of different methods,
finding substantial gains from ASCM and related approaches overall.
6.1 Illustration: 2012 Kansas tax cuts
In 2010, Sam Brownback was elected governor of Kansas, having run on a platform emphasizing
tax cuts and deficit reduction (see Rickman and Wang, 2018, for further discussion and analysis).
Upon taking office, he implemented a substantial personal income tax cut, both lowering rates
and reducing credits and deductions. This is a valuable test of “supply side” models: Brownback
argued that the tax cuts would increase business activity in Kansas, generating economic growth
and additional tax revenues that would make up for the static revenue losses. Kansas’ subsequent
economic performance has not been impressive relative to its neighbors; however, potentially con-
founding factors include a drought and declines in the locally important aerospace industry. Finding
a credible control for Kansas is thus challenging, and SCM-type models offer a potential solution.
We estimate the effect of the tax cuts on log gross state product (GSP) per capita using the
second quarter of 2012 — when Brownback signed the tax cut bill into law — as the intervention
time. We use three primary estimators: (1) SCM alone fit on the entire vector of lagged outcomes,
(2) ridge ASCM, and (3) the two-step approach where we first regress the lagged outcomes on
predictors then fit ridge ASCM on the residuals as in Section 5.9 Figure 1, known as a “gap plot”,
shows the difference between Kansas and its synthetic control before and after the passage of the
tax cuts, plus or minus two jackknife standard errors.10 Appendix Figure D.2 additionally shows
the results for: de-meaned SCM; ridge regression alone; ridge ASCM with λ chosen to minimize
CV (λ); and the original SCM proposal incorporating auxiliary covariates (Abadie et al., 2010).
SCM achieves fairly poor pre-treatment fit; the synthetic control exceeds the treatment unit by
two to four percent in 2004-2005. This lack of pre-treatment fit should make us wary of the validity
of the SCM effect estimates, and suggests that there may be gains to augmentation. Augmenting
SCM with ridge regression indeed provides better pre-treatment fit. Adding auxiliary covariates,
balanced by OLS before fitting ridge ASCM to the residuals, improves fit even more; the cor-
responding imbalances are small in magnitude throughout the 1990-2011 period. The estimated
9The covariates we include are the pre-treatment averages of (1) log state and local revenue per capita, (2) logaverage weekly wages, (3) number of establishments per capita, (4) the employment level, and (5) log GSP per capita.
10We compute standard errors for the imputed counterfactual Y1(0). See Arkhangelsky et al. (2019) for a discussionof jackknife standard errors in the SCM and Augmented SCM setting. For the augmented estimators we select thehyperparameter λr as the largest λ within one standard error of the λ that minimizes the cross-validation placebo fitCV (λ); see Section 4.4. Appendix Figure D.1 plots CV (λ) and one standard error for the ridge ASCM estimator asan empirical analog of the bound in Theorem 2.
18
Figure 1: Point estimates ± two standard errors of the ATT for the effect of the tax cuts on logGSP per capita using SCM, ridge ASCM, and ridge ASCM with covariates.
treatment effect in this model is a nearly 5% decrease in GSP per capita that persists through the
end of our sample and is statistically significant at nearly every time point. This is a larger negative
effect than is indicated either by SCM (a 2% decrease) or ridge ASCM (a 3.8% decrease), neither
of which is consistently negative.
To check against over-fitting, Figure 2 shows in-time placebo estimates for the ridge ASCM
estimator with covariates with placebo treatment times in the second quarter of 2009, 2010, and
2011. With placebo treatment in 2009 and 2011 we estimate placebo effects that are near zero; with
placebo treatment in 2010 the point estimates of the placebo effects are large, but do not appear
meaningfully different from zero. Appendix Figures D.3 and D.4 show the corresponding placebo
estimates for SCM alone and ridge ASCM without covariates.
Figure 3a shows the covariate balance for the three estimators. While SCM and ridge ASCM
achieve excellent fit for the pre-treatment average log GSP per capita, neither estimator perfectly
balances the other covariates, most notably the average employment level across the quarters of the
pre-period. In contrast — and by design — including an un-penalized regression on the auxiliary
covariates perfectly balances them. Moreover, after residualizing out the covariates, SCM along
with the ridge augmentation on lagged outcomes is able to achieve very good pre-treatment fit on
the lagged outcomes as shown in Figure 1.
Figure 3b shows the weighting on the donor units for SCM and ridge ASCM. Here we see the
minimal extrapolation property of the ASCM weights. The SCM weights are zero for all but six
donor states. The ridge ASCM weights are similar but deviate slightly from the simplex, with only
one unit receiving a meaningful negative weight. In addition, the ridge ASCM weights retain some
19
Figure 2: Placebo point estimates ± two standard errors for ridge ASCM covariates with placebotreatment times in Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.
(a) (b)
Figure 3: (a) Covariate balance for SCM, ridge ASCM, and ASCM with covariates. (b) Donor unitweights for SCM and ridge ASCM.
20
of the interpretability of the SCM weights. For the donor units with positive SCM weight, ridge
ASCM places close to the same weight. For the majority of those with zero SCM weight, ridge
ASCM places a close to zero weight, with relatively few donor units with large negative weight. By
contrast, Appendix Figure D.5 shows the weights from ridge regression alone: many of the weights
are negative and the weights are far from sparse.
6.2 Calibrated simulation studies
We now turn to simulation studies in which the true data generating process is known. To make
our simulation studies as realistic as possible, we conduct “calibrated” simulation studies (Kern
et al., 2016), based on estimates of linear factor models fit to the series of log GSP per capita we
study above (N = 50, T0 = 89). We use the Generalized Synthetic Control Method (Xu, 2017) to
estimate factor models with three latent factors for each application. We then simulate outcomes
using the distribution of estimated parameters and model selection into treatment as a function
of the latent factors; see Appendix A for additional details. We also present results from three
additional DGPs, each calibrated to estimates from the Kansas data: (1) the factor model with
quadruple the standard deviation of the noise term, (2) a unit and time fixed effects model, and
(3) an autoregressive model with 3 lags.
First, we explore the role of augmentation using simple outcome models. For each DGP, we
consider five estimators: (1) SCM alone, (2) ridge regression alone, (3) ridge ASCM, (4) fixed effects
alone, and (5) SCM augmented with fixed effects (i.e., de-meaned SCM). Figure 4 shows the main
results. Each panel displays the average absolute bias for all five estimators across all simulations.
Appendix Figure D.6 shows the estimator RMSE for the same simulations.
There are several takeaways. First, as anticipated, SCM alone — without conditioning on
excellent pre-treatment fit — is badly biased across simulations. This underscores the importance
of the recommendation in Abadie et al. (2010, 2015) to use SCM only in settings with excellent
pre-treatment fit.11 Second, augmenting SCM with a ridge outcome regression reduces bias relative
to SCM in all four simulations. Under the baseline factor model and the fixed effect model, the
ridge augmentation greatly reduces bias — by more than 75% in the factor model simulation and
over 90% in the fixed effects simulation. However, in the AR(3) model, as well as in the factor
model with reduced signal to noise ratio, the gains to augmentation relative to SCM are more
limited. Third, ridge ASCM has slightly lower bias than ridge regression alone across all of the
simulation settings. Fourth, when the fixed effects estimator is incorrectly specified, combining
SCM with a fixed effects estimator has much lower bias than either method alone; even when the
fixed effects estimator is correctly specified the de-meaned SCM has similar bias to the (correctly
specified) fixed effects approach. Finally, Appendix Figure D.6 shows that in all simulations ASCM
11Abadie et al. (2010, 2015) also strongly recommend incorporating auxiliary covariates, weighted by their predictivepower, into the procedure, noting that this is important for further reducing bias. For simplicity, the simulations donot include auxiliary covariates.
21
Figure 4: Overall absolute bias (a) the factor model simulation, (b) the factor model simulationwith quadruple the standard deviation, (c) the fixed effects simulation, and (d) the AR simulation.In panels (a) and (b), the outcome is scaled such that the residual standard deviation is one. TheSCM estimates reported here are not restricted to simulation draws with excellent pre-treatmentfit; Abadie et al. (2015) advise against using SCM in such settings.
has lower RMSE than SCM, as the large decrease in bias more than makes up for the slight increase
in variance.
We now consider how the benefits of augmentation relate to the quality of the original SCM
fit. Figure 5 shows the bias and root mean squared error (RMSE) as a function of λ for the two
factor model simulations, both unconditional on SCM fit and conditional on SCM fit being either
in the top half or the top quartile across simulations. As reference, the limit of ASCM as λ gets
large is the original SCM estimator, so the right-most points in the plots represent SCM. Several
key facts stand out. First, as expected, Augmented SCM can substantially reduce bias when SCM
is applied without regard to the quality of the pre-treatment fit. Second, restricting to simulations
with relatively good fit limits the gains from augmentation, although augmentation still reduces
bias even in the top quartile of pre-treatment fit. Third, it is possible to over-fit with ASCM, as
evident in much larger RMSE for ASCM with too-small λ. Fourth, with additional noise the risk
of over-fitting is greater, mitigating the gains to augmentation; conditional on good SCM fit the
RMSE is nearly monotonically decreasing with λ. Finally, when the SCM fit is good or the noise
is large, the bias is not monotonically increasing with λ and there is some optimal choice of λ that
22
Figure 5: Bias and RMSE of ridge ASCM as a function of λ under a linear factor model withmoderate and high noise.
minimizes the bias. Appendix Figures D.7 and D.8 show the bias and RMSE of several estimators
conditioned on SCM fit in the top quintile and show similar results: conditioned on good SCM fit
the gains of augmentation are limited.
Next, we evaluate alternative outcome models for use in ASCM. For each DGP we consider
SCM augmented with (1) Lasso, (2) a random forest, (3) CausalImpact (Brodersen et al., 2015),
(4) matrix completion using MCPanel (Athey et al., 2017) and (5) fitting the factor model directly
with gsynth (Xu, 2017). We compare ASCM to the pure outcome models as well as pure SCM.
Figure 6 shows the absolute bias for these methods and Appendix Figure D.6 shows the RMSE.
We broadly see the same results as with ridge ASCM. In our simulations, augmenting SCM almost
always reduces the bias relative to SCM (unconditional on good pre-treatment fit) with some models
improving SCM more than others. Additionally, in nearly every case ASCM also has lower bias
than outcome modeling alone. Appendix Figures D.7 and D.8 show the bias and RMSE when SCM
fit is in the top quintile. As before, conditioned on good pre-treatment fit the gains to augmentation
with flexible outcome models are more limited, except with the oracle gsynth estimator.
23
Figure 6: Absolute bias for SCM, ridge, fixed effects, and several machine learning and panel dataoutcome models, and their augmented versions using the same data generating processes as Figure4.
Overall we find consistently good performance across data generating processes from SCM
augmented with a penalized regression model.12 Due to this good performance, the small number
of units and high dimensional structure of typical SCM settings, and their relative simplicity, we
recommend augmenting SCM with penalized regression in practice. In particular we suggest using
ridge regression; among the other benefits, ridge ASCM allows the practitioner to diagnose the
level of extrapolation that the outcome model adds.
7 Discussion
SCM is a widely popular approach for estimating policy impacts at the jurisdiction level, such as the
city or state. By design, however, the method is limited to settings where excellent pre-treatment
fit is possible. For settings when this is infeasible, we introduce Augmented SCM, which controls
pre-treatment fit while minimizing extrapolation. We show that this approach reduces bias under
12Arkhangelsky et al. (2019) also show good performance when augmenting SCM with non-negative least squares.
24
a linear factor model and propose several extensions, including to incorporate auxiliary covariates.
There are several directions for future work. The most immediate is to develop robust inferential
methods for settings where pre-treatment SCM fit is imperfect. We adopt here jackknife standard
errors proposed by Arkhangelsky et al. (2019), but these are justified only asymptotically; see also
Arkhangelsky and Imbens (2019), who discuss doubly robust identification and estimation in this
setting. We could also extend this to allow for sensitivity analysis that directly parameterizes
departures from, say, the linear factor model, as in recent approaches for sensitivity analysis for
balancing weights (Soriano et al., 2019).
A second area for future inquiry is the application of the ASCM framework to settings with
multiple treated units. For instance, there are different approaches in settings when all treated
units are treated at the same time: some papers propose to fit SCM separately for each treated
unit (e.g., Abadie and L’Hour, 2018), while others simply average the units together (e.g., Kreif
et al., 2016; Robbins et al., 2017). The situation is more complicated with staggered adoption,
when units take up the treatment at different times (e.g., Dube and Zipperer, 2015; Donohue et al.,
2017). We explore this extension in Ben-Michael et al. (2019).
A third potential extension is to more complex data structures, such as applications with mul-
tiple outcomes series for the same units (e.g., measures of both earnings and total employment in
minimum wage studies) or hierarchical data structures with outcome information at both the indi-
vidual and aggregate level (e.g., students within schools). Similarly, the theoretical results in this
paper are poorly suited to count outcomes. This is especially important in gun violence research
where the outcome of interest might be the number of homicides (Rudolph et al., 2015).
25
References
Abadie, A. (2005). Semiparametric difference-in-differences estimators. The Review of EconomicStudies 72 (1), 1–19.
Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and methodologicalaspects. Journal of Economic Literature.
Abadie, A., A. Diamond, and J. Hainmueller (2010). Synthetic Control Methods for ComparativeCase Studies: Estimating the Effect of California’s Tobacco Control Program. Journal of theAmerican Statistical Association 105 (490), 493–505.
Abadie, A., A. Diamond, and J. Hainmueller (2015). Comparative Politics and the SyntheticControl Method. American Journal of Political Science 59 (2), 495–510.
Abadie, A. and J. Gardeazabal (2003). The Economic Costs of Conflict: A Case Study of theBasque Country. The American Economic Review 93 (1), 113–132.
Abadie, A. and G. W. Imbens (2011). Bias-corrected matching estimators for average treatmenteffects. Journal of Business & Economic Statistics 29 (1), 1–11.
Abadie, A. and J. L’Hour (2018). A penalized synthetic control estimator for disaggregated data.
Amjad, M., D. Shah, and D. Shen (2018). Robust synthetic control. The Journal of MachineLearning Research 19 (1), 802–852.
Arkhangelsky, D., S. Athey, D. A. Hirshberg, G. W. Imbens, and S. Wager (2019). Syntheticdifference in differences. arXiv preprint arXiv:1812.09970 .
Arkhangelsky, D. and G. W. Imbens (2019). Double-robust identification for causal panel datamodels. arXiv preprint arXiv:1909.09412 .
Athey, S., M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017). Matrix CompletionMethods for Causal Panel Data Models. arxiv 1710.10251 .
Athey, S. and G. W. Imbens (2017). The state of applied econometrics: Causality and policyevaluation. Journal of Economic Perspectives 31 (2), 3–32.
Athey, S., G. W. Imbens, and S. Wager (2018). Approximate Residual Balancing: De-BiasedInference of Average Treatment Effects in High Dimensions. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology).
Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica 77 (4), 1229–1279.
Ben-Michael, E., A. Feller, and J. Rothstein (2019). Synthetic control and weighted event studymodels with staggered adoption. Technical report. working paper.
Botosaru, I. and B. Ferman (2019). On the role of covariates in the synthetic control method. TheEconometrics Journal 22 (2), 117–130.
Breidt, F. J. and J. D. Opsomer (2017). Model-Assisted Survey Estimation with Modern PredictionTechniques. Statistical Science 32 (2), 190–205.
26
Brodersen, K. H., F. Gallusser, J. Koehler, N. Remy, and S. L. Scott (2015). Inferring CausalImpact using Bayesian Structural Time-Series Models. The Annals of Applied Statistics 9 (1),247–274.
Cassel, C. M., C.-E. Sarndal, and J. H. Wretman (1976). Some results on generalized differenceestimation and generalized regression estimation for finite populations. Biometrika 63 (3), 615–620.
Chernozhukov, V., K. Wuthrich, and Y. Zhu (2018). Inference on average treatment effects inaggregate panel data settings. arXiv preprint arXiv:1812.10820 .
Deville, J.-C. and C.-E. Sarndal (1992). Calibration estimators in survey sampling. Journal of theAmerican statistical Association 87 (418), 376–382.
Deville, J. C., C. E. Sarndal, and O. Sautory (1993). Generalized raking procedures in surveysampling. Journal of the American Statistical Association 88 (423), 1013–1020.
Donohue, J. J., A. Aneja, and K. D. Weber (2017). Right-to-carry laws and violent crime: Acomprehensive assessment using panel data and a state-level synthetic control analysis. Technicalreport, National Bureau of Economic Research.
Doudchenko, N. and G. W. Imbens (2017). Difference-In-Differences and Synthetic Control Meth-ods: A Synthesis. arxiv 1610.07748 .
Dube, A. and B. Zipperer (2015). Pooling multiple case studies using synthetic controls: Anapplication to minimum wage policies.
Ferman, B. (2019). On the Properties of the Synthetic Control Estimator with Many Periods andMany Controls.
Ferman, B. and C. Pinto (2018). Synthetic controls with imperfect pre-treatment fit.
Ferman, B., C. Pinto, and V. Possebom (2017). Cherry picking with synthetic controls.
Gobillon, L. and T. Magnac (2016). Regional policy evaluation: Interactive fixed effects andsynthetic controls. Review of Economics and Statistics 98 (3), 535–551.
Hastie, T., J. Friedman, and R. Tibshirani (2009). The elements of statistical learning. Springerseries in statistics New York.
Hazlett, C. and Y. Xu (2018). Trajectory balancing: A general reweighting approach to causalinference with time-series cross-sectional data.
Hsiao, C., Q. Zhou, et al. (2018). Panel parametric, semi-parametric and nonparametric construc-tion of counterfactuals-california tobacco control revisited. Technical report.
Kern, H. L., E. A. Stuart, J. Hill, and D. P. Green (2016). Assessing methods for generalizingexperimental impact estimates to target populations. Journal of Research on Educational Effec-tiveness 9 (1), 103–127.
King, G. and L. Zeng (2006). The dangers of extreme counterfactuals. Political Analysis 14 (2),131–159.
27
Kline, P. (2011). Oaxaca-Blinder as a reweighting estimator. In American Economic Review,Volume 101, pp. 532–537.
Kreif, N., R. Grieve, D. Hangartner, A. J. Turner, S. Nikolova, and M. Sutton (2016). Examinationof the synthetic control method for evaluating health policies with multiple treated units. Healtheconomics 25 (12), 1514–1528.
Minard, S. and G. R. Waddell (2018). Dispersion-weighted synthetic controls.
Neyman, A. J. and E. L. Scott (1948). Consistent estimates based on partially consistent observa-tions. Econometrica 16 (1), 1–32.
Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments.essay on principles. section 9. Statistical Science 5 (4), 465–472.
Nickell, S. (1981). Biases in dynamic models with fixed effects. Econometrica: Journal of theEconometric Society , 1417–1426.
Powell, D. (2018). Imperfect synthetic controls: Did the massachusetts health care reform savelives?
Rickman, D. S. and H. Wang (2018). Two tales of two us states: Regional fiscal austerity andeconomic performance. Regional Science and Urban Economics 68, 46–55.
Robbins, M., J. Saunders, and B. Kilmer (2017). A Framework for Synthetic Control Methods WithHigh-Dimensional, Micro-Level Data: Evaluating a Neighborhood-Specific Crime Intervention.Journal of the American Statistical Association 112 (517), 109–126.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when someregressors are not always observed. Journal of the American Statistical Association 89 (427), 846–866.
Rubin, D. B. (1973). The use of matched sampling and regression adjustment to remove bias inobservational studies. Biometrics, 185–203.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies. Journal of educational Psychology 66 (5), 688.
Rubin, D. B. (1980). Comment on “randomization analysis of experimental data: The fisherrandomization test”. Journal of the American Statistical Association 75 (371), 591–593.
Rudolph, K. E., E. A. Stuart, J. S. Vernick, and D. W. Webster (2015). Association be-tween connecticut’s permit-to-purchase handgun law and homicides. American journal of publichealth 105 (8), e49–e54.
Samartsidis, P., S. R. Seaman, A. M. Presanis, M. Hickman, D. De Angelis, et al. (2019). Assessingthe causal effect of binary interventions from observational panel data with few treated units.Statistical Science 34 (3), 486–503.
Soriano, D., E. Ben-Michael, A. Feller, and S. Pimentel (2019). Sensitivity analysis for balancingweights. Technical report. working paper.
28
Tan, Z. (2017). Regularized calibrated estimation of propensity scores with model misspecificationand high-dimensional data.
Tan, Z. (2018). Model-assisted inference for treatment effects using regularized calibrated estimationwith high-dimensional data. arXiv preprint arXiv:1801.09817 .
Wainwright, M. (2018). High dimensional statistics: a non-asymptomatic viewpoint.
Wang, Y. and J. R. Zubizarreta (2018). Minimal Approximately Balancing Weights: AsymptoticProperties and Practical Considerations.
Xu, Y. (2017). Generalized Synthetic Control Method: Causal Inference with Interactive FixedEffects Models. Political Analysis 25, 57–76.
Zhao, Q. (2018). Covariate Balancing Propensity Score by Tailored Loss Functions. Annals ofStatistics, forthcoming.
Zhao, Q. and D. Percival (2017). Entropy balancing is doubly robust. Journal of Causal Infer-ence 5 (1).
Zubizarreta, J. R. (2015). Stable Weights that Balance Covariates for Estimation With IncompleteOutcome Data. Journal of the American Statistical Association 110 (511), 910–922.
29
A Simulation data generating process
We now describe the simulations in detail. We use the Generalized Synthetic Control Method(Xu, 2017) to fit the following linear factor model to the observed series of log GSP per capita(N = 50, T0 = 89, T = 105), setting J = 3:
Yit = αi + νt +
J∑j=1
φijµjt + εit. (32)
We then use these estimates as the basis for simulating data. Appendix Figure D.9 shows theestimated factors µ. We use the estimated time fixed effects ν and factors µ and then simulatedata using Equation (32), drawing:
αi ∼ N( ˆα, σα)
φ ∼ N(0, Σφ)
εit ∼ N(0, σε),
where ˆα and σα are the estimated mean and standard deviation of the unit-fixed effects, Σφ isthe sample covariance of the estimated factor loadings, and σε is the estimated residual standarddeviation. We also simulate outcomes with quadruple the standard deviation, sd(εit) = 4σε. Weassume a sharp null of zero treatment effect in all DGPs and estimate the ATT at the final timepoint.
To model selection, we compute the (marginal) propensity scores as
logit−1 {πi} = logit−1 {P(T = 1 | αi, φi)} = θ
αi +∑j
φij
,
where we set θ = 1/2 to re-scale the factors and fixed effects to have unit variance. Finally,we restrict each simulation to have a single treated unit and therefore normalize the selectionprobabilities as πi∑
j πj.
We also consider an alternative data generating process that specializes the linear factor modelto only include unit- and time-fixed effects:
Yit(0) = αi + νt + εit.
We calibrate this data generating process by fitting the fixed effects with gsynth and drawing newunit-fixed effects from αi ∼ N( ˆα, σα). We then model selection proportional to the fixed effect asabove with θ = 3
2 . Second, we generate data from an AR(3) model:
Yit(0) = ρ0 +
3∑j=1
ρjYi(t−j) + εit,
where we fit ρ0, ρ to the observed series of log GSP per capita. We model selection as proportional
to the last 3 outcomes logit−1πi = θ(∑4
j=1 Yi(T0−j+1)
)and set θ = 5
2 . For this simulation we
30
estimate the ATT at time T0 + 1.
B Proofs
Proof of Lemmas 1 and 3. Recall that the lagged outcomes are centered by the control averages.Notice that
Y1(0)aug = m(X1) +∑Wi=0
γscmi (YiT − m(Xi))
= η0 + η′X1 +∑Wi=0
γscmi (YiT − η0 −X ′iη)
=∑Wi=0
(γscmi + (X1 −X ′0·γscm)(X ′0·X0· + λIT0)−1Xi)Yi
=∑Wi=0
γaugi Yi
(33)
The expression for Y1(0)r follows.We now prove that γr and γscm solve the weighting optimization problems (15) and (11). First,
the Lagrangian dual to (15) is
minα,β
1
2
∑Wi=0
(α+ β′Xi +
1
N0
)2
− (α+ β′X1) +λ
2‖β‖22, (34)
where we have used that the convex conjugate of 12
(x− 1
N0
)2is 1
2
(y + 1
N0
)2− 1
2N20
. Solving for α
we see that ∑Wi=0
α+ β′Xi + 1 = 1
Since the lagged outcomes are centered, this implies that
α = 0
Now solving for β we see that
X ′0·
(1
1
N0+X0·β
)+ λβ = X1
This implies thatβ = (X ′0·X0· + λI)−1X1
Finally, the weights are the ridge weights
γi =1
N0+X ′1(X ′0·X0· + λI)−1Xi = γr
i
Similarly, the Lagrangian dual to (11) is
minα,β
1
2
∑Wi=0
(α+ β′Xi + γscm
i
)2 − (α+ β′X1) +λ
2‖β‖22, (35)
31
where we have used that the convex conjugate of 12 (x− γscm
i )2 is 12 (y + γscm
i )2 − 12 γ
scm2i . Solving
for α we see that ∑Wi=0
α+ β′Xi + 1 = 1
Since the lagged outcomes are centered, this implies that
α = 0
Now solving for β we see that
X ′0·
(γscm +X0·β
)+ λβ = X1
This implies thatβ = (X ′0·X0· + λI)−1(X1 −X ′0·γscm)
Finally, the weights are the ridge ASCM weights
γi = γscmi + (X1 −X ′0·γscm)′(X ′0·X0· + λI)−1Xi = γaug
i
Proof of Lemma 2. Notice that
X1 −X ′0·γaug = (I −X ′0·X0·(X′0·X0· +N0λI)−1)(X1 −X ′0·γscm)
= N0λ(X ′0·X0· +N0λI)−1(X1 −X ′0·γscm)
= V diag
(λ
d2j + λ
)V ′(X1 −X ′0·γscm)
So since V is orthogonal,
‖X1 −X ′0·γaug‖2 =
∥∥∥∥∥diag
(λ
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2
Lemma 5. The ridge augmented SCM weights with hyperparameter λN0, γaug, satisfy
‖γaug‖2 ≤ ‖γscm‖2 +
1√N0
∥∥∥∥∥diag
(dj
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2
, (36)
with Xi = V ′Xi as defined in Lemma 2.
Proof of Lemma 5. Notice that using the singular value decomposition and by the triangle inequal-
32
ity,‖γaug‖+ 2 = ‖γscm +X0·(X
′0·X0· + λI)−1(X1 −X ′0·γscm)
=
∥∥∥∥∥γscm + Udiag
( √N0dj
N0d2j + λN0
)V ′(X1 −X ′0·γscm)
∥∥∥∥∥2
≤ ‖γscm‖2 +
∥∥∥∥∥diag
(dj
(d2j + λ)
√N0
)(X1 − X ′0·γscm)
∥∥∥∥∥2
Lemma 6. Under the linear factor model (16), for weights γ with∑
i γi = 1, the imbalance in thelatent factor loadings is
µT ·
φ1 −∑Wi=0
γiφi
=1
T0µT ·
µ′X1 −∑Wi=0
γiµ′Xi
︸ ︷︷ ︸
imbalance
− 1
T0µT ·
µ′ε1 −∑Wi=0
γiµ′εi
︸ ︷︷ ︸
approx. error
. (37)
Proof of Lemma 6. Collecting the factor loadings into a matrix Φ ∈ RN×J , notice that
µ(Φ1· − Φ′c·γ) = X1· − ε1 − (X0· − εc)′γ
Multiplying both sides by (µ′µ)−1µ′ and recalling that 1T0µ′µ = I, this gives that
Φ1· − Φ′c·γ =1
T0µ′(X1· −X ′0·γ)− 1
T0µ′(ε1 − ε′cγ)
This gives the result.
Lemma 7. For weights γ with ‖γ‖1 ≤ 1, εit sub-Gaussian with scale parameter σ and independent,and δ > 0,
1
T0
∣∣∣∣∣∣µT ·µ′ε1 −
∑Wi=0
γiµ′εi
∣∣∣∣∣∣ ≤ 2JM2σ
(√logN0
T0+ δ
(1√N1
+ 1
))(38)
with probability at least 1− 4e−T0δ
2
2 .
Proof of Lemma 7. First, we break up the expression into two components with the triangle in-equality
1
T0
∣∣∣∣∣∣µT ·µ′ε1 −
∑Wi=0
γiµ′εi
∣∣∣∣∣∣ ≤ 1
T0
∣∣µ′Tµ′ε1
∣∣+1
T0
∣∣∣∣∣∣µ′Tµ′∑Wi=0
εiγi
∣∣∣∣∣∣ .We now deal with each term in turn. First, since εit are sub-Gaussian and independent, µ′Tµ
′εi =∑T0t=1 µ
′Tµt·εit is sub-Gaussian with scale parameter
√∑T0t=1(µ′Tµt·)
2 ≤ JM2√T0. Thus, by the
33
sub-Gaussian tail bound,
P
(1
T0
∣∣µ′Tµ′ε1
∣∣ ≥ δJM2
√N1
σ
)≤ 2 exp
(−T0δ
2
2
).
Next, by Holder’s inequality,
1
T0
∣∣∣∣∣∣µ′Tµ′∑Wi=0
εiγi
∣∣∣∣∣∣ ≤ 1
T0‖ε0µµT ‖∞‖γ‖1 ≤
1
T0‖ε0µµT ‖∞.
Note that this bound corresponds to the worst case where γ has a weight of 1 on the unit withthe largest noise term correlated with the latent factors. Using the union bound to bound themaximum of sub-Gaussian variables, we see that
P
(1
T0‖ε0µµT ‖∞ ≥ 2JM2σ
(√logN0
T0+ δ
))≤ 2 exp
(−T0δ
2
2
)Combining these two bounds gives the result.
Proof of Theorem 1. From Lemma 6 and using the triangle inequality, the imbalance in φ satisfies∣∣∣∣∣∣µT ·φ1 −
∑Wi=0
γiφi
∣∣∣∣∣∣ ≤ 1
T0
∣∣∣∣∣∣µT ·µ′X1 −
∑Wi=0
γiµ′Xi
∣∣∣∣∣∣+1
T0
∣∣∣∣∣∣µT ·µ′ε1 −
∑Wi=0
γiµ′εi
∣∣∣∣∣∣Recall that µtj ≤ M , so ‖µ′‖2 ≤ M
√JT0 Again using Cauchy-Schwarz and ‖µT ‖2 ≤ M
√J , we
have the bound
1
T0
∣∣∣∣∣∣µT ·µ′X1 −
∑Wi=0
γiµ′Xi
∣∣∣∣∣∣ ≤ 1
T0‖µT ‖2‖µ‖2‖X1 −X ′0·γ‖2 ≤
MJ2
√T0‖X1 −X ′0·γ‖2
Combining this with Lemma 7 yields that∣∣∣∣∣∣µT ·φ1 −
∑Wi=0
γiφi
∣∣∣∣∣∣ ≤ M2J√T0‖X1 −X ′0·γ‖2 + 2JM2σ
(√logN0
T0+ δ
(1√N1
+ 1
))
with probability at least 1 − 4e−T0δ
2
2 . Specializing to the case where N1 = 1 and setting δ = δ√T0
gives the result.
Proof of Theorem 2. From Lemma 6 the bias for Ridge ASCM is
µTφ1 −∑Wi=0
γaugi φi
=1
T0µ′Tµ
′
X1 −∑Wi=0
γaugi Xi − ε1 +
∑Wi=0
γaugi εi
=
1
T0µ′Tµ
′(X1 −X ′0·γscm − ε1 + ε′0·γ
scm − (X0· − ε0·)′(γaug − γscm)
).
34
Using Lemma 1, we can write the bias as
bias =1
T0µ′Tµ
′((
I − (X0· − ε0·)′X ′0·(X
′0·X0· + λI)−1
) (X1 −X ′0·γscm
)− ε1 + ε′0·γ
scm
)=
1
T0µ′Tµ
′ (λI + ε′0·X0·)
(X ′0·X0· + λI)−1(X1 −X ′0·γscm
)︸ ︷︷ ︸
(∗)
− 1
T0µ′Tµ
′ (ε1 + ε′0·γscm).
Lemma 7 provides a bound for the second term, so we now inspect the first term. By thetriangle inequality and Cauchy Schwarz,
|(∗)| ≤ 1
T0‖µµT ‖2λ
∥∥(X ′0·X0· + λI)−1(X1 −X ′0·γscm
)∥∥2
+
∣∣∣∣ 1
T0µ′Tµ
′ε′0·X0·(X′0·X0· + λI)−1
(X1 −X ′0·γscm
)∣∣∣∣Using Lemmas 2 and 5 we get that
|(∗)| ≤ JM2
√T0
∥∥∥∥∥diag
(λ
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2
+1
T0‖ε0·µµT ‖2
1√N0
∥∥∥∥∥diag
(dj
d2j + λ
)(X1 − X ′0·γscm)
∥∥∥∥∥2
.
Now we must bound the term 1T0‖ε0·µµT ‖2. From the proof of Lemma 7, 1
T0
∑T0t=1 µ
′Tµt·εit is sub-
Gaussian with scale parameter JM2√T0
. A standard discretization argument (Wainwright, 2018) allows
us to bound the L2 norm:
P
(1
T0‖ε0µµT ‖2 ≥ 2JM2σ
(√N0 log 5
T0+ δ
))≤ 2 exp
(−T0δ
2
2
)Replacing δ with
√N0T0
(2−√
log 5) we see that
1
T0‖ε0·µµT ‖2 ≤ 4JM2σ
√N0
T0
with probability at least 1 − 2 exp(−N0(2−
√log 5)2
2
). Combining this bound with Lemma 7 and
using the union bound gives the result.
Proof of Lemma 4. The regression parameters ηx and ηz in Equation (23) are:
ηx = (X ′0·X0· + λrI)−1X ′0·Y0 and ηz = (Z ′0·Z0·)−1Z ′0·Y0 (39)
35
Now notice that
Y cov0 = η′xX1 + η′zZ1 +
∑Wi=0
(Yi − η′xXi − ηzZi)γi
= η′x(X1 −X ′0·γ) + ηz(Z1 − Z ′0·γ) + Y ′0 γ
= η′x(X1 −X ′0·γ)− η′xX0·(Z′0·Z0·)
−1(Z1 − Z ′0·γ) + Y ′0Z0·(Z′0·Z0·)
−1(Z1 − Z ′0·γ) + Y ′0 γ
= η′x(X1 − X ′0·γ) + Y ′0Z0·(Z′0·Z0·)
−1(Z1 − Z ′0·γ) + Y ′0 γ
= Y ′0(γ + X0·(X
′0·X0· + λrI)−1(X1 − X ′0·γ) + Z0·(Z
′0·Z0·)
−1(Z1 − Z ′0·γ)).
(40)
This gives the form of γcov. The imbalance in Z is
Z1 − Z ′0·γcov =(Z1 − Z ′0·Z0·(Z
′0·Z0·)
−1Z1
)+(Z0· − Z ′0·Z0·(Z
′0·Z0·)
−1Z0·)′γ
− Z ′0·X0·(X′0·X0· + λrI)−1(X1 − X ′0·γ)
= 0.
(41)
The pre-treatment fit is
X1 −X ′0·γcov =(X1 −X ′0·Z0·(Z
′0·Z0·)
−1Z1
)+(X0· −X ′0·Z0·(Z
′0·Z0·)
−1Z0·)′γ
−X ′0·X0·(X′0·X0· + λrI)−1(X1 − X ′0·γ)
=(I −X ′0·X0·(X
′0·X0· + λrI)−1
) (X1 − X ′0·γ
)=(I − X ′0·X0·(X
′0·X0· + λrI)−1
) (X1 − X ′0·γ
).
(42)
This gives the bound on the pre-treatment fit.
Proof of Theorem 3. Building on Lemma 6, we can see that the bias due to imbalance in φ is equalto
µT ·
φ1 −∑Wi=0
γcovi φi
=1
T0µ′Tµ
′(X1−X ′0·γcov)− 1
T0µ′Tµ
′(ε1−ε′0·γcov)− 1
T0µ′Tµ
′B′(Z1−Z ′0·γcov),
(43)where B = [B1, . . . , BT0 ] ∈ RK×T0 . By construction, γcov perfectly balances the covariates, andcombined with Lemma 4, the bias due to imbalance in φ simplifies to
µT ·
φ1 −∑Wi=0
γiφi
=1
T0µ′Tµ
′(X1 − X ′0·γ)− 1
T0µ′Tµ
′(ε1 − ε′0·γcov).
We now turn to bounding the noise terms. First, notice that
1
T0µ′Tµ
′ε′0·γcov =
1
T0µ′Tµ
′ε′0·γ +1
T0µ′Tµ
′ε′0·Z0·(Z′0·Z0·)
−1(Z1 − Z ′0·γ).
We have bounded the first term on the right hand side in Lemma 7. To bound the second term,notice that
∑Wi=0
∑T0t=1 µ
′Tµt·Zikεit is sub-Gaussian with scale parameter σMJ2
√T0‖Z·k‖2 =
36
MJ2σ√T0N0. We can now bound the L2 norm of 1
T0µ′Tµ
′ε′0·Z0· ∈ RK :
P
(1
T0‖µ′Tµ′ε′0·Z0·‖2 ≥ 2JM2σ
(√N0K log 5
T0+ δ
))≤ 2 exp
(−T0δ
2
2
)Replacing δ with
√KN0T0
(2−√
log 5) and with the Cauchy-Schwarz inequality we see that
1
T0
∣∣µ′Tµ′ε′0·Z0·(Z′0·Z0·)
−1(Z1 − Z ′0·γ)∣∣ ≤ 4JM2σ
√K
T0N0‖Z1 − Z ′0·γscm‖2
with probability at least 1 − 2 exp(−KN0(2−
√log 5)2
2
). Combining with Lemma 4 and putting
together the pieces with the union bound gives the result.
37
C Connection to balancing weights and IPW
We have motivated Augmented SCM via bias correction. An alternative motivation comes fromthe connection between SCM and inverse propensity score weighting (IPW).
First, notice that the SCM weights from the constrained optimization problem in Equation(3) are a form of approximate balancing weights (see, for example Zubizarreta, 2015; Athey et al.,2018; Tan, 2017; Wang and Zubizarreta, 2018; Zhao, 2018). Unlike traditional inverse propensityscore weights, which indirectly minimize covariate imbalance by estimating a propensity scoremodel, balancing weights seek to directly minimize covariate imbalance, in this case L2 imbalance.Balancing weights have a Lagrangian dual formulation as inverse propensity score weights (see, forexample Zhao and Percival, 2017; Zhao, 2018). Extending these results to the SCM setting, theLagrangian dual of the SCM optimization problem in Equation (3) has the form of a propensityscore model. Importantly, as we discuss below, it is not always appropriate to interpret this modelas a propensity score.
We first derive the Lagrangian dual for a general class of balancing weights problems, thenspecialize to the penalized SCM estimator (3).
minγ
hζ(X1· −X ′0·γ)︸ ︷︷ ︸balance criterion
+∑Wi=0
f(γi)︸ ︷︷ ︸dispersion
subject to∑Wi=0
γi = 1.(44)
This formulation generalizes Equation (3) in two ways: first, we remove the non-negativity con-straint and note that this can be included by restricting the domain of the strongly convex dispersionpenalty f . Examples include the re-centered L2 dispersion penalties for ridge regression and ridgeASCM, an entropy penalty (Robbins et al., 2017), and an elastic net penalty (Doudchenko andImbens, 2017). Second, we generalize from the squared L2 norm to a general balance criterion hζ ;another promiment example is an L∞ constraint (see e.g. Zubizarreta, 2015; Athey et al., 2018).
Proposition 1. The Lagrangian dual to Equation (44) is
minα,β
∑Wi=0
f∗(α+ β′Xi·)− (α+ β′X1·)︸ ︷︷ ︸loss function
+ h∗ζ(β)︸ ︷︷ ︸regularization
, (45)
where a convex, differentiable function g has convex conjugate g∗(y) ≡ supx∈dom(g){y′x − g(x)}.The solutions to the primal problem (44) are γi = f∗′(α+ β′Xi), where f∗′(·) is the first derivativeof the convex conjugate, f∗(·).
There is a large literature relating balancing weights to propensity score weights. This literatureshows that the loss function in Equation (45) is an M-estimator for the propensity score and thuswill be consistent for the propensity score parameters under large N asymptotics. The dispersionmeasure f(·) determines the link function of the propensity score model, where the odds of treatment
are π(x)1−π(x) = f∗′(α + β′x). Note that un-penalized SCM, which can yield multiple solutions,
does not have a well-defined link function. We extend the duality to a general set of balancecriteria so that Equation (45) is a regularized M-estimator of the propensity score parameters
38
where the balance criterion hζ(·) determines the type of regularization through its conjugate h∗ζ(·).This formulation recovers the duality between entropy balancing and a logistic link (Zhao andPercival, 2017), Oaxaca-Blinder weights and a log-logistic link (Kline, 2011), and L∞ balance andL1 regularization (Wang and Zubizarreta, 2018). This more general formulation also suggestsnatural extensions of both SCM and ASCM beyond the L2 setting to other forms, especially L1
regularization.Specializing proposition 1 to a squared L2 balance criterion hζ(x) = 1
2ζ ‖x‖22 as in the penalized
SCM problems yields that the dual propensity score coefficients β are regularized by a ridge penalty.In the case of an entropy dispersion penalty as Robbins et al. (2017) consider, the donor weights γhave the form of IPW weights with a logistic link function, where the propensity score is π(xi) =
logit−1(α+ β′xi), the odds of treatment are π(xi)1−π(xi)
= exp(α+ β′xi) = γi.We emphasize that while Proposition 1 shows that the the estimated weights have the IPW form,
in SCM settings it may not always be appropriate to interpret the dual problem as a propensityscore reflecting stochastic selection into treatment. For example, this interpretation would notbe appropriate in some canonical SCM examples, such as the analysis of German reunification inAbadie et al. (2015).
Proof of Proposition 1. We can augment the optimization problem (44) with auxiliary variables ε,yielding:
minγ,ε
hζ(ε) +∑Wi=0
f(γi).
subject to ε = X1· −X ′0·γ∑Wi=0
γi = 1
(46)
The Lagrangian is
L(γ, ε, α, β) =∑
i|Wi=0
f(γi) + α(1− γi) + hζ(ε) + β′(X1· −X ′0·γ − ε). (47)
The dual maximizes the objective
q(α, β) = minγ,εL(γ, ε, α, β)
=∑Wi=0
minγi{f(γi)− (α+ β′Xi)γi}+ min
ε{hζ(ε)− β′ε}+ α+ β′X1·
= −∑Wi=0
f∗(α+ β′Xi) + α+ β′X ′1· − h∗ζ(β),
(48)
By strong duality the general dual problem (45), which minimizes −q(α, β), is equivalent to theprimal balancing weights problem. Given the α and β that minimize the Lagrangian dual objective,−q(α, β), we recover the donor weights solution to (44) as
γi = f∗′(α+ β′Xi). (49)
39
D Additional figures
Figure D.1: Cross validation MSE and one standard error computed according to Equation (21).The minimal point, and the maximum λ within one standard error of the minimum are highlighted.The dashed line shows the sum of the imbalance and excess approximation error terms in Theorem2 as a function of λ, with noise term fit via gsynth. The bound is scaled to the same scale as theMSE.
Figure D.2: Point estimates ± two standard errors of the ATT for the effect of the tax cuts onlog GSP per capita using de-meaned SCM, ridge regression alone, ridge ASCM with λ chosen tominimize the cross validated MSE , and the original SCM proposal with covariates (Abadie et al.,2010).
40
Figure D.3: Placebo point estimates ± two standard errors for SCM with placebo treatment timesin Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.
Figure D.4: Placebo point estimates ± two standard errors for ridge ASCM with placebo treatmenttimes in Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.
41
Figure D.5: Donor unit weights for SCM, ridge regression, and ridge ASCM.
Figure D.6: RMSE for different augmented and non-augmented estimators across outcome models.
42
Figure D.7: Bias for different augmented and non-augmented estimators across outcome modelsconditioned on SCM fit in the top quintile.
43
Figure D.8: RMSE for different augmented and non-augmented estimators across outcome modelsconditioned on SCM fit in the top quntile.
44
Figure D.9: Latent factors for calibrated simulation studies.
45