the augmented synthetic control method · 2019-11-12 · 1 introduction the synthetic control...

The Augmented Synthetic Control Method∗

Eli Ben-Michael, Avi Feller, and Jesse Rothstein

UC Berkeley

November 2019

Abstract

The synthetic control method (SCM) is a popular approach for estimating the impact of atreatment on a single unit in panel data settings. The “synthetic control” is a weighted averageof control units that balances the treated unit’s pre-treatment outcomes as closely as possible.A critical feature of the original proposal is to use SCM only when the fit on pre-treatmentoutcomes is excellent. We propose Augmented SCM to extend SCM to settings where suchpre-treatment fit is infeasible. Analogous to bias correction for inexact matching, AugmentedSCM uses an outcome model to estimate the bias due to imperfect pre-treatment fit and thende-biases the original SCM estimate. Our main proposal, which uses ridge regression as theoutcome model, directly controls pre-treatment fit while minimizing extrapolation from theconvex hull. We bound the bias of this approach under a linear factor model and show howregularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCMwith extensive simulation studies and apply this framework to estimate the impact of the 2012Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth

R package.

∗email: [email protected]. We thank Alberto Abadie, Josh Angrist, Matias Cattaneo, Alex D’Amour, Peng

Ding, Erin Hartman, Chad Hazlett, Steve Howard, Guido Imbens, Brian Jacob, Pat Kline, Caleb Miles, Luke Miratrix,

Sam Pimentel, Fredrik Savje, Jas Sekhon, Jake Soloff, Panos Toulis, Stefan Wager, Yiqing Xu, Alan Zaslavsky, and

Xiang Zhou for thoughtful comments and discussion, as well as seminar participants at Stanford, UC Berkeley, UNC,

the 2018 Atlantic Causal Inference Conference, COMPIE 2018, and the 2018 Polmeth Conference.

1 Introduction

The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment

on a single unit in panel data settings with a modest number of control units and with many

pre-treatment periods (Abadie and Gardeazabal, 2003; Abadie et al., 2010, 2015). The idea is to

construct a weighted average of control units, known as a synthetic control, that matches the treated

unit’s pre-treatment outcomes. The estimated impact is then the difference in post-treatment

outcomes between the treated unit and the synthetic control. SCM has been widely applied — the

main SCM papers have over 4,000 citations — and has been called “arguably the most important

innovation in the policy evaluation literature in the last 15 years” (Athey and Imbens, 2017).

A critical feature of the original proposal is to use SCM only when the synthetic control’s pre-

treatment outcomes closely match the pre-treatment outcomes for the treated unit (Abadie et al.,

2015). When it is not possible to construct a synthetic control that fits pre-treatment outcomes

well, the original papers advise against using SCM.

We propose the augmented synthetic control method (ASCM) to extend SCM to settings where

excellent pre-treatment fit is not feasible. Analogous to bias correction for inexact matching (Rubin,

1973; Abadie and Imbens, 2011), ASCM uses an outcome model to estimate the bias due to imper-

fect pre-treatment fit and then de-biases the original SCM estimate. If pre-treatment fit is good,

the estimated bias will be small, and the SCM and ASCM estimates will be similar. Otherwise,

the estimates will diverge, relying more heavily on the outcome model than SCM weighting.

We focus on the special case in which the outcome model is a ridge-regularized linear regression

model, which we call Ridge ASCM. Like SCM, Ridge ASCM can be written as a weighted average

of the control unit outcomes. However, where SCM weights are always non-negative, Ridge ASCM

weights can be negative. This introduces a trade off: allowing for negative weights improves pre-

treatment fit relative to SCM alone, but at the cost of extrapolating outside the convex hull of

control units. We show that the regularization parameter in Ridge ASCM directly parameterizes

this trade-off by penalizing the distance from SCM weights, analogous to calibration in survey

sampling (Deville and Sarndal, 1992). By contrast, ridge regression alone can also be written as a

(possibly negative) weighting estimator, but one that instead seeks the minimum variance — rather

than minimum extrapolation — solution.

To assess the bias of the Ridge ASCM estimator, we connect pre-treatment fit to bias under

the linear factor model often invoked in this setting (Abadie et al., 2010). For weights constrained

to be on the simplex, including SCM weights, we bound the possible bias in terms of a measure

of pre-treatment fit and an approximation error. Incorporating the ridge adjustment improves

pre-treatment fit, but may increase the approximation error if the ridge regression over-fits noise

in the pre-treatment outcomes. In practice this will require careful regularization: we describe a

cross-validation hyper-parameter selection procedure that builds off the in-time placebo check in

Abadie et al. (2015).

1

Next, we extend the Augmented SCM approach to incorporate auxiliary covariates other than

pre-treatment outcomes. The most immediate extension is to simply include these auxiliary co-

variates in parallel to the lagged outcomes in both the outcome and SCM models. In the special

case where the auxiliary covariates are low dimensional, we instead propose a two-step procedure:

first residualize the pre- and post-treatment outcomes on the auxiliary covariates and then fit

Ridge ASCM on the residualized outcomes; this extends a suggestion from Doudchenko and Im-

bens (2017). We show that this controls bias under a linear factor model with auxiliary covariates

and also show that recent proposals for SCM with an intercept shift (Doudchenko and Imbens,

2017; Ferman and Pinto, 2018) are special cases of this approach.

We demonstrate Augmented SCM by using it to examine the effect of an aggressive tax cut

on economic output in Kansas in 2012, finding a substantial negative effect. We then generate

simulation studies calibrated to this application. Overall, we find large gains from ASCM relative

to alternative estimators, especially under model mis-specification, in terms of both bias and root

mean squared error. We implement the proposed methodology in the augsynth package for R,

available at https://github.com/ebenmichael/augsynth.

The paper proceeds as follows. Section 1.1 briefly reviews related work. Section 2 introduces

notation and the SCM estimator. Section 3 proposes Augmented SCM and gives key results for

Ridge ASCM. Section 4 bounds the bias under a linear factor model, the standard setting for SCM.

Section 5 extends the ASCM framework to incorporate auxiliary covariates and alternative outcome

models. Section 6 reports on the application to the Kansas tax cuts as well as extensive simulation

studies. Finally, Section 7 discusses some possible directions for further research. The appendix

includes all of the proofs, as well as additional derivations and technical discussion.

1.1 Related work

SCM was introduced by Abadie and Gardeazabal (2003) and Abadie et al. (2010, 2015) and is the

subject of an extensive methodological literature; see Abadie (2019) and Samartsidis et al. (2019)

for recent reviews. We briefly highlight some relevant aspects of this literature.

A first set of papers adapts the original SCM proposal to allow for more robust estimation while

retaining the simplex constraint on the weights. Robbins et al. (2017); Doudchenko and Imbens

(2017); Abadie and L’Hour (2018); Minard and Waddell (2018) incorporate a penalty on the weights

into the SCM optimization problem, building on a suggestion in Abadie et al. (2015). Gobillon and

Magnac (2016) and Hazlett and Xu (2018) explore dimension reduction strategies and other data

transformations that can improve the performance of the subsequent estimator.

A second set of papers relaxes the constraint that the weights are on the simplex as well as

other generalizations. In particular, Doudchenko and Imbens (2017) relax the SCM restriction that

control unit weights be non-negative, arguing that there are many settings in which negative weights

would be desirable. Amjad et al. (2018) propose an interesting variant that combines negative

2

https://github.com/ebenmichael/augsynth

weights with a pre-processing step. Powell (2018) instead allows for extrapolation via a Frisch-

Waugh-Lovell-style projection, which similarly generalizes the typical SCM setting. Doudchenko

and Imbens (2017) and Ferman and Pinto (2018) both propose to incorporate an intercept into

the SCM problem, which we discuss in Section 5. There have also been several other proposals to

reduce bias in SCM, developed independently and contemporaneously with ours; see Abadie and

L’Hour (2018), Chernozhukov et al. (2018), and Arkhangelsky et al. (2019). In particular, Abadie

and L’Hour (2018) also propose bias correcting SCM using regression; Arkhangelsky et al. (2019)

propose the Synthetic Difference-in-Differences estimator which, as the authors show, can be seen

as a special case of our proposal with a constrained outcome regression.

Finally, there have been several recent proposals to use outcome modeling rather than SCM-

style weighting in this setting. These include the matrix completion method in Athey et al. (2017),

the generalized synthetic control method in Xu (2017), and the combined approaches in Hsiao et al.

(2018). We explore the performance of these methods, both in isolation and in combination with

SCM, in Section 6.2.

2 Overview of the Synthetic Control Method

2.1 Notation and setup

We consider the canonical SCM panel data setting with i = 1, . . . , N units observed for t = 1, . . . , T

time periods. Let Wi be an indicator that unit i is treated at time T0 < T ; units with Wi = 0 never

receive the treatment. We restrict our attention to the case where a single unit receives treatment,

and follow the convention that the first unit is treated, that is W1 = 1. The remaining N0 = N − 1

units are potential controls, often referred to as donor units in the SCM context.

The outcome is Y and is typically continuous. We adopt the potential outcomes frame-

work (Neyman, 1923; Rubin, 1974) and invoke SUTVA, which assumes a well-defined treatment

and excludes interference between units (Rubin, 1980). The potential outcomes for unit i in period

t under control and treatment are Yit(0) and Yit(1), respectively.1 Observed outcomes are:

Yit =

Yit(0) if Wi = 0 or t ≤ T0

Yit(1) if Wi = 1 and t > T0.(1)

To keep notation simple, we assume that there is only one post-treatment observation, T = T0 + 1,

though our results are easily extended to larger T . We use Yi to represent the single post-treatment

observation for unit i, dropping t from the subscript. We use Xit, for t ≤ T0, to represent pre-

1To simplify the discussion, we assume that potential outcomes under both treatment and control exist for allunits in all time periods t > T0. With some additional technical caveats, we could relax this assumption to onlyrequire that Yit(0) be well-defined for all units, without also requiring control units to have well-defined potentialoutcomes under treatment.

3

treatment outcomes, to emphasize that pre-treatment outcomes serve as covariates in SCM. With

some abuse of notation, we use X0· to represent the N0-by-T0 matrix of control unit pre-treatment

outcomes and Y0 for the N0-vector of control unit outcomes in period T . With only one treated

unit, Y1 is a scalar, and X1· is a T0-row vector of treated unit pre-treatment outcomes. The data

structure is then:

Y11 Y12 . . . Y1T0 Y1T

Y21 Y22 . . . Y2T0 Y2T

......

YN1 YN2 . . . YNT0 YNT

≡

X11 X12 . . . X1T0 Y1

X21 X22 . . . X2T0 Y2

......

︸︷︷︸pre-treatment outcomes

XN1 XN2 . . . XNT0 YN

≡(X1· Y1

X0· Y0

)(2)

The treatment effect of interest is τ = τ1T = Y1(1)− Y1(0) = Y1 − Y1(0).

2.2 Synthetic Control Method

The Synthetic Control Method imputes the missing potential outcome for the treated unit, Y1(0),

as a weighted average of the control outcomes, Y ′0γ (Abadie and Gardeazabal, 2003; Abadie et al.,

2010, 2015). Weights are chosen to balance pre-treatment outcomes and possibly other covariates.

We consider a version of SCM that chooses weights γ as a solution to the constrained optimization

problem:

minγ

‖X1· −X ′0·γ‖22 + ζ∑Wi=0

f(γi)

subject to∑Wi=0

γi = 1

γi ≥ 0 i : Wi = 0

(3)

where ‖X1·−X ′0·γ‖22 ≡∑T0

t=1(X1t−X ′0tγ)2 is the squared 2-norm on RT0 , and the constraints limit

γ to the simplex in RN0 .

Equation (3) modifies the original SCM proposal in two key ways. First, this version, known

as the constrained regression formulation of SCM, follows the recent methodological literature and

focuses solely on balancing the lagged outcomes X (see Doudchenko and Imbens, 2017; Ferman

and Pinto, 2018). By contrast, the original SCM formulation also includes auxiliary covariates Z

and a weighted L2 norm. Our proposal naturally accommodates such covariates; we re-introduce

them in Section 5.

Second, Equation (3) penalizes the dispersion of the weights with hyperparameter ζ > 0, fol-

lowing a suggestion in Abadie et al. (2015). Possible penalization functions include an elastic net

penalty (Doudchenko and Imbens, 2017), an entropy penalty (Robbins et al., 2017), and a penalty

4

based on a measure of pairwise distance (Abadie and L’Hour, 2018). When the weights are con-

strained to be non-negative, as in the original SCM proposal and in (3), the particular choice of

dispersion penalty does not play a central role. However, if this constraint is relaxed, as we consider

below, regularizing the weights with a dispersion penalty is especially important. We discuss this

choice at length in the next section.

The SCM weights in Equation (3) optimize the pre-treatment fit, minimizing the imbalance

of pre-treatment outcomes between the treated unit and the weighted control mean. When these

weights achieve perfect pre-treatment fit, that is, when X1t − X ′0tγscm = 0 for all t, the resulting

estimator has many attractive properties, including a bias bound established by (see Abadie et al.,

2010). Achieving perfect (or nearly perfect) pre-treatment fit is not always feasible in practice,

however. By definition, weights that yield perfect fit exist if and only if the treated unit’s X1·

vector is inside the convex hull of the control units, X0·. As a result, the curse of dimensionality

applies for large T ; see Ferman and Pinto (2018) for a detailed discussion in the SCM setting.

Abadie et al. (2015) recommend against using SCM when “the pre-treatment fit is poor or the

number of pre-treatment periods is small.” Similarly, Abadie et al. (2010, 2015) argue that an

SCM should be used only if the setting passes extensive balance checks, such as “placebo in time”

checks. The conditional nature of SCM analysis therefore excludes many practical settings. Our

proposal enables the use of (a modified) SCM approach even when these balance checks fail.

3 Augmented SCM

3.1 Overview

Let Yi = m(Xi) + εi be a working outcome model, where m(Xi) is a function of pre-treatment

outcomes that we estimate with m(Xi), and εi is an independent error term. Under this additive

structure, the bias of any weighting estimator with weights γ can be written as:

bias = Y1(0)−∑Wi=0

γiYi =

m(X1)−∑Wi=0

γim(Xi)

+ E

ε1 −∑Wi=0

γiεi

. (4)

The first term here represents bias attributable to imbalance in X. Using the estimated m(Xi), we

can then estimate this component of the bias by substituting m(Xi):

bias = m(X1)−∑Wi=0

γi m(Xi).

5

This leads naturally to a bias-corrected estimator for Y1(0):

Y aug1 (0) =

∑Wi=0

γiYi +

m(X1)−∑Wi=0

γim(Xi)

(5)

= m(X1) +∑Wi=0

γi(Yi − m(Xi)). (6)

When the weights γi are the SCM weights defined above, we refer to this estimator as Augmented

SCM (ASCM). Standard SCM is a special case, where m(·) is set to be a constant.

Equations (5) and (6), while equivalent, highlight two distinct motivations for ASCM. Equation

(5) directly corrects the SCM estimate,∑γiYi, by the estimated bias. This is analogous to bias

correction for inexact matching (Rubin, 1973; Abadie and Imbens, 2011). If the estimated bias is

small, then the SCM and ASCM estimates will be similar. By contrast, Equation (6) is analogous to

standard doubly robust estimators (Robins et al., 1994), which begin with the outcome model but

then re-weight to balance residuals. This is also comparable in form to the generalized regression

estimator in survey sampling (Cassel et al., 1976; Breidt and Opsomer, 2017), which has been

adapted to the causal inference setting by, among others, Athey et al. (2018) and Tan (2018). We

emphasize the bias correction interpretation and consider residual balancing further in Section 5.

Finally, we discuss a connection to inverse propensity score weighting in Appendix C.

3.2 Ridge ASCM

We focus our discussion on the special case where m(Xi) is a ridge-regularized linear model, and

discuss the use of non-ridge outcome models with ASCM in Sections 5 and 6. The outcome model

is m(Xi) = ηr0 + X ′i·η

r, where ηr0 and ηr are the coefficients of a ridge regression of control post-

treatment outcomes Y0 on centered pre-treatment outcomes X0· with penalty hyper-parameter λr:

{ηr0, ηr} = arg minη0,η

1

2

∑Wi=0

(Yi − (η0 +X ′iη))2 + λr‖η‖22. (7)

With this estimate, the ridge regression specialization of (5) is:

Y aug1 (0) =

∑Wi=0

γscmi Yi +

X1· −∑Wi=0

γscmi Xi·

· ηr. (8)

We refer to this as Ridge Augmented SCM (ridge ASCM).

In the remainder of this section, we first show that ridge ASCM is a weighting estimator

with potentially negative weights. We then show that allowing for negative weights improves

pre-treatment fit relative to SCM alone. Finally, we show that, unlike ridge regression alone, ridge

6

ASCM directly penalizes extrapolation from the SCM weights. We give additional results bounding

the variance of Ridge ASCM in Lemma 5 in the Appendix.

3.2.1 Ridge ASCM as a weighting estimator

Lemma 1 expresses ridge ASCM as a weighting estimator that adjusts the raw SCM weights to

achieve better covariate balance, while penalizing the magnitude of the adjustment (see Deville

et al., 1993; Breidt and Opsomer, 2017, for similar results in survey sampling).

Lemma 1. The ridge-augmented SCM estimator (8) is:

Y aug1 (0) =

∑Wi=0

γscmi Yi +

X1· −∑Wi=0

γscmi Xi·

· ηr =∑Wi=0

γaugi Yi, (9)

where

γaugi = γscm

i + (X1 −X ′0·γscm)′(X ′0·X0· + λrIT0)−1Xi·. (10)

Moreover, the ridge ASCM weights γaug are the solution to

minγ s.t.

∑i γi=1

1

2λr‖X1· −X ′0·γ‖22 +

1

2‖γ − γscm‖22 . (11)

Equation (11) is a special case of the penalized SCM problem in Equation (3), where f(γi) =12‖γi− γ

scmi ‖22 and where we relax the restriction that weights are non-negative. Lemma 1 highlights

the interplay between the level of pre-treatment fit and the tuning parameter λr in determining the

size of the adjustment. When SCM yields good pre-treatment fit or λr is large, the adjustment term

is small and γaug will remain close to the SCM weights. Indeed, the ridge ASCM and SCM weights

are identical when SCM weights exactly balance the lagged outcomes. When SCM weights do not

achieve exact balance, the ridge ASCM solution will use negative weights to extrapolate from the

convex hull of the control units, with the amount of extrapolation controlled by λr. When SCM

has poor fit and λr is small, the adjustment is large and as a consequence, γaug will extrapolate

more from the support of the data.

3.2.2 Ridge ASCM improves pre-treatment fit relative to SCM alone

The expression of ridge ASCM as a weighting estimator allows us to compare the pre-treatment

fits for ridge ASCM to that for SCM alone. Specifically, we show that the pre-treatment fit from

ridge ASCM is at least as good as — and generally better than — the pre-treatment fit from SCM

alone.

Lemma 2. Let 1√N0X0· = UDV ′ be the singular value decomposition of the matrix of control

pre-intervention outcomes, where r is the rank of X0, U ∈ RN0×r, V ∈ RT0×r, and D = diag(dj)

7

is the diagonal matrix of singular values. Furthermore, let Xi = V ′Xi be the rotation of Xi along

the singular vectors of X0·. Then γaug, the ridge ASCM weights with hyperparameter λr = λN0

satisfy:

‖X1 −X ′0·γaug‖2 =

∥∥∥∥∥diag

(λ

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2

. (12)

There are two important cases to consider. First, as λ → 0, then Lemma 2 shows that (possibly

infeasible) ridge ASCM will have perfect pre-treatment fit. Second, as λ → ∞ the pre-treatment

fit of ridge ASCM will approach that of SCM alone from below. Thus, except in corner cases, ridge

ASCM will achieve better pre-treatment fit than SCM alone, albeit at the price of possible negative

weights.2

3.2.3 Ridge ASCM directly penalizes extrapolation, unlike ridge regression alone

It is useful to compare ridge ASCM to a simple ridge regression estimator. This, too, can be written

as a weighting estimator with possibly negative weights. However, while ridge ASCM penalizes the

distance from SCM weights — and thus, extrapolation from the convex hull — ridge regression

instead penalizes the variance of the control unit weights.

Lemma 3. With ηr0 and ηr, the solutions to (7), the ridge estimate can be written as a weighting

estimator:

Y r1 (0) = ηr

0 + ηr′X1 =∑Wi=0

γriYi, (13)

where

γri =

1

N0+ (X1 − X0)′(X ′0·X0· + λrIT0)−1Xi. (14)

Moreover, the ridge weights γr are the solution to

minγ |

∑i γi=1

1

2λr‖X1· −X ′0·γ‖22 +

1

2

∥∥∥∥γ − 1

N0

∥∥∥∥2

2

. (15)

Lemma 3 shows that ridge regression weights minimize imbalance while penalizing deviations from

uniformity.3 We can therefore view ridge regression as an adjustment to uniform weights (the

2For example, in the special case where the pre-treatment SCM error is orthogonal to the pre-intervention series ofthe control units then the ridge adjustment will not improve the fit. This is unlikely unless T0 � N0 (and impossibleif N0 > T0 and X0· is full column rank). Additionally, Lemma 2 shows that we can view ridge ASCM as a formof soft singular value thresholding and so is similar to the hard singular value thresholding procedure proposed byAmjad et al. (2018).

3This penalization is akin to “calibration” in survey sampling (see Deville and Sarndal, 1992). The penalizedSCM form (15) shows that the ridge weights are a special case of the elastic-net synthetic control estimator proposedby Doudchenko and Imbens (2017). When λ = 0, the ridge weights reduce to the OLS weights considered by Kline(2011) and Abadie et al. (2015); when λ > 0 they correspond to an L2 penalized version of the Oaxaca-Blinderweights considered by Kline (2011). Lemma 3 also relates to the “vertical” and “horizontal” regressions consideredby Athey et al. (2017) and Arkhangelsky et al. (2019); it implies that the two are equivalent in the special casewithout an intercept in the vertical regression.

8

minimum variance solution) to achieve better pre-treatment fit, where the regularization parameter

λr controls the trade-off between better fit and lower variance. By contrast, the ridge ASCM weights

in Lemma 1 penalize distance to SCM weights γscm rather than to uniform weights 1N0

. Penalizing

distance to SCM weights is natural in our setting since these weights are sparse and interpretable.

Moreover, recall that X ′0·γscm is the projection of X1 onto the convex hull of the control units.

Thus, ridge ASCM weights directly penalize extrapolation outside the convex hull, in contrast to

ridge regression, which allows for arbitrary extrapolation.

Allowing for negative weights is an important departure from the original SCM proposal, which

constrains weights to be on the simplex. Abadie et al. (2010, 2015) argue that the simplex constraint

is important for avoiding extrapolation and for preserving interpretability. Indeed, when it is

possible to achieve excellent pre-treatment fit with SCM, that is when ‖X1 − X ′0·γscm‖2 is very

small, we should prefer SCM weights over possibly negative weights: a slight balance improvement

is not worth the extrapolation and the loss of interpretability. In cases where this is not possible,

however, some degree of extrapolation is critical for improving pre-treatment fit.4 There are two

main challenges to allowing for extrapolation. First, extrapolation increases the potential for over-

fitting. In line with our discussion above, Doudchenko and Imbens (2017) argue that regularizing the

weights with a dispersion penalty — and choosing that dispersion penalty carefully — is especially

important when weights can be negative. Second, extrapolation comes at the price of greater model

dependence. We discuss practical model checks in Section 6.

4 Bias under a linear factor model

We now consider the bias of Augmented SCM under the canonical linear factor model considered

by Abadie et al. (2010). For weights constrained to be on the simplex, as in standard SCM, the bias

is bounded by the pre-treatment fit and an approximation error due to the potential for over-fitting

to noise. Incorporating the ridge adjustment can reduce bias through improved pre-treatment fit,

but may also increase the risk of bias due to over-fitting. In practice this will require careful

regularization: we describe a cross-validation hyper-parameter selection procedure that builds off

the in-time placebo check in Abadie et al. (2015).

4.1 Linear factor model and bias for general weights

Following the setup in Abadie et al. (2010), we assume that there are J unknown, latent time-

varying factors µt = {µjt}, j = 1, . . . , J , with maxjt |µjt| ≤ M , where J will typically be small

relative to N and T0. Each unit has a vector of unknown factor loadings φi ∈ RJ , which we consider

4See King and Zeng (2006) for a discussion of extrapolation in constructing counterfactuals. As they note: “If welearn that a counterfactual question involves extrapolation, we still might wish to proceed if the question is sufficientlyimportant, but we would be aware of how much more model dependent our answers would be.”

9

fixed. Control potential outcomes are weighted averages of these factors plus additive noise εit:

Yit(0) =J∑j=1

φijµjt + εit = φ′iµt + εit. (16)

Slightly abusing notation, we collect the pre-intervention factors into a matrix µ ∈ RT0×J , where

the tth row of µ contains the factor values at time t, µ′t. Following Bai (2009) and Xu (2017), we

further assume that the factors are orthogonal and normalized, i.e. that 1T0µ′µ = IJ . Finally, and

crucially, we assume that treatment assignment Wi is ignorable given a unit’s factor loadings,

EεT [Yi(0) | φi,Wi] = EεT [Yi(0) | φi] ,

where the expectation is taken with respect to εT = (ε1T , . . . , εNT ). Lastly, we assume that the

error terms εit are independent (across units and over time) sub-Gaussian random variables with

scale parameter σ.

Under this model, the error of a weighting estimator depends on a bias term due to imbalance

in the latent φi and a noise term due to the noise at time T :

Y1(0)−∑Wi=0

γiYi =

φ1 −∑Wi=0

γiφi

· µT︸︷︷︸bias from imbalance

+ ε1T −∑Wi=0

γiεit︸︷︷︸noise

. (17)

The error can be controlled by two aspects of the weights. First, the noise term depends on the

variance of the weights ‖γ‖22 and is minimized by uniform weights. Second, an estimator that

exactly balances the latent φi will yield an unbiased estimate of Y1(0); without exact balance, bias

will be proportional to the level of imbalance in φi. Balancing the observed Xi, however, will not

necessarily balance the latent φi, and the relationship between imbalance in X and bias is less

direct. Following Abadie et al. (2010), we show in the appendix that, under Equation (16), the

bias of any weighting estimator with weights γ can be expressed as

φ1 −∑Wi=0

γiφi

· µT =1

T0

µ′X1 −∑Wi=0

γiµ′Xi

︸︷︷︸

imbalance in X

·µT −1

T0

µ′ε1 −∑Wi=0

γiµ′εi

︸︷︷︸

approximation error

·µT . (18)

The first term is the imbalance of observed lagged outcomes and the second term is an approxima-

tion error arising from the latent factor structure.

10

4.2 Bounding the bias for weights on the simplex

We can bound the bias in Equation (18) for weights on the simplex using the triangle inequality

and accounting for the worst-case approximation error:

Theorem 1. Under the linear factor model (16) with sub-Gaussian noise, for weights γ ∈ ∆N0

and for any δ > 0,∣∣∣∣∣∣φ1 −

∑Wi=0

γiφi

· µT∣∣∣∣∣∣ ≤ JM2

√T0‖X1 −X ′0·γ‖2︸︷︷︸

imbalance in X

+2JM2σ√

T0

(√logN0 + 2δ

)︸︷︷︸

approximation error

(19)

with probability at least 1− 4e−δ2

2 , where maxjt |µjt| ≤M .

Theorem 1 contains two competing terms: the pre-treatment fit and the approximation error due

to balancing lagged outcomes rather than balancing the latent factors themselves. To understand

this theorem, we consider two special cases. First, in the special case with exact balance, considered

by Abadie et al. (2010), the first term of Equation (19) is zero and the bias is controlled by the

approximation error, which converges to zero in probability as T0 → ∞. Intuitively, the lagged

outcome for unit i at time t, Xit, is a noisy proxy for the index φ′iµt. Thus, as we observe more Xit

— and can exactly balance each one — we are better able to match on this index and, as a result,

on the underlying factor loadings. Without exact balance, Theorem 1 shows that a long pre-period

may not be enough to control bias.5 This emphasizes the critical role that conditioning on excellent

pre-treatment fit plays in ensuring that SCM yields reasonable estimates of the counterfactual.

4.3 Bounding the bias for Ridge Augmented SCM

We now consider the bias of ridge ASCM under a linear factor model, by relaxing the simplex

constraint in Theorem 1. This relaxation introduces an additional term in the bound due to an

increased risk of over-fitting.

Theorem 2. Under the linear factor model (16) with sub-Gaussian noise, for any δ > 0, the ridge

5Similar results exist for bias in other panel data settings. For instance, the approximation error in (19) is analogousto so-called “Nickell (1981) bias” that arises in short panels. Separately, Ferman and Pinto (2018) argue that — evenif weights exist that exactly balance the latent factor loadings — as T0 →∞, SCM weights will converge to a solutionthat does not exactly balance the latent factors, leading to bias. Theorem 1 complements their asymptotic analysiswith a finite sample bound that holds for all weights on the simplex, and explicitly includes the level of imbalance inthe lagged outcomes.

11

ASCM weights with hyperparameter λr = λN0 satisfy the bound∣∣∣∣∣∣φ1 −

∑Wi=0

γaugi φi

· µT∣∣∣∣∣∣ ≤ JM2

√T0

( ∥∥∥∥∥diag

(λ

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2︸︷︷︸

imbalance in X

+

∥∥∥∥∥diag

(4djσ

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2︸︷︷︸

excess approximation error

)+

2JM2σ√T0

(√logN0 + 2δ

)︸︷︷︸

SCM approximation error

(20)


2 − 2e−N0(2−

√log 5)2

2 .

Comparing Theorems 1 and 2, we see that the term deriving from imbalance in X will be

smaller for ridge ASCM than for SCM alone. However, allowing for extrapolation through negative

weights creates the potential for additional approximation error in excess of the SCM approximation

error. Therefore, ridge augmentation affects the bias in two competing ways. In one direction,

augmentation may reduce the bias by improving the pre-treatment fit, as in Lemma 2. In the other

direction, augmentation may increase bias by over-fitting to the noisy lagged outcomes, increasing

the approximation error relative to SCM alone. The gains to augmentation will therefore depend

on the the signal to noise ratio — i.e. the size of the largest singular values relative to σ — and the

level of pre-treatment fit along the principal directions of X0·. When the signal to noise ratio is high

and the pre-treatment fit is poor, the improvement in bias from augmentation can be substantial.

Comparing the two bias bounds also shows the importance of the simplex constraint in control-

ling the approximation error. In the worst case scenario, the SCM problem places all of the weight

on a single control unit and primarily balances noise rather than underlying factors. Constraining

weights to the simplex mitigates this risk by restricting the maximal weight to be one. As a re-

sult, the worst case approximation error in Theorem 1 only scales logarithmically with N0 while

decreasing with√T0. In contrast, allowing for negative weights creates the potential for worse

approximation error in Theorem 2; the soft penalty and tuning parameter λ take the place of a

hard constraint on the weights in controlling the approximation error.

Finally, the analyses for both Theorems 1 and 2 are restricted to a finite sample setting with

a fixed number of units (e.g., the 50 states), where we attempt to understand the bias in any

particular practitioner’s data series. The situation is different in a setting where units are sampled

from a (possibly infinite) population, and both N0 → ∞ and T0 → ∞. In this case, under

appropriate regularity conditions, the bias tends to zero since including more units improves the

12

pre-intervention fit faster than it increases the approximation error. Arkhangelsky et al. (2019) and

Ferman (2019) both consider this setting and show that the standard SCM estimator is consistent

as both N0 →∞ and T0 →∞. In addition, results in Arkhangelsky et al. (2019) imply that ridge

ASCM is be consistent under slightly weaker assumptions than SCM alone.6

4.4 Hyper-parameter selection

Regularization plays a vital role in navigating the trade-off between improved pre-treatment fit

and increased approximation error. As λ → ∞, ridge ASCM will converge to standard SCM, and

the excess approximation error will be zero, yielding the same bias bound as in Theorem 1. At

the other extreme where λ = 0, we estimate a (potentially infeasible) un-regularized regression.

Pre-treatment fit will be perfect in this case, but the approximation error could be large.

In general the bias bound in Theorem 2 will not be monotonic in λ and will be minimized by

some λ ∈ (0,∞), thus we must carefully select λ in practice. In Appendix Figure D.1 we inspect

an empirical example of the bound as a function of λ. We show through simulation in Section

6.2 that the gains to pre-treatment fit through augmentation outweigh the potential for increased

approximation error in a range of practical settings.

We propose a cross-validation approach for selecting λ inspired by the in-time placebo check

proposed by Abadie et al. (2015). Let Y1(−t) =∑

Wi=0 γaugi(−t)Yit be the estimate of Y1t where time

period t is excluded from fitting the estimator in (10). Abadie et al. (2015) propose to compare

the difference Y1t − Y1(−t) for some t ≤ T0. We can extend this idea to compute the leave-one-out

cross validation MSE over time periods:

CV (λ) =

T0∑t=1

(Y1t − Y1(−t)

)2. (21)

We can then choose λ to minimize CV (λ) or follow a more conservative approach such as the

“one-standard-error” rule (Hastie et al., 2009). This proposal is similar to the leave-one-out cross

validation proposed by Doudchenko and Imbens (2017), who select hyperparameters by holding

out control units and minimizing the MSE of the control units in the post-treatment time T .

5 Auxiliary covariates and residualization

Thus far, we have focused exclusively on the lagged outcomes as predictors. We now turn to

incorporating auxiliary covariates into ASCM. The most immediate extension is to simply include

these auxiliary covariates in parallel to the lagged outcomes in both the outcome and SCM models.

6Arkhangelsky et al. (2019) show consistency of ASCM augmented with a linear model with a hard constraint onthe regression coefficients, in particular for the constrained form of ridge regression. While the constrained form doesnot have a closed form solution as a weighting estimator, there exists a choice of hyperparameter that is equivalentto the Lagrangian form we consider here.

13

In the special case where the auxiliary covariates are low dimensional, we instead propose a two-

step procedure, extending a suggestion from Doudchenko and Imbens (2017): first residualize the

pre- and post-treatment outcomes on the auxiliary covariates and then fit Ridge ASCM on the

residualized outcomes. We show that this controls bias under a linear factor model with auxiliary

covariates and also show that recent proposals for SCM with an intercept shift (Doudchenko and

Imbens, 2017; Ferman et al., 2017) are a special case of this approach.

To define relevant notation, let Zi ∈ RK denote the vector of covariates for unit i; Zi may

include summaries of lagged outcomes or time-varying covariates such as the pre-treatment mean

Xi, treated as unit-level measures. Let Z0· ∈ RN0×K denote the matrix of covariates for the donor

units. We assume that the covariates are centered, Z0· = 0.7

5.1 Incorporating auxiliary covariates into ASCM

We begin by describing the general approach where we include both X and Z in parallel into both

SCM and the augmentation, and then shift to the special case where the auxiliary covariates are

low dimensional. For the general case, we can fit SCM by minimizing a weighted average of the

imbalance in X and Z over non-negative weighting vectors:

minγ∈∆N0

θx‖X1· −X ′0·γ‖22 + θz‖Z1 − Z0·γ‖22 + ζ∑Wi=0

f(γi) (22)

where θx and θz are researcher-selected weights that govern the relative priority placed on balancing

the two vectors, which need not be on the same scale. As above, we can augment these weights

with an outcome model m(Xi, Zi) that is a function of both the lagged outcomes and auxiliary

covariates. For example, a simple extension of ridge ASCM chooses m(X,Z) = η0 + X ′ηx + Z ′ηz

to be linear in X and Z, fit via ridge regression with different penalties for the two components:

minη0,ηx,ηz

1

2

∑Wi=0

(Yi − (η0 +X ′iηx + Z ′iηz))2 + λx‖ηx‖22 + λz‖ηz‖22. (23)

As with ridge ASCM, including auxiliary covariates into the ridge augmentation again yields a

weighting estimator that balances both the lagged outcomes and the auxiliary covariates.

In many settings, the dimension of Z is small relative to N (i.e., K � N), and it is feasible

to fit a regression model that regularizes the lagged outcome coefficients ηx but does not regu-

larize the auxiliary covariate coefficients ηz (i.e., set λz = 0). In Lemma 4 below, we show that

this is a weighting estimator that perfectly balances the auxiliary covariates Z. Importantly, the

corresponding pre-treatment fit in the lagged outcomes X only depends on the pre-treatment fit

7A complementary strategy is to consider using SCM to balance lower dimensional summaries of the vector oflagged outcomes, such as estimated pre-treatment average, trend, or factor loadings; see, for example, Gobillon andMagnac (2016) and Hazlett and Xu (2018).

14

on the residuals from regressions of the lagged outcomes on Z. We therefore propose a two-step

procedure where we first residualize the pre- and post-treatment outcomes on Z and then estimate

Ridge ASCM on the residualized outcomes.8 We propose this as our default approach for practice

in settings where SCM alone is not feasible.

Lemma 4. Let ηx and ηz be the solutions to (23) with λx = λr and λz = 0. For any weight vector

γ that sums to one, the ASCM estimator from Equation (6) with m(Xi, Zi) = X ′iηx + Z ′iηz is

∑Wi=0

γiYi +

X1· −∑Wi=0

γiXi·

′ ηx +

Z1· −∑Wi=0

γiZi·

′ ηz =∑Wi=0

γcovi Yi, (24)

where

γcovi = γi + (X1 − X0·)(X

′0·X0· + λrI)−1Xi + (Z1 − Z ′0γ)′(Z ′0Z0)−1Zi, (25)

and Xi is the residual components of a regression of pre-treatment outcomes on the control auxiliary

covariates:

Xi = Xi − Z ′i(Z ′0Z0)−1Z ′0X0. (26)

These weights exactly balance the auxiliary covariates, Z1 − Z ′0·γcov = 0; the imbalance in the

lagged outcomes is ∥∥X1 −X ′0·γcov∥∥

2≤(

λr

λr +N0d2r

)∥∥X1 − X ′0·γ∥∥

2, (27)

where dr is the minimal singular value of X0.

Lemma 4 has two main implications for incorporating auxiliary covariates into ridge ASCM.

First, the weights γcov exactly balance the auxiliary covariates regardless of whether the SCM

weights alone also achieve exact balance. Thus, we can simply exclude Z from Equation (22), or,

equivalently set θz = 0. Second, from Equation (27) we see that the pre-treatment fit on the lagged

outcomes X depends on how well the SCM weights balance the residualized lagged outcomes X.

This suggests modifying Equation (22) to balance X rather than X, which leads to the two-step

procedure we outline above.

8This is equivalent to fitting the partitioned ridge regression (23) and fitting SCM weights to the residualizedlagged outcomes. Doudchenko and Imbens (2017) also suggest a version of residualization that is equivalent toexcluding the lagged outcomes X from the outcome regression.

15

5.2 Bias under a linear factor model with covariates

We can quantify the behavior of this two-step procedure in controlling bias under a more general

form of the linear factor model (16) with covariates:

Yit(0) =J∑j=1

φijµjt +K∑k=1

BtkZik + εit = φ′iµt +B′tZi + εit, (28)

where Bt ∈ RK is a set of coefficients on the covariates (see Abadie et al., 2010; Botosaru and

Ferman, 2019, for additional discussion). Under this model the bias will be due to imbalance in both

the latent factor loadings φi and the covariates Zi. The auxiliary covariates are perfectly balanced

due to the residualization and we can bound the bias due to imbalance in φ using techniques from

Sections 3 and 4. For ease of exposition, we assume that the control covariates are standardized

and rotated, which can always be true after pre-processing. We present results for the simpler case

in which we fit SCM on the residualized pre-treatment outcomes rather than ridge ASCM (i.e. we

let λr →∞); the more general version follows immediately by applying Theorem 2.

Theorem 3. Under the linear factor model with covariates (28) with 1N0Z ′0·Z0· = IK and sub-

Gaussian noise, for any δ > 0 γcov in Equation (25) with λr →∞ satisfies the bound∣∣∣∣∣∣φ1 −

∑Wi=0

γcovi φi

· µT∣∣∣∣∣∣ ≤ 4JM2

√T0

( ∥∥X1 − X ′0·γ)∥∥

2︸︷︷︸imbalance in X

+ σ

√K

N0‖Z1 − Z ′0·γ‖2︸︷︷︸

excess approximation error

)+

2JM2σ√T0

(√logN0 + 2δ

)︸︷︷︸

SCM approximation error

(29)


2 − 2e−KN0(2−

√log 5)2

2 .

Building on Lemma 4, Theorem 3 shows that due to the additive, separable structure of the

auxiliary covariates in Equation (28), controlling the pre-treatment fit in the residualized lagged

outcomes X partially controls the bias. As above, this justifies directly targeting fit in X rather

than targeting raw lagged outcomes X. Moreover, since the number of covariates K is small relative

to N0, the auxiliary covariates are measured without noise, and the residualized lagged outcomes

are orthogonal to the auxiliary covariates by construction, the excess approximation error due to

augmentation will be small.

As in Section 3.2.2, we can achieve better balance when we apply ridge ASCM to X than when

we apply SCM alone. Because X are orthogonal to Z by construction, this comes at no cost in

terms of imbalance in Z. However, the fundamental challenge of over-fitting to noise still remains,

16

and, as in the case without auxiliary covariates, selecting the tuning parameter remains important.

We again propose to follow the cross validation approach in Section 4.4, here using the residualized

lagged outcomes X rather than the raw lagged outcomes X.

5.3 Including the average lagged outcome as a covariate

An important special case of this two-step procedure is the de-meaned or intercept shift SCM,

which incorporates an intercept into the SCM optimization problem (Doudchenko and Imbens,

2017; Ferman and Pinto, 2018). The estimator of the treated unit’s potential outcome under

control is then:

Y int scm1 (0) = X1 +

∑Wi=0

γscm∗i (Yi − Xi), (30)

where Xi = 1T0

∑T0t=1Xit is the pre-treatment mean outcome for unit i and the weights γscm∗

i are

chosen to balance the residual outcomes Xit−Xi rather than the raw outcomes Xit. This estimator

is equivalent to setting Zi = Xi and constraining ηz to equal one, thus estimating a fixed-effects

outcome model as m(Xi) = Xi. Alternatively, we can view this approach as augmenting SCM with

a panel regression on the vector of unit indicators.

The corresponding treatment effect estimate is then a weighted difference-in-differences estima-

tor:

τde =(Y1 − X1

)−

∑Wi=0

γscm∗i (Yi − Xi)

=1

T0

T0∑t=1

(Y1 −X1t)−

∑Wi=0

γscm∗i (Yi −Xit)

.(31)

Replacing the SCM weights with uniform weights in Equation (31) yields the typical two-group

difference-in-differences approach. Replacing the SCM weights with traditional inverse propensity

score weights instead yields a variant of the semiparametric difference-in-differences estimator of

Abadie (2005), albeit using lagged outcomes as predictors rather than auxiliary covariates.

In practice, the unit-specific pre-treatment average Xi may be a poor estimate for unit fixed

effects, due to the incidental parameter problem (Neyman and Scott, 1948). However, Ferman and

Pinto (2018) show that this combined approach has attractive theoretical properties in a range

of settings with large T0 and is often preferable to the usual two-group difference-in-differences

estimator; Arkhangelsky et al. (2019) come to a similar conclusion under a different asymptotic

framework. Thus, we also propose extending the two-step procedure above to include Xi as a

default predictor.

17

6 Simulations and numerical illustrations

We now turn to an empirical illustration and simulations. First, we use our approach to examine

the effect of an aggressive tax cut on economic output in Kansas in 2012. We then generate

simulation studies calibrated to this application and assess the performance of different methods,

finding substantial gains from ASCM and related approaches overall.

6.1 Illustration: 2012 Kansas tax cuts

In 2010, Sam Brownback was elected governor of Kansas, having run on a platform emphasizing

tax cuts and deficit reduction (see Rickman and Wang, 2018, for further discussion and analysis).

Upon taking office, he implemented a substantial personal income tax cut, both lowering rates

and reducing credits and deductions. This is a valuable test of “supply side” models: Brownback

argued that the tax cuts would increase business activity in Kansas, generating economic growth

and additional tax revenues that would make up for the static revenue losses. Kansas’ subsequent

economic performance has not been impressive relative to its neighbors; however, potentially con-

founding factors include a drought and declines in the locally important aerospace industry. Finding

a credible control for Kansas is thus challenging, and SCM-type models offer a potential solution.

We estimate the effect of the tax cuts on log gross state product (GSP) per capita using the

second quarter of 2012 — when Brownback signed the tax cut bill into law — as the intervention

time. We use three primary estimators: (1) SCM alone fit on the entire vector of lagged outcomes,

(2) ridge ASCM, and (3) the two-step approach where we first regress the lagged outcomes on

predictors then fit ridge ASCM on the residuals as in Section 5.9 Figure 1, known as a “gap plot”,

shows the difference between Kansas and its synthetic control before and after the passage of the

tax cuts, plus or minus two jackknife standard errors.10 Appendix Figure D.2 additionally shows

the results for: de-meaned SCM; ridge regression alone; ridge ASCM with λ chosen to minimize

CV (λ); and the original SCM proposal incorporating auxiliary covariates (Abadie et al., 2010).

SCM achieves fairly poor pre-treatment fit; the synthetic control exceeds the treatment unit by

two to four percent in 2004-2005. This lack of pre-treatment fit should make us wary of the validity

of the SCM effect estimates, and suggests that there may be gains to augmentation. Augmenting

SCM with ridge regression indeed provides better pre-treatment fit. Adding auxiliary covariates,

balanced by OLS before fitting ridge ASCM to the residuals, improves fit even more; the cor-

responding imbalances are small in magnitude throughout the 1990-2011 period. The estimated

9The covariates we include are the pre-treatment averages of (1) log state and local revenue per capita, (2) logaverage weekly wages, (3) number of establishments per capita, (4) the employment level, and (5) log GSP per capita.

10We compute standard errors for the imputed counterfactual Y1(0). See Arkhangelsky et al. (2019) for a discussionof jackknife standard errors in the SCM and Augmented SCM setting. For the augmented estimators we select thehyperparameter λr as the largest λ within one standard error of the λ that minimizes the cross-validation placebo fitCV (λ); see Section 4.4. Appendix Figure D.1 plots CV (λ) and one standard error for the ridge ASCM estimator asan empirical analog of the bound in Theorem 2.

18

Figure 1: Point estimates ± two standard errors of the ATT for the effect of the tax cuts on logGSP per capita using SCM, ridge ASCM, and ridge ASCM with covariates.

treatment effect in this model is a nearly 5% decrease in GSP per capita that persists through the

end of our sample and is statistically significant at nearly every time point. This is a larger negative

effect than is indicated either by SCM (a 2% decrease) or ridge ASCM (a 3.8% decrease), neither

of which is consistently negative.

To check against over-fitting, Figure 2 shows in-time placebo estimates for the ridge ASCM

estimator with covariates with placebo treatment times in the second quarter of 2009, 2010, and

2011. With placebo treatment in 2009 and 2011 we estimate placebo effects that are near zero; with

placebo treatment in 2010 the point estimates of the placebo effects are large, but do not appear

meaningfully different from zero. Appendix Figures D.3 and D.4 show the corresponding placebo

estimates for SCM alone and ridge ASCM without covariates.

Figure 3a shows the covariate balance for the three estimators. While SCM and ridge ASCM

achieve excellent fit for the pre-treatment average log GSP per capita, neither estimator perfectly

balances the other covariates, most notably the average employment level across the quarters of the

pre-period. In contrast — and by design — including an un-penalized regression on the auxiliary

covariates perfectly balances them. Moreover, after residualizing out the covariates, SCM along

with the ridge augmentation on lagged outcomes is able to achieve very good pre-treatment fit on

the lagged outcomes as shown in Figure 1.

Figure 3b shows the weighting on the donor units for SCM and ridge ASCM. Here we see the

minimal extrapolation property of the ASCM weights. The SCM weights are zero for all but six

donor states. The ridge ASCM weights are similar but deviate slightly from the simplex, with only

one unit receiving a meaningful negative weight. In addition, the ridge ASCM weights retain some

19

Figure 2: Placebo point estimates ± two standard errors for ridge ASCM covariates with placebotreatment times in Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.

(a) (b)

Figure 3: (a) Covariate balance for SCM, ridge ASCM, and ASCM with covariates. (b) Donor unitweights for SCM and ridge ASCM.

20

of the interpretability of the SCM weights. For the donor units with positive SCM weight, ridge

ASCM places close to the same weight. For the majority of those with zero SCM weight, ridge

ASCM places a close to zero weight, with relatively few donor units with large negative weight. By

contrast, Appendix Figure D.5 shows the weights from ridge regression alone: many of the weights

are negative and the weights are far from sparse.

6.2 Calibrated simulation studies

We now turn to simulation studies in which the true data generating process is known. To make

our simulation studies as realistic as possible, we conduct “calibrated” simulation studies (Kern

et al., 2016), based on estimates of linear factor models fit to the series of log GSP per capita we

study above (N = 50, T0 = 89). We use the Generalized Synthetic Control Method (Xu, 2017) to

estimate factor models with three latent factors for each application. We then simulate outcomes

using the distribution of estimated parameters and model selection into treatment as a function

of the latent factors; see Appendix A for additional details. We also present results from three

additional DGPs, each calibrated to estimates from the Kansas data: (1) the factor model with

quadruple the standard deviation of the noise term, (2) a unit and time fixed effects model, and

(3) an autoregressive model with 3 lags.

First, we explore the role of augmentation using simple outcome models. For each DGP, we

consider five estimators: (1) SCM alone, (2) ridge regression alone, (3) ridge ASCM, (4) fixed effects

alone, and (5) SCM augmented with fixed effects (i.e., de-meaned SCM). Figure 4 shows the main

results. Each panel displays the average absolute bias for all five estimators across all simulations.

Appendix Figure D.6 shows the estimator RMSE for the same simulations.

There are several takeaways. First, as anticipated, SCM alone — without conditioning on

excellent pre-treatment fit — is badly biased across simulations. This underscores the importance

of the recommendation in Abadie et al. (2010, 2015) to use SCM only in settings with excellent

pre-treatment fit.11 Second, augmenting SCM with a ridge outcome regression reduces bias relative

to SCM in all four simulations. Under the baseline factor model and the fixed effect model, the

ridge augmentation greatly reduces bias — by more than 75% in the factor model simulation and

over 90% in the fixed effects simulation. However, in the AR(3) model, as well as in the factor

model with reduced signal to noise ratio, the gains to augmentation relative to SCM are more

limited. Third, ridge ASCM has slightly lower bias than ridge regression alone across all of the

simulation settings. Fourth, when the fixed effects estimator is incorrectly specified, combining

SCM with a fixed effects estimator has much lower bias than either method alone; even when the

fixed effects estimator is correctly specified the de-meaned SCM has similar bias to the (correctly

specified) fixed effects approach. Finally, Appendix Figure D.6 shows that in all simulations ASCM

11Abadie et al. (2010, 2015) also strongly recommend incorporating auxiliary covariates, weighted by their predictivepower, into the procedure, noting that this is important for further reducing bias. For simplicity, the simulations donot include auxiliary covariates.

21

Figure 4: Overall absolute bias (a) the factor model simulation, (b) the factor model simulationwith quadruple the standard deviation, (c) the fixed effects simulation, and (d) the AR simulation.In panels (a) and (b), the outcome is scaled such that the residual standard deviation is one. TheSCM estimates reported here are not restricted to simulation draws with excellent pre-treatmentfit; Abadie et al. (2015) advise against using SCM in such settings.

has lower RMSE than SCM, as the large decrease in bias more than makes up for the slight increase

in variance.

We now consider how the benefits of augmentation relate to the quality of the original SCM

fit. Figure 5 shows the bias and root mean squared error (RMSE) as a function of λ for the two

factor model simulations, both unconditional on SCM fit and conditional on SCM fit being either

in the top half or the top quartile across simulations. As reference, the limit of ASCM as λ gets

large is the original SCM estimator, so the right-most points in the plots represent SCM. Several

key facts stand out. First, as expected, Augmented SCM can substantially reduce bias when SCM

is applied without regard to the quality of the pre-treatment fit. Second, restricting to simulations

with relatively good fit limits the gains from augmentation, although augmentation still reduces

bias even in the top quartile of pre-treatment fit. Third, it is possible to over-fit with ASCM, as

evident in much larger RMSE for ASCM with too-small λ. Fourth, with additional noise the risk

of over-fitting is greater, mitigating the gains to augmentation; conditional on good SCM fit the

RMSE is nearly monotonically decreasing with λ. Finally, when the SCM fit is good or the noise

is large, the bias is not monotonically increasing with λ and there is some optimal choice of λ that

22

Figure 5: Bias and RMSE of ridge ASCM as a function of λ under a linear factor model withmoderate and high noise.

minimizes the bias. Appendix Figures D.7 and D.8 show the bias and RMSE of several estimators

conditioned on SCM fit in the top quintile and show similar results: conditioned on good SCM fit

the gains of augmentation are limited.

Next, we evaluate alternative outcome models for use in ASCM. For each DGP we consider

SCM augmented with (1) Lasso, (2) a random forest, (3) CausalImpact (Brodersen et al., 2015),

(4) matrix completion using MCPanel (Athey et al., 2017) and (5) fitting the factor model directly

with gsynth (Xu, 2017). We compare ASCM to the pure outcome models as well as pure SCM.

Figure 6 shows the absolute bias for these methods and Appendix Figure D.6 shows the RMSE.

We broadly see the same results as with ridge ASCM. In our simulations, augmenting SCM almost

always reduces the bias relative to SCM (unconditional on good pre-treatment fit) with some models

improving SCM more than others. Additionally, in nearly every case ASCM also has lower bias

than outcome modeling alone. Appendix Figures D.7 and D.8 show the bias and RMSE when SCM

fit is in the top quintile. As before, conditioned on good pre-treatment fit the gains to augmentation

with flexible outcome models are more limited, except with the oracle gsynth estimator.

23

Figure 6: Absolute bias for SCM, ridge, fixed effects, and several machine learning and panel dataoutcome models, and their augmented versions using the same data generating processes as Figure4.

Overall we find consistently good performance across data generating processes from SCM

augmented with a penalized regression model.12 Due to this good performance, the small number

of units and high dimensional structure of typical SCM settings, and their relative simplicity, we

recommend augmenting SCM with penalized regression in practice. In particular we suggest using

ridge regression; among the other benefits, ridge ASCM allows the practitioner to diagnose the

level of extrapolation that the outcome model adds.

7 Discussion

SCM is a widely popular approach for estimating policy impacts at the jurisdiction level, such as the

city or state. By design, however, the method is limited to settings where excellent pre-treatment

fit is possible. For settings when this is infeasible, we introduce Augmented SCM, which controls

pre-treatment fit while minimizing extrapolation. We show that this approach reduces bias under

12Arkhangelsky et al. (2019) also show good performance when augmenting SCM with non-negative least squares.

24

a linear factor model and propose several extensions, including to incorporate auxiliary covariates.

There are several directions for future work. The most immediate is to develop robust inferential

methods for settings where pre-treatment SCM fit is imperfect. We adopt here jackknife standard

errors proposed by Arkhangelsky et al. (2019), but these are justified only asymptotically; see also

Arkhangelsky and Imbens (2019), who discuss doubly robust identification and estimation in this

setting. We could also extend this to allow for sensitivity analysis that directly parameterizes

departures from, say, the linear factor model, as in recent approaches for sensitivity analysis for

balancing weights (Soriano et al., 2019).

A second area for future inquiry is the application of the ASCM framework to settings with

multiple treated units. For instance, there are different approaches in settings when all treated

units are treated at the same time: some papers propose to fit SCM separately for each treated

unit (e.g., Abadie and L’Hour, 2018), while others simply average the units together (e.g., Kreif

et al., 2016; Robbins et al., 2017). The situation is more complicated with staggered adoption,

when units take up the treatment at different times (e.g., Dube and Zipperer, 2015; Donohue et al.,

2017). We explore this extension in Ben-Michael et al. (2019).

A third potential extension is to more complex data structures, such as applications with mul-

tiple outcomes series for the same units (e.g., measures of both earnings and total employment in

minimum wage studies) or hierarchical data structures with outcome information at both the indi-

vidual and aggregate level (e.g., students within schools). Similarly, the theoretical results in this

paper are poorly suited to count outcomes. This is especially important in gun violence research

where the outcome of interest might be the number of homicides (Rudolph et al., 2015).

25

References

Abadie, A. (2005). Semiparametric difference-in-differences estimators. The Review of EconomicStudies 72 (1), 1–19.

Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and methodologicalaspects. Journal of Economic Literature.

Abadie, A., A. Diamond, and J. Hainmueller (2010). Synthetic Control Methods for ComparativeCase Studies: Estimating the Effect of California’s Tobacco Control Program. Journal of theAmerican Statistical Association 105 (490), 493–505.

Abadie, A., A. Diamond, and J. Hainmueller (2015). Comparative Politics and the SyntheticControl Method. American Journal of Political Science 59 (2), 495–510.

Abadie, A. and J. Gardeazabal (2003). The Economic Costs of Conflict: A Case Study of theBasque Country. The American Economic Review 93 (1), 113–132.

Abadie, A. and G. W. Imbens (2011). Bias-corrected matching estimators for average treatmenteffects. Journal of Business & Economic Statistics 29 (1), 1–11.

Abadie, A. and J. L’Hour (2018). A penalized synthetic control estimator for disaggregated data.

Amjad, M., D. Shah, and D. Shen (2018). Robust synthetic control. The Journal of MachineLearning Research 19 (1), 802–852.

Arkhangelsky, D., S. Athey, D. A. Hirshberg, G. W. Imbens, and S. Wager (2019). Syntheticdifference in differences. arXiv preprint arXiv:1812.09970 .

Arkhangelsky, D. and G. W. Imbens (2019). Double-robust identification for causal panel datamodels. arXiv preprint arXiv:1909.09412 .

Athey, S., M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017). Matrix CompletionMethods for Causal Panel Data Models. arxiv 1710.10251 .

Athey, S. and G. W. Imbens (2017). The state of applied econometrics: Causality and policyevaluation. Journal of Economic Perspectives 31 (2), 3–32.

Athey, S., G. W. Imbens, and S. Wager (2018). Approximate Residual Balancing: De-BiasedInference of Average Treatment Effects in High Dimensions. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology).

Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica 77 (4), 1229–1279.

Ben-Michael, E., A. Feller, and J. Rothstein (2019). Synthetic control and weighted event studymodels with staggered adoption. Technical report. working paper.

Botosaru, I. and B. Ferman (2019). On the role of covariates in the synthetic control method. TheEconometrics Journal 22 (2), 117–130.

Breidt, F. J. and J. D. Opsomer (2017). Model-Assisted Survey Estimation with Modern PredictionTechniques. Statistical Science 32 (2), 190–205.

26

Brodersen, K. H., F. Gallusser, J. Koehler, N. Remy, and S. L. Scott (2015). Inferring CausalImpact using Bayesian Structural Time-Series Models. The Annals of Applied Statistics 9 (1),247–274.

Cassel, C. M., C.-E. Sarndal, and J. H. Wretman (1976). Some results on generalized differenceestimation and generalized regression estimation for finite populations. Biometrika 63 (3), 615–620.

Chernozhukov, V., K. Wuthrich, and Y. Zhu (2018). Inference on average treatment effects inaggregate panel data settings. arXiv preprint arXiv:1812.10820 .

Deville, J.-C. and C.-E. Sarndal (1992). Calibration estimators in survey sampling. Journal of theAmerican statistical Association 87 (418), 376–382.

Deville, J. C., C. E. Sarndal, and O. Sautory (1993). Generalized raking procedures in surveysampling. Journal of the American Statistical Association 88 (423), 1013–1020.

Donohue, J. J., A. Aneja, and K. D. Weber (2017). Right-to-carry laws and violent crime: Acomprehensive assessment using panel data and a state-level synthetic control analysis. Technicalreport, National Bureau of Economic Research.

Doudchenko, N. and G. W. Imbens (2017). Difference-In-Differences and Synthetic Control Meth-ods: A Synthesis. arxiv 1610.07748 .

Dube, A. and B. Zipperer (2015). Pooling multiple case studies using synthetic controls: Anapplication to minimum wage policies.

Ferman, B. (2019). On the Properties of the Synthetic Control Estimator with Many Periods andMany Controls.

Ferman, B. and C. Pinto (2018). Synthetic controls with imperfect pre-treatment fit.

Ferman, B., C. Pinto, and V. Possebom (2017). Cherry picking with synthetic controls.

Gobillon, L. and T. Magnac (2016). Regional policy evaluation: Interactive fixed effects andsynthetic controls. Review of Economics and Statistics 98 (3), 535–551.

Hastie, T., J. Friedman, and R. Tibshirani (2009). The elements of statistical learning. Springerseries in statistics New York.

Hazlett, C. and Y. Xu (2018). Trajectory balancing: A general reweighting approach to causalinference with time-series cross-sectional data.

Hsiao, C., Q. Zhou, et al. (2018). Panel parametric, semi-parametric and nonparametric construc-tion of counterfactuals-california tobacco control revisited. Technical report.

Kern, H. L., E. A. Stuart, J. Hill, and D. P. Green (2016). Assessing methods for generalizingexperimental impact estimates to target populations. Journal of Research on Educational Effec-tiveness 9 (1), 103–127.

King, G. and L. Zeng (2006). The dangers of extreme counterfactuals. Political Analysis 14 (2),131–159.

27

Kline, P. (2011). Oaxaca-Blinder as a reweighting estimator. In American Economic Review,Volume 101, pp. 532–537.

Kreif, N., R. Grieve, D. Hangartner, A. J. Turner, S. Nikolova, and M. Sutton (2016). Examinationof the synthetic control method for evaluating health policies with multiple treated units. Healtheconomics 25 (12), 1514–1528.

Minard, S. and G. R. Waddell (2018). Dispersion-weighted synthetic controls.

Neyman, A. J. and E. L. Scott (1948). Consistent estimates based on partially consistent observa-tions. Econometrica 16 (1), 1–32.

Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments.essay on principles. section 9. Statistical Science 5 (4), 465–472.

Nickell, S. (1981). Biases in dynamic models with fixed effects. Econometrica: Journal of theEconometric Society , 1417–1426.

Powell, D. (2018). Imperfect synthetic controls: Did the massachusetts health care reform savelives?

Rickman, D. S. and H. Wang (2018). Two tales of two us states: Regional fiscal austerity andeconomic performance. Regional Science and Urban Economics 68, 46–55.

Robbins, M., J. Saunders, and B. Kilmer (2017). A Framework for Synthetic Control Methods WithHigh-Dimensional, Micro-Level Data: Evaluating a Neighborhood-Specific Crime Intervention.Journal of the American Statistical Association 112 (517), 109–126.

Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when someregressors are not always observed. Journal of the American Statistical Association 89 (427), 846–866.

Rubin, D. B. (1973). The use of matched sampling and regression adjustment to remove bias inobservational studies. Biometrics, 185–203.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies. Journal of educational Psychology 66 (5), 688.

Rubin, D. B. (1980). Comment on “randomization analysis of experimental data: The fisherrandomization test”. Journal of the American Statistical Association 75 (371), 591–593.

Rudolph, K. E., E. A. Stuart, J. S. Vernick, and D. W. Webster (2015). Association be-tween connecticut’s permit-to-purchase handgun law and homicides. American journal of publichealth 105 (8), e49–e54.

Samartsidis, P., S. R. Seaman, A. M. Presanis, M. Hickman, D. De Angelis, et al. (2019). Assessingthe causal effect of binary interventions from observational panel data with few treated units.Statistical Science 34 (3), 486–503.

Soriano, D., E. Ben-Michael, A. Feller, and S. Pimentel (2019). Sensitivity analysis for balancingweights. Technical report. working paper.

28

Tan, Z. (2017). Regularized calibrated estimation of propensity scores with model misspecificationand high-dimensional data.

Tan, Z. (2018). Model-assisted inference for treatment effects using regularized calibrated estimationwith high-dimensional data. arXiv preprint arXiv:1801.09817 .

Wainwright, M. (2018). High dimensional statistics: a non-asymptomatic viewpoint.

Wang, Y. and J. R. Zubizarreta (2018). Minimal Approximately Balancing Weights: AsymptoticProperties and Practical Considerations.

Xu, Y. (2017). Generalized Synthetic Control Method: Causal Inference with Interactive FixedEffects Models. Political Analysis 25, 57–76.

Zhao, Q. (2018). Covariate Balancing Propensity Score by Tailored Loss Functions. Annals ofStatistics, forthcoming.

Zhao, Q. and D. Percival (2017). Entropy balancing is doubly robust. Journal of Causal Infer-ence 5 (1).

Zubizarreta, J. R. (2015). Stable Weights that Balance Covariates for Estimation With IncompleteOutcome Data. Journal of the American Statistical Association 110 (511), 910–922.

29

A Simulation data generating process

We now describe the simulations in detail. We use the Generalized Synthetic Control Method(Xu, 2017) to fit the following linear factor model to the observed series of log GSP per capita(N = 50, T0 = 89, T = 105), setting J = 3:

Yit = αi + νt +

J∑j=1

φijµjt + εit. (32)

We then use these estimates as the basis for simulating data. Appendix Figure D.9 shows theestimated factors µ. We use the estimated time fixed effects ν and factors µ and then simulatedata using Equation (32), drawing:

αi ∼ N( ˆα, σα)

φ ∼ N(0, Σφ)

εit ∼ N(0, σε),

where ˆα and σα are the estimated mean and standard deviation of the unit-fixed effects, Σφ isthe sample covariance of the estimated factor loadings, and σε is the estimated residual standarddeviation. We also simulate outcomes with quadruple the standard deviation, sd(εit) = 4σε. Weassume a sharp null of zero treatment effect in all DGPs and estimate the ATT at the final timepoint.

To model selection, we compute the (marginal) propensity scores as

logit−1 {πi} = logit−1 {P(T = 1 | αi, φi)} = θ

αi +∑j

φij

,

where we set θ = 1/2 to re-scale the factors and fixed effects to have unit variance. Finally,we restrict each simulation to have a single treated unit and therefore normalize the selectionprobabilities as πi∑

j πj.

We also consider an alternative data generating process that specializes the linear factor modelto only include unit- and time-fixed effects:

Yit(0) = αi + νt + εit.

We calibrate this data generating process by fitting the fixed effects with gsynth and drawing newunit-fixed effects from αi ∼ N( ˆα, σα). We then model selection proportional to the fixed effect asabove with θ = 3

2 . Second, we generate data from an AR(3) model:

Yit(0) = ρ0 +

3∑j=1

ρjYi(t−j) + εit,

where we fit ρ0, ρ to the observed series of log GSP per capita. We model selection as proportional

to the last 3 outcomes logit−1πi = θ(∑4

j=1 Yi(T0−j+1)

)and set θ = 5

2 . For this simulation we

30

estimate the ATT at time T0 + 1.

B Proofs

Proof of Lemmas 1 and 3. Recall that the lagged outcomes are centered by the control averages.Notice that

Y1(0)aug = m(X1) +∑Wi=0

γscmi (YiT − m(Xi))

= η0 + η′X1 +∑Wi=0

γscmi (YiT − η0 −X ′iη)

=∑Wi=0

(γscmi + (X1 −X ′0·γscm)(X ′0·X0· + λIT0)−1Xi)Yi

=∑Wi=0

γaugi Yi

(33)

The expression for Y1(0)r follows.We now prove that γr and γscm solve the weighting optimization problems (15) and (11). First,

the Lagrangian dual to (15) is

minα,β

1

2

∑Wi=0

(α+ β′Xi +

1

N0

)2

− (α+ β′X1) +λ

2‖β‖22, (34)

where we have used that the convex conjugate of 12

(x− 1

N0

)2is 1

2

(y + 1

N0

)2− 1

2N20

. Solving for α

we see that ∑Wi=0

α+ β′Xi + 1 = 1

Since the lagged outcomes are centered, this implies that

α = 0

Now solving for β we see that

X ′0·

(1

1

N0+X0·β

)+ λβ = X1

This implies thatβ = (X ′0·X0· + λI)−1X1

Finally, the weights are the ridge weights

γi =1

N0+X ′1(X ′0·X0· + λI)−1Xi = γr

i

Similarly, the Lagrangian dual to (11) is

minα,β

1

2

∑Wi=0

(α+ β′Xi + γscm

i

)2 − (α+ β′X1) +λ

2‖β‖22, (35)

31

where we have used that the convex conjugate of 12 (x− γscm

i )2 is 12 (y + γscm

i )2 − 12 γ

scm2i . Solving

for α we see that ∑Wi=0

α+ β′Xi + 1 = 1

Since the lagged outcomes are centered, this implies that

α = 0

Now solving for β we see that

X ′0·

(γscm +X0·β

)+ λβ = X1

This implies thatβ = (X ′0·X0· + λI)−1(X1 −X ′0·γscm)

Finally, the weights are the ridge ASCM weights

γi = γscmi + (X1 −X ′0·γscm)′(X ′0·X0· + λI)−1Xi = γaug

i

Proof of Lemma 2. Notice that

X1 −X ′0·γaug = (I −X ′0·X0·(X′0·X0· +N0λI)−1)(X1 −X ′0·γscm)

= N0λ(X ′0·X0· +N0λI)−1(X1 −X ′0·γscm)

= V diag

(λ

d2j + λ

)V ′(X1 −X ′0·γscm)

So since V is orthogonal,

‖X1 −X ′0·γaug‖2 =

∥∥∥∥∥diag

(λ

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2

Lemma 5. The ridge augmented SCM weights with hyperparameter λN0, γaug, satisfy

‖γaug‖2 ≤ ‖γscm‖2 +

1√N0

∥∥∥∥∥diag

(dj

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2

, (36)

with Xi = V ′Xi as defined in Lemma 2.

Proof of Lemma 5. Notice that using the singular value decomposition and by the triangle inequal-

32

ity,‖γaug‖+ 2 = ‖γscm +X0·(X

′0·X0· + λI)−1(X1 −X ′0·γscm)

=

∥∥∥∥∥γscm + Udiag

( √N0dj

N0d2j + λN0

)V ′(X1 −X ′0·γscm)

∥∥∥∥∥2

≤ ‖γscm‖2 +

∥∥∥∥∥diag

(dj

(d2j + λ)

√N0

)(X1 − X ′0·γscm)

∥∥∥∥∥2

Lemma 6. Under the linear factor model (16), for weights γ with∑

i γi = 1, the imbalance in thelatent factor loadings is

µT ·

φ1 −∑Wi=0

γiφi

=1

T0µT ·

µ′X1 −∑Wi=0

γiµ′Xi

︸︷︷︸

imbalance

− 1

T0µT ·

µ′ε1 −∑Wi=0

γiµ′εi

︸︷︷︸

approx. error

. (37)

Proof of Lemma 6. Collecting the factor loadings into a matrix Φ ∈ RN×J , notice that

µ(Φ1· − Φ′c·γ) = X1· − ε1 − (X0· − εc)′γ

Multiplying both sides by (µ′µ)−1µ′ and recalling that 1T0µ′µ = I, this gives that

Φ1· − Φ′c·γ =1

T0µ′(X1· −X ′0·γ)− 1

T0µ′(ε1 − ε′cγ)

This gives the result.

Lemma 7. For weights γ with ‖γ‖1 ≤ 1, εit sub-Gaussian with scale parameter σ and independent,and δ > 0,

1

T0

∣∣∣∣∣∣µT ·µ′ε1 −

∑Wi=0

γiµ′εi

∣∣∣∣∣∣ ≤ 2JM2σ

(√logN0

T0+ δ

(1√N1

+ 1

))(38)

with probability at least 1− 4e−T0δ

2

2 .

Proof of Lemma 7. First, we break up the expression into two components with the triangle in-equality

1

T0

∣∣∣∣∣∣µT ·µ′ε1 −

∑Wi=0

γiµ′εi

∣∣∣∣∣∣ ≤ 1

T0

∣∣µ′Tµ′ε1

∣∣+1

T0

∣∣∣∣∣∣µ′Tµ′∑Wi=0

εiγi

∣∣∣∣∣∣ .We now deal with each term in turn. First, since εit are sub-Gaussian and independent, µ′Tµ

′εi =∑T0t=1 µ

′Tµt·εit is sub-Gaussian with scale parameter

√∑T0t=1(µ′Tµt·)

2 ≤ JM2√T0. Thus, by the

33

sub-Gaussian tail bound,

P

(1

T0

∣∣µ′Tµ′ε1

∣∣ ≥ δJM2

√N1

σ

)≤ 2 exp

(−T0δ

2

2

).

Next, by Holder’s inequality,

1

T0

∣∣∣∣∣∣µ′Tµ′∑Wi=0

εiγi

∣∣∣∣∣∣ ≤ 1

T0‖ε0µµT ‖∞‖γ‖1 ≤

1

T0‖ε0µµT ‖∞.

Note that this bound corresponds to the worst case where γ has a weight of 1 on the unit withthe largest noise term correlated with the latent factors. Using the union bound to bound themaximum of sub-Gaussian variables, we see that

P

(1

T0‖ε0µµT ‖∞ ≥ 2JM2σ

(√logN0

T0+ δ

))≤ 2 exp

(−T0δ

2

2

)Combining these two bounds gives the result.

Proof of Theorem 1. From Lemma 6 and using the triangle inequality, the imbalance in φ satisfies∣∣∣∣∣∣µT ·φ1 −

∑Wi=0

γiφi

∣∣∣∣∣∣ ≤ 1

T0

∣∣∣∣∣∣µT ·µ′X1 −

∑Wi=0

γiµ′Xi

∣∣∣∣∣∣+1

T0

∣∣∣∣∣∣µT ·µ′ε1 −

∑Wi=0

γiµ′εi

∣∣∣∣∣∣Recall that µtj ≤ M , so ‖µ′‖2 ≤ M

√JT0 Again using Cauchy-Schwarz and ‖µT ‖2 ≤ M

√J , we

have the bound

1

T0

∣∣∣∣∣∣µT ·µ′X1 −

∑Wi=0

γiµ′Xi

∣∣∣∣∣∣ ≤ 1

T0‖µT ‖2‖µ‖2‖X1 −X ′0·γ‖2 ≤

MJ2

√T0‖X1 −X ′0·γ‖2

Combining this with Lemma 7 yields that∣∣∣∣∣∣µT ·φ1 −

∑Wi=0

γiφi

∣∣∣∣∣∣ ≤ M2J√T0‖X1 −X ′0·γ‖2 + 2JM2σ

(√logN0

T0+ δ

(1√N1

+ 1

))

with probability at least 1 − 4e−T0δ

2

2 . Specializing to the case where N1 = 1 and setting δ = δ√T0

gives the result.

Proof of Theorem 2. From Lemma 6 the bias for Ridge ASCM is

µTφ1 −∑Wi=0

γaugi φi

=1

T0µ′Tµ

′

X1 −∑Wi=0

γaugi Xi − ε1 +

∑Wi=0

γaugi εi

=

1

T0µ′Tµ

′(X1 −X ′0·γscm − ε1 + ε′0·γ

scm − (X0· − ε0·)′(γaug − γscm)

).

34

Using Lemma 1, we can write the bias as

bias =1

T0µ′Tµ

′((

I − (X0· − ε0·)′X ′0·(X

′0·X0· + λI)−1

) (X1 −X ′0·γscm

)− ε1 + ε′0·γ

scm

)=

1

T0µ′Tµ

′ (λI + ε′0·X0·)

(X ′0·X0· + λI)−1(X1 −X ′0·γscm

)︸︷︷︸

(∗)

− 1

T0µ′Tµ

′ (ε1 + ε′0·γscm).

Lemma 7 provides a bound for the second term, so we now inspect the first term. By thetriangle inequality and Cauchy Schwarz,

|(∗)| ≤ 1

T0‖µµT ‖2λ

∥∥(X ′0·X0· + λI)−1(X1 −X ′0·γscm

)∥∥2

+

∣∣∣∣ 1

T0µ′Tµ

′ε′0·X0·(X′0·X0· + λI)−1

(X1 −X ′0·γscm

)∣∣∣∣Using Lemmas 2 and 5 we get that

|(∗)| ≤ JM2

√T0

∥∥∥∥∥diag

(λ

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2

+1

T0‖ε0·µµT ‖2

1√N0

∥∥∥∥∥diag

(dj

d2j + λ

)(X1 − X ′0·γscm)

∥∥∥∥∥2

.

Now we must bound the term 1T0‖ε0·µµT ‖2. From the proof of Lemma 7, 1

T0

∑T0t=1 µ

′Tµt·εit is sub-

Gaussian with scale parameter JM2√T0

. A standard discretization argument (Wainwright, 2018) allows

us to bound the L2 norm:

P

(1

T0‖ε0µµT ‖2 ≥ 2JM2σ

(√N0 log 5

T0+ δ

))≤ 2 exp

(−T0δ

2

2

)Replacing δ with

√N0T0

(2−√

log 5) we see that

1

T0‖ε0·µµT ‖2 ≤ 4JM2σ

√N0

T0

with probability at least 1 − 2 exp(−N0(2−

√log 5)2

2

). Combining this bound with Lemma 7 and

using the union bound gives the result.

Proof of Lemma 4. The regression parameters ηx and ηz in Equation (23) are:

ηx = (X ′0·X0· + λrI)−1X ′0·Y0 and ηz = (Z ′0·Z0·)−1Z ′0·Y0 (39)

35

Now notice that

Y cov0 = η′xX1 + η′zZ1 +

∑Wi=0

(Yi − η′xXi − ηzZi)γi

= η′x(X1 −X ′0·γ) + ηz(Z1 − Z ′0·γ) + Y ′0 γ

= η′x(X1 −X ′0·γ)− η′xX0·(Z′0·Z0·)

−1(Z1 − Z ′0·γ) + Y ′0Z0·(Z′0·Z0·)

−1(Z1 − Z ′0·γ) + Y ′0 γ

= η′x(X1 − X ′0·γ) + Y ′0Z0·(Z′0·Z0·)

−1(Z1 − Z ′0·γ) + Y ′0 γ

= Y ′0(γ + X0·(X

′0·X0· + λrI)−1(X1 − X ′0·γ) + Z0·(Z

′0·Z0·)

−1(Z1 − Z ′0·γ)).

(40)

This gives the form of γcov. The imbalance in Z is

Z1 − Z ′0·γcov =(Z1 − Z ′0·Z0·(Z

′0·Z0·)

−1Z1

)+(Z0· − Z ′0·Z0·(Z

′0·Z0·)

−1Z0·)′γ

− Z ′0·X0·(X′0·X0· + λrI)−1(X1 − X ′0·γ)

= 0.

(41)

The pre-treatment fit is

X1 −X ′0·γcov =(X1 −X ′0·Z0·(Z

′0·Z0·)

−1Z1

)+(X0· −X ′0·Z0·(Z

′0·Z0·)

−1Z0·)′γ

−X ′0·X0·(X′0·X0· + λrI)−1(X1 − X ′0·γ)

=(I −X ′0·X0·(X

′0·X0· + λrI)−1

) (X1 − X ′0·γ

)=(I − X ′0·X0·(X

′0·X0· + λrI)−1

) (X1 − X ′0·γ

).

(42)

This gives the bound on the pre-treatment fit.

Proof of Theorem 3. Building on Lemma 6, we can see that the bias due to imbalance in φ is equalto

µT ·

φ1 −∑Wi=0

γcovi φi

=1

T0µ′Tµ

′(X1−X ′0·γcov)− 1

T0µ′Tµ

′(ε1−ε′0·γcov)− 1

T0µ′Tµ

′B′(Z1−Z ′0·γcov),

(43)where B = [B1, . . . , BT0 ] ∈ RK×T0 . By construction, γcov perfectly balances the covariates, andcombined with Lemma 4, the bias due to imbalance in φ simplifies to

µT ·

φ1 −∑Wi=0

γiφi

=1

T0µ′Tµ

′(X1 − X ′0·γ)− 1

T0µ′Tµ

′(ε1 − ε′0·γcov).

We now turn to bounding the noise terms. First, notice that

1

T0µ′Tµ

′ε′0·γcov =

1

T0µ′Tµ

′ε′0·γ +1

T0µ′Tµ

′ε′0·Z0·(Z′0·Z0·)

−1(Z1 − Z ′0·γ).

We have bounded the first term on the right hand side in Lemma 7. To bound the second term,notice that

∑Wi=0

∑T0t=1 µ

′Tµt·Zikεit is sub-Gaussian with scale parameter σMJ2

√T0‖Z·k‖2 =

36

MJ2σ√T0N0. We can now bound the L2 norm of 1

T0µ′Tµ

′ε′0·Z0· ∈ RK :

P

(1

T0‖µ′Tµ′ε′0·Z0·‖2 ≥ 2JM2σ

(√N0K log 5

T0+ δ

))≤ 2 exp

(−T0δ

2

2

)Replacing δ with

√KN0T0

(2−√

log 5) and with the Cauchy-Schwarz inequality we see that

1

T0

∣∣µ′Tµ′ε′0·Z0·(Z′0·Z0·)

−1(Z1 − Z ′0·γ)∣∣ ≤ 4JM2σ

√K

T0N0‖Z1 − Z ′0·γscm‖2

with probability at least 1 − 2 exp(−KN0(2−

√log 5)2

2

). Combining with Lemma 4 and putting

together the pieces with the union bound gives the result.

37

C Connection to balancing weights and IPW

We have motivated Augmented SCM via bias correction. An alternative motivation comes fromthe connection between SCM and inverse propensity score weighting (IPW).

First, notice that the SCM weights from the constrained optimization problem in Equation(3) are a form of approximate balancing weights (see, for example Zubizarreta, 2015; Athey et al.,2018; Tan, 2017; Wang and Zubizarreta, 2018; Zhao, 2018). Unlike traditional inverse propensityscore weights, which indirectly minimize covariate imbalance by estimating a propensity scoremodel, balancing weights seek to directly minimize covariate imbalance, in this case L2 imbalance.Balancing weights have a Lagrangian dual formulation as inverse propensity score weights (see, forexample Zhao and Percival, 2017; Zhao, 2018). Extending these results to the SCM setting, theLagrangian dual of the SCM optimization problem in Equation (3) has the form of a propensityscore model. Importantly, as we discuss below, it is not always appropriate to interpret this modelas a propensity score.

We first derive the Lagrangian dual for a general class of balancing weights problems, thenspecialize to the penalized SCM estimator (3).

minγ

hζ(X1· −X ′0·γ)︸︷︷︸balance criterion

+∑Wi=0

f(γi)︸︷︷︸dispersion

subject to∑Wi=0

γi = 1.(44)

This formulation generalizes Equation (3) in two ways: first, we remove the non-negativity con-straint and note that this can be included by restricting the domain of the strongly convex dispersionpenalty f . Examples include the re-centered L2 dispersion penalties for ridge regression and ridgeASCM, an entropy penalty (Robbins et al., 2017), and an elastic net penalty (Doudchenko andImbens, 2017). Second, we generalize from the squared L2 norm to a general balance criterion hζ ;another promiment example is an L∞ constraint (see e.g. Zubizarreta, 2015; Athey et al., 2018).

Proposition 1. The Lagrangian dual to Equation (44) is

minα,β

∑Wi=0

f∗(α+ β′Xi·)− (α+ β′X1·)︸︷︷︸loss function

+ h∗ζ(β)︸︷︷︸regularization

, (45)

where a convex, differentiable function g has convex conjugate g∗(y) ≡ supx∈dom(g){y′x − g(x)}.The solutions to the primal problem (44) are γi = f∗′(α+ β′Xi), where f∗′(·) is the first derivativeof the convex conjugate, f∗(·).

There is a large literature relating balancing weights to propensity score weights. This literatureshows that the loss function in Equation (45) is an M-estimator for the propensity score and thuswill be consistent for the propensity score parameters under large N asymptotics. The dispersionmeasure f(·) determines the link function of the propensity score model, where the odds of treatment

are π(x)1−π(x) = f∗′(α + β′x). Note that un-penalized SCM, which can yield multiple solutions,

does not have a well-defined link function. We extend the duality to a general set of balancecriteria so that Equation (45) is a regularized M-estimator of the propensity score parameters

38

where the balance criterion hζ(·) determines the type of regularization through its conjugate h∗ζ(·).This formulation recovers the duality between entropy balancing and a logistic link (Zhao andPercival, 2017), Oaxaca-Blinder weights and a log-logistic link (Kline, 2011), and L∞ balance andL1 regularization (Wang and Zubizarreta, 2018). This more general formulation also suggestsnatural extensions of both SCM and ASCM beyond the L2 setting to other forms, especially L1

regularization.Specializing proposition 1 to a squared L2 balance criterion hζ(x) = 1

2ζ ‖x‖22 as in the penalized

SCM problems yields that the dual propensity score coefficients β are regularized by a ridge penalty.In the case of an entropy dispersion penalty as Robbins et al. (2017) consider, the donor weights γhave the form of IPW weights with a logistic link function, where the propensity score is π(xi) =

logit−1(α+ β′xi), the odds of treatment are π(xi)1−π(xi)

= exp(α+ β′xi) = γi.We emphasize that while Proposition 1 shows that the the estimated weights have the IPW form,

in SCM settings it may not always be appropriate to interpret the dual problem as a propensityscore reflecting stochastic selection into treatment. For example, this interpretation would notbe appropriate in some canonical SCM examples, such as the analysis of German reunification inAbadie et al. (2015).

Proof of Proposition 1. We can augment the optimization problem (44) with auxiliary variables ε,yielding:

minγ,ε

hζ(ε) +∑Wi=0

f(γi).

subject to ε = X1· −X ′0·γ∑Wi=0

γi = 1

(46)

The Lagrangian is

L(γ, ε, α, β) =∑

i|Wi=0

f(γi) + α(1− γi) + hζ(ε) + β′(X1· −X ′0·γ − ε). (47)

The dual maximizes the objective

q(α, β) = minγ,εL(γ, ε, α, β)

=∑Wi=0

minγi{f(γi)− (α+ β′Xi)γi}+ min

ε{hζ(ε)− β′ε}+ α+ β′X1·

= −∑Wi=0

f∗(α+ β′Xi) + α+ β′X ′1· − h∗ζ(β),

(48)

By strong duality the general dual problem (45), which minimizes −q(α, β), is equivalent to theprimal balancing weights problem. Given the α and β that minimize the Lagrangian dual objective,−q(α, β), we recover the donor weights solution to (44) as

γi = f∗′(α+ β′Xi). (49)

39

D Additional figures

Figure D.1: Cross validation MSE and one standard error computed according to Equation (21).The minimal point, and the maximum λ within one standard error of the minimum are highlighted.The dashed line shows the sum of the imbalance and excess approximation error terms in Theorem2 as a function of λ, with noise term fit via gsynth. The bound is scaled to the same scale as theMSE.

Figure D.2: Point estimates ± two standard errors of the ATT for the effect of the tax cuts onlog GSP per capita using de-meaned SCM, ridge regression alone, ridge ASCM with λ chosen tominimize the cross validated MSE , and the original SCM proposal with covariates (Abadie et al.,2010).

40

Figure D.3: Placebo point estimates ± two standard errors for SCM with placebo treatment timesin Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.

Figure D.4: Placebo point estimates ± two standard errors for ridge ASCM with placebo treatmenttimes in Q2 2009, 2010, and 2011. Scale begins in 2005 to highlight placebo estimates.

41

Figure D.5: Donor unit weights for SCM, ridge regression, and ridge ASCM.

Figure D.6: RMSE for different augmented and non-augmented estimators across outcome models.

42

Figure D.7: Bias for different augmented and non-augmented estimators across outcome modelsconditioned on SCM fit in the top quintile.

43

Figure D.8: RMSE for different augmented and non-augmented estimators across outcome modelsconditioned on SCM fit in the top quntile.

44

Figure D.9: Latent factors for calibrated simulation studies.

45

the augmented synthetic control method · 2019-11-12 · 1 introduction the synthetic control...

Documents