arxiv:2109.05175v1 [stat.ml] 11 sep 2021

19
E STIMATION OF L OCAL A VERAGE T REATMENT E FFECT BY DATA C OMBINATION Kazuhiko Shinoda Graduate School of Economics, Keio University Takahiro Hoshino Faculty of Economics, Keio University RIKEN AIP ABSTRACT It is important to estimate the local average treatment effect (LATE) when compliance with a treat- ment assignment is incomplete. The previously proposed methods for LATE estimation required all relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult or even impossible to collect such data in many real-world problems for technical or privacy reasons. We consider a novel problem setting in which LATE, as a function of covariates, is nonparametri- cally identified from the combination of separately observed datasets. For estimation, we show that the direct least squares method, which was originally developed for estimating the average treat- ment effect under complete compliance, is applicable to our setting. However, model selection and hyperparameter tuning for the direct least squares estimator can be unstable in practice since it is defined as a solution to the minimax problem. We then propose a weighted least squares estimator that enables simpler model selection by avoiding the minimax objective formulation. Unlike the in- verse probability weighted (IPW) estimator, the proposed estimator directly uses the pre-estimated weight without inversion, avoiding the problems caused by the IPW methods. We demonstrate the effectiveness of our method through experiments using synthetic and real-world datasets. 1 Introduction Estimating the causal effects of treatment on an outcome of interest is central to optimal decision making in many real-world problems, such as policymaking, epidemiology, education, and marketing [52, 59, 44, 57]. However, the identification and estimation of treatment effects usually relies on the untestable assumption referred to as uncon- foundedness, namely, independence between the treatment status and potential outcomes conditional on covariates [26]. Violations of unconfoundedness may occur not only in observational studies, but also in randomized controlled trials (RCTs) when compliance to the assigned treatment is not complete. For example, even if a coupon is distributed randomly to measure its effect on sales, the probability of using the coupon is likely to depend on the unobserved nature of the individuals. Moreover, noncompliance can also occur regardless of the individual’s intentions. In on- line advertisement placement, the probability of watching the ad depends on the bidding strategy of other companies because ads that are actually displayed are determined through the real-time-bidding even if one tries to randomly place the ad. In such cases, it is well known that the local average treatment effect (LATE) can be identified and esti- mated using the treatment assignment as an instrumental variable (IV) and conditions milder than unconfoundedness [25, 5, 19]. LATE is the treatment effect measured for the subpopulation of compliers, individuals who always follow the given assignment. We suppose that we cannot observe all relevant variables in a single dataset for technical or privacy reasons. For exam- ple, in online-to-offline (O2O) marketing, where treatments are implemented online and outcomes are observed offline, it is often difficult to match the records of the same individuals observed separately online and offline. Additionally, with the global anti-tracking movement gaining momentum, it may become more difficult to combine multiple pieces of information online as well. Although causal inference by data combination has been actively studied [46, 8], LATE estimation using multiple datasets has not received much attention despite its practical importance. We extend the problem setting considered in [62], where two different treatment regimes are available, to allow for the existence of noncompliance. In our setting, LATE can be nonparametrically identified as a function of covariates, and estimated from a combination of separately observed datasets that are easier to collect in practice. arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Upload: others

Post on 27-Oct-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

ESTIMATION OF LOCAL AVERAGE TREATMENT EFFECT BYDATA COMBINATION

Kazuhiko ShinodaGraduate School of Economics, Keio University

Takahiro HoshinoFaculty of Economics, Keio University

RIKEN AIP

ABSTRACT

It is important to estimate the local average treatment effect (LATE) when compliance with a treat-ment assignment is incomplete. The previously proposed methods for LATE estimation requiredall relevant variables to be jointly observed in a single dataset; however, it is sometimes difficult oreven impossible to collect such data in many real-world problems for technical or privacy reasons.We consider a novel problem setting in which LATE, as a function of covariates, is nonparametri-cally identified from the combination of separately observed datasets. For estimation, we show thatthe direct least squares method, which was originally developed for estimating the average treat-ment effect under complete compliance, is applicable to our setting. However, model selection andhyperparameter tuning for the direct least squares estimator can be unstable in practice since it isdefined as a solution to the minimax problem. We then propose a weighted least squares estimatorthat enables simpler model selection by avoiding the minimax objective formulation. Unlike the in-verse probability weighted (IPW) estimator, the proposed estimator directly uses the pre-estimatedweight without inversion, avoiding the problems caused by the IPW methods. We demonstrate theeffectiveness of our method through experiments using synthetic and real-world datasets.

1 Introduction

Estimating the causal effects of treatment on an outcome of interest is central to optimal decision making in manyreal-world problems, such as policymaking, epidemiology, education, and marketing [52, 59, 44, 57]. However, theidentification and estimation of treatment effects usually relies on the untestable assumption referred to as uncon-foundedness, namely, independence between the treatment status and potential outcomes conditional on covariates[26]. Violations of unconfoundedness may occur not only in observational studies, but also in randomized controlledtrials (RCTs) when compliance to the assigned treatment is not complete. For example, even if a coupon is distributedrandomly to measure its effect on sales, the probability of using the coupon is likely to depend on the unobservednature of the individuals. Moreover, noncompliance can also occur regardless of the individual’s intentions. In on-line advertisement placement, the probability of watching the ad depends on the bidding strategy of other companiesbecause ads that are actually displayed are determined through the real-time-bidding even if one tries to randomlyplace the ad. In such cases, it is well known that the local average treatment effect (LATE) can be identified and esti-mated using the treatment assignment as an instrumental variable (IV) and conditions milder than unconfoundedness[25, 5, 19]. LATE is the treatment effect measured for the subpopulation of compliers, individuals who always followthe given assignment.

We suppose that we cannot observe all relevant variables in a single dataset for technical or privacy reasons. For exam-ple, in online-to-offline (O2O) marketing, where treatments are implemented online and outcomes are observed offline,it is often difficult to match the records of the same individuals observed separately online and offline. Additionally,with the global anti-tracking movement gaining momentum, it may become more difficult to combine multiple piecesof information online as well. Although causal inference by data combination has been actively studied [46, 8], LATEestimation using multiple datasets has not received much attention despite its practical importance. We extend theproblem setting considered in [62], where two different treatment regimes are available, to allow for the existence ofnoncompliance. In our setting, LATE can be nonparametrically identified as a function of covariates, and estimatedfrom a combination of separately observed datasets that are easier to collect in practice.

arX

iv:2

109.

0517

5v1

[st

at.M

L]

11

Sep

2021

Page 2: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

<latexit sha1_base64="vGWBIr2FdE2ChKxbeUef3q/17+c=">AAAChHichVG7SgNBFD1Z3/EVtRFsgkGxkDDrG0EJ2gg2iTEqqMjuOomDm91ldxKIIT8g2JrCSsFC7P0BG3/AIp8glgo2Ft7dLIiKeoeZOXPmnjtnuLpjCk8y1ogoLa1t7R2dXdHunt6+/tjA4JZnl1yD5wzbtN0dXfO4KSyek0KafMdxuVbUTb6tH6/699tl7nrCtjZlxeH7Ra1gibwwNElUdn1JPYglWJIFEf8J1BAkEEbajt1hD4ewYaCEIjgsSMImNHg0dqGCwSFuH1XiXEIiuOeoIUraEmVxytCIPaa1QKfdkLXo7Nf0ArVBr5g0XVLGMcYe2Q17YQ/slj2x919rVYMavpcK7XpTy52D/tPh7Nu/qiLtEkefqj89S+SxEHgV5N0JGP8XRlNfPqm/ZBc3xqrj7Io9k/9L1mD39AOr/GpcZ/jGxR9+dPJSo/ao35vxE2xNJdW55ExmJpFaCRvViRGMYoK6MY8U1pBGjqoXcIZz1JV2ZVKZVmabqUok1AzhSyjLH6HtkDg=</latexit>

K = 1<latexit sha1_base64="gIBM7qlqlJtE4YxdVqEkN8WGCFM=">AAAChHichVG7SgNBFD1ZNWp8JGoj2ASDYiHhxjeCErQRbKIxKkQJu+skLtnsLrubQAz5AcHWFFYKFmLvD9j4AxZ+glgq2Fh4s1kQDeodZubMmXvunOEqlq45LtFTQOro7Ap29/SG+voHBsORoeFdxyzbqsiopm7a+4rsCF0zRMbVXF3sW7aQS4ou9pTievN+ryJsRzONHbdqicOSXDC0vKbKLlPpzRXKRWIUJy+i7SDhgxj8SJmROxzgCCZUlFGCgAGXsQ4ZDo8sEiBYzB2ixpzNSPPuBeoIsbbMWYIzZGaLvBb4lPVZg8/Nmo6nVvkVnafNyigm6JFu6JUe6Jae6ePXWjWvRtNLlXelpRVWLnw6mn7/V1Xi3cXxl+pPzy7yWPK8auzd8pjmL9SWvnLSeE0vb0/UJumKXtj/JT3RPf/AqLyp11ti++IPPwp7qXN7Ej+b0Q52Z+KJhfjc1lwsueY3qgdjGMcUd2MRSWwghQxXL+AM52hIQWlampXmW6lSwNeM4FtIq5+fz5A3</latexit>

K = 0

<latexit sha1_base64="JWqlpPuNRkL0oxnW2GDT7uf0uPs=">AAACunicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB2hExhvGVWsYatbqRMYbQFkucDEXuFhMUm51RC2EoxkvoGygZwAGCpgMQyhDmQEKAvIFtjHEMKQw5DMkM5Qy5DKkMuQxlADZOQyJDMVAGM1gyGDAUAAUi2WoBooVAVmZYPlUhloGLqDeUqCqVKCKRKBoNpBMB/KioaJ5QD7IzGKw7mSgLTlAXATUqcCganDVYKXBZ4MTBqsNXhr8wWlWNdgMkFsqgXQSRG9qQTx/l0Twd4K6coF0CUMGQhdeN5cwpDFYgN2aCXR7AVgE5ItkiP6yqumfg62CVKvVDBYZvAa6f6HBTYPDQB/klX1JXhqYGjQbj3uSgG6pBUaPIXpkYDLCjPQMzfRMAk2UHZygEcXBIM2gxKABjA1zBgcGD4YAhlCg6XMZDjOcYTjLZMOUxJTJlA1RysQI1SPMgAKYSgDstaI0</latexit>

P (Y(1)1 , Y

(1)0 , D

(1)1 , D

(1)0 , X(1))

<latexit sha1_base64="PQsjM1/6IIvkUfoxsf2E/30yL90=">AAACunicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuYB2hExhvGVWsYaNbqRMYbQFkucDEXuFhMUm51RC2EoxkvoGygZwAGCpgMQyhDmQEKAvIFtjHEMKQw5DMkM5Qy5DKkMuQxlADZOQyJDMVAGM1gyGDAUAAUi2WoBooVAVmZYPlUhloGLqDeUqCqVKCKRKBoNpBMB/KioaJ5QD7IzGKw7mSgLTlAXATUqcCganDVYKXBZ4MTBqsNXhr8wWlWNdgMkFsqgXQSRG9qQTx/l0Twd4K6coF0CUMGQhdeN5cwpDFYgN2aCXR7AVgE5ItkiP6yqumfg62CVKvVDBYZvAa6f6HBTYPDQB/klX1JXhqYGjQbj3uSgG6pBUaPIXpkYDLCjPQMzfRMAk2UHZygEcXBIM2gxKABjA1zBgcGD4YAhlCg6XMZDjOcYTjLZMOUxJTJlA1RysQI1SPMgAKYSgDhoKIv</latexit>

P (Y(0)1 , Y

(0)0 , D

(0)1 , D

(0)0 , X(0))

<latexit sha1_base64="oba4aEbd28/ymahtUG16Q1FF/Og=">AAACnHichVHLSsNQED3G97vqRhFELIqClIkUFVeiXQgi1EdrRSUk8aqheZGkBS3FvT/gwpWCCxF05w+48Qdc+AniUsGNC6dpQFTUCTdz7rlz5p7LaK5p+AHRY41UW1ff0NjU3NLa1t7RGevqzvpOwdNFRndMx8tpqi9MwxaZwAhMkXM9oVqaKda1/HzlfL0oPN9w7LXgwBXblrpnG7uGrgZMKbG+9OiGIo9vKDSe4pzivKVZpVx5TInFKUFhDP4EcgTiiCLtxG6xhR040FGABQEbAWMTKnz+NiGD4DK3jRJzHiMjPBcoo4W1Ba4SXKEym+f/Hu82I9bmfaWnH6p1vsXk5bFyEMP0QJf0Qvd0RU/0/muvUtij4uWAs1bVClfpPO5dfftXZXEOsP+p+tNzgF1Mh14N9u6GTOUVelVfPDx5WZ1ZGS6N0Dk9s/8zeqQ7foFdfNUvlsXK6R9+NPZS5vHI34fxE2QnEvJkIrmcjM/ORYNqQj+GMMrTmMIsFpBGhrsf4QLXuJEGpJS0KC1VS6WaSNODLyFlPwCUlZfQ</latexit>

P (Y1, Y0, D1, D0, X)

<latexit sha1_base64="G6571JVVXJQGbuUJ388/BnbKi8w=">AAACunichVE9T9tQFD0Y2tK0lEAXJJaIiCpZomuECkIgobJ0DIFAVAKRbR7wFH9hv0Sibv4Af4Ch6tBKDFV3/gALA4wM/ATESCUWht7YlipAwLXsd+6591yfp2v6tgwV0UWP1tv34uWr/teZN28H3g1mh4ZXQq8VWKJqebYX1EwjFLZ0RVVJZYuaHwjDMW2xajYXuvXVtghC6bnLas8X646x7cotaRmKqUZ2qlz4shEV9GJnTv9WN52o1knSYt0Vu7mkSnernBYb2TyVKI7cQ6CnII80yl72CHVswoOFFhwIuFCMbRgI+VmDDoLP3Doi5gJGMq4LdJBhbYu7BHcYzDb5u83ZWsq6nHdnhrHa4r/Y/AaszGGczuk3XdMJ/aFLun10VhTP6HrZ49NMtMJvDO6PLN08q3L4VNj5r3rSs8IWpmOvkr37MdO9hZXo218PrpdmKuPRB/pFV+z/J13QMd/Abf+1DhdF5fsTfkz20uH16PeX8RCsTJT0j6XJxcn8/Kd0Uf0YxRgKvI0pzOMzyqjy9B84xinOtFnN1KTWTFq1nlTzHndCU/8ACLqjlg==</latexit>

P (Z(1) = 1|X(1)) 6= P (Z(0) = 1|X(0))

Treatment Assignment with

<latexit sha1_base64="da9RNYe7FOTnDo3ugWk1V8wYP+g=">AAAConichVFNLwNRFD0dX/VdbCQWGlWpRJo30iBWgoVYtapaaWlmxsPEfGVm2oRJlzb+gIUViYXY4g/Y+AMWfoJYkthYuJ1OIjS4k5l77nn33DkvV7Y01XEZewoJLa1t7R3hzq7unt6+/sjA4IZjVmyF5xRTM+2CLDlcUw2ec1VX4wXL5pIuazwvHyzVz/NVbjuqaay7hxbf0qU9Q91VFcklqhwZTSc2t72EOFmbWg5ySda9Qq1RTJYjMZZkfkSbgRiAGIJIm5E7lLADEwoq0MFhwCWsQYJDTxEiGCzituARZxNS/XOOGrpIW6EuTh0SsQf03aOqGLAG1fWZjq9W6C8avTYpo4izR3bFXtkDu2bP7OPXWZ4/o+7lkLLc0HKr3H8ynH3/V6VTdrH/pfrTs4tdzPleVfJu+Uz9FkpDXz06fc3Or8W9CXbBXsj/OXti93QDo/qmXGb42tkffmTyUqP1iD+X0Qw2ppPiTDKVScUWFoNFhTGCMSRoG7NYwArSyNH0Y1zjBrfCuLAqZIRso1UIBZohfAuh9Ami0pph</latexit>

P (Y (1), D(1), X(1))<latexit sha1_base64="JjvaJRNwC5B2gL+OR8Ml6HM8Uk8=">AAAConichVHNLgNhFD3GX/0XG4kFUaQSaW6lQawEC7FqVakozcz4MDF/mZk2YdKljRewsCKxEFu8gI0XsPAIYkliY+F2OokguJOZe+757rlzvlzF1jXXI3qsk+obGpuaIy2tbe0dnV3R7p5V1yo5qsiplm45eUV2ha6ZIudpni7ytiNkQ9HFmrI/Xz1fKwvH1SxzxTuwxaYh75rajqbKHlPF6EA6vr7lx2msMr4Q5oJi+PlKrRgrRmOUoCAGf4JkCGIII21Fb1HANiyoKMGAgAmPsQ4ZLj8bSIJgM7cJnzmHkRacC1TQytoSdwnukJnd5+8uVxsha3JdnekGapX/ovPrsHIQI/RAl/RC93RFT/T+6yw/mFH1csBZqWmFXew67su+/asyOHvY+1T96dnDDqYDrxp7twOmegu1pi8fnrxkZ5ZH/FE6p2f2f0aPdMc3MMuv6kVGLJ/+4UdhLxVeT/L7Mn6C1YlEcjKRyqRis3PhoiLoxxDivI0pzGIRaeR4+hGucI0baVhakjJSttYq1YWaXnwJqfABnE2aXg==</latexit>

P (Y (0), D(0), X(0))

Regime Assignment

Figure 1: Overview of the 2RD.

For the estimation, we show that the direct estimation method originally developed for estimating the average treatmenteffect (ATE) under the complete compliance [62] can be applied to the LATE estimation in our problem setting.However, their method has a practical issue in that model selection and hyperparameter tuning can be unstable owingto its minimax objective formulation. We then propose a weighted least squares estimator to avoid the minimaxobjective functional, and improve the stability in practice. Unlike the inverse probability weighted (IPW) estimator[60, 61, 50], which is often employed to estimate treatment effects, the proposed estimator directly uses the estimatedpropensity-score-difference (PSD) as a weight without inversion. Therefore, our proposed estimator can also avoidthe common issue in the IPW methods, that is, high variance at points with a propensity score extremely close to zero.

Section 2 introduces the notation and problem settings. We also present the identification of LATE from separatelyobserved samples. Section 3 reviews the statistical approaches for studying treatment effects when there is no singledataset containing all relevant variables. Section 4 applies some existing methods to our problem setting. Section 5develops a novel estimator that overcomes the drawbacks of the existing methods. Section 6 evaluates the performanceof the proposed method via experiments using synthetic and real-world datasets.

2 Problem Setting

Unlike the standard causal inference studies, we consider a setting where there are two different assignment regimes[62], which we term as the two-regime design (2RD). Figure 1 illustrates the concept of the 2RD. We define ourproblem using the potential outcome framework [48, 26]. Let K ∈ 0, 1 be a regime indicator, and we use asuperscript k = 0, 1 to specify the regime which the variables come from. Let Y (k) ∈ Y ⊂ R be an outcome ofinterest,D(k) ∈ 0, 1 be a binary treatment indicator, Z(k) ∈ 0, 1 be an assignment indicator andX(k) ∈ X ⊂ Rqxbe a qx-dimensional vector of covariates. D(k)

z is the potential treatment status realized only when Z(k) = z, and Y (k)d

is the potential outcome realized only when D(k) = d, where z, d ∈ 0, 1. Using this notation, we implicitly assumethat Z(k) does not have a direct effect on Y (k), but affects Y (k) indirectly through D(k). This condition, often referredto as the exclusion restriction, is necessary for Z(k) to be a valid IV. Let Y1, Y0, D1, D0 and X be the potentialvariables and covariates in the population of interest, and we suppose that the iid samples from P (X) can be obtainedas test samples.

2.1 Basic Setting

First, we make the following assumption on the relation between the two regimes.

Assumption 1.

1. P (Y(1)1 , Y

(1)0 , D

(1)1 , D

(1)0 ,X(1)) = P (Y

(0)1 , Y

(0)0 , D

(0)1 , D

(0)0 ,X(0)) = P (Y1, Y0, D1, D0,X).

2. P (Z(1) = 1|X(1) = x) 6= P (Z(0) = 1|X(0) = x) for any x ∈ X .

By Assumption 1.1, we suppose that the joint distribution of the potential variables and covariates in each assignmentregime is invariant to each other, and equal to the joint distribution in the population of interest. This means thatthe participants in each regime are random draws from the population of interest. Assumption 1.2 indicates that the

2

Page 3: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

assignment regimes are different to each other. It is possible to set P (Z(0) = 1|X(0)) = 0 as long as Assumption 1.2is satisfied. We term the design with P (Z(1) = 1|X(1)) > 0 for any X(1) and P (Z(0) = 1|X(0)) = 0 for any X(0)

as the one-experiment 2RD (1E2RD) since an experiment is conducted only for those with k = 1, and a natural statewithout any intervention is observed for those with k = 0. The 1E2RD extends the applicability of our setting since itis easier to implement than the general 2RD.

Additionally, we make the following assumption required for the identification of LATE. Here, A ⊥⊥ B|C means Aand B are conditionally independent given C.Assumption 2. For k = 0, 1,

1. Y (k)1 , Y

(k)0 , D

(k)1 , D

(k)0 ⊥⊥ Z(k)|X(k).

2. D(k) = Z(k)D(k)1 + (1− Z(k))D

(k)0 and Y (k) = D(k)Y

(k)1 + (1−D(k))Y

(k)0 .

3. P (D(k)1 ≥ D(k)

0 |X(k)) = 1.

4. P (D(k)1 = 1|X(k)) 6= P (D

(k)0 = 1|X(k)).

These are the direct extensions of the standard assumptions used for the LATE estimation [1, 19, 56, 42] to the2RD. However, we omit the condition referred to as positivity 0 < P (Z(k) = 1|X(k)) < 1 as it is unnecessaryin the 2RD. Assumption 2.1 states that Z(k) are randomly assigned within a subpopulation sharing the same levelof the covariates. The mean independence between the potential variables and the treatment assignment conditionalon covariates is sufficient for identifying LATE, but we maintain Assumption 2.1 as it is typical to assume the fullconditional independence in the causal inference literature. Assumption 2.2 relates the potential variables to theirrealized counterparts. Assumption 2.3 excludes defiers, those who never follow the given treatment assignment, fromour analysis. Assumption 2.4 is necessary for Z(k) to be a valid IV of D(k). Hereafter, we abuse the notation byomitting the superscript on the potential variables and covariates unless this causes confusion.

2.2 Identification

The parameter of our interest is LATE as a function of covariatesX , defined as

µ(X) := E[Y1 − Y0|D1 > D0,X].

It measures how covariatesX affect the average treatment effect within the subpopulation of compliers. The followingtheorem shows that µ(X) is nonparametrically identified in an unusual form in our setting.Theorem 1. Under Assumption 1 and 2,

µ(X) =E[Y (1)|X]− E[Y (0)|X]

E[D(1)|X]− E[D(0)|X]. (1)

The proof can be found in Appendix A. Notably, this form coincides with the identification result of ATE in [62], whichmeans that we can also estimate µ(X) from the same combination of datasets proposed in [62] under appropriate as-sumptions. This is a rather powerful result in practice sinceZ(k) is not required despite the presence of noncompliance.Moreover, it is obvious that p(k)d := P (D(k) = 1) and P (X|D(k) = 1) are sufficient to identify the denominator of

(1) since E[D(k)|X] can be identified as P (X|D(k)=1)p(k)d

P (X) by applying the Bayes’ theorem. The point is that we canidentify µ(X) as a combination of functions depending only on covariates X in our setting. This property allowsus to use a direct estimation method for µ(X). We can identify LATE as µ(X) = E[Y |Z=1,X]−E[Y |Z=0,X]

E[D|Z=1,X]−E[D|Z=0,X] underthe standard assumptions [1, 19, 56, 42], but we cannot rewrite it as a combination of functions depending only oncovariates.

Theorem 1 also suggests the usefulness of the 1E2RD. We can simplify (1) when we implement the 1E2RD underone-sided noncompliance, where individuals assigned to the control group never receive treatment, but they have achoice if assigned to the treatment group.

Corollary 1. Assume P (Z(0) = 1|X) = 0 (1E2RD) and P (D0 = 1|X) = 0 (one-sided noncompliance) in additionto Assumption 1 and 2. Then, E[Y (0)|X] = E[Y0|X] and

µ(X) =E[Y (1)|X]− E[Y (0)|X]

E[D(1)|X].

3

Page 4: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

This corollary shows that we can reduce the number of necessary datasets in the 1E2RD under one-sided noncompli-ance. This fact not only facilitates the data collection, but also benefits the estimation since the denominator is now justa propensity score, thus estimable accurately by, for example, the Positive-Unlabeled (PU) learning [16, 14, 15, 31]with the logistic loss. Although the one-sided noncompliance assumption seems too restrictive and impractical at firstglance, RCTs in practice often fall into this class for the following two reasons. The first reason is that there wouldbe individuals who do not follow the assignment when they must take some voluntary action to receive treatment.Second, it is easy to deny access to treatment for those in the control group in many RCT designs. Although one-sidednoncompliance is often associated with RCTs, some observational studies also fit with the framework [20, 30]. In thecase of one-sided noncompliance, we can consider our problem as the estimation of the average treatment effect onthe treated (ATT) since LATE is equal to ATT under one-sided noncompliance and the other standard assumptions[20, 13].

2.3 Data Collection Scheme

We assume that the joint samples of (Y (k), D(k), Z(k),X) are not available. By Theorem 1, the following separatedatasets and the estimate of p(k)d are sufficient for estimating µ(X):

x(k)di

n(k)di=1

iid∼ P (X|D(k) = 1), (y(k)i ,x(k)i )n(k)

i=1iid∼ P (Y (k),X),

for k = 0, 1. This setting is much easier to apply to real-world situations than the standard setting where the jointsamples are available from every individual.

We can interpret this data collection scheme in two ways. First, it can be regarded as the repeated cross-sectional(RCS) design [39, 7, 21, 46, 33] with k = 0, 1 representing the time points before and after the assignment regimeswitches, respectively. Generally, although panel data are advantageous for statistical analyses because they follow thesame individuals over multiple time periods, RCS data have some advantages over panel data. RCS data are easier andcheaper to collect in many cases. Consequently, they are often more representative of the population of interest andlarger in sample size than panel data. Additionally, we are more likely to be able to collect large and representativedata in our setting since we do not need to observe the outcome and the treatment status simultaneously. RCS datararely suffer from attrition and dropout, which often cause severe bias in panel data [46]. However, there is a concernregarding the validity of Assumption 1.1 when we use RCS data since the potential variables may change over time.We need some sort of side information to be confident about the use of RCS data in our setting as we cannot directlytest whether the potential variables remain unchanged. The second interpretation is to collect data with k = 0, 1 atthe same time by randomly splitting the population. If an experimenter is able to surely perform random splitting, thisapproach is favorable since we do not have to worry about the validity of Assumption 1.1. The implementation ofrandom splitting is easy in the case of RCTs.

One specific example of our setting is O2O marketing. To measure the effect of an online video ads on sales inphysical stores, we usually cannot link the data of those who watched the ads with that of those who shopped at thestores. However, it is relatively easy to separately collect data from those who watch the ads and the purchasers. In thiscase, data collection can be performed by either RCS design or random splitting. Another example is when estimatingthe effect of a treatment that takes a certain period of time to take effect. For instance, it is desirable to use panel datato estimate the effect of job training on future earnings. However, individuals may gradually drop out of the survey,and the probability of the attrition can depend on unobserved variables [22, 40]. Therefore, it is easier to collect RCSdata than to construct balanced panel data.

Remark on Assumption 1.1 We can still identify LATE even if we weaken Assumption 1.1 to its conditionalversion:

P (Y(1)1 , Y

(1)0 , D

(1)1 , D

(1)0 |X(1)) = P (Y

(0)1 , Y

(0)0 , D

(0)1 , D

(0)0 |X(0)) = P (Y1, Y0, D1, D0|X), (2)

which implies that P (X(1)) 6= P (X(0)) 6= P (X). Unfortunately, it is impossible to estimate µ(X) on P (X) under

our data collection scheme, where we have x(k)di

n(k)di=1 and p(k)d . If we have (d(k)i ,x

(k)di )n

(k)di=1

iid∼ P (D(k),X(k))instead, the estimation is possible by utilizing density ratios to adjust the distribution over which the expectation istaken. However, this limitation does not detract significantly from the value of our work since our setting is plausibleenough to apply to many real-world problems even with Assumption 1.1 as explained earlier. Note that the concernfor the time-varying potential variables when using RCS data does not vanish even if we adopt the condition (2) sincethe conditional distribution may also change over time. See Appendix B for a detailed discussion of the estimationunder the condition (2).

4

Page 5: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

3 Related Works

There have been many proposals for the estimation of µ(X) when the joint samples are available [35, 23, 1, 56, 43,42, 58]. These include estimation via the parametric specification of the local average response function E[Yd|D1 >D0,X] [1], doubly robust estimators [43, 42] and estimation with a binary outcome [58], to name a few. However,the estimation of µ(X) by data combination has rarely been considered in the literature.

Our problem setting is closely related to the two-sample IV (TSIV) estimation [6, 27, 45, 12, 10], where the outcome,IVs, and covariates are not jointly observed in a single dataset. Furthermore, the idea of using moments separatelyestimated from different datasets can be dated back to at least [32]. Although estimands in the TSIV estimation are notlimited to causal parameters, there have been some studies on the causal inference in the setting related to TSIV. Theyinclude ATE estimation from samples of (Y,Z,X) and (D,Z,X) with the existence of unmeasured confounders [54]and causal inference using samples from heterogeneous populations [64, 51]. Our setting differs from theirs in thatwe have to observe D only when D = 1, and we do not need Z at all. Particularly, it is of great practical benefit sincesamples of those who do not receive treatment are often rather difficult to observe. Although our setting requires tworegimes, it rather opens up the possibility of LATE estimation in the real world since we can set P (Z(0) = 1|X) = 0as mentioned earlier.

Another approach for the causal inference from separately observed samples is the partial identification [37, 55, 38],that is, deriving bounds for the treatment effects rather than imposing strong assumptions sufficient for the pointidentification. In [17, 18], the moment inequality proposed in [11] was applied to derive the sharp bounds for ATEwhen only samples of (Y,D) and (D,X) are available. The difficulty of applying their approach to the estimation ofa treatment effect conditional on covariates is that it requires conditional distribution functions or conditional quantilefunctions, which are usually difficult to estimate accurately. Moreover, it is sometimes difficult to derive informativebounds for treatment effects without imposing strong assumptions depending on the data generating process [38].

4 Existing Methods

Although the LATE estimation in our specific setting has not been studied before, some existing methods can beapplied. We discuss the advantages and disadvantages of these methods especially in terms of accuracy and modelselection. Since the true value of treatment effects is not observable by the fundamental problem of causal inference[24], model selection and hyperparameter tuning are substantial issues in practice [47, 4, 49].

4.1 Separate Estimation

A naive estimation method for µ(X) is to separately estimate the components and combine them as

µsep(x) =E[Y (1)|X = x]− E[Y (0)|X = x]

E[D(1)|X = x]− E[D(0)|X = x], (3)

where a hat denotes an estimator. E[D(k)|X] can be estimated by the PU learning [16, 14, 15, 31] with the logistic

loss by using x(k)di

n(k)di=1 as positive data, and the covariates in (y(1)i ,x

(1)i )n(1)

i=1 and (y(0)i ,x(0)i )n(0)

i=1 as unlabeleddata. Separate estimation is easy to implement, but usually does not provide a good estimate since it has four possiblesources of error. One advantage of the separate estimation is that the model selection can also be naively performed bychoosing the best model for each component. We can easily calculate the model selection criteria, such as the meansquared error (MSE) for each component from the separately observed datasets. However, the resulting µsep mayperform poorly because it does not necessarily minimize the model selection criterion in terms of the true µ [47].

We can alleviate the drawback of the separate estimation by directly estimating the numerator and denominator in (3).Let T := KD(1) − (1 −K)D(0) and U := KY (1) − (1 −K)Y (0) be the auxiliary variables for the notational andcomputational convenience, and assume P (K = 1|X) = 0.5. Then, we can rewrite the expression in Theorem 1as µ(X) = ν(X)/π(X), where ν(X) := E[U |X] and π(X) := E[T |X]. To estimate ν(X) and π(X), we canconstruct combined datasets

(ti,xti, rti)nti=1 :=

((−1)1−k,x

(k)di ,

p(k)d (n

(1)d + n

(0)d )

2n(k)d

)n(k)d ,1

i=1,k=0

,

(ui,xui, rui)nui=1 :=

((−1)1−ky

(k)i ,x

(k)i ,

n(1) + n(0)

2n(k)

)n(k),1

i=1,k=0

5

Page 6: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

from x(k)di

n(k)di=1 and (y(k)i ,x

(k)i )n(k)

i=1 , respectively, where nt := n(1)d + n

(0)d and nu := n(1) + n(0). We can

approximate the expectation of a product of any function of X and T or U by the simple sample average using theabove datasets since we have

1

nt

nt∑i=1

rtitif(xti) =p(1)d

2n(1)d

n(1)d∑i=1

f(x(1)di )− p

(0)d

2n(0)d

n(0)d∑i=1

f(x(0)di ) ≈ 1

2E[D(1)f(X)]− 1

2E[D(0)f(X)],

1

nu

nu∑i=1

ruiuif(xi) =1

2n(1)

n(1)∑i=1

yif(x(1)i )− 1

2n(0)

n(0)∑i=1

yif(x(0)i ) ≈ 1

2E[Y (1)f(X)]− 1

2E[Y (0)f(X)].

Although an estimator of ν(X) can be obtained by any regression method using the above combined dataset, we needa little twist to obtain an estimator of the PSD π(X). See Appendix C for the direct estimation methods for the PSD.

4.2 Direct Least Squares Estimation

We can apply the direct least squares (DLS) method [62] originally proposed for ATE estimation under completecompliance. Motivated by the drawbacks of the separate estimation, the DLS directly estimates µ(X) only in onestep, which is advantageous not only in performance but also in computational efficiency.

The following theorem corresponds to Theorem 1 in [62].Theorem 2. Assume ν ∈ L2, where L2 :=

f : X 7→ R

∣∣E[f(X)2] <∞

in addition to Assumption 1 and 2.Furthermore, define Hf (X) := π(X)f(X) − ν(X). Then, µ is equal to the solution of the following least squaresproblem:

minf∈L2

E[Hf (X)2

]. (4)

Theorem 2 immediately follows from Theorem 1. In what follows, we show that the minimizer of the prob-lem (4) can be estimated without going through the estimation of the conditional mean functions ν and π. Since(Hf (X)− g(X))

2 ≥ 0 for any square integrable function g ∈ L2, we have Hf (X)2 ≥ 2Hf (X)g(X)− g(X)2 byexpanding the square. The equality holds at g(X) = Hf (X) for any f , which maximizes 2Hf (X)g(X) − g(X)2

with respect to g. Hence,

µ(X) = argminf∈L2

maxg∈L2

J(f, g), (5)

where J(f, g) := E[2Hf (X)g(X)− g(X)2

]. Rewriting J(f, g) yields

J(f, g) = 2E[Tf(X)g(X)]− 2E[Ug(X)]− E[g(X)2]. (6)

While the objective functional in (4) requires the conditional mean estimators, this form can be estimated based on thesample averages as

J(f, g) =2

nt

nt∑i=1

rtitif(xti)g(xti)−2

nu

nu∑i=1

ruiuig(xui)−1

nu

nu∑i=1

ruig(xui)2. (7)

We can implement the DLS estimation of µ with an arbitrary regularization term Ω in practice:

µdls(x) = argminf∈F

maxg∈G

J(f, g) + Ω(f, g),

where F and G are model classes for f and g, respectively.

Although any model can be trained by optimizing the model parameters to minimize the above objective functional, apractically useful choice is a linear-in-parameter model. Set F = fα : x 7→ α>φ(x)|α ∈ Rqf andG = gβ : x 7→β>ψ(x)|β ∈ Rqg, where φ andψ are qf - and qg-dimensional basis functions, respectively. Using the `2-regularizer,we have

J(f, g) + Ω(f, g) = 2α>Aβ − 2β>b− β>Cβ + λfα>α+ λgβ

>β,

where

A :=1

nt

nt∑i=1

rtitiφ(xti)ψ(xti)>, b :=

1

nu

nu∑i=1

ruiuiψ(xui), C :=1

nu

nu∑i=1

ruiψ(xui)ψ(xui)>,

6

Page 7: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

and λf and λg are some positive constants. The advantage of this formulation with the linear-in-parameter models and`2-regularizer is that we have an analytical solution. The solution to the inner maximization is given by

β = (C + λgIqg )−1(A>α− b),

where Iqg denotes the qg × qg identity matrix. We can then obtain the DLS estimator of α by substituting β andsolving the outer minimization:

αdls =A(C + λgIqg )−1A> + λfIqf

−1A(C + λgIdg )−1b.

For the model selection, we can choose a model that minimizes J(f , g) evaluated on a validation set. However, aspointed out in [62], one cannot tell if J(f , g) is small because f is a good solution to the outer minimization, or gis a poor solution to the inner maximization. For this reason, µdls(x) := α>dlsφ(x) can be unstable, and one cannotbe confident about which model is the best. The increased dimensionality of the search space because of the need tosimultaneously select models for f and g can also make the model selection based on J(f , g) challenging.

5 Proposed Method

We propose a novel estimator that enables simpler model selection than the DLS estimation while maintaining theessence of direct estimation as much as possible. We avoid the minimax formulation of the objective functional as in(5) by estimating µ as a solution to the weighted least squares problem derived from the original problem (4). It canbe constructed based on the sample averages from the separately observed samples like the DLS estimator, except weneed to estimate π(X) as a weight. We term the proposed estimator as the directly weighted least squares (DWLS)estimator since the pre-estimated PSD directly appears without inversion in the objective unlike the IPW estimators[60, 61, 50].

Consider the following weighted least squares problem:

minf∈L2

E

[w(X)

π(X)Hf (X)2

]=: Q0(f |w), (8)

where w(X) is some weight depending onX . Rewriting the objective yields

Q0(f |w) = E[Tw(X)f(X)2]− 2E [Uw(X)f(X)] + E[S], (9)

where S := w(X)π(X) ν(X)2 is a constant and can therefore be safely ignored in the optimization. Choosing w = π

clearly reduces the problem (8) to (4). Therefore, we can plug-in the pre-estimated π(X) as a weight in practice tofind a minimizer of the problem (4). However, π(X) may not be the proper weight since it takes a negative value whenP (Z(1) = 1|X) < P (Z(0) = 1|X). We can estimate Q(f |π) := Q0(f |π) − E[S] based on the sample averagesusing the separately observed samples as

Q(f |π) =1

nt

nt∑i=1

rtitiπ(xti)f(xti)2 − 2

nu

nu∑i=1

ruiuiπ(xui)f(xui).

Using the linear-in-parameter model for f and the `2-regularizer as in the previous section, we have

Q(f |π) + Ω(f) = α>Aα− 2α>b+ λfα>α,

where

A :=1

nt

nt∑i=1

rtitiπ(xti)φ(xti)φ(xti)>, b :=

1

nu

nu∑i=1

ruiuiπ(xui)φ(xui).

The analytical solution to minf∈F Q(f |π) + Ω(f) can be obtained as µdpw(x) = α>dpwφ(x), where

αdpw =(A+ λfIqf

)−1b.

Although the DWLS objective is no longer the MSE, Q(f |π) is still a sufficient measure for model selection becauseµdwls minimizing this objective is also a good estimator in terms of the true MSE as long as π is sufficiently accurate.Since the DWLS estimation involves only a single minimization, the model selection is easy to perform compared tothe DLS estimation.

7

Page 8: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

Remark on the objective formulation We can consider the following least squares problem instead of (8):

minf∈L2

E

[π(X)

w(X)(f(X)− µ(X))

2

]=: Q′0(f |w). (10)

This objective functional can also be evaluated without the conditional mean functions through a transformation similarto (9):

Q′0(f |w) = E

[Tf(X)2

w(X)

]− 2E

[Uf(X)

w(X)

]+ E[S′],

where S′ := π(X)w(X)µ(X)2. We refer to the estimator based on this objective as the inverse weighted least squares

(IWLS) estimator. This estimator tends to be imprecise when the PSD is extremely close to zero, which is also acommon issue among the IPW estimators [60, 61, 50]. Although the performance of such estimators can be improvedby trimming small probabilities, determining the optimal threshold is nontrivial [34]. Moreover, selecting the thresholdis a more complicated task in the case of the IWLS since the weight can take negative values. At points where the PSDis close to zero, the absolute value of the weights increases while the sign errors are more likely to occur. As a result,a small estimation error in the PSD tends to have a large impact on the IWLS. On the other hand, the impact of theestimation error of the PSD on the DWLS estimator is limited since the absolute value of the weights is small whena sign error occurs in the PSD estimation. The weight in the proposed method is confined to [−0.5, 0.5], whereas theweight in the IWLS is not bounded at all.

6 Performance Evaluation

We test the performance of the proposed method with synthetic and real-world datasets.

6.1 Synthetic Experiment

Setup The proposed and other methods explained in the previous sections were compared on synthetic data with thedifferent shapes of µ and the covariates’ dimension qx. The data generating process used for the experiments is asfollows:

Y ′0 = ς(1>qxX) + (0.2D1 + 0.1D0)× 1>qxX, Y ′1 = Y ′0 + h(X, D1, D0),

Y0 = Y ′0 + ε0, Y1 = Y ′1 + ε1, X ∼ N(0qx ,Σqx),

P (Z(1) = 1|X) = ς(1 + 0.2× 1>qxX), P (Z(0) = 1|X) = 0,

P (D1 = 1|X) = ς(γ + 4 + 1>qxX), P (D0 = 1|X) = ς(γ + 1>qxX),

where ς(v) = (1+e−v)−1, γ is an experimental parameter for controlling the ratio of the extreme PSD, 1qx is a qx×1vector with all elements equal to 1, 0qx is a qx × 1 zero vector and Σqx is a qx × qx positive semidefinite matrix withdiagonal elements equal to 1 and the absolute value of nondiagonal elements less than 0.5. ε0 and ε1 are mean zerorandom noise with var(εd) = 0.5 for d = 0, 1 and cov(ε0, ε1) = 0.2. We implemented the experiments with threedifferent shapes of the treatment effect function h:

hcon(X, D1, D0) = 0.2 + 0.3D1 + 0.1D0,

hlin(X, D1, D0) = (0.1 + 0.15D1 + 0.05D0)× 1>qxX,

hlog(X, D1, D0) = ς((1 + 0.2D1 + 0.1D0)× 1>qxX).

Note that µ(X) = h(X, 1, 0). We separately generated four sets of the training samples x(k)di

n(k)di=1 , (y(k)i ,x

(k)i )n(k)

i=1

with n(k)d = n(k) = 10000 or 50000 for k = 0, 1, validation sets with the same sample size as the training samplesand a test set with 10000 samples.

We compared the proposed method (DWLS) with the IWLS, the separate estimation (SEP) and the DLS. The Gaussiankernels centered at 100 randomly chosen training samples were used for the basis functions φ and ψ. We also usedthe same kernel for the kernel ridge regression to estimate ν. The estimation of π was performed using the directestimation method explained in Appendix C. The value of π was trimmed to 0.15 to prevent the weight from beingtoo large for the SEP and IWLS. This trimming was not performed for the DWLS. The bandwidth h of the Gaussiankernel and the regularization parameter λ were selected based on the model selection criterion corresponding to eachestimator on validation sets. We used optuna [3] for the hyperparameter tuning with 100 searches, and we set thesearch space for h and λ to [1, 10] and [10−5, 105], respectively.

8

Page 9: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

Shape n qx DWLS IWLS SEP DLS1 0.65 ± 1.54 5.04 ± 16.5 1.80 ± 2.38 0.15 ± 0.20

10K 5 0.40 ± 0.91 1.30 ± 3.21 3.46 ± 2.71 0.93 ± 1.17

hcon10 0.65 ± 1.00 4.96 ± 14.4 4.18 ± 3.81 1.81 ± 2.821 0.22 ± 0.46 53.5 ± 372 1.04 ± 0.93 0.08 ± 0.13

50K 5 0.06 ± 0.07 0.35 ± 0.88 2.13 ± 2.00 0.41 ± 0.6710 0.10 ± 0.11 1.22 ± 6.07 3.52 ± 1.92 0.41 ± 0.521 0.09 ± 0.15 0.51 ± 1.25 0.25 ± 0.30 0.39 ± 0.30

10K 5 0.14 ± 0.08 1.32 ± 6.72 0.58 ± 0.41 0.88 ± 0.52

hlin10 0.61 ± 0.27 4.73 ± 22.4 1.39 ± 0.65 1.53 ± 0.751 0.02 ± 0.03 6.79 ± 53.1 0.11 ± 0.11 0.36 ± 0.33

50K 5 0.04 ± 0.03 0.12 ± 0.16 0.36 ± 0.24 1.04 ± 0.5310 0.32 ± 0.10 11.9 ± 112 0.76 ± 0.26 1.61 ± 0.891 0.19 ± 0.26 0.72 ± 1.82 0.30 ± 0.31 0.32 ± 0.29

10K 5 0.29 ± 0.23 12.1 ± 113 0.58 ± 0.40 0.97 ± 0.56

hlog10 0.72 ± 0.31 2.76 ± 7.65 1.11 ± 0.49 1.40 ± 0.721 0.06 ± 0.08 0.30 ± 0.88 0.22 ± 0.16 0.32 ± 0.30

50K 5 0.10 ± 0.04 0.18 ± 0.13 0.56 ± 0.31 0.75 ± 0.6210 0.30 ± 0.10 55.3 ± 550 0.86 ± 0.28 1.20 ± 0.99

Table 1: The mean and standard deviation of the MSE over 100 trials with γ = 0.The results are multiplied by 100 (constant), 10 (linear) and 10 (logistic), respec-tively. The bold face denotes the best and comparative results according to the two-sided Wilcoxon signed-rank test at the significance level of 5%.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 2: Relation between the squared error(y-axis) and the PSD (x-axis) when µ(X) islinear, qx = 5 and n = 10000. Each pointis coloured according to the spatial density.

Results The mean and standard deviation of the MSE over 100 trials are summarized in Table 1. We conductedthe Wilcoxon signed-rank test since the IWLS was highly unstable, which makes it difficult to use the t-test. Here,we report only the results with γ = 0. The experiments with different values of γ yielded similar findings, and theirresults can be found in Appendix D. Although we did not trim the PSD for the DWLS estimator, its performancewas sufficiently stable to outperform the other estimators in almost all the cases. The DWLS had larger errors thanthe DLS only when µ was constant and qx = 1. This may be because the benefit of direct estimation outweighedthe difficulties of hyperparameter tuning of the DLS since the constant µ with qx = 1 is the easiest to learn. Thisresult supports the effectiveness of our approach: avoiding the minimax formulation but without the inverse PSD as aweight. However, the IWLS estimator often worked terribly poorly, showing the instability of the IPW methods evenwith the weight trimming. The IWLS might be affected by the small PSD most severely since it uses the inverse PSDboth in the estimation and hyperparameter tuning. The performance of the DLS is not as poor, but it has a larger MSEthan SEP in many cases, indicating that hyperparameter tuning did not work as well as the DWLS. Figure 2 shows therelation between the squared error of each estimator (x-axis) and the PSD (y-axis) for the linear µ(X) when qx = 5and n = 10000. The area with a dense plot is colored in light. The squared error of the DWLS shown in Figure 2a iskept very small except it is slightly larger when the PSD is close to zero. This result demonstrates the robustness ofour method against the near-zero PSD, whereas the performance of the other estimators is affected more severely bythe small PSD. We can observe the instability of the IWLS again in Figure 2b, which shows the sporadic high errorsover the entire PSD. The squared error of the SEP displays the pattern similar to that of the IWLS. In Figure 2d, thelight-colored area is at a high position, indicating that the hyperparameter tuning of the DLS does not work well.

6.2 Real-Data Analysis

Setup We used the dataset from the National Job Training Partnership Act (JTPA) study1. This is one of the largestRCT datasets for job training evaluations in the US with approximately 20000 participants, and it has been used inthe several previous studies on causal inference [9, 2, 13]. We treat this dataset as if it was collected from the 1E2RDunder one-sided noncompliance since the noncompliance structure was almost one-sided in the JTPA study. Only2% of the participants in the control group received the job training treatment, whereas approximately 40% in thetreatment group did not receive the treatment. We excluded the participants with Z = 0 and D = 1 from our analysisto focus on the one-sided noncompliance structure. We used the samples with Z = 0, 1 as K = 1 and the sampleswith Z = 0 as K = 0. Note that, therefore, this sampling procedure is with replacement. Assumption 1.1 is satisfiedwith this sampling since Z was assigned completely at random in the JTPA study.

The outcome of interest here is the total earnings over 30 months after a random assignment. The covariates used inour analysis included gender, age, earnings in the previous year, a dummy for the black, a dummy for Hispanic, a

1https://www.upjohn.org/data-tools/employment-research-data-center/national-jtpa-study

9

Page 10: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

DR DWLS IWLS SEP DLSRMSE - 252 ± 141 222 ± 145 280 ± 119 946 ± 663Mean 1778 1528 1559 1538 937Max 1837 1838 1924 2200 7966Min 1648 1013 977 919 -1814

Table 2: The results for performance evaluation using the JTPA dataset. The bold face denotes the best and comparative resultsaccording to the two-sided Wilcoxon signed-rank test at the significance level of 5%.

dummy for high school graduates, a dummy for the married and a dummy for whether a participant worked for at least12 weeks in the 12 months preceding the random assignment. We excluded the samples with earnings in the previousyear larger than 90 percentile to stabilize the estimation. We applied the Yeo-Johnson transformation [63] to the ageand earnings in the previous year since they are non-negative and thus have a highly skewed distribution. All variableswere normalized to have zero mean and unit standard deviation.

We evaluated the performance of each estimator based on the root mean squared error (RMSE) computed using a100-fold cross-validation. We used the doubly robust (DR) estimator [42] trained with all joint samples as the pseudotrue LATE since we do not know the true treatment effect for a real-world dataset. For hyperparameter tuning, wecalculated the model selection criterion corresponding to each estimator by 10-fold cross-validation within the trainingsets. Weight trimming was not performed since π(X) was far from zero for allX .

Results The results are summarized in Table 2. We report the mean, max, and min of µ(x) in addition to the RMSEto see the behavior of each estimator. We conducted the Wilcoxon signed-rank test by using 100 results producedwith 100-fold cross-validation. The IWLS had the smallest RMSE, but the difference between the DWLS and IWLSwas not statistically significant. The mean, max, and min of the DWLS and IWLS also indicate that they have similarperformance. The stability of the IWLS in contrast to the synthetic experiment is because there were no samples witha small PSD. The relatively good performance of the SEP should be because of the same reason. The performance ofthe DLS estimator was highly unstable, reflecting the difficulty of the model selection based on the DLS objective (7).

7 Conclusion

We proposed a novel problem setting for LATE estimation by data combination. Our setting is plausible enough toapply to many real-world problems. The leading examples include O2O marketing and when there is non-randomattrition in panel data. We developed a practically useful method for estimating LATE as a function of covariatesfrom the separately observed datasets. Our method overcomes the issue of the existing method in model selection andhyperparameter tuning by avoiding the minimax formulation of the objective functional. Furthermore, our method canalso avoid the common problem of the IPW methods since the PSD directly appears in our estimator without inver-sion. Our method displayed the promising performance in the experiments on the synthetic and real-world datasets.However, there are sometimes concerns on the homogeneity of the populations which the separately observed samplescome from. Extending the method proposed in this study to account for heterogeneous samples and time-varyingpotential variables is an important future direction to further increase the usefulness of the method in practice.

References

[1] A. Abadie. Semiparametric instrumental variable estimation of treatment response models. Journal of Econo-metrics, 113(2):231–263, 2003.

[2] A. Abadie, J. Angrist, and G. Imbens. Instrumental variables estimates of the effect of subsidized training on thequantiles of trainee earnings. Econometrica, 70(1):91–117, 2002.

[3] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimizationframework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, page 2623–2631. Association for Computing Machinery, 2019.

[4] A. Alaa and M. van der Schaar. Limits of estimating heterogeneous treatment effects: Guidelines for practicalalgorithm design. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages129–138. PMLR, 2018.

[5] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables.Journal of the American Statistical Association, 91(434):444–455, 1996.

10

Page 11: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

[6] J. D. Angrist and A. B. Krueger. The effect of age at school entry on educational attainment: An applicationof instrumental variables with moments from two samples. Journal of the American Statistical Association,87(418):328–336, 1992.

[7] S. Athey and G. W. Imbens. Identification and inference in nonlinear difference-in-differences models. Econo-metrica, 74(2):431–497, 2006.

[8] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academyof Sciences of the United States of America, 113(27):7345–7352, 2016.

[9] H. S. Bloom, L. L. Orr, S. H. Bell, G. Cave, F. Doolittle, W. Lin, and J. M. Bos. The benefits and costs of JTPATitle II-A programs: Key findings from the national job training partnership act study. The Journal of HumanResources, 32(3):549–576, 1997.

[10] M. Buchinsky, F. Li, and Z. Liao. Estimation and inference of semiparametric models using data from severalsources. Journal of Econometrics, 2021.

[11] S. Cambanis, G. Simons, and W. Stout. Inequalities for Ek(X,Y) when the marginals are fixed. Zeitschrift furWahrscheinlichkeitstheorie und verwandte Gebiete, 36(4):285–294, 1976.

[12] J. Choi, J. Gu, and S. Shen. Weak-instrument robust inference for two-sample instrumental variables regression.Journal of Applied Econometrics, 33(1):109–125, 2018.

[13] S. G. Donald, Y.-C. Hsu, and R. P. Lieli. Testing the unconfoundedness assumption via inverse probabilityweighted estimators of (L)ATT. Journal of Business & Economic Statistics, 32(3):395–415, 2014.

[14] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Advancesin Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

[15] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1386–1394. PMLR,2015.

[16] C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 213–220. Associationfor Computing Machinery, 2008.

[17] Y. Fan, R. Sherman, and M. Shum. Identifying treatment effects under data combination. Econometrica,82(2):811–822, 2014.

[18] Y. Fan, R. Sherman, and M. Shum. Estimation and inference in an ecological inference model. Journal ofEconometric Methods, 5(1):547, 2016.

[19] M. Frolich. Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econo-metrics, 139(1):35–75, 2007.

[20] M. Frolich and B. Melly. Identification of treatment effects on the treated with One-Sided Non-Compliance.Econometric Reviews, 32(3):384–414, 2013.

[21] M. Guell and L. Hu. Estimating the probability of leaving unemployment using uncompleted spells from repeatedcross-section data. Journal of Econometrics, 133(1):307–341, 2006.

[22] K. Hirano, G. W. Imbens, G. Ridder, and D. B. Rubin. Combining panel data sets with attrition and refreshmentsamples. Econometrica, 69(6):1645–1659, 2001.

[23] K. Hirano, G. W. Imbens, D. B. Rubin, and X. H. Zhou. Assessing the effect of an influenza vaccine in anencouragement design. Biostatistics, 1(1):69–88, 2000.

[24] P. W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960,1986.

[25] G. W. Imbens and J. D. Angrist. Identification and estimation of local average treatment effects. Econometrica,62(2):467–475, 1994.

[26] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. CambridgeUniversity Press, 2015.

[27] A. Inoue and G. Solon. Two-sample instrumental variables estimators. The Review of Economics and Statistics,92(3):557–561, 2010.

[28] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. The Journalof Machine Learning Research, 10:1391–1445, 2009.

11

Page 12: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

[29] M. Kato and T. Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation.In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 5320–5333.PMLR, 2021.

[30] E. H. Kennedy. Efficient nonparametric causal inference with missing exposure information. The InternationalJournal of Biostatistics, 16(1), 2020.

[31] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-negative risk esti-mator. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[32] A. Klevmarken. Missing variables and two-stage least-squares estimation from more than one data set. Technicalreport, IUI Working Paper, 1982.

[33] M. J. Lebo and C. Weber. An effective approach to the repeated cross-sectional design. American Journal ofPolitical Science, 59(1):242–258, 2015.

[34] B. K. Lee, J. Lessler, and E. A. Stuart. Weight trimming and propensity score weighting. PloS one, 6(3):e18174,2011.

[35] R. J. Little and L. H. Y. Yau. Statistical techniques for analyzing data from prevention trials: Treatment ofno-shows using Rubin’s causal model. Psychological Methods, 3(2):147–159, 1998.

[36] S. Liu, A. Takeda, T. Suzuki, and K. Fukumizu. Trimmed density ratio estimation. In Advances in NeuralInformation Processing Systems, volume 30. Curran Associates, Inc., 2017.

[37] C. F. Manski. Partial identification of probability distributions. Springer Science & Business Media, 2003.

[38] F. Molinari. Microeconometrics with partial identification. Handbook of econometrics, 7:355–486, 2020.

[39] E. Moretti. Estimating the social return to higher education: Evidence from longitudinal and repeated cross-sectional data. Journal of Econometrics, 121(1):175–212, 2004.

[40] A. Nevo. Using weights to adjust for sample selection when auxiliary information is available. Journal ofBusiness & Economic Statistics, 21(1):43–52, 2003.

[41] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio byconvex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.

[42] E. L. Ogburn, A. Rotnitzky, and J. M. Robins. Doubly robust estimation of the local average treatment effectcurve. Journal of the Royal Statistical Society. Series B, Statistical methodology, 77(2):373, 2015.

[43] R. Okui, D. S. Small, Z. Tan, and J. M. Robins. Doubly robust instrumental variable regression. Statistica Sinica,22(1), 2012.

[44] P. Oreopoulos. Estimating average and local average treatment effects of education when compulsory schoolinglaws really matter. American Economic Review, 96(1):152–175, 2006.

[45] D. Pacini and F. Windmeijer. Robust inference for the two-sample 2SLS estimator. Economics letters, 146:50–54,2016.

[46] G. Ridder and R. Moffitt. The econometrics of data combination. In J. J. Heckman and E. E. Leamer, editors,Handbook of Econometrics, volume 6, pages 5469–5547. Elsevier, 2007.

[47] C. A. Rolling and Y. Yang. Model selection for estimating treatment effects. Journal of the Royal StatisticalSociety: Series B: Statistical Methodology, pages 749–769, 2014.

[48] D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal ofEducational Psychology, 66(5):688, 1974.

[49] Y. Saito and S. Yasui. Counterfactual cross-validation: Stable model selection procedure for causal inferencemodels. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 8398–8407. PMLR, 2020.

[50] S. R. Seaman and I. R. White. Review of inverse probability weighting for dealing with missing data. StatisticalMethods in Medical Research, 22(3):278–295, 2013.

[51] H. Shu and Z. Tan. Improved methods for moment restriction models with data combination and an applicationto two-sample instrumental variable estimation. Canadian Journal of Statistics, 48(2):259–284, 2020.

[52] C. Skovron and R. Titiunik. A practical guide to regression discontinuity designs in political science. AmericanJournal of Political Science, 2015:1–36, 2015.

[53] M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio estimation in machine learning. Cambridge UniversityPress, 2012.

12

Page 13: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

[54] B. Sun and W. Miao. On semiparametric instrumental variable estimation of average treatment effects throughdata fusion. Statistica Sinica, 2022. Forthcoming.

[55] E. Tamer. Partial identification in econometrics. Annual Review of Economics, 2(1):167–195, 2010.[56] Z. Tan. Regression and weighting methods for causal inference using instrumental variables. Journal of the

American Statistical Association, 101(476):1607–1618, 2006.[57] H. R. Varian. Causal inference in economics and marketing. Proceedings of the National Academy of Sciences

of the United States of America, 113(27):7310–7315, 2016.[58] L. Wang, Y. Zhang, T. S. Richardson, and J. M. Robins. Estimation of local treatment effects under the binary

instrumental variable model. Biometrika, 2021.[59] L. Wood, M. Egger, L. L. Gluud, K. F. Schulz, P. Juni, D. G. Altman, C. Gluud, R. M. Martin, A. J. Wood,

and J. A. Sterne. Empirical evidence of bias in treatment effect estimates in controlled trials with differentinterventions and outcomes: Meta-epidemiological study. BMJ, 336(7644):601–605, 2008.

[60] J. M. Wooldridge. Inverse probability weighted M-estimators for sample selection, attrition, and stratification.Portuguese Economic Journal, 1(2):117–139, 2002.

[61] J. M. Wooldridge. Inverse probability weighted estimation for general missing data problems. Journal of Econo-metrics, 141(2):1281–1301, 2007.

[62] I. Yamane, F. Yger, J. Atif, and M. Sugiyama. Uplift modeling from separate labels. In Advances in NeuralInformation Processing Systems, volume 31. Curran Associates, Inc., 2018.

[63] I. K. Yeo and R. A. Johnson. A new family of power transformations to improve normality or symmetry.Biometrika, 87(4):954–959, 2000.

[64] Q. Zhao, J. Wang, W. Spiller, J. Bowden, and D. S. Small. Two-Sample instrumental variable analyses usingheterogeneous samples. Statistical Science, 34(2):317–333, 2019.

13

Page 14: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

A Proof of Theorem 1

Proof. By Assumption 1.1, 2.1 and 2.2, we have

E[D(1)|X]− E[D(0)|X] = E[Z(1)D1 + (1− Z(1))D0|X]− E[Z(0)D1 + (1− Z(0))D0|X]

= E[(Z(1) − Z(0))(D1 −D0)|X]

= E[Z(1) − Z(0)|X]E[D1 −D0|X]. (11)

Similarly, we can show

E[Y (1)|X]− E[Y (0)|X] = E[D(1)Y1 + (1−D(1))Y0|X]− E[D(0)Y1 + (1−D(0))Y0|X]

= E[(D(1) −D(0))(Y1 − Y0)|X]

= E[(Z(1) − Z(0))(D1 −D0)(Y1 − Y0)|X]

= E[Z(1) − Z(0)|X]E[(D1 −D0)(Y1 − Y0)|X]

= E[Z(1) − Z(0)|X]P (D1 −D0 = 1|X)E[Y1 − Y0|D1 −D0 = 1X]. (12)

It holds that E[D1 − D0|X] = P (D1 − D0 = 1|X) since D1 − D0 ∈ 0, 1 by Assumption 2.3. Moreover,D1−D0 = 1 impliesD1 > D0 sinceD1 andD0 are binary. Note thatE[Z(1)−Z(0)|X] 6= 0 andE[D1−D0|X] 6= 0by Assumption 1.2 and 2.4, respectively. Combining these facts with (11) and (12) yields

E[Y (1)|X]− E[Y (0)|X]

E[D(1)|X]− E[D(0)|X]=E[Z(1) − Z(0)|X]E[D1 −D0|X]E[Y1 − Y0|D1 −D0 = 1X]

E[Z(1) − Z(0)|X]E[D1 −D0|X]

= E[Y1 − Y0|D1 −D0 = 1,X]

= E[Y1 − Y0|D1 > D0,X]

= µ(X).

B Direct Estimation of PSD

We cannot employ ordinary regression methods on (ti,xti, rti)nti=1 to obtain an estimator of π(X) since it does

not follow P (T |X) even when n(1)d = n(0)d . We have to develop an estimator which incorporates the covariates

of (ui,xui, rui)nui=1. Moreover, an estimator of π(X) should satisfy the constraint π(X) ∈ [−0.5, 0.5]. In the

following, we develop a direct estimation method which is basically an application of the least squares method forestimating the density ratio [28, 53].

In order to develop an estimator satisfying the constraint, consider the least squares estimation of π(X) + 0.5 ∈ [0, 1]by minimizing the following squared error:

E[(f(X)− π(X)− 0.5)2] = E[f(X)2]− 2E[Tf(X)]− E[f(X)] + E[(π(X) + 0.5)2].

The last term can be safely ignored in the optimization. We can approximate the first three terms as

1

nu

nu∑i=1

ruif(xui)2 − 2

nt

nt∑i=1

rtitif(xti)−1

nu

nu∑i=1

ruif(xui).

Using the linear-in-parameter model and the `2-regularizer, we have an analytical solution α>0.5+φ(x), where

α0.5+ =

(1

nu

nu∑i=1

ruiφ(xui)φ(xui)> + λfIqf

)−1(1

nt

nt∑i=1

rtitiφ(xti) +1

2nu

nu∑i=1

ruiφ(xui)

).

When we use a non-negative basis function such as the Gaussian kernel, we can modify α0.5+ as α0.5+ :=maxα0.5+,0qf to force α>0.5+φ(x) to be non-negative as well. Here, the max operator extracts larger val-ues element-wise. Similarly, we can construct an estimator of −π(X) + 0.5 ∈ [0, 1] as α>0.5−φ(x), whereα0.5− := maxα0.5−,0qf and

α0.5− =

(1

nu

nu∑i=1

ruiφ(xui)φ(xui)> + λfIqf

)−1(− 1

nt

nt∑i=1

rtitiφ(xti) +1

2nu

nu∑i=1

ruiφ(xui)

).

14

Page 15: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

Finally, we normalize an estimate of π(X) + 0.5 at a given point x as

(π + 0.5)(x) =α>0.5+φ(x)

(α>0.5+ + α>0.5−)φ(x),

since (π(X) + 0.5) + (−π(X) + 0.5) = 1. We can ensure (π + 0.5)(x) ≤ 1 for any x by this normalization.Therefore, we can obtain an estimator that respects the constraint by setting π(x) := (π + 0.5)(x)− 0.5.

Estimation in 1E2RD We can further tighten the restriction to π(X) ∈ [0, 0.5] in the 1E2RD since the PSD isnon-negative by Assumption 2.3 when P (Z(0) = 1|X) = 0. In this case, we can obtain the non-negative estimator ofπ(X) as α>πφ(x), where απ := maxαπ,0qf and

απ =

(1

nu

nu∑i=1

ruiφ(xui)φ(xui)> + λfIqf

)−1(1

nt

nt∑i=1

rtitiφ(xti)

).

Then, we can normalize the estimator so that it takes values no larger than 0.5 as

π(x) =α>πφ(x)

2(α>π + α>0.5−)φ(x),

since π(X) + (−π(X) + 0.5) = 0.5.

C Estimation of µ(X) when P (X(1)) 6= P (X(0)) 6= P (X)

We can identify µ(X) as the following even if we replace Assumption 1.1 by its conditional version (2):

µ(x) =E[Y (1)|X(1) = x]− E[Y (0)|X(0) = x]

E[D(1)|X(1) = x]− E[D(0)|X(0) = x].

However, directly applying the methods in the main text may result in large bias since P (X(1)) 6= P (X(0)) 6= P (X)in general. We need density ratios to adjust a distribution which the expectation is taken over. Unfortunately, we cannot

estimate µ(X) on P (X) with our data collection scheme, where we have x(k)di

n(k)di=1

iid∼ P (X(k)|D(k) = 1) and p(k)d .In this case, we usually do not have information on P (X|D(k) = 1), thus impossible to estimate the density-ratioP (X|D(k)=1)P (X(k)|D(k)=1)

, which is necessary for approximating the expectation over P (X|D(k) = 1) using x(k)di

n(k)di=1 .

If we can collect (d(k)i ,x(k)di )n

(k)di=1

iid∼ P (D(k),X(k)) instead of the pair x(k)di

n(k)di=1 and p(k)d , we can modify the

combined datasets explained in Section 4.1 to

(ti,xti, rti)nti=1 :=

((−1)1−kd

(k)i ,x

(k)di ,

n(1)d + n

(0)d

2n(k)d

ρk(x(k)di )

)n(k)d ,1

i=1,k=0

,

(ui,xui, rui)nui=1 :=

((−1)1−ky

(k)i ,x

(k)i ,

n(1) + n(0)

2n(k)ρk(x

(k)i )

)n(k),1

i=1,k=0

,

where ρk(X) := P (X)P (X(k))

for k = 0, 1, and estimate µ(X) on P (X) with the methods in the main text using theabove datasets. Various methods can be employed to estimate density ratios [53, 41, 36, 29].

Relation to the Standard Setting Consider the estimation of µ(X) using separate samples drawn from P (Y,Z,X)and P (D,Z,X) under the standard assumptions [1, 19, 56, 42]. Recalling that µ(X) can be identified as

µ(X) =E[Y |Z = 1,X]− E[Y |Z = 0,X]

E[D|Z = 1,X]− E[D|Z = 0,X],

we can see that it is essentially the same problem as our setting with the condition (2). We need P (X)P (X|Z=z) for z = 0, 1

in this case.

15

Page 16: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

D Additional Experimental Results

D.1 Relation between the Squared Error and the PSD

We present the graphs of the relation between the squared error and the PSD for all the cases. Similar to Figure 2,these figures show that the performance deterioration of the DWLS at points with the small PSD is limited comparedto the other estimators.

D.2 Sensitivity Analysis

In addition to the synthetic experiment in the main text, we conducted the experiments with γ = −1, 1 to investigate thechange in the performance of each estimator according to the ratio of the small PSD. Figure 21 shows the histogramsof the PSD with different values of γ. It is clear that the proportion of the small PSD increases as γ increases.

Table 3 and 4 presents the results with γ = 1,−1, respectively. In both cases, we obtained the results similar to thecase with γ = 0: the DWLS had the smallest MSE for all the cases except the constant µ with qx = 1. When γ = 1,the relative advantage of the DWLS over the other estimators decreased for all the cases. Since the DWLS directlyuses the PSD, the efficiency of the DWLS estimator declines as the proportion of the small PSD increases. In thiscase, the performance of the DWLS could be improved by the weight trimming so that samples with the near-zeroPSD have at least a certain impact on estimation. When γ = −1, the MSE decreased overall. The variance of theIWLS got smaller than in the case with the other value of γ, but it is still unstable compared to the other methods.

16

Page 17: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Constant, qx = 1, n =10000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

Figure 4: Linear, qx = 1, n = 10000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

Figure 5: Logistic, qx = 1, n =10000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 6: Constant, qx = 5, n =10000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 7: Linear, qx = 5, n = 10000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 8: Logistic, qx = 5, n =10000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 9: Constant, qx = 10, n =10000.

0.2 0.4a. DWLS

01234567

0.2 0.4b. IWLS

01234567

0.2 0.4c. SEP

01234567

0.2 0.4d. DLS

01234567

Figure 10: Linear, qx = 10, n =10000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 11: Logistic, qx = 10, n =10000.

17

Page 18: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 12: Constant, qx = 1, n =50000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

Figure 13: Linear, qx = 1, n = 50000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14: Logistic, qx = 1, n =50000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 15: Constant, qx = 5, n =50000.

0.2 0.4a. DWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4b. IWLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4c. SEP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.2 0.4d. DLS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 16: Linear, qx = 5, n = 50000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 17: Logistic, qx = 5, n =50000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 18: Constant, qx = 10, n =50000.

0.2 0.4a. DWLS

0

1

2

3

4

5

0.2 0.4b. IWLS

0

1

2

3

4

5

0.2 0.4c. SEP

0

1

2

3

4

5

0.2 0.4d. DLS

0

1

2

3

4

5

Figure 19: Linear, qx = 10, n =50000.

0.2 0.4a. DWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4b. IWLS

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4c. SEP

0.0

0.2

0.4

0.6

0.8

1.0

0.2 0.4d. DLS

0.0

0.2

0.4

0.6

0.8

1.0

Figure 20: Logistic, qx = 10, n =50000.

18

Page 19: arXiv:2109.05175v1 [stat.ML] 11 Sep 2021

Estimation of Local Average Treatment Effect by Data Combination

0.0 0.1 0.2 0.3 0.4 0.50

500

1000

1500

2000

(a) γ = −1

0.0 0.1 0.2 0.3 0.4 0.50

500

1000

1500

2000

(b) γ = 0

0.0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

1400

1600

(c) γ = 1

Figure 21: Histograms of the PSD.

Shape n qx DWLS IWLS SEP DLS1 0.11 ± 0.32 404 ± 3482 0.55 ± 0.23 0.10 ± 0.46

10K 5 0.10 ± 0.23 0.79 ± 5.09 0.70 ± 0.32 0.23 ± 0.30

hcon10 0.09 ± 0.19 30.5 ± 225 0.71 ± 0.36 0.28 ± 0.301 0.05 ± 0.14 85.4 ± 623 0.56 ± 0.11 0.02 ± 0.04

50K 5 0.01 ± 0.02 0.07 ± 0.17 0.63 ± 0.13 0.05 ± 0.1010 0.02 ± 0.02 158 ± 1551 0.66 ± 0.19 0.11 ± 0.271 0.02 ± 0.08 578 ± 5723 0.03 ± 0.02 0.04 ± 0.03

10K 5 0.05 ± 0.05 0.24 ± 0.90 0.08 ± 0.04 0.14 ± 0.06

hlin10 0.15 ± 0.29 3.23 ± 19.8 0.15 ± 0.05 0.20 ± 0.091 0.01 ± 0.02 10.3 ± 50.2 0.02 ± 0.01 0.05 ± 0.04

50K 5 0.01 ± 0.01 0.03 ± 0.04 0.06 ± 0.02 0.13 ± 0.0610 0.05 ± 0.03 0.60 ± 2.94 0.10 ± 0.02 0.18 ± 0.101 0.04 ± 0.08 68.4 ± 573 0.13 ± 0.03 0.05 ± 0.04

10K 5 0.10 ± 0.08 0.52 ± 3.58 0.18 ± 0.04 0.20 ± 0.07

hlog10 0.17 ± 0.18 9.18 ± 63.1 0.23 ± 0.04 0.25 ± 0.071 0.02 ± 0.03 17.2 ± 94.5 0.13 ± 0.02 0.05 ± 0.04

50K 5 0.03 ± 0.02 0.05 ± 0.07 0.17 ± 0.02 0.12 ± 0.0710 0.08 ± 0.03 0.32 ± 2.12 0.21 ± 0.03 0.17 ± 0.08

Table 3: The mean and standard deviation of the MSE over 100 trials with γ = 1. The results are multiplied by 10 (constant), 1(linear) and 1 (logistic), respectively. The bold face denotes the best and comparative results according to the two-sided Wilcoxonsigned-rank test at the significance level of 5%.

Shape n qx DWLS IWLS SEP DLS1 0.30 ± 0.59 1.19 ± 3.72 0.76 ± 0.88 0.14 ± 0.30

10K 5 0.24 ± 0.29 0.56 ± 1.02 1.97 ± 1.89 0.66 ± 1.12

hcon10 0.43 ± 0.52 2280 ± 22666 2.18 ± 1.92 0.88 ± 1.081 0.11 ± 0.27 0.46 ± 1.33 0.40 ± 0.34 0.06 ± 0.11

50K 5 0.05 ± 0.07 0.16 ± 0.28 1.74 ± 0.96 0.18 ± 0.2610 0.09 ± 0.21 0.39 ± 1.13 2.06 ± 0.96 0.36 ± 0.461 0.06 ± 0.17 0.19 ± 0.44 0.10 ± 0.11 0.40 ± 0.29

10K 5 0.11 ± 0.09 0.26 ± 0.20 0.29 ± 0.17 0.75 ± 0.50

hlin10 0.50 ± 0.17 18.8 ± 178 1.02 ± 0.37 1.19 ± 0.701 0.01 ± 0.02 0.05 ± 0.15 0.05 ± 0.05 0.35 ± 0.31

50K 5 0.03 ± 0.02 0.06 ± 0.06 0.18 ± 0.12 0.90 ± 0.5210 0.25 ± 0.06 0.32 ± 0.24 0.63 ± 0.18 1.36 ± 0.861 0.07 ± 0.08 0.26 ± 0.60 0.15 ± 0.13 0.28 ± 0.29

10K 5 0.17 ± 0.12 0.35 ± 0.26 0.50 ± 0.28 0.56 ± 0.33

hlog10 0.47 ± 0.18 321 ± 3187 0.87 ± 0.33 0.85 ± 0.601 0.03 ± 0.04 0.07 ± 0.16 0.09 ± 0.06 0.32 ± 0.40

50K 5 0.06 ± 0.01 0.11 ± 0.06 0.35 ± 0.15 0.70 ± 0.4510 0.21 ± 0.05 0.32 ± 0.14 0.57 ± 0.16 0.78 ± 0.77

Table 4: The mean and standard deviation of the MSE over 100 trials with γ = −1. The results are multiplied by 100 (constant), 10(linear) and 10 (logistic), respectively. The bold face denotes the best and comparative results according to the two-sided Wilcoxonsigned-rank test at the significance level of 5%.

19