machine learning for healthcare4~2(4))*+,! two common approaches for counterfactual inference...
TRANSCRIPT
MachineLearningforHealthcareHST.956,6.S897
Lecture15:CausalInferencePart2
DavidSontag
Acknowledgement:adaptedfromslidesbyUriShalit (Technion)
Reminder:PotentialOutcomes
• Eachunit(individual)𝑥" hastwopotentialoutcomes:– 𝑌$(𝑥") isthepotentialoutcomehadtheunitnotbeentreated:“controloutcome”
– 𝑌'(𝑥") isthepotentialoutcomehadtheunitbeentreated:“treatedoutcome”
• Conditionalaveragetreatmenteffectforunit𝑖:𝐶𝐴𝑇𝐸 𝑥" = 𝔼/0~2(/0|45)[𝑌'|𝑥"] − 𝔼/:~2(/:|45)[𝑌$|𝑥"]
• AverageTreatmentEffect:𝐴𝑇𝐸 = 𝔼4~2(4) 𝐶𝐴𝑇𝐸 𝑥
Twocommonapproachesforcounterfactualinference
CovariateadjustmentPropensityscores
𝑥'
𝑥;
𝑥<
𝑇
… 𝑓(𝑥, 𝑇)𝑦
Regressionmodel
OutcomeCovariates(Features)
Covariateadjustment(reminder)
Explicitlymodeltherelationshipbetweentreatment,confounders,andoutcome:
Covariateadjustment(reminder)
• Underignorability,𝐶𝐴𝑇𝐸 𝑥 =𝔼4~2 4 𝔼 𝑌' 𝑇 = 1, 𝑥 − 𝔼 𝑌$ 𝑇 = 0, 𝑥
• Fitamodel𝑓 𝑥, 𝑡 ≈ 𝔼 𝑌D 𝑇 = 𝑡, 𝑥 ,then:𝐶𝐴𝑇𝐸J 𝑥" = 𝑓 𝑥", 1 − 𝑓(𝑥", 0).
Covariateadjustmentwithlinearmodels
• Assumethat:
• Then:𝐶𝐴𝑇𝐸(𝑥): = 𝔼[𝑌' 𝑥 − 𝑌$ 𝑥 ] =
𝔼[(𝛽𝑥 + 𝛾 + 𝜖') − 𝛽𝑥 + 𝜖$ ] = 𝛾
age medicationBloodpressure
𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝜖D𝔼 𝜖D = 0
• Assumethat:
• Then:𝐶𝐴𝑇𝐸(𝑥): = 𝔼[𝑌' 𝑥 − 𝑌$ 𝑥 ] =
𝔼[(𝛽𝑥 + 𝛾 + 𝜖') − 𝛽𝑥 + 𝜖$ ] = 𝛾
age medication
𝐴𝑇𝐸:= 𝔼2 4 𝐶𝐴𝑇𝐸 𝑥 = 𝛾
Bloodpressure
𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝜖D𝔼 𝜖D = 0
Covariateadjustmentwithlinearmodels
• Assumethat:
• Forcausalinference,needtoestimate𝛾 well,not𝑌D 𝑥 - Identification,notprediction
• MajordifferencebetweenMLandstatistics
age medication
𝐴𝑇𝐸:= 𝔼2 4 𝐶𝐴𝑇𝐸 𝑥 = 𝛾
Bloodpressure
𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝜖D𝔼 𝜖D = 0
Covariateadjustmentwithlinearmodels
Whathappensiftruemodelisnotlinear?
• Truedatageneratingprocess,𝑥 ∈ ℝ:
𝐴𝑇𝐸 = 𝔼 𝑌' − 𝑌$ = 𝛾• Hypothesizedmodel:
𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝛿 ⋅ 𝑥;
𝑌DT 𝑥 = 𝛽U𝑥 + 𝛾V ⋅ 𝑡
𝛾V = 𝛾 + 𝛿𝔼 𝑥𝑡 𝔼 𝑥; − 𝔼[𝑡;]𝔼[𝑥;𝑡]𝔼 𝑥𝑡 ; − 𝔼[𝑥;]𝔼[𝑡;]
Dependingon𝜹,canbemadetobearbitrarilylargeorsmall!
Covariateadjustmentwithnon-linearmodels
• RandomforestsandBayesiantreesHill(2011),Athey &Imbens (2015),Wager&Athey (2015)
• GaussianprocessesHoyeretal.(2009),Zigler etal.(2012)
• NeuralnetworksBecketal.(2000),Johanssonetal.(2016),Shalitetal.(2016),Lopez-Pazetal.(2016)
Example:Gaussianprocesses
10 20 30 40 50 60
8090
100
110
120
GP−Independent
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
10 20 30 40 50 60
8090
100
110
120
GP−Grouped
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
Figures:VincentDorie&JenniferHill
Separatetreatedandcontrolmodels
Jointtreatedandcontrolmodel
𝑌' 𝑥
𝑌$ 𝑥
𝑌' 𝑥
𝑌$ 𝑥
𝑥𝑥
𝑦
Treated
Control
Example:Neuralnetworks
Shalit,Johansson,Sontag.EstimatingIndividualTreatmentEffect:GeneralizationBoundsandAlgorithms.ICML,2017
" Φ…
…
… %&
%' (
)
*
Covariates Shared representation
Predicted potential outcomes
Learning objective Outcome
InterventionNeural network layers
Matching• Findeachunit’slong-lostcounterfactualidenticaltwin,checkuponhisoutcome
Matching• Findeachunit’slong-lostcounterfactualidenticaltwin,checkuponhisoutcome
Obama,hadhegonetolawschool Obama,hadhegonetobusinessschool
Matching• Findeachunit’slong-lostcounterfactualidenticaltwin,checkuponhisoutcome
• UsedforestimatingbothATEandCATE
Matchtonearestneighborfromoppositegroup
Treated
Control Age
Charlesoncomorbidityindex
Matchtonearestneighborfromoppositegroup
Treated
Control Age
Charlesoncomorbidityindex
1-NNMatching
• Let𝑑 ⋅,⋅ beametricbetween𝑥’s• Foreach𝑖,define𝑗 𝑖 = argmin
_`.D.DabD5𝑑(𝑥_, 𝑥")
𝑗 𝑖 isthenearestcounterfactualneighborof𝑖• 𝑡" = 1,unit𝑖 istreated:
𝐶𝐴𝑇𝐸J 𝑥" = 𝑦" − 𝑦_ "• 𝑡" =0,unit𝑖 iscontrol:
𝐶𝐴𝑇𝐸J 𝑥" = 𝑦_(") − 𝑦"
1-NNMatching
• Let𝑑 ⋅,⋅ beametricbetween𝑥’s• Foreach𝑖,define𝑗 𝑖 = argmin
_`.D.DabD5𝑑(𝑥_, 𝑥")
𝑗 𝑖 isthenearestcounterfactualneighborof𝑖
• 𝐶𝐴𝑇𝐸J 𝑥" = (2𝑡" − 1)(𝑦"−𝑦_ " )
• 𝐴𝑇𝐸J = 'd∑ 𝐶𝐴𝑇𝐸J 𝑥"d"f'
Matching
• Interpretable,especiallyinsmall-sampleregime• Nonparametric• Heavilyreliantontheunderlyingmetric• Couldbemisledbyfeatureswhichdon’taffecttheoutcome
Covariateadjustmentandmatching
• Matchingisequivalenttocovariateadjustmentwithtwo1-nearestneighborclassifiers:𝑌g' 𝑥 = 𝑦hh0 4 ,𝑌g$ 𝑥 = 𝑦hh: 4where𝑦hhi 4 isthenearest-neighborof𝑥amongunitswithtreatmentassignment
𝑡 = 0,1
• 1-NNmatchingisingeneralinconsistent,thoughonlywithsmallbias(Imbens 2004)
Twocommonapproachesforcounterfactualinference
CovariateadjustmentPropensityscores
Propensityscores
• ToolforestimatingATE• Basicidea:turnobservationalstudyintoapseudo-randomizedtrialbyre-weightingsamples,similartoimportancesampling
Inversepropensityscorere-weighting
𝑥' = 𝑎𝑔𝑒
𝑥; =Charlsoncomorbidityindex
Treated
Control
𝑝(𝑥|𝑡 = 0) ≠ 𝑝 𝑥 𝑡 = 1control treated
𝑝 𝑥 𝑡 = 0 ⋅ 𝑤$(𝑥) ≈ 𝑝 𝑥 𝑡 = 1 ⋅ 𝑤'(𝑥)reweightedcontrolreweightedtreated
Inversepropensityscorere-weighting
𝑥' = 𝑎𝑔𝑒
𝑥; =Charlsoncomorbidityindex
Treated
Control
Propensityscore• Propensityscore:𝑝 𝑇 = 1 𝑥 ,usingmachinelearningtools
• Samplesre-weightedbytheinversepropensityscoreofthetreatmenttheyreceived
Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator
HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)
1. UseanyMLmethodtoestimate𝑝V 𝑇 = 𝑡 𝑥
2. ˆATE =1
n
X
i s.t. ti=1
yip̂(ti = 1|xi)
� 1
n
X
i s.t. ti=0
yip̂(ti = 0|xi)
Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator
HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)
1. Randomizedtrial𝑝(𝑇 = 𝑡|𝑥) = 0.5
2. ˆATE =1
n
X
i s.t. ti=1
yip̂(ti = 1|xi)
� 1
n
X
i s.t. ti=0
yip̂(ti = 0|xi)
Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator
HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)
1. Randomizedtrial𝑝(𝑇 = 𝑡|𝑥) = 0.5
2. ˆATE =1
n
X
i s.t. ti=1
yi0.5
� 1
n
X
i s.t. ti=0
yi0.5
=
2
n
X
i s.t. ti=1
yi �2
n
X
i s.t. ti=0
yi
Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator
HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)
1. Randomizedtrial𝑝 = 0.5
2. ˆATE =1
n
X
i s.t. ti=1
yi0.5
� 1
n
X
i s.t. ti=0
yi0.5
=
2
n
X
i s.t. ti=1
yi �2
n
X
i s.t. ti=0
yi
Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator
HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)
1. Randomizedtrial𝑝 = 0.5
2. ˆATE =1
n
X
i s.t. ti=1
yi0.5
� 1
n
X
i s.t. ti=0
yi0.5
=
2
n
X
i s.t. ti=1
yi �2
n
X
i s.t. ti=0
yi
Sumover~𝒏𝟐terms
Propensityscores- derivation
• Recallaveragetreatmenteffect:
• Weonlyhavesamplesfor:
Ex⇠p(x)[ E [Y1|x, T = 1]�E [Y0|x, T = 0] ]
Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]
Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]
Propensityscores- derivation
• Weonlyhavesamplesfor:
Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]
Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]
Propensityscores- derivation
• Weonlyhavesamplesfor:
• Weneedtoturn𝑝(𝑥|𝑇 = 1) into𝑝(𝑥):
Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]
Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]
p(x|T = 1) · p(T = 1)
p(T = 1|x) = p(x)?
Propensityscores- derivation
• Weonlyhavesamplesfor:
• Weneedtoturn𝑝(𝑥|𝑇 = 1) into𝑝(𝑥):
Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]
Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]
p(x|T = 1) · p(T = 1)
p(T = 1|x) = p(x)
Propensityscore
Propensityscores- derivation
• Weonlyhavesamplesfor:
• Weneedtoturn𝑝(𝑥|𝑇 = 0) into𝑝(𝑥):
Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]
Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]
p(x|T = 0) · p(T = 0)
p(T = 0|x) = p(x)
Propensityscore
• Wewant:
• Weknowthat:
• Thus:
• Wecanapproximatethisempiricallyas:
(similarlyforti=0)
p(x|T = 1) · p(T = 1)
p(T = 1|x) = p(x)
Ex⇠p(x)[Y1(x)]
Ex⇠p(x|T=1)
p(T = 1)
p(T = 1 | x)Y1(x)
�= Ex⇠p(x)[Y1(x)]
1
n1
X
i s.t.ti=1
n1/n
p̂(ti = 1 | xi)yi
�=
1
n
X
i s.t.ti=1
yip̂(ti = 1 | xi)
ProblemswithIPW
• Needtoestimatepropensityscore(probleminallpropensityscoremethods)
• Ifthere’snotmuchoverlap,propensityscoresbecomenon-informativeandeasilymis-calibrated
• Weightingbyinversecancreatelargevarianceandlargeerrorsforsmallpropensityscores– Exacerbatedwhenmorethantwotreatments
Manymoreideasandmethods
• Naturalexperiments®ressiondiscontinuity
• Instrumentalvariables
Manymoreideasandmethods–Naturalexperiments
• Doesstressduringpregnancyaffectlaterchilddevelopment?
• Confounding:genetic,motherpersonality,economicfactors…
• Naturalexperiment:theCubanmissilecrisisofOctober1962.Manypeoplewereafraidanuclearwarisabouttobreakout.
• Comparechildrenwhowereinuteroduringthecrisiswithchildrenfromimmediatelybeforeandafter
Manymoreideasandmethods–Instrumentalvariables
• Informally:avariablewhichaffectstreatmentassignmentbutnottheoutcome
• Example:areprivateschoolsbetterthanpublicschools?
• Confounding:differentstudentpopulation,differentteacherpopulation
• Can’tforcepeoplewhichschooltogoto
Manymoreideasandmethods–Instrumentalvariables
• Informally:avariablewhichaffectstreatmentassignmentbutnottheoutcome
• Example:areprivateschoolsbetterthanpublicschools?
• Can’tforcepeoplewhichschooltogoto• Canrandomly giveoutvoucherstosomechildren,givingthemanopportunitytoattendprivateschools
• Thevoucherassignmentistheinstrumentalvariable
Summary
• Twoapproachestousemachinelearningforcausalinference:1. Predictoutcomegivenfeaturesandtreatment,then
useresultingmodeltoimputecounterfactuals(covariateadjustment)
2. Predicttreatmentusingfeatures(propensityscore),thenusetoreweightoutcomeorstratifythedata
• Causalgraphsimportantforthinkingthroughwhetherproblemissetupappropriatelyandwhetherassumptionshold