machine learning for healthcare4~2(4))*+,! two common approaches for counterfactual inference...

MachineLearningforHealthcareHST.956,6.S897

Lecture15:CausalInferencePart2

DavidSontag

Acknowledgement:adaptedfromslidesbyUriShalit (Technion)

Reminder:PotentialOutcomes

• Eachunit(individual)𝑥" hastwopotentialoutcomes:– 𝑌$(𝑥") isthepotentialoutcomehadtheunitnotbeentreated:“controloutcome”

– 𝑌'(𝑥") isthepotentialoutcomehadtheunitbeentreated:“treatedoutcome”

• Conditionalaveragetreatmenteffectforunit𝑖:𝐶𝐴𝑇𝐸 𝑥" = 𝔼/0~2(/0|45)[𝑌'|𝑥"] − 𝔼/:~2(/:|45)[𝑌$|𝑥"]

• AverageTreatmentEffect:𝐴𝑇𝐸 = 𝔼4~2(4) 𝐶𝐴𝑇𝐸 𝑥

Twocommonapproachesforcounterfactualinference

CovariateadjustmentPropensityscores

𝑥'

𝑥;

𝑥<

𝑇

… 𝑓(𝑥, 𝑇)𝑦

Regressionmodel

OutcomeCovariates(Features)

Covariateadjustment(reminder)

Explicitlymodeltherelationshipbetweentreatment,confounders,andoutcome:

Covariateadjustment(reminder)

• Underignorability,𝐶𝐴𝑇𝐸 𝑥 =𝔼4~2 4 𝔼 𝑌' 𝑇 = 1, 𝑥 − 𝔼 𝑌$ 𝑇 = 0, 𝑥

• Fitamodel𝑓 𝑥, 𝑡 ≈ 𝔼 𝑌D 𝑇 = 𝑡, 𝑥 ,then:𝐶𝐴𝑇𝐸J 𝑥" = 𝑓 𝑥", 1 − 𝑓(𝑥", 0).

Covariateadjustmentwithlinearmodels

• Assumethat:

• Then:𝐶𝐴𝑇𝐸(𝑥): = 𝔼[𝑌' 𝑥 − 𝑌$ 𝑥 ] =

𝔼[(𝛽𝑥 + 𝛾 + 𝜖') − 𝛽𝑥 + 𝜖$ ] = 𝛾

age medicationBloodpressure

𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝜖D𝔼 𝜖D = 0

• Assumethat:

• Then:𝐶𝐴𝑇𝐸(𝑥): = 𝔼[𝑌' 𝑥 − 𝑌$ 𝑥 ] =

𝔼[(𝛽𝑥 + 𝛾 + 𝜖') − 𝛽𝑥 + 𝜖$ ] = 𝛾

age medication

𝐴𝑇𝐸:= 𝔼2 4 𝐶𝐴𝑇𝐸 𝑥 = 𝛾

Bloodpressure



• Assumethat:

• Forcausalinference,needtoestimate𝛾 well,not𝑌D 𝑥 - Identification,notprediction

• MajordifferencebetweenMLandstatistics

age medication

𝐴𝑇𝐸:= 𝔼2 4 𝐶𝐴𝑇𝐸 𝑥 = 𝛾

Bloodpressure



Whathappensiftruemodelisnotlinear?

• Truedatageneratingprocess,𝑥 ∈ ℝ:

𝐴𝑇𝐸 = 𝔼 𝑌' − 𝑌$ = 𝛾• Hypothesizedmodel:

𝑌D 𝑥 = 𝛽𝑥 + 𝛾 ⋅ 𝑡 + 𝛿 ⋅ 𝑥;

𝑌DT 𝑥 = 𝛽U𝑥 + 𝛾V ⋅ 𝑡

𝛾V = 𝛾 + 𝛿𝔼 𝑥𝑡 𝔼 𝑥; − 𝔼[𝑡;]𝔼[𝑥;𝑡]𝔼 𝑥𝑡 ; − 𝔼[𝑥;]𝔼[𝑡;]

Dependingon𝜹,canbemadetobearbitrarilylargeorsmall!

Covariateadjustmentwithnon-linearmodels

• RandomforestsandBayesiantreesHill(2011),Athey &Imbens (2015),Wager&Athey (2015)

• GaussianprocessesHoyeretal.(2009),Zigler etal.(2012)

• NeuralnetworksBecketal.(2000),Johanssonetal.(2016),Shalitetal.(2016),Lopez-Pazetal.(2016)

Example:Gaussianprocesses

10 20 30 40 50 60

8090

100

110

120

GP−Independent

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

10 20 30 40 50 60

8090

100

110

120

GP−Grouped

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

Figures:VincentDorie&JenniferHill

Separatetreatedandcontrolmodels

Jointtreatedandcontrolmodel

𝑌' 𝑥

𝑌$ 𝑥

𝑌' 𝑥

𝑌$ 𝑥

𝑥𝑥

𝑦

Treated

Control

Example:Neuralnetworks

Shalit,Johansson,Sontag.EstimatingIndividualTreatmentEffect:GeneralizationBoundsandAlgorithms.ICML,2017

" Φ…

…

… %&

%' (

)

*

Covariates Shared representation

Predicted potential outcomes

Learning objective Outcome

InterventionNeural network layers

Matching• Findeachunit’slong-lostcounterfactualidenticaltwin,checkuponhisoutcome


Obama,hadhegonetolawschool Obama,hadhegonetobusinessschool


• UsedforestimatingbothATEandCATE

Matchtonearestneighborfromoppositegroup

Treated

Control Age

Charlesoncomorbidityindex

1-NNMatching

• Let𝑑 ⋅,⋅ beametricbetween𝑥’s• Foreach𝑖,define𝑗 𝑖 = argmin

_`.D.DabD5𝑑(𝑥_, 𝑥")

𝑗 𝑖 isthenearestcounterfactualneighborof𝑖• 𝑡" = 1,unit𝑖 istreated:

𝐶𝐴𝑇𝐸J 𝑥" = 𝑦" − 𝑦_ "• 𝑡" =0,unit𝑖 iscontrol:

𝐶𝐴𝑇𝐸J 𝑥" = 𝑦_(") − 𝑦"

1-NNMatching

• Let𝑑 ⋅,⋅ beametricbetween𝑥’s• Foreach𝑖,define𝑗 𝑖 = argmin

_`.D.DabD5𝑑(𝑥_, 𝑥")

𝑗 𝑖 isthenearestcounterfactualneighborof𝑖

• 𝐶𝐴𝑇𝐸J 𝑥" = (2𝑡" − 1)(𝑦"−𝑦_ " )

• 𝐴𝑇𝐸J = 'd∑ 𝐶𝐴𝑇𝐸J 𝑥"d"f'

Matching

• Interpretable,especiallyinsmall-sampleregime• Nonparametric• Heavilyreliantontheunderlyingmetric• Couldbemisledbyfeatureswhichdon’taffecttheoutcome

Covariateadjustmentandmatching

• Matchingisequivalenttocovariateadjustmentwithtwo1-nearestneighborclassifiers:𝑌g' 𝑥 = 𝑦hh0 4 ,𝑌g$ 𝑥 = 𝑦hh: 4where𝑦hhi 4 isthenearest-neighborof𝑥amongunitswithtreatmentassignment

𝑡 = 0,1

• 1-NNmatchingisingeneralinconsistent,thoughonlywithsmallbias(Imbens 2004)

Twocommonapproachesforcounterfactualinference

CovariateadjustmentPropensityscores

Propensityscores

• ToolforestimatingATE• Basicidea:turnobservationalstudyintoapseudo-randomizedtrialbyre-weightingsamples,similartoimportancesampling

Inversepropensityscorere-weighting

𝑥' = 𝑎𝑔𝑒

𝑥; =Charlsoncomorbidityindex

Treated

Control

𝑝(𝑥|𝑡 = 0) ≠ 𝑝 𝑥 𝑡 = 1control treated

𝑝 𝑥 𝑡 = 0 ⋅ 𝑤$(𝑥) ≈ 𝑝 𝑥 𝑡 = 1 ⋅ 𝑤'(𝑥)reweightedcontrolreweightedtreated

Inversepropensityscorere-weighting

𝑥' = 𝑎𝑔𝑒

𝑥; =Charlsoncomorbidityindex

Treated

Control

Propensityscore• Propensityscore:𝑝 𝑇 = 1 𝑥 ,usingmachinelearningtools

• Samplesre-weightedbytheinversepropensityscoreofthetreatmenttheyreceived

Propensityscores– algorithmInverseprobabilityoftreatmentweightedestimator

HowtocalculateATEwithpropensityscoreforsample 𝑥', 𝑡', 𝑦' , … , (𝑥d, 𝑡d, 𝑦d)

1. UseanyMLmethodtoestimate𝑝V 𝑇 = 𝑡 𝑥

2. ˆATE =1

n

X

i s.t. ti=1

yip̂(ti = 1|xi)

� 1

n

X

i s.t. ti=0

yip̂(ti = 0|xi)



1. Randomizedtrial𝑝(𝑇 = 𝑡|𝑥) = 0.5

2. ˆATE =1

n

X

i s.t. ti=1

yip̂(ti = 1|xi)

� 1

n

X

i s.t. ti=0

yip̂(ti = 0|xi)



1. Randomizedtrial𝑝(𝑇 = 𝑡|𝑥) = 0.5

2. ˆATE =1

n

X

i s.t. ti=1

yi0.5

� 1

n

X

i s.t. ti=0

yi0.5

=

2

n

X

i s.t. ti=1

yi �2

n

X

i s.t. ti=0

yi



1. Randomizedtrial𝑝 = 0.5

2. ˆATE =1

n

X

i s.t. ti=1

yi0.5

� 1

n

X

i s.t. ti=0

yi0.5

=

2

n

X

i s.t. ti=1

yi �2

n

X

i s.t. ti=0

yi



1. Randomizedtrial𝑝 = 0.5

2. ˆATE =1

n

X

i s.t. ti=1

yi0.5

� 1

n

X

i s.t. ti=0

yi0.5

=

2

n

X

i s.t. ti=1

yi �2

n

X

i s.t. ti=0

yi

Sumover~𝒏𝟐terms

Propensityscores- derivation

• Recallaveragetreatmenteffect:

• Weonlyhavesamplesfor:

Ex⇠p(x)[ E [Y1|x, T = 1]�E [Y0|x, T = 0] ]

Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]

Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]



Ex⇠p(x|T=1)[ E [Y1|x, T = 1]]

Ex⇠p(x|T=0)[ E [Y0|x, T = 0]]

• Wewant:

• Weknowthat:

• Thus:

• Wecanapproximatethisempiricallyas:

(similarlyforti=0)

p(x|T = 1) · p(T = 1)

p(T = 1|x) = p(x)

Ex⇠p(x)[Y1(x)]

Ex⇠p(x|T=1)

p(T = 1)

p(T = 1 | x)Y1(x)

�= Ex⇠p(x)[Y1(x)]

1

n1

X

i s.t.ti=1

n1/n

p̂(ti = 1 | xi)yi

�=

1

n

X

i s.t.ti=1

yip̂(ti = 1 | xi)

ProblemswithIPW

• Needtoestimatepropensityscore(probleminallpropensityscoremethods)

• Ifthere’snotmuchoverlap,propensityscoresbecomenon-informativeandeasilymis-calibrated

• Weightingbyinversecancreatelargevarianceandlargeerrorsforsmallpropensityscores– Exacerbatedwhenmorethantwotreatments

Manymoreideasandmethods

• Naturalexperiments&regressiondiscontinuity

• Instrumentalvariables

Manymoreideasandmethods–Naturalexperiments

• Doesstressduringpregnancyaffectlaterchilddevelopment?

• Confounding:genetic,motherpersonality,economicfactors…

• Naturalexperiment:theCubanmissilecrisisofOctober1962.Manypeoplewereafraidanuclearwarisabouttobreakout.

• Comparechildrenwhowereinuteroduringthecrisiswithchildrenfromimmediatelybeforeandafter

Manymoreideasandmethods–Instrumentalvariables

• Informally:avariablewhichaffectstreatmentassignmentbutnottheoutcome

• Example:areprivateschoolsbetterthanpublicschools?

• Confounding:differentstudentpopulation,differentteacherpopulation

• Can’tforcepeoplewhichschooltogoto

Manymoreideasandmethods–Instrumentalvariables

• Informally:avariablewhichaffectstreatmentassignmentbutnottheoutcome

• Example:areprivateschoolsbetterthanpublicschools?

• Can’tforcepeoplewhichschooltogoto• Canrandomly giveoutvoucherstosomechildren,givingthemanopportunitytoattendprivateschools

• Thevoucherassignmentistheinstrumentalvariable

Summary

• Twoapproachestousemachinelearningforcausalinference:1. Predictoutcomegivenfeaturesandtreatment,then

useresultingmodeltoimputecounterfactuals(covariateadjustment)

2. Predicttreatmentusingfeatures(propensityscore),thenusetoreweightoutcomeorstratifythedata

• Causalgraphsimportantforthinkingthroughwhetherproblemissetupappropriatelyandwhetherassumptionshold

machine learning for healthcare4~2(4))*+,! two common approaches for counterfactual inference...

Documents