targetted learning

7232019 Targetted Learning

httpslidepdfcomreaderfulltargetted-learning 120

Review ArticleEntering the Era of Data Science Targeted Learning and theIntegration of Statistics and Computational Data Analysis

Mark J van der Laan1 and Richard J C M Starmans2

983089 University of California Berkeley 983089983088983096 Haviland Hall Berkeley CA 983097983092983095983090983088-983095983091983094983088 USA983090 Department of Computer Science Utrecht University Te Netherlands

Correspondence should be addressed to Mark J van der Laan laanberkeleyedu

Received 983089983094 February 983090983088983089983092 Revised 983097 July 983090983088983089983092 Accepted 983089983088 July 983090983088983089983092 Published 983089983088 September 983090983088983089983092

Academic Editor Chin-Shang Li

Copyright copy 983090983088983089983092 M J van der Laan and R J C M Starmans Tis is an open access article distributed under the CreativeCommons Attribution License which permits unrestricted use distribution and reproduction in any medium provided theoriginal work is properly cited

Tis outlook paper reviews the research o van der Laanrsquos group on argeted Learning a sub1047297eld o statistics that is concernedwith the construction o data adaptive estimators o user-supplied target parameters o the probability distribution o the dataand corresponding con1047297dence intervals aiming at only relying on realistic statistical assumptions argeted Learning ully utilizesthe state o the art in machine learning tools while still preserving the important identity o statistics as a 1047297eld that is concernedwith both accurate estimation o the true target parameter value and assessment o uncertainty in order to make sound statisticalconclusions We also provide a philosophical historical perspective on argeted Learning also relating it to the new developments

in Big Data We conclude with some remarks explaining the immediate relevance o argeted Learning to the current Big Datamovement

1 Introduction

In Section 983090 we start out with reviewing some basic statisticalconcepts such as data probability distribution statisticalmodel and target parameter allowing us to de1047297ne the 1047297eldargeted Learning a sub1047297eld o statistics that develops dataadaptive estimators o user supplied target parameters o data distributions based on high dimensional data under

realistic assumptions (eg incorporating the state o the art inmachine learning) while preserving statistical inerence Tisalso allows us to clariy how argeted Learning distinguishesrom typical current practice in data analysis that relies onunrealistic assumptions and describe the key ingredients o targeted minimum loss based estimation (MLE) a generaltool to achieve the goals set out by argeted Learninga substitution estimator construction o initial estimatorthrough super-learning targeting o the initial estimator toachieve asymptotic linearity with known in1047298uence curveby solving the efficient in1047298uence curve estimating equa-tion and statistical inerence in terms o a normal limitingdistribution

argeted Learning resurrects the pillars o statistics suchas the acts that a model represents actual knowledge aboutthe data generating experiment and that a target parameterrepresents the eature o the data generating distributionwe want to learn rom the data In this manner argetedLearning de1047297nes a truth and sets a scienti1047297c standard or esti-mation procedures while current practice typically de1047297nes aparameter as a coefficient in a misspeci1047297ed parametric model

(eg logistic linear regression repeated measures generalizedlinear regression) or small unrealistic semi parametric regres-sion models (eg Cox proportional hazards regression)where different choices o such misspeci1047297ed models yielddifferent answers Tis lack o truth in current practicesupported by statements such as ldquoAll models are wrong butsome are useulrdquo allows a user to make arbitrary choiceseven though these choices result in different answers to thesame estimation problem In act this lack o truth in currentpractice presents a undamental drive behind the epidemico alse positives and lack o power to detect true positivesour 1047297eld is suffering rom In addition this lack o truthmakes many o us question the scienti1047297c integrity o the 1047297eld

Hindawi Publishing CorporationAdvances in StatisticsVolume 2014 Article ID 502678 19 pageshttpdxdoiorg1011552014502678



983090 Advances in Statistics

we call statistics and makes it impossible to teach statisticsas a scienti1047297c discipline even though the oundations o statistics including a very rich theory are purely scienti1047297cTat is our 1047297eld has suffered rom a disconnect betweenthe theory o statistics and the practice o statistics whilepractice should be driven by relevant theory and theoretical

developments should be driven by practice For example atheorem establishing consistency and asymptotic normality o a maximum likelihood estimator or a parametric modelthat is known to be misspeci1047297ed is not a relevant theoremor practice since the true data generating distribution is notcaptured by this theorem

De1047297ning the statistical model to actually contain the trueprobability distribution has enormous implications or thedevelopment o valid estimators For example maximumlikelihood estimators are now ill de1047297ned due to the curse o dimensionality o the model In addition even regularizedmaximum likelihood estimators are seriously 1047298awed a gen-eral problem with maximum likelihood based estimators isthat the maximum likelihood criterion only cares about how well the density estimator 1047297ts the true density resulting ina wrong trade-off or the actual target parmaeter o interestFrom a practical perspective when we use AIC BIC or cross-

validated log-likelihood to select variables in our regressionmodel then that procedure is ignorant o the speci1047297c eatureo the data distribution we want to estimate Tat is in largestatistical models it is immediately apparent that estimatorsneed to be targeted towards their goal just like a human beinglearns the answer to a speci1047297c question in a targeted mannerand maximum likelihood based estimators ail to do that

In Section 983091 we review the roadmap orargetedLearningo a causal quantity involving de1047297ning a causal model andcausal quantity o interest establishing an estimand o thedatadistribution thatequals the desired causal quantity underadditional causal assumptions applying the pure statisticalargeted Learning o therelevant estimand based on a statisti-cal model compatible with the causal model but or sure con-taining the true data distribution and careul interpretationo the results In Section 983092 we proceed with describing ourproposed targeted minimum loss-based estimation (MLE)template which represents a concrete template or construc-tion o targeted efficient substitution estimators which arenot only asymptotically consistent asymptotically normally distributed and asymptotically efficient but also tailoredto have robust 1047297nite sample perormance Subsequently inSection 983093 we review some o our most important advances

in argeted Learning demonstrating the remarkable powerand 1047298exibility o this MLE methodology and in Section 983094we describe uture challenges and areas o research InSection 983095 we provide a historical philosophical perspectiveo argeted Learning Finally in Section 983096 we conclude withsome remarks putting argeted Learning in the context o themodern era o Big Data

We reer to our papers and book on argeted Learningor overviews o relevant parts o the literature that put ourspeci1047297c contributions within the 1047297eld o argeted Learningin the context o the current literature thereby allowing usto ocus on argeted Learning itsel in the current outlook paper

2 Targeted Learning

Our research takes place in a sub1047297eld o statistics we namedargeted Learning [983089 983090] In statistics the data (1 )on 1038389 units is viewed as a realization o a random variableor equivalently an outcome o a particular experiment andthereby has a probability distribution

1103925

0 ofen called the

data distribution For example one might observe 1038389 =(9073171038389 1038389 1038389) on a subject where 9073171038389 are baseline character-istics o the subject 1038389 is a binary treatment or exposure thesubject received and1038389 is a binary outcome o interestsuch asan indicator o death = 1 1038389 Troughout this paper wewill use this data structure to demonstrate the concepts andestimation procedures

983090983089 Statistical Model A statistical model M is de1047297ned as aset o possible probability distributions or the data distribu-tion and thus represents the available statistical knowledgeabout the true data distribution 11039250 In argeted Learning

this core-de1047297nition o the statistical model is ully respectedin the sense that one should de1047297ne the statistical model tocontain the true data distribution 11039250 isin M

So contrary tothe ofen conveniently used slogan ldquoAll models are wrong butsome are useulrdquo and erosion over time o the original truemeaning o a statistical model throughout applied researchargeted Learning de1047297nes the model or what it actually is[983091] I there is truly no statistical knowledge available thenthe statistical model is de1047297ned as all data distributions Apossible statistical model is the model that assumes that(1 ) are 1038389 independent and identically distributedrandom variables with completely unknown probability dis-tribution 11039250 representing the case that the sampling o thedata involved repeating the same experiment independentlyIn our example this would mean that we assume that(9073171038389 1038389 1038389) are independent with a completely unspeci1047297edcommon probability distribution For example i 907317 is 983089983088-dimensional while and are two-dimensional then 11039250is described by a 983089983090-dimensional density and this statisticalmodel does not put any restrictions on this 983089983090-dimensionaldensity One could actorize this density o (907317) asollows

0 (907317) = 11039250 (907317)907317|11039250 ( | 907317) |90731711039250 ( | 907317) (983089)

where

11039250 is the density o the marginal distribution o

907317

907317|11039250 is the conditional density o given 907317 and |90731711039250

is the conditional density o given 907317 In this modeleach o these actors is unrestricted On the other handsuppose now that the data is generated by a randomizedcontrolled trial in which we randomly assign treatment isin01 with probability 983088983093 to a subject In that case theconditional density o given907317 is known but the marginaldistribution o the covariates and the conditional distributiono the outcome given covariates and treatment might still beunrestricted Even in an observational study one might know that treatment decisions were only based on a small subset o the available covariates 907317 so that it is known that 907317|11039250(1 |

907317) only depends on

907317 through these ew covariates In the



Advances in Statistics 983091

case that death = 1 represents a rare event it might alsobe known that the probability o death 1103925|9073171103925(1 | 907317) isknown to be between 983088 and some small number (eg 983088983088983091)Tis restriction should then be included in the model M

In various applications careul understanding o theexperiment that generated the data might show that even

these rather large statistical models assuming the data gen-erating experiment equals the independent repetition o acommon experiment are too small to be true see [983092ndash983096]or models in which (1 ) is a joint random variabledescribed by a single experiment which nonetheless involvesa variety o conditional independence assumptions Tat isthe typical statement that 1 are independent andidentically distributed (iid) mightalready represent a wrongstatistical model For example in a community randomizedtrial it is ofen the case that the treatments are assigned by the ollowing type o algorithm based on the characteristics(9073171 907317) one 1047297rst applies an algorithm that aims to splitthe 1038389 communities in 10383892 pairs that are similar with respect tobaseline characteristics subsequently one randomly assignstreatment and control to each pair Clearly even when thecommunities would have been randomly sampled rom atarget population o communities the treatment assignmentmechanism creates dependence so that the data generatingexperiment cannot be described as an independent repetitiono experiments see [983095] or a detailed presentation

In a study in which one observes a single community o 1038389interconnected individuals one might have that the outcome1038389 or subject is not only affected by the subjectrsquos past(9073171038389 1038389) but also affected by the covariate and treatmento riends o subject Knowing the riends o each sub-

ject would now impose strong conditional independenceassumptions on the density o the data

(1

) but one

cannot assume that the data is a result o 1038389 independentexperiments in act as in the community randomized trialexample such data sets have sample size 1 since the data canonly be described as the result o a single experiment [983096]

In group sequential randomized trials one ofen may use a randomization probability or a next recruited thsubject that depends on the observed data o the previously recruited and observed subjects 1 1038389minus1 which makesthe treatment assignment 1038389 a unction o 1 1038389minus1Even when the subjects are sampled randomly rom a targetpopulation this type o dependence between treatment 1038389

andthepastdata 1 1038389minus1 implies that thedata is the resulto a single large experiment (again the sample size equals 983089)

[983092ndash983094]Indeed many realistic statistical models only involve

independence and conditional independence assumptionsand known bounds (eg it is known that the observedclinical outcome is bounded between [01] or the conditionalprobability o death is bounded between 983088 and a smallnumber) Either way i the data distribution is described by a sequence o independent (and possibly identical) exper-iments or by a single experiment satisying a variety o conditional independence restrictions parametric modelsthough representing common practice are practically alwaysinvalid statistical models since such knowledge about the datadistribution is essentially never available

An important by-product o requiring that the statisticalmodel needs to be truthul is that one is orced to obtain asmuch knowledge about the experiment beore committing toa model which is precisely the role a good statistician shouldplay On the other hand i one commits to a parametricmodel then why would one still bother trying to 1047297nd out the

truth about the data generating experiment

983090983090 arget Parameter Te target parameter is de1047297ned as a

mappingΨ M rarr R that maps the data distribution into

the desired 1047297nite dimensional eature o the data distributionone wants to learn rom the data 0 = Ψ(11039250 ) Tis choice o target parameter requires careul thought independent romthe choice o statistical model and is not a choice made outo convenience Te use o parametric or semiparametricmodels such as the Cox-proportional hazards model is ofenaccompanied with the implicit statement that the unknowncoefficients represent the parameter o interest Even inthe unrealistic scenario that these small statistical models

would be true there is absolutely no reason why the very parametrization o the data distribution should correspondwith the target parameter o interest Instead the statisticalmodel M and the choice o target parameter Ψ M rarrR are two completely separate choices and by no means

one should imply the other Tat is the statistical knowledgeabout the experiment that generated the data and de1047297ningwhat we hope to learn rom the data are two importantkey steps in science that should not be convoluted Te truetarget parameter value 0 is obtained by applying the targetparameter mapping Ψ to the true data distribution 11039250 andrepresents the estimand o interest

For example i

1038389

= (9073171038389

1038389

1038389

) are independent and

have common probability distribution 11039250 then one mightde1047297ne the target parameter as an average o the conditional907317-speci1047297c treatment effects

0 = Ψ 1048616110392501048617= 0 10486990 ( | = 1907317) minus 0 ( | = 0907317)1048701

(983090)

By using that is binary this can also be written as ollows

0 = 9917879831631103925|90731711039250 (1 | = 1907317 = )

minus1103925|90731711039250 (1 | = 0907317 = )983165 110392511039250 () (983091)

where 1103925|90731711039250(1 | = 907317 = ) denotes the true con-ditional probability o death given treatment = andcovariate 907317 =

For example suppose that the true conditional probabil-ity o death is given by some logistic unction

1103925|9073171103925 (1 | 907317) = 11 + exp 1048616minus0 (907317)1048617 (983092)

or some unction 0 o treatments and 907317 Te reader canplug in a possible orm or

0 such as

0

() = 03 +




021 +0112 +123 Given this unction 0 the true value 0 is computed by the above ormula as ollows

0 = 991787983080 11 + exp 1048616minus0 (1 )1048617 minus

11 + exp 1048616minus0 (0 )1048617983081

times 110392511039250 () (983093)

Tis parameter 0 has a clear statistical interpretaion asthe average o all the -speci1047297c additive treatment effects1103925|90731711039250(1 | = 1907317 = ) minus 1103925|90731711039250(1 | = 0907317 = )983090983091 Te Important Role of Models Also Involving Nontestable

Assumptions However this particular statisticalestimand0has an even richer interpretation i one is willing to makeadditional so called causal (nontestable) assumptions Letus assume that 907317 are generated by a set o so calledstructural equations

907317 = 1103925 104861611039251048617 = 907317 10486169073179073171048617 = 10486169073171048617

(983094)

where = (1103925 907317 ) are random inputs ollowing a par-ticular unknown probability distribution while the unctions1103925 907317 deterministically map the realization o therandom input = sequentially into a realization o 907317 =1103925(1103925) = 907317(907317907317) = (907317) One mightnot make any assumptions about the orm o these unctions1103925 907317 In that case these causal assumptions put norestrictions on the probability distribution o

(907317) but

through these assumptions we have parametrized 11039250 by achoice o unctions (1103925 907317 ) and a choice o distributiono Pearl[983097] reers to such assumptions as a structural causalmodel or the distribution o (907317)

Tis structural causal model allows one to de1047297ne acorresponding postintervention probability distribution thatcorresponds with replacing = 907317(907317907317) by our desiredintervention on the intervention node For example a staticintervention = 1 results in a new system o equations907317 = 1103925(1103925) = 1 1 = (9073171) where this new random variable 1 is called a counteractual outcome orpotential outcome corresponding with intervention = 1Similarly one can de1047297ne

0

=

(9073170

) Tus

0 (

1)

represent the outcome on the subject one would have seen i the subject would have been assigned treatment = 0 ( =1) One might now de1047297ne the causal effect o interest as01 minus 00 that is the difference between the expectedoutcome o 1 and the expected outcome o 0 I one alsoassumes that is independent o given 907317 which is ofenreerred to as the assumption o no unmeasured conoundingor the randomization assumption then it ollows that 0 =01 minus 00 Tat is under the structural causal modelincluding this no unmeasured conounding assumption 0can not only be interpreted purely statistically as an averageo conditional treatment effects but it actually equals themarginal additive causal effect

In general causal models or more generally sets o nontestable assumptions can be used to de1047297ne underlyingtarget quantities o interest and corresponding statisticaltarget parameters that equal this target quantity under theseassumptions Well known classes o such models are modelsor censored data in which the observed data is represented

as a many to one mapping on the ull data o interest andcensoring variable and the target quantity is a parameter o the ull data distribution Similarly causal inerence modelsrepresent the observed data as a mapping on counteractualsand the observed treatment (either explicitly as in theNeyman-Rubin model or implicitly as in the Pearl structuralcausal models) and one de1047297nes the target quantity as aparameter o the distribution o the counteractuals One isnow ofen concerned with providing sets o assumptions onthe underlying distribution (ie o the ull-data) that allow identi1047297ability o the target quantity rom the observed datadistribution (eg coarsening at random or randomizationassumption) Tese nontestable assumptions do not changethe statistical model M and as a consequence once one hasde1047297ned the relevant estimand0 do not affect the estimationproblem either

983090983092 Estimation Problem Te estimation problem is de1047297nedby the statistical model (ie (1 ) sim 11039250 isin M) andchoice o target parameter (ie Ψ M rarr R) argetedLearning is now the 1047297eld concerned with the developmento estimators o the target parameter that are asymptot-ically consistent as the number o units 1038389 converges toin1047297nity and whose appropriately standardized version (egradic 1038389( minus 0 )) converges in probability distribution to somelimit probability distribution (eg normal distribution) so

that one can construct con1047297dence intervals that or largeenough sample size 1038389contain with a user supplied highprobability the true value o the target parameter In the casethat 1 simiid11039250 a common method or establishingasymptotic normality o an estimator is to demonstratethat the estimator minus truth can be approximated by anempirical mean o a unction o 1038389 Such an estimator iscalled asymptotically linear at 11039250 Formally an estimator is asymptotically linear under iid sampling rom 11039250 i minus 0 = (11038389)sum

1038389=1 IC(11039250)(1038389) + (1radic 1038389) where rarrIC(11039250)() is the so called in1047298uence curve at 11039250 In thatcase the central limit theorem teaches us that radic 1038389( minus 0)converges to a normal distribution

(02

) with variance

2 = 0IC(11039250)()2 de1047297ned as the variance o the in1047298uencecurve An asymptotic 983088983097983093 con1047297dence interval or 0 is then

given by plusmn 196radic 1038389 where 2 is the sample variance o an estimate IC(1038389) o the true in1047298uence curve IC(11039250)(1038389) = 1 1038389

Te empirical mean o the in1047298uence curve IC(11039250) o anestimator represents the 1047297rst order linear approximationo the estimator as a unctional o the empirical distributionand the derivation o the in1047298uence curve is a by-producto the application o the so called unctional delta-methodor statistical inerence based on unctionals o the empiricaldistribution [983089983088ndash983089983090] Tat is the in1047298uence curve IC(11039250)()o an estimator viewed as a mapping rom the empirical




distribution 1103925 into the estimated value Ψ(1103925) is de1047297nedas the directional derivative at 11039250 in the direction (1103925=1 minus11039250) where 1103925=1 is the empirical distribution at a singleobservation

983090983093 argeted Learning Respects Both Local and Global Con-

straints of the Statistical Model argeted Learning is not justsatis1047297ed with asymptotic perormance such as asymptoticefficiency Asymptotic efficiency requires ully respectingthe local statistical constraints or shrinking neighborhoodsaround the true data distribution implied by the statisticalmodel de1047297ned by the so called tangent space generated by all scores o parametric submodels through 11039250 [983089983091] but itdoes not require respecting the global constraints on the datadistribution implied by the statistical model (eg see [983089983092])Instead argeted Learning pursues the development o suchasymptotically efficient estimators that also have excellentand robust practical perormance by also ully respectingthe global constraints o the statistical model In addition

argeted Learning is also concerned with the developmento con1047297dence intervals with good practical coverage For thatpurpose our proposed methodology or argeted Learningso called targeted minimum loss based estimation discussedbelow does not only result in asymptotically efficient esti-mators but the estimators (983089) utilize uni1047297ed cross-validationto make practically sound choices or estimator constructionthat actually work well with the very data set at hand [ 983089983093ndash983089983097] (983090) ocus on the construction o substitution estimatorsthat by de1047297nition also ully respect the global constraintso the statistical model and (983091) use in1047298uence curve theory to construct targeted computer riendly estimators o theasymptotic distribution such as the normal limit distribution

based on an estimator o the asymptotic variance o theestimatorLet us succinctly review the immediate relevance to

argeted Learning o the above mentioned basic conceptsin1047298uence curve efficient in1047298uence curve substitution esti-mator cross-validation and super-learning For the sake o discussion let us consider the case that the1038389 observations areindependent and identically distributed 1038389simiid11039250 isin M and

Ψ M rarr R can now be de1047297ned as a parameter on the

common distribution o 1038389 but each o the concepts has ageneralization to dependent data as well (eg see [983096])

983090983094 argeted Learning Is Based on a Substitution Estimator

Substitution estimators are estimators that can be describedas thetarget parametermapping applied to an estimatoro thedata distribution that is an element o the statistical modelMore generally i the target parameter is represented as amapping on a part 0 = (11039250) o the data distribution 11039250(eg actor o likelihood) then a substitution estimator canbe represented asΨ() where is an estimatoro 0 that iscontained in the parameter space (1103925) 1103925 isin M implied by the statistical modelM Substitution estimators are known tobe particularly robust by ully respecting that the true targetparameter is obtained by evaluating the target parametermapping on this statistical model For example substitutionestimators are guaranteed to respect known bounds on the

target parameter (eg it is a probability or difference betweentwo probabilities) as well as known bounds on the datadistribution implied by the model M

In our running example we can de1047297ne 0 = (11039250 0)where 11039250 is the probability distribution o 907317 under 11039250 and

0

(907317) = 0

( | 907317) is the conditional mean o the

outcome given the treatment and covariates and representthe target parameter

0 = Ψ104861601048617 = 010486990 (1907317) minus 0 (0907317)1048701 (983095)

as a unction o the conditional mean 0 and the probability

distribution 11039250 o 907317 Te model M might restrict 0 tobe between 0 and a small number delta lt 1 but otherwiseputs no restrictions on 0 A substitution estimator is now obtained by plugging in the empirical distribution 1103925 or

11039250 and a data adaptive estimator 0 lt lt o the

regression0

= Ψ8520081103925 852009 = 110383899917611038389=1

1048699 1048616190731710383891048617 minus 10486160907317103838910486171048701 (983096)

Not every type o estimatoris a substitution estimator Forexample an inverse probability o treatment type estimator o 0 could be de1047297ned as

1103925 = 110383899917611038389=1

2 1038389 minus 1 1048616 1038389 | 907317103838910486171038389 (983097)

where (sdot | 907317) is an estimator o the conditional probability o treatment 0(sdot | 907317) Tis is clearly not a substitution esti-

mator In particular i ( 1038389 | 9073171038389) is very small or someobservations this estimator might not be between minus1 and 1and thus completely ignores known constraints

983090983095 argeted Estimator Relies on Data Adaptive Estimator of Nuisance Parameter Te construction o targeted estimatorso the target parameter requires construction o an estimatoro in1047297nite dimensional nuisance parameters speci1047297cally theinitial estimator o the relevant part 0 o the data dis-tribution in the MLE and the estimator o the nuisanceparameter 0 = (11039250) that is needed to target the 1047297t o thisrelevant part in the MLE In our running example we have

0

= (11039250

0

) and the

0 is the conditional distribution o

given 907317

983090983096 argeted Learning Uses Super-Learning to Estimate theNuisance Parameter In order to optimize these estimators o the nuisance parameters (0 0) we use a so called super-learner that is guaranteed to asymptotically outperorm any available procedure by simply including it in the library o estimators that is used to de1047297ne the super-learner

Te super-learner is de1047297ned by a library o estimators o the nuisance parameter and uses cross-validation to selectthe best weighted combination o these estimators Teasymptotic optimality o the super-learner is implied by the oracle inequality or the cross-validation selector that




compares the perormance o the estimator that minimizesthe cross-validated risk over all possible candidate estimatorswith the oracle selector that simply selects the best possiblechoice (as i one has available an in1047297nite validation sample)Te only assumption this asymptotic optimality relies uponis that the loss unction used in cross-validation is uniormly

bounded and that the number o algorithms in the library does not increase at a aster rate than a polynomial powerin sample size when sample size converges to in1047297nity [983089983093ndash983089983097] However cross-validation is a method that goes beyondoptimal asymptotic perormance since the cross-validatedrisk measures the perormance o the estimator on the very sample it is based upon making it a practically very appealingmethod or estimator selection

In our running example we have that 0 = arg min0()() where () = ( minus (907317))2 is the squared error

loss or one can also use the log-likelihood loss ()() =minus log(907317)+(1minus) log(1minus(907317)) Usually there area variety o possible loss unctions one could use to de1047297ne the

super-learner the choice could be based on the dissimilarity implied by the loss unction [983089983093] but probably should itsel be data adaptively selected in a targeted manner Te cross-

validated risk o a candidate estimator o 0 is then de1047297ned asthe empirical mean over a validation sample o the loss o thecandidate estimator 1047297tted on the training sample averagedacrossdifferentspits o the sample in a validation andtrainingsample A typical way to obtain such sample splits is socalled-old cross-validation in which one1047297rst partitions thesample in subsets o equal size and each o the subsetsplays the role o a validation sample while its complemento minus 1 subsets equals the corresponding training sampleTus

-old cross-validation results in

sample splits into

a validation sample and corresponding training sampleA possible candidate estimator is a maximum likelihoodestimator based on a logistic linear regression working modelor 1103925( = 1 | 907317) Different choices o such logistic linearregression working models result in different possible candi-date estimators So in this manner one can already generatea rich library o candidate estimators However the statisticsand machine learning literature has also generated lots o data adaptive estimators based on smoothing data adaptiveselection o basis unctions and so on resulting in anotherlarge collection o possible candidate estimators that can beadded to the library Given a library o candidate estimatorsthe super-learner selects the estimator that minimizes the

cross-validated risk over all the candidate estimators Tisselected estimator is now applied to the whole sample to give

our 1047297nal estimate o 0 One can enrich the collection o candidate estimators by taking any weighted combination o an initial library o candidate estimators thereby generatinga whole parametric amily o candidate estimators

Similarly one can de1047297ne a super-learner o the condi-tional distribution o given 907317

Te super-learnerrsquos perormance improves by enlargingthe library Even though or a given data set one o the can-didate estimators will do as well as the super-learner acrossa variety o data sets the super-learner beats an estimatorthat is betting on particular subsets o the parameter space

containing the truth or allowing good approximations o the truth Te use o super-learner provides on importantstep in creating a robust estimator whose perormance isnot relying on being lucky but on generating a rich library so that a weighted combination o the estimators provides agood approximation o the truth wherever the truth might

be located in the parameter space

983090983097 Asymptotic Efficiency An asymptotically efficient esti-mator o the target parameter is an estimator that can berepresented as the target parameter value plus an empiricalmean o a so called (mean zero) efficient in1047298uence curvelowast(11039250)() up till a second order term that is asymptotically negligible [983089983091] Tat is an estimator is efficient i and only i it is asymptotically linear with in1047298uence curve (11039250) equal tothe efficient in1047298uence curve lowast(11039250)

minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)

Te efficient in1047298uence curve is also called the canonicalgradient and is indeed de1047297ned as the canonical gradient o the pathwise derivative o the target parameter Ψ M rarrR Speci1047297cally one de1047297nes a rich amily o one-dimensionalsubmodels 1103925() through 1103925 at = 0 and onerepresents the pathwise derivative ()Ψ(1103925())|=0 as aninner product (the covariance operator in the Hilbert space o unctions o with mean zero and inner product ⟨ℎ1 ℎ2⟩ =ℎ1()ℎ2()(1103925)()(1103925)() where (1103925) is the scoreo the path 1103925() and (1103925) is a so called gradientTe unique gradient that is also in the closure o the linearspan o all scores generated by the amily o one-dimensional

submodels through 1103925 also called the tangent space at 1103925 isnow the canonical gradientlowast(1103925) at 1103925 Indeed the canonicalgradient can be computed as the projection o any givengradient (1103925) onto the tangent space in the Hilbert space20(1103925) An interesting result in efficiency theory is that anin1047298uence curve o a regular asymptotically linear estimatoris a gradient

In our running example it can be shown that the efficientin1047298uence curve o the additive treatment effect Ψ M rarr R

is given by

lowast 1048616110392501048617 () = 2minus 10 (907317) 852008minus 0 (907317)852009

+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)

As noted earlier the in1047298uence curve IC(11039250) o an estima-

tor also characterizes the limit variance 20 = 11039250IC(11039250)2 o the mean zero normal limit distribution o radic 1038389( minus0) Tis

variance 20 can be estimated with 11038389sum1038389=1 IC(1038389)2 where

IC is an estimator o the in1047298uence curve IC(11039250) Efficiency theory teaches us that or any regular asymptotically linearestimator its in1047298uence curve has a variance that is largerthan or equal to the variance o the efficient in1047298uence curve2lowast0 = 11039250lowast(11039250)2 which is also called the generalizedCramer-Rao lower bound In our running example the




asymptotic variance o an efficient estimator is thus estimatedwith the sample variance o an estimatelowast

(1038389) o lowast(11039250)(1038389)obtained by plugging in the estimator o 0 and the

estimator o 0 and Ψ(0) is replaced by Ψ() =(11038389)sum

1038389=1((19073171038389) minus (09073171038389))983090983089983088 argeted Estimator Solves the Efficient In1047298uence CurveEquation Te efficient in1047298uence curveis a unction o thatdepends on 11039250 through 0 and possible nuisance parameter0 and it can be calculated as the canonical gradient o thepathwise derivative o the target parameter mapping alongpaths through 11039250 It is also called the efficient score Tusgiven the statistical model and target parameter mappingone can calculate the efficient in1047298uence curve whose variancede1047297nes the best possible asymptotic variance o an estimatoralso reerred to as the generalized Cramer-Rao lower boundor the asymptotic variance o a regular estimator Te prin-cipal building block or achieving asymptotic efficiency o asubstitution estimator

Ψ(

) beyond

being an excellent

estimator o 0 as achieved with super-learning is that theestimator solves the so called efficient in1047298uence curveequation sum

1038389=1lowast( )(1038389) = 0 or a good estimator o 0 Tis property cannot be expected to hold ora super-learner and that is why the MLE discussed inSection 983092 involves an additional update o the super-learnerthat guarantees that it solves this efficient in1047298uence curveequation

For example maximum likelihood estimators solve allscore equations including this efficient score equation thattargets the target parameter but maximum likelihood estima-tors orlarge semi parametric modelsM typically do not existor 1047297nite sample sizes Fortunately or efficient estimation

o the target parameter one should only be concerned withsolving this particular efficient score tailored or the target

parameter Using the notation 1103925 equiv int()1103925() orthe expectation operator one way to understand why theefficient in1047298uence curve equation indeed targets the truetarget parameter value is that there are many cases in which11039250lowast(1103925) = Ψ(11039250) minus Ψ(1103925) and in general as a consequenceo lowast(1103925) being a canonical gradient

11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)

where (1103925 11039250) = (1103925 minus 11039250) is a term involving second

order differences (1103925 minus11039250)2 Tis key property explains why solving 11039250lowast(1103925) = 0 targets Ψ(1103925) to be close to Ψ(11039250) and

thus explains why solving 1103925lowast

( ) = 0 targets to 1047297tΨ(0)In our running example we have (1103925 11039250) = 1(110392511039250) minus0(110392511039250) where (110392511039250) = int(( minus 0)( | 907317)( |

907317))(minus0)(907317)110392511039250() So in our example the remain-der (1103925 11039250) only involves a cross-product difference ( minus0)( minus 0) In particular the remainder equals zero i either = 0 or = 0 which is ofen reerred to asdouble robustness o the efficient in1047298uence curve with respectto () in the causal and censored data literature (seeeg [983090983088]) Tis property translates into double robustness o estimators that solve the efficient in1047298uence curve estimatingequation

Due to this identity (983089983090) an estimator 9838101103925 that solves1103925lowast( 9838101103925) = 0 and is in a local neighborhood o 11039250 so that

(9838101103925 11039250) = (1radic 1038389) approximately solves Ψ(9838101103925) minus Ψ(11039250) asymp(1103925 minus 11039250)lowast( 9838101103925) where the latter behaves as a mean zerocentered empirical mean with minimal variance that will beapproximately normally distributed Tis is ormalized in an

actual proo o asymptotic efficiency in the next subsection

983090983089983089 argeted Estimator Is Asymptotically Linear and EfficientIn act combining 1103925lowast( ) = 0 with (983089983090) at 1103925 =( ) yields

Ψ10486161048617 minusΨ 104861601048617 = 10486161103925 minus 110392501048617 lowast 1048616 1048617 + (983089983091)

where is a second order term Tus i second order

differences such as ( minus 0)2 ( minus 0)( minus 0) and( minus 0)2 converge to zero at a rate aster than 1radic 1038389 thenit ollows that = (1radic 1038389) o make this assumption asreasonable as possible one should use super-learning or both

and

In addition empirical process theory teaches us

that (1103925 minus11039250)lowast( ) = (1103925 minus11039250)lowast(0 0) + (1radic 1038389) i 11039250lowast( )minuslowast(0 0)2 converges to zeroin probability as 1038389 converges to in1047297nity (a consistency condition) and i lowast( ) alls in a so called Donsker class o unctions rarr () [983089983089] An important Donsker class is the classo all -variate real valued unctions that have a uniormsectional variation norm that is bounded by some universal lt infin that is the variation norm o the unction itsel and the variation norm o its sections are all bounded by this lt infin Tis Donsker class condition essentially excludesestimators that heavily over1047297t the data so that their

variation norms convergeto in1047297nity as 1038389 converges to in1047297nitySo under this Donsker class condition

=

(1radic 1038389) and

the consistency condition we have

minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

Tat is is asymptotically efficient In addition the right-hand side converges to a normal distribution with mean zeroand variance equal to the variance o the efficient in1047298uencecurve So in spite o the act that the efficient in1047298uence curveequation only represents a 1047297nite dimensional equation oran in1047297nite dimensional object it implies consistency o Ψ() up till a second order term and even asymptoticefficiency i = (1radic 1038389) under some weak regularity conditions

3 Road Map for Targeted Learning of CausalQuantity or Other Underlying Full-Data Target Parameters

Tis is a good moment to review the roadmap or argetedLearning We have ormulated a roadmap or argeted Learn-ing o a causal quantity that provides a transparent roadmap[983090 983097 983090983089] involving the ollowing steps

(i) de1047297ning a ull-data model such as a causal modeland a parameterization o the observed data distri-bution in terms o the ull-data distribution (eg the




Neyman-Rubin-Robins counteractual model [983090983090ndash983090983096]) or the structural causal model [983097]

(ii) de1047297ning the target quantity o interest as a targetparameter o the ull-data distribution

(iii) establishing identi1047297ability o the target quantity romthe observed data distribution under possible addi-tional assumptions that are not necessarily believedto be reasonable

(iv) committing to the resulting estimand and the statisti-cal model that is believed to contain the true 11039250

(v) a subroadmap or the MLE discussed below toconstruct an asymptotically efficient substitution esti-mator o the statistical target parameter

(vi) establishing an asymptotic distribution and corre-sponding estimator o this limit distribution to con-

struct a con1047297dence interval

(vii) honest interpretation o the resultspossibly includinga sensitivity analysis [983090983097ndash983091983090]

Tat is the statistical target parameters o interestare ofen constructed through the ollowing process Oneassumes an underlying model o probability distributionswhich we will call the ull-data model and one de1047297nesthe data distribution in terms o this ull-data distributionTis can be thought o as modeling that is one obtains aparameterization M = 1103925 isin Θ or the statisticalmodel M or some underlying parameter space

Θ and

parameterization rarr 1103925 Te target quantity o interestis de1047297ned as some parameter o the ull-data distributionthat is o 0 Under certain assumptions one establishes thatthe target quantity can be represented as a parameter o thedata distribution a so called estimand such a result is calledan identi1047297ability result or the target quantity One mightnow decide to use this estimand as the target parameterand develop a MLE or this target parameter Under thenontestable assumptions the identi1047297ability result relied uponthe estimand can be interpreted as the target quantity o interest but importantly it can always be interpreted as astatistical eature o the data distribution (due to the statisticalmodel being true) possibly o independent interest In this

manner one can de1047297ne estimands that are equal to a causalquantity o interest de1047297ned in an underlying (counteractual)world Te MLE o this estimand which is only de1047297ned by the statistical model and the target parameter mapping andthus ignorant o the nontestable assumptions that allowedthe causal interpretation o the estimand provides now anestimator o this causal quantity In this manner argetedLearning is in complete harmony with the developmento models such as causal and censored data models andidenti1047297cation results or underlying quantities the latter

just provides us with a de1047297nition o a target parametermapping and statistical model and thereby the pure statisticalestimation problem that needs to be addressed

4 Targeted Minimum Loss Based Estimation(TMLE)

TeMLE[983089 983090 983092] is de1047297ned according to the ollowing stepsFirstly one writesthe target parameter mapping as a mappingapplied to a part o the data distribution 11039250 say 0 = (11039250)that can be represented as the minimizer o a criterion at thetrue data distribution 11039250 over all candidate values (1103925) 1103925 isin M or this part o the data distribution we reer to thiscriterion as the risk 0() o the candidate value

ypically the risk at a candidate parameter value canbe de1047297ned as the expectation under the data distribution o a loss unction ( ) rarr ()() that maps the unit datastructure and the candidate parameter value in a real valuenumber 0() = 0()() Examples o loss unctionsare the squared error loss or a conditional mean and the log-likelihood loss or a (conditional) density Tis representationo 0 asa minimizero a riskallowsus to estimateit with (egloss-based) super-learning

Secondly one computes the efficient in1047298uence curve

(1103925) rarr lowast((1103925)(1103925))() identi1047297ed by the canonicalgradient o the pathwise derivative o the target parametermapping along paths through a data distribution 1103925 wherethis efficient in1047298uence curve does only depend on 1103925 through(1103925) and some nuisance parameter(1103925) Given an estimator one now de1047297nes a path 1038389

() with Euclideanparameter through the super-learner whose score

8520081038389

()852009=0 (983089983093)

at = 0 spans the efficient in1047298uence curve lowast( ) atthe initial estimator ( ) this is called a least avorable

parametric submodel through the super-learnerIn our running example we have = (1103925) so

that it suffices to construct a path through and 1103925 withcorresponding loss unctions and show that their scores spanthe efficient in1047298uence curve (983089983089) We can de1047297ne the path() = +() where()() = (2minus1)( | 907317)andloss unction()() = minus log(907317)+(1minus) log(1minus(907317)) Note that

852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)

We also de1047297ne the path 1103925() = (1 + lowast1103925(1103925))1103925

with loss unction (1103925) (907317) = minus log1103925(907317) wherelowast1103925()() = (1907317) minus (0907317) minus Ψ() Note that

10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)

Tus i wede1047297ne the sum lossunction() = ()+(1103925)then

852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)




Tis proves that indeed these proposed paths through and 1103925 and corresponding loss unctions span the efficient

in1047298uence curve lowast() = lowast1103925() + lowast

() at () asrequired

Te dimension o can be selected to be equal to thedimension o the target parameter 0 but by creating extra

components in one can arrange to solve additional scoreequations beyond the efficient score equation providingimportant additional 1047298exibility and power to the procedureIn our running example we can use an 1 or the path

through and a separate 2 or the path through 1103925 Inthis case the MLE updatelowast

will solve two score equations

1103925lowast1103925(lowast

) = 0 and 1103925lowast(lowast

) = 0 and thus inparticular 1103925lowast(lowast

) = 0 In this example the mainbene1047297t o using a bivariate = (1 2) is that the MLE doesnot update 1103925 (i selected to be the empirical distribution)and converges in a single step

One 1047297ts the unknown parameter o this path by minimizing the empirical risk

rarr 1103925

(1038389

()) along this

path through the super-learner resulting in an estimator Tis de1047297nes now an update o the super-learner 1047297t de1047297ned as1 = 1038389

() Tis updating process is iterated till asymp 0

Te 1047297nal update we will denote with lowast the MLE o 0

and the target parameter mapping applied to lowast de1047297nes the

MLE o the target parameter 0 Tis MLE lowast solves the

efficient in1047298uence curve equation sum1038389=1lowast(lowast

)(1038389) = 0providing the basis in combination with statistical propertieso (lowast

) or establishing that the MLE Ψ(lowast ) is asymp-

totically consistent normally distributed and asymptotically efficient as shown above

In our running example we have 1 = arg min1103925

(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero

Tat is the MLE does not update 1103925 since the empiricaldistribution is already a nonparametric maximum likelihood

estimator solving all score equations In this case lowast

= 1

since the convergence o the MLE-algorithm occurs in onestep and o course lowast

1103925 = 1103925 is just the initial empiricaldistribution unction o 9073171 907317 Te MLE o 0 is thesubstitution estimator Ψ(lowast

)

5 Advances in Targeted Learning

As apparent rom the above presentation MLE is a generalmethod that can be developed or all types o challenging

estimation problems It is a matter o representing the targetparameters as a parametero a smaller0de1047297ningapathandloss unction with generalized score that spans the efficientin1047298uence curve and the corresponding iterative targetedminimum loss-based estimation algorithm

We have used this ramework to develop MLE in alarge number o estimation problems that assumes that1 simiid11039250 Speci1047297cally we developed MLE o alarge variety o effects (eg causal) o single and multipletime point interventions on an outcome o interest thatmay be subject to right-censoring interval censoring case-control sampling and time-dependent conounding see orexample [983092 983091983091ndash983094983091 983094983091ndash983095983090]

An original example o a particular type o MLE(based on a double robust parametric regression model) orestimation o a causal effect o a point-treatment interventionwas presented in [983095983091]andwereerto[983092983095] or a detailed review o this earlier literature and its relation to MLE

It is beyond the scope o this overview paper to get

into a review o some o these examples For a generalcomprehensive book on argeted Learning which includesmany o these applications on MLE and more we reer to[983090]

o provide the reader with a sense consider generalizingour running example to a general longitudinal data structure = ((0)(0)()()) where (0) are base-line covariates () are time dependent covariates realizedbetween intervention nodes ( minus 1) and () and isthe 1047297nal outcome o interest Te intervention nodes couldinclude both censoring variables and treatment variables thedesired intervention or the censoring variables is always ldquonocensoringrdquo since the outcome is only o interest when itis not subject to censoring (in which case it might just be aorward imputed value eg)

One may now assume a structural causal model o the type discussed earlier and be interested in the meancounteractual outcome under a particular intervention onall the intervention nodes where these interventions couldbe static dynamic or even stochastic Under the so calledsequential randomization assumption this target quantity isidenti1047297ed by the so called G-computation ormula or thepostintervention distribution corresponding with a stochas-tic intervention lowast

11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

Note that this postintervention distribution is nothing elsebut the actual distribution o actorized according to thetime-ordering but with the true conditional distributions

o () given parenethesis ()( minus 1)) replaced by the desired stochastic intervention Te statistical targetparameter is thus 01103925lowastlowast that is the mean outcome under

this postintervention distribution A big challenge in the

literature has been to develop robust efficient estimators o this estimand and more generally one likes to estimatethis mean outcome under a user supplied class o stochasticinterventionslowast Such robust efficient substitution estimatorshave now been developed using the MLE ramework [983093983096983094983088] where the latter is a MLE inspired by important doublerobust estimators established in earlier work o [983095983092] Tiswork thus includes causal effectsde1047297nedby working marginalstructural models or static and dynamic treatment regimenstime to event outcomes and incorporating right-censoring

In many data sets one is interested in assessing the effecto one variable on an outcome controlling or many other

variables across a large collection o variables For example




one might want to know the effect o a single nucleotidepolymorphism (SNP) on a trait o a subject across a wholegenome controlling each time or a large collection o otherSNPs in the neighborhood o the SNP in question Or oneis interested in assessing the effect o a mutation in the HIV

virus on viral load drop (measure o drug resistance) when

treated with a particular drug class controlling or the othermutations in the HIV virus and or characteristics o the subject in question Tereore it is important to careully de1047297nethe effect o interest or each variable I the variable is binaryone could use the target parameter Ψ(1103925) = ( | = 1907317) minus ( | = 0907317) in our running example but with now being the SNP in question and 907317 being the variablesone wants to control or while is the outcome o interestWe ofen reer to such a measure as a particular variableimportance measure O course one now de1047297nes such a

variable importance measure or each variable When the variable is continuous the above measure is not appropriateIn that case one might de1047297ne the variable importance asthe projection o

( | 907317) minus ( | = 0907317)onto a linear model such as and use as the variableimportance measure o interest [983095983093] but one could think o a variety o other interesting effect measures Either way oreach variable one uses a MLE o the corresponding variableimportance measure Te stacked MLE acrossall variablesisnow an asymptotically linear estimator o the stacked variableimportance measure with stacked in1047298uence curve and thusapproximately ollows a multivariate normal distribution thatcan be estimated rom the data One can now carry outmultiple testing procedures controlling a desired amily wisetype I error rate and construct simultaneous con1047297denceintervals or the stacked variable importance measure basedon this multivariate normal limit distribution In this man-ner one uses argeted Learning to target a large amily o target parameters while still providing honest statisticalinerence taking into account multiple testing Tis approachdeals with a challenge in machine learning in which onewants estimators o a prediction unction that simultaneously yield good estimates o the variable importance measuresExamples o such efforts are random orest and LASSObut both regression methods ail to provide reliable variableimportance measuresand ail to provide any type o statisticalinerence Te truth is that i the goal is not prediction but toobtain a good estimate o the variable importance measuresacross the variables then one should target the estimatoro the prediction unction towards the particular variable

importance measure or each variable separately and only then one obtains valid estimators and statistical inerenceForMLE o effects o variables across a large set o variables a socalled variable importance analysis including the applicationto genomic data sets we reer to [983092983096 983095983093ndash983095983097]

Sofware has been developed in the orm o general R-packages implementing super-learning and MLE or gen-eral longitudinal data structures these packages are publicly available on CRAN under the unction names tmle() ltmle()and superlearner()

Beyond the development o MLE in this large variety o complex statistical estimation problems as usual thecareul study o real world applications resulted in new

challenges or the MLE and in response to that we havedeveloped general MLE that have additional propertiesdealing with these challenges In particular we have shownthat MLE has the 1047298exibility and capability to enhance the1047297nite sample perormance o MLE under the ollowingspeci1047297c challenges that come with real data applications

Dealing with Rare Outcomes I the outcome is rare then thedata is still sparse even though the sample size might be quitelarge When the data is sparse with respect to the questiono interest the incorporation o global constraints o thestatistical model becomes extremely important and can makea real difference in a data analysis Consider our runningexample and suppose that is the indicator o a rare eventIn such cases it is ofen known that the probability o = 1conditional on a treatment and covariate con1047297gurationshould not exceed a certain value gt 0 or example themarginal prevalence is known and it is known that thereare no subpopulations that increase the relative risk by morethan a certain actor relative marginal prevalence So thestatistical model should now include the global constraint

that 0(907317) lt or some known gt 0 A MLE

should now be based on an initial estimator satisying

this constraint and the least avorable submodel 1038389() should also satisy this constraint or each so that it

is a real submodel In [983096983088] such a MLE is constructedand it is demonstrated to very signi1047297cantly enhance itspractical perormance or 1047297nite sample sizes Even though aMLE ignoring this constraint would still be asymptotically efficient by ignoring this important knowledge its practicalperormance or 1047297nite samples suffers

argeted Estimation of Nuisance Parameter

0 in MLE Even

though an asymptotically consistent estimator o 0 yieldsan asymptotically efficient MLE the practical perormanceo the MLE might be enhanced by tuning this estimator not only with respect to to its perormance in estimating0 but also with respect to how well the resulting MLE1047297ts 0 Consider our running example Suppose that amongthe components o 907317 there is a 907317 that is an almost perectpredictor o but has no effect on the outcome Inclusiono such a covariate 907317 in the 1047297t o makes sense i thesample size is very large and one tries to remove someresidual conounding due to not adjusting or 907317 but inmost 1047297nite samples adjustment or 907317 in will hurt thepractical perormance o MLE and effort should be put

in variables that are stronger conounders than 907317 Wedeveloped a method or building an estimator that uses ascriterion the change in 1047297t between initial estimator o 0 andthe updated estimator (ie the MLE) and thereby selects

variables that result in the maximal increase in 1047297t duringthe MLE updating step However eventually as sample sizeconverges to in1047297nity all variables will be adjusted or sothat asymptotically the resulting MLE is still efficient Tis

version o MLE is called the collaborative MLE since it1047297ts 0 in collaboration with the initial estimator [983090 983092983092ndash983092983094 983093983096] Finite sample simulations and data analyses haveshown remarkable important 1047297nite sample gains o C-MLErelative to MLE (see above reerences)




Cross-Validated MLE Te asymptotic efficiency o MLErelies on a so called Donsker class condition For example

in our running example it requires that and are nottoo erratic unctions o (907317) Tis condition is not justtheoretical but one can observe its effects in 1047297nite samples by evaluating the MLE when using a heavily over1047297tted initial

estimator Tis makes sense since i we use an over1047297ttedinitial estimator there is little reason to think that the that maximizes the 1047297t o the update o the initial estimatoralong the least avorable parametric model will still do agood job Instead one should use the 1047297t o that maximizeshonest estimate o the 1047297t o the resulting update o the initialestimator as measured by the cross-validated empirical meano the loss unction Tis insight results in a so called cross-

validated MLE and we have proven that one can establishasymptotic linearity o this CV-MLE without a Donskerclass condition [983090 983096983089] thus the CV-MLE is asymptotically linear under weak conditions compared to the MLE

Guaranteed Minimal Performance of MLE I the initialestimator is inconsistent but is consistent then theMLE is still consistent or models and target parameters inwhich the efficient in1047298uence curve is double robust Howeverthere might be other estimators that will now asymptotically beat the MLE since the MLE is not efficient anymoreTe desire or estimators to have a guarantee to beat certainuser supplied estimators was ormulated and implementedor double robust estimating equation based estimators in[983096983090] Such a property can also be arranged within theMLE ramework by incorporating additional 1047298uctuationparameters in its least avorable submodel though the initialestimator so that the MLE solves additional score equationsthat guarantee that it beats a user supplied estimator even

under heavy misspeci1047297cation o the initial estimator [983093983096983094983096]

argeted Selection of Initial Estimator in MLE In situationswhere it is unreasonable to expect that the initialestimator

will be close to the true 0 such as in randomized controlledtrials in which the sample size is small one may improve theefficiency o the MLE by using a criterion or tuning theinitial estimator that directly evaluates the efficiency o theresulting MLE o 0 Tis general insight was ormulated asempirical efficiency maximization in [983096983091] and urther workedout in the MLE context in chapter 983089983090 and Appendix o [983090]

Double Robust Inference I the efficient in1047298uence curve isdouble robust then the MLE remains consistent i either or is consistent However i one uses a data adaptiveconsistent estimator o 0 (and thus with bias larger than1radic 1038389) and is inconsistent then the bias o mightdirectly map into a bias or the resulting MLE o 0 o the same order As a consequence the MLE might have abias with respect to 0 that is larger than (1radic 1038389) so thatit is not asymptotically linear However one can incorporateadditional 1047298uctuation parameters in the least avorable sub-model (by also 1047298uctuating ) to guarantee that the MLEremains asymptotically linear with known in1047298uence curvewhen either

or

is inconsistent but we do not know

which one [983096983092] So these enhancements o MLE result inMLE that are asymptotically linearunderweakerconditionsthan a standard MLE just like the CV-MLE that removed acondition or asymptotic linearity Tese MLE now involvenot only targeting but also targeting to guarantee thatwhen is misspeci1047297ed the required smooth unction o

will behaveas a MLE and i is misspeci1047297ed the requiredsmooth unctional o is still asymptotically linear Tesame method was used to develop an IPW estimator thattargets so that the IPW estimator is asymptotically linearwith known in1047298uence curve even when the initial estimatoro 0 is estimated with a highly data adaptive estimator

Super-Learning Based on CV-MLE of the Conditional Riskof a Candidate Estimator Super-learner relies on a cross-

validated estimate o the risk o a candidate estimator Teoracle inequalities o the cross-validation selector assumedthat the cross-validated risk is simply an empirical meanover the validation sample o a loss unction at the candidateestimator based on training sample averaged across differentsample splits where we generalized these results to lossunctions that depend on an unknown nuisance parameter(which are thus estimated in the cross-validated risk)

For example suppose that in our running example iscontinuous and we are concerned with estimation o thedose-response curve (0 ) where 0 = 00( | =907317) One might de1047297ne the risk o a candidate dose responsecurve as a mean squared error with respect to the true curve0 However this risk o a candidate curve is itsel anunknown real valued target parameter On the contrary tostandard prediction or density estimation this risk is notsimply a mean o a known loss unction and the proposedunknown loss unctions indexed by a nuisance parameter can

have large values making the cross-validated risk a nonrobustestimator Tereore we have proposed to estimate this con-ditional risk o candidate curve with MLE and similarly theconditional risk o a candidate estimator with a CV-MLEOne can now develop a super-learner that uses CV-MLE asan estimate o the conditional risk o a candidate estimator[983093983091 983096983093] We applied this to construct a super-learner o thecausal dose responsecurve or a continuous valued treatmentand we obtained a corresponding oracle inequality or theperormance o the cross-validation selector [983093983091]

6 Eye on the Future of Targeted Learning

We hope that the above clari1047297es that argeted Learning isan ongoing exciting research area that is able to addressimportant practical challenges Each new application con-cerning learning rom data can be ormulated in termso a statistical estimation problem with a large statisticalmodel and a target parameter One can now use the generalramework o super-learning and MLE to develop efficienttargeted substitution estimators and corresponding statisticalinerence As is apparent rom the previous section thegeneral structure o MLE and super-learning appears tobe 1047298exible enough to handleadapt to any new challengesthat come up allowing researchers in argeted Learning tomake important progress in tackling real world problems




By being honest in the ormulation typically new challengescome up asking or expert input rom a variety o researchersranging rom subject-matter scientists computer scientiststo statisticians argeted Learning requires multidisciplinary teams since it asks or careul knowledge about data exper-iment the questions o interest possible inormed guesses

or estimation that can be incorporated as candidates in thelibrary o the super-learner andinput rom the state o the artin computer science to produce scalable sofware algorithmsimplementing the statistical procedures

Tere are a variety o important areas o research inargeted Learning we began to explore

Variance EstimationTe asymptotic variance o an estimatorsuch as the MLE that is the variance o the in1047298uence curveo theestimator is just another targetparameter o greatinter-est Itis commonpractice to estimate thisasymptoticvariancewith an empirical sample variance o the estimated in1047298uencecurves However in the context o sparsity in1047298uence curvescan be large making such an estimator highly nonrobustIn particular such a sample mean type estimator will notrespect the global constraints o the model Again this is not

just a theoretical issue since we have observed that in sparsedatasituations standard estimators o the asymptotic varianceofen underestimate the variance o the estimator thereby resulting in overly optimistic con1047297dence intervals Tis spar-sity can be due to rare outcomes or strong conoundingor highly inormative censoring or example and naturally occurseven when sample sizes are large Careulinspection o these variance estimators shows that the essential problem isthat these variance estimators are not substitution estimatorsTereore we are in the process to apply MLE to improvethe estimators o the asymptotic variance o MLE o a target

parameter thereby improving the 1047297nite sample coverage o our con1047297dence intervals especially in sparse-data situations

Dependent Data Contrary to experiments that involve ran-dom sampling rom a target population i one observes thereal world over time then naturally there is no way to arguethat the experiment can be represented as a collection o independent experiments let alone identical independentexperiments An environment over time and space is a singleorganism that cannot be separated out into independentunits without making very arti1047297cial assumptions and losing

very essential inormation the world needs to be seen as awhole to see truth Data collection in our societies is moving

more and more towards measuring total populations overtime resulting in what we ofen reer to as Big Data andthese populations consist o interconnected units Even inrandomized controlled settings where one randomly samplesunits rom a target population one ofen likes to look at thepast data and change the sampling design in response to theobserved past in order to optimize the data collection withrespect to certain goals Once again this results in a sequenceo experiments that cannot be viewed as independent exper-iments the next experiment is only de1047297ned once one knowsthe data generated by the past experiments

Tereore we believe that our models that assume inde-pendence even though they are so much larger than the

models used in current practice are still not realistic modelsin many applications o interest On the other hand evenwhen the sample size equals 983089 things are not hopeless i oneis willing to assume that the likelihood o the data actorizesin many actors due to conditional independence assump-tions and stationarity assumptions that state that conditional

distributions might be constant across time or that differentunits are subject to the same laws or generating their data asa unction o their parent variables In more recent researchwe have started to develop MLE or statistical models thatdo not assume that the unit-speci1047297c data structures areindependent handling adaptive pair matching in community randomized controlled trials group sequential adaptive ran-domization designs andstudies that collectdata on units thatare interconnected through a causal network [983092ndash983096]

Data Adaptive arget Parameters It is common practice thatpeople 1047297rst look at data beore determining their choice o target parameter they want to learn even though it is taughtthat this is unacceptable practice since it makes the

1103925 values

and con1047297dence intervals unreliable But maybe we should view this common practice as a sign that a priori speci1047297cationo the target parameter (and null hypothesis) limits thelearning rom data too much and by enorcing it we only orce data analysts to cheat Current teaching would tell usthat one is only allowed to do this by splitting the sample useone part o the sample to generate a target parameter and usethe other part o the sample to estimate this target parameterand obtain con1047297dence intervals Clearly this means thatone has to sacri1047297ce a lot o sample size or being allowedto look at the data 1047297rst Another possible approach orallowing us to obtain inerence or a data driven parameteris to a priori ormulate a large class o target parameters

and use multiple testing or simultaneous con1047297dence intervaladjustments However also with this approach one has to pay a big price through the multiple testing adjustment and onestill needs to a priori list the target parameters

For that purpose acknowledging that one likes to minethe data to 1047297nd interesting questions that are supportedby the data we developed statistical inerence based onCV-MLE or a large class o target parameters that arede1047297ned as unctions o the data [983096983094] Tis allows one tode1047297ne an algorithm that when applied to the data generatesan interesting target parameter while we provide ormalstatistical inerence in terms o con1047297dence intervals or thisdata adaptive target parameter Tis provides a much broader

class o a priori speci1047297ed statistical analyses than currentpractice which requires a priori speci1047297cation o the targetparameter while still providing valid statistical inerenceWe believe that this is a very promising direction or utureresearch opening up many new applications which wouldnormally be overlooked

Optimal Individualized reatment One is ofen interestedin learning the best rule or treating a subject in responseto certain time-dependent measurements on that subjectwhere best rule might be de1047297ned as the rule that optimizesthe expected outcome Such a rule is called an individual-ized treatment rule or dynamic treatment regimen and an




optimal treatment rule is de1047297ned as the rule that minimizesthe mean outcome or a certain outcome (eg indicator o death or other health measurement) We started to addressdata adaptive learning o the best possible treatment rule by developing super-learners o this important target parameterwhile still providing statistical inerence (and thus con1047297dence

intervals) or the mean o the outcome in the counteractualworld in which one applies this optimal dynamic treatmentto everybody in the target population [983096983095] In particular thisproblem itsel provides a motivation or a data adaptive targetparameter namely the mean outcome under a treatmentrule 1047297tted based on the data Optimal dynamic treatmenthas been an important area in statistics and computerscience but we target this problem within the ramework o argeted Learning thereby avoiding reliance on unrealisticassumptions that cannot be deended and will heavily affectthe true optimality o the 1047297tted rules

Statistical Inference Based on Higher Order Inference Anotherkey assumption the asymptotic efficiency or asymptoticlinearity o MLE relies upon is the remaindersecond orderterm = (1radic 1038389) For example in our running examplethis means that the product o the rate at which the super-learner estimators o 0 and 0 converge to their targetconverges to zero at a aster rate than 1radic 1038389 Te density estimation literature proves that i the density is many times differentiable then it is possible to construct density estimators whose bias is driven by the last term o a higherorder ailor expansion o the density around a point Robinset al [983096983096] have developed theory based on higher orderin1047298uence unctions under the assumption that the targetparameter is higher order pathwise differentiable Just asdensity estimators exploiting underlying smoothness this

theory also aims to construct estimators o higher orderpathwise differentiable target parameters whose bias is drivenby the last term o the higher order ailor expansion o the target parameter Te practical implementation o theproposed estimators has been challenging and is sufferingrom lack o robustness argeted Learning based on thesehigher order expansions (thus incorporating not only the1047297rst order efficient in1047298uence unction but also the higherorder in1047298uence unctions that de1047297ne the ailor expansion o the target parameter) appears to be a natural area o utureresearch to urther build on these advances

Online MLE rading Off Statistical Optimality and Comput-

ing Cost We will be more and more conronted with onlinedata bases that continuously grow and are massive in sizeNonetheless one wants to know i the new data changes theinerence about target parameters o interest and one wantsto know it right away Recomputing the MLE based on theold data augmented with the new chunk o data would beimmensely computer intensive Tereore we are conrontedwith the challenge on constructing an estimator that is ableto update a current estimator without having to recomputethe estimator but instead one wants to update it based oncomputations with the new data only More generally oneis interested in high quality statistical procedures that arescalable We started doing research in such online MLE that

preserve all or most o the good properties o MLE but canbe continuously updated where the number o computationsrequired or this update is only a unction o the size o thenew chunk o data

7 Historical Philosophical Perspective onTargeted Learning A Reconciliation withMachine Learning

In the previous sections the main characteristics o MLESLmethodology have been outlined We introduced the mostimportant undamental ideas and statistical concepts urgedthe need or revision o current data-analytic practice andshowed some recent advances and application areas Alsoresearch in progress on such issues as dependent data anddata adaptive target parameters has been brought orward Inthis section we put the methodology in a broader historical-philosophical perspective trying to support the claim that its

relevance exceeds the realms o statistics in a strict sense andeven those o methodology o this aim we will discuss boththe signi1047297cance o MLESL or contemporary epistemology and its implications or the current debate on Big Data andthe generally advocated emerging new discipline o DataScience Some o these issues have been elaborated moreextensively in [983091 983096983097ndash983097983089] where we have put the present stateo statistical data analysis in a historical and philosophicalperspective with the purpose to clariy understand andaccount or the current situation in statistical data analysisand relate the main ideas underlying MLESL to it

First andoremost it must be emphasized that rather thanextending the toolkit o the data analyst MLESL establishesa new methodology From a technical point o view it offers anintegrative approach to data analysis or statistical learning by combining inerential statistics with techniques derived romthe 1047297eld o computational intelligence Tis 1047297eld includessuch related and usually eloquently phrased disciplines likemachine learning data mining knowledge discovery indatabases and algorithmic data analysis From a conceptualor methodological point o view it sheds new light on severalstages o the research process including such items as theresearch question assumptions and background knowledgemodeling and causal inerence and validation by anchoringthese stages or elements o the research process in statisticaltheory According to MLESL all these elements should berelated to or de1047297ned in terms o (properties o) the data

generating distribution and to this aim the methodology provides both clear heuristics and ormal underpinnings Amongother things this means that the concept o a statistical modelis reestablished in a prudent and parsimonious way allowinghumans to include only their true realistic knowledge in themodel In addition the scienti1047297c question and backgroundknowledge are to be translated into a ormal causal modeland target causal parameter using the causal graphs andcounteractual (potential outcome) rameworks includingspeciying a working marginal structural model And evenmore signi1047297cantly MLESL reassigns to the very concepto estimation canonical as it has always been in statisticalinerence the leading role in any theory oapproach to




learning rom data whether it deals with establishing causalrelations classiying or clustering time series orecastingor multiple testing Indeed inerential statistics arose atthe background o randomness and variation in a worldrepresented or encoded by probability distributions and ithas thereore always presumed and exploited the sample-

population dualism which underlies the very idea o esti-mation Nevertheless the whole concept o estimation seemsto be discredited and disregarded in contemporary dataanalytical practice

In act the current situation in data analysis is ratherparadoxical and inconvenient From a oundational perspec-tive the 1047297eld consists o several competing schoolswith some-times incompatible principles approaches or viewpointsSome o these can be traced back to Karl Pearsons goodness-o-1047297t-approach to data-analysis or to the Fisherian traditiono signi1047297cance testing and ML-estimation Some principlesand techniques have been derived rom the Neyman-Pearsonschool o hypothesis testing such as the comparison betweentwo alternative hypotheses and the identi1047297cation o twokinds o errors o usual unequal importance that should bedealt with And last but not least the toolkit contains allkinds o ideas taken rom the Bayesian paradigm whichrigorously pulls statistics into the realms o epistemology Weonly have to reer here to the subjective interpretation o probability and the idea that hypotheses should be analyzedin a probabilistic way by assigning probabilities to thesehypotheses thus abandoning the idea that the parameter isa 1047297xed unknown quantity and thus moving the knowledgeabout the hypotheses rom the meta-language into the objectlanguage o probability calculus In spite o all this theburgeoning statistical textbook market offers many primersand even advanced studies which wrongly suggest a uniormandunited 1047297eld with oundations that are 1047297xed and on whichull agreement has been reached It offers a toolkit based onthe alleged uni1047297cation o ideas and methods derived rom theaorementioned traditions As pointed out in [983091] this situationis rather inconvenient rom a philosophical point o view ortwo related reasons

First nearly all scienti1047297c disciplines have experienced aprobabilistic revolution since the late 983089983097th century Increas-ingly key notions are probabilistic research methods entiretheories are probabilistic i not the underlying worldview isprobabilistic that is they are all dominated by and rooted inprobability theory and statistics When the probabilistic revo-lution emerged in the late 983089983097th century thistransition became

recognizable in old established sciences like physics (kineticgas theory statistical mechanics o Bolzmann Maxwelland Gibbs) but especially in new emerging disciplineslike the social sciences (Quetelet and later Durkheim)biology (evolution genetics zoology) agricultural scienceand psychology Biology even came to maturity due toclose interaction with statistics oday this trend has only urther strengthened and as a result there is a plethora o 1047297elds o application o statistics ranging rom biostatisticsgeostatistics epidemiology and econometrics to actuarialscience statistical 1047297nance quality control and operationalresearch in industrial engineering and management scienceProbabilistic approaches have also intruded many branches

o computer science most noticeably they dominate arti1047297cialintelligence

Secondly at a more abstract level probabilistic approa-ches also dominate epistemology the branch o philosophy committed to classical questions on the relation betweenknowledge and reality like What is reality Does it exist

mind-independent Do we have access to it I yes how Doour postulated theoretical entities exist How do they corre-spond to reality Canwe make truestatements about itI yeswhat is truth and how is it connected to reality Te analysesconducted to address these issues are usually intrinsically probabilistic As a result these approaches dominate key issues and controversies in epistemology such as the scienti1047297crealism debate the structure o scienti1047297c theories Bayesiancon1047297rmation theory causality models o explanation andnatural laws All too ofen scienti1047297c reasoning seems nearly synonymous with probabilistic reasoning In view o theact that scienti1047297c inerence more and more depends onprobabilistic reasoning and that statistical analysis is not aswell-ounded as might be expected the issue addressed inthis chapter is o crucial importance or epistemology [983091]

Despite these philosophical objections against the hybridcharacter o inerential statistics its successes were enormousin the 1047297rst decades o the twentieth century In newly estab-lished disciplines like psychology and economics signi1047297cancetesting and maximum likelihood estimation were appliedwith methodological rigor in order to enhance prestige andapply scienti1047297c method to their 1047297eld Although criticism thata mere chasing o low 1103925 values and naive use o parametricstatistics did not do justice to speci1047297c characteristics o thesciences involved emerging rom the start o the applicationo statistics the success story was immense However thisrise o the inerence experts like Gigerenzer calls them inTe Rise o Statistical Tinking was just a phase or stagein the development o statistics and data analysis whichmaniests itsel as a Hegelian triptych that unmistakably isnow being completed in the era o Big Data Afer this thesiso a successul but ununi1047297ed 1047297eld o inerential statistics anantithesis in the Hegelian sense o the word was unavoidableand it was thisantithesis that gave rise to the current situationin data-analytical practice as well Apart rom the already mentioned Bayesian revolt the rise o nonparametric statis-tics in the thirties must be mentioned here as an intrinsically statistical criticism that heralds this antithesis Te majorcaesura in this process however was the work o John ukey in the sixties and seventies o the previous century Afer a

long career in statistics and other mathematical disciplinesukey wrote Explorative Data analysis in 983089983097983095983096 Tis study is inmany ways a remarkable unorthodox book First it containsno axioms theorems lemmas or proos and even barely ormulas Tere are no theoretical distributions signi1047297cancetests 1103925 values hypothesis tests parameter estimation andcon1047297dence intervals No inerential or con1047297rmatory statisticsbut just the understanding o data looking or patternsrelationships and structures in data and visualizing theresults According to ukey the statistician is a detectiveas a contemporary Sherlock Holmes he must strive orsigns and ldquocluesrdquo ukey maintains this metaphor consistently throughout the book and wants to provide the data analyst




with a toolbox ull o methods or understanding requency distributions smoothing techniques scale transormationsand above all many graphical techniques or explorationstorage and summary illustrations o data Te unorthodoxapproach o ukey in EDA reveals not so much a contrarianspirit but rather a undamental dissatisaction with the

prevailing statistical practice and the underlying paradigm o inerentialcon1047297rmatory statistics [983097983088]In EDA ukey endeavors to emphasize the importance

o con1047297rmatory classical statistics but this looks or themain part a matter o politeness and courtesy In act hehad already put his cards on the table in 983089983097983094983090 in the amousopening passage rom Te Future o Data Analysis ldquo for along time I have thought that I was a statistician interested in inferences from the particular to the general But as I have watched mathematical statistics evolve I have had causeto wonder and to doubt And when I have pondered about why such techniques as the spectrum analysis of time serieshave proved so useful it has become clear that their ldquodealing with 1047298uctuationsrdquo aspects are in many circumstances of lesser importance than the aspects that would already havebeen required to deal effectively with the simpler case of very extensive data where 1047298uctuations would no longer be a

problem All in all I have come to feel that my central interest is in data analysis which I take to include among other things

procedures for analyzing data techniques for interpreting theresults of such procedures ways of planning the gathering of data to make its analysis easier more precise or moreaccurate and all the machinery and results of mathematical statisticswhich apply to analyzing data Data analysis is alarger and more varied 1047297eld than inference or allocationrdquo Alsoin other writings ukey makes a sharp distinction betweenstatistics and data analysis

First ukey gave unmistakable impulse to the emancipa-tion o the descriptivevisual approach afer pioneering work o William Playair (983089983096th century) and Florence Nightingale(983089983097th century) on graphical techniques that were soonovershadowed by the ldquoinerentialrdquo coup which marked theprobabilistic revolution Furthermore it is somewhat ironicthat many consider ukey a pioneer o computational 1047297eldssuch as data mining and machine learning although hehimsel preerred a small role or the computer in his analysisand kept it in the background More importantly howeverbecause o his alleged antitheoretical stance ukey is some-times considered the man who tried to reverse or undo theFisherian revolution and an exponent or orerunner o todayrsquos

erosion o models the view that all models are wrong theclassical notion o truth is obsolete and pragmatic criteria aspredictive success in data analysis must prevail Also the ideacurrently requently uttered in the data analytical traditionthat the presence o Big Data will make much o the statisticalmachinery super1047298uous is an import aspect o the here very brie1047298y sketched antithesis Beore we come to the intendedsynthesis the 1047297nal stage o the Hegelian triptych let us maketwo remarks concerning ukeyrsquos heritage Although it almostsounds like a cliche yet it must be noted that EDA techniquesnowadays are routinely applied in all statistical packagesalong with in itsel sometimes hybrid inerential methodsIn the current empirical methodology EDA is integrated

with inerential statistics at different stages o the researchprocess Secondly it could be argued that ukey did notso much undermine the revolution initiated by Galton andPearson but understood the ultimate consequences o it Itwas Galton who had shown that variation and change areintrinsic in nature and that we have to look or the deviant

the special or the peculiar It was Pearson who did realize thatthe constraints o the normal distribution (Laplace Quetelet)hadto be abandoned and who distinguished differentamilieso distributions as an alternative Galtonrsquos heritage was justslightly under pressure hit by the successes o the parametricFisherian statistics on strong model assumptions and it couldwell be stated that this was partially reinstated by ukey

Unsurprisingly the 1047297nal stage o the Hegelian triptychstrives or some convergence i not synthesis Te 983089983097thcentury dialectical German philosopher GFW Hegel arguedthat history is a process o becoming or development inwhich a thesis evokes and binds itsel to an antithesis inaddition both are placed at a higher level to be completedand to result in a ul1047297lling synthesis Applied to the lessmetaphysically oriented present problem this dialecticalprinciple seems particularly relevant in the era o Big Datawhich makes a reconciliation between inerential statisticsand computational science imperative Big Data sets highdemands and offers challenges to both For example it setshigh standards or data management storage and retrievaland has great in1047298uence on the research o efficiency o machine learning algorithms But it is also accompanied by new problems pitalls and challenges or statistical inerenceand its underlying mathematical theory Examples includethe effects o wrongly speci1047297ed models the problems o smallhigh-dimensional datasets (microarray data) the search orcausal relationships in nonexperimental data quantiyinguncertainty efficiency theory and so on Te act that many data-intensive empirical sciences are highly dependent onmachine learning algorithms and statistics makes bridgingthe gap o course or practical reasons compelling

In addition it seems that Big Data itsel also transormsthe nature o knowledge the way o acquiring knowledgeresearch methodology nature andstatus o models andtheo-ries In the re1047298ections o all the brie1047298y sketched contradictionofen emerges and in the popular literature the differences areusually enhanced leading to annexation o Big Data by oneo the two disciplines

O course the gap between both has many aspects bothphilosophical and technical that have been lef out here

However it must be emphasized that or the main part ar-geted Learning intends to support the reconciliation betweeninerential statistics and computational intelligence It startswith the speci1047297cation o a nonparametric and semiparametricmodel that contains only the realistic background knowledgeand ocuses on the parameter o interest which is consideredas a property o the as yet unknown true data-generatingdistribution From a methodological point o view it is aclear imperative that model and parameter o interest mustbe speci1047297ed in advance Te (empirical) research questionmust be translated in terms o the parameter o interest anda rehabilitation o the concept model is achieved Ten ar-geted Learning involves a 1047298exible data-adaptive estimation




procedure that proceeds in two steps First an initial estimateis searched on the basis o the relevant part o the truedistribution that is needed to evaluate the target parameterTis initial estimator is ound by means o the super learning-algorithm In short this is based on a library o many diverse analytical techniques ranging rom logistic regression

to ensemble techniques random orest and support vectormachines Because the choice o one o these techniques by human intervention is highly subjective and the variation inthe results o the various techniques usually substantial SLuses a sort o weighted sum o the values calculated by meanso cross-validation Based on these initial estimators thesecondstage o the estimation procedure can be initiated Teinitial 1047297t is updated with the goal o an optimal bias-variancetrade-off or the parameter o interest Tis is accomplishedwith a targeted maximum likelihood estimator o the 1047298uc-tuation parameter o a parametric submodel selected by theinitial estimator Te statistical inerence is then completed by calculating standard errors on the basis o ldquoin1047298uence-curvetheoryrdquo or resampling techniques Tis parameter estimationretains a crucial place in the data analysis I one wants to do

justice to variation and change in the phenomena then youcannot deny Fisherrsquos unshakable insight that randomness isintrinsic and implies that the estimator o the parameter o interest itsel hasa distribution TusFisher proved himsel tobe a dualist in making the explicit distinction between sampleand population Neither Big Data nor ull census research orany other attempt to take into account the whole o reality or a world encoded or encrypted in data can compensateor it Although many aspects have remained undiscussedin this contribution we hope to have shown that MLESLcontributes to the intended reconciliation between inerentialstatistics and computational science and that both ratherthan being in contradiction should be integrating parts inany concept o Data Science

8 Concluding Remark TargetedLearning and Big Data

Te expansion o available data has resulted in a new 1047297eldofen reerred to as Big Data Some advocate that Big Datachanges the perspective on statistics or example sincewe measure everything why do we still need statisticsClearly Big Data reers to measuring (possibly very) highdimensional data on a very large number o units Te truth

is that there will never be enough data so that careul designo studies and interpretation o data is not needed anymore

o start with lots o bad data are useless so one will needto respect the experiment that generated the data in order tocareully de1047297ne the target parameter and its interpretationand design o experiments is as important as ever so that thetarget parameters o interest can actually be learned

Even though the standard error o a simple samplemean might be so small that there is no need or con1047297denceintervals one is ofen interested in much more complexstatistical target parameters For example consider theaverage treatment effect o our running example whichis not a very complex parameter relative to many other

parameters o interest such as an optimal individualizedtreatment rule Evaluation o the average treatment effectbased on a sample (ie substitution estimator obtained by plugging in the empirical distribution o the sample) wouldrequire computing the mean outcome or each possible strata

o treatment and covariates Even with 1038389 = 1012 observations

most o these strata will be empty or reasonable dimensionso the covariates so that this pure empirical estimator isnot de1047297ned As a consequence we will need smoothing(ie super learning) and really we will also need argetedLearning or unbiased estimation and valid statisticalinerence

argeted Learning was developed in response to highdimensional data in which reasonably sized parametricmodels are simplyimpossible to ormulate andare immensely biased anyway Te high dimension o the data only empha-sizes the need or realistic (and thereby large semiparameric)models target parameters de1047297ned as eatures o the data dis-tribution instead o coefficients in these parametric models

and argeted LearningTe massive dimension o the data doesmake it appealing

to not be necessarily restricted by a priori speci1047297cation o thetarget parameters o interest so that argeted Learning o dataadaptive target parameters discussed above is particularly important uture area o research providing an importantadditional 1047298exibility without giving up on statistical iner-ence

One possible consequence o the building o large databases that collect data on total populations is that the datamight correspond with observing a single process like acommunity o individuals over time in which case onecannot assume that the data is the realization o a collection

o independent experiments the typical assumption moststatistical methods rely upon Tat is data cannot be rep-resented as random samples rom some target populationsince we sample all units o the target population In thesecases it is important to document the connections betweenthe units so that one can pose statistical models that rely on the a variety o conditional independence assumptionsas in causal inerence or networks developed in [983096] Tatis we need argeted Learning or dependent data whosedata distribution is modeled through realistic conditionalindependence assumptions

Such statistical models do not allow or statistical iner-ence based on simple methods such as the bootstrap (ie

sample size is 983089) so that asymptotic theory or estimatorsbased on in1047298uence curves and the state o the art advancesin weak convergence theory is more crucial than ever Tatis the state o the art in probability theory will only be moreimportant in this new era o Big Data Speci1047297cally one willneed to establish convergence in distribution o standardizedestimators in these settings in which the data correspondswith the realization o onegigantic random variableor whichthe statistical model assumes a lot o structure in terms o conditional independence assumptions

O course argeted Learning with Big Data will requirethe programming o scalable algorithms putting undamen-tal constraints on the type o super-learners and MLE




Clearly Big Data does require integration o differentdisciplines ully respecting the advances made in the di-erent 1047297elds such as computer science statistics probability theory and scienti1047297c knowledge that allows us to reduce thesize o the statistical model and to target the relevant targetparameters Funding agencies need to recognize this so that

money can be spent in the best possible way the best possibleway is not to give up on theoretical advances but the theory has to be relevant to address the real challenges that comewith real data Te Biggest Mistake we can make in this BigData Era is to give up on deep statistical and probabilisticreasoning and theory and corresponding education o ournext generations and somehow think that it is just a mattero applying algorithms to data

Conflict of Interests

Te authors declare that there is no con1047298ict o interestsregarding the publication o this paper

Acknowledgments

Te authors thank the reviewers or their very helpul com-ments which improved the paper substantially Tis researchwas supported by an NIH Grant 983090R983088983089AI983088983095983092983091983092983093

References

[983089] M J van der Laan and D Rubin ldquoargeted maximum likeli-hood learningrdquo International Journal of Biostatistics vol 983090 no983089 983090983088983088983094

[983090] M J van der Laan and S Rose argeted Learning Causal

Inference for Observational and Experimental Data SpringerNew York NY USA 983090983088983089983089

[983091] R JC M Starmans ldquoModels inerenceand truth Probabilisticreasoning in the inormation erardquo in argeted Learning Causal Inference for Observational and Experimental Studies M J vander Laan and S Rose Eds pp 983089ndash983090983088 Springer New York NYUSA 983090983088983089983089

[983092] M J van der Laan ldquoEstimation based on case-control designswith known prevalence probabilityrdquo Te International Journal of Biostatistics vol 983092 no 983089 983090983088983088983096

[983093] A Chambaz and M J van der Laan ldquoargeting the optimaldesign in randomized clinical trials with binary outcomes andno covariate theoretical studyrdquo Te International Journal of Biostatistics vol 983095 no 983089 pp 983089ndash983091983090 983090983088983089983089 Working paper 983090983093983096

httpbiostatsbepresscomucbbiostat

[983094] A Chambaz and M J van der Laan ldquoargeting the optimaldesign in randomized clinical trials with binary outcomesand no covariate simulation studyrdquo International Journal of Biostatistics vol 983095 no 983089 article 983091983091 983090983088983089983089 Working paper 983090983093983096httpwwwbepresscomucbbiostat

[983095] M J van der Laan L B Balzer and M L Petersen ldquoAdaptivematching in randomized trials and observational studiesrdquo Journal of Statistical Research vol 983092983094 no 983090 pp 983089983089983091ndash983089983093983094 983090983088983089983091

[983096] M J van der Laan ldquoCausal inerence or networksrdquo ech Rep983091983088983088 University o Caliornia Berkeley Cali USA 983090983088983089983090

[983097] J Pearl Causality Models Reasoning and Inference CambridgeUniversity Press Cambridge NY USA 983090nd edition 983090983088983088983097

[983089983088] R D Gill ldquoNon- and semi-parametric maximum likelihoodestimators and the von Mises method (part 983089)rdquo Scandinavian Journal of Statistics vol 983089983094 pp 983097983095ndash983089983090983096 983089983097983096983097

[983089983089] A W van der Vaart and J A Wellner Weak Convergence and Emprical Processes Springer Series in Statistics Springer New York NY USA 983089983097983097983094

[983089983090] R D Gill M J van der Laan and J A Wellner ldquoInefficientestimators o the bivariate survival unction or three modelsrdquo Annales de lrsquoInstitut Henri Poincare vol 983091983089 no 983091 pp 983093983092983093ndash983093983097983095983089983097983097983093

[983089983091] P J Bickel C A J Klaassen Y Ritov and J Wellner Efficient and Adaptive Estimation for Semiparametric Models Springer983089983097983097983095

[983089983092] S Gruber and M J van der Laan ldquoA targeted maximumlikelihood estimator o a causal effect on a bounded continuousoutcomerdquo International Journal of Biostatistics vol 983094 no 983089article 983090983094 983090983088983089983088

[983089983093] M J van der Laan and S Dudoit ldquoUni1047297ed cross-validationmethodology or selection among estimators and a generalcross-validated adaptive epsilon-net estimator 1047297nite sampleoracle inequalities and examplesrdquo echnical Report Divisiono Biostatistics University o Caliornia Berkeley Cali USA983090983088983088983091

[983089983094] A W van der Vaart S Dudoit and M J van der LaanldquoOracle inequalities or multi-old cross validationrdquo Statisticsand Decisions vol 983090983092 no 983091 pp 983091983093983089ndash983091983095983089 983090983088983088983094

[983089983095] M J van der Laan S Dudoit and A W van der VaartldquoTe cross-validated adaptive epsilon-net estimatorrdquo Statisticsamp Decisions vol 983090983092 no 983091 pp 983091983095983091ndash983091983097983093 983090983088983088983094

[983089983096] M J van der Laan E Polley and A Hubbard ldquoSuper learnerrdquoStatistical Applications in Genetics and Molecular Biology vol983094no 983089 article 983090983093 983090983088983088983095

[983089983097] E C Polley S Rose and M J van der Laan ldquoSuper learningrdquoin argeted Learning Causal Inference for Observational and

Experimental Data M J van der Laan and S Rose EdsSpringer New York NY USA 983090983088983089983090

[983090983088] M J van der Laan and J M Robins Uni1047297ed Methods for Censored Longitudinal Dataand Causality New York NY USASpringer 983090983088983088983091

[983090983089] M L Petersen and M J van der Laan A General Roadmap for the Estimation of Causal Effects Division o Biostatis ticsUniversity o Caliornia Berkeley Cali USA 983090983088983089983090

[983090983090] J Splawa-Neyman ldquoOn the application o probability theory toagricultural experimentsrdquo Statistical Science vol 983093 no 983092 pp983092983094983093ndash983092983096983088 983089983097983097983088

[983090983091] D B Rubin ldquoEstimating causal effects o treatments in ran-domized and non-randomized studiesrdquo Journal of Educational Psychology vol 983094983092 pp 983094983096983096ndash983095983088983089 983089983097983095983092

[983090983092] D B Rubin Matched Sampling for Causal Effects CambridgeUniversity Press Cambridge Mass USA 983090983088983088983094

[983090983093] P W Holland ldquoStatistics and causal inerencerdquo Journal of the American Statistical Association vol 983096983089 no 983091983097983094 pp 983097983092983093ndash983097983094983088983089983097983096983094

[983090983094] J Robins ldquoA new approach to causal inerencein mortality stud-ies with a sustained exposure periodmdashapplication to controlo the healthy worker survivor effectrdquo Mathematical Modelling vol 983095 no 983097ndash983089983090 pp 983089983091983097983091ndash983089983093983089983090 983089983097983096983094

[983090983095] J M Robins ldquoAddendum to ldquoA new approach to causal iner-ence in mortality studies with a sustained exposure periodmdashapplication to control o the healthy worker survivor effectrdquordquoComputers amp Mathematics with Applications vol 983089983092 no 983097ndash983089983090pp 983097983090983091ndash983097983092983093 983089983097983096983095




[983090983096] J Robins ldquoA graphical approach to the identi1047297cation andestimation o causal parameters in mortality studies withsustained exposure periodsrdquo Journal of Chronic Diseases vol983092983088 supplement 983090 pp 983089983091983097Sndash983089983094983089S 983089983097983096983095

[983090983097] A Rotnitzky D Scharstein L Su and J Robins ldquoMethodsor conducting sensitivity analysis o trials with potentially nonignorable competing causes o censoringrdquo Biometrics vol983093983095 no 983089 pp 983089983088983091ndash983089983089983091 983090983088983088983089

[983091983088] J M Robins A Rotnitzky and D O Scharstein ldquoSensitivity analysis or se lection bias and unmeasured conounding in missing data and causal inerence modelsrdquo in Statistical Modelsin Epidemiology the Environment and Clinical rials IMAVolumes in Mathematics and Its Applications Springer BerlinGermany 983089983097983097983097

[983091983089] D O Scharstein A Rotnitzky and J Robins ldquoAdjustingor nonignorable drop-out using semiparametric nonresponsemodelsrdquo Journal of the American Statistical Association vol 983097983092no 983092983092983096 pp 983089983088983097983094ndash983089983089983092983094 983089983097983097983097

[983091983090] I Diaz and M J van der Laan ldquoSensitivity analysis orcausal inerence under unmeasured conounding and mea-

surement error problemsrdquo ech Rep Division o Biostatis-tics University o Caliornia Berkeley Cali USA 983090983088983089983090httpwwwbepresscomucbbiostatpaper983091983088983091

[983091983091] O Bembom and M J van der Laan ldquoA practical illustration o the im-portance o realistic individualized treatment rules incausal inerencerdquo Electronic Journal of Statistics vol 983089 pp 983093983095983092ndash983093983097983094 983090983088983088983095

[983091983092] S Rose and M J van der Laan ldquoSimple optimal weighting o cases and controls in case-control studiesrdquo Te International Journal of Biostatistics vol 983092 no 983089 983090983088983088983096

[983091983093] S Rose and M J van der Laan ldquoWhy match Investigatingmatched case-control study designs with causal effect estima-tionrdquo Te International Journal of Biostatistics vol 983093 no 983089article 983089 983090983088983088983097

[983091983094] S Rose and M J van der Laan ldquoA targeted maximum likeli-hood estimator or two-stage designsrdquo International Journal of Biostatistics vol 983095 no 983089 983090983089 pages 983090983088983089983089

[983091983095] K L Moore and M J van der Laan ldquoApplication o time-to-event methods in the assessment o saety in clinical trialsrdquoin Design Summarization Analysis amp Interpretation of Clinical rials with ime-to-Event Endpoints E Karl Ed Chapman andHall 983090983088983088983097

[983091983096] K L Moore and M J van der Laan ldquoCovariate adjustment inrandomized trials with binary outcomes targeted maximumlikelihood estimationrdquo Statistics in Medicine vol 983090983096 no 983089 pp983091983097ndash983094983092 983090983088983088983097

[983091983097] K L Moore and M J van der Laan ldquoIncreasing power

in randomized trials with right censored outcomes throughcovariate adjustmentrdquo Journal of Biopharmaceutical Statistics vol 983089983097 no 983094 pp 983089983088983097983097ndash983089983089983091983089 983090983088983088983097

[983092983088] O Bembom M L Petersen S-Y Rhee et al ldquoBiomarker dis-covery using targeted maximum likelihoodestimation applica-tion to the treatment o antiretroviral resistant HIV inectionrdquoStatistics in Medicine vol 983090983096 pp 983089983093983090ndash983089983095983090 983090983088983088983097

[983092983089] R Neugebauer M J Silverberg and M J van der LaanldquoObservational study and individualized antiretroviral therapy initiation rules or reducing cancer incidence in HIV-inectedpatientsrdquo ech Rep 983090983095983090 Division o Biostatistics University o Caliornia Berkeley Cali USA 983090983088983089983088

[983092983090] E C Polley and M J van der Laan ldquoPredicting optimaltreatment assignment based on prognostic actors in cancer

patientsrdquo in Design Summarization Analysis amp Interpretationof Clinical rials with ime-to-Event Endpoints KE Peace EdChapman amp Hall 983090983088983088983097

[983092983091] M Rosenblum S G Deeks M van der Laan and D RBangsberg ldquoTe risk o virologicailure decreaseswith durationo HIVsuppression at greater than 983093983088 adherence to antiretroviral therapyrdquo PLoS ONE vol 983092 no 983097 Article ID e983095983089983097983094 983090983088983088983097

[983092983092] M J van der Laan and S Gruber ldquoCollaborative double robusttargeted maximum likelihood estimationrdquo Te International Journal of Biostatistics vol 983094 no 983089 article 983089983095 983090983088983089983088

[983092983093] O M Stitelman and M J van der Laan ldquoCollaborative targetedmaximum like-lihood or time to event datardquo ech Rep 983090983094983088Division o Biostatistics University o Caliornia BerkeleyCali USA 983090983088983089983088

[983092983094] S Gruber and M J van der Laan ldquoAn application o collabo-rative targeted maximum likelihood estimation in causal iner-ence and genomicsrdquo Te International Journal of Biostatistics vol 983094 no 983089 983090983088983089983088

[983092983095] M Rosenblum and M J van der Laan ldquoargeted maximumlikelihood estimation o the parameter o a marginal structural

modelrdquo International Journal of Biostatistics vol 983094 no 983090 983090983088983089983088[983092983096] H Wang S Rose and M J van der Laan ldquoFinding quantitative

trait loci genes with collaborative targeted maximum likelihoodlearningrdquo Statistics amp Probability Letters vol 983096983089 no 983095 pp 983095983097983090ndash983095983097983094 983090983088983089983089

[983092983097] I D Munoz and M J van der Laan ldquoSuper learner basedconditional density estimation with application to marginalstructural modelsrdquo International Journal of Biostatistics vol 983095no 983089 article 983091983096 983090983088983089983089

[983093983088] I D Munoz and M van der Laan ldquoPopulation interventioncausal effects based on stochastic interventionsrdquo Biometrics vol983094983096 no 983090 pp 983093983092983089ndash983093983092983097 983090983088983089983090

[983093983089] I Diaz and M J van der Laan ldquoSensitivity analysis or causalinerence under unmeasured conounding and measurement

error problemsrdquo International Journal of Biostatistics vol 983097 no983090 pp 983089983092983097ndash983089983094983088 983090983088983089983091

[983093983090] I Diaz and M J van der Laan ldquoAssessing the causal effect o policies an example using stochastic interventionsrdquo Interna-tional Journal of Biostatistics vol 983097 no 983090 pp 983089983094983089ndash983089983095983092 983090983088983089983091

[983093983091] I Diaz and J Mark van der Laan ldquoargeted data adaptiveestimation o the causal dosemdashresponse curverdquo Journal of Causal Inference vol 983089 no 983090 pp 983089983095983089ndash983089983097983090 983090983088983089983091

[983093983092] O M Stitelman and M J van der Laan ldquoargeted maximumlikelihood estimation o effect modi1047297cation parameters insurvival analysisrdquo Te International Journal of Biostatistics vol983095 no 983089 article 983089983097 983090983088983089983089

[983093983093] M J vander Laan ldquoargeted maximum likelihood based causal

inerence PartIrdquo International Journalof Biostatistics vol 983094no983090 Art pages 983090983088983089983088

[983093983094] O M Stitelman and M J van der Laan ldquoargeted maximumlikelihood estimation o time-to-event parameters with time-dependent covariatesrdquo ech Rep Division o BiostatisticsUniversity o Caliornia Berkeley Cali USA 983090983088983089983089

[983093983095] M Schnitzer E Moodie M J van der Laan R Platt and MKlei ldquoModeling theimpact o hepatitis C viral clearanceon end-stage liver disease in an HIV co-inected cohort with argetedMaximum Likelihood Estimationrdquo Biometrics vol983095983088 no 983089pp983089983092983092ndash983089983093983090 983090983088983089983092

[983093983096] S Gruber and M J van der Laan ldquoargeted minimum lossbased estimator that outperorms a given estimatorrdquo Te Inter-national Journal of Biostatistics vol 983096 article 983089983089 no 983089 983090983088983089983090




[983093983097] S Gruber and M J van der Laan ldquoConsistent causal effectestimation under dual misspeci1047297cation and implications orconounder selection procedurerdquo Statistical Methods in Medical Research 983090983088983089983090

[983094983088] M Petersen J Schwab S Gruber N Blaser M Schomaker andM Jvan derLaanldquoargetedminimum loss based estimation o marginal structural working modelsrdquo ech Rep 983091983089983090 University o Caliornia Berkeley Cali USA 983090983088983089983091

[983094983089] J Brooks M J van der Laan D E Singer and A S Goldquoargeted minimum loss-based estimation o causal effects inright-censored survival data with time-dependent covariateswararin stroke and death in atrial 1047297brillationrdquo Journal of Causal Inference vol 983089 no 983090 pp 983090983091983093ndash983090983093983092 983090983088983089983091

[983094983090] JBrooks M Jvan der Laan and A S Go ldquoargeted maximumlikelihood estimation or prediction calibrationrdquo International Journal of Biostatistics vol 983096 article 983091983088 no 983089 983090983088983089983090

[983094983091] S Sapp M J van der Laan and K Page ldquoargeted estimationo variable importance measures with interval-censored out-comesrdquo ech Rep 983091983088983095 University o Caliornia Berkeley CaliUSA 983090983088983089983091

[983094983092] R Neugebauer J A Schmittdiel and M J van der Laanldquoargeted learning in real-world comparative effectivenessresearch with time-varying interventionsrdquo ech RepHHSA983090983097983088983090983088983088983093983088983088983089983094I Te Agency or Healthcare Researchand Quality 983090983088983089983091

[983094983093] S D Lendle M S Subbaraman and M J van der LaanldquoIdenti1047297cation and efficient estimation o the natural directeffect among the untreatedrdquo Biometrics vol 983094983097 no 983090 pp 983091983089983088ndash983091983089983095 983090983088983089983091

[983094983094] S D Lendle B Fireman and M J van der Laan ldquoargetedmaximum likelihood estimation in saety analysisrdquo Journal of Clinical Epidemiology vol 983094983094 no 983096 pp S983097983089ndashS983097983096 983090983088983089983091

[983094983095] M S Subbaraman S Lendle M van der Laan L A Kaskutas

and J Ahern ldquoCravings as a mediator and moderator o drinking outcomes in the COMBINE studyrdquo Addiction vol 983089983088983096no 983089983088 pp 983089983095983091983095ndash983089983095983092983092 983090983088983089983091

[983094983096] S D Lendle B Fireman and M J van der Laan ldquoBalancingscore adjusted targeted minimum loss-based estimationrdquo 983090983088983089983091

[983094983097] W Zheng M L Petersen and M J van der Laan ldquoEstimatingthe effect o a community-based intervention with twocommu-nitiesrdquo Journal of Causal Inference vol 983089 no 983089 pp 983096983091ndash983089983088983094 983090983088983089983091

[983095983088] W Zheng and M J van der Laan ldquoargeted maximum likeli-hood estimation o natural direct effectsrdquo International Journal of Biostatistics vol 983096 no 983089 983090983088983089983090

[983095983089] W Zheng and M J van der Laan ldquoCausal mediation in asurvival setting with time-dependent mediatorsrdquo echnicalReport 983090983097983093 Division o Biostatistics University o CaliorniaBerkeley Cali USA 983090983088983089983090

[983095983090] M Carone M Petersen and M J van der Laan ldquoargetedminimum loss based estimation o a casual effect using intervalcensored time to event datardquo in Interval Censored ime to Event Data Methods and Applications D-G Chen J Sun and K E Peace Eds Chapman amp HallCRC New York NY USA 983090983088983089983090

[983095983091] D O Scharstein A Rotnitzky and J M Robins ldquoAdjustingor nonignorable drop-out using semiparametric nonresponsemodels (with discussion and rejoinder)rdquo Journal of the Ameri-can Statistical Association vol 983097983092 pp 983089983088983097983094ndash983089983089983090983088 983089983097983097983097

[983095983092] H Bang andJ M Robins ldquoDoubly robust estimation in missingdata and causal inerence modelsrdquo Biometrics vol 983094983089 no 983092 pp983097983094983090ndash983097983095983090 983090983088983088983093

[983095983093] A Chambaz N Pierre and M J van der Laan ldquoEstimation o a non-parametric variable importance measure o a continuousexposurerdquo Electronic Journal of Statistic vol 983094 pp 983089983088983093983097ndash983089983088983097983097983090983088983089983090

[983095983094] C uglus and M J van der Laan ldquoargeted methods orbiomarker discovery the search or a standardrdquo UC Berkeley Working Paper Series 983090983088983088983096 httpwwwbepresscomucbbios-

tatpaper983090983091983091[983095983095] C uglus and M J van der Laan ldquoModi1047297ed FDR controlling

procedure or multi-stage analysesrdquo Statistical Applications inGenetics and Molecular Biology vol 983096 no 983089 article 983089983090 983090983088983088983097

[983095983096] C uglus and M J van der Laan ldquoargeted methods orbiomarker discoveriesrdquo in argeted Learning Causal Inference for Observationaland Experimental DataMJvanderLaanandS Rose Eds chapter 983090983090 Springer New York NY USA 983090983088983089983089

[983095983097] H Wang S Rose and M J van der Laan ldquoFinding quantitativetrait loci genesrdquo in argeted Learning Causal Inference for Observational and Experimental Data MJ van der Laan andS Rose Eds Springer New York NY USA 983090983088983089983089 chapter 983090983091

[983096983088] L B Balzer and M J van der Laan ldquoEstimating effects onrare outcomes knowledge is powerrdquo ech Rep 983091983089983088 Divisiono Biostatistics University o Caliornia Berkeley Cali USA983090983088983089983091

[983096983089] W Zheng and M J van der Laan ldquoCross-validated targetedminimum loss based estimationrdquo in argeted Learning Causal Inference for Observational and Experimental Studies M J vander Laan and S Rose Eds Springer New York NY USA 983090983088983089983089

[983096983090] A Rotnitzky Q Lei M Sued and J M Robins ldquoImproveddouble-robust estimation in missing data and causal inerencemodelsrdquo Biometrika vol 983097983097 no 983090 pp 983092983091983097ndash983092983093983094 983090983088983089983090

[983096983091] D B Rubin and M J van der Laan ldquoEmpirical efficiency maximization improved locally efficient covariate adjustmentin randomized experiments and survival analysisrdquo Te Interna-tional Journal of Biostatistics vol 983092 no 983089 article 983093 983090983088983088983096

[983096983092] M J van der Laan ldquoStatistical inerence when using dataadaptive estimators o nuisance parametersrdquo ech Rep 983091983088983090Division o Biostatistics University o Caliornia BerkeleyCali USA 983090983088983089983090

[983096983093] M J van der and M L Petersen ldquoargeted learningrdquo inEnsemble Machine Learning pp 983089983089983095ndash983089983093983094 Springer New YorkNY USA 983090983088983089983090

[983096983094] M J van der Laan A E Hubbard and S Kherad ldquoStatisticalinerence or data adaptive target parametersrdquo ech Rep 983091983089983092University o Caliornia Berkeley Cali USA June 983090983088983089983091

[983096983095] M J van der Laan ldquoargeted learning o an optimal dynamictreatment and statistical inerence or its mean outcomerdquo echRep 983091983089983095 University o Caliornia at Berkeley 983090983088983089983091 o appear inJournal o Causal Inerence

[983096983096] J M Robins L LiE chetgen andA W vander VaartldquoHigherorder in1047298uence unctions and minimax estimation o non-linear unctionalsrdquo in Essays in Honor of David A Freedman IMS Collections Probabilityand Statistics pp 983091983091983093ndash983092983090983089 SpringerNew York NY USA 983090983088983088983096

[983096983097] S Rose R J C M Starmans and M J van der Laan ldquoar-geted learning or causality and statistical analysis in medicalresearchrdquo ech Rep 983090983097983095 Division o Biostatistics University o Caliornia Berkeley Cali USA 983090983088983089983089

[983097983088] R J C M Starmans ldquoPicasso Hegel and the era o big datardquoStator vol 983090 no 983090983092 983090983088983089983091 (Dutch)

[983097983089] R JC M Starmansand M Jvan derLaanldquoInerential statistics versus machine learning a preludeto reconciliationrdquo Stator vol983090 no 983090983092 983090983088983089983091 (Dutch)



Submit your manuscripts at

httpwwwhindawicom




we call statistics and makes it impossible to teach statisticsas a scienti1047297c discipline even though the oundations o statistics including a very rich theory are purely scienti1047297cTat is our 1047297eld has suffered rom a disconnect betweenthe theory o statistics and the practice o statistics whilepractice should be driven by relevant theory and theoretical

developments should be driven by practice For example atheorem establishing consistency and asymptotic normality o a maximum likelihood estimator or a parametric modelthat is known to be misspeci1047297ed is not a relevant theoremor practice since the true data generating distribution is notcaptured by this theorem

De1047297ning the statistical model to actually contain the trueprobability distribution has enormous implications or thedevelopment o valid estimators For example maximumlikelihood estimators are now ill de1047297ned due to the curse o dimensionality o the model In addition even regularizedmaximum likelihood estimators are seriously 1047298awed a gen-eral problem with maximum likelihood based estimators isthat the maximum likelihood criterion only cares about how well the density estimator 1047297ts the true density resulting ina wrong trade-off or the actual target parmaeter o interestFrom a practical perspective when we use AIC BIC or cross-

validated log-likelihood to select variables in our regressionmodel then that procedure is ignorant o the speci1047297c eatureo the data distribution we want to estimate Tat is in largestatistical models it is immediately apparent that estimatorsneed to be targeted towards their goal just like a human beinglearns the answer to a speci1047297c question in a targeted mannerand maximum likelihood based estimators ail to do that

In Section 983091 we review the roadmap orargetedLearningo a causal quantity involving de1047297ning a causal model andcausal quantity o interest establishing an estimand o thedatadistribution thatequals the desired causal quantity underadditional causal assumptions applying the pure statisticalargeted Learning o therelevant estimand based on a statisti-cal model compatible with the causal model but or sure con-taining the true data distribution and careul interpretationo the results In Section 983092 we proceed with describing ourproposed targeted minimum loss-based estimation (MLE)template which represents a concrete template or construc-tion o targeted efficient substitution estimators which arenot only asymptotically consistent asymptotically normally distributed and asymptotically efficient but also tailoredto have robust 1047297nite sample perormance Subsequently inSection 983093 we review some o our most important advances

in argeted Learning demonstrating the remarkable powerand 1047298exibility o this MLE methodology and in Section 983094we describe uture challenges and areas o research InSection 983095 we provide a historical philosophical perspectiveo argeted Learning Finally in Section 983096 we conclude withsome remarks putting argeted Learning in the context o themodern era o Big Data

We reer to our papers and book on argeted Learningor overviews o relevant parts o the literature that put ourspeci1047297c contributions within the 1047297eld o argeted Learningin the context o the current literature thereby allowing usto ocus on argeted Learning itsel in the current outlook paper

2 Targeted Learning

Our research takes place in a sub1047297eld o statistics we namedargeted Learning [983089 983090] In statistics the data (1 )on 1038389 units is viewed as a realization o a random variableor equivalently an outcome o a particular experiment andthereby has a probability distribution

1103925

0 ofen called the

data distribution For example one might observe 1038389 =(9073171038389 1038389 1038389) on a subject where 9073171038389 are baseline character-istics o the subject 1038389 is a binary treatment or exposure thesubject received and1038389 is a binary outcome o interestsuch asan indicator o death = 1 1038389 Troughout this paper wewill use this data structure to demonstrate the concepts andestimation procedures

983090983089 Statistical Model A statistical model M is de1047297ned as aset o possible probability distributions or the data distribu-tion and thus represents the available statistical knowledgeabout the true data distribution 11039250 In argeted Learning

this core-de1047297nition o the statistical model is ully respectedin the sense that one should de1047297ne the statistical model tocontain the true data distribution 11039250 isin M

So contrary tothe ofen conveniently used slogan ldquoAll models are wrong butsome are useulrdquo and erosion over time o the original truemeaning o a statistical model throughout applied researchargeted Learning de1047297nes the model or what it actually is[983091] I there is truly no statistical knowledge available thenthe statistical model is de1047297ned as all data distributions Apossible statistical model is the model that assumes that(1 ) are 1038389 independent and identically distributedrandom variables with completely unknown probability dis-tribution 11039250 representing the case that the sampling o thedata involved repeating the same experiment independentlyIn our example this would mean that we assume that(9073171038389 1038389 1038389) are independent with a completely unspeci1047297edcommon probability distribution For example i 907317 is 983089983088-dimensional while and are two-dimensional then 11039250is described by a 983089983090-dimensional density and this statisticalmodel does not put any restrictions on this 983089983090-dimensionaldensity One could actorize this density o (907317) asollows

0 (907317) = 11039250 (907317)907317|11039250 ( | 907317) |90731711039250 ( | 907317) (983089)

where

11039250 is the density o the marginal distribution o

907317

907317|11039250 is the conditional density o given 907317 and |90731711039250

is the conditional density o given 907317 In this modeleach o these actors is unrestricted On the other handsuppose now that the data is generated by a randomizedcontrolled trial in which we randomly assign treatment isin01 with probability 983088983093 to a subject In that case theconditional density o given907317 is known but the marginaldistribution o the covariates and the conditional distributiono the outcome given covariates and treatment might still beunrestricted Even in an observational study one might know that treatment decisions were only based on a small subset o the available covariates 907317 so that it is known that 907317|11039250(1 |

907317) only depends on

907317 through these ew covariates In the









(1

) but one













For example i

1038389

= (9073171038389

1038389

1038389



0 = Ψ 1048616110392501048617= 0 10486990 ( | = 1907317) minus 0 ( | = 0907317)1048701

(983090)


0 = 9917879831631103925|90731711039250 (1 | = 1907317 = )

minus1103925|90731711039250 (1 | = 0907317 = )983165 110392511039250 () (983091)



1103925|9073171103925 (1 | 907317) = 11 + exp 1048616minus0 (907317)1048617 (983092)


0 such as

0

() = 03 +





0 = 991787983080 11 + exp 1048616minus0 (1 )1048617 minus

11 + exp 1048616minus0 (0 )1048617983081

times 110392511039250 () (983093)



907317 = 1103925 104861611039251048617 = 907317 10486169073179073171048617 = 10486169073171048617

(983094)


(907317) but



0

=

(9073170

) Tus

0 (

1)







(02

) with variance



















0

(907317) = 0



0 = Ψ104861601048617 = 010486990 (1907317) minus 0 (0907317)1048701 (983095)




regression0

= Ψ8520081103925 852009 = 110383899917611038389=1

1048699 1048616190731710383891048617 minus 10486160907317103838910486171048701 (983096)


1103925 = 110383899917611038389=1

2 1038389 minus 1 1048616 1038389 | 907317103838910486171038389 (983097)




0

= (11039250

0

) and the


given 907317













sample splits into









minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)




is given by


+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)












Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom









(1

) but one













For example i

1038389

= (9073171038389

1038389

1038389



0 = Ψ 1048616110392501048617= 0 10486990 ( | = 1907317) minus 0 ( | = 0907317)1048701

(983090)


0 = 9917879831631103925|90731711039250 (1 | = 1907317 = )

minus1103925|90731711039250 (1 | = 0907317 = )983165 110392511039250 () (983091)



1103925|9073171103925 (1 | 907317) = 11 + exp 1048616minus0 (907317)1048617 (983092)


0 such as

0

() = 03 +





0 = 991787983080 11 + exp 1048616minus0 (1 )1048617 minus

11 + exp 1048616minus0 (0 )1048617983081

times 110392511039250 () (983093)



907317 = 1103925 104861611039251048617 = 907317 10486169073179073171048617 = 10486169073171048617

(983094)


(907317) but



0

=

(9073170

) Tus

0 (

1)







(02

) with variance



















0

(907317) = 0



0 = Ψ104861601048617 = 010486990 (1907317) minus 0 (0907317)1048701 (983095)




regression0

= Ψ8520081103925 852009 = 110383899917611038389=1

1048699 1048616190731710383891048617 minus 10486160907317103838910486171048701 (983096)


1103925 = 110383899917611038389=1

2 1038389 minus 1 1048616 1038389 | 907317103838910486171038389 (983097)




0

= (11039250

0

) and the


given 907317













sample splits into









minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)




is given by


+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)












Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom





0 = 991787983080 11 + exp 1048616minus0 (1 )1048617 minus

11 + exp 1048616minus0 (0 )1048617983081

times 110392511039250 () (983093)



907317 = 1103925 104861611039251048617 = 907317 10486169073179073171048617 = 10486169073171048617

(983094)


(907317) but



0

=

(9073170

) Tus

0 (

1)







(02

) with variance



















0

(907317) = 0



0 = Ψ104861601048617 = 010486990 (1907317) minus 0 (0907317)1048701 (983095)




regression0

= Ψ8520081103925 852009 = 110383899917611038389=1

1048699 1048616190731710383891048617 minus 10486160907317103838910486171048701 (983096)


1103925 = 110383899917611038389=1

2 1038389 minus 1 1048616 1038389 | 907317103838910486171038389 (983097)




0

= (11039250

0

) and the


given 907317













sample splits into









minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)




is given by


+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)












Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom
















0

(907317) = 0



0 = Ψ104861601048617 = 010486990 (1907317) minus 0 (0907317)1048701 (983095)




regression0

= Ψ8520081103925 852009 = 110383899917611038389=1

1048699 1048616190731710383891048617 minus 10486160907317103838910486171048701 (983096)


1103925 = 110383899917611038389=1

2 1038389 minus 1 1048616 1038389 | 907317103838910486171038389 (983097)




0

= (11039250

0

) and the


given 907317













sample splits into









minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)




is given by


+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)












Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom











sample splits into









minus 0 = 1

1038389

9917611038389=1

lowast 1048616110392501048617 104861610383891048617 + 1

radic 1038389

(983089983088)




is given by


+0 (1907317) minus 0 (0907317) minusΨ104861601048617 (983089983089)












Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom








Ψ(

) beyond

being an excellent






11039250lowast (1103925) = Ψ1048616110392501048617 minus Ψ (1103925) +10486161103925110392501048617 (983089983090)













and




=



minus 0 = 110383899917611038389=1

lowast 10486160 01048617 104861610383891048617 + 1radic 1038389 (983089983092)

















Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom













Θ and










8520081038389

()852009=0 (983089983093)




852008 ()852009 ()

=0

= lowast

852008852009 = 2minus 1 (907317) 852008minus (907317)852009

(983089983094)



10486161103925 ()1048617=0 =

lowast1103925 () (983089983095)


852008 () 1103925 ()852009

=0

= lowast ( ) (983089983096)






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom






() at () asrequired










rarr 1103925

(1038389

()) along this











(0

1038389

()) while

2

= arg min

1103925

(1103925

(2

)) equals zero



= 1



)










11039250lowast

() =

+1

prod=011039250()|(minus1)907317(minus1)

852008 () | ( minus 1) ( minus 1)852009prod=0

lowast 852008 () | ( minus 1) ()852009 (983089983097)

























0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom



















0 in MLE Even

















or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom













or


























1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom

















1103925 values















































































Acknowledgments


References
















































































































httpwwwhindawicom










































































Acknowledgments


References
















































































































httpwwwhindawicom


















































































httpwwwhindawicom

targetted learning

Documents