combining biomarkers to optimize patient treatment …quefeng/publication/biometrics... ·...

26
Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining Biomarkers to Optimize Patient Treatment Recommendations Chaeryon Kang, * Holly Janes, and Ying Huang Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A. email: [email protected] Summary. Markers that predict treatment effect have the potential to improve patient outcomes. For example, the Oncotype DX RecurrenceScore has some ability to predict the benefit of adjuvant chemotherapy over and above hormone therapy for the treatment of estrogen-receptor-positive breast cancer, facilitating the provision of chemotherapy to women most likely to benefit from it. Given that the score was originally developed for predicting outcome given hormone therapy alone, it is of interest to develop alternative combinations of the genes comprising the score that are optimized for treatment selection. However, most methodology for combining markers is useful when predicting outcome under a single treatment. We propose a method for combining markers for treatment selection which requires modeling the treatment effect as a function of markers. Multiple models of treatment effect are fit iteratively by upweighting or “boosting” subjects potentially misclassified according to treatment benefit at the previous stage. The boosting approach is compared to existing methods in a simulation study based on the change in expected outcome under marker-based treatment. The approach improves upon methods in some settings and has comparable performance in others. Our simulation study also provides insights as to the relative merits of the existing methods. Application of the boosting approach to the breast cancer data, using scaled versions of the original markers, produces marker combinations that may have improved performance for treatment selection. Key words: Biomarker; Boosting; Model mis-specification; Treatment selection. 1. Introduction Discovering and describing heterogeneity in treatment effects across patient subgroups has emerged as a key objective in clinical trials and drug development. If the treatment effect can be predicted given marker values such as biological mea- surements and clinical characteristics, providing patients and clinicians with these marker values can help them make more informed treatment decisions. For example, the Oncotype DX Recurrence Score is a leading marker for predicting the benefit of adjuvant chemotherapy over and above tamoxifen among breast cancer patients with estrogen receptor-positive (ER- positive) tumors (Albain et al., 2010a). The Recurrence Score is a proprietary combination of expression levels of 21 genes (16 cancer-related and 5 reference) measured in breast cancer tumor tissue, and is used to identify a subgroup of patients for whom the likelihood of benefitting from adjuvant chemother- apy is small. These patients can therefore avoid unnecessary and potentially toxic treatment. There is a large literature on statistical methods for com- bining markers, but the vast majority of them have focused on combining markers for predicting outcome under a single treatment (e.g., Etzioni et al., 2003; Pepe, Cai, and Longton, 2005; Zhao et al., 2011). However, combinations of markers for risk prediction or classification under a single treatment are not optimized for treatment selection. Being at high risk for the outcome does not necessarily imply a larger benefit from a particular treatment (Henry and Hayes, 2006; Janes et al., 2011; Janes, Pepe, and Huang, in press). In particular, the Recurrence Score was originally developed for predicting the risk of disease recurrence or death given treatment with tamoxifen alone (Paik et al., 2004), and was later shown to have value for predicting chemotherapy benefit (Paik et al., 2004; Albain et al. 2010a, 2010b). Therefore, it is of interest to explore alternative combinations of gene expression measures that are optimized for treatment selection. Statistical methods for combining markers for treatment selection are being developed (see Gunter, Zhu, and Murphy, 2007; Brinkley, Tsiatis, and Anstrom, 2010; Cai et al., 2011; Claggett et al., 2011; Foster, Taylor, and Ruberg, 2011; Gunter, Zhu, and Murphy, 2011a; Zhang et al., 2012; Zhao et al., 2012; Lu, Zhang, and Zeng, 2013). A simple approach uses generalized linear regression to model the expected disease outcome as a function of treatment and markers, including an interaction between each marker and treatment (Gunter et al., 2007; Cai et al., 2011; Lu et al., 2013; Janes et al., in press). This model is difficult to specify, particulary with multiple markers as in the breast cancer example, and hence an approach that is robust to model mis-specification is warranted. This is a key motivation for our approach to combining markers for treatment selection. We call our approach “boosting” since it is a natural general- ization of the Adaboost (Adaptive boosting) method used to predict disease outcome under a single treatment Freund and Schapire (1997) and Friedman, Hastie, and Tibshirani (2000). Candidate approaches for combining markers should be compared with respect to a clinically relevant performance © 2014, The International Biometric Society 695

Upload: others

Post on 22-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Biometrics 70, 695–720 DOI: 10.1111/biom.12191September 2014

Combining Biomarkers to Optimize Patient TreatmentRecommendations

Chaeryon Kang,* Holly Janes, and Ying Huang

Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer ResearchCenter, Seattle, Washington 98109, U.S.A.

∗email: [email protected]

Summary. Markers that predict treatment effect have the potential to improve patient outcomes. For example, theOncotype DX� RecurrenceScore� has some ability to predict the benefit of adjuvant chemotherapy over and above hormonetherapy for the treatment of estrogen-receptor-positive breast cancer, facilitating the provision of chemotherapy to womenmost likely to benefit from it. Given that the score was originally developed for predicting outcome given hormone therapyalone, it is of interest to develop alternative combinations of the genes comprising the score that are optimized for treatmentselection. However, most methodology for combining markers is useful when predicting outcome under a single treatment. Wepropose a method for combining markers for treatment selection which requires modeling the treatment effect as a function ofmarkers. Multiple models of treatment effect are fit iteratively by upweighting or “boosting” subjects potentially misclassifiedaccording to treatment benefit at the previous stage. The boosting approach is compared to existing methods in a simulationstudy based on the change in expected outcome under marker-based treatment. The approach improves upon methods insome settings and has comparable performance in others. Our simulation study also provides insights as to the relative meritsof the existing methods. Application of the boosting approach to the breast cancer data, using scaled versions of the originalmarkers, produces marker combinations that may have improved performance for treatment selection.

Key words: Biomarker; Boosting; Model mis-specification; Treatment selection.

1. IntroductionDiscovering and describing heterogeneity in treatment effectsacross patient subgroups has emerged as a key objective inclinical trials and drug development. If the treatment effectcan be predicted given marker values such as biological mea-surements and clinical characteristics, providing patients andclinicians with these marker values can help them make moreinformed treatment decisions. For example, the Oncotype DXRecurrence Score is a leading marker for predicting the benefitof adjuvant chemotherapy over and above tamoxifen amongbreast cancer patients with estrogen receptor-positive (ER-positive) tumors (Albain et al., 2010a). The Recurrence Scoreis a proprietary combination of expression levels of 21 genes(16 cancer-related and 5 reference) measured in breast cancertumor tissue, and is used to identify a subgroup of patients forwhom the likelihood of benefitting from adjuvant chemother-apy is small. These patients can therefore avoid unnecessaryand potentially toxic treatment.

There is a large literature on statistical methods for com-bining markers, but the vast majority of them have focusedon combining markers for predicting outcome under a singletreatment (e.g., Etzioni et al., 2003; Pepe, Cai, and Longton,2005; Zhao et al., 2011). However, combinations of markersfor risk prediction or classification under a single treatmentare not optimized for treatment selection. Being at high riskfor the outcome does not necessarily imply a larger benefitfrom a particular treatment (Henry and Hayes, 2006; Janeset al., 2011; Janes, Pepe, and Huang, in press). In particular,

the Recurrence Score was originally developed for predictingthe risk of disease recurrence or death given treatment withtamoxifen alone (Paik et al., 2004), and was later shown tohave value for predicting chemotherapy benefit (Paik et al.,2004; Albain et al. 2010a, 2010b). Therefore, it is of interest toexplore alternative combinations of gene expression measuresthat are optimized for treatment selection.

Statistical methods for combining markers for treatmentselection are being developed (see Gunter, Zhu, and Murphy,2007; Brinkley, Tsiatis, and Anstrom, 2010; Cai et al., 2011;Claggett et al., 2011; Foster, Taylor, and Ruberg, 2011;Gunter, Zhu, and Murphy, 2011a; Zhang et al., 2012;Zhao et al., 2012; Lu, Zhang, and Zeng, 2013). A simpleapproach uses generalized linear regression to model theexpected disease outcome as a function of treatment andmarkers, including an interaction between each marker andtreatment (Gunter et al., 2007; Cai et al., 2011; Lu et al.,2013; Janes et al., in press). This model is difficult to specify,particulary with multiple markers as in the breast cancerexample, and hence an approach that is robust to modelmis-specification is warranted. This is a key motivation forour approach to combining markers for treatment selection.We call our approach “boosting” since it is a natural general-ization of the Adaboost (Adaptive boosting) method used topredict disease outcome under a single treatment Freund andSchapire (1997) and Friedman, Hastie, and Tibshirani (2000).

Candidate approaches for combining markers should becompared with respect to a clinically relevant performance

© 2014, The International Biometric Society 695

Page 2: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

696 Biometrics, September 2014

measure, and yet a few of the existing studies have performedsuch comparisons. In a simulation study and in our analysisof the breast cancer data, we evaluate methods for combiningmarkers using the cardinal measure of model performance: theimprovement in expected outcome under marker-based treat-ment (Song and Pepe, 2004; Brinkley et al., 2010; Gunter,Zhu, and Murphy, 2011b; Zhang et al., 2012; Janes et al.,2014). To the best of our knowledge, only two other articles(Qian and Murphy, 2011; Zhang et al., 2012) have used thisapproach for evaluating new methodology.

The structure of the article is as follows. In Section 2, weintroduce our approach to evaluating marker combinationsfor treatment selection and describe the boosting method.A simulation study used to evaluate the boosting approachin comparison to other candidate approaches is described inSection 3. Section 4 describes our application of the boostingapproach to the breast cancer data. We conclude with a dis-cussion of our findings and further research topics to pursue.

2. Methods

2.1. Context and Notation

Let D be a binary indicator of an adverse outcome followingtreatment which we refer to as “disease.” In the breast can-cer example, D indicates death or cancer recurrence within5 years of study enrollment. We assume that D captures allthe consequences of treatment, such as subsequent toxicity,morbidity, and mortality; more general settings are addressedin Section 5. Suppose that the task is to decide, for each indi-vidual patient, between two treatment options denoted by T ,where we call T = 1 “treatment” and T = 0 “no treatment.”We assume that the default treatment strategy is to treat allpatients. The marker, Y ∈ Rp, may be useful for identifyinga subgroup of patients who can avoid treatment. This setupis motivated by the breast cancer context, wherein adjuvantchemotherapy in addition to hormone therapy (T = 1) is thestandard of care and markers are used to identify women whocan forego adjuvant chemotherapy (T = 0). The setting whereT = 0 is the default and Y is used to identify a subgroup totreat can be handled by simply switching treatment labels(T = 0 for treatment and 1 for no treatment). We assume thatthe data {Di, Ti, Yi}n

i=1 come from the ideal setting for evaluat-ing treatment efficacy, a randomized clinical trial comparingT = 0 to T = 1 where Y is measured at baseline and D is aclinical outcome observed for all subjects.

2.2. Measures for Evaluating Marker Performance

Let �(Y) ≡ P(D = 1|T = 0, Y) − P(D = 1|T = 1, Y) denotethe marker-specific treatment effect. Given marker valuesY for all subjects, the treatment policy that minimizes thepopulation disease rate is to recommend no treatment ifφ(Y) = 1{�(Y) ≤ 0} = 1, where 1(·) is the indicator func-tion (Vickers, Kattan, and Sargent, 2007; Brinkley et al.,2010; Zhang et al., 2012; Lu et al., 2013). In the breast can-cer example, this policy would recommend hormone therapyalone to patients with negative treatment effects and adjuvantchemotherapy to patients with positive treatment effects. Thefunction �(Y) is therefore the combination of markers that weseek, and φ(Y) is the associated treatment rule. Given data{Di, Ti, Yi} for i = 1, . . . , n subjects, we estimate the marker-

specific treatment effect by fitting a model for P(D = 1|T, Y),

termed the “risk model,” and calculate �(Y) = P(D = 1|T =0, Y) − P(D = 1|T = 1, Y) and φ(Y) = 1{�(Y) ≤ 0}.

We characterize the performance of an arbitrary estimatedtreatment rule φ(Y) by evaluating the benefit of marker-basedtreatment (Song and Pepe, 2004; Brinkley et al., 2010; Gunteret al., 2011b; Zhang et al., 2012; Janes et al., 2014). This ismeasured by the difference in the disease rate under marker-based treatment assignment versus the default strategy of pro-viding treatment to all patients:

θ{φ(Y)} ≡ P(D = 1|T = 1) − [P{D = 1|T = 0, φ(Y) = 1}× P{φ(Y)=1}+P{D=1|T =1, φ(Y)=0}P{φ(Y)=0}]

= [P{D = 1|T = 1, φ(Y) = 1}−P{D = 1|T = 0, φ(Y) = 1}] × P{φ(Y) = 1}.

In the breast cancer example, θ denotes the reduction in the 5-year death or recurrence rate under marker-based treatment;in general, a higher value of θ indicates greater marker value.Using the standard empirical measure Pn(δ) ≡ ∑n

i=1n−1δi, θ

is estimated empirically as follows:

θ{φ(Y)} =[Pn1{D = 1, T = 1, φ(Y) = 1}Pn1{T = 1, φ(Y) = 1}

− Pn1{D = 1, T = 0, φ(Y) = 1}Pn1{T = 0, φ(Y) = 1}

]× Pn1{φ(Y) = 1}.

Another important measure of the population performanceof the marker is the rate of incorrect treatment recommen-dations, which we call the misclassification rate of treat-ment benefit, MCRTB{φ(Y)} ≡ P{φ(Y) �= φ(Y)}, and estimate

by MCRTB{φ(Y)} = Pn1{φ(Y) �= φ(Y)}. Similar measures havebeen used by Foster et al. (2011) and Lu et al. (2013). Al-though this measure cannot be evaluated in practice since�(Y) is unknown, it can be evaluated in simulated data where�(Y) is known.

2.3. The Boosting Method of Combining Markers forTreatment Selection

A simple approach for estimating �(Y) is to use a generalizedlinear model for the outcome, D, as a function of markers, Y,

and treatment, T, including interactions between each markerand treatment. That is, to stipulate that

h{P(D = 1|T, Y)} = η(T, Y), (1)

where the linear predictor η(T, Y) = Yβ1 + T Yβ2, Y =(1T , YT )T , β1 and β2 are (p + 1)-dimensional vectors ofregression coefficients for the markers’ main effects andinteractions with treatment, respectively, and h is a linkfunction. The logit link is the most common choice fora binary outcome. This risk model, if correctly specified,produces the combination of markers, �(Y), with optimalperformance, that is, θ{φ(Y)}. However if the risk model ismis-specified, it will produce a biased estimate of treatmenteffect, resulting in a suboptimal combination of markers and

Page 3: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 697

rule for assigning treatment. With multiple markers, thelikelihood of risk model mis-specification is increased. Ourmethod seeks to improve upon logistic regression by provid-ing an estimate of treatment effect, and a combination ofmarkers, that is more robust to risk model mis-specification.

To achieve this goal, we adopt the idea of Adaboost,which iteratively fits classifiers, at each stage assigning higherweights to subjects whose outcomes are misclassified at theprevious stage in order to minimize classification error. Analo-gously, we repeatedly fit a “working model” for P(D = 1|T, Y),and at each stage give more weight to subjects who lie close tothe decision boundary, �(Y) = 0, who have greater potentialto be recommended the incorrect treatment. In other words,we extend Adaboost from the classification setting, where theoutcome to be predicted is D, to the treatment selection set-ting, where the outcome is 1{�(Y) ≤ 0}. The added complex-ity is that 1{�(Y) ≤ 0} is not directly observable. Details ofthe boosting algorithm are given below.

Boosting algorithm

(1) With initial weight w(0)i = w(0)(Yi) for subject i, i =

1, . . . , n, fit the working risk model and calculateP(0)(D = 1|T = t, Y), t = 0, 1, and �(0)(Y) = P(0)(D =1|T = 0, Y) − P(0)(D = 1|T = 1, Y).

(2) Update weights according to w(1)i = w

∗(1)i /

∑n

i=1w

∗(1)i ,

where w∗(1)i = min[w{�(0)(Yi)}, CM], for w(u) decreas-

ing in |u| and a specified maximum weight CM. Inour simulations, we use w{�(0)(Yi)} = |�(0)(Yi)|− 1

3 and

CM = 500. This upweights subjects with small |�(0)(Y)|and limits the maximum size of the weights.

(3) Re-fit the working model with updated weights w(1)i

to obtain P(1)(D = 1|T = t, Y), t = 0, 1 and �(1)(Yi) =P(1)(D = 1|T = 0, Yi) − P(1)(D = 1|T = 1, Yi) for allsubjects.

(4) Repeat steps (2)–(3) until either a pre-specified con-vergence criterion is satisfied or a specified maximumnumber of iterations (Mmax) is reached. In our simu-lations, we set Mmax = 500 as an upper limit on thenumber of iterations that would be necessary.

(5) After the last iteration, denoted by M ≤ Mmax, we

have {P(1)(D = 1|T = t, Yi), . . . , P(M)(D = 1|T = t, Yi)},

t = 0, 1, and {�(1)(Yi), . . . , �(M)(Yi)} for i = 1, . . . , n.

The estimated disease rate and treatment ef-fect for subject i are P(Di = 1|Ti = t, Yi) = M−1∑M

m=1P(m)(Di = 1|Ti = t, Yi) for t = 0, 1, and �(Yi) =

M−1∑M

m=1�(m)(Yi), and the estimated treatment rule

is φ(Yi) = 1{�(Yi) ≤ 0}.(6) Given a new subject with covariate Y0, say in an in-

dependent test data set, we apply the set of work-ing risk models in (5) and calculate �(m)(Y0) for m =1, . . . , M. The estimated treatment effect is �(Y0) =M−1

∑M

m=1�(m)(Y0) and φ(Y0) = 1{�(Y0) ≤ 0}.

We explore use of the linear logistic regression model (1) anda binary classification tree (Breiman et al., 1984) as work-ing models. However, the boosting method applies to any ar-bitrary model for P(D = 1|T, Y). Choice of the weight func-

tion, w(u), maximum weight, CM, and algorithm stoppingrule are discussed in Web appendix A. With the logistic work-ing model, we stop the iterations when ‖β(k) − β(k−1)‖ ≤ 10−7,where β(k) is the vector of estimated regression coefficientsat the kth iteration, or when M = Mmax; and with the clas-sification tree working model, we stop the iterations whenM = Mmax.

3. Simulation Study

A simulation study was performed to compare the boostingmethod to existing approaches for combining markers fortreatment selection. The boosting method is compared tofour comparator approaches: (1) using Adaboost (Friedmanet al., 2000) to combine classification trees for predictingdisease outcome under each treatment separately; (2) fittinga classification tree to both treatment groups including allmarker-by-treatment interactions as predictors; (3) the classiclogistic regression approach which fits model (1) using maxi-mum likelihood; and (4) the approaches of Zhang et al. (2012)that maximize the Inverse Probability Weighted (IPW) orAugmented Inverse Probability (AIPW) estimators of θ.

3.1. Comparator Methods for Combining Markers

3.1.1. Applying Adaboost separately to each treatmentgroup. A natural approach is to use a risk model to com-bine markers to predict outcome under each treatment sep-arately. We consider the Adaboost algorithm (Freund andSchapire, 1997; Friedman et al., 2000) which combines pre-dictions across multiple binary classification trees (Breimanet al., 1984) (“base trees” (Hastie, Tibshirani, and Friedman,2001)). Hereafter, this is referred to as the “Adaboost trees”method. Each base tree is built by assigning higher weightsto subjects that are misclassified at the previous stage. Theassociated risk model for each treatment group is a functionof individual markers and, potentially, interactions betweenmarkers. We use Friedman et al.’s method (Friedman et al.,2000) for estimating P(D = 1|T, Y). Adaboost trees is imple-mented by the R function ada (R package ada (Culp, John-son, and Michailidis, 2012)) using the following default set-tings: exponential loss function, discrete boosting algorithm,and 500 base trees. Since Adaboost trees is a non-parametricapproach, the obtained combination of markers is expectedto be more robust than logistic regression. However, fitting aseparate classifier to each treatment group may not yield theoptimal marker combination for treatment selection.

3.1.2. A single classification tree with marker-by-treatmentinteractions. An alternative nonparametric approach is tofit a single classification tree to both treatment groups in-cluding {T, TY1, . . . , TYp, (1 − T )Y1, . . . , (1 − T )Yp} as predic-tors. Using this classification tree, P(D = 1|T, Y) can be es-timated using the empirical proportion of D = 1 observationsin each terminal node. We use the R function rpart (R pack-age rpart (Therneau, Atkinson, and Ripley, 2012)) with de-fault settings: the minimal number of observations requiredto split is 20, the minimum number of observations in anyterminal node is 7, and the maximal number of nodes priorto terminal node is 30. We do not prune the tree to stabilizethe probability estimates (Provost and Domingos, 2003; Chu

Page 4: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

698 Biometrics, September 2014

et al., 2011), but these estimates are improved by averagingacross multiple tree classifiers (Chu et al., 2011).

3.1.3. Maximizing the IPW or AIPW estimators of θ.Recently, Zhang et al. (2012) proposed an approach that findsa combination of markers by directly maximizing the meanoutcome (in our context, minimizing the disease rate) under

marker-based treatment. This is equivalent to maximizing θ,the estimated decrease in disease rate under marker-basedtreatment. Zhang et al. (2012) consider maximizing both IPWand AIPW estimators.

Briefly, let D(t) denote the potential disease out-come under treatment t. For arbitrary treatment rule g :Y �→ {0, 1} (in our context, assigning no treatment), thegoal is to estimate the optimal treatment rule definedby gopt(Y) = arg min

g∈GE{D(g)} = 1{�(Y) ≤ 0}, where D(g) ≡

D(1){1 − g(Y)} + D(0)g(Y). Given a parametric working riskmodel P(D = 1|T, Y ;β) parameterized by finite-dimensionalparameter β, let η = η(β) denote a scaled version of β sat-isfying ‖η‖ = 1 with ‖ · ‖ denoting the �2-norm. Treatmentrules in this class of risk models are written g(Y, η). The scal-ing is used to ensure that the solution ηopt ≡ arg min

η

Q(η),

where Q(η) ≡ E[D{g(Y, η)}], is unique. Specifically, ηopt is es-timated by minimizing the IPW or AIPW estimators of Q(η)as follows:

IPWE(η) = Pn

{CηD

πc(Y ; η, γ)

}, (2)

AIPWE(η) = Pn

{CηD

πc(Y ; η, γ)− Cη − πc(Y ; η, γ)

πc(Y ; η, γ)m(Y ; η, β)

},

(3)

where Y = (1, Y), π(Y ; γ) = P(T = 1|Y ; γ) = eYγ

1+eYγis a known

or estimated probability of treatment (the “propensityscore”), πc(Y ; η, γ) = π(Y ; γ)T + {1 − π(Y ; γ)}1−T , Cη = T {1 −g(Y, η)} + (1 − T )g(Y, η) is the treatment recommend

by the rule g(Y, η), and m(Y ; η, β) = P(D = 1|T = 1, Y ;

β){1 − g(Y, η)} + P(D = 1|T = 0, Y ; β)g(Y, η) is the model-estimated disease rate under g(Y, η). In our randomized trialsetting, the propensity score model is known by design. TheIPW estimator (2) thus reduces to the empirical diseaserate under marker-based treatment. The AIPW estimator(3) is more efficient in large samples. Maximizing (2) or (3)therefore yields the marker combination with the highestIPW or AIPW θ{φ(Y)} in the training data within the classof the working risk model. However, when the working modelis mis-specified, this combination may perform poorly, and itis in this setting where the boosting approach may generatemarker combinations with closer-to-optimal performance.

To implement the approach, we find ηopt that min-imizes IPWE(η) or AIPWE(η) under the linear logisticworking model (1) where η = β/‖β‖. Under this model,

1{�(Y) ≤ 0} is equivalent to 1{Yη ≤ 0}, and so the class

of treatment rules is Gη = {g(Y ; η) = 1{Yη ≤ 0}, ‖η‖ = 1, Y =(1, Y1, . . . , Yp)}. Following Zhang et al. (2012), the R function

genoud (R package rgenoud (Mebane and Sekhon, 2011)) isutilized to minimize IPWE(η) (2) or AIPWE(η) (3) using thegenetic algorithm (Sekhon and Mebane, 1998).

3.2. Simulation Set-Up

We generate simulated data sets with 500 or 5000 ob-servations in the following fashion. Binary treatmentindicators T ∼ Bernoulli (0.5). In most scenarios we gen-erate three independent continuous markers Y1, Y2, and Y3

(Y = (Y1, Y2, Y3)) each following a standard normal distri-bution; exceptions are noted below. The binary outcomeD ∼ Bernoulli {P(D = 1|T, Y)}. The risk model P(D = 1|T, Y)varies among the seven scenarios as shown in Table 1 anddescribed below. Figure 1 displays the distribution of �(Y)for each scenario. The linear logistic regression model (1) andthe classification tree including {T, TY, (1 − T )Y} as predictorsare used as working models for the boosting method.

Simulation scenarios

Scenario 1. The true risk model is linear logistic whereY1, Y2, and Y3 have strong, intermediate, andweak interactions with treatment: logitP(D =1|T, Y) = 0.3 + 0.2Y1 − 0.2Y2 − 0.2Y3+T (−0.1−2Y1 − 0.7Y2 − 0.1Y3). The marker combinationobtained by fitting the linear logistic workingmodel with maximum likelihood estimation(MLE) is expected to achieve the best perfor-mance. However, it is of interest to determinethe extent to which other methods producecomparable results.

Scenario 2. The true risk model is the same as in Scenario1, but now Y1 has high leverage points. Specif-ically, a random 2% of Y1 values are replacedwith draws from a Uniform (8, 9) distribution.This scenario is used to compare the perfor-mance of the approaches that use the correctlinear logistic working model in the context ofhigh leverage observations.

Scenario 3. The true risk model is log{−logP(D =1|T, Y)} = −0.7 − 0.2Y1 − 0.2Y2+0.1Y3+T (0.1+2Y1 − Y2 − 0.3Y3), where Y1, Y2, and Y3 havestrong, intermediate, and weak interactionswith treatment. The linear predictor of thelinear logistic working model is correct but thelink function is incorrect. This scenario is usedto compare the robustness of the boostingapproach to other approaches in the contextof minor working model mis-specification.

Scenario 4. The true risk model is log{−logP(D =1|T, Y)} = 2 − 1.5Y2

1 − 1.5Y22 + 3Y1Y2 +

T (−0.1 − Y1 + Y2). Y1 and Y2 follow a Uniform(−1.5, 1.5) distribution. The link functionand main effects of the linear logistic workingmodel are incorrectly specified, the latterdue to omission of quadratic and marker-by-marker interaction terms, but the interactionterms are correct. This scenario is chosen forits similarity to the first scenario in Zhang

Page 5: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 699

Table 1True risk models and marker distributions for the seven simulation scenarios. The linear logistic regression model

logitP(D = 1|T, Y) = Yβ1 + T Yβ2, with Y = (1, Y), and the classification tree including {T, TY, (1 − T )Y} as predictors areevaluated as working models in all scenarios.

Scenario True risk model Marker distribution

The linear logistic working 1 logitP(D = 1|T, Y) = 0.3 + 0.2Y1 − 0.2Y2 − 0.2Y3 Y1, Y2, and Y3 aremodel is correctly specified +T (−0.1 − 2Y1 − 0.7Y2 − 0.1Y3) independent N(0, 1)

2 logitP(D = 1|T, Y) = 0.3 + 0.2Y1 − 0.2Y2 − 0.2Y3 Same as Scenario 1 except+T (−0.1 − 2Y1 − 0.7Y2 − 0.1Y3) for 2% of high leverage

observations where Y1 ∼Uniform (8, 9)

Link function in the linear 3 log{−logP(D = 1|T, Y)} = −0.7 − 0.2Y1 − 0.2Y2 + 0.1Y3 Same as Scenario 1logistic working model is +T (0.1 + 2Y1 − Y2 − 0.3Y3)incorrectly specified

Link function and main effects 4 log{−logP(D = 1|T, Y)} = 2 − 1.5Y21 − 1.5Y2

2 + 3Y1Y2 Y1, and Y2 are independentin the linear logistic working +T (−0.1 − Y1 + Y2) Uniform (−1.5, 1.5)model are incorrectlyspecified

Main effects and interactions 5 logitP(D = 1|T, Y) = −0.1 − 0.2Y1 + 0.2Y2 − 0.1Y3 + Y21 Same as Scenario 1

are incorrectly specified in +T (−0.5 − 2Y1 − Y2 − 0.1Y3 + 2Y21 )

the linear logistic workingmodel

6 logitP(D = 1|T, Y) = 0.1 − 0.2Y1 + 0.2Y2 − Y1Y2

+T (−0.5 − Y1 + Y2 + 3Y1Y2)

Linear logistic working model 7 P(D = 1|T, Y) = 1{Y1 < 8} 1

1 + e−η+ 1{Y1 ≥ 8}

(1 − 1

1 + e−η

), Same as Scenario 2

is mis-specified for outlyingwhere η = 0.3 + 0.2Y1 − 0.2Y2 − 0.2Y3observations +T (−0.1 − 2Y1 − 0.7Y2 − 0.1Y2)

et al. (2012) who found that, in a continuousoutcome setting, maximizing the IPW orAIPW estimators of θ yielded substantialimprovement over standard linear regression.

Scenario 5. The true risk model is logitP(D = 1|T, Y) =−0.1 − 0.2Y1 + 0.2Y2 − 0.1Y3 + Y2

1 + T (−0.5 −2Y1 − Y2 − 0.1Y3 + 2Y2

1 ) including a non-linear main effect and interaction of Y1 withtreatment. The linear logistic working modelmis-specifies these Y1 effects, but the classifi-cation tree working model should be able todetect them.

Scenario 6. The true risk model is a logistic regressionmodel including an interaction between Y1 andY2 where Y1, Y2 and Y1Y2 have intermediate,intermediate, and strong interactions withtreatment: logitP(D = 1|T, Y) = 0.1 − 0.2Y1 +0.2Y2 − Y1Y2 + T (−0.5 − Y1 + Y2 + 3Y1Y2).The linear logistic working model does notinclude Y1Y2 and TY1Y2 interaction termswhereas a classification tree working modeldoes allow for them.

Scenario 7. The true risk model is the same linear logisticmodel as in Scenario 1 except for the presenceof 2% outlying observations. Specifically, for arandom 2% sample, Y1 is replaced with a draw

from a Uniform (8, 9) distribution and D is re-placed with 1 − D.

For each scenario, 1000 data sets are generated and usedas training data to build a prediction model and treatmentassignment rule, φ(Y) = 1{�(Y) ≤ 0}. To avoid overoptimismassociated with fitting and evaluating the risk model usingthe same data, a single large independent test data set withn = 105 observations is generated and used to evaluate theperformance of the fitted treatment rule, θ{φ(Y)}. Mean and

Monte-carlo standard deviation (SD) of θ{φ(Y)} and mean

MCRTB{φ(Y)} are reported. The performance of the true

treatment rule, θ{φ(Y)}, is calculated as an average of θ{φ(Y)}over 100 Monte-carlo simulations where each θ{φ(Y)} is ob-tained using n = 3 × 107 observations.

3.3. Results of the Simulation Study

Tables 2 and 3 summarize the simulation results for samplesizes n = 500 and n = 5000, respectively. The performances ofmarker combinations obtained using the following methodsare compared: Logistic regression with maximum likelihoodestimation (hereafter “linear logistic MLE”), the boostingmethod described in Section 2.3 with linear logistic workingmodel (“linear logistic boosting”), maximizing the IPW orAIPW estimators of θ as proposed by Zhang et al. (2012)(“maximizing IPWE or AIPWE of θ”), a single classification

Page 6: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

700 Biometrics, September 2014

Scenario 10

0.48

1

−1.0 −0.5 0.0 0.5 1.0

θ = 0.1268

Scenario 2

00.

471

−1.0 −0.5 0.0 0.5 1.0

θ = 0.1243

Scenario 3

00.

481

−1.0 −0.5 0.0 0.5 1.0

θ = 0.1341

Scenario 4

00.

541

−1.0 −0.5 0.0 0.5 1.0

θ = 0.0657

Scenario 5

00.

591

−1.0 −0.5 0.0 0.5 1.0

θ = 0.095

Scenario 6

00.

41

−1.0 −0.5 0.0 0.5 1.0

θ = 0.1393

Scenario 7

00.

491

−1.0 −0.5 0.0 0.5 1.0

θ = 0.1419

Δ(Y)=P(D=1|T=0, Y)−P(D=1|T=1, Y)

CD

F o

f Δ(Y)

Figure 1. Distribution of the marker-specific treatment ef-fect, �(Y) = P(D = 1|T = 0, Y) − P(D = 1|T = 1, Y), for eachof the seven simulation scenarios. The proportion of individu-als with negative treatment effects is indicated on the Y-axis,and θ = [P{D = 1|T = 1, φ(Y) = 1} − P{D = 1|T = 0, φ(Y) =1}] × P{φ(Y) = 1}, measuring the impact of marker-basedtreatment assignment, is shown.

tree with marker-by-treatment interactions (“single classifi-cation tree”), the boosting method with a classification treeworking model including marker-by-treatment interactions(“classification tree boosting”), and applying Adaboost treesto each treatment group separately (“separate Adaboost”).For each scenario, the method with the highest mean θ ismarked in bold.

When the linear logistic working model was correctly spec-ified (Scenario 1), as expected the combination of markersobtained using linear logistic MLE had the highest mean θ,smallest SD of θ, and smallest MCRTB. Linear logistic boost-ing produced almost identical results, whereas all other meth-ods produced modestly lower mean θ and substantially higherSD of θ and MCRTB.

In the presence of high leverage points (Scenario 2), linearlogistic MLE continued to produce the highest mean θ andsmallest MCRTB. However, the SD of θ was slightly lower withlinear logistic boosting and substantially lower when maximiz-ing the AIPWE of θ, or employing classification tree boosting,even while the associated mean θ’s were close to optimal. Thissuggests that, as expected, linear logistic MLE yields variableestimates in the presence of high leverage points; this effect

disappears with large n (Table 3). Another observation is thatonly linear logistic boosting produced MCRTB near that oflinear logistic MLE; all other methods produced substantiallyhigher classification error.

Mis-specifying the link function of the logistic workingmodel (Scenario 3) had minimal impact on θ and both lin-ear logistic MLE and linear logistic boosting produced nearlyoptimal mean θ and similarly low SD of θ and MCRTB. Allother methods yielded slightly lower mean θ and substan-tially higher SD of θ and MCRTB. The superiority of the lin-ear logistic regression methods persisted with larger n (Table3). When both the link function and main effects were mis-specified (Scenario 4), methods with linear logistic workingmodels produced similar mean θ (close to the optimal value)but linear logistic boosting had some advantage in terms oflower SD of θ and MCRTB. Differences among methods weresmaller again for larger n (Table 3).

Scenarios 5 and 6 explore substantial mis-specification ofthe linear logistic working model; the mean θ for linear lo-gistic MLE is far from the optimal value. In these scenarios,boosting improved upon linear logistic MLE. Classificationtree boosting yielded the best performance with the mostdramatic improvement over logistic regression in the highlynonlinear setting of Scenario 6. These results persisted forlarge n (Table 3).

When the risk model mis-specification was due to outly-ing observations (Scenario 7), maximizing the AIPWE of θ

and boosting provided marker combinations with improvedperformance over those generated by linear logistic MLE.

In summary, these simulation results demonstrate that theboosting method can improve upon existing methods for com-bining markers in certain settings. Under a substantially mis-specified working model, boosting can dramatically improvemodel performance. When the working model is mis-specifiedbut not far from the true risk model, boosting may slightly im-prove performance. When high leverage points exist, boostingreduces variability without compromising mean performance.Boosting can perform better than direct maximization of theIPWE and AIPWE of θ, under mild or substantial workingmodel mis-specification. As expected, linear logistic boost-ing performs best with minor mis-specification of the logisticrisk function while classification tree boosting better capturesnonlinear main effects and interactions with treatment.

4. Breast Cancer Data

The boosting method was then applied to the breast can-cer data. The performance of the Oncotype DX RecurrenceScore was most recently evaluated in the Southwest Oncol-ogy Group (SWOG)-SS8814 trial (Albain et al., 2010a), whichrandomized women with node-positive, ER-positive breastcancer to tamoxifen plus adjuvant chemotherapy (cyclophos-phamide, doxorubicin, and fluorouracil before or concurrentwith tamoxifen) or tamoxifen alone. For 367 women (219 ontamoxifen plus adjuvant chemotherapy sequentially (T = 1)and 148 on tamoxifen alone (T = 0)), expression levels of16 breast cancer-related and 5 reference genes were mea-sured on tumor samples obtained at surgery (before adjuvantchemotherapy), and the Recurrence Score was calculated.

Page 7: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 701

Table

2Res

ults

ofth

esi

mula

tion

study

with

sam

ple

size

n=

500.M

ark

erco

mbi

nations

obt

ain

edusi

ng

the

followin

gm

ethod

sare

com

pare

d:Log

istic

regr

essi

on

with

maxi

mum

like

lihoo

des

tim

ation

(MLE),

the

boost

ing

met

hod

ofSec

tion

2.3

with

linea

rlo

gist

icwork

ing

mod

el,m

axi

miz

ing

the

IPW

and

AIP

Wes

tim

ato

rsof

θas

des

crib

edby

Zhang

etal.

(2012),

asi

ngl

ecl

ass

ifica

tion

tree

with

inte

ract

ions

betw

een

mark

ers

and

trea

tmen

t,th

ebo

ost

ing

met

hod

with

class

ifica

tion

tree

work

ing

mod

el,and

apply

ing

Adabo

ost

tree

sto

each

trea

tmen

tgr

oup

sepa

rate

ly.M

ean

and

Monte

Carlo

standard

dev

iation

(SD

)of

θare

shown,alo

ng

with

the

mea

nm

iscl

ass

ifica

tion

rate

for

trea

tmen

tbe

nefi

t(M

CR

TB).

For

each

scen

ari

o,th

em

ethod

with

the

hig

hes

tm

ean

θis

mark

edin

bold

.

Work

ing

Cla

ssifi

cati

on

tree

model

Lin

ear

logis

tic

wit

hin

tera

ctio

ns

Fit

ting

Max

θM

ax

θSin

gle

Sep

ara

teSce

nari

oTru

alg

ori

thm

MLE

Boost

ing

(IP

WE

)(A

IPW

E)

tree

Boost

ing

Adaboost

Mea

n0.1

199

0.1

195

0.1

076

0.1

134

0.0

971

0.1

083

0.0

854

θ1

0.1

268

SD

0.0

0218

0.0

0258

0.0

1273

0.0

0728

0.0

1206

0.0

0649

0.0

0907

MC

RT

BM

ean

0.0

517

0.0

555

0.1

261

0.0

996

0.2

351

0.1

294

0.2

111

The

linea

rlo

gis

tic

work

ing

model

isco

rrec

tly

spec

ified

Mea

n0.1

232

0.1

229

0.1

099

0.1

158

0.1

004

0.1

104

0.0

876

θ2

0.1

243

SD

0.0

1306

0.0

1156

0.0

1292

0.0

0760

0.0

1306

0.0

0740

0.0

0852

MC

RT

BM

ean

0.0

512

0.0

526

0.1

240

0.0

973

0.2

372

0.1

269

0.2

053

Mea

n0.1

302

0.1

299

0.1

176

0.1

234

0.1

056

0.1

162

0.1

038

Lin

kfu

nct

ion

inth

elinea

rlo

gis

tic

work

ing

θ3

0.1

341

SD

0.0

0204

0.0

0224

0.0

1245

0.0

0712

0.0

1110

0.0

0654

0.0

0667

model

isin

corr

ectl

ysp

ecifi

edM

CR

TB

Mea

n0.0

418

0.0

444

0.1

045

0.0

824

0.2

072

0.1

124

0.1

539

Lin

kfu

nct

ion

and

main

effec

tsin

the

Mea

n0.0

574

0.0

607

0.0

561

0.0

567

0.0

221

0.0

378

0.0

352

θlinea

rlo

gis

tic

work

ing

model

are

40.0

657

SD

0.0

1267

0.0

0986

0.0

1296

0.0

1482

0.0

1143

0.0

0867

0.0

0718

inco

rrec

tly

spec

ified

MC

RT

BM

ean

0.1

511

0.1

206

0.1

719

0.1

653

0.5

397

0.3

251

0.5

304

Mea

n0.0

681

0.0

702

0.0

668

0.0

694

0.0

615

0.0

735

0.0

590

θ5

0.0

950

SD

0.0

0703

0.0

0737

0.0

1303

0.0

1098

0.0

1359

0.0

0813

0.0

0854

Main

effec

tsand

inte

ract

ions

are

MC

RT

BM

ean

0.2

667

0.2

540

0.2

588

0.2

503

0.2

737

0.1

838

0.2

478

inco

rrec

tly

spec

ified

inth

elinea

rlo

gis

tic

work

ing

model

Mea

n0.0

236

0.0

438

0.0

498

0.0

544

0.0

978

0.1

186

0.1

010

θ6

0.1

393

SD

0.0

1875

0.0

1276

0.0

1238

0.0

0893

0.0

1996

0.0

1057

0.0

0807

MC

RT

BM

ean

0.3

865

0.3

542

0.3

452

0.3

330

0.2

697

0.1

762

0.2

433

Mea

n0.0

879

0.1

140

0.1

099

0.1

153

0.1

042

0.1

151

0.0

856

Lin

earlo

gis

tic

work

ing

model

ism

is-s

pec

ified

θ7

0.1

419

SD

0.0

2370

0.0

1183

0.0

1294

0.0

0816

0.0

1657

0.0

0944

0.0

0944

for

outl

yin

gobse

rvati

ons

MC

RT

BM

ean

0.2

163

0.1

207

0.1

436

0.1

198

0.2

399

0.1

394

0.2

339

Page 8: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

702 Biometrics, September 2014

Table

3Res

ults

ofth

esi

mula

tion

study

with

sam

ple

size

n=

5000.M

ark

erco

mbi

nations

obt

ain

edusi

ng

the

followin

gm

ethod

sare

com

pare

d:Log

istic

regr

essi

on

with

maxi

mum

like

lihoo

des

tim

ation

(MLE),

the

boost

ing

met

hod

ofSec

tion

2.3

with

linea

rlo

gist

icwork

ing

mod

el,m

axi

miz

ing

the

IPW

and

AIP

Wes

tim

ato

rsof

θas

des

crib

edby

Zhang

etal.

(2012),

asi

ngl

ecl

ass

ifica

tion

tree

with

inte

ract

ions

betw

een

mark

ers

and

trea

tmen

t,th

ebo

ost

ing

met

hod

with

class

ifica

tion

tree

work

ing

mod

el,and

apply

ing

Adabo

ost

tree

sto

each

trea

tmen

tgr

oup

sepa

rate

ly.M

ean

and

Monte

Carlo

standard

dev

iation

(SD

)of

θare

shown,alo

ng

with

the

mea

nm

iscl

ass

ifica

tion

rate

for

trea

tmen

tbe

nefi

t(M

CR

TB).

For

each

scen

ari

o,th

em

ethod

with

the

hig

hes

tm

ean

θis

mark

edin

bold

.

Work

ing

Cla

ssifi

cati

on

tree

model

Lin

ear

logis

tic

wit

hin

tera

ctio

ns

Fit

ting

Max

θM

ax

θSin

gle

Sep

ara

teSce

nari

oTru

alg

ori

thm

MLE

Boost

ing

(IP

WE

)(A

IPW

E)

tree

Boost

ing

Adaboost

Mea

n0.1

258

0.1

257

0.1

198

0.1

206

0.1

114

0.1

143

0.0

990

θ1

0.1

268

SD

0.0

0038

0.0

0043

0.0

0216

0.0

0149

0.0

0523

0.0

0434

0.0

0237

The

linea

rlo

gis

tic

work

ing

model

isco

rrec

tly

MC

RT

BM

ean

0.0

165

0.0

182

0.0

527

0.0

443

0.3

222

0.2

099

0.1

664

spec

ified

Mea

n0.1

252

0.1

252

0.1

227

0.1

236

0.1

140

0.1

165

0.1

003

θ2

0.1

243

SD

0.0

0038

0.0

0043

0.0

0239

0.0

0170

0.0

0513

0.0

0473

0.0

0245

MC

RT

BM

ean

0.0

160

0.0

173

0.0

530

0.0

430

0.3

182

0.2

251

0.1

635

Mea

n0.1

316

0.1

319

0.1

299

0.1

308

0.1

130

0.1

215

0.1

120

Lin

kfu

nct

ion

inth

elinea

rlo

gis

tic

work

ing

θm

odel

isin

corr

ectl

ysp

ecifi

ed3

0.1

341

SD

0.0

0046

0.0

0046

0.0

0228

0.0

0151

0.0

0644

0.0

0328

0.0

0194

MC

RT

BM

ean

0.0

222

0.0

188

0.0

436

0.0

355

0.2

092

0.1

148

0.1

290

Lin

kfu

nct

ion

and

main

effec

tsin

the

Mea

n0.0

640

0.0

640

0.0

625

0.0

636

0.0

259

0.0

549

0.0

436

θlinea

rlo

gis

tic

work

ing

model

are

40.0

657

SD

0.0

0041

0.0

0036

0.0

0218

0.0

0077

0.0

1029

0.0

0412

0.0

0270

inco

rrec

tly

spec

ified

MC

RT

BM

ean

0.0

633

0.0

544

0.1

252

0.1

034

0.4

949

0.2

041

0.2

694

Mea

n0.0

772

0.0

804

0.0

787

0.0

797

0.0

766

0.0

839

0.0

703

θ5

0.0

950

SD

0.0

0210

0.0

0238

0.0

0280

0.0

0222

0.0

0458

0.0

0617

0.0

0231

Main

effec

tsand

inte

ract

ions

are

MC

RT

BM

ean

0.2

485

0.2

326

0.2

008

0.1

941

0.3

287

0.1

357

0.1

829

inco

rrec

tly

spec

ified

inth

elinea

rlo

gis

tic

work

ing

model

Mea

n0.0

218

0.0

496

0.0

570

0.0

582

0.1

143

0.1

290

0.1

200

θ6

0.1

393

SD

0.0

0834

0.0

0485

0.0

0403

0.0

0345

0.0

1494

0.0

0846

0.0

0247

MC

RT

BM

ean

0.3

851

0.3

490

0.3

460

0.3

472

0.2

212

0.1

231

0.1

804

Mea

n0.1

019

0.1

215

0.1

226

0.1

236

0.1

315

0.1

355

0.1

179

Lin

earlo

gis

tic

work

ing

model

ism

is-s

pec

ified

θ7

0.1

419

SD

0.0

0824

0.0

0167

0.0

0247

0.0

0172

0.0

0537

0.0

0436

0.0

0246

for

outl

yin

gobse

rvati

ons

MC

RT

BM

ean

0.1

684

0.0

668

0.0

742

0.0

635

0.3

285

0.1

459

0.1

632

Page 9: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 703

We use the SS8814 data to explore alternative combina-tions of the 16 breast cancer related genes that are optimizedfor treatment selection. In these data, there were 80 deathsor breast cancer recurrences by 5 years (35 given T = 0 and45 given T = 1). There was little censoring; for nine subjectscensored before 5 years, we assume D = 0. Because the dataare not currently available for public use, we modified thegene values but preserved the basic underlying structure of thedata. Specifically, we use scaled versions of the markers (meancentered with unit variance) and un-labeled genes. A modifiedversion of the original Recurrence Score was used. Combina-tions of the following marker sets were considered for theirpotential to guide treatment decisions: (1) The modified riskscore (MRS); (2) three genes, G1, G2, and G3, that showedevidence of marker-by-treatment interactions in a multivari-ate linear logistic regression model; and (3) two genes, G4 andG5, that exhibited a significant three-way interaction TG4G5

in a linear logistic regression model.We implement the following approaches: Linear logistic

MLE, linear logistic boosting, maximization of the IPWEor AIPWE of θ described by Zhang et al. (2012), a singleclassification tree with marker-by-treatment interactions, andclassification tree boosting. The tuning parameters Mmax andw{�(Y)} varied across marker sets and were determined us-ing cross-validation (see Web Appendix A); Cmax was set to500 (see Web Appendix A). To assess model performance, we

calculate the apparent performance (θ{φ(Y)}) using the orig-inal (training) data and use the percentile bootstrap to cal-culate a 95% confidence interval. A bootstrap-bias-correctedestimate of model performance (θc{φb(Y)}) (Efron and Tib-shirani, 1993) is also calculated along with a 95% confi-dence interval obtained using the double-bootstrap (see WebAppendix B).

Performance measures of the various marker combinationsare shown in Table 4. For every set of markers, maximizingthe AIPWE of θ, linear logistic boosting, a single classifica-tion tree, or classification tree boosting yields a combinationof markers with better performance (higher θ) than that ob-tained using linear logistic MLE. For example, for the modelsincluding G4, G5, and G4G5, classification tree boosting yieldsa marker combination associated with a 9% decrease in 5-yearrecurrence or death (95% CI: 8–18%) and the marker combi-nation maximizing the AIPWE of θ yields a 3% decrease (95%CI: 2–11%). In contrast, the combination derived using linearlogistic MLE yields a 0.3% decrease (95% CI: −2 to 8%).These new combinations of markers may have improved abil-ity to identify a subgroup of women who can avoid adjuvantchemotherapy, in terms of providing a lower population rate of5-year death or recurrence. For example, the best function ofthe MRS is estimated to yield a 5% reduction in 5-year deathor recurrence (95% CI: 4–13%), while allowing 64% women toavoid adjuvant chemotherapy.

Observe in Table 5 that the differences in performance be-tween models are due to a large proportion of subjects beingdifferently classified according to treatment benefit using lin-ear logistic MLE versus the other approaches. The resultsalso suggest that the linear logistic model may not hold forthe modified risk score since maximizing the AIPWE of θ

and classification tree boosting produce substantially higherθ than linear logistic MLE.

These results must be interpreted with caution, however,since even our bootstrap-bias-corrected estimates of modelperformance may be overoptimistic. With the small samplesize, cross-validation did not produce satisfactory estimatesof test data performance; results were highly dependent onthe random seed used to split the data. Other bias correctionapproaches such as the 0.632 bootstrap method (Efron andTibshirani, 1993) do not appear to apply to the measure θ.Obtaining sufficiently large data sets to validate marker com-binations is a pervasive challenge for the treatment selectionfield.

5. Discussion

This article describes a novel application of boosting tocombining markers for predicting treatment effect. The ap-proach is intended to build in robustness to risk model mis-specification, by averaging across risk models fit by iterativelyupweighting subjects potentially misclassified according totreatment benefit at the previous stage. We evaluate the per-formance of the approach using clinically relevant measuresand find several settings in which the boosting method resultsin combinations of markers that have closer-to-optimal per-formance than combinations derived using less-robust existingapproaches. Specifically, boosting appears advantageous un-der substantial risk model mis-specification and in settingswith high leverage points. Our analysis of the breast can-cer data suggests that, in these data, boosting can yield newmarker combinations that may have superior ability to iden-tify women who do not benefit from adjuvant chemotherapy.

A simple approach to combining markers for treatment se-lection is to apply one of the plethora of methods available forcombining markers for classification separately to each treat-ment group. As discussed by Claggett et al. (2011), however,the two best performing risk models for each treatment groupdo not necessarily produce the best model for treatment ef-fect. This strategy risks missing markers that are stronglyassociated with treatment effect but which have modest maineffects, and risks including markers which have strong maineffects but modest interactions with treatment. For exam-ple, human epidermal growth factor receptor 2 (HER-2) isnot considered a significant predictor of cancer recurrence inbreast cancer patients while it is an important predictor of theeffects of some adjuvant chemotherapies and hormone thera-pies (Clark, 1995; Henry and Hayes, 2006). In our simulations,fitting a risk model to each treatment group separately tendedto produce marker combinations with inferior performancecompared to those that simultaneously considered both treat-ment groups, such as the novel boosting method.

When evaluating candidate approaches for combiningmarkers, it is important that methods be compared with re-spect to compelling and clinically relevant measures of modelperformance. Measures such as the frequency of correct vari-ables selected (Gunter et al., 2007; Lu et al., 2013), the areaunder the receiver operating characteristic curve (AUC) foreach treatment group (Claggett et al., 2011) and the meansquared error (MSE) of model coefficients (Lu et al., 2013) suf-fer from lack of clinical interpretation and do not characterizethe benefit of the marker combination. The rate of incorrecttreatment recommendation, MCRTB, is appealing and useful

Page 10: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

704 Biometrics, September 2014

Table

4Per

form

ance

ofm

ark

erco

mbi

nations

inth

ebr

east

cance

rdata

.M

odel

sin

cludin

gth

em

odifi

edri

sksc

ore

(MR

S);

genes

G1,G

2and

G3;and

genes

G4,G

5and

G4

×G

5are

shown.T

he

followin

gm

ethod

sare

use

dto

obt

ain

mark

erco

mbi

nations:

Log

istic

regr

essi

on

with

maxi

mum

like

lihoo

des

tim

ation

(MLE),

the

boost

ing

met

hod

with

linea

rlo

gist

icwork

ing

mod

el,m

axi

miz

ing

the

IPW

or

AIP

Wes

tim

ato

rof

θas

des

crib

edby

Zhang

etal.

(2012),

asi

ngl

ecl

ass

ifica

tion

tree

with

inte

ract

ions

betw

een

mark

ers

and

trea

tmen

t,and

the

boost

ing

met

hod

with

class

ifica

tion

tree

work

ing

mod

el.For

each

met

hod

the

appa

rentm

odel

perf

orm

ance

( θ{φ

(Y)})

and

corr

espo

ndin

gbo

ots

trap-b

ase

d95%

confiden

cein

terv

al;

and

boots

trap-b

ias-

corr

ecte

des

tim

ate

ofm

odel

perf

orm

ance

(θc{φ

b(Y

)})and

corr

espo

ndin

gbo

ots

trap-b

ase

d95%

confiden

cein

terv

al,

are

shown.

Work

ing

Cla

ssifi

cati

on

tree

model

Lin

ear

logis

tic

wit

hin

tera

ctio

ns

Max

θSin

gle

Mark

erFit

ting

MLE

Boost

ing

Boost

ing

set

(Y)

alg

ori

thm

(IP

WE

)(A

IPW

E)

tree

θ{φ

(Y)}

0.0

26

0.0

34

−0.0

12

0.0

44

0.0

52

0.0

74

MR

S95%

CI

(−0.0

17

,0.0

68)

(−0.0

15

,0.0

72)

(−0.0

62

,0.0

73)

(0.0

05

,0.0

92)

(−0.0

11

,0.1

53)

(0.0

27

,0.1

59)

θ c{φ

b(Y

)}0.0

02

0.0

08

−0.0

09

0.0

33

0.0

40

0.0

51

95%

CI

(−0.0

10

,0.0

60)

(−0.0

14

,0.0

62)

(−0.0

42

,0.0

72)

(−0.0

01

,0.0

87)

(0.0

25

,0.1

27)

(0.0

37

,0.1

33)

θ{ φ

(Y)}

0.0

52

0.0

60

0.0

63

0.0

80

0.0

75

0.0

95

(G1,G

2,G

3)

95%

CI

(−0.0

12

,0.1

10)

(0.0

13

,0.1

21)

(−0.0

34

,0.1

15)

(0.0

53

,0.1

46)

(0.0

31

,0.1

69)

(0.0

74,0.2

06)

θ c{φ

b(Y

)}0.0

37

0.0

48

0.0

28

0.0

63

0.0

54

0.0

81

95%

CI

(−0.0

09

,0.1

02)

(0.0

00

,0.1

17)

(−0.0

10

,0.0

86)

(0.0

00

,0.1

37)

(0.0

00

,0.1

32)

(0.0

00

,0.1

59)

θ{φ

(Y)}

0.0

11

0.0

49

0.0

50

0.0

57

0.0

53

0.1

25

(G4,G

5,G

G5)

95%

CI

(−0.0

27

,0.0

95)

(−0.0

13

,0.1

02)

(−0.0

13

,0.0

94)

(0.0

28

,0.1

26)

(0.0

25

,0.1

59)

(0.0

87

,0.2

40)

θ c{φ

b(Y

)}0.0

03

0.0

15

0.0

22

0.0

34

0.0

42

0.0

89

95%

CI

(−0.0

16

,0.0

75)

(−0.0

13

,0.0

87)

(−0.0

18

,0.0

76)

(0.0

18

,0.1

10)

(0.0

32

,0.1

24)

(0.0

80

,0.1

75)

Page 11: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 705

Table

5Est

imate

dtrea

tmen

tru

les

inth

ebr

east

cance

rdata

obt

ain

edusi

ng

the

followin

gm

ethod

s:Log

istic

regr

essi

on

with

maxi

mum

like

lihoo

des

tim

ation

(MLE),

the

boost

ing

met

hod

with

linea

rlo

gist

icwork

ing

mod

el,m

axi

miz

ing

the

IPW

or

AIP

Wes

tim

ato

rof

θas

des

crib

edby

Zhang

etal.

(2012),

asi

ngl

ecl

ass

ifica

tion

tree

with

inte

ract

ions

betw

een

mark

ers

and

trea

tmen

t,and

the

boost

ing

met

hod

with

class

ifica

tion

tree

work

ing

mod

el.C

ells

show

the

num

ber

(%)

ofsu

bjec

tsin

the

data

setcr

oss

-cla

ssifi

edby

trea

tmen

tru

les

from

linea

rlo

gist

icM

LE

and

oth

erm

ark

erco

mbi

nation

appro

ach

es.Subj

ects

with

�(Y

)≤

0are

reco

mm

ended

tam

oxi

fen

alo

ne

(T=

0)

and

those

satisf

ying

�(Y

)>

0are

reco

mm

ended

adju

vantch

emoth

erapy

(T=

1).

Cla

ssifi

cati

on

tree

Lin

ear

logis

tic

wit

hin

tera

ctio

ns

Max

θSin

gle

Boost

ing

Boost

ing

Lin

ear

(IP

WE

)(A

IPW

E)

tree

Mark

erlo

gis

tic

set

(Y)

MLE

�(Y

)≤

0�

(Y)

>0

�(Y

)≤

0�

(Y)

>0

�(Y

)≤

0�

(Y)

>0

�(Y

)≤

0�

(Y)

>0

�(Y

)≤

0�

(Y)

>0

�(Y

)≤

0169

(46)

0(0

)169

(46)

0(0

)169

(46)

0(0

)169

(46)

0(0

)169

(46)

0(0

)169

(46)

MR

S�

(Y)

>0

9(2

)189

(51)

155

(42)

43

(12)

17

(5)

181

(49)

107

(29)

91

(25)

67

(18)

131

(36)

198

(54)

178

(49)

189

(51)

324

(88)

43

(12)

186

(51)

181

(49)

276

(75)

91

(25)

236

(64)

131

(36)

367

(100)

�(Y

)≤

0171

(47)

13

(4)

109

(30)

75

(20)

73

(36)

111

(14)

160

(44)

24

(7)

138

(38)

46

(13)

184

(50)

(G1,G

2,G

3)

�(Y

)>

012

(3)

171

(47)

19

(4)

164

(46)

13

(2)

170

(48)

103

(28)

80

(22)

79

(22)

104

(28)

183

(50)

183

(50)

184

(50)

128

(35)

239

(65)

86

(23)

281

(77)

263

(72)

104

(28)

217

(59)

150

(41)

367

(100)

�(Y

)≤

088

(24)

26

(7)

50

(14)

64

(17)

98

(27)

16

(4)

98

(27)

16

(4)

42

(11)

72

(20)

114

(31)

(G4,G

5,G

G5)

�(Y

)>

069

(19)

184

(50)

68

(19)

185

(50)

75

(20)

178

(49)

173

(47)

80

(22)

77

(21)

176

(48)

253

(69)

157

(43)

210

(57)

118

(32)

249

(68)

173

(47)

194

(53)

271

(74)

96

(26)

119

(32)

248

(68)

367

(100)

Page 12: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

706 Biometrics, September 2014

for simulation studies evaluating new methods. The decreasein the disease rate under marker-based treatment, measuredby θ, has clear relevance. This measure, or a variation on it,has been advocated in several recent articles on evaluatingtreatment selection markers (Song and Pepe, 2004; Gunteret al., 2007, 2011b; Brinkley et al., 2010; Qian and Murphy,2011; Janes et al., 2011, 2014; Zhang et al., 2012). θ is com-prised of the proportion of subjects who are marker-negativeand the treatment effect in the marker-negative subgroup.While these constituents inform about the nature of mark-ers’ effect, neither can serve as the sole basis for comparingcombinations of markers.

The relative performance of the different approaches tocombining markers for treatment selection depends on thescale of the outcome. While many of the methods to-datehave focused on the continuous outcome setting, this articlecompares approaches given a binary outcome. In particular,we present results on the IPWE or AIPWE of θ maximiza-tion approach compared to logistic regression MLE, whereasthe original article (Zhang et al., 2012) focused on a contin-uous outcome and linear regression. In our simulation study,improving upon logistic regression proved difficult. Even un-der risk model mis-specification, maximizing the IPWE orAIPWE of θ only resulted in moderately higher mean θ inmost scenarios. In Scenario 4, constructed to be similar to thefirst simulation scenario of Zhang et al. (2012), maximizingthe IPWE or AIPWE of θ did not yield a marker combina-tion with superior performance to that associated with logisticregression MLE. Based on these results, it appears more dif-ficult to improve upon logistic regression for binary outcomesthan it is to improve upon linear regression for continuousoutcomes. Pepe et al. (2005) also found logistic regression tobe remarkably robust in the classification context.

The boosting method described here warrants further re-search along several avenues. The method can be generalizednaturally to settings where the outcome does not capture allconsequences of treatment and therefore the optimal treat-ment rule is �(Y) ≤ δ for some δ > 0 (Vickers et al., 2007;Janes et al., 2014). Continuous outcomes and time-to-eventoutcomes could be also accommodated. Further investigationof the optimal weight function for the boosting method isof interest. The method could be extended to settings withmarker values missing at random, multiple treatment options,or to the observational study setting. Another challenge isdoing variable selection in the treatment selection context.Application of boosting with a penalized regression workingmodel is one potential approach that would accommodatehigh dimensional makers.

6. Supplementary Materials

Web Appendix A, referenced in Sections 2.3 and 4, and WebAppendix B, referenced in Section 4, and R code to performthe estimation are available with this paper at the Biometricswebsite on Wiley Online Library

Acknowledgements

This work was funded by R01 CA152089, P30CA015704, andR01 GM106177-01.

References

Albain, K., Barlow, W., Ravdin, P., Farrar, W., Burton, G.,Ketchel, S., Cobau, C., Levine, E., Ingle, J., Pritchard, K.,Lichter, A., Schneider, D., Abeloff, M., Henderson, I., Muss,H., Green, S., Lew, D., Livingston, R., Martino, S., andOsborne, C. (2010a). Adjuvant chemotherapy and timingof tamoxifen in postmenopausal patients with endocrine-responsive, node-positive breast cancer: A phase 3, open-label, randomised controlled trial. The Lancet 374, 2055–2063.

Albain, K., Barlow, W., Shak, S., Hortobagyi, G., Livingston,R., Yeh, I., Ravdin, P., Bugarini, R., Baehner, F., David-son, N., Sledge, G., Winer, E., Hudis, C., Ingle, J., Perez,E., Pritchard, K., Shepherd, L., Gralow, J., Yoshizawa, C.,Allred, D., Osborne, C., and Hayes, D. (2010b). Prognos-tic and predictive value of the 21-gene recurrence score as-say in postmenopausal women with node-positive, oestrogen-receptor-positive breast cancer on chemotherapy: A retro-spective analysis of a randomised trial. The Lancet Oncology11, 55–65.

Breiman, L., Friedman, J., Stone, C., and Olshen, R. (1984). Clas-sification and Regression Trees. Wadsworth, Belmont, CA:Chapman & Hall/CRC.

Brinkley, J., Tsiatis, A., and Anstrom, K. (2010). A generalizedestimator of the attributable benefit of an optimal treatmentregime. Biometrics 66, 512–522.

Cai, T., Tian, L., Wong, P., and Wei, L. (2011). Analysis of random-ized comparative clinical trial data for personalized treat-ment selections. Biostatistics 12, 270–282.

Chu, N., Ma, L., Liu, P., Hu, Y., and Zhou, M. (2011). A com-parative analysis of methods for probability estimation tree.WSEAS Transactions on Computers 10, 71–80.

Claggett, B., Zhao, L., Tian, L., Castagno, D., and Wei, L. (2011).Estimating subject-specific treatment differences for risk-benefit assessment with competing risk event-time data.Harvard University Biostatistics Working Paper Series, 125.

Clark, G. (1995). Prognostic and predictive factors for breast can-cer. Breast Cancer 2, 79–89.

Culp, M., Johnson, K., and Michailidis, G. (2012). ada: An R Pack-age for Stochastic Boosting. R package version 2.0-3.

Efron, B. and Tibshirani, R. (1993). An Introduction to the Boot-strap, Vol. 57. London: CRC Press.

Etzioni, R., Kooperberg, C., Pepe, M., Smith, R., and Gann, P.(2003). Combining biomarkers to detect disease with appli-cation to prostate cancer. Biostatistics 4, 523–538.

Foster, J. C., Taylor, J. M. G., and Ruberg, S. J. (2011). Subgroupidentification from randomized clinical trial data. Statisticsin Medicine 30, 2867–2880.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic gen-eralization of on-line learning and an application to boost-ing. Journal of Computer and System Sciences 55, 119–139.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logis-tic regression: a statistical view of boosting (with discussionand a rejoinder by the authors). Annals of Statistics 28,337–407.

Gunter, L., Zhu, J., and Murphy, S. (2007). Variable selection foroptimal decision making. Artificial Intelligence in Medicine4594, 149–154.

Gunter, L., Zhu, J., and Murphy, S. (2011a). Variable selection forqualitative interactions. Statistical Methodology 8, 42–55.

Gunter, L., Zhu, J., and Murphy, S. (2011b). Variable selectionfor qualitative interactions in personalized medicine whilecontrolling the family-wise error rate. Journal of Biophar-maceutical Statistics 21, 1063–1078.

Page 13: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 707

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elementsof Statistical Learning: Data Mining, Inference and Predic-tion, Vol. 1. New York: Springer Series in Statistics.

Henry, N. and Hayes, D. (2006). Uses and abuses of tumor markersin the diagnosis, monitoring, and treatment of primary andmetastatic breast cancer. The Oncologist 11, 541–552.

Janes, H., Brown, M., Pepe, M., and Huang, Y. (in press). An ap-proach to evaluating and comparing biomarkers for patienttreatment selection. International Journal of Biostatistics.

Janes, H., Pepe, M., and Bossuyt, P. (2014). A framework for eval-uating markers used to select patient treatment. Medical De-cision Making, 34, 159–167.

Janes, H., Pepe, M., Bossuyt, P., and Barlow, W. (2011). Measuringthe performance of markers for guiding treatment decisions.Annals of Internal Medicine 154, 253.

Lu, W., Zhang, H., and Zeng, D. (2013). Variable selection foroptimal treatment decision. Statistical methods in medicalresearch, 22, 493–504.

Mebane, Jr., W. R. and Sekhon, J. S. (2011). Genetic optimizationusing derivatives: The rgenoud package for R. Journal ofStatistical Software 42, 1–26.

Paik, S., Shak, S., Tang, G., Kim, C., Baker, J., Cronin, M.,Baehner, F., Walker, M., Watson, D., Park, T., Hiller, W.,Fisher, E., Wickerham, D., Bryant, J., and Wolmark, N.(2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England Journalof Medicine 351, 2817–2826.

Paik, S., Tang, G., Shak, S., Kim, C., Baker, J., Kim, W., Cronin,M., Baehner, F., Watson, D., Bryant, J., Costantino, J.,Geyer Jr, C., Wickerham, D., and Wolmark, N. (2006). Geneexpression and benefit of chemotherapy in women with node-negative, estrogen receptor–positive breast cancer. Journalof Clinical Oncology 24, 3726–3734.

Pepe, M., Cai, T., and Longton, G. (2005). Combining predictorsfor classification using the area under the receiver operatingcharacteristic curve. Biometrics 62, 221–229.

Provost, F. and Domingos, P. (2003). Tree induction forprobability-based ranking. Machine Learning 52, 199–215.

Qian, M. and Murphy, S. (2011). Performance guarantees forindividualized treatment rules. Annals of Statistics 39,1180.

Sekhon, J. S. and Mebane, Jr., W. R. (1998). Genetic optimiza-tion using derivatives: Theory and application to nonlinearmodels. Political Analysis 7, 189–213.

Song, X. and Pepe, M. (2004). Evaluating markers for selecting apatient’s treatment. Biometrics 60, 874–883.

Therneau, T., Atkinson, B., and Ripley, B. (2012). rpart: RecursivePartitioning. R package version 4.0-1.

Vickers, A., Kattan, M., and Sargent, D. (2007). Method for evalu-ating prediction models that apply the results of randomizedtrials to individual patients. Trials 8, 14.

Zhang, B., Tsiatis, A., Laber, E., and Davidian, M. (2012). A robustmethod for estimating optimal treatment regimes. Biomet-rics 68, 1010–1018.

Zhao, X., Dai, W., Li, Y., and Tian, L. (2011). AUC-basedbiomarker ensemble with an application on gene scores pre-dicting low bone mineral density. Bioinformatics 27, 3050–3055.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012).Estimating individualized treatment rules using outcomeweighted learning. Journal of the American Statistical As-sociation 107, 1106–1118.

Received March 2013. Revised May 2013.Accepted December 2013.

Discussions

Eric B. Laber,* Anastasios A. Tsiatis, Marie Davidian, and Shannon T. Holloway

Department of Statistics, North Carolina State University, Raleigh,North Carolina 27695-8203, U.S.A.∗email: [email protected]

1. Introduction

We congratulate the Kang, Janes, and Huang (hereafter KJH)on an interesting and powerful new method for estimatingan optimal treatment rule, also referred to as an optimaltreatment regime. Their proposed method relies on havinga high-quality estimator for the regression of outcome onbiomarkers and treatment, which the authors obtain usinga novel boosting algorithm. Methods for constructing treat-ment rules/regimes that rely on outcome models are some-times called indirect or regression-based methods because thetreatment rule is inferred from the outcome model (Barto andDieterich, 1988).

Regression-based methods are appealing because theycan be used to make prognostic predictions as well astreatment recommendations. While it is common practice touse parametric or semiparametric models in regression-basedapproaches (Robins, 2004; Chakraborty and Moodie, 2013;Laber, Linn, and Stefanski, in press; Schulte et al., in press),there is growing interest in using nonparametric methodsto avoid model misspecification (Zhao et al., 2011; Moodie,Dean, and Sun, 2013). In contrast, direct estimation meth-ods, also known as policy-search methods, try to weakenor eliminate dependence on correct outcome models andinstead attempt to search for the best treatment rule within a

Page 14: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

708 Biometrics, September 2014

pre-specified class of rules (Orellana, Rotnitzky, and Robins,2010; Zhang et al., 2012a,b, 2013; Zhao et al., 2012). Directestimation methods make fewer assumptions about theoutcome model, which may make them more robust to modelmisspecification but potentially more variable.

We derive a direct estimation analog to the method ofKJH, which we term value boosting. The method is basedon recasting the problem of estimating an optimal treatmentrule as a weighted classification problem (Zhang et al., 2012a;Zhao et al., 2012). We show how the method of KJH canbe used with existing policy-search methods to construct atreatment rule that is interpretable, logistically feasible, par-simonious, or otherwise appealing.

2. Setup and Notation

We assume that the available data are (Di, Ti, Yi), i = 1, . . . , n,which comprise n independent and identically distributedcopies of (D, T, Y), where Y ∈ Rp denotes biomarker informa-tion; T ∈ {0, 1} denotes treatment received; and D ∈ R denotesthe outcome of interest coded so that higher values are bet-ter, as is customary in the treatment regime literature. Tomatch the development of KJH, we assume that the dataare collected in a randomized clinical trial so that π(t|y) =P(T = t|Y = y) is known by design, where ε < π(t|Y) < 1 − ε

for some ε < 0 with probability one. This setup includes thebinary outcome considered by KJH as a special case with theroles of D = 1 and D = 0 interchanged.

A treatment rule g is a map from the domain of Y

into that of T , so that a patient presenting with Y = y isrecommended treatment g(y). Define the value of a ruleg, denoted V (g), as the expected outcome if all patientsare treated according to g. An optimal regime, gopt, sat-isfies V (gopt) ≥ V (g) for all other rules g under considera-tion. Under suitable conditions (e.g., Zhang et al., 2012b)V (g) = E[E{D|Y, T = g(Y)}]. Letting Q(y, t) = E(D|Y = y, T =t), gopt(y) = arg maxt Q(y, t). The foregoing expression moti-vates estimating gopt by first estimating the function Q(y, t)

by regressing D on Y and T to obtain Q(y, t) and then ob-

taining gReg(y) = arg maxt Q(y, t). We refer to approaches ofthis form as regression-based methods; the method proposedby KJH is of this type.

An alternative approach is to construct first an esti-mator for V (g), say V (g), and subsequently obtain gVal =arg maxg∈G V (g) for some suitable class of rules G. Werefer to approaches of this form as policy-search meth-ods. Policy-search and regression-based methods are notclearly delineated. For example, one could estimate V (g) by

n−1∑n

i=1Q{Yi, g(Yi)}, in which case gVal = gReg provided that

gReg ∈ G. Let BKJH(Y) be the estimator of Q(Y, 1) − Q(Y, 0)

proposed by KJH. Because Q{Y, g(Y)} = Q(Y, 0) + g(Y)B(Y),the method of KJH can be viewed as a policy-search estima-tor of the form arg maxg∈G n−1

∑n

i=1g(Yi)BKJH(Yi). However,

the success of this approach relies on the correctness of theoutcome model.

Generally, policy-search methods employ estimators forV (g) that do not rely heavily on correctness of an outcomemodel (Zhang et al., 2012b, 2013; Zhao et al., 2012). In theclinical trial setting considered here, both the inverse proba-bility weighted estimator (IPWE) and the augmented inverse

probability weighted estimator (AIPWE) described by KJHare consistent estimators for V (g), pointwise in g, that donot require a consistent outcome model. We assume subse-quently that the AIPWE is used to estimate V (g). Let Pn de-note the empirical measure so that Pnf (Z) = n−1

∑n

i=1f (Zi).

The AIPWE is

VAIPWE(g) = Pn

[1T=g(Y)D

πc(Y ; g)− 1T=g(Y) − πc(Y ; g)

πc(Y ; g)Q{Y, g(Y)}

],

(1)

where 1ν is one if ν is true and zero otherwise, πc(Y ; g) =π{g(Y)|Y}, and Q(y, t) is an estimator for Q(y, t). The AIPWEcan be viewed as adding a mean zero but negatively correlatedterm to the IPWE to gain efficiency (Scharfstein, Rotnitzky,and Robins, 1999). Define

C(D, Y, T ) = T

π(1|Y)D − T − π(1|Y)

π(1|Y)Q(Y, 1)

− 1 − T

1 − π(1|Y)D − T − π(1|Y)

1 − π(1|Y)Q(Y, 0),

and Ci = C(Di, Yi, Ti). Let Wi = |Ci| and Zi = 1Ci>0

.

Zhang et al. (2012a) showed that

arg maxg∈G

VAIPWE(g) = arg ming∈GPnW1

Z �=g(Y),

and thus the policy-search objective function using (1) isequivalent to a weighted misclassification error with predictor-label pairs {(Yi, Zi)}n

i=1 and weights {Wi}ni=1. Taking this ap-

proach, any classification algorithm that can incorporateweights can be used to estimate an optimal treatment regime.In the next section, we use this framework to develop a policy-search boosting algorithm for estimating gopt.

3. Value Boosting

3.1. Policy-Search Boosting Algorithm

To apply boosting using this classification perspective, werequire an algorithm that can accommodate weights. How-ever, standard boosting algorithms assume unweighted data.To provide intuition and to match the development of KJH,we describe a simple boosting algorithm with a logistic re-gression model as the base classifier, although other choicesare possible. To gain intuition, consider first a thought ex-periment. Suppose that each weight Wi is a rational num-ber expressed as qi/ri, where qi and ri are positive integers.Let N denote the smallest common multiple of r1, . . . , rn,and create a new, augmented data set with qiN/ri copies of(Yi, Zi). Let DAug = {(Yik, Zik)} k = 1, . . . , qiN/ri, i = 1, . . . , n

denote this augmented data set, where Yik ≡ Yi and Zik ≡ Zi

for i = 1, . . . , n. In this case

VAIPWE(g) ∝n∑

i=1

qiN/ri∑k=1

1Zik �=g(Yik)

,

Page 15: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 709

which is the (unweighted) misclassification rate of the ruleg evaluated on the augmented data set. Thus, we can applya boosting algorithm without modification to the augmenteddata set. The algorithm we consider is as follows:

1. Initialize weights ω(1)ik = 1, k = 1, . . . , Nqi/ri, i = 1, . . . , n.

2. Repeat for b = 1, . . . , B

(a) Fit a logistic regression model g(b)n to DAug with

weights ω(b)ik , k = 1, . . . , qiN/ri, i = 1, . . . , n.

(b) Compute m(b)ik = 1

g(b)n (Yik) �=Zik

and weighted misclas-

sification error e(b) = ∑i,k

ω(b)ik m

(b)ik /

∑i,k

ω(b)ik .

(c) Set ω(b+1)iK = ω

(b)ik

{(1 − e(b))/e(b)

}mik.

3. The final boosted treatment rule is

gn(Y) =

⎧⎪⎨⎪⎩1 if

∑B

b=1log{(1 − e(b))/e(b)}g(b)

n (Y)

> 12

∑B

b=1log{(1 − e(b))/e(b)},

0 otherwise.

This algorithm is not computationally feasible becauseDAug is potentially very large. However, it suggests a sim-ple weighted procedure that can be applied to the originaldata. First, note that step 2(a) is equivalent to fitting logistic

regression to obtaining g(b)n using the original data {(Yi, Zi}n

i=1

with weights ω(b)i = ω

(b)i1 Wi. Second, defining m

(b)i = 1

g(b)n (Yi) �=Zi

,

the weighted misclassification at iteration b of the boostingalgorithm is

e(b) =∑

i,kω

(b)ik m

(b)ik∑

i,kω

(b)ik

=∑n

i=1(Nqi/ri)ω

(b)i1 m

(b)i1∑n

i=1(Nqi/ri)ω

(b)i1

=∑n

i=1Wiω

(b)i1 m

(b)i∑n

i=1Wiω

(b)i1

.

This suggests the following algorithm, which we term valueboosting.

1. Initialize weights ω(1)i = n−1, i = 1, . . . , n.

2. Repeat for b = 1, . . . , B

(a) Fit a logistic regression model g(b)n to {(Yi, Zi)}n

i=1

with weights ω(b)i , i = 1, . . . , n.

(b) Compute m(b)i = 1

g(b)n (Yi) �=Zi

and weighted misclas-

sification error e(b) = ∑iω

(b)i Wim

(b)i /

∑iω

(b)i Wi.

(c) Set ω(b+1)i = ω

(b)i

{(1 − e(b))/e(b)

}mi.

3. The final boosted treatment rule is

gn(Y) =

⎧⎪⎨⎪⎩1 if

∑B

b=1log{(1 − e(b))/e(b)}g(b)

n (Y)

> 12

∑B

b=1log{(1 − e(b))/e(b)},

0 otherwise.

The form of the above algorithm can be viewed as a versionof the adacost algorithm (Fan et al., 1999) and is thus one ofmany potential boosting algorithms that could be used to

Table 1Performance of value boosting and the method proposed byKJH in terms of mean θ based on 1000 Monte Carlo data

sets, n = 500. Monte Carlo standard deviations are given inparentheses. Scenario indicators correspond to those defined

in Section 3.2 of KJH.

Scenario True θ Value boosting KJH

1 0.127 0.113 (0.007) 0.118 (0.004)2 0.124 0.125 (0.007) 0.128 (0.004)3 0.134 0.129 (0.007) 0.134 (0.003)4 0.066 0.036 (0.016) 0.059 (0.014)5 0.095 0.072 (0.012) 0.077 (0.007)6 0.139 0.126 (0.005) 0.045 (0.015)7 0.142 0.124 (0.011) 0.111 (0.010)

construct a direct search analog of the algorithm proposed byKJH.

3.2. Boosting the Value or Boosting the Contrast

Value-boosting focuses on iteratively improving performanceof a treatment regime in terms of the estimated value. Themethod of KJH iteratively attempts to improve an estima-tor of the contrast function B(y) = Q(y, 1) − Q(y, 0). Becausethe optimal treatment regime is gopt(y) = 1B(y)>0 and for anytreatment regime, say g, V (g) − EQ(Y, 0) = Eg(Y)B(y), theproblems of estimating the contrast and maximizing the valueare closely related. However, value-boosting directly targetsgopt rather than B, which may be advantageous in settingswhere gopt is markedly more parsimonious than B(y); be-cause gopt is a function of B, it is necessarily more par-simonious in some sense. An exaggerated example that il-lustrates this point is Q(y, t) = t

(∑p

j=1y2

j

)y1; in this case

B(y) = (∑p

j=1y2

j

)y1 is a complex nonlinear function of all p

variables whereas gopt(y) = 1y1>0 is a univariate linear thresh-old. However, the operating characteristics of policy-searchand regression-based estimators have not been directly com-pared on examples of this type; we believe such a comparisonwould be provide valuable information on the differences be-tween these two classes of methods.

4. Simulation Experiments

To examine the finite sample performance of the value boost-ing algorithm, we use the seven simulation scenarios consid-ered by KJH with n = 500. We use a logistic working modelof the form Q(y, t) = P(D = 1|T = t, Y = y) = expit(yᵀβ0 +tyᵀβ1), where y = (1, yᵀ)ᵀ; β0, β1 are unknown parameters;and expit(u) = eu/(1 + eu). We estimate Q(y, t) using max-imum likelihood. Our base classifier is a logistic model ofthe form P(Z = 1|Y = y) = expit(ψ

ᵀ0y + ψ12y1y2 + ψ13y1y3 +

ψ23y2y3 + ψ11y21 + ψ22y

22 + ψ33y

23). We estimate the perfor-

mance of value boosting and the method proposed by KJHusing the mean θ value as proposed by KJH computed us-ing a test set of size 104 and 1000 Monte Carlo replications.Table 1 shows the estimated performance of each method.Value boosting is comparable with KJH in all scenarios ex-cept Scenario 6, where it shows a marked advantage. However,value boosting is more variable than KJH, which is antic-

Page 16: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

710 Biometrics, September 2014

ipated given that policy-search methods are typically morevariable than regression-based methods. The simulations sug-gest that the boosting method proposed by KJH and valueboosting may warrant further investigation.

5. Parsimony, Feasibility, and Interpretability

Boosting and other “black-box” estimation algorithms have astrong track record for predictive performance, especially onbenchmark data sets. However, it is not apparent that suchmethods are suited to inform clinical practice, guidelines, orresearch. One of the strongest features of policy-search meth-ods is that it is possible to control the class of potential treat-ment rules. Consequently, one can constrain the estimatedtreatment rule to be interpretable, low-cost, logistically feasi-ble, and so on. However, boosting these methods to improvevalue may destroy some or all of the foregoing features.

Consider three approaches for estimating an optimal treat-ment strategy: (A1) optimize within a small but informativeclass of regimes; (A2) optimize within a large but difficult tointerpret class; and (A3) optimize within a large but difficultclass and then project the estimated regime onto a smaller,interpretable class. Approach (A1) is taken by Orellana et al.(2010), Zhang et al. (2012a,b), and Zhang et al. (2013). Inaddition to the benefits of interpretability and feasibility im-posed on the estimated regime, in some contexts, (A1) is ap-pealing because the best regime within a pre-specified classof regimes is of independent interest. Approach (A2), takenby Zhao et al. (2011, 2012) and Moodie et al. (2013), usesthe predictive power of machine-learning methods and re-duces the risk of model misspecification. Approach (A3) at-tempts to maintain both interpretability and performance(e.g., Breiman and Shang, 1996); furthermore, the “residual”of the estimated treatment regime relative to its projectiononto the smaller class may be informative about where andhow the smaller class is not sufficiently expressive.

The trade off between flexibility and interpretability isubiquitous in statistical modeling. However, the importance ofeach component depends on the context in which the modelis estimated. Estimation of an optimal treatment regime isoften conducted as a secondary analysis aimed at generatingscientific hypotheses and informing follow-up studies. In con-trast, KJH seem to be primarily interested in constructing ahigh-quality treatment regime. Their boosting method seemsideally suited for this purpose.

Acknowledgement

This work was supported by NIH grant P01 CA142538.

References

Barto, A. and Dieterich, T. (2004). Reinforcement learning andits relation to supervised learning. In Handbook of Learningand Approximate Dynamic Programming, J. Si, A. G. Barto,W. B. Powell, and D. Wunsch (eds), 45–63. New York:Wiley.

Breiman, L. and Shang, N. (1996). Born again trees. Available atftp://ftp.stat.berkeley. edu/pub/users/breiman/BAtrees.ps.

Chakraborty, B. and Moodie, E. E. M. (2013). Statistical Methodsfor Dynamic Treatment Regimes: Reinforcement Learning,Causal Inference, and Personalized Medicine. New York:Springer.

Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K. (1999). Adacost:misclassification cost-sensitive boosting. In Proceedings ofthe Sixteenth International Conference on Machine Learn-ing (ICML’99), 97–105, Bled, Slovenia, June 1999.

Laber, E., Linn, K., and Stefanski, L. (in press). Interactive model-building for Q-learning. Biometrika.

Moodie, E. E. M., Dean, N., and Sun, Y. R. (2013). Q-learning:Flexible learning about useful utilities. Statistics in Bio-sciences, 1–21.

Orellana, L., Rotnitzky, A., and Robins, J. M. (2010). Dynamicregime marginal structural mean models for estimation ofoptimal treatment regimes, part I: Main content. Interna-tional Journal of Biostatistics 6, Article 8.

Robins, J. M. (2004). Optimal structured nested models for opti-mal sequential decisions. In Proceedings of the Second SeattleSymposium on Biostatistics, D. Y. Lin and P. J. Heagerty(eds), 189–326. New York: Springer.

Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Ad-justing for nonignorable drop-out using semiparametric non-response models. Journal of the American Statistical Asso-ciation 94, 1096–1120.

Schulte, P., Tsiatis, A., Laber, E., and Davidian, M. (in press).Q-and A-learning methods for estimating optimal dynamictreatment regimes. Statistical Science.

Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber,E. (2012a). Estimating optimal treatment regimes from aclassification perspective. Stat 1, 103–114.

Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012b).A robust method for estimating optimal treatment regimes.Biometrics 68, 1010–1018.

Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013).Robust estimation of optimal dynamic treatment regimes forsequential treatment decisions. Biometrika 100, 681–694.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012).Estimating individualized treatment rules using outcomeweighted learning. Journal of the American Statistical As-sociation 107, 1106–1118.

Zhao, Y., Zeng, D., Socinski, M. A., and Kosorok, M. R. (2011). Re-inforcement learning strategies for clinical trials in nonsmallcell lung cancer. Biometrics 67, 1422–1433.

Lu TianDepartment of Health Research and Policy, Stanford University,Palo Alto, California 94305, U.S.Aemail: [email protected]

I would like to congratulate Kang, Janes, and Huang fortheir interesting article on a novel boosting algorithm forcombining multiple biomarkers to optimize patient treat-

ment recommendations. Their results show the potentialof modern machine learning methods especially when onecannot correctly specify a regression model for the complex

Page 17: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 711

underlying relationship of interest. Since the publication ofthe AdaBoosting procedure by Freund and Schapire (1997),there has been great interest in seeking an explanation for itsexcellent performance and generalizing the method in varioussettings. For example, AdaBoosting can be viewed as anumerical algorithm performing forward stagewise regressionwith a set of weak classifiers to minimize a regularizedexponential loss function. Along this line, several versions ofboosting procedures have been proposed by changing the lossfunction to be minimized (Friedman, Hastie, and Tibshirani,2000). Boosting is arguably one of the best off-the-shelfmethods and it is not a surprise that when appropriatelyadapted, it can perform competitively well for differentiatingpatients with positive treatment benefit from those without.On the other hand, after carefully examining the proposedboosting algorithm in the article, I see some crucial depar-tures from the classical version of boosting. Therefore, I willfirst discuss the differences between the proposed methodand the conventional boosting algorithm and their potentialimplications. Secondly, I will propose simple alternatives toresolve some of the difficulties. To simplify the discussion, Iwill assume P(T = 1) = P(T = 0) = 0.5 throughout.

1. The Difference between the Proposal andConvention Boosting

Suppose for the time being that the directions of the treat-ment effect, Si = I{�(Yi) ≤ 0}, i = 1, . . . , n are observed, andthen the real AdaBoosting (a generalized version of Ad-aBoosting returns a class probability) proceeds as follows:

(1) Start with weights w(0)i = 1/n for subjects i = 1, . . . , n,

and fit a working model to calculate �(0)(y), the esti-mated probability P(S = 1|Y = y) based on the workingmodel.

(2) For m = 1, . . . , M,

• update the weights according to w(m)i = w

∗(m)i /∑n

i=1w

∗(m)i , where

w∗(m)i = w

(m−1)i exp

(−1

2(2Si−1) log

[�(m−1)(yi)

1−�(m−1)(yi)

])for i = 1, . . . , n.

• Refit the working model with updated weights w(m)i

to obtain �(m)(y), the updated estimated probabil-ity P(S = 1|Y = y).

(3) After the last iteration, the estimated treatment rule is

φ(Yi) = I

(M∑

m=1

log

[�(m)(yi)

1 − �(m)(yi)

]≥ 0

).

Although the assumption that Si are all observed isartificial, it reveals an important message in comparing theproposed boosting algorithm with the ideal counterpartabove. Firstly, while the updated weight in the ideal RealAdaBoosting depends on the fitted probability, the oldweight and the true outcomes Si, the weight in the proposedalgorithm depends only on the fitted probability. Intuitively,

the Real AdaBoosting always tries to upweight misclassifiedobservations in the previous step to ensure that the newclassification rule likely adds value to existing ones, especiallyon subgroup of observations for which the performance of theold classifiers is unsatisfactory. Since Si is unobserved, thenew algorithm upweights observations close to the estimateddecision boundary, whose classification (optimal treatmentrecommendation) is relatively ambiguous. It is clear thatthere is a genuine difference in the reweighting scheme:following the same logic of the new algorithm, the Real Ad-aBoosting would upweights observations with the estimatedclass probability close to 0.5, which may or may not be thesame observations misclassified in the previous step. Onecertainly can imagine that if the initial working model isa poor approximation to the truth, then some misclassifiedobservations could mistakenly have an estimated class prob-ability close to 0 or 1 and receive less weight. Furthermore,even when the working model yields reasonable class prob-ability estimates, the misclassification that occurred nearthe decision boundary affects the value-function θ the least.Therefore, one may wonder if the new proposal sometimesfails to upweight the “right” observations. Secondly, while theReal AdaBoosting can be viewed as an algorithm to minimizea regularized loss function, the proposed algorithm can not beeasily fit into the same framework. One consequence is that itis not clear if the algorithm provides a consistent estimate tothe optimal treatment rule I{�(Y) ≤ 0}. To gain some insightinto the consistency issue, lets consider the convergence limitof P(1)(D = 1|T = t, Y) in the proposed boosting procedure.

The final estimator �(Y) should be close to the differencebetween the two limits after sufficient number of iterations.When the simple logistic working model is used, the limit ofP(1)(D = 1|T = t, Y) is {1 + exp(−β′

tY)}−1 and βt solves theestimating equation

E

[Yw

(∣∣∣∣ exp(β′1Y)

1 + exp(β′1Y)

− exp(β′0Y)

1 + exp(β′0Y)

∣∣∣∣)×

{P(D = 1|T = t, Y) − exp(β′

tY)

1 + exp(β′tY)

}]= 0, t = 0, 1,

as n → ∞, which implies that

E[Yw

(|�(Y)|){�(Y) − �(Y)

}] = 0,

where �(Y) = {1 + exp(−β′0Y)}−1 − {1 + exp(−β′

1Y)}−1 canbe viewed as a reasonable approximation to �(Y), especiallyin the region where β′

0Y ≈ β′1Y. However, it seems that in

general �(Y) is different from �(Y) unless the simple logisticworking model is correctly specified. The extra weightfunction does not solve this problem. This is in contrastto the appealing “consistency” property of the originalAdaBoosting procedure, which does not require the “weak”learners to be “strong” or “correct.”

2. Alternative Boosting Procedure

From the discussion in the previous section, it seems thatthe new proposal may sacrifice some important componentsin the AdaBoosting procedure due to the simple fact that

Page 18: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

712 Biometrics, September 2014

Si, i = 1, . . . , n are not observable in practice. This may causeconcerns about its performance in settings beyond those inves-tigated in the comprehensive simulation study. On the otherhand, noticing the fact that

E{(2D − 1)(2T − 1)|Y} = −�(Y),

one may treat the binary variable Si = (2Di − 1)(2Ti −1) as the surrogate to the unobserved Si = I{−�(Yi) ≥ 0}(Signorovitch, 2007) and simply replace Si by Si in the RealAdaBoosting procedure described above. To justify this ap-proach, one notes that AdaBoosting can be viewed as a nu-merical algorithm to minimize the empirical version of theloss function

J(d) = E[exp{−Sd(Y)}]

in terms of d(·), a function of y. In this case, the minimizerof J(d) is

d1(y) = 1

2log

{P(S = 1|Y = y)

P(S = −1|Y = y)

}= 1

2log

{1 − �(y)

1 + �(y)

}.

It suggests that I{d1(Y) ≥ 0}, the treatment assignment rulebased on d1(·), is equivalent to our target I{�(Y) ≤ 0}. There-fore, as a numerical algorithm to estimate I{d1(Y) ≥ 0}, theoutput of AdaBoosting with Si, 1 ≤ i ≤ n, as responses can beused to assign patients according to their individual treatmenteffects.

The second alternative is to estimate the optimal treat-ment rule by maximizing the mean outcome. As the authorsadvocated, the mean outcome is a clinically relevant and in-terpretable quantity to measure the value of a given treatmentassignment rule. In the current setting, maximizing the meanoutcome is equivalent to minimizing

n∑i=1

DiI{Ti �= g(Yi)} =∑

i:Di=1

I{Ti �= g(Yi)},

the misclassification error for predicting Ti among the sub-group of patients having outcome Di = 1, with respect to thebinary classification rule g : Y → {0, 1}. This again providesa very natural platform for applying the AdaBoostingalgorithm. Specifically, one may use AdaBoosting procedureto construct a classification rule to classify a patient tohis/her actual randomized treatment assignment basedon Y for the subgroup of patients with D = 1. To justifythe validity of this boosting procedure, one only needs tonote that the minimizer of the corresponding loss functionJ(d) = E[exp{−(2T − 1)d(Y)} | D = 1] is

d2(y) = 1

2log

{P(T = 1|Y = y, D = 1)

P(T = 0|Y = y, D = 1)

}= 1

2log

{P(D = 1|T = 1, Y = y)

P(D = 1|T = 0, Y = y)

}.

Table

1T

he

resu

lts

ofsi

mula

tion

study

with

asa

mple

size

of

n=

500

for

scen

ari

os

4and

6from

Tabl

e2

ofth

eart

icle

byK

ang,

Janes

,and

Huang.

The

resu

lts

ofth

etw

onew

met

hod

sare

sum

mari

zed

inth

ela

sttw

oco

lum

ns

under

the

nam

es“S”

and

“va

lue-

funct

ion.”

Work

ing

Cla

ssifi

cati

on

Sep

ara

teN

ew

model

Lin

ear

logis

tic

tree

AdaB

oost

ing

boost

ing

Sce

nari

oTru

Alg

ori

thm

MLE

Boost

ing

IPW

EA

IPW

ESin

gle

tree

Boost

ing

SV

alu

e-fu

nct

ion

40.0

657

θM

ean

0.0

574

0.0

607

0.0

561

0.0

567

0.0

221

0.0

378

0.0

352

0.0

396

0.0

351

MC

RM

ean

0.1

511

0.1

206

0.1

719

0.1

653

0.5

397

0.3

251

0.5

304

0.3

270

0.3

374

60.1

393

θM

ean

0.0

236

0.0

438

0.0

498

0.0

544

0.0

978

0.1

186

0.1

010

0.1

006

0.1

011

MC

RM

ean

0.3

865

0.3

542

0.3

452

0.3

330

0.2

697

0.1

762

0.2

433

0.2

290

0.2

227

Page 19: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 713

Thus, the d2(·)-based treatment assignment ruleI{d2(Y) ≥ 0} = I{�(Y) ≤ 0} and the corresponding boostingprocedure provides a flexible algorithm to approximate thedecision boundary of the optimal treatment rule.

I have selected the two most challenging scenarios (4 and6) in the simulation study performed by Kang, Janes, andHuang to examine the relative performance of the above twoproposals. For both cases, the “ada” function with defaultparameters in R is used to perform the boosting algorithm.The sample size of the training set is 500 and final results arebased on 1000 Monte-Carlo replications (Table 1). It seemsthat the empirical performances of these two simple alterna-tives are comparable to those proposed by Kang, Janes, andHuang with classification tree as the base learner.

3. Remarks

We statisticians can learn a great deal from the rapid develop-ment of machine learning techniques, which oftentimes offerrobust performance for a broad range of problems. The au-thors convincingly demonstrated the power of a version of thegeneralized boosting method for estimating the optimal strat-

egy for assigning treatment. On the other hand, the proposedboosting method is different from its conventional counter-part in several key aspects and new explanations are neededto account for its good performance. Furthermore, there arealternative boosting procedures, which in my opinion can cir-cumvent the difficulties caused by unobservable class labelsSi = I{�(Yi) ≤ 0}, i = 1, . . . , n, in more natural ways. Furtherresearch in these directions is certainly warranted.

References

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic gener-alization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55, 119–139.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logis-tic regression: A statistics view of boosting (with discussionand a rejoinder by the authors). Annals of Statistics, 28,337–407.

Signorovitch, J. E. (2007). Identifying informative biological mark-ers in high-dimensional genomic data and clinical trials.Ph.D. thesis, Harvard University.

Ying-Qi Zhao*

Department of Biostatistics and Medical Informatics,University of Wisconsin-Madison, Madison, Wisconsin 53792, U.S.A.∗email: [email protected]

and

Michael R. KosorokDepartment of Biostatistics, University of North Carolina at Chapel Hill,Chapel Hill, North Carolina 27599, U.S.A.

Department of Statistics and Operations Research,University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A.

Summary. Kang, Janes and Huang propose an interesting boosting method to combine biomarkers for treatment selection.The method requires modeling the treatment effects using markers. We discuss an alternative method, outcome weightedlearning. This method sidesteps the need for modeling the outcomes, and thus can be more robust to model misspecification.

Key words: Boosting; Outcome weighted learning; Personalized medicine; Support vector machine.

1. Introduction

The problem of combining markers to optimize treatment se-lection has recently received significant attention among sta-tistical researchers. We congratulate Kang, Janes, and Huang(in press; hereafter K.J.H.) on an elegant contribution to thisimportant area. K.J.H. provide a novel application of boost-ing to find the best marker combination. In particular, witha binary outcome, the optimal treatment rule is determinedby the conditional probability of having disease, that is, therisk, given the covariates and the treatment. If the risk mod-els are correctly specified, the optimal rule can be deducedaccordingly. While a generalized linear model is a simple andpopular option, it may suffer from model misspecification.

The proposed method in K.J.H. achieves a measure of ro-bustness to such model misspecification through the use ofboosting, combined with iteratively reweighting each subject’spotential misclassification based on treatment benefit in theprevious iteration.

Although K.J.H. indicate that the purpose of the proposedmethod is to classify subjects according to the unobserved op-timal treatment decision rule, the approach does not utilize aclear objective function for optimization. Risk modeling is re-quired to estimate the optimal rules and to further update theweights at each step. As shown in the simulation results, theperformances vary with different working models. An alter-native approach, outcome weighted learning (OWL), is given

Page 20: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

714 Biometrics, September 2014

Table 1Results of the simulation study with sample size n = 500. OWL-HL: outcome weighted learning with hinge loss, linear kernel;OWL-HG: outcome weighted learning with hinge loss, Gaussian kernel; OWL-E: outcome weighted learning with exponential

loss; L-Boosting: linear logistic boosting; C-boosting: classification tree boosting: Logit: logistic regression.

SN True θ OWL-HL OWL-HG OWL-E L-Boosting C-Boosting Logit

Mean 0.1225 0.1057 0.1020 0.1240 0.1149 0.1256θ

1 0.1268 SD 0.0025 0.0038 0.0034 0.0018 0.0040 0.0019MCRTB Mean 0.0640 0.1363 0.1620 0.0517 0.1128 0.0309

Mean 0.1224 0.1129 0.0946 0.1248 0.0921 0.1241θ

2 0.1243 SD 0.0024 0.0052 0.0024 0.0017 0.0037 0.0017MCRTB Mean 0.0664 0.1160 0.1883 0.0481 0.1811 0.0545

Mean 0.1256 0.1191 0.1010 0.1315 0.1154 0.1315θ

3 0.1341 SD 0.0026 0.0045 0.0031 0.0015 0.0051 0.0016MCRTB Mean 0.0865 0.1062 0.1461 0.0498 0.1307 0.0499

Mean 0.0654 0.0061 0.0436 0.0657 0.0356 0.0657θ

4 0.0657 SD 0.0058 0.0070 0.0031 0.0031 0.0031 0.0040MCRTB Mean 0.0855 0.4638 0.2721 0.0202 0.2853 0.0489

Mean 0.0798 0.0646 0.0783 0.0594 0.0517 0.0592θ

5 0.0950 SD 0.0046 0.0034 0.0035 0.0030 0.0037 0.0030MCRTB Mean 0.2049 0.2250 0.1810 0.3110 0.2708 0.3070

Mean 0.0001 0.0931 0.0903 0.0313 0.0926 −0.0023θ

6 0.1393 SD 0.0049 0.0035 0.0035 0.0032 0.0034 0.0055MCRTB Mean 0.4010 0.2264 0.2267 0.3469 0.2408 0.4296

Mean 0.1295 0.1217 0.1140 0.1312 0.1185 0.1313θ

7 0.1419 SD 0.0028 0.0056 0.0038 0.0015 0.0044 0.0015MCRTB Mean 0.0718 0.1123 0.1455 0.0468 0.1448 0.0468

in Zhao et al. (2012), which estimates the optimal treatmentdecision rule through a weighted classification procedure thatincorporates outcome information. To be more specific, theoptimization target, which directly leads to the optimal treat-ment decision rule, can be viewed as a weighted classificationerror where each subject is weighted proportional to his or herclinical outcome. In the next section, we will briefly introducethe idea of OWL and modify it to the binary outcome setup.In Section 3, we present simulation studies comparing OWLwith the boosting method proposed by KJH. We concludewith a brief discussion in Section 4.

2. Outcome Weighted Learning (OWL)

Using the same notation as K.J.H., we let D ∈ {0, 1} be thebinary indicator of an adverse outcome, T indicate treatment(T = 1) or not (T = 0), and Y be the marker which can be usedto identify a subgroup. We assume that the data {Di, Ti, Yi}n

i=1

are from a randomized clinical trial. For arbitrary treatmentrule g : Y �→ {0, 1}, the expected benefit under g, that is, if g

were implemented in the whole population, can be written as(Qian and Murphy, 2011)

P

{I(D = 0)I(T = g(Y))

πc(Y)

}, (1)

where πc(Y) is the known probability of treatment, and theoptimal treatment rule gopt(Y) can be obtained by maximiz-

ing the above quantity. Equivalently, by minimizing

P

{I(D = 0)I(T �= g(Y))

πc(Y)

},

we can obtain that gopt(Y) = 1{�(Y) ≤ 0}, where �(Y) =P(D = 1|T = 0, Y) − P(D = 1|T = 1, Y). Indeed, this can beviewed as a weighted classification error in the setting wherewe wish to classify T in the responders group using the co-variate Y , with 1/πc(Y) as the weights. In particular, whenπc(Y) = 0.5, the problem falls into a regular classificationframework if we only consider responders. In this case, weclassify subjects according to their assigned treatments onlyamong those who received benefit from the assignment (i.e.,the responders).

Provided with the data, we could potentially minimize theempirical analog

mingPn

{I(D = 0)I(T �= g(Y))

πc(Y)

}(2)

to estimate gopt(Y), where Pn denotes the empirical average.Note that this is similar to the quantity IPWE(η) presented inZhang et al. (2012). Due to the nonconvexity and discontinu-ity of the 0-1 loss, it is computationally difficult to minimize(2). We address this problem by using a convex surrogate lossfunction to replace the 0-1 loss, a common practice in the fieldof machine learning literature (Zhang, 2004; Bartlett, Jordan,and McAuliffe, 2006). In other words, instead of minimizing

Page 21: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 715

Table 2Results of the simulation study with sample size n = 5000. OWL-HL: outcome weighted learning with hinge loss, linearkernel; OWL-HG: outcome weighted learning with hinge loss, Gaussian kernel; OWL-E: outcome weighted learning withexponential loss; L-Boosting: linear logistic boosting; C-boosting: classification tree boosting: Logit: logistic regression.

SN True θ OWL-HL OWL-HG OWL-E L-Boosting C-Boosting Logit

Mean 0.1228 0.1256 0.1218 0.1265 0.1177 0.1265θ

1 0.1268 SD 0.0025 0.0038 0.0034 0.0018 0.0040 0.0019MCRTB Mean 0.0776 0.0338 0.0709 0.0124 0.0961 0.0111

Mean 0.1251 0.1246 0.1233 0.1264 0.1083 0.1264θ

2 0.1243 SD 0.0024 0.0052 0.0024 0.0017 0.0037 0.0017MCRTB Mean 0.0442 0.0459 0.0598 0.0162 0.1201 0.0181

Mean 0.1319 0.1316 0.1284 0.1338 0.1204 0.1337θ

3 0.1341 SD 0.0026 0.0045 0.0031 0.0015 0.0051 0.0016MCRTB Mean 0.0408 0.0433 0.0628 0.0110 0.1095 0.0161

Mean 0.0653 0.0634 0.0640 0.0652 0.0530 0.0652θ

4 0.0657 SD 0.0058 0.0070 0.0031 0.0031 0.0031 0.0040MCRTB Mean 0.0357 0.1515 0.1189 0.0090 0.2256 0.0035

Mean 0.0792 0.0923 0.0893 0.0761 0.0750 0.0732θ

5 0.0950 SD 0.0046 0.0034 0.0035 0.0030 0.0037 0.0030MCRTB Mean 0.2165 0.0693 0.0878 0.2364 0.2215 0.2522

Mean 0.0417 0.1315 0.1258 0.0457 0.1116 0.0123θ

6 0.1393 SD 0.0049 0.0035 0.0035 0.0032 0.0034 0.0055MCRTB Mean 0.3574 0.1081 0.1349 0.3490 0.2585 0.4039

Mean 0.1294 0.1319 0.1288 0.1327 0.1218 0.1327θ

7 0.1419 SD 0.0028 0.0056 0.0038 0.0015 0.0044 0.0015MCRTB Mean 0.0928 0.0440 0.0789 0.0329 0.1094 0.0307

(2), we minimize

minf∈FPn

{I(D = 0)φ{(2T − 1)f (Y)}

πc(Y)

}+ λn‖f‖2, (3)

where F is the functional space that f resides in, 2T − 1 isused to rescale T to reside in {−1, 1}, φ(t) is a convex sur-rogate loss function, and λn‖f‖2 is a regularization term toavoid overfitting, with ‖ · ‖ denoting the norm of f in F and λn

controlling the amount of penalization. The estimated treat-ment rule is g(Y) = I(f (Y) > 0), where f is the solution to(3). We can specify F to be a linear functional space if we areonly interested in linear decision rules. We can also considernonlinear functional spaces where treatment effects can po-tentially be complex and nonlinear. In the simulation section,we will examine the performances using two popular choicesfor φ(t), including the hinge loss φ(t) = max(1 − t, 0) and theexponential loss φ(t) = exp(−t).

Since being at high risk does not necessarily imply a largerbenefit from treatment, we aim to find methods that are op-timized for treatment selection (Kang, Janes, and Huang, inpress). We point out that OWL directly targets the optimaldecision rule, hence the covariate-treatment interaction effectsare separated from the main effects. Indeed, it does not in-volve a modeling step for the risk and �(Y) as required byK.J.H. If the functional space F is correctly specified for theinteraction effects, we can consistently estimate gopt(Y) (Zhaoet al., 2012). Specifically, using the exponential loss, we gener-alize Adaboost (Freund and Schapire, 1997; Friedman, Hastie,

and Tibshirani, 2000) to select the optimal treatment. Ratherthan using the weight function w{�(Y)} to be updated at eachstep, we use instead I(Di = 0) exp(−Tif (Yi)) as the weight foreach observation, where the f (Yi) are repeatedly updated ineach iteration.

As a side note, the OWL method can be naturally gen-eralized to continuous outcomes, which commonly occurs inpractice. For example, if we let R denote the continuous out-comes, with larger values being more preferable, we only needto change I(D = 0) to R in (1). The subsequent derivation andcomputation follow accordingly.

3. Simulation Studies

We compare the OWL method with logistic regression andwith the boosting methods (both linear logistic boosting andclassification tree boosting) proposed by K.J.H. The OWLmethods are implemented using the hinge loss and the expo-nential loss as the convex surrogates. We use the same simula-tion scenarios as presented in K.J.H. Since patients are equallyrandomized to T = 0 or 1, πc(Y) can be dropped from the op-timization objective (3). Thus, we can apply the standardclassification algorithm to the simulated data by only consid-ering the responders with D = 0. The adaboost (Freund andSchapire, 1997) or support vector machine (SVM)(Cortes andVapnik, 1995) can be carried out for this subset of patients, bytreating their assignments T ∈ {0, 1} as the class labels and thebiomarkers Y as the predictors. The adaboost is implementedby the R function ada (R package ada (Culp, Johnson, andMichailides, 2006)) using the default settings with exponen-

Page 22: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

716 Biometrics, September 2014

tial loss function. The SVM is implemented by the R functionsvm (R package e1071 (Dimitriadou et al., 2008)). Both lin-ear and Gaussian kernels are used for comparison, yieldinglinear and nonlinear decision rules, respectively.

For each scenario, 1000 data sets are generated as trainingdata to build the treatment decision rule, g(Y). A largeindependent test data set with n = 105 observations is gener-ated to evaluate the performance of the obtained g(Y) underdifferent methods. Mean and Monte-Carlo standard deviation(SD) of θ{g(Y)} and mean MCRTB{g(Y)} are reported, whereθ and MCRTB are defined in K.J.H. The method is markedin bold if it outperforms other methods by producing thehighest mean θ. Logistic regression performs the best whenthe model is correctly specified, which is anticipated andhas been noted in K.J.H. The linear logistic boosting has asimilar performance to logistic regression if the models arecorrect, and can improve on logistic regression to some extentwhen the models are misspecified. However, if the effects arenonlinear, classification tree boosting may be a better option.An appropriate working model is important in practice,when the underlying truth is masked. The OWL method hascomparable performances with linear treatment effects. Whenthe outcome models are complex with nonlinear treatmentinteractions (Scenarios 5 and 6), the OWL methods lead tobetter performances. In general, OWL using hinge loss withGaussian kernel has a favorable performance which is alwaysclose-to-optimal, especially when the sample size is large.

4. Discussion

In summary, the intuition underlying the proposed boostingmethod by classifying subjects according to their (unob-served) optimal treatment holds much promise. However, inpractice, attention must be paid to the selection of the work-ing models and the weight function w since they may impactthe results to some extent. The OWL procedure discussedfor comparison utilizes a machine learning approach to findthe best treatment rule through optimizing a target functionwhich directly reflects the overall benefit of the decisionrule. When the outcome is binary, the method proposed inK.J.H. can have a better performance with small samplesizes, given that the OWL essentially only uses informationfrom the responders. Also, if there is prior information on the

relationship between outcomes and candidate biomarkers,the K.J.H. method can be preferable. On the other hand, dueto its flexibility in handling covariate-treatment interactioneffects, OWL can yield better results when the dimension ofthe covariate space is high or when the true model is fairlycomplex. For large samples sizes, OWL with hinge loss andGaussian kernel performs nearly optimally in every scenario.Additionally, while OWL can also readily handle additionaldata types such as continuous outcomes, this is not yet thecase for the K.J.H. method and so it would be worthwhile toinvestigate the possibility of such a generalization.

References

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convex-ity, classification, and risk bounds. JASA 101, 138–156.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Ma-chine Learning 20, 273–297.

Culp, M., Johnson, K., and Michailides, G. (2006). ada: An r pack-age for stochastic boosting. Journal of Statistical Software17, 1–27.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., and Weingessel,A. (2008). Misc Functions of the Department of Statistics(e1071), TU Wien. R Package, 1–5.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic gener-alization of on-line learning and an application to boosting.Journal of Computer and System Sciences 101, 119–139.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive lo-gistic regression: A statistical view of boosting. Annals ofStatistics 101, 337–374.

Kang, C., Janes, H., and Huang, Y. (in press). Combiningbiomarkers to optimize patient treatment recommendations.Biometrics.

Qian, M. and Murphy, S. A. (2011). Performance guarantees for in-dividualized treatment rules. Annals of Statistics 101, 1180–1210.

Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012).A robust method for estimating optimal treatment regimes.Biometrics 101, 1010–1018.

Zhang, T. (2004). Statistical behavior and consistency of classifi-cation methods based on convex risk minimization. Annalsof Statistics 101, 56–134.

Zhao, Y. Q., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012).Estimating individualized treatment rules using outcomeweighted learning. JASA 101, 1106–1118.

Menggang Yu*

Department of Biostatistics & Medical Informatics,University of Wisconsin, Madison, Wisconsin, U.S.A.∗email: [email protected]

and

Quefeng Li

Department of Operations Research and Financial Engineering,Princeton University, Princeton, New Jersey, U.S.A.

1. Introduction

Kang, Janes, and Huang (KJH) proposed a new approachto study an important problem of comparative treatment se-

lection based on selected markers. Although correct predic-tion of outcomes under different treatments facilitates treat-ment selection, such prediction can be hard due to model

Page 23: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 717

Table 1Performance comparison from various methods

Non-boosting methods Boosting methods

Logistic Single tree Logistic Tree M-boost

Model 1: scenario 1 in K.J.H.; true θ = 0.1270

MCRTB 0.0368 0.2367 0.0415 0.1300 0.0632

θ 0.1255 0.1089 0.1252 0.1088 0.1155

SD(θ) 0.0025 0.0090 0.0027 0.0157 0.0024

Model 2: scenario 5 in K.J.H.; true θ = 0.0956

MCRTB 0.3264 0.2235 0.3255 0.2102 0.1246

θ 0.0535 0.0664 0.0537 0.0674 0.0876

SD(θ) 0.0053 0.0160 0.0048 0.0119 0.0036

Model 3: log{− log(1 − P(D = 1|T, Y))} = 0.1 − 0.2Y1 + 0.2Y2 − Y1Y2 + T (−0.5 − Y1 + Y2 + 3Y1Y2)true θ = 0.1575

MCRTB 0.3676 0.2057 0.3689 0.1891 0.1757

θ 0.0414 0.1280 0.0414 0.1259 0.1431

SD(θ) 0.0029 0.0158 0.0028 0.0240 0.0032

Model 4: Pr(D = 1|T, Y)/Pr(D = 0|T, Y) = |0.1 − 0.2Y1 + 0.2Y2 − Y1Y2 + T (−0.5 − Y1 + Y2 + 3Y1Y2)|true θ = 0.2345

MCRTB 0.3020 0.2514 0.3083 0.2719 0.2544

θ 0.1755 0.1917 0.1743 0.1804 0.1882

SD(θ) 0.0096 0.0153 0.0089 0.0218 0.0091

mis-specification, especially when there are multiple mark-ers involved. On the other hand, treatment selection basicallydepends on the sign of the contrast function �(Y). Directmodeling this contrast function or its sign may lead to moreparsimonious and robust results. For example, it can be shownthat, if the binary response D follows a single index model aslogitP(D = 1|T, Y) = h{l(Y) + g(TY ′β)} where h(·) and g(·) areincreasing functions and l(·) is arbitrary, then for any givenY , �(Y) ≤ 0 only and if only Y ′β ≥ 0. Therefore the treat-ment rule is substantially simpler than the outcome model.This formulation of the problem into a classification frame-work nicely connects with a vast amount of available machinelearning algorithms. For example, the procedure can be eas-ily generalized to treatment selection among three or moretreatments by utilizing multicategory boosting (Zou, Zhu, andHastie, 2008; Schapire and Freund, 2012, Chapter 10).

2. The Boosting Algorithm

Our main comments are related to the boosting algorithmused by K.J.H. This article seems to bring a new dimensionto the classification problem. Even though we observe D, thetarget function is actually based on �(Y), a quantity not di-rectly observable. So it is not a completely supervised learningproblem. This brings challenges to the following aspects.

(1) Sensitivity of the working modelA basic principle of the boosting algorithm is to

“boost” weak learners to an adequate learner. Theboosting iteratively “trains” the weak classifiers usinga weighting distribution. The weights are usually re-

lated to the weak learners’ accuracy: misclassified ex-amples gain weight and correctly classified exampleslose weight. Thus, future weak learners focus more onthe examples that previous weak learners misclassified.This scheme seems to suit the treatment selection prob-lem very well. Because of the possibility of model mis-specification, �(Y) is a weak learner. This weak learnercan then be boosted into a strong one with properlychosen weights based on some function of �(Y). How-ever different from the usual boosting setting, theseweights may be incorrect. To conform with the princi-ple of the boosting algorithm, one’s hope is that mis-classified examples would still gain weight and correctlyclassified examples lose weight. Therefore the workingmodel plays a dual role here. Mis-specification of theworking model may be severe enough that this principledoes not hold anymore.

(2) Thresholds of weightsK.J.H. controls the influence from those observations

with tiny �(Y) using a thresholding Cm. The choice ofCm was shown to have minimum influence on the re-sults. With the weight function w{�(Y)} = |�(Y)|−1/3,the threshold is effective only for those with |�(Y)| <

(1/Cm)3. When Cm = 300, 500, and 1000, the thresholdsare 3.7e − 08, 8e − 09, 1e − 10. The threshold might betoo small to threshold observations with small �(Y).To a certain degree, if �(Y) converges to �(Y), mis-classifying the subjects with small �(Y) should havea negligible effect on estimating θ. We wonder if thealgorithm may spend too much efforts on these “non-

Page 24: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

718 Biometrics, September 2014

important” but easily misclassified examples. In otherwords, maybe more liberal thresholds can be explored.Alternatively we wonder if boosting algorithms that ac-tually decrease the weight of repeatedly misclassifiedexamples, for example, boost by majority (Schapire andFreund, 2012, Chapter 13) and BrownBoost Freund(2001) may be adopted here. In the meantime, over-fitting can be an issue with possibly misspecified work-ing models. This may be reflected by “over-confident”votes or very large �(Y). In such case, increasing theweights from these subjects can be helpful.

(3) Target functionThe target function used in the algorithm is

1{�(Y) ≤ 0} . The 0-1 loss function is non-continuousand non-convex. Similar to the usual boosting set-ting, we wonder if convex surrogate target functionscan be used similar to (Hastie, Tibshirani, and Fried-man, 2005, Chapter 10). In the original boosting set-ting, the surrogate functions were argued from sta-tistical principles and lead to some nice algorithmsincluding the well known functional gradient descentalgorithm (Friedman, Hastie, and Tibshirani, 2000;Friedman, 2001). Another possible target functionis actually θ{φ(Y)}. This function, or equivalently,E{P(T |Y)−11{T = φ(Y)}D}, was used in Zhao et al.(2012) for individualized treatment selection based ona support vector machine framework. Different from1{�(Y) ≤ 0}, E{P(T |Y)−11{T = φ(Y)}D} is empiricallyestimable given φ(Y) when P(T |Y) is known, for ex-ample, from clinical trial settings. A correspondingboosting algorithm can then be designed to maximizeE{P(T |Y)−11{T = φ(Y)}D}.

3. Improvement Using the Component-WiseBoosting

In a small simulation study we explored the sensitivity ofthe results to the working model. To reduce the undueinfluence of a multivariate working model, which is likelyto be misspecified, we adopted BinomialBoosting with acomponent-wise generalized additive model as the “weaklearner” (Buhlmann and Hothorn, 2007). More explicitly, Bi-nomialBoosting iteratively approximates 2−1logitP(D = 1|·)by f [m](·) = ν

∑m

k=1g[k](·), where ν is a step-length factor. At

the mth iteration, g[m](·) was given by the following procedure.Let X = (Y1, . . . , Yp, T, T × Y1, . . . , T × Yp) be a 2p + 1 vectorof biomarkers, treatment, and treatment-biomarker interac-tions. Let X(j) be the jth component of X for 1 ≤ j ≤ 2p + 1.

Denote U[m−1]1 , . . . , U

[m−1]n as the current negative gradients

of the binomial loss function (c.f. equation (3.1) of Buhlmannand Hothorn (2007)) corresponding to n subjects in the train-

ing set. We fit U[m−1]1 , . . . , U

[m−1]n against X

(j)1 , . . . , X

(j)n by

a one-dimensional smoothing spline function h(j)(·). Then,g[m](·) was chosen to be the best component-wise fitting,

g[m](·)= h(j)(x(j)) where j = arg min1≤j≤2p+1

n∑i=1

(U[m−1]i − h(j)(X

(j)i ))2.

Therefore simple smoothing splines based on each (univariate)biomarker, treatment, and treatment-biomarker interactions

formed a pool of candidate weak learners. At each iteration,we chose the weak learner as the simple smoothing spline thatbest predicted the current negative gradient of the binomialloss function.

The intuition behind our choice of the component-wiseboosting is 2-fold. First, the flexibility of the smoothingsplines may overcome possible misspecification of variablefunctional forms. Note that due to the ensemble nature ofthe boosting algorithm, we just need to get the contribu-tion of each variable right. Second, fitting the gradients ofthe binomial loss function may reduce the undue influenceof wrong working models on the weighting distributions. Thecomponent-wise boosting procedure is quite flexible and canalso incorporate variable selection (Buhlmann and Hothorn,2007). This may be attractive when the number of biomarkersis large.

We compared the above boosting method with K.J.H.methods through four models. Models 1 and 2 were the sameas Scenarios 1 and 5 in K.J.H. Model 3 used the log(− log)link function and had the complexity between Scenarios 3 and4 in K.J.H. Model 4 had severe deviation from the logistic re-gression setting. For each model, we ran 100 simulations withn = 1000 in each simulation. We used the R package mboost

to implement the above boosting algorithm. We chose eachh(j)(·) to be a P-spline of three degrees of freedom with a B-spline basis by using the option bbs in the mboost package.We used the default choices of ν and number of iterationsin the package. The performances from various methods aresummarized in Table 1. Both MCRTB and θ were calculatedbased on testing data sets of size 10,000. Best performers arehighlighted. One thing we noticed from the K.J.H. simulationresults was that the performances of the boosting methods candepend on the working and true models. The misclassificationrates (MCRTB) were quite high when the true model was lo-gistic and the boosting was based on classification trees (seeScenarios 1 and 2 in J.K.H.), or when the true model was notlogistic and the boosting was based on logistic working models(see Scenarios 5 and 6 in J.K.H.). Without knowing the truth,it is therefore hard to determine which boosting algorithmto use. In comparison, the component-wise boosting seemsto have improved all-around performance under various truemodels. A more comprehensive investigation is warranted.

References

Buhlmann, P. and Hothorn, T. (2007). Boosting algorithms: Reg-ularization, prediction and model fitting (with discussion).Statistical Science 22, 477–505.

Freund, Y. (2001). An adaptive version of the boost by majorityalgorithm. Machine Learning 43, 293–318.

Friedman, J. (2001). Greedy function approximation: A gradientboosting machine. Annals of Statistics 29, 1189–1232.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive lo-gistic regression: A statistical view of boosting. Annals ofStatistics 28, 337–374.

Hastie, T., Tibshirani, R., and Friedman, J. (2005). The Elementsof Statistical Learning: Data Mining, Inference and Predic-tion, 2nd edition. New York, NY: Springer.

Schapire, R. and Freund, Y. (2012). Boosting Foundations and Al-gorithms, 1st edition. Cambridge: The MIT Press.

Page 25: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

Combine Markers for Treatment Selection 719

Zhao, Y., Zeng, D., Rush, A., and Kosorok, M. (2012). Estimatingindividualized treatment rules using outcome weighted learn-ing. Journal of the American Statistical Association 107,1106–1118.

Zou, H., Zhu, J., and Hastie, T. (2008). New multicate-gory boosting algorithms based on multicategory Fisher-consistent losses. The Annals of Applied Statistics 2, 1290–1306.

Rejoinder

Chaeryon Kang,Holly Janes,and Ying Huang

We thank co-editor Jeremy M. G. Taylor for organizing thisdiscussion and the discussants for their insightful commentsand suggestions. In this rejoinder, we will address the broadpoints made by individual discussants and draw connectionsbetween them.

We agree with Laber, Tsiatis, Davidian, and Holloway(hereafter LTDH) that a taxonomy of methodology for de-riving treatment rules is useful for discussing the relativemerits of the various approaches. LTDH classified statisti-cal approaches to find marker-based treatment rules into twoclasses based on the estimation method: “regression-basedmethods” obtain a rule by first modeling the outcome us-ing a regression model; and “policy search methods” directlymaximize a criterion of interest, for example, the expectedoutcome under marker-based treatment, in order to derivea treatment rule. Our boosting approach was characterizedas a regression-based approach, whereas outcome weightedlearning (OWL, Zhao et al., 2012), direct maximization ofthe expected outcome under marker-based treatment usingthe inverse probability weighted estimator (IPWE) and theaugmented inverse probability weighted estimator (AIPWE)(Zhang et al., 2012a,b), and modeling marker-by-treatmentinteractions through Q- and A-learning (e.g., Murphy, 2003;Zhao, Kosorok, and Zeng, 2009) were characterized as policysearch methods.

We prefer to group methods using somewhat differentlabels. We call “policy search methods” those that yield atreatment rule. In contrast, “outcome prediction methods”yield a model for the expected outcome given marker andtreatment, which can then be used to derive a treatmentrule. Using this terminology, our boosting approach, OWL,direct maximization of the expected outcome under marker-based treatment, and Q- and A-learning approaches are allexamples of policy search methods: they yield treatmentrules only and do not produce a model for the outcome.The methods differ in whether they are “direct” in thatthey search for treatment rules by directly maximizing acriterion of interest such as the expected outcome undermarker-based treatment; or “indirect” in that they search fortreatment rules by maximizing a criterion which is differentfrom, but presumably related to the criterion of interest. Ourboosting method is an indirect approach. The first methodproposed by Tian, which minimizes the rate at which subjects

are misclassified according to treatment benefit (using asurrogate variable for this unobserved outcome), is also anindirect policy search method. This taxonomy is helpful, webelieve, in that it makes plain the fact that the approachesmentioned in our article and by the discussants are all policysearch methods, except for the method suggested by Yu andLi (hereafter YL), which is an outcome modeling approachthat is designed to be robust to model misspecification.They are therefore limited in that they are suitable only foraddressing the problem of identifying a treatment rule, andnot for the more difficult task of predicting outcome givenmarker value and treatment assignment.

Several discussants proposed novel direct policy search ap-proaches that also use boosting ideas. Several rely on thefact that maximizing the expected outcome under marker-based treatment can be refomulated as a classification prob-lem with weights that are functions of the outcome (Zhaoet al., 2012; Zhang et al., 2012a,b). Using this formulation,Zhao and Kosorok (hereafter ZK) and Tian proposed solvingan approximation of the weighted classification problem andapplied AdaBoost to improve weak classifiers, while LTDHproposed “value boosting” that allows more general weightssuch as those from AIPWE. We agree that these methodshave broad appeal and deserve in-depth investigation.

YL and Tian both raised questions about our proposedstrategy of upweighting subjects with small estimated treat-ment effects, near the decision boundary, who are more likelyto be incorrectly classified with respect to treatment benefit.They raised an interesting and fundamental question: shouldsubjects who lie close to the decision boundary have moreinfluence on the classifier? Or should subjects who lie farfrom the decision boundary but whose incorrect treatmentrecommendations will have greater impact have more influ-ence? Many traditional classification methods have focusedon subjects who are difficult to classify, for example, supportvector machines and AdaBoost. In contrast, other recentlydeveloped boosting methods such as BrownBoost (Freund,2001) focus on subjects whose estimated class labels are con-sistently correct across iterations, and give up on “noisy sub-jects” whose estimated class labels are consistently incorrect.We agree that, in the treatment selection context, boostingsubjects whose estimated treatment effects are large is worthfurther investigation. We suspect that the optimal weighting

Page 26: Combining Biomarkers to Optimize Patient Treatment …quefeng/publication/Biometrics... · 2018-01-18 · Biometrics 70, 695–720 DOI: 10.1111/biom.12191 September 2014 Combining

720 Biometrics, September 2014

Table 1Results of the simulation study for the continuous outcome

setting. Marker combinations obtained using linear regressionwith maximum likelihood estimation (Linear MLE) and the

boosting method described in our article with linearregression working model (Linear Boosting) are compared.Scenarios similar to 5 and 6 in our article are examined.a

For each scenario, 1000 training datasets (n = 500) and onetest dataset (N = 105) were generated. Mean and

Monte-carlo standard deviation (SD) of θ are shown, alongwith the mean and SD of the misclassification rate fortreatment benefit (MCRTB), calculated using test data.

Linear LinearScenario True MLE Boosting

Mean 1.821 2.121θ 3.385

SD 0.365 0.427Scenario 5

Mean 0.352 0.303MCRTB SD 0.062 0.061

Mean 2.396 2.477θ 3.049

SD 0.379 0.157Scenario 6

Mean 0.198 0.154MCRTB SD 0.054 0.025

aIn Scenario 5, the true risk model is D = 0.1 + 0.2Y1 − 0.2Y2 +0.1Y3 − Y2

1 + T (0.5 + Y1 + 0.5Y2 + 0.1Y3 − Y21 ) + ε, and in Scenario

6, the true risk model is D = −0.1 − 0.2Y1 + 0.2Y2 + Y1Y2 +T (−0.5 − Y1 − 0.5Y2 + 2Y1Y2) + ε, given T and Y , where Y1, Y2, andY3 are independent N(0, 1), and ε follows N(0, 1) and is independent

of Y1, Y2, and Y3. The same weight function w{�(Y)} = |�(Y)|−1/3,maximum number of iterations (M = 500), and the maximumweight (CM = 500) were used as in our article.

strategy will depend on the particular setting, and will be af-fected by factors such as the distribution of the markers andtheir associations with the treatment effect.

We agree with the point raised by ZK and Tian that theperformance of our boosting approach depends on the choiceof working model. In practice, prior biological knowledge andcross-validation techniques are useful for guiding the choice ofworking model. There appears to be similar sensitivity of theOWL method of ZK to the choice of “kernel” parameterizingthe treatment rule boundary. Comparing the finite sampleperformance of the boosting and OWL methods to one an-other in simulations will be challenging, particularly undermodel misspecification, given that each requires specificationof a different set of inputs.

One simple question raised by ZK is how to extend ourboosting approach from the binary outcome setting, which isthe focus of our article, to other types of outcomes such as con-tinuous and count outcomes. The method extends naturally asillustrated in Table 1. We let D ∈ R1 be a continuous outcomein which a smaller value is preferable, T be treatment assign-ment (T = 0/1, where T = 1 is the default), and Y ∈ Rp be a

set of markers. Denote by �(Y) = E(D|T = 0, Y) − E(D|T =1, Y) the marker-specific treatment effect, φ(Y) = 1{�(Y) ≤ 0}the optimal treatment rule, and

θ{φ(Y)} = E(D|T = 1) − [E{D|T = 1, φ(Y) = 0}P{φ(Y) = 0}+ E{D|T = 0, φ(Y) = 1}P{φ(Y) = 1}]

the primary measure of its performance. Using the boost-ing approach with linear regression working model can yieldmarker combinations with slightly higher θ and smaller mis-classification of treatment benefit than using classical lin-ear regression with maximum likelihood estimation. We note,however, that further investigation is needed to specify rea-sonable ranges for the tuning parameters of the boostingmethod in the continuous outcome setting.

We conclude with two observations that emerge from thisdiscussion. First, there is much to be gained from bringingresearchers from different areas of statistics and biostatisticstogether around a single topic. This discussion highlights theconnections between the fields of adaptive treatment regimesand risk prediction and biomarker evaluation. Undoubtedly,it has brought relevant work in one field to the attention ofresearchers in another. With such interactions our science willsurely improve. Second, there is tremendous value in repro-ducible research. We applaud the journal for encouraging us topublish our simulation code along with our article. With thiscode, the discussants were able to efficiently compare alterna-tive approaches to ours using the same simulation scenarios,thus expediting the scientific process.

Acknowledgements

This work was funded by R01 CA152089, P30CA015704, andR01 GM106177-01.

References

Freund, Y. (2001). An adaptive version of the boost by majorityalgorithm. Machine Learning 43, 293–318.

Murphy, S. A. (2003). Optimal dynamic treatment regimes. Jour-nal of the Royal Statistical Society, Series B 65, 331–355.

Zhang, B., Tsiatis, A., Laber, E., and Davidian, M. (2012). A robustmethod for estimating optimal treatment regimes. Biomet-rics 68, 1010–1018.

Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber, E.(2012). Estimating optimal treatment regimes from a classi-fication perspective. Stat 1, 103–114.

Zhao, Y., Kosorok, M. R., and Zeng, D. (2009). Reinforce-ment learning design for cancer clinical trials. Statistics inMedicine 28, 3294–3315.

Zhao, Y., Zeng, D., Rush, A. J., and Kosorok, M. R. (2012).Estimating individualized treatment rules using outcomeweighted learning. Journal of the American Statistical As-sociation 107, 1106–1118.