visualizing the effects of predictor variables in …networks, boosted trees, random forests,...

44
Visualizing the Effects of Predictor Variables in Black Box Su- pervised Learning Models Daniel W. Apley and Jingyu Zhu Northwestern University, USA Summary. In many supervised learning applications, understanding and visualizing the effects of the predictor variables on the predicted response is of paramount importance. A shortcoming of black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, support vector regression, etc.) in this regard is their lack of interpretability or transparency. Partial dependence (PD) plots, which are the most popular approach for visualizing the effects of the predictors with black box supervised learn- ing models, can produce erroneous results if the predictors are strongly correlated, because they require extrapolation of the response at predictor values that are far outside the multivariate enve- lope of the training data. As an alternative to PD plots, we present a new visualization approach that we term accumulated local effects (ALE) plots, which do not require this unreliable extrapo- lation with correlated predictors. Moreover, ALE plots are far less computationally expensive than PD plots. Keywords: Functional ANOVA; Marginal plots; Partial dependence plots; Supervised learn- ing; Visualization 1. Introduction With the proliferation of larger and richer data sets in many predictive modeling application domains, black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, support vector regres- sion, etc.) are increasingly commonly used in place of more transparent linear and logistic regression models to capture nonlinear phenomena. However, one shortcoming of black box supervised learning models is that they are difficult to interpret in terms of understanding the effects of the predictor variables (aka predictors) on the predicted response. For many applica- tions, understanding the effects of the predictors is critically important. This is obviously the case if the purpose of the predictive modeling is explanatory, such as identifying new disease risk factors from electronic medical record databases. Even if the purpose is purely predictive, understanding the effects of the predictors may still be quite important. If the effect of a pre- dictor violates intuition (e.g., if it appears from the supervised learning model that the risk of experiencing a cardiac event decreases as patients age), then this is either an indication that the fitted model is unreliable or that a surprising new phenomenon has been discovered. In addition, predictive models must be transparent in many regulatory environments, e.g., to demonstrate to regulators that consumer credit risk models do not penalize credit applicants based on age, race, etc. To be more concrete, suppose we have fit a supervised learning model for approximating E[Y |X = x] f (x), where Y is a scalar response variable, X =(X 1 ,X 2 ,...,X d ) is a vector of d predictors, and f (·) is the fitted model that predicts Y (or the probability that Y falls into a particular class, in the classification setting) as a function of X. To simplify notation, we omit any ˆ symbol over f , with the understanding that it is fitted from data. The training data to which the model is fit consists of n (d + 1)-variate observations {y i , x i =(x i,1 ,x i,2 ,...,x i,d ): Address for correspondence: Daniel W. Apley, Department of Industrial Engineering & Management Sciences, Northwestern University, Evanston, IL 60208, USA. E-mail: [email protected] arXiv:1612.08468v2 [stat.ME] 19 Aug 2019

Upload: others

Post on 03-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing the Effects of Predictor Variables in Black Box Su-pervised Learning Models

Daniel W. Apley and Jingyu ZhuNorthwestern University, USA

Summary. In many supervised learning applications, understanding and visualizing the effects ofthe predictor variables on the predicted response is of paramount importance. A shortcoming ofblack box supervised learning models (e.g., complex trees, neural networks, boosted trees, randomforests, nearest neighbors, local kernel-weighted methods, support vector regression, etc.) in thisregard is their lack of interpretability or transparency. Partial dependence (PD) plots, which are themost popular approach for visualizing the effects of the predictors with black box supervised learn-ing models, can produce erroneous results if the predictors are strongly correlated, because theyrequire extrapolation of the response at predictor values that are far outside the multivariate enve-lope of the training data. As an alternative to PD plots, we present a new visualization approachthat we term accumulated local effects (ALE) plots, which do not require this unreliable extrapo-lation with correlated predictors. Moreover, ALE plots are far less computationally expensive thanPD plots.

Keywords: Functional ANOVA; Marginal plots; Partial dependence plots; Supervised learn-ing; Visualization

1. Introduction

With the proliferation of larger and richer data sets in many predictive modeling applicationdomains, black box supervised learning models (e.g., complex trees, neural networks, boostedtrees, random forests, nearest neighbors, local kernel-weighted methods, support vector regres-sion, etc.) are increasingly commonly used in place of more transparent linear and logisticregression models to capture nonlinear phenomena. However, one shortcoming of black boxsupervised learning models is that they are difficult to interpret in terms of understanding theeffects of the predictor variables (aka predictors) on the predicted response. For many applica-tions, understanding the effects of the predictors is critically important. This is obviously thecase if the purpose of the predictive modeling is explanatory, such as identifying new diseaserisk factors from electronic medical record databases. Even if the purpose is purely predictive,understanding the effects of the predictors may still be quite important. If the effect of a pre-dictor violates intuition (e.g., if it appears from the supervised learning model that the risk ofexperiencing a cardiac event decreases as patients age), then this is either an indication that thefitted model is unreliable or that a surprising new phenomenon has been discovered. In addition,predictive models must be transparent in many regulatory environments, e.g., to demonstrateto regulators that consumer credit risk models do not penalize credit applicants based on age,race, etc.

To be more concrete, suppose we have fit a supervised learning model for approximatingE[Y |X = x] ≈ f(x), where Y is a scalar response variable, X = (X1, X2, . . . , Xd) is a vector ofd predictors, and f(·) is the fitted model that predicts Y (or the probability that Y falls into aparticular class, in the classification setting) as a function of X. To simplify notation, we omitany ˆ symbol over f , with the understanding that it is fitted from data. The training datato which the model is fit consists of n (d+ 1)-variate observations {yi,xi = (xi,1, xi,2, . . . , xi,d) :

†Address for correspondence: Daniel W. Apley, Department of Industrial Engineering & ManagementSciences, Northwestern University, Evanston, IL 60208, USA.E-mail: [email protected]

arX

iv:1

612.

0846

8v2

[st

at.M

E]

19

Aug

201

9

Page 2: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

2 D. Apley and J. Zhu

Fig. 1. Illustration of the differences between the computation of (a) f1,PD(x1) and (b) f1,M (x1) atx1 = 0.3.

i = 1, 2, . . . , n}. Throughout, we use upper case to denote a random variable and lower case todenote specific or observed values of the random variable.

Our objective is to visualize and understand the “main effects” dependence of f(x) =f(x1, x2, . . . , xd) on each of the individual predictors, as well as the low-order “interaction ef-fects” among pairs of predictors. Throughout the introduction we illustrate concepts for thesimple d = 2 case. The most popular approach for visualizing the effects of the predictors ispartial dependence (PD) plots, introduced by Friedman (2001). To understand the effect of onepredictor (say X1) on the predicted response, a PD plot is a plot of the function

f1,PD(x1) ≡ E[f(x1, X2)] =

∫p2(x2)f(x1, x2)dx2 (1)

versus x1, where p2(·) denotes the marginal distribution of X2. We use p(·) to denote the fulljoint probability density of X, and use p·(·), p·|·(·|·), and p·,·(·, ·) to respectively denote themarginal, conditional, and joint probability density functions of various elements of X, with thesubscripts indicating which elements. An estimate of (1), calculated pointwise in x1 for a rangeof x1 values, is

f1,PD(x1) ≡1

n

n∑i=1

f(x1, xi,2). (2)

Figure 1(a) illustrates how f1,PD(x1) is computed at a specific value x1 = 0.3 for a toy examplewith n = 200 observations of (X1, X2) following a uniform distribution along the line segmentx2 = x1 but with independent N(0, 0.052) variables added to both predictors (see Hooker (2007),for a similar example demonstrating the adverse consequences of extrapolation in PD plots).Although we ignore the response variable for now, we return to this example in Section 4 andfit various models f(x) to these data. The salient point in Figure 1(a), which illustrates theproblem with PD plots, is that the integral in (1) is the weighted average of f(x1, X2) as X2

varies over its marginal distribution. This integral is over the entire vertical line segment inFigure 1(a) and requires rather severe extrapolation beyond the envelope of the training data.If one were to fit a simple parametric model of the correct form (e.g., f(x) = β0 + β1x1 + β2x

22),

then this extrapolation might be reliable. However, by nature of its flexibility, a nonparametricsupervised learning model like a regression tree cannot be expected to extrapolate reliably. Aswe demonstrate later (see Figures 5—7), this renders the PD plot an unreliable indicator of theeffect of x1.

Page 3: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 3

The extrapolation in Figure 1(a) that is required to calculate f1,PD(x1) occurs because themarginal density p2(x2) is much less concentrated around the data than the conditional densityp2|1(x2|x1), due to the strong dependence between X2 and X1. Marginal plots (M plots) arealternatives to PD plots that avoid such extrapolation by using the conditional density in placeof the marginal density. As illustrated in Figure 1(b), an M plot of the effect of X1 is a plot ofthe function

f1,M (x1) ≡ E[f(X1, X2)|X1 = x1] =

∫p2|1(x2|x1)f(x1, x2)dx2 (3)

versus x1. A crude estimate of f1,M (x1) is

f1,M (x1) ≡1

n(x1)

∑i∈N(x1)

f(x1, xi,2), (4)

where N(x1) ⊂ {1, 2, . . . , n} is the subset of row indices i for which xi,1 falls into some small,appropriately selected neighborhood of x1, and n(x1) is the number of observations in theneighborhood. Although more sophisticated kernel smoothing methods are typically used toestimate f1,M (x1), we do not consider them here, because there is a more serious problem withusing f1,M (x1) to visualize the main effect of X1 when X1 and X2 are dependent. Namely, usingf1,M (x1) is like regressing Y onto X1 while ignoring (i.e., marginalizing‡ over) the nuisancevariable X2. Consequently, if Y depends on X1 and X2, f1,M (x1) will reflect both of theireffects, a consequence of the omitted variable bias phenomenon in regression.

The main objective of this paper is to introduce a new method of assessing the main andinteraction effects of the predictors in black box supervised learning models that avoids theforegoing problems with PD plots and M plots. We refer to the approach as accumulated localeffects (ALE) plots. For the case that d = 2 and f(·) is differentiable (the more general definitionis deferred until Section 2), we define the ALE main-effect of X1 as

f1,ALE(x1) ≡∫ x1

xmin,1

E[f1(X1, X2)|X1 = z1]dz1 − constant

=

∫ x1

xmin,1

∫p2|1(x2|z1)f1(z1, x2)dx2dz1 − constant,

(5)

where f1(x1, x2) ≡ ∂f(x1,x2)∂x1

represents the local effect of x1 on f(·) at (x1, x2), and xmin,1 issome value chosen near the lower bound of the effective support of p1(·), e.g., just below thesmallest observation min{xi,1 : i = 1, 2, . . . , n}. Choice of xmin,1 is not important, as it onlyaffects the vertical translation of the ALE plot of f1,ALE(x1) versus x1, and the constant in (5)will be chosen to vertically center the plot.

The function f1,ALE(x1) can be interpreted as the accumulated local effects of x1 in the follow-ing sense. In (5), we calculate the local effect f1(x1, x2) of x1 at (x1 = z1, x2), then average thislocal effect across all values of x2 with weight p2|1(x2|z1), and then finally accumulate/integratethis averaged local effect over all values of z1 up to x1. As illustrated in Figure 2, when aver-aging the local effect f1(x1, x2) across x2, the use of the conditional density p2|1(x2|x1), insteadof the marginal density p2(x2), avoids the extrapolation required in PD plots. The avoidance ofextrapolation is similar to M plots, which also use the conditional density p2|1(x2|x1). However,by averaging (across x2) and accumulating (up to x1) the local effects via (5), as opposed todirectly averaging f(·) via (3), ALE plots avoid the omitted nuisance variable bias that rendersM plots of little use for assessing the main effects of the predictors. This relates closely to the

‡Regarding the terminology, plots of an estimate of f1,M (x1) are often referred to as “marginal plots”,because ignoring X2 in this manner is equivalent to working with the joint distribution of (Y,X1) after

marginalizing across X2. Unfortunately, plots of f1,PD(x1) are also sometimes referred to as “marginalplots” (e.g., in the gbm package for fitting boosted trees in R), presumably because the integral in (1) iswith respect to the marginal distribution p2(x2). In this paper, marginal plots will refer to how we havedefined them above.

Page 4: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

4 D. Apley and J. Zhu

Fig. 2. Illustration of the computation of f1,ALE(x1) at x1 = 0.3

use of paired differences to block out nuisance factors in more general statistical settings, whichwe discuss in Section 5.3.

Methods also exist for visualizing the effects of predictors by plotting a collection of curves,rather than a single curve that represents some aggregate effect. Consider the effect of a singlepredictor Xj , and let X\j denote the other predictors. Conditioning plots (coplots) (Chambersand Hastie (1992); Cleveland (1993)), conditional response (CORE) plots (Cook (1995)), andindividual conditional expectation (ICE) plots (Goldstein and Pitkin (2015)) plot quantities likef(xj ,x\j) vs. xj for a collection of discrete values of x\j (CORE and ICE plots), or similarlythey plot E[f(xj ,X\j)|X\j ∈ Sk] vs. xj for each set Sk in some partition {Sk : k = 1, 2, . . .} ofthe space of X\j . Such a collection of curves have more in common with interaction effect plots(as in Figure 10, later) than with main effect plots, for which one desires, by definition, a singleaggregated curve.

The format of the remainder of the paper is as follows. In Section 2, we define the ALEmain effects for individual predictors and the ALE second-order interaction effects for pairs ofpredictors. In Section 3 we present estimators of the ALE main and second-order interaction ef-fects, which are conceptually straightforward and computationally efficient (much more efficientthan PD plots), and we prove their consistency. We focus on main and second-order inter-action effects and discuss general higher-order effects and their estimation in the appendices.In Section 4 we give examples that illustrate the ALE plots and, in particular, how they canproduce correct results when PD plots are corrupted due to their reliance on extrapolation. InSection 5, we discuss interpretation of ALE plots and a number of their desirable properties andcomputational advantages, and we illustrate with a real data example. We also discuss theirrelation to functional ANOVA decompositions for dependent variables (e.g., Hooker (2007)) thathave been developed to avoid the same extrapolation problem highlighted in Figure 1(a). ALEplots are far more computationally efficient and systematic to compute than functional ANOVAdecompositions; and they yield a fundamentally different decomposition of f(x) that is bettersuited for visualization of the effects of the predictors. Section 6 concludes the paper. We alsoprovide as supplementary material an R package ALEPlot to implement ALE plots.

Page 5: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 5

2. Definition of ALE Main and Second-Order Effects

In this section we define the ALE main effect functions for each predictor (Eq. (5) is a specialcase for d = 2 and differentiable f(·)) and the ALE second-order effect functions for each pair ofpredictors. ALE plots are plots of estimates of these functions, and the estimators are definedin Section 3. We do not envision ALE plots being commonly used to visualize third- and high-order effects, since high-order effects are difficult to interpret and usually not as predominant asmain and second-order effects. For this reason, and to simplify notation, we focus on main andsecond-order effects and relegate the definition of higher-order ALE effects to the appendices.

Throughout this section, we assume that p has compact support S, and the support of pjis the interval Sj = [xmin,j , xmax,j ] for each j ∈ {1, 2, . . . , d}. For each K = 1, 2, . . . , andj ∈ {1, 2, . . . , d}, let PKj ≡ {zKk,j : k = 0, 1, . . . ,K} be a partition of Sj into K intervals with

zK0,j = xmin,j and zKK,j = xmax,j . Define δj,K ≡ max{|zKk,j − zKk−1,j | : k = 1, 2, . . . ,K}, which

represents the fineness of the partition. For any x ∈ Sj , define kKj (x) to be the index of the

interval of PKj into which x falls, i.e., x ∈ (zKk−1,j , zKk,j ] for k = kKj (x). Let X\j denote the

subset of d − 1 predictors excluding Xj , i.e., X\j = (Xk : k = 1, 2, . . . , d; k 6= j). The followingdefinition is a generalization of (5) for a function f(·) that is not necessarily differentiable andfor any d ≥ 2. The generalization essentially replaces the derivative and integral in (5) withlimiting forms of finite differences and summations, respectively.

Definition 1 (Uncentered ALE Main Effect). Consider any j ∈ {1, 2, . . . , d}, and supposethe sequence of partitions {PKj : K = 1, 2, . . .} is such that limK→∞ δj,K = 0. When f(·) and pare such that the following limit exists and is independent of the particular sequence of partitions{PKj : K = 1, 2, . . .} (see Theorem A.1 in Appendix A for sufficient conditions on the existenceand uniqueness of the limit), we define the uncentered ALE main (aka first-order) effect functionof Xj as (for xj ∈ Sj)

gj,ALE(xj) ≡ limK→∞

kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]. (6)

The following theorem, the proof of which is in Appendix A, states that for differentiablef(·), the uncentered ALE main effect of Xj in (6) has an equivalent but more revealing formthat is analogous to (5).

Theorem 1 (Uncentered ALE Main Effect for differentiable f(·)). Let f j(xj ,x\j) ≡∂f(xj ,x\j)

∂xjdenote the partial derivative of f(x) with respect to xj when the derivative exists.

In Definition 1, suppose

(i) f(xj ,x\j) is differentiable in xj on S,

(ii) f j(xj ,x\j) is continuous in (xj ,x\j) on S, and

(iii) E[f j(Xj ,X\j)|Xj = zj ] is continuous in zj on Sj .Then, for each xj ∈ Sj ,

gj,ALE(xj) =

∫ xj

xmin,j

E[f j(Xj ,X\j)|Xj = zj ]dzj . (7)

(End of Theorem 1)

The (centered) ALE main effect of Xj , denoted by fj,ALE(xj), is defined the same asgj,ALE(xj) but centered so that fj,ALE(Xj) has a mean of zero with respect to the marginaldistribution of Xj . That is,

fj,ALE(xj) ≡ gj,ALE(xj)− E[gj,ALE(Xj)]

= gj,ALE(xj)−∫pj(zj)gj,ALE(zj)dzj .

(8)

Page 6: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

6 D. Apley and J. Zhu

Remark 1. The ALE plot function fj,ALE(xj) attempts to quantify something quite similar tothe PD plot function fj,PD(xj) in (1) and can be interpreted in the same manner. For ex-ample, ALE plots and PD plots both have a desirable additive recovery property. That is, iff(x) =

∑dj=1 fj(xj) is additive, then both fj,ALE(xj) and fj,PD(xj) are equal to the desired true

effect fj(xj), up to an additive constant. Hence, a plot of fj,ALE(xj) vs. xj correctly reveals thetrue effect of Xj on f , no matter how black-box the function f is. If second-order interactioneffects are present in f , a similar additive recovery property holds for the ALE second-orderinteraction effects that we define next (see Section 5.3 for a more general additive recovery prop-erty that applies to interactions of any order). In spite of the similarities in the characteristicsof f that they are designed to extract, the differences in fj,ALE(xj) and fj,PD(xj) lead to verydifferent methods of estimation. As will be demonstrated in the later sections, the ALE plotfunctions are estimated in a far more computationally efficient manner that also avoids the ex-trapolation problem that renders PD plots unreliable with highly correlated predictors.

We next define the ALE second-order effects. For each pair of indices {j, l} ⊆ {1, 2, . . . , d},let X\{j,l} denote the subset of d − 2 predictors excluding {Xj , Xl}, i.e., X\{j,l} = (Xk : k =1, 2, . . . , d; k 6= j; k 6= l). For general f(·), the uncentered ALE second-order effect of (Xj , Xl)is defined similarly to (6), except that we replace the 1-D finite-differences by 2-D second-orderfinite differences on the 2-D grid that is the Cartesian product of the 1-D partitions of Sj andSl, and the summation is over this 2-D grid.

Definition 2 (Uncentered ALE Second-Order Effect). Consider any pair {j, l} ⊆ {1, . . . , d}and corresponding sequences of partitions {PKj : K = 1, 2, . . .} and {PKl : K = 1, 2, . . .} suchthat limK→∞ δj,K = limK→∞ δl,K = 0. When f(·) and p are such that the following limit existsand is independent of the particular sequences of partitions, we define the uncentered ALEsecond-order effect function of (Xj , Xl) as (for (xj , xl) ∈ Sj × Sl)

h{j,l},ALE(xj , xl) ≡ limK→∞

kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]],

(9)where

∆{j,l}f (K, k,m; x\{j,l}) = [f(zKk,j , z

Km,l,x\{j,l})− f(zKk−1,j , z

Km,l,x\{j,l})]

−[f(zKk,j , zKm−1,l,x\{j,l})− f(zKk−1,j , z

Km−1,l,x\{j,l})]

(10)

is the second-order finite difference of f(x) = f(xj , xl,x\{j,l}) with respect to (xj , xl) across cell

(zKk−1,j , zKk,j ]× (zKm−1,l, z

Km,l] of the 2-D grid that is the Cartesian product of PKj and PKl .

Analogous to Theorem 1, Theorem 2 (proved in Appendix A) states that for differentiablef(·), the uncentered ALE second-order effect of (Xj , Xl) in (9) has an equivalent integral form.

Theorem 2 (Uncentered ALE Second-Order Effect for differentiable f(·)). Let

f{j,l}(xj , xl,x\{j,l}) ≡∂2f(xj ,xl,x\{j,l})

∂xj∂xldenote the second-order partial derivative of f(x) with

respect to xj and xl when the derivative exists. In Definition 2, suppose

(i) f(xj , xl,x\{j,l}) is differentiable in (xj , xl) on S,

(ii) f{j,l}(xj , xl,x\{j,l}) is continuous in (xj , xl,x\{j,l}) on S, and

(iii) E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl] is continuous in (zj , zl) on Sj × Sl.

Then, for each (xj , xl) ∈ Sj × Sl,

h{j,l},ALE(xj , xl) ≡∫ xl

xmin,l

∫ xj

xmin,j

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl. (11)

Page 7: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 7

(End of Theorem 2)

The ALE second-order effect of (Xj , Xl), denoted by f{j,l},ALE(xj , xl), is defined the sameas h{j,l},ALE(xj , xl) but “doubly-centered” so that f{j,l},ALE(Xj , Xl) has a mean of zero withrespect to the marginal distribution of (Xj , Xl) and so that the ALE main effects of Xj and Xl

on f{j,l},ALE(Xj , Xl) are both zero. The latter centering is accomplished by subtracting fromh{j,l},ALE(xj , xl) its uncentered ALE main effects via

g{j,l},ALE(xj , xl) ≡ h{j,l},ALE(xj , xl)

− limK→∞

kKj (xj)∑k=1

E[h{j,l},ALE(zKk,j , Xl)− h{j,l},ALE(zKk−1,j , Xl)|Xj ∈ (zKk−1,j , zKk,j ]]

− limK→∞

kKl (xl)∑k=1

E[h{j,l},ALE(Xj , zKk,l)− h{j,l},ALE(Xj , z

Kk−1,l)|Xl ∈ (zKk−1,l, z

Kk,l]].

(12)

By Theorem 1, for differentiable f , (12) is equivalent to

g{j,l},ALE(xj , xl) ≡ h{j,l},ALE(xj , xl)−∫ xj

xmin,j

E[∂h{j,l},ALE(Xj , Xl)

∂Xj|Xj = zj ]dzj

−∫ xl

xmin,l

E[∂h{j,l},ALE(Xj , Xl)

∂Xl|Xl = zl]dzl

= h{j,l},ALE(xj , xl)−∫ xj

xmin,j

∫pl|j(zl|zj)

∂h{j,l},ALE(zj , zl)

∂zjdzldzj

−∫ xl

xmin,l

∫pj|l(zj |zl)

∂h{j,l},ALE(zj , zl)

∂zldzjdzl.

(13)

The final centering is accomplished by taking

f{j,l},ALE(xj , xl) ≡ g{j,l},ALE(xj , xl)− E[g{j,l},ALE(Xj , Xl)]

= g{j,l},ALE(xj , xl)−∫ ∫

p{j,l}(zj , zl)g{j,l},ALE(zj , zl)dzjdzl.(14)

It can be verified that f{j,l},ALE(xj , xl) is centered in the sense that the ALE main effects of Xj

and Xl on f{j,l},ALE(xj , xl) are both zero (see Appendix C for a formal proof of a related butmore general result).

If we define the zero-order effect for any function of X as its expected value with respect to p,then we can view the ALE first-order effect of Xj as being obtained by first calculating its uncen-tered first-order effect (6), and then for the resulting function, subtracting its zero-order effect.Likewise, the ALE second-order effect of (Xj , Xl) is obtained by first calculating the uncenteredsecond-order effect (9), then for the resulting function, subtracting both of its first-order effectsof Xj and of Xl, and then for this resulting function, subtracting its zero-order effect. The ALEhigher-order effects are defined analogously in Appendix B. The uncentered higher-order effectis first calculated, and then all lower-order effects are sequentially calculated and subtracted oneorder at a time, until the final result has all lower-order effects that are identically zero.

Remark 2. In Appendix B we define ALE higher-order effect functions fJ,ALE(xJ) for |J | > 2,where |J | denotes the cardinality of the set of predictor indices J . Appendix C shows that thisleads to a functional-ANOVA-like decomposition of f via

f(x) =

d∑j=1

fj,ALE(xj) +

d∑j=1

d∑l=j+1

f{j,l},ALE(xj , xl) +∑

J⊆{1,2,...,d},|J |≥3

fJ,ALE(xJ).

Page 8: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

8 D. Apley and J. Zhu

This ALE decomposition has a certain orthogonality-like property, which we contrast with otherfunctional ANOVA decompositions in Section 5.5.

Remark 3. The ALE function definitions in this section apply to predictor distributions pj thatare continuous numerical with compact support. For discrete pj , one could consider modifying(6) and (9) by using a fixed finite partition whose interval endpoints coincide with the supportof pj . We do not develop this, however, because our focus is on estimation and interpretation ofthe ALE effects, and the estimators in the following section are meaningful for either continuousor discrete pj . In the case that Xj is a nominal categorical predictor, one must decide howto order its categories prior to estimating its ALE effect (which requires differencing f acrossneighboring categories of Xj). In Appendix E, we discuss a strategy for this that we have foundto work well in practice.

3. Estimation of fj,ALE(xj) and f{j,l},ALE(xj , xl)

In Appendix D we briefly describe how to estimate the ALE higher-order effect fJ,ALE(xJ) fora general subset J ⊆ {1, 2, . . . , d} of predictor indices. Our focus in this section is on estimatingthe first-order (|J | = 1) and second-order (|J | = 2) effects, since these are the most commonand useful (i.e., interpretable).

As an overview, the estimate fJ,ALE is obtained by computing estimates of the quantities inEqs. (6)—(14) for J = j (a single index) or for J = {j, l} (a pair of indices). For the estimateswe (i) replace the sequence of partitions in (6) (or (9)) by some appropriate fixed partition ofthe sample range of {xi,J : i = 1, . . . , n} and (ii) replace the conditional expectations in (6)(or (9)) by sample averages across {xi,\J : i = 1, 2, . . . , n)}, conditioned on xi,J falling intothe corresponding interval/cell of the partition. In the preceding, xi,J = (xi,j : j ∈ J) andxi,\J = (xi,j : j = 1, 2, . . . , d; j 6∈ J) denote the ith observation of the subsets of predictors XJ

and X\J , respectively.More specifically, for each j ∈ {1, 2, . . . , d}, let {Nj(k) = (zk−1,j , zk,j ] : k = 1, 2, . . . ,K} be a

sufficiently fine partition of the sample range of {xi,j : i = 1, 2, . . . , n} into K intervals. Since theestimator is computed for a fixed K, we have omitted it as a superscript on the partition, withthe understanding that the partition implicitly depends on K. In all of our examples later inthe paper, we chose zk,j as the k

K quantile of the empirical distribution of {xi,j : i = 1, 2, . . . , n}with z0,j chosen just below the smallest observation, and zK,j chosen as the largest observation.Figure 3 illustrates the notation and concepts in computing the ALE main effect estimatorfj,ALE(xj) for the first predictor j = 1 for the case of d = 2 predictors. For k = 1, 2, . . . ,K, letnj(k) denote the number of training observations that fall into the kth interval Nj(k), so that∑K

k=1 nj(k) = n. For a particular value x of the predictor xj , let kj(x) denote the index of theinterval into which x falls, i.e., x ∈ (zkj(x)−1,j , zkj(x),j ].

For general d, to estimate the main effect function fj,ALE(·) of a predictor Xj , we firstcompute an estimate of the uncentered effect gj,ALE(·) defined in (6), which is

gj,ALE(x) =

kj(x)∑k=1

1

nj(k)

∑{i:xi,j∈Nj(k)}

[f(zk,j ,xi,\j)− f(zk−1,j ,xi,\j)] (15)

for each x ∈ (z0,j , zK,j ]. Analogous to (8), the ALE main effect estimator fj,ALE(·) is thenobtained by subtracting an estimate of E[gj,ALE(Xj)] from (15), i.e.,

fj,ALE(x) = gj,ALE(x)− 1

n

n∑i=1

gj,ALE(xi,j) = gj,ALE(x)− 1

n

K∑k=1

nj(k)gj,ALE(zk,j). (16)

To estimate the ALE second-order effect of a pair of predictors (Xj , Xl), we partition thesample range of {(xi,j , xi,l) : i = 1, 2, . . . , n} into a grid of K2 rectangular cells obtained asthe Cartesian product of the individual one-dimensional partitions. Figure 4 illustrates the

Page 9: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 9

Fig. 3. Illustration of the notation and concepts in computing the ALE main effect estimator fj,ALE(xj)for j = 1 with d = 2 predictors. The bullets are a scatterplot of {(xi,1, xi,2) : i = 1, 2, . . . , n} for n = 30training observations. The range of {xi,1 : i = 1, 2, . . . , n} is partitioned into K = 5 intervals {N1(k) =(zk−1,1, zk,1] : k = 1, 2, . . . , 5} (in practice,K should usually be chosen much larger than 5). The numbersof training observations falling into the 5 intervals are n1(1) = 4, n1(2) = 6, n1(3) = 6, n1(4) = 5, andn1(5) = 9. The horizontal line segments shown in the N1(4) region are the segments across which thefinite differences f(z4,j ,xi,\j)− f(z3,j ,xi,\j) are calculated and then averaged in the inner summand ofEq. (15) for k = 4 and j = 1.

notation and concepts. Let (k,m) (with k and m integers between 1 and K) denote the indicesinto the grid of rectangular cells with k corresponding to xj and m corresponding to xl. Inanalogy with Nj(k) and nj(k) defined in the context of estimating fj,ALE(·), let N{j,l}(k,m) =Nj(k)×Nl(m) = (zk−1,j , zk,j ]× (zm−1,l, zm,l] denote the cell associated with indices (k,m), andlet n{j,l}(k,m) denote the number of training observations that fall into cell N{j,l}(k,m), so that∑K

k=1

∑Km=1 n{j,l}(k,m) = n.

To estimate f{j,l},ALE(xj , xl), we first estimate the uncentered effect h{j,l},ALE(xj , xl) definedin (9) by

h{j,l},ALE(xj , xl) =

kj(xj)∑k=1

kl(xl)∑m=1

1

n{j,l}(k,m)

∑{i:xi,{j,l}∈N{j,l}(k,m)}

∆{j,l}f (K, k,m; xi,\{j,l}) (17)

for each (xj , xl) ∈ (z0,j , zK,j ] × (z0,l, zK,l]. In (17), ∆{j,l}f (K, k,m; xi,\{j,l}) is the second-order

finite difference defined in (10), evaluated at the ith observation xi,\{j,l}, i.e.

∆{j,l}f (K, k,m; xi,\{j,l}) = [f(zk,j , zm,l,xi,\{j,l})− f(zk−1,j , zm,l,xi,\{j,l})]

−[f(zk,j , zm−1,l,xi,\{j,l})− f(zk−1,j , zm−1,l,xi,\{j,l})](18)

Analogous to (12), we next compute estimates of the ALE main effects of Xj and Xl for the

function h{j,l},ALE(xj , xl) and then subtract these from h{j,l},ALE(xj , xl) to give an estimate of

Page 10: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

10 D. Apley and J. Zhu

g{j,l},ALE(xj , xl):

g{j,l},ALE(xj , xl)

=h{j,l},ALE(xj , xl)−kj(xj)∑k=1

1

nj(k)

∑{i:xi,j∈Nj(k)}

[h{j,l},ALE(zk,j , xi,l)− h{j,l},ALE(zk−1,j , xi,l)]

−kl(xl)∑m=1

1

nl(m)

∑{i:xi,l∈Nl(m)}

[h{j,l},ALE(xi,j , zm,l)− h{j,l},ALE(xi,j , zm−1,l)]

=h{j,l},ALE(xj , xl)−kj(xj)∑k=1

1

nj(k)

K∑m=1

nj,l(k,m)[h{j,l},ALE(zk,j , zm,l)− h{j,l},ALE(zk−1,j , zm,l)]

−kl(xl)∑m=1

1

nl(m)

K∑k=1

n{j,l}(k,m)[h{j,l},ALE(zk,j , zm,l)− h{j,l},ALE(zk,j , zm−1,l)].

(19)

Finally, analogous to (14), we estimate f{j,l},ALE(xj , xl) by subtracting an estimate of

Fig. 4. Illustration of the notation used in computing the ALE second-order effect estimatorf{j,l},ALE(xj , xl) for K = 5. The ranges of {xi,j : i = 1, 2, . . . , n} and {xi,l : i = 1, 2, . . . , n}are each partitioned into 5 intervals, and their Cartesian product forms the grid of rectangular cells{N{j,l}(k,m) = Nj(k)×Nl(m) : k = 1, 2, . . . , 5;m = 1, 2, . . . , 5}. The cell with bold borders is the regionN{j,l}(4, 3). The second-order finite differences ∆

{j,l}f (K, k,m; xi,\{j,l}) in Eq. (18) for (k,m) = (4, 3) are

calculated across the corners of this cell. In the inner summation of Eq. (17), these differences are thenaveraged over the n{j,l}(4, 3) = 2 observations in region N{j,l}(4, 3).

Page 11: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 11

E[g{j,l},ALE(Xj , Xl)] from (19), which gives

f{j,l},ALE(xj , xl) = g{j,l},ALE(xj , xl)−1

n

n∑i=1

g{j,l},ALE(xi,j , xi,l)

= g{j,l},ALE(xj , xl)−1

n

K∑k=1

K∑m=1

n{j,l}(k,m)g{j,l},ALE(zk,j , zm,l).

(20)

Theorems 3 and 4 in Appendix A show that, under mild conditions, (16) and (20) areconsistent estimators of the ALE main effect (8) of Xj and ALE second-order effect (14) of(Xj , Xl), respectively.

ALE plots are plots of the ALE effect estimates fj,ALE(xj) and f{j,l},ALE(xj , xl) versus thepredictors involved. ALE plots have substantial computational advantages over PD plot, whichwe discuss in Section 5.4. In addition, ALE plots can produce reliable estimates of the main andinteraction effects in situations where PD plots break down, which we illustrate with examplesin the next section, as well as an example on real data in Section 5.1.

4. Toy Examples Illustrating when ALE Plots are Reliable but PD Plots Break Down

Fig. 5. Depiction of the first eight splits in the tree fitted to the Example 1 data. The left panel is ascatterplot of x2 vs. x1 showing splits corresponding to the truncated tree in the right panel.

Example 1. This example was introduced in Section 1. For this example, d = 2, n = 200, and(X1, X2) follows a uniform distribution along a segment of the line x2 = x1 with independentN(0, 0.052) variables added to both predictors. Figure 5 shows a scatter plot of X2 vs. X1. Thetrue response was generated according to the noiseless model Y = X1 +X2

2 for the 200 trainingobservations in Figure 5, to which we fit a tree using the tree package of R (Ripley (2015)).The tree was overgrown and then pruned back to have 100 leaf nodes, which was approximatelythe optimal number of leaf nodes according to a cross-validation error sum of squares criterion.Notice that the optimal size tree is relatively large, because the response here is a deterministicfunction X1 +X2

2 of the predictors with no response observation error. The first eight splits ofthe fitted tree f(x) are also depicted in Figure 5. Figure 6 shows main effect PD plots, M plots,

Page 12: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

12 D. Apley and J. Zhu

Fig. 6. For the tree model fitted to the Example 1 data, plots of fj,ALE(xj) (blue line with bullets),fj,PD(xj) (red dotted line), fj,M (xj) (black dashed line), and the true main effect of Xj (black solid line)for (a) j = 1, for which the true effect of X1 is linear, and (b) j = 2, for which the true effect of X2 isquadratic. For both j = 1 and j = 2, fj,ALE(xj) is much more accurate than either fj,PD(xj) or fj,M (xj).

and ALE plots for the full 100-node fitted tree f(x), calculated via (2), (4), and (15)—(16),

respectively. For both j = 1 and j = 2, fj,ALE(xj) is much more accurate than either fj,PD(xj)

or fj,M (xj). By inspection of the fitted tree in Figure 5, it is clear why fj,PD(xj) performs so

poorly in this example. For small x1 values like x1 ≈ 0.2, the PD plot estimate f1,PD(x1 ≈ 0.2)is much higher than it should be, because it is based on the extrapolated values of f(x) in theupper-left corner of the scatter plot in Figure 5, which were substantially overestimated due tothe nature of the tree splits and the absence of any data in that region. For similar reasons,f2,PD(x2) for small x2 values is substantially underestimated because of the extrapolation in thelower-right corner of the scatter plot in Figure 5. In contrast, by avoiding this extrapolation,f1,ALE(x1) and f2,ALE(x2) are estimated quite accurately and are quite close to the true linear(for x1) and quadratic (for x2) effects, as seen in Figures 6(a) and 6(b), respectively.

Also notice that the M plots in Figures 6(a) and 6(b) perform very poorly. As expected,

because of the strong correlation between X1 and X2, f1,M (x1) and f2,M (x2) are quite closeto each other and are each combinations of the true effects of X1 and X2. In the subsequentexamples, we do not further consider M plots.

Example 2. This example is a modification of Example 1 having the same d = 2, n = 200, and(X1, X2) following a uniform distribution along a segment of the line x2 = x1 with independentN(0, 0.052) variables added to both predictors. However, the true response is now generated asnoisy observations according to the model Y = X1 + X2

2 + ε with ε ∼ N(0, 0.12), and we fita neural network model instead of a tree. For the neural network, we used the nnet packageof R (Venables and Ripley (2002)) with ten nodes in the single hidden layer, a linear outputactivation function, and a decay/regularization parameter of 0.0001, all of which were chosen asapproximately optimal via multiple replicates of 10-fold cross-validation (the cross-validation r2

for this model varied between 0.965 and 0.975, depending on the data set generated, which is

quite close to the theoretical r2 value of 1 − Var(ε)Var(Y ) = 0.972). We repeated the procedure in a

Monte Carlo simulation with 50 replicates, where on each replicate we generated a new trainingdata set of 200 observations and refit the neural network model with the same tuning parameters

Page 13: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 13

Fig. 7. Comparison of (a) f1,ALE(x1), (b) f1,PD(x1), (c) f2,ALE(x2), and (d) f2,PD(x2) for neural networkmodels fitted over 50 Monte Carlo replicates of the Example 2 data. In each panel, the black curve isthe true effect function (linear for X1 and quadratic for X2), and the gray curves are the estimated effectfunctions over the 50 Monte Carlo replicates.

mentioned above. The estimated main effect functions fj,ALE(xj) and fj,PD(xj) (for j = 1, 2)

over all 50 replicates are shown in Figure 7. For this example too, fj,ALE(xj) is far superior to

fj,PD(xj). On every replicate, f1,ALE(x1) and f2,ALE(x2) are quite close to the true linear and

quadratic effects, respectively. In sharp contrast, f1,PD(x1) and f2,PD(x2) are so inaccurate onmany replicates that they are of little use in understanding the true effects of X1 and X2.

5. Discussion

5.1. Illustration with a Bike-Sharing Real Data ExampleWe now show an example with a real, larger data set. The data are a compilation of the bike-sharing rental counts from the Capital Bikeshare system (Washington D.C., USA) over the two-year period 2011-2012, aggregated on an hourly basis, together with hourly weather and seasonalinformation over the same time period. The data file can be found at https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset (Fanaee-T and Gama (2013)). There are n = 17393 cases/rowsin the training data set corresponding to 17393 hours of data. The response is the total numberof bike rental counts in each hour. We use the following d = 11 predictors: year (X1, categorical

Page 14: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

14 D. Apley and J. Zhu

with 2 categories: 0 = 2011, 1 = 2012), month (X2, treated as numerical: 1 = January, 2 =February, . . ., 12 = December), hour (X3, treated as numerical: {0, 1, . . . , 23}), holiday (X4,categorical: 0 = non-holiday, 1 = holiday), weekday (X5, treated as numerical: {0, 1, . . . , 6}representing day of a week with 0 = Sunday), workingday (X6, categorical: 1 = neither week-end nor holiday, 0 = otherwise), weather situation (X7, treated as numerical: {1, 2, 3, 4}, smallervalues correspond to nicer weather situations), temp (X8, numerical: temperature in Celsius),atemp (X9, numerical: feeling temperature in Celsius), hum (X10, numerical: humidity), wind-speed (X11, numerical: wind speed). We do not use date and season in the data file as predictorssince they are dependent on the other predictors. Notice that the set of predictors are highlycorrelated. For example, temperature and feeling temperature are highly correlated, and so aremonth and temperature.

We fit a neural network model using the R nnet package (Venables and Ripley (2002)) with10 nodes in the single hidden layer (size = 10), a logistic output activation function (linout= FALSE), and a regularization parameter of 0.05 (decay = 0.05). These parameters areapproximately optimal according to multiple replicates of 3-fold cross validation (CV r2 ≈ 0.90).For the ALE and PD plots, the function f(x) is the predicted hourly count of rental bikes fromthe fitted neural network model. Figure 8, Figure 9(a), and Figure 10 show ALE main andinteraction effects plots for various predictors. We used K = 100 for both the main-effect plotsand the interaction plots.

Fig. 8. For the bike-sharing example with f(x) a neural network model for predicting hourly bike rentalcounts, ALE main-effect plots for month (X2, top left), hour-of-day (X3, top right), weather situation (X7,bottom left), and wind speed (X11, bottom right) predictors. The zero-order effects have been included,i.e., the plots are of fj,ALE(xj) + E[f(X)].

The ALE main-effect plots are shown for month (X2), hour (X3), weather situation (X7),and wind speed (X11) in Figure 8, and for feeling temperature (X9) in Figure 9(a). All of theALE main-effect plots provide clear interpretations of the (main) effects of the predictors. Forthe effect of month (X2), the number of rentals is lowest in January and gradually increases

Page 15: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 15

Fig. 9. For the bike-sharing data example with neural network predicted counts for f(x), ALE main-effectplot (left panel) and PD main-effect plot (right panel) for feeling temperature (X9). Both plots include thezero-order effect E[f(X)]. The two plots differ substantially, and the ALE plot seems to agree more withintuition.

month-by-month until it peaks in September-October (months 9-10), after which it declines inthe winter months. For the effect of hour of day (X3), the number of rentals increases untilit first peaks at the morning rush hour around 8:00 am (hour 8), after which it decreases tomoderate levels over the late morning and early afternoon hours, and then peaks again at theevening rush hour around 5:00-6:00 pm (hours 17−18). For the effect of weather situation (X7),the number of rentals monotonically decreases as the weather situation worsens. Recall that alarger value of X7 corresponds to worse weather conditions. For the effect of wind speed (X11),the number of rentals also monotonically decreases as the wind speed increases. For the effectof atemp (X9, in Figure 9(a)), the number of rentals steadily increases as atemp (i.e., feelingtemperature) increases up until about 26 degrees Celsius (79 degrees Fahrenheit), after which itsteadily decreases. This makes perfect sense, since a feeling temperature of 26 degrees Celsiusmight be considered as nearly optimal for comfortably biking around a city (note that feelingtemperature takes into factors such as humidity and breeze, so the actual temperature wouldbe somewhat lower), and feeling temperatures that are either substantially higher or lower thanthis will make bike rental less appealing for many people. In comparison, Figure 9(b) shows thePD main effect plot for feeling temperature, which is substantially different from the ALE maineffect plot in Figure 9(a), even though the two are for the exact same fitted neural network model.The difference is due to the high correlation between feeling temperature and some of the otherpredictors, and the resulting extrapolation that makes PD plots unreliable. In this case, the PDplot indicates that the number of bike rentals monotonically increases as feeling temperatureincreases, even at feeling temperatures over 40 degrees Celsius (104 Fahrenheit). The ALE plotfor feeling temperature in Figure 9(a), which indicates that bike rentals will decrease as feelingtemperature increases beyond the comfortable range, is in much better agreement with commonsense. In addition to providing better interpretability, the ALE plots are much faster to computethan the PD plots (see Section 5.4).

Figure 10 shows two versions of the ALE second-order interaction effect plot for the hourand weather situation predictors ({X3, X7}), without and with the main effects of X3 and X7

included. The latter (Figure 10(b)) plots E[f(X)]+ f3,ALE(x3)+ f7,ALE(x7)+ f{3,7},ALE(x3, x7),

whereas the former (Figure 10(a)) plots only f{3,7},ALE(x3, x7). Generally speaking, the latterprovides a clearer picture of the joint effects of two predictors, whereas the former allows the

Page 16: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

16 D. Apley and J. Zhu

Fig. 10. ALE second-order interaction plots for the predictors hour (X3) and weather situation (X7)without (left panel) and with (right panel) the main effects of X3 and X7 included. The left panel plotsf{3,7},ALE(x3, x7), and the right panel plots E[f(X)] + f3,ALE(x3) + f7,ALE(x7) + f{3,7},ALE(x3, x7). Thenumbers on the contours are the function values.

overall magnitude of the interaction effect to be more easily assessed. Our ALEPlot R packageallows either to be plotted.

The interaction ALE plot in Figure 10 reveals an interesting relationship and is an importantsupplement to the main effects ALE plots for X3 and X7 in Figure 8. Consider the decrease inbike rental counts that is due to larger weather situation values (i.e., less pleasant weather). Ifthere were no interactions, this decrease would be the same regardless of the hour or the levelsof the other predictors. But from Figure 10(a), there is clearly a strong interaction between X3

and X7, since the contour values vary over a range of about 110 units (from −50 to +60), which

is almost as large as the range for the main effect f7,ALE(x7) in Figure 8. The effect of weathersituation, which is a decrease in rentals as weather situation increases, is clearly amplified duringthe rush hour peaks, and in general at hours when the overall rental counts are expected to behigher. One must be careful interpreting interaction plots without the main effects included.From Figure 10(a), at some hours (e.g., around hour 0, which is midnight) f{3,7},ALE(x3, x7)increases as weather situation increases. However, when the main effects of X3 and X7 areincluded as in Figure 10(b), it is clear that increasing weather situation decreases bike rentalsat any hour.

It makes sense that the effects of weather situation are amplified by the effects of hour(on which the overall bike rental counts depend heavily), and this example illustrates howvisualizations like this can aid the model building process by suggesting modifications of themodel that one might consider. For example, Figure 10(b) indicates that the effects of weathersituation and/or hour on rental counts might be better modeled as multiplicative.

5.2. The Wrong and Right Ways to Interpret ALE Plots (and PD Plots)This section provides a word of caution on how not to interpret ALE plots when the predictorsare highly correlated, which also applies to interpreting PD plots. Reconsider Example 1, inwhich X1 and X2 are highly correlated (see Figure 5), and the ALE and PD plots are as inFigure 6. The wrong way to interpret the ALE plot is that it implies that if we fix (say) x1 andthen vary x2 over its entire range, the response (of which f(x) is a predictive model) is expectedto vary as in Figure 6(b). And this interpretation is wrong even if f(x) is truly additive in thepredictors. Indeed, varying x2 over its entire range with x1 held fixed would take x far outside

Page 17: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 17

Fig. 11. Illustration of the right way to interpret ALE plots for an example in which f(x) =∑dj=1 fj(xj) is

additive with quadratic fj(xj) = x2j , and K = 10 equally-spaced bins over the support [0, 1] were used.The left panel shows the local effects of Xj within each bin (i.e., the summand in Eq. (6)). The localeffects are local and require no extrapolation outside the data envelope. The right plot is of gj,ALE(xj)and can be viewed as piecing together (or accumulating) the local effects in a manner that facilitateseasier visualization of the underlying global effect.

the envelope of the training data to which f was fit, as can be seen in Figure 5. f(x) is obviouslyunreliable for this level of extrapolation, which was the main motivation for ALE plots, and wehave highly uncertain knowledge of the hypothetical values of the response far outside the dataenvelope.

However, ALE plots are still very useful if we interpret them correctly, and the correctinterpretation is illustrated in Figure 11. In this toy example, suppose f(x) =

∑dj=1 fj(xj) is

additive, the effect of Xj is the quadratic function fj(xj) = x2j , and Xj is highly correlated withthe other predictors. The left panel of Figure 11 shows the local effects of Xj within each bin,for K = 10 equally-spaced bins over the support [0, 1]. Here, the local effect within a bin isdefined as the summand in the Eq. (6) definition of gj,ALE(xj), which represents the averagechange in f(X) as Xj changes from the left endpoint to the right endpoint of the bin.

The local effects are exactly that – local – and require no extrapolation beyond the envelopeof the data, since the changes in f(X) are averaged across the conditional distribution of X\j ,given that Xj falls in that bin. Consequently, if the bin widths are not too small and f is nottoo noisy, a local effect plot like the one in the left panel of Figure 11 could be interpreted toreveal the effect of Xj on f .

However, the effect of Xj is much easier to visualize if we accumulate the local effects via Eq.(6) and plot gj,ALE(xj) instead, as in the right panel of Figure 11. Aside from vertical centering,this is exactly the ALE plot, and it is best viewed as a way of piecing together the local effectsin a manner that provides easier visualization of the underlying global effect of a predictor. Theadditive recovery property discussed in the next subsection provides even stronger justificationfor this manner of piecing together the local effects. Namely, if f(x) =

∑dj=1 fj(xj) is additive,

the ALE plot manner of piecing together the local effects produces the correct global effectfunction fj,ALE(xj) = fj(xj). Of course, one must still keep in mind that the global effect fj(xj)may only hold when the set of predictors x jointly fall within the data envelope.

Page 18: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

18 D. Apley and J. Zhu

5.3. Paired Differencing and Additive RecoveryThe ALE functions have an attractive additive recovery property mentioned in Remark 1. Sup-pose f(x) =

∑dj=1 fj(xj) is an additive function of the individual predictors. Then it is straight-

forward to show that the ALE main effects are fj,ALE(xj) = fj(xj) for (j = 1, 2, . . . , d), upto an additive constant. That is, the ALE effects recover the correct additive functions. Moregenerally, the following result states that higher-order ALE effects fJ,ALE(xJ) have a similaradditive recovery property.

Additive recovery for ALE plots. Suppose f is of the form f(x) =∑

J⊆{1,2,...,d},|J |≤k fJ(xJ)for some 1 ≤ k ≤ d. That is, f has interactions of order k, but no higher-order interactionsthan that. Then for every subset J with |J | = k, fJ,ALE(xJ) = fJ(xJ) +

∑u⊂J hu(xu) for some

functions hu(xu) that are of strictly lower order than k. In other words, for every J with |J | = k,the ALE effect fJ,ALE(xJ) returns the correct k-order interaction fJ(xJ), since the presence ofstrictly-lower-order functions do not alter the interpretation of a k-order interaction.

The proof of this additive recovery property for ALE plots follows directly from the decom-position theorem in Appendix C. It also follows that if the functions {fJ(xJ)} in the expressionfor f(x) are adjusted so that each has no lower-order ALE effects, then fJ,ALE(xJ) = fJ(xJ)for each J ⊆ {1, 2, . . . , d}. Although PD plots have a similar additive recovery property (see

below), M plots have no such property. For example, if f(x) =∑d

j=1 fj(xj), and the predictors

are dependent, then each fj,M (xj) may be a combination of the main effects of many predictors.As discussed previously, this can be viewed as the omitted variable bias in regression, whereby aregression of Y on (say) X1, omitting a correlated nuisance variable X2 on which Y also depends,will bias the effect of X1 on Y .

Fig. 12. Illustration of how, when estimating f1,ALE(x1), the differences f(zk,1, xi,2)− f(zk−1,1, xi,2) andf(zk′,1, xi,2)−f(zk′−1,1, xi,2) in (15) are paired differences that block out the nuisance variable X2. Here,k = k1(0.3) and k′ = k1(0.8).

The mechanism by which ALE plots avoid this omitted nuisance variable bias is illustratedin Figure 12, for the same example depicted in Figures 1, 2, and 5. First note that the M plotfunctions in this example are severely biased, because f1,M (x1) averages the function f(x1, X2)itself (as opposed to its derivative) with respect to the conditional distribution of X2|X1 = x1(see (3) and (4)). For example, for f(x) = x1 + x22 considered in Figure 6, the M plot av-eraged function is f1,M (x1) = E[f(X1, X2)|X1 = x1] = x1 + E[X2

2 |X1 = x1] 6= x1, which isbiased by the functional dependence of f(x) on the correlated nuisance variable X2. In contrastto averaging the function f itself, the ALE effect f1,ALE(x1) estimated via (15)—(16) aver-

Page 19: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 19

ages only the local effect of f represented by the paired differences f(zk,1, xi,2)− f(zk−1,1, xi,2)in (15). As illustrated in Figure 12, this paired differencing is what blocks out the effect ofthe correlated nuisance variable X2. Continuing the f(x) = x1 + x22 example, the paireddifferences f(zk,1, xi,2) − f(zk−1,1, xi,2) = (zk,1 + x2i,2) − (zk−1,1 + x2i,2) = zk,1 − zk−1,1 forthe ALE plot completely block out the effect of X2, so that the accumulated local effect∑k1(x1)

k=1 [zk,1 − zk−1,1] = x1 + constant is correct.

Multiplicative recovery for ALE plots for independent subsets of predictors. Sup-pose the model is of the form f(x) = fJ(xJ)f\J(x\J) for some J ⊂ {1, 2, . . . , d} with XJ inde-pendent of X\J . In this case it is straightforward to show that the ALE |J |-order interactioneffect of XJ is fJ,ALE(xJ) = fJ(xJ)E[f\J(X\J)] +

∑u⊂J hu(xu) for some lower-order functions

hu(xu). That is, the ALE |J |-order interaction effect fJ,ALE(xJ) recovers the correct functionfJ(xJ), except for a multiplicative constant E[f\J(X\J)] and the additive presence of strictlylower-order functions.

Comparison to PD plots. PD plots also have the same additive and multiplicative recoveryproperties just discussed. Moreover, for f(x) = fJ(xJ)f\J(x\J), PD plots have multiplicativerecovery (up to a multiplicative constant) even when XJ and X\J are dependent (Hastie andFriedman (2009)). Although it is probably desirable to have multiplicative recovery when XJ

and X\J are independent, it is unclear whether multiplicative recovery is even desirable if XJ

and X\J are dependent.

For example, suppose f(x) = x1x2 withX1 (J = 1) andX2 (\J = 2) standard normal randomvariables with correlation coefficient ρ. It is straightforward to show that f{1,2},ALE(x1, x2) =

x1x2 − 12ρ(x21 + x22), f1,ALE(x1) = 1

2ρ(x21 − 1), and f2,ALE(x2) = 12ρ(x22 − 1), compared to

f{1,2},PD(x1, x2) = x1x2, f1,PD(x1) = 0, and f2,PD(x2) = 0. Because of the strong interaction, itis essential to look at the second-order interaction effects in order to understand the functionaldependence of f(·) on the predictors. Both f{1,2},ALE(x1, x2) and f{1,2},PD(x1, x2) correctlyrecover the interaction, up to lower-order functions of the individual predictors. Regarding themain effects, however, the picture is more ambiguous. First, it is important to note that witha strong interaction and dependent predictors, it is unclear whether the main effects are evenmeaningful. And it is equally unclear whether the PD main effect f1,PD(x1) = 0 is any moreor less meaningful than the ALE main effect f1,ALE(x1) = 1

2ρ(x21 − 1). If X1 and X2 wereindependent in this example, then it would probably be desirable to view the main effects of X1

and X2 as zero, but in this case f1,PD(x1) = f1,ALE(x1) = 0 would actually be in agreement.

The situation is murkier with dependent predictors in this example. The local effect ∂f(x)∂x1

= x2of X1 depends strongly on the value of x2, in that it is amplified for larger |x2| and changes signif x2 changes sign. Thus, if ρ is large and positive, the local effect of X1 is positive for x1 > 0and negative for x1 < 0, which is the local effect of a quadratic relationship. In this case onemight argue that the quadratic f1,ALE(x1) = 1

2ρ(x21 − 1) is more revealing than f1,PD(x1) = 0.However, the debate is largely academic, because when strong interactions are present the lower-order effects should not be interpreted in a vacuum.

5.4. Computational Advantages of ALE Plots over PD PlotsFor general supervised learning models f(x), ALE plots have an enormous computational advan-

tage over PD plots. Suppose we want to compute fJ,ALE(xJ) for one subset J ⊆ {1, 2, . . . , d} over

a grid in the xJ -space with K discrete locations for each variable. Computation of fJ,ALE(xJ)

over this grid requires a total of 2|J | × n evaluations of the supervised learning model f(x)

(see (15)—(20) or (D.1) for |J | > 2). In comparison, computation of fJ,PD(xJ) over this grid

requires a total of K |J | × n evaluations of f(x). For example, for K = 50, PD main effects andsecond-order interaction effects require, respectively, 25 and 625 times more evaluations of f(x)than the corresponding ALE effects. Moreover, as we discuss in Appendix E, the evaluations off(x) can be easily vectorized (in R, for example, by appropriately calling the predict function

Page 20: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

20 D. Apley and J. Zhu

that is built into most supervised learning packages in R).Also notice that the number of evaluations of f(x) for ALE plots does not depend on K,

which is convenient. As n increases, the observations become denser, in which case we may wantthe fineness of the discretization to increase as well. If we choose K |J | ∝ n (which results in thesame average number of observations per cell as n increases), then the number of evaluations off(x) is O(n) for ALE plots versus O(n2) for PD plots.

For the bike sharing example in Section 5.1, we implemented the ALE and PD plots usingour R package ALEPlot on a Windows™ laptop with Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz processor. The ALE main-effect plots and the ALE second-order interaction plots tookless than 1 second each. In comparison, with the same K = 100, the PD main-effect plots tookabout 5 seconds each, and the PD interaction plots (not shown in Section 5.1) took about 8minutes each. The PD plot computational expense is proportional to K for main effects and toK2 for second-order interaction effects, whereas the ALE plot computational expense is largelyindependent of K. The ALE interaction plots were orders of magnitude faster to compute thanthe PD interaction plots (less than 1 second vs. 8 minutes).

5.5. Relation to Functional ANOVA with Dependent InputsIn the context of the closely-related problem of functional ANOVA with dependent input (i.e.,predictor) variables, the extrapolation issue that motivated ALE plots has been previously con-sidered. Hooker (2007) proposed a functional ANOVA decomposition of f(x) into componentfunctions {fJ,ANOV A(xJ) : J ⊆ {1, 2, . . . , d}} by adopting the Stone (1994) approach of usingweighted integrals in the function approximation optimality criterion and in the component func-tion orthogonality constraints. Hooker (2007) used p{1,2,...,d}(x) as a weighting function, whichindirectly avoids extrapolation of f(x) in regions in which there are no training observations,because any such extrapolations are assigned little or no weight. The resulting ANOVA com-ponent functions are hierarchically orthogonal under the correlation inner product, in the sensethat fJ,ANOV A(XJ) and fu,ANOV A(Xu) are uncorrelated when u ⊂ J . However, fJ,ANOV A(XJ)and fu,ANOV A(Xu) are not uncorrelated for general u 6= J .

In comparison, we show in Appendix C that the ALE decompositionf(x) =

∑J⊆{1,2,...,d} fJ,ALE(xJ) mentioned in Remark 2 has the following orthogonality-like

property. For each J ⊆ {1, 2, . . . , d}, let HJ(·) denote the operator that maps a function f to itsALE effect fJ,ALE , i.e., such that fJ,ALE = HJ(f) (see Appendix B for details). The collectionof operators {HJ : J ⊆ {1, 2, . . . , d}} behaves like a collection of operators that project ontoorthogonal subspaces of an inner product space. Namely, if “◦” denotes the composite functionoperator, then HJ ◦ HJ(f) = HJ(f), and HJ ′ ◦ HJ(f) = 0 for each J ′ ⊆ {1, 2, . . . , d} withJ ′ 6= J .

In other words, the ALE |J ′|-order effect of the predictors XJ ′ for the function fJ,ALE isidentically zero when J 6= J ′; and the ALE |J |-order effect of the predictors XJ for the functionfJ,ALE is the same function fJ,ALE . For example, for any pair of predictors {Xj , Xl}, the ALEmain effect of any predictor (including Xj or Xl) for the function f{j,l},ALE(xj , xl) is identicallyzero. Thus each ALE second-order interaction effect function has ALE main effects that areall identically zero. Likewise, each ALE main effect function has ALE second-order interactioneffects that are all identically zero. And for the function f{j,l},ALE(xj , xl), the ALE second-orderinteraction effect of any other pair of predictors is identically zero. Similarly, the ALE first- andsecond-order interaction effects for any ALE third-order effect function are all identically zero,and vice-versa.

For the purpose of visualizing the effects of the predictors on black box supervised learningmodels, the correlation orthogonality in other functional ANOVA decompositions may be lessrelevant and less useful than the ALE pseudo-orthogonality. As discussed in Roosen (1995), if thepredictors are dependent, it may even be preferable to artificially impose a product p{1,2,...,d}(x)in the functional ANOVA decomposition to avoid conflating direct and indirect effects of apredictor, and this will typically result in ANOVA component functions that are no longeruncorrelated. For example, suppose f(x) = x1 + x2, and X1 and X2 are correlated. Any

Page 21: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 21

functional ANOVA decomposition that gives uncorrelated main effects will not give the correctmain effects f1(x1) = x1 and f2(x2) = x2 that are needed to understand the true functionaldependence of f(x) on x1 and x2. In contrast, the ALE and PD main effect functions are thecorrect functions f1(x1) = x1 and f2(x2) = x2, up to an additive constant (see Section 5.3).Functional ANOVA can be coerced into producing the correct main effects f1,ANOV A(x1) = x1and f2,ANOV A(x2) = x2 by artificially imposing a product distribution p{1,2}(x) = p1(x1)p2(x2),but then the ANOVA component functions are no longer uncorrelated. Moreover, artificiallyimposing a product p{1,2,...,d}(x) in functional ANOVA still leaves the extrapolation problemthat plagues PD plots and that motivated ALE plots and the work of Hooker (2007).

In addition, practical implementation is far more cumbersome for functional ANOVA decom-positions than for ALE (or PD) plots, for multiple reasons. First, p{1,2,...,d}(x) must be estimatedin the functional ANOVA approach of Hooker (2007), which is problematic in high-dimensions.In contrast, the ALE effect estimators (15)—(20) involve summations over the training databut require no estimate of p{1,2,...,d}(x). Second, each ALE plot effect function can be computedone-at-a-time using straightforward and computationally efficient calculations that involve onlyfinite differencing, averaging, and summing. In contrast, the functional ANOVA componentfunctions must be computed simultaneously, which requires the solution of a complex system ofequations. Follow-up work in Li and Rabitz (2012), Chastaing and Prieur (2012), and Rahman(2014) improved the solution techniques, sometimes restricting the component ANOVA func-tions to be expansions in basis functions such as polynomials and splines, but these are morerestrictive (perhaps negating the benefits of fitting a black box supervised learning model in thefirst place), as well as more cumbersome and computationally expensive than ALE plots.

6. Conclusions

For visualizing the effects of the predictor variables in black box supervised learning models, PDplots are the most widely used method. The ALE plots that we have proposed in this paper are analternative that has two important advantages over PD plots. First, by design, ALE plots avoidthe extrapolation problem that can render PD plots unreliable when the predictors are highlycorrelated (see Figures 6, 7, and 9). Second, ALE plots are substantially less computationallyexpensive than PD plots, requiring only 2|J |×n evaluations of the supervised learning model f(x)

to compute each fJ,ALE(xJ), compared to K |J | × n evaluations to compute each fJ,PD(xJ). Inlight of this, we suggest that ALE plots should be adopted as a standard visualization componentin supervised learning software. We have also provided, as supplementary material, an R packageALEPlot to implement the ALE plots.

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Pacific Grove, CA: Wadsworth& Brooks/Cole.

Chastaing, G., G. F. and Prieur, C. (2012) Generalized hoeffding-sobol decomposition for de-pendent variables - application to sensitivity analysis. Electronic Journal of Statistics, 6,2420–2448.

Cleveland, W. S. (1993) Visualizing Data. Summit, NJ: Hobart Press.

Cook, D. R. (1995) Graphics for studying the net effects of regression predictors. StatisticaSinica, 5, 689–708.

Fanaee-T, H. and Gama, J. (2013) Event labeling combining ensemble detectors and backgroundknowledge. Progress in Artificial Intelligence, 1–15.

Friedman, J. H. (2001) Greedy function approximation: A gradient boosting machine. Annalsof Statistics, 29, 1189–1232.

Page 22: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

22 D. Apley and J. Zhu

Goldstein, A., K. A. B. J. and Pitkin, E. (2015) Peeking inside the black box: Visualizingstatistical learning with plots of individual conditional expectation. Journal of Computational& Graphical Statistics, 24, 44–65.

Hastie, T., T. R. and Friedman, J. (2009) The elements of statistical learning. New York:Springer.

Hooker, G. (2007) Generalized functional anova diagnostics for high dimensional functions ofdependent variables. Journal of Computational and Graphical Statistics, 16, 709–732.

Li, G. and Rabitz, H. (2012) General formulation of hdmr component functions with independentand correlated variables. Journal of Mathematical Chemistry, 50, 99–130.

Rahman, S. (2014) A generalized anova dimensional decomposition for dependent probabilitymeasures. SIAM/ASA Journal of Uncertainty Quantification, 2, 670–697.

Ripley, B. D. (2015) tree: Classification and regression trees. r package version 1.0-36. URL:http://CRAN.R-project.org/package=tree.

Roosen, C. B. (1995) Visualization and Exploration of High-Dimensional Functions Using theFunctional Anova Decomposition. Ph.D. thesis, Stanford University.

Stone, C. J. (1994) The use of polynomial splines and their tensor products in multivariatefunction estimation. Annals of Statistics, 22, 118–171.

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.New York: Springer.

Page 23: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 23

Appendix A Statements and Proofs of Theorems A.1 and 1 - 4

Theorem A.1 (sufficient conditions for the limit in (6) to exist, independent of thesequence of partitions). Define the functions h(u, z) ≡ E[f(u,X\j)|Xj = z] and h′(u, z) ≡∂h(u,z)∂u where the derivative exists, and suppose that p and f are such that:

(i) f is bounded on S.

(ii) h(·, ·) is differentiable in its first argument everywhere on the set {(z, z) : z ∈ Sj\J },where the set of points J = {u1, u2, . . . , uM} ⊂ Sj at which h(·, ·) is nondifferentiable (andpossibly discontinuous) in its first argument is a finite collection. J may be empty.

(iii) The differentiability of h(·, ·) with respect to its first argument is uniform on the set {(z, z) :z ∈ Sj\J }, in the sense that ∀ε > 0, ∃δ > 0 such that∣∣∣∣h′(z, z)− h(u, z′)− h(v, z′)

u− v

∣∣∣∣ < ε

whenever |u− v| < δ and (v, u] ∩ J = ∅ with v < z ≤ u and v < z′ ≤ u.

(iv) h(·, ·) is continuous in its second argument everywhere on the set {(z, z) : z ∈ Sj}, andthe continuity is uniform in the sense that ∀ε > 0,∃δ > 0 such that |h(z, z)− h(z, z′)| < εwhenever |z − z′| < δ.

(v) For each l = 1, 2, . . . ,M , the potential discontinuity of h(u, ul) in its first argument atu = ul is such that limu→u−l h(u, ul) exists and limu→u+

lh(u, ul) = h(ul, ul) (i.e., all discon-

tinuities are of the jump type and are right continuous).

Under these conditions, the limit in (6) exists independent of the sequence of partitions {PKj :K = 1, 2, . . .}, and the limit is

gj,ALE(xj) =

∫ xj

xmin,j

h(z)dz +

M∑l=1,ul≤xj

Jl, (A.1)

where Jl ≡ limu→u+lh(u, ul)− limu→u−l h(u, ul) is the jump at ul, and h(z) is defined as h′(z, z)

on Sj\J and 0 on J .

Proof. For a partition sequence {PKj } satisfying limK→∞ δj,K = 0 in Definition 1, let KK = {k ∈{1, 2, . . . ,K} : (zKk−1,j , z

Kk,j ] ∩ J 6= ∅}, i.e., KK is the set of indices of the partition intervals in

PKj that contain one or more of the discontinuity points in J . The condition limK→∞ δj,K = 0,together with conditions (ii) and (iii), imply that ∀ε > 0,∃K1 such that ∀K > K1 and k ∈{1, 2, . . . ,K}\KK ,∣∣∣h(zKk,j)(z

Kk,j − zKk−1,j)− E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , z

Kk,j ]]

∣∣∣=

∣∣∣∣∣h′(zKk,j , zKk,j)(zKk,j − zKk−1,j)−∫z∈(zKk−1,j ,z

Kk,j ]

[h(zKk,j , z)− h(zKk−1,j , z)]dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣=

(zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∣∣∣∣∣∫z∈(zKk−1,j ,z

Kk,j ]

[h(zKk,j , z)− h(zKk−1,j , z)

(zKk,j − zKk−1,j)− h′(zKk,j , zKk,j)

]dpj(z)

∣∣∣∣∣≤

(zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∫z∈(zKk−1,j ,z

Kk,j ]

∣∣∣∣∣h(zKk,j , z)− h(zKk−1,j , z)

(zKk,j − zKk−1,j)− h′(zKk,j , zKk,j)

∣∣∣∣∣ dpj(z)<

(zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∫z∈(zKk−1,j ,z

Kk,j ]

ε

4(xmax,j − xmin,j)dpj(z)

=(zKk,j − zKk−1,j)ε

4(xmax,j − xmin,j),

(A.2)

Page 24: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

24 D. Apley and J. Zhu

so that

∣∣∣∣∣∣kKj (xj)∑

k=1,k /∈KK

h(zKk,j)(zKk,j − zKk−1,j)−

kKj (xj)∑

k=1,k /∈KK

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

∣∣∣∣∣∣≤

kKj (xj)∑

k=1,k /∈KK

∣∣∣h(zKk,j)(zKk,j − zKk−1,j)− E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , z

Kk,j ]]

∣∣∣<

kKj (xj)∑

k=1,k /∈KK

(zKk,j − zKk−1,j)ε4(xmax,j − xmin,j)

≤ ε

4.

(A.3)

Regarding the partition intervals (zKk−1,j , zKk,j ] that contain at least one of the nondifferentiable

points in J (i.e., k ∈ KK), since |J | = M < ∞, for sufficiently large K each such intervalcontains only one discontinuity. Let l = l(k,K) denote the index of the point ul ∈ (zKk−1,j , z

Kk,j ]

at which there is a discontinuity. Then using conditions (iv) and (v), ∀ε > 0, ∃K2 such that∀K > K2 and k ∈ KK ,

∣∣Jl − E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

∣∣=

∣∣∣∣∣ limu→u+

l

h(u, ul)− limu→u−l

h(u, ul)−

∫z∈(zKk−1,j ,z

Kk,j ]

[h(zKk,j , z)− h(zKk−1,j , z)]dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣≤

∣∣∣∣∣ limu→u+

l

h(u, ul)− limu→u−l

h(u, ul)−

∫z∈(zKk−1,j ,z

Kk,j ]

[h(zKk,j , ul)− h(zKk−1,j , ul)]dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣+

∣∣∣∣∣∫z∈(zKk−1,j ,z

Kk,j ]

[h(zKk,j , ul)− h(zKk−1,j , ul)]− [h(zKk,j , z)− h(zKk−1,j , z)]dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣=

∣∣∣∣∣ limu→u+

l

h(u, ul)− limu→u−l

h(u, ul)− [h(zKk,j , ul)− h(zKk−1,j , ul)]

∣∣∣∣∣+

∣∣∣∫z∈(zKk−1,j ,zKk,j ]

[h(zKk,j , ul)− h(zKk,j , z)]− [h(zKk−1,j , ul)− h(zKk−1,j , z)]dpj(z)∣∣∣

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣ limu→u+

l

h(u, ul)− h(zKk,j , ul)

∣∣∣∣∣+

∣∣∣∣∣h(zKk−1,j , ul)− limu→u−l

h(u, ul)

∣∣∣∣∣+

∣∣∣∣∣∣∫z∈(zKk−1,j ,z

Kk,j ]

∣∣∣h(zKk,j , ul)− h(zKk,j , z)∣∣∣ dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣∣+

∣∣∣∣∣∣∫z∈(zKk−1,j ,z

Kk,j ]

∣∣∣[h(zKk−1,j , ul)− h(zKk−1,j , z)]∣∣∣ dpj(z)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣∣<

ε

16M+

ε

16M+

ε

16M

∫z∈(zKk−1,j ,z

Kk,j ]

dpj(z)

pj((zKk−1,j , zKk,j ])

16M

∫z∈(zKk−1,j ,z

Kk,j ]

dpj(z)

pj((zKk−1,j , zKk,j ])

4M.

(A.4)

Page 25: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 25

Consequently, ∀K > K2,∣∣∣∣∣∣kKj (xj)∑

k=1,k∈KK

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]−

M∑l=1,ul≤xj

Jl

∣∣∣∣∣∣≤

kKj (xj)∑

k=1,k∈KK

∣∣E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]− Jl(k,K)

∣∣<

kKj (xj)∑

k=1,k∈KK

ε

4M≤ ε

4,

(A.5)

Now condition (iii) implies that h(z) is uniformly continuous on Sj\J . To see this, let ε, δ, u, v,and z

′be as in condition (iii). Applying condition (iii) twice, first with z = v and second with

z = u, implies that∣∣∣h(u)− h(v)∣∣∣ ≤ ∣∣∣∣h(u)− h(u, z

′)− h(v, z

′)

u− v

∣∣∣∣+

∣∣∣∣h(v)− h(u, z′)− h(v, z

′)

u− v

∣∣∣∣ < 2ε.

In other words, ∀ε > 0, ∃δ > 0 such that |h(u) − h(v)| < 2ε whenever |u − v| < δ and

(v, u] ∩ J = ∅, which is uniform continuity on Sj\J . This in turn implies that h(z) is bothbounded and Riemann integrable on Sj (the latter, because a bounded function on a closedinterval is Riemann integrable if and only if it is continuous almost everywhere, per Lebesguemeasure).By definition of Riemann integrability, ∃K3 such that ∀K > K3,∣∣∣∣∣∣

∫ xj

xmin,j

h(z)dz −kKj (xj)∑k=1

h(zKk,j)(zKk,j − zKk−1,j)

∣∣∣∣∣∣ < ε

4. (A.6)

In addition, since |KK | ≤ |J | = M < ∞ and since h(z) is bounded on Sj , ∃K4 such that∀K > K4, ∣∣∣∣∣∣

kKj (xj)∑k=1

h(zKk,j)(zKk,j − zKk−1,j)−

kKj (xj)∑

k=1,k /∈KK

h(zKk,j)(zKk,j − zKk−1,j)

∣∣∣∣∣∣=

∣∣∣∣∣∣kKj (xj)∑

k=1,k∈KK

h(zKk,j)(zKk,j − zKk−1,j)

∣∣∣∣∣∣ < ε

4.

(A.7)

Thus, combining (A.3), (A.5), (A.6), and (A.7), ∀K > max{K1,K2,K3,K4}∣∣∣∣∣∣kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]−

∫ xj

xmin,j

h(z)dz +

M∑l=1,ul≤xj

Jl

∣∣∣∣∣∣≤

∣∣∣∣∣∣kKj (xj)∑

k=1,k∈KK

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]−

M∑l=1,ul≤xj

Jl

∣∣∣∣∣∣+

∣∣∣∣∣∣kKj (xj)∑

k=1,k /∈KK

h(zKk,j)(zKk,j − zKk−1,j)−

kKj (xj)∑

k=1,k /∈KK

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

∣∣∣∣∣∣+

∣∣∣∣∣∣∫ xj

xmin,j

h(z)dz −kKj (xj)∑k=1

h(zKk,j)(zKk,j − zKk−1,j)

∣∣∣∣∣∣

Page 26: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

26 D. Apley and J. Zhu

+

∣∣∣∣∣∣kKj (xj)∑k=1

h(zKk,j)(zKk,j − zKk−1,j)−

kKj (xj)∑

k=1,k /∈KK

h(zKk,j)(zKk,j − zKk−1,j)

∣∣∣∣∣∣<ε

4+ε

4+ε

4+ε

4= ε.

Since ε was arbitrary, this shows that the limit in (6) exists and is equal to (A.1) .

Theorem 1 (Uncentered ALE Main Effect for differentiable f(·)). Let f j(xj ,x\j) ≡∂f(xj ,x\j)

∂xjdenote the partial derivative of f(x) with respect to xj when the derivative exists. In

Definition 1, assume that the limit in (6) exists independent of the sequence of partitions (e.g.,if the conditions of Theorem A.1 are satisfied). If in addition,

(i) f(xj ,x\j) is differentiable in xj on S,

(ii) f j(xj ,x\j) is continuous in (xj ,x\j) on S, and

(iii) E[f j(Xj ,X\j)|Xj = zj ] is continuous in zj on Sj ,

then for each xj ∈ Sj .

gj,ALE(xj) =

∫ xj

xmin,j

E[f j(Xj ,X\j)|Xj = zj ]dzj .

Proof. Let {PKj : K = 1, 2, . . .} denote any sequence of partitions in Definition 1. Since

E[f j(Xj ,X\j)|Xj = zj ] is continuous in zj , it is Riemann integrable on Sj . By definition ofRiemann-integrability, for any ε > 0 there exists a K1(ε) such that for all K > K1(ε),∣∣∣∣∣

∫ xj

xmin,j

E[f j(Xj ,X\j)∣∣Xj = zj ]dzj −

kKj (xj)∑k=1

E[f j(Xj ,X\j)∣∣Xj = zKk,j ](z

Kk,j − zKk−1,j)

∣∣∣∣∣<ε

4.

(A.8)

Notice that in (A.8), the upper limit xj in the integral does not necessarily coincide exactly withthe upper limit zKk,j in the Riemann sum when k = kKj (xj). However, xj and zKk,j for k = kKj (xj)

must become arbitrarily close as K → ∞, so that (A.8) still holds. This follows because thecontinuity of f j(xj ,x\j) on the compact S implies that E[f j(Xj ,X\j)|Xj = zj ] is bounded, so

that for k = kKj (xj), we have∫ zKk,j

xjE[f j(Xj ,X\j)|Xj = zj ]dzj → 0 as K →∞.

By Definition 1, there also exists a K2(ε) such that for all K > K2(ε)∣∣∣∣∣∣gj,ALE(xj)−kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

∣∣∣∣∣∣ < ε

4. (A.9)

If we can show that the summations in (A.8) and (A.9) are within ε2 of each other for sufficiently

large K, then combining this with (A.8) and (A.9) will imply that∣∣∣∣∣gj,ALE(xj)−∫ xj

xmin,j

E[f j(Xj ,X\j)|Xj = zj ]dzj

∣∣∣∣∣ < ε

for sufficiently large K, and the proof will be complete. Towards this end, write the summand

Page 27: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 27

in (A.9) as

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

=1

pj((zKk−1,j , zKk,j ])

∫zj∈(zKk−1,j ,z

Kk,j ]

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj = zj ]dpj(zj)

=1

pj((zKk−1,j , zKk,j ])

∫zj∈(zKk−1,j ,z

Kk,j ]

∫x\j

[f(zKk,j ,x\j)− f(zKk−1,j ,x\j)]dp\j|j(x\j |zj)dpj(zj)

=(zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∫zj∈(zKk−1,j ,z

Kk,j ]

∫x\j

[f(zKk,j ,x\j)− f(zKk−1,j ,x\j)]

(zKk,j − zKk−1,j)dp\j|j(x\j |zj)dpj(zj)

=(zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∫zj∈(zKk−1,j ,z

Kk,j ]

∫x\j

f j(z(k,K,x\j),x\j)dp\j|j(x\j |zj)dpj(zj),

(A.10)

for some z(k,K,x\j) ∈ (zKk−1,j , zKk,j ], where the last equality in (A.10) follows by the mean value

theorem applied to f(xj ,x\j) over the interval xj ∈ (zKk−1,j , zKk,j ]. In addition, by the continuity

of f j(xj ,x\j) on S (which implies uniform continuity since S is compact), there exists a K3(ε)

such that for all K > K3(ε) and for all k, x\j , and zj ∈ (zKk−1,j , zKk,j ]∣∣f j(z(k,K,x\j),x\j)− f j(zj ,x\j)∣∣ < ε

4(xmax,j − xmin,j). (A.11)

And by the assumed continuity of E[f j(Xj ,X\j)|Xj = zj ] on Sj (which again implies uniformcontinuity on the compact Sj), there exists a K4(ε) such that for all K > K4(ε) and for all kand zj ∈ (zKk−1,j , z

Kk,j ]∣∣E[f j(Xj ,X\j)|Xj = zj ]− E[f j(Xj ,X\j)|Xj = zKk,j ]

∣∣ < ε

4(xmax,j − xmin,j). (A.12)

Using (A.11) and (A.12), for all K > max{K3(ε),K4(ε)} and for all k and zj ∈ (zKk−1,j , zKk,j ],∣∣∣∣∣

∫x\j

f j(z(k,K,x\j),x\j)dp\j|j(x\j |zj)− E[f j(Xj ,X\j)|Xj = zKk,j ]

∣∣∣∣∣≤

∣∣∣∣∣∫x\j

f j(z(k,K,x\j),x\j)dp\j|j(x\j |zj)−∫x\j

f j(zj ,x\j)dp\j|j(x\j |zj)

∣∣∣∣∣+

∣∣∣∣∣∫x\j

f j(zj ,x\j)dp\j|j(x\j |zj)− E[f j(Xj ,X\j)|Xj = zKk,j ]

∣∣∣∣∣≤∫x\j

|f j(z(k,K,x\j),x\j)− f j(zj ,x\j)|dp\j|j(x\j |zj)

+∣∣E[f j(Xj ,X\j)|Xj = zj ]− E[f j(Xj ,X\j)|Xj = zKk,j ]

∣∣≤∫x\j

ε

4(xmax,j − xmin,j)dp\j|j(x\j |zj) +

ε

4(xmax,j − xmin,j)=

ε

2(xmax,j − xmin,j).

(A.13)

Thus, using (A.10) and (A.13), for all K > max{K3(ε),K4(ε)} the difference between thesummations in (A.8) and (A.9) satisfy∣∣∣∣∣

kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

−kKj (xj)∑k=1

E[f j(Xj ,X\j)|Xj = zKk,j ](zKk,j − zKk−1,j)

∣∣∣∣∣

Page 28: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

28 D. Apley and J. Zhu

≤kKj (xj)∑k=1

∣∣∣E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]

− E[f j(Xj ,X\j)|Xj = zKk,j ](zKk,j − zKk−1,j)

∣∣∣=

kKj (xj)∑k=1

∣∣∣∣∣ (zKk,j − zKk−1,j)pj((zKk−1,j , z

Kk,j ])

∫zj∈(zKk−1,j ,z

Kk,j ]

∫x\j

f j(z(k,K,x\j),x\j)dp\j|j(x\j |zj)dpj(zj)

− E[f j(Xj ,X\j)|Xj = zKk,j ](zKk,j − zKk−1,j)

∣∣∣∣∣ (using (A.10))

=

kKj (xj)∑k=1

(zKk,j − zKk−1,j)

×

∣∣∣∣∣∫zj∈(zKk−1,j ,z

Kk,j ]{∫x\j

f j(z(k,K,x\j)),x\j)dp\j|j(x\j |zj)− E[f j(Xj ,X\j)|Xj = zKk,j ]}dpj(zj)

pj((zKk−1,j , zKk,j ])

∣∣∣∣∣≤

kKj (xj))∑k=1

(zKk,j − zKk−1,j)×∫zj∈(zKk−1,j ,z

Kk,j ]

∣∣∣∫x\j f j(z(k,K,x\j),x\j)dp\j|j(x\j |zj)− E[f j(Xj ,X\j)|Xj = zKk,j ]∣∣∣ dpj(zj)

pj((zKk−1,j , zKk,j ])

≤kKj (xj)∑k=1

(zKk,j − zKk−1,j)

∫zj∈(zKk−1,j ,z

Kk,j ]

ε2(xmax,j−xmin,j)

dpj(zj)

pj((zKk−1,j , zKk,j ])

(using (A.13))

=

kKj (xj)∑k=1

(zKk,j − zKk−1,j)εpj((z

Kk−1,j , z

Kk,j ])

2(xmax,j − xmin,j)pj((zKk−1,j , zKk,j ])

2(xmax,j − xmin,j)

kKj (xj)∑k=1

(zKk,j − zKk−1,j)

≤ ε(xmax,j − xmin,j)

2(xmax,j − xmin,j)=ε

2. (A.14)

Finally, combining (A.8),(A.9), and (A.14) , for all K > max{K1(ε),K2(ε),K3(ε),K4(ε)},∣∣∣∣∣gj,ALE(xj)−∫ xj

xmin,j

E[f j(Xj ,X\j)|Xj = zj ]dzj

∣∣∣∣∣≤

∣∣∣∣∣∣gj,ALE(xj)−kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

∣∣∣∣∣∣+

∣∣∣∣∣∣∫ xj

xmin,j

E[f j(Xj ,X\j)|Xj = zj ]dzj −kKj (xj)∑k=1

E[f j(Xj ,X\j)|Xj = zKk,j ](zKk,j − zKk−1,j)

∣∣∣∣∣∣+

∣∣∣∣∣kKj (xj)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

−kKj (xj)∑k=1

E[f j(Xj ,X\j)|Xj = zKk,j ](zKk,j − zKk−1,j)

∣∣∣∣∣

Page 29: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 29

≤ ε

4+ε

4+ε

2= ε,

which completes the proof.

Theorem 2 (Uncentered ALE Second-Order Effect for differentiable f(·)). Let

f{j,l}(xj , xl,x\{j,l}) ≡∂2f(xj ,xl,x\{j,l})

∂xj∂xldenote the second-order partial derivative of f(x) with

respect to xj and xl when the derivative exists. In Definition 2, assume that the limit in (9)exists independent of the sequences of partitions. If in addition,

(i) f(xj , xl,x\{j,l}) is differentiable in (xj , xl) on S,

(ii) f{j,l}(xj , xl,x\{j,l}) is continuous in (xj , xl,x\{j,l}) on S, and

(iii) E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl] is continuous in (zj , zl) on Sj × Sl,

then, for each (xj , xl) ∈ Sj × Sl,

h{j,l},ALE(xj , xl) ≡∫ xl

xmin,l

∫ xj

xmin,j

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl.

Proof. This proof follows the same general course as the proof of Theorem 1. Let {PKj : K =

1, 2, . . .} and {PKl : K = 1, 2, . . .} denote any two sequences of partitions of Sj and Sl in

Definition 2. Since E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl] is continuous in (zj , zl), it isRiemann integrable on Sj × Sl. By definition of Riemann-integrability of real-valued functionsdefined on rectangular domains, for any ε > 0 there exists a K1(ε) such that for all K > K1(ε),∣∣∣∣∣

∫ xl

xmin,l

∫ xj

xmin,j

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl

−kKj (xj)∑k=1

kKl (xl)∑m=1

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣∣∣<ε

4.

(A.15)

By similar argument as in Theorem 1, (A.15) holds even though the upper limits xj and xlin the double integral do not necessarily coincide exactly with the upper limits zKk,j for k =

kKj (xj) and zKm,l for m = kKl (xl) in the Riemann sum. This follows because the continuity of

f{j,l}(xj , xl,x\{j,l}) on the compact S implies that E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl] isbounded, and because the two corresponding upper limits for each predictor become arbitrarilyclose as K →∞. Thus, for k = kKj (xj) and m = kKl (xl), it follows that∫ zKm,l

xl

∫ zKk,j

xj

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl → 0

as K →∞.By Definition 2, there exists a K2(ε) such that for all K > K2(ε)∣∣∣∣∣h{j,l},ALE(xj , xl)−

kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

∣∣∣∣∣ < ε

4.

(A.16)The remainder of the proof shows that the summations in (A.15) and (A.16) are within ε

2 of eachother for sufficiently large K, which, when combined with (A.15) and (A.16), proves the desiredresult that |h{j,l},ALE(xj , xl) −

∫ xl

xmin,l

∫ xj

xmin,jE[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl| < ε

Page 30: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

30 D. Apley and J. Zhu

for sufficiently large K.The summand in (A.16) can be written as

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

=

∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

E[∆{j,l}f (K, k,m; X\{j,l})|Xj = zj , Xl = zl]dp{j,l}(zj , zl)

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

=

∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

∫x\{j,l}

∆{j,l}f (K, k,m; x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)dp{j,l}(zj , zl)

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

=(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])×

∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

∫x\{j,l}

∆{j,l}f (K, k,m; x\{j,l})

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)dp\{j,l}|{j,l}(x\{j,l}|zj , zl)dp{j,l}(zj , zl)

=(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])×∫

(zj ,zl)∈(zKk−1,j ,zKk,j ]×(zKm−1,l,z

Km,l]

∫x\{j,l}

f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)dp{j,l}(zj , zl),

(A.17)

for some z(k,m,K,x\{j,l}) ≡ (z1(k,K,xj), z2(k,m,K,x\{j,l})) ∈ (zKk−1,j , zKk,j ] × (zKm−1,l, z

Km,l],

where the last equality in (A.17) follows by applying the mean value theorem twice to f(xj , xl,x\{j,l})

over the rectangle (xj , xl) ∈ (zKk−1,j , zKk,j ]× (zKm−1,l, z

Km,l]. Specifically, the first application of the

mean value theorem gives

f(zKk,j ,x\j)− f(zKk−1,j ,x\j)

zKk,j − zKk−1,j= f j(z1(k,K,x\j),x\j) (A.18)

for some z1(k,K,x\j) ∈ (zKk−1,j , zKk,j ]. A second application of the mean value theorem to (A.18)

gives

∆{j,l}f (K, k,m; x\{j,l})

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

=

f(zKk,j ,zKm,l,x\{j,l})−f(zKk−1,j ,z

Km,l,x\{j,l})

zKk,j−zKk−1,j− f(zKk,j ,z

Km−1,l,x\{j,l})−f(zKk−1,j ,z

Km−1,l,x\{j,l})

zKk,j−zKk−1,j

zKm,l − zKm−1,l= f{j,l}(z1(k,K,x\j), z2(k,m,K,x\{j,l}),x\{j,l})

for some z2(k,m,K,x\{j,l}) ∈ (zKm−1,l, zKm,l].

Further, by the continuity of f{j,l}(xj , xl,x\{j,l}) (which implies uniform continuity on thecompact S), there exists a K3(ε) such that for all K > K3(ε) and for all k, m, x\{j,l},

zj ∈ (zKk−1,j , zKk,j ], and zl ∈ (zKm−1,l, z

Km,l],∣∣∣f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})− f{j,l}(zj , zl,x\{j,l})

∣∣∣ < ε

4(xmax,j − xmin,j)(xmax,l − xmin,l).

(A.19)Likewise, by the assumed continuity of E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl] on the compactset Sj ×Sl (which implies uniform continuity on this set), there exists a K4(ε) such that for all

Page 31: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 31

K > K4(ε) and for all k, m, zj ∈ (zKk−1,j , zKk,j ], and zl ∈ (zKm−1,l, z

Km,l],∣∣∣E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]

∣∣∣<

ε

4(xmax,j − xmin,j)(xmax,l − xmin,l).

(A.20)

Using (A.19) and (A.20), for all K > max{K3(ε),K4(ε)} and for all k, m, zj ∈ (zKk−1,j , zKk,j ], and

zl ∈ (zKm−1,l, zKm,l],∣∣∣∣∣

∫x\{j,l}

f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]

∣∣∣∣∣≤

∣∣∣∣∣∫x\{j,l}

f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

−∫x\{j,l}

f{j,l}(zj , zl,x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)∣∣∣∣

+

∣∣∣∣∣∫x\{j,l}

f{j,l}(zj , zl,x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]

∣∣∣∣∣≤∫x\{j,l}

∣∣∣f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})− f{j,l}(zj , zl,x\{j,l})∣∣∣dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

+∣∣∣E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]

∣∣∣≤∫x\{j,l}

ε

4(xmax,j − xmin,j)(xmax,l − xmin,l)dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

4(xmax,j − xmin,j)(xmax,l − xmin,l)=

ε

2(xmax,j − xmin,j)(xmax,l − xmin,l).

(A.21)

Thus, using (A.17) and (A.21), for all K > max{K3(ε),K4(ε)} the difference between thesummations in (A.15) and (A.16) satisfies∣∣∣∣∣kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

−kKj (xj)∑k=1

kKl (xl)∑m=1

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣∣∣≤

kKj (xj)∑k=1

kKl (xl)∑m=1

∣∣∣E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣=

kKj (xj)∑k=1

kKl (xl)∑m=1

∣∣∣∣∣ (zKk,j − zKk−1,j)(zKm,l − zKm−1,l)p{j,l}((z

Kk−1,j , z

Kk,j ])× (zKm−1,l, z

Km,l])

Page 32: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

32 D. Apley and J. Zhu

×∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

∫x\{j,l}

f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)dp{j,l}(zj , zl)

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣∣∣ (using (A.17))

=

kKj (xj)∑k=1

kKl (xl)∑m=1

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)p{j,l}((z

Kk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

×

∣∣∣∣∣∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]{∫x\{j,l}

f{j,l}(z(k,m,K,x\{j,l})),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]}dp{j,l}(zj , zl)

∣∣∣∣∣≤

kKj (xj)∑k=1

kKl (xl)∑m=1

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)p{j,l}((z

Kk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

×∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

∣∣∣∣∣∫x\j

f{j,l}(z(k,m,K,x\{j,l}),x\{j,l})dp\{j,l}|{j,l}(x\{j,l}|zj , zl)

− E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l]

∣∣∣∣∣dp{j,l}(zj , zl)≤

kKj (xj)∑k=1

kKl (xl)∑m=1

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)p{j,l}((z

Kk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

×∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

ε

2(xmax,j − xmin,j)(xmax,l − xmin,l)dp{j,l}(zj , zl) (using (A.21))

=

kKj (xj)∑k=1

kKl (xl)∑m=1

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

×εp{j,l}((z

Kk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

2(xmax,j − xmin,j)(xmax,l − xmin,l)p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

2(xmax,j − xmin,j)(xmax,l − xmin,l)

kKj (xj)∑k=1

kKl (xl)∑m=1

(zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

≤ε(xmax,j − xmin,j)(xmax,l − xmin,l)

2(xmax,j − xmin,j)(xmax,l − xmin,l)=ε

2. (A.22)

Finally, combining (A.15),(A.16), and (A.22) , for all K > max{K1(ε),K2(ε),K3(ε),K4(ε)},∣∣∣∣∣h{j,l},ALE(xj , xl)−∫ xl

xmin,l

∫ xj

xmin,j

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zj , Xl = zl]dzjdzl

∣∣∣∣∣≤

∣∣∣∣∣∣h{j,l},ALE(xj , xl)−kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

∣∣∣∣∣∣+

∣∣∣∣∣∫ xl

xmin,l

∫ xj

xmin,j

E[f{j,l}(Xj , Xl, X\{j,l})|Xj = zj , Xl = zl]dzjdzl

−kKj (xj)∑k=1

kKl (xl)∑m=1

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣∣∣

Page 33: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 33

+

∣∣∣∣∣kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

−kKj (xj)∑k=1

kKl (xl)∑m=1

E[f{j,l}(Xj , Xl,X\{j,l})|Xj = zKk,j , Xl = zKm,l](zKk,j − zKk−1,j)(zKm,l − zKm−1,l)

∣∣∣∣∣≤ ε

4+ε

4+ε

2= ε,

which completes the proof.

Theorem 3 (consistency of the ALE main effect estimator). Consider a sequence of par-titions {PKj : K = 1, 2, . . .} as in Definition 1, such that δj,K → 0 as K → ∞. Denote the

estimator (15) of the uncentered ALE main effect of Xj using partition PKj and sample size nby

gj,ALE,K,n(x) ≡kKj (x)∑k=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ])[f(zKk,j ,Xi,\j)− f(zKk−1,j ,Xi,\j)]∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])

, (A.23)

where I(·) denotes the indicator function of an event. Assume that {Xi : i = 1, 2, . . .} is an i.i.d.sequence of random vectors drawn from p(·), and let P denote the probability measure on theunderlying probability space of this random sequence. Let

gj,ALE,K(x) ≡kKj (x)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]] (A.24)

denote the finite-K version of gj,ALE(xj) in (6) using the same partition PKj as the estimator.Then gj,ALE,K,n(x) is a strongly consistent estimator of gj,ALE(x) in the following sense. For

each x ∈ Sj , each ε > 0, and each K,

limn→∞

gj,ALE,K,n(x) = gj,ALE,K(x) a.s.−P,

with |gj,ALE,K(x)− gj,ALE(x)| < ε if K is sufficiently large.Moreover, there exists a sequence of sample sizes {nK : K = 1, 2, . . .} such that, as K →∞,

gj,ALE,K,nK(x)→ gj,ALE(x) both in probability and in mean.

Proof. Consider any fixed K and k ∈ {1, 2, . . . ,K}. Applying the strong law of large numbersto both the numerator and denominator in the summand of (A.23), as n → ∞, the summandconverges a.s.-P :

n−1∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])[f(zKk,j ,Xi,\j)− f(zKk−1,j ,Xi,\j)]

n−1∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])

a.s.→E[I(Xj ∈ (zKk−1,j , z

Kk,j ])[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)]]

E[I(Xj ∈ (zKk−1,j , zKk,j ])]

=E[E[I(Xj ∈ (zKk−1,j , z

Kk,j ]){f(zKk,j ,X\j)− f(zKk−1,j ,X\j)}|Xj ]]

pj((zKk−1,j , zKk,j ]))

=E[I(Xj ∈ (zKk−1,j , z

Kk,j ])E[{f(zKk,j ,X\j)− f(zKk−1,j ,X\j)}|Xj ]]

pj((zKk−1,j , zKk,j ])

=

∫z∈(zKk−1,j ,z

Kk,j ]

E[{f(zKk,j ,X\j)− f(zKk−1,j ,X\j)}|Xj = z]dpj(z)

pj((zKk−1,j , zKk,j ])

=E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , z

Kk,j ]]pj((z

Kk−1,j , z

Kk,j ])

pj((zKk−1,j , zKk,j ])

Page 34: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

34 D. Apley and J. Zhu

= E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]].

Consequently, since the finite sum of almost-surely convergent random variables also convergesalmost surely, as n→∞,

gj,ALE,K,n(x) =

kKj (x)∑k=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ])[f(zKk,j ,Xi,\j)− f(zKk−1,j ,Xi,\j)]∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])

a.s.→kKj (x)∑k=1

E[f(zKk,j ,X\j)− f(zKk−1,j ,X\j)|Xj ∈ (zKk−1,j , zKk,j ]]

= gj,ALE,K(x).

(A.25)

By definition of gj,ALE(x) in (6), for any ε > 0 there exists some sufficiently large K1 = K1(ε)such that

|gj,ALE,K(x)− gj,ALE(x)| < ε

2∀K > K1. (A.26)

Eqs. (A.25) and (A.26) together imply the strong consistency claim.Because the almost-sure convergence in (A.25) implies convergence in probability as well, by

definition of convergence in probability we have that for each ε > 0,

limn→∞

P{|gj,ALE,K,n(x)− gj,ALE,K(x)| > ε

2} = 0, (A.27)

so that for each K, each ε > 0, and each δ > 0 there exists an integer nK = nK(ε, δ) such thatfor all n ≥ nK ,

P{|gj,ALE,K,n(x)− gj,ALE,K(x)| > ε

2} < δ. (A.28)

Thus, since |gj,ALE,K,nK(x) − gj,ALE(x)| ≤ |gj,ALE,K,nK

(x) − gj,ALE,K(x)| + |gj,ALE,K(x) −gj,ALE(x)|, for K > K1 in (A.26) we have

P{|gj,ALE,K,nK(x)− gj,ALE(x)| > ε}

≤ P{|gj,ALE,K,nK(x)− gj,ALE,K(x)|+ |gj,ALE,K(x)− gj,ALE(x)| > ε}

≤ P{|gj,ALE,K,nK(x)− gj,ALE,K(x)|+ ε

2> ε}

= P{|gj,ALE,K,nK(x)− gj,ALE,K(x)| > ε

2} (using (A.28))

< δ.

(A.29)

Since δ and ε were arbitrary, (A.29) implies that gj,ALE,K,nK(x) converges in probability to

gj,ALE(x) as K →∞, as claimed.Finally, convergence in mean follows similarly to convergence in probability, if, for fixed K,

we can replace the convergence in probability (A.27) by convergence in mean

limn→∞

E[|gj,ALE,K,n(x)− gj,ALE,K(x)|] = 0. (A.30)

Because convergence in probability (A.27), along with uniform integrability of gj,ALE,K,n(x),implies convergence in mean (A.30), it suffices to show uniform (over all n) integrability ofgj,ALE,K,n(x), which follows if we can show that gj,ALE,K,n(x) is bounded for all n and for fixedK. The boundedness follows because |f | is bounded (say by M <∞), so that for every n,

|gj,ALE,K,n(x)| ≤kKj (x)∑k=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ])|f(zKk,j ,Xi,\j)− f(zKk−1,j ,Xi,\j)|∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])

≤ 2M

kKj (x)∑k=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ])∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ])

≤ 2MK,

(A.31)

Page 35: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 35

which proves (A.30). Eq. (A.30) in turn implies that for each fixed K and each ε > 0, thereexists an integer nK = nK(ε) such that for all n ≥ nK ,

E[|gj,ALE,K,n(x)− gj,ALE,K(x)|] < ε

2. (A.32)

Thus, for K > K1 in (A.26) we have

E[|gj,ALE,K,nK(x)− gj,ALE(x)|]

≤ E[|gj,ALE,K,nK(x)− gj,ALE,K(x)|] + E[|gj,ALE,K(x)− gj,ALE(x)|]

2+ε

2= ε.

(A.33)

Since ε was arbitrary, this proves the claimed convergence in mean

limK→∞

E[|gj,ALE,K,nK(x)− gj,ALE(x)|] = 0.

Theorem 4 (consistency of the ALE second-order effect estimator). For {j, l} ⊆ {1, 2, . . . , d},consider two sequences of partitions {PKj : K = 1, 2, . . .} and {PKl : K = 1, 2, . . .} in Definition2 such that limK→∞ δj,K = limK→∞ δl,K = 0. Denote the estimator (17) of the uncentered ALEsecond-order effect of (Xj , Xl) using partitions PKj and PKl and sample size n by

h{j,l},ALE,K,n(xj , xl) =

kKj (xj)∑k=1

kKl (xl)∑m=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])∆

{j,l}f (K, k,m; Xi,\{j,l})∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

.(A.34)

Let {Xi : i = 1, 2, . . . , n} and P (·) be as in Theorem 3, and let

h{j,l},ALE,K(xj , xl) ≡kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]],

(A.35)denote the finite-K version of h{j,l},ALE(xj , xl) in (9) using the same partitions as the estimator.Assume p{j,l}(·) is strictly positive everywhere in Sj × Sl.

Then h{j,l},ALE,n(xj , xl) is a strongly consistent estimator of h{j,l},ALE(xj , xl) in the followingsense. For each (xj , xl) ∈ Sj × Sl, each ε > 0, and each K,

limn→∞

h{j,l},ALE,K,n(xj , xl) = h{j,l},ALE,K(xj , xl) a.s.−P

with |h{j,l},ALE,K(xj , xl)− h{j,l},ALE(xj , xl)| < ε if K is sufficiently large.In addition, there exists a sequence of sample sizes {nK : K = 1, 2, . . .} such that, as K →∞,

h{j,l},ALE,nK(xj , xl)→ h{j,l},ALE(xj , xl)

both in probability and in mean for each (xj , xl) ∈ Sj × Sl.

Proof. The proof is similar to that of Theorem 3. Consider any fixed K and k,m ∈ {1, 2, . . . ,K}.Applying the strong law of large numbers to both the numerator and denominator in the sum-mand of (A.34), as n→∞, the summand converges a.s.-P :

n−1∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])∆

{j,l}f (K, k,m; Xi,\{j,l})

n−1∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

a.s.→E[I(Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l])∆

{j,l}f (K, k,m; X\{j,l})]

E[I(Xj ∈ (zKk−1,j , zKk,j ], Xl ∈ (zKm−1,l, z

Km,l])]

Page 36: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

36 D. Apley and J. Zhu

=E[E[I(Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l])∆

{j,l}f (K, k,m; X\{j,l})|Xj , Xl]]

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

=E[I(Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l])E[∆

{j,l}f (K, k,m; X\{j,l})|Xj , Xl]]

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

=

∫(zj ,zl)∈(zKk−1,j ,z

Kk,j ]×(zKm−1,l,z

Km,l]

E[∆{j,l}f (K, k,m; X\{j,l})|Xj = zj , Xl = zl]dp{j,l}(zj , zl)

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

=E[∆

{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]p{j,l}((z

Kk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

p{j,l}((zKk−1,j , z

Kk,j ]× (zKm−1,l, z

Km,l])

= E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]].

Again, by the almost sure convergence of the finite sum of almost-surely convergent randomvariables, as n→∞,

h{j,l},ALE,K,n(xj , xl) =

kKj (xj)∑k=1

kKl (xl)∑m=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])∆

{j,l}f (K, k,m; Xi,\{j,l})∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

a.s.→kKj (xj)∑k=1

kKl (xl)∑m=1

E[∆{j,l}f (K, k,m; X\{j,l})|Xj ∈ (zKk−1,j , z

Kk,j ], Xl ∈ (zKm−1,l, z

Km,l]]

= h{j,l},ALE,K(xj , xl).

(A.36)

By definition of h{j,l},ALE(xj , xl) in (9), for any ε > 0 there exists some sufficiently large K1 =K1(ε) such that ∣∣h{j,l},ALE,K(xj , xl)− h{j,l},ALE(xj , xl)

∣∣ < ε

2∀K > K1. (A.37)

Eqs. ((A.36)) and ((A.37)) together imply the strong consistency claim.Using the definition of convergence in probability, which is implied from the almost sure

convergence in (A.36), we have for each ε > 0,

limn→∞

P{∣∣∣h{j,l},ALE,K,n(xj , xl)− h{j,l},ALE,K(xj , xl)

∣∣∣ > ε

2

}= 0. (A.38)

Thus, for each K, each ε > 0, and each δ > 0 there exists an integer nK = nK(ε, δ) such thatfor all n ≥ nK ,

P{∣∣∣h{j,l},ALE,K,n(xj , xl)− h{j,l},ALE,K(xj , xl)

∣∣∣ > ε

2

}< δ. (A.39)

Since∣∣∣h{j,l},ALE,K,nK(xj , xl)− h{j,l},ALE(xj , xl)

∣∣∣≤∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE,K(xj , xl)∣∣∣+∣∣h{j,l},ALE,K(xj , xl)− h{j,l},ALE(xj , xl)

∣∣ ,for K > K1 in (A.37) we have

P{∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE(xj , xl)∣∣∣ > ε

}≤ P

{∣∣∣h{j,l},ALE,K,nK(xj , xl)− h{j,l},ALE,K(xj , xl)

∣∣∣+∣∣h{j,l},ALE,K(xj , xl)− h{j,l},ALE(xj , xl)

∣∣ > ε}

≤ P{∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE,K(xj , xl)∣∣∣+

ε

2> ε}

= P{∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE,K(xj , xl)∣∣∣ > ε

2

}< δ (using (A.39)).

(A.40)

Page 37: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 37

Since δ and ε were arbitrary, (A.40) implies that h{j,l},ALE,K,nK(xj , xl) converges in probability

to h{j,l},ALE(xj , xl) as K →∞, as claimed.We now prove that for any fixed K, the following convergence in mean holds:

limn→∞

E[∣∣∣h{j,l},ALE,K,n(xj , xl)− h{j,l},ALE,K(xj , xl)

∣∣∣] = 0. (A.41)

By similar arguments as in the proof of Theorem 3, it suffices to show uniform (over all n)

integrability of h{j,l},ALE,K,n(xj , xl), which follows if we can show that h{j,l},ALE,K,n(xj , xl) isbounded for all n and for any fixed K. The boundedness follows since |f | is bounded by someM <∞, so that for every n,∣∣∣h{j,l},ALE,K,n(xj , xl)

∣∣∣ ≤kKj (xj)∑k=1

kKl (xl)∑m=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

∣∣∣∆{j,l}f (K, k,m; Xi,\{j,l})∣∣∣∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

≤ 4M

kKj (xj)∑k=1

kKl (xl)∑m=1

∑ni=1 I(Xi,j ∈ (zKk−1,j , z

Kk,j ], Xi,l ∈ (zKm−1,l, z

Km,l]∑n

i=1 I(Xi,j ∈ (zKk−1,j , zKk,j ], Xi,l ∈ (zKm−1,l, z

Km,l])

≤ 4MK2.

(A.42)

Thus, we have proven (A.41), which in turn implies that for each fixed K and each ε > 0, thereexists an integer nK = nK(ε) such that for all n ≥ nK ,

E[∣∣∣h{j,l},ALE,K,n(xj , xl)− h{j,l},ALE,K(xj , xl)

∣∣∣] < ε

2. (A.43)

Hence, for K > K1 in (A.37), we have

E[∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE(xj , xl)∣∣∣]

≤ E[∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE,K(xj , xl)∣∣∣]+ E

[∣∣h{j,l},ALE,K(xj , xl)− h{j,l},ALE(xj , xl)∣∣]

2+ε

2= ε.

(A.44)

Since ε was arbitrary, this proves the claimed convergence in mean

limK→∞

E[∣∣∣h{j,l},ALE,K,nK

(xj , xl)− h{j,l},ALE(xj , xl)∣∣∣] = 0.

Appendix B ALE Plot Definition for Higher-Order Effects

Although we do not envision ALE plots being commonly used to visualize third-and-higherorder effects, the notion of higher-order ALE effects is needed to derive the ALE decompositiontheorem in Appendix C and the additive recovery properties discussed in Section 5.3. For thisreason and for completeness, we define the ALE higher-order effects and their sample estimatorsin this appendix and Appendix D, respectively. Here, we use the notations defined in Section 2and make the same assumptions for pj , Sj , and PKj (for each j ∈ {1, . . . , d}, K = 1, 2, . . .) as in

Definitions 1 and 2. Again, δj,K represents the fineness of the partition PKj with limK→∞ δj,K = 0

for each j, and kKj (x) denotes the index of the interval of PKj into which x falls.Consider any index subset J ⊆ D ≡ {1, 2, . . . , d} and the corresponding subset of predictors

XJ = (Xj : j ∈ J) and their complement X\J . In order to define the |J |-order ALE effect of XJ ,we first extend the definition of the uncentered ALE effects gj,ALE(xj) and h{j,l},ALE(xj , xl) to

Page 38: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

38 D. Apley and J. Zhu

general J ⊆ D, for which it will be convenient to introduce the following operator notation. Letg : Rd → R be any function and consider the operator LJ that maps g onto another functionLJ(g) : Rd → R defined via

LJ(g)(xJ) ≡ limK→∞

∑{k:1≤kj≤kK

j (xj),j∈J}

E[∆Jg (K,k; X\J)|Xj ∈ (zKkj−1,j , z

Kkj ,j ],∀j ∈ J ], (B.1)

where the notation is as follows. For each K, the XJ -space is partitioned into the |J |-dimensionalgrid of rectangular cells that is the Cartesian product of {PKj : j ∈ J}. The |J |-length vec-tor k = (kj : j ∈ J) (with each kj an integer between 1 and K) indicates a specific cell inthis grid, i.e., cell-k is the Cartesian product of {(zKkj−1,j , z

Kkj ,j

] : j ∈ J}. ∆Jg (K,k; x\J) de-

notes the |J |-order finite difference of g(x) = g(xJ ,x\J) with respect to xJ = (xj : j ∈ J)

across cell-k. For example, for J = 1 and k = k, ∆Jg (K,k; x\J) is the difference [g(zKk,1,x\1) −

g(zKk−1,1,x\1)]. For J = {1, 3} and k = (k,m), ∆Jg (K,k; x\J) is the difference of the difference

[g(zKk,1, zKm,3,x\{1,3})−g(zKk−1,1, z

Km,3,x\{1,3})]− [g(zKk,1, z

Km−1,3,x\{1,3})−g(zKk−1,1, z

Km−1,3,x\{1,3})].

For general J , ∆Jg (K,k; x\J) is the difference of the difference of the difference . . . (|J | times).

Note that if we substitute g = f in (B.1) for the special case J = {j, l}, (B.1) reduces toh{j,l},ALE(xj , xl) in (9). In (B.1) we have written LJ(g) as a function of only xJ to make explicit

the fact that it only depends on xJ . However, it could be viewed as a function of x ∈ Rd, if wetake it to be its extension from R|J | to Rd. For J = ∅ (the empty set of indices), L∅(g) is definedas E[g(X)] =

∫pD(x)g(x)dx, the marginal mean of g(X).

We will define the ALE |J |-order effect of XJ on f , which we denote by fJ,ALE , as a centeredversion of the function LJ(f), analogous to how fj,ALE(xj) and f{j,l},ALE(xj , xl) were obtainedby centering gj,ALE(xj) and h{j,l},ALE(xj , xl) in Section 2. For general J , LJ(f) is comprised ofthe desired fJ,ALE , plus lower-order effect functions that are related to f evaluated at the lowerboundaries of the rectangular summation region in (B.1). Broadly speaking, our strategy is tosequentially subtract the lower-order ALE effects from LJ(f) to obtain fJ,ALE .

More formally, define HJ(f) : f → fJ,ALE as the operator that maps a function f to its|J |-order ALE effect fJ,ALE , and let the symbol ◦ denote the composition of two operators. For|J | = 0 (i.e., J = ∅), we define the zero-order ALE effect for f(·) as

f∅,ALE ≡ H∅(f) ≡ L∅(f), (B.2)

a constant that does not depend on x and that represents the marginal mean E[f(X)] of thefunction f(X). For 1 ≤ |J | < d, we define the |J |-order effect of XJ on f as

fJ,ALE(xJ) ≡ HJ(f)(xJ) ≡ [(I − L∅) ◦ (I −∑

v⊂J,|v|=1

Lv) ◦ (I −∑

v⊂J,|v|=2

Lv)◦

. . . ◦ (I −∑

v⊂J,|v|=|J |−1

Lv) ◦ LJ ](f)(xJ),(B.3)

where I denotes the identity operator, i.e., I(g) = g for a function g : Rd → R. The rightmostterm in the composite operator HJ defined in (B.3) is just LJ ; the next rightmost term (I −∑

v⊂J,|v|=J−1 Lv) serves to subtract all of the interactions effects of order |J | − 1 from the result

LJ(f) of the previous operation; the next rightmost term (I−∑

v⊂J,|v|=J−2 Lv) serves to subtract

all of the interaction terms of order |J | − 2 from the result (I −∑

v⊂J,|v|=J−1 Lv) ◦ LJ(f) of the

previous operation; and so on. In other words, proceeding from right-to-left, (B.3) iterativelysubtracts the effects of smaller and smaller order, until the final operator (I−L∅) is encountered,which subtracts the zero-order effect from the result of the previous operation. Collectively, thesecomposite operations serve to properly (in the sense of the decomposition theorem in AppendixC) subtract from LJ(f) all lower-order effects when forming fJ,ALE . Finally, for J = D, wedefine fD,ALE(x) as

fD,ALE(x) ≡ HD(f)(x) ≡ [I −∑v⊂DHv](f)(x). (B.4)

Page 39: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 39

For the special cases J = j (|J | = 1) and J = {j, l} (|J | = 2), (B.3) reduces to fj,ALE(xj)and f{j,l},ALE(xj , xl) from (8) and (14), respectively. That is, for J = j,

fj,ALE(xj) = [(I − L∅) ◦ Lj ](f)(xj) = Lj(f)(xj)− L∅ ◦ Lj(f)(xj)

= Lj(f)(xj)− E[Lj(f)(Xj)] = gj,ALE(xj)− E[gj,ALE(Xj)],

which is the same as (8); and for J = {j, l},

f{j,l},ALE(xj , xl) = [(I − L∅) ◦ (I − Lj − Ll) ◦ L{j,l}](f)(xj , xl)

= (I − L∅) ◦ [L{j,l}(f)(xj , xl)− Lj ◦ L{j,l}(f)(xj)− Ll ◦ L{j,l}(f)(xl)]

= L{j,l}(f)(xj , xl)− Lj ◦ L{j,l}(f)(xj)− Ll ◦ L{j,l}(f)(xl)

− E[L{j,l}(f)(Xj , Xl)− Lj ◦ L{j,l}(f)(Xj)− Ll ◦ L{j,l}(f)(Xl)],

which is the same as (14).

Appendix C ALE Decomposition Theorem and Some Properties of LJ and HJ

We first state some properties of LJ and HJ , which will be used in the proof of the main resultin this appendix. The main result is the ALE decomposition theorem, which states that, insome sense, the ALE plots are estimating the correct quantities.

Properties of LJ and HJ : For any two sets of indices u ⊆ D and J ⊆ D, we have:

i Lu ◦ Lu = Lu.

ii Lu ◦ LJ = 0 if u 6⊆ J , i.e., if u contains at least one index that is not in J .

iii LJ is a linear operator, i.e., LJ(a1g1 + a2g2) = a1LJ(g1) + a2LJ(g2) for functions g1 and g2in the domain of LJ and constants a1 and a2.

iv Lu ◦ HJ = 0, for u 6= J .

v Lu ◦ Hu = Lu.

vi Hu ◦ HJ = 0, for u 6= J .

vii Hu ◦ Hu = Hu

The statement Lu ◦ LJ = 0 is an abbreviation for Lu ◦ LJ(g)(x) = 0 for all g and for all x,and likewise for similar statements. The preceding properties are mostly obvious by inspectionof the definitions of LJ in (B.1) and HJ in (B.2)—(B.4). Property (i) follows because applyingLu to a function g(xu) that does not depend on x\u returns the same function g(xu) pluslower-order functions of proper subsets of elements of xu. Hence, Lu ◦ Lu(g)(xu) = Lu(g)(xu)plus lower order functions, but the sum of these lower-order functions must be identically zerobecause of the boundary conditions that Lu ◦ Lu(g)(xu) = Lu(g)(xu) = 0 when any element ofxu (say xj) is at its lower boundary value xmin,j over the integration region in (B.1). Properties(ii) and (iii) are obvious. Regarding Property (iv), if u 6= J , we must have either u 6⊆ J oru ⊂ J . Property (iv) is obvious for u 6⊆ J , i.e., if u contains at least one index that is not inJ . For u ⊂ J , Property (iv) follows by noting that, when applying Lu to (B.3) from left toright, Lu ◦ (I − L∅) ◦ . . . ◦ (I −

∑v⊂J,|v|=|u|−1 Lv) = Lu (by Properties (ii) and (iii)), so that

Lu◦(I−L∅)◦ . . .◦(I−∑

v⊂J,|v|=|u| Lv) = Lu◦(I−Lu−∑

v⊂J,|v|=|u|,v 6=u Lv) = Lu−Lu−0 = 0 (by

Properties (i), (ii), and (iii)). Property (v) follows similarly. Properties (vi) and (vii) followimmediately from Properties (iv) and (v), respectively.

The next theorem follows trivially from the above definitions and properties, although weformally state it for completeness.

ALE Decomposition Theorem: A function f(·) can be decomposed asf(x) =

∑J⊆D fJ,ALE(xJ), where each ALE component function fJ,ALE represents the |J |-order

effect of XJ on f(·) and is directly constructed via fJ,ALE = HJ(f). Moreover, the ALE

Page 40: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

40 D. Apley and J. Zhu

component functions have the following orthogonality-like property. For all J ⊆ D, we haveHJ(fJ,ALE) = fJ,ALE , and Hu(fJ,ALE) = 0 for all u ⊆ D with u 6= J . That is, for each J ⊆ D,the |J |-order effect of XJ on fJ,ALE is fJ,ALE itself, and all other effects on fJ,ALE are identicallyzero. The ALE decomposition is unique in that for any decomposition f =

∑J⊆D fJ with

{fJ : J ⊆ D} having this orthogonality-like property (i.e., with HJ(fJ) = fJ and Hu(fJ) = 0for u 6= J), it must be the case that fJ = fJ,ALE .

Proof. That f(·) can be decomposed as f(x) =∑

J⊆D fJ,ALE(xJ) follows directly from thedefinition (B.4) of fD,ALE . The orthogonality-like property follows directly from Properties (vi)and (vii). The uniqueness of the ALE decomposition follows directly from Property (iii).

Appendix D Estimation of fJ,ALE(xJ) for Higher-Order Effects

Estimation of fJ,ALE for |J | = 1 and |J | = 2 is described in Section 3. Here we describe

estimation of fJ,ALE for general J ⊆ D ≡ {1, 2, . . . , d}. We compute the estimate fJ,ALEby computing estimates of the quantities in the composite expression (B.3) from right-to-left.

Although the notation necessary to formally define fJ,ALE for general J is tedious, the conceptis straightforward: To estimate LJ(f)(xJ), we make the following replacements in (B.1). Wereplace the sequence of |J |-dimensional Cartesian product partitions in (B.1) by the single|J |-dimensional Cartesian product of some appropriate fixed partitions of the sample ranges of{xi,j : i = 1, . . . , n} (for j ∈ J), and we replace the conditional expectation in (B.1) by the sampleaverage across all {xi,\J : i = 1, 2, . . . , n}, conditioned on xi,J falling into the corresponding cellof the partition.

More precisely, for any function g : Rd → R, we estimate LJ(g)(xJ) for J ⊂ D via

LJ(g)(xJ) ≡∑

{k:1≤kj≤kj(xj),j∈J}

1

nJ(k)

∑{i:xi,J∈NJ(k)}

∆Jg (K,k; xi,\J), (D.1)

where the notation is as follows. The |J |-order finite difference ∆Jg (K,k; xi,\u) and the index

vector k = (kj : j ∈ J) of cell-k of the partition grid are the same as in Appendix B. Let {Nj(k) =(zk−1,j , zk,j ] : k = 1, 2, . . . ,K} denote a partition of the sample range of {xi,j : i = 1, 2, . . . , n} as

in Section 3. We partition the |J |-dimensional range of xJ into a grid of K |J | rectangular cellsobtained as the cross product of the individual one-dimensional partitions. Let NJ(k) denotecell-k of the grid, i.e., the Cartesian product of the intervals {(zkj−1,j , zkj ,j ] : j ∈ J}, and letnJ(k) denote the number of training observations that fall into NJ(k), so that the sum of nJ(k)over all K |J | rectangles is

∑{k:1≤kj≤K,j∈J} nJ(k) = n. For each j ∈ D and each x, let kj(x)

denote the index of the Xj partition interval into which x falls, i.e., x ∈ Nj(kj(x)).

The estimator fJ,ALE is obtained by substituting the estimator (D.1) for each term of the

form Lu(g)(xu) in (B.3). Eqs. (16) and (20) in Section 3 are special cases of fJ,ALE for |J | = 1and |J | = 2.

Appendix E Some Implementation Details

Handling categorical predictors. It is often desirable to visualize the effect of any categoricalpredictor Xj on f(·), and our package ALEPlot includes such functionality. Recall that for theestimation of the ALE effects (for example, in (15) and (17)), we need to take the difference off(·) across neighboring values of Xj . It is therefore important to come up with a reasonableordering of the levels of Xj . The main consideration for a “reasonable” ordering is that functionextrapolation outside the data envelope should be avoided as much as possible when neighboringvalues of xj are plugged into f(xj ,x\j).

To accomplish this, we order the levels of Xj based on how dissimilar the sample values {xi,\j :i = 1, 2, . . . , n} are across the levels of Xj . More specifically, we set the number of bins K to thenumber of nonempty levels of Xj . We then calculate a K ×K dissimilarity matrix, the (k, l)thcomponent of which accumulates (over the other predictors Xj′ with j′ ∈ {1, 2, . . . , d}\j) the

Page 41: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 41

Fig. 13. Illustration of the method of ordering the levels of a categorical predictor Xj (X1, having levels“A”, “B”, . . ., “H”) when calculating its ALE main effect. The left panel is a scatter plot of X\j (X2,numerical) vs. X1 in the original alphabetical ordering, along with the conditional distributions p2|1(x2|x1)for x1 = B,D,E. The right panel is the same but after reordering the levels ofX1 using the MDS method.The horizontal black arrows in the left panel indicate where severe extrapolation is required if the levelsare not reordered.

distances (e.g., Kolmogorov-Smirnov distance for continuous predictors) between the univariateconditional distributions of {xi,j′ : i = 1, 2, . . . , n;xi,j = level-k} and {xi,j′ : i = 1, 2, . . . , n;xi,j =level-l}. We then order the levels of Xj according to the ordering of the first coordinate fromapplying multidimensional scaling (MDS) to the dissimilarity matrix.

Figure 13 uses a toy example to illustrate the logic of this ordering and how it helps to avoidfunction extrapolation. Here, there are d = 2 predictors X1 (categorical with levels “A”, “B”,. . ., “H”) and X2 (numerical), and we are interested in the main effect of X1. A scatterplot ofX2 vs. X1, along with several conditional distributions p2|1(x2|x1) for x1 = B,D,E, are shownin the left panel where levels of X1 are alphabetically ordered. If this ordering is kept, we mustevaluate f(x1 = E, xi,2) − f(x1 = D,xi,2) for many observations i that require extrapolatingf outside the effective range of p2|1(x2|x1 = D) and p2|1(x2|x1 = E) (as indicated by the darkhorizongal arrows in Figure 13). In contrast, the right panel of Figure 13 shows the samedata after ordering the levels of X1 according to MDS as described above. In this case, muchless function extrapolation is required when we evaluate f(x1, xi,2)− f(x′1, xi,2) for x1 and x′1 inneighboring levels after reordering. Our ALEPlot package uses the method of ordering depictedin Figure 13, except that for d > 2 the distance between distributions is summed over all d− 1components of X\j .

Vectorization. It is important to note that computation of fJ,ALE(xJ) is easily vectorizable.

In R for example, to produce the 2|J |×n evaluations of f(x) in (D.1) (or in (16) or (20) for ALEmain and second-order effects), we can construct a predictor variable array with 2|J | × n rowsand d columns and then call the predict command (which is available for most supervisedlearning models) a single time. This is typically orders of magnitude faster than using a forloop to call the predict command 2|J | × n times. Similarly, the averaging and summationoperations in (D.1), (16), or (20) can be vectorized without the need for a for loop. Our Rpackage ALEPlot uses this vectorization.

Page 42: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

42 D. Apley and J. Zhu

Dealing with empty cells in second-order interaction effect ALE plots. For asecond-order interaction effect ALE plot with J = {j, l}, we discretize the (xj , xl) sample-spaceinto the K2 rectangular cells by taking the cross product of the quantile-based discretizations ofthe sample spaces of Xj and Xl individually. If (Xj , Xl) are correlated, this will often result insome cells that contain few or no observations, as illustrated in Fig. 14. In this case, it may beuseful to add scatter plot bullets to the second-order interaction effect ALE plots to indicate thebivariate training sample values {(xi,j , xi,l) : i = 1, 2, . . . , n}. This would help to identify emptycells (i.e., cells with n{j,l}(k,m) = 0) and, more generally, to identify regions of the (xj , xl)-spacein which fJ,ALE(xJ) may not be reliable due to lack of data in that region. This is not necessaryfor main effect ALE plots (J = j), because our quantile-based discretization always results inthe same number n

K of observations in each region. In our ALEPlot package, rather than addingscatter plot bullets on the ALE second-order effect plot, we allow black rectangles to be plottedto indicate empty cells in the chosen grid of partition.

Denote the average local effect associated with any cell (k,m) (i.e., the summand in (17)) by

∆(k,m) ≡ 1

n{j,l}(k,m)

∑{i:xi,{j,l}∈N{j,l}(k,m)}

∆{j,l}f (K, k,m; xi,\{j,l}), (E.1)

where for notational simplicity, we have omitted the dependence of ∆(k,m) on {j, l} and K.Note that the quantity in (E.1) is not defined for any cell (k,m) that is empty. On the surface,

this may appear to cause a problem when calculating h{j,l},ALE(xj , xl) via (17), since the outertwo summations in (17) are over a rectangular array of cells. If an empty cell is within this array,we need to replace (E.1) by some appropriate value in order to allow the outer two summationsin (17) to be calculated.

This problem of empty cells is fundamentally different from the extrapolation problem il-lustrated in Figure 1(a). If J ⊆ {1, 2, ..., d} denotes the predictor indices, the former involvesextrapolation in XJ , whereas the latter involves extrapolation in X\J . For the latter, ALE plotsthemselves are an effective solution, as they are designed to avoid the extrapolation. However,for the former, there is no way around substituting some value for (E.1) for empty cells. Forthis, we recommend the following strategy, which we use in our ALEPlot package. For anyempty cells, we replace (E.1) by the average of the ∆(k,m) values for the M nearest non-emptycells, weighted by the number of training observations in each of these non-empty cells. We takeM = min{10,M0.1}, where M0.1 is the smallest number of nearest non-empty cells that togethercontain at least 10 percent of the training observations.

The ∆(k,m) value that we substitute for empty cells is somewhat arbitrary, and substituting

a different value will alter f{j,l},ALE(xj , xl) in general. However, for empty cells that lie outsidethe convex hull of the bivariate training values {(xi,j , xi,l), i = 1, . . . , n} (which is where emptycells are more likely to occur, as illustrated in Fig. 14), the following theorem implies that the

∆(k,m) values that we choose do not alter f{j,l},ALE(xj , xl) where it matters, which is in cellsthat are not empty.

Theorem 5. Consider a supervised learning model f(x1, x2, . . . , xd) and two predictors of interestXj and Xl. Let C denote the convex hull of the bivariate training values {(xi,j , xi,l) : i = 1, . . . , n}and (k∗,m∗) denote the index of an empty cell outside C in a given partition of the (Xj , Xl)sample space into a 2-D grid (see Fig. 14). Suppose that we substitute some value ∆(k∗,m∗)

for the quantity in (E.1) for the empty cell when calculating the functions h{j,l},ALE , g{j,l},ALE ,

and f{j,l},ALE via (17), (19), and (20). We claim that the value chosen for ∆(k∗,m∗) is arbitrary

and does not affect f{j,l},ALE inside C (i.e., in the region for which we have data), in the sense

that changing ∆(k∗,m∗) to ∆(k∗,m∗) + δ for any constant δ has no effect on f{j,l},ALE(xj , xl)for (xj , xl) ∈ C.

Proof. To avoid obscuring the simple logic behind this claim with tedious arguments and cum-bersome notation, we present only a sketch of proof and only for the case in which Xj and Xl

Page 43: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

Visualizing Supervised Learning Models 43

Fig. 14. Illustration of the ideas behind the proof of Theorem 5. All panels except the top-right showsample data for two negatively correlated predictors (Xj , Xl), the approximate convex hull C (shown asan ellipse) of the bivariate training sample, and the given partition of the space into a 2-D grid. Theshaded rectangle with lower left corner at (a, b) is an empty cell (indexed by (k∗,m∗)) that lies outside C.The heavy horizontal and vertical lines passing through the point (a, b) divide the grid into four quadrants.The top-left panel shows the values of hδ in each quadrant. The top-right panel plots Lj(hδ) and Ll(hδ)as functions of xj and xl respectively. The bottom-left and bottom-right panels show the values of gδand fδ, respectively, in each quadrant.

are negatively correlated and the sample data are as shown in Figure 14. Denote the lower-left corner of empty cell (k∗,m∗) by (a, b), as depicted in Figure 14. Let h{j,l},ALE , g{j,l},ALE ,

and f{j,l},ALE denote the functions computed in (17), (19), and (20) when we use the value

∆(k∗,m∗) for (E.1) for the empty cell, let h{j,l},ALE , g{j,l},ALE , and f{j,l},ALE denote the corre-

sponding functions computed when we use the value ∆(k∗,m∗) + δ, instead of ∆(k∗,m∗), and

let hδ = h{j,l},ALE− h{j,l},ALE , gδ = g{j,l},ALE− g{j,l},ALE , and fδ = f{j,l},ALE− f{j,l},ALE denotetheir differences.

The goal is to show that fδ(xj , xl) = 0 for (xj , xl) ∈ C. Beginning with the uncentered

ALE effect functions h{j,l},ALE and h{j,l},ALE , adding δ for cell (k∗,m∗) will only affect theaccumulation of local effects in (17) in cells that are above and to the right of (k∗,m∗). Hence

hδ(xj , xl) =

{δ, if xj > a, xl > b

0, otherwise= 0 + δ · I(xj > a, xl > b), (E.2)

Page 44: Visualizing the Effects of Predictor Variables in …networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects

44 D. Apley and J. Zhu

which is shown in the top-left panel of Figure 14.To relate g{j,l},ALE and g{j,l},ALE , using the operator notation defined in Appendix D, (19)

becomes

g{j,l},ALE ≡ h{j,l},ALE − Lj(h{j,l},ALE)− Ll(h{j,l},ALE)

=[h{j,l},ALE − Lj(h{j,l},ALE)− Ll(h{j,l},ALE)

]+[hδ − Lj(hδ)− Ll(hδ)

]= g{j,l},ALE +

[hδ − Lj(hδ)− Ll(hδ)

]= g{j,l},ALE + gδ.

(E.3)

From (E.2), the uncentered main effects of hδ are Lj(hδ)(xj) = 0+δ · I(xj > a) and Ll(hδ)(xl) =0 + δ · I(xl > b), which are the step functions shown in the top-right panel of Figure 14. Usingthis in (E.3) gives

gδ(xj , xl) = hδ(xj , xl)− Lj(hδ)(xj , xl)− Ll(hδ)(xj , xl) = −δ · I(xj > a or xl > b), (E.4)

which is plotted in the bottom-left panel of Figure 14.Finally, using (E.4), the centered second-order ALE effect from (20) becomes

f{j,l},ALE(xj , xl) ≡ g{j,l},ALE(xj , xl)− L∅(g{j,l},ALE

)(xj , xl)

= g{j,l},ALE(xj , xl) + gδ(xj , xl)− L∅(g{j,l},ALE + gδ

)(xj , xl)

= g{j,l},ALE(xj , xl)− L∅(g{j,l},ALE

)(xj , xl) + gδ(xj , xl)− L∅ (gδ) (xj , xl)

= f{j,l},ALE(xj , xl)− δ · I(xj > a or xl > b)− (−δ)

= f{j,l},ALE(xj , xl) + δ · I(xj ≤ a, xl ≤ b)

= f{j,l},ALE(xj , xl) + fδ(xj , xl).

Hence, fδ(xj , xl) = δ · I(xj ≤ a, xl ≤ b) = 0 for (xj , xl) ∈ C, which proves the claim.