n° 2017-72 nonparametric imputation by data depth p....

32
Série des Documents de Travail n° 2017-72 Nonparametric imputation by data depth P. MOZHAROVSKYI 1 J. JOSSE 2 F. HUSSON 3 Les documents de travail ne reflètent pas la position du CREST et n'engagent que leurs auteurs. Working papers do not reflect the position of CREST but only the views of the authors. 1 CREST-ENSAI, Université Bretagne Loire, E.mail : [email protected] 2 Ecole Polytechnique, CMAP, E.mail : [email protected] 3 IRMAR, Applied Mathematics Unit, Agrocampus Ouest, E.mail : [email protected]

Upload: others

Post on 18-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Série des Documents de Travail

n° 2017-72 Nonparametric imputation by data depth

P. MOZHAROVSKYI1 J. JOSSE2

F. HUSSON3 Les documents de travail ne reflètent pas la position du CREST et n'engagent que leurs auteurs. Working papers do not reflect the position of CREST but only the views of the authors.

1 CREST-ENSAI, Université Bretagne Loire, E.mail : [email protected] 2 Ecole Polytechnique, CMAP, E.mail : [email protected] 3 IRMAR, Applied Mathematics Unit, Agrocampus Ouest, E.mail : [email protected]

Page 2: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Nonparametric imputation by data depth

Pavlo Mozharovskyi∗

CREST, Ensai, Universite Bretagne Loireand

Julie JosseEcole Polytechnique, CMAP

andFrancois Husson

IRMAR, Applied Mathematics Unit, Agrocampus Ouest, Rennes

December 14, 2017

Abstract

The presented methodology for single imputation of missing values borrows the idea fromdata depth — a measure of centrality defined for an arbitrary point of the space with respect toa probability distribution or a data cloud. This consists in iterative maximization of the depthof each observation with missing values, and can be employed with any properly defined sta-tistical depth function. On each single iteration, imputation is narrowed down to optimizationof quadratic, linear, or quasiconcave function being solved analytically, by linear programming,or the Nelder-Mead method, respectively. Being able to grasp the underlying data topology, theprocedure is distribution free, allows to impute close to the data, preserves prediction possibili-ties different to local imputation methods (k-nearest neighbors, random forest), and has attractiverobustness and asymptotic properties under elliptical symmetry. It is shown that its particularcase — when using Mahalanobis depth — has direct connection to well known treatments formultivariate normal model, such as iterated regression or regularized PCA. The methodology isextended to the multiple imputation for data stemming from an elliptically symmetric distribu-tion. Simulation and real data studies positively contrast the procedure with existing popularalternatives. The method has been implemented as an R-package.

Keywords: Elliptical symmetry, Outliers, Tukey depth, Zonoid depth, Nonparametric impu-tation, Convex optimization.

∗The major part of this project has been conducted during the postdoc of Pavlo Mozharovskyi at Agrocampus Ouest(Rennes) granted by Centre Henri Lebesgue due to program PIA-ANR-11-LABX-0020-01.

1

Page 3: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

1 IntroductionFollowing the seminal idea of Tukey (1974), the concept of data depth has developed to a pow-erful statistical methodology allowing description of data w.r.t. location, scale, and shape basedon a multivariate ordering. Today, it finds numerous applications in multivariate data analysis(Liu et al., 1999), statistical quality control (Liu and Singh, 1993), classification (Jornsten, 2004,Lange et al., 2014), multivariate risk measurement (Cascos and Molchanov, 2007), robust linearprogramming (Bazovkin and Mosler, 2015), etc. Being able to exploit topological properties ofthe data in a nonparametric way, statistical depth function further proves to be suited for imput-ing missing values while being connected to the state of the art methods. We start with a briefbackground on imputation, followed by the proposal.

1.1 Background on missing valuesThe problem of missing values exists since the earliest attempts of exploiting data as a source ofknowledge as it lies intrinsically in the process of obtaining, recording, and preparation of the dataitself. To exploit all the information present in the data set, a statistical method may be adaptedto missing values, but this requires developing such a one for each estimator and inference of in-terest. A more universal way is to impute missing data first, and then apply the statistical methodof interest to the completed data set (Little and Rubin, 2002). Lastly, the multiple imputation hasgained a lot of attention: for a data set containing missing values a number of completed data setsis generated reflecting uncertainty of the imputation process, which enables not only estimatingthe parameter of interest but also drawing an inference on it (Rubin, 1996).

Development of many statistical methods started with the natural normality assumption, andimputation is not an exception here. For a multivariate normal distribution, single imputationof the observations containing missing values can be performed by imputing the missing valueswith conditional mean, where conditioning is w.r.t. observed values. This procedure makesuse of the expectation-maximization algorithm (Dempster et al., 1977) to estimate mean andcovariance matrix; see also Little and Rubin (2002). By that, imputation inherits sensitivity tooutliers and to (near) low-rank covariance matrix. To deal with these problems, uncountableextensions of the EM framework have been developed. Another way is to directly assume low-rank of the underlying covariance, which is followed by the methods based on the principalcomponent analysis (PCA), closely connected to the matrix completion literature, see Josse andHusson (2012), Hastie et al. (2015). These methods aim at denoising the data and suppose itsspecial structure, knowledge (or estimation) of the rank of data, of the shape of outliers, andnoise to be normal. Both groups of methods impute on low-dimensional affine subspaces and— by that — are sensitive to even slight deviations from normality (or ellipticity in general) andignore the geometry of data. Extension of single imputation methods to more general densitiesconsists in such nonparametric techniques as k-nearest neighbors (kNN) (see Troyanskaya et al.,2001, and references there in) and random forest (Stekhoven and Buhlmann, 2012) imputation.When properly tuned, these methods capture the locality of the data, but fail to extrapolate andthus can be inappropriate under the missing at random mechanism (MAR, Seaman et al., 2013),i.e. exhibit in some sense superfluously local behavior.

2

Page 4: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

−1 0 1 2 3

−2

02

46

X2[rowSums(is.na(X2.miss)) < 0.5, ][,1]

X2[

row

Sum

s(is

.na(

X2.

mis

s))

< 0

.5, ]

[,2]

−1 0 1 2 3

−4

−2

02

46

vol = 0.42061

X1[rowSums(is.na(X.miss1)) < 0.5, ][,1]

X1[

row

Sum

s(is

.na(

X.m

iss1

)) <

0.5

, ][,2

]

Figure 1: Bivariate normal distribution with 30% MCAR (left) and with MAR in second coordi-nate for values > 3.5 (right); imputation using maximum zonoid depth (red), conditional meanimputation using EM estimates (blue), and random forest imputation (green).

1.2 ProposalCurrent article fills the existing gap on the level of single imputation for general elliptically sym-metric distribution families. We suggest a nonparametric framework that is able to grasp datatopology and, to some degree, even reflect finite sample deviations from ellipticity. For the pur-pose of single imputation of missing values, we resort to the idea of statistical depth functionD(x|X) — a measure of centrality of an arbitrary point x ∈ Rd w.r.t. a d-variate random vectorX . Given such a measure, we propose to maximize it for a point having missing values condi-tioned on its observed values. For each point containing missing values, the procedure is repeatediteratively to achieve stability of the solution. In this framework, the properties of the imputa-tion are to great extent defined by the chosen notion of data depth function (see Section 2). Anonparametric data depth allows for imputation close to the data geometry, still accounting fortheir global features due to the center anchoring. This makes a distinction w.r.t. using fully localimputation methods like kNN or random forest imputation. In addition, data depth allows forrobust imputation both in sense of outliers (i.e. disregarding outliers when imputing points closerto center) and distribution (i.e. not masking outliers). As not using it, depth-based imputationavoids problems connection with estimation of the covariance matrix.

We employ three depth functions: Mahalanobis depth, which is a natural extension of theMahalanobis distance and is here because of its connections to standard imputation frameworks(see Section 4); zonoid depth, which imputes by the average of the maximal number of equallyweighted observations; Tukey depth, which maximizes the infimum of the portion of points con-tained in the closed halfspace including the imputed point. For a well defined imputation, thesedepth notions require two, one, and zero first moments of the underlying probability measure,respectively. We highlight most important properties of the depth-based imputation right belowby means of examples.

Regard Figure 1 (left), where 150 points stemming from a bivariate normal distribution hav-ing mean µ1 = (1, 1)′ and covariance Σ1 =

((1, 1)′, (1, 4)′

)are plotted, with 30% of the entries

having been removed in both variables due to the missing completely at random (MCAR) mecha-

3

Page 5: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

−1 0 1 2 3 4

−4

−2

02

46

X3[rowSums(is.na(X3.miss)) < 0.5, ][,1]

X3[

row

Sum

s(is

.na(

X3.

mis

s))

< 0

.5, ]

[,2]

−20 −10 0 10 20

−40

−20

020

40

X4[rowSums(is.na(X4.miss)) < 0.5, ][,1]

X4[

row

Sum

s(is

.na(

X4.

mis

s))

< 0

.5, ]

[,2]

Figure 2: Mixture of bivariate normal (425 points, 15% MCAR) and Cauchy (75 point) samples(left) and 1000 bivariate Cauchy distributed points with 15% MCAR (right). Imputation usingmaximum Tukey depth (red) and conditional mean imputation using EM estimates (blue).

nism; points with one missing entry are denoted by dotted lines. EM-based imputation is depictedin blue. The imputed points are conditional means and lie exactly on the regression lines esti-mate; their distribution versions are plotted as the two lines. Zonoid depth-based imputation,pictured in red, reflects the idea that the sample may not necessarily be normal, with this unsure-ness more stressed on the fringe of the data cloud, where imputed points deviate from conditionalmean towards the unconditional one. Indeed, away from the data center, imputation by data depthis closer to nonparametric imputation, which accounts for the local nature of data. On the otherhand, different to these methods, depth imputation preserves prediction ability. In Figure 1 (right)for the same sample first coordinate has been removed for (14) points having value > 3.5 in thesecond coordinate (MAR mechanism). The depth-based imputation manages to extrapolate topredict missing values, while random forest imputation performs as expected rather poorly.

Being robust both in sense of outliers and heavy tails concerns the two following cases: First,if data is polluted by outliers but coordinates are missing for a representative point, the outliersshould not alter the imputation. In Figure 2, left we plot — zoomed in — 500 points. 425 ofthem stem from the same normal distribution as above, with 15% of entries MCAR; another75 are outliers drawn from the Cauchy distribution with the same center and shape matrix andhaving no missing values. As expected, conditional mean based on EM estimates (depicted inblue) imputes rather randomly. Depth-based imputation using Tukey depth imputes in a robustway, close to (distribution) regression lines, reflecting geometry of data. Second, if missing valuesbelong to the polluting distribution (or if the entire data generating process is heavy-tailed), theimputation should reflect this heavy-tailness not to mask a possible outlier. For 1000 points fromCauchy distribution having 15% of values MCAR, imputation by Tukey depth and the EM-basedone (for comparison) are shown in Figure 2, right. Depth-based respects the general ellipticityand imputes close to distribution regression lines.

The rest of the paper is organized as follows. In Section 2 we state some important definitionsconcerning data depth and establish the notation. Section 3 enlightens the proposed methodologyof the depth-based imputation, regards its theoretical properties, and suggests optimization tech-niques. Section 4 is devoted to the theoretical investigation of the special case when employing

4

Page 6: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Mahalanobis depth bridging the proposed imputation technique with regression and PCA impu-tation. Section 5 provides a comparative simulation and real data study. Section 6 extends theproposed approach to multiple imputation. Section 7 gathers some useful remarks.

2 Notation and background on data depthFor a data set in Rd (n× d matrix)X = {x1, ...,xn} = (Xobs,Xmiss), we denote byXobs andXmiss its observed and missing part, respectively, i.e. there exists an indicator n× d matrix Msuch thatX = Xobs · (1n×d−M) +Xmiss ·M in the sense of elementwise multiplication. Fora point x ∈ Rd, we denote miss(x) and obs(x) the sets of its coordinates containing missing,respectively observed values, denoting |miss(x)| and |obs(x)| the corresponding cardinalities,restricting to |miss(x)|+ |obs(x)| = d. Following the same logic, we write miss(i) and obs(i)for a point xi ∈X or even just miss and obs if no confusion arises.

Under missing completely at random (MCAR) we understand the mechanism where eachvalue ofX has the same missing probability.

Being in the core of the theoretical results derived in Section 3 elliptical distribution is definedas follows (see Fang et al. (1990), and Liu and Singh (1993) in the data depth context).

Definition 1. Random vector X ∈ Rd is distributed elliptically symmetric with strictly decreas-ing density or equivalently as E(µX ,ΣX , F ) if it is distributed asX D

= µX+RΛU withR ∈ R+

being a nonnegative random variable stemming from F possessing strictly decreasing density, Uuniformly distributed on Sd−1, µX ∈ Rd, and Σ = ΛΛ′.

Below we briefly state definitions covering necessary material on data depth. Following thepioneering idea of Tukey (1974), statistical data depth function is a mapping

Rd ×M→ [0, 1] : (x, P ) 7→ D(x|P ) ,

with M being a subset of M0, the set of all probability measures on (Rd,B). It measurescloseness ofx to the center of P . Further, for a given pointx and a random vectorX coming fromthe probability distribution P ∈ M we denote the depth D(x|X), and D(x|X) its empiricalversion. Zuo and Serfling (2000) developed axiomatic for the depth, which requires a properdepth function to be affine invariant, maximal at the P ’s center of symmetry, non-increasingon a ray starting from any deepest point, and vanish in infinity. Additionally one may requirequasiconcavity, a restriction which will prove to be useful computationally below. Upper-levelset of the depth function to level α is called a depth (-trimmed) region Dα(X) = {x ∈ Rd :D(x|X) ≥ α}. For α ∈ [0, 1], depth regions form a family of nested set-valued statistics, whichdescribes X w.r.t. location, scatter, and shape.

A number of depth notions emerged during the last decades; below we give definitions ofthose three being used for mean of imputation in the present article.

Definition 2. Mahalanobis (1936) depth of x ∈ Rd w.r.t. X is defined as

DM (x|X) =(1 + (x− µX)′Σ−1X (x− µX)

)−1,

where µX and ΣX are any location and shape estimates of X .

5

Page 7: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

In the empirical version, µX and ΣX are replaced by appropriate estimates of location andshape, which throughout the article are taken as the moment estimates µX = 1

n

∑ni=1 xi and

ΣX = 1n−1

∑ni=1(xi − µX)(xi − µX)′, respectively.

Koshevoy and Mosler (1997) define a zonoid trimmed region — for α ∈ (0, 1] — as

Dzα(X) =

{∫Rd

xg(x)dP (x) : g : Rd 7→[0,

1

α

]measurable and

∫Rd

g(x)dP (x) = 1}

and for α = 0 asDz

0(X) = conv (supp(X)) .

where supp(X) denotes the support ofX and conv(A) denotes the smallest convex set containingA. Its empirical version can be defined as

Dz(n)α (X) =

{ n∑i=1

λixi :n∑i=1

λi = 1 , λi ≥ 0 , αλi ≤1

n∀ i ∈ {1, ..., n}

}.

Definition 3. Zonoid depth of x w.r.t. X is defined as

Dz(x|X) =

{sup{α : x ∈ Dz

α(X)} if x ∈ conv(supp(X)

),

0 otherwise.

For a comprehensive reference on the zonoid depth the reader is referred to Mosler (2002).Zonoid depth tends to represent x as the average of maximum number of equally weighted

points, i.e. as a weighted mean, which opens a variety of connected methods including the entireclass of the weighted mean depths, see Dyckerhoff and Mosler (2011).

Definition 4. Tukey (1974) depth of x w.r.t. X is defined as

DT (x|X) = inf{P (H) : H a closed halfspace, x ∈ H} .

In empirical version, probability is substituted by the portion of X . Exploiting solely thedata geometry and optimizing over the indicator loss, Tukey depth is fully nonparametric, highlyrobust, and does not require moment assumptions on X . For more information on Tukey depthand the corresponding trimmed regions see Donoho and Gasko (1992) and Hallin et al. (2010).

3 Imputation by depth maximization

3.1 Main ideaLet’s start by regarding one of the simplest methods to impute missing values making use of thefollowing iterative regression imputation scheme: (1) initialize missing values arbitrary, usingmean imputation for instance; (2) impute missing values in one variable by the values predictedby the regression model of this variable with the resting variables taken as explanatory ones, (3)iterate through variables containing missing values till convergence.

Most common imputation methods are based on similar approaches: Templ et al. (2011) usethe same (robustified) iterative method, and Josse and Husson (2012) and Hastie et al. (2015)use iterative (thresholded) singular value decomposition to impute the missing entries, whereasStekhoven and Buhlmann (2012) iteratively replace missing cells with values fitted by random

6

Page 8: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

forest. This scheme can even be traced back to the earliest solution to deal with missing valuessuggested by Healy and Westmacott (1956) and is at the root of EM based algorithms (imple-mented e.g. by means of the SWEEP operator for normal data) to estimate parameters despitemissing values (Little and Rubin, 2002).

In the described above procedure, on each step, each point xi having missing valueson a coordinate j is imputed with yi,j the univariate conditional mean E[X|X{1,...,d}\{j} =xi,{1,...,d}\{j},µX = µX ,ΣX = ΣX ]. After convergence, each point xi with missing values onmiss(i) is imputed with the multivariate conditional mean

E[X|Xobs(i) = xi,obs(i),µX = µX ,ΣX = ΣX ] (1)

=µXmiss(i) + ΣXmiss(i),obs(i)Σ−1X obs(i),obs(i)

(xi,obs(i) − µX obs(i)

).

The last expression is the closed-form solution to

minzmiss(i)∈R|miss(i)| ,zobs(i)=xobs(i)

dMah(z,µX |ΣX)

with dMah(z,µX |ΣX) = (z−µX)′Σ−1X (z−µX) being the squared Mahalanobis distance fromz to µX . Minimizing Mahalanobis distance can be seen as maximizing a centrality measure —the Mahalanobis depth

maxzmiss(i)∈R|miss(i)| ,zobs(i)=xobs(i)

DMah(z|X) .

We generalize this principle to the notion of statistical depth function.We propose a unified framework based on the statistical data depth function. Having a sam-

ple X , impute a point x containing missing coordinates with the point y maximizing data depthconditioned on observed values xobs. This direct extension of the idea of conditional mean im-putation to data depth can be expressed as

y = argmaxzmiss∈R|miss| ,zobs=xobs

D(z|X) . (2)

The use of equation (2) is limited to strictly quasiconcave continuous and nowhere vanishingdepth notions, which is not the case for intrinsically nonparametric depths best reflecting the datageometry. A solution to (2) can trivially be nonunique, as fitting to the finite-sample data topologythe depth can be non-continuous. In addition, the value of the depth function may become zeroimmediately beyond the convex hull of the support of the distribution, which is just conv(X) fora finite sample. To circumvent these problems, we suggest to impute x having missing valueswith y:

y = ave(

arg minu∈Rd ,uobs=xobs

{‖u− v‖ |v ∈ Dα∗(X)}), (3)

whereα∗ = inf

α∈(0;1)

{α |Dα(X) ∩ {z | z ∈ Rd , zobs = xobs} = ∅

}. (4)

Given a sample containing missing data X = (Xobs,Xmiss). Start with an arbitrary initial-ization of all missing values, by imputing with unconditional mean say. Then, for each observa-tion containing missing coordinates, impute them according to (3), and iterate. This imputationby iterative maximization of depth can be summarized as Algorithm 1.

7

Page 9: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Algorithm 1 Single imputation1: function IMPUTE.DEPTH.SINGLE(X)2: Y ←X3: µ← µ(obs)(X) . Calculate mean ignoring missing values4: for i = 1 : n do5: if miss(i) 6= ∅ then6: yi,miss(i) ← µmiss(i) . Impute with unconditional mean

7: I ← 08: repeat . Iterate until convergence or maximal iteration9: I ← I + 1

10: Z ← Y11: for i = 1 : n do12: if miss(i) 6= ∅ then . Impute with maximum depth13: α∗ ← infα∈(0;1)

{α |Dα(Y ) ∩ {z | z ∈ Rd , zobs = yi,obs(i)} = ∅

}14: yi ← ave

(argminu∈Rd ,uobs=yi,obs(i)

{‖u− v‖ |v ∈ Dα∗(Y )})

15: until maxi∈{1,...,n},j∈{1,...,d} |yi,j − zi,j| < ε or I = Imax16: return Y

After the stopping criterion has been reached, one can expect that for each point initiallycontaining missing values, it holds xi = argmaxzobs=xobs

minu∈Sd−1

∣∣{k : x′ku ≥ z′u, k ∈{1, ..., n}

}∣∣ when employing the Tukey depth. So the imputation is performed due to the max-imin principle based on criteria involving indicator functions, which implies robustness of thesolution. When using zonoid depth, each such xi is imputed by the average of the maximumnumber of possibly most equally weighted points. W.r.t. X , this is a weighted mean impu-tation, which has connection to the methods of local nature such as the kNN imputation, andallows for gaining additional insights into data geometry by inspecting the optimal weights, con-stituting the Lagrange multipliers; see Section 3.3 for the detailed discussion. With Mahalanobisdepth, each xi with missingness is imputed by the conditional mean (1) and thus lies in thesingle-output regression hyperplane X ·,j on X ·,{1,...,d}\{j} for all j ∈ miss(i) and in generalin the

(d − |miss(i)|

)-dimensional multiple-output regression subspace X ·,miss(i) on X ·,obs(i);

such a subspace is obtained as the intersection of the single-output regression hyperplanes corre-sponding to missing coordinates. Among others, this yields further properties and connections tocovariance determinant and the regularized PCA imputation by Josse and Husson (2012), whichis regarded in detail in Section 4.

3.2 PropertiesThough being of different nature, suggested depth-based framework maintains some of the de-sirable properties of the EM-based imputation, since the imputation converges to center of theconditional distribution in the settings specified in Theorems 1 and 2. Different to EM, we avoidthe second and first moment assumptions due to the use zonoid and Tukey depths. Together withMahalanobis depth, the choice of depths becomes canonical in the sense that finiteness of firsttwo (Mahalanobis depth), one (zonoid depth), and no (Tukey depth) moments is required.

8

Page 10: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Theorem 1 shows that for any elliptical distribution, imputation of one point only convergesto the center of the conditional distribution when conditioning on the observed values.

Theorem 1 (One row consistency). Let X(n) = (X(n)obs,x) be a sequence of data sets in Rd

with d ≥ 2 consisting of a point x = (xobs,xmiss) and X(n)obs being n points sampled from

an E(µX ,ΣX), f). Then for n → ∞ for Tukey, zonoid (E possesses finite 1st moment), andMahalanobis (E possesses finite 2nd moment) depths

ymiss = µXmiss + ΣXmiss,obsΣ−1X obs,obs(xobs − µX obs)

is a stationary point of Algorithm 1.

Theorem 1 is illustrated in Figure 3 for a bivariate sample stemming from the Cauchy distri-bution, where a point is imputed using Tukey depth. The kernel density estimate of the imputedvalues over 10 000 repetitions resembles the Gaussian curve and approaches the target value be-ing the center of the population conditional distribution.

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Student t1, n = 100, d = 2, k = 10000

Maximum Tukey depth imputation of X[i,2] for X[i,1] = 3

Ker

nel d

ensi

ty e

stim

ate

of im

pute

d va

lues

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Student t1, n = 200, d = 2, k = 10000

Maximum Tukey depth imputation of X[i,2] for X[i,1] = 3

Ker

nel d

ensi

ty e

stim

ate

of im

pute

d va

lues

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Student t1, n = 500, d = 2, k = 10000

Maximum Tukey depth imputation of X[i,2] for X[i,1] = 3

Ker

nel d

ensi

ty e

stim

ate

of im

pute

d va

lues

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

Student t1, n = 1000, d = 2, k = 10000

Maximum Tukey depth imputation of X[i,2] for X[i,1] = 3

Ker

nel d

ensi

ty e

stim

ate

of im

pute

d va

lues

Figure 3: Kernel density estimate (solid) and the best approximating Gaussian curve (dashed)over 10 000 repetitions of the imputation of a single point having one missing coordinate. Sampleof size 100 (top, left), 200 (top, right), 500 (bottom, left), 1000 (bottom, right) is drawn from theCauchy distribution with location and scatter parameters µ1 and Σ1 from the introduction. Thepopulation conditional center given the observed value equals 3.

Theorem 2 states that if missing values constitute a portion of the sample but concentrate ina single variable, the imputed values converge to the center of the conditional distribution whenconditioning on the observed values.

9

Page 11: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Theorem 2 (One column consistency). Let X(n) = (X(n)obs,X

(n)miss) be a sequence of data sets

with x(n)miss only in coordinate j following MCAR mechanism, sampled from an E(µX ,ΣX , F ).

Then for n → ∞ for Tukey, zonoid (E possesses finite 1st moment), and Mahalanobis (E pos-sesses finite 2nd moment) depths for all points having missing values

yj = µX j + ΣX j,obsΣ−1X obs,obs(xobs − µX obs)

with obs = {1, ..., d} \ {j} is a stationary point of Algorithm 1.

The statements of both theorems are valid for n → ∞. Nevertheless, for any ε > 0 thereexists N such that for all n > N ε-neighborhood of the corresponding limit from Theorems 1and 2 is a stationary point of Algorithm 1.

3.3 Optimization in a single iterationDue to Algorithm 1, the only imputation step repeated iteratively consists in imputing point xhaving missing coordinates miss(x) with y by maximizing its depth D(z|X) w.r.t. the com-pleted data setX conditioned on xobs. Note that the function f(zmiss(x)) := D(z|X, zobs(x) =xobs) is quadratic for Mahalanobis depth, continuous inside conv(X) for zonoid depth, and step-wise discrete there for Tukey depth, with the single solution to a convex maximization problemin all the three cases. For a trivariate Gaussian sample, f(zmiss(x)) is depicted in Figure 4

For Mahalanobis depth, equation (2) has closed form: ymiss = µXmiss +ΣXmiss,obsΣ

−1X obs,obs(xobs − µX obs). In case ΣX is singular, one can work in the linear sub-

space of eigenvectors with positive eigenvalues. Defined this way Mahalanobis depth is sensitiveto outliers, which can be compensated for by estimating µX and ΣX in a robust way, usingminimum covariance determinant (MCD, see Rousseeuw and Van Driessen, 1999) say.

Zonoid depth is continuous inside the convex hull of the sample, and thus optimization w.r.t.(2) can be used directly. By construction zonoid depth can be represented as a problem of linearprogramming, see Mosler (2002) for details. To account for the missingness, a modification isnecessary, which consists in removing constraints corresponding to the point’s missing coordi-nates:

min γ s. t. Xobs(x)λ = xobs , (5)

λ′1n = 1 ,

γ1n − λ ≥ 0n ,

λ ≥ 0n .

Here Xobs(x) the completed n × |obs(x)| data matrix containing columns corresponding onlyto nonmissing coordinates of x and 1n (respectively 0n) a vector of ones (respectively zeros) oflength n. Imputation is finally performed with the λ-weighted average

ymiss = X ′miss(x)λ .

Investigating λ = {λ1, ..., λn}, the weights applied to the imputed point, may give additionalinsights concerning its positioning w.r.t. the sample. Thus, from (5) it is clear that in case itis solvable, the number of nonnegative weights

∣∣{i : λi > 0, i ∈ {1, ..., n}}∣∣ = m + 1,

1 ≤ m+ 1 ≤ n, where usually m of them are equal to the solution γ∗, and one is ≥ 0 and ≤ γ∗.This means that ymiss can be seen as the average of m points of the sample very slightly shifted

10

Page 12: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Figure 4: A Gaussian sample consisting of 250 points and a hyperplane of two missing coordi-nates (top, left), and the function f(zmiss(x)) to be optimized on each single iteration of Algo-rithm 1, for the identified (smaller) rectangle, for Tukey (top, right), zonoid (bottom, left), andMahalanobis (bottom, right) depth.

by an (m+ 1)st one. These m+ 1 points constitute the part of the imputed sample “responsible”for imputation of xmiss, and can be readily identified by strictly positive weights λi.

Due to the combinatorial nature of the discrete Tukey depth function another approach isneeded. We employ the Nelder-Mead downhill-simplex method, which is called 2d times withdiffering starting points; the results are averaged after. Although this slightly deviates from (3),it works stably in practice and one can expect convergence to (3) for a continuous density. Forthe purity of experiment, in the study of Section 5 we always compute Tukey depth exactly fol-lowing Dyckerhoff and Mozharovskyi (2016), although to avoid computational burden approxi-mation through random directions (Dyckerhoff, 2004) is recommended.

In the above paragraphs, we were considering the problem of solving (3) inside conv(X).Zonoid and Tukey depths equal 1/n on the conv(X) and 0 everywhere beyond this. While (3)deals with this situation, for a finite sample this means that points with missingness having max-imal value in at least one of the existing coordinates will never move from the initial imputationbecause they will become vertices of the conv(X). For the similar reason, other points to beimputed and lying exactly on the conv(X) will have suppressed dynamics. As such points are

11

Page 13: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

not numerous and thus need quite a substantial deviation to influence imputation quality, we im-pute them — during a few first iterations — using the spatial depth function (Vardi and Zhang,2000) that is everywhere nonnegative. This resembles the idea the so-called “outsider treatment”introduced by Lange et al. (2014). To make spatial depth affine invariant, covariance matrix isused, which is taken as the moment estimator for zonoid depth and as the MCD estimator (withparameter 0.5) for Tukey depth.

4 Special case: Mahalanobis depthAs announced in Section 3, the case of Mahalanobis depth imputation is additionally interesting,also due to its relation to existing methods. In this section we consider two points: its connectionto the minimization of the covariance determinant, and correspondence of the final imputation tothe iterative regression and the regularized PCA imputation. Proposition 1 gives the first insightinto the change of the determinant of the sample covariance matrix by stating that it decreaseswhen imputing a single point.

Proposition 1 (Imputation by conditional mean reduces covariance determinant). Let X ={x1, ...,xn} be a sample in Rd with µX = 0 and invertible ΣX . Further, for some k ∈{1, ..., n}, let Y = {y1, ...,yn} with yi = xi for all i = {1, ..., n} \ {k}, yk,miss(k) =

Σmiss(k),obs(k)Σ−1obs(k),obs(k)xk,obs(k), and yk,obs(k) = xk,obs(k), such that yk 6= xk. Then

|ΣY | < |ΣX |.

Lemma 1 points out that the relationship in Proposition 1 is quadratic, and this fact is laterused to prove point (2) of Theorem 3.

Lemma 1. LetX(y) =(x1, ..., (xi,1, ...,xi,|obs(i)|,y

′)′, ...,x′n)′ be a n× d matrix with ΣX(y)

invertible for all y ∈ R|miss(i)|. Then |ΣX(y)| is quadratic and globally minimized in y =µXmiss(i)(y) + ΣXmiss(i),obs(i)(y)Σ−1X obs(i),obs(i)(y)

((xi,1, ...,xi,|obs(i)|)− µX obs(i)

).

Lemma 1 brings an additional insight: it allows to see the entire imputation (Algorithm 1 withMahalanobis depth) as minimization of the covariance determinant. In this view, keeping all thepoints to be imputed but one fixed, covariance determinant is a quadratic function of the missingcoordinates of this one point, and thus is minimized in a single point only. Thus, to impute pointswith missing coordinates one-by-one and iterate till convergence constitutes the block coordinatedescent method, which proves to numerically converge due to Proposition 2.7.1 from Bertsekas(1999) (as long as ΣX is invertible).

LetX−µX = UΛ12V ′ be the singular value decomposition (SVD) of the centeredX . Josse

and Husson (2012) suggest the regularized PCA imputation, where, after an initialization, each

point x having missing values is imputed with y such that yj =∑S

s=1U j,s

√λs−σ2

λsV j,s+µX j

for all j ∈ miss(x) and yobs(x) = xobs(x) with 1 ≤ S ≤ d and some 0 < σ2 ≤ 1d−S

∑ds=S+1 λs;

the algorithm proceeds iteratively till convergence. This method has proved its high efficiency inpractice due to sticking to the low-rank structure of importance and ignoring noise. In addition,it can be seen as the truncated version of the iteratively applied extension of Stein’s estimatorby Efron and Morris (1972).

We consider here its special case when S = d and 0 < σ2 ≤ λd. Proposition 2 is theregularized PCA analog to Proposition 1.

12

Page 14: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Proposition 2 (Imputation by regularized PCA reduces covariance determinant). Let X ={x1, ...,xn} be a sample in Rd with µX = 0 and invertible ΣX . Further, for some k ∈ {1, ..., n},let Y = {y1, ...,yn} with yi = xi for all i = {1, ..., n} \ {k}, yk,l =

∑ds=1Uk,s

√λs−σ2

λsV l,s

with 0 < σ2 ≤ λd for all i ∈ miss(k) and yk,obs(k) = xk,obs(k), such that yk 6= xk. Then|ΣY | < |ΣX |.

Below, we state the main result of this section, that imputation using maximum Mahalanobisdepth, iterative (multiple-output) regression, or regularized PCA with S = d equilibrate at thesame imputed sample.

Theorem 3. ImputeX = (Xmiss,Xobs) in Rd withY so that for each yi ∈ Y with |miss(i)| >0 holds yi,miss(i) = argmaxzobs(i)=yobs(i) DM (z|Y ) . Then for each such yi holds as well:

• xi is imputed with the conditional mean:

ymiss(i) = µY miss(i) + ΣY miss(i),obs(i)Σ−1Y obs(i),obs(i)(xobs(i) − µY obs(i))

which is equivalent to single- and multiple-output regression,

• Y is a stationary point of |ΣX(Xmiss)|:

∂|ΣX |∂Xmiss

(Y miss) = 0,

• each missing coordinate j of xi is imputed with regularized PCA by Josse & Husson(2012) with any 0 < σ2 ≤ λd:

yi,j =

d∑s=1

U i,s

√λs − σ2λs

V j,s + µY j .

Connection with the minimization of the covariance determinant expressed in the secondpoint of Theorem 3 provides further insights. First, for a sample containing missing values,minimization of the covariance determinant corresponds to the minimization of the volume ofthe Mahalanobis depth lift mes

(DM (X)

)defined as follows. Adding a real dimension to central

regions Dα(X) and their multiplication with their depth α, α ∈ [0, 1], gives the depth lift

D(X) = {(α,y) ∈ [0, 1]× Rd : y = αx, x ∈ Dα(X), α ∈ [0, 1]} .

The depth lift is a body in Rd+1, that describes location and scatter of the distribution of X ,and in general gives rise to an ordering of distributions in M (Mosler, 2013). Its specificationfor Mahalanobis depth is DM (X) = {(α,x) ∈ [0, 1] × Rd : (x − αµX)′Σ−1X (x − αµX) ≤α(1− α)}, and it possesses the volume

mes(DM (X)

)=

πd2

Γ(d2 + 1)

b d2c∏

s=1

2s+ 2(dmod 2)

s+ 12 + dmod 2

(1−

(1− π

8

)I(dmod 2 6= 0)

)√|ΣX | .

The connection between the Mahalanobis depth lift volume and the covariance determinantsuggests an extension for the general depth function by imputingXmiss with

argminY miss∈R|Xmiss|,Y obs=Xobs

mes(D(Y )

). (6)

13

Page 15: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

While being tractable for Mahalanobis depth, (6) demands enormous computational burden evenfor very moderate data sets, as a single evaluation of mes

(D(Y )

)amounts to mes

(Dz(n)(X)

)=

1nd+1

∑{i0,...,id}⊂{1,...,n}

∣∣∣((1,x′i0)′, ..., (1,x′id)′)∣∣∣ for zonoid depth and to mes

(DT (n)(X)

)=∑nαmax

i=1id+1−(i−1)d+1

(d+1)nd+1 mes(DT (n)in

(X))

for Tukey depth with αmax being the depth of the Tukey

median w.r.t. the sampleX . On the other hand, it constitutes a separate approach giving a solutiondifferent to the proposed one. For these reasons we leave it out of the scope of the present article.

5 Experimental study

5.1 Choice of competitorsThe proposed methodology is represented by the three introduced imputation schemes basedon the iterative maximization of Tukey, zonoid, and Mahalanobis depth. Further, we include thesame imputation scheme using Mahalanobis depth based on MCD mean and covariance estimateswith robustness parameter chosen in an optimal way due to knowledge of the simulation setting.Next, conditional mean imputation based on EM estimates of mean and covariance matrix istaken. We also include two regularized PCA imputations assuming the data rank to be equal to 1and to 2. Further, two nonparametric imputation methods are used, namely random forest with itsdefault implementation in R-package missForest and kNN imputation tuned as described inAlgorithm 2 (Section 3) by Stekhoven and Buhlmann (2012), i.e. by choosing k from {1, ..., 15}minimizing imputation error over 10 validation sets. Finally, for the benchmark purposes, meanand oracle (if possible) imputations are added.

5.2 Simulated dataWe start by exploring the MCAR mechanism applied to the family of elliptically symmet-ric Student-t distributions. Regarding Definition 1, we set center µ2 = (1, 1, 1)′, shapeΣ2 =

((1, 1, 1)′, (1, 4, 4)′, (1, 4, 8)′

), and let F be the univariate Student-t distribution ranging

the number of degrees of freedom (d.f.) from the Gaussian to the Cauchy: t = ∞, 10, 5, 3, 2, 1.For each of the 1000 random simulations, we remove 25% of values due to MCAR, and indicatethe median and the median absolute deviation from the median (MAD, in parentheses) of theRMSE of each imputation method for a sample of size 100 points in Table 1. In each column ofTable 1, we distinguish the best method in bold and the second best in italics, ignoring the oracleimputation. We set robustness parameter of the MCD to 0.75, an optimal choice after severaltries; also because imputed points concentrate in subspaces of lower dimension and this singular-ity hinders execution of the MCD algorithm with lower parameter values already on the stage ofinitialization. The oracle imputes with the conditional mean using population parameters µ2 andΣ2.

For the presented range of elliptical Student-t distributions, behavior of different imputationmethods changes with the number of d.f., as well as the general tendency of the leadership. Forthe Cauchy distribution, robust methods perform best: the Mahalanobis depth-based imputationusing MCD estimates, closely followed by the one using Tukey depth. For 2 d.f., when thefirst moment exists but the second does not, EM- and Tukey-depth-based imputation performsimilarly, with a slight advantage of the Tukey depth in terms of the MAD. For larger numbersof d.f., when two first moments exist, EM takes the leadership. It is followed by the group of the

14

Page 16: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Distr. Gaussian t 10 t 5 t 3 t 2 CauchyDTuk 1.675 1.871 2.143 2.636 3.563 16.58

(0.205) (0.2445) (0.3313) (0.5775) (1.09) (13.71)Dzon 1.609 1.81 2.089 2.603 3.73 19.48

(0.1893) (0.2395) (0.3331) (0.5774) (1.236) (16.03)DMah 1.613 1.801 2.079 2.62 3.738 19.64

(0.1851) (0.2439) (0.3306) (0.5745) (1.183) (16.2)DMah

MCD.75 1.991 2.214 2.462 2.946 3.989 16.03(0.291) (0.3467) (0.4323) (0.6575) (1.287) (12.4)

EM 1.575 1.755 2.026 2.516 3.567 18.5(0.1766) (0.2379) (0.3144) (0.5537) (1.146) (15.46)

regPCA1 1.65 1.836 2.108 2.593 3.692 18.22(0.1846) (0.2512) (0.3431) (0.561) (1.186) (15.02)

regPCA2 1.613 1.801 2.08 2.619 3.738 19.61(0.1856) (0.2433) (0.3307) (0.5741) (1.19) (16.1)

kNN 1.732 1.923 2.235 2.757 3.798 17.59(0.2066) (0.2647) (0.3812) (0.5874) (1.133) (14.59)

RF 1.763 1.96 2.259 2.79 3.849 17.48(0.2101) (0.2759) (0.3656) (0.5856) (1.19) (14.33)

mean 2.053 2.292 2.612 3.165 4.341 20.32(0.2345) (0.2936) (0.3896) (0.6042) (1.252) (16.36)

oracle 1.536 1.703 1.949 2.384 3.175 13.55(0.1772) (0.2206) (0.3044) (0.5214) (0.9555) (10.71)

Table 1: Median and MAD of the RMSE of the imputation for a sample of 100 points drawn fromthe family of elliptically symmetric Student-t distributions with parameters µ2 and Σ2 having25% of missing values due to MCAR, over 1000 repetitions.

regularized PCA methods, and Mahalanobis- and zonoid-depth-based imputation. Please notethat Mahalanobis-depth and regularized PCA with two-dimensional low-rank model perform inthe same way (the tiny difference can be explained by the precision constant), see Theorem 3.Both nonparametric imputation methods perform rather poorly being “unaware” of the ellipticityof the underlying distribution, and deliver (to a certain extent) reasonable results for the case ofthe Cauchy distribution only where partial insensibility to the correlation between the variablescan be seen as an advantage.

In the second simulation, we modify the above setting by adding 15% of outliers that stemfrom Cauchy distribution with the same parameters µ2 and Σ2. The portion of missing values iskept on the same level of 25% but the outlying observations do not contain missing values. Theparameter of the MCD algorithm for the robust Mahalanobis-depth-based imputation is set to0.85, i.e. exactly corresponding to the portion of non-contaminated data. Corresponding mediansand MADs of the RMSE over 1000 repetitions are indicated in Table 2.

As expected, best RMSEs are obtained by the robust imputation methods: Tukey depth andMahalanobis depth with MCD estimates. Being restricted to a neighborhood, nonparametricmethods often impute based on non-outlying points, and thus deliver second best imputationgroup. The rest of the included imputation methods do not resist pollution of the data and performrather poorly.

Further, we explore the performance of the proposed methodology in a MAR setting. Forthis, first, we generate highly correlated Gaussian data by setting mean to µ3 = (1, 1, 1) and

15

Page 17: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Distr. Gaussian t 10 t 5 t 3 t 2 CauchyDTuk 1.751 1.942 2.178 2.635 3.763 17.17

(0.2317) (0.2976) (0.3556) (0.6029) (1.17) (13.27)Dzon 1.86 2.087 2.333 2.864 4.082 20.43

(0.3181) (0.4295) (0.4924) (0.7819) (1.535) (15.99)DMah 1.945 2.165 2.421 2.935 4.136 20.27

(0.4299) (0.5473) (0.6026) (0.8393) (1.501) (15.91)DMah

MCD.85 1.81 2.022 2.231 2.664 3.783 16.46(0.239) (0.3128) (0.381) (0.5877) (1.224) (12.94)

EM 1.896 2.112 2.376 2.828 4.036 19.01(0.3987) (0.5226) (0.5715) (0.7773) (1.518) (15.21)

regPCA1 1.958 2.196 2.398 2.916 4.09 19.81(0.4495) (0.5729) (0.6035) (0.8221) (1.585) (16.15)

regPCA2 1.945 2.165 2.421 2.93 4.14 20.53(0.4328) (0.5479) (0.5985) (0.8384) (1.503) (16.28)

kNN 1.859 2.051 2.315 2.797 3.955 18.96(0.2602) (0.3143) (0.3809) (0.6045) (1.265) (14.73)

RF 1.86 2.047 2.325 2.838 4.026 19.04(0.2332) (0.3043) (0.3946) (0.6228) (1.354) (14.62)

mean 2.23 2.48 2.766 3.34 4.623 21.04(0.3304) (0.4163) (0.528) (0.7721) (1.561) (15.56)

oracle 1.563 1.733 1.939 2.356 3.323 14.44(0.1849) (0.2266) (0.2979) (0.4946) (1.04) (11.33)

Table 2: Median and MAD of the RMSE of the imputation for a sample of 100 points drawn fromthe family of elliptically symmetric Student-t distributions with parameters µ2 and Σ2 contam-inated with 15% of outliers having 25% of missing values due to MCAR on non-contaminateddata, over 1000 repetitions.

covariance matrix to Σ3 =((1, 1.75, 2)′, (1.75, 4, 4)′, (2, 4, 8)′

). Second, we introduce missing

values depending on the existing values according to the following scheme: first variable hasmissing value with probability 0.08 if second variable is higher than the population mean, andwith probability 0.7 if second variable is lower than the population mean; for the third variablecorresponding probabilities constitute 0.48 and 0.24. This mechanism leads to a highly asym-metric pattern of 24% MAR values. The boxplots of the RMSEs of the considered imputationmethods over 1000 repetitions are indicated in Figure 5.

According to our expectations, semi-parametric methods (EM- and regularized-PCA-basedimputation, and thus Mahalanobis-depth-based imputation as well) perform well and close to theoracle imputation. Better performance when considering the one-dimensional low-rank modelfor the regularized PCA can be explained by the high correlation. Though having no parametricknowledge, zonoid-depth-based imputation also performs satisfactorily. Being unable to capturesufficiently the correlation, nonparametric methods perform poorly. Even worse performance ofrobust methods is explained by the fact of “throwing away” points that possibly contain valuableinformation.

Finally, we consider an extremely contaminated low-rank model. Namely, we fix a two-dimensional low-rank structure and add Cauchy-distributed noise. Then, we remove 20% ofvalues according to the MCAR mechanism. The resulting medians and MADs of the RMSE over1000 repetitions are indicated in Table 3. We exclude the oracle imputation, as it is supposed to

16

Page 18: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

d.Tuk d.zon d.Mah d.MahR EM rPCA1 rPCA2 kNN RF mean orcl

1.0

1.5

2.0

2.5

3.0

3.5

t0 outl 100−3, MAR 0.25, k = 1000

RM

SE

Figure 5: Root means square errors for different imputation methods for a correlated three-dimensional normal sample with parameters µ3 and Σ3 of 100 points with missing values ac-cording to MAR, over 1000 repetitions.

be always correct and thus cannot serve as a benchmark. This model can be seen as a stress-testof the regarded imputation methods. In general, as capturing any structure is rather meaning-less in this setting (which is additionally confirmed by the high MADs), the performance of themethods is “proportional to the way they ignore” dependency information. For this reason, meanimputation as well as nonparametric methods perform best. On the other hand, one should notethat accounting only for fundamental features of the data, Tukey-depth- and zonoid-depth-basedmethods perform second best. This can be also said about the regularized PCA keeping the firstprincipal component only. The rest of the methods try to reconstruct the data structure, and beingdistracted either by the low rank or by the heavy-tailed noise show poor results.

DTuk Dzon DMah DMahR0.75 EM regPCA1 regPCA2 kNN RF mean

Median RMSE 0.4511 0.4536 0.4795 0.5621 0.4709 0.4533 0.4664 0.4409 0.4444 0.4430Mad of RMSE 0.3313 0.3411 0.3628 0.4355 0.3595 0.3461 0.3554 0.3302 0.3389 0.3307

Table 3: Medians and MADs of RMSE for a two-dimensional low-rank model in R4 of 50 pointswith the Cauchy-distributed noise and 20% of missing values according to MCAR, over 1000repetitions.

5.3 Real dataIn addition, we validate the proposed methodology on four real benchmark data sets taken fromthe UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). For each data set, we con-sider part of variables and observations only (picking one class) in order to exclude categori-cal variables and reduce computational burden. The finally handled data sets are: Banknotes(n = 100, d = 3), Glass (n = 76, d = 3), Pima (n = 68, d = 4), Blood Transfusion (n = 502,

17

Page 19: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

d = 3, Yeh et al., 2009). Full details on the experimental design are given in the reproducingsources at https://github.com/julierennes. We investigate the same set of imputa-tion methods, set parameter of the MCD to 85%, and exclude oracle again as it would produceperfect imputation. The RMSE’s boxplots over 500 repetitions of imputation after removing 15%of entries due to MCAR mechanism are depicted in Figure 6. First, we should note that through-out the data sets, zonoid-depth-based imputation stably delivers highly satisfactory results.

d.Tuk d.zon d.Mah d.MahR EM rPCA1 rPCA2 kNN RF mean

12

34

56

Banknotes one 1−3, n = 100, d = 3, MCAR 0.15, k = 1000

RM

SE

d.Tuk d.zon d.Mah d.MahR EM rPCA1 rPCA2 kNN RF mean

0.5

1.0

1.5

2.0

2.5

Glass nonfloat 1−3, n = 76, d = 3, MCAR 0.15, k = 1000

RM

SE

d.Tuk d.zon d.Mah d.MahR EM rPCA1 rPCA2 kNN RF mean

1020

3040

50

Pima 1−4, n = 68, d = 4, MCAR 0.15, k = 500

RM

SE

d.Tuk d.zon d.Mah d.MahR EM rPCA1 rPCA2 kNN RF mean

1015

2025

Blood transfusion gp 1−3, n = 502, d = 3, MCAR 0.15, k = 500

RM

SE

Figure 6: RMSEs for Banknotes (top, left), Glass (top, right), Pima (bottom, left), and BloodTransfusion (bottom, right) data sets with 15% of MCAR values over 500 repetitions.

Visual inspection of the Banknotes data set shows that it rather constitutes a mixture of twocomponents, and thus mean imputation as well as the one-dimensional regularized PCA showpoor performance. Random forest imputation delivers satisfactory results by fitting to the localgeometry of the data. On the other hand, zonoid-depth-based imputation searching compromisebetween local and global features delivers best results. Unsatisfactory performance of the kNNimputation could be explained by local inhomogeneities in the data (points form plenty of localclusters of different size), i.e. it is too local; this problem seems to be captured by the aggregationstage of the random forest. All methods imputing by conditional mean (both Mahalanobis-depth-based, EM-based, and regularized PCA imputation) perform reasonably as well, while imputingin two-dimensional affine subspaces. Tukey-depth-based imputation captures geometry of thedata on one side but lacks information for robustness reasons (outliers do not seem to be a problemhere), and thus performs similar.

Glass data turns out to be more challenging as it highly deviates from ellipticity and part ofthe data lie sparse in the space but do not seem to be outlying (perhaps a separate component

18

Page 20: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

containing among others limit or censored cases). Thus, mean and robust Mahalanobis-depth-and Tukey-depth-based imputations perform poorly. An additional disadvantage for the Tukeydepth is presence of a large number of outsiders, which are imputed with maximum spatial depthusing obviously non-optimal here MCD estimates. For the same reason, both (groups) of semi-and non-parametric methods do not outperform mean that much. Accounting for local geometry,random forest and zonoid-depth-based imputation perform slightly better, and deliver best results.

Even more challenging is the Pima data set. Its variables are only weakly dependent oneach other, some correlation can be observer between third and fourth variables, but even this isweakened by the presence of an outlier in third dimension, which lies on the mean level in dimen-sions two and four. Similar to the highly contaminated low-rank model from Section 5.2, in thissetting partial ignorance regarding dependency is of advantage. For this reason, mean and one-dimensional regularized PCA perform best, but depth-based imputation methods are included inthe closely following group. Methods accounting for data ellipticity perform comparably becausethey are able to provide close to reasonable imputation at least in the two correlated dimensions.

Blood Transfusion data visually reminds a tetrahedron being dispersed away from one ofits vertices. Thus, mean imputation can be substantially improved. As nonparametric methodsdisregard dependency between dimension and one-dimensional regularized PCA does not capturethis sufficiently, they perform poorly. Better imputation is delivered by the depth- and EM-based methods, those capturing correlation. As data still deviates from the ellipticity and a fewoutliers are present only, MCD throwing away 15% of the data worsen robust Mahalanobis-depth-based imputation; this effect is partially transferred to the Tukey-depth-based imputation, via theoutsider treatment. Due to the same reason of deviation from ellipticity, zonoid-depth-basedimputation is further slightly advantageous.

6 Multiple imputation for elliptical familyThe developed above generic framework allows to go beyond the single imputation and enablefor statistical inference by means of multiple imputation (Little and Rubin, 2002). This approachconsists in calculating estimator of interest on a number of generated imputed data sets with fur-ther aggregation. Drawing multiply imputed data sets is traditionally performed in two steps: Thefirst step consists in reflecting the uncertainty of the parameters of the imputation model. The sec-ond step consists in imputing close to the underlying distribution. Imputation model uncertaintymay be reflected using either bootstrap or Bayesian approach, see e.g. Schafer (1997), Efron(1994), and for multivariate normal setting EM-estimates can be used to draw from the condi-tional normal distribution. An alternative is to employ the Markov chain Monte Carlo (MCMC)with normal conditional distributions, viz. multiple imputation by chained equations (MICE) byvan Buuren (2012).

Depth-based single-imputation framework introduced above allows to extend multiple impu-tation in a natural way to the more general elliptical setting. We start by showing how to reflectuncertainty due to distribution with that delivers so-called improper imputation (Section 6.1). Af-ter, using bootstrap to reflect model uncertainty, we state the complete algorithm in Section 6.2.

6.1 Accounting for uncertainty due to distributionWhen imputing an observation x in case of multivariate normality, one can use EM estimatesto draw xmiss from N(µ,Σ) conditioned on xobs, for instance using the Schur complement

19

Page 21: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

(stochastic EM). Conditional distribution of an elliptically symmetric distribution (we assumeabsolute continuity and that center and shape can be consistently estimated as µ and Σ) is ellip-tical as well, and can be derived by the proper transformation of µ, Σ, and the univariate (radial)density (Fang et al., 1990). Estimation of these preserves two complications: incompletenessof the data and one-sided possibly heavy-tailed density. To overcome the first one, we design aMCMC allowing to use estimators on complete data. For the second issue, taking into accountthat multiple imputation rather requires drawing a point then estimating the density itself, we stayin the depth framework and proceed as follows. First, using depth c.d.f. we draw the depth con-tour. And second, using µ and Σ, we draw the point containing missing values in the subspace ofthe intersection of this contour with the hyperplane of missing coordinates. The suggested pro-cedure allows us to draw from the conditional distribution in a semiparametric way, as detailedexplicitly right below.

For an absolutely continuous elliptically symmetric distribution, any data depth satisfyingcorresponding postulates from Mosler (2013) or Zuo and Serfling (2000) possesses the charac-terization property: it is a monotone function of the radial density, which in order is expected tobe a monotone function of the Mahalanobis distance. For the Mahalanobis depth for instance,which is just a monotone transformation of the Mahalanobis distance, this relationship is the mostintuitive.

Different to the normal case, the shape of each conditional distribution depends on the condi-tion itself. Precisely, this heteroscedasticity is determined by Mahalanobis distance of the point x(to be imputed) and of the conditional center µ∗ from the unconditional one µ. Given a completedata set X , µ∗ can be obtained due to (3). Further, let fD(X|X) denote the density of the depthfor a random vectorX ∈ Rd w.r.t. itself. The depth of the imputing point y should then be drawnas a quantile Q uniformly on [0, Fµ∗

(D(µ∗|X)

)], where

Fµ∗(x) =

∫ x

0fD(X|X)(z)

(√d2Mah.(z)− d2Mah.

(D(µ∗|X)

))|miss(x)|−1dd−1Mah.(z)

×

× dMah.(z)√d2Mah.(z)− d2Mah.

(D(µ∗|X)

)dz,(7)

with dMah.(z) being the Mahalanobis distance to the center as a function of depth. Then, Q issimply projected back on the support by α = F−1µ∗ (Q) corresponding to the surface of Dα(X) —the trimmed region of depth α, see Figure 8 (left). The aim of transformation (7) is to normalizethe volume (see Appendix for the derivation). Any constant normalization factor can be omittedhere as Fµ∗ is used exceptionally for drawing. The square root in the formula could be shortened,but this way Mahalanobis distance is exploited as a function of depth for joint distribution only,which can be estimated from the data without further transformations. For instance, when usingthe Mahalanobis depth, one can substitute directly in the equation (7) dMah.(y) by

√1/y − 1.

Next, the point y ∈ ∂Dα(X) ∩ {z ∈ Rd | zobs(x) = xobs} (lying in intersection of theregion of depth α with the hyperplane of missing values of x) should be randomly chosen.This is done by drawing u uniformly on S |miss(x)|−1 and transforming it by conditional scat-ter matrix obtaining u∗ ∈ Rd having u∗miss(x) = Λu (with Λ(Λ)′ = Σmiss(x),miss(x) −Σmiss(x),obs(x)Σ

−1obs(x),obs(x)Σobs(x),miss(x)) and u∗obs(x) = 0. Such u∗ is uniformly distributed

on the conditional depth contour. Then x is imputed as y = µ∗ + βu∗, where β is ascalar obtained as the positive solution of µ∗ + βu∗ ∈ ∂Dα(X) (e.g. quadratic equation

20

Page 22: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

(µ∗ + βu∗ − µ)′Σ−1(µ∗ + βu∗ − µ) = d2Mah.(α) in the case of Mahalanobis depth). SeeFigure 8, right for an illustration.

0.0 1.0

0.0

1.0

D (μ*)XF (Q)μ*

F (D (μ*))μ* X

-1

Q

F μ*

Region of depth α

xobs

u*

β

μμ*

Pair of

y

missing values

Figure 7: Illustration of an application of (7) to impute by drawing from the conditional distribu-tion of an elliptical distribution. Drawing depth D = F−1µ∗ (Q) via the depth c.d.f. Fµ∗ (left) andlocating the corresponding imputed point y (right).

Proposed method allows to impute by conditional drawing from an elliptically symmet-ric absolutely continuous distribution, and can be simply employed for flat ones generaliz-ing the scatter-matrix operations to the singular cases. We shortly demonstrate its capabili-ties by the following simulation study. We generate 500 points from an elliptical Student-tdistribution with 3 degrees of freedom with mean µ = (−1,−1,−1,−1)′ and structure ma-trix

((0.5, 0.5, 1, 1)′, (0.5, 1, 1, 1)′, (1, 1, 4, 4)′(1, 1, 4, 10)′

), put 30% of missing values due to

MCAR, and impute them reflecting uncertainty due to distribution only by stochastic EM andthe proposed method. Medians of the univariate quantiles over 2000 repetitions for the initialdistribution, stochastic EM, and suggested algorithm are indicated in Table 4. While stochasticEM, generating noise from the normal model, fails to guess the quantiles as expected, proposedmethod deliver their quite precise exploration with only slight deviations in the tail of the distribu-tion due to (again expected) complication to reflect density shape there. Though such an output isexpected, this greatly extends the scope of practices compared to the deep-rooted Gaussian-basedimputation.

6.2 Inference for incomplete dataImputation scheme designed in Section 6.1 accounts for uncertainty w.r.t. distribution only, andthus cannot be directly used for deriving inference from incomplete data. To perform proper mul-tiple imputation, we resort to the bootstrap approach to reflect uncertainty due to the estimation ofthe underlying semi-parametric model as well: To generate each imputed data set, we first drawa sequence of indices b = (b1, ..., bn) with bi ∼ U(1, ..., n) for i = 1, ..., n, and then utilize thissequence to obtain a subset (with repetitions) Xb,· = (xb1 , ...,xbn) used to perform single im-putation giving µ∗ and to estimate the shape Σ on each Monte-Carlo iteration. The depth-basedprocedure for multiple imputation (called DMI) can be described by Algorithm 2. Taking single

21

Page 23: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Quantile: 0.5 0.75 0.85 0.9 0.95 0.975 0.99 0.995X1 :complete -1.0013 -0.4632 -0.1231 0.1446 0.6398 1.2017 2.0661 2.8253stoch. EM -1.0008 -0.4225 -0.0649 0.2114 0.6902 1.2022 1.9782 2.6593depth -0.9996 -0.4643 -0.1232 0.1491 0.6509 1.2142 2.0827 2.8965X2 :complete -0.9992 -0.2359 0.2416 0.6237 1.3252 2.1263 3.3277 4.3852stoch. EM -1.0018 -0.1675 0.3468 0.7468 1.4205 2.1355 3.1723 4.1049depth -1.0008 -0.2318 0.2522 0.6411 1.3537 2.1627 3.3771 4.5018X3 :complete -1.0030 0.5194 1.4815 2.2513 3.6353 5.2170 7.6330 9.8235stoch. EM -1.0038 0.6304 1.6537 2.4386 3.7903 5.2302 7.4294 9.2877depth -1.0043 0.5228 1.4919 2.2660 3.6624 5.2602 7.7358 9.8991X4 :complete -1.0134 1.3938 2.9121 4.1352 6.3271 8.8592 12.6258 16.1014stoch. EM -0.9968 1.6388 3.2870 4.5494 6.6477 8.8909 12.0789 14.9915depth -1.0091 1.3986 2.9341 4.1559 6.3862 8.9471 12.7642 16.3143

Table 4: Median univariate quantiles of imputed elliptical sample consisting of 500 points drawnfrom Student-t distribution with 3 d.f. over 2000 repetitions.

imputation as a starting point allows to begin Markov chain Monte-Carlo closer to the stationarymode and thus reduce the burn-in period to a few iterations only.

22

Page 24: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Algorithm 2 Depth-based multiple imputation1: function IMPUTE.DEPTH.MULTIPLE(X , num.burnin, num.sets)2: for m = 1 : num.sets do3: Y (m) ←IMPUTE.DEPTH.SINGLE(X) . Start MCMC with a single

imputation4: b← (b1, ..., bn) =

(U(1, ..., n), ..., U(1, ..., n)

). Draw bootstrap sequence

5: for k = 1 : (num.burnin+ 1) do6: Σ← Σ(Y

(m)b,· )

7: Estimate fD(X|X) using Y (m).8: for i = 1 : n do9: if miss(i) 6= ∅ then

10: µ∗ ← IMPUTE.DEPTH.SINGLE(xi,Y(m)b,· ) . Single-impute point

11: u← U(S |miss(i)|−1)12: u∗miss(i) ← uΛ . Calculate random direction13: u∗obs(i) ← 014: Calculate Fµ∗15: Q← U

([0, Fµ∗

(D(µ∗)

)])

. Draw depth16: α← F−1µ∗ (Q)

17: β ← positive solution of µ∗ + βu∗ ∈ ∂Dα(Y(m)b,· ).

18: y(m)i,miss(i) ← µ∗miss(i) + βu∗miss(i) . Impute one point

19: return(Y (1), ...,Y (num.sets)

)The function impute.depth.single

(xi,Y

(m)b,·)

in Algorithm 2 corresponds to the ob-vious modification of Algorithm 1 for imputing missing values of single point xi calculating allthe depth estimates w.r.t. the complete data set Y (m)

b,· . Λ is such that Λ(Λ)′ = Σ. Depth den-sity fD(X|X) can be obtained using any consistent estimate, while for Fµ∗ numerical integrationcan be employed. In general, Algorithm 2 sticks to the above notation, and returns a number ofmultiply-imputed data sets.

While estimation of the depth density gives a clear advantage for DMI over existing imple-mentations since it captures the joint distribution, its competitiveness under the Gaussian settingremains interesting. Thus, we explore the performance of DMI in estimating coefficients of tworegression model. The first one is: Y = β′(1, X ′)′ + ε where β = (0.5, 1)′, X ∼ N(1, 4), andε ∼ N(0, 0.25). 30% of missing values are introduced due to MCAR. We employ DMI and addmultiple imputation by R-packages Amelia and mice (with Gaussian conditional distribution)with the default settings for comparison. We generate 5 and 20 multiply-imputed data sets. Basedon a sample consisting of 100 observations, over 1000 repetitions, we indicate medians, cover-age by the 95% confidence interval constructed according to the Rubin’s rule, and width of thisinterval of the estimates of β in Table 5.

Note that both Amelia and mice are optimal in these settings. In the experiment, Ameliaseems to slightly undercover, more expressed when considering 20 multiply-imputed data sets.mice delivers very reasonable estimates. DMI suffers from a slight overcovering.

23

Page 25: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

β0 β1med cov width med cov width

5 multiply-imputed data setsAmelia 0.498 0.941 0.39 1.004 0.947 0.168mice 0.506 0.945 0.39 1 0.943 0.178DMI 0.501 0.966 0.484 0.998 0.971 0.212

20 multiply-imputed data setsAmelia 0.496 0.937 0.351 1 0.931 0.156mice 0.499 0.957 0.371 0.997 0.946 0.166DMI 0.5 0.979 0.421 0.996 0.969 0.191

Table 5: Medians (med), 95% coverage due to the Rubin’s rule (cov), and width of the confidenceintervals (width) for the first regression model sample consisting of 100 observations with 30%MCAR-coordinates, based on 5 and 20 multiply imputed data sets, over 1000 repetitions.

The second regression model is Y = β′(1, X ′)′ + ε with β = (0.5, 1, 3)′ and X ∼N(

(1, 1)′,((1, 1)′, (1, 4)′

)). Resting settings are kept. Except for missing values, the joint el-

liptical distribution of (X ′, Y )′ reserves additional difficulties due to high correlation (≈ 0.988)between the second component of X and Y . We indicate the results in Table 6.

β0 β1 β2med cov width med cov width med cov width

5 multiply-imputed data setsAmelia 0.5 0.946 0.536 1.005 0.939 0.438 2.999 0.94 0.226mice 0.525 0.984 1.464 1.063 0.975 1.476 2.92 0.976 0.88DMI 0.513 0.974 0.719 0.989 0.957 0.589 3 0.961 0.295

20 multiply-imputed data setsAmelia 0.487 0.931 0.489 1.01 0.941 0.399 2.998 0.929 0.206mice 0.519 0.984 1.6 1.081 0.98 1.807 2.881 0.982 1.502DMI 0.504 0.971 0.613 0.989 0.979 0.519 3.003 0.97 0.26

Table 6: Medians (med), 95% coverage due to the Rubin’s rule (cov), and width of the confidenceintervals (width) for the second regression model sample consisting of 100 observations with 30%MCAR-coordinates, based on 5 and 20 multiply imputed data sets, over 1000 repetitions.

Amelia still slightly undercovers, again more pronounced for 20 multiply-imputed data sets.mice delivers biased coefficients, due to its MCMC-nature and high correlation that causes insta-bility in regression models, and seriously overcovers (even visually its 95% confidence intervalsa substantially larger). DMI on the other hand, retains the same behavior suffering from a slightovercovering, but in general delivers reasonable results. It is worth to notice that due to the esti-mation of the depth density the application of the Rubin’s rule to DMI is not theoretically justifiedand requires further investigation (Reiter and Raghunathan, 2007). This could explain the littleovercovering, which still keeps the confidence level and is better than undercovering.

24

Page 26: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

7 ConclusionsThe proposed framework for imputation based on data depth fills the gap between global im-putation pursued by regression- and PCA-based methods, and the local one represented by e.g.random forest or kNN imputation. It reflects unsureness in the distribution assumption by im-puting close to data geometry, is robust in sense of distribution and outliers, and preserves func-tionality under MAR mechanism. When used with Mahalanobis depth, exploiting data depthas a concept the connection between iterative regression, regularized PCA, and minimum co-variance determinant imputation has been established. Empirical study shows efficiency of thesuggested methodology for various elliptical distributions and elliptically resembling real datasettings. Further, in a natural way the method is extended to multiple imputation for the ellipticalfamily, which enlarges the area of application of the existing tools for multiple imputation.

The methodology is generic, i.e. any reasonable notion of data depth can be employed, whichwill determine the properties of the imputation. Due to empirical study, zonoid depth behaveswell in general, and in the real-data settings particularly. On the other hand Tukey depth maybe preferred if robustness is an issue. Further, projection depth (Zuo and Serfling, 2000) isa proper choice if only a few points contain missing values in a data set that is substantiallyoutlier-contaminated. This specific case is not included in the article but the projection-depth-based imputation can be found in the implementation. To reflect multimodality of the data, thesuggested framework can be employed with localized depths, see e.g. Paindaveine and Bever(2013), whereas the localization parameter can be tuned by means of cross-validation due to theimputation quality (e.g. as it was done for tuning the kNN imputation in Section 5).

A serious question with data depths is their computation demand. For the purity of exper-iment, in Section 5 all the computations were using exact algorithms, that is why the study isrestricted to dimensions 3 and 4 and a few hundred points only. Even using approximate versionsof data depths (which can be found in the implementation as well) does not scale the amountof data that can be handled substantially. Imputing large data sets might be overcome by theexpected developments in the field of data depth.

The methodology has been implemented as an R-package. Source codes ofthe package and of the experiment-reproducing files can be downloaded fromhttps://github.com/julierennes.

Appendix: ProofsProof of Theorem 1:

The proof for Mahalanobis depth is obvious. Exploiting the results by Liu and Singh (1993),for zonoid and Tukey depths one obtains:

For zonoid depth, due to absolute continuity of E , ∃Nch such that x ∈ conv(X(n)obs) ∀n >

Nch, and thus due to continuity of zonoid depth in conv(X(n)obs), ∃Dα(n)(X

(n)obs) such that y ∈

∂Dα(n)(X(n)obs). As n → ∞ ∂Dα(n)(X

(n)obs)

a.s.−→ ∂Dα(E) for some α such that y ∈ ∂Dα(E),which is an ellipsoid described by µ, Σ, and a scaling constant.

For Tukey depth, regard the sequence of the regions Dα(n)(X(n)obs) with α(n) defined by (4).

Due to absolute continuity and ellipticity of E , as n → ∞ ∂Dα(n)(X(n)obs)

a.s.−→ ∂Dα(E), anellipsoid described by µ, Σ, and containing y. �

25

Page 27: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Proof of Theorem 2:The proof for Mahalanobis depth is obvious.For zonoid and Tukey depth, due to consistency with the population version, we restrict to the

population one. Let Y ∼ E and allow for a transform Y 7→ Z = RΣ−1/2(X −µ) withR beinga rotation operator such that w.l.o.g. y 7→ z and zi = 0∀ i = 2, ..., d with d being the missingvalues orthant. We have to show that D(z|Z) > D(x|Z)∀x with xi = zi ∀ i = 1, ..., d− 1. LetD(z|Z) = α, and regard the corresponding region Dα(Z). As both zonoid and Tukey depthssatisfy the weak projection property, due to Statement 1 of Theorem 2 from Dyckerhoff (2004) itis sufficient to check univariate projections in the plane spanned by orthants e1 and ed, namelyonly angles in range (0, π/2] between e1 and the direction u ∈ S2, taken counter clockwise, say.Fu′Z(x) = (1− pNA)F1(x) + pNAF1(x/ cosβ) for β ∈ [0, π/2] with β = arccos(u′e1), F1(x)being a univariate marginal c.d.f. of Z assuming it has no missing values, and pNA is the portionof missing values. Fu′Z(x) > F1(x) = Fe1′Z(x) ∀β ∈ (0, π/2] for x > 0 (‘<’ for x < 0), andthis last equality holds for ray pointing at the imputed point. Due to monotonicity, the inequalitiesreverse for the quantile function, which combined with the definitions of the both depths givesstrictly Dα(u′Z) ⊂ Dα(e1′Z) ∀α ∈ (0, 1) and ∀β ∈ (0, π/2]. �

Proof of Proposition 1: Factorization of the determinant and of the inverse gives (|A +

aa′| = |A|(1 + a′A−1a), (A + aa′)−1 = A−1 − (A−1a)(a′A−1)

1+a′A−1a, see Appendix A by Mardia

et al. (1979), further used in proofs of Lemma 1 and of Proposition 2):

|Y ′Y | = |X ′X + yky′k − xkx′k| = |X ′X + yky

′k|(1− x′k(X ′X + yky

′k)−1xk

)= |X ′X|

(1 + y′k(X

′X)−1yk)(

1− x′k(X ′X)−1xk +x′k(X

′X)−1yky′k(X

′X)−1xk1 + y′k(X

′X)−1yk

)= |X ′X|

(1 + y′k(X

′X)−1yk − x′k(X ′X)−1xk − y′k(X ′X)−1ykx′k(X

′X)−1xk

+ x′k(X′X)−1yky

′k(X

′X)−1xk).

Let zk = xk−yk, then due to Mahalanobis orthogonality of yk and zk w.r.t. X (y′kΣ−1X zk = 0),

one hasd2M (xk, µX ; ΣX) = d2M (yk, µX ; ΣX) + d2M (zk, µX ; ΣX).

Using this, one obtains

|Y ′Y | = |nΣX |(

1− 1

nd2M (zk, µX ; ΣX)

(1 +

1

nd2M (yk, µX ; ΣX)

)).

As d2M (xk, µX ; ΣX) < n one has

1

nd2M (zk, µX ; ΣX)

(1 +

1

nd2M (yk, µX ; ΣX)

)<

1

nd2M (zk, µX ; ΣX)

(2− 1

nd2M (zk, µX ; ΣX)

).

Further, one can rewrite the right term as

1

nd2M (zk, µX ; ΣX)

(2− 1

nd2M (zk, µX ; ΣX)

)=

1

n2g(d2M (zk, µX ; ΣX)

),

with g(x) = −x2 + 2nx, a function monotonically increasing from g(0) = 0 to g(n) = n2,while d2M (zk, µX ; ΣX) > 0 as long as yk 6= xk. From this follows that |Y ′Y | < |nΣX |, andthus |ΣY | < |ΣX |. �

26

Page 28: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Proof of Lemma 1: W.l.o.g. we restrict to the case i = 1. Let Z be a n × d matrix withµZ = 0 and z1,miss(1) = ΣZmiss(1),obs(1)Σ

−1Z obs(1),obs(1)z1,obs(1). Denoting a = (0, .., 0,y′)′ ∈

Rd, then

ΣZ = Z ′Z − z1z′1 + (z1 + a)(z1 + a)′ − 1

naa′.

As z1Σ−1Z a = 0 due to Mahalanobis orthogonality, by simple algebra (analog to that in the proofof Lemma 1)

|ΣZ | = |ΣZ |(1 +n− 1

n2aΣ−1Z a).

�Proof of Proposition 2: Factorization of the determinant and of the inverse gives:

|Y ′Y | = |X ′X|((

1− z′k(X ′X)−1xk)2

+ z′k(X′X)−1zk

(1− x′k(X ′X)−1xk

))with zk = xk − yk. Let M be a (d × d) matrix with elements M i,i = 1 for i ∈ miss(k) andzero otherwise. Then zk = σ2MV Λ−

12uk, and thus, denoting u∗k = σMV Λ−

12uk,

z′k(X′X)−1zk = (u∗k)

′V Λ−12σ2Λ−

12V ′u∗k < (u∗k)

′u∗k = z′k(X′X)−1xk ,

because σ2 < λd.In the same way one can show that z′k(X

′X)−1xk < x′k(X

′X)−1xk. Keeping

0 < z′k(X′X)−1zk < z

′k(X

′X)−1xk < x′k(X

′X)−1xk < 1 ,

clearly |Y ′Y | < |X ′X|((1−a)2+a(1−x)

)with a = z′k(X

′X)−1xk and x = x′k(X′X)−1xk,

and this last term is a function of a and x, that is < 1 for all (a, x) ∈ (0, 1)2. �Proof of Theorem 3: First point can be checked by elementary algebra. Second point follows

from the coordinatewise application of Lemma 1. For third point it suffices to prove the single-output regression case. The regularized PCA algorithm will converge if

yid =

d∑s=1

uis√λsvds =

d∑s=1

uis(√λs −

σ2√λs

)vds

for any σ2 ≤ λd. W.l.o.g. we prove that

yd = Σd (1,...,d−1)Σ−1(1,...,d−1) (1,...,d−1)y(1,...,d−1) ⇐⇒

d∑i=1

uivdi√λi

= 0

denoting Σ(Y ) simply Σ for the centered Y and an arbitrary point y. Due to the matrix algebra

yd = Σd (1,...,d−1)Σ−1(1,...,d−1) (1,...,d−1)y(1,...,d−1)

= −((Σ−1)dd

)−1(Σ−1)d (1,...,d−1)y(1,...,d−1),

d∑i=1

ui√λivdi = −

( d∑i=1

v2diλi

)−1( d∑i=1

vdiv1iλi

,d∑i=1

vdiv2iλi

, ...,d∑i=1

vdiv(d−1) i

λi

×( d∑i=1

ui√λiv1i,

d∑i=1

ui√λiv2i, ...,

d∑i=1

ui√λiv(d−1) i

)′.

27

Page 29: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

After reordering terms one obtains

d∑i=1

ui√λi

d∑j=1

vdjλj

d∑k=1

vkivkj = 0.

Due to the orthogonality of V , d2 − d terms from the two outer sum signs are zero. Gatheringnonzero terms, i.e. those with i = j only

d∑i=1

ui√λivdiλi

=d∑i=1

uivdi√λi

= 0.

�Derivation of (7): The integrated quantity is the conditional depth density that can be ob-

tained from the joint one by the volume transformation (denoting dMah.

(z,µ

)Mahalanobis dis-

tance between a point of depth z and µ):

fD((X|Xobs=xobs)|X)(z) = fD(X|X)(z) · C · Tdown(dMah.(z,µ)

)· Tup

(dMah.(z,µ

∗))×

× Tangle(dMah.(z,µ), dMah.(z,µ

∗)).

Any constant C is neglected, as it is unimportant when drawing. The three terms correspondto descaling density to dimension one (downscaling), re-scaling it to the dimension of missingvalues (upscaling), and linear change of its volume to the hyperplane of missingness (angle trans-formation).

μ μ*

D (X)z

D (X)D(μ*|X)

d (z,μ)Mah.

d (μ*,μ)Mah.

d (z,μ*)Mah.

D (X)z

D (X)z+δ

μ μ*

θ

θ

θ

aa

sinθ

Figure 8: Illustration of the derivation of (7).

28

Page 30: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Tdown(dMah.(z,µ)

)= d1−dMah.(z,µ) =

1

dd−1Mah.(z,µ).

Tup(dMah.(z,µ

∗))

= d|miss(x)|−1Mah. (z,µ∗)

=(√

d2Mah.(z,µ)− d2Mah.

(D(µ∗|X),µ

))|miss(x)|−1.

Tangle(dMah.(z,µ), dMah.(z,µ

∗))

=1

sin θ=

1dMah.(z,µ∗)dMah.(z,µ)

=dMah.(z,µ)√

d2Mah.(z,µ)− d2Mah.

(D(µ∗|X),µ

) .Tdown and Tup are illustrated in Figure 8 (left), for Tangle see Figure 8 (right). LettingdMah.(z,µ) = dMah.(z) to shorten notation gives (7).

ReferencesBazovkin, P. and Mosler, K. (2015), ‘A general solution for robust linear programs with distortion

risk constraints’, Annals of Operations Research 229(1), 103–120.

Bertsekas, P. D. (1999), Nonlinear programming. Second edition, MIT Press.

Cascos, I. and Molchanov, I. (2007), ‘Multivariate risks and depth-trimmed regions’, Financeand Stochastics 11(3), 373–397.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), ‘Maximum likelihood from incompletedata via the em algorithm’, Journal of the Royal Statistical Society. Series B (Methodological)39(1), 1–38.

Donoho, D. L. and Gasko, M. (1992), ‘Breakdown properties of location estimates based onhalfspace depth and projected outlyingness’, The Annals of Statistics 20(4), 1803–1827.

Dyckerhoff, R. (2004), ‘Data depths satisfying the projection property’, Advances in StatisticalAnalysis 88(2), 163–190.

Dyckerhoff, R. and Mosler, K. (2011), ‘Weighted-mean trimming of multivariate data’, Journalof Multivariate Analysis 102, 405–421.

Dyckerhoff, R. and Mozharovskyi, P. (2016), ‘Exact computation of the halfspace depth’, Com-putational Statistics and Data Analysis 98, 19–30.

Efron, B. (1994), ‘Missing data, imputation, and the bootstrap’, Journal of the American Statisti-cal Association 89(426), 463–475.

Efron, B. and Morris, C. (1972), ‘Empirical bayes on vector observations: An extension of stein’smethod’, Biometrika 59(2), 335–347.

Fang, K., Kotz, S. and Ng, K. (1990), Symmetric multivariate and related distributions, Mono-graphs on statistics and applied probability, Chapman and Hall.

29

Page 31: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Hallin, M., Paindaveine, D. and Siman, M. (2010), ‘Multivariate quantiles and multiple-outputregression quantiles: From l1 optimization to halfspace depth’, The Annals of Statistics38(2), 635–669.

Hastie, T., Mazumder, R., Lee, D. J. and Zadeh, R. (2015), ‘Matrix completion and low-rank svdvia fast alternating least squares’, Journal of Machine Learning Research 16, 3367–3402.

Healy, M. and Westmacott, M. (1956), ‘Missing values in experiments analysed on automaticcomputers’, Journal of the Royal Statistical Society. Series C (Applied Statistics) 5(3), 203–206.

Jornsten, R. (2004), ‘Clustering and classification based on the {L1} data depth’, Journal ofMultivariate Analysis 90(1), 67–89. Special Issue on Multivariate Methods in Genomic DataAnalysis.

Josse, J. and Husson, F. (2012), ‘Handling missing values in exploratory multivariate data analy-sis methods’, Journal de la Societe Francaise de Statistique 153(2), 79–99.

Koshevoy, G. and Mosler, K. (1997), ‘Zonoid trimming for multivariate distributions’, The Annalsof Statistics 25(5), 1998–2017.

Lange, T., Mosler, K. and Mozharovskyi, P. (2014), ‘Fast nonparametric classification based ondata depth’, Statistical Papers 55(1), 49–69.

Little, R. and Rubin, D. (2002), Statistical analysis with missing data, Wiley series in probabilityand mathematical statistics. Probability and mathematical statistics, Wiley.

Liu, R. Y., Parelius, J. M. and Singh, K. (1999), ‘Multivariate analysis by data depth: descriptivestatistics, graphics and inference, (with discussion and a rejoinder by liu and singh)’, TheAnnals of Statistics 27(3), 783–858.

Liu, R. Y. and Singh, K. (1993), ‘A quality index based on data depth and multivariate rank tests’,Journal of the American Statistical Association 88(401), 252–260.

Mahalanobis, P. C. (1936), ‘On the generalised distance in statistics’, 2(1), 49–55.

Mardia, K., Kent, J. and Bibby, J. (1979), Multivariate analysis, Probability and mathematicalstatistics, Academic Press.

Mosler, K. (2002), Multivariate Dispersion, Central Regions, and Depth: The Lift Zonoid Ap-proach, Lecture Notes in Statistics, Springer New York.

Mosler, K. (2013), Depth statistics, in C. Becker, R. Fried and S. Kuhnt, eds, ‘Robustness andComplex Data Structures: Festschrift in Honour of Ursula Gather’, Springer. Berlin, pp. 17–34.

Paindaveine, D. and Bever, G. V. (2013), ‘From depth to local depth: A focus on centrality’,Journal of the American Statistical Association 108(503), 1105–1119.

Reiter, J. P. and Raghunathan, T. E. (2007), ‘The multiple adaptations of multiple imputation’,Journal of the American Statistical Association 102(480), 1462–1471.

30

Page 32: n° 2017-72 Nonparametric imputation by data depth P. …crest.science/RePEc/wpstorage/2017-72.pdf · 2018-01-02 · predict missing values, while random forest imputation performs

Rousseeuw, P. J. and Van Driessen, K. (1999), ‘A fast algorithm for the minimum covariancedeterminant estimator’, Technometrics 41(3), 212–223.

Rubin, D. B. (1996), ‘Multiple imputation after 18+ years’, Journal of the American StatisticalAssociation 91(434), 473–489.

Schafer, J. (1997), Analysis of Incomplete Multivariate Data, Chapman & Hall/CRC Monographson Statistics & Applied Probability, CRC Press.

Seaman, S., Galati, J., Jackson, D. and Carlin, J. (2013), ‘What is meant by missing at random?’,Statistical Science 28(2), 257–268.

Stekhoven, D. J. and Buhlmann, P. (2012), ‘MissForest – non-parametric missing value imputa-tion for mixed-type data.’, Bioinformatics 28(1), 112–118.

Templ, M., Kowarik, A. and Filzmoser, P. (2011), ‘Iterative stepwise regression imputation usingstandard and robust methods’, Computational Statistics and Data Analysis 55(10), 2793–2806.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. andAltman, R. B. (2001), ‘Missing value estimation methods for dna microarrays’, Bioinformatics17(6), 520–525.

Tukey, J. W. (1974), Mathematics and the Picturing of Data, in R. D. James, ed., ‘InternationalCongress of Mathematicians 1974’, Vol. 2, pp. 523–532.

van Buuren, S. (2012), Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisci-plinary Statistics), Chapman and Hall/CRC.

Vardi, Y. and Zhang, C.-H. (2000), ‘The multivariate L1-median and associated data depth’,Proceedings of the National Academy of Sciences 97(4), 1423–1426.

Yeh, I.-C., Yang, K.-J. and Ting, T.-M. (2009), ‘Knowledge discovery on RFM model usingbernoulli sequence’, Expert Systems with Applications 36(3, Part 2), 5866–5871.

Zuo, Y. and Serfling, R. (2000), ‘General notions of statistical depth function’, The Annals ofStatistics 28(2), 461–482.

31