low-rank model with covariates for count data with missing ...and focus on the estimation of x based...

37
Low-rank model with covariates for count data with missing values Genevi` eve Robin 1,2 , Julie Josse 1,2 , ´ Eric Moulines 1,2 Sylvain Sardy 3 1 CMAP, ´ Ecole Polytechnique 2 XPOP, INRIA 3 Department of Mathematics, Universit´ e de Gen` eve October 12, 2018 Abstract Count data are collected in many scientific and engineering tasks including image processing, single-cell RNA sequencing and ecologi- cal studies. Such data sets often contain missing values, for example because some ecological sites cannot be reached in a certain year. In addition, in many instances, side information is also available, for ex- ample covariates about ecological sites or species. Low-rank methods are popular to denoise and impute count data, and benefit from a sub- stantial theoretical background. Extensions accounting for covariates have been proposed, but to the best of our knowledge their theoretical and empirical properties have not been thoroughly studied, and few softwares are available for practitioners. We propose a complete methodology called LORI (Low-Rank Inter- action), including a Poisson model, an algorithm, and automatic se- lection of the regularization parameter, to analyze count tables with covariates. We also derive an upper bound on the estimation error. We provide a simulation study with synthetic data, revealing empir- ically that LORI improves on state of the art methods in terms of estimation and imputation of the missing values. We illustrate how the method can be interpreted through visual displays with the analy- sis of a well-know plant abundance data set, and show that the LORI 1

Upload: others

Post on 20-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Low-rank model with covariates for count datawith missing values

    Geneviève Robin1,2, Julie Josse1,2, Éric Moulines1,2

    Sylvain Sardy3

    1CMAP, École Polytechnique2XPOP, INRIA

    3Department of Mathematics, Université de Genève

    October 12, 2018

    Abstract

    Count data are collected in many scientific and engineering tasksincluding image processing, single-cell RNA sequencing and ecologi-cal studies. Such data sets often contain missing values, for examplebecause some ecological sites cannot be reached in a certain year. Inaddition, in many instances, side information is also available, for ex-ample covariates about ecological sites or species. Low-rank methodsare popular to denoise and impute count data, and benefit from a sub-stantial theoretical background. Extensions accounting for covariateshave been proposed, but to the best of our knowledge their theoreticaland empirical properties have not been thoroughly studied, and fewsoftwares are available for practitioners.We propose a complete methodology called LORI (Low-Rank Inter-action), including a Poisson model, an algorithm, and automatic se-lection of the regularization parameter, to analyze count tables withcovariates. We also derive an upper bound on the estimation error.We provide a simulation study with synthetic data, revealing empir-ically that LORI improves on state of the art methods in terms ofestimation and imputation of the missing values. We illustrate howthe method can be interpreted through visual displays with the analy-sis of a well-know plant abundance data set, and show that the LORI

    1

  • outputs are consistent with known results. Finally we demonstrate therelevance of the methodology by analyzing a water-birds abundancetable from the French national agency for wildlife and hunting man-agement (ONCFS). The method is available in the R package lori onthe Comprehensive Archive Network (CRAN).

    1 Introduction

    Let Y be an n× p observation matrix of counts, R ∈ Rn×K1 and C ∈ Rp×K2be two matrices containing row and column covariates, respectively. In ourecological application in Section 6, rows of the contingency table representecological sites, and columns represent years. For (i, j) ∈ [n]× [p], Yij countsthe abundance of water-birds measured in site i during the year j. Therow feature Ri`, ` ∈ [K1] embeds geographical information about the site i(latitude, longitude, distance to coast, etc.) while the column feature Cj`,` ∈ [K2] codes meteorological characteristics of the year j (precipitation,etc.). In addition, some entries of Y are missing. For example, ecologicalsites are sometimes inaccessible because of meteorological or political condi-tions, and thus cannot be counted.

    Such count tables are often analyzed using low-rank models (1; 2; 3; 4; 5;6), imposing a low-rank structure to an underlying parameter matrix. Weassume a probabilistic framework with independent entries Yij following aPoisson model

    Yij ∼ P(eX∗ij), (i, j) ∈ [n]× [p], (1)

    and focus on the estimation of X∗ based on a low-rank assumption. Thegeneralized additive main effects and multiplicative interaction model, or row-column model (see, e.g., (2; 3)), assuming

    X∗ij = µ∗ + α∗i + β

    ∗j + Θ

    ∗ij, rank(Θ

    ∗) ≤ min(n− 1, p− 1), (2)

    is adapted to do so. In this model, µ∗ is an offset, the terms which only de-pend on the index of the row or column (α∗i and β

    ∗j ) are called main effects,

    and the terms which depend on both (here Θ∗ij) are called interactions (7,Section 4.1.2, p.87).

    A natural idea to incorporate covariates in this framework, is to expressthe row and column effects α∗i and β

    ∗j as regression terms on the covariates.

    2

  • In other words, for µ∗ ∈ R, α∗ ∈ RK1 , β∗ ∈ RK2 and Θ∗ ∈ Rn×p,

    X∗ij = µ∗+

    K1∑k=1

    Rikα∗k︸ ︷︷ ︸

    row effect

    +

    K2∑l=1

    Cilβ∗l︸ ︷︷ ︸

    column effect

    +Θ∗ij, rank(Θ∗) ≤ min(n−1, p−1). (3)

    Such an extension is useful in practice for two main reasons. First, estimatedcovariates coefficients (and in particular their signs) can be used to determinewhether the studied covariates have positive or negative effects on the counts;this is particularly useful in ecology, to check whether meteorological, geo-graphical or political conditions favor or endanger species. Second, when theproportion of missing values is large, which is often the case in bird monitor-ing, relying on (relevant) covariates can improve the imputation significantly.

    Models related to (3) have been considered for statistical ecology applica-tions in (8; 9). However, to the best of our knowledge, their theoretical andempirical properties have not been thoroughly studied. On the other hand,the literature on convex low-rank matrix estimation is abundant and benefitsfrom a substantial theoretical background, but few software with ready to usesolution are available for practitioners, and applications for count data out-side image analysis (10; 11; 12) and recommendation systems (13) have notbeen attempted. The scope of this paper is to develop a complete methodol-ogy for the inference of model (3), bridging the gap between convex low-rankmatrix completion and model-based count data analysis.

    After detailing related work in Section 1.1, we introduce in Section 2a general model which includes (3); we propose an estimation procedurethrough the minimization of a data fitting term penalized by the nuclear normof the matrix Θ, which acts as a convex relaxation of the rank constraint.Building up on existing results on nuclear norm regularized loss functions,we derive statistical guarantees in Section 2.1. In particular, we provide anupper bound for the Frobenius norm of the estimation error. In Section 3, wepropose an optimization algorithm, and an automatic selection of the regular-ization parameter. We provide a simulation study in Section 4 revealing thatLORI outperforms state-of-the-art methods when the proportion of missingvalues is large and the interactions are of significant order compared to themain effects. In Section 5, we show on plant abundance data with side infor-mation, how the results of our procedure can be interpreted through visual

    3

  • displays. In particular, the arising interpretation is consistent with knownresults from the original study (14). In Section 6, we use LORI to analyze awater-birds abundance data set from the French national agency for wildlifeand hunting management (ONCFS). The proofs of the statistical guaranteesare postponed to the appendix, and the method is available as an R package(15) called lori (LOw-Rank Interaction) on the Comprehensive R ArchiveNetwork (CRAN) at https://CRAN.R-project.org/package=lori.

    1.1 Related work

    Model (3) is closely related to other models previously suggested in the sta-tistical ecology literature to analyze count tables with row and column co-variates. For instance, (8) and (9) suggested the following model:

    X∗ij = µ∗ + α∗i + β

    ∗j + �RCRiCj, (4)

    with Ri, 1 ≤ i ≤ n a row trait and Cj, 1 ≤ j ≤ p a column trait. The inter-action between covariates is modeled by �RCRiCj, where �RC is an unknownparameter measuring the strength of the interaction between the two traits.The main difference with model (3) is that we incorporate the covariates inthe main effects rather than the interactions, which leads to different inter-pretations. In terms of estimation properties, the main advantage of (3) isthat, as long as K1 ≤ n and K2 ≤ p, we estimate less parameters. This is animportant point for us since in many applications we are interested in (seee.g. Section 6), a large proportion of entries is missing, limiting the amountof available data. Finally, model (4) was developed with the aim of testingsignificant associations between covariates, and its theoretical and empiricalestimation properties, as far as we know, were not studied.

    In the low-rank matrix completion literature, related approaches for countmatrix recovery and dimensionality reduction can be embedded within theframework of low-rank exponential family estimation (16; 17; 18; 19; 20) aswell as its Bayesian counterpart (21; 22). In terms of statistical guarantees,the theoretical performance of nuclear norm penalized estimators for Poissondenoising has been studied in (12), where the authors prove uniform boundson the empirical error risk. Estimation rates are also given in (23), whereoptimal bounds are proved for matrix completion in the exponential family.These two papers do not account for available covariates.

    4

    https://CRAN.R-project.org/package=lori

  • More recently, (24) developped a probabilistic PCA framework for theexponential family, where covariates can be included in the parameter space.(25) present a variety of low-rank problems including the generalized nuclearnorm penalty (26), that can be used to include row and column covariates.Similar estimation problems were also considered, e.g., in (27; 28). However,to the best of our knowledge, these papers did not provide statistical guar-antees and the practical advantages of such extensions compared to classicallow-rank methods have not been thoroughly studied.

    2 General model and estimation

    We now introduce a general version of the model described in the previoussection. First, we relax the Poisson model and replace it with the followingassumption on the distribution of Yij, (i, j) ∈ [n]× [p].

    H 1. The random variables Y = {Yi,j}(i,j)∈[n]×[p] are independent and thereexist γ > 0, σ− > 0 and σ+ 2 ,P : X ∈ Rn×p 7→ X −P⊥(X), X0 ⊂ {X ∈ Rn×p;P⊥(X) = 0} and T = {X ∈Rn×p;P(X) = 0}. We denote

    r = max ({rank(A) : A ∈ X0}) . (6)

    Consider the following decomposition:

    X∗ = X∗0 + Θ∗, where X∗0 ∈ X0 and Θ∗ ∈ T . (7)

    5

  • Denote, for m ≥ 1, 1m the vector of ones of length m. Model (3) is includedin (7) by setting S1 = {u ∈ Rn;1>nu = 0}, S2 = {v ∈ Rp;1>p v = 0}, and

    X0 =

    (µ+

    K1∑k=1

    Rikαk +

    K2∑k=2

    Cikβk

    )(i,j)∈[n]×[p]

    ;µ ∈ R, α ∈ RK1 , β ∈ RK2

    .The dimension of this subspace is at most 1 + K1 + K2 and the rank of amatrix in X0 is less than 3.

    We finally consider a setting with missing observations. Denote by Ω ⊂[n] × [p] the set of observed entries: (i, j) ∈ Ω if and only if Yij is observed.Define also the random variables (ωij) such that ωij = 1 if Yij is observedand ωij = 0 otherwise. We assume that (ωij) and Y are independent, anda Missing Completely At Random (MCAR) scenario (29) where (ωij) areindependent Bernoulli random variables. For (i, j) ∈ {1, . . . , n}×{1, . . . , p},we denote πij = P(ωij = 1). We assume the probability of observing anyentry is positive, i.e. there exists π > 0 such that

    min {πij : (i, j) ∈ [n]× [p]} = π > 0 . (8)

    For j ∈ [p], denote by π.j =∑n

    i=1 πij the probability of observing an elementin the j-th column. Similarly, for i ∈ [n], denote by πi. =

    ∑pj=1 πij the

    probability of observing an element in the i-th row. We define the followingupper bound:

    max ({πi. : i ∈ [n]} ∪ {π.j : j ∈ [p]}) ≤ β . (9)

    We can now define our data-fitting term:

    L(X) =∑

    (i,j)∈[n]×[p]

    ωij {−YijXij + exp(Xij)} . (10)

    Denote ‖ · ‖ the operator norm (the largest singular value), ‖ · ‖∞ the infinitynorm (the largest entry in absolute value) and ‖·‖∗ the nuclear norm (the sumof singular values). Our estimator of model (3), for a given regularizationparameter λ > 0, is the minimizer of the data-fitting term (10) penalized bythe nuclear norm of Θ:

    (X̂0, Θ̂) ∈ argmin L(X0 + Θ) + λ‖Θ‖∗,such that ‖X0 + Θ‖∞ ≤ γ, (i, j) ∈ [n]× [p]

    X0 ∈ X0, Θ ∈ T .(11)

    6

  • Denote by ∇L(X) =∑

    (i,j)∈[n]×[p] ωij {−Yij + exp(Xij)}Eij, where (Eij) arethe matrices of the canonical basis of Rn×p, the gradient of L at X. De-note also ∂2L/∂x2ij the second derivative of L with respect to the (i, j)-thcoordinate. Consider the following condition:

    H 2. The function L is strongly convex and smooth on [−γ − ε, γ + ε]n×pfor some ε > 0. There exist σ− > 0 and σ+ < ∞ such that for all X ∈[−γ − ε, γ + ε]n×p and (i, j) ∈ [n]× [p], σ2− ≤ ∂2L(X)/∂x2ij ≤ σ2+.

    2.1 Statistical guarantees

    We now derive an upper bound on the Frobenius estimation error of estima-tor (11). Let (�ij), (i, j) ∈ [n] × [p] be i.i.d. Rademacher random variablesindependent of Y and Ω and define

    ΣR =∑

    (i,j)∈[n]×[p]

    �ijωijEij. (12)

    Theorem 1. Assume H 1-2, and λ ≥ 2‖∇L(X∗)‖. Then for all n, p ≥ 1,with probability at least 1− 8(n+ p)−1,∥∥∥X∗ − X̂∥∥∥2

    F≤ Cπ2

    ([λ2

    σ4−+ (E‖ΣR‖)2γ2

    ](rank(Θ∗) + r) + log(n+ p)

    ),

    (13)where r, γ are defined in (6) and (11), and C is a numerical constant whosevalue can be found in the proof and which is independent of n, p and X∗.

    Proof. The proof is postponed to Section 8.1.

    We then control E‖ΣR‖, and compute a value of λ such that the conditionλ ≥ 2‖∇L(X∗)‖ holds with high probability. We will need the followingadditional assumption on the distribution of the counts:

    H 3. There exists δ > 0 such that for all (i, j) ∈ [n]× [p],E [exp(|Yij|/δ)] < +∞ .

    Define the following quantities, with C∗ a numerical constant defined inLemma 5 and r, β and γ defined in (6), (9) and (11) respectively:

    Φ1 = 48σ2+β log(n+ p),

    Φ2 = 36δ2(e− 1)2 log1

    (1 + 8δ2

    np

    βσ2−

    )log2(n+ p),

    Φ3 = 4C∗2 max(β2, log{min(n, p)}).

    (14)

    7

  • Theorem 2. Assume H 1-H 3 and set

    λ = max

    {4σ+

    √3β log(n+ p), 12δ(e− 1) log

    (1 + 8δ2

    np

    βσ2−

    )log(n+ p)

    }.

    Then with probability at least 1− 10(n+ p)−1,∥∥∥X∗ − X̂∥∥∥2F≤ Cπ2{(max(Φ1,Φ2) + Φ3) (rank(Θ∗) + r) + log(n+ p)} , (15)

    where C is a numerical constant independent of n, p and X∗.

    Proof. The proof is postponed to Section 8.2.

    We recover an upper bound of order rank(Θ∗)β/π2, which is classical inlow-rank matrix estimation and completion (30; 23), and equal to rank(Θ∗) max(n, p)/πwhen the sampling is almost uniform (c1π ≤ πij ≤ c2π). The additional termrβ/π2 accounts for explicit modeling of the covariates in the main effects. Theconstant term appearing in bound (15) grows linearly with the upper boundσ2+ and quadratically with the inverse of σ

    2−. This means that by relaxing

    Assumption 1 to allow var(Yij) to grow as fast as log(n + p) or decrease asfast as 1/ log(n+ p), we only lose a log-polynomial factor in bound (15).

    3 Algorithm and selection of λ

    3.1 Optimization algorithm

    In this section, we propose an algorithm to solve (11) for the model

    X∗ij = µ∗ +

    K1∑k=1

    Rikα∗k +

    K2∑l=1

    Cilβ∗l + Θ

    ∗ij

    using alternating minimization (31), which consists in updating µ, α, β and Θalternatively, each time along a descent direction. Note that, in the algorithmand the entire numerical section, we relax the constraint |µ+Ri,.α+Cj,.β +Θij| ≤ γ. Indeed, this constraint is mainly required to obtain statisticalguarantees, and we observed that in practice, for γ sufficiently large, thisconstraint is never reached. Denote

    F(µ, α, β,Θ) = L((µ+Ri,.α + Cj,.β)i,j + Θ),

    8

  • and∇ΘF the gradient of F with respect to Θ defined by (∇ΘF(µ, α, β,Θ))ij =−Yij + exp(µ + Ri,.α + Cj,.β + Θij) if ωij = 1 and (∇ΘF(µ, α, β,Θ))ij = 0otherwise. At every iteration we solve three sub-problems. The sub-problemin µ can be solved in closed form at each iteration; the updates in α andβ can be done simultaneously by estimating a Poisson generalized linearmodel (which can be done using standard algorithms implemented in avail-able libraries); the update in Θ is along the proximal gradient direction,with a step size tuned using backtracking line search. Denote by Dλ thesoft-thresholding operator of singular values at level λ (32, Section 2). Theprocedure is sketched in Algorithm 1.

    Algorithm 1: Alternating minimization for problem (11)

    1 Initialize µ[0], α[0], β[0], Θ[0]

    2 For t = 0, 1, . . . , T − 1• µ[t+1] ∈ argminF(µ, α[t], β[t],Θ[t]),

    • (α[t+1], β[t+1]) ∈ argminF(µ[t], α, β,Θ[t]),

    • τ = 1,

    • Θ[t+1] = Dλ[Θ[t] − τP{∇ΘF(µ[t], α[t], β[t],Θ[t])}]

    • While F(µ[t+1], α[t+1], β[t+1],Θ[t+1]) + λ‖Θ[t+1]‖∗ >F(µ[t+1], α[t+1], β[t+1],Θ[t]) + λ‖Θ[t]‖∗:

    – τ = τ/2

    – Θ[t+1] = Dλ[Θ[t] − τP{∇ΘF(µ[t], α[t], β[t],Θ[t])}].

    Output µ[T ], α[T ], β[T ], Θ[T ]

    Note that if K1 + K2 > |Ω|, with |Ω| denoting the cardinality of Ω, theupdate in α and β does not have a unique solution. However in our targetedapplications, typically K1 + K2 � |Ω|. In the package lori, we addition-ally implemented a warm-start strategy (33), which consists in solving (11)for a large value of λ, then sequentially decreasing λ and solving the newproblem using the previous estimate as a starting point. Thus, our imple-mentation solves (11) for the entire regularization path at once. Note that,even though our theoretical guarantees require a MCAR mechanism, the esti-mation method still holds when entries are missing at random ((29), Section1.3). Its imputation properties are illustrated in an ecological application inSection 6.

    9

  • 3.2 Automatic selection of λ

    A common way to select the regularization parameter is cross-validation,which consists in erasing a fraction of the observed cells in Y , estimatinga complete parameter matrix X̂ for a range of λ values, and choosing theparameter λ that minimizes the imputation error. This can be performeddirectly using LORI without modifying the code. Indeed, let (ω̃ij) denotethe weights in {0, 1} indicating which entries are observed after removingsome of them for cross-validation, and denote

    F̃(µ, α, β,Θ) =∑

    (i,j)∈[n]×[p]

    ω̃ij{−Yij(µ+Ri,.α+Cj,.β+Θij)+exp(µ+Ri,.α+Cj,.β+Θij)}.

    The optimization problem becomes

    (µ̂, α̂, β̂, Θ̂) ∈ argmin F̃(µ, α, β,Θ) + λ‖Θ‖∗,Θ ∈ T

    (16)

    which can be solved using the method described in Section 3.1 (see Algo-rithm 1). However, cross-validation is computationally costly. We suggestan alternative method, inspired by (34) and the work of (35) on quantile uni-versal thresholds. In Theorem 3 below, we define a null-thresholding statisticof estimator (11), a function of the data λ0(Y ) for which the estimated in-teraction matrix Θ̂λ0(Y ) is null, and the same estimate Θ̂λ = 0 is obtained forany λ ≥ λ0(Y ).

    Theorem 3 (Null-thresholding statistic). The estimated interaction matrixΘ̂λ for a regularization parameter λ is null if and only if λ ≥ λ0(Y ), whereλ0(Y ) is the null-thresholding statistic

    λ0(Y ) =∥∥∥∇L(X̂0)∥∥∥ , where X̂0 ∈ argminX∈X0 L(X). (17)

    Proof. The proof is postponed to Section 8.3.

    We propose a heuristic selection of λ based on this null-thresholdingstatistic λ0(Y ). To explain further the method, we first need to define thefollowing procedure:

    H0 : Θ∗ = 0 against the alternative H1 : Θ

    ∗ 6= 0, (18)

    10

  • which tests whether the parameter matrix X∗ can be explained only interms of linear combinations of the measured covariates. For a probabilityε ∈ (0, 1), consider the upper ε-quantile λε of the null-thresholding statis-tics, namely that satisfies PH0 (λ0(Y ) > λε) < ε. The test which consistsin comparing the statistics λ0(Y ) to λε is of level 1 − ε for (18). This canbe seen as an alternative to the χ2 test for independence which handles co-variates. However, we do not have access to the distribution under the nullPH0 (λ0(Y ) < λ), and we perform parametric bootstrap (36) to compute aproxy λ̌ε. In practice we recommend ε = 0.05 and use λQUT := λ̌.05; we referto it in what follows as quantile universal threshold (QUT). This selectionof the regularization parameter is essentially the universal threshold of (34)extended to our setting.

    4 Simulation study

    4.1 Estimation

    We simulate Y ∈ N300×30 under model (1)–(3), with R ∈ R500×3 and C ∈R300×4 drawn from multivariate Gaussian distributions with mean 0 andblock-diagonal covariance matrices ΣR and ΣC respectively. We set µ

    ∗ = 1,α∗ = (2, 0, 0), β∗ = (−2, 0, 0, 0) and Θ of rank 5. We compare the perfor-mance of LORI in terms of estimation of the regression coefficients α and β,and compare it to a standard Poisson Generalized Linear Model (GLM) es-timated with the glm function in R. We repeat the experiment 100 times fordecreasing values of the ratio τ = ‖Θ‖F/‖X0‖F , where X0 = (µ+Riα+Cjβ)ijis fixed. We look at the Root Mean Square Error (RMSE) for the estimationof X0; the results are given in Table 1, where we observe that LORI and thePoisson GLM are equivalent for τ = 0, and that LORI outperforms the GLMfor non-zero interactions, with a gap widening as τ increases.

    Second, we compare LORI to a convex low-rank matrix estimation pro-cedure with a Poisson loss function and where covariates are not modeled(e.g., (23)) in terms of the relative estimation error ‖X̂ −X∗‖F/‖X∗‖F (be-cause ‖X‖F varies with τ). We refer to this competitor as ”Poisson LRM”.Again, we reproduce the experiment 100 times for decreasing values of theratio τ = ‖Θ‖F/‖X0‖F . On Table 2, we observe that LORI achieves lowererrors than Poisson LRM, which is expected as we simulated under the LORI

    11

  • τ Mean of RMSE∗100 Standard deviation of RMSE∗100

    1LORI 52 1.6GLM 143 29

    0.5LORI 17 0.9GLM 22 1.0

    0.25LORI 4.3 0.37GLM 5.3 0.33

    0.1LORI 1.7 0.34GLM 2.6 0.34

    0LORI 0.88 0.2GLM 0.87 0.19

    Table 1: Estimation error (RMSE) of regression coefficients√‖α̂− α∗‖22 + ‖β̂ − β∗‖22 of LORI and a Poisson GLM, for decreasing

    values of τ = ‖Θ‖F/‖X0‖F .

    model. As τ decreases – i.e. the size of the main effects increases relative tothe interactions – both errors decrease as well, and the gap between LORIand the Poisson LRM widens, indicating that modeling covariates explicitlyimproves the estimation.

    4.2 Imputation

    Using the same simulation scheme, we now compare LORI in terms of miss-ing values imputation to Correspondence Analysis (CA) (1), the equivalentof PCA for discrete data, and which shares similarities with log-linear models(6). We also compare LORI to Trends & Indices for Monitoring data (TRIM)(37), a method based on a Poisson log-linear model used to impute bird abun-dance data. To do so we erase an increasing proportion of entries in the dataand impute them using LORI, CA and TRIM, replicating the experiment100 times. We also impute the missing values using the column means, as abaseline referred to as ”MOY”. We observe on Figure 1 that LORI performsbest, which is expected as we simulate under the LORI model. Moreover,the gap widens as the percentage of missing values increases. In particular,the error of TRIM for 80% of missing entries is not represented because themethod fails (we use the default parameters of the R package rtrim).

    12

  • τ Mean of Relative RMSE∗100 Standard deviation of Relative RMSE∗100

    1LORI 82 1.8

    Poisson LRM 95 4.4

    0.5LORI 40 0.25

    Poisson LRM 50 0.17

    0.25LORI 24 0.12

    Poisson LRM 40 0.12

    0.1LORI 11 0.081

    Poisson LRM 36 1.5

    0LORI 4.3 0.1

    Poisson LRM 34 1.2

    Table 2: Estimation error (Relative RMSE) of parameter matrix ‖X̂ −X∗‖F/‖X∗‖F of LORI and a Poisson GLM, for decreasing values of τ =‖Θ‖F/‖X0‖F .

    LORI CA TRIM MOY

    100

    200

    300

    400

    500

    600

    700

    Imputation error 20% NA

    (a) 20% of missing entries

    LORI CA TRIM MOY

    020

    040

    060

    080

    0

    Imputation error 40% NA

    (b) 40% of missing entries

    LORI CA TRIM MOY

    100

    200

    300

    400

    500

    600

    700

    Imputation error 60% NA

    (c) 60% of missing entries

    LORI CA MOY

    100

    200

    300

    400

    500

    Imputation error 80% NA

    (d) 80% of missing entries

    Figure 1: Average imputation error∑

    (i,j)∈Ω(Yij − Ŷij)2/|Ω|.13

  • 5 Analysis of the Aravo data

    The Aravo data set (14) counts the abundance of 82 species of alpine plantsin 75 sites in France; covariates about the environments and species are alsoavailable. We focus on 8 species traits providing physical information aboutplants (height, spread, etc.), and 4 environmental variables giving geograph-ical and meteorological information about sites. We apply our method LORIafter scaling the covariates and tuning the regularization parameter with theQUT method. This results in estimates for the main effects of the envi-ronment characteristics α and of the species traits β, and of the interactionmatrix Θ.

    Aspect Slope PhysD Snow0.04 0.07 -0.02 -0.07

    Table 3: Main effect of the Aravo environment characteristics estimated withLORI. The regularization parameter is tuned using QUT.

    Height Spread Angle Area Thick SLA Nmass Seed0.09 -0.24 -0.18 -0.20 -0.11 -0.17 0.18 -0.12

    Table 4: Main effect of the Aravo species traits estimated with LORI. Theregularization parameter is tuned using QUT.

    The main effects of environment characteristics are given in Table 3 andthe main effects of the species traits in Table 4. First we observe that over-all, species traits have larger effects than environment characteristics on theobserved abundances. In particular, the mass-based leaf nitrogen content(Nmass) has a large positive effect, which seems to indicate that plants witha large Nmass tend to be more abundant across all environments. On theother hand, the maximum lateral spread of clonal plants (Spread), area ofsingle leaf (Area) leaf elevation angle estimated at the middle of the lamina(Angle) and specific leaf area (SLA) have large negative effects on the abun-dances.

    The estimated rank of the interaction matrix Θ̂ (number of singular val-ues above 10−6) is 2. The environments (rows) and species (columns) can

    14

  • −0.02 0.00 0.02 0.04 0.06 0.08

    −0.0

    3−0

    .01

    0.00

    0.01

    0.02

    0.03

    0.04

    2D Display plot of interaction directions

    Dim 1 (93.73%)

    Dim

    2 (6

    .27%

    )

    AR07

    AR71

    AR26

    AR54AR60

    AR70AR25AR27

    AR72

    AR01AR23 AR13 AR39

    AR40

    AR03

    AR20

    AR42AR44

    AR41

    AR17

    Alop.alpi

    Care.rosa

    Care.foet

    Care.rupe

    Fest.quad

    Kobr.myosPoa.alpi

    Alch.pent

    Bart.alpi

    Leuc.alpi Geum.mont

    Omal.supi

    Leon.pyre

    Minu.sedo

    Plan.alpi

    Pote.aure

    Ranu.kuep

    Sagi.glab

    Sali.herb

    Sibb.proc

    Figure 2: Display of the two first dimensions of interaction estimated withLORI. Environments are represented with blue points and species with redtriangles.

    be visualized on a biplot (38, Section 2.5), where rows and columns are rep-resented simultaneously in a normalized Euclidean space. In such plots, thedimensions of the Euclidean space are given by the principal directions of Θ̂,scaled by the square root of the singular values of Θ̂. Figure 2 shows sucha display, which can be interpreted in terms of distance between points: aspecies and an environment that are close interact highly, and two species ortwo environments that are close have similar profiles. Justifications for such adistance interpretation can be found in (38, Section 2.5) or (6, Section 2). Wecan then look at the relations between the known traits and the interactiondirections of Θ̂. Figure 3a shows that the two first directions of interactionare correlated with the species covariates; the correlation is particularly highfor the Nmass and SLA variables. Thus, on Figure 2, the two directions sepa-rate the plants with large SLA and Nmass (top right corner) from those withsmall SLA and Nmass (bottom left corner). Then, Figure 3a shows that thedirections of interaction are also correlated with the environment covariates,and particularly with the mean snowmelt date (Snow). Thus, on Figure 2,the two directions separate the late melting environments (top right corner)from the early melting environments (bottom left corner). Combining the

    15

  • −1.0 −0.5 0.0 0.5 1.0

    −1.

    0−

    0.5

    0.0

    0.5

    1.0

    Dim.1

    Dim

    .2

    HeightSpread

    Angle

    Area Thick

    SLA

    N_mass

    Seed

    (a) Environment covariates

    −1.0 −0.5 0.0 0.5 1.0

    −1.

    0−

    0.5

    0.0

    0.5

    1.0

    Dim.1

    Dim

    .2 Aspect

    Slope

    PhysD

    Snow

    (b) Species traits

    Figure 3: Correlation between the two first dimensions of interaction and thecovariates (the covariates are not used in the estimation).

    16

  • Agricultural surface Latitude Dist. to town Dist. to coast Surface0.09 -0.21 0.09 -0.48 0.20

    Table 5: Main effect of the sites characteristics estimated with LORI. Theregularization parameter is tuned using QUT.

    interpretation of Figure 2, Figure 3a and Figure 3b, we deduce that plantswith large Nmass and SLA interact highly with late melting sites (large valueof Snow). This was in fact the main result obtained in the original study(14) (see, e.g., the summary of findings in the abstract), which advocates thegood properties of LORI in terms of interpretation.

    6 Using covariates to impute ecological data

    The water-birds data count the abundance of migratory water-birds in 785wetland sites (across the 5 countries in North Africa), between 1990 and 2017(39). One of the objectives is to assess the effect of time on species abun-dances, to monitor the populations and assess wetlands conservation policies.Ornithologists have also recorded side information concerning the sites andyears, which may influence the counts. For instance, meteorological anoma-lies, latitude and longitude. The count table contains a large amount ofmissing entries (70%), but the covariate matrices which contain respectively6 covariates about the 785 sites and 8 covariates about the 28 years, are fullyobserved. Our method allows to take advantage of the available covariatesto provide interpretation for spatio-temporal patterns. As a by-product, itproduces an imputed contingency table.

    Tables 5 and 6 show the estimated main effects of some of the sites andyears characteristics. Sites with large latitudes are associated to smallercounts, as well as sites which are far from the coast. Sites which are locatedfar from towns, and sites with large water surfaces are associated to largercounts. The four year covariates given in Table 6 concern meteorologicalanomalies. The associated coefficients are all negative, indicating that moreimportant anomalies are associated to smaller abundances.

    The sites and years can also be displayed using the same visual tools asdescribed in Section 5. Figure 4a and 4b show the correlations between the

    17

  • Spring N/O Spring N/E Winter S/O Winter S/E-0.05 -0.03 -0.05 -0.03

    Table 6: Main effect of the years characteristics estimated with LORI.

    covariates and the directions of interaction. On the two-dimensional displayon Figure 4c, the first dimension is correlated with geographical characteris-tics, and the second dimension with meteorological anomalies. We observe avery clear temporal gradient along the second dimension, indicating that overtime, meteorological anomalies increase (in the sense of a summary anomalyvariable embodied by the second direction). One of the site (378) lays out ofthe point cloud, and corresponds to a site with very large surface.

    LORI also returns counts estimates, which can be used to compute an es-timation of the total yearly abundances (i.e. counts estimates summed acrosssites). To better assess the temporal trend, one can decompose the estimatedcounts into three factors corresponding to the site effects, year effects andinteractions respectively. Indeed, for (i, j) ∈ {1, . . . , n}× {1, . . . , p}, one canwrite

    exp(X̂ij) = exp(µ̂) exp(Ri,.α̂) exp(Cj,.β̂) exp(Θ̂ij).

    Figure 5 shows the last three factors of this decomposition separately.

    On Figure 5a we see that most sites have multiplicative effects around1 on count scale. One site (site 378, large red point) stands out; again, itcorresponds to an extremely large site (6000km2, 5 times larger than the sec-ond, 300 times larger than the mean). In this respect, the row effects act asnormalization factors accounting for surface. We also observe tenuous levelsalong the x axis, corresponding to sites of different countries. On Figure 5bwe observe a decreasing temporal trend. This means that, all other thingsbeing equal, later years tend to produce smaller abundances. As illustratedin Figure 4b, this temporal trend can be associated with the effects of mete-orological anomalies. Note that the temporal effects (top right) are smallerin amplitude than the spatial effects (top left). Indeed, more variability isobserved between sites in a given year, than between years for a given site.

    Finally, looking at the interaction matrix on Figure 5c, we see that theinteraction is mainly driven by a few sites which interact more or less highly

    18

  • −1.0 −0.5 0.0 0.5 1.0

    −1.

    0−

    0.5

    0.0

    0.5

    1.0

    Dim.1

    Dim

    .2

    LatitudeDist_towns (m)

    Dist_coast (m)

    Surfaces_eau (km²)

    (a) Correlation between sites charac-teristics and the two first directionsof interaction.

    −1.0 −0.5 0.0 0.5 1.0

    −1.

    0−

    0.5

    0.0

    0.5

    1.0

    Dim.1

    Dim

    .2

    Spring N/0

    Spring N/E

    Winter S/0Winter S/ESpring N/0

    Spring N/EWinter S/0Winter S/E

    (b) Correlation between year charac-teristics and the two first directionsof interaction.

    −0.4 −0.2 0.0 0.2 0.4

    −0.2

    0.0

    0.2

    0.4

    CA factor map

    Dim 1 (30.42%)

    Dim

    2 (1

    8.19

    %)

    1 23456

    7 89101112131415161718192021222324252627

    282930313233343536

    37383940414243444546

    47

    48495051525354

    55

    56

    5758596061

    6263

    64

    65666768697071727374

    75

    7677

    78

    79

    80

    81

    82

    838485

    86

    87888990919293949596979899100101102103104105106107108

    109110111112113114115

    116117118119120121122123124125126127128

    129130131132133134135136137138139140141142143144145146147148149150

    151

    152153154155156157158159160161162

    163164165166167168

    169170171172173

    174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207

    208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327

    328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373

    374375

    376

    377378 379

    380381382383384385386387388389

    390391392

    393394395396397398399400401402403

    404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465

    466467468469470471472473474475476477478479480481482483484485486487488489490491492

    493

    494495496497498499500501502

    503504

    505506507508

    509510511512513514515

    516

    517518519520521522523524525526527

    528529530531532533534535536537538539540541542543544

    545546547548549550551552553554555

    556557558559560

    561562563564565

    566567

    568

    569

    570571572573574

    575576

    577578579580581 582583

    584585

    586

    587588589590591592593594595596

    597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648

    649650651652653654655656657658659660661662663664665

    666

    667668669670671672

    673

    674675

    676

    677678679680681682683684685

    686687688689690691692

    693

    694695696

    697

    698

    699700

    701702

    703

    704705

    706707708

    709710

    711712713714715

    716

    717718719720721722

    723

    724725

    726

    727728729730

    731

    732733734735736

    737

    738739740741

    742743

    744

    745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785

    1990199119921993

    1994

    1995

    1996

    19971998

    19992000

    20012002

    2003

    2004

    200520062007

    20082009

    20102011

    2012

    2013

    2014

    2015

    2016

    2017

    (c) Display of the two first dimensions of interaction estimatedwith LORI. Environments are represented with blue points andyears with red triangles.

    Figure 4: Visual display of LORI results for the water-birds data.

    with every year. In particular the site 582 presents large interactions withevery year. It corresponds to a national park in Morocco, the most abundantsite with 4 million birds in total (twice as much as the second most abundant,120 more than the mean). Here, the large abundance is not explained by the

    19

  • geographical covariates available (but maybe by other unmeasured factorssuch as protection legislations), and thus the extreme abundance is capturedin the interaction rather than the main effects. This profile was also visibleon Figure 4c where the site 582 lies amongst the cloud of years, indicatinglarge interactions through small Euclidean distances.The site 704 also presents large interactions, but they are not constantthroughout the years. It corresponds to Ichkeul national park in Tunisia,which is a major site for most species; the abundances are very large inIchkeul compared to other sites. However during several years including2007, bad weather conditions prevented ornithologist to correctly count thebirds, thus reported counts are significantly lower than expected. This ex-plains the drop in the interaction in 2007 for Ichkeul, corresponding to anoutlier behavior. Again, such a profile could not be highlighted withoutmodeling interactions. This illustrates one of the advantages of LORI forsuch bird abundance data compared to state-of-the-art methods such as (37)which do not model interactions. In particular, in most cases the interactionterms absorb outlying values (small or large), and indirectly account for theover-dispersion which is known to occur in birds abundance data.

    7 Discussion

    We conclude by discussing some opportunities for further research. To selectcovariates, we could penalize the main effects with an `1 penalty on α and β.It may be also of interest to consider other sparsity inducing penalties. Inparticular, penalizing the Poisson log-likelihood by the absolute values of thecoefficients of the interaction matrix Θ could possibly lead to solutions wheresome interactions are driven to 0 and a small number of large interactionsare selected. Secondly, it would be useful to develop a multiple imputationprocedure based on LORI, to provide confidence regions for the estimatedparameters. The properties of the thresholding test, which can be seen as analternative to a chi-squared test for independence with covariates, also meritfurther investigation. In particular, the power could be assessed. Finally, wecould also explore whether our model could be extended to more complexmodels such as the zero-inflated negative binomial models.

    20

  • Acknowledgments

    The authors thank Trevor Hastie, Edgar Dobriban, Olga Klopp, Kevin Bleakeyand Stéphane Dray for their very helpful comments on this manuscript. Wewould like to thank all the organizations involved in the Mediterranean Wa-terbirds Network (MNW) who provided the waterbird data set used in thisarticle: the GREPOM/BirdLife Morocco, the Direction Générale des Forêts(Algeria), the AAO/BirdLife Tunisia, the Libyan Society for Birds, the Egyp-tian Environment Affairs Agency, the national agency for wildlife and hunt-ing management (ONCFS, France) and the Research Institute of the Tourdu Valat (France). We are also grateful to all the field observers who partic-ipated in the North African IWC thus making this data set so rich, and toPierre Defos du Rau and Laura Dami for their help.

    8 Proofs

    8.1 Proof of Theorem 1

    We will first derive an upper bound for∑

    (i,j)∈Ω(X̂ij − X∗ij)2, then control‖X̂ − X∗‖2F by ‖X̂ − X∗‖2F ≤

    ∑(i,j)∈Ω(X̂ij − X∗ij)2 + D, with D a residual

    term defined later on. By definition of X̂ = X̂0+Θ̂, L(X̂)+λ‖Θ̂‖∗ ≤ L(X∗)+λ‖Θ∗‖∗. Using the strong convexity of L and substracting 〈∇L(X∗), X̂−X∗〉on both sides of this inequality, we obtain

    σ2−∑

    (i,j)∈Ω(X̂ij −X∗ij)2

    2≤ −〈∇L(X∗), X̂ −X∗〉︸ ︷︷ ︸

    I

    +λ(‖Θ∗‖∗ − ‖Θ̂‖∗)︸ ︷︷ ︸II

    . (19)

    We will bound separately the two terms on the right hand side of (19).

    Given a matrix X ∈ Rn×p, we denote S1(X) (resp. S2(X)) the span ofleft (resp. right) singular vectors of X. Let P⊥S1(X) (resp. P

    ⊥S2(X)) be the

    orthogonal projector in Rn on S1(X)⊥ (resp. in Rp on S2(X)⊥). We definethe projection operator in Rn×p P⊥X : X̃ 7→ P⊥S1(X)X̃P

    ⊥S2(X), and PX : X̃ 7→

    X̃ − P⊥S1(X)X̃P⊥S2(X). We use the following Lemma, proved in (23, Lemma

    16).

    Lemma 1. For all M and M ′ in Rn×p,

    21

  • (i) ‖M + P⊥M(M)‖∗ = ‖M‖∗ + ‖P⊥M(M)‖∗,

    (ii) ‖M‖∗ − ‖M ′‖∗ ≤ ‖PM(M −M ′)‖∗ − ‖P⊥M(M −M ′)‖∗,

    (iii) ‖PM(M −M ′)‖∗ ≤√

    2rk(M)‖M −M ′‖F .

    Using |〈∇L(X∗), X̂ − X∗〉| ≤ ‖X̂ − X∗‖∗‖∇L(X∗)‖ and the triangularinequality gives that

    I ≤ ‖∇L(X∗)‖(‖PΘ∗(Θ̂−Θ∗)‖∗ + ‖P⊥Θ∗(Θ̂−Θ∗)‖∗ + ‖X̂0 −X∗0‖∗

    ). (20)

    Then, Lemma 1 (ii) applied to Θ̂ and Θ∗, results in

    II ≤ λ(‖PΘ∗(Θ̂−Θ∗)‖∗ − ‖P⊥Θ∗(Θ̂−Θ∗)‖∗

    ). (21)

    Plugging inequalities (20) and (21) in (19) we obtain

    σ2−∑

    (i,j)∈Ω

    (X̂ij −X∗ij)2 ≤ 2(λ+ ‖∇L(X∗)‖)∥∥∥PΘ∗(Θ̂−Θ∗)∥∥∥

    + 2(‖∇L(X∗)‖ − λ)‖P⊥Θ∗(Θ̂−Θ∗)‖∗ + 2‖∇L(X∗)‖‖X̂0 −X∗0‖∗. (22)

    We now use the condition λ ≥ 2‖∇L(X∗)‖ in (22):

    σ2−∑

    (i,j)∈Ω

    (X̂ij −X∗ij)2 ≤ 3λ‖PΘ∗(Θ̂−Θ∗)‖∗ + λ‖X̂0 −X∗0‖∗. (23)

    Then, rk(X̂0 − X∗0 ) ≤ r and ‖X̂0 − X∗0‖F ≤ ‖X̂ − X∗‖F imply that ‖X̂0 −X∗0‖∗ ≤

    √r‖X∗− X̂‖F , which together with Lemma 1 (iii) and ‖Θ̂−Θ∗0‖F ≤

    ‖X̂ −X∗‖F yields

    σ2−∑

    (i,j)∈[n]×[p]

    ωij(X̂ij −X∗ij)2 ≤ λ(

    3√

    2 rank(Θ∗) +√r)‖X̂ −X∗‖F . (24)

    We now derive the upper bound ‖X̂−X∗‖2F ≤∑

    (i,j)∈[n]×[p] ωij(X̂ij−X∗ij)2+D.Define η = 72 log(n+ p)/(π log(6/5)),

    Σ(ω,X) =∑

    (i,j)∈[n]×[p]

    ωijX2ij (25)

    22

  • and the set

    C(η, ρ) ={X ∈ Rn×p; ‖X‖∞ ≤ 1, ‖X‖∗ ≤

    √ρ‖X‖F ,E [Σ(ω,X)] > η

    }.(26)

    We start by showing in the following Lemma that whenever X̂−X∗ belongs toC(η, ρ) (for ρ and D defined later on), a restricted strong convexity propertyof the form ‖X̂ −X∗‖2F ≤

    ∑(i,j)∈[n]×[p] ωij(X̂ij −X∗ij)2 + D holds. Define

    ς = 96π−1[ρ(E‖ΣR‖)2 + 8]. (27)

    Lemma 2. Let η = 72 log(n + p)/(π log(6/5)) and ρ > 0. With probabilityat least 1− 8(n+ p)−1, for all X ∈ C(η, ρ) we get

    |Σ(ω,X)− E [Σ(ω,X)]| ≤ E [Σ(ω,X)]2

    + ς,

    with ΣR defined in (12).

    Proof. Consider the event

    B =

    {sup

    X∈C(η,ρ)

    [|Σ(ω,X)− E [Σ(ω,X)]| − 1

    2E [Σ(ω,X)]

    ]> ς

    }.

    Define also for l ∈ N∗

    Sl ={X ∈ C(η, ρ);κl−1η < E [Σ(ω,X)] < κlη

    },

    for κ = 6/5 and η = 72 log(n+ p)/(π log(6/5)). On B, there exist l ≥ 1 andX ∈ C(η, ρ) such that X ∈ C(η, ρ)

    ⋂Sl, and

    |Σ(ω,X)− E [Σ(ω,X)]| > 12E [Σ(ω,X)]+ ς >

    1

    2κl−1η+ ς =

    5

    12κlη+ ς. (28)

    For T > 0, define the set

    C(η, ρ, T ) = {X ∈ C(η, ρ),E [Σ(ω,X)] ≤ T}

    and the event

    Bl =

    {sup

    X∈C(η,ρ,κlη)|Σ(ω,X)− E [Σ(ω,X)]| > 5

    12κlη + ς

    }.

    23

  • It follows from (28) that B ⊂⋃+∞l=1 Bl; thus, it is enough to estimate the

    probability of the events Bl, l ∈ N, and then apply the union bound. Suchan estimation is given in the following Lemma, adapted from (40) (see Lemma10). Define

    ZT = supX∈C(η,ρ,T )

    |Σ(ω,X)− E [Σ(ω,X)]| . (29)

    Lemma 3. Under the assumptions of Theorem 1,

    P(ZT ≥

    5

    12T + ς

    )≤ 4e−πT/72, (30)

    where ς is defined in (27).

    Proof. We use the following Talagrand’s concentration inequality and a sym-metrization argument. Recall the statement of Talagrand’s concentrationinequality. Let f : [−1, 1]m 7→ R a convex Lipschitz function with Lips-chitz constant L, Ξ1, . . . ,Ξm be independent random variables taking valuesin [−1, 1], and Z := f(Ξ1, . . . ,Ξm). Then, for any t ≥ 0, P(|Z − E[Z]| ≥16L + t) ≤ 4e−t2/2L2 . For x = (xij), (i, j) ∈ [n]× [p], we apply this result tothe function

    f(x) = supX∈C(η,ρ,T )

    ∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (xij − πij)X2ij

    ∣∣∣∣∣∣ ,

    24

  • which is Lipschitz with Lipschitz constant√π−1T :

    |f(x11, . . . , xnp)− f(z11, . . . , znp)|

    =

    ∣∣∣∣∣∣ supX∈C(η,ρ,T )∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (xij − πij)X2ij

    ∣∣∣∣∣∣− supX∈C(η,ρ,T )∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (zij − πij)X2ij

    ∣∣∣∣∣∣∣∣∣∣∣∣

    ≤ supX∈C(η,ρ,T )

    ∣∣∣∣∣∣∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (xij − πij)X2ij

    ∣∣∣∣∣∣−∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (zij − πij)X2ij

    ∣∣∣∣∣∣∣∣∣∣∣∣

    ≤ supX∈C(η,ρ,T )

    ∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (xij − πij)X2ij −∑

    (i,j)∈[n]×[p]

    (zij − πij)X2ij

    ∣∣∣∣∣∣≤ sup

    X∈C(η,ρ,T )

    ∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    (xij − zij)X2ij

    ∣∣∣∣∣∣≤ sup

    X∈C(η,ρ,T )

    √(i, j) ∈

    ∑[n]×[p]

    π−1ij (xij − zij)2√ ∑

    (i,j)∈[n]×[p]

    πijX4ij

    ≤ supX∈C(η,ρ,T )

    √π−1√ ∑

    (i,j)∈[n]×[p]

    (xij − zij)2√ ∑

    (i,j)∈[n]×[p]

    πijX2ij

    ≤√π−1T

    √ ∑(i,j)∈[n]×[p]

    (xij − zij)2,

    where we have used ||a|−|b|| ≤ |a−b|,‖X‖∞ ≤ 1 and E [Σ(ω,X)] ≤ T . Thus,Talagrand’s inequality and the identity

    √π−1T ≤ T/(2× 96) + 96/(2π) give

    P(ZT ≥ E(ZT ) + 768π−1 +

    1

    12T + t

    )≤ 4e−t2π/2T .

    Taking t = T/6 we get

    P(ZT ≥ E(ZT ) + 768π−1 +

    3

    12T

    )≤ 4e−πT/72. (31)

    Now we bound the expectation E[ZT ] using a symmetrization argument (41,Section 7.2). Let (�ij) be an i.i.d. Rademacher sequence. We have

    E(ZT ) ≤ 2E

    supX∈C(η,ρ,T )

    ∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    �ijωijX2ij

    ∣∣∣∣∣∣ , (32)

    25

  • Then, the contraction inequality (see (42), Theorem 2.2) yields

    E(ZT ) ≤ 8E

    supX∈C(η,ρ,T )

    ∣∣∣∣∣∣∑

    (i,j)∈[n]×[p]

    �ijωijXij

    ∣∣∣∣∣∣ = 8E( sup

    X∈C(η,ρ,T )|〈ΣR, X〉|

    ),

    where ΣR is defined in (12). For X ∈ C(η, ρ, T ) we have that ‖X‖∗ ≤√ρπ−1T . Then by duality between the nuclear and operator norms we obtain

    E(ZT ) ≤ 8E

    sup‖X‖∗≤

    √ρπ−1T

    |〈ΣR, X〉|

    ≤ 8√ρπ−1TE‖ΣR‖.Combined with (31) and using 8

    √ρπ−1TE‖ΣR‖ ≤ T2×3 +

    3×82ρπ−12

    (E‖ΣR‖)2we finally obtain (30) using the definition of ς in (27).

    Lemma 3 implies that

    P(B) ≤+∞∑l=1

    P(Bl) ≤ 4+∞∑l=1

    exp(−πκlη/72) ≤ 8/(n+ p),

    which concludes the proof.

    Case 1 If∑

    (i,j)∈[n]×[p] πij(X̂ij −X∗ij)2 ≤ η, then ‖X̂ −X∗‖22 ≤ η/π and theresult of Theorem 1 (15) is proved.

    Case 2 If∑

    (i,j)∈[n]×[p] πij(X̂ij−X∗ij)2 > η. Let us show that (X̂−X∗)/2γ ∈C(η, 64 rank(X∗)). Using (22), σ2−

    ∑(i,j)∈Ω(X̂ij−X∗ij)2 ≥ 0 and ‖∇L(X∗)‖ ≤

    λ/2, we obtain that

    ‖P⊥Θ∗(Θ̂−Θ∗)‖∗ ≤ 3‖PΘ∗(Θ̂−Θ∗)‖∗ + ‖X̂0 −X∗0‖∗.

    On the other hand,

    ‖X̂ −X∗‖∗ ≤ ‖P⊥Θ∗(Θ̂−Θ∗)‖∗ + ‖PΘ∗(Θ̂−Θ∗)‖∗ + ‖X̂0 −X∗0‖∗≤ 4‖PΘ∗(Θ̂−Θ∗)‖∗ + 2‖X̂0 −X∗0‖∗≤ 2√

    2 rank(Θ∗)‖Θ̂−Θ∗‖F + 2√r‖X̂0 −X∗0‖F

    ≤√

    64 rank(X∗)‖X̂ −X∗‖F .

    26

  • Thus, Lemma 2 implies that with probability at least 1− 8(n+ p)−1,

    ∑(i,j)∈[n]×[p]

    ωij(X̂ij−X∗ij)2 ≥E[∑

    (i,j)∈[n]×[p] ωij(X̂ij −X∗ij)2]2

    −384γ2π−1[64 rank(X∗)(E‖ΣR‖)2+8].

    (33)Combining (33) and (24) we obtain

    π‖X̂ −X∗‖2F2

    −384γ2[64 rank(X∗)(E‖ΣR‖)2 + 8]

    π≤ λσ2−

    (3√

    2 rank(Θ∗) +√r)‖X̂−X∗‖F .

    Finally, using the identity ab ≤ a2 + b2/4 and rank(X∗) ≤ rank(Θ∗) + r weobtain

    ‖X̂ −X∗‖2F ≤(

    192λ2

    π2σ4−+

    24576γ2(E‖ΣR‖)2

    π2

    )[rk(Θ∗) + r] +

    6144

    π2. (34)

    8.2 Proof of Theorem 2

    Theorem 2 derives from Theorem 1 and combining the two following steps:1) computing a value of λ such that the condition λ ≥ 2‖∇L(X∗)‖ holdswith high probability and 2) controlling E‖ΣR‖. Let us start with 1). Definethe random matrices Zij = ωij(−Yij + exp(X∗ij))Eij and the quantity

    σ2Z = max

    (1

    np

    ∥∥∥∥∥n∑i=1

    p∑j=1

    E[ZijZ

    >ij

    ]∥∥∥∥∥ , 1np∥∥∥∥∥

    n∑i=1

    p∑j=1

    E[Z>ijZij

    ]∥∥∥∥∥). (35)

    Lemma 4. Under the assumptions of Theorem 2,

    σ2−β

    np≤ σ2Z ≤

    σ2+β

    np. (36)

    Proof. For all (i, j) ∈ [n] × [p], ZijZTij = ωij(−Yij + exp(X∗ij))2EijE>ij , andE[ZijZ>ij ] = E[ωij]E[(−Yij + exp(X∗ij))2]EijE>ij , which is a diagonal matrixwith 0 everywhere except on the i-th element of its diagonal, where its valueis E[ωij]E[(−Yij + exp(X∗ij))2]. Thus,∑

    (i,j)∈[n]×[p]

    E[ZijZ>ij ]

    27

  • is also a diagonal matrix, and the i-th element of its diagonal is∑p

    j=1 E[ωij]E[(−Yij+exp(X∗ij))

    2]. We obtain that

    1

    np

    ∥∥∥∥∥∥∑

    (i,j)∈[n]×[p]

    E[ZijZ>ij ]

    ∥∥∥∥∥∥ = 1np maxi∈[n]p∑j=1

    E[ωij]E[(−Yij + exp(X∗ij))2].

    Using E[Yij] = exp(X∗ij) and σ2− ≤ var(Yij) ≤ σ2+, we obtain:

    σ2−np

    maxi∈[n]

    p∑j=1

    E[ωij] ≤1

    np

    ∥∥∥∥∥∥∑

    (i,j)∈[n]×[p]

    E[ZijZ>ij ]

    ∥∥∥∥∥∥ ≤ σ2+

    npmaxi∈[n]

    p∑j=1

    E[ωij]. (37)

    Using the same arguments, we also obtain

    σ2−np

    maxj∈[p]

    n∑i=1

    E[ωij] ≤1

    np

    ∥∥∥∥∥∥∑

    (i,j)∈[n]×[p]

    E[Z>ijZij]

    ∥∥∥∥∥∥ ≤ σ2+

    npmaxj∈[p]

    n∑i=1

    E[ωij]. (38)

    Combining (37) and (37), we obtain that

    σ2−np

    max

    {maxi∈[n]

    p∑j=1

    E[ωij],maxj∈[p]

    n∑i=1

    E[ωij]

    }≤ σ2Z ≤

    σ2+np

    max

    {maxi∈[n]

    p∑j=1

    E[ωij],maxj∈[p]

    n∑i=1

    E[ωij]

    },

    which concludes the proof.

    Note that E [Zij] = 0 for all (i, j) ∈ [n]×[p] and∇L(X∗) =∑n

    i=1

    ∑pj=1 Zij.

    We use an extension of Theorem 4 in (43) to rectangular matrices via self-adjoint dilation (cf., for example, 2.6 in (44)). Let Ξ1, . . . ,Ξm be m indepen-dent (n× p)-matrices satisfying E[Ξi] = 0 and

    inf{K > 0 : E[exp(‖Ξi‖/K)] ≤ e} < M

    for some constant M and for all i ∈ {1, . . . ,m}. Define

    σ2 = max

    (1

    m

    ∥∥∥∥∥m∑i=1

    E(ΞiΞ

    Ti

    )∥∥∥∥∥ , 1m∥∥∥∥∥

    m∑i=1

    E(ΞTi Ξi

    )∥∥∥∥∥),

    28

  • and Ū = M log(1 + 2M2

    σ2). Then, for tŪ ≤ 2(e− 1)σ2m,

    P

    {∥∥∥∥∥ 1mm∑i=1

    Ξi

    ∥∥∥∥∥ ≥ t}≤ 2(n+ p) exp

    {− t

    2

    4mσ2 + 2Ū t/3

    }and for tŪ > 2(e− 1)σ2m,

    P

    {∥∥∥∥∥ 1mm∑i=1

    Ξi

    ∥∥∥∥∥ ≥ t}≤ 2(n+ p) exp

    {− t

    (e− 1)Ū

    }.

    Under Assumption 3 we may apply this result with m = np, (Ξ1, . . . ,Ξm) =(Z11, . . . , Znp), M = 2δ, σ

    2 = σ2Z and Ū = 2δ log(1 + 8δ2/σ2Z). Taking

    t ≥ max{

    2σZ√

    3np log(n+ p), 6δ(e− 1) log(1 + 8δ2/σ2Z) log(n+ p)}

    and using Lemma 4, we get that with probability at least 1− (n+ p)−1,

    ‖∇L(X∗)‖ ≤ max{

    2σ+ (3β log(n+ p))1/2 , 6δ(e− 1) log{1 + 8δ2np/(βσ2−)} log(n+ p)

    }.

    Thus, taking λ as in Theorem 2 ensures that λ ≥ 2‖∇L(X∗)‖ with proba-bility at least 1− (n+ p)−1.

    We now control E‖ΣR‖ with the following lemma.

    Lemma 5. There exists an absolute constant C∗ such that the two followinginequality holds

    E[‖ΣR‖] ≤ C∗{√

    β +√

    logm}.

    Proof. We use an extension to rectangular matrices via self-adjoint dilationof Corollary 3.3 in (45).

    Proposition 1. Let A be an n× p rectangular matrix with Aij independentcentered bounded random variables. then, there exists a universal constantC∗ such that

    E[‖A‖] ≤ C∗{σ1 ∨ σ2 + σ∗

    √log(n ∧ p)

    },

    σ1 = maxi

    √∑j

    E[A2ij], σ2 = maxj

    √∑i

    E[A2ij], σ∗ = maxi,j|Aij|.

    29

  • Applying Proposition 1 to ΣR with σ1 ∨ σ2 ≤√β/|Ω| and σ∗ ≤ 1 we

    obtainE[‖ΣR‖] ≤ C∗

    {√β +

    √log(n ∧ p)

    }.

    Combining 1) and 2) with (34) and a union bound argument, we obtainthe result of Theorem 2.

    8.3 Proof of Theorem 3

    In what follows we denote for X0 ∈ X0 and Θ ∈ T Fλ(X0,Θ) = L(X0 +Θ)+λ‖Θ‖∗. We establish below that λ0(Y ) defined in (17) is equal to

    λ0(Y ) = minλ

    0 ∈ ∂Θ{Fλ(X̂0,Θ) + χT (Θ)} |Θ=0,

    where for K ⊂ Rn×p ,χK(X) is the characteristic function of the set K,equal to 0 on K and +∞ elsewhere, and X̂0 = argmin

    X∈X0L(X) (see (17)). The

    subdifferential of the objective function Fλ with respect to Θ is given by

    ∂ΘFλ(X̂0, 0) = ∇L(X̂0 + Θ) |Θ=0 +λ∂Θ ‖Θ‖∗ |Θ=0 +∂ΘχT (Θ) |Θ=0 .

    0 ∈ ∂ΘχT (Θ) |Θ=0. Lemma 6 ensures that 0 ∈ ∂Fλ(Θ) |Θ=0 if and only if

    0 ∈{∇L(X̂0) + λW ; ‖PT (W )‖ ≤ 1

    }.

    This is equivalent to λ ≥∥∥∥PT (∇L(X̂0))∥∥∥ . Additionally, at the optimum X̂0,

    we have PT (∇L(X̂0)) = ∇L(X̂0), which concludes the proof.

    Lemma 6. Let g : T → R+ be the function defined by g(A) = ‖A‖∗ forA ∈ T . ∂g(0) = {W ∈ Rn×p, ‖PT (W )‖ ≤ 1}.

    Proof. By definition of the subdifferential we need to prove that for all W ∈Rn×p, ‖PT (W )‖ < 1, and for all B ∈ T , g(B) ≥ g(0) + 〈W,B − 0〉. FirstB ∈ T implies 〈W,B〉 = 〈PT (W ), B〉, therefore ‖PT (W )‖ ≤ 1 is a sufficientcondition for W ∈ ∂g(0). Now assume ‖PT (W )‖ > 1 and let PT (W ) =UΣV T , where U and V are orthogonal matrices of left and right singularvectors, and Σ11 = ‖PT (W )‖ > 1. Let us define B = UΣ̃V T , Σ̃11 = 1 and

    30

  • Σ̃ij = 0 elsewhere; note that with this definition B ∈ T . We have g(B) = 1and 〈PT (W ), B〉 = Σ11 > g(B). Therefore ‖PT (W )‖ > 1 ⇒ W /∈ ∂g(0),from which we conclude

    ∂g(0) ={W ∈ Rn×p, ‖PT (W )‖ < 1

    }.

    References

    [1] M. Greenacre, Theory and Applications of Correspondence Analysis,Acadamic Press, 1984.

    [2] L. A. Goodman, The analysis of cross-classified data having orderedand/or unordered categories: association models, correlation models,and asymmetry models for contingency tables with or without missingentries, Annals of Statistics 13 (1985) 10–69.

    [3] A. de Falguerolles, Chapter 35 - log-bilinear biplots in action,in: J. Blasius, M. Greenacre (Eds.), Visualization of Categor-ical Data, Academic Press, San Diego, 1998, pp. 527 – 539.doi:https://doi.org/10.1016/B978-012299045-8/50039-5.URL https://www.sciencedirect.com/science/article/pii/B9780122990458500395

    [4] R. Christensen, Log-Linear Models, Springer-Verlag, New York., 2010.

    [5] J. Gower, S. Lubbe, N. le Roux, Understanding Biplots, John Wiley &Sons, 2011.

    [6] W. Fithian, J. Josse, Multiple correspondence analysis and the multi-logit bilinear model, Journal of Multivariate Analysis 157 (2017) 87 –102. doi:https://doi.org/10.1016/j.jmva.2017.02.009.URL http://www.sciencedirect.com/science/article/pii/S0047259X1730115X

    [7] M. Kateri, Contingency Table Analysis, Springer New York, 2014.

    [8] A. M. Brown, D. I. Warton, N. R. Andrew, M. Binns, G. Cassis, H. Gibb,The fourth-corner solution – using predictive models to understand how

    31

    https://www.sciencedirect.com/science/article/pii/B9780122990458500395http://dx.doi.org/https://doi.org/10.1016/B978-012299045-8/50039-5https://www.sciencedirect.com/science/article/pii/B9780122990458500395https://www.sciencedirect.com/science/article/pii/B9780122990458500395http://www.sciencedirect.com/science/article/pii/S0047259X1730115Xhttp://www.sciencedirect.com/science/article/pii/S0047259X1730115Xhttp://dx.doi.org/https://doi.org/10.1016/j.jmva.2017.02.009http://www.sciencedirect.com/science/article/pii/S0047259X1730115Xhttp://www.sciencedirect.com/science/article/pii/S0047259X1730115Xhttp://dx.doi.org/10.1111/2041-210X.12163http://dx.doi.org/10.1111/2041-210X.12163

  • species traits interact with the environment, Methods in Ecology andEvolution 5 (4) (2014) 344–352. doi:10.1111/2041-210X.12163.URL http://dx.doi.org/10.1111/2041-210X.12163

    [9] C. J. ter Braak, P. Peres-Neto, S. Dray, A critical issue in model-basedinference for studying trait-based community assembly and a solution,PeerJ 5 (2017) e2885. doi:10.7717/peerj.2885.URL https://doi.org/10.7717/peerj.2885

    [10] F. Luisier, T. Blu, M. Unser, Image denoising in mixed poisson-gaussiannoise, IEEE Transactions on Image Processing 20 (3) (2011) 696–708.

    [11] J. Salmon, Z. Harmany, C. Deledalle, R. Willett, Poisson noise reductionwith non-local pca, Journal of Mathematical Imaging and Vision 48 (2)(2014) 279–294.

    [12] Y. Cao, Y. Xie, Poisson matrix recovery and completion, IEEE Trans-actions on Signal Processing 64 (6) (2016) 1609–1620. doi:10.1109/TSP.2015.2500192.

    [13] P. Gopalan, F. J. R. Ruiz, R. Ranganath, D. M. Blei, Bayesian non-parametric poisson factorization for recommendation systems, in: InAISTATS, 2014, pp. 275–283.

    [14] P. Choler, Consistent shifts in alpine plant traits along a mesotopo-graphical gradient, Arctic, Antarctic, and Alpine Research 37 (4) (2005)444–453. doi:10.1214/12-AOS986.URL http://dx.doi.org/10.1657/1523-0430(2005)037[0444:CSIAPT]2.0.CO;2

    [15] R Core Team, R: A Language and Environment for Statistical Comput-ing, R Foundation for Statistical Computing, Vienna, Austria (2016).URL https://www.R-project.org/

    [16] M. Collins, S. Dasgupta, R. E. Schapire, A generalization of principalcomponent analysis to the exponential family, in: Proceedings of the14th International Conference on Neural Information Processing Sys-tems: Natural and Synthetic, NIPS’01, MIT Press, Cambridge, MA,USA, 2001, pp. 617–624.URL http://dl.acm.org/citation.cfm?id=2980539.2980620

    32

    http://dx.doi.org/10.1111/2041-210X.12163http://dx.doi.org/10.1111/2041-210X.12163http://dx.doi.org/10.1111/2041-210X.12163http://dx.doi.org/10.1111/2041-210X.12163https://doi.org/10.7717/peerj.2885https://doi.org/10.7717/peerj.2885http://dx.doi.org/10.7717/peerj.2885https://doi.org/10.7717/peerj.2885http://dx.doi.org/10.1109/TSP.2015.2500192http://dx.doi.org/10.1109/TSP.2015.2500192http://dx.doi.org/10.1657/1523-0430(2005)037[0444:CSIAPT]2.0.CO;2http://dx.doi.org/10.1657/1523-0430(2005)037[0444:CSIAPT]2.0.CO;2http://dx.doi.org/10.1214/12-AOS986http://dx.doi.org/10.1657/1523-0430(2005)037[0444:CSIAPT]2.0.CO;2http://dx.doi.org/10.1657/1523-0430(2005)037[0444:CSIAPT]2.0.CO;2https://www.R-project.org/https://www.R-project.org/https://www.R-project.org/http://dl.acm.org/citation.cfm?id=2980539.2980620http://dl.acm.org/citation.cfm?id=2980539.2980620http://dl.acm.org/citation.cfm?id=2980539.2980620

  • [17] J. de Leeuw, Principal component analysis of binary data by iteratedsingular value decomposition, Computational Statistics and Data Anal-ysis 50 (1) (2006) 21–39.

    [18] J. Li, D. Tao, Simple exponential family PCA, IEEE Transactions onNeural Networks and Learning Systems 24 (3) (2013) 485–497.

    [19] J. Josse, S. Wager, Bootstrap-based regularization for low-rank matrixestimation, Journal of Machine Learning Research 17 (124) (2016) 1–29.

    [20] L. Liu, E. Dobriban, A. Singer, epca: High dimensional exponentialfamily pca, arXiv:1611.05550.

    [21] S. Mohamed, Z. Ghahramani, K. A. Heller, Bayesian exponentialfamily pca, in: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou(Eds.), Advances in Neural Information Processing Systems 21, CurranAssociates, Inc., 2009, pp. 1089–1096.URL http://papers.nips.cc/paper/3532-bayesian-exponential-family-pca.pdf

    [22] P. Gopalan, F. Ruiz, R. Ranganath, D. Blei, Bayesian nonparametricpoisson factorization for recommendation systems, In AISTATS (2014)275–283.

    [23] J. Lafond, Low rank matrix completion with exponential family noise,Journal of Machine Learning Research: Workshop and Conference Pro-ceedings 40 (2015) 1–18.

    [24] J. Chiquet, M. Mariadassous, S. Robin, Variational inference for proba-bilistic poisson pca, Annals of Applied Statistics (to appear).

    [25] W. Fithian, R. Mazumder, Flexible low-rank statistical modeling withmissing data and side information, Statist. Sci. 33 (2) (2018) 238–260.doi:10.1214/18-STS642.URL https://doi.org/10.1214/18-STS642

    [26] R. Angst, C. Zach, M. Pollefeys, The generalized trace-norm and itsapplication to structure-from-motion problems, in: Proceedings of the2011 International Conference on Computer Vision, ICCV ’11, IEEEComputer Society, Washington, DC, USA, 2011, pp. 2502–2509. doi:

    33

    http://papers.nips.cc/paper/3532-bayesian-exponential-family-pca.pdfhttp://papers.nips.cc/paper/3532-bayesian-exponential-family-pca.pdfhttp://papers.nips.cc/paper/3532-bayesian-exponential-family-pca.pdfhttp://papers.nips.cc/paper/3532-bayesian-exponential-family-pca.pdfhttps://doi.org/10.1214/18-STS642https://doi.org/10.1214/18-STS642http://dx.doi.org/10.1214/18-STS642https://doi.org/10.1214/18-STS642http://dx.doi.org/10.1109/ICCV.2011.6126536http://dx.doi.org/10.1109/ICCV.2011.6126536http://dx.doi.org/10.1109/ICCV.2011.6126536http://dx.doi.org/10.1109/ICCV.2011.6126536

  • 10.1109/ICCV.2011.6126536.URL http://dx.doi.org/10.1109/ICCV.2011.6126536

    [27] D. Agarwal, B.-C. Chen, Regression-based latent factor models, in: Pro-ceedings of the 15th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD ’09, ACM, New York, NY, USA,2009, pp. 19–28. doi:10.1145/1557019.1557029.URL http://doi.acm.org/10.1145/1557019.1557029

    [28] J. Abernethy, F. Bach, T. Evgeniou, J.-P. Vert, A new approach tocollaborative filtering: Operator estimation with spectral regularization,J. Mach. Learn. Res. 10 (2009) 803–826.URL http://dl.acm.org/citation.cfm?id=1577069.1577098

    [29] R. J. A. Little, D. B. Rubin, Statistical Analysis with Missing Data,John Wiley & Sons series in probability and statistics, New-York, 1987,2002.

    [30] O. Klopp, Noisy low-rank matrix completion with general sampling dis-tribution, Bernoulli 20 (1) (2014) 282–303.

    [31] I. Csiszár, G. Tusnády, Information Geometry and Alternating mini-mization procedures, Statistics and Decisions Supplement Issue 1.

    [32] J.-F. Cai, E. J. Candès, Z. Shen, A singular value thresholding algorithmfor matrix completion, SIAM Journal on Optimization 20 (4) (2010)1956–1982.

    [33] J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, Pathwise coordinateoptimization, Ann. Appl. Stat. 1 (2) (2007) 302–332. doi:10.1214/07-AOAS131.URL http://dx.doi.org/10.1214/07-AOAS131

    [34] D. L. Donoho, I. M. Johnstone, Ideal spatial adaptation via waveletshrinkage, Biometrika 81 (1994) 425–455.

    [35] C. Giacobino, S. Sardy, J. Diaz Rodriguez, N. Hengardner, Quantileuniversal threshold, Electronic Journal of Statistics 11 (2017) 4701–4722.

    [36] B. Efron, Bootstrap methods: Another look at the jackknife, Ann.Statist. 7 (1) (1979) 1–26. doi:10.1214/aos/1176344552.URL https://doi.org/10.1214/aos/1176344552

    34

    http://dx.doi.org/10.1109/ICCV.2011.6126536http://dx.doi.org/10.1109/ICCV.2011.6126536http://dx.doi.org/10.1109/ICCV.2011.6126536http://doi.acm.org/10.1145/1557019.1557029http://dx.doi.org/10.1145/1557019.1557029http://doi.acm.org/10.1145/1557019.1557029http://dl.acm.org/citation.cfm?id=1577069.1577098http://dl.acm.org/citation.cfm?id=1577069.1577098http://dl.acm.org/citation.cfm?id=1577069.1577098http://dx.doi.org/10.1214/07-AOAS131http://dx.doi.org/10.1214/07-AOAS131http://dx.doi.org/10.1214/07-AOAS131http://dx.doi.org/10.1214/07-AOAS131http://dx.doi.org/10.1214/07-AOAS131https://doi.org/10.1214/aos/1176344552http://dx.doi.org/10.1214/aos/1176344552https://doi.org/10.1214/aos/1176344552

  • [37] J. Pannekoek, A. van Strien, Trim 3 manual (trends & indices for mon-itoring data), Statistics Netherlands.

    [38] M. de Rooij, W. J. Heiser, Graphical representations and odds ratios in adistance-association model for the analysis of cross-classified data, Psy-chometrika 70 (1) (2005) 99–122. doi:10.1007/s11336-000-0848-1.URL https://doi.org/10.1007/s11336-000-0848-1

    [39] M. Sayoud, H. Salhi, B. Chalabi, A. Allali, M. Dakki, A. Qninba,M. E. Agbani, H. Azafzaf, C. Feltrup-Azafzaf, H. Dlensi, N. Hamouda,W. A. L. Ibrahim, H. Asran, A. A. Elnoor, H. Ibrahim, K. Etayeb,E. Bouras, W. Bashaimam, A. Berbash, C. Deschamps, J. Mondain-Monval, A. Brochet, S. Véran, P. D. du Rau, The first coordinatedtrans-north african mid-winter waterbird census: The contribution ofthe international waterbird census to the conservation of waterbirds andwetlands at a biogeographical level, Biological Conservation 206 (2017)11 – 20. doi:https://doi.org/10.1016/j.biocon.2016.12.005.URL http://www.sciencedirect.com/science/article/pii/S0006320716309788

    [40] O. Klopp, Matrix completion by singular value thresholding: sharpbounds, Electronic journal of statistics 9 (2) (2015) 2348–2369.URL https://hal.archives-ouvertes.fr/hal-01111757

    [41] M. Ledoux, The concentration of measure phenomenon, Vol. 89 of Math-ematical Surveyx and Monographs, American Mathematical Society,Providence, 2001.

    [42] V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization andSparse Recovery, Springer, 2011.

    [43] V. Koltchinskii, A remark on low rank matrix recovery and noncommu-tative Bernstein type inequalities, Vol. Volume 9 of Collections, Instituteof Mathematical Statistics, Beachwood, Ohio, USA, 2013, pp. 213–226.doi:10.1214/12-IMSCOLL915.URL https://doi.org/10.1214/12-IMSCOLL915

    [44] J. A. Tropp, User-friendly tail bounds for sums of random matrices,Foundations of Computational Mathematics 12 (4) (2012) 389–434.

    35

    https://doi.org/10.1007/s11336-000-0848-1https://doi.org/10.1007/s11336-000-0848-1http://dx.doi.org/10.1007/s11336-000-0848-1https://doi.org/10.1007/s11336-000-0848-1http://www.sciencedirect.com/science/article/pii/S0006320716309788http://www.sciencedirect.com/science/article/pii/S0006320716309788http://www.sciencedirect.com/science/article/pii/S0006320716309788http://www.sciencedirect.com/science/article/pii/S0006320716309788http://dx.doi.org/https://doi.org/10.1016/j.biocon.2016.12.005http://www.sciencedirect.com/science/article/pii/S0006320716309788http://www.sciencedirect.com/science/article/pii/S0006320716309788https://hal.archives-ouvertes.fr/hal-01111757https://hal.archives-ouvertes.fr/hal-01111757https://hal.archives-ouvertes.fr/hal-01111757https://doi.org/10.1214/12-IMSCOLL915https://doi.org/10.1214/12-IMSCOLL915http://dx.doi.org/10.1214/12-IMSCOLL915https://doi.org/10.1214/12-IMSCOLL915

  • [45] A. S. Bandeira, R. van Handel, Sharp nonasymptotic bounds on thenorm of random matrices with independent entries, Ann. Probab. 44 (4)(2016) 2479–2506. doi:10.1214/15-AOP1025.URL https://doi.org/10.1214/15-AOP1025

    36

    https://doi.org/10.1214/15-AOP1025https://doi.org/10.1214/15-AOP1025http://dx.doi.org/10.1214/15-AOP1025https://doi.org/10.1214/15-AOP1025

  • 100

    200

    300

    200 400 600

    Site

    Site

    effe

    cts

    (a) Total site effects in count scale(exp(Ri,α̂), for 1 ≤ i ≤ n). One sitehas exp(Ri,α̂) ≥ exp(5) (large redpoint). Horizontal line: exp(Ri,α) =1.

    0.9

    1.0

    1.1

    1.2

    1990 2000 2010

    Year

    Year

    s ef

    fect

    s

    (b) Total year effects in count scale(exp(Cj,β̂), for 1 ≤ j ≤ p). Blue line:loess (standard deviation in gray).

    Site 40 Site 54 Site 568 Site 584 Site 687 Site 703 Site 719

    1990

    1994

    1998

    2002

    2006

    2010

    2014

    0

    20

    40

    60

    80

    (c) Interaction in count scale (exp(Θ̂ij)) for 30 sites (to improvethe display, the rest of the sites take value extremely close to 1).

    Figure 5: Decomposition of the estimated counts into multiplicative siteeffects (top left), year effects (top right) and interactions (bottom).

    37

    IntroductionRelated work

    General model and estimationStatistical guarantees

    Algorithm and selection of Optimization algorithmAutomatic selection of

    Simulation studyEstimation Imputation

    Analysis of the Aravo dataUsing covariates to impute ecological dataDiscussionProofsProof of th:global-bound-1Proof of th:global-boundProof of prop:null-lambda