optimal link prediction with matrix logistic regression

29
arXiv:1803.07054v1 [math.ST] 19 Mar 2018 Optimal Link Prediction with Matrix Logistic Regression Nicolai Baldin and Quentin Berthet University of Cambridge Abstract. We consider the problem of link prediction, based on par- tial observation of a large network, and on side information associated to its vertices. The generative model is formulated as a matrix logis- tic regression. The performance of the model is analysed in a high- dimensional regime under a structural assumption. The minimax rate for the Frobenius-norm risk is established and a combinatorial estima- tor based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any (randomised) algorithm computable in polynomial time under a computational complexity assumption. Primary 62J02, 62C20; secondary 68Q17. Key words and phrases: Link prediction, Logistic regression, Computa- tional lower bounds, Planted clique. INTRODUCTION In the field of network analysis, the task of link prediction consists in predicting the presence or absence of edges in a large graph, based on the observations of some of its edges, and on side information. Network analysis has become a growing inspiration for statistical problems. Indeed, one of the main characteristics of datasets in the modern scientific landscape is not only their growing size, but also their increasing complexity. Most phenomena now studied in the natural and social sciences concern not only isolated and independent variables, but also their interactions and connections. The fundamental problem of link prediction is therefore naturally linked with statistical estimation: the objective is to understand, through a generative model, why different vertices are connected or not, and to generalise these observations to the rest of the graph. Most statistical problems based on graphs are unsupervised: the graph itself is the sole data, there is no side information, and the objective is to recover an unknown structure in the generative model. Examples include the planted clique problem (Alon and Sudakov, 1998; Kuˇ cera, 1995), the stochastic block model (Holland et al., 1983)—see Abbe (2017) for a recent survey of a very ac- tive line of work (Abbe and Sandon, 2015; Banks et al., 2016; Decelle et al., 2011; Supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 647812) Supported by an Isaac Newton Trust Early Career Support Scheme and by The Alan Turing Institute under the EPSRC grant EP/N510129/1. 1

Upload: others

Post on 30-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

arX

iv:1

803.

0705

4v1

[m

ath.

ST]

19

Mar

201

8

Optimal Link Predictionwith Matrix Logistic RegressionNicolai Baldin∗ and Quentin Berthet†

University of Cambridge

Abstract. We consider the problem of link prediction, based on par-tial observation of a large network, and on side information associatedto its vertices. The generative model is formulated as a matrix logis-tic regression. The performance of the model is analysed in a high-dimensional regime under a structural assumption. The minimax ratefor the Frobenius-norm risk is established and a combinatorial estima-tor based on the penalised maximum likelihood approach is shown toachieve it. Furthermore, it is shown that this rate cannot be attainedby any (randomised) algorithm computable in polynomial time undera computational complexity assumption.

Primary 62J02, 62C20; secondary 68Q17.Key words and phrases: Link prediction, Logistic regression, Computa-tional lower bounds, Planted clique.

INTRODUCTION

In the field of network analysis, the task of link prediction consists in predictingthe presence or absence of edges in a large graph, based on the observations ofsome of its edges, and on side information. Network analysis has become a growinginspiration for statistical problems. Indeed, one of the main characteristics ofdatasets in the modern scientific landscape is not only their growing size, butalso their increasing complexity. Most phenomena now studied in the naturaland social sciences concern not only isolated and independent variables, but alsotheir interactions and connections.

The fundamental problem of link prediction is therefore naturally linked withstatistical estimation: the objective is to understand, through a generative model,why different vertices are connected or not, and to generalise these observationsto the rest of the graph.

Most statistical problems based on graphs are unsupervised: the graph itselfis the sole data, there is no side information, and the objective is to recoveran unknown structure in the generative model. Examples include the plantedclique problem (Alon and Sudakov, 1998; Kucera, 1995), the stochastic blockmodel (Holland et al., 1983)—see Abbe (2017) for a recent survey of a very ac-tive line of work (Abbe and Sandon, 2015; Banks et al., 2016; Decelle et al., 2011;

∗Supported by the European Research Council (ERC) under the European Union’s Horizon2020 research and innovation programme (grant agreement No 647812)

†Supported by an Isaac Newton Trust Early Career Support Scheme and by The Alan TuringInstitute under the EPSRC grant EP/N510129/1.

1

2 BALDIN AND BERTHET

Massoulie, 2014; Mossel et al., 2013, 2015), the Ising blockmodel (Berthet et al.,2016), random geometric graphs – see Penrose (2003) for an introduction and(Bubeck et al., 2014; Devroye et al., 2011) for recent developments in statistics,or metric-based learning (Bellet et al., 2014; Chen et al., 2009) and ordinal em-bedings (Jain et al., 2016).

In supervised regression problems on the other hand, the focus is on under-standing a fundamental mechanism, formalized as the link between two variables.The objective is to learn how an explanatory variable X allows to predict a re-sponse Y , i.e. to find the unknown function f that best approximates the rela-tionship Y ≈ f(X). This statistical framework is often applied to the observationof a phenomenon measured by Y (e.g. of a natural or social nature), given knowninformation X: the principle is to understand said phenomenon, to explain therelationship between the variables by estimating the function f (Hoff et al., 2002;Holland and Leinhardt, 1981).

We follow this approach here: our goal is to learn how known characteristicsof each agent (represented by a node) in the network induce a greater or smallerchance of connection, to understand the mechanism of formation of the graph.We propose a model for supervised link prediction, using the principle of regres-sion for inference on graphs. For each vertex, we are given side information, avector of observations X ∈ Rd. Given observations Xi,Xj about nodes i and j ofa network, we aim to understand how these two explanatory variables are relatedto the probability of connection between the two corresponding vertices, suchthat P(Y(i,j) = 1) = f(Xi,Xj), by estimating f within a high-dimensional classbased on logistic regression. Besides this high-dimensional parametric modelling,various fully non-parametric statistical frameworks were exploited in the litera-ture, see, for example, Gao et al. (2015); Wolfe and Olhede (2013), for graphonestimation, Biau and Bleakly (2008); Papa et al. (2016) for graph reconstructionand Bickel and Chen (2009) for modularity analysis.

Link prediction can be useful in any application where data can be gatheredabout the nodes of a network. One of the most obvious motivations is in socialnetworks, in order to model social interactions. With access to side informationabout each member of a social network, the objective is to understand the mech-anisms of connection between members: shared interests, differences in artistictastes or political opinion (Wasserman and Faust, 1994). This can also be appliedto citation networks, or in the natural sciences to biological networks of interac-tions between molecules or proteins (Madeira and Oliveira, 2004; Yu et al., 2008).The key assumption in this model is that the network is a consequence of theinformation, but not necessarily based on similarity: it is possible to model morecomplex interactions, e.g. where opposites attract.

The focus on a high-dimensional setting is another aspect of this work that isalso motivated by modern applications of statistics: data is often collected withoutdiscernment and the ambient dimension d can be much larger than the samplesize. This setting is common in regression problems: the underlying model is oftenactually very simple, to reflect the fact that only a small number of measuredparameters are relevant to the problem at hand, and that the intrinsic dimensionis much smaller. This is usually handled through an assumption on the rank,sparsity, or regularity of a parameter. Here this needs to be adapted to a modelwith two covariates (explanatory variables) and a structural assumption is made

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 3

in order to reflect this nature of our problem.We therefore decide to tackle link prediction by modelling it as matrix logistic

regression. We study a generative model for which P(Y(i,j) = 1) = σ(X⊤i Θ⋆Xj),

where σ is the sigmoid function, and Θ⋆ is the unknown matrix to estimate.It is a simple way to model how the variables interact, by a quadratic affinityfunction and a sigmoid function. In order to model realistic situations with partialobservations, we assume that Y(i,j) is only observed for a subset of all the couples(i, j), denoted by Ω.

To convey the general idea of a simple dependency on Xi and Xj , we makestructural assumptions on the rank and sparsity of Θ⋆. This reflects that theaffinity X⊤

i Θ⋆Xj is a function of the projections u⊤ℓ X for the vectors Xi and Xj,for a small number of orthogonal vectors, that have themselves a small numberof non-zero coefficients (sparsity assumption). In order to impose that the inverseproblem is well-posed, we also make a restricted conditioning assumption on Θ⋆,inspired by the restricted isometry property (RIP). These conditions are discussedin Section 1. We talk of link prediction as this is the legacy name but we focusalmost entirely on the problem of estimating Θ⋆.

The classical techniques of likelihood maximization can lead to computation-ally intractable optimization problems. We show that in this problem as well asothers this is a fundamental difficulty, not a weakness of one particular estimationtechnique; statistical and computational complexities are intertwined.

Our contribution: This work is organized in the following manner: We givea formal description of the problem in Section 1, as well as a discussion of ourassumptions and links with related work. Section 2 collects our main statisticalresults. We propose an estimator Θ based on the penalised maximum likelihoodapproach and analyse its performance in Section 2.1 in terms of non-asymptoticrate of estimation. We show that it attains the minimax rate of estimation oversimultaneously block-sparse and low-rank matrices Θ⋆, but is not computationallytractable. In Section 2.2, we provide a convex relaxation of the problem which isin essence the Lasso estimator applied to a vectorised version of the problem. Thelink prediction task is covered in Section 2.3. A matching minimax lower bound forthe rate of estimation is given in Section 2.4. Furthermore, we show in Section 3that the minimax rate cannot be attained by a (randomised) polynomial-timealgorithm, and we identify a corresponding computational lower bound. The proofof this bound is based on a reduction scheme from the so-called dense subgraphdetection problem. Technical proofs are deferred to the appendix. Our findingsare depicted in Figure 1.

Notation: For any positive integer n, we denote by [n] the set 1, . . . , n andby [[n]] the set of couples of [n], of cardinality

(n2

)

. We denote by R the set ofreal numbers and by Sd the set of real symmetric matrices of size d. For a matrixA ∈ Sd, we denote by ‖A‖F its Frobenius norm, defined by

‖A‖2F =∑

i,j∈[d]

A2ij .

We extend this definition for B ∈ Sn and any subset Ω ⊆ [[n]] to its semi-norm

4 BALDIN AND BERTHET

Lasso

k2

Nlog(d)k2

N

computationally hard

minimax

kr

N+ k

Nlog( de

k)

Figure 1: The computational and statistical boundaries for estimation and pre-diction in the matrix logistic regression model. Here k denotes the sparsity of Θ⋆

and r its rank, while N is the number of observed edges in the network.

‖B‖F,Ω defined by

‖B‖2F,Ω =∑

i,j : (i,j)∈Ω

B2ij .

The corresponding bilinear form playing the role of inner-product of two ma-trices B1, B2 ∈ Sn is denoted as 〈〈B1, B2〉〉F,Ω. For a matrix B ∈ Sn, we alsomake use of the following matrix norms and pseudo-norms for p, q ∈ [0,∞),with ‖B‖p,q =

∥(‖B1∗‖p · · · ‖Bd∗‖p)∥

q, where Bi∗ denotes the ith row of B, and

‖B‖∞ = max(i,j)∈[d] |Bij |. For asymptotic bounds, we shall write f(x) . g(x) iff(x) is bounded by a constant multiple of g(x).

1. PROBLEM DESCRIPTION

1.1 Generative model

For a set of vertices V = [n] and explanatory variables Xi ∈ Rd associated toeach i ∈ V , a random graph G = (V,E) is generated by the following model. Forall i, j ∈ V , variables Xi,Xj ∈ Rd and an unknown matrix Θ⋆ ∈ Sd, an edgeconnects the two vertices i and j independently of the others according to thedistribution

(1.1) P(

(i, j) ∈ E)

= σ(X⊤i Θ⋆Xj) =

1

1 + exp(−X⊤i Θ⋆Xj)

.

Here we denote by σ the sigmoid, or logistic function.

Definition 1. We denote by πij : Sd → [0, 1] the function mapping a matrixΘ ∈ Sd to the probability in (1.1). Let Σ ∈ Sn with Σij = X⊤

i ΘXj denote theso-called affinity matrix. In particular, we then have πij(Θ) = σ(Σij).

Our observation consists of the explanatory variables Xi and of the observationof a subset of the graph. Formally, for a subset Ω ⊆ [[n]], we observe an adjacencyvector Y indexed by Ω that satisfies, for all (i, j) ∈ Ω, Y(i,j) = 1 if and only if(i, j) ∈ E (and 0 otherwise). We thus have

(1.2) Y(i,j) ∼ Bernoulli(

πij(Θ⋆))

, (i, j) ∈ Ω .

The joint data distribution is denoted by PΘ⋆ and is thus completely specifiedby πij(Θ⋆), (i, j) ∈ Ω. For ease of notation, we write N = |Ω|, representing theeffective sample size. Our objective is to estimate the parameter matrix Θ⋆, basedon the observations Y ∈ RN and on known explanatory variables X ∈ Rd×n.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 5

This problem can be reformulated as a classical logistic regression problem.Indeed, writing vec(A) ∈ Rd2 for the vectorized form of a matrix A ∈ Sd, wehave that

(1.3) X⊤i Θ⋆Xj = Tr(XjX

⊤i Θ⋆) = 〈vec(XjX

⊤i ),vec(Θ⋆)〉 .

The vector of observation Y ∈ RN therefore follows a logistic distribution withexplanatory design matrix DΩ ∈ RN×d2 such that DΩ (i,j) = vec(XjX

⊤i ) and

predictor vec(Θ⋆) ∈ Rd2 . We focus on the matrix formulation of this problem,and consider directly matrix logistic regression in order to simplify the notationof the explanatory variables and our model assumptions on Θ⋆, that are specificto matrices.

1.2 Comparaison with other models

This model can be compared to other settings in the statistical and learningliterature.

Generalised linear model. As discussed above in the remark to (1.3), this isan example of a logistic regression model. We focus in this work on the case wherethe matrix Θ⋆ is block-sparse. The problem of sparse generalised linear models,and sparse logistic regression in particular has been extensively studied (see, e.g.Abramovich and Grinshtein, 2016; Bach, 2010; Bunea, 2008; Meier et al., 2008;Rigollet, 2012; van de Geer, 2008, and references therein). Our work focuses onthe more restricted case of block-sparse and low-rank parameter, establishinginteresting statistical and computational phenomena in this setting.

Graphon model. The graphon model is a model of a random graph in whichthe explanatory variables associated with the vertices in the graph are unknown.It has recently become popular in the statistical community, see Gao et al. (2015);Klopp and Tsybakov (2015); Wolfe and Olhede (2013); Zhang et al. (2015). Typ-ically, an objective of statistical inference is a link function which belongs to ei-ther a parametric or non-parametric class of functions. Interestingly, the minimaxlower bound for the classes of Holder-continuous functions, obtained in Gao et al.(2015), has not been attained by any polynomial-time algorithm.

Trace regression models. The modelling assumption (1.1) of the presentpaper is in fact very close to the trace regression model, as it follows fromthe representation (1.3). Thus, the block-sparsity and low-rank structures arepreserved and can well be studied by the means of techniques developed forthe trace regression. We refer to Fan et al. (2016); Koltchinskii et al. (2011);Negahban and Wainwright (2011); Rohde and Tsybakov (2011) for recent devel-opments in the linear trace regression model, and Fan et al. (2017) for the gen-eralised trace regression model. However, computational lower bounds have notbeen studied either and many existing minimax optimal estimators cannot to becomputed in polynomial time.

Metric learning. In the task of metric learning, observations depend on anunknown geometric representation V1, . . . , Vn of the variables in a Euclidean spaceof low dimension. The goal is to estimate this representation (up to a rigid trans-formation), based on noisy observations of 〈Vi, Vj〉 in the form of random eval-uations of similarity. Formally, our framework also recovers the task of metriclearning by taking Xi = ei and Θ⋆ an unknown semidefinite positive matrix of

6 BALDIN AND BERTHET

small rank (here V ⊤V ), since

〈Vi, Vj〉 = 〈V ei, V ej〉 = e⊤i V⊤V ej .

We refer to (Bellet et al., 2014; Chen et al., 2009) and references therein for acomprehensive survey of metric learning methods.

1.3 Parameter space

The unknown predictor matrix Θ⋆ describes the relationship between the ob-served features Xi and the probabilities of connection πij(Θ⋆) = σ(X⊤

i Θ⋆Xj)following Definition 1. We focus on the high-dimensional setting where d2 ≫ N :the number of features for each vertex in the graph, and number of free pa-rameters, is much greater than the total number of observations. In order tocounter the curse of dimensionality, we make the assumption that the function(Xi,Xj) 7→ πij depends only on a small subset S of size k of all the coefficients ofthe explanatory variables. This translates to a block-sparsity assumption on Θ⋆:the coefficients Θ⋆ ij are only nonzero for i and j in S. Furthermore, we assumethat the rank of the matrix Θ⋆ can be smaller than the size of the block. Formally,we define the following parameter spaces

Pk,r(M) =

Θ ∈ Sd : ‖Θ‖1,1 < M , ‖Θ‖0,0 ≤ k , and rank(Θ) ≤ r

,

for the coefficient-wise ℓ1 norm ‖ · ‖1,1 on Sd and integers k, r ∈ [d]. We alsodenote P(M) = Pd,d(M) for convenience.

Remark 2. The bounds on block-sparsity and rank in our parameter spaceare structural bounds: we consider the case where the matrix Θ⋆ can be conciselydescribed in terms of the number of parameters. This is motivated by consideringthe spectral decomposition of the real symmetric matrix Θ⋆ as

Θ⋆ =

r∑

ℓ=1

λℓuℓu⊤ℓ .

The affinity Σij = X⊤i Θ⋆Xj between vertices i and j is therefore only a function

of the projections of Xi and Xj along the axes uℓ, i.e.

Σij = X⊤i Θ⋆Xj =

r∑

ℓ=1

λℓ(u⊤ℓ Xi)(u

⊤ℓ Xj) .

Assuming that there are only a few of these directions uℓ with non-zero impacton the affinity motivates the low-rank assumption, while assuming that there areonly few relevant coefficients of Xi,Xj that influence the affinity corresponds to asparsity assumption on the uℓ, or block sparsity of Θ⋆. The effect of these projec-tions on the affinity is weighted by the λℓ. By allowing for negative eigenvalues,we allow our model to go beyond a geometric description, where close or similarXs are more likely to be connected. This can be used to model interactions whereopposites attract.

The assumption of simultaneously sparse and low-rank matrices arises natu-rally in many applications in statistics and machine learning and has attracted

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 7

considerable recent attention. Various regularisation techniques have been de-veloped for estimation, variable and rank selection in multivariate regressionproblems (see, e.g. Bunea et al., 2012; Richard et al., 2012, and the referencestherein).

1.4 Explanatory variables

As mentioned above, this problem is different from tasks such as metric learn-ing, where the objective is to estimate the Xi with no side information. Here theyare seen as covariates, allowing us to infer from the observation on the graph thepredictor variable Θ⋆. For this task to be even possible in a high-dimensionalsetting, we settle the identifiability issue by making the following variant of aclassical assumption on X ∈ Rd×n.

Definition 3 (Block isometry property). For a matrix X ∈ Rd×n and aninteger s ∈ [d], we define ∆Ω,s(X) ∈ (0, 1) as the smallest positive real such that

N(

1−∆Ω,s(X))

‖B‖2F ≤ ‖X⊤B X‖2F,Ω ≤ N(

1 + ∆Ω,s(X))

‖B‖2F ,

for all matrices B ∈ Sd that satisfy the block-sparsity assumption ‖B‖0,0 ≤ s.

Definition 4 (Restriced isometry properties). For a matrix A ∈ Rn×p andan integer s ∈ [p], δs(A) ∈ (0, 1) is the smallest positive real such that

n(

1− δs(A))

‖v‖22 ≤ ‖Av‖22 ≤ n(

1 + δs(A))

‖v‖22 ,

for all s-sparse vectors, i.e. satisfying ‖v‖0 ≤ s.

When p = d2 is a square, we define δB,s(A) as the smallest positive real suchthat

n(

1− δB,s(A))

‖v‖22 ≤ ‖Av‖22 ≤ n(

1 + δB,s(A))

‖v‖22 ,for all vectors such that v = vec(B), where B satisfies the block-sparsity assump-tion ‖B‖0,0 ≤ s.

The first definition is due to Candes and Tao (2005), with restriction to sparsevectors. It can be extended in general, as here, to other types of restrictions (see,e.g. Traonmilin and Gribonval, 2015). Since the restriction on the vectors in thesecond definition (s-by-s block-sparsity) is more restricting than in the first one(sparsity), δB,s is smaller than δs2 . These different measures of restricted isometryare related, as shown in the following lemma

Lemma 5. For a matrix X ∈ Rd×n, let DΩ ∈ RN×d2 be defined row-wise byDΩ(i,j) = vec(XjX

⊤i ) for all (i, j) ∈ Ω. It holds that

∆Ω,s(X) = δB,s(DΩ) .

Proof. This is a direct consequence of the definition of DΩ, which yields‖X⊤B X‖2F,Ω = ‖DΩ vec(B)‖22, and ‖vec(B)‖22 = ‖B‖2F .

8 BALDIN AND BERTHET

The assumptions above guarantee that the matrix Θ⋆ can be recovered fromobservations of the affinities, settling the well-posedness of this part of the inverseproblem. However, we do not directly observe these affinities, but their imagethrough the sigmoid function. We must therefore further impose the followingassumption on the design matrix X that yields constraints on the probabilitiesπij and in essence governs the identifiability of Θ⋆.

Assumption 6. There exists a constant M > 0 such that for all Θ in theclass P(M) we have max(i,j)∈Ω |X⊤

i ΘXj | < M .

In particular, under this assumption a constant

(1.4) L(M) := σ′(M) = σ(M)(

1− σ(M))

,

is lower bounded away from zero, and we have

(1.5) infΘ∈P(M)

σ′(X⊤i ΘXj) ≥ L(M) > 0 ,

for all (i, j) ∈ Ω. Assuming that L(M) always depends on the same M , wesometimes write simply L.

Remark 7. Assumption 6 is necessary for the identifiability of Θ⋆: if X⊤i Θ⋆Xj

can be arbitrarily large in magnitude, πij = σ(X⊤i Θ⋆Xj) can be arbitrarily close

to 0 or 1. Since our observations only depend on Θ⋆ through its image πij , thiscould lead to a very large estimation error on Θ⋆ even with a small estimationerror on the πij.

Remark 8. This assumption has already appeared in the literature on high-dimensional estimation, see Abramovich and Grinshtein (2016); van de Geer (2008).Similarly to Bach (2010), Assumption 6 can be shown to be redundant for min-imax optimal prediction, because the log-likelihood function in the matrix logisticregression model satisfies the so-called self-concordant property. Our analysis tofollow can be combined with an analysis similar to Bach (2010) to get rid of theassumption for minimax optimal prediction.

Proposition 9. The identifiability assumption max(i,j)∈Ω |X⊤i ΘXj | < M is

guaranteed for all Θ ∈ P(M) and design matrices X satisfying either of thefollowing

• ‖XjX⊤i ‖∞ ≤ 1,

• ‖Θ‖2F < M1 for some M1 > 0 and the block isometry property.

1.4.1 Random designs For random designs, we require the block isometryproperty to hold with high probability. Then the results in this article carryover directly and thus we do not discuss it in full detail. It is well known thatfor sparse linear models with the dimension of a target vector p and the sparsityk, the classical restricted isometry property holds for some classes of randommatrices with i.i.d. entries including sub-Gaussian and Bernoulli matrices, seeMendelson et al. (2008), provided that n & k log(p/k), and i.i.d. subexponentialrandom matrices, see Adamczak et al. (2011), provided that n & k log2(p/k).

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 9

In the same spirit, the design matrices with independent entries following sub-Gaussian, subexponential or Bernoulli distributions can be shown to satisfy theblock isometry property, cf. Wang et al. (2016a), provided that the number ofobserved edges in the network satisfies N & k2 log2(d/k) for sub-Gaussian andsubexponential designs and N & k2 log(d/k) for Bernoulli designs.

2. MATRIX LOGISTIC REGRESSION

The log-likelihood for this problem is

ℓY (Θ) = −∑

(i,j)∈Ω

ξ(s(i,j)X⊤i ΘXj) ,

where s(i,j) = 2Y(i,j) − 1 is a sign variable that depends on the observations Yand ξ : x 7→ log(1 + ex) is a softmax function, convex on R. As a consequence,the negative log-likelihood −ℓY is a convex function of Θ. Denoting by ℓ theexpectation IEΘ⋆[ℓY ], we recall the classical expressions for all Θ ∈ Sd

ℓ(Θ) = ℓ(Θ⋆)−∑

(i,j)∈Ω

KL(πij(Θ⋆), πij(Θ))

= ℓ(Θ⋆)− KL(PΘ⋆ ,PΘ) ,

where we recall πij(Θ) = σ(X⊤i ΘXj), and

ℓY (Θ) = ℓ(Θ) + 〈〈∇ζ,Θ〉〉F ,

where ζ is a stochastic component of the log-likelihood with constant gradient∇ζ ∈ Rd×d given by ∇ζ =

(i,j)∈Ω(Y(i,j) − πij(Θ⋆))XjX⊤i , which is a sum of

independent centered random variables.

2.1 Penalized logistic loss

In a classical setting where d is fixed and N grows, the maximiser of ℓY - themaximum likelihood estimator - is an accurate estimator of Θ⋆, provided that itis possible to identify Θ from PΘ (i.e. if the Xi are well conditioned). We arehere in a high-dimensional setting where d2 ≫ N , and this approach is not di-rectly possible. Our parameter space indicates that the intrinsic dimension of ourproblem is truly much lower in terms of rank and block-sparsity. Our assumptionon the conditioning of the Xi is tailored to this structural assumption. In thesame spirit, we also modify our estimator in order to promote the selection ofelements of low rank and block-sparsity. Following the ideas of Birge and Massart(2007) and Abramovich and Grinshtein (2016), we define the following penalizedmaximum likelihood estimator

(2.1) Θ ∈ argminΘ∈P(M)

− ℓY (Θ) + p(Θ)

,

with a penalty p defined as

(2.2) p(Θ) = g(rank(Θ), ‖Θ‖0,0) , and g(R,K) = cKR+ cK log(de

K

)

,

where c > 0 is a universal constant and to be specified further. The proof ofthe following theorem is based on Dudley’s integral argument combined withBousquet’s inequality and is deferred to the Appendix.

10 BALDIN AND BERTHET

Theorem 10. Assume the design matrix X satisfies max(i,j)∈Ω |X⊤i Θ⋆Xj | <

M for some M > 0 and all Θ⋆ in a given class, and the penalty term p(Θ)satisfies (2.2) with the constants c ≥ c1/L, c1 > 1, L given in (1.4). Then forthe penalised MLE estimator Θ, the following non-asymptotic upper bound on theexpectation of the Kullback-Leibler divergence between the measures PΘ⋆ and PΘholds

(2.3) supΘ⋆∈Pk,r(M)

1

NIE[KL(PΘ⋆ ,PΘ)] ≤ C1

kr

N+ C1

k

Nlog

(de

k

)

,

where C1 > 3c is some universal constant for all k = 1, ..., d and r = 1, ..., k.

Remark 11. Random designs with i.i.d. entries following sub-Gaussian, Bernoulliand subexponential distributions discussed in Section 1.4.1 yield the same rate aswell. It can formally be shown using standard conditioning arguments (see, e.g.Nickl and van de Geer, 2013).

Corollary 12. Assume the design matrix X satisfies the block isometryproperty from Definition 3 and max(i,j)∈Ω |X⊤

i Θ⋆Xj | < M for some M > 0 andall Θ⋆ in a given class, and the penalty term p(Θ) is as in Theorem 10. Then forthe penalised MLE estimator Θ, the following non-asymptotic upper bound on therate of estimation holds

supΘ⋆∈Pk,r(M)

IE[

‖Θ−Θ⋆‖2F]

≤ C1

L(M)(

1−∆Ω,2k(X))

(kr

N+k

Nlog

(de

k

)

)

,

where C1 > 3c is some universal constant for all k = 1, ..., d and r = 1, ..., k.

Let us define rank-constrained maximum likelihood estimators with boundedblock size as

Θk,r ∈ argminΘ∈Pk,r(M)

−ℓY (Θ) .

It is intuitively clear that without imposing any regularisation on the likelihoodfunction, the maximum likelihood approach selects the most complex model. Infact, the following result holds.

Theorem 13. Assume the design matrix X satisfies the block isometry prop-erty from Definition 3 and max(i,j)∈Ω |X⊤

i Θ⋆Xj | < M for some M > 0 and all Θ⋆

in a given class. Then for the maximum likelihood estimator Θk,r, the followingnon-asymptotic upper bound on the rate of estimation holds

supΘ⋆∈Pk,r(M)

IE[

‖Θk,r −Θ⋆‖2F]

≤ C3

L(M)(

1−∆Ω,2k(X))

(kr

N+k

Nlog

(de

k

)

)

,

for all k = 1, ..., d and r = 1, ..., k and some constant C3 > 0.

Remark 14. The penalty (2.2) belongs to the class of the so-called minimalpenalties, cf. Birge and Massart (2007). In particular, a naive MLE approachwith p(Θ) = 0 in (2.1) yields a suboptimal estimator as it follows from Theo-rem 13.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 11

2.2 Convex relaxation

In practice, computation of the estimator (2.1) is often infeasible. In essence,in order to compute it, we need to compare the likelihood functions over allpossible subspaces Pk,r(M). Sophisticated step-wise model selection proceduresallow to reduce the number of analysed models. However, they are not feasiblein a high-dimensional setting either. We here consider the following estimator

(2.4) ΘLasso = argminΘ∈Sd

−ℓY (Θ) + λ‖Θ‖1,1 ,

with λ > 0 to be chosen further, which is equivalent to the logistic Lasso onvec(Θ). Using standard arguments, cf. Example 1 in van de Geer (2008), com-bined with the block isometry property the following result immediately follows.

Theorem 15. Assume the design matrix X satisfies the block isometry prop-erty from Definition 3 and max(i,j)∈Ω |X⊤

i Θ⋆Xj | < M for some M > 0 and allΘ⋆ in a given class. Then for λ = C4

√log d, where C4 > 0 is an appropriate

universal constant, the estimator (2.4) satisfies

(2.5) supΘ⋆∈Pk,r(M)

IE[

‖ΘLasso −Θ⋆‖2F]

≤ C5

L(M)(

1−∆Ω,2k(X))

k2

Nlog d ,

for all k = 1, ..., d and r = 1, ..., k and some universal constant C5 > 0.

As one could expect the upper bound on the rate of estimation of our fea-sible estimator is independent of the true rank r. It is natural, when deal-ing with a low-rank and block-sparse objective matrix, to combine the nuclearpenalty with either the (2, 1)-norm penalty or the (1, 1)-norm penalty of a matrix,cf. Bunea et al. (2012); Giraud (2011); Koltchinskii et al. (2011); Richard et al.(2012). In our setting, it can be easily shown that combining the (1, 1)-normpenalty and the nuclear penalty yields the same rate of estimation (k2/N) log d.This appears to be inevitable in view of a computational lower bound, obtainedin Section 3, which is independent of the rank as well. In particular, these findingspartially answer a question posed in Section 6.4.4 in Giraud (2014).

2.3 Prediction

In applications, as new users join the network, we are interested in predictingthe probabilities of the links between them and the existing users. It is natural tomeasure the prediction error of an estimator Θ by IE

[∑

(i,j)∈Ω(πij(Θ)−πij(Θ⋆))2]

which is controlled according to the following result using the smoothness of thelogistic function σ.

Theorem 16. Under Assumption 6, we have the following rate for estimatingthe matrix of probabilities Σ⋆ = X

⊤Θ⋆X ∈ Rn×n with the estimator Σ = X⊤ΘX ∈

Rn×n:

supΘ⋆∈Pk,r(M)

1

2NIE[

‖Σ− Σ⋆‖2F,Ω]

≤ C1

L(M)

(kr

N+k

Nlog

(de

k

)

)

,

with the constant C1 from (2.3). The rate is minimax optimal, i.e. a minimaxlower bound of the same asymptotic order holds for the prediction error of esti-mating the matrix of probabilities Σ⋆ = X

⊤Θ⋆X ∈ Rn×n.

12 BALDIN AND BERTHET

2.4 Information-theoretic lower bounds

The following result demonstrates that the minimax lower bound on the rateof estimation matches the upper bound in Theorem 10 implying that the rate ofestimation is minimax optimal.

Theorem 17. Let the design matrix X satisfy the block isometry property.Then for estimating Θ⋆ ∈ Pk,r(M) in the matrix logistic regression model, thefollowing lower bound on the rate of estimation holds

infΘ

supΘ⋆∈Pk,r(M)

IE[

‖Θ−Θ⋆‖2F]

≥ C2

(1 + ∆Ω,2k(X))

(kr

N+k

Nlog

(de

k

)

)

,

where the constant C2 > 0 is independent of d, k, r and the infimum extends overall estimators Θ.

Remark 18. The lower bounds of the same order hold for the expectationof the Kullback-Leibler divergence between the measures PΘ⋆ and PΘ and theprediction error of estimating the matrix of probabilities Σ⋆ = X

⊤Θ⋆X ∈ Rn×n.

3. COMPUTATIONAL LOWER BOUNDS

In this section, we investigate whether the lower bound in Theorem 17 can beachieved with an estimator computable in polynomial time. The fastest rate ofestimation attained by a (randomised) polynomial-time algorithm in the worst-case scenario is usually referred to as a computational lower bound. Recently, thegap between computational and statistical lower bounds has attracted a lot of at-tention in the statistical community. We refer to Berthet and Rigollet (2013a,b);Chen (2015); Chen and Xu (2016); Gao et al. (2017); Hajek et al. (2015); Ma and Wu(2015); Wang et al. (2016b); Zhang and Dong (2017) for computational lowerbounds in high-dimensional statistics based on the planted clique problem (seebelow), Berthet and Ellenberg (2015) using hardness of learning parity with noiseOymak et al. (2015) for denoising of sparse and low-rank matrices, Agarwal(2012) for computational trade-offs in statistical learning, as well as Zhang et al.(2014) for worst-case lower bounds for sparse estimators in linear regression, aswell as Bruer et al. (2015); Chandrasekaran and Jordan (2013) for another ap-proach on computational trade-offs in statistical problems, as well as Berthet and Chandrasekaran(2016); Berthet and Perchet (2017) on the management of these trade-offs. In or-der to establish a computational lower bound for the block-sparse matrix logisticregression, we exploit a reduction scheme from Berthet and Rigollet (2013a): weshow that detecting a subspace of Pk,r(M) can be computationally as hard assolving the dense subgraph detection problem.

3.1 The dense subgraph detection problem

Although our work is related to the study of graphs, we recall for absoluteclarity the following notions from graph theory. A graph G = (V,E) is a non-empty set V of vertices, together with a set E of distinct unordered pairs i, jwith i, j ∈ V , i 6= j. Each element i, j of E is an edge and joins i to j. Thevertices of an edge are called its endpoints. We consider only undirected graphswith neither loops nor multiple edges. A graph is called complete if every pairof distinct vertices is connected. A graph G′ = (V ′, E′) is a subgraph of a graph

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 13

G = (V,E) if V ′ ⊆ V and E′ ⊆ E. A subgraph C is called a clique if it is complete.The problem of detecting a maximum clique, or all cliques, the so-called Cliqueproblem, in a given graph is known to be NP-complete, cf. Karp (1972).

The Planted Clique problem, motivated as an average case version of theClique problem, can be formalised as a decision problem over random graphs,parametrised by the number of vertices n and the size of the subgraph k. LetGn denote the collection of all graphs with n vertices and G(n, 1/2) denote dis-tribution of Erdos-Renyi random graphs, uniform on Gn, where each edge isdrawn independently at random with probability 1/2. For any k ∈ 1, ..., nand q ∈ (1/2, 1], let G(n, 1/2, k, q) be a distribution on Gn constructed by firstpicking k vertices independently at random and connecting all edges in-betweenwith probability q, and then joining each remaining pair of distinct vertices byan edge independently at random with probability 1/2. Formally, the PlantedClique problem refers to the hypothesis testing problem of

(3.1) H0 : A ∼ G(n, 1/2) vs. H1 : A ∼ G(n, 1/2, k, 1) ,

based on observing an adjacency matrix A ∈ Rn×n of a random graph drawnfrom either G(n, 1/2) or G(n, 1/2, k, 1).

One of the main properties of the Erdos-Renyi random graph were studied inErdos and Renyi (1959), as well as in Grimmett and McDiarmid (1975), who inparticular proved that the size of the largest clique in G(n, 1/2) is asymptoticallyclose to 2 log2 n almost surely. On the other hand, Alon et al. (1998) proposed aspectral method that for k > c

√n detects a planted clique with high probability

in polynomial time. Hence the most intriguing regime for k is

(3.2) 2 log2 n ≤ k ≤ c√n .

The conjecture that no polynomial-time algorithm exists for distinguishing be-tween hypotheses in (3.1) in the regime (3.2) with probability tending to 1 asn → ∞ is the famous Planted Clique conjecture in complexity theory. Its varia-tions have been used extensively as computational hardness assumptions in statis-tical problems, see Berthet and Rigollet (2013b); Cai and Wu (2018); Gao et al.(2017); Wang et al. (2016b).

The Planted Clique problem can be reduced to the so-called dense subgraphdetection problem of testing the null hypothesis in (3.1) against the alternativeH1 : A ∼ G(n, 1/2, k, q), where q ∈ (1/2, 1]. This is clearly a computationallyharder problem. In this paper, we assume the following variation of the PlantedClique conjecture which is used to establish a computational lower bound in thematrix logistic regression model.

Conjecture 19 (The dense subgraph detection conjecture). For any se-quence k = kn such that k ≤ nβ for some 0 < β < 1/2, and any q ∈ (1/2, 1],there is no (randomised) polynomial-time algorithm that can correctly identify thedense subgraph with probability tending to 1 as n → ∞, i.e. for any sequence of(randomised) polynomial-time tests (ψn : Gn → 0, 1)n, we have

lim infn→∞

P0(ψn(A) = 1) +P1(ψn(A) = 0)

≥ 1/3 .

14 BALDIN AND BERTHET

3.2 Reduction to the dense subgraph detection problem and a

computational lower bound

Consider the vectors of explanatory variables Xi = N1/4ei, i = 1, ..., n andassume without loss of generality that the observed set of edges Ω in the matrixlogistic regression model consists of the interactions of the n nodes Xi, i.e. itholds N = |Ω| =

(n2

)

. It follows from the matrix logistic regression modellingassumption (1.1) that the Erdos-Renyi graph G(n, 1/2) corresponds to a randomgraph associated with the matrix Θ0 = 0 ∈ Rd×d. Let Gl(k) be a subset ofPk,1(M) with a fixed support l of the block. In addition, let GαN

k ⊂ Pk,1(M) bea subset consisting of the matrices Θl ∈ Gl(k), l = 1, ...,K, K =

(

nk

)

such that all

elements in the block of a matrix Θl equal some αN = α/√N > 0, see Figure 2.

Then we have

P(

(i, j) ∈ E|Xi,Xj) =1

1 + e−X⊤iΘXj

=1

1 + e−α,

for all Θ ∈ GαN

k . Therefore, the testing problem

(3.3) H0 : Y ∼ PΘ0vs. H1 : Y ∼ PΘ,Θ ∈ GαN

k ,

where Y ∈ 0, 1N is the adjacency vector of binary responses in the matrixlogistic regression model, is reduced to the dense subgraph detection problem withq = 1/(1 + e−α). This reduction scheme suggests that the computational lowerbound for separating the hypotheses in the dense subgraph detection problemmimics the computational lower bound for separating the hypotheses in (3.3) inthe matrix logistic regression model. The following theorem exploits this fact inorder to establish a computational lower bound of order k2/N for estimating thematrix Θ⋆ ∈ Pk,r(M).

Theorem 20. Let Fk be any class of matrices containing GαN

k ∪Θ0 from thereduction scheme. Let c > 0 be a positive constant and f(k, d,N) be a real-valuedfunction satisfying f(k, d,N) ≤ ck2/N for k = kn < nβ, 0 < β < 1/2 and asequence d = dn, for all n > m0 ∈ IN. If Conjecture 19 holds, for some the designX that fulfils the block isometry property from Definition 3, there is no estimator ofΘ⋆ ∈ Fk, that attains the rate f(k, d,N) for the Frobenius norm risk, and can beevaluated using a (randomised) polynomial-time algorithm, i.e. for any estimatorΘ, computable in polynomial time, there exists a sequence (k, d,N) = (kn, dn, N),such that

(3.4)1

f(k, d,N)sup

Θ⋆∈Fk

IE[

‖Θ−Θ⋆‖2F]

→ ∞ ,

as n→ ∞. Similarly, for any estimator Θ, computable in polynomial time, thereexists a sequence (k, d,N) = (kn, dn, N), such that

(3.5)1

f(k, d,N)sup

Θ⋆∈Fk

1

NIE[

‖Σ− Σ⋆‖2F,Ω]

→ ∞ ,

for the prediction error of estimating Σ⋆ = X⊤Θ⋆X.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 15

kn

d

αN

GαN

k ∋ Θ =

Figure 2: The construction of matrices GαN

k used in the reduction scheme.

Remark 21. Thus the computational lower bound for estimating the matrixΘ⋆ in the matrix logistic regression model is of order k2/N compared to the mini-max rate of estimation of order kr/N+(k/N) log(de/k) and the rate of estimation(k2/N) log(d) for the Lasso estimator ΘLasso, cf. Figure 1. Hence the computa-tional gap is most noticeable for the matrices of rank 1. Furthermore, as a simpleconsequence of this result, the corresponding computational lower bound for theprediction risk of estimating Σ⋆ = X

⊤Θ⋆X is k2/N as well.

Proof. We here provide a proof of the computational lower bound on theprediction error (3.5) for convenience. The bound on the estimation error (3.4)is straightforward to show by utilizing the block isometry property. Assume thatthere exists a hypothetical estimator Θ computable in polynomial time that at-tains the rate f(k, d,N) for the prediction error, i.e. such that it holds that

lim supn→∞

1

f(k, d,N)sup

Θ⋆∈Fk

1

NIE[

‖X⊤(Θ−Θ⋆)X‖2F,Ω]

≤ b <∞ ,

for all sequences (k, d,N) = (kn, dn, N) and a constant b. Then by Markov’sinequality, we have

(3.6)1

N‖X⊤(Θ−Θ⋆)X‖2F,Ω ≤ uf(k, d,N) ,

for some numeric constant u > 0 with probability 1 − b/u for all Θ⋆ ∈ Fk.Following the reduction scheme, we consider the design vectors Xi = N1/4ei,i = 1, ..., n and the subset of edges Ω, such that

(3.7)1

N‖X⊤(Θ−Θ⋆)X‖2F,Ω =

(i,j)∈Ω

(Θij −Θ⋆ij)2 = ‖Θ−Θ⋆‖2F,Ω ,

for any Θ⋆ ∈ GαN

k . Note, that the design vectors Xi = N1/4ei, i = 1, ..., n clearlysatisfy Assumption 6. Thus, in order to separate the hypotheses

(3.8) H0 : Y ∼ P0 vs. H1 : Y ∼ PΘ,Θ ∈ GαN

k ,

it is natural to employ the following test

(3.9) ψ(Y ) = 1(

‖Θ‖F,Ω ≥ τd,k(u))

,

where τ2d,k(u) = uf(k, d,N). The type I error of this test is controlled automati-cally due to (3.6) and (3.7), P0(ψ = 1) ≤ b/u. For the type II error, we obtain

supΘ∈G

αNk

PΘ(ψ = 0) = supΘ∈G

αNk

(

‖Θ‖F,Ω < τd,k(u))

≤ supΘ∈G

αNk

(

‖Θ−Θ‖2F,Ω > ‖Θ‖2F,Ω − τ2d,k(u))

≤ b/u ,

16 BALDIN AND BERTHET

provided thatk(k − 1)α2

N/2 ≥ 2τ2d,k(u) = 2uf(k, d,N) ,

which is true in the regime k ≤ nβ, β < 1/2, and α2 ≥ 4u/c, (hence α2N ≥

4u/(cN) ) by the definition of the function f(k, d,N). Putting the pieces together,we obtain

lim supn→∞

P0(ψ(Y ) = 1) + supΘ∈G

αNk

PΘ(ψ(Y ) = 0)

≤ 2b/u < 1/3 ,

for u > 6b. Hence, the test (3.9) separates the hypotheses (3.8). This contradictsConjecture 19 and implies (3.5).

4. CONCLUDING REMARKS

Our results shed further light on the emerging topic of statistical and compu-tational trade-offs in high-dimensional estimation. The matrix logistic regressionmodel is very natural to study the connection between statistical accuracy andcomputational efficiency as the model is based on the study of a generative modelfor random graphs. It is also an extension of lower bound for all statistical pro-cedures to a model with covariates, the first of its kind.

Our fundings suggest that the block-sparsity is a limiting model selection cri-terion for polynomial-time estimation in the logistic regression model. That is,imposing further structure, like an additional low-rank constraint, and thus re-ducing the number of studied models yields an expected gain in the minimaxrate, but that gain can never be achieved by a polynomial-time algorithm. In thissetting, this implies that with a larger parameter space, while the statistical ratesmight be worse, they might be closer to those that are computationally achiev-able. As an illustration, both efficient and minimax optimal estimation is possiblefor estimating sparse vectors in the high-dimensional linear regression model, see,for example, SLOPE for achieving the exact minimax rate in Bellec et al. (2018);Bogdan et al. (2015), extending upon previous results on the Danzig selector andLasso in Bickel et al. (2009); Candes and Tao (2007).

The logistic regression is also a representative of a large class of generalisedlinear models. Furthermore, the proof of the minimax lower bound on the rate ofestimation in Theorem 17 can be extended to all generalised linear models. Thecombinatorial estimator (2.1) can well be used to achieve the minimax rate. Thecomputational lower bound then becomes a delicate issue. A more sophisticatedreduction scheme is needed to relate the dense subgraph detection problem toan appropriate testing problem for a generalised linear model. Approaching thisquestion might require notions of noise discretisation and Le Cam equivalencestudied in Ma and Wu (2015).

An interesting question is whether it is possible to adopt polynomial-timealgorithms available for detecting a dense subgraph for estimating the targetmatrix in the logistic regression model in all sparsity regimes. A common ideabehind those algorithms is to search a dense subgraph over the vertices of ahigh degree and thus substantially reduce the number of compared models ofsubgraphs. The network we observe in the logistic regression model is generatedby a sparse matrix. We may still observe a fully connected network which is

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 17

generated by a small block in the target matrix. Therefore, it is not yet clearhow to adopt algorithms for dense subgraph detection to submatrix detection. Itremains an open question to establish whether these results can be extended toany design matrix, and all parameter regimes.

REFERENCES

Abbe, E. (2017). Community detection and stochastic block models: recentdevelopments. arXiv:1703.10146 .

Abbe, E. and Sandon, C. (2015). Detection in the stochastic block model withmultiple clusters: proof of the achievability conjectures, acyclic BP, and theinformation-computation gap. arXiv:1512.09080 .

Abramovich, F. and Grinshtein, V. (2016). Model selection and minimaxestimation in generalized linear models. IEEE Transactions on InformationTheory 62 3721–3730.

Adamczak, R., Litvak, A., Pajor, A. and Tomczak-Jaegermann, N.(2011). Restricted isometry property of matrices with independent columnsand neighborly polytopes by random sampling. Constructive Approximation34 61–88.

Agarwal, A. (2012). Computational Trade-offs in Statis-tical Learning. Theses, University of California, Berkeley.http://research.microsoft.com/en-us/um/people/alekha/thesismain.pdf.

Alon, N., Krivelevich, M. and Sudakov, B. (1998). Finding a large hiddenclique in a random graph. Random Structures and Algorithms 13 457–466.

Alon, N. K. M. and Sudakov, B. (1998). Finding a large hidden clique in arandom graph. In Proceedings of the Eighth International Conference “RandomStructures and Algorithms” (Poznan, 1997), vol. 13.

Bach, F. (2010). Self-concordant analysis for logistic regression. ElectronicJournal of Statistics 4 384–414.

Banks, J., Moore, C., Neeman, J. and Netrapalli, P. (2016).Information-theoretic thresholds for community detection in sparse networks.arXiv:1601.02658 .

Bellec, P., Lecue, G. and Tsybakov, A. (2018). Slope meets lasso:improved oracle bounds and optimality. Annals of Statistics (to appear)ArXiv:1605.08651.

Bellet, A., Habrard, A. and Sebban, M. (2014). A survey on metric learningfor feature vectors and structured data.

Berthet, Q. and Chandrasekaran, V. (2016). Resource allocation for sta-tistical estimation. Proceedings of the IEEE 104 115–125.

Berthet, Q. and Ellenberg, J. (2015). Detection of planted solutions for flatsatisfiability problems .

18 BALDIN AND BERTHET

Berthet, Q. and Perchet, V. (2017). Fast rates for bandit optimization withupper-confidence frank-wolfe. NIPS 2017, to appear .

Berthet, Q. and Rigollet, P. (2013a). Complexity theoretic lower boundsfor sparse principal component detection. In Conference on Learning Theory.

Berthet, Q. and Rigollet, P. (2013b). Optimal detection of sparse principalcomponents in high dimension. The Annals of Statistics 41 1780–1815.

Berthet, Q., Rigollet, P. and Srivastava, P. (2016). Exact recovery in theising blockmodel .

Biau, G. and Bleakly, K. (2008). Statistical inference on graphs. Statisticsand decisions 24 209–232.

Bickel, P. J. and Chen, A. (2009). A nonparametric view of network mod-els and newman–girvan and other modularities. Proceedings of the NationalAcademy of Sciences 106 21068–21073.

Bickel, P. J., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis oflasso and dantzig selector. Ann. Statist. 37 1705–1732.URL https://doi.org/10.1214/08-AOS620

Birge, L. and Massart, P. (2007). Minimal penalties for gaussian model se-lection. Probability theory and related fields 138 33–73.

Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candes, E. (2015).Slope—adaptive variable selection via convex optimization. The Annals ofApplied Atatistics 9 1103.

Bousquet, O. (2002). A bennett concentration inequality and its application tosuprema of empirical processes. Comptes Rendus Mathematique 334 495–500.

Bruer, J. J., Tropp, J. A., Cevher, V. and Becker, S. R. (2015). Designingstatistical estimators that balance sample size, risk, and computational cost.IEEE Journal of Selected Topics in Signal Processing 9 612–624.

Bubeck, S., Ding, J., Eldan, R. and Racz, M. (2014). Testing for high-dimensional geometry in random graphs .URL https://arxiv.org/abs/1411.5713

Bunea, F. (2008). Honest variable selection in linear and logistic regressionmodels via ℓ1 and ℓ1+ ℓ2 penalization. Electronic Journal of Statistics 2 1153–1194.

Bunea, F., She, Y. and Wegkamp, M. (2012). Joint variable and rank selec-tion for parsimonious estimation of high-dimensional matrices. The Annals ofStatistics 40 2359–2388.

Cai, T. and Wu, Y. (2018). Statistical and computational limits for sparsematrix detection. arXiv preprint arXiv:1801.00518 .

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 19

Candes, E. and Plan, Y. (2011). Tight oracle inequalities for low-rank ma-trix recovery from a minimal number of noisy random measurements. IEEETransactions on Information Theory 57 2342–2359.

Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimationwhen p is much larger than n. Ann. Statist. 35 2313–2351.URL https://doi.org/10.1214/009053606000001523

Candes, E. J. and Tao, T. (2005). Decoding by Linear Programming. IEEETrans. Information Theory 51 4203–4215.

Chandrasekaran, V. and Jordan, M. I. (2013). Computational and statis-tical tradeoffs via convex relaxation. Proceedings of the National Academy ofSciences .

Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactionson Information Theory 61 2909–2923.

Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A. and Cazzanti, L.(2009). Similarity-based classification: Concepts and algorithms. J. Mach.Learn. Res. 10 747–776.

Chen, Y. and Xu, J. (2016). Statistical-computational tradeoffs in plantedproblems and submatrix localization with a growing number of clusters andsubmatrices. Journal of Machine Learning Research 17 1–57.URL http://jmlr.org/papers/v17/14-330.html

Decelle, A., Krzakala, F., Moore, C. and Zdeborova, L. (2011). Asymp-totic analysis of the stochastic block model for modular networks and its algo-rithmic applications. Physical Review E 84 066106.

Devroye, L., Gyorgy, A., Lugosi, G. and Udina, F. (2011). High-dimensional random geometric graphs and their clique number. ElectronicCommunications in Probability 16 2481–2508.

Dudley, R. (1967). The sizes of compact subsets of hilbert space and continuityof gaussian processes. Journal of Functional Analysis 1 290 – 330.

Erdos, P. and Renyi, A. (1959). On random graphs. Publicationes Mathemat-icae 6 290–297.

Fan, J., Gong, W. and Zhu, Z. (2017). Generalized high-dimensional traceregression via nuclear norm regularization. arXiv preprint arXiv:1710.08083 .

Fan, J.,Wang, W. and Zhu, Z. (2016). Robust low-rank matrix recovery. arXivpreprint arXiv:1603.08315 .

Gao, C., Lu, Y. and Zhou, H. (2015). Rate-optimal graphon estimation. TheAnnals of Statistics 43 2624–2652.

Gao, C., Ma, Z. and Zhou, H. (2017). Sparse cca: Adaptive estimation andcomputational barriers. arXiv preprint arXiv:1409.8565 .

20 BALDIN AND BERTHET

Gine, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Proba-bilistic Mathematics, Cambridge University Press.

Giraud, C. (2011). Low rank multivariate regression. Electronic Journal ofStatistics 5 775–799.

Giraud, C. (2014). Introduction to high-dimensional statistics. Chapman &Hall/CRC Monographs on Statistics & Applied Probability, Taylor & Francis.

Grimmett, G. and McDiarmid, C. (1975). On colouring random graphs.In Mathematical Proceedings of the Cambridge Philosophical Society, vol. 77.Cambridge Univ Press.

Hajek, B., Wu, Y. and Xu, J. (2015). Computational lower bounds for com-munity detection on random graphs. In Proceedings of The 28th Conferenceon Learning Theory, vol. 40 of Proceedings of Machine Learning Research.

Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent spaceapproaches to social network analysis. Journal of the American StatisticalAssociation 97 1090–1098.

Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic block-models: First steps. Social Networks 5 109 – 137.

Holland, P. W. and Leinhardt, S. (1981). An exponential family of prob-ability distributions for directed graphs. Journal of the American StatisticalAssociation 76 33–50.

Jain, L., Jamieson, K. and Nowak, R. (2016). Finite sample prediction andrecovery bounds for ordinal embedding .

Karp, R. (1972). Reducibility among combinatorial problems. Springer.

Klopp, O. and Tsybakov, a. V. N., A. (2015). Oracle inequalities for net-work models and sparse graphon estimation. Annals of Statistics (to appear)ArXiv:1507.04118.

Koltchinskii, V., Lounici, K. and Tsybakov, A. (2011). Nuclear-norm pe-nalization and optimal rates for noisy low-rank matrix completion. The Annalsof Statistics 39 2302–2329.

Kucera, L. (1995). Expected complexity of graph partitioning problems. Dis-crete Appl. Math. 57 193–212. Combinatorial optimization 1992 (CO92) (Ox-ford).

Ma, Z. and Wu, Y. (2015). Computational barriers in minimax submatrixdetection. The Annals of Statistics 43 1089–1116.

Madeira, S. C. andOliveira, A. L. (2004). Biclustering algorithms for biolog-ical data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics1 24–45.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 21

Massart, P. (2007). Concentration inequalities and model selection, vol. 6.Springer.

Massoulie, L. (2014). Community detection thresholds and the weak Ramanu-jan property. In Proceedings of the 46th Annual ACM Symposium on Theoryof Computing. ACM.

Meier, L., van de Geer, S. and Buhlmann, P. (2008). The group lasso forlogistic regression. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 70 53–71.

Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2008). Uniformuncertainty principle for bernoulli and subgaussian ensembles. ConstructiveApproximation 28 277–289.

Mossel, E., Neeman, J. and Sly, A. (2013). A proof of the block modelthreshold conjecture. arXiv:1311.4115 .

Mossel, E., Neeman, J. and Sly, A. (2015). Reconstruction and estimationin the planted partition model. Probability Theory and Related Fields 162

431–461.

Negahban, S. and Wainwright, M. (2011). Estimation of (near) low-rankmatrices with noise and high-dimensional scaling. The Annals of Statistics 39

1069–1097.

Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression.Ann. Statist. 41 2852–2876.

Oymak, S., Jalali, A., Fazel, M., Eldar, Y. and Hassibi, B. (2015). Simul-taneously structured models with application to sparse and low-rank matrices.IEEE Transactions on Information Theory 61 2886–2908.

Papa, G., Bellet, A. and Clemencon, S. (2016). On graph reconstruction viaempirical risk minimization: Fast learning rates and scalability. In Advancesin Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon and R. Garnett, eds.). Curran Associates, Inc., 694–702.

Penrose, M. (2003). Random Geometric Graphs. Oxford University Press.

Richard, E., Savalle, P. and Vayatis, N. (2012). Estimation of simultane-ously sparse and low-rank matrices. In Proceedings of the 29th InternationalConference on Machine Learning (ICML-12).

Rigollet, P. (2012). Kullback-leibler aggregation and misspecified generalizedlinear models. The Annals of Statistics 40 639–665.

Rohde, A. and Tsybakov, A. (2011). Estimation of high-dimensional low-rankmatrices. The Annals of Statistics 39 887–930.

Traonmilin, Y. andGribonval, R. (2015). Stable recovery of low-dimensionalcones in hilbert spaces: One rip to rule them all .URL https://arxiv.org/abs/1510.00504

22 BALDIN AND BERTHET

Tsybakov, A. (2008). Introduction to Nonparametric Estimation. 1st ed.Springer.

van de Geer, S. (2008). High-dimensional generalized linear models and thelasso. The Annals of Statistics 36 614–645.

Wang, T., Berthet, Q. and Plan, Y. (2016a). Average-case hardness of ripcertification. In Proceedings of the 30th International Conference on NeuralInformation Processing Systems. NIPS’16, Curran Associates Inc., USA.URL http://dl.acm.org/citation.cfm?id=3157382.3157525

Wang, T., Berthet, Q. and Samworth, R. (2016b). Statistical and compu-tational trade-offs in estimation of sparse principal components. The Annalsof Statistics 44 1896–1930.

Wasserman, S. and Faust, K. (1994). Social Network Analysis: Methods andApplications. Structural Analysis in the Social Sciences, Cambridge UniversityPress.

Wolfe, P. and Olhede, S. (2013). Nonparametric graphon estimation. arXivpreprint arXiv:1309.5936 .

Yu, H., Braun, P. and Yildirim, M. (2008). High quality binary proteininteraction map of the yeast interactome network. Science (New York, NY)322 104–110.

Zhang, A. and Dong, X. (2017). Tensor svd: Statistical and computationallimits. arXiv preprint arXiv:1703.02724 .

Zhang, Y., Levina, E. and Zhu, J. (2015). Estimating network edge probabil-ities by neighborhood smoothing. arXiv preprint arXiv:1509.08588 .

Zhang, Y., Wainwright, M. and Jordan, M. (2014). Lower bounds on theperformance of polynomial-time algorithms for sparse linear regression 35.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 23

APPENDIX A: PROOFS

A.1 Some geometric properties of the likelihood

Let us recall the stochastic component of the likelihood function

ζ(Θ) = ℓY (Θ)− ℓ(Θ) =∑

(i,j)∈Ω

(

Y(i,j) − πij(Θ⋆))

X⊤i ΘXj ,

which is a linear function in Θ. The deviation of the gradient ∇ζ of the stochasticcomponent is governed by the deviation of the independent Bernoulli randomvariables εi,j = Y(i,j) − IE[Y(i,j)] = Y(i,j) − πij(Θ⋆), (i, j) ∈ Ω. Let us introduce anupper triangular matrix EΩ = (εi,j)(i,j)∈Ω with zeros on the complement set Ωc.In this notation, we have ζ(Θ) = 〈〈ζ,Θ〉〉F , with

∇ζ =∑

(i,j)∈Ω

εi,jXjX⊤i = X EΩX⊤ ∈ Rd×d .

In particular,∇ζ is sub-Gaussian with parameter∑

(i,j)∈Ω ‖XjX⊤i ‖2F /4 = ‖X⊤

X‖2F,Ω/4,i.e. it holds for the moment generating function of 〈〈ζ,B〉〉F for any B ∈ Rd×d

and σ2 = 1/4,

ϕ〈〈ζ,B〉〉F

(t) := IE[

exp(t〈〈ζ,B〉〉F )]

=∏

(i,j)∈Ω

IE[

exp(tεi,j〈〈XjX⊤i , B〉〉F )

]

≤∏

(i,j)∈Ω

exp(

t2σ2〈〈XjX⊤i , B〉〉F

2/2)

= exp(

tσ2‖X⊤BX‖2F,Ω/2)

.(A.1)

We shall be frequently using versions of the following inequality, which is basedon the fact that ∇ℓ(Θ⋆) = 0 ∈ Rd×d, the Taylor expansion and (1.5), and holdsfor any Θ ∈ P(M),

ℓ(Θ⋆)− ℓ(Θ) =1

2

(i,j)∈Ω

(

σ′(X⊤i Θ0Xj)〈〈XjX

⊤i ,Θ⋆ −Θ〉〉2

)

≥ L2

(i,j)∈Ω

〈〈XjX⊤i ,Θ⋆ −Θ〉〉2F =

L2‖X⊤(Θ⋆ −Θ)X‖2F,Ω ,(A.2)

where Θ0 ∈ [Θ,Θ⋆] element-wise. Furthermore, using that supt∈R σ′(t) ≤ 1/4, we

obtain for all Θ ∈ P(M)

ℓ(Θ⋆)− ℓ(Θ) ≤ 1

8‖X⊤(Θ⋆ −Θ)X‖2F,Ω .

We shall also be using the bounds

max(i,j)∈Ω

(

εi,jXTi (Θ−Θ⋆)Xj

)

≤ ‖X⊤(Θ−Θ⋆)X‖F,Ω , a.s. ,(A.3)

Var(

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉F)

≤ 1

4‖X⊤(Θ−Θ⋆)X‖2F,Ω .(A.4)

24 BALDIN AND BERTHET

A.2 Entropy bounds for some classes of matrices

Recall that an ε-net of a bounded subset K of some metric space with ametric ρ is a collection K1, ...,KNε ∈ K such that for each K ∈ K, thereexists i ∈ 1, ..., Nε such that ρ(K,Ki) ≤ ε. The ε-covering number N(ε,K, ρ)is the cardinality of the smallest ε-net. The ε-entropy of the class K is definedby H(ε,K, ρ) = log2N(ε,K, ρ). The following statement is adapted from Lemma3.1 in Candes and Plan (2011).

Lemma 22. Let T0 := Θ ∈ Rk×k : rank(Θ) ≤ r, ‖Θ‖F ≤ 1. Then it holdsfor any ε > 0

H(ε,T0, ‖ · ‖F ) ≤(

(2k + 1)r + 1)

log(9

ε

)

.

A.3 Proof of Theorem 10 and Theorem 16

It suffices to show the following uniform deviation inequality

(A.5) supΘ⋆∈Pk,r(M)

PΘ⋆

(

ℓ(Θ⋆)− ℓ(Θ) + p(Θ) > 2p(Θ⋆) +R2t

)

≤ e−cRt ,

for any Rt > 0 and some numeric constant c > 0. Indeed, then taking R2t = p(Θ⋆),

it follows that ℓ(Θ⋆)− ℓ(Θ) ≤ 3p(Θ⋆) uniformly for all Θ⋆ in the considered class

with probability at least 1 − e−c√

p(Θ⋆). The upper bound (2.3) of Theorem 10the follows directly integrating the deviation inequality (A.5), while the upperbound on the prediction error in Theorem 16 further follows using (A.2) and thesmoothness of the logistic function, supt∈R σ

′(t) ≤ 1/4. Define

(A.6) τ2(Θ;Θ⋆) := ℓ(Θ⋆)− ℓ(Θ) + p(Θ) , GR(Θ⋆) := Θ : τ(Θ;Θ⋆) ≤ R .

The inequality (A.5) clearly holds on the event τ2(Θ;Θ⋆) ≤ 2p(Θ⋆). In view ofℓY (Θ)− p(Θ) ≥ ℓY (Θ⋆)− p(Θ⋆), we have on the complement:

〈〈EΩ,X⊤(Θ −Θ⋆)X〉〉 ≥ ℓ(Θ⋆)− ℓ(Θ) + p(Θ)− p(Θ⋆) ≥1

2τ2(Θ;Θ⋆) .

Therefore, for any Θ⋆ ∈ Pk,r(M), we have

PΘ⋆

(

τ2(Θ;Θ⋆) > 2p(Θ⋆) +R2t

)

≤ PΘ⋆

(

supτ(Θ;Θ⋆)≥Rt

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉τ2(Θ;Θ⋆)

≥ 1

2

)

.

We now apply the so-called “peeling device” (or “slicing” as it sometimes calledin the literature). The idea is to “slice” the set τ(Θ;Θ⋆) ≥ Rt into pieces onwhich the penalty term p(Θ) is fixed and the term ℓ(Θ⋆) − ℓ(Θ) is bounded. Itfollows,

PΘ⋆

(

supτ(Θ;Θ⋆)≥Rt

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉τ2(Θ;Θ⋆)

≥ 1

2

)

≤d

K=1

K∑

R=1

∞∑

s=1

PΘ⋆

(

supΘ∈G2sRt

(Θ⋆)

k(Θ)=K, rank(Θ)=R

〈〈EΩ,X⊤(Θ −Θ⋆)X〉〉 ≥1

822sR2

t

)

.

(A.7)

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 25

On the set Θ ∈ G2sRt(Θ⋆), k(Θ) = K, rank(Θ) = R, it holds by the definitions(A.6)

ℓ(Θ⋆)− ℓ(Θ) ≤ 22sR2t − p(K,R) ,

and therefore using (A.2), this implies

(A.8) ‖X⊤(Θ⋆ −Θ)X‖F,Ω ≤ Z(K,R, s) , Z2(K,R, s) =2

L(

22sR2t − p(K,R)

)

.

Let us fix the location of the block, that is the support of a matrix Θ′ ∈ G1 :=Θ ∈ Rd×d : k(Θ) = K, rank(Θ) = R belongs to the upper-left block of sizeK×K. Then following the lines of the proof of Lemma 22 and using the singularvalue decomposition, we derive

H(ε, X⊤Θ′X : Θ′ ∈ G1, ‖X⊤Θ′

X‖F,Ω ≤ B, ‖ · ‖F,Ω) ≤(

(2K+1)R+1)

log(9B

ε

)

.

Consequently, for the set T := X⊤(Θ−Θ⋆)X : Θ ∈ Rd×d, rank(Θ) = R, k(Θ) =K, ‖X⊤(Θ⋆ −Θ)X‖F,Ω ≤ Z(K,R, s), we obtain

H(ε,T, ‖ · ‖F,Ω) ≤(

(2K + 1)R + 1)

log(9Z(K,R, s)

ε

)

+K log(de

K

)

.

Denote t(K,R) :=√KR+

K log(

deK

)

. By Dudley’s entropy integral bound, see

Dudley (1967) and Gine and Nickl (2016) for a more recent reference, we thenhave

IE[

supΘ∈G2sRt

(Θ⋆)

k(Θ)=K, rank(Θ)=R

〈〈EΩ,X⊤(Θ −Θ⋆)X〉〉]

≤ C ′

∫ Z(K,R,s)

0

H(ε,T, ‖ · ‖Ω)dε

≤ C ′′√kr

∫ 9Z(K,R,s)

0

log(9Z(K,R, s)

ε

)

dε+ 9C ′′Z(K,R, s)

K log(de

K

)

≤ CZ(K,R, s)t(K,R) ,

for some universal constant C > 0. Furthermore, by Bousquet’s version of Tala-grand’s inequality, see Theorem A.10, in view of the bounds (A.3) and (A.4), wehave for all u > 0

PΘ⋆

(

supΘ∈G2sRt

(Θ⋆)

k(Θ)=K, rank(Θ)=R

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉 ≥ CZ(K,R, s)t(K,R)

+

(1

2Z2(K,R, s) + 4CZ2(K,R, s)t(K,R)

)

u+Z(K,R, s)u

3

)

≤ e−u .

Taking u(K,R, s) := L1/2Z(K,R, s)+L−1/2t(K,R)+2 log d and using inequalities√c1 + c2 ≤

√c1 +

√c2 and

√c1c2 ≤ 1

2(c1ε+c2ε ), which hold for any c1, c2, ε > 0,

we obtain

PΘ⋆

(

supΘ∈G2sRt

(Θ⋆)

k(Θ)=K, rank(Θ)=R

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉

≥ 1

16LZ2(K,R, s) + C2

1t2(K,R)/L

)

≤ e−u(K,R,s) ,

26 BALDIN AND BERTHET

for some numeric constant C1 > 0. Plugging this back into (A.7) and using (A.8),we obtain

PΘ⋆

(

supΘ∈G2sRt

(Θ⋆)

k(Θ)=K, rank(Θ)=R

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉 ≥1

822sR2

t

)

≤ e−u(K,R,s) ,

for some numeric constant C2 > 0, provided that

(A.9)1

16LZ2(K,R, s) +

C21

L t2(K,R) ≤ 1

822sR2

t =1

16LZ2(K,R, s) + 8p(K,R) ,

which is satisfied for p(K,R) ≥ (C21/L)t2(K,R). Therefore,

PΘ⋆

(

supτ(Θ;Θ⋆)≥Rt

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉τ2(Θ;Θ⋆)

≥ 1

2

)

≤d

K=1

K∑

R=1

∞∑

s=1

e−u(K,R,s) ≤ e−cRt ,

for some numeric constants c > 0 using (A.9), which concludes the proof.The following prominent result is due to Bousquet (2002).

Theorem 23 (Bousquet’s version of Talagrand’s inequality). Let (B,B) be ameasurable space and let ε1, ..., εn be independent B-valued random variables.Let F be a countable set of measurable real-valued functions on B such thatf(εi) ≤ b <∞ a.s. and IEf(εi) = 0 for all i = 1, ..., n, f ∈ F . Let

S := supf∈F

n∑

i=1

f(εi) , v := supf∈F

n∑

i=1

IE[f2(εi)].

Then for all u > 0, it holds that

(A.10) P(

S − IE[S] ≥√

2(v + 2bIE[S])u +bu

3

)

≤ e−u .

A.4 Proof of Theorem 13

For the MLE Θk,r, it clearly holds ℓY (Θk,r) ≥ ℓY (Θ⋆) implying

ℓ(Θ⋆)− ℓ(Θk,r) ≤ 〈〈EΩ,X⊤(Θk,r −Θ⋆)X〉〉.

Furthermore, in view of (A.2), we derive

L2‖X⊤(Θk,r −Θ⋆)X‖F,Ω ≤ 〈〈EΩ,X⊤(Θk,r −Θ⋆)X〉〉

‖X⊤(Θk,r −Θ⋆)X‖F,Ω(A.11)

≤ supΘ∈Pk,r(M)

〈〈EΩ,X⊤(Θ −Θ⋆)X〉〉‖X⊤(Θ−Θ⋆)X‖F,Ω

.(A.12)

Following the lines of Section A.3, by Dudley’s integral we next obtain

IE(

supΘ∈Pk,r(M)

〈〈EΩ,X⊤(Θ−Θ⋆)X〉〉‖X⊤(Θ−Θ⋆)X‖F,Ω

)

≤ c√kr + c

k log(de

k

)

,

for some universal constant c > 0. Plugging this bound back into (A.12) andusing the block isometry property yields the desired assertion.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 27

A.5 Proof of Theorem 17

Proof. The proof is split into two parts. First, we show a lower bound of theorder kr and then a lower bound of the order k log(de/k). A simple inequality(a+ b)/2 ≤ maxa, b for all a, b > 0 then completes the proof. Both parts of theproof exploit a version of remarkable Fano’s inequality given in Proposition 24 tofollow, cf. Section 2.7.1 in Tsybakov (2008).

1. A bound kr. The proof of this bound is similar to the proof of a minimaxlower bound for estimating a low-rank matrix in the trace-norm regression modelgiven in Theorem 5 in Koltchinskii et al. (2011). For the sake of completeness,we provide the details here. Consider a subclass of matrices

C =

A ∈ Rk×r : ai,j = 0, αN, 1 ≤ i ≤ k, 1 ≤ j ≤ r

,

α2N =

γ log 2

1 + ∆Ω,2k(X)

r

2kN,

where γ > 0 is a positive constant, ∆Ω,2k(X) > 0 is the the block isometryconstant from Definition 3 and ⌊x⌋ denotes the integer part of x. Further define

B(C) =1

2(A+A⊤) : A = (A| · · · |A|O) ∈ Rk×k, A ∈ C

,

where O denotes the k × (k − r⌊k/r⌋) zero matrix. By construction, any matrixΘ ∈ B(C) is symmetric, has rank at most r with entries bounded by αN . Applyinga standard version of the Varshamov-Gilbert lemma, see Lemma 2.9 in Tsybakov(2008), there exists a subset B ⊂ B(C) of cardinality card(B) ≥ 2kr/16 + 1 suchthat

kr

16

(αN

2

)2⌊k

r

≤ ‖Θu −Θv‖2F ≤ k2α2N ,

for all Θu,Θv ∈ B. Thus B is a 2δ-separated set in the Frobenius metric withδ2 = kr

64

(

αN

2

)2⌊kr ⌋. The Kullback-Leibler divergence between the measures PΘu

and PΘv , Θu,Θv ∈ B, u 6= v, is upper bounded as

KL(PΘu ,PΘv ) = IEPΘu[ℓY (Θu)]− IEPΘu

[ℓY (Θv)] ≤1

8

(i,j)∈Ω

〈〈XjX⊤i ,Θu −Θv〉〉2F

≤ 1 + ∆Ω,2k(X)

8k2α2

NN .

Taking γ > 0 small enough, we obtain

1 + ∆Ω,2k(X)

8k2α2

NN + log 2 =kr

16γ log 2 + log 2 = log(2

kr16

γ+1) < log(2kr/16 + 1) ,

which, in view of Proposition 24, yields the desired lower bound.2. A bound k log(de/k). Let K =

(dk

)

and consider the set GαN

k ⊂ Pk,1(M) fromthe reduction scheme in Section 3.2 with

α2N =

4γ log 2

kN(1 + ∆Ω,2k(X))log

(de

k

)

,

where γ > 0 is a positive constant. Using simple calculations, we then have(2k − 1)α2

N ≤ ‖Θu − Θv‖2F ≤ 2k2α2N for all Θu,Θv ∈ GαN

k , u 6= v. Furthermore,

according to Lemma 25 to follow, there exists a subset GαN ,0k ⊂ GαN

k such that

c0k2α2

N ≤ ‖Θu −Θv‖2F ≤ 2k2α2N ,

28 BALDIN AND BERTHET

and of cardinality card(GαN ,0k ) ≥ 2ρk log(de/k) + 1 for some ρ > 0 depending on a

constant c0 > 0 and independent of k and d. Thus GαN ,0k is a 2δ-separated set

in the Frobenius metric with δ2 = c0k2α2

N/4. The Kullback-Leibler divergence

between the measures PΘu and PΘv , Θu,Θv ∈ GαN ,0k , u 6= v, is upper bounded

as

KL(PΘu ,PΘv ) = IEPΘu[ℓY (Θu)]− IEPΘu

[ℓY (Θv)] ≤1

8

(i,j)∈Ω

〈〈XjX⊤i ,Θu −Θv〉〉2F

≤ 1 + ∆Ω,2k(X)

4k2α2

NN ,

for all u 6= v and ∆Ω,2k(X) > 0 from Definition 3. As in the first part of the proof,taking γ > 0 small enough, we obtain

1 + ∆Ω,2k(X)

4k2α2

NN + log 2 = kγ log(2) log(de

k

)

+ log 2 = log(2kγ log(de/k)+1)

< log(2ρk log(de/k) + 1) .

The desired lower bound then follows from Proposition 24.

Proposition 24 (Fano’s method). Let Θ1, ...,ΘJ be a 2δ-separated set inRd×d in the Frobenius metric, meaning that ‖Θk − Θl‖F ≥ 2δ for all elementsΘk,Θl, l 6= k in the set. Then for any increasing and measurable function F :[0,∞) → [0,∞), the minimax risk is lower bounded as

infΘ

supΘ

IEPΘ

[

F (‖Θ −Θ‖F )]

≥ F (δ)(

1−∑

u,v KL(PΘu ,PΘv )/J2 + log 2

log J

)

.

Lemma 25 (Variant of the Varshamov-Gilbert lemma). Let G ⊂ Pk,1(M) bea set of 0, 1d×d symmetric block-sparse matrices with the size of the block k,where k ≤ αβd for some α, β ∈ (0, 1). Denote K =

(dk

)

the cardinality of Gand ρH(E,E′) =

i,j 1(Ei,j 6= E′i,j) the Hamming distance between two matrices

E,E′ ∈ G. Then there exists a subset G0 = E(0), ..., E(J) ⊂ G of cardinality

log J := log(card(G0)) ≥ ρk log(de

k) ,

where ρ = α− log(αβ) (− log β + β − 1) such that

ρH(E(k), E(l)) ≥ ck2 ,

for all k 6= l where c = 2(1 − α2) ∈ (0, 2).

Proof. Let E(0) = 0k×k, D = ck2, and construct the set E1 = E ∈ G :ρH(E(0), E) > D. Next, pick any E(1) ∈ E1 and proceed iteratively so that for amatrix E(j) ∈ Ej we construct the set

Ej+1 = E ∈ Ej : ρ(E(j), E) > D .

Let J denote the last index j for which Ej 6= ∅. It remains to bound the cardinalityJ of the constructed set G0 = E(0), ..., E(J). For this, we consider the cardinalitynj of the subset Ej \ Ej+1:

nj := #Ej \ Ej+1 ≤ #E ∈ G : ρH(E(j), E) ≤ D.

OPTIMAL LINK PREDICTION WITH MATRIX LOGISTIC REGRESSION 29

For all E,E′ ∈ G, we have

ρH(E,E′) = 2(k2 − (k −m)2) ,

where m ∈ [0, k] corresponds to the number of distinct columns of E (or E′).Solving the quadratic equation (A.5) for ρH(E,E′) = D = ck2 we obtain

mD = k(1 −√

1− c/2) ,

for the maximum number of distinct columns of a block-sparse matrix E (andE′) such that ρH(E,E′) ≤ D = ck2 for c ∈ [0, 2]. For instance, in order to get thedistance between matrices 2k2, i.e. c = 2 we need to shift all the k columns (andconsequently rows) and so the number of distinct columns of a matrix is m = k,and in order to get the minimal possible distance 4k − 2, i.e. c = (4k − 2)/k2 weneed to shift only one column and a corresponding row, i.e. m = 1. Therefore,for nj in (A.5), we have

nj ≤ #E ∈ G : ρH(E(j), E) ≤ D =

mD∑

i=0

(

k

i

)(

d− k

i

)

=k

i=k−mD

(

k

i

)(

d− k

k − i

)

.

Together with an evident equality∑J

j=0 nj = K =(dk

)

, this implies

k∑

i=k−mD

(

k

i

)(

d− k

k − i

)

/

(

d

k

)

≥ 1

J + 1.

Note that taking mD = k, which as we have seen corresponds to c = 2, wehave a trivial bound J + 1 ≥ 1 using Vandermonde’s convolution. Further-more, the expression on the left-hand side in (A.5) is exactly the probabilityP(X ≥ k −mD) = P(X ≥ kα) for α =

1− c/2, where the variable X followsthe hypergeometric distribution H(d, k, k/d). The rest of the proof is based onapplying Chernoff’s inequality and follows the scheme of the proof of Lemma 4.10in Massart (2007).

Nicolai BaldinDepartment of Pure Mathematicsand Mathematical Statistics,University of CambridgeWilbeforce RoadCambridge, CB3 0WB, UK([email protected])

Quentin BerthetDepartment of Pure Mathematicsand Mathematical Statistics,University of CambridgeWilbeforce RoadCambridge, CB3 0WB, UK([email protected])