abstract · 2019. 9. 27. · mixed dimension embedding with application to memory-efficient...

Mixed Dimension Embeddings with Application to Memory-EfficientRecommendation Systems

Antonio A. Ginart ∗† Maxim Naumov ‡ Dheevatsa Mudigere ‡ Jiyan Yang ‡ James Zou ∗

November 9, 2020

AbstractEmbedding representations power machine intelligence in manyapplications, including recommendation systems, but they are spaceintensive — potentially occupying hundreds of gigabytes in large-scale settings. To help manage this outsized memory consumption,we explore mixed dimension embeddings, an embedding layer archi-tecture in which a particular embedding vector’s dimension scaleswith its query frequency. Through theoretical analysis and system-atic experiments, we demonstrate that using mixed dimensions candrastically reduce the memory usage, while maintaining and evenimproving the ML performance. Empirically, we show that theproposed mixed dimension layers improve accuracy by 0.1% usinghalf as many parameters or maintain it using 16× fewer parametersfor click-through rate prediction task on the Criteo Kaggle dataset.

1 IntroductionEmbedding representations power state-of-the-artapplications in diverse domains, including computervision [1, 2], natural language processing [3, 4, 5], andrecommendation systems [6, 7, 8]. It is standard practiceto embed objects into Rd at a fixed uniform dimension(UD) d. When the embedding dimension d is too low, thedownstream statistical performance suffers [9]. Whend is high and the number of objects to represent is large,memory consumption becomes an issue. For example, inrecommendation models, the embedding layer can makeup more than 99.9% of the memory it takes to store themodel, and in large-scale settings, it could consume hun-dreds of gigabytes or even terabytes. [7, 10]. Therefore,finding innovative embedding representations that usefewer parameters while preserving statistical performanceof the downstream model is an important problem.

Object frequencies are often heavily skewed in real-world applications. For instance, for the full MovieLensdataset1, the top 10% of users receive as many queries asthe remaining 90% and the top 1%of items receive asmanyqueries as the remaining 99%. To an even greater extent,on Criteo Kaggle dataset2 the top 0.0003% of indicesreceive as many queries as the remaining ∼32 million.

To leverage the heterogeneous object popularity in rec-ommendation, we propose mixed dimension (MD) embed-ding layers, in which the dimension of a particular object’sembedding scales with that object’s popularity ratherthan remaining uniform. Our case studies and theoret-ical analysis demonstrate that MD embeddings work wellfor two reasons. First, by scaling embedding dimensionsbased on corresponding sample sizes, MD embeddingslearn representations that do not underfit popular embed-

∗Stanford University†Project initiated while at internship with Facebook‡Facebook1https://grouplens.org/datasets/movielens/2https://www.kaggle.com/c/criteo-display-ad-challenge

dings or overfit rare embeddings. Second, due to skewedpopularity in deployment (i.e. test sets), MD embeddingsminimize popularity-weighted loss by efficiently allocatingparameters without wasting them on rare objects.

In Section 3, we introduce the proposed architecturefor the embedding layer. In Section 4, we theoreticallyinvestigate MD embeddings. Our theoretical frameworksplits embedding-based recommendation systems intoeither the data-limited regime or the memory-limitedregime, depending on the parameter budget and samplesize. Using our framework, we prove mathematical guar-antees that demonstrate that when the frequency of cat-egorical values is sufficiently skewed, MD embeddings areboth better at matrix recovery and incur lower reconstruc-tion distortion than UD embeddings. Based on our theory,we propose a simple power-law scheme for MD sizing thatperforms well empirically. Our method is faster to trainwhile requiring far less tuning than other non-uniformembedding layers, such as expensive neural architecturesearches. In Section 5, we demonstrate that MD embed-dings improve parameter-efficiency in click through rate(CTR) prediction tasks. On the Criteo Kaggle dataset,mixed embeddings improve accuracy by 0.1% using half asmany parameters and maintain accuracy using 16× fewerparameters. They also train over 2× faster on a GPU.

Summary of Contributions:1. We propose an MD embeddings layer for recommen-

dation systems and provide a novel, mathematicalmethod for sizing the dimension of features withvariable popularity that is fast to train, easy to tune,and performs well empirically.

2. With matrix completion and factorization models,we prove that with sufficient popularity skew,mixed dimension embeddings incur lower distortionwhen memory-limited and generalize better whendata-limited.

3. Of importance to our application is the memory-limited regime, for which we derive the optimal featuredimension. This dimension only depends on thefeature’s popularity and singular-value spectrum ofthe pairwise interactions.

2 Background & Related WorksThis work applies MD embeddings to CTR prediction,a cornerstone task for modern recommendation engines.CTR prediction can more generally be viewed as contex-tual collaborative filtering (CF). We assume familiaritywith these tasks and relevant algorithms and briefly reviewthem here. In canonical CF tasks, user ratings or interac-tions for particular items are observed. One of the simplest

arX

iv:1

909.

1181

0v2

[cs

.LG

] 1

3 N

ov 2

020

treatments of CF is matrix completion (with the under-standing that the goal of CF is to generate recommenda-tions to users rather than just infer the unknown matrix).

CTR Prediction. The canonical CF task can begeneralized to a large number of recommendation tasksincorporating various types of additional contextual in-formation. CTR prediction tasks can be interpreted ascontext-based CF. Compared to canonical CF, these tasksalso include a large amount of context that can be usedto better predict user and item interaction events. Thesecontextual features are expressed through sets of indices(categorical) and floating point values (continuous). Thesefeatures can represent arbitrary details about the contextof an on-click or personalization event. The i-th categor-ical feature can be represented by an index xi∈1,...,nifor i=1,...,κ. In addition to κ categorical features, we alsohave s scalar features, together producing a dense featurevector x′ ∈Rs. Thus, given some (x1,...,xκ,x′)∈ ([n1]×...×[nκ])×Rs, we would like to predict y∈0,1, whichdenotes an on-click event in response to a particular per-sonalized context (more generally we might allow y∈R).

Matrix and Tensor Completion As mentionedabove, matrix completion is a commonmodel for CF tasks.We assume the reader is familiar with this literature. Re-fer to Appendix C for references. We briefly review thesetting here. LetM be an unknown target matrix, gener-ally thought to represent the preferences or engagement ofn users overm items. Let Ω⊂ [n]×[m] denote a sample ofindices. Our observation, denoted Ω, is the set of 3-tuples:Ω=(i,j,Mij) : (i,j)∈Ω. The observation sample Ω canbe thought of as a training set, and the goal is to predict theratings for unobserved indices, Ω′=[n]×[m]/Ω. Matrixcompletion is effectively a CF problem in which the ob-served sampleΩ follows a probabilsitic samplingmodel (tobe described later on). We say the target matrixM is com-pleted or recovered if recovery algorithmA returnsA(Ω)=M . We are generally interested in the probability of thisrecovery event: Pr[M=A(Ω)] for an algorithmA under asampling model for Ω and an assumed low-rank structureonM . More generically, tensor completion generalizes tocontextual CF tasks and subsumes matrix completion asa special case. We briefly review this here, following a set-ting similar to [11, 12]. For tensor completion, the goal isrecovering T ∈Rn1×...×nκ where Tx1,...,xκ ∈0,1 denotesif given context (x3,...,xκ), user x1 engages with item x2.We assume T has a low pairwise interaction rank r, mean-ing T can be factored into κ matrices,M(i)∈Rni×r fori ∈ [κ] as follows: Tx1,...,xκ =

∑(i,j)∈[κ]×[κ]〈M

(i)xi ,M

(j)xj 〉.

For simplicity and brevity, we will focus the theory in thiswork on the matrix completion setting, but our theoreticalframework also translates neatly to the tensor setting (wediscuss this further in Section 4).

Representation Learning. Embedding-basedapproaches, such as matrix factorization (MF) [13, 14, 15]or neural collaborative filtering (NCF) [16, 17, 18, 19], areamong the most computationally efficient solutions to CFand matrix completion. The simplest model for canonical

CF is MF, which treats embeddings as factorizing thetarget matrix directly and comes with nice theoreticalguarantees. Direct optimization over embedding layers isa non-convex problem. However, due to specific problemstructure, many simple algorithms, including first-ordermethods, have been proven to find global optima (undermild assumptions) [20, 21, 22]. Neural collaborativefiltering (NCF) models, which make use of non-linearneural processing, are more flexible options. While NCFmodels often lack the theoretical guarantees offered byMF, they usually perform mildly better on real-worlddatasets. In CTR prediction, it is common to use em-bedding layers to represent categorical features and havevarious neural layers to process the various embeddings.

Deep Learning for CTR We will use state-of-the-art deep learning recommendation model (DLRM) [23] asan off-the-shelf deep model. Generally these deep modelsare trained via empirical risk minimization (ERM) andback-propagation. For a givenmodel fθ (parameterized byθ) the standard practice is to represent categorical featureswith some indexed embedding layer E. The ERM ob-jective is then: minθ,E

∑i∈D `(fθ(x

′i,E[(x1,...,xκ)i]),yi)

where the sum is over all data points (x1,...,xκ,x′)i,yi

in the dataset and the loss function ` is taken to becross entropy for our purposes. Usually, each categoricalfeature has its own independent embedding matrix:E[(x0,...,xκ)i]=(E(1)[x1],...,E(κ)[xκ]).

Various deep CTR prediction models have been devel-oped over the recent years, including but not limited to [6,24, 25, 23, 26, 27]. Thesemodels share characteristics, andall of them are powered by memory-intensive embeddinglayers that utterly dwarf the rest of the model. The trade-off between the size of the embedding layer and the statis-tical performance of the model seems to be an unavoidabletrade-off for this class of recommendation algorithms.

Non-uniform Embeddings Recent works haveproposed similar but distinct non-uniform embeddingarchitectures, particularly for the natural languageprocessing (NLP) domain [28, 29]. There are substantialdifferences between those methods and ours. Weemphasize that neither of those methods would workout-of-the-box for CTR prediction. Critically, in contrastto NLP, CTR embeddings encode categorical values forindividual features, and thus come with feature-levelstructure that should not be ignored when architectingthe embedding layer. In NLP, embeddings representwords and can be thought of a single large bag of values— in contrast to representing the various categoricalvalues a particular feature can take on. We discover thatignoring this feature-level structure in the embeddinglayer adversely affects performance, dropping accuracyby >1%. For this reason, the sorted blocking techniqueintroduced in [29] is not effective in CTR prediction. Ad-ditionally, embedding layers in NLP are pre-trained fromunsupervised language corpus, unlike in RecSys, whichmeans that clustering and factoring the embeddings asin [28] prior to training is not feasible in CTR prediction.

Neural architecture search (NAS) for RecSysembedding layers is proposed in [30]. A3C, a genericreinforcement learning algorithm, is used to architectthe embedding layer. However, NAS is computationallyexpensive. Thus, resorting to NAS is only necessarywhen one lacks a principled methodology for determininggood embedding sizes. In this work, we show that thearchitecture search over non-uniform embedding layerscan be distilled into tuning a single hyper-parameter(requiring minimal overhead). Indeed, our theory, backedby experimental support, shows that determining goodembedding sizes (for RecSys) is as easy as tuning learningrate and does not require the heavy-machinery of NAS.This enormous simplification in model search is onlypossible due to our theoretical framework.

Furthermore, in contrast to previous works, wetheoretically analyze our method. Moreover, pastmethods do not even empirically validate the speculatedmechanisms by which their methods work. For example,in [29], authors claim their proposed architecture,"reduces the capacity for less frequent words with thebenefit of reducing overfitting to rare word." While theproposed method works well on benchmarks, the claimthat the method reduces overfitting is not supported. Asshown in [9, 31], embedding overfitting depends criticallyon the training algorithm and even the model atop theembeddings. In fact, when training is properly tuned,embeddings are quite resilient to overfitting, which meansthe claim made in [29] is far from self-evident. It is moreaccurate to view rare embeddings as wasting parametersrather than overfitting them.

3 Mixed Dimension Embedding LayerWe proceed to introduce notation and define the proposedmixed dimension (MD) architecture. The MD embeddinglayer E architecture consists of k blocks and is defined by2k matrices: E=(E(1),...,E(k),P (1),...,P (k)) with E(i)∈Rni×di and P (i)∈Rdi×d for i=1,...,k. Together, E(i) andP (i) form the i-th block. In total, there are n=

∑ki=1ni

embedding vectors in the layer. We always treat embed-ding vectors as row vectors. The E(i) can be interpreted asthe MD embeddings, and the P (i) are projections that liftthe embeddings into a base dimension d such that d≥di.The entire layer can be thought of as a single n×d blockmatrix for which the i-th block is factored at dimension di.

Forward propagation for a MD embedding layer isperformed by indexing an embedding vector and thenprojecting it. For example, compute P (1)E

(1)` for the `-th

vector in the first block. Backwards propagation throughthis layer updates matrices E(i) and P (i) accordinglyduring training. Importantly, for many standard lossfunctions, only the embedding vectors queried on theforward pass have non-zero gradients on the backwardpass. In large-scale settings, using sparse updates for theembedding tables is essentially mandatory for efficienttraining. Thus, contrary to popular belief in academiccircles, large-scale embeddings are often trained without

norm-based regularization since this would require anupdate to every single row each step, even if not queried.

Downstream models based on a MD embedding layershould be sized with respect to d. If di= d for any block,the projectionP (i) is not needed andmay be replaced withan identity mapping. We illustrate this along with thegeneral matrix architecture of a two block MD embeddinglayer in Fig. 1a. The parameter budget (total area) con-sumed by UD and MD embedding layers is the same, butthe parameters are allocated unevenly to different indicesin the MD embeddings. For MD embedding layers, thereare two primary architectural decisions tomake: (i) block-ing: how to block n total embedding indices into k blocks?and (ii) sizing: how to size the embedding dimensionsd = (d1,...,dk)? For large-scale CTR, blocking embed-dings by feature is both natural and effective. We defer adiscussion of embedding sizing until after our theoreticalanalysis. Assigning the dimensions for the MD layer for agiven parameter budget is a key aspect of our frameworkdiscussed in Section 4. We focus here primarily on ablocking scheme for CTR prediction tasks that requiresembedding tables for κ categorical features. Generally, κis in the order of 10 to 100. Standard embedding layersallocate κUD embeddingmatrices to these κ features. ForCTR prediction, it is both simple and natural to inheritthe block structure as delineated by the task itself. We usek←κ embedding blocks in the MD layer: E(i) ∈Rni×d,one for each of the κ categorical features. The value rangefor each categorical feature defines the row counts ni in thecorresponding block of the MD layer. The re-indexing cantrivially be stored in a simple length k offset vector, requir-ing negligible computation and memory. For the Criteodataset, there are κ = 26 distinct categorical features,meaningwe produce aMD embedding layer with 26 blocks.To get meaningful and useful blocks, this blocking schemedepends on the fact that our task has a large number of con-textual features k, with value ranges varying from order 10to order 10million. Thus, even if feature values are roughlyuniformly popular within each feature, the large variationin value ranges leads to a significantly skewed popularity.

We stress that this is a significant distinction fromembedding applications in NLP.When using word embed-dings, one cannot block the mixed layer by feature. Thisinherent structure in the pCTR setting is simply absentfrom the NLP setting. Thus, one needs to resort to com-plex blocking and sizing schemes, such as those proposedin [28, 29]. Furthermore, we found that accuracy signif-icantly drops if the layer is not blocked by feature. Wehypothesize that embedding projections encode feature-level semantic information when blocking by feature.

4 Theoretical frameworkIn this section we present our theoretical framework.Our theoretical framework leverages matrix completionand factorization as mathematical models. We split ouranalysis into two regimes, data-limited and memory-limited, and prove relevant theorems in both regimes

Figure 1: Fig. 1a (left): Matrix Architecture for UD and MD Embedding Layers. Fig. 1b (right) Embedding parameters for 26categorical features allocated at different temperatures (α) for the same parameter budget on the Criteo dataset. Higher temperaturesresults in higher dimensions for popular embeddings and α=0 is UD. See Section 4 and Alg.1 for more details concerning the assignmentscheme. The dimensions are rounded to powers of 2. Base dimension d corresponds to index 1.

concerning the benefits of MD over UD embeddings (allproofs are given in Appendix C). Our analysis enables usto mathematically understand the learning phenomenaas well as inform engineering practice by guiding theassignment of embedding dimensions in MD layers.

M=

M(11) ... M (1kW )

.... . .

...M (kV 1) ... M (kV kW )

Recall thatM ∈

Rn×m, for n ≥ m,is an unknown tar-get matrix. With-out loss of general-ity, we assume Mhas a block structure such thatM is comprised of blocksM (i,j)∈Rni×mj for 1≤ i≤kV and 1≤j≤kW (see above).

When indexing M , we use subscripts, as in Mkl, todenote the kl-th scalar entry in M , and superscripts inparenthesis, such asM (i,j), to denote the ij-th block ofM(we often omit the comma in the superscript). As in CF,we have a sample Ω=(k,l,Mkl) : (k,l)∈Ω. We can treatΩ as a set-valued random variable. It is common practiceto study a Bernoulli sampling model [32, 33, 34, 35, 21], inwhich each entry is revealed independently with somesmall probability. Below we define a similar model,adapted for the proposed block-wise setting.Definition 4.1. The Block-wise Bernoulli sam-pling model is defined as follows. Let Π ∈ RkW×kVbe a probability matrix. Let N denote the expected num-ber of observed indices in a training sample. Let i(·)and j(·) map each index of M to the index of the blockto which it belongs. Each index (k,l) is independentlysampled as follows: Pr[(k,l)∈Ω]=N

Πi(k),j(l)

ni(k)mj(l). We refer

to Πij/(nimj) as the sampling rate of the ij-th block.The minimal marginal sampling rate is defined asε=minmini

1ni

∑jΠij ,minj

1mj

∑iΠij. Finally, a block

matrix M is rank additive if r=∑kVi=1

∑kWj=1rij, where

r=rank(M) and rij=rank(M (ij)).Thus, the probability of observing any particular entry

depends only on which block that entry belongs, and it isalso uniform within any given block. Also, the notion ofrank additivity used above is slightly less general than theone in [36] but is sufficient for our purposes. It is a mildassumption whenM has few blocks k=kV kW relative to

m and rijm. In fact, block-wise rank additivity holdswith high probability for random matrix models meetingthese criteria.

Connection to Tensor Completion Given boththe block-wise structure ofM and the MD embeddings,it is straightforward to apply MD here. The goal is totrain the layer E to represent M . More specifically, weshould like E(i)P (i)(E(j)P (j))T =M (ij). We can train Eas usual, using SGD and back-propagation.

One potential objection to our model is that itonly addresses the matrix case, which cannot captureadditional contextual features. In response, we point outthe following direct connection between the our block-wise model and the tensor setting (such as in [11]). Recallthat under the assumed pairwise interaction rank r forT , we can factor T into κ matrices,M(1),...,M(1). wecan adapt our model to the tensor case by exploitingthe block structure and treating M as a block pairwiseinteraction matrix rather than a preference matrix. LetkV =kW =κ and let each block represent an interactionmatrix: M (ij) =M(i)(M(j))T . Hence, M is symmetricand with this simple construction, the factors of T arerepresented as the blocks ofM . The remaining distinctionis that in the tensor case, the algorithm only observesums of elements selected from the blocks ofM instead ofobserving the entries ofM directly. This minor distinctionis addressed in both [11] and [12] and with appropriatecare to details, the two observation structures are largelyequivalent. For simplicity and brevity, we discuss onlythe matrix case here, while keeping in mind this naturalextension to the tensor case. When thinking about thecategorical features in CTR prediction, this constructionis precisely the onewe use to block by features, as describedin Section 3.

Data-Limited and Memory-Limited RegimesIn contextual recommendation engines, there are twoprimary bottlenecks. In the data-limited regime, themodel does not have enough samples to accurately recoverthe preference (or pairwise interaction) matrix. In thememory-limited regime, the model has sufficient data torecover the preference (or pairwise interaction) matrix butnot enough space for the parameters that comprise the

embedding layer (i.e. matrix/tensor factors). Roughlyspeaking, the data-limited regime occurswhen the numberof samples N=o(nrlogn), in which case our forthcomingtheory illustrates that we must turn to MD embeddingsto recover any matrices at all. The memory-limitedregime occurs when the space constraint is o(nr), whichmeans we must find a way to use fewer parameters in theembedding layer than are naively required. These tworegimes are not mutually exclusive. We leave the difficultcase of analyzing both regimes simultaneously for futurework. For large-scale CTR prediction (with potentiallyorder 109 contextual features), the constant generation ofdata results in systems that fall into the memory-limitedregime.

4.1 Generalization in the Data-Limited RegimeWe use matrix completion under block-wise Bernoullisampling as a model to understand generalization inthe data-limited regime. We will show that when thetraining samples are sufficiently skewed, MD embeddingscan recover many more matrices than UD embeddings.Here, we use recovery of the matrix as a proxy for goodgeneralization. We can assume that we easily have enoughspace to accommodate all needed parameters. Althoughwe focus on the exact completion case for simplicityand brevity, it is well-understood how to extend theseresults to the setting of approximate completion forapproximately low-rank matrices. First, we provide anasymptotic lower bound on the sample complexity formatrix recovery under popularity skew by any popularity-agnostic algorithm, including UD factorization. Then,we show that MD factorization achieves lower samplecomplexity for recovery. Exact non-asymptotic resultscan be found in the Appendix C. Broadly speaking,we refer to popularity-agnostic algorithms as those thatignore popularity-based structure and rely only on low-rank structure in recovery. We formalize this notion inAppendix C. Under uniform sampling, all popularity-agnostic algorithms need Θ(rnlogn) samples to recoverthe matrix [33]. Thm. 4.1 shows that popularity-agnostic algorithms can require many more samples whenpopularity skew is high.Theorem 4.1. Let M be a target matrix following theblock-wise Bernoulli sampling model. Suppose N =o( rεn logn) where ε is the minimal marginal samplingrate. Then any algorithm that does not take popularityinto account will have asymptotically zero probability ofrecovering M . There exist popularity-agnostic algorithmsfor which N=Θ( rεnlogn) is sufficient for high probabilityrecovery.

Thus, popularity-agnostic methods necessarily payan additional 1

ε factor in samples for recovery. In contrastto this negative result, we show that MD embeddings,trained by SGD, can complete M with fewer samples.Note that we assume the block and rank structure ofMis known. Thm 4.2 is a completion guarantee for MDfactorization.

Theorem 4.2. Let M be a rank additive target matrixfollowing the block-wise Bernoulli sampling model. LetC be some non-asymptotic constant independent ofn. If N ≥ C(maxij

rijΠij

)n logn, then block-wise mixeddimension factorization with SGD recovers M with highprobability.

Thm4.1 states thatwith popularity-agnosticmethods,completion of matrixM is bottlenecked by the block withlowest popularity. When popularity skew is large, theε factor greatly increases the sample size necessary tocomplete M . In contrast, as shown in Thm 4.2, MDfactorization gets a significant reprieve if low-popularityblocks are also treated as low-rank, implying a smallmaxij

rijΠij r

ε . The key insight is that it is impossible tocomplete rare rows at the same rank as popular rows. MDembeddings can successfully accommodate both rows.

Two block example The presented theoreticalresults are more easily interpreted for a special caseblock matrix consisting of two vertically stacked matrices

M =

[M (1)

M (2)

]. Let us assume block-wise popularity

sampling with Π1 =1−ε and Π2 =ε for small 0<ε<1/2,so that we can interpret M (1) as the popular and M (2)

as the rare block. Also, assume that rank additivity ispreserved and r=r1+r2.

Notice that based on Thm. 4.1, we cannot completeM with popularity-agnostic methods if N = o( rεnlogn).Based on Thm. 4.2, we successfully complete R withMD factorization for N = Θ(max r1

1−ε ,r2ε nlogn). For

illustrative purposes, assume that r2 is a small constantand r1 ≈ 1

ε is significantly larger. Then, popularity-agnostic algorithms suffer a 1

ε2 quadratic penalty insample complexity due to popularity skew, whereas MDfactorization only pays a 1

ε factor.It is instructive to understand why UD embeddings

fail to recover rank-r matrix M under popularity skew.For argument’s sake, let d denote a potential uniformembedding dimension. Suppose we have Θ(rn log n)samples and we set UD to r: d← r. When samplingis skewed, M (2) will be too sparsely covered to reveal rdegrees of freedom, since it only generates ε fraction ofthe observations. Thus, the r-dimensional embeddingswould over-fit the under-constrained M (2) block as aresult. Alternatively, if we set d←εr, as so to match thesample size over sub-matrixM (2), then our εr-dimensionalembeddings would be unable to fit the larger trainingsample over theM (1) block. Namely, we would now havetoo many samples coming from a rank-r matrix, resultingin an over-constrained problem that is infeasible with εr-dimensional embeddings. By using MD embeddings, wecan avoid this problem by simultaneously fitting popularand rare blocks.4.2 Space Efficiency in the Memory-LimitedRegime To study resource-constrained training anddeployment, we assume the target matrix is completed byoracle. This abstracts away the generalization becauselearning the ground-truth becomes straightforward with

sufficient data. We are instead constrained by a smallparameter budget B. The goal is to optimally allocateparameters to embedding vectors such that we minimizethe expected reconstruction over a non-uniformly sampledtest set. We show that under mild assumptions thisdimension allocation problem is a convex program andis amenable to closed-form solutions. We prove that theoptimal dimension for a given embedding scales with thatembedding’s popularity in a manner that depends deeplyon the spectral properties of the target matrix.

For most applications, popularity skew is present notonly at training time but also at test time. Thus, in orderto maximize expected out-of-sample performance in theface of memory constraints, it is necessary to account forit. In this setting, the natural loss metric to study is thepopularity-weighted MSE, defined as follows.

Definition 4.2. Popularity-Weighted MSE. Let Πbe a block-wise probability matrix. Let (k,l) be a testcoordinate sampled according to Π. The popularity-weighted MSE for predicted M is given by LΠ(M,M)=

Ek,l|Mkl−Mkl|2 =∑i,j

1nimj

Πij ||M (ij)−M (ij)||2F .

Let us now assume that the target matrix is given andthat we are trying to optimally size MD embeddingswith respect to popularity-weighted MSE reconstruction.We assume to have a small parameter budget, so that wecannot factor the target matrix exactly. We formulate thisoptimization as a convex program with linear constraints(we treat the dimensions as continuous — this is a convexrelaxation of a hard problem, see Appendix C for moredetails).Definition 4.3. Convex program for optimizingvariable-dimension embeddings.

mindw,dv

(minW,V

LΠ(M,WV T )

)s.t.

∑i

ni(dw)i+∑j

mj(dv)j≤B

where dw∈RkW and dw∈RkV denote the dimensionsof the embedding blocks of W and V , respectively.

We can obtain a solution using first-order conditionsand Lagrange multipliers. Our analysis reveals that theoptimal dimension to popularity scaling is the functionalinverse of the spectral decay of the target matrix (seeThm. 4.3).Theorem 4.3. Let M be a rank additive block matrixfollowing block-wise spectral (singular value) decay givenby σij(·). Then, then optimal block-wise embeddingdimensions are given by: (d∗w)i =

∑j d∗ij, (d∗v)j =∑

id∗ij, where d∗ij = σ−1

ij

(√λ(ni+mj)(nimj)Π

−1ij

)and∑

ij(ni+mj)d∗ij=B.

Whenwehave a closed-form expression for the spectraldecay ofM , we can give a closed-form solution in termsof that expression. We now give the optimal dimensionsfor the simple two-block example under power spectraldecay.

Definition 4.4. Power spectral decay A matrix withspectral norm ρ exhibits a singular value spectrum withpower spectral decay β>0 if the k-th singular values isgiven by: σ(k)=ρk−β.

Corollary 4.1. For the two-block matrix R, with eachblock exhibiting a power spectral (singular value) decay,then d∗1∝(1−ε)

12β and d∗2∝ε

12β

As expected, the optimal dimension for an embeddingvector should scale with its popularity ε. The particularscaling rule that emerges is the inverse of the spectraldecay of the target matrix. For example, when thespectrum follows a power law, optimal dimensions scaleas a fractional power of popularity. The generic versionof Corollary 4.1 can be found in Appendix C.

4.3 Large-scale CTR Prediction Systems inthe Memory-Limited Regime As mentioned earlier,labeled training data is easy to acquire in most large-scaleCTR prediction systems. The pipeline is self-supervised:one can observe user engagement (or lack thereof) withpersonalized content. Thus, other than for newly addedusers/features, it is fair to assume that one has enoughdata to adequately learn embedding representations. Theembedding layer’s memory footprint ends up being theprimary bottleneck. In this situation, the results of Thm4.3 yield appropriate guidelines for the optimal dimensionfor each embedding vector. The primary difficulty is thatone needs to know the spectrum of the target matrixto know the optimal dimension. Unfortunately, thisis unavoidable. One solution is to train an enormoustable with an enormous data set, thereby obtaining thespectrum, and then factoring (or re-training from scratch)at the optimal size. While the deployed embeddings forinference would be space efficient, the clear drawback isthe enormous resource usage during training. However,the cost of obtaining the spectrum should not force us tothrow away all prior mathematical knowledge and resortto NAS or RL to find good mixed embedding dimensions.We can still leveraging the insight of our theoreticalframework as follows. Most spectral decays, whetherthey are flat or steep, can be decently well fit by a powerlaw (so much so that there is a large literature dedicated tofalsifying power-laws even when the numerical fit appearsreasonable [37]). Generally, varying the temperatureadequately captures the trend of the decay. By a prioriassuming that the spectral decay is approximately a powerlaw, we can still encode the mathematical knowledge wehave into the architecture. We only have to tune onehyper-parameter over a narrow range that requires onlyexploring a small number of values.

Power-law Sizing Scheme Here we define block-level probability p. Let p be a k-dimensional probabilityvector (recalling k is the number of blocks in embeddinglayer E) such that pi=

∑jΠij is the average probability

that rows in the i-th block will be queried. When blocksexactly correspond to features (as in our CTR task)pi=

1ni.

Algorithm 1 Popularity-Based Dimension SizingInput: Baseline dimension d and fixed temperature 0≤α≤1Input: Probability vector pOutput: Dimension assignment vector dλ← d||p||−α∞ . Compute scalar scaling factord←λpα . Component-wise exponent

Alg. 1 shows code for the scheme. We use atemperature parameter α > 0 to control the degree towhich popularity influences the embedding dimension(as a simplification over using decay parameter β). Asα increases, so does the popularity-based skew in theembedding dimension. At α = 0, the embeddingdimensions are all uniform. At α = 1, the embeddingdimensions are proportional to their popularity. Inprinciple, one could pick an α>1, but since it is natural toassume that there is diminishing returns to the embeddingdimension of a feature, it should follow that such a choiceis poor. In Fig. 1b, we show the results of applying Alg.1 to the Criteo dataset.

5 ExperimentsWe proceed to our experiments with MD embed-dings. We begin by presenting our results on theCriteo CTR prediction dataset. Our implementa-tion is available as part of an open-source project:https://github.com/facebookresearch/dlrm.

Results on CTR Prediction We explore thetrade-off between memory and statistical performance forMD embeddings, parameterized by varying temperatures(α). We measure memory in terms of the number of 32-bitfloating point parameters in the model. Since differentembedding base dimensions imply different widths in firsthidden layers of the downstream deep model, for fairness,our parameter counts include both the embedding layerand all model parameters (recall that the parametercount is overwhelmingly dominated by embeddings). Wereport statistical performance in terms of cross entropyloss. Accuracy is reported in Appendix B. DLRM withuniform d = 32 embeddings obtains an accuracy of∼79%, close to state-of-the-art for this dataset [23]. Wesweep parameter budgets from a lower bound given by1 parameter per embedding vector (d= 1) to an upperbound given by the memory limit of a 16GB GPU. Theparameters are allocated to embeddings according to theα-parameterized rule proposed in Alg. 1. For models withUD embedding layers (α=0), the number of parametersdirectly controls the embedding dimension because thenumber of parameters per embedding is fixed. On theother hand, for MD layers (α > 0) we have differentdimensions for each embedding.

Our results are reported in Fig. 2. In Fig 2a we reportthe learning curves for the DLRM model. We show thata MD embeddings with α= 0.3 (green line) produces alearning curve on par to that of a d=32 UD embeddings(blue line) using a total parameter count equivalent tod = 4 UD (orange line). In Fig 2b we report rate-distortion curves for varying temperature (α). We can see

that using MD embedding layers improves the memory-performance frontier at each parameter budget. Theoptimal temperature (α) is dependent on the parameterbudget, with higher temperatures leading to lower lossfor smaller budgets. Embeddings with α = 0.4 obtainperformance on par with uniform embeddings using 16×fewer parameters. Embeddings with α= 0.2 modestlyoutperform UD embeddings by an accuracy of 0.1% usinghalf asmany parameters. ForRecSys, small improvementsin accuracy are still important when volume is sufficientlylarge. The std. dev. estimates indicate that this gain issignificant.

Since we increase the base dimension of our embed-dings at higher temperatures (α), the downstream deepmodel must process higher dimension features at the in-put layer. The computation required to compute innerproducts or other embedding interactions downstreamincreases with embedding dimension. It is natural towonder if our method increases training time in exchangefor memory reduction. Fortunately, MD embeddings notonly improve prediction quality for a given parameterbudget, but also for a given training time! We observethat MD embeddings can train > 2× faster than UDembeddings (see Fig. 2c) at a given test loss. This ispossible because for a given loss, the MD layer uses farfewer parameters than a UD layer. The faster trainingwe observe is likely due to superior bandwidth efficiencyand caching on the GPU, enabled by the reduced memoryfootprint. We run all of our models on a single GPU withidentical hyperparameters (including batch size) acrossall α. See Appendix B for more details.

5.1 Case Study: Collaborative Filtering Thesimplicity of CF with only users and items makes iteasy to study the effects of MD and obtain insight.Matrix factorization (MF) is a typical embedding-basedalgorithm for CF. A baseline MF model for CF tasksoffers the most distilled setting for investigating MDembeddings. On the other hand, we seek to understandthe effect of non-linear neural processing layers in ourmodel, so we also include simple NCFmodels. Because weonly have two features for this case study (users and items)we slightly modify our blocking approach. To summarize:we sort users and items by empirical popularity, and thenwe apply mixed dimensions within individual embeddingmatrices by partitioning them into sub-matrices. Forthe full details of our case study protocol, including theblocking scheme, please see the Appendix.

Rate-distortion curves for varyingα usingMF andNCF models are reported in Figs. 3a and 3d, respectively.The CF rate-distortion curves display similar trends tothe CTR ones. MD embeddings substantially outperformUD embeddings, especially at low parameter budgets.

Popular vs. Rare ItemsWe partition item embed-dings (since they often the focal point of recommendation)into the one-third most and least popular items (by em-pirical training frequency). We report these popularity-partitioned rate-distortion curves for varying α in Figs.

Figure 2: CTR prediction results for MD embeddings on Criteo dataset using DLRM. Fig. 3a (left): Learning curves for selected emb.arch. Fig. 3b (center): Loss vs. # param. for varying α. Fig 3c (right): Train time vs. loss for varying α

Figure 3: MF (top row; a-c) and NCF (bottom row; d-f) for collaborative filtering on the MovieLens dataset. Fig. 3a & 3d (right): MSEvs. # params for varying α. Fig. 3b & 3e (center): MSE vs. # params for varying α. Fig. 3c & 3f (left): Generalization for popular (red)and rare (blue) items.3b and 3e. The dashed lines correspond to test sam-ples that contain the one-third least popular items. Thesolid lines correspond to test samples that contain theone-third most popular items. These results show thatnon-zero temperature (α) improves test loss on popularitems and performs on par with or slightly better thanUD embeddings on rare items.

Generalization performance We also report gen-eralization curves for the popular and rare items in Figs3c and 3f. In these curves we fix a parameter budget (221),vary the temperature (α), and plot training and test lossfor both popular and rare item partitions. The resultssuggest that allocating more parameters to popular em-beddings instead of rare items can improve performanceon popular items without hurting the performance ofrare items. Allocating more parameters to popular itemsby increasing temperature (α) decreases both trainingand test loss. On the other hand, allocating more pa-rameters to rare items by decreasing temperature (α)decreases only training loss without helping test perfor-mance, indicating that the additional parameters on therare embeddings are wasteful and should be allocated topopular embeddings. The key insight from this case study

is that uniformly allocating parameters to embeddings,agnostic to the popularity of said embedding is inefficient.A majority of parameters are severely underutilized onrare embeddings, whereas popular embeddings could stillbenefit from increased representational capacity.

To summarize our experimental results, we have foundthat MD embeddings greatly improve both parameterefficiency and training time. We have found systematicand compelling empirical evidence that MD embeddingsachieve this by leveraging generalization skew and deploy-ment skew optimization.

6 ConclusionWe have seen thatMD embeddings obtain higher accuracywith fewer parameters while training much faster thanUD embeddings. Our theoretical framework is thefirst to mathematically explain how this occurs and ourexperiments are the first to validate the phenomena.Future works can explore applications of our theory tosmall data regimes and incorporate an analysis of non-linear neural processing on top of the embeddings.

References[1] Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for

semantic image retrieval. In Proc. IEEE Winter Conf. Applications ofComputer Vision, pages 638–647. IEEE, 2019.

[2] Mariya I. Vasileva, Bryan A. Plummer, Krishna Dusad, Shreya Rajpal,Ranjitha Kumar, and David Forsyth. Learning type-aware embeddings forfashion compatibility. In Proc. European Conf. Computer Vision, 2018.

[3] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billionparameter language models using gpu model parallelism. arXiv preprintarXiv:1909.08053, 2019.

[4] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual stringembeddings for sequence labeling. In Proc. 27th Int. Conf. ComputationalLinguistics, pages 1638–1649, 2018.

[5] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, DanqiChen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.Roberta: A robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692, 2019.

[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chan-dra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, MustafaIspir, et al. Wide & deep learning for recommender systems. In Proc. 1stWorkshop Deep Learning Recommender Systems, pages 7–10. ACM, 2016.

[7] Jongsoo Park et al. Deep learning inference in Facebook data centers:Characterization, performance optimizations and hardware implications.CoRR, abs/1811.09886, 2018.

[8] Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang.Saec: Similarity-aware embedding compression in recommendation systems.CoRR, abs/1903.00103, 2019.

[9] Zi Yin and Yuanyuan Shen. On the dimensionality of word embedding. InProc. Advances in Neural Information Processing Systems, 2018.

[10] Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. Practiceon long sequential user behavior modeling for click-through rate prediction.CoRR, abs/1905.09248, 2019.

[11] Shouyuan Chen, Michael R Lyu, Irwin King, and Zenglin Xu. Exact andstable recovery of pairwise interaction tensors. In Advances in neuralinformation processing systems, pages 1691–1699, 2013.

[12] Hongyang Zhang, Vatsal Sharan, Moses Charikar, and Yingyu Liang.Recovery guarantees for quadratic tensors with sparse observations. InThe 22nd International Conference on Artificial Intelligence and Statistics,2019.

[13] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrixcompletion and low-rank svd via fast alternating least squares. The Journalof Machine Learning Research, 16:3367–3402, 2015.

[14] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems. Computer, 2009.

[15] Steffen Rendle, Li Zhang, and Yehuda Koren. On the difficulty of evaluatingbaselines: A study on recommender systems. CoRR, abs/1905.01395, 2019.

[16] Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Arewe really making much progress? A worrying analysis of recent neuralrecommendation approaches. CoRR, abs/1907.06902, 2019.

[17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf. WorldWide Web, pages 173–182, 2017.

[18] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and JiajunChen. Deep matrix factorization models for recommender systems. InIJCAI, pages 3203–3209, 2017.

[19] Jicong Fan and Jieyu Cheng. Matrix completion by deep matrix factoriza-tion. Neural Networks, 98:34–41, 2018.

[20] Moritz Hardt. Understanding alternating minimization for matrix comple-tion. In 2014 IEEE 55th Annual Symposium on Foundations of ComputerScience, pages 651–660. IEEE, 2014.

[21] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion vianon-convex factorization. IEEE Transactions on Information Theory,62(11):6535–6579, 2016.

[22] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrixcompletion using alternating minimization. In Proc. 45th annual ACMsymposium on Theory of computing. ACM, 2013.

[23] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta,Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommen-dation model for personalization and recommendation systems. CoRR,abs/1906.00091, 2019.

[24] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He.DeepFM: a factorization-machine based neural network for ctr prediction.CoRR, abs/1703.04247, 2017.

[25] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie,and Guangzhong Sun. xDeepFM: Combining explicit and implicit featureinteractions for recommender systems. In Proc. 24th Int. Conf. KnowledgeDiscovery & Data Mining, pages 1754–1763. ACM, 2018.

[26] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou,Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. CoRR, abs/1809.03672, 2018.

[27] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma,Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest networkfor click-through rate prediction. In Proc. 24th Int. Conf. KnowledgeDiscovery & Data Mining, pages 1059–1068. ACM, 2018.

[28] Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. Groupre-duce: Block-wise low-rank approximation for neural language model shrink-ing. In Advances in Neural Information Processing Systems, 2018.

[29] Alexei Baevski and Michael Auli. Adaptive input representations for neurallanguage modeling. arXiv preprint arXiv:1809.10853, 2018.

[30] Manas R Joglekar, Cong Li, Mei Chen, Taibai Xu, Xiaoming Wang, Jay KAdams, Pranav Khaitan, Jiahui Liu, and Quoc V Le. Neural inputsearch for large scale recommendation models. In Proceedings of the 26thACM SIGKDD International Conference on Knowledge Discovery & DataMining, pages 2387–2397, 2020.

[31] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regular-ization in deep matrix factorization. In Advances in Neural InformationProcessing Systems, pages 7411–7422, 2019.

[32] Emmanuel J Candès and Terence Tao. The power of convex relaxation:Near-optimal matrix completion. IEEE Transactions on InformationTheory, 56(5):2053–2080, 2010.

[33] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proc.IEEE, 98(6):925–936, 2010.

[34] Emmanuel J Candès and Benjamin Recht. Exact matrix completion viaconvex optimization. Foundations of Computational mathematics, 2009.

[35] Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rankmatrix recovery from a minimal number of noisy random measurements.IEEE Transactions on Information Theory, 57(4):2342–2359, 2011.

[36] Jerzy K Baksalary, Peter Semrl, and George PH Styan. A note on rankadditivity and range additivity. Linear algebra and its applications,237:489–498, 1996.

[37] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-lawdistributions in empirical data. SIAM review, 51(4):661–703, 2009.

[38] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next gener-ation of recommender systems: A survey of the state-of-the-art and pos-sible extensions. IEEE transactions on knowledge and data engineering,17(6):734–749, 2005.

[39] Fabián P Lousame and Eduardo Sánchez. A taxonomy of collaborative-based recommender systems. In Web Personalization in Intelligent Envi-ronments, pages 81–117. Springer, 2009.

[40] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrixcompletion from a few entries. IEEE transactions on information theory,56(6):2980–2998, 2010.

[41] Sriram Vishwanath. Information theoretic bounds for low-rank matrixcompletion. In 2010 IEEE Int. Symposium on Information Theory, pages1508–1512. IEEE, 2010.

[42] Benjamin Recht. A simpler approach to matrix completion. Journal ofMachine Learning Research, 12(Dec):3413–3430, 2011.

[43] Sahand Negahban and Martin J Wainwright. Restricted strong convexityand weighted matrix completion: Optimal bounds with noise. Journal ofMachine Learning Research, 13:1665–1697, 2012.

[44] Raghu Meka, Prateek Jain, and Inderjit S Dhillon. Matrix completionfrom power-law distributed samples. In Advances in neural informationprocessing systems, 2009.

[45] Guangcan Liu, Qingshan Liu, and Xiaotong Yuan. A new theory for matrixcompletion. In Advances in Neural Information Processing Systems, 2017.

[46] Martin Andrews. Compressing word embeddings. CoRR, abs/1511.06397,2015.

[47] Prasad Bhavana, Vikas Kumar, and Vineet Padmanabhan. Block basedsingular value decomposition approach to matrix factorization for recom-mender systems. Pattern Recognition Letters, abs/1907.07410, 2019.

[48] Prasanna Sattigeri and Jayaraman J. Thiagarajan. Sparsifying word rep-resentations for deep unordered sentence modeling. In Proc. 1st Workshopon Representation Learning for NLP, pages 206–214.

[49] Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. Sparse wordembeddings using l1 regularized online learning. In Proc. 25th Int. JointConf. Artificial Intelligence, 2016.

[50] JoseMAlvarez andMathieu Salzmann. Compression-aware training of deepnetworks. In Proc. Advances in Neural Information Processing Systems,pages 856–867, 2017.

[51] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis:Finding sparse, trainable neural networks. CoRR, abs/1803.03635, 2018.

[52] Maxim Naumov, Utku Diril, Jongsoo Park, Benjamin Ray, Jedrzej Jablon-ski, and Andrew Tulloch. On periodic functions as regularizers for quanti-zation of neural networks. CoRR, abs/1811.09862, 2018.

[53] Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantizationfor training and inference of neural networks. In Proc. European Conf.Computer Vision, pages 580–595, 2018.

[54] Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, DongLin, Lichan Hong, and Ed H Chi. Learning multi-granular quantizedembeddings for large-vocab categorical features in recommender systems.In Companion Proceedings of the Web Conference 2020, 2020.

[55] Raphael Shu and Hideki Nakayama. Compressing word embeddings viadeep compositional code learning. CoRR, abs/1711.01068, 2017.

[56] Julien Tissier, Amaury Habrard, and Christophe Gravier. Near-losslessbinarization of word embeddings. CoRR, abs/1803.09065, 2018.

[57] Joan Serrà and Alexandros Karatzoglou. Getting deep recommendersfit: Bloom embeddings for sparse binary input/output networks. InProceedings of the Eleventh ACM Conference on Recommender Systems,pages 279–287, 2017.

[58] Jiaxi Tang and Ke Wang. Ranking distillation: Learning compact rankingmodels with high performance for recommender system. In Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery& Data Mining, pages 2289–2298, 2018.

[59] Josh Attenberg, Kilian Weinberger, Anirban Dasgupta, Alex Smola, andMartin Zinkevich. Collaborative email-spam filtering with the hashingtrick. In Proc. 6th Conf. Email and Anti-Spam, 2009.

[60] Jinyang Gao, Beng Chin Ooi, Yanyan Shen, and Wang-Chien Lee. Cuckoofeature hashing: Dynamic weight sharing for sparse analytics. In IJCAI,pages 2135–2141, 2018.

[61] Alexandros Karatzoglou, Alex Smola, and Markus Weimer. Collaborativefiltering on a budget. In Proc. 13th Int. Conf. Artificial Intelligence andStatistics, pages 389–396, 2010.

[62] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan V.Oseledets. Tensorized embedding layers for efficient model compression.CoRR, abs/1901.10787, 2019.

[63] Kaiyu Shi and Kai Yu. Structured word embedding for low memory neuralnetwork language model. In Proc. Interspeech, 2018.

[64] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. Pytorch: An imperative style, high-performance deeplearning library. In Advances in Neural Information Processing Systems,pages 8024–8035, 2019.

[65] Nvidia. Tesla V100 GPU architecture white paper. 2017.[66] F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: His-

tory and context. ACM Transactions on Interactive Intelligent Systems,5, 2015.

[67] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence ofAdam and beyond. In Proc. Int. Conf. Learning Representations, 2018.

[68] Florian Strub, Romaric Gaudel, and Jérémie Mary. Hybrid recommendersystem based on autoencoders. In Proceedings of the 1st Workshop on DeepLearning for Recommender Systems, pages 11–16, 2016.

[69] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of trainingdeep feedforward neural networks. In Proc. 13th Int. Conf. ArtificialIntelligence and Statistics, pages 249–256, 2010.

[70] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delvingdeep into rectifiers: Surpassing human-level performance on imagenetclassification. In Proceedings of the IEEE international conference oncomputer vision, pages 1026–1034, 2015.

[71] Michael Mitzenmacher and Eli Upfal. Probability and computing: Ran-domization and probabilistic techniques in algorithms and data analysis.Cambridge university press, 2017.

[72] Brian Dawkins. Siobhan’s problem: the coupon collector revisited. TheAmerican Statistician, 45(1):76–82, 1991.

[73] Richard MKarp. Reducibility among combinatorial problems. InComplex-ity of computer computations, pages 85–103. Springer, 1972.

[74] Ivan Markovsky. Structured low-rank approximation and its applications.Automatica, 44(4):891–909, 2008.

[75] David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming,volume 2. Springer.

[76] Richard Courant and Fritz John. Introduction to calculus and analysis I.Springer Science & Business Media, 2012.

A Supplementary MaterialWe review and discuss various supplementary materials.

Collaborative Filtering & Matrix CompletionCF tasks come in many variations and have a largemass scientific literature, with good reviews of classicalalgorithms and approaches provided in [38, 39]. Relatedto CF is the simplified matrix completion problem[33, 34, 32, 13, 40, 41, 21, 42].

Non-uniform popularity Non-uniform and deter-ministic sampling have been discussed in the matrix com-pletion literature [43, 44, 45], but only in so far as howto correct for popularity so as to improve statistical re-covery performance, or build theoretical guarantees forcompletion under deterministic or non-uniform sampling.

Memory-Efficient Embeddings The techniquesto decrease the memory consumed by embedding ta-bles can be roughly split into two high-level classes: (i)compression algorithms and (ii) compressed architec-tures. Simple offline compression algorithms includepost-training quantization, pruning or low-rank SVD[46, 47, 48, 49]. Online compression algorithms, in-cluding quantization-aware training, gradual pruning,and periodic regularization, are somewhat less popu-lar in practice because of the additional complicationsthey add to already intricate deep model training meth-ods. [50, 51, 52, 53, 54]. Model distillation techniques[55, 56, 57, 58] are another form of compression and canhave online and offline variants. On the other hand,compressed architectures have the advantage of not onlyreducingmemory requirements for inference time, but alsoat training time. This is the approach followed by hashing-based and tensor factorization methods [59, 60, 61, 62, 63].

Popularity Histograms The key idea of the MDembeddings is to address the skew in the popularity ofobjects appearing in a dataset. To illustrate it we presentpopularity histograms of accesses across all features forthe Criteo Kaggle dataset in Fig. 4a and individually forusers and items features for the MovieLens dataset in Fig.4c and 4b, respectively.

Forward Propagation with Mixed DimensionEmbeddings Forward propagation for aMD embeddinglayer can be summarized in Alg. 2. The steps involved inthis algorithm are differentiable, therefore we can performbackward propagation through this layer and updatematrices E(i) and P (i) accordingly. We note that Alg. 2may be generalized to support multi-hot lookups, whereembedding vectors corresponding to z query indices arefetched and reduced by a differentiable operator, such asadd, multiply or concatenation.

Algorithm 2 Forward PropagationInput: Index x∈ [n]Output: Embedding vector exi←0 and t←0while t+ni<x do . Find offset t and sub-block i

t← t+nii← i+1

end whileex← E(i)[x−t]P (i) . Construct an embedding vector

B Experimental DetailsIn this section we provide detailed protocols for theexperiments in Section 5 of the main text. All codeis implemented using the Pytorch [64] library. Allalgorithms are run as single GPU jobs on Nvidia V100s[65]. All confidence intervals reported are standarddeviations obtained from 5 replicates per experimentwith different random seeds. In Table 1 we summarizethe datasets and models used.

B.1 Collaborative Filtering Recall that CF tasksrequire a different blocking scheme than the one presentedin the main text because we only have 2 categoricalfeatures. These features have corresponding embeddingmatrices W ∈ Rn×d and V ∈ Rm×d, for users anditems, respectively. To size the MD embedding layerwe apply MDs within individual embedding matrices bypartitioning them. We blockW and V separately. First,we sort and re-index the rows based on empirical row-wisefrequency: i<i′=⇒fi≥fi′ , where fi is the frequency thata user or item appears in the training data 3. Then, wepartition each embedding matrix into k blocks such thatthe sum of frequencies in each block is equal. For each ofthe k blocks, the total popularity (i.e. probability thata random query will include that row) for each block isconstant (see Alg. 3). Then, for a given frequency f thek-equipartition is unique and is simple to compute. Inour experiments, we saw that setting k anywhere in the[8,16] range is sufficient to observe the effects induced byMDs, with diminishing effect beyond these thresholds.

Algorithm 3 Partitioning Scheme for CFInput: Desired number of blocks kInput: Row-wise frequencies vector fOutput: Offsets vector tf←sort(f) . Sort by row-wise frequencyFind offsets ti such that

∑ti+1

l=tifl=

∑tj+1

l=tjfl for ∀i,j

In our experiments we use the full 27M MovieLensdataset [66]. We train at learning rate 10−2, found in thesame manner as for CTR prediction. For consistency withCTR prediction, we also used the Amsgrad optimizer [67].We train for 100 epochs, taking the model with the lowestvalidation loss. We early terminate if the model does notimprove after 5 epochs in a row. We use a batch size of215 in order to speed-up training. We did not observesignificant differences between this and smaller batchsizes. Our reported performance, in terms of MSE, arecomparable to those elsewhere reported in the literature[68].

For initialization we use the uniform Xavier initial-ization (for consistency with CTR prediction). Also, forthe NCF model, we use a 3-layer MLP with hidden layerdimensions 128−128−32. We used default LeakyReLUactivations.

3Sorting and indexing can be done quickly on a single node aswell as in the distributed settings.

(a) Histogram of accesses for Criteo (b) Histogram of accesses for items (c) Histogram of accesses for users

Figure 4: Popularity skew in real-world datasets.

(a) Effect of num. of equiparititons on CF with MD (b) Effect of num. of equiparititons on CF with UD

Figure 5: Effect of num. of equiparititons on CF

Dataset Tasks # Samples # Categories Models UsedMovieLens CF 27M 330K MF, NCF

Criteo Kaggle CTR 40M 32M DLRM

Table 1: Datasets, tasks and models

For the item embedding partitioning (in Fig. 3), wepartitioned the item embeddings by empirical popularitymass. This means that the top third item embeddingsrepresent to order 103 items, whereas the bottom thirditem embeddings represent order 105 items. Thus, the topthird and bottom third partitions have the same empiricalpopularity mass, not the same number of items.

B.2 CTR Prediction From the perspective of thiswork, using MD embedding layers on real CTR predictiondata with modern deep recommendation models is animportant experiment that shows how MD embeddinglayers might scale to real-world recommendation engines.

In our experiments we use state-of-the-art DLRM[23] and the Criteo Kaggle dataset. We determined thelearning rates, optimizer, batch size, and initializationsscheme by doing a simple grid search sweep on uniformdimension embeddings of dimension 32. Ultimately, weused Amsgrad with a learning rate of 10−3, a batch sizeof 212, and a uniform Xavier initialization for all CTRexperiments reported in the main text. As is customaryin CTR prediction tasks, we train for only one epoch [23].

For learning rates, we tried λ ∈

10−2, 10−3, 10−4, 10−5. 10−3 was best. We triedAmsgrad, Adam, Adagrad, and SGD optimizers.Amsgrad was best.

For batch size, we tried powers of 2 from 25 to 212.The batch size hardly affected the memory footprint onthe GPU, since the majority of memory is spent on theembedding layer, which is only sparsely queried on eachforward pass in the batch. We also found that performancewas largely invariant in batch size, so we selected batchsize of 212.

For initialization schemes, we tried uniform Xavier[69], uniform He (fan-in) and uniform He (fan-out) [70].We initialized all neural network parameters, includingembedding matrices, according to the same scheme. Wefound that He fan-in resulted in severe training instability.Xavier outperformed He fan-out by a considerable margin(Fig. 6).

We also report the accuracy for our CTR predictionexperiments (Fig. 7). The curves are like a reflection ofthe cross entropy loss in the main text.

Regularization We note that in many instances,embeddings for collaborative filtering tasks are usuallytrained with some type of norm-based regularization for

Figure 6: Xavier and He (fan-out) initializations for DLRMwith UD embeddings at various parameter budgets

Figure 7: Accuracy on CTR Prediction for varying temp. (α)

Figure 8: Training time vs. parameter counts for varyingtemperature (legend same as Fig. 7)

the embedding vectors. While this particular form ofregularization works well for small-scale embeddings, it isnon-trivial to scale it for large-scale embeddings. This isbecause the gradient update when using a regularizationterm should back-propagate to every embedding vectorin the layer, rather than just those queried in theforward pass. Because this technique is not feasiblein large-scale CTR prediction, we explicitly do not useembedding regularization in our collaborative filteringtask. Furthermore, we note that from the perspectiveof popularity skew, that embedding norm regularizationactually implicitly penalizes rare embeddings more thanpopular ones, since a larger fraction of training updatesonly contain norm penalty signal for rare embeddingsthan popular ones. This is an interesting connection thatcould be explored in future work, but it does not achievethe stated goal of parameter reduction.

C Theoretical DetailsWe proceed to present proofs of theorems as well asadditional results that did not fit into the main text ofthe paper.

C.1 Block-wise MD Factorization The first pointto address is the specifics of the SGD-variant used to solvethe MD factorization. We adopt the particular variationof the SGD algorithm for matrix factorization proposedin [21] and assume that block structure is known or hasbeen estimated. We show that under rank additivityassumption it can be applied to factor the blocks ofthe matrix M independently and yet construct a rank-r approximation for it. Note that in this scenario theprojections do not need to be free parameters (see Alg.4).

Algorithm 4 Block-wise Mixed Dim. FactorizationInput: Partially masked target block matrix PΩ(M).Input: The blocksM (ij) with block-wise rank rij .Output: MD embeddings W ,V .W (i,j),V (i,j)←SGD(PΩ(M (ij))) . Factor each blockfor 1≤ i≤kW do

W (i)← [W (i,1),...,W (i,kV )]∈Rni×diw . Assemble W

diw←∑kVj=1rij . Construct projection P

siw←∑i−1l=1

∑kVj=1rlj

tiw←∑kWl=i+1

∑kVj=1rlj

P(i)W ←

[0diw×siw ,Idiw×diw ,0diw×tiw

]∈Rd

iw×r

end forfor 1≤j≤kV do

V (j)← [V (1,j),...,V (kW ,j)]∈Rmj×djv . Assemble V

djv←∑kWi=1rij . Construct projection P

sij←∑i−1l=1rli+

∑j−1l=1 ril

tij←∑kVl=j+1ril+

∑kWl=i+1rlj

P(j)V ←

0r1j×s1j ,Ir1j×r1j ,0r1j×t1j...

0rij×sij ,Irij×rij ,0rij×tij...

0rkwj×skwj ,Irkwj×rkwj ,0rkwj×tkwj

∈Rdjv×r

end forW←(W (1),...,W (kW ),P

(1)W ,...,P

(kW )W )

V ←(V (1),...,V (kV ),P(1)V ,...,P

(kV )V )

Two block example We illustrate block-wise MDfactorization on 2n×m two block rank-r matrix

M=

[M (1)

M (2)

]=

[W (1)V (1)T

W (2)V (2)T

]=

[W (1)P

(1)W

W (2)P(2)W

][V (1)T

V (2)T

]with W (1) ∈ Rn×r1 ,V (1) ∈ Rm×r1 ,W (2) ∈ Rn×r2 ,V (2) ∈Rm×r2 obtained by Alg. 4, projections P (1)

W = [I,0] ∈Rr1×r, P (2)

W =[0,I]∈Rr2×r defined by construction and Ibeing an identity matrix.

Four block example We extend the example to2n×2m four block rank-r matrix below

M=

[M (11)M (12)

M (21)M (22)

]=

[W (1)P

(1)W

W (2)P(2)W

][(V (1)P

(1)V )T

(V (2)P(2)V )T

]where rank-rij blockM (ij) =W (ij)V (ij)T factors and

W (1) =[W (11),W (12)]∈Rn×r11+r12

W (2) =[W (21),W (22)]∈Rn×r21+r22

V (1) =[V (11), V (21)]∈Rn×r11+r21

V (2) =[V (12), V (22)]∈Rn×r12+r22

were obtained by Alg. 4, while projections

P(1)W =

[I,0,0,00,I,0,0

]∈Rr11+r12×r

P(2)W =

[0,0,I,00,0,0,I

]∈Rr21+r22×r

P(1)V =

[I,0,0,00,0,I,0

]∈Rr11+r21×r

P(2)V =

[0,I,0,00,0,0,I

]∈Rr12+r22×r

are defined by construction. Note that expandedterms

W (1)P(1)W =[W (11),W (12),0,0]

W (2)P(2)W =[0,0,W (12),W (22)]

V (1)P(1)V =[V (11),0,V (21),0]

V (2)P(2)V =[0,V (12),0,V (22)]

Intuitively, in a case with more blocks, the projectionsgeneralize this pattern of “sliding" the block elements ofW and “interleaving" the block elements of V as definedin Alg. 4.

C.2 Data-Limited Regime We cover additional de-tails about for the data-limited, as well as provide proofsfor the associated theorems in this section.

Incoherence The notion of incoherence is central tomatrix completion [33, 32, 34, 35, 21, 40]. Throughoutthis work, we implicitly assume thatM is µ-incoherent.Note that asymptotic notation occults this. For manystandard random models forM , µ scales like

√rlogn [40],

but here, we simply take µ-incoherence as an assumptionon M . Note that all matrices are incoherent for someµ∈ [1,maxn,m

r ] [21].

Definition C.1. Incoherence. Let M = USV T bethe singular-value decomposition for a matrix rank-rmatrix M ∈Rn×m. Matrix M is µ-incoherent if for all1≤ i≤n,||Ui||22≤

µrn and for all 1≤j≤m,||Vj ||22≤

µrn .

Low-sample Sub-matrix We denote Mε as thelow sample sub-matrix of blocks corresponding to min-imum marginal sampling rate ε. Concretely, Mε =[M (iε1), ...,M (iεkV )] ∈ Rniε×m if εW ≤ εV and Mε =[M (1jε), ...,M (kW jε)] ∈ Rn×mjε otherwise. For conve-nience, we also define nε =niε if εW ≤ εV and nε =mjε

otherwise. We refer to nε as the size of the low-samplesub-matrix (since the other dimension is inherited fromthe size ofM itself).

Popularity-agnostic algorithms (including UDmatrix factorization) are those that can be seen asempirically matching at the observed indices at a givenrank constraint, or any relaxation thereof, without takingadvantage of popularity. MD factorization imposesadditional popularity-based constraints. These additionalconstraints become essential to completion when thepopularity is significantly skewed.

Definition C.2. Popularity-Agnostic Algorithm.Let f(θ) be some arbitrary but fixed model with param-eters θ that outputs an attempted reconstruction M ofmatrix M . For a given rank r∗ let S be any set of matri-ces such that for S=M |rank(M)=r∗ we have S⊆S.An algorithm A is popularity-agnositc if it outputs the so-lution to an optimization characterized by a Lagrangianof the form L(θ,λ) = ||M − M ||2Ω + λ1[M 6∈ S] whereindicator function 1[x]=1 if x=True and 0 otherwise.

C.2.1 Popularity-Agnostic Completion BoundsIt is standard to impose a low-rank structure in thecontext of matrix completion. We are interested inunderstanding how and when popularity-based structurecan improve recovery. While, UD embeddings imposea low-rank structure, at a given rank, we can interpretour MD embeddings as imposing additional popularity-based constraints on the matrix reconstruction. Whileour MD embeddings maintain a particular rank, they doso with less parameters, thereby imposing an additionalpopularity-based restriction on the space of admissiblesolution matrices.

Non-asymptotic Upper Bound We first givea simple lower bound on the sample complexity forpopularity-agnostic algorithm. The bound is straightfor-ward, based on the fact that without additional problemstructure, in order to complete a matrix at rank r, youneed at least r observations, even on the least popularrow or column. The global reconstruction efforts willalways be thwarted by the least popular user or item. Thebound below is non-asymptotic, holding for any problemsize. The theorem below implies that popularity-agnosticalgorithms pay steep recovery penalties depending on theleast likely row/column to sample. If you want to exactlyrecover a matrix, you can only do as well as your mostdifficult row/column.

Theorem C.1. Fix some 0 < δ < 12 . Let ε be the

minimum marginal sampling rate and let nε be the size ofthe low-sample sub-matrix. Suppose number of samplesN≤ r

ε (1−δ). Then, no popularity-agnostic algorithm canrecover M with probability greater than exp(− rnεδ

2

3(1−δ) ).

Proof. Let ψ be any one of the nε vectors in the low-probability sub-matrix Mε (ψ is of length n or m,depending on if Mε is a block-wise row or column).Let Xψ be a random variable denoting the number ofobservations in corresponding to ψ under block-wiseBernoulli sampling. Since Xψ is the sum of independent

Bernoulli variables, we have E[Xψ] = Nε by linearityof expectation. Furthermore, we require at least robservations for each row and column in M in orderto achieve exact recovery at rank r. In order to seethis directly, assume that an oracle completes all theembeddings except the row or column in question. Then,each observations defines immediately removes only onedegree of freedom, since it defines the inner productwith a known vector. It will be impossible to completethe final row or column with less than r observationsbecause popularity-agnostic algorithms provide no furtherconstraints beyond a low-rank structure. Given thatwe need r observations per vector, we can see thatE[Xψ]=Nε<r. This implies if N<r/ε, we can use theChernoff tail bound [71] to bound Pr[Xψ≥r] from above.We take 1/2<δ<1 such thatN≤δr/ε. By application ofthe Chernoff bound, we have Pr[Xψ≥r]≤exp(− r3

δ2

1−δ ).To complete the proof, notice that our argument extendsto each of the nε vectors inMε independently. Since all ofthese vectors require r observations in order to completeof Mε, we obtain the final probability by computinga product over the probability that each of nε vectorsobtains at least r observations.

Asymptotic Upper Bound We also provide astronger asymptotic lower bound for exact completion,based on the results of [32, 40]. This lower bound assumesthat thematrix sizen increases while keeping the samplingrate constant. It includes an additional O(logn) factor,due to the well-known coupon collector effect [72].

Since M is still a block matrix, we assume thatasymptotically, each individual block becomes large, whileΠ is held constant. More concretely, we assume each blockscales at the same rate as the entire matrix: nij =Θ(n) forall i,j andmij =Θ(n) for all i,j. In principle, we could alsosupport an asymptotic number of blocks as well, as longas the number of blocks grows slowly enough comparedto size of each block. Other numerical constants, such asthe condition number and incoherence, are taken to benon-asymptotic. Note that we do not require the blockadditivity assumption for this to hold.

Theorem C.2. Let M be a target matix following theblock-wise Bernoulli sampling model. Let ε be the mini-mum marginal sampling rate. Suppose N = o( rεnlogn).Then any popularity-agnostic algorithm will have arbi-trarily small probability of recovery, asymptotically.

Proof. Order Θ(nrlogn) observations are necessary forexact completion at a given probability in the asymptoticsetting [32, 40]. This is because Θ(nrlogn) observationsare necessary to obtain r observations per row and column.Each vector in the low-sample sub-matrix also requiresr observations. Since the number of samples in the low-sample sub-matrix concentrates around Nε, this numbermust be order Θ(rnlogn) in order to have a chance ofreconstruction.

C.2.2 Completion Guarantees for Mixed Di-mension Embeddings In [21] it was shown that vari-ous non-convex optimization algorithms, including SGD,could exactly complete the unknown matrix, under theBernoulli sampling model. For convenience, the theoremis reproduced below. The details of the SGD implemen-tation, such as step sizes and other parameters, can befound in [21].

Theorem C.3. (Sun and Luo, 2016) Let α = nm ≥

1, κ be the condition number of M and µ be theincoherence of M . If (expected) number of samples N≥C0nrκ

2α(maxµlogn,√αµ2r6κ4) then SGD completes

M with probability greater than 1− 2n4 .

We can use Thm C.3 and Alg. 4 to construct aguarantee for mixed-dimension block-wise factorization,as follows.

Corollary C.1. Let M be a target matrix followingthe block-wise Bernoulli sampling model. Let C0 bea universal constant, nij = minni, mj and N∗ij =

C0Π−1ij nijrijκ

2ijαij(maxµij log nij ,

√αijµ

2ijr

6ijκ

4ij)

where κij and µij is the condition number and the inco-herence of the ij-th block ofM and αij =

maxnij ,mijminnij ,mij ≥1.

If N ≥ maxij N∗ij, then block-wise MD factorization

completes rank additive block matrix M with probabilitygreater than 1−

∑ij

2n4ij.

Proof. Recall the construction used in Alg. 4. First, wecomplete each block individually. We apply Thm C.3 toeach block independently to guarantee its completion atrank rij with probability at least 1− 2

n4ij. We then use

the block-wise embeddings to construct MD embeddingsW ,V as described Alg. 4. IfW (ij)V (ij)T =M (ij), for alli,j, then W V T =M . Thus, we need only a union boundon the failure probabilities from Thm C.3 to complete theproof.

Note that Corollary C.1 implies Thm 4.2 as it is a non-asymptotic version. Namely, recall that nij =Θ(n) for alli,j. Furthermore, letting C=C0maxij(Π

−1ij rijκ

2ijαijµij)

recovers Corollary 4.2.

C.3 Memory-Limited Regime Now, we turn ourattention to the allocation implications of non-uniformlysampled test sets. To abstract away training, we assumean oracle reveals the target matrix. Recall that ourchallenge is a small parameter budget — our embeddingscan only use B parameters. The question is whatdimensions dw and dv each embedding block should getto minimize our popularity-weighted reconstruction loss(under MSE)? Before proceeding, we pause to define someuseful matrices from the block-wiseMD factorization (Alg.4).

Definition C.3. In the context of block-wise MDfactorization (Alg. 4) we refer to the matrices(W (ij),V (ij)) as the ij-th block-wise embeddings. Werefer to matrix W (i) as the i-th row embedding blockand matrix V (j) as the j-th column embedding block.

Note that, generally speaking, embedding tables(W,V ), naturally inherit an embedding block structurebased on the block structure ofM . For example, standardUD embeddings partition such that the top block-wiserow M (1) = [M (11), ...,M (1kW )] only depends on W (1).Thus, embedding blocks exist independent of the usage ofblock-wise MD factorization. On the other hand, block-wise embeddings (W (ij),V (ij)), are a distinct byproductof block-wise MD factorization.

C.3.1 Optimization over Embedding Dimen-sions We assume the block structure and block-wiseprobability matrix is given — the variables over whichthe optimization takes place are 1) the dimensions of theembedding blocks, (dw,dv) such that W (l) ∈ Rnl×(dw)l

and V (l)∈Rnl×(dv)l and 2) the embedding blocks them-selves W (i) for 1≤ i≤kW and V (j) for 1≤ j≤kV . Notethat when the embedding block dimensions are uniform,such that (dw)i=(dv)j for all i and j, this is equivalentto direct optimization over embedding matricesW,V (i.e.matrix factorization). Recall that LΠ is the popularity-weighted MSE. When dw and dv are treated as integers,this optimization is NP-Hard in general, since even in-tegral feasibility under linear constraints is known to beNP-Hard [73]. Instead, we study a continuous relaxationthat results in a convex program. The resultant convexprogram is far simpler, and yields a closed-form solutiongiven the spectrum of the target matrix. We proceed todefine another quantity of interest for our discussion, aspectral (singular value) decay. In order to save space inthe main text, we do not introduce the spectral decay ruleg but we imply it when referring to the spectrum directly.After the upcoming definition, we restate and prove Thm.4.3 from the main text.Definition C.4. A spectral decay is mapping from[0,r] to R+ that describes the singular value scree plotfor a matrix. Let σk be the k-th singular value of a matrixand σk ≥ σk+1 for k = 1,...,r. For any singular valuespectrum we associate a spectral decay rule, a piece-wisestep-function and its functional inverse, as g(x) = σkfor k − 1 ≤ x < k and g−1(x) = k for σk < x ≤ σk+1,respectively.Theorem C.4. The optimal block-wise embedding di-mensions for the convex relaxation of the variable dimen-sion embedding optimization under a parameter budgetare given by d∗ij = g−1

ij


−1ij

)where

g−1ij is the functional inverse of the spectral decay of blockM (ij).

Proof. The optimization is formulated asmindw,dv

minW,V

LΠ(M,WV T )

s.t.∑i

ni(dw)i+∑j

mj(dv)j≤B

Under relaxation, we treat this a continuous optimiza-tion. Let (k,l) be a test coordinate sampled according toΠ. If rank additivity holds, we can equivalently write

=mind

minW,V

E(k,l)∈[n]×[m]|Mkl−WkVTl |2

st∑ij

(ni+mj)dij≤B

whereMkl is the kl-th element ofM , (dw)i=∑jdij

and (dv)j =∑idij . The dw,dv refer to the embedding

block dimensions, whereas the dij refer to the block-wiseembedding dimensions (Definition C.11). We may ignorethe parameters in the projections since they are not freeparameters (and also contribute a negligible amount ofparameters to the total count). Under Bernoulli samplingmodel, our popularity distribution yields. As shorthandnotation, let B :=

∑ij(ni+mj)dij .

=mind

minW,V

∑ij

Πij

nimj||M (ij)−W (ij)V (ij)T ||2F st B≤B

Since the constraints remain the same, we omit themfor the time being. Letting σ(ij)

k be the singular valuesof block M (ij) and using the low-rank approximationtheorem [74] we obtain

=mind

∑ij

Πij

nimj

rij∑k=dij+1

(σ(ij)k )2 st B≤B

Letting gij be the spectral decay rule for each blockand noticing that by construction

∑rk=0σk =

∫ r0g(k)dk

we obtain

=mind

∑ij

Πij

nimj

(∫ rij

0

g2ij(k)dk−

∫ dij

0

g2ij(k)dk

)st B≤B

=mind

∑ij

Πij

nimj

(||M (ij)||2F−

∫ dij

0

g2ij(k)dk

)st B≤B

Observe that the objective is convex. To see this,note that each g is decreasing since the spectral decayis decreasing. Thus, g2 is also decreasing. The negativeintegral of a decreasing function is convex. Finally,since the objective is a sum of functions that are convexalong one variable and constant along the rest, the entireoptimization is convex (and well-posed under the linearconstraint, which is guaranteed to be active). Thus wecan solve with the optimization with first-order conditions[75]. The corresponding Lagrangian can be written as

L=∑ij

Πij

nimj

(||M (ij)||2F−

∫ dij

0

g2ij(k)dk

)

+λ

−B+∑ij

(ni+mj)dij

Note thatM (ij) does not depend on dij . Also, note

that we can use the fundamental theorem of calculus∂∂x

∫ x0g(t)dt=g(x) [76]. Then, using Lagrange multipliers

[75] we can write

∂

∂dijLΠ(M,WV T )=− Πij

nimjg2ij(dij)+λ(ni+mj)

Finally, using first order conditions ∇dij =[ ∂∂dij

]=0

we obtain: g2ij(dij)=λ(ni+mj)(nimj)Π

−1ij . Solving for

dij by taking the functional inverse of gij completes theproof. We conclude:

d∗ij=g−1ij


−1ij

)

For specific spectral decay rules, we may give closed-form solutions, as done in the main text for powerlaw decay. We can also analyze the performance gapbetween uniform and MD embeddings with respect to theoptimization objective.

Corollary C.2. The performance gap compared to UDembeddings is∑

ij

Πij(1d∗ij>B

n+m(

d∗ij∑k= B

n+m

(σ(ij)k )2)

−1d∗ij<B

n+m(

Bn+m∑k=d∗ij

(σ(ij)k )2))

Proof. Follows directly from plugging optimal d∗ into theobjective.

We can explain the intuition for Corollary C.2 as fol-lows. The first term counts the spectral mass gained backby allocating more parameters to frequent embeddings.B/(n+m) is the embedding dimension under a uniformparameter allotment. When d∗ij is greater than this, weare increasing the dimension which enables that embed-ding to recover more of the spectrum. This occurs whenΠij is large. On the other hand, the trade-off is thatlower-dimension embeddings recover less of the spectrumwhen Πij is small, which is the penalty incurred by thesecond term.

Corollary C.3. When M exhibits a block-wise powerspectral decay, this becomes:

d∗ij=λζijΠ12β

ij

where ζij =(

(ni+mj)(nimj)µ

)−12β

and λ =(B∑

ij(ni+mj)ζij

)−2β

Proof. Follows directly by substituting power spectraldecay rule for σ(·).

C.3.2 Approximation Gap for Convex Relax-ation Note that we can bound the approximation gap ofthe proposed relaxation by simply rounding down eachdij to the nearest integer, which ensures the feasibility ofthe assignment. The absolute approximation error is then

less than∑ijΠij · g2(dij). For most applications, this

quantity is small for a good MD assignment, since eitherthe probability term, Πij is small, or the the spectrumat dij is small. For example, in typical use cases, theembedding dimensions may be on the order of 10−100 –rounding down to the nearest integer would thus representa loss of 10−1% of the spectral mass.

abstract · 2019. 9. 27. · mixed dimension embedding with application to memory-efficient...

Documents