language modeling with power low rank ensemblesants, as well as class-based language models. we also...

12
Language Modeling with Power Low Rank Ensembles Ankur P. Parikh School of Computer Science Carnegie Mellon University [email protected] Avneesh Saluja Electrical & Computer Engineering Carnegie Mellon University [email protected] Chris Dyer School of Computer Science Carnegie Mellon University [email protected] Eric P. Xing School of Computer Science Carnegie Mellon University [email protected] Abstract We present power low rank ensembles (PLRE), a flexible framework for n-gram language modeling where ensembles of low rank matrices and tensors are used to obtain smoothed probability estimates of words in context. Our method can be understood as a generalization of n- gram modeling to non-integer n, and in- cludes standard techniques such as abso- lute discounting and Kneser-Ney smooth- ing as special cases. PLRE training is effi- cient and our approach outperforms state- of-the-art modified Kneser Ney baselines in terms of perplexity on large corpora as well as on BLEU score in a downstream machine translation task. 1 Introduction Language modeling is the task of estimating the probability of sequences of words in a language and is an important component in, among other applications, automatic speech recognition (Ra- biner and Juang, 1993) and machine translation (Koehn, 2010). The predominant approach to lan- guage modeling is the n-gram model, wherein the probability of a word sequence P (w 1 ,...,w ) is decomposed using the chain rule, and then a Markov assumption is made: P (w 1 ,...,w ) Q i=1 P (w i |w i-1 i-n+1 ). While this assumption sub- stantially reduces the modeling complexity, pa- rameter estimation remains a major challenge. Due to the power-law nature of language (Zipf, 1949), the maximum likelihood estimator mas- sively overestimates the probability of rare events and assigns zero probability to legitimate word se- quences that happen not to have been observed in the training data (Manning and Sch ¨ utze, 1999). Many smoothing techniques have been pro- posed to address the estimation challenge. These reassign probability mass (generally from over- estimated events) to unseen word sequences, whose probabilities are estimated by interpolating with or backing off to lower order n-gram models (Chen and Goodman, 1999). Somewhat surprisingly, these widely used smoothing techniques differ substantially from techniques for coping with data sparsity in other domains, such as collaborative filtering (Koren et al., 2009; Su and Khoshgoftaar, 2009) or matrix completion (Cand` es and Recht, 2009; Cai et al., 2010). In these areas, low rank approaches based on matrix factorization play a central role (Lee and Seung, 2001; Salakhutdinov and Mnih, 2008; Mackey et al., 2011). For example, in recom- mender systems, a key challenge is dealing with the sparsity of ratings from a single user, since typical users will have rated only a few items. By projecting the low rank representation of a user’s (sparse) preferences into the original space, an es- timate of ratings for new items is obtained. These methods are attractive due to their computational efficiency and mathematical well-foundedness. In this paper, we introduce power low rank en- sembles (PLRE), in which low rank tensors are used to produce smoothed estimates for n-gram probabilities. Ideally, we would like the low rank structures to discover semantic and syntactic relat- edness among words and n-grams, which are used to produce smoothed estimates for word sequence probabilities. In contrast to the few previous low rank language modeling approaches, PLRE is not orthogonal to n-gram models, but rather a gen- eral framework where existing n-gram smoothing methods such as Kneser-Ney smoothing are spe- cial cases. A key insight is that PLRE does not compute low rank approximations of the original

Upload: others

Post on 28-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Language Modeling with Power Low Rank Ensembles

    Ankur P. ParikhSchool of Computer ScienceCarnegie Mellon [email protected]

    Avneesh SalujaElectrical & Computer Engineering

    Carnegie Mellon [email protected]

    Chris DyerSchool of Computer ScienceCarnegie Mellon [email protected]

    Eric P. XingSchool of Computer ScienceCarnegie Mellon [email protected]

    Abstract

    We present power low rank ensembles(PLRE), a flexible framework for n-gramlanguage modeling where ensembles oflow rank matrices and tensors are usedto obtain smoothed probability estimatesof words in context. Our method canbe understood as a generalization of n-gram modeling to non-integer n, and in-cludes standard techniques such as abso-lute discounting and Kneser-Ney smooth-ing as special cases. PLRE training is effi-cient and our approach outperforms state-of-the-art modified Kneser Ney baselinesin terms of perplexity on large corpora aswell as on BLEU score in a downstreammachine translation task.

    1 Introduction

    Language modeling is the task of estimating theprobability of sequences of words in a languageand is an important component in, among otherapplications, automatic speech recognition (Ra-biner and Juang, 1993) and machine translation(Koehn, 2010). The predominant approach to lan-guage modeling is the n-gram model, whereinthe probability of a word sequence P (w1, . . . , w`)is decomposed using the chain rule, and then aMarkov assumption is made: P (w1, . . . , w`) ≈∏`i=1 P (wi|w

    i−1i−n+1). While this assumption sub-

    stantially reduces the modeling complexity, pa-rameter estimation remains a major challenge.Due to the power-law nature of language (Zipf,1949), the maximum likelihood estimator mas-sively overestimates the probability of rare eventsand assigns zero probability to legitimate word se-quences that happen not to have been observed inthe training data (Manning and Schütze, 1999).

    Many smoothing techniques have been pro-posed to address the estimation challenge. Thesereassign probability mass (generally from over-estimated events) to unseen word sequences,whose probabilities are estimated by interpolatingwith or backing off to lower order n-gram models(Chen and Goodman, 1999).

    Somewhat surprisingly, these widely usedsmoothing techniques differ substantially fromtechniques for coping with data sparsity in otherdomains, such as collaborative filtering (Koren etal., 2009; Su and Khoshgoftaar, 2009) or matrixcompletion (Candès and Recht, 2009; Cai et al.,2010). In these areas, low rank approaches basedon matrix factorization play a central role (Leeand Seung, 2001; Salakhutdinov and Mnih, 2008;Mackey et al., 2011). For example, in recom-mender systems, a key challenge is dealing withthe sparsity of ratings from a single user, sincetypical users will have rated only a few items. Byprojecting the low rank representation of a user’s(sparse) preferences into the original space, an es-timate of ratings for new items is obtained. Thesemethods are attractive due to their computationalefficiency and mathematical well-foundedness.

    In this paper, we introduce power low rank en-sembles (PLRE), in which low rank tensors areused to produce smoothed estimates for n-gramprobabilities. Ideally, we would like the low rankstructures to discover semantic and syntactic relat-edness among words and n-grams, which are usedto produce smoothed estimates for word sequenceprobabilities. In contrast to the few previous lowrank language modeling approaches, PLRE is notorthogonal to n-gram models, but rather a gen-eral framework where existing n-gram smoothingmethods such as Kneser-Ney smoothing are spe-cial cases. A key insight is that PLRE does notcompute low rank approximations of the original

  • joint count matrices (in the case of bigrams) or ten-sors i.e. multi-way arrays (in the case of 3-gramsand above), but instead altered quantities of thesecounts based on an element-wise power operation,similar to how some smoothing methods modifytheir lower order distributions.

    Moreover, PLRE has two key aspects that leadto easy scalability for large corpora and vocabu-laries. First, since it utilizes the original n-grams,the ranks required for the low rank matrices andtensors tend to be remain tractable (e.g. around100 for a vocabulary size V ≈ 1 × 106) leadingto fast training times. This differentiates our ap-proach over other methods that leverage an under-lying latent space such as neural networks (Bengioet al., 2003; Mnih and Hinton, 2007; Mikolov etal., 2010) or soft-class models (Saul and Pereira,1997) where the underlying dimension is requiredto be quite large to obtain good performance.Moreover, at test time, the probability of a se-quence can be queried in time O(κmax) whereκmax is the maximum rank of the low rank matri-ces/tensors used. While this is larger than KneserNey’s virtually constant query time, it is substan-tially faster than conditional exponential familymodels (Chen and Rosenfeld, 2000; Chen, 2009;Nelakanti et al., 2013) and neural networks whichrequire O(V ) for exact computation of the nor-malization constant. See Section 7 for a more de-tailed discussion of related work.

    Outline: We first review existing n-gramsmoothing methods (§2) and then present the in-tuition behind the key components of our tech-nique: rank (§3.1) and power (§3.2). We thenshow how these can be interpolated into an ensem-ble (§4). In the experimental evaluation on Englishand Russian corpora (§5), we find that PLRE out-performs Kneser-Ney smoothing and all its vari-ants, as well as class-based language models. Wealso include a comparison to the log-bilinear neu-ral language model (Mnih and Hinton, 2007) andevaluate performance on a downstream machinetranslation task (§6) where our method achievesconsistent improvements in BLEU.

    2 Discount-based Smoothing

    We first provide background on absolute discount-ing (Ney et al., 1994) and Kneser-Ney smooth-ing (Kneser and Ney, 1995), two common n-gramsmoothing methods. Both methods can be formu-lated as back-off or interpolated models; we de-scribe the latter here since that is the basis of our

    low rank approach.

    2.1 Notation

    Let c(w) be the count of word w, and similarlyc(w,wi−1) for the joint count of words w andwi−1. For shorthand we will define w

    ji to denote

    the word sequence {wi, wi+1, ..., wj−1, wj}. LetP̂ (wi) refer to the maximum likelihood estimate(MLE) of the probability of word wi, and simi-larly P̂ (wi|wi−1) for the probability conditionedon a history, or more generally, P̂ (wi|wi−1i−n+1).

    Let N−(wi) := |{w : c(wi, w) > 0}| bethe number of distinct words that appear be-fore wi. More generally, let N−(wii−n+1) =|{w : c(wii−n+1, w) > 0}|. Similarly, letN+(w

    i−1i−n+1) = |{w : c(w,w

    i−1i−n+1) > 0}|. V

    denotes the vocabulary size.

    2.2 Absolute Discounting

    Absolute discounting works on the idea of inter-polating higher order n-gram models with lower-order n-gram models. However, first some prob-ability mass must be “subtracted” from the higherorder n-grams so that the leftover probability canbe allocated to the lower order n-grams. Morespecifically, define the following discounted con-ditional probability:

    P̂D(wi|wi−1i−n+1) =max{c(wi, wi−1i−n+1)−D, 0}

    c(wi−1i−n+1)

    Then absolute discounting Pabs(·) uses the follow-ing (recursive) equation:

    Pabs(wi|wi−1i−n+1) = P̂D(wi|wi−1i−n+1)

    + γ(wi−1i−n+1)Pabs(wi|wi−1i−n+2)

    where γ(wi−1i−n+1) is the leftover weight (due tothe discounting) that is chosen so that the con-ditional distribution sums to one: γ(wi−1i−n+1) =

    Dc(wi−1i−n+1)

    N+(wi−1i−n+1). For the base case, we set

    Pabs(wi) = P̂ (wi).Discontinuity: Note that if c(wi−1i−n+1) = 0, thenγ(wi−1i−n+1) =

    00 , in which case γ(w

    i−1i−n+1) is set

    to 1. We will see that this discontinuity appears inPLRE as well.

  • 2.3 Kneser Ney SmoothingIdeally, the smoothed probability should preservethe observed unigram distribution:

    P̂ (wi) =∑

    wi−1i−n+1

    Psm(wi|wi−1i−n+1)P̂ (wi−1i−n+1) (1)

    where Psm(wi|wi−1i−n+1) is the smoothed condi-tional probability that a model outputs. Unfortu-nately, absolute discounting does not satisfy thisproperty, since it exclusively uses the unalteredMLE unigram model as its lower order model. Inpractice, the lower order distribution is only uti-lized when we are unsure about the higher orderdistribution (i.e., when γ(·) is large). Therefore,the unigram model should be altered to conditionon this fact.

    This is the inspiration behind Kneser-Ney (KN)smoothing, an elegant algorithm with robust per-formance in n-gram language modeling. KNsmoothing defines alternate probabilities P alt(·):

    P altD (wi|wi−1i−n′+1) =

    P̂D(wi|wi−1i−n′+1), if n

    ′ = n

    max{N−(wii−n′+1)−D,0}∑wiN−(wii−n′+1)

    , if n′ < n

    The base case for unigrams reduces toP alt(wi) =

    N−(wi)∑wiN−(wi)

    . Intuitively P alt(wi) is

    proportional to the number of unique words thatprecede wi. Thus, words that appear in many dif-ferent contexts will be given higher weight thanwords that consistently appear after only a fewcontexts. These alternate distributions are thenused with absolute discounting:

    Pkn(wi|wi−1i−n+1) = PaltD (wi|wi−1i−n+1)

    + γ(wi−1i−n+1)Pkn(wi|wi−1i−n+2) (2)

    where we set Pkn(wi) = P alt(wi). By definition,KN smoothing satisfies the marginal constraint inEq. 1 (Kneser and Ney, 1995).

    3 Power Low Rank Ensembles

    In n-gram smoothing methods, if a bigram countc(wi, wi−1) is zero, the unigram probabilities areused, which is equivalent to assuming that wi andwi−1 are independent ( and similarly for generaln). However, in this situation, instead of back-ing off to a 1-gram, we may like to back off to a“1.5-gram” or more generally an order between 1and 2 that captures a coarser level of dependence

    between wi and wi−1 and does not assume full in-dependence.

    Inspired by this intuition, our strategy is to con-struct an ensemble of matrices and tensors thatnot only consists of MLE-based count informa-tion, but also contains quantities that represent lev-els of dependence in-between the various orders inthe model. We call these combinations power lowrank ensembles (PLRE), and they can be thoughtof as n-gram models with non-integer n. Our ap-proach can be recursively formulated as:

    Pplre(wi|wi−1i−n+1) = PaltD0(wi|w

    i−1i−n+1)

    + γ0(wi−1i−n+1)

    (ZD1(wi|wi−1i−n+1) + .....

    + γη−1(wi−1i−n+1)

    (ZDη(wi|wi−1i−n+1)

    + γη(wi−1i−n+1)

    (Pplre(wi|wi−1i−n+2)

    ))...

    )(3)

    where Z1, ...,Zη are conditional probability ma-trices that represent the intermediate n-gram or-ders1 and D is a discount function (specified in§4).

    This formulation begs answers to a few crit-ical questions. How to construct matrices thatrepresent conditional probabilities for intermedi-ate n? How to transform them in a way thatgeneralizes the altered lower order distributionsin KN smoothing? How to combine these matri-ces such that the marginal constraint in Eq. 1 stillholds? The following propose solutions to thesethree queries:

    1. Rank (Section 3.1): Rank gives us a concretemeasurement of the dependence between wiand wi−1. By constructing low rank ap-proximations of the bigram count matrix andhigher-order count tensors, we obtain matri-ces that represent coarser dependencies, witha rank one approximation implying that thevariables are independent.

    2. Power (Section 3.2): In KN smoothing, thelower order distributions are not the originalcounts but rather altered estimates. We pro-pose a continuous generalization of this alter-ation by taking the element-wise power of thecounts.

    1with a slight abuse of notation, let ZDj be shorthandfor Zj,Dj

  • 3. Creating the Ensemble (Section 4): Lastly,PLRE also defines a way to interpolate thespecifically constructed intermediate n-grammatrices. Unfortunately a constant discount,as presented in Section 2, will not in generalpreserve the lower order marginal constraint(Eq. 1). We propose a generalized discount-ing scheme to ensure the constraint holds.

    3.1 Rank

    We first show how rank can be utilized to constructquantities between an n-gram and an n− 1-gram.In general, we think of an n-gram as an nth or-der tensor i.e. a multi-way array with n indices{i1, ..., in}. (A vector is a tensor of order 1, a ma-trix is a tensor of order 2 etc.) Computing a spe-cial rank one approximation of slices of this tensorproduces the n− 1-gram. Thus, taking rank κ ap-proximations in this fashion allows us to representdependencies between an n-gram and n−1-gram.

    Consider the bigram count matrix B withN counts which has rank V . Note thatP̂ (wi|wi−1) = B(wi,wi−1)∑

    wB(w,wi−1). Additionally, B

    can be considered a random variable that is the re-sult of sampling N tuples of (wi, wi−1) and ag-glomerating them into a count matrix. Assum-ing wi and wi−1 are independent, the expectedvalue (with respect to the empirical distribution)E[B] = NP (wi)P (wi−1), which can be rewrit-ten as being proportional to the outer product ofthe unigram probability vector with itself, and isthus rank one.

    This observation extends to higher ordern-grams as well. Let Cn be the nth order tensorwhere Cn(wi, ...., wi−n+1) = c(wi, ..., wi−n+1).Furthermore denote Cn(:, w̃i−1i−n+2, :) tobe the V × V matrix slice of Cn wherewi−n+2, ..., wi−1 are held fixed to a particularsequence w̃i−n+2, ..., w̃i−1. Then if wi is con-ditionally independent of wi−n+1 given wi−1i−n+2,then E[Cn(:, w̃i−1i−n+2, :)] is rank one ∀w̃

    i−1i−n+2.

    However, it is rare that these matrices are ac-tually rank one, either due to sampling vari-ance or the fact that wi and wi−1 are not in-dependent. What we would really like to sayis that the best rank one approximation B(1)

    (under some norm) of B is ∝ P̂ (wi)P̂ (wi−1).While this statement is not true under the `2norm, it is true under generalized KL diver-gence (Lee and Seung, 2001): gKL(A||B) =∑

    ij

    (Aij log(

    AijBij

    )−Aij +Bij))

    .

    In particular, generalized KL divergence pre-serves row and column sums: if M (κ) is the bestrank κ approximation of M under gKL then therow sums and column sums of M (κ) and M areequal (Ho and Van Dooren, 2008). Leveragingthis property, it is straightforward to prove the fol-lowing lemma:

    Lemma 1. Let B(κ) be the best rank κ ap-proximation of B under gKL. Then B(1) ∝P̂ (wi)P̂ (wi−1) and ∀wi−1 s.t. c(wi−1) 6= 0:

    P̂ (wi) =B(1)(wi, wi−1)∑wB

    (1)(w,wi−1)

    For more general n, let Cn,(κ)i−1,...,i−n+2 be thebest rank κ approximation of Cn(:, w̃i−1i−n+2, :) under gKL. Then similarly, ∀wi−1i−n+1 s.t.c(wi−1i−n+1) > 0:

    P̂ (wi|wi−1, ..., wi−n+2)

    =Cn,(1)i−1,...,i−n+2(wi, w

    i−1i−n+1)∑

    wCn,(1)i−1,...,i−n+2(w,w

    i−1i−n+1)

    (4)

    Thus, by selecting 1 < κ < V , we obtain countmatrices and tensors between n and n − 1-grams.The condition that c(wi−1i−n+1) > 0 corresponds tothe discontinuity discussed in §2.2.

    3.2 PowerSince KN smoothing alters the lower order distri-butions instead of simply using the MLE, vary-ing the rank is not sufficient in order to generalizethis suite of techniques. Thus, PLRE computeslow rank approximations of altered count matri-ces. Consider taking the elementwise power ρ ofthe bigram count matrix, which is denoted by B·ρ.For example, the observed bigram count matrixand associated row sum:

    B·1 =

    (1.0 2.0 1.00 5.0 02.0 0 0

    )row sum→

    (4.05.02.0

    )As expected the row sum is equal to the uni-

    gram counts (which we denote as u). Now con-sider B·0.5:

    B·0.5 =

    (1.0 1.4 1.00 2.2 01.4 0 0

    )row sum→

    (3.42.21.4

    )Note how the row sum vector has been altered.In particular since w1 (corresponding to the first

  • row) has a more diverse history than w2, it hasa higher row sum (compared to in u where w2has the higher row sum). Lastly, consider the casewhen p = 0:

    B·0 =

    (1.0 1.0 1.00 1.0 01.0 0 0

    )row sum→

    (3.01.01.0

    )

    The row sum is now the number of unique wordsthat precede wi (since B0 is binary) and is thusequal to the (unnormalized) Kneser Ney unigram.This idea also generalizes to higher order n-gramsand leads us to the following lemma:

    Lemma 2. Let B(ρ,κ) be the best rank κ ap-proximation of B·ρ under gKL. Then ∀wi−1 s.t.c(wi−1) 6= 0:

    P alt(wi) =B(0,1)(wi, wi−1)∑wB

    (0,1)(w,wi−1)

    For more general n, let Cn,(ρ,κ)i−1,...,i−n+2 be the bestrank κ approximation of Cn,(ρ)(:, w̃i−1i−n+2, :) un-der gKL. Similarly, ∀wi−1i−n+1 s.t. c(w

    i−1i−n+1) > 0:

    P alt(wi|wi−1, ..., wi−n+2)

    =Cn,(0,1)i−1,...,i−n+2(wi, w

    i−1i−n+1)∑

    wCn,(0,1)i−1,...,i−n+2(w,w

    i−1i−n+1)

    (5)

    4 Creating the Ensemble

    Recall our overall formulation in Eq. 3; a naivesolution would be to set Z1, ...,Zη to low rankapproximations of the count matrices/tensors un-der varying powers, and then interpolate throughconstant absolute discounting. Unfortunately, themarginal constraint in Eq. 1 will generally not holdif this strategy is used. Therefore, we propose ageneralized discounting scheme where each non-zero n-gram count is associated with a differentdiscount Dj(wi, wi−1i−n′+1). The low rank approxi-mations are then computed on the discounted ma-trices, leaving the marginal constraint intact.

    For clarity of exposition, we focus on the spe-cial case where n = 2 with only one low rankmatrix before stating our general algorithm:

    Pplre(wi|wi−1) = P̂D0(wi|wi−1)

    + γ0(wi−1)

    (ZD1(wi|wi−1) + γ1(wi−1)P alt(wi)

    )(6)

    Our goal is to compute D0,D1 and Z1 sothat the following lower order marginal constraintholds:

    P̂ (wi) =∑wi−1

    Pplre(wi|wi−1)P̂ (wi−1) (7)

    Our solution can be thought of as a two-step procedure where we compute the discountsD0,D1 (and the γ(wi−1) weights as a by-product), followed by the low rank quantity Z1.First, we construct the following intermediate en-semble of powered, but full rank terms. LetY ρj be the matrix such that Y ρj (wi, wi−1) :=c(wi, wi−1)

    ρj . Then define

    Ppwr(wi|wi−1) := Y (ρ0=1)D0 (wi|wi−1)

    + γ0(wi−1)

    (Y

    (ρ1)D1

    (wi|wi−1)

    + γ1(wi−1)Y(ρ2=0)(wi|wi−1)

    )(8)

    where with a little abuse of notation:

    YρjDj

    (wi|wi−1) =c(wi, wi−1)

    ρj −Dj(wi, wi−1)∑wi

    c(wi, wi−1)ρj

    Note that P alt(wi) has been replaced withY (ρ2=0)(wi|wi−1), based on Lemma 2, and willequal P alt(wi) once the low rank approximation istaken as discussed in § 4.2).

    Since we have only combined terms of differ-ent power (but all full rank), it is natural choosethe discounts so that the result remains unchangedi.e., Ppwr(wi|wi−1) = P̂ (wi|wi−1), since the lowrank approximation (not the power) will imple-ment smoothing. Enforcing this constraint givesrise to a set of linear equations that can be solved(in closed form) to obtain the discounts as we nowshow below.

    4.1 Step 1: Computing the DiscountsTo ensure the constraint that Ppwr(wi|wi−1) =P̂ (wi|wi−1), it is sufficient to enforce the follow-ing two local constraints:

    Y (ρj)(wi|wi−1) = Y(ρj)Dj

    (wi|wi−1)

    + γj(wi−1)Y(ρj+1)(wi|wi−1) for j = 0, 1

    (9)

    This allows each Dj to be solved for indepen-dently of the other {Dj′}j′ 6=j . Let ci,i−1 =c(wi, wi−1), c

    ji,i−1 = c(wi, wi−1)

    ρj , and dji,i−1 =

  • Dj(wi, wi−1). Expanding Eq. 9 yields that∀wi, wi−1:

    cji,i−1∑i cji,i−1

    =

    cji,i−1 − dji,i−1∑

    i cji,i−1

    +

    (∑i dji,i−1∑

    i cji,i−1

    )cj+1i,i−1∑i cj+1i,i−1

    (10)

    which can be rewritten as:

    −dji,i−1 +

    (∑i

    dji,i−1

    )cj+1i,i−1∑i cj+1i,i−1

    = 0 (11)

    Note that Eq. 11 decouples across wi−1 since theonly dji,i−1 terms that are dependent are the onesthat share the preceding context wi−1.

    It is straightforward to see that setting dji,i−1proportional to cj+1i,i−1 satisfies Eq. 11. Furthermoreit can be shown that all solutions are of this form(i.e., the linear system has a null space of exactlyone). Moreover, we are interested in a particularsubset of solutions where a single parameter d∗(independent of wi−1) controls the scaling as in-dicated by the following lemma:

    Lemma 3. Assume that ρj ≥ ρj+1. Choose any0 ≤ d∗ ≤ 1. Set dji,i−1 = d∗c

    j+1i,i−1 ∀i, j. The

    resulting discounts satisfy Eq. 11 as well as theinequality constraints 0 ≤ dji,i−1 ≤ c

    ji,i−1. Fur-

    thermore, the leftover weight γj takes the form:

    γj(wi−1) =

    ∑i dji,i−1∑

    i cji,i−1

    =d∗∑

    i cj+1i,i−1∑

    i cji,i−1

    Proof. Clearly this choice of dji,i−1 satisfiesEq. 11. The largest possible value of dji,i−1 iscj+1i,i−1. ρj ≥ ρj+1, implies c

    ji,i−1 ≥ c

    j+1i,i−1. Thus

    the inequality constraints are met. It is then easyto verify that γ takes the above form.

    The above lemma generalizes to longer contexts(i.e. n > 2) as shown in Algorithm 1. Note that ifρj = ρj+1 then Algorithm 1 is equivalent to scal-ing the counts e.g. deleted-interpolation/JelinekMercer smoothing (Jelinek and Mercer, 1980). Onthe other hand, when ρj+1 = 0, Algorithm 1is equal to the absolute discounting that is usedin Kneser-Ney. Thus, depending on ρj+1, ourmethod generalizes different types of interpola-tion schemes to construct an ensemble so that themarginal constraint is satisfied.

    Algorithm 1 Compute DIn: Count tensor Cn, powers ρj , ρj+1 such thatρj ≥ ρj+1, and parameter d∗.Out: Discount Dj for powered counts Cn,(ρj)and associated leftover weight γj

    1: Set Dj(wi, wi−1i−n+1) = d∗c(wi, wi−1i−n+1)

    ρj+1 .2:

    γj(wi, wi−1i−n+1) =

    d∗∑

    wic(wi, w

    i−1i−n+1)

    ρj+1∑wic(wi, w

    i−1i−n+1)

    ρj

    Algorithm 2 Compute ZIn: Count tensor Cn, power ρ, discounts D, rankκOut: Discounted low rank conditional probabilitytable Z(ρ,κ)D (wi|w

    i−1i−n+1) (represented implicitly)

    1: Compute powered counts Cn,(·ρ).2: Compute denominators

    ∑wic(wi, w

    i−1i−n+1)

    ρ

    ∀wi−1i−n+1 s.t. c(wi−1i−n+1) > 0.

    3: Compute discounted powered countsCn,(·ρ)D = C

    n,(·ρ) −D.4: For each slice Mw̃i−1i−n+2 := C

    n,(·ρ)D (:

    , w̃i−1i−n+2, :) compute

    M (κ) := minA≥0:rank(A)=κ

    ‖Mw̃i−1i−n+2 −A‖KL

    (stored implicitly as M (κ) = LR)

    Set Z(ρ,κ)D (:, w̃i−1i−n+2, :) = M

    (κ)

    5: Note that

    Z(ρ,κ)D (wi|w

    i−1i−n+1) =

    Z(ρ,κ)D (wi, w

    i−1i−n+1)∑

    wic(wi, w

    i−1i−n+1)

    ρ

    4.2 Step 2: Computing Low Rank QuantitiesThe next step is to compute low rank approxi-

    mations of Y (ρj)Dj to obtain ZDj such that the inter-mediate marginal constraint in Eq. 7 is preserved.This constraint trivially holds for the intermediateensemble Ppwr(wi|wi−1) due to how the discountswere derived in § 4.1. For our running bigram ex-ample, define Z(ρj ,κj)Dj to be the best rank κj ap-

    proximation to Y (ρj ,κj)Dj according to gKL and let

    Zρj ,κjDj

    (wi|wi−1) =Zρj ,κjDj

    (wi, wi−1)∑wic(wi, wi−1)ρj

    Note that Zρj ,κjDj (wi|wi−1) is a valid (discounted)conditional probability since gKL preservesrow/column sums so the denominator remains un-changed under the low rank approximation. Then

  • using the fact that Z(0,1)(wi|wi−1) = P alt(wi)(Lemma 2) we can embellish Eq. 6 as

    Pplre(wi|wi−1) = PD0(wi|wi−1)+

    γ0(wi−1)

    (Z

    (ρ1,κ1)D1

    (wi|wi−1) + γ1(wi−1)Palt(wi))

    Leveraging the form of the discounts androw/column sum preserving property of gKL, wethen have the following lemma (the proof is in thesupplementary material):

    Lemma 4. Let Pplre(wi|wi−1) indicate the PLREsmoothed conditional probability as computed byEq. 6 and Algorithms 1 and 2. Then, the marginalconstraint in Eq. 7 holds.

    4.3 More general algorithm

    In general, the principles outlined in the previ-ous sections hold for higher order n-grams. As-sume that the discounts are computed accordingto Algorithm 1 with parameter d∗ and Z

    (ρj ,κj)Dj

    iscomputed according to Algorithm 2. Note that, asshown in Algorithm 2, for higher order n-grams,the Z(ρj ,κj)Dj are created by taking low rank approx-imations of slices of the (powered) count tensors(see Lemma 2 for intuition). Eq. 3 can now beembellished:

    Pplre(wi|wi−1i−n+1) = PaltD0(wi|w

    i−1i−n+1)

    + γ0(wi−1i−n+1)

    (Z

    (ρ1,κ1)D1

    (wi|wi−1i−n+1) + .....

    + γη−1(wi−1i−n+1)

    (Z

    (ρη ,κη)Dη

    (wi|wi−1i−n+1)

    + γη(wi−1i−n+1)

    (Pplre(wi|wi−1i−n+2)

    ))...

    )(12)

    Lemma 4 also applies in this case and is given inTheorem 1 in the supplementary material.

    4.4 Links with KN Smoothing

    In this section, we explicitly show the relation-ship between PLRE and KN smoothing. Rewrit-ing Eq. 12 in the following form:

    Pplre(wi|wi−1i−n+1) = Ptermsplre (wi|wi−1i−n+1)

    +γ0:η(wi−1i−n+1)Pplre(wi|w

    i−1i−n+2) (13)

    where P termsplre (wi|wi−1i−n+1) contains the terms in

    Eq. 12 except the last, and γ0:η(wi−1i−n+1) =∏ηh=0 γh(w

    i−1i−n+1), we can leverage the form of

    the discount, and using the fact that ρη+1 = 02:

    γ0:η(wi−1i−n−1) =

    d∗η+1N+(w

    i−1i−n+1)

    c(wi−1i−n+1)

    With this form of γ(·), Eq. 13 is remarkably sim-ilar to KN smoothing (Eq. 2) if KN’s discount pa-rameter D is chosen to equal (d∗)

    η+1.The difference is that P alt(·) has been replaced

    with the alternate estimate P termsplre (wi|wi−1i−n+1),

    which have been enriched via the low rank struc-ture. Since these alternate estimates were con-structed via our ensemble strategy they containboth very fine-grained dependencies (the origi-nal n-grams) as well as coarser dependencies (thelower rank n-grams) and is thus fundamentallydifferent than simply taking a single matrix/tensordecomposition of the trigram/bigram matrices.

    Moreover, it provides a natural way of settingd∗ based on the Good-Turing (GT) estimates em-ployed by KN smoothing. In particular, we can setd∗ to be the (η + 1)

    th root of the KN discount Dthat can be estimated via the GT estimates.

    4.5 Computational ConsiderationsPLRE scales well even as the order n increases.To compute a low rank bigram, one low rank ap-proximation of a V × V matrix is required. Forthe low rank trigram, we need to compute a lowrank approximation of each slice Cn,(·p)D (:, w̃i−1, :) ∀w̃i−1. While this may seem daunting at first, inpractice the size of each slice (number of non-zerorows/columns) is usually much, much smaller thanV , keeping the computation tractable.

    Similarly, PLRE also evaluates conditionalprobabilities at evaluation time efficiently. Asshown in Algorithm 2, the normalizer can be pre-computed on the sparse powered matrix/tensor. Asa result our test complexity is O(

    ∑ηtotali=1 κi) where

    ηtotal is the total number of matrices/tensors inthe ensemble. While this is larger than KneserNey’s practically constant complexity of O(n),it is much faster than other recent methods forlanguage modeling such as neural networks andconditional exponential family models where ex-act computation of the normalizing constant costsO(V ).

    5 Experiments

    To evaluate PLRE, we compared its performanceon English and Russian corpora with several vari-

    2for derivation see proof of Lemma 4 in the supplemen-tary material

  • ants of KN smoothing, class-based models, andthe log-bilinear neural language model (Mnih andHinton, 2007). We evaluated with perplexity inmost of our experiments, but also provide resultsevaluated with BLEU (Papineni et al., 2002) on adownstream machine translation (MT) task. Wehave made the code for our approach publiclyavailable 3.

    To build the hard class-based LMs, we utilizedmkcls4, a tool to train word classes that usesthe maximum likelihood criterion (Och, 1995) forclassing. We subsequently trained trigram classlanguage models on these classes (correspond-ing to 2nd-order HMMs) using SRILM (Stolcke,2002), with KN-smoothing for the class transitionprobabilities. SRILM was also used for the base-line KN-smoothed models.

    For our MT evaluation, we built a hierarchi-cal phrase translation (Chiang, 2007) system us-ing cdec (Dyer et al., 2010). The KN-smoothedmodels in the MT experiments were compiled us-ing KenLM (Heafield, 2011).

    5.1 DatasetsFor the perplexity experiments, we evaluated ourproposed approach on 4 datasets, 2 in English and2 in Russian. In all cases, the singletons were re-placed with “” tokens in the training cor-pus, and any word not in the vocabulary was re-placed with this token during evaluation. There isa general dearth of evaluation on large-scale cor-pora in morphologically rich languages such asRussian, and thus we have made the processedLarge-Russian corpus available for comparison 3.

    • Small-English: APNews corpus (Bengio et al.,2003): Train - 14 million words, Dev - 963,000,Test - 963,000. Vocabulary- 18,000 types.• Small-Russian: Subset of Russian news com-

    mentary data from 2013 WMT translation task5:Train- 3.5 million words, Dev - 400,000 Test -400,000. Vocabulary - 77,000 types.• Large-English: English Gigaword, Training -

    837 million words, Dev - 8.7 million, Test - 8.7million. Vocabulary- 836,980 types.• Large-Russian: Monolingual data from WMT

    2013 task. Training - 521 million words, Vali-dation - 50,000, Test - 50,000. Vocabulary- 1.3million types.

    3http://www.cs.cmu.edu/∼apparikh/plre.html4http://code.google.com/p/giza-pp/5http://www.statmt.org/wmt13/training-monolingual-

    nc-v8.tgz

    For the MT evaluation, we used the parallel datafrom the WMT 2013 shared task, excluding theCommon Crawl corpus data. The newstest2012and newstest2013 evaluation sets were used as thedevelopment and test sets respectively.

    5.2 Small CorporaFor the class-based baseline LMs, thenumber of classes was selected from{32, 64, 128, 256, 512, 1024} (Small-English)and {512, 1024} (Small-Russian). We could notgo higher due to the computationally laboriousprocess of hard clustering. For Kneser-Ney, weexplore four different variants: back-off (BO-KN)interpolated (int-KN), modified back-off (BO-MKN), and modified interpolated (int-MKN).Good-Turing estimates were used for discounts.All models trained on the small corpora are oforder 3 (trigrams).

    For PLRE, we used one low rank bigram andone low rank trigram in addition to the MLE n-gram estimates. The powers of the intermediatematrices/tensors were fixed to be 0.5 and the dis-counts were set to be square roots of the Good Tur-ing estimates (as explained in § 4.4). The rankswere tuned on the development set. For Small-English, the ranges were {1e − 3, 5e − 3} (as afraction of the vocabulary size) for both the lowrank bigram and low rank trigram models. ForSmall-Russian the ranges were {5e − 4, 1e − 3}for both the low rank bigram and the low rank tri-gram models.

    The results are shown in Table 1. The best class-based LM is reported, but is not competitive withthe KN baselines. PLRE outperforms all of thebaselines comfortably. Moreover, PLRE’s perfor-mance over the baselines is highlighted in Russian.With larger vocabulary sizes, the low rank ap-proach is more effective as it can capture linguisticsimilarities between rare and common words.

    Next we discuss how the maximum n-gram or-der affects performance. Figure 1 shows the rela-tive percentage improvement of our approach overint-MKN as the order is increased from 2 to 4 forboth methods. The Small-English dataset has arather small vocabulary compared to the numberof tokens, leading to lower data sparsity in the bi-gram. Thus the PLRE improvement is small fororder = 2, but more substantial for order = 3. Onthe other hand, for the Small-Russian dataset, thevocabulary size is much larger and consequentlythe bigram counts are sparser. This leads to sim-

  • Dataset class-1024(3) BO-KN(3) int-KN(3) BO-MKN(3) int-MKN(3) PLRE(3)Small-English Dev 115.64 99.20 99.73 99.95 95.63 91.18Small-English Test 119.70 103.86 104.56 104.55 100.07 95.15Small-Russian Dev 286.38 281.29 265.71 287.19 263.25 241.66Small-Russian Test 284.09 277.74 262.02 283.70 260.19 238.96

    Table 1: Perplexity results on small corpora for all methods.

    Small-Russian

    Small-English

    Figure 1: Relative percentage improvement ofPLRE over int-MKN as the maximum n-gram or-der for both methods is increased.

    ilar improvements for all orders (which are largerthan that for Small-English).

    On both these datasets, we also experimentedwith tuning the discounts for int-MKN to see ifthe baseline could be improved with more carefulchoices of discounts. However, this achieved onlymarginal gains (reducing the perplexity to 98.94on the Small-English test set and 259.0 on theSmall-Russian test set).

    Comparison to LBL (Mnih and Hinton,2007): Mnih and Hinton (2007) evaluate on theSmall-English dataset (but remove end markersand concatenate the sentences). They obtain per-plexities 117.0 and 107.8 using contexts of size 5and 10 respectively. With this preprocessing, a 4-gram (context 3) PLRE achieves 108.4 perplexity.

    5.3 Large CorporaResults on the larger corpora for the top 2 per-forming methods “PLRE” and “int-MKN” are pre-sented in Table 2. Due to the larger training size,we use 4-gram models in these experiments. How-ever, including the low rank 4-gram tensor pro-vided little gain and therefore, the 4-gram PLREonly has additional low rank bigram and low ranktrigram matrices/tensors. As above, ranks weretuned on the development set. For Large-English,the ranges were {1e−4, 5e−4, 1e−3} (as a frac-tion of the vocabulary size) for both the low rank

    Dataset int-MKN(4) PLRE(4)Large-English Dev 73.21 71.21Large-English Test 77.90 ± 0.203 75.66 ± 0.189Large-Russian Dev 326.9 297.11Large-Russian Test 289.63 ± 6.82 264.59 ± 5.839

    Table 2: Mean perplexity results on large corpora,with standard deviation.

    Dataset PLRE Training TimeSmall-English 3.96 min ( order 3) / 8.3 min (order 4)Small-Russian 4.0 min (order 3) / 4.75 min (order 4)Large-English 3.2 hrs (order 4)Large-Russian 8.3 hrs (order 4)

    Table 3: PLRE training times for a fixed parametersetting6. 8 Intel Xeon CPUs were used.

    Method BLEUint-MKN(4) 17.63 ± 0.11

    PLRE(4) 17.79 ± 0.07Smallest Diff PLRE+0.05Largest Diff PLRE+0.29

    Table 4: Results on English-Russian translationtask (mean ± stdev). See text for details.

    bigram and low rank trigram models. For Small-Russian the ranges were {1e−5, 5e−5, 1e−4} forboth the low rank bigram and the low rank trigrammodels. For statistical validity, 10 test sets of sizeequal to the original test set were generated by ran-domly sampling sentences with replacement fromthe original test set. Our method outperforms “int-MKN” with gains similar to that on the smallerdatasets. As shown in Table 3, our method obtainsfast training times even for large datasets.

    6 Machine Translation Task

    Table 4 presents results for the MT task, trans-lating from English to Russian7. We usedMIRA (Chiang et al., 2008) to learn the featureweights. To control for the randomness in MIRA,we avoid retuning when switching LMs - the setof feature weights obtained using int-MKN is thesame, only the language model changes. The

    6As described earlier, only the ranks need to be tuned, soonly 2-3 low rank bigrams and 2-3 low rank trigrams need tobe computed (and combined depending on the setting).

    7the best score at WMT 2013 was 19.9 (Bojar et al.,2013)

  • procedure is repeated 10 times to control for op-timizer instability (Clark et al., 2011). Unlikeother recent approaches where an additional fea-ture weight is tuned for the proposed model andused in conjunction with KN smoothing (Vaswaniet al., 2013), our aim is to show the improvementsthat PLRE provides as a substitute for KN. On av-erage, PLRE outperforms the KN baseline by 0.16BLEU, and this improvement is consistent in thatPLRE never gets a worse BLEU score.

    7 Related WorkRecent attempts to revisit the language model-ing problem have largely come from two direc-tions: Bayesian nonparametrics and neural net-works. Teh (2006) and Goldwater et al. (2006)discovered the connection between interpolatedKneser Ney and the hierarchical Pitman-Yor pro-cess. These have led to generalizations that ac-count for domain effects (Wood and Teh, 2009)and unbounded contexts (Wood et al., 2009).

    The idea of using neural networks for languagemodeling is not new (Miikkulainen and Dyer,1991), but recent efforts (Mnih and Hinton, 2007;Mikolov et al., 2010) have achieved impressiveperformance. These methods can be quite expen-sive to train and query (especially as the vocab-ulary size increases). Techniques such as noisecontrastive estimation (Gutmann and Hyvärinen,2012; Mnih and Teh, 2012; Vaswani et al., 2013),subsampling (Xu et al., 2011), or careful engi-neering approaches for maximum entropy LMs(which can also be applied to neural networks)(Wu and Khudanpur, 2000) have improved train-ing of these models, but querying the probabil-ity of the next word given still requires explicitlynormalizing over the vocabulary, which is expen-sive for big corpora or in languages with a largenumber of word types. Mnih and Teh (2012) andVaswani et al. (2013) propose setting the normal-ization constant to 1, but this is approximate andthus can only be used for downstream evaluation,not for perplexity computation. An alternate tech-nique is to use word-classing (Goodman, 2001;Mikolov et al., 2011), which can reduce the costof exact normalization to O(

    √V ). In contrast, our

    approach is much more scalable, since it is triv-ially parallelized in training and does not requireexplicit normalization during evaluation.

    There are a few low rank approaches (Saul andPereira, 1997; Bellegarda, 2000; Hutchinson et al.,2011), but they are only effective in restricted set-

    tings (e.g. small training sets, or corpora dividedinto documents) and do not generally performcomparably to state-of-the-art models. Roark etal. (2013) also use the idea of marginal constraintsfor re-estimating back-off parameters for heavily-pruned language models, whereas we use this con-cept to estimate n-gram specific discounts.

    8 ConclusionWe presented power low rank ensembles, a tech-nique that generalizes existing n-gram smoothingtechniques to non-integer n. By using ensemblesof sparse as well as low rank matrices and ten-sors, our method captures both the fine-grainedand coarse structures in word sequences. Ourdiscounting strategy preserves the marginal con-straint and thus generalizes Kneser Ney, and un-der slight changes can also extend other smooth-ing methods such as deleted-interpolation/Jelinek-Mercer smoothing. Experimentally, PLRE con-vincingly outperforms Kneser-Ney smoothing aswell as class-based baselines.

    Acknowledgements

    This work was supported by NSF IIS1218282,NSF IIS1218749, NSF IIS1111142, NIHR01GM093156, the U. S. Army Research Labo-ratory and the U. S. Army Research Office undercontract/grant number W911NF-10-1-0533, theNSF Graduate Research Fellowship Programunder Grant No. 0946825 (NSF Fellowship toAPP), and a grant from Ebay Inc. (to AS).

    ReferencesJerome R. Bellegarda. 2000. Large vocabulary speech

    recognition with multispan statistical language mod-els. IEEE Transactions on Speech and Audio Pro-cessing, 8(1):76–84.

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. J. Mach. Learn. Res., 3:1137–1155,March.

    Ondřej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 Work-shop on Statistical Machine Translation. In Pro-ceedings of the Eighth Workshop on Statistical Ma-chine Translation, pages 1–44, Sofia, Bulgaria, Au-gust. Association for Computational Linguistics.

    Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen.2010. A singular value thresholding algorithm for

  • matrix completion. SIAM Journal on Optimization,20(4):1956–1982.

    Emmanuel J Candès and Benjamin Recht. 2009. Exactmatrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717–772.

    Stanley F. Chen and Joshua Goodman. 1999. Anempirical study of smoothing techniques for lan-guage modeling. Computer Speech & Language,13(4):359–393.

    Stanley F Chen and Ronald Rosenfeld. 2000. A surveyof smoothing techniques for me models. Speech andAudio Processing, IEEE Transactions on, 8(1):37–50.

    Stanley F. Chen. 2009. Shrinking exponential lan-guage models. In Proceedings of Human Lan-guage Technologies: The 2009 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics, NAACL ’09, pages468–476, Stroudsburg, PA, USA. Association forComputational Linguistics.

    David Chiang, Yuval Marton, and Philip Resnik. 2008.Online large-margin training of syntactic and struc-tural translation features. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 224–233. Association for Com-putational Linguistics.

    David Chiang. 2007. Hierarchical phrase-based trans-lation. Comput. Linguist., 33(2):201–228, June.

    Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis testing forstatistical machine translation: Controlling for op-timizer instability. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies: ShortPapers - Volume 2, HLT ’11, pages 176–181.

    Chris Dyer, Jonathan Weese, Hendra Setiawan, AdamLopez, Ferhan Ture, Vladimir Eidelman, Juri Gan-itkevitch, Phil Blunsom, and Philip Resnik. 2010.cdec: A decoder, alignment, and learning frameworkfor finite-state and context-free translation models.In Proceedings of the ACL 2010 System Demonstra-tions, pages 7–12. Association for ComputationalLinguistics.

    Sharon Goldwater, Thomas Griffiths, and Mark John-son. 2006. Interpolating between types and tokensby estimating power-law generators. In Advances inNeural Information Processing Systems, volume 18.

    Joshua Goodman. 2001. Classes for fast maximumentropy training. In Acoustics, Speech, and SignalProcessing, 2001. Proceedings.(ICASSP’01). 2001IEEE International Conference on, volume 1, pages561–564. IEEE.

    Michael Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statisticalmodels, with applications to natural image statistics.Journal of Machine Learning Research, 13:307–361.

    Kenneth Heafield. 2011. KenLM: faster and smallerlanguage model queries. In Proceedings of theEMNLP 2011 Sixth Workshop on Statistical Ma-chine Translation, pages 187–197, Edinburgh, Scot-land, United Kingdom, July.

    Ngoc-Diep Ho and Paul Van Dooren. 2008. Non-negative matrix factorization with fixed row and col-umn sums. Linear Algebra and its Applications,429(5):1020–1025.

    Brian Hutchinson, Mari Ostendorf, and Maryam Fazel.2011. Low rank language models for small trainingsets. Signal Processing Letters, IEEE, 18(9):489–492.

    Frederick Jelinek and Robert Mercer. 1980. Interpo-lated estimation of markov source parameters fromsparse data. Pattern recognition in practice.

    Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for m-gram language modeling.In Acoustics, Speech, and Signal Processing, 1995.ICASSP-95., 1995 International Conference on, vol-ume 1, pages 181–184. IEEE.

    Philipp Koehn. 2010. Statistical Machine Translation.Cambridge University Press, New York, NY, USA,1st edition.

    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.Matrix factorization techniques for recommendersystems. Computer, 42(8):30–37.

    Daniel D. Lee and H. Sebastian Seung. 2001. Algo-rithms for non-negative matrix factorization. Ad-vances in Neural Information Processing Systems,13:556–562.

    Lester Mackey, Ameet Talwalkar, and Michael I Jor-dan. 2011. Divide-and-conquer matrix factoriza-tion. arXiv preprint arXiv:1107.0789.

    Christopher D Manning and Hinrich Schütze. 1999.Foundations of statistical natural language process-ing, volume 999. MIT Press.

    Risto Miikkulainen and Michael G. Dyer. 1991. Natu-ral language processing with modular pdp networksand distributed lexicon. Cognitive Science, 15:343–399.

    Tom Mikolov, Martin Karafit, Luk Burget, Jan ernock,and Sanjeev Khudanpur. 2010. Recurrent neu-ral network based language model. In Proceed-ings of the 11th Annual Conference of the Interna-tional Speech Communication Association (INTER-SPEECH 2010), volume 2010, pages 1045–1048.International Speech Communication Association.

  • Tomas Mikolov, Stefan Kombrink, Lukas Burget,JH Cernocky, and Sanjeev Khudanpur. 2011.Extensions of recurrent neural network languagemodel. In Acoustics, Speech and Signal Processing(ICASSP), 2011 IEEE International Conference on,pages 5528–5531. IEEE.

    Andriy Mnih and Geoffrey Hinton. 2007. Three newgraphical models for statistical language modelling.In Proceedings of the 24th international conferenceon Machine learning, pages 641–648. ACM.

    Andriy Mnih and Yee Whye Teh. 2012. A fast andsimple algorithm for training neural probabilisticlanguage models. In Proceedings of the Interna-tional Conference on Machine Learning.

    Anil Kumar Nelakanti, Cedric Archambeau, JulienMairal, Francis Bach, and Guillaume Bouchard.2013. Structured penalties for log-linear languagemodels. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 233–243, Seattle, Washington, USA, Oc-tober. Association for Computational Linguistics.

    Hermann Ney, Ute Essen, and Reinhard Kneser.1994. On Structuring Probabilistic Dependencies inStochastic Language Modelling. Computer Speechand Language, 8:1–38.

    Franz Josef Och. 1995. Maximum-likelihood-schätzung von wortkategorien mit verfahren derkombinatorischen optimierung. Bachelor’s thesis(Studienarbeit), University of Erlangen.

    Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. pages 311–318.

    Lawrence Rabiner and Biing-Hwang Juang. 1993.Fundamentals of speech recognition.

    Brian Roark, Cyril Allauzen, and Michael Riley. 2013.Smoothed marginal distribution constraints for lan-guage modeling. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 43–52.

    Ruslan Salakhutdinov and Andriy Mnih. 2008.Bayesian probabilistic matrix factorization usingMarkov chain Monte Carlo. In Proceedings of the25th international conference on Machine learning,pages 880–887. ACM.

    Lawrence Saul and Fernando Pereira. 1997. Aggre-gate and mixed-order markov models for statisticallanguage processing. In Proceedings of the sec-ond conference on empirical methods in natural lan-guage processing, pages 81–89. Somerset, New Jer-sey: Association for Computational Linguistics.

    Andreas Stolcke. 2002. SRILM - An Extensible Lan-guage Modeling Toolkit. In Proceedings of the In-ternational Conference in Spoken Language Pro-cessing.

    Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A sur-vey of collaborative filtering techniques. Advancesin artificial intelligence, 2009:4.

    Yee Whye Teh. 2006. A hierarchical bayesian lan-guage model based on pitman-yor processes. InProceedings of the 21st International Conferenceon Computational Linguistics and the 44th annualmeeting of the Association for Computational Lin-guistics, pages 985–992. Association for Computa-tional Linguistics.

    Ashish Vaswani, Yinggong Zhao, Victoria Fossum,and David Chiang. 2013. Decoding with large-scale neural language models improves translation.In Proceedings of the 2013 Conference on Em-pirical Methods in Natural Language Processing,pages 1387–1392, Seattle, Washington, USA, Oc-tober. Association for Computational Linguistics.

    Frank Wood and Yee Whye Teh. 2009. A hierarchicalnonparametric Bayesian approach to statistical lan-guage model domain adaptation. In Artificial Intel-ligence and Statistics, pages 607–614.

    Frank Wood, Cédric Archambeau, Jan Gasthaus,Lancelot James, and Yee Whye Teh. 2009. Astochastic memoizer for sequence data. In Proceed-ings of the 26th Annual International Conference onMachine Learning, pages 1129–1136. ACM.

    Jun Wu and Sanjeev Khudanpur. 2000. Efficient train-ing methods for maximum entropy language model-ing. In Interspeech, pages 114–118.

    Puyang Xu, Asela Gunawardana, and Sanjeev Khu-danpur. 2011. Efficient subsampling for trainingcomplex language models. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, EMNLP ’11, pages 1128–1136,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

    George Zipf. 1949. Human behaviour and the prin-ciple of least-effort. Addison-Wesley, Cambridge,MA.