mm alternative to em

Upload: vkk106

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 MM Alternative to EM

    1/14

    arXiv:1104

    .2203v1[stat.ME]

    12Apr2011

    Statistical Science

    2010, Vol. 25, No. 4, 492505DOI: 10.1214/08-STS264c Institute of Mathematical Statistics, 2010

    The MM Alternative to EMTong Tong Wu and Kenneth Lange

    Abstract. The EM algorithm is a special case of a more general al-gorithm called the MM algorithm. Specific MM algorithms often havenothing to do with missing data. The first M step of an MM algo-rithm creates a surrogate function that is optimized in the second Mstep. In minimization, MM stands for majorizeminimize; in maximiza-tion, it stands for minorizemaximize. This two-step process alwaysdrives the objective function in the right direction. Construction of MMalgorithms relies on recognizing and manipulating inequalities ratherthan calculating conditional expectations. This survey walks the reader

    through the construction of several specific MM algorithms. The po-tential of the MM algorithm in solving high-dimensional optimizationand estimation problems is its most attractive feature. Our applicationsto random graph models, discriminant analysis and image restorationshowcase this ability.

    Key words and phrases: Iterative majorization, maximum likelihood,inequalities, penalization.

    1. INTRODUCTION

    This survey paper tells a tale of two algorithmsborn in the same year. We celebrate the christeningof the EM algorithm by Dempster, Laird and Rubin(1977) for good reasons. The EM algorithm is oneof the workhorses of computational statistics withliterally thousands of applications. Its value was al-most immediately recognized by the internationalstatistics community. The more general MM algo-rithm languished in obscurity for years. Although in1970 the numerical analysts Ortega and Rheinboldt(1970) allude to the MM principle in the context ofline search methods, the first statistical application

    Tong Tong Wu is Assistant Professor, Department ofEpidemiology and Biostatistics, University of Maryland,

    College Park, Maryland 20742, USA. Kenneth Lange is

    Professor, Departments of Biomathematics, Human

    Genetics and Statistics, University of California, Los

    Angeles, California 90095-1766, USA e-mail:

    [email protected].

    This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2010, Vol. 25, No. 4, 492505. Thisreprint differs from the original in pagination andtypographic detail.

    occurs in two papers (de Leeuw, 1977; de Leeuw andHeiser, 1977) of de Leeuw and Heiser in 1977 on mul-tidimensional scaling. One can argue that the un-

    fortunate neglect of the de Leeuw and Heiser papershas retarded the growth of computational statistics.The purpose of the present paper is to draw atten-tion to the MM algorithm and highlight some of itsinteresting applications.

    Neither the EM nor the MM algorithm is a concre-te algorithm. They are both principles for creatingalgorithms. The MM principle is based on the no-tion of (tangent) majorization. A function g( | n)is said to majorize a function f() provided

    f(n) = g(n|n),(1)

    f() g(|n), = n.In other words, the surface g(|n) lies abovethe surface f() and is tangent to it at the point = n. Here n represents the current iterate in asearch of the surface f(). The function g(|n) mi-norizes f() if g(|n) majorizes f(). Readersshould take heed that the term majorization is usedin a different sense in the theory of convex functions(Marshall and Olkin, 1979).

    In the minimization version of the MM algorithm,we minimize the surrogate majorizing function

    1

    http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://arxiv.org/abs/1104.2203v1http://www.imstat.org/sts/http://dx.doi.org/10.1214/08-STS264http://www.imstat.org/http://www.imstat.org/mailto:[email protected]://www.imstat.org/http://www.imstat.org/sts/http://www.imstat.org/sts/http://dx.doi.org/10.1214/08-STS264http://dx.doi.org/10.1214/08-STS264http://www.imstat.org/sts/http://www.imstat.org/mailto:[email protected]://www.imstat.org/http://dx.doi.org/10.1214/08-STS264http://www.imstat.org/sts/http://arxiv.org/abs/1104.2203v1
  • 7/30/2019 MM Alternative to EM

    2/14

    2 T. T. WU AND K. LANGE

    g(|n) rather than the actual function f(). Ifn+1denotes the minimum of the surrogate g(|n), thenone can show that the MM procedure forces f()downhill. Indeed, the relations

    f(n+1) g(n+1|n) g(n|n)n = f(n)(2)follow directly from the definition of n+1 and themajorization conditions (1). The descent property(2) lends the MM algorithm remarkable numericalstability. Strictly speaking, it depends only on de-creasing the surrogate function g(|n), not on mini-mizing it. This fact has practical consequences whenthe minimum of g(|n) cannot be found exactly. Inthe maximization version of the MM algorithm, wemaximize the surrogate minorizing function g(|n).Thus, the acronym MM does double duty, serving as

    an abbreviation of both pairs majorizeminimizeand minorizemaximize. The earlier, less memo-rable name iterative majorization for the MM al-gorithm unfortunately suggests that the principle islimited to minimization.

    The EM algorithm is actually a special case of theMM algorithm. If f() is the log-likelihood of theobserved data, and Q(|n) is the function createdin the E step, then the minorization

    f() Q(|n) + f(n) Q(n|n)is the key to the EM algorithm. Maximizing Q(

    |n)

    with respect to drives f() uphill. The proof ofthe EM minorization relies on the nonnegativity ofthe KullbackLeibler divergence of two conditionalprobability densities. The divergence inequality inturn depends on Jensens inequality and the con-cavity of the function ln x (Hunter and Lange, 2004;Lange, 2004).

    In our opinion, the MM principle is easier to stateand grasp than the EM principle. It requires neithera likelihood model nor a missing data framework.In some cases, existing EM algorithms can be de-rived more easily by isolating a key majorizationor minorization. In other cases, it is quicker andmore transparent to postulate the complete dataand calculate the conditional expectations requiredby the E step of the EM algorithm. Many problemsinvolving the multivariate normal distribution fallinto this latter category. Finally, EM and MM algo-rithms constructed for the same problem can differ.Our second example illustrates this point. Which al-gorithm is preferred is then a matter of reliability infinding the global optimum, ease of implementation,speed of convergence and computational complexity.

    This is not the first survey paper on the MM al-gorithm and probably will not be the last. The pre-vious articles (Becker, Yang and Lange, 1997; deLeeuw, 1994; Heiser, 1995; Hunter and Lange, 2004;

    Lange, Hunter and Yang, 2000) state the generalprinciple, sketch various methods of majorizationand present a variety of new and old applications.Prior to these survey papers, the MM principle sur-faced in robust regression (Huber, 1981), correspon-dence analysis (Heiser, 1987), the quadratic lowerbound principle (Bohning and Lindsay, 1988), al-ternating least squares applications (Bijleveld andde Leeuw, 1991; Kiers, 2002; Kiers and Ten Berge,1992; Takane, Young and de Leeuw, 1977), med-ical imaging (De Pierro, 1995; Lange and Fessler,1994) and convex programming (Lange, 1994). Re-

    cent work has demonstrated the utility of MM al-gorithms in a broad range of statistical contexts,including quantile regression (Hunter and Lange,2000), survival analysis (Hunter and Lange, 2002),nonnegative matrix factorization (Elden, 2007; Leeand Seung, 1999, 2001; Pauca, Piper and Plemmous,2006), paired and multiple comparisons (Hunter,2004), variable selection (Hunter and Li, 2005), DNAsequence analysis (Sabatti and Lange, 2002) and dis-criminant analysis (Groenen, Nalbantov and Bioch,2006; Lange and Wu, 2008).

    The primary purpose of this paper is to presentMM algorithms not featured in previous surveys.Some of these algorithms are novel, and some areminor variations on previous themes. Except for ourfirst two examples in Sections 2 and 3, it is unclearwhether any of the algorithms can be derived froma missing data perspective. This fact alone distin-guishes them from standard EM fare. In digestingthe examples, readers should notice how the MM al-gorithm interdigitates with other algorithms such asblock relaxation and Newtons method. Classroomexpositions of computational statistics leave the im-pression that different optimization algorithms actin isolation. In reality, some of the best algorithmsare hybrids. The examples also stress penalized es-timation and high-dimensional problems that chal-lenge traditional algorithms such as scoring and New-tons method. Such problems are apt to dominatecomputational statistics and data mining for sometime to come. The MM principle offers a foothold inthe unforgiving terrain of large data sets and high-dimensional models.

    Two theoretical skills are necessary for construct-ing new MM algorithms. One is a good knowledge

  • 7/30/2019 MM Alternative to EM

    3/14

    THE MM ALTERNATIVE TO EM 3

    of statistical models. Another is proficiency with in-equalities. Most inequalities are manifestations ofconvexity. The single richest source of minorizationsis the supporting hyperplane inequality

    f(x) f(y) + df(y)(x y)satisfied by a convex function f(x) at each point yof its domain. Here df(y) is the row vector of partialderivatives of f(x) at y.

    The quadratic lower bound principle of Bohningand Lindsay (1988) propels majorization when theobjective function has bounded curvature. Let d2f(x)be the second differential (Hessian) of the objectivefunction f(x), and suppose B is a positive definitematrix such that Bd2f(x) is positive semidefinitefor all arguments x. Then we have the majorization

    f(x) = f(y) + df(y)(x y)+ 12(x y)td2f(z)(x y)

    f(y) + df(y)(x y)+ 12(x y)tB(x y),

    where z falls on the line segment between x and y.Minimization of the quadratic surrogate is straight-forward. In the unconstrained case, it involves inver-sion of the matrix B, but this can be done once incontrast to the repeated matrix inversions of New-tons method. Other relevant majorizations and mi-norizations will be mentioned as needed. Readerswondering where to start in brushing up on inequal-ities are urged to consult the elementary exposition(Steele, 2004). The more advanced texts (Boyd andVandenberghe, 2004; Lange, 2004) are also useful forstatisticians.

    Finally, let us stress that neither EM nor MM isa panacea. Optimization is as much art as science.There is no universal algorithm of choice, and a gooddeal of experimentation is often required to chooseamong EM, MM, scoring, Newtons method, quasi-

    Newton methods, conjugate gradient, and other moreexotic algorithms. The simplicity of MM algorithmsusually argues in their favor. Balanced against thisadvantage is the sad fact that many MM algorithmsexhibit excruciatingly slow rates of convergence. Sec-tion 8 derives the theoretical criterion governing therate of convergence of an MM algorithm. Fortu-nately, MM algorithms are readily amenable to ac-celeration. For the sake of brevity, we will omit adetailed development of acceleration and other im-portant topics. Our discussion in Section 9 will takethese up and point out pertinent references.

    2. ESTIMATION WITH THE

    MULTIVARIATE T

    The multivariate t-distribution has density

    f(x) =

    +p2

    2

    ()p/2||1/2

    1 +1

    (x )t1(x )

    (+p)/21

    for all x Rp. Here is the mean vector, is thepositive definite scale matrix and > 0 is the de-grees of freedom. Let x1, . . . , xm be a random sam-ple from f(x). To estimate and for fixed, the

    well-known EM algorithm (Lange, Little and Tay-lor, 1989; Little and Rubin, 2002) iterates accordingto

    n+1 =1

    sn

    mi=1

    wni xi,(3)

    n+1 =1

    m

    mi=1

    wni (xi n+1)(xi n+1)t,(4)

    where sn =m

    i=1 wni is the sum of the case weights

    wni =+p

    + dni, dni = (xi n)t(n)1(xi n).

    The derivation of the EM algorithm hinges on therepresentation of the t-density as a hidden mixtureof multivariate normal densities.

    Derivation of the same algorithm from the MMperspective ignores the missing data and exploits theconcavity of the function ln x. Thus, the supportinghyperplane inequality

    ln x ln y x yy

    implies the minorization

    12

    ln || +p2

    ln[+ (xi )t1(xi )]

    12

    ln ||

    +p2

    ln

    +p

    wni

    + (+ (xi )t1(xi ) (+p)/wni )

  • 7/30/2019 MM Alternative to EM

    4/14

    4 T. T. WU AND K. LANGE

    ((+p)/wni )1

    =

    1

    2ln

    |

    |

    wni2

    [+ (xi

    )t1(xi

    )]

    + cni

    for case i, where cni is a constant that depends onneither nor . Summing over the different casesproduces the overall surrogate. Derivation of the up-dates (3) and (4) reduces to standard manipulationswith the multivariate normal (Lange, 2004).

    Kent, Tyler and Vardi (1994) suggest an alterna-tive algorithm that replaces the EM update (4) for by

    n+1

    =

    1

    sn

    mi=1

    wni (xi

    n+1

    )(xi n+1

    )t

    .(5)

    Megan and van Dyk (1997) justify this modest amend-ment by expanding the parameter space to include aworking parameter that is tweaked to produce fasterconvergence. It is interesting that a trivial variationof our minorization produces the Kent, Tyler andVardi (1994). We simply combine the two log termsand minorize via

    12

    ln || +p2

    ln[+ (xi )t1(xi )]

    =+p2

    ln{||a[+ (xi )t1(xi )]}

    wni

    2|n|a {||a[+ (xi )t1(xi )]}

    + cni ,

    with working parameter a = 1/(+p).For readers wanting the full story, we now indicate

    briefly how the second step of the MM algorithm isderived. This revolves around maximizing the sur-rogate function

    mi=1

    wni {||a[+ (xi )t1(xi )]}

    with respect to and . Regardless of the valueof , one should choose as the weighted mean (3).If we let R be the square root of n+1 as defined by(5) and substitute n+1 in the surrogate, then therefined surrogate function can be expressed

    sn{||a[+ tr(1R2)]}=sn{|R1R1|a[+tr(R1R)]}|R|2a.

    To show that = R2 minimizes the surrogate, let1, . . . , p denote the eigenvalues of the positive def-inite matrix R1R1. This allows us to express thesurrogate as a negative multiple of the function

    h() = p

    j=1

    aj +p

    j=1

    aj

    pj=1

    1j .

    The choice = 1 corresponds to = R2 and yieldsthe value h(1) = +p. The identity = R2 can nowbe proved by showing that + p is a lower boundfor h(). Setting j = e

    j , a simple rearrangementof the bounding inequality shows that it suffices toprove the alternative inequality

    e1/(+p)p

    j=1 j

    +p

    e0 +1

    +p

    p

    j=1

    ej ,

    which is a direct consequence of the convexity of ex.

    3. GROUPED EXPONENTIAL DATA

    The EM algorithm for estimating the intensity ofgrouped exponential data is well known (Dempster,Laird and Rubin, 1977; McLachlan and Krishnan,1997; Meilijson, 1989). In this setting the completedata corresponds to a random sample x1, . . . , xmfrom an exponential density with intensity . Theobserved data conforms to a sequence of thresholdst1 < t2 < < tm. It is convenient to append thethreshold t0 = 0 to this list and to let ci recordthe number of values that fall within the interval(ti, ti+1]. The exceptional count cm represents thenumber of right-censored values. One can derive anovel MM algorithm by close examination of thelog-likelihood

    L() = c0 ln(1 et1 )

    +m1i=1

    ci ln(eti eti+1 ) cmtm

    =m1i=0

    citi+1 cmtm +m1i=0

    ci ln(edi 1),

    where di = ti+1 ti.The above partial linearization of the log-likelihood

    L() focuses our attention on the remaining nonlin-ear parts ofL() determined by the function f() =ln(ed 1). The derivatives

    f() =edd

    ed 1 , f() = e

    dd2

    (ed 1)2

  • 7/30/2019 MM Alternative to EM

    5/14

    THE MM ALTERNATIVE TO EM 5

    Table 1

    Comparison of MM and EM on grouped exponential data

    MM algorithm EM algorithm

    n n L(n) n L(n)

    0 1.00000 3.00991 1.00000 3.009911 0.50000 1.75014 0.27082 1.346372 0.25000 1.32698 0.21113 1.305913 0.18924 1.30528 0.20102 1.304434 0.19762 1.30438 0.19904 1.304375 0.19848 1.30437 0.19864 1.304376 0.19853 1.30437 0.19856 1.304377 0.19854 1.30437 0.19854 1.30437

    indicate that f() is increasing and concave. It isimpossible to minorize f() by a linear function,

    so we turn to the quadratic lower bound principle.Hence, in the second-order Taylor expansion

    f() = f(n) + f(n)( n)+ 12f

    ()( n)2,with between and n, we seek to bound f()from below. One can easily check that f() is increa-sing on (0,) and tends to as approaches 0.To avoid this troublesome limit, we restrict to theinterval (12

    n,) and substitute f(12n) for f().Minorizing the nonlinear part ofL() term by term

    now gives a quadratic minorizer q() of L(). Be-cause the coefficient of2 in q() is negative, the re-stricted maximum n+1 ofq() occurs at the bound-ary 12

    n whenever the unrestricted maximum occurs

    to the left of 12n. In symbols, the MM update re-

    duces to

    n+1 = max

    1

    2n,

    n +

    m1i=0 ci(v

    ni ti+1) cmtmm1i=0 ciw

    ni

    ,

    where

    vni =e

    ndidiendi 1 , w

    ni =

    endi/2d2i /4

    (endi/2 1)2 .

    Table 1 compares the MM algorithm and the tra-ditional EM algorithm on the toy example of Meil-ijson (1989). Here we have m = 3 thresholds at 1,3 and 10 and assign proportions 0.185, 0.266, 0.410and 0.139 to the four ordinal groups. It is clear thatthe MM algorithm hits its lower bound on iterations1 and 2. Although its local rate of convergence ap-pears slightly better than that of the EM algorithm,

    the differences are minor. The purpose of this exer-cise is more to illustrate the quadratic lower bound

    principle in deriving MM algorithms.

    4. POWER SERIES DISTRIBUTIONS

    A family of discrete density functions pk() de-fined on {0, 1, . . .} and indexed by a parameter > 0is said to be a power series family provided for all k

    pk() =ck

    k

    q(),(6)

    where ck 0 and q() =

    k=0 ckk is the appropri-

    ate normalizing constant (Rao, 1973). The binomial,negative binomial, Poisson and logarithmic familiesare examples. Zero truncated versions of these fam-ilies also qualify. Fisher scoring is the traditionalapproach to maximum likelihood estimation with apower series family. Ifx1, . . . , xm is a random samplefrom the discrete density (6), then the log-likelihood

    L() =mi=1

    xi ln m ln q()

    has score s() and expected information J()

    s() =1

    m

    i=1

    xi mq()

    q(), J() =

    m2()

    2,

    where 2() is the variance of a single realization.Functional iteration provides an alternative to scor-

    ing. It is clear that the maximum likelihood estimate is a root of the equation

    x =q()

    q(),(7)

    where x is the sample mean. This result suggests theiteration scheme

    n+1 =xq(n)

    q(n)= M(n)(8)

  • 7/30/2019 MM Alternative to EM

    6/14

    6 T. T. WU AND K. LANGE

    Table 2

    Performance of the algorithm (8) for truncated Poisson data

    n n L(n) n n L(n)

    0 1.00000

    5.41325 7 1.59161

    4.344671 1.26424 4.63379 8 1.59280 4.344662 1.43509 4.40703 9 1.59329 4.344663 1.52381 4.35635 10 1.59349 4.344664 1.56424 4.34670 11 1.59357 4.344665 1.58151 4.34501 12 1.59360 4.344666 1.58867 4.34472 13 1.59362 4.34466

    and raises two obvious questions. First, is the algo-rithm (8) an MM algorithm? Second, is it likely to

    converge to even in the absence of such a guar-

    antee? Local convergence hinges on the derivativecondition |M()| < 1. When this condition holds,the map n+1 = M(n) is locally contractive near

    the fixed point . It turns out that

    M() = 1 2()

    (),

    where

    () =q()

    q()

    is the mean of a single realization X. Thus, conver-

    gence depends on the ratio of the variance to themean. To prove these assertions it is helpful to dif-ferentiate q(). The first derivative delivers the meanand the second derivative the second factorial mo-ment

    E[X(X 1)] = 2q()

    q().

    If one substitutes these into the obvious expressionfor M() and invokes equality (7) at , then the

    moment form of M() emerges.To address the question of whether functional iter-

    ation is an MM algorithm, we make the assumptionthat q() is log-concave. This condition holds forthe binomial and Poisson distributions but not forthe negative binomial and logarithmic distributions.The convexity of ln q() entails the minorization,

    L() mi=1

    xi ln m ln q(n) m[ln q(n)]( n)

    =mi=1

    xi ln m ln q(n) mq(n)

    q(n)( n).

    Setting the derivative of this surrogate function equalto 0 leads to the MM update (8). One can demon-strate that log-concavity implies 2() (). The lo-cal contraction condition |M

    ()| < 1 is consistentwith the looser criterion 2() 2(). Thus, there

    is room for a viable local algorithm that fails to havethe ascent property.

    The truncated Poisson density has normalizingfunction q() = e 1. The second derivative testshows that q() is log-concave. Table 2 records thewell-behaved MM iterates (8) for the choices x = 2and m = 10. The geometric density counting failuresuntil a success has normalizing function q() = ( 1)1, which is log-convex rather than log-concave.The iteration function is now M() = x(1

    ). Since

    M() =x, the algorithm diverges for x > 1. Fi-nally, the discrete logarithmic density has normal-izing constant q() = ln(1 ), which is also log-convex rather than log-concave. The choices x = 2and m = 10 lead to the iterates in Table 3. Althoughthe algorithm (8) converges for the logarithmic den-sity, it cannot be an MM algorithm because the log-likelihood experiences a decline at its first iteration.

    One of the morals of this example is that manynatural algorithms only satisfy the descent or as-cent property in special circumstances. This is notnecessarily a disaster, but without such a guaran-tee, safeguards must usually be instituted to pre-vent iterates from going astray. Proof of the de-scent or ascent property almost always starts withmajorization or minorization. Because so much ofstatistical inference revolves around log-likelihoods,log-convexity and log-concavity are possibly moreimportant than ordinary convexity and concavity inconstructing MM algorithms.

    There are a variety of criteria that help in checkinglog-concavity. Besides the obvious second derivativetest, one should keep in mind the closure properties

  • 7/30/2019 MM Alternative to EM

    7/14

    THE MM ALTERNATIVE TO EM 7

    Table 3

    Performance of the algorithm (8) for logarithmic data

    n n L(n) n n L(n)

    0 0.99000

    15.47280 9 0.71470

    8.982941 0.09210 24.32767 10 0.71565 8.982932 0.17545 18.35307 11 0.71517 8.982933 0.31814 13.30624 12 0.71542 8.982934 0.52221 9.96349 13 0.71529 8.982935 0.70578 8.98560 14 0.71535 8.982936 0.71991 8.98355 15 0.71532 8.982937 0.71291 8.98310 16 0.71534 8.982938 0.71655 8.98297 17 0.71533 8.98293

    of the collection of log-concave functions on a givendomain (Bergstrom and Bagnoli, 2005; Boyd and

    Vandenberghe, 2004). For example, the collection isclosed under the formation of products and posi-tive powers. Any positive concave function is log-concave. If f(x) > 0 for all x, then f(x) islog-concave. In some cases, integration preserves log-concavity. Iff(x) is log-concave, then

    xa f(y) dy andb

    x f(y) dy are log-concave. When f(x, y) is jointlylog-concave in (x, y),

    f(x, y) dy is log-concave in x.

    As a special case, the convolution of two log-concavefunctions is log-concave. One of the more useful re-cent tests for log-concavity pertains to power series

    (Anderson, Vamanamurthy and Vuorinen, 2007).Suppose f(x) =

    k=0akx

    k has radius of convergencer around the origin. If the coefficients ak are posi-tive and the ratio (k + 1)ak+1/ak is decreasing ink, then f(x) is log-concave on (r, r). This resultalso holds for finite series f(x) =

    mk=0 akx

    k. In mi-norization, log-convexity plays the linearizing roleof log-concavity. The closure properties of the set oflog-convex functions are equally impressive (Boydand Vandenberghe (2004)).

    5. A RANDOM GRAPH MODELRandom graphs provide interesting models of con-

    nectivity in genetics and internet node ranking. Herewe consider the random graph model of Blitzstein,Chatterjee and Diaconis (2008). Their model assignsa nonnegative propensity pi to each node i. An edgebetween nodes i and j then forms independentlywith probability pipj/(1 + pipj). The most obviousstatistical question in the model is how to estimatethe pi from data. Once this is done, we can ranknodes by their estimated propensities.

    If E denotes the edge set of the graph, then thelog-likelihood can be written as

    L(p) =

    {i,j}E

    [lnpi + lnpj]

    (9)

    {i,j}

    ln(1 +pipj).

    Here {i, j} denotes a generic unordered pair. Thelogarithms ln(1 +pipj) are the bothersome terms inthe log-likelihood. We will minorize each of these byexploiting the convexity of the function ln(1+ x).Application of the supporting hyperplane inequality

    yields

    ln(1 +pipj) ln(1 +pni pnj )

    11 +pni p

    nj

    (pipj pni pnj )

    and eliminates the logarithm. Note that equalityholds when pi =p

    ni for all i. This minorization is not

    quite good enough to separate parameters, however.Separation can be achieved by invoking the secondminorizing inequality

    pipj 12

    pn

    jpni

    p2i + pni

    pnjp2j

    .

    Note again that equality holds when all pi = pni .

    These considerations imply that up to a constantL(p) is minorized by the function

    g(p|pn) =

    {i,j}E

    [lnpi + lnpj]

    {i,j}

    1

    1 +pni pnj

    1

    2

    pnjpni

    p2i +pnipnj

    p2j

    .

  • 7/30/2019 MM Alternative to EM

    8/14

    8 T. T. WU AND K. LANGE

    Table 4

    Convergence of the MM random graph algorithm

    n pn0 pnm/2 p

    nm L(p

    n)

    0 0.00100 0.48240 0.95613

    40572252.71091 0.00000 0.48281 0.97251 40565250.83332 0.00000 0.48220 0.98274 40562587.53503 0.00000 0.48151 0.98950 40561497.14114 0.00000 0.48093 0.99408 40561038.95345 0.00000 0.48050 0.99720 40560843.3998

    10 0.00000 0.47963 1.00299 40560695.651515 0.00000 0.47950 1.00387 40560693.124520 0.00000 0.47948 1.00400 40560693.077025 0.00000 0.47948 1.00403 40560693.076130 0.00000 0.47948 1.00403 40560693.076135 0.00000 0.47948 1.00403 40560693.0764

    The fact that g(p|pn) separates parameters allows usto compute pn+1i by setting the derivative of g(p|pn)with respect to pi equal to 0. Thus, we must solve

    0 =

    {i,j}E

    1

    pij=i

    1

    1 +pni pnj

    pnjpni

    pi.

    Ifdi =

    {i,j}E1 denotes the degree of node i, thenthe positive square root

    pn+1i =

    pni di

    j=i

    pn

    j/(1 +pn

    ipn

    j

    )

    1/2(10)

    is the pertinent solution. Blitzstein, Chatterjee andDiaconis (2008) derive a different and possibly moreeffective algorithm by a contraction mapping argu-ment.

    The MM update (10) is not particularly intuitive,but it does have the virtue of algebraic simplic-ity. When di = 0, it also makes the sensible choicepn+1i = 0. As a check on our derivation, observe thata stationary point of the log-likelihood satisfies

    0 =di

    pi j=ipj

    1 +pipj

    ,

    which is just a rearranged version of the update (10)with iteration superscripts suppressed.

    The MM algorithm just derived carries with itcertain guarantees. It is certain to increase the log-likelihood at every iteration, and if its maximumvalue is attained at a unique point, then it will alsoconverge to that point. It is straightforward to provethat the log-likelihood is concave under the repa-rameterization pi = e

    qi . The requirement of twosuccessive minorizations in our derivation gives us

    pause because if minorization is not tight, then con-vergence is slow. On the other hand, if the numberof nodes is large, then competing algorithms such asNewtons method entail large matrix inversions andare very expensive.

    As a test case for the MM algorithm, we gener-ated a random graph on m = 10,000 nodes with apropensity pi for node i of (i 12)/m. To derive ap-propriate starting values for the propensities, we es-timated a common background propensity q by set-ting q2/(1 + q2) equal to the ratio of observed edges

    to possible edges and solving for q. This backgroundpropensity was then used to estimate each pi by set-ting piq/(1 +piq) equal to di/m and solving for pi.Table 4 displays the components pn0 , p

    nm/2 and p

    nm

    of the parameter vector pn at iteration n. The log-likelihood actually fails the ascent test in the last it-eration because its rightmost digits are beyond ma-chine precision. Despite this minor flaw, the algo-rithm performs impressively on this relatively largeand decidedly nonsparse problem. As an indicationof the quality of the final estimate p, the maximumerror maxi

    |(i

    12 )/m

    pi

    |was 0.0825 and the aver-

    age absolute error 1m

    i |(i 12)/m pi| was 0.0104.6. DISCRIMINANT ANALYSIS

    Discriminant analysis is another attractive appli-cation. In discriminant analysis with two categories,each case i is characterized by a feature vector zi anda category membership indicator yi taking the val-ues 1 or 1. In the machine learning approach to dis-criminant analysis (Scholkopf and Smola, 2002; Vap-nik, 1995), the hinge loss function [1 yi( + zti)]+plays a prominent role. Here (u)+ is shorthand for

  • 7/30/2019 MM Alternative to EM

    9/14

    THE MM ALTERNATIVE TO EM 9

    the convex function max{u, 0}. Just as in ordinaryregression, we can penalize the overall loss

    g() =n

    i=1

    [1

    yi( + z

    ti)]+

    by imposing a lasso or ridge penalty (Hastie, Tib-shirani and Friedman, 2001). Note that the linearregression function hi() = + z

    ti predicts either

    1 or 1. Ifyi = 1 and hi() overpredicts in the sensethat hi() > 1, then there is no loss. Similarly, ifyi = 1 and hi() underpredicts in the sense thathi()

  • 7/30/2019 MM Alternative to EM

    10/14

    10 T. T. WU AND K. LANGE

    Table 5

    Empirical examples from UCI machine learning repository

    Data set Hinge loss VDA

    (Cases, features) Iters Error Time Iters Error Time

    Diabetes (768, 8) 44 0.2266 0.063 11 0.2240 0.015SPECT (80, 22) 326 0.2000 0.359 7 0.1750 0.000Tic-tac-toe (958, 9) 274 0.0167 0.578 26 0.0167 0.062Ionosphere (351, 33) 483 0.0513 2.984 42 0.0570 0.266

    2006; Liao et al., 2002). Suppose a photograph is di-vided into pixels and yij is the digitized intensity forpixel (i, j). Some of the yij are missing or corrupted.Smoothing pixel values can give a visually improvedimage. Correction of pixels subject to minor corrup-tion is termed denoising; correction of missing orgrossly distorted values is termed inpainting. Let Sbe the set of pixels with acceptable values. We canrestore the photograph by minimizing the criterion(i,j)S

    (yij ij)2 + i

    j

    (k,l)Nij

    ij klTV,

    where Nij denotes the pixels neighboring pixel (i, j),

    xTV =

    x2 + is the total variation norm with > 0 small and > 0 is a tuning constant. Let nijbe the current iterate. The total variation penaltiesare majorized using

    xTV xnTV + 12xnTV [x

    2 (xn)2]

    based on the concavity of the function

    t + . Thesemaneuvers construct a simple surrogate function ex-pressible as a weighted sum of squares. Other rough-ness penalties are possible. For instance, the scaledsum of squares

    i

    j

    (k,l)Nij

    (ijkl)2 is plau-sible. Unfortunately, this choice tends to deter theformation of image edges. The total variation alter-

    native is preferred in practice because it is gentlerwhile remaining continuously differentiable.

    If the pixels are defined on a rectangular grid, thenwe can divide them into two blocks in a checker-board fashion, with the red checkerboard squaresfalling into one block and the black checkerboardsquares into the other block. Within a block, theleast squares problems generated by the surrogatefunction are parameter separated and hence trivialto solve. Thus, it makes sense to alternate the up-dates of the blocks. Within a block we update ij

    via

    n+1ij =2yij +

    (k,l)Nij

    nkl/nij nklTV2 +

    (k,l)Nij

    1/nij nklTVfor (i, j)

    S or via

    n+1ij =

    (k,l)Nij

    nkl/nij nklTV(k,l)Nij

    1/nij nklTVfor (i, j) / S. Here each interior pixel (i, j) has fourneighbors. If the singularity constant is too smallor if the tuning is too large, then small residu-als generate very large weights. When this pitfall isavoided, the described algorithm is apt to be supe-rior to the fused lasso algorithm of Friedman, Hastieand Tibshirani (2007).

    We applied the total variation algorithm to the

    standard image of the model Lenna. Figure 1 showsthe original 256 256 image with pixel values digi-tized on a gray scale from 0 to 255. To the right ofthe original image is a version corrupted by Gaus-sian noise (mean 0 and standard deviation 10) anda scratch on the shoulder. The images are restoredwith values of 10, 15, 20 and 25 and an value of1. Although we tend to prefer the restoration on theright in the second row, this is a matter of judgment.Variations in clearly control the balance betweenimage smoothness and loss of detail.

    8. LOCAL CONVERGENCE OF MMALGORITHMS

    Many MM and EM algorithms exhibit a slow rateof convergence. How can one predict the speed ofconvergence of an MM algorithm and choose be-tween competing algorithms? Consider an MM mapM() for minimizing the objective function f() viathe surrogate function g(|n). According to a theo-rem of Ortega (1990), the local rate of convergenceof the sequence n+1 = M(n) is determined by thespectral radius of the differential dM() at the

  • 7/30/2019 MM Alternative to EM

    11/14

    THE MM ALTERNATIVE TO EM 11

    Fig. 1. Restoration of the Lenna photograph. Top row left: the original image; top row right: image corrupted by Gaussiannoise (mean 0 and standard deviation 10) and a scratch; second row left: restored image with = 10; second row right: restoredimage with = 15; third row left: restored image with = 20; third row right: restored image with = 25. The same value = 1 is used throughout.

    minimum point off(). Well-known calculations(Dempster, Laird and Rubin, 1977; Lange (1995a))demonstrate that

    dM() = I d2g(|)1d2f().Hence, the eigenvalue equation dM()v = v canbe rewritten as

    d2g(|)v d2f()v = d2g(|)v.Taking the inner product of this with v, we can solvefor in the form

    = 1 vtd2f()v

    vtd2g(|)v .

    Extension of this line of reasoning shows that thespectral radius satisfies

    = 1 minv=0

    vtd2f()vvtd2g(|)v .

    Thus, the rate of convergence of the MM iterates isdetermined by how well d2g(|) approximatesd2f(). In practice, the surrogate function g(|n)should hug f() is tightly as possible for closeto n.

    Meng and van Dyk (1997) use this Rayleigh quo-tient characterization of the spectral radius to provethat the Kent et al. multivariate t algorithm is fasterthan the original multivariate t algorithm. In essence,

  • 7/30/2019 MM Alternative to EM

    12/14

    12 T. T. WU AND K. LANGE

    they show that the second differential d2g(|) isuniformly more positive definite for the alternativealgorithm. de Leeuw and Lange (2009) make sub-stantial progress in designing optimal quadratic sur-

    rogates. For most other MM algorithms, however,such theoretical calculations are too hard to carryout, and one must rely on numerical experimenta-tion to determine the rate of convergence. The un-certainties about rates of convergence are reminis-cent of the uncertainties surrounding MCMC meth-ods. This should not deter us from constructing MMalgorithms. On large-scale problems, many tradi-tional algorithms are simply infeasible. If we canconstruct a MM algorithm, then there is always thechance of accelerating it. We take up this topic brieflyin the discussion. Finally, let us stress that the num-

    ber of iterations until convergence is not the sole de-terminant of algorithm speed. Computational com-plexity per iteration also comes into play. On thisbasis, a standard MM algorithm for transmission to-mography is superior to a plausible but different EMalgorithm (Lange, 2004).

    9. DISCUSSION

    Perhaps the best evidence of the pervasive influ-ence of the EM algorithm is the sheer number of ci-tations garnered by the Dempster et al. paper. As of

    April 2008, Google Scholar lists 11,232 citations. Bycontrast, Google Scholar lists 58 citations for the deLeeuw paper and 47 citations for the de Leeuw andHeiser paper. If our contention about the relativeimportance of the EM and MM algorithms is true,how can one account for this disparity? Several rea-sons come to mind. One is the venue of publication.The Journal of the Royal Statistical Society, SeriesB, is one of the most widely read journals in statis-tics. The de Leeuw and Heiser papers are buried in ahard to access conference proceedings. Another rea-son is the prestige of the authors. Four of the five au-thors of the three papers, Nan Laird, Donald Rubin,Jan de Leeuw and Willem Heiser, were quite juniorin 1977. On the other hand, Arthur Dempster was amajor figure in statistics and well established at Har-vard, the most famous American university. Besidesthese extrinsic differences, the papers have intrinsicdifferences that account for the better reception ofthe Dempster et al. paper. Its most striking advan-tage is the breadth of its subject matter. Dempsteret al. were able to unify different branches of com-putational statistics under the banner of a clearly

    enunciated general principle. de Leeuw and Heiserstuck to multidimensional scaling. Their work andextensions are well summarized by Borg and Groe-nen (1997).

    The EM algorithm immediately appealed to thestochastic intuition of statisticians, who are good atcalculating the conditional expectations required bythe E step. The MM algorithm relies on inequalitiesand does not play as well to the strengths of statisti-cians. Partly for this reason the MM algorithm haddifficulty breaking out of the vast but placid backwa-ter of social science applications where it started. Itremained sequestered there for years, nurtured byseveral highly productive Dutch statisticians withless clout than their American and British colleagues.

    Our emphasis on concrete applications neglects

    some issues of considerable theoretical and practi-cal importance. The most prominent of these areglobal convergence analysis, computation of asymp-totic standard errors, acceleration, and approximatesolution of the optimization step (second M) of theMM algorithm. Let us address each of these in turn.

    Virtually all of the convergence results announcedby Dempster et al. (1977) and corrected by Wu (1983)and Boyles (1983) carry over to the MM algorithm.The known theory, both local and global, is summa-rized in the references (Lange, 2004; Vaida, 2005).As anticipated, the best results hold in the presenceof convexity or concavity. The SEM algorithm ofMeng and Rubin (1991) for computation of asymp-totic standard errors also carries over to the MMalgorithm (Hunter, 2004). Numerical differentiationof the score function is a viable competitor, particu-larly if the score can be evaluated analytically. Thesimplest form of acceleration is step doubling (deLeeuw and Heiser, 1980; Lange and Fessler, 1994).This maneuver replaces the point delivered by analgorithm map n+1 = M(n) by the new point n +2[M(n)n]. Step doubling usually halves the num-ber of iterations until convergence in an MM algo-rithm. More effective forms of acceleration are pos-sible using matrix polynomial extrapolation (Varad-han and Roland, 2008) and quasi-Newton and con-jugate gradient elaborations of the MM algorithm(Jamshidian and Jennrich, 1997; Lange, 1995b). Fi-nally, if the optimization step of an MM algorithmcannot be accomplished analytically, it is possible tofall back on the MM gradient algorithm (Hunter andLange, 2004; Lange, 1995a). Here one substitutesone step of Newtons method for full optimizationof the surrogate function g(|n) with respect to .

  • 7/30/2019 MM Alternative to EM

    13/14

    THE MM ALTERNATIVE TO EM 13

    Fortunately, this approximate algorithm has exactlythe same rate of convergence as the original MM al-gorithm. It also preserves the descent or ascent prop-erty of the MM algorithm close to the optimal point.

    The reader may be left wondering whether EM orMM provides a clearer path to the derivation of newalgorithms. In the absence of a likelihood function, itis difficult for EM to work its magic. Even so, criteriasuch as least squares can involve hidden likelihoods.Perhaps the best reply is that we are asking thewrong question. After all, one mans mathematicalmeat is often another mans mathematical poison.A better question is whether MM broadens the pos-sibilities for devising new algorithms. In our view,the answer to the second question is a resoundingyes. Our last four examples illustrate this point. Of

    course, it may be possible to derive one or more ofthese algorithms from the EM perspective, but wehave not been clever enough to do so.

    In highlighting the more general MM algorithm,we intend no disrespect to the pioneers of the EM al-gorithm. If the fog of obscurity lifts from the MM al-gorithm, it will not detract from their achievements.It may, however, propel the ambitious plans for datamining underway in the 21st century. Even with theexpected advances in computer hardware, the statis-tics community still needs to concentrate on effectivealgorithms. The MM principle is poised to claim a

    share of the credit in this enterprise. Statisticianswith a numerical bent are well advised to add it totheir toolkits.

    ACKNOWLEDGMENT

    Research supported in part by USPHS GrantsGM53275 and MH59490 to KL.

    REFERENCES

    Anderson, G. D., Vamanamurthy, M. K. and Vuori-nen, M. (2007). Generalized convexity and inequalities.J. Math. Anal. Appl. 335 12941308. MR2346906

    Asuncion, A. and Newman, D. J. (2007). UCIMachine Learning Repository. Available athttp://www.ics.uci.edu/~mlearn/MLRepository.html.

    Becker, M. P., Yang, I. and Lange, K. (1997). EM al-gorithms without missing data. Stat. Methods Med. Res. 63854.

    Bergstrom, T. C. and Bagnoli, M. (2005). Log-concaveprobability and its applications. Econom. Theory 26 445469. MR2213177

    Bijleveld, C. C. J. H. and de Leeuw, J. (1991). Fittinglongitudinal reduced-rank regression models by alternatingleast squares. Psychometrika 56 433447. MR1131768

    Bioucas-Dias, J. M., Figueiredo, M. A. T. and Oliveira,J. P. (2006). Total variation-based image deconvolution:a majorizationminimization approach. In IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing, ICASSP 2006 Proceedings 861864.

    Blitzstein, J., Chatterjee, S. and Diaconis, P. (2008).A new algorithm for high dimensional maximum likelihoodestimation. Technical report.

    Bohning, D. and Lindsay, B. G. (1988). Monotonicity ofquadratic approximation algorithms. Ann. Inst. Statist.Math. 40 641663. MR0996690

    Borg, I. and Groenen, P. (1997). Modern MultidimensionalScaling: Theory and Applications. Springer, New York.MR1424243

    Boyd, S. and Vandenberghe, L. (2004). Convex Optimiza-tion. Cambridge Univ. Press. MR2061575

    Boyles, R. A. (1983). On the convergence of the EM algo-rithm. J. Roy. Statist. Soc. Ser. B 45 4750. MR0701075

    de Leeuw, J. (1977). Applications of convex analysis to mul-tidimensional scaling. In Recent Developments in Statistics(J. R. Barra, F. Brodeau, G. Romie and B. Van Cutsem,eds.) 133145. North-Holland, Amsterdam. MR0478483

    de Leeuw, J. (1994). Block relaxation algorithms in statis-tics. In Information Systems and Data Analysis (H.-H.Bock, W. Lenski and M. M. Richter, eds.). Springer, Berlin.

    de Leeuw, J. and Heiser, W. J. (1977). Convergence of cor-rection matrix algorithms for multidimensional scaling. InGeometric Representations of Relational Data (J. C. Lin-goes, E. Roskam and I. Borg, eds.). Mathesis Press, AnnArbor, MI.

    de Leeuw, J. and Heiser, W. J. (1980). Multidimensionalscaling with restriction on the configuration. In Multivari-ate Analysis V: Proceeding of the Fifth InternationalSymposium on Multivariate Analysis (P. R. Krishnaiahed.) 501522. North-Holland, Amsterdam. MR0566359

    de Leeuw, J. and Lange, K. (2009). Sharp quadratic ma-jorization in one dimension. Comput. Statist. Data Anal.53 24712484.

    De Pierro, A. R. (1995). A modified expectation maximiza-tion algorithm for penalized likelihood estimation in emis-sion tomography. IEEE Trans. Med. Imaging 14 132137.

    Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977).Maximum likelihood from incomplete data via the EM al-gorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39138. MR0501537

    Elden, L. (2007). Matrix Methods in Data Mining and Pat-tern Recognition. SIAM, Philadelphia. MR2314399

    Friedman, J., Hastie, T. and Tibshirani, R. (2007). Path-wise coordinate optimization. Ann. Appl. Statist. 1 302332.

    Groenen, P. J. F., Nalbantov, G. and Bioch, J. C.(2006). Nonlinear support vector machines through iter-ative majorization and I-splines. Studies in Classification,Data Analysis and Knowledge Organization (H. J. Lenzand R. Decker, eds.) 149161. Springer, Berlin.

    Hastie, T., Tibshirani, R. and Friedman, J. (2001). TheElements of Statistical Learning: Data Mining, Inference,and Prediction. Springer, New York. MR1851606

    Heiser, W. J. (1987). Correspondence analysis with leastabsolute residuals. Comput. Statist. Data Anal. 5 337356.

    http://www.ams.org/mathscinet-getitem?mr=2346906http://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ams.org/mathscinet-getitem?mr=2213177http://www.ams.org/mathscinet-getitem?mr=1131768http://www.ams.org/mathscinet-getitem?mr=0996690http://www.ams.org/mathscinet-getitem?mr=1424243http://www.ams.org/mathscinet-getitem?mr=2061575http://www.ams.org/mathscinet-getitem?mr=0701075http://www.ams.org/mathscinet-getitem?mr=0478483http://www.ams.org/mathscinet-getitem?mr=0566359http://www.ams.org/mathscinet-getitem?mr=0501537http://www.ams.org/mathscinet-getitem?mr=2314399http://www.ams.org/mathscinet-getitem?mr=1851606http://www.ams.org/mathscinet-getitem?mr=1851606http://www.ams.org/mathscinet-getitem?mr=2314399http://www.ams.org/mathscinet-getitem?mr=0501537http://www.ams.org/mathscinet-getitem?mr=0566359http://www.ams.org/mathscinet-getitem?mr=0478483http://www.ams.org/mathscinet-getitem?mr=0701075http://www.ams.org/mathscinet-getitem?mr=2061575http://www.ams.org/mathscinet-getitem?mr=1424243http://www.ams.org/mathscinet-getitem?mr=0996690http://www.ams.org/mathscinet-getitem?mr=1131768http://www.ams.org/mathscinet-getitem?mr=2213177http://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ams.org/mathscinet-getitem?mr=2346906
  • 7/30/2019 MM Alternative to EM

    14/14

    14 T. T. WU AND K. LANGE

    Heiser, W. J. (1995). Convergent computing by iterativemajorization: theory and applications in multidimensionaldata analysis. In Recent Advances in Descriptive Multivari-ate Analysis (W. J. Krzanowski, ed.). Clarendon Press, Ox-ford. MR1380319

    Huber, P. J. (1981). Robust Statistics. Wiley, New York.MR0606374

    Hunter, D. R. (2004). MM algorithms for general-ized BradleyTerry models. Ann. Statist. 32 384406.MR2051012

    Hunter, D. R. and Lange, K. (2000). Quantile regressionvia an MM algorithm. J. Comput. Graph. Statist. 9 6077.MR1819866

    Hunter, D. R. and Lange, K. (2002). Computing estimatesin the proportional odds model. Ann. Inst. Statist. Math.54 155168. MR1893548

    Hunter, D. R. and Lange, K. (2004). A tutorial on MMalgorithms. Amer. Statist. 58 3037. MR2055509

    Hunter, D. R. and Li, R. (2005). Variable selection usingMM algorithms. Ann. Statist. 33 16171642. MR2166557

    Jamshidian, M. and Jennrich, R. I. (1997). Quasi-Newtonacceleration of the EM algorithm. J. Roy. Statist. Soc. Ser.B 59 569587. MR1452026

    Kent, J. T., Tyler, D. E. and Vardi, Y. (1994).A curious likelihood identity for the multivariate t-distribution. Comm. Statist. Simulation Comput. 23 441453. MR1279675

    Kiers, H. A. L. (2002). Setting up alternating least squaresand iterative majorization algorithms for solving variousmatrix optimization problems. Comput. Statist. Data Anal.41 157170. MR1973762

    Kiers, H. A. L. and Ten Berge, J. M. F. (1992). Minimiza-tion of a class of matrix trace functions by means of refinedmajorization. Psychometrika 57 371382. MR1183194

    Lange, K. (1994). An adaptive barrier method for convexprogramming. Methods Appl. Anal. 1 392402. MR1317019

    Lange, K. (1995a). A gradient algorithm locally equivalentto the EM algorithm. J. Roy. Statist. Soc. Ser. B 57 425437. MR1323348

    Lange, K. (1995b). A quasi-Newton acceleration of the EMalgorithm. Statist. Sinica 5 118. MR1329286

    Lange, K. (2004). Optimization. Springer, New York.MR2072899

    Lange, K. and Fessler, J. A. (1994). Globally convergentalgorithms for maximum a posteriori transmission tomog-raphy. IEEE Trans. Image Process. 4 14301438.

    Lange, K., Hunter, D. R. and Yang, I. (2000). Optimiza-tion transfer using surrogate objective functions (with dis-cussion). J. Comput. Graph. Statist. 9 120. MR1819865

    Lange, K., Little, R. J. A. and Taylor, J. M. G. (1989).Robust statistical modeling using the t distribution. J.Amer. Statist. Assoc. 84 881896. MR1134486

    Lange, K. and Wu, T. T. (2008). An MM algorithm for mul-ticategory vertex discriminant analysis. J. Comput. Graph.Statist. 17 118.

    Lee, D. D. and Seung, H. S. (1999). Learning the parts ofobjects by non-negative matrix factorization. Nature 401788791.

    Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Adv. Neural Inform. Process.Syst. 13 556562.

    Liao, W. H., Huang, S. C., Lange, K. and Bergsnei-der, M. (2002). Use of MM algorithm for regularizationof parametric images in dynamic PET. In Brain ImagingUsing PET (M. Senda, Y. Kimura, P. Herscovitch and Y.Kimura, eds.). Academic Press, New York.

    Little, R. J. A. and Rubin, D. B. (2002). StatisticalAnalysis with Missing Data, 2nd ed. Wiley, New York.MR1925014

    Marshall, A. W. and Olkin, I. (1979). Inequalities: Theoryof Majorization and Its Applications. Academic Press, SanDiego. MR0552278

    McLachlan, G. J. and Krishnan, T. (1997). The EM Al-gorithm and Extensions. Wiley, New York. MR1417721

    Meilijson, I. (1989). A fast improvement to the EM algo-rithm on its own terms. J. Roy. Statist. Soc. B 51 127138.MR0984999

    Meng, X. L. and Rubin, D. B. (1991). Using EM to obtainasymptotic variancecovariance matrices: The SEM algo-rithm. J. Amer. Statist. Assoc. 86 899909.

    Meng, X. L. and van Dyk, D. (1997). The EM algorithman old folk-song sung to a fast new tune. J. Roy. Statist.Soc. Ser. B 59 511567. MR1452025

    Ortega, J. M. (1990). Numerical Analysis: A SecondCourse. SIAM, Philadelphia. MR1037261

    Ortega, J. M. and Rheinboldt, W. C. (1970). Iterative So-lutions of Nonlinear Equations in Several Variables. Aca-demic Press, New York. MR0273810

    Pauca, V. P., Piper, J. and Plemmons, R. J. (2006). Non-negative matrix factorization for spectral data analysis.Linear Algebra Appl. 416 2947. MR2232918

    Rao, C. R. (1973). Linear Statistical Inference and Its Ap-plications, 2nd ed. Wiley, New York. MR0346957

    Sabatti, C. and Lange, K. (2002). Genomewide motif iden-tification using a dictionary model. Proceedings IEEE 9018031810.

    Scholkopf, B. and Smola, A. J. (2002). Learning with Ker-nels: Support Vector Machines, Regularization, Optimiza-tion, and Beyond. MIT Press, Cambridge.

    Steele, J. M. (2004). The CauchySchwarz Master Class:An Introduction to the Art of Mathematical Inequalities.Cambridge Univ. Press and Math. Assoc. Amer., Wash-ington, DC. MR2062704

    Takane, Y., Young, F. W. and de Leeuw, J. (1977). Non-metric individual differences multidimensional scaling: Analternating least squares method with optimal scaling fea-tures. Psychometrika 42 767.

    Vaida, F. (2005). Parameter convergence for EM and MMalgorithms. Statist. Sinica 15 831840. MR2233916

    Vapnik, V. (1995). The Nature of Statistical Learning The-ory. Springer, New York. MR1367965

    Varadhan, R. and Roland, C. (2008). Simple and globallyconvergent methods for accelerating the convergence of anyEM algorithm. Scand. J. Statist. 35 335353. MR2418745

    Wu, C. F. J. (1983). On the convergence properties of theEM algorithm. Ann. Statist. 11 95103. MR0684867

    http://www.ams.org/mathscinet-getitem?mr=1380319http://www.ams.org/mathscinet-getitem?mr=0606374http://www.ams.org/mathscinet-getitem?mr=2051012http://www.ams.org/mathscinet-getitem?mr=1819866http://www.ams.org/mathscinet-getitem?mr=1893548http://www.ams.org/mathscinet-getitem?mr=2055509http://www.ams.org/mathscinet-getitem?mr=2166557http://www.ams.org/mathscinet-getitem?mr=1452026http://www.ams.org/mathscinet-getitem?mr=1279675http://www.ams.org/mathscinet-getitem?mr=1973762http://www.ams.org/mathscinet-getitem?mr=1183194http://www.ams.org/mathscinet-getitem?mr=1317019http://www.ams.org/mathscinet-getitem?mr=1323348http://www.ams.org/mathscinet-getitem?mr=1329286http://www.ams.org/mathscinet-getitem?mr=2072899http://www.ams.org/mathscinet-getitem?mr=1819865http://www.ams.org/mathscinet-getitem?mr=1134486http://www.ams.org/mathscinet-getitem?mr=1925014http://www.ams.org/mathscinet-getitem?mr=0552278http://www.ams.org/mathscinet-getitem?mr=1417721http://www.ams.org/mathscinet-getitem?mr=0984999http://www.ams.org/mathscinet-getitem?mr=1452025http://www.ams.org/mathscinet-getitem?mr=1037261http://www.ams.org/mathscinet-getitem?mr=0273810http://www.ams.org/mathscinet-getitem?mr=2232918http://www.ams.org/mathscinet-getitem?mr=0346957http://www.ams.org/mathscinet-getitem?mr=2062704http://www.ams.org/mathscinet-getitem?mr=2233916http://www.ams.org/mathscinet-getitem?mr=1367965http://www.ams.org/mathscinet-getitem?mr=2418745http://www.ams.org/mathscinet-getitem?mr=0684867http://www.ams.org/mathscinet-getitem?mr=0684867http://www.ams.org/mathscinet-getitem?mr=2418745http://www.ams.org/mathscinet-getitem?mr=1367965http://www.ams.org/mathscinet-getitem?mr=2233916http://www.ams.org/mathscinet-getitem?mr=2062704http://www.ams.org/mathscinet-getitem?mr=0346957http://www.ams.org/mathscinet-getitem?mr=2232918http://www.ams.org/mathscinet-getitem?mr=0273810http://www.ams.org/mathscinet-getitem?mr=1037261http://www.ams.org/mathscinet-getitem?mr=1452025http://www.ams.org/mathscinet-getitem?mr=0984999http://www.ams.org/mathscinet-getitem?mr=1417721http://www.ams.org/mathscinet-getitem?mr=0552278http://www.ams.org/mathscinet-getitem?mr=1925014http://www.ams.org/mathscinet-getitem?mr=1134486http://www.ams.org/mathscinet-getitem?mr=1819865http://www.ams.org/mathscinet-getitem?mr=2072899http://www.ams.org/mathscinet-getitem?mr=1329286http://www.ams.org/mathscinet-getitem?mr=1323348http://www.ams.org/mathscinet-getitem?mr=1317019http://www.ams.org/mathscinet-getitem?mr=1183194http://www.ams.org/mathscinet-getitem?mr=1973762http://www.ams.org/mathscinet-getitem?mr=1279675http://www.ams.org/mathscinet-getitem?mr=1452026http://www.ams.org/mathscinet-getitem?mr=2166557http://www.ams.org/mathscinet-getitem?mr=2055509http://www.ams.org/mathscinet-getitem?mr=1893548http://www.ams.org/mathscinet-getitem?mr=1819866http://www.ams.org/mathscinet-getitem?mr=2051012http://www.ams.org/mathscinet-getitem?mr=0606374http://www.ams.org/mathscinet-getitem?mr=1380319