a highly scalable parallel algorithm for isotropic total...

17
A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models Jie Wang JIE. WANG. USTC@ASU. EDU Arizona State University, Tempe, AZ 85287 USA Qingyang Li QINGYANG. LI @ASU. EDU Arizona State University, Tempe, AZ 85287 USA Sen Yang SENYANG@ASU. EDU Arizona State University, Tempe, AZ 85287 USA Wei Fan DAVID. FANWEI @HUAWEI . COM Huawei Noahs Ark Lab, Hong Kong, China Peter Wonka PETER. WONKA@ASU. EDU Arizona State University, Tempe, AZ 85287 USA Jieping Ye JIEPING. YE@ASU. EDU Arizona State University, Tempe, AZ 85287 USA Abstract The total variation (TV) models are among the most popular and successful tools in signal pro- cessing. However, due to the complex nature of the TV term, it remains challenging to solve especially for large-scale problems. The state- of-the-art algorithms based on the alternating di- rection method of multipliers (ADMM) often re- quire to solve large-size linear systems, which greatly limits the practical use for large-scale problems. In this paper, we propose a highly scalable parallel algorithm for the TV models, which is based on a novel decomposition strategy of the problem domain. As a result, the TV mod- els can be decoupled into a set of small and in- dependent subproblems, which admit closed for- m solutions. This makes our approach particu- larly suitable for parallel implementation. Our algorithm is guaranteed to converge to its glob- al minimum. With N variables and n p process- es, the time complexity is O( N np ) to reach an - optimal solution. Extensive experiments demon- strate that our approach outperforms the existing state-of-the-art algorithms, especially in dealing with high resolution, mega-size images. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). 1. Introduction The total variation (TV) models have been found great suc- cess in a wide range of applications, including but not limit- ed to image denoising, image deblurring and image recon- struction (Barbero & Sra, 2011; Cand` es et al., 2006; Casas et al., 1999; He et al., 2005; Kim et al., 2010; Lustig et al., 2005; Osher et al., 2005; Vert & Bleakley, 2010; Vogel & Oman, 1998). Rudin et al. (1992) first introduced the TV model to remove noise in a given image by penalizing the TV term. The motivation is due to the fact that a noisy sig- nal generally implies high total variation. Given an image Y ∈< m×n , the discrete version of the ROF model is: min X 1 2 kX - Y k 2 F + λkXk TV , (1) where k·k F and k·k TV are the Frobenius norm and TV nor- m respectively, λ is the positive parameter, and X ∈< m×n is the image to be recovered. In this paper, we consider the isotropic TV norm: kXk TV = X m i=1 X n j=1 kD i,j Xk 2 , (2) where D i,j X = ((D 1 X) i,j , (D 2 X) i,j ) T is the discretized gradient at pixel (i, j ). For simplicity, we use the forward difference to define D i,j X, i.e., (D 1 X) i,j = ( x i+1,j - x i,j if i<m 0 if i = m , (3) (D 2 X) i,j = ( x i,j+1 - x i,j if j<n 0 if j = n . (4)

Upload: hoangnga

Post on 29-Mar-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

Jie Wang [email protected]

Arizona State University, Tempe, AZ 85287 USA

Qingyang Li [email protected]

Arizona State University, Tempe, AZ 85287 USA

Sen Yang [email protected]

Arizona State University, Tempe, AZ 85287 USA

Wei Fan [email protected]

Huawei Noahs Ark Lab, Hong Kong, China

Peter Wonka [email protected]

Arizona State University, Tempe, AZ 85287 USA

Jieping Ye [email protected]

Arizona State University, Tempe, AZ 85287 USA

AbstractThe total variation (TV) models are among themost popular and successful tools in signal pro-cessing. However, due to the complex natureof the TV term, it remains challenging to solveespecially for large-scale problems. The state-of-the-art algorithms based on the alternating di-rection method of multipliers (ADMM) often re-quire to solve large-size linear systems, whichgreatly limits the practical use for large-scaleproblems. In this paper, we propose a highlyscalable parallel algorithm for the TV models,which is based on a novel decomposition strategyof the problem domain. As a result, the TV mod-els can be decoupled into a set of small and in-dependent subproblems, which admit closed for-m solutions. This makes our approach particu-larly suitable for parallel implementation. Ouralgorithm is guaranteed to converge to its glob-al minimum. With N variables and np process-es, the time complexity is O( N

εnp) to reach an ε-

optimal solution. Extensive experiments demon-strate that our approach outperforms the existingstate-of-the-art algorithms, especially in dealingwith high resolution, mega-size images.

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

1. IntroductionThe total variation (TV) models have been found great suc-cess in a wide range of applications, including but not limit-ed to image denoising, image deblurring and image recon-struction (Barbero & Sra, 2011; Candes et al., 2006; Casaset al., 1999; He et al., 2005; Kim et al., 2010; Lustig et al.,2005; Osher et al., 2005; Vert & Bleakley, 2010; Vogel &Oman, 1998). Rudin et al. (1992) first introduced the TVmodel to remove noise in a given image by penalizing theTV term. The motivation is due to the fact that a noisy sig-nal generally implies high total variation. Given an imageY ∈ <m×n, the discrete version of the ROF model is:

minX

1

2‖X − Y ‖2F + λ‖X‖TV , (1)

where ‖·‖F and ‖·‖TV are the Frobenius norm and TV nor-m respectively, λ is the positive parameter, andX ∈ <m×nis the image to be recovered. In this paper, we consider theisotropic TV norm:

‖X‖TV =∑m

i=1

∑n

j=1‖Di,jX‖2, (2)

where Di,jX = ((D1X)i,j , (D2X)i,j)T is the discretized

gradient at pixel (i, j). For simplicity, we use the forwarddifference to define Di,jX , i.e.,

(D1X)i,j =

{xi+1,j − xi,j if i < m

0 if i = m, (3)

(D2X)i,j =

{xi,j+1 − xi,j if j < n

0 if j = n. (4)

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

It is worthwhile to mention that there are fast algorithms forthe “anisotropic” TV model, where ‖X‖TV is defined by∑mi=1

∑nj=1 ‖Di,jX‖1. However, those algorithms, which

are usually based on graph cut or maximum flow, are notapplicable for the isotropic TV models. As pointed out byGoldfarb & Yin (2009); Duan & Tai (2012), unless xi,j arebinary variables, the isotropic TV term is not graph repre-sentable. Moreover, Wahlberg et al. (2012) propose an AD-MM method to solve the 1D TV model. Recently, L. Con-dat (Condat, 2013) shows that the 1D TV denoising modeladmits a closed form solution and thus can be solved bya very fast noniterative algorithm. By utilizing this result,Yang et al. (2013) propose an efficient ADMM algorithmfor the multidimensional TV model. However, this methodis only applicable to the anisotropic TV model.

Although the formulation in problem (1) is simple, it iscomputationally challenging to solve due to the complexnature of the TV norm. However, as the emerging of highresolution, mega-sized images and signals, it is desirableto solve problem (1) with large-scale data. Existing algo-rithms usually demand high computational cost and mem-ory usage, which prevent them for large-scale problems.Simpler methods require moderate computational effortsand less memory space, but usually converge very slow.The difficulties motivate lots of research efforts to betterbalance the tradeoff of the computational cost and conver-gence rate. For example, Beck & Teboulle (2009a) proposefast gradient-based algorithms which combine an acceler-ation of the dual approach with a fast iterative shrinkagealgorithm (FISTA) (Beck & Teboulle, 2009b); Goldstein& Osher (2009) develop a split Bregman method; Cham-bolle & Pock (2011) apply a primal-dual method to solve(1). As another popular application of TV based model-s, the Magnetic Resonance (MR) image reconstruction hasreceived great attention recently. Ma et al. (2008) proposean operator-splitting algorithm (TVCMRI) for MR imagereconstruction. Huang et al. (2011) propose another effi-cient algorithm named FCSA, which utilizes the ideas ofcomposite splitting (Combettes & Pesquet, 2011) and theacceleration technique introduced by FISTA.

In this paper, we propose a fast Altermating DirectionMethod of Multipliers (ADMM) (Boyd et al., 2011; Chenet al., 2012; Chan et al., 2011; Esser et al., 2010; Esser,2009) based algorithm, termed FAD, to solve the TV mod-els. FAD is based on a novel decomposition strategy of theproblem domain. As a result, the TV model in (1) can bedecoupled into a set of small and independent subproblem-s. Each subproblem involves at most three variables andadmits a closed form solution. Thus, all of the subprob-lems can be solved efficiently in parallel. We implementour high scalable parallel algorithm via OpenMP and MPI.Extensive experiments on synthetic and real data demon-strate the scalability and efficiency of the proposed FAD.

Notations: Let Di,j be the discretized gradient at pixel(i, j), and ‖ · ‖ be the `2 norm. For matrix A, [A]i,j isits (i, j)th entry. The inner product between matrices isdefined as 〈A,B〉 =

∑i,j [A]i,j [B]i,j . Given a set C, we

use C◦, relint C, and ∂C to denote the interior points, rel-ative interiors and boundary points of C respectively. Forany real number c, let d := bcc be the largest integersuch that d ≤ c, and mod (a, b) := a −

⌊ab

⌋b. Let

f : <n → < ∪ {+∞} be a closed proper convex func-tion. The proximal operator of f is defined by:

proxf (x) = argminy

(f(y) + 1/2‖y − x‖2

).

2. Optimization MethodsIn this section, we present the proposed FAD in detail. Wefirst briefly review ADMM in Section 2.1. In Section 2.2,we present our novel decomposition strategy to partitionthe variables in (1). The proposed FAD and its convergenceproperty are presented in Sections 2.3 and 2.4, respectively.

2.1. Review of ADMM

ADMM makes use of the decomposability of dual ascentand meanwhile it has the good convergence properties ofthe multipliers’ method. Given a problem:

minx,z{f(x) + g(z) : x− z = 0} . (5)

We assume both f(·) and g(·) are convex. The augmentedLagrangian is given by:

Lγ(x, z; θ) = f(x) + g(z) +γ

2‖x− (z− θ)‖2, (6)

where θ and γ > 0 are the scaled augmented Lagrangianmultipliers (Boyd et al., 2011) and the penalty parameterrespectively. ADMM attempts to solve problem (5) by iter-atively minimizing Lγ(x,y; θ) over x and y respectively,and updating θ accordingly (Boyd et al., 2011). Please referto the supplement for the details of ADMM.

2.2. The Decomposition StrategyWe first give an example to illustrate our idea in Section2.2.1, and then consider the general case in Section 2.2.2.

2.2.1. A SIMPLE EXAMPLELet us first consider a simple case as an example. Supposewe have an image of 4 × 4 pixels as illustrated in Fig. 1.The TV term ‖X‖TV can be written as:

‖X‖TV =∑4

i=1

∑4

j=1‖Di,j‖ (7)

=∑3

i=1

∑3

j=1

√(xi+1,j − xi,j)2 + (xi,j+1 − xi,j)2

+∑3

i=1|xi+1,4 − xi,4|+

∑3

j=1|x4,j+1 − x4,j |.

From the right hand side (RHS) of Eq. (7), we can seethat none of the terms is separable from the others since

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

(a) (b) (c)Figure 1. Illustration of the decomposition.

most of the variables appear in different terms. Our goalis to decompose ‖X‖TV into several different parts suchthat each term is separable from all the others. As shown inFig. 1, we decompose ‖X‖TV into three disjoint parts, i.e.,

‖X‖TV =∑3

k=1‖X‖TVk . (8)

Each part consists of the total variation norm ‖Di,j‖ asso-ciated with the big solid nodes. For example, there are sixbig solid red nodes in Fig. 1(a): (1, 1), (2, 2), (3, 3), (4, 4),(1, 4) and (4, 1). Thus, the first part ‖X‖TV1

is:

‖X‖TV1 =∑4

k=1‖Dk,k‖+ ‖D1,4‖+ ‖D4,1‖ (9)

=∑3

k=1

√(xk+1,k − xk,k)2 + (xk,k+1 − xk,k)2

+ |x2,4 − x1,4|+ |x4,2 − x4,1|.Recall that ‖D4,4‖ = 0. From the RHS of Eq. (9), wecan see that each variable only appears in one term whichimplies that every term is separable from all the otherones. Similarly, from Fig. 1(b) and Fig. 1(c), ‖X‖TV2

and‖X‖TV3 are given by

‖X‖TV2=∑3

k=1‖Dk,k+1‖+

∑2

k=1‖Dk+2,k‖, (10)

‖X‖TV3=∑3

k=1‖Dk+1,k‖+

∑2

k=1‖Dk,k+2‖. (11)

Clearly, for ‖X‖TV2, each term on the RHS of Eq. (10)

is separable from all the other ones and the same propertyholds for ‖X‖TV3

.

2.2.2. THE GENERAL CASE

Suppose we have an image of m × n pixels. Let D ={(i, j) : i = 1, 2, . . . ,m, j = 1, 2, . . . , n.}. Then we candivide D into three non-overlapping subsets:Dk = {(i, j) ∈ D : mod(j − i, 3) = k − 1}, k = 1, 2, 3.

Thus, ‖X‖TV can be written as:

‖X‖TV =∑3

k=1‖X‖TVk =

∑3

k=1

∑(i,j)∈Dk

‖Di,j‖. (12)

We can see that for each ‖X‖TVk =∑

(i,j)∈Dk ‖Di,j‖(k = 1, 2, 3), every term ‖Di,j‖2, (i, j) ∈ Dk is separablefrom all the other ones ‖Di′,j′‖ where (i′, j′) ∈ Dk \ (i, j).For details, please refer to Proposition 1. in the supplement.

Remark 1. Recall that, the isotropic TV norm introducedin Eq. (2) is the summation of the `2 norm of the discretizedgradient over all pixels. The proposed decomposition strat-egy is to decompose the image into 3 non-overlapping sets

of pixels, and writes the TV norm as the summation of threeterms as in Eq. (12). Each term refers to the summation ofthe `2 norm of the discretized gradient over one group ofpixels. Notice that, each ‖X‖TVk =

∑(i,j)∈Dk ‖Di,j‖ al-

so involves the pixel-level variables [X]i′,j′ that are sharedacross the three group (though not within the groups).

2.3. ADMM for TV Based ModelsBy using the decomposition strategy presented in the lastsection, problem (1) can be rewritten as:

minX

1

2‖X − Y ‖2F + λ

∑3

k=1‖X‖TVk . (13)

To use ADMM, (13) can be reformulated as:

minZ;X1,X2,X3

1

2‖Z − Y ‖2F + λ

∑3

k=1‖Xk‖TVk , (14)

s.t. Xk = Z, k = 1, 2, 3,

with global variable Z and local variables Xk, k = 1, 2, 3.The augmented Lagrangian of problem (14) is:Lγ(Z;X1, X2, X3; Θ1,Θ2,Θ3) (15)

=1

2‖Z − Y ‖2F + λ

3∑k=1

‖Xk‖TVk +γ

2

3∑k=1

‖Xk − (Z −Θk)‖2F .

Denote X1, X2, X3 and Θ1,Θ2,Θ3 by X and Θ re-spectively. Thus, Lγ in Eq. (15) can be simplified toLγ(Z; X; Θ). We apply ADMM to solve (14). In eachiteration, we first minimize Lγ(Z; X; Θ) with respect toZ and X, and then update the dual variable Θ. Assumewe are at the tth step, i.e., Zt, Xt and Θt are known, wecompute Zt+1, Xt+1 and Θt+1 as follows:

S1. Minimize Lγ(Zt; X; Θt) with respect to X

Recall that X refers to Xk, k = 1, 2, 3. Thus, we need tominimize Lγ(Zt; X; Θt) with respect to Xk, k = 1, 2, 3.

Since Xt+1k = argminXk Lγ(Zt; X; Θt), we have

Xt+1k = argmin

Xk

λ‖Xk‖TVk +γ

2‖Xk − (Zt −Θt

k)‖2F (16)

= argminXk

1

2‖Xk − V tk‖2F +

λ

γ‖Xk‖TVk ,

where V tk = Zt −Θtk. Note that, the first term on the RHS

of (16) is separable. Due to the decomposition strategy inSection 2.2, the second term on the RHS of (16) is separa-ble as well. Therefore, problem (16) can be decoupled toa set of subproblems which involve at most three variables.Intuitively, the reason is that each subproblem involves agiven pixel and the pixels above and to the right due to theforward difference. Suppose pixel (p, q) ∈ Dk. We havethe following cases.

Case 1. Image interior (if p < m, q < n). Let us defineu = ([Xk]p,q+1, [Xk]p,q, [Xk]p+1,q)

T , (17)

w = ([V tk ]p,q+1, [Vtk ]p,q, [V

tk ]p+1,q)

T .

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

As we discussed in Section 2.2, u is only involved in‖Dp,qXk‖, not any other terms in ‖Xk‖TVk . Thus, the fol-lowing subproblem can be decoupled from problem (16):

minu

12‖u−w‖2 + λ

γ

√(u1 − u2)2 + (u3 − u2)2. (18)

Case 2. Top boundary (if p = m, q < n). Letu = ([Xk]p,q+1, [Xk]p,q)

T ,w = ([V tk ]p,q+1, [Vtk ]p,q)

T . (19)Then due to the same reason as in Case 1, we have thefollowing subproblem which can be decoupled from (16):

minu

12‖u−w‖2 + λ

γ |u2 − u1|. (20)

Consider the right boundary (p < m, q = n). Letu = ([Xk]p+1,q, [Xk]p,q)

T ,w = ([V tk ]p+1,q, [Vtk ]p,q)

T . (21)we get the same subproblem as (20).

Case 3. Top right corner (if p = m and q = n). We have‖Dp,qXk‖ = 0 by the definition in (3) and (4). Thus,min

[Xk]p,q

12 ([Xk]p,q − [V tk ]p,q)

2 ⇒ [Xt+1k ]p,q = [V tk ]p,q. (22)

We solve problems (18) and (20) in Section 3.

Remark 2. [Xk]p,q will be updated with one of the abovethree cases if group k contains pixel (p, q), the pixel below,or the pixel to the left.

Case 4. By Remark 2, if none of the pixel (p, q), the pixelbelow, or the pixel to the left belongs to group k, [Xk]p,qwill not be updated. Take Fig. 1(a) as an example; thisis the case for pixel (1, 3). Clearly, the variable [Xk]p,qdoes not appear in the second term on the RHS of (16), i.e.,λγ ‖Xk‖TVk . Thus we only need to solve:

min[Xk]p,q

12 ([Xk]p,q − [V tk ]p,q)

2 ⇒ [Xt+1k ]p,q = [V tk ]p,q. (23)

Remark 3. S1 involves two levels of decomposition. Thefirst level comes from the fact that problem (16) involvesminimizing over the three sets of local variables Xk, k =1, 2, 3, which are independent with each other. Thus, wecan solve (16) for Xk, k = 1, 2, 3, in parallel. The secondlevel is due to the four cases in S1. Specifically, for eachk ∈ {1, 2, 3}, problem (16) breaks into a set of subprob-lems involving at most three variables [problems (18) and(20)]. Recall that, in Section 2.2, we decompose the imageinto three groups of non-overlapping pixels. We emphasizethat these are referring to different levels of decomposition.

S2. Minimize Lγ(Z; Xt+1; Θt) with respect to Z

Since Zt+1 = argminZ Lγ(Z; Xt+1; Θt), we have∂Lγ(Z; Xt+1; Θt)

∂Z

∣∣∣∣Z=Zt+1

= 0 (24)

⇒Zt+1 − Y − γ∑3

k=1(Xt+1

k − Z + Θtk) = 0

⇒Zt+1 =Y +

∑3k=1(Θt

k +Xt+1k )

1 + 3γ. (25)

Notice that, in this step, we update the global variable Zand thus the replicated variables Xk, k = 1, 2, 3, are coor-dinated across the three pixel groups.

S3. Update the (scaled) dual variable Θt

The dual variables Θtk, k = 1, 2, 3, are updated by

Θt+1k = Θt

k + (Xt+1k − Zt+1), (26)

which is equivalent to a dual ascent step (Boyd et al., 2011).

Our method is summarized in Algorithm 1.

Algorithm 1 FAD (Fast ADMM for TV Models)Input: Y , λ, γ > 0Initialize Z0 = X0

k := Y , Θ0k := 0, k = 1, 2, 3

for t = 0 to T dofor k = 1 to 3 do

V tk := Zt −Θt

k,

Xt+1k := argmin

Xk

1

2‖Xk − V t

k ‖2F +λ

γ‖Xk‖TVk ,

end for

Zt+1 :=Y +

∑3k=1(Θt

k +Xt+1k )

1 + 3γ,

for k = 1 to 3 doΘt+1

k := Θtk + (Xt+1

k − Zt+1),

end forend forReturn: ZT

Remark 4. Recall that T is the number of iterations whenAlgorithm 1 terminates. We use the primal and dual resid-uals (Boyd et al., 2011) to specify the stopping criterion. Ifboth residuals are less than a given parameter ε, the algo-rithm stops. In this paper, we set ε = 10−4. In addition, itis known that a larger γ tends to result in a smaller primalresidual but larger dual residual. We keep γ fixed (say 10)in this paper. For schemes about varying γ, we refer thereaders to (Boyd et al., 2011).

2.4. Convergence Analysis

The convergence properties of ADMM to solve the stan-dard form in (5) have been extensively explored by (He &Yuan, 2012; Boyd et al., 2011; Monteiro & Svaiter, 2010;Eckstein & Bertsekas, 1992; Glowinski & Tallec, 1989).To establish the convergence properties of Algorithm 1,we can reformulate problem (14) as (5), and the result-ing formulation satisfies the conditions required for con-vergence. Moreover, the convergence rate of Algorithm 1can be shown asO(1/k) by following the procedure in (He& Yuan, 2012). For details, please refer to the supplement.

3. Solution to the SubproblemsWe solve problems (18) and (20) in Sections 3.1 and 3.2respectively. In Section 3.3, we give a brief complexityanalysis of Algorithm 1.

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

3.1. Solution to Subproblem (18)

We focus on the following subproblem:

minu

12‖u−w‖2 + ρ

√(u1 − u2)2 + (u3 − u2)2,

where ρ = λγ ≥ 0. Let G ∈ <2×3 be defined by

G =

(−1 1 00 −1 1

). (27)

The dual problem of (18) is:mins

{12‖w − ρG

T s‖2 : s21 + s22 ≤ 1}. (28)

Assume u∗ and s∗ are the optimal solutions of (18) and(28). By the KKT conditions, we have:

u∗ = w − ρGT s∗, (29)(s∗1s∗2

)∈

{(u∗2−u

∗1 ,u∗3−u

∗2)T

‖(u∗2−u∗1 ,u∗3−u∗2)T ‖, if (u∗2 − u∗1, u∗3 − u∗2)T 6= 0,

v, ‖v‖ ≤ 1, if (u∗2 − u∗1, u∗3 − u∗2)T = 0.(30)

Note that problem (28) is smooth and convex. Thus it iseasier to solve than the nonsmooth problem (18). We cansolve u∗ via s∗ according to Eq. (29). We rewrite (28) as:

mins

{ρ2

2

∥∥∥wρ −G

T s∥∥∥2 : s21 + s22 ≤ 1

}. (31)

Let B = {t : t = GT s, s ∈ B}, where B ∈ <2 is the unitdisc. Clearly, B ∈ <3 is a two dimensional linear mani-fold. We can see that problem (31) is in fact a projectionproblem. Let

t∗(ρ) = argmint∈B

∥∥∥wρ − t

∥∥∥2 = PB(w/ρ),

then we have t∗(ρ) = GT s∗(ρ). There is one-to-one lin-ear mapping between t∗(ρ) and s∗(ρ) since GT is linearand has full column rank. Consequently, t∗(ρ) ∈ relint Bimplies s∗(ρ) ∈ B◦ and t∗(ρ) ∈ ∂B implies s∗(ρ) ∈ ∂Band vice versa. Since the projection operator PB(w/ρ) isnonexpansive (Bertsekas, 2003), t∗(ρ) varies continuous-ly with ρ. When ρ is big enough, t∗(ρ) must belong torelint B and thus ‖s∗‖ ∈ B◦ (consider the extreme caset∗(∞) = 0 ∈ relint B). Therefore there must exist a ρmaxsuch that ρ > ρmax implies s∗(ρ) ∈ B◦, i.e., ‖s∗(ρ)‖ < 1.By Eq. (30), ‖s∗‖(ρ) < 1 implies that all components ofu∗(ρ) are equal. The following theorem gives the explicitexpression for ρmax and u∗ when ρ ≥ ρmax.

Theorem 1. Let ρmax = ‖(GGT )−1Gw‖ and I be theidentity matrix. If ρ ≥ ρmax, the optimal solution of theproblem (18) is given by

u∗ =[I −GT (GGT )−1G

]w. (32)

Remark 5. The expression of ρmax implies that if w be-longs to the null space of G, i.e., all three components ofw are equal, then ρmax = 0. In this case, the solution toproblem (18) is u∗ = w [note that u∗ is the projection of wonto the null space of G by Eq. (32)], which is independent

of ρ. To avoid this trivial case, we always assume w is nota constant signal. Therefore, we only consider ρmax > 0.

We solve for u∗ with ρ < ρmax in the following theorem.

Theorem 2. LetG = UΣV T be the singular value decom-position ofG. If ρ < ρmax, the optimal solution to problem(18) is given by

u∗ =[I −GT (GGT + α

ρ2 I)−1G]

w, (33)

where α > 0 solves the following quartic equation:

α4 + c3α3 + c2α

2 + c1α+ c0 = 0. (34)The parameters are given by

c0 = ρ8σ41σ

42 − ρ6(w2

1σ42 + w2

2σ41),

c1 = 2ρ6(σ21σ

4 + σ41σ

2)− 2ρ4(w21σ

2 + w22σ

21),

c2 = ρ4(σ41 + σ4

2 + 4σ21σ

22)− ρ2(w2

1 + w22),

c3 = 2ρ2(σ21 + σ2

2).

and w = ΣV Tw, σ21 = 3, σ2

2 = 1. Eq. (34) has a uniquepositive root if ρ < ρmax.

Remark 6. There are numerous efficient algorithms1 tosolve Eq. (34). The closed form solutions of quartic equa-tions are also available2, but it is computationally expen-sive and nonstable. In this paper, we use Newton’s methodto find the nonnegative root of Eq. (34), which never fail-s in practice. For discussions regarding the root findingtechniques, we refer the reader to the supplement.

3.2. Solution to Subproblem (20)

Recall that problem (20) takes the form of:minuh(u) = 1

2‖u−w‖2 + ρ|u2 − u1|,

where ρ = λ/γ. We have the following theorem.

Theorem 3. Let u∗ be the optimal solution of (20). Then

u∗ =

(w1 − ρ, w2 + ρ)T , if w1 > w2 + 2ρ,

(w1 + ρ, w2 − ρ)T , if w1 < w2 − 2ρ,

(w1+w2

2 , w1+w2

2 )T , otherwise.(35)

Notice that, the result in Theorem 3 is a small variation onthe standard soft thresholding operator.

Remark 7. Notice that, problems (18) and (20) can beviewed as proximal operators of particular functions. Takeproblem (18) for example. Let f(u) = ‖Gu‖. Then theoptimal solution u∗ of problem (18) can be expressed as:

u∗ = proxρf (w).

Moreover, the Fenchel’s conjugate (Ruszczynski, 2006) of

1http://en.wikipedia.org/wiki/Quartic_function

2We found instances that the closed form solution can not givecorrect answers due to the limitation of computers’ precision.

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

ρf is given by:

(ρf)∗(v) = supu

(uTv − ρf(u)

)= δρB(v),

where δρB(v) = 0 if v ∈ ρB and δρB(v) =∞ otherwise,

that is, δρB(v) is the indicator function of ρB. Thus,

proxδρB (w) = minu

1

2‖u−w‖2 + δρB(u) = PρB(w).

By Moreau decomposition (Rockafellar, 1970), we have

proxρf (w) = w − prox(ρf)∗(w)⇒ u∗ = w − ρGT s∗,

which is the same as Eq. (29). By similar analysis as in Sec-tion 3.1, we can also solve (18). For more details, pleaserefer to Parikh & Boyd (2013); Pustelnik et al. (2011).

3.3. Complexity Analysis of Algorithm 1

Each iteration of Algorithm 1 involves update of the primalvariables Z,X, and dual variables Θ. We update Z andΘ element-wise by simple arithmetic operations and thusthe complexity is O(N), where N = m× n is the numberof variables/pixels. The update of X involves solving a setof small and independent subproblems. Because each sub-problem involves at most three variables, the complexity ofupdating X is proportional to N . Thus, the complexity ofeach iteration in Algorithm 1 is O(N). From Section 2.4,to obtain an ε-optimal solution, Algorithm 1 needs O(1/ε)steps (He & Yuan, 2012). Thus, the complexity of Algo-rithm 1 to achieve an ε-optimal solution is O(N/ε).

Moreover, the discussion above reveals an appealing fea-ture of Algorithm 1, that is every step of Algorithm 1 canbe executed in parallel. Therefore, suppose we have npprocessors, then we can run FAD simultaneously on theseprocessors. Each processor solves a subset of the subprob-lems (18) and (20) to update X, and updates a small pieceof Z and Θ. We call the parallel version of FAD “pFAD”.The complexity of pFAD is O( N

εnp). Clearly, the more pro-

cessors we have, the more efficient pFAD will be. This willbe demonstrated in our experiments.

4. Implementation of pFAD via MPIBy the decomposition strategy in Section 2.2, the TV mod-el in (1) can be decoupled into a set of small and indepen-dent subproblems as in (18) and (20). To fully utilize thisfeature, we implement pFAD via MPI such that pFAD canrun on large systems of interconnected computer clustersin parallel. One of the key advantages of pFAD is that thecommunication overhead across processors is very little.Suppose we divide the image into several blocks, only thedata on the “boundaries” need to be interchanged amongthe “adjacent” blocks (please refer to Fig. 1).

Fig. 2 shows an example of the synthetic images for testing.Each image Y ∈ <n×n has 8 randomly generated blocks.

We add a Gaussian noise N (0, 0.22) to Y to generate thenoisy image Y . The intensities are scaled to [0, 1]. Fig. 3(a)shows the speedup of pFAD compared with FAD when weincrease the number of processors. The number of proces-sors c is set to be 1, 2, 4, 9, 16, 25, 100, 400. For each spe-cific number of processors, we fix λ = 0.35 and vary n, theside length of the test images, from 100 to 5000 with a stepsize 100 (we observed similar patterns under different pa-rameters). For each n, we perform 10 trials and record theaverage performance. Then we average the performanceacross all different n to get the speedup for each specif-ic number of processors. Recall that Algorithm 1 involvesupdating the primal variables Z and X, and dual variablesΘ. Except X, the update of Z and Θ are element-wise andonly require several simple arithmetic operations. Thus, thedominate part of Algorithm 1 is the update of X since it re-quires to solve a bunch of quartic equations. In Fig. 3(a),we report the speedup of updating X and the total com-putational time. We can see that the speedup of updatingX approximately equals to the number of processors. For400 processors, the speedup is about 377 times. Fig. 3(a)also indicates that the speedup of the total computationaltime is about 357 times for 400 processors, since the costof updating Z and Θ is very cheap.

(a) Original image (b) Noisy imageFigure 2. Example of synthetic images to evaluate the perfor-mance of MFISTA, SplitBregman and FAD/pFAD.

Comparison with Other Algorithms We compare FADand its parallel variant pFAD with several state-of-the-artcompetitors including MFISTA3 (Beck & Teboulle, 2009a)and SplitBregman method4 (Goldstein & Osher, 2009).Note that it is not straightforward to implement MFISTAor SplitBregman in parallel since both of them involve ma-nipulating or solving large-scale linear systems. Thus, weimplement pFAD with OpenMP for this experiment andtest all of the algorithms on a server with 4 quad-core (16processors in total) Intel Xeon 2.93GHz CPUs and 65GBmemory. To thoroughly evaluate the performance of theaforementioned methods, we test them on images whosesize varies from 100× 100 to 5000× 5000.

We compare the performance of the above algorithms by

3tag7.web.rice.edu/Split_Bregman.html4iew3.technion.ac.il/˜becka/papers/tv_

fista.zip

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

(a) (b) (c)

Figure 3. Efficiency evaluation. Fig. 3(a) shows the speedup of pFAD with respect to FAD using multiple prcessors. Fig. 3(b) andFig. 3(c) present the efficiency comparison of FAD, pFAD, SplitBregman and MFISTA in terms of the computational time.

varying n and λ, respectively. We first fix λ = 0.35 andvary the side length n of the images from 100 to 5000 witha step size 100. Then we fix n = 1000 and vary λ from 0.15to 0.9 with a step size 0.05. For both settings, we perform10 trials and report the average performance in Fig. 3(b)and Fig. 3(c), respectively. For a fair comparison, in eachsetting, we first run FAD and record the objective functionvalue when it stops. Then we run all the other algorithmsuntil the objective function values are no larger than that ofFAD (for FAD/pFAD, we set ε = 10−4 and γ = 10).

Fig. 3(b) indicates that the performance of FAD is compa-rable to the SplitBregman method and about 6 times fasterthan MFISTA. pFAD further improves the efficiency ofFAD by about one order. Thus, pFAD is about 60 and 10times faster than MFISTA and SplitBregman methods re-spectively. Fig. 3(c) further demonstrates the efficiency ofpFAD. We can observe that, pFAD and FAD become slight-ly faster when λ increases. From Theorems 1 and 2, whenρ ≥ ρmax (λ is proportional to ρ since γ is fixed), problem(18) admits a closed form solution. Otherwise, we haveto solve the quartic equation (34). Overall, FAD/pFAD ismore stable than SplitBregman and MFISTA.

5. ApplicationsWe apply pFAD to the problems of image denoising, de-blurring and reconstruction, and present the results below.

5.1. Image DenoisingWe apply our methods to solve the most basic TV model,i.e., the image denoising problem. The original image5 isa high resolution photograph of moon with 5100 × 4768pixels. The noisy image is obtained by adding a Gaussiannoise N (0, 0.22).The regularization parameter λ is set tobe 0.2. To give a better view, we show the results over asub-image with 512 × 512 pixels. The upper left cornerof the sub-image is at (1800, 2200). The original imagein Fig. 4(a) is about 13MB. The denoising process cost-

5http://blog.astrophotographytargets.com/2012/03/

(a) Original image (b) Subimage

(c) Noisy subimage (d) Denoised resultFigure 4. Image denoising.

s pFAD, SplitBregman and MFISTA 46.87, 445.66 and2695.28 seconds respectively. MFISTA takes about 45minutes. SplitBregman is much efficient than MFISTA, butstill it spends about 7 minutes. Our approach, pFAD, onlyneeds several tens of seconds, and thus performs the best.

5.2. Image DeblurringThe TV-based image deblurring model is:

minX‖B(X)− Y ‖2F + λ‖X‖TV (36)

where X ∈ <m×n is the image to be recovered, B :<m×n → <m×n is a linear transformation encoding a cer-tain blurring operation, and Y is the observed noisy andblurred image. Problem (36) can be efficiently solved viaFISTA (Beck & Teboulle, 2009b). The key step is to solvethe proximal operator of the TV term which is equivalen-t to problem (1). Therefore, we can apply FAD or pFADto solve the proximal operator to speedup the computation.The original image “text” in Fig. 5(a) is extracted from theMatlab image processing toolbox. We then apply a 7 × 7Gaussian filter with standard deviation 5 to get the blurredimage. Fig. 5(b) is obtained by adding a Gaussian noise

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

(a) (b) (c)

Figure 5. Image deblurring via FISTA-pFAD.

Table 1. Computational time (in seconds) for MRI reconstruction.CARDIAC BRAIN CHEST ARTERY

FCSA-MFISTA 0.682 0.607 0.633 0.712

FCSA-PFAD 0.037 0.033 0.041 0.031

N (0, 0.022) to the blurred image. The regularization pa-rameter λ of pFAD is set to be 0.001. Fig. 5(c) shows thedeblurred image via FISTA-pFAD.

5.3. Image ReconstructionMagnetic Resonance Imaging (MRI) is a powerful tool inmedical diagonosis because of its excellent performancein detecting soft tissue changes. However a typical M-RI examination procedure may take up 45 minutes. Bynoting the sparse nature of signals in a transformed do-main, recent advances of the compressive sensing theory(CS)(Candes et al., 2006; Donoho, 2006) show that it ispossible to accurately reconstruct signals from limited un-dersamples. Therefore, the duration of an MRI examina-tion can be significantly reduced. A popular formulation ofthe image reconstruction problem takes the form of(Lustiget al., 2007; Trzasko et al., 2007):

minX

1

2‖F(X)− Y ‖2 + λ1‖W(X)‖1 + λ2‖X‖TV , (37)

where X is the signal to be recovered, Y is the observedundersamples, F and W are partial Fourier and wavelettransformations respectively. Huang et al. (2011) proposedan efficient algorithm called FCSA6 to solve (37). One ofthe key steps of FCSA is to solve the proximal operator ofthe TV term by MFISTA. To further improve the efficiencyof FCSA, we use pFAD to replace MFISTA.

We compare the performance of FCSA-MFISTA andFCSA-pFAD on four MR images: cardiac, brain, chestand artery. We use the default settings of the FCSApackage. The sample ratio is about 20% and a Gaussiannoise N (0, 0.012) is added to the undersamples to gener-ate Y . For a fair comparison, we first run FCSA-pFADand record the objective function values returned by pFAD.Then we run FCSA-MFISTA and make sure it achieves thesame precision level with FCSA-pFAD at each iteration.We run both FCSA-pFAD and FCSA-MFISTA 50 step-s. Fig. 6 presents the results obtained by FCSA-MFISTAand FCSA-pFAD, which is almost identical to each oth-er. Because both FCSA-MFISTA and FCSA-pFAD solve

6ranger.uta.edu/˜huang/R_CSMRI.htm

(a) Cardiac

(b) Brain

(c) Chest

(d) ArteryFigure 6. MRI reconstruction. The left column shows the origi-nal images. The images in the middle and right columns are thereconstructed ones by FCSA-MFISTA and FCSA-pFAD respec-tively. The SNR for Cardiac, Brain, Chest and Artery are 17.56,20.35, 16.06 and 23.70 respectively.

the same problem and achieve the same accuracy level, theresulting reconstructions and SNR are expected to be thesame. The average time for the 50 iterations of the twomethods is reported in Table 1, which indicates that FCSA-pFAD is about 20 times faster than FCSA-MFISTA.

6. ConclusionIn this paper, we propose a fast alternating direction methodfor isotropic TV models. Our methods are based on a noveldecomposition strategy of the problem domain. Based onthe decomposition, the TV based models can be decoupledinto a set of small and independent subproblems, whichcan be solved efficiently in parallel. Different from exist-ing ADMM based approaches, like SplitBregman methods,one appealing feature of our method is that no large-scalelinear system is involved. Our empirical evaluation demon-strates the very low communication overhead and high s-calability of the proposed algorithm. One of our future di-rections is to extend our approach to higher dimensionalproblems, like video and functional MRI denoising.

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

ReferencesBarbero, A. and Sra, S. Fast newton-type methods for total varia-

tion regularization. In ICML, 2011.

Beck, A. and Teboulle, M. Fast gradient-based algorithms forconstrained total variation image denoising and deblurringproblems. IEEE Transactions on Image Processing, 18:2419–2434, 2009a.

Beck, A. and Teboulle, M. A fast iterative shringkage-thresholding algorithm for linear inverse problems. SIAM Jour-nal on Imaging Sciences, 2:183–202, 2009b.

Bertsekas, D. P. Convex Analysis and Optimization. Athena Sci-entific, 2003.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Dis-tributed optimization and statistical learning via the alternatingdirection method of multipliers. Foundations and Trends inMachine Learning, 3:1–122, 2011.

Candes, E., Romberg, J., and Tao, T. Robust uncertainty prin-ciples: exact signal reconstruction from highly incomplete fre-quency information. IEEE Trans. Inform. Theory, 52:489–509,2006.

Casas, E., Kunisch, K., and Pola, C. Regularization by functionsof bounded variation and applications to image enhancement.Applied Mathematics & Optimization, 40:229–257, 1999.

Chambolle, A. and Pock, T. A first-order primal-dual algorithmfor convex problems with applications to imaging. J. Math.Imaging. Vis., 40:120–145, 2011.

Chan, R. H., Yang, J. F., and Yuan, X. M. Alternating directionmethod for image inpainting in wavelet domain. SIAM Journalon Imaging Sciences, 4:807–826, 2011.

Chen, C., He, B., and Yuan, X. Matrix completion via an alter-nating direction method. IMA Journal of Numerical Analysis,32:227–245, 2012.

Combettes, P. and Pesquet, J. Fixed-Point Algorithms for InverseProblems in Science and Engineering. Springer, 2011.

Condat, L. A direct algorithm for 1-D total variation denoising.IEEE Signal Processing Letters, 20:1054–1057, 2013.

Donoho, D. Compressed sensing. IEEE Trans. Inform. Theory,52:1289–1306, 2006.

Duan, Y. and Tai, X. Domain decomposition methods with graphcuts algorithms for total variation minimization. Adv. Comput.Math, 36:175–199, 2012.

Eckstein, J. and Bertsekas, D. P. On the Douglas-Rachford s-plitting method and the proximal point algorithm for maximalmonotone operators. Mathematical Programming, 55:293–318, 1992.

Esser, E. Applications of lagrangianbased alternating directionmethods and connections to split Bregman. Technical report,UCLA CAM, 2009.

Esser, E., Zhang, X. Q., and Chan, T. F. A general frameworkfor a class of first order primaldual algorithms for convex opti-mization in imaging science. SIAM J. Imag. Sci., 3:1015–1046,2010.

Glowinski, R. and Tallec, P. Le. Augmented Lagrangian andoperator-splitting methods in nonlinear mechanics. SIAM S-tudies in Applied Mathematics, 1989.

Goldfarb, D. and Yin, W. Parametric maximum flow algorithmsfor fast total variation minimization. SIAM J. Sci. Comput., 31:

3712–3743, 2009.

Goldstein, Tom and Osher, Stanley. The Split Bregman methodfor L1-regularized problems. SIAM J. Imag. Sci., 2:323–342,2009.

He, B. and Yuan, X. On the o(1/n) convergence rate of theDouglas-Rachford alternating direction method. SIAM J. Nu-mer. Anal., 50:700–709, 2012.

He, L., Chang, T. C., and Osher, S. MR image reconstruction fromsparse radial samples by using iterative refinement procedures.In 13th Annual Meeting of ISMRM, 2005.

Huang, J., Zhang, S., and Metaxas, D. Efficient MR image recon-struction for compressed MR imaging. Medical Image Analy-sis, 15:670–679, 2011.

Kim, D., Sra, S., and Dhillon, I. A scalable trust-region algorithmwith application to mixed-norm regression. In ICML, 2010.

Lustig, M., Lee, J. H., Donoho, D. L., and Pauly, J. M. Fasterimaging with randomly pertured undersampled spirals and l1reconstruction. In 13th Scientific Meeting of the ISMRM, 2005.

Lustig, M., Donoho, D., and Pauly, J. Sparse MRI: The appli-cation of compressed sensing for rapid MR imaging. Magn.Reson. Med., 58:1182–1195, 2007.

Ma, S., W, Yin, Zhang, Y., and Chakraborty, A. An efficient al-gorithm for compressed MR imaging using total variation andwavelets. In CVPR, 2008.

Monteiro, R. and Svaiter, B. F. Iteration-complexity of block-decomposition algorithms and the alternating direction methodof multipliers. manuscript, 2010.

Osher, S., Burger, M., Goldfarb, D., Xu, J., and Yin, W. An it-erative regularization method for total variation-based imagerestoration. Multiscale Model. Simul., 4:460–489, 2005.

Parikh, N. and Boyd, S. Proximal algorithms. Foundations andTrends in Optimization, 1:123–231, 2013.

Pustelnik, N., Chaux, C., and Pesquet, J.-C. Parallel proximalalgorithm for image restoration using hybrid regularization.IEEE Transactions on Image Processing, 20:2450–2462, 2011.

Rockafellar, R. Convex Analysis. Princeton University Press,1970.

Rudin, L. I., Osher, S. J., and Fatemi, E. Nonlinear total variationbased noise removal algorithms. Phys. D, 60:259–268, 1992.

Ruszczynski, A. Nonlinear Optimization. Princeton UniversityPress, 2006.

Trzasko, J., Manduca, A., and Borisch, E. Sparse MRI reconstruc-tion via multiscale L0-continuation. In 14th IEEE/SP Work-shop on Statistical Signal Processing, 2007.

Vert, J. P. and Bleakley, K. Fast detection of multiple change-points shared by many signals using group LARS. In NIPS,2010.

Vogel, C. and Oman, M. Fast, robust total variation-based recon-struction of noisy, blurred images. IEEE Trans. Image Pro-cess., 7:813–824, 1998.

Wahlberg, B., Boyd, S., Annergren, M., and Wang, Y. An ADMMalgorithm for a class of total variation regularized estimationproblems. In IFAC Symposium on System Identification, 2012.

Yang, S., Wang, J., Fan, W., Zhang, X., Wonka, P., and Ye, J.An efficient ADMM algorithm for multidimensional anistropictotal variation regularization problems. In SIGKDD, 2013.

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models(Supplemental Material)

In this supplement, we present all the necessary details we mentioned in the main text.

1. Algorithm of ADMMDenote K the number of iterations required to meet the stopping criterion. The update rule in each iteration is given byAlgorithm 1. For more details, please refer to Boyd et al. (2011).

Algorithm 1 ADMMInput: z0, θ0, γ > 0for t = 0 to T do

xt+1 := argminx

Lγ(x, zt; θt) (1)

zt+1 := argminz

Lγ(xt+1, z; θt) (2)

θt+1 := θt + (xt+1 − zt+1) (3)end forReturn: xT , zT

2. Convergence AnalysisThe standard form of ADMM is

minx,y

f(x) + g(z) (4)

s.t. x− z = 0

Algorithm FAD finds the solution of the the following constrained convex optimization problem:

minZ;X1,X2,X3

1

2‖Z − Y ‖2F + λ

3∑k=1

‖Xk‖TVk(5)

s.t. Xk = Z, k = 1, 2, 3.

The convergence properties of ADMM to solve the standard form in (4) have been extensively explored by He & Yuan(2012); Boyd et al. (2011); Eckstein & Bertsekas (1992). Therefore, to establish the convergence properties of AlgorithmFAD, we only need to reformulate problem (5) as (4) and check if the resulting formulation satisfies the conditions requiredfor convergence.

Let us denote z := (ZT , ZT , ZT )T , x := (XT1 , X

T2 , X

T3 )T , and T1 := (Im, 0, 0), T2 := (0, Im, 0), T3 := (0, 0, Im),

where Im ∈ <m×m is the identity matrix, and I ∈ <3m×m = (Im, Im, Im)T . Then problem (5) can be rewritten as:minx,z

f(x) + g(z) (6)

s.t. x− z = 0,

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

where

f(x) = λ

3∑k=1

‖Tkz‖TVk,

g(z) =1

6‖z− IY ‖2F .

Clearly, problem in (6) is exactly the same as (4). Since f and g are proper closed convex functions, the convergenceproperties of Algorithm FAD can be readily established by He & Yuan (2012); Boyd et al. (2011); Eckstein & Bertsekas(1992). The convergence rate of Algorithm FAD can be shown asO(1/k) by following the procedure in He & Yuan (2012).

3. Dual FormulationRecall that we need to solve

minu

1

2‖u−w‖22 + ρ

√(u1 − u2)2 + (u3 − u2)2, (7)

where ρ = λγ .

3.1. Deviation of the Dual Formulation

Without loss of generality, assume w ∈ <2n+1, we solve the following optimization problem:

minu

1

2‖u−w‖22 + ρ

n∑k=1

√(u2k−1 − u2k)2 + (u2k+1 − u2k)2 (8)

Clearly, problem (8) reduces to problem (7) when n = 1. Let z ∈ <2n be defined as:zi = ui+1 − ui, i = 1, 2, . . . , 2n.

Then problem (8) is equivalent to the following constrained optimization problem:

minu

1

2‖u−w‖22 + ρ

n∑k=1

√z22k−1 + z22k (9)

s.t. zi = ui+1 − ui, i = 1, 2, . . . , 2n.

Let G ∈ <2n×(2n+1) be defined as

gi,j =

1 if j = i+ 1

−1 if j = i

0 otherwise(10)

we have z = Gu.

By introducing the scaled dual variable ρs ∈ <2n, the Lagrangian of problem (8) can be written as:

L(u, z; s) =1

2‖u−w‖22 + ρ

n∑k=1

√z22k−1 + z22k + ρ〈s, Gu− z〉 (11)

and the dual problem is in factg(s) = min

u,zL(u, z; s). (12)

Let

f1(u) =1

2‖u−w‖22 + ρ〈s, Gu〉

and

f2(z) = ρ

n∑k=1

√z22k−1 + z22k − ρ〈s, z〉

We can see thatL(u, z; s) = f1(u) + f2(z)

andg(s) = min

uf1(u) + min

zf2(z) (13)

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

Let us first solve the optimization problem minu f1(u). Since∇f1(u) = u−w+ρGT s, by setting∇f1(u) = 0, we haveu∗ = argmin

uf1(u) = w − ρGT s (14)

and

f1(u∗) = minuf1(u) = −ρ

2

2‖GT s‖22 + ρ〈GT s,w〉 = −1

2‖w − ρGT s‖22 +

1

2‖w‖22

To solve the second optimization problem minz f2(z), let us make some conventions. For an arbitrary vector v, let[v]ji = (vi, vi+1 . . . , vj)

T . Then f2(z) can be written as:

f2(z) = ρ

n∑k=1

‖[z]2k2k−1‖2 − ρ〈s, z〉 (15)

and thus

[∇f2(z)]2k2k−1 ∈

ρ

{[z]2k

2k−1

‖[z]2k2k−1‖2

− [s]2k2k−1

}, if [z]2k2k−1 6= 0

ρ

{vk − [s]2k2k−1

}, ‖vk‖2 ≤ 1, if [z]2k2k−1 = 0

(16)

For each k ∈ {1, . . . , n}, by setting [∇f2(z)]2k2k−1 = 0, it follows that‖[s]2k2k−1‖2 ≤ 1 (17)

andf2(z∗) = min

zf2(z) = 0,

where z∗ = argminz f2(z).

Therefore, the dual function g(s) is found as:

g(s) = −1

2‖w − ρGT s‖22 +

1

2‖w‖22 (18)

All together, the dual problem of problem (8) is equivalent to the following optimization problem:

mins

1

2‖w − ρGT s‖22 (19)

s.t. ‖[s]2k2k−1‖2 ≤ 1, k = 1, 2, . . . , n.

3.2. KKT Conditions

The KKT conditions can be directly obtained from the last section. Assume u∗, z∗ and s∗ are the optimal solutions to theproblems (8) and (19) respectively, we summarize the KKT conditions as follows:

w = u∗ + ρGT s∗ (20)

[s∗]2k2k−1 ∈

[z∗]2k2k−1

‖[z∗]2k2k−1‖2, if [z∗]2k2k−1 6= 0

vk, ‖vk‖2 ≤ 1, if [z∗]2k2k−1 = 0(21)

Because z∗ = Gu∗, condition (21) can be written more explicitly as:

[s∗]2k2k−1 ∈

{(u2k−u2k−1,u2k+1−u2k)

T

‖(u2k−u2k−1,u2k+1−u2k)T ‖2 , if (u2k − u2k−1, u2k+1 − u2k)T 6= 0

vk, ‖vk‖2 ≤ 1, if (u2k − u2k−1, u2k+1 − u2k)T = 0(22)

4. Proposition 1Proposition 1. Given an image with m× n pixels, if we divide the image domain into three non-overlapping subset

Dk := {(i, j) ∈ D : mod (j − i, 3) = k − 1}, (23)where k = 1, 2, 3, and we define

‖X‖(k)TV :=∑

(i,j)∈Dk

‖Di,j‖2, k = 1, 2, 3,

then for each (i, j) ∈ Dk, ‖Di,j‖2 is separable from every other ‖Di′,j′‖2, where (i′, j′) ∈ Dk \ (i, j).

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

Proof. To simplify the proof, let us extend D to an infinite set, i.e.,D = {(i, j) : i = −∞, . . . ,−1, 0, 1, . . . ,∞,

j = −∞, . . . ,−1, 0, 1, . . . ,∞.}If the conclusion of Proposition 1 is true for the extendedD, then it also holds whenD is a finite set. Therefore let us proveProposition 1 for the former case.

Without loss of generality, let us consider D1. Every term ‖Di,j‖2 in ‖X‖(1)TV involves three variables, xi,j , xi+1,j andxi,j+1 since the pixels (i, j), (i+ 1, j) and (i, j + 1) all belong to D1 (we extend D to an infinite set).

Consider variable xi,j . It is only appears in ‖Di−1,j‖2 and ‖Di,j−1‖. Since (i, j) ∈ D1, we have mod (j − i, 3) = 0.Therefore it follows that

mod (j − (i− 1), 3) = 1 and mod ((j − 1)− i, 3) = 2.

According to the definition of D2 and D3, we can see that (i − 1, j) ∈ D2 and (i, j − 1) ∈ D3. In other words, except‖Di,j‖2, the variable xi,j does not appear in any other term ‖Di′,j′‖2 where (i′, j′) ∈ D1. For the same reason, thevariables xi−1,j and xi,j−1 only appears in ‖Di,j‖2 among all ‖Di′′,j′′‖2 where (i′′, j′′) ∈ D1.

Therefore ‖Di,j‖2 is separable from all the other terms in ‖X‖(1)TV , which completes the proof.

5. Proof of Theorem 1Proof. Denote the objective function of as f(s). Suppose ρ is large enough such that t∗(ρ) ∈ relint B and thus s∗(ρ) ∈ B◦.Then we have

∇f(s∗) = −ρG(w − ρGT s∗) = 0, (24)which is equivalent to

(GGT )−1Gw = ρs∗. (25)Therefore, we have

‖(GGT )−1Gw‖2 = ρ‖s∗‖2.It follows that

ρ > ‖(GGT )−1Gw‖2 ⇒ ‖s∗‖2 < 1.

Consider the KKT condition u∗ = w − ρGT s∗ together with Eq. (25), u∗ is:u∗ = w −GT (GGT )−1Gw,

which is the result in the statement. Note that u∗ is independent of ρ.

Recall that s∗(ρ) varies continuously with ρ. By the KKT condition u∗ = w− ρGT s∗, u∗ is also continuous with respectto ρ. Thus

u∗(ρmax) = limρ↓ρmax

u∗(ρ) = w −GT (GGT )−1Gw,

which completes the proof.

6. Proof of Theorem 2Proof. Let s∗ be the optimal solution to the dual problem. The KKT conditions are:

−ρG(w − ρGT s∗) + αs∗ = 0 (26)

α((s∗1)2 + (s∗2)2 − 1) = 0 (27)where α > 0 is the Lagrange multiplier and Eq. (27) is the complementary condition.

By rearranging the terms, Eq. (26) becomesρ(ρ2GGT + αI)−1Gw = s∗.

440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494

495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

It follows that‖ρ(ρ2GGT + αI)−1Gw‖22 = ‖s∗‖22 (28)

⇒∥∥∥∥ρ 1

ρ2

(UΣ2UT +

α

ρ2UUT

)−1Gw

∥∥∥∥22

= ‖s∗‖22

⇒∥∥∥∥U(Σ2 +

α

ρ2I

)−1ΣV Tw

∥∥∥∥22

= ρ2‖s∗‖22

⇒∥∥∥∥(Σ2 +

α

ρ2I

)−1w

∥∥∥∥22

= ρ2‖s∗‖22where

Σ2 = ΣΣT =

(σ21 0

0 σ22

)=

(3 00 1

)and w = ΣV Tw.

Eq. (28) is equivalent to (ρw1

ρ2σ21 + α

)2

+

(ρw2

ρ2σ22 + α

)2

= ‖s∗‖22 (29)

Let v ∈ <2, where

v1 =ρw1

ρ2σ21 + t

and v2 =ρw2

ρ2σ22 + t

.

Clearly, v(t) is a plane curve parameterized by t ≥ 0 and it is easy to see that

limt→∞

‖v(t)‖22 = 0 andd‖v(t)‖22

dt< 0, ∀t ≥ 0.

Therefore, by Eq. (29), we can see that:

1) If ‖v(0)‖22 < 1, then ‖s∗‖22 = ‖v(α)‖22 < 1. The complementary condition Eq. (27) indicates that α = 0. Then Eq. (26)reduces to Eq. (25), which has been solved in Theorem 1.

It is worthwhile to note that

‖v(0)‖22 = ‖(ρΣ2)−1w‖22 =1

ρ2‖U(ΣV TV ΣT )−1UTUw‖22

=1

ρ2‖(UΣV TV ΣTUT )−1Uw‖22 =

1

ρ2‖(GGT )−1w‖22

Therefore, from ‖v(0)‖22 < 1, we can derive that ρ > ‖(GGT )−1w‖2 = ρmax which is consistent with the results inTheorem 1.

2) If ‖v(0)‖22 > 1, we must have α > 0 since ‖v(α)‖2 = ‖s∗‖2 ≤ 1 [recall the constraint of problem (26) in the maintext]. The complementary condition implies that ‖s∗‖2 = 1. Clearly, the curve v(t), t ≥ 0 must intersect with the unitcircle at one unique point. Therefore Eq. (29) must have one unique positive root. Furthermore, from ‖v(0)‖22 > 1, wecan conclude that ρ < ‖(GGT )−1w‖2 = ρmax.

From the above, we know ρ < ρmax implies that ‖s∗‖2 = 1. Sorting the terms in Eq. (29) results in Eq. (30), whichcompletes the proof.

In the proof of Theorem 2, we omit the case ‖v(0)‖2 = 1. By the same technique, we know ‖v(0)‖ = 1 implies ρ = ρmax,which has been considered in Theorem 1. Actually, via the techniques in Theorem 2, the case ρ = ρmax is equivalent tothat Eq. (30) has a unique root at 0.

7. Discussion of Root Finding TechniqueThe quartic function we need to solve is solves the following quartic function:

α4 + c3α3 + c2α

2 + c1α+ c0 = 0 (30)

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

wherec0 = ρ8σ4

1σ42 − ρ6(w2

1σ42 + w2

2σ41),

c1 = 2ρ6(σ21σ

42 + σ4

1σ22)− 2ρ4(w2

1σ22 + w2

2σ21),

c2 = ρ4(σ41 + σ4

2 + 4σ21σ

22)− ρ2(w2

1 + w22),

c3 = 2ρ2(σ21 + σ2

2).

In Theorem 2, we have already shown that Eq. (30) has a unique positive root when ρ < ρmax, we can find this positiveroot efficiently by Newton’s method. If we denote p(t) as the quartic function in Eq. (30), it is worthwhile to note thatp(0) = c0 < 0 if ρ < ρmax. Therefore the initial point α0 can be chosen such that p(α0) > 0 (such α0 always existssince p(α) → ∞ when α → ∞). Once the Newton’s method fails, e.g., there is a α ∈ (α, α0) such that p(α) > 0 andp′(α) = 0, we switch to the more stable solver via eigendecomposition of a 4 × 4 matrix (Edelman & Murakami, 1995).However, Newton’s method never fails in our experiments and a few steps (less than 10) usually result in a very accuratesolution (|p(αt)| < 1e− 18 where αt is the value returned by the Newton’s method after t iterations).

8. Proof of Theorem (3)We are trying to solve:

minuh(u) =

1

2‖u−w‖22 + ρ|u2 − u1| (31)

where ρ = λ/γ.

Proof. u∗ is the optimal solution of problem (31) if and only if 0 ∈ ∂uh(u∗), i.e.,u∗ ∈ w + vρg (32)

where g = (1,−1)T and

v ∈

1, if u2 > u1

−1, if u2 < u1

[−1, 1], if u2 = u1

(33)

We write (32) in componentwise:u∗1 = w1 + vρ and u∗2 = w2 − vρ.

Since |v| ≤ 1, then ifw1 − ρ > w2 + ρ, i.e., v = −1

we can conclude that u∗1 > u∗2. By (33), v has to be −1. Thereforeu∗1 = w1 − v and u∗2 = w2 + v.

Similarly, whenw1 + ρ < w2 − ρ, i.e., v = 1

we can see that u∗1 must be smaller than u∗2. Therefore (33), v has to be 1 and thusu∗1 = w1 + v and u∗2 = w2 − v.

Otherwise, we havew2 − 2ρ ≤ w1 ≤ w2 + 2ρ,

which implies

−1 ≤ w2 − w1

2ρ≤ 1.

Therefore if we set v = w2−w1

2ρ , we havew1 + vρ = w2 − vρ,

i.e.,

u∗1 = u∗2 =w1 + w2

2.

660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714

715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

The results above can be summarized as:

u∗ =

(w1 − ρ, w2 + ρ)T , if w1 > w2 + 2ρ

(w1 + ρ, w2 − ρ)T , if w1 < w2 − 2ρ

(w1+w2

2 , w1+w2

2 )T , otherwise(34)

which completes the proof.

770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824

825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879

A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

ReferencesBoyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alter-

nating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1–122, 2011.

Eckstein, J. and Bertsekas, D. P. On the Douglas-Rachford splitting method and the proximal point algorithm for maximalmonotone operators. Mathematical Programming, 55:293–318, 1992.

Edelman, Alan and Murakami, H. Polynomial roots from companion matrix eigenvalues. Math. Comp, 64:763–776, 1995.

He, B. and Yuan, X. On the o(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer.Anal., 50:700–709, 2012.