20141204.journal club

21
Journal Club 論文紹介 KSVD An Algorithm for Designing Overcomplete Dic=onaries for Sparse Representa=on Aharon, M. et al. Nov., 2006, IEEE Trans Signal Processing 54(11), 43114321 2014/12/04 [email protected]

Upload: shouno-hayaru

Post on 12-Jul-2015

393 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: 20141204.journal club

Journal  Club  論文紹介  K-­‐SVD  An  Algorithm  for  Designing  Overcomplete  Dic=onaries  for  

Sparse  Representa=on

Aharon,  M.  et  al.    Nov.,  2006,  IEEE  Trans  Signal  Processing  54(11),  

4311-­‐4321      

2014/12/04  [email protected]

Page 2: 20141204.journal club

やりたいこと/やったこと

•  Sparse  Coding  [Olshausen&Field  94]  を    K-­‐means  法の文脈にそって理解してみよう  

0.1 0.3 0.2

なるべくゼロである係数が多くなるようにする

D と x  を求める

Page 3: 20141204.journal club

Sparse  Coding  のNN的解釈

0.1 0.3 0.2

なるべくゼロである係数が多くなるようにする

ゼロ要素が多いと  伝送エネルギーが小さい

Page 4: 20141204.journal club

Sparse  Coding  [Olshausen&Field  94]

•  Wavelet  like  な表現  •  V1  の  simple  cell  like  な特徴抽出  

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

入力画像 Y  (自然画像) 辞書 D  の要素

Page 5: 20141204.journal club

定式化

Page 6: 20141204.journal club

Sparse  Coding  Formula=on

4316 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

Fig. 1. The -means algorithm.

sentation MSE per is defined as , and theoverall MSE is

(15)

The VQ training problem is to find a codebook that minimizesthe error , subject to the limited structure of , whose columnsmust be taken from the trivial basis

subject to for some

(16)The -means algorithm is an iterative method used for de-signing the optimal codebook for VQ [39]. In each iterationthere are two stages: one for sparse coding that essentiallyevaluates and one for updating the codebook. Fig. 1 gives amore detailed description of these steps.

The sparse coding stage assumes a known codebookand computes a feasible that minimizes the value of (16).Similarly, the dictionary update stage fixes as known andseeks an update of so as to minimize (16). Clearly, at eachiteration, either a reduction or no change in the MSE is en-sured. Furthermore, at each such stage, the minimization stepis optimal under the assumptions. As the MSE is bounded frombelow by zero, and the algorithm ensures a monotonic decreaseof the MSE, convergence to at least a local minimum solution isguaranteed. Note that we have deliberately chosen not to discussstopping rules for the above-described algorithm, since thosevary a lot but are quite easy to handle [39].

B. -SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-alization of the VQ objective (16), in which we allow each inputsignal to be represented by a linear combination of codewords,which we now call dictionary elements. Therefore the coeffi-cients vector is now allowed more than one nonzero entry, andthese can have arbitrary values. For this case, the minimization

corresponding to (16) is that of searching the best possible dic-tionary for the sparse representation of the example set

subject to (17)

A similar objective could alternatively be met by considering

subject to (18)

for a fixed value . In this paper, we mainly discuss the firstproblem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-tively. First, we fix and aim to find the best coefficient matrix

that can be found. As finding the truly optimal is impos-sible, we use an approximation pursuit method. Any such al-gorithm can be used for the calculation of the coefficients, aslong as it can supply a solution with a fixed and predeterminednumber of nonzero entries .

Once the sparse coding task is done, a second stage is per-formed to search for a better dictionary. This process updatesone column at a time, fixing all columns in except one, ,and finding a new column and new values for its coefficientsthat best reduce the MSE. This is markedly different from allthe -means generalizations that were described in Section III.All those methods freeze while finding a better . Our ap-proach is different, as we change the columns of sequentiallyand allow changing the relevant coefficients. In a sense, this ap-proach is a more direct generalization of the -means algorithmbecause it updates each column separately, as done in -means.One may argue that in -means the nonzero entries in arefixed during the improvement of , but as we shall see next,this is true because in the -means (and the gain-shape VQ),the column update problems are decoupled, whereas in the moregeneral setting, this should not be the case.

The process of updating only one column of at a timeis a problem having a straightforward solution based on thesingular value decomposition (SVD). Furthermore, allowing achange in the coefficient values while updating the dictionarycolumns accelerates convergence, since the subsequent columnupdates will be based on more relevant coefficients. The overalleffect is very much in line with the leap from gradient descentto Gauss–Seidel methods in optimization.

Here one might be tempted to suggest skipping the step ofsparse coding and using only updates of columns in , alongwith their coefficients, applied in a cyclic fashion, again andagain. This, however, will not work well, as the support of therepresentations will never be changed, and such an algorithmwill necessarily fall into a local minimum trap.

C. -SVD—Detailed Description

We shall now discuss the -SVD in detail. Recall that ourobjective function is

subject to (19)

Inputs Y Dic*onary D Coef. X

yiY = yi{ }i

N dk

NP-­‐hard  な問題なので普通に難しい

Page 7: 20141204.journal club

MOD:  related  to  [Olshausen&Field94]

1.  辞書 D  を固定(Sparse  coding  phase)  OMP/FOCUSS  で係数 X  を推定  

2.  係数 X  を固定して辞書を更新(Dict.  update)

4314 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

[40], [41]. This method follows more closely the -means out-line, with a sparse coding stage that uses either OMP or FO-CUSS followed by an update of the dictionary. The main con-tribution of the MOD method is its simple way of updating thedictionary. Assuming that the sparse coding for each exampleis known, we define the errors . The overall rep-resentation mean square error is given by

(10)

Here we have concatenated all the examples as columns ofthe matrix and similarly gathered the representations coef-ficient vectors to build the matrix . The notationstands for the Frobenius norm, defined as .

Assuming that is fixed, we can seek an update to suchthat the above error is minimized. Taking the derivative of (10)with respect to , we obtain the relation ,leading to

(11)

MOD is closely related to the work by Olshausen and Field,with improvements both in the sparse coding and the dictionaryupdate stages. Whereas the work in [23], [24], and [22] appliesa steepest descent to evaluate , those are evaluated much moreefficiently with either OMP or FOCUSS. Similarly, in updatingthe dictionary, the update relation given in (11) is the best thatcan be achieved for fixed . The iterative steepest descent up-date in (9) is far slower. Interestingly, in both stages of the algo-rithm, the difference is in deploying a second order (Newtonian)update instead of a first-order one. Looking closely at the updaterelation in (9), it could be written as

(12)

Using infinitely many iterations of this sort, and using smallenough , this leads to a steady-state outcome that is exactlythe MOD update matrix (11). Thus, while the MOD method as-sumes known coefficients at each iteration and derives the bestpossible dictionary, the ML method by Olshausen and Field onlygets closer to this best current solution and then turns to calcu-late the coefficients. Note, however, that in both methods a nor-malization of the dictionary columns is required and done.

D. Maximum A-Posteriori Probability Approach

The same researchers that conceived the MOD method alsosuggested a MAP probability setting for the training of dictio-naries, attempting to merge the efficiency of the MOD with anatural way to take into account preferences in the recovereddictionary. In [37], [41], [42], and [44], a probabilistic pointof view is adopted, very similar to the ML methods discussedabove. However, rather than working with the likelihood func-tion , the posterior is used. Using Bayes’ rule,we have , and thus we can use the

likelihood expression as before and add a prior on the dictionaryas a new ingredient.

These works considered several priors and proposedcorresponding formulas for the dictionary. The efficiency ofthe MOD in these methods is manifested in the efficient sparsecoding, which is carried out with FOCUSS. The proposed al-gorithms in this family deliberately avoid a direct minimizationwith respect to as in MOD, due to the prohibitive matrixinversion required. Instead, iterative gradient descent is used.

When no prior is chosen, the update formula is the very oneused by Olshausen and Field, as in (9). A prior that constrains

to have a unit Frobenius norm leads to the update formula

(13)

As can be seen, the first two terms are the same as in (9). The lastterm compensates for deviations from the constraint. This caseallows different columns in to have different norm values.As a consequence, columns with small norm values tend to beunderused, as the coefficients they need are larger and as suchmore penalized.

This led to the second prior choice, constraining the columnsof to have a unit -norm. The new update equation formedis given by

(14)

where is the th column in the matrix .Compared to the MOD, this line of work provides slower

training algorithms. Simulations reported in [37], [41], [42],[44] on synthetic and real image data seem to provide encour-aging results.

E. Unions of Orthonormal Bases

The very recent work reported in [45] considers a dictionarycomposed as a union of orthonormal bases

where , are orthonormal matrices.Such a dictionary structure is quite restrictive, but its updatingmay potentially be made more efficient.

The coefficients of the sparse representations can be de-composed to pieces, each referring to a different orthonormalbasis. Thus

where is the matrix containing the coefficients of the or-thonormal dictionary .

One of the major advantages of the union of orthonormalbases is the relative simplicity of the pursuit algorithm neededfor the sparse coding stage. The coefficients are found usingthe block coordinate relaxation algorithm [46]. This is an ap-pealing way to solve as a sequence of simple shrinkagesteps, such that at each stage is computed while keeping allthe other pieces of fixed. Thus, this evaluation amounts to asimple shrinkage.

4314 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

[40], [41]. This method follows more closely the -means out-line, with a sparse coding stage that uses either OMP or FO-CUSS followed by an update of the dictionary. The main con-tribution of the MOD method is its simple way of updating thedictionary. Assuming that the sparse coding for each exampleis known, we define the errors . The overall rep-resentation mean square error is given by

(10)

Here we have concatenated all the examples as columns ofthe matrix and similarly gathered the representations coef-ficient vectors to build the matrix . The notationstands for the Frobenius norm, defined as .

Assuming that is fixed, we can seek an update to suchthat the above error is minimized. Taking the derivative of (10)with respect to , we obtain the relation ,leading to

(11)

MOD is closely related to the work by Olshausen and Field,with improvements both in the sparse coding and the dictionaryupdate stages. Whereas the work in [23], [24], and [22] appliesa steepest descent to evaluate , those are evaluated much moreefficiently with either OMP or FOCUSS. Similarly, in updatingthe dictionary, the update relation given in (11) is the best thatcan be achieved for fixed . The iterative steepest descent up-date in (9) is far slower. Interestingly, in both stages of the algo-rithm, the difference is in deploying a second order (Newtonian)update instead of a first-order one. Looking closely at the updaterelation in (9), it could be written as

(12)

Using infinitely many iterations of this sort, and using smallenough , this leads to a steady-state outcome that is exactlythe MOD update matrix (11). Thus, while the MOD method as-sumes known coefficients at each iteration and derives the bestpossible dictionary, the ML method by Olshausen and Field onlygets closer to this best current solution and then turns to calcu-late the coefficients. Note, however, that in both methods a nor-malization of the dictionary columns is required and done.

D. Maximum A-Posteriori Probability Approach

The same researchers that conceived the MOD method alsosuggested a MAP probability setting for the training of dictio-naries, attempting to merge the efficiency of the MOD with anatural way to take into account preferences in the recovereddictionary. In [37], [41], [42], and [44], a probabilistic pointof view is adopted, very similar to the ML methods discussedabove. However, rather than working with the likelihood func-tion , the posterior is used. Using Bayes’ rule,we have , and thus we can use the

likelihood expression as before and add a prior on the dictionaryas a new ingredient.

These works considered several priors and proposedcorresponding formulas for the dictionary. The efficiency ofthe MOD in these methods is manifested in the efficient sparsecoding, which is carried out with FOCUSS. The proposed al-gorithms in this family deliberately avoid a direct minimizationwith respect to as in MOD, due to the prohibitive matrixinversion required. Instead, iterative gradient descent is used.

When no prior is chosen, the update formula is the very oneused by Olshausen and Field, as in (9). A prior that constrains

to have a unit Frobenius norm leads to the update formula

(13)

As can be seen, the first two terms are the same as in (9). The lastterm compensates for deviations from the constraint. This caseallows different columns in to have different norm values.As a consequence, columns with small norm values tend to beunderused, as the coefficients they need are larger and as suchmore penalized.

This led to the second prior choice, constraining the columnsof to have a unit -norm. The new update equation formedis given by

(14)

where is the th column in the matrix .Compared to the MOD, this line of work provides slower

training algorithms. Simulations reported in [37], [41], [42],[44] on synthetic and real image data seem to provide encour-aging results.

E. Unions of Orthonormal Bases

The very recent work reported in [45] considers a dictionarycomposed as a union of orthonormal bases

where , are orthonormal matrices.Such a dictionary structure is quite restrictive, but its updatingmay potentially be made more efficient.

The coefficients of the sparse representations can be de-composed to pieces, each referring to a different orthonormalbasis. Thus

where is the matrix containing the coefficients of the or-thonormal dictionary .

One of the major advantages of the union of orthonormalbases is the relative simplicity of the pursuit algorithm neededfor the sparse coding stage. The coefficients are found usingthe block coordinate relaxation algorithm [46]. This is an ap-pealing way to solve as a sequence of simple shrinkagesteps, such that at each stage is computed while keeping allthe other pieces of fixed. Thus, this evaluation amounts to asimple shrinkage.

where

Page 8: 20141204.journal club

K-­‐means  method:  Clustering  Algorithm

4316 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

Fig. 1. The -means algorithm.

sentation MSE per is defined as , and theoverall MSE is

(15)

The VQ training problem is to find a codebook that minimizesthe error , subject to the limited structure of , whose columnsmust be taken from the trivial basis

subject to for some

(16)The -means algorithm is an iterative method used for de-signing the optimal codebook for VQ [39]. In each iterationthere are two stages: one for sparse coding that essentiallyevaluates and one for updating the codebook. Fig. 1 gives amore detailed description of these steps.

The sparse coding stage assumes a known codebookand computes a feasible that minimizes the value of (16).Similarly, the dictionary update stage fixes as known andseeks an update of so as to minimize (16). Clearly, at eachiteration, either a reduction or no change in the MSE is en-sured. Furthermore, at each such stage, the minimization stepis optimal under the assumptions. As the MSE is bounded frombelow by zero, and the algorithm ensures a monotonic decreaseof the MSE, convergence to at least a local minimum solution isguaranteed. Note that we have deliberately chosen not to discussstopping rules for the above-described algorithm, since thosevary a lot but are quite easy to handle [39].

B. -SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-alization of the VQ objective (16), in which we allow each inputsignal to be represented by a linear combination of codewords,which we now call dictionary elements. Therefore the coeffi-cients vector is now allowed more than one nonzero entry, andthese can have arbitrary values. For this case, the minimization

corresponding to (16) is that of searching the best possible dic-tionary for the sparse representation of the example set

subject to (17)

A similar objective could alternatively be met by considering

subject to (18)

for a fixed value . In this paper, we mainly discuss the firstproblem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-tively. First, we fix and aim to find the best coefficient matrix

that can be found. As finding the truly optimal is impos-sible, we use an approximation pursuit method. Any such al-gorithm can be used for the calculation of the coefficients, aslong as it can supply a solution with a fixed and predeterminednumber of nonzero entries .

Once the sparse coding task is done, a second stage is per-formed to search for a better dictionary. This process updatesone column at a time, fixing all columns in except one, ,and finding a new column and new values for its coefficientsthat best reduce the MSE. This is markedly different from allthe -means generalizations that were described in Section III.All those methods freeze while finding a better . Our ap-proach is different, as we change the columns of sequentiallyand allow changing the relevant coefficients. In a sense, this ap-proach is a more direct generalization of the -means algorithmbecause it updates each column separately, as done in -means.One may argue that in -means the nonzero entries in arefixed during the improvement of , but as we shall see next,this is true because in the -means (and the gain-shape VQ),the column update problems are decoupled, whereas in the moregeneral setting, this should not be the case.

The process of updating only one column of at a timeis a problem having a straightforward solution based on thesingular value decomposition (SVD). Furthermore, allowing achange in the coefficient values while updating the dictionarycolumns accelerates convergence, since the subsequent columnupdates will be based on more relevant coefficients. The overalleffect is very much in line with the leap from gradient descentto Gauss–Seidel methods in optimization.

Here one might be tempted to suggest skipping the step ofsparse coding and using only updates of columns in , alongwith their coefficients, applied in a cyclic fashion, again andagain. This, however, will not work well, as the support of therepresentations will never be changed, and such an algorithmwill necessarily fall into a local minimum trap.

C. -SVD—Detailed Description

We shall now discuss the -SVD in detail. Recall that ourobjective function is

subject to (19)

Sparse  Coding Codebook  Update Sparse  Coding

Page 9: 20141204.journal club

K-­‐means  method:  Formula=on

Input Y Codebook C Coef. X

yiY = yi{ }i

N

dk

dk

Sparse  coding  に似てない?

4316 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

Fig. 1. The -means algorithm.

sentation MSE per is defined as , and theoverall MSE is

(15)

The VQ training problem is to find a codebook that minimizesthe error , subject to the limited structure of , whose columnsmust be taken from the trivial basis

subject to for some

(16)The -means algorithm is an iterative method used for de-signing the optimal codebook for VQ [39]. In each iterationthere are two stages: one for sparse coding that essentiallyevaluates and one for updating the codebook. Fig. 1 gives amore detailed description of these steps.

The sparse coding stage assumes a known codebookand computes a feasible that minimizes the value of (16).Similarly, the dictionary update stage fixes as known andseeks an update of so as to minimize (16). Clearly, at eachiteration, either a reduction or no change in the MSE is en-sured. Furthermore, at each such stage, the minimization stepis optimal under the assumptions. As the MSE is bounded frombelow by zero, and the algorithm ensures a monotonic decreaseof the MSE, convergence to at least a local minimum solution isguaranteed. Note that we have deliberately chosen not to discussstopping rules for the above-described algorithm, since thosevary a lot but are quite easy to handle [39].

B. -SVD—Generalizing the -Means

The sparse representation problem can be viewed as a gener-alization of the VQ objective (16), in which we allow each inputsignal to be represented by a linear combination of codewords,which we now call dictionary elements. Therefore the coeffi-cients vector is now allowed more than one nonzero entry, andthese can have arbitrary values. For this case, the minimization

corresponding to (16) is that of searching the best possible dic-tionary for the sparse representation of the example set

subject to (17)

A similar objective could alternatively be met by considering

subject to (18)

for a fixed value . In this paper, we mainly discuss the firstproblem (17), although the treatment is very similar.

In our algorithm, we minimize the expression in (17) itera-tively. First, we fix and aim to find the best coefficient matrix

that can be found. As finding the truly optimal is impos-sible, we use an approximation pursuit method. Any such al-gorithm can be used for the calculation of the coefficients, aslong as it can supply a solution with a fixed and predeterminednumber of nonzero entries .

Once the sparse coding task is done, a second stage is per-formed to search for a better dictionary. This process updatesone column at a time, fixing all columns in except one, ,and finding a new column and new values for its coefficientsthat best reduce the MSE. This is markedly different from allthe -means generalizations that were described in Section III.All those methods freeze while finding a better . Our ap-proach is different, as we change the columns of sequentiallyand allow changing the relevant coefficients. In a sense, this ap-proach is a more direct generalization of the -means algorithmbecause it updates each column separately, as done in -means.One may argue that in -means the nonzero entries in arefixed during the improvement of , but as we shall see next,this is true because in the -means (and the gain-shape VQ),the column update problems are decoupled, whereas in the moregeneral setting, this should not be the case.

The process of updating only one column of at a timeis a problem having a straightforward solution based on thesingular value decomposition (SVD). Furthermore, allowing achange in the coefficient values while updating the dictionarycolumns accelerates convergence, since the subsequent columnupdates will be based on more relevant coefficients. The overalleffect is very much in line with the leap from gradient descentto Gauss–Seidel methods in optimization.

Here one might be tempted to suggest skipping the step ofsparse coding and using only updates of columns in , alongwith their coefficients, applied in a cyclic fashion, again andagain. This, however, will not work well, as the support of therepresentations will never be changed, and such an algorithmwill necessarily fall into a local minimum trap.

C. -SVD—Detailed Description

We shall now discuss the -SVD in detail. Recall that ourobjective function is

subject to (19)

Page 10: 20141204.journal club

K-­‐SVD  method:  strategy

1.  Sparse  coding  stage  辞書 D  を固定して,表現するベクトルを選ぶ  – OMP:  l0-­‐norm  minimiza=onの為の近似ソルバ      

2.  Codebook  update  stage  辞書を  update  – 辞書 D  と係数 X  を同時に更新  SVD 分解によるアップデート  

Page 11: 20141204.journal club

AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4317

Let us first consider the sparse coding stage, where we assumethat is fixed, and consider the above optimization problem asa search for sparse representations with coefficients summarizedin the matrix . The penalty term can be rewritten as

Therefore the problem posed in (19) can be decoupled todistinct problems of the form

subject to

for (20)

This problem is adequately addressed by the pursuit algorithmsdiscussed in Section II, and we have seen that if is smallenough, their solution is a good approximation to the ideal onethat is numerically infeasible to compute.

We now turn to the second, and slightly more involved,process of updating the dictionary together with the nonzerocoefficients. Assume that both and are fixed and weput in question only one column in the dictionary and thecoefficients that correspond to it, the th row in , denoted as

(this is not the vector which is the th column in ).Returning to the objective function (19), the penalty term canbe rewritten as

(21)

We have decomposed the multiplication to the sum ofrank-1 matrices. Among those, 1 terms are assumed fixed,and one—the th—remains in question. The matrix standsfor the error for all the examples when the th atom is re-moved. Note the resemblance between this error and the onedefined in [45].

Here, it would be tempting to suggest the use of the SVD tofind alternative and . The SVD finds the closest rank-1matrix (in Frobenius norm) that approximates , and this willeffectively minimize the error as defined in (21). However, sucha step will be a mistake, because the new vector is very likelyto be filled, since in such an update of we do not enforce thesparsity constraint.

A remedy to the above problem, however, is simple and alsoquite intuitive. Define as the group of indices pointing toexamples that use the atom , i.e., those where isnonzero. Thus

(22)

Define as a matrix of size , with ones on theth entries and zeros elsewhere. When multiplying

, this shrinks the row vector by discarding of

Fig. 2. The -SVD algorithm.

the zero entries, resulting with the row vector of length. Similarly, the multiplication creates a matrix

of size that includes a subset of the examples thatare currently using the atom. The same effect happenswith , implying a selection of error columns thatcorrespond to examples that use the atom .

With this notation, we may now return to (21) and suggestminimization with respect to both and , but this time forcethe solution of to have the same support as the original .This is equivalent to the minimization of

(23)

and this time it can be done directly via SVD. Taking the re-stricted matrix , SVD decomposes it to . Wedefine the solution for as the first column of , and thecoefficient vector as the first column of multiplied by

. Note that, in this solution, we necessarily have that i)the columns of remain normalized and ii) the support of allrepresentations either stays the same or gets smaller by possiblenulling of terms.

We shall call this algorithm “ -SVD” to parallel the name-means. While -means applies computations of means to

update the codebook, -SVD obtains the updated dictionaryby SVD computations, each determining one column. A fulldescription of the algorithm is given in Fig. 2.

In the -SVD algorithm, we sweep through the columns anduse always the most updated coefficients as they emerge frompreceding SVD steps. Parallel versions of this algorithm canalso be considered, where all updates of the previous dictionaryare done based on the same . Experiments show that whilethis version also converges, it yields an inferior solution andtypically requires more than four times the number of iterations.

Page 12: 20141204.journal club

K-­‐SVD  法の理解

Inputs Y Dic*onary D Coef.    X

⋯⋯ ⋯⋯ ⋯

⋯⋯

yiY = yi{ }i

N

dk

dk

xTk

Y −DXF

2= Y − djxT

j

j=1

K

∑F

2

= Y − djxTj

j≠k∑

$

%&&

'

())− dk xT

k

F

2

Ek特異値分解(SVD)を使い   の近似を求め    と  を更新する dk xTk

2

F

kTkk xdE −=

dk  成分引っこ抜いたらどんだけ影響あるのか?

Page 13: 20141204.journal club

行列の特異値分解(SVD) p 任意の m ×  n  の行列Aは次の形に分解できる.  

     –  ここで,  

•   U: m×m  の直交行列  •   V: n×n  の直交行列  •   Σ: m×n  の対角行列 Σ  =  diag  (σ1  ,  σ2  ,  …  ,  σn  )       (ただし σ1  ≧σ2  ≧  …  ≧σr  >  σr+1  =  …=  σn  =  0)  

p これらを使うと,A  は以下のように表すことが出来る.

A = U Σ VT =  A U Σ VT

A  =  σ1u1v1T  +  σ2u2v2T  +  …  +  σrurvrT    

A   × × × × × × + + + v1  σ1  

= v2  σ2   vr  σr  

u1 u2 ur

Page 14: 20141204.journal club

SVDの前処理 •  係数の非ゼロ要素のみを抽出

kkT

kR xx Ω=

xRk

1× ωk

xTk

1×NΩkN × ωk

EkR = EkΩk

EkR Ek

n× ωk n×NΩkN × ωk

Ek − dkxTk

F

2Ek

R − dkxRk

F

2

Page 15: 20141204.journal club

K-­‐SVD法におけるSVDの使いかた

+

v1  σ1  

v2  σ2  u1 u2

  の特異値分解(SVD)

×

EkR − dkxR

k

F

2

RkE dk xR

k

RkERkE

× × × × + +

vn  σn  un

× ×

dk

xRk

u1

σ1  × v1  

Page 16: 20141204.journal club

計算機実験

Page 17: 20141204.journal club

人工データによる再構成評価 •  再構成実験  

–  辞書サイズ 20x50  –  入力:  3個の辞書要素の重ねあわせ+加法ノイズ:    20次元  ×  1500個  

–  横軸:  加法ノイズ  –  縦軸:  ランダムデータ50回の成功再生回数  

AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4319

Fig. 3. Synthetic results: for each of the tested algorithms and for each noiselevel, 50 trials were conducted and their results sorted. The graph labels repre-sent the mean number of detected atoms (out of 50) over the ordered tests ingroups of ten experiments.

and initializing in the same way. We executed the MOD algo-rithm for a total number of 80 iterations. We also executed theMAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-gorithm was executed as is, therefore using FOCUSS as its de-composition method. Here, again, a maximum of 80 iterationswere allowed.

D. Results

The computed dictionary was compared against the knowngenerating dictionary. This comparison was done by sweepingthrough the columns of the generating dictionary and findingthe closest column (in distance) in the computed dictionary,measuring the distance via

(25)

where is a generating dictionary atom and is its corre-sponding element in the recovered dictionary. A distance lessthan 0.01 was considered a success. All trials were repeated50 times, and the number of successes in each trial was com-puted. Fig. 3 displays the results for the three algorithms fornoise levels of 10, 20, and 30 dB and for the noiseless case.

We should note that for different dictionary size (e.g.,20 30) and with more executed iterations, the MAP-basedalgorithm improves and gets closer to the -SVD detectionrates.

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

RESULTS

We carried out several experiments on natural image data,trying to show the practicality of the proposed algorithm and thegeneral sparse coding theme. We should emphasize that our testshere come only to prove the concept of using such dictionarieswith sparse representations. Further work is required to fully

2The authors of [37] have generously shared their software with us.

Fig. 4. A collection of 500 random blocks that were used for training, sortedby their variance.

(a) (b) (c)

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending orderof their variance and stretched to maximal range for display purposes. (b) Theovercomplete separable Haar dictionary and (c) the overcomplete DCT dictio-nary are used for comparison.

Fig. 6. The root mean square error for 594 new blocks with missing pixelsusing the learned dictionary, overcomplete Haar dictionary, and overcompleteDCT dictionary.

deploy the proposed techniques in large-scale image-processingapplications.

1) Training Data: The training data were constructed as aset of 11 000 examples of block patches of size 8 8 pixels,taken from a database of face images (in various locations). Arandom collection of 500 such blocks, sorted by their variance,is presented in Fig. 4.

2) Removal of the DC: Working with real image data, wepreferred that all dictionary elements except one have a zeromean. The same measure was practiced in previous work [23].For this purpose, the first dictionary element, denoted as the DC,was set to include a constant value in all its entries and was notchanged afterwards. The DC takes part in all representations,and as a result, all other dictionary elements remain with zeromean during all iterations.

3) Running the -SVD: We applied the -SVD, training adictionary of size 64 441. The choice came fromour attempt to compare the outcome to the overcomplete Haardictionary of the same size (see the following section). The coef-ficients were computed using OMP with a fixed number of coef-

Page 18: 20141204.journal club

自然画像によるK-­‐SVD  dic=onary •  自然画像による辞書生成  

– 辞書サイズ (8x8)x441  – 入力:  自然画像8x8パッチ  x  11000個  – 80  itera=ons  AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4319

Fig. 3. Synthetic results: for each of the tested algorithms and for each noiselevel, 50 trials were conducted and their results sorted. The graph labels repre-sent the mean number of detected atoms (out of 50) over the ordered tests ingroups of ten experiments.

and initializing in the same way. We executed the MOD algo-rithm for a total number of 80 iterations. We also executed theMAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-gorithm was executed as is, therefore using FOCUSS as its de-composition method. Here, again, a maximum of 80 iterationswere allowed.

D. Results

The computed dictionary was compared against the knowngenerating dictionary. This comparison was done by sweepingthrough the columns of the generating dictionary and findingthe closest column (in distance) in the computed dictionary,measuring the distance via

(25)

where is a generating dictionary atom and is its corre-sponding element in the recovered dictionary. A distance lessthan 0.01 was considered a success. All trials were repeated50 times, and the number of successes in each trial was com-puted. Fig. 3 displays the results for the three algorithms fornoise levels of 10, 20, and 30 dB and for the noiseless case.

We should note that for different dictionary size (e.g.,20 30) and with more executed iterations, the MAP-basedalgorithm improves and gets closer to the -SVD detectionrates.

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

RESULTS

We carried out several experiments on natural image data,trying to show the practicality of the proposed algorithm and thegeneral sparse coding theme. We should emphasize that our testshere come only to prove the concept of using such dictionarieswith sparse representations. Further work is required to fully

2The authors of [37] have generously shared their software with us.

Fig. 4. A collection of 500 random blocks that were used for training, sortedby their variance.

(a) (b) (c)

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending orderof their variance and stretched to maximal range for display purposes. (b) Theovercomplete separable Haar dictionary and (c) the overcomplete DCT dictio-nary are used for comparison.

Fig. 6. The root mean square error for 594 new blocks with missing pixelsusing the learned dictionary, overcomplete Haar dictionary, and overcompleteDCT dictionary.

deploy the proposed techniques in large-scale image-processingapplications.

1) Training Data: The training data were constructed as aset of 11 000 examples of block patches of size 8 8 pixels,taken from a database of face images (in various locations). Arandom collection of 500 such blocks, sorted by their variance,is presented in Fig. 4.

2) Removal of the DC: Working with real image data, wepreferred that all dictionary elements except one have a zeromean. The same measure was practiced in previous work [23].For this purpose, the first dictionary element, denoted as the DC,was set to include a constant value in all its entries and was notchanged afterwards. The DC takes part in all representations,and as a result, all other dictionary elements remain with zeromean during all iterations.

3) Running the -SVD: We applied the -SVD, training adictionary of size 64 441. The choice came fromour attempt to compare the outcome to the overcomplete Haardictionary of the same size (see the following section). The coef-ficients were computed using OMP with a fixed number of coef-

入力画像例

Page 19: 20141204.journal club

自然画像からの基底

AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4319

Fig. 3. Synthetic results: for each of the tested algorithms and for each noiselevel, 50 trials were conducted and their results sorted. The graph labels repre-sent the mean number of detected atoms (out of 50) over the ordered tests ingroups of ten experiments.

and initializing in the same way. We executed the MOD algo-rithm for a total number of 80 iterations. We also executed theMAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-gorithm was executed as is, therefore using FOCUSS as its de-composition method. Here, again, a maximum of 80 iterationswere allowed.

D. Results

The computed dictionary was compared against the knowngenerating dictionary. This comparison was done by sweepingthrough the columns of the generating dictionary and findingthe closest column (in distance) in the computed dictionary,measuring the distance via

(25)

where is a generating dictionary atom and is its corre-sponding element in the recovered dictionary. A distance lessthan 0.01 was considered a success. All trials were repeated50 times, and the number of successes in each trial was com-puted. Fig. 3 displays the results for the three algorithms fornoise levels of 10, 20, and 30 dB and for the noiseless case.

We should note that for different dictionary size (e.g.,20 30) and with more executed iterations, the MAP-basedalgorithm improves and gets closer to the -SVD detectionrates.

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

RESULTS

We carried out several experiments on natural image data,trying to show the practicality of the proposed algorithm and thegeneral sparse coding theme. We should emphasize that our testshere come only to prove the concept of using such dictionarieswith sparse representations. Further work is required to fully

2The authors of [37] have generously shared their software with us.

Fig. 4. A collection of 500 random blocks that were used for training, sortedby their variance.

(a) (b) (c)

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending orderof their variance and stretched to maximal range for display purposes. (b) Theovercomplete separable Haar dictionary and (c) the overcomplete DCT dictio-nary are used for comparison.

Fig. 6. The root mean square error for 594 new blocks with missing pixelsusing the learned dictionary, overcomplete Haar dictionary, and overcompleteDCT dictionary.

deploy the proposed techniques in large-scale image-processingapplications.

1) Training Data: The training data were constructed as aset of 11 000 examples of block patches of size 8 8 pixels,taken from a database of face images (in various locations). Arandom collection of 500 such blocks, sorted by their variance,is presented in Fig. 4.

2) Removal of the DC: Working with real image data, wepreferred that all dictionary elements except one have a zeromean. The same measure was practiced in previous work [23].For this purpose, the first dictionary element, denoted as the DC,was set to include a constant value in all its entries and was notchanged afterwards. The DC takes part in all representations,and as a result, all other dictionary elements remain with zeromean during all iterations.

3) Running the -SVD: We applied the -SVD, training adictionary of size 64 441. The choice came fromour attempt to compare the outcome to the overcomplete Haardictionary of the same size (see the following section). The coef-ficients were computed using OMP with a fixed number of coef-

AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4319

Fig. 3. Synthetic results: for each of the tested algorithms and for each noiselevel, 50 trials were conducted and their results sorted. The graph labels repre-sent the mean number of detected atoms (out of 50) over the ordered tests ingroups of ten experiments.

and initializing in the same way. We executed the MOD algo-rithm for a total number of 80 iterations. We also executed theMAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-gorithm was executed as is, therefore using FOCUSS as its de-composition method. Here, again, a maximum of 80 iterationswere allowed.

D. Results

The computed dictionary was compared against the knowngenerating dictionary. This comparison was done by sweepingthrough the columns of the generating dictionary and findingthe closest column (in distance) in the computed dictionary,measuring the distance via

(25)

where is a generating dictionary atom and is its corre-sponding element in the recovered dictionary. A distance lessthan 0.01 was considered a success. All trials were repeated50 times, and the number of successes in each trial was com-puted. Fig. 3 displays the results for the three algorithms fornoise levels of 10, 20, and 30 dB and for the noiseless case.

We should note that for different dictionary size (e.g.,20 30) and with more executed iterations, the MAP-basedalgorithm improves and gets closer to the -SVD detectionrates.

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

RESULTS

We carried out several experiments on natural image data,trying to show the practicality of the proposed algorithm and thegeneral sparse coding theme. We should emphasize that our testshere come only to prove the concept of using such dictionarieswith sparse representations. Further work is required to fully

2The authors of [37] have generously shared their software with us.

Fig. 4. A collection of 500 random blocks that were used for training, sortedby their variance.

(a) (b) (c)

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending orderof their variance and stretched to maximal range for display purposes. (b) Theovercomplete separable Haar dictionary and (c) the overcomplete DCT dictio-nary are used for comparison.

Fig. 6. The root mean square error for 594 new blocks with missing pixelsusing the learned dictionary, overcomplete Haar dictionary, and overcompleteDCT dictionary.

deploy the proposed techniques in large-scale image-processingapplications.

1) Training Data: The training data were constructed as aset of 11 000 examples of block patches of size 8 8 pixels,taken from a database of face images (in various locations). Arandom collection of 500 such blocks, sorted by their variance,is presented in Fig. 4.

2) Removal of the DC: Working with real image data, wepreferred that all dictionary elements except one have a zeromean. The same measure was practiced in previous work [23].For this purpose, the first dictionary element, denoted as the DC,was set to include a constant value in all its entries and was notchanged afterwards. The DC takes part in all representations,and as a result, all other dictionary elements remain with zeromean during all iterations.

3) Running the -SVD: We applied the -SVD, training adictionary of size 64 441. The choice came fromour attempt to compare the outcome to the overcomplete Haardictionary of the same size (see the following section). The coef-ficients were computed using OMP with a fixed number of coef-

AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4319

Fig. 3. Synthetic results: for each of the tested algorithms and for each noiselevel, 50 trials were conducted and their results sorted. The graph labels repre-sent the mean number of detected atoms (out of 50) over the ordered tests ingroups of ten experiments.

and initializing in the same way. We executed the MOD algo-rithm for a total number of 80 iterations. We also executed theMAP-based algorithm of Kreutz-Delgado et al. [37].2 This al-gorithm was executed as is, therefore using FOCUSS as its de-composition method. Here, again, a maximum of 80 iterationswere allowed.

D. Results

The computed dictionary was compared against the knowngenerating dictionary. This comparison was done by sweepingthrough the columns of the generating dictionary and findingthe closest column (in distance) in the computed dictionary,measuring the distance via

(25)

where is a generating dictionary atom and is its corre-sponding element in the recovered dictionary. A distance lessthan 0.01 was considered a success. All trials were repeated50 times, and the number of successes in each trial was com-puted. Fig. 3 displays the results for the three algorithms fornoise levels of 10, 20, and 30 dB and for the noiseless case.

We should note that for different dictionary size (e.g.,20 30) and with more executed iterations, the MAP-basedalgorithm improves and gets closer to the -SVD detectionrates.

VI. APPLICATIONS TO IMAGE PROCESSING—PRELIMINARY

RESULTS

We carried out several experiments on natural image data,trying to show the practicality of the proposed algorithm and thegeneral sparse coding theme. We should emphasize that our testshere come only to prove the concept of using such dictionarieswith sparse representations. Further work is required to fully

2The authors of [37] have generously shared their software with us.

Fig. 4. A collection of 500 random blocks that were used for training, sortedby their variance.

(a) (b) (c)

Fig. 5. (a) The learned dictionary. Its elements are sorted in an ascending orderof their variance and stretched to maximal range for display purposes. (b) Theovercomplete separable Haar dictionary and (c) the overcomplete DCT dictio-nary are used for comparison.

Fig. 6. The root mean square error for 594 new blocks with missing pixelsusing the learned dictionary, overcomplete Haar dictionary, and overcompleteDCT dictionary.

deploy the proposed techniques in large-scale image-processingapplications.

1) Training Data: The training data were constructed as aset of 11 000 examples of block patches of size 8 8 pixels,taken from a database of face images (in various locations). Arandom collection of 500 such blocks, sorted by their variance,is presented in Fig. 4.

2) Removal of the DC: Working with real image data, wepreferred that all dictionary elements except one have a zeromean. The same measure was practiced in previous work [23].For this purpose, the first dictionary element, denoted as the DC,was set to include a constant value in all its entries and was notchanged afterwards. The DC takes part in all representations,and as a result, all other dictionary elements remain with zeromean during all iterations.

3) Running the -SVD: We applied the -SVD, training adictionary of size 64 441. The choice came fromour attempt to compare the outcome to the overcomplete Haardictionary of the same size (see the following section). The coef-ficients were computed using OMP with a fixed number of coef-

K-­‐SVD  による辞書 Overcomplete  DCT

Page 20: 20141204.journal club

Applica=on:Image  inpain=ng 4320 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006

Fig. 7. The corrupted image (left) with the missing pixels marked as points and the reconstructed results by the learned dictionary, the overcomplete Haar dictio-nary, and the overcomplete DCT dictionary, respectively. The different rows are for 50% and 70% of missing pixels.

ficients, where the maximal number of coefficients is ten. Notethat better performance can be obtained by switching to FO-CUSS. We concentrated on OMP because of its simplicity andfast execution. The trained dictionary is presented in Fig. 5(a).

4) Comparison Dictionaries: The trained dictionary wascompared with the overcomplete Haar dictionary, which in-cludes separable basis functions, having steps of various sizesand in all locations (total of 441 elements). In addition, webuild an overcomplete separable version of the DCT dictionaryby sampling the cosine wave in different frequencies to result atotal of 441 elements. The overcomplete Haar dictionary and theovercomplete DCT dictionary are presented in Fig. 5(b) and (c),respectively.

5) Applications: We used the -SVD results, denoted hereas the learned dictionary, for two different applications on im-ages. All tests were performed on one face image which wasnot included in the training set. The first application is filling inmissing pixels: we deleted random pixels in the image and filledtheir values using the various dictionaries’ decomposition. Wethen tested the compression potential of the learned dictionarydecomposition and derived a rate-distortion graph. We hereafterdescribe those experiments in more detail.

A. Filling In Missing Pixels

We chose one random full face image, which consists of 594nonoverlapping blocks (none of which were used for training).For each block, the following procedure was conducted for inthe range {0.2,0.9}.

1) A fraction of the pixels in each block, in random loca-tions, were deleted (set to zero).

2) The coefficients of the corrupted block under the learneddictionary, the overcomplete Haar dictionary, and the over-complete DCT dictionary were found using OMP with anerror bound of , where is a vector ofall ones3 (allowing an error of 5 gray-values in 8-bit im-

3The input image is scaled to the dynamic range [0,1].

ages). All projections in the OMP algorithm included onlythe noncorrupted pixels, and for this purpose, the dictio-nary elements were normalized so that the noncorruptedindexes in each dictionary element have a unit norm. Theresulting coefficient vector of the block is denoted .

3) The reconstructed block was chosen as .

4) The reconstruction error was set to(where 64 is the number of pixels in each block).

The mean reconstruction errors (for all blocks and all corruptionrates) were computed and are displayed in Fig. 6. Two corruptedimages and their reconstructions are shown in Fig. 7. As canbe seen, higher quality recovery is obtained using the learneddictionary.

B. Compression

A compression comparison was conducted between the over-complete learned dictionary, the overcomplete Haar dictionary,and the overcomplete DCT dictionary (as explained before), allof size 64 441. In addition, we compared to the regular (uni-tary) DCT dictionary (used by the JPEG algorithm). The re-sulting rate-distortion graph is presented in Fig. 8. In this com-pression test, the face image was partitioned (again) into 594disjoint 8 8 blocks. All blocks were coded in various rates(bits-per-pixel values), and the peak SNR (PSNR) was mea-sured. Let be the original image and be the coded image,combined by all the coded blocks. We denote as the meansquared error between and , and

PSNR (26)

In each test we set an error goal and fixed the number of bitsper coefficient . For each such pair of parameters, all blockswere coded in order to achieve the desired error goal, and the co-efficients were quantized to the desired number of bits (uniformquantization, using upper and lower bounds for each coefficientin each dictionary based on the training set coefficients). For the

Page 21: 20141204.journal club

まとめ

•  Overcomplete  な基底に対する辞書学習方法  K-­‐SVD  を提案したよ  

•  K-­‐SVD  は  K-­‐means  の一般化(少し強引)  •  K-­‐SVD で作成した基底は inpain=ng  とかの問

題解決に役立つ