blind compressed sensing using sparsifying...

1

Blind Compressed SensingUsing Sparsifying Transforms

Saiprasad Ravishankar and Yoram Bresler

Department of Electrical and Computer Engineeringand Coordinated Science Laboratory

University of Illinois at Urbana-Champaign

May 29, 2015

S. Ravishankar & Y. Bresler Blind Compressed Sensing

2

Key Topics of Talk

Non-adaptive Compressed Sensing (CS)

Synthesis dictionary learning-basedblind compressed sensing

Transform learning vs. Dictionary learning

Transform learning-based blindcompressed sensing

Application to magnetic resonanceimaging (MRI)

Transform learning based MRI (TLMRI)

Conclusions


3

Compressed Sensing (CS)

CS enables accurate recovery of images from far fewermeasurements than the number of unknowns

Sparsity of image in transform domain or dictionary

measurement procedure incoherent with transform

Reconstruction non-linear and expensive

Reconstruction problem (NP-hard) -

minx

Data Fidelity︷︸︸︷

‖Ax − y‖22 +λRegularizer︷︸︸︷

‖Ψx‖0 (1)

x ∈ CP : image as vector, y ∈ Cm : measurements.

A ∈ Cm×P : sensing matrix (m < P), Ψ : transform (Wavelets,

Contourlets, Total Variation). ℓ0 “norm”counts non-zeros.

Iterative algorithms for CS reconstruction are usually expensive.


4

Application: Compressed Sensing MRI (CSMRI)

Data - samples in k-space of spatial Fouriertransform of object, acquired sequentially.

Acquisition rate limited by MR physics,physiological constraints on RF energydeposition.

CS accelerates the data acquisition in MRI.

CSMRI with non-adaptive transforms ordictionaries limited to 2.5-3 foldundersampling [Ma et al. ’08].

Two directions to improve CSMRI -

better or adaptive sparse modeling

better choice of sampling pattern (Fu)[EMBC, 2011]

Fig. from Lustig et al. ’07


5

Synthesis Model for Sparse Representation

Given a signal y ∈ Rn, and dictionary D ∈ Rn×K , we assumey = Dx with ‖x‖0 ≪ K ⇒ a union of subspaces model.

Real world signals modeled as y = Dx + e, e is deviation term.

Given D, sparsity level s, the synthesis sparse coding problem is

x = argminx

‖y − Dx‖22 s.t. ‖x‖0 ≤ s

This problem is NP-hard.


6

Synthesis Dictionary Learning

The DL problem (NP-hard) -

minD,B

N∑

j=1

‖Rjx − Dbj‖22 s.t. ‖dk‖2 = 1 ∀ k , ‖bj‖0 ≤ s ∀ j (2)

Rjx ∈ Cn -√n×√

n patch indexed bylocation in image.

Rj ∈ Cn×P extracts patch.

D ∈ Cn×K - patch based dictionary.

bj ∈ CK - sparse, xj ≈ Dbj .

s - sparsity, B = [b1 | b2 | ... | bN ].DL minimizes fit error of all patches usingsparse representations w.r.t. D.


7

Synthesis-based Blind Compressed Sensing (BCS)

(P0) minx,D,B

Sparse Fitting Regularizer︷︸︸︷

N∑

j=1

‖Rjx − Dbj‖22 + ν

Data Fidelity︷︸︸︷

‖Ax − y‖22

s.t. ‖dk‖2 = 1 ∀ k , ‖bj‖0 ≤ s ∀ j .

B ∈ Cn×N : matrix that has the sparse codes bj as its columns.

(P0) learns D ∈ Cn×K , and reconstructs x , from only undersampled y ⇒

dictionary adaptive to underlying image.

(P0) is NP-hard, non-convex even if ℓ0 “norm” relaxed to ℓ1.

DLMRI1 solves (P0) for MRI and works better than non-adaptive CS.

Synthesis BCS algorithms have no guarantees and are expensive.

1 [Ravishankar & Bresler ’11]


8

2D Random Sampling - 6 fold undersampling

0

0.05

0.1

0.15

0.2

0.25

0.3

LDP2 reconstruction (22 dB) LDP error magnitude

0

0.05

0.1

0.15

0.2

0.25

0.3

DLMRI reconstruction (32 dB) DLMRI error magnitude

MRI data from Miki Lustig. 2 [Lustig et al. ’07]


9

Alternative: Sparsifying Transform Model

Given a signal y ∈ Rn, and transform W ∈ Rm×n, we modelWy = x + η with ‖x‖0 ≪ m and η - error term.

Natural signals are approximately sparse in Wavelets, DCT.

Given W , and sparsity s, transform sparse coding is

x = argminx

‖Wy − x‖22 s.t. ‖x‖0 ≤ s

x = Hs(Wy) computed by thresholding Wy to the s largest magnitude

elements. Sparse coding is cheap. Signal recovered as W †x .

Sparsifying transforms exploited for compression (JPEG2000), etc.


10

Alternative: Sparsifying Transform Learning

Square Transform Models

Unstructured transform learning [IEEE TSP, 2013 & 2015]

Doubly sparse transform learning [IEEE TIP, 2013]

Online learning for Big Data [IEEE JSTSP, 2015]

Convex formulations for transform learning [ICASSP, 2014]

Overcomplete Transform Models

Unstructured overcomplete transform learning [ICASSP, 2013]

Learning structured overcomplete transforms with block cosparsity(OCTOBOS) [IJCV, 2014]

Applications: Sparse representation, Image & Video denoising,Classification, Blind compressed sensing (BCS) for imaging.


11

Square Transform Learning Formulation

(P1) minW ,B

Sparsification Error︷︸︸︷

N∑

j=1

‖WRjx − bj‖22 +λ

Regularizer︷︸︸︷(

0.5 ‖W ‖2F − log |detW |)

s.t. ‖bj‖0 ≤ s ∀ j

Sparsification error - measures deviation of data in transform domainfrom perfect sparsity.

Regularizer enables complete control over conditioning & scaling of W .

If ∃ (W , B) such that the condition number κ(W ) = 1, WRjx = bj ,

||bj ||0 ≤ s ∀ j ⇒ globally identifiable by solving (P1).

(P1) favors both a low sparsification error and good conditioning.

The solution to (P1) is unitary as λ→ ∞.


12

Transform-based Blind Compressed Sensing (BCS)

(P2) minx,W ,B

Sparsification Error︷︸︸︷

N∑

j=1

‖WRjx − bj‖22+νData Fidelity︷︸︸︷

‖Ax − y‖22 +λRegularizer︷︸︸︷

v(W )

s.t.N∑

j=1

‖bj‖0 ≤ s, ‖x‖2 ≤ C .

(P2) learns W ∈ Cn×n, and reconstructs x , from only undersampledy ⇒ transform adaptive to underlying image.

v(W ) , − log |detW | + 0.5 ‖W ‖2F controls scaling and κ of W .

‖x‖2 ≤ C is an energy/range constraint. C > 0.


13

Transform BCS: Identifiability & Uniqueness

Proposition 1

Let x ∈ Cp, and let y = Ax with A ∈ Cm×p. Suppose

‖x‖2 ≤ C

W ∈ Cn×n is a unitary transform∑N

j=1 ‖WRjx‖0 ≤ s

Further, let B denote the matrix that has WRjx as its columns.Then, (x ,W ,B) is a global minimizer of Problem (P2), i.e., it isidentifiable by solving (P2).

Given minimizer (x ,W ,B) of (P2), (x ,ΘW ,ΘB) is anotherequivalent minimizer ∀Θ s.t. ΘHΘ = I ,

∑

j ‖Θbj‖0 ≤ s. Theoptimal x is invariant to such transformations of (W ,B).

When W is constrained to be doubly sparse and unitary, uniquenesscan be guaranteed under additional (e.g., spark) conditions.


14

Alternative Transform BCS Formulations

(P3) minx,W ,B

N∑

j=1

‖WRjx − bj‖22 + ν ‖Ax − y‖22

s.t. WHW = I ,

N∑

j=1

‖bj‖0 ≤ s, ‖x‖2 ≤ C .

(P3) is also a unitary synthesis dictionary-based BCS problem,with WH the synthesis dictionary.

(P4) minx,W ,B

N∑

j=1

‖WRjx − bj‖22 + ν ‖Ax − y‖22 + λ v(W ) + η2N∑

j=1

‖bj‖0

s.t. ‖x‖2 ≤ C .


15

Block Coordinate Descent (BCD) Algorithm for (P2)

(P2) solved by alternating between updating W , B, and x .

Alternate a few times between the W and B updates, beforeperforming an image update.

Sparse Coding Step solves (P2) for B with fixed x , W .

minB

N∑

j=1

‖WRjx − bj‖22 s.t.N∑

j=1

‖bj‖0 ≤ s. (3)

Cheap Solution: Let Z ∈ Cn×N be the matrix with WRjx as its

columns. Solution B = Hs(Z ) computed exactly by zeroing out allbut the s largest magnitude coefficients in Z .


16

BCD Algorithm for (P2)

Transform Update Step solves (P2) for W with fixed x , B.

minW

N∑

j=1

‖WRjx − bj‖22 + 0.5λ ‖W ‖2F − λ log |detW | (4)

Let X ∈ Cn×N be the matrix with Rjx as its columns.

Closed-form solution:

W = 0.5R

(

Σ+(

Σ2 + 2λI) 1

2

)

VHL−1 (5)

where XXH + 0.5λI = LLH , and L−1XBH has a full SVD of VΣRH .

Solution is unique if and only if XBH is non-singular.


17

BCD Algorithm for (P2)

Image Update Step solves (P2) for x with fixed W , B.

minx

N∑

j=1

‖WRjx − bj‖22 + ν ‖Ax − y‖22 s.t. ‖x‖2 ≤ C . (6)

Least squares problem with ℓ2 norm constraint.

Solution is unique as long as the set of overlapping patches cover allimage pixels.

Solve Least squares Lagrangian formulation:

minx

N∑

j=1

‖WRjx − bj‖22 + ν ‖Ax − y‖22 + µ

(

‖x‖22 − C)

(7)

The optimal multiplier µ ∈ R+ is the smallest real such that‖x‖2 ≤ C . µ and x can be found cheaply.


18

BCS Convergence Guarantees - Notations

Define the barrier function ψs(B) as

ψs(B) =

{

0,

+∞,

∑N

j=1 ‖bj‖0 ≤ s

else

χC (x) is the barrier function corresponding to ‖x‖2 ≤ C .

(P2) is equivalent to the problem of minimizing the objective

g(W ,B, x) =

N∑

j=1

‖WRjx − bj‖22 + ν ‖Ax − y‖22 + λ v(W ) + ψ(B) + χ(x)

For H ∈ Cp×q , ρj(H) is the magnitude of the j th largest element(magnitude-wise) of H .

X ∈ Cn×N denotes a matrix with Rjx , 1 ≤ j ≤ N , as its columns.


19

Transform BCS Convergence Guarantees

Theorem 1

For the sequence {W t ,B t , x t} generated by the BCD Algorithm withinitial (W 0,B0, x0), we have

{g (W t ,B t , x t)} → g∗ = g∗(W 0,B0, x0).

{W t ,B t , x t} is bounded, and all its accumulation points areequivalent, i.e., they achieve the same value g∗ of the objective.∥∥x t − x t−1

∥∥2→ 0 as t → ∞.

Every accumulation point (W ,B, x) is a critical point of gsatisfying the following partial global optimality conditions

x ∈ argminx

g (W ,B, x) (8)

W ∈ argminW

g(

W ,B, x)

, B ∈ argminB

g(

W , B , x)

(9)


20

Transform BCS Convergence Guarantees

Theorem 2

Each accumulation point (W ,B, x) of {W t ,B t , x t} also satisfies thefollowing partial local optimality conditions

g(W +∆W ,B +∆B, x) ≥g(W ,B, x) = g∗ (10)

g(W ,B +∆B, x +∆x) ≥g(W ,B, x) = g∗ (11)

The conditions each hold for all ∆x ∈ Cp, and all ∆W ∈ Cn×n satisfying‖∆W ‖F ≤ ǫ for some ǫ = ǫ(W ) > 0, and all ∆B ∈ Cn×N in R1 ∪ R2

R1. The half-space Re(tr{(WX − B)∆BH

})≤ 0.

R2. The local region defined by ‖∆B‖∞ < ρs(WX ).

Furthermore, if ‖WX‖0 ≤ s, then ∆B can be arbitrary.


21

Global Convergence Guarantees

Proposition 2

For each initialization, the iterate sequence in the BCD algorithmconverges to an equivalence class (same objective values) of criticalpoints of the objective, that are also partial global/local minimizers.

Proposition 3

The BCD algorithm is globally convergent to a subset of the set ofcritical points of the objective. The subset includes all (W ,B, x) that areat least partial global and partial local minimizers.


22

Computational Advantages of Transform BCS

Cost per iteration of transform BCS: O(p4NL)

N overlapping patches of size p × p, W ∈ Cn×n, n , p2.

L : # inner alternations between transform update & sparse coding.

Cost per iteration of Synthesis BCS method DLMRI3: O(p6NJ).

D ∈ Cn×K , n , p2, K ∝ n, sparsity s ∝ n.

J : # of inner iterations of dictionary learning using K-SVD4.

In practice, transform BCS converges quickly and is much cheaperfor large p.

In 3D or 4D imaging, n = p3 or p4, and the gain in computations isabout a factor n in order.

3 [Ravishankar & Bresler ’11] 4 [Aharon et al. ’06],


23

TLMRI Convergence - 4x Undersampling (s = 3.4%)

Reference Sampling mask

100

1019.45

9.5

9.55

9.6

9.65

9.7x 105

Iteration Number

Obj

ectiv

e F

unct

ion

100

10110

−2

10−1

100

101

Iteration Number (t)

∥ ∥

xt−xt−1∥ ∥

2

Objective∥∥x t − x t−1

∥∥2vs. t


24

Convergence & Learning - 4x Undersampling (s = 3.4%)

0

0.05

0.1

0.15

0.2

Zero-filling (28.94 dB) Zero-filling Error

TLMRI (32.66 dB) real (top), imaginary (bottom)parts of learnt 36× 36 W


25

Comparison (PSNR & Runtime) to Recent Methods

Sampling Scheme Undersampling Zero-filling LDP5 PBDWS6 DLMRI7 PANO8 TLMRI

2D Random4x 25.3 30.3 32.6 32.91 32.2 33.04

7x 25.3 27.3 31.3 31.46 30.2 31.81

Cartesian4x 28.9 30.2 32.0 32.46 31.6 32.64

7x 27.9 25.5 30.1 30.72 30.4 31.04

Avg. Runtime (s) 251 794 2051 664 211

TLMRI is up to 5.5 dB better than LDP, that uses Wavelets + TV.

TLMRI provides up to 1 dB improvement in PSNR over the PBDWS methodthat uses redundant Wavelets and trained patch-based geometric directions, andis up to 1.6 dB better than the non-local PANO method.

It is up to 0.35 dB better than DLMRI, that learns 4x overcomplete dictionary.

TLMRI is 10x faster than DLMRI, and 4x faster than the PBDWS method.

TLMRI provides the best reconstructions, and is the fastest.

5 [Lustig et al. ’07] 6 [Ning et al. ’13] 7 [Ravishankar & Bresler ’11] 8 [Qu et al. ’14]


26

Example - 2D random 5x Undersampling

Reference DLMRI (28.54 dB) TLMRI (30.47 dB)

0

0.05

0.1

0.15

0.2

0.25

0

0.05

0.1

0.15

0.2

0.25

Sampling Mask DLMRI Error TLMRI Error


27

Conclusions

We introduced a transform-based BCS framework

Proposed BCS algorithms have a low computational cost.

We provided novel convergence guarantees for the algorithms, thatdo not require any restrictive assumptions.

For CSMRI, the proposed approach is better than leading imagereconstruction methods, while being much faster.

Future work: convergence of algorithm to global minimizer &convergence rate.


28

Thank you! Questions??


29

Convergence Guarantees - Definitions

Definition 1

Let φ : Rq 7→ (−∞,+∞] be a proper function and let z ∈ domφ. TheFrechet sub-differential of the function φ at z is the following set:

∂φ(z) ,{

h ∈ Rq : lim infb→z,b 6=z

1‖b−z‖ (φ(b) − φ(z) − 〈b − z , h〉) ≥ 0

}

(12)

If z /∈ domφ, then ∂φ(z) = ∅. The sub-differential of φ at z is defined as

∂φ(z) ,{

h ∈ Rq : ∃zk → z , φ(zk ) → φ(z), hk ∈ ∂φ(zk ) → h

}. (13)

Lemma 1

A necessary condition for z ∈ Rq to be a minimizer of the functionφ : Rq 7→ (−∞,+∞] is that z is a critical point of φ, i.e., 0 ∈ ∂φ(z). Ifφ is a convex function, this condition is also sufficient.


blind compressed sensing using sparsifying...

Documents