compressed sensing, low-rank recovery & co: lecture notes˝ a mathematical introduction to...

Compressed sensing, low-rank recovery & co:lecture notes

Master 2 Mathématiques et Applications - Spécialité Statistique & Spécialité Algorithmes et

Apprentissage

Sorbonne Université

Claire Boyer & Maxime Sangnier

2020

Acknowledgments

We would like to warmly thank Guillaume Lecué for his help & Gabriel Peyré for hisuseful toolboxes and pedagogical resources.

This course is mainly based on

˝ A mathematical introduction to compressive sensing, from Simon Foucart and HolgerRauhut,

˝ Guillaume Lecué’s lecture notes on compressed sensing

˝ Mathematics of sparsity by Emmanuel Candès

˝ Pascal Bianchi, Olivier Fercoq and Anne Sabourin’s lecture notes for optimization

˝ Vandenberghe’s lecture notes on Optimization methods for large-scale systems

˝ Clarice Poon’s lecture notes

3

Chapter 1

Introduction to CS as an inverseproblem: sparsity and measurements

1

Contents1.1 Introduction to CS . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.1 A first regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.2 A new paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Introduction to sparsity . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Is the world sparse? Some orthogonal transforms . . . . . . . . 12

1.3.1 Sparse approximation in a basis . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Fourier tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.3 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 CS formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.1 Noiseless recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4.2 Noisy recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.5 Nearly-sparse signals: stable recovery . . . . . . . . . . . . . . . 29

1.6 Robust and stable recovery . . . . . . . . . . . . . . . . . . . . . 30

1.7 CS applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.7.1 CS and photography . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.7.2 CS and Fourier measurements . . . . . . . . . . . . . . . . . . . . . 33

1.7.3 CS and orthogonal (incoherent) transforms . . . . . . . . . . . . . 33

1.7.4 CS and MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341A big part of this chapter is taken from Mallat’s book, A wavelet tour of signal processing and Foucart

and Rauhut’s book, A mathematical introduction to compressive sensing.

5

1.7.5 CS and face recognition . . . . . . . . . . . . . . . . . . . . . . . . 34

1.7.6 CS and finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.8 Minimal number of measurements . . . . . . . . . . . . . . . . . 37

1.8.1 Recovery of all sparse vectors . . . . . . . . . . . . . . . . . . . . . 37

1.8.2 Recovery of individual sparse vector . . . . . . . . . . . . . . . . . 38

1.9 Preliminaries: NP-hard problem . . . . . . . . . . . . . . . . . . 38

In data science, one key element is the data representation, in order to exploit somestructure. In this chapter, we will present some famous transforms that will be used eitherto derive theoretical results in the following chapters or during practical work.

1.1 Introduction to CSCompressed sensing can be translated into "Acquisition comprimée". This means thatacquisition and compression of some signal is simultaneously made. Before CS, datacompression was a step apart from the acquisition. For instance in photography, the firststep is to take a picture, i.e. to expose the photo-sensitive plaque of the camera to thescene light. This data is stored in the RAW format: the scene is discretized into pixels,and color levels of each pixel if stored in a big matrix. For modern cameras, the storage ofa photograph in RAW format uses Mo of memory. This is then the first step: acquisition.The second step is generally done by a CPU (on the device, or on a computer): it consistsin using a compression algorithm on the RAW file. This step is crucial as it leads to dividethe storage place up to 25, meaning only hundreds of Ko for a compressed image insteadof a few Mo for RAW format. For instance, the JPEG format is a classical extension forcompressed photographs, using a particular algorithm (transformation made by blocks)in a cosine basis and a quantification of obtained coefficients. Compression will be usefulonly if it is possible to get back the original image fastly and efficiently, i.e. without toomuch loss (the human eye is not supposed to see the difference between the initial andthe compressed images). This is the case of JPEG compression.

The main drawbacks of this 2-steps approach is that it requires: (i) the storage of n datawhereas n is typically huge in applications; (ii) we have to compute the n coordinates ofthe signal in the compression basis whereas only a small proportion will be useful.

Compressed sensing is a setting where both steps are made simultaneously. One hasthen to pick "clever" measurements of the signal of interest, and to be able to reconstructthe entire signal from a few number of measurements.

The notion of measurements is crucial in CS. Theoretical research on constructing ef-ficient measurements is still very active. We will only consider linear measurements. Itmeans that if x is the signal that we want to recover, the measurements can be written asfollows

yi “ xai, xy , 1 § i § m, (1.1)

6

Figure 1.1

Figure 1.2

where

• i is the measurement number

• ai is the i-th measurement vector

• yi is the i-th observation.

Note m the number of measurements. In this course, only finite-dimensional objectswill be considered (while a infinite-dimensional theory exists). Then, suppose that x P Rn

or Cn. One can rewrite the measurements as follows:

y “ Ax “

¨

˚a˚1...a˚m

˛

‹‚x, (1.2)

with

• y P Cm the vector of measurements

• A P Cmˆn the sensing matrix whose rows are the pa˚i q’s,

• x P Rn the signal to reconstruct.

The fewer the measurements, the happier the world (for storage/acquisition cost rea-sons), then we will always have that

m ! n

7

Figure 1.3

the number of measurements is then much smaller that the dimension of the ambientspace. This is a classical setting of high-dimensional statistics: the number of observationsm is fewer than the space dimension n of parameters to estimate.

Now, we want to reconstruct x P Rn from y using A. On an algebraic view, we wantto solve an under-determined linear equations system, see Figure 1.3. Then, there is aninfinite number of solutions without making stronger assumptions on x.

1.1.1 A first regularization

One could propose to pick the solution that minimizes the energy of the signal, i.e.

minxPRn

}x}2 such that y “ Ax0 (1.3)

It has the nice advantage to lead to a closed-form solution. With the pseudo-inverse of Anoted A:,

x “ A:y.

However, in practice, such a regularization does not lead to satisfactory results. This ex-ample has been treated in Notebook 71 (with two methods: Fermat’s rule and Lagrangianmethod). This has to be known.

1.1.2 Another paradigm

The main idea in CS is that if one wants to sense a few linear measurements, and stillbe able to exactly reconstruct the object of interest then it means that the object shouldnot be very complicated: it should live in a subspace of small dimension compared tothe ambient dimension. That is why a few number of measurements could be sufficientto reconstruct it: we should observe a quantity of measurements relative to the intrinsicinformation in the signal. This simple feature of the signal is called sparsity. It means

8

that x is essentially made of 0, except for a few coordinates. The number of non-zerocoefficients is then small compared to the ambient dimension n.

1.2 Introduction to sparsity

In compressed sensing and beyond, a key principle is the sparsity of the signal of interest.

Definition 1.1. We say that a signal x P Rnis s-sparse if the number of non-zero coefficients of

x is less than s:

|ti P t1, . . . , nu, xi ‰ 0u| § s.

The complexity or "intrinsic" information content of a compressible/sparse signal ismuch smaller than its signal length, i.e. s ! n. It means that even if the ambient dimen-sion is n, the signal x lives in a subspace of low-dimension. To represent data, one willnaturally choose the sparsest possible, to have the most simple representation.

Without any further information on the signal x, we assume that it is s-sparse in thecanonical basis, see for instance Fig. 1.4. Real-world signals do not always look like Fig.1.4. As empirically observed, many real-world signals are compressible in the sense thatthey are well approximated by sparse signals often after an appropriate change of basis

(we will see that the JPEG, MPEG, or MP3 technologies rely on such ideas, and work verywell in practice). Then one could say that a signal is s-sparse in the basis F “ pfiq1§i§n,meaning that the signal can be represented as

nÿ

i“1

✓ifi,

where ✓ is a sparse vector of coordinates in the basis F . In this case, we do not want toreconstruct the initial signal x but its representation coefficients p✓iq1§i§n which is a sparsevector. This leads to change the sensing matrix A by AF ˚, where F “ pf1, . . . , fnq

˚.

Remark 1.2. In this course, we will always consider linear transforms, that is why for-mally the object of interest will always be a vector of Rn (or Cn). But one should keep inmind that this x can physically stand for an image or whatever the reader may imagine.For instance if x stands for an image of

?nˆ

?n, it can always be vectorized into a vector

of Rn. This convention will simplify notation in the course notes. Of course, one shouldnot be fond of vectorizing any signal in practical work!

For instance, JPEG relies on the sparsity of images in the discrete cosine basis orwavelet basis and achieves compression by only storing the largest discrete cosine orwavelet coefficients. The other coefficients are simply set to zero.

9

Figure 1.4 – An example of an 512 ˆ 512 image sparse in the canonical/dirac basis: only10% coefficients are non-zero. Apart from astronomical or microscopy purpose, this kindof images have a limited interest...

1.3 Is the world sparse? Some orthogonal transforms

An orthogonal basis is a dictionary of minimum size that can yield a sparse representa-tion if designed to concentrate the signal energy over a set of few vectors.This set gives ageometric signal description. Efficient signal compression and noise reduction algorithmsare then implemented with diagonal operators computed with fast algorithms. But thisis not always optimal. In natural languages, a richer dictionary helps to build shorter andmore precise sentences. Similarly, dictionaries of vectors that are larger than bases areneeded to build sparse representations of complex signals. But choosing is difficult andrequires more complex algorithms. Sparse representations in redundant dictionaries canimprove pattern recognition, compression, and noise reduction, but also the resolution ofnew inverse problems. This includes super-resolution, source separation, and compres-sive sensing. This first chapter is a sparse book representation, providing the story lineand the main ideas. It gives a sense of orientation for choosing a path to travel. In thischapter we will focus on orthogonal transforms, but in practice redundant dictionarieshave been proven to be very useful.

1.3.1 Sparse approximation in a basis

In the beginning of this chapter, the discrete approach has been taken: x lives in a finite-dimensional space. For some applications purpose, one can be interested in a infinite-dimensional signal, and its finite-dimensional version becomes its approximation. Forease of notation, we will present orthogonal transforms massively used in applications ina infinite-dimensional setting.

10

Linear vs. non-linear approximation

Given an orthogonal basis B “ t'kukPN of L2pr0, 1s

dq (where d “ 1 for a signal, and d “ 2

for an image). For f P L2pr0, 1s

dq, one can write f “

∞k xf,'ky'k. To approximate f , one

wants to fix a finite set IK , such that |IK | “ K:

frKs “

ÿ

kPIKxf,'ky'k.

There are two ways to construct such a set IK :

• Linear approximation: the set IK does not depend on f , i.e. IK “ tK low frequenciesu.The error

✏`pK, fq “ }f ´ frKs}22 “

`8ÿ

k“K`1

| xf,'ky |2

decreases quickly when K increases if the coefficient amplitudes | xf,'ky |2 have a

fast decay when the index k increases. The dimension K has to be adjusted to thedesired approximation error.

Theorem 1.3. Let tgkukPN be an orthonormal basis of L2r0, 1s. Suppose that for some s °

1{2,8ÿ

k“0

k2s|xf, gky|

2† 8.

Then, there exist A,B ° 0 such that

A8ÿ

k“0

k2s|xf, gky|

2§

8ÿ

K“0

K2s´1✏lpK, fq § B8ÿ

k“0

k2s|xf, gky|

2.

It follows that ✏`pK, fq “ opK´2sq.

This theorem establishes that the linear approximation of f in a basis B decays fasterthan K´2s if f belongs to

WB,s “ tf P L2r0, 1s,

8ÿ

m“0

m2s|xf, gmy|

2† 8u.

In the following sections, we will see that in the case of Fourier or wavelet bases,this is in fact a Sobolev space.

Linear approximations reduce the space dimensionality but can introduce impor-tant errors when reducing the resolution if the signal is not uniformly regular. Thisrequires defining an irregular sampling adapted to the local signal regularity, seenext paragraph.

11

Figure 1.5 – Comparison between linear and non-linear approximations of an image. (*)

(*) from Gabriel Peyré’s slides.

• Non-linear approximation: it consists in minimizing }f ´ frKs} for a given K, i.e.IK “ tk ; | xf,'ky | ° T u and then K “ |IK |. Therefore, nonlinear approxima-tions operate in two stages. First, a linear operator approximates the signal f withn samples written pfmq1§m§n. Then, a nonlinear approximation of pfmq is computedto reduce the n coefficients to K ! n coefficients in a sparse representation. To ob-tain a sparse representation with a nonlinear approximation, we have to choose anorthonormal basis which concentrates the signal energy as much as possible overfew coefficients. To do so, one can study the non-linear approximation error:

✏npK, fq :“

ÿ

kRIK|xf, gky|

2

Hard-thresholding. A common non-linear approximation is called hard-thresholding,it consists in giving the best K-term (non-linear) approximation in the basis B:

frKs “

ÿ

|xf,'ky|°T

xf,'ky'k “

ÿ

k

HT pxf,'kyq'k

whereHT pxq “

"x if |x| ° T0 if |x| § T.

The approximation error decay is usually polynomial

}f ´ frKs}2

§ CfK´↵

12

Figure 1.6 – Log-log plot of the approximation error for 4 different images usingwavelets.(*)


where (i) ↵ depends on a signal class (regular, piecewise smooth, etc...); (ii) Cf depends onf (norm of the signal within its class. We can do a log-log plot (see Figure 1.6 to displaythe affine curve profiles of the approximation error since

logp}f ´ frKs}2q “ cst ´ ↵ logpKq.

We will present some orthogonal transforms to perform good approximation, but let’skill the suspense by looking at Figure 1.7.

Computational motivations

Signal coefficients pfkq are computed from the n input sample values pfkq1§k§n with anorthogonal change of basis that takes n2 operations in nonstructured bases.

Fourier and wavelet bases are the journey’s starting point. They decompose signalsover oscillatory waveforms that reveal many signal properties and provide a path tosparse representations. Discretized signals often have a very large size n • 10

6, andthus can only be processed by fast algorithms, typically implemented with Opn log nq op-erations and memories illustrate the strong connection between well-structured mathe-matical tools and fast algorithms.

1from Gabriel Peyré’s slides.

13

Figure 1.7 – Approximation comparison in different bases. (*)


1.3.2 Fourier toolsThe Fourier transform is everywhere in physics and mathematics because it diagonalizestime-invariant convolution operators.

From classical Fourier analysis, we know that t1?2B⇡

eiB´1k¨

: k P Zu is an orthonormalbasis of L2

pr´B⇡, B⇡sq. So, given any f P L2pr´B⇡, B⇡sq,

fpxq “1

2B⇡

ÿ

kPZfpkB´1

qeikB´1x. (1.4)

Recall that, the Fourier transform of f P L1pRq is defined by

fp!q “

ª

Rfpxqe´ix!

dx, ! P R

and this definition can be extended to L2pRq since L1

pRq X L2pRq is dense in L2

pRq.A direct consequence of (1.4) is the celebrated Shannon-Nyquist-Whittaker sampling

theorem:

Theorem 1.4. Suppose f is piecewise smooth and continuous and fp!q “ 0 for all |!| ° B⇡.

Then,

fpxq “

ÿ

kPZf

ˆk

B

˙'

ˆx ´

k

B

˙,

14

where 'pxq “sinp⇡Bxq⇡Bx . We also have that

fK “

ÿ

|k|§K

f

ˆk

B

˙'

ˆ¨ ´

k

B

˙Ñ f in L8

pRq.

Some examples

• Diracs are often used in calculations. A Dirac � associates its value to a function att “ 0. Since ei!0 “ 1, it seems reasonable to define its Fourier transform by

� “

ª

R

�ptqeí!tdt “ 1. (1.5)

This formula is justified mathematically by the extension of the Fourier transformto tempered distributions.

• The indicator function f “ 1r´T,T s is discontinuous at t “ ˘T . Then its Fouriertransform is not integrable:

fp!q “ 2sinpT!q

!

• A Gaussian fptq “ expp´t2q is a function with a fast asymptotic decay (in theSchwartz space). Its Fourier transform is also a Gaussian:

fp!q “?⇡ expp´!2

{4q.

(The computation can be made by solving the PDE 2f 1` !f “ 0). Well-normalized,

the Gaussian function is a fixed point of the Fourier transform.

• A translated Dirac �⌧ ptq “ �pt ´ ⌧q has a Fourier transform calculated by evaluatingeí!t at t “ ⌧ , so

�⌧ p!q “ eí!⌧ .

When f is defined only on an interval, say r0, 1s, then the Fourier transform becomesa decomposition in a Fourier orthonormal basis tei2ktukPZ of L2

r0, 1s. If f is uniformly reg-ular, then its Fourier transform coefficients also have a fast decay when the frequency 2kincreases, so it can be easily approximated with few low-frequency Fourier coefficients.The Fourier transform therefore defines a sparse representation of uniformly regular func-tions.

Fourier and regularity

The global regularity of a signal f depends on the decay of |fp!q| when the frequency !increases. If f P L1

pRq then the Fourier inversion formula ensures that f is continuousand bounded:

|fptq| §1

2⇡

ª

R|fp!qeit!|d! “

1

2⇡|f |d! † 8 (1.6)

15

Theorem 1.5. A function f is bounded and p times continuously differentiable with bounded

derivatives if ª

R

|fp!qp1 ` |!|pqd! † 8.

Proof. Use Time derivative of Figure A.1.

This result proves that if a constant K and " ° 0 exist such that

|fp!q| §K

1 ` |!|p`1`" then f P Cp.

In words, the more regular fptq, the faster the decay of the sinusoidal wave amplitude|fp!q| when frequency ! increases.

Let us define the Sobolev spaces.

Definition 1.6 (Sobolev spaces). Let ⌦ Ñ RN. The Sobolev space W s,p

p⌦q is the space of

functions f P Lpp⌦q such that for multi-index ↵ “ p↵jq

Nj“1 P NN

0 , with |↵| “ ↵1 ` . . .`↵N § s,

D↵f P Lpp⌦q exists in a weak sense, i.e.

ªfB

↵' “ p´1q|↵|

ªD↵f', @' P C8

c p⌦q.

In the above, B↵' “

B|↵|'Bx↵1

1 ¨¨¨x↵NN

.

W s,pp⌦q is a Banach space equipped with the norm

}f}s,p “

¨

˝ÿ

|↵|§s

ª|D↵f |

p

˛

‚1{p

.

One can show (see the appendix) the following result on linear approximation forFourier representation.

Theorem 1.7 (Fourier linear approximation for Sobolev spaces). Let f P L2pr0, 1sq with

Supppfq Ñ p0, 1q. Then, f P W s,2r0, 1s if and only if

8ÿ

k“1

k2s ✏lpk, fq

k† 8. (1.7)

So, ✏lpK, fq “ opK´2sq.

The previous results also prove that if f has a compact support, then f P C8. Thedecay of |fp!q| depends on the worst singular behavior of f .

For example, f “ 1r´T ;T s is discontinuous at t “ ˘T , so |fp!q| decays like |!|´1, mean-

ing that one cannot expect faster decay than N´1. In this case, it could also be importantto know that fptq is regular for t ‰ ˘T . This information cannot be derived from the

16

decay of |fp!q|.To characterize local regularity of a signal f , it is necessary to decomposeit over waveforms that are sufficiently localized in time, as opposed to sinusoidal waves.

As long as we are satisfied with linear time-invariant operators or uniformly regularsignals, the Fourier transform provides simple answers to most questions. Its richnessmakes it suitable for a wide range of applications such as signal transmissions or station-ary signal processing. However, to represent irregular/discontinuous objects, the Fouriertransform becomes a cumbersome tool that requires many coefficients to represent a lo-calized event.

Limitations of Fourier representation for real signals

Although the Shannon-Nyquist-Whittaker theorem provides a discrete represen-tation of functions and describes how one may approximate f with finitely manyvalues, interpolation with ' is rarely used in practice due to its slow decay. Fur-thermore, Fourier representations have the drawback of requiring many samplesor coefficients to represent localized events. More precisely, the support of the func-tions eikB´1¨ over the entire real line, so changing f locally will result in a change inall its coefficients fpkB´1

q.

2D Fourier transform

The Fourier transform in Rn is a straightforward extension of the one-dimensional Fouriertransform. The two-dimensional case is briefly reviewed for image processing applica-tions. The Fourier transform of a two-dimensional integrable function f P L1

pR2q is

fp!1,!2q “

ª ªfpx1, x2q exp p´ip!1x1 ` !2x2q dx1dx2.

In polar coordinates, exp pip!1x1 ` !2x2q can be rewritten

exp pip!1x1 ` !2x2qq “ exp pi⇠ px1 cos ✓ ` x2 sin ✓qq

with ⇠ “

a!21 ` !2

2 . It is a plane wave that propagates in the direction of ✓ and oscillatesat frequency ⇠. The properties of a two-dimensional Fourier transform are essentially thesame as in one dimension.

Discrete setting

Over discrete signals, the Fourier transform is a decomposition in a discrete orthogonalFourier basis

ei2kj{n(

0§k§nof Cn, which has properties similar to a Fourier transform on

functions. For a signal f of n points, a direct calculation of the n discrete Fourier sums

f rks “

n´1ÿ

`“0

f r`se´i2⇡k`

n , for k “ 1, . . . , n.

17

Its embedded structure leads to fast Fourier transform (FFT) algorithms, which computediscrete Fourier coefficients with Opn log nq instead of n2. This FFT algorithm is a corner-stone of discrete signal processing.

Conclusion on the Fourier transform

The Fourier transform for this course?

What would be signals sparsely encoded by the Fourier transform? A signal xadmits a sparse representation in the Fourier domain if x is a combination of a fewsinusoids. The Fourier transform will be efficient to represent objects that are verylocalized in the frequency domain.Unluckily, any "real" signal cannot be very localized in the frequency domain (takean image with sharp transitions). So any "real" signal cannot be sparsely encodedby the Fourier transform. However, knowing Fourier transform features can bevery useful since this transform is well-spread in applications (and it can sometimesmodeled the acquisition process, cf. MRI).

1.3.3 Wavelets

A big question is then:

no matter how complicated is the signal, (Q1)can we describe it by an object living in small dimension?

This is a typical question in CS, and more generally in high-dimensional statistics. Forinstance, can a natural image be encoded by a few non-zero coefficients in a certain basis?

Let us look at the image of Barbara in Figure 1.8. In this picture, there are a lot oftextures that could be efficiently encoded by the Fourier transforms: stripes are typicalsinusoids. However, other parts of the image are continuous chunks with sharp edges.So to encode Barbara efficiently, one has to be able to encode (spatial) frequencies (corre-sponding to the texture part) and spatial objects (corresponding to the cartoon part). Inthe 80’s, 90’s, research brought answer to Question Q1. The main idea is to construct basesthat are both localized in time (or space) and in frequency (contrarily to the Fourier ba-sis which is only localized in frequency). Engineers and researchers managed to proposesuch bases, in which real photographs coefficients are "nearly" sparse.

One can construct orthogonal bases p'jq1§j§n such that for a signal f , f can be writtenas

f “

nÿ

i“1

xf,'iy'i

18

Figure 1.8 – Barbara

The coefficients pxf,'iyqi are mainly small (relatively to a few coefficients that carry outthe main information of the signal). If we keep the s largest coefficient in the represen-tation of f , so if we do a hard-thresholding of the coefficients pxf,'iyqi then the sparseapproximation of f :

frss “

ÿ

iPIsxf,'iy'i

is generally indistinguishable from f . Then the sparsity assumption is approximativelyverified. Then if we design stable recovery of sparse signals, compressed sensing ap-proach could be applied for such signals.

Such bases are wavelets-based.

Definition 1.8. We say that a function P L2pRq is a wavelet for L2

pRq if

j,k :“ 2

j{2 p2j

¨ ´kq : j, k P Z(

forms an orthonormal basis of L2pRq.

Such a structure in the basis as in Definition 1.8 allows to analyse signal structures ofvery different sizes.

More details on how to construct a wavelet basis can be found in the appendix, but itis beyond the scope of the present lecture, this is only given for the reader’s interest.

A brief history

In 1910, Haar constructed the wavelet basis (although it was not known as such), bychoosing

“ 1r´1,´1{2q ´ 1r´1{2,0q,

19

he showed that j,k :“ 2

j{2 p2j

¨ ´kq : j, k P Z(

forms an orthonormal basis of L2pRq. The basis functions are compactly supported, and

large coefficients occur at sharp signal transitions (discontinuities) only. In 1980, Ström-berg discovered a piecewise linear wavelet which yields better approximation proper-ties for smooth functions. Unaware of this result, Meyer tried to prove that there does notexist a regular wavelet which generates an orthonormal basis. Instead of proving this, hisattempt led to the construction of an entire family of orthonormal wavelet bases which areinfinitely continuously differentiable. The work of Meyer led to a scurry of research onwavelets throughout the late 1980’s and 1990’s. In the following sections, we shall studythe systematic approach of constructing orthonormal wavelet bases via multiresolutionanalysis, which was established by Meyer and Mallat.

How to choose a wavelet?

Most applications of wavelet bases exploit their ability to efficiently approximate partic-ular classes of functions with few nonzero wavelet coefficients. Therefore, we would likea wavelet such that most of the coefficients xf, j,ny « 0. There are generally 3 desirablequalities:

• decay

• vanishing moments

• smoothness

Definition 1.9 (Vanishing moments). We say that has p vanishing moments if≥tk ptqdt “ 0

for all k “ 0, . . . , p ´ 1.

Note that if has p vanishing moments, then xf, y “ 0 whenever f is a polynomial of de-gree at most p´ 1. In general, if f has very few discontinuities and is smooth between thediscontinuities, then one may want to choose a wavelet with many vanishing moments.On the other hand, as the density of the singularities increase, one may wish to find awavelet with smaller support at the cost of reducing the number of vanishing moments.

Although the size of the support and the number of vanishing moments are not di-rectly linked, one can show that the size of the support of an orthogonal wavelet with pvanishing moments necessarily have support of size at least 2p ´ 1.

We will now formally state that smoothness in a wavelet in fact implies the vanishingmoments property.

Proposition 1.10. Suppose that is an orthonormal wavelet. For l P N , assume that

• P C l,

• psqis bounded on R for s “ 0, . . . , l,

20

• | pxq| § C{p1 ` |x|q↵

for some ↵ ° l ` 1.

Then, has l ` 1 vanishing moments.

How to construct a wavelet? One way to do it

Definition 1.11. A multiresolution analysis (MRA) consists of a sequence of closed subspaces Vj

of L2pRq, with j P Z, satisfying the following.

(I) Vj Ä Vj`1 for all j P Z.

(II) For all j P Z, f P Vj if and only if fp2¨q P Vj`1.

(III) limjÑ´8 Vj “ì

jPZ Vj “ t0u.

(IV) limjÑ`8 Vj “î

jPZ Vj “ L2pRq.

(V) There exists ' P V0 such that t'p¨ ´ kq, k P Zu is an orthonormal basis for V0.

The function ' in (V) is called a scaling function for the MRA.

Note that condition (II) implies that t'j,k, k P Zu is an orthonormal basis for Vj .

Construction from an MRA Let W0 be the orthogonal complement of V0 in V1, so that

V1 “ V0 ‘ W0.

If we dilate elements in W0 by 2j (for P W0, consider p2

j¨q), we get Wj such that

Vj`1 “ Vj ‘ Wj, @j P Z.

Since Vj Ñ t0u as j Ñ ´8, we have that

Vj`1 “ Vj ‘ Wj “ Vj´1 ‘ Wj´1 ‘ Wj “ ‘jl“´8Wl.

Also, since Vj Ñ L2pRq as j Ñ `8,

L2pRq “ ‘

jPZWj.

If we can find P W0 such that t 0,kukPZ is an orthonormal basis of W0, then t j,kukPZ isan orthonormal basis of Wj (by (II)). This implies that

t j,k, j, k P Zu

is an orthonormal basis of L2pRq.

21

The low-pass filter

Definition 1.12. For a given scaling function ', one can define the low-pass filter as the 2⇡-

periodic function m0p⇠q “∞

k ↵keik⇠, element of L2pTq such that

'p2⇠q “

ÿ

k

↵k'p⇠qeik⇠ “: m0p⇠q'p⇠q.

Why this definition? Note that

1

2'

´¨

2

¯P V´1 Ä V0.

By (V),1

2'

´¨

2

¯“

ÿ

k

↵k'p¨ ` kq

where

↵k “1

2

ª'

´x

2

¯'px ` kqdx,

ÿ

k

|↵k|2

† 8.

By applying the Fourier transform,

'p2⇠q “

ÿ

k

↵k'p⇠qeik⇠ “: m0p⇠q'p⇠q.

Construction from an MRA One can show that given any MRA and scaling function2,we can always construct an orthonormal wavelet by

p⇠q “ ei⇠{2m0p⇠{2 ` ⇡q'p⇠{2q.

Recall also that'p2⇠q “ 'p⇠qm0p⇠q, m0p⇠q “

ÿ

kPZ↵ke

ik⇠.

So, p2⇠q “ ei⇠'p⇠q

ÿ

kPZ↵ke

´ik⇠p´1q

kñ p⇠q “ 'p⇠{2q

ÿ

kPZ↵ke

´ipk´1q⇠{2p´1q

k

and by taking the Fourier transform,

pxq “ 2

ÿ

kPZp´1q

k↵k'p2x ´ pk ´ 1qq.

2see appendix for more info

22

Examples of wavelets

• The Haar wavelet Let Vj be the set of functions in L2 which are constant on rn2´j, pn`

1q2´j

q for n P Z. Then tVjujPZ is an MRA with scaling function ' “ 1r´1,0q. The cor-responding low pass filter is m0 “

12p1 ` ei⇠q. Since

'p⇠q “1 ´ ei⇠

í⇠“ ei⇠{2 sinp⇠{2q

p⇠{2q,

the wavelet satisfies

p⇠q “ ei⇠{2 p1 ´ eí⇠{2qp1 ´ ei⇠{2

q

í⇠“ iei⇠{2 sin

2p⇠{4q

⇠{4.

This is the Fourier transform of “ 1r´1,´1{2q ´ 1r´1{2,0q.

• The Shannon wavelet Let Vj be the set of functions in L2pRq whose Fourier trans-

form have support contained in r´2j⇡, 2j⇡s. Then,

– 'pxq “ sinp⇡xq{p⇡xq is so that t'0,nunPZ is an orthonormal basis of V0. One canverify that this is a scaling function for MRA tVjujPZ.

– Recall that the low pass filter m0 satisfies

'p2⇠q “ 1r´⇡,⇡sp2⇠q “ 'p⇠qm0p⇠q “ 1r´⇡,⇡sp⇠qm0p⇠q.

So, m0 “ 1r´⇡{2,⇡{2s.

– p⇠q “ 1r´2⇡,´⇡sYr⇡,2⇡sp⇠q exppi⇠{2q, and so,

pxq “ ´2sinp2⇡xq ` cosp⇡xq

⇡p2x ` 1q.

Since has compact support, is C8, but observe that it decays slowly in time. Inparticular, | pxq| decays like |x|

´1 since is discontinuous as ˘⇡ and ˘2⇡.

• Meyer’s wavelet The scaling function of Meyer’s wavelet is defined in the Fourierdomain by

'p⇠q “

$’&

’%

1 |⇠| § 2⇡{3

cosr⇡2⌫p

32⇡ |⇠| ´ 1qs

2⇡3 § |⇠| §

4⇡3

0 otherwise,

where ⌫ P Ck or C8 is a monotone function which satisfies ⌫pxq “ 0 for all x § 0,⌫pxq “ 1 for x • 1 and ⌫pxq ` ⌫p1 ´ xq “ 1 for all x P R. One can show that theassociated wavelet is such that

p⇠q “

$’’’&

’’’%

0 |⇠| § 2⇡{3

ei⇠{2sinr

⇡2⌫p

32⇡ |⇠| ´ 1qs

2⇡3 § |⇠| §

4⇡3

ei⇠{2cosr

⇡2⌫p

34⇡ |⇠| ´ 1qs

4⇡3 § |⇠| §

8⇡3

0 otherwise.

23

– Since ' and have compact support, ' and are C8.

– The smoother transition in ' (compare with that of the Shannon scaling func-tion) results in faster decay in time. If ⌫ P C8, then for all N , there exists AN

such that| pxq| §

AN

p1 ` |x|qN, |'pxq| §

AN

p1 ` |x|qN.

We remark however that although there is fast asymptotic decay, the constantAN grows with N and in practice, the numerical decay of may be slow.

– Note that p0q “ 0 and dn

d⇠n p0q “ 0 for all n P N . Since given gptq “ p´itqnfptq,gp⇠q “

dn

d⇠n fp⇠q, it follows that

gp0q “ p´iqnªtnfptqdt “ f pnq

p0q.

Going back to , we see thatªtn ptqdt “ 0, @n P N .

Linear approximation with wavelets

The decay of wavelets also characterizes Sobolev spaces, as pointed out by the followingresult.

Proposition 1.13. Let s P p0, qq. f P L2pr0, 1sq is in W s,2

pr0, 1sq if and only if

8ÿ

k“1

k2s ✏lpk, fq

k† 8.

Hence, ✏lpK, fq “ opK´2sq.

Non-linear approximation with wavelets

Theorem 1.14. If f has discontinuities on r0, 1s and is uniformly Lipschitz-↵ between these

discontinuities with ↵ P p1{2, qq, where q is the number of vanishing moments of the considered

wavelet, then

✏lpK, fq “ Op }f}C↵ K´1q and ✏npK, fq “ Op}f}C↵ K´2↵

q.

On a computational side

Again on a computational side, wavelet transform can be computed very fastly usingOpn log2 nq operations.

24

Figure 1.9 – Wavelet coefficients of an image, sorted by descending order. The majorityof the coefficients are very small compared to the first 4000 ones.

Why you should like wavelets

1. Wavelet bases are great to approximate piecewise regular signals ! Comparedto the Fourier basis...

2. Wavelet transform is a nice theoretical and numerical tool: it can be fastlyimplemented on a computer!

2D wavelet transform

To obtain a base in 2D, one can make the kronecker product of the 1D basis: the 2D basisobtained is then separable (2D wavelets transforms implemented are not often separable).In a discrete setting it means that @px, yq P t1, . . . , nu

2:

f rx, ys “

nÿ

i1“1

nÿ

i2“1

f ri1, i2s i1pxq i2pyq

where

f ri1, i2s “

nÿ

x“1

nÿ

y“1

f rx, ys i1pxq i2pyq

So if we proceed to hard-thresholding, the main information contained in the signal ispreserved and then any "natural" image can be sparsely expressed in wavelet bases.

25

1.4 CS formalization

1.4.1 Noiseless recovery

Now more familiar with measurements and sparsity, one can formalize the compressedsensing setting. Sparsity assumption often naturally arises for real signals. Hardly ever,measurements cannot be set by the statistician, they are imposed by the physics of acqui-sition. However, if the statistician is free on the sensing choice, i.e. on the pa˚

i q’s, it is saidti be the "plan d’experience" setting, and then a measurements choice strategy has to bedefined.

Whatever the constraints are, the goal in CS is to reconstruct a sparse signal using afew number of measurements.

Construct a minimal amount of sensing vectorspa1, . . . , amqsuch thatanys-sparse vector can be efficiently recovered

from y “ Ax and A P Rmˆn.(P1)

We shall define what efficiently means. For the moment, a valid reconstruction ofx will be when an algorithm can be implemented in a finite time and with reasonablecomputational cost. Some algorithms will turn to be not realistic (because NP-hard) andthose very efficient (linear programming or semi-definite programming).

Given sensing vectors pa1, . . . , amq, construct a recovery algorithm for any s ´ sparse vectorfrom y “ Ax and A P Rmˆn, i.e. find an application � : Rm

Ñ Rn

which can be efficiently implemented and such that,@s-sparse vector x,�pAxq “ x.

(P2)

We will see later what efficiently implemented means, in particular for minimization`0 and `1. What is asked in (P2) is very restricted, we want from only one sensing matrixA to be able to reconstruct any s-sparse vector. We will see that it is possible for certainsensing matrices, but one can also considering the reconstruction of a fixed but arbitrarys-sparse signal x. This is the difference between uniform and non-uniform approaches.

Remark 1.15. CS is an example of high-dimensional statistics problem because the num-ber of measurements/observations is much smaller than the dimension of the object toestimate (n). Intuitively, the dimension of an s-sparse signal in Rn is s, and not n (whichwould be the case if we knew the true support of x). Then, if one wants to estimate ans-dimensional object, the minimal number measurements should be s. We then wouldlike the same bound on m for CS, which gives an intuition for the expected number ofmeasurements in CS theory.

26

1.4.2 Noisy recoveryIn many applications, data contains noise. Then we observe a signal through linear pro-jections, with an additive noise. One can write the measurements as follows

yi “ xai, xy ` ⇠i, i “ 1, . . . ,m

where ⇠i is the noise component for the i-th observation.In statistics, it is commonplace to model the noise as a family of random variables,

generally i.i.d.Usually, a centered Gaussian distribution is considered, and the matrix no-tation gives

y “ Ax ` ⇠

with ⇠ the noise vector.

Definition 1.16. We say that a reconstruction algorithm� : RmÑ Rn

is robust to noise of order

s if for all vector x s-sparse, and for all noise vector ⇠, one gets

}x ´�pAx ` ⇠q}2 § c}⇠}2,

where c is an absolute constant.

The norm } ¨ } could be not specified and could be adapted: so we would talk aboutrobust `2 or robust `8 recovery.

If there is no noise, then the recovery will be exact, since �pAxq “ x. If measurementsare corrupted by noise, we want to ensure that the reconstruction error of x will be atmost of the order of the noise (here measured by its `2-norm. In many applications, therobustness to noise is quite crucial, because even if the data are noisy, one can hope thatthe noise magnitude will be low compared to the signal one. In statistics (and in signalprocessing), the ratio }⇠}2{}x}2 is called the Signal-to-Noise Ratio -SNR- (in signal pro-cessing, the logarithm of this ratio is more likely considered). It measures the denoisingproblem feasability: if the SNR is small, then the signal is drowned in noise, its estimationwill be difficult.

1.5 Nearly-sparse signals: stable recoveryThe signals of interest are generally not exactly sparse, but close to an s-sparse signal.We would like to ensure that recovery algorithms can handle well this case: this is calledstability.

Definition 1.17. We say that a recovery algorithm � : RmÑ Rn

is stable of order s, when for

all vector x P Rn

}x ´�pAxq}2 § cminzP⌃s }x ´ z}1

?s

,

with c an absolute constant.

27

Look close to the choice of the norms: the reconstruction error is measured in `2-normwhereas in the approximation error is measured in `1-norm. We could have put also an`2-norm in the approximation error term, however we could show that if there exists analgorithm such that }x ´ �pAxq}2 § cminzP⌃s }x ´ z}2 (even for s “ 1) then necesarilym • Cn where C is constant only depending on c. Therefore we cannot have stabilitywith `2-norm for both terms without the price of a number of measurements at least largerthan the ambient dimension, that we do not want.

Note also that if x is s-sparse then, the recovery will be exact.The approximation error satisfies

�1pxq “ minzP⌃s

}x ´ z}1 “ }x ´ xs}1

where xs “ Hpxq. Indeed,

�1pxq “ minz P Rs

JsÄt1,...,nu,|Js|“s

nÿ

i“1

|xi ´ zi|1iPJs `

nÿ

i“1

|xi|1iRJs

The best `1-approximation is thus a truncated version of x where only the largest entries(in absolute value) of x are kept. Therefore if x is not exactly sparse, but close to an s-sparse signal, the approximation error �1pxq will be small, and the reconstruction �pAxq

will be close to x if � is stable.As for the robust property, the estimation error should be of the error of the ideal

model (without noise and with exactly s-sparse signals). If we are not considering idealcases, the price to be paid should be of the order of the approximation error.

Furthermore, "good" algorithms should group both properties.

1.6 Robust and stable recoveryDefinition 1.18. We say that an algorithm � : Rm

Ñ Rnis stable and robust of order s if for all

x P Rnand ⇠ P Rm

}x ´�pAx ` ⇠q}2 § c

ˆminzP⌃s}x´z}1

?s

` }⇠}2

˙

Remark 1.19. Remark on Lasso or Dantzig selector?

1.7 CS applications

1.7.1 CS and photography

As described previously, cameras are more likely in the two-steps acquisition-compressionsetting. Thenwhat would be a camera based on compressed sensing?

28

Figure 1.10 – Example of a compressed measurement for photography

Figure 1.11 – Example of compressed measurements for photography

The goal is to construct measurements in a small number, allowing to reconstruct thecaptured scene. Instead of saving the color level for each pixel, we are going to make alinear measurement of these color levels. This leads to make local means of color levels,according to resolution and brightness, and then make a weighted sum of these. In Figure1.10 we give an example of one measurements of a scene.

In the classical approach, the acquisition step consists in measuring the scene with asensing vector with all zero components except for a pixel. There are as many measure-ments as pixels, which explains the big size of RAW files.

For a CS camera, we will use several measurements of the type of Figure 1.10. Then mmeasurements as in figure 1.11 are stored.

This is the principle of the single-pixel camera of the RICE university: the measure-ments are real-valued which can be recorded using a photo plaque with only one lumino-sensitive point (while usually photo plaques contain millions of lumino-sensitive points.

Measurements construction is based on micro-miroir, which let pass the light moreor less according to their orientation. The sum of all the color levels are then sent tothe unique photo-sensitive detector. A key-point in the measurements construction israndomness.

Can it work? Look at Figure 1.14. We still have to make theory to know that thesekind of measurements will be efficient. However, you can already know where we canfind sparsity.

Exercise 1.20. Try to recast the single-pixel camera into a formal CS problem.

29

Figure 1.12 – Single-pixel camera principle.

Figure 1.13 – Single-pixel camera mirrors.

Figure 1.14 – Single-pixel camera reconstruction.

30

1.7.2 CS and Fourier measurementsGiven a signal x P Cn and a subset of frequencies I in t1, . . . , nu, it is not possible ingeneral to reconstruct x from its partial Fourier coefficients (since the Fourier transformis a bijection of Cn). However, if x is s-sparse, this becomes a CS example: we want toreconstruct a structured signal (in the canonical basis) from incomplete information in theFourier domain.

Let F P Cnˆn be the Fourier matrix (the discrete Fourier transform):

F “ pFk`q1§k,`§n, Fk,` “1

?

Nexp

ˆ´2i⇡pk ´ 1qp` ´ 1q

n

˙

By noting pf1, . . . fnq the row vectors of F , one observes yi “ xfi, xy for i P I . The sensingmatrix is obtained by extracted the rows indexed by I in F . Then the CS problem canbe recast here as: knowing m selected Fourier coefficients of a signal x P Cn but whichsupport is of size s ! n, what are the conditions on s,m, n for exact recovery? We willshow that a random selection of frequencies leads to a good CS sensing matrix (good inthe sense that the required number of measurements m is of order s up to logarithmicfactors).

1.7.3 CS and orthogonal (incoherent) transformsThe Fourier matrix could be replaced by any orthogonal matrix � (then � is an isometry),such its entries satisfy �k` § C{

?n (with C an absolute constant). If so, call pe1, . . . enq the

vectors of the canonical basis of Rn:

| x�k:, e`y |2

§C2

N.

If such a condition is satisfied, we say that the bases p�k:qk and pe`q` are incoherent.Then the Fourier and the canonical basis are maximally incoherent.Incoherence extends duality between time and frequency: a signal localized in time

will be spread out in the frequency domain. Here a sparse object (in signal processing wesay "localized in time") in a basis or in a dictionary should be spread out in the measure-ment space. In particular, measurement vectors should have a "spread" representation inthe sparsity basis.

One can formally define the coherence notion. Let B� “ p�1, . . .�nq and B “ p 1, . . . nq

be two bases of Rn. In B� signals of interest are sparse, this is thus a "good" basis choice:this is the representation basis. B will be used for measurements. We choose only mvectors in the measurement basis B .

Exercise 1.21. How do you write A the sensing matrix of the CS problem ?

Definition 1.22. Let B� “ p�1, . . .�nq and B “ p 1, . . . nq be two bases of Rn. The coherence

between those two bases is defined as follows:

µ pB�,B q “ nmaxk,`

|x�k, `y|2 .

31

Figure 1.15 – MRI: on the left,a brain image. On the right, an MRI scanner.

Exercise 1.23. Rewrite the coherence in function of ai “ �˚i , i “ 1, . . . , n.

Exercise 1.24. Show that µ P r1, ns.

We will say the two bases are maximally incoherent if µ “ 1, this is the case of theFourier and the canonical bases. As we predict that in this case, the number of mea-surements should be of the order of s, one can predict that the bound on the number ofmeasurements will be of the order sµ (see Candes,Romberg article).

1.7.4 CS and MRI

The physics of the acquisition in MRI leads to measurements in the Fourier domain: weobserve partial Fourier coefficients of the image to reconstruct. However a brain image isnot sparse in the canonical basis.

Exercise 1.25. Write A where MRI is viewed as a CS problem.

1.7.5 CS and face recognition

Face recognition can be written as a CS problem, if the chosen approach is sparse representation-based classification (SRC) (for more information see the book of Yonina Eldar and GittaKutyniok chap 12). We have access to data of the form D “ tp�j, `jq, 1 § j § nu:

• �j P Rm is a vector representation of the j-th image (vectorized possibly, or it maycontains particular descriptors of the image as invariances).

• `j P t1, . . . , Cu is the label, the person numbe represented in the j-th image.

32

Figure 1.16 – Construction of the sensing matrix: shots with different angle

A same person can be represented several times in D (with different angles, differentlight). We are given a new image y P Rm (or a new descriptor of an image), we want tofind a label in t1, . . . , Cu corresponding to the person represented in the image y. This isa multi-class classification problem.

in recent papers, researchers have observed that if the number of images of the sameperson in the dictionary was high then these pictures should live in a linear space ofsmall dimension in the big m-dimensional space. For a certain image representation �,this small dimension can be taken to 9.

Then, if the dictionary contains enough information, one can hope that for a new im-age y of the i-th person (tp be identified),

y „

ÿ

j:`j“i

�jxj.

On can writey “ �x ` ⇠

where

• � “ r�1| . . . |�Cs where �i “ r�j : `j “ is.

• x can be also formed by blocks

x “ r0˚| . . . |x˚

i | . . . |0˚s

˚

where x˚i is the components of x multiplied by the columns in �i.

Then this setting meets CS framework: x is sparse (and even block-sparse). If the recon-struction x has non-zero components in the i-th block, we can then infer that the newimage y corresponds to the i-th person.

The sensing matrix depends on observations stored in the data base D. Either this database is given and we cannot contruct it, or one can construct it as in figure 1.17.

33

Figure 1.17 – Construction of the sensing matrix: shots with different light

Figure 1.18 – Financial data from Bloomberg website. The sensing matrix is constructedusing the stock values over time.

1.7.6 CS and finance

Let us suppose that we got access to portfolio stock values (or other financial assets) in realtime, every minutes y1, y2, . . .. One can suppose that this portfolio keeps the same struc-ture over weeks. We got access to the (instant) stock values of this portfolio on Bloombergand Reuters, see Figure 1.18.

The j-th stock has a score of xij at time i. The data are on the following form:

• t “ 1, y1 portfolio value, px1jq1§j§n stock prices,

• t “ 2, y2 portfolio value, px2jq1§j§n stock prices,...

• t “ m, ym portfolio value, pxmjq1§j§n stock prices.

We want to know the portfolio structure: which and how many assets does the port-folio include? One can write the problem in a CS way. Let y “ pyjq1§j§m the portfoliovalues, and A “ pxijq1§i§m,1§j§n the scoring matrix. We want to find ✓ P Rn such thaty “ A✓.

34

1.8 Minimal number of measurementsIn this section, we examine the question of the minimal number of linear measurementsneeded to reconstruct s-sparse vectors from these measurements, regardless of the practi-cality of the reconstruction scheme. This question can in fact take two meanings, depend-ing on whether we require that the measurement scheme allows for the reconstruction ofall s-sparse vectors x P Cn simultaneously or whether we require that, given an s-sparsevector x P Cn the measurement scheme allows for the reconstruction of this specific vec-tor. While the second scenario seems to be unnatural at first sight because the vector xis unknown a priori, it will become important later when aiming at recovery guaranteeswhen the matrix A is chosen at random and the sparse vector x is fixed (so-called nonuni-form recovery guarantees). The minimal number m of measurements depends on thesetting considered, namely, it equals 2s in the first case and s`1 in the second case. How-ever, one can show that if we also require the reconstruction scheme to be stable, then theminimal number of required measurements additionally involves a factor of lnpN{sq, sothat recovery will never be stable with only 2s measurements.

Before separating the two settings discussed above, it is worth pointing out the equiv-alence of the following properties for given sparsity s, matrix A P Cmˆn, and s-sparsex P Cn:

1. The vector x is the unique s-sparse solution of Az “ y with y “ Ax, that is, tz P Cn:

Az “ Ax, }z}0 § su “ txu.

2. The vector x can be reconstructed as the unique solution of

minzPCn

}z}0 s.t. y “ Az

Indeed, if an s-sparse x P Cn is the unique s-sparse solution of Az “ y with y “ Ax,then a solution x7 of the `0-minimization problem is s-sparse and satisfies Ax7

“ y, so thatx “ x7. This shows 1 ñ 2. The implication 2 ñ 1 is clear.

1.8.1 Recovery of all sparse vectorsTheorem 1.26. Given A P Cmˆn

, the following properties are equivalent:

(a) Every s-sparse vector x P Cnis the unique s-sparse solution of Az “ Ax, that is, if Ax “ Az

and both x and z are s-sparse, then x “ z.

(b) The null space kerA does not contain any 2s-sparse vector other than the zero vector, that

is,kerpAq X tz P Cn, }z}0 § 2su “ t0u.

(c) For every S Ä r|n|s with |S| § 2s, the submatrix AS is injective as a map from CSto Cm

.

(d) Every set of 2s columns of A is linearly independent.

35

Proof. (b)ñ(a). Let x and z be s- sparse with Ax “ Az. Then x ´ z is 2s-sparse andApx ´ zq “ 0. If the kernel does not contain any 2s-sparse vector different from the zerovector, then x “ z.

(a)ñ(b). Conversely, assume that for every s-sparse vector x P Cn, we have tz P

Cn, }z}0 § 2su “ txu. Let v P kerpAq be 2s-sparse. We can write v “ x ´ z for s-sparsevectors x, z with suppx X suppz “ H. Then Ax “ Az and by assumption x “ z. Since thesupports of x and z are disjoint, it follows that x “ z “ 0 and v “ 0.

For the equivalence of (b), (c), and (d), we observe that for a 2s-sparse vector v withS “ suppv, we have Av “ ASvS . Noting that S “ suppv ranges through all possiblesubsets of r|n|s of cardinality |S| § 2s when v ranges through all possible 2s-sparse vectorscompletes the proof by basic linear algebra.

We observe, in particular, that if it is possible to reconstruct every s-sparse vectorx P Cn from the knowledge of its measurement vector y “ Ax P Cm, then (a) holds andconsequently so does (d). This implies rankpAq • 2s. We also have rankpAq § m, becausethe rank is at most equal to the number of rows. Therefore, the number of measurementsneeded to reconstruct every s-sparse vector always satisfies

m • 2s.

We are now going to see that m “ 2s measurements suffice to reconstruct every s-sparse vector - at least in theory.

Theorem 1.27. For any integer n • 2s, there exists a measurement matrix A P Cmˆnwith

m “ 2s rows such that every s-sparse vector x P Cncan be recovered from its measurement vector

y “ Ax P Cmas a solution of the `0-minimization problem.

Proof. The idea is to construct a Vandermonde matrix.

Theorem 1.28. For any n • 2s, there exists a practical procedure for the reconstruction of every

2s-sparse vector from its first m “ 2s discrete Fourier measurements.

1.8.2 Recovery of individual sparse vectorTheorem 1.29. For any n • s ` 1, given an s-sparse vector x P Cn

, there exists a measurement

matrix A P Cmˆnwith m “ s ` 1 rows such that the vector x can be reconstructed from its

measurement vector y “ Ax P Cmas a solution of the `0-minimization problem.

Proof. left to the reader.

1.9 Preliminaries: NP-hard problemReconstructing an s-sparse vector x P Cn from its measurement vector y P Cm amounts tosolving the `0-minimization problem:

minzPCn

}z}0 s.t. Az “ y.

36

Since a minimizer has sparsity at most s, the straightforward approach for findingit consists in solving every rectangular system ASu “ y, or rather every square systempASq

˚ASu “ pASq˚y, for u P CS where S runs through all the possible subsets of r|n|s

with size s. However, since the number`ns

˘of these subsets is prohibitively large, such

a straightforward approach is completely unpractical. By way of illustration, for smallproblem sizes n “ 1000 and s “ 10, we would have to solve

`100010

˘•

`100010

˘10“ 10

20 linearsystems of size 10 ˆ 10. Even if each such system could be solved in 10

´10 seconds, thetime required to solve (P0) with this approach would still be 10

10 seconds, i.e., more than300 years. We are going to show that solving (P0) in fact is intractable for any possibleapproach. Precisely, for any fixed ⌘ • 0, we are going to show that the more generalproblem

minzPCn

}z}0 s.t. }Az ´ y}2 § ⌘.

is NP-hard.We start by introducing the necessary terminology from computational complexity.

First, a polynomial-time algorithm is an algorithm performing its task in a number ofsteps bounded by a polynomial expression in the size of the input. Next, let us describein a rather informal way a few classes of decision problems:

• The class P of P-problems consists of all decision problems for which there exists apolynomial-time algorithm finding a solution.

• The class NP of NP-problems consists of all decision problems for which there existsa polynomial-time algorithm certifying a solution. Note that the class P is clearlycontained in the class NP.

• The class NP-hard of NP-hard problems consist of all problems (not necessarily de-cision problems) for which a solving algorithm could be transformed in polynomialtime into a solving algorithmfor any NP-problem.Roughly speaking, this is the classof problems at least as hard as any NP-problem. Note that the class NP-hard is notcontained in the class NP.

• The class NP-complete of NP-complete problems consist of all problems that areboth NP and NP-hard; in other words, it consists of all the NP-problems at least ashard as any other NP-problem.

It is a common belief that P is strictly contained in NP, that is to say, that there areproblems for which potential solutions can be certified, but for which a solution cannotbe found in polynomial time. However, this remains a major open question to this day.There is a vast catalog of NP-complete problems, the most famous of which being perhapsthe traveling salesman problem. The one we are going to use is exact cover by 3-sets.

Exact cover by 3-sets problem. Given a collection tCi, i P r|n|su of 3-element subsetsof r|m|s, does there exist an exact cover (a partition) of r|m|s, i.e., a set J Ä r|n|s such thatYjPJCj “ r|m|s and Cj X Cj1 “ H for j ‰ j1?

37

Taking for granted that this problem is NP-complete, one can prove the followingresult.

Theorem 1.30. For any ⌘ • 0, the `0-minimization problem

minzPCn

}z}0 s.t. }Az ´ y}2 § ⌘.

for general A P Cmˆnand y P Cm

is NP-hard.

In compressive sensing, we will rather consider special choices of A and choose y “ Axfor some sparse x. We will see that a variety of tractable algorithms will then provablyrecover x from y and thereby solve (P0) for such specifically designed matrices A. How-ever, to emphasize this point once more, such algorithms will not successfully solve the`0- minimization problem for all possible choices of A and y due to NP-hardness. A se-lection of tractable algorithms will be introduced later.

38

compressed sensing, low-rank recovery & co: lecture notes˝ a mathematical introduction to...

Documents