weighted l1-analysis minimization and stochastic gradient

177
Weighted 1 -Analysis Minimization and Stochastic Gradient Descent for Low Rank Matrix Recovery Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Jonathan Fell, M.Sc aus Daun Berichter Prof. Dr. Holger Rauhut Prof. Dr. Hartmut Führ Tag der mündlichen Prüfung 18.12.2020 Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek verfügbar.

Upload: others

Post on 17-Apr-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Weighted l1-Analysis minimization and stochastic gradient

Weighted `1-Analysis Minimization andStochastic Gradient Descent for Low

Rank Matrix Recovery

Von der Fakultät für Mathematik, Informatik und Naturwissenschaften derRWTH Aachen University zur Erlangung des akademischen Grades einesDoktors der Naturwissenschaften genehmigte Dissertation vorgelegt von

Jonathan Fell, M.Scaus Daun

Berichter Prof. Dr. Holger RauhutProf. Dr. Hartmut Führ

Tag der mündlichen Prüfung 18.12.2020

Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek verfügbar.

Page 2: Weighted l1-Analysis minimization and stochastic gradient

2

Page 3: Weighted l1-Analysis minimization and stochastic gradient

Contents

Page

List of Figures 3

1 Introduction 71.1 Weighted Sparse Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 `1-Minimization of Analysis coefficients . . . . . . . . . . . . . . . . . . . . . 91.3 Low Rank Matrix Recovery and Stochastic Gradient Descent . . . . . . . . . 111.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Preliminaries 152.1 Weighted Sparse Approximations . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Weighted Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Low Rank Matrix Recovery and the Phase Retrieval problem . . . . . . . . . 232.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Weighted `1-Minimization and Weighted Iterative Hard Thresholding 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Weighted RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Wavelet Bases and Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . 303.4 WIHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5 Weighted Sparse Recovery in Besov Spaces . . . . . . . . . . . . . . . . . . . 443.6 Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 `1-Minimization of Analysis Coefficients 584.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Null Space Properties for Bounded Orthonormal Systems . . . . . . . . . . . 654.3 Shearlet-Wavelet `1-Analysis Minimization . . . . . . . . . . . . . . . . . . . . 704.4 Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.4.1 TV-Minimization of Real-World Data . . . . . . . . . . . . . . . . . . 904.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

1

Page 4: Weighted l1-Analysis minimization and stochastic gradient

5 Low Rank Matrix Recovery via Stochastic Gradient Descent 945.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4 The averaged flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.5 Deviation via Hanson-Wright and Generic Chaining . . . . . . . . . . . . . . 1125.6 Numerical Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.7 Mini-Batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 1305.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6 Appendix 1406.1 Tools from probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.2 Besov Spaces, Smoothness Spaces, wavelets and shearlets . . . . . . . . . . . 146

6.2.1 Shearlet Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.3 A brief note on Sparsity Equivalence . . . . . . . . . . . . . . . . . . . . . . . 160

Symbols 165

Index 167

Bibliography 169

Eidesstattliche Erklärung 175

2

Page 5: Weighted l1-Analysis minimization and stochastic gradient

List of Figures

3.1 Graphical illustrations for WIHT and NWIHT and sampling locations . . . . 543.2 Subsampling patterns in 2D Fourier domain. . . . . . . . . . . . . . . . . . . 553.3 Graphical illustrations for WIHT and NWIHT with line-sampling . . . . . . . 563.4 Graphical illustrations for WIHT and NWIHT with preconditioned sampling 57

4.1 Numerical illustrations for shearlet minimization . . . . . . . . . . . . . . . . 864.2 Numerical illustration for Haar and shearlet minimization with deterministic

sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.3 Numerical illustration for Haar and shearlet minimization with random sampling 884.4 Numerical illustration for Haar and shearlet minimization of a non-sparse im-

age with deterministic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 884.5 Numerical illustration for Haar and shearlet minimization of a non-sparse im-

age with random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.6 Numerical illustration for TV minimization . . . . . . . . . . . . . . . . . . . 894.7 Numerical illustration for TV, Haar and shearlet minimization for the precon-

ditioned probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.8 Numerical illustration for TV, Haar and shearlet minimization for the precon-

ditioned probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.9 Numerical illustration for real world data . . . . . . . . . . . . . . . . . . . . 92

5.1 Occasionally, the starting point computed via spectral initialization is so closeto the actual minimum that Algorithm 3 converges in a few steps. Here Z ∈R25×15 of rank 2 and Z0 computed from m = 16094 measurements. . . . . . . 103

5.2 Comparison of the error d(Zi, Z) for different step-sizes . . . . . . . . . . . . 1215.3 Error d(Zi, Z) of SGD using µi ∼ 1

i3/4 . . . . . . . . . . . . . . . . . . . . . . 1225.4 Error of stochastic gradient descent measured in d( · , · ) . . . . . . . . . . . . 1235.5 Reconstruction error for Z ∈ R25×25 for m = drkn2κ2(Z) log(n)e measurements1245.6 Convergence of SGD for Z ∈ R25×25 for m = drknκ2(Z) log(n)e measurements 1255.7 Convergence of SGD for Z ∈ R25×25 for m = dr3knκ2(Z) log(n)e measurements1265.8 Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.9 Deviations for rank(Z) = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

3

Page 6: Weighted l1-Analysis minimization and stochastic gradient

5.10 Reconstruction for Z ∈ R25×25 for m = drkn log(n)e measurements for ran-domized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.11 Mini-batch reconstruction error for Z ∈ R25×25 for M = 10drknκ(Z) log(n)emeasurements and m = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.1 Tilings of 2D Fourier domain for several classes of atomic decompositions . . 1616.2 Essential Fourier support for α-molecules [37,61]. . . . . . . . . . . . . . . . . 162

4

Page 7: Weighted l1-Analysis minimization and stochastic gradient

Abstract

This dissertation focuses on two topics: First of all, two extensions to the usual approachof Compressed Sensing are discussed. Compressed Sensing aims at solving an underdeter-mined linear equation by exploiting the assumption that the original signal comes from alow-complexity class if signals. Usually, this is expressed by the notion of sparsity, that isto say that the signals only possess few significant coefficients in a basis of the underlyingvector space. The first part of this dissertation extends the notion of sparsity by employingadditional weights to emphasize certain parts of the signal. An algorithm is developed thatallows for the reconstruction of such a signal from an underdetermined linear equation byusing those weights. In the following chapter, the idea of extending prior methods is perse-vered and extended in such a way as the sparsity of the signal in a basis of the vector spaceis exchanged for sparsity in a more general representation system. These systems are, forthis work, so-called frames, redundant generating systems of vector spaces that allow for thecharacterization of certain classes of signals by the decay of the frame coefficients of thesesignals. This is combined with the weighting process from the first part of this work to extendthese results to infinite dimensional vector spaces.The second main aspect of this dissertation is the reconstruction of low-rank matrices fromquadratic measurements which are moreover oblivious to orthogonal transformations. Thelow-rank assumption is the matrix analog of sparsity in the case of the reconstruction ofvectorized signals. An algorithm is developed and presented that reconstructs the origi-nal matrix up to a global orthogonal transformation. The fact that the measurements arequadratic instead of linear, as they were in the former chapters, is and additional challengeto the analysis.One main focus of all the results that are going to be presented is the size of the equationsystems, i.e., the ration of the number of measurements needed to ensure approximation orreconstruction and the dimension of the underlying vector space. Moreover, numerical exam-ples will be presented to demonstrate the applicability of the methods that were developedhere. Especially in the first part, the weighted minimization of analysis coefficients, the algo-rithms that were developed will be applied to real-world data from Computer Tomographyto show that this approach can be beneficial for the application to real-world data.

5

Page 8: Weighted l1-Analysis minimization and stochastic gradient

Zusammenfassung

Diese Arbeit hat zwei Schwerpunkte: Zunächst werden die üblichen Methoden, die aus demBereich Compressed Sensing bekannt sind, erweitert. Der Grundgedanke dieser Methoden,bei denen ein unterbestimmtes lineares Gleichungssystem gelöst werden soll, ist stets, dassdie Ursprungssignale, die rekonstruiert werden sollen, eine wesentlich niedrigdimensionalereStruktur besitzen als der Raum aus dem sie stammen. Dies wird üblicherweise dadurch zumAusdruck gebracht, dass diese Signale dünnbesetzt sind, also wenige signifikante Koeffizientenin einer Basis des zugrundeliegenden Vektorraumes haben. Im ersten Teil dieser Arbeit wirddieser Begriff der Dünnbesetztheit nun dahingehend erweitert, dass vermöge Gewichten bes-timmte Anteile der Ursprungsdaten stärker betont werden als andere. Insbesondere wird einAlgorithmus entwickelt der es ermöglicht, ebensolche Signale aus unterbestimmten linearenMessungen wiederzugewinnen. Im anschließenden Kapitel wird dieser Gedanke weiterverfolgtund ausgebaut, indem anstatt faktischer Dünnbesetztheit in einer Vektorraumbasis die Sig-nale in einem weitaus allgemeineren Repräsentationssystem wenige signifikante Koeffizientenhaben sollen. Diese Systeme sind hier sogenannte Frames, redundante Erzeugendensystemevon Vektorräumen, die es erlauben, bestimmte Klassen von Signalen anhand des Abklingver-haltens der Framekoeffizienten von Signalen zu charakterisieren. Dies kombinieren wir mitder Gewichtung aus dem ersten Teil der Arbeit, mit dem Ziel die Resultate auf unendlichdi-mensionale Vektorräume zu erweitern.Der zweite Hauptaspekt dieser Dissertation ist die Rekonstruktion von Niedrigrangmatrizenaus quadratischen Messungen, die zusätzlich blind gegenüber orthogonalen Transformationensind. Die Annahme, dass die ursprünglichen Matrizen niedrigen Rang haben ist hier das Ana-log zur Dünnbesetztheit bei Vektoren. Es wird ein Algorithmus entwickelt und vorgestellt,der diese Matrizen bis auf eine orthogonale Transformation rekonstruiert. Die Tatsache, dassdie Messungen in diesem Falle quadratisch und nicht mehr linear sind, wie in den vorigenKapiteln, stellt eine zusätzliche Herausforderung dar.Bei allen vorgestellten Resultaten liegt ein Hauptaugenmerk auf der Größe der zu lösendenGleichungssysteme, das heißt dem Verhältnis der benötigen Messungen zur Dimension des zu-grundeliegenden Signalraumes. Weiterhin werden stets numerische Beispiele vorgestellt, umdie praktische Anwendbarkeit der entwickelten Methoden vorzuführen. Insbesondere für denersten Teil, der Signalrekonstruktion anhand gewichteter Koeffizienten in redundanten Sys-temen, der gewichteten Framekoeffizienten-Minimierung, werden die algorithmischen Metho-den auf Messungen aus Computertomographen angewandt, womit gezeigt werden kann, dassdieser Ansatz auch Vorteile in der Anwendung auf reale Daten hat.

6

Page 9: Weighted l1-Analysis minimization and stochastic gradient

Chapter 1

Introduction

This dissertation focuses on two different but related topics from the field of CompressiveSensing. First, we consider the question how to recover an unknown vector x ∈ Rd from noisylinear measurements y = Φx + e where Φ ∈ Rm×d is underdetermined, i.e., m d, if wehave some prior knowledge about the original data x. Secondly, we consider the retrieval ofa low rank matrix Z ∈ Rn×k from quadratic measurements. Although both problems differin the form of the data as well as the measurements, their solutions always hinge on the factthat the original datum belongs to a subset of low complexity of the signal space.

Since the first works on sparse approximation and sparse recovery, including but notlimited to [13, 29, 41], have been published, tremendous efforts have been undertaken towiden the scope of sparse reconstruction/approximation methods. These efforts have becomemore urgent since the practical impact of these methods on real-world problems has becomeapparent. They have, for example, many applications in the field of Magnet ResonanceImaging (MRI) [67], Computer Tomography (CT) [6,75], radar imaging [4,30,83] and manymore.

We present two strategies to approach the problem of sparse recovery: Firstly, we useweighted minimization for signal recovery via a hard thresholding algorithm where we proveconvergence based on the popular Restricted Isometry Property (RIP). Secondly, we use theNull Space Property to show that `1-minimization applied to the analysis coefficients of asignal rather than to the signal itself also yields proper recovery.

The second topic is inspired by the Phase Retrieval Problem, i.e., the recovery of a signalwhere the phase is lost in the measurement process. We extend the Phase Retrieval Problemto the field of low rank matrix recovery which also has found many applications, among themmatrix completion [16] and object detection [88], just to name a few.

1.1 Weighted Sparse Recovery

The usual assumption underlying Compressive Sensing, the idea that the original x ∈ Rd hasa sparse representation in some basis of Rd, allows for the stable and robust recovery of data

7

Page 10: Weighted l1-Analysis minimization and stochastic gradient

from (noisy) linear measurements y = Φx+ e under some assumptions. One prominent toolin Compressive Sensing is the RIP which requires a measurement matrix Φ ∈ Rm×d obeysthe inequalities

(1− δs)‖x‖22 ≤ ‖Φx‖22 ≤ (1 + δs)‖x‖22

for all s-sparse x ∈ Rd. It has been shown that Gaussian random matrices or properlysubsampled orthonormal operators possess the RIP with high probability. In Remarks 101and 102 we present some matrix classes and the conditions under which they possess theRIP. These recitals are far from complete and for more information on this we refer to [34].In short, once a measurement matrix possesses the RIP with a suitable small constant δswe have stable and robust recovery of x from noisy linear measurements y = Φx + e via`1-minimization

minz∈Rd

‖z‖1 subject to ‖Φz − y‖2 ≤ ‖e‖2.

If there is additional previous knowledge about the signal class it is often possible to employa weighted sparsity prior which emphasizes certain parts of the signal over others. Oneexample of this is the recovery of functions belonging to a Besov space Bp,qs (R) which canbe characterized by weighted sequence spaces, see Section 6.2 for more details. This entailsreconstruction via weighted basis pursuit denoising

minz∈Rd

‖z‖ω,1 subject to ‖Φz − y‖2 ≤ ‖e‖2 (ω-BPDN)

which uses the idea that the original vector x is ω-weighted sparse. Here, a vector x ∈ Rd isweighted s-sparse if ‖x‖ω,0 =

∑xγ 6=0 ω

2γ ≤ s. In this work we restrict ourselves to the case

that all the weights are ≥ 1.

However, ω-BPDN is not an algorithm per se but rather a problem for which it is knownthat the solution is a weighted s-sparse solution of ‖Φx − y‖2 ≤ ‖e‖2. In order to actuallycompute such a solution, we propose Weighted Iterative Hard Thresholding, a weightedvariant of the Iterative Hard Thresholding Algorithm. Since we minimize a weighted variantof the `1-norm, we are in need of a properly adjusted RIP which is the ω-RIP

(1− δs,ω)‖x‖22 ≤ ‖Φx‖22 ≤ (1 + δs,ω)‖x‖22

which now has to hold for all weighted s-sparse x ∈ Rd.In [80] the authors showed that for random subsamples of an orthonormal matrix the

resulting measurement matrix Φ possesses the weighted RIP with high probability under asuitable condition on the number of measurements in terms of the weighted sparsity. A firstattempt at implementing a greedy algorithm for weighted sparse recovery was undertakenin [48] where the resulting algorithm suffered from the fact that the thresholding step was

8

Page 11: Weighted l1-Analysis minimization and stochastic gradient

computationally infeasible. We circumvent this issues using approximation results from [80]and propose Weighted Iterative Hard Thresholding (WIHT), Algorithm 1, as an algorithmfor weighted sparse recovery. We prove that once the constant δ14s,ω is sufficiently small, wehave that WIHT produces a sequence (xn)n∈N from noisy measurements y = Φx + e thatsatisfies

‖x− xn‖ω,p ≤ Cs1p−1‖x‖ω,1 +Ds

1p−

12 ‖e‖2 + ρns

1p−

12 ‖x‖2

for some universal constants C,D and 0 < ρ < 1.

We extend this result to the infinite-dimensional realm by means of Besov spaces [44,94,96]which, as mentioned before, can be characterized as weighted sequence spaces via waveletdecompositions, see [3, 79]. If a function f ∈ Bp,ql (R) with compact support is sampled atfinitely many locations selected uniformly at random according to an accustomed probabilitymeasure, the so-called preconditioning measure, in Fourier domain, our reconstruction f ]

achieves a recovery rate of

‖f − f ]‖ω,1 ≤ Cs1− 1q ‖f‖v,q

for 0 < q < 1 and a second series of weights v which corresponds to a Besov space Bq,ql′ (R).

1.2 `1-Minimization of Analysis coefficients

In practical applications data may not exhibit sparsity in a basis of the signal space butonly under a more general transformation. This entails reconstruction of the original datumx ∈ Rd from noisy linear measurements y = Φx+ e via

minz∈Rd

‖Ωz‖1 subject to ‖Φx− y‖2 ≤ ‖e‖2

where Ω is some sparsifying transform appropriate for the signal class. In this thesis we focuson the case where Ω ∈ Rn×d is a frame, i.e., there exist A,B > 0 such that A‖z‖22 ≤ ‖Ωz‖22 ≤B‖z‖22 for all z ∈ Rd.

The approach we present here refers to analysis sparsity which assumes the analysiscoefficients Ωx of x to be sparse. Another strategy called synthesis sparsity is to use thesparsity of a signal in a dictionary (ψλ)λ∈Λ ⊂ Rd, that is to say x =

∑λ∈Λ cλψ

∗λ where the

sequence of coefficients (cλ)λ∈Λ is sparse. Setting Ω to be the matrix where ψtλ are the rows,this entails the minimization problem

minc∈Rn

‖c‖1 subject to ‖ΦΩ∗c− y‖2 ≤ ‖e‖2

which is a variant of the usual `1-minimization problem. The `1-analysis minimization ap-

9

Page 12: Weighted l1-Analysis minimization and stochastic gradient

proach aims at solving

minx∈Rn

‖Ωx‖1 subject to ‖Φx− y‖2 ≤ ‖e‖2.

We are interested in recovering analysis-sparse signals from random samples of theirFourier transform. While [53] provided recovery guarantees for the special case of analysissparsity with respect to a Parseval frame, satisfying results for general frames and weightedanalysis sparsity remained open. The authors of [53] assume that Φ is a random subsamplingof an orthonormal operator which is the setting we will discuss in Chapter 4 as well.

In [49], the authors provided stable and robust recovery results for `1-analysis minimiza-tion for Gaussian measurements by means of the Null Space property, which is the strategywe will follow in our analysis as well. A matrix Φ possesses the Ω-s null space property if forall S ⊂ [n] with ]S ≤ s we have that for all x ∈ ker(Φ)

‖(Ωx)S‖1 < ‖(Ωx)S‖1

which is to say that no `1-analysis s-sparse part of x, i.e. no s-sparse part of Ωx comes from anx that lies within the kernel of Φ. While the null space property allows for the reconstructionof vectors that have a s-sparse series of frame coefficients, there are also extensions that allowfor stable and robust recovery results as was shown in [49]. We extend these properties byintroducing weights in the reconstruction process once more so that we actually examine

minz∈Rd

‖Ωz‖ω,1 subject to ‖Φx− y‖2 ≤ ‖e‖2. (1.1)

Our analysis is based on methods from [19, Chapter 5] and probabilistic results from [9].Firstly we show that, given enough measurements, a random subsampling of an orthonormaloperator possesses the weighted robust null space property of order s with respect to theframe Ω with high probability. This allows for a robust reconstruction result.

Again we extend this theory to the recovery of functions from samples at randomly chosenlocations in Fourier domain. As a frame we choose the frame of shearlet functions, of whichplethora of versions are available such as a tight frame of shearlet functions, a frame ofcompactly supported shearlet functions, the widely-used bandlimited shearlets and manymore [26, 52, 56, 58, 59]. The fact that a sequence of shearlet coefficients is the minimizerof the weighted basis pursuit minimization problem implies that the sequence of shearletcoefficients belongs to a weighted sequence space. For a certain type of weights these weightedsequence spaces give rise to the shearlet smoothness spaces Sp,ql (R2) [63] (which bear astriking resemblances to the shearlet coorbit spaces, see [26, 51, 77], as we will discuss in theAppendix) in the same way the weighted sequence spaces in the first chapter give rise tothe Besov spaces. As in Chapter 4 we use an ansatz space of functions where the waveletcoefficients of these functions belong to a weighted sequence space , hence the original functionbelongs to a Besov space Bp,ql (R2). Since we are operating with Besov spaces and shearletsmoothness spaces at the same time, we employ embeddings where we have for 0 < p ≤

10

Page 13: Weighted l1-Analysis minimization and stochastic gradient

∞, 0 < q <∞ and β ∈ R

Bp,qβ+1/q(R2) → Sp,qβ (R2) and Sp,qβ (R2) → Bp,qβ−l(R

2)

where l = max(

1, 1p

)− min

(1, 1

q

). We start by assuming that the original function f =∑

γ∈Γ cγϕγ belongs to a space generated by finitely many wavelets ϕγ. The shearlet coef-

ficients take the form 〈ψλ, f〉 =∑γ∈Γ cγ〈ψλ, ϕγ〉 and we take Fourier measurements yi =

f(xi) =∑γ∈Γ cγϕγ(xi). For the reconstruction we use the weighted `1-minimization mini-

mization problem (1.1) where

Ωλ,γ = 〈ψλ, ϕγ〉 and Φi,γ = ϕγ(xi), i = 1, . . . ,m

for λ ∈ Λ and γ ∈ Γ where Λ and Γ are finite index sets. We present a framework for themeasurement and reconstruction process and provide the means to determine index sets Λ

and Γ such that Ω consists a frame. The main focus of our efforts here lie on the questionwhen Ω as above consists a frame for the ansatz space. While we present our own analysisspecifically for shearlets and wavelets, there is a more general approach to this question. Theidea that the entries of

(〈ψλ, ϕγ〉

)λ∈Λγ∈Γ

become negligible once the indices γ and λ are ’far

apart’ is specified mathematically in [36] and employs the notion of α-molecules. There, theauthors develop a notion of distance for parameters from a general parameter space and ameasure for the similarity between different representation systems. This theory allows forthe transition of sparsity-based methods between different representation systems within avector space, i.e., the idea of sparsity equivalence. We will give some details on this in theappendix in Section 6.3.

Our main recovery Theorem 59 shows that, given a sufficient number of measurements,with high probability the reconstruction f ] of f ∈ Sp,pl (R2) roughly obeys

‖f − f ]‖2 . s1−1/p‖f‖Sp,pl+1.

Here, the optimal recovery rate would be ‖f − f ]‖2 . s1/2−1/p‖f‖Sp,pl+1.

Moreover, we discuss the impact of these findings on the reconstruction of computertomography imagery. We present numerical examples both from model data as well as real-world measurements. Especially the latter, the application of sparsity based recovery methodson real-world CT-data, has not been researched thoroughly while offering large opportunityfor future research.

1.3 Low Rank Matrix Recovery and Stochastic Gradient

Descent

The third part of this dissertation draws its inspiration from two different fields of research:Firstly, we turn to the theory of low-rank matrix recovery instead of purely sparsity-based

11

Page 14: Weighted l1-Analysis minimization and stochastic gradient

methods. The assumption that a matrix Z ∈ Rn×k is of low rank is the analog to the sparsityassumption for the reconstruction of vectors. For a measurement operator A : Rn×k → Rm

which is not necessarily linear and measurements y = A(Z) ∈ Rm, we aim at solving

minz∈Rn×k

rank(Z) subject to ‖A(Z)− y‖2 ≤ ε (1.2)

which is NP hard in general. In case that A is linear, a common strategy is to apply convexrelaxation and then solve

minz∈Rn×k

‖Z‖∗ subject to ‖A(Z)− y‖2 ≤ ε (1.3)

which can be understood as `1-minimization applied to the singular values of Z. For moreinformation on low-rank matrix recovery, we refer to [17,40,47,97] where plethora of differentmatrix recovery problems are discussed such as matrix completion or object detection.

As a measurement operator we consider A(Z) =(

tr(ZtAiZ))i=1,...,m

where the Ai ∈

Rn×n are random matrices which are distributed as an Gaussian Orthonormal Ensemble(GOE), see Definition 67. Such A is quadratic in contrast to the usual CS theory. Here wedraw inspiration from the phase retrieval problem, see [15,22,32,74,76]. This problem consistsof recovering a given z ∈ Rn from quadratic measurements |〈z, ai〉|2, i = 1, . . . ,m whereai ∈ Rn. This has many applications in physics, e.g., X-ray crystallography [45,86], quantummechanics [24, 82] and diffraction imaging [10, 20]. There are many algorithms to solve thisphase problem such as the Fienup algorithm [32] or the Gerchberg-Saxon algorithm [76].Those two algorithms, however, are only applicable to problems where the linear operator Ais a Fourier transform. In the popular AltMinPhase algorithm the size of the original datax ∈ Cn determines the maximal number of iterations. A more recent approach, in case thatthe ai are i.i.d. Gaussian vectors, is the Wirtinger Flow algorithm proposed in [15] and whichuses a gradient descent scheme to recover x from the measurements |Ax|2 where A ∈ Cm×n

is the matrix with rows a∗i and the absolute value is applied entry-wise. The principle behindWirtinger Flow is to lift the problem of recovering a vector in dimension n from quadraticmeasurements to recovering a matrix of dimension n2 from linear measurements by rewriting|Ax|2 = |〈Ax,Ax〉| = |tr(A∗Axx∗)| = |tr(A∗AX)| where X ∈ Rn×n is a symmetric, positivesemi-definite (PSD) rank one matrix. Since X is a PSD matrix, this problem can be solvedvia PSD schemes such as interior point methods [2].

Note that Z and ZU , where U ∈ Rk×k is an orthonormal matrix, produce the samemeasurements A(Z) = A(ZU) if which is why Z can only be retrieved up to an orthonormalphase in the same way as the phase retrieval problem can only be solved up to a global phasefactor.

In [98], the authors approached this problem by finding the solution of

minZ∈Rn×k

1

4m

m∑i=1

(yi − tr(ZtAiZ))2

12

Page 15: Weighted l1-Analysis minimization and stochastic gradient

via gradient descent which is the ’matrix’ variant of the Wirtinger Flow algorithm mentionedabove. Starting at an initial Z0 ∈ Rn×k which is computed from the spectral decompositionof

1

m

m∑k=1

ykAkAtk

they prove linear convergence of their algorithm. A drawback of this method is that at eachiteration k the gradient of the function above, i.e.,

m∑i=1

(tr(ZktAiZk)− yi)AiZk

has to be evaluated for the current iterate Zk which is computationally expensive. We try tocircumvent this issue by applying stochastic gradient descent (SGD) which entails an updatescheme

Zi+1 = Zi − 1

µi(yt − tr(Zi

tAiZi)AiZi.

Unfortunately, we are only able to to provide a partial convergence analysis for this particularalgorithm. However, we will provide numerical evidence for the convergence of stochasticgradient descent as well as for other, closely related algorithms.

Moreover, we will prove that the Mini Batch Stochastic Gradient Descent Algorithm 6actually converges under some additional assumptions. This algorithm takes the quadraticmeasurements A(Z) =

(tr(ZtAiZ)

)i=1,...,M

and samples Θ ⊂ [M ] with ]Θ = m at random

whereafter one gradient descent step is applied to the cumulative functionfΘ(Z) = 1

4m

∑j∈Θ(tr((Zi)tAjZi)− yj)2, that is

Zi+1 ← Zi − µi∇ZfΘ(Zi) = Zi − µim

∑j∈Θ

(tr((Zi)tAjZi)− yj)AjZi

This algorithm sits between the gradient descent algorithm proposed in [98] where at eachiteration all the measurements are taken into account and the SGD approach we take whereat each iteration just one single measurement is used. For the Mini Batch Stochastic GradientDescent Algorithm we show that we have linear convergence if the number of measurementsm exceeds Cκ(Z)2r3n2 log(n) for a global constant C. While this is surely not optimal, itremains an open question how to improve this result.

1.4 Outline

In the next chapter we state the most important definitions as well as some backgroundresults for the subsequent chapters. Chapter 3 discusses weighted `1-minimization and theWeighted Iterative Hard Thresholding algorithm. Moreover, we present the extension of

13

Page 16: Weighted l1-Analysis minimization and stochastic gradient

these result to the recovery of functions from a finite number of Fourier measurements inBesov spaces. Afterwards, in Chapter 4, we present our results for `1-analysis minimizationand the extension to shearlet-wavelet minimization. The fifth chapter discusses the problemof low rank matrix recovery via stochastic gradient descent. All those chapters begin withan additional introduction into the respective topic which is more exhaustive than the oneabove. Moreover, we provide numerical examples and proofs of concept at the end of eachchapter. Lastly, Chapter 6 summarizes results used throughout this dissertation especiallyseveral tools from probability theory, wavelet ONBs, Besov spaces and shearlet smoothnessspaces. Also, we include some background theory which is not essential to the understandingof our results, particularly with regard to decomposition spaces and smoothness spaces.

1.5 Acknowledgements

First and foremost, I am grateful for the help, guidance and support of my adviser, Prof.Dr. Holger Rauhut, as I am thankful for the opportunity to take on my doctorate and formy position at his chair. Moreover my sincere thanks go to all of my colleagues, the groupmembers at the Chair for the Mathematics of Information Processing.Furthermore, I would like to express my gratitude towards CIRM for the fantastic winterschool I visited there in 2014 and the Hausdorff Center in Bonn for the admission to thetrimester program ’Mathematics of Signal Processing’.Lastly, I would like to thank my wonderful, loving wife for her unceasing and unwaveringemotional support.

Est modus in rebus

14

Page 17: Weighted l1-Analysis minimization and stochastic gradient

Chapter 2

Preliminaries

In this section we provide a general overview of the related fields which are of importancefor the understanding of this work. Also, we introduce the basic notation, concepts andfundamental theorems which provide the basis for our inquiries. For further backgroundresults, the gentle reader may be referred to Chapter 6.

2.1 Weighted Sparse Approximations

Compressed Sensing exemplifies the idea that many types of data which are retrieved fromsome linear measurements Φx = y ∈ Rm, where Φ ∈ Rm×d, x ∈ Rd actually only contains asmall amount of information, i.e., there are only s d entries of x which are non-zero. Thisis expressed by the notion of sparsity which is to say that

‖x‖0 := ]i ∈ [d] : xi 6= 0 ≤ s.

This idea was firstly explored by Donoho in [29]. The set of all such vectors is denoted asΣs := x ∈ Rd : ‖x‖0 ≤ s. This work will employ weighted vector spaces in large parts; fora sequence ω := (ωλ)λ∈Λ of weights which obey ωλ ≥ 1, for some countable index set Λ, wedefine the following expressions for 0 ≤ p ≤ 2

‖x‖ω,p :=

(∑λ∈Λ

|xλ|pω2−pλ

)1/p

, ‖x‖0,ω :=∑xλ 6=0

|ωλ|2. (2.1)

Though the choice of 2 − p in the exponent of the weights may seem somewhat uncommonbut allow for the Stetchkin-type estimate (2.5) which is a key ingredient for the proof of ourmain recovery result. The expressions from Equation (2.1) are pseudo-norms for 0 < p < 1

and proper norms for 1 ≤ p ≤ 2.

15

Page 18: Weighted l1-Analysis minimization and stochastic gradient

Proposition 1. For 1 ≤ p ≤ 2 the map ‖ · ‖ω,p is a norm and fulfills the Hölder-inequality∑j∈[N ]

|xjyj | ≤ ‖x‖ω,p‖y‖ω,q with x, y ∈ CN

for 1 < p, q <∞ with 1p + 1

q = 1.

We include a proof for this proposition to highlight the similarity between the proofsfor the unweighted inequalities, e.g., the triangle or the reverse triangle inequality and thecorresponding weighted versions.

Proof. We start out by noting that only the triangle inequality has to be proven for ‖ · ‖ω,pto be a norm, all other properties are immediately clear. We start with the proof of the theHölder-inequality:For positive A,B and 1 < p, q <∞ with 1

p + 1q = 1 the Young-inequality states

AB ≤ Ap

p+Bq

q.

So let 0 6= x, y ∈ CN and set Aj :=xjω

2−pp

j

‖x‖ω,p and Bj :=yjω

2−qq

j

‖y‖ω,q . Since 2−pp + 2−q

q =

2(

1p + 1

q

)− 2 = 0 we have

1

‖x‖ω,p‖y‖ω,q

∑j∈[N ]

|xjyj | =∑j∈[N ]

AjBj ≤∑j∈[N ]

Apjp

+Bqjq

=1

p‖x‖pω,p

∑j∈[N ]

|xj |pω2−pj +

1

q‖y‖qω,q

∑j∈[N ]

|yj |qω2−qj =

1

p+

1

q= 1

hence the desired inequality. Therefore

‖x+ y‖pω,p =∑j∈[N ]

|xj + yj |pω2−pj ≤

∑j∈[N ]

|xj ||xj + yj |p−1ω2−pj +

∑j∈[N ]

|yj ||xj + yj |p−1ω2−pj

≤ (‖x‖ω,p + ‖y‖ω,p)∥∥∥(|xi + yi|p−1ω2−p

i )i∈[N ]

∥∥∥ω, pp−1

where we used the unweighted Hölder-inequality in the last step. Now we evaluate

∥∥∥(|xi + yi|p−1ω2−pi )i∈[N ]

∥∥∥ω, pp−1

=

∑j∈[N ]

|xj + yj |(p−1) pp−1ω

(2−p) pp−1 +2− p

p−1

j

p−1p

=

∑j∈[N ]

|xj + yj |pω2−pj

p−1p

= ‖x+ y‖p−1ω,p .

These are the desired inequalities.

16

Page 19: Weighted l1-Analysis minimization and stochastic gradient

We denote the (weighted) sequence spaces as

`pω(Λ) := `pω := x = (xλ)λ∈Λ : ‖x‖ω,p <∞ (2.2)

for some countable index set Λ, p ≥ 0 and ‖ · ‖ω,p as in Equation (2.1). For some S ⊂ Λ

we write ]S for the unweighted cardinality of S and ω(S) :=∑λ∈S ω

2λ for the weighted

cardinality.

In applications sparsity in a strict sense might be too strong a demand on the underlyingdata which is why we rather consider compressible vectors, which are those x ∈ R such thatthe `p-error of best s-term approximation

σs(x)ω,p := inf‖x− z‖ω,p : z ∈ Σs (2.3)

is small compared to ‖x‖ω,p.As it turns out that the minimizer of σs(x)ω,p, where we use the weighted (quasi)-norm‖ · ‖ω,p from Equation (2.1), is NP-hard to compute in general as we will show in Lemma 23.Therefore we approximate this quantity using the weighted quasi-best s-term approximation.

Definition 2 (Weighted quasi-best s-term approximation). For some signal x ∈ `pω, let(vj)j∈Λ be the non-increasing rearrangement of the sequence (|xj |ω−1

j )j∈Λ, namely a vj =

|xσ(j)|ω−1σ(j) where σ ∈ Sym(Λ) is a permutation on the set Λ such that vj ≤ vj+1 for all

j ∈ Λ.Let k be maximal such that

∑j≤k ω

2σ(j) ≤ s. Then the set S := j ∈ Λ : σ(j) ≤ k is the

set of indices which corresponds to the weighted quasi-best s-term approximation xS .Accordingly, we define the error of the weighted quasi-best s-term approximation to be

σs(x)ω,p = ‖x− xS‖ω,p. (2.4)

These concepts can also be carried over to the domain of frame coefficients. In thiscontext, however, we have no need of additional definitions since the concept of `1-analysissparsity is that the frame coefficients f a given datum x ∈ Rd, Ψx, are s-sparse (or, as itis sometimes referred to, it is d-s-cosparse [49]). The best s-term approximation of thesecoefficients simply is σs(Ωx)ω,p. The minimizer z of which the analysis coefficients realizethe minimum, i.e. that z ∈ Rd that minimizes ‖Ωz −Ωx‖ω,p is usually hard to compute andnot necessarily unique due to the overcompleteness of the frame.

These weighted quasi-best s-term approximation from Definition 2, the error of theweighted best s-term approximation and the ‖ · ‖ω,p-norm are related to each other viathe crucial Stechkin estimate:

Theorem 3 (Stechkin-type estimate). [34, Prop 2.3] [80, Lemma 3.1, Theorem 3.2] Forany x ∈ Cd and weights ωλ ≥ 1 the following inequality holds for all 0 ≤ p < q ≤ 2 Assuming

17

Page 20: Weighted l1-Analysis minimization and stochastic gradient

‖ω‖2∞ < s

σs(x)ω,q ≤ σs(x)ω,q ≤ (s− ‖ω‖∞)1q−

1p ‖x‖ω,p (2.5)

and

σ3s(x)ω,q ≤ σs(x)ω,q.

This is the weighted version of the classical Stechkin estimate

σs(x)q ≤ s1q−

1p ‖x‖p, p < q ≤ ∞. (2.6)

The Stechkin-type estimate from (2.5) is of crucial importance for the recovery resultsfor our algorithm and includes the factor (s − ‖ω‖2∞)1/q−1/p for p < q ≤ 2. As mentionedpreviously, we will assume s ≥ 2‖ω‖2∞ for the remainder of this chapter and Chapter 4 sothat (s− ‖ω‖∞)∗ ≤ 2∗s∗.

2.2 Weighted Compressed Sensing

As a first approach, is seems meaningful to solve the following optimization problem toretrieve a s-sparse signal x ∈ Rd from y = Φx for a known measurement operator Φ ∈ Rd×d:

min ‖z‖0,ω subject to Φz = y.

Unfortunately, this problem is known to be NP-hard as is shown in [34, Section 2.3] (as is theminimization of ‖x‖ω.p for p < 1, see [34, Exercise 2.10]) and therefore one applies convexrelaxation instead and solves

min ‖z‖ω,1 subject to Φz = y (ω-BP)

or if we assume noisy measurements , i.e., y = Φx+ e

min ‖z‖ω,1 subject to ‖Φz − y‖ω,2 ≤ η (ω-BPDN)

where η ≥ ‖e‖2 is an estimate of the noise e = Φx− y. It would also be possible to employ anorm ‖x‖ω,p with p > 1 instead of the 1-norm, but this has been shown to yield non-sparsesolutions, see [34, Exercise 3.1]. Among the first to exploit this idea, without any weights,were Candés, Romberg and Tao in [18]. A thorough introduction into Compressed Sensingcan also be found in [34] where all the results from this section originate in their unweightedversion except if explicitly stated otherwise.

One way to generalize the minimization problems ω-BP and ω-BPDN is `1-analysis min-imization which will be discussed in further detail in Chapter 4. Here we take a frame

18

Page 21: Weighted l1-Analysis minimization and stochastic gradient

Ω ∈ Rn×d ≥ d , i.e., a matrix such that there are constants A,B > 0 such that

A‖x‖22 ≤ ‖Ωx‖22 ≤ B‖x‖22 for all x ∈ RN .

In this setting our frame Ω consists of vectors ψj ∈ Cd, j = 1, . . . , n for some presumablyd ≥ n ∈ N and for abbreviation we write

Ω =

ψt1...ψtp

∈ Rn×d,

which gives ‖x‖2 ‖Ωx‖2.The possible overcompleteness of the frames leaves some roomto design a frame system in an appropriate way to accentuate certain prominent featurespresent in every instance of a given class of signals. Then the basic idea is that instead ofbeing sparse in the standard basis or any orthonormal basis a vector might have sparse framecoefficients Ωx for a a frame Ω ∈ Cn×d where n ≥ d, i.e., x is the minimizer of

min ‖Ωz‖ω,1 subject to Φz = y (Ω-BP)

or if we assume noisy measurements , i.e., y = Φx+ e

min ‖Ωz‖ω,1 subject to ‖Φz − y‖2 ≤ η. (Ω-BPDN)

More details as well as the related `1-synthesis minimization will be given in Chapter 4.The error of weighted quasi best-s-term approximationσs(Ωx)ω,q from Definition 2 used

to evaluate the quality of recovery for the two different types of recovery which are considered.

Remark 4 (Uniform and nonuniform recovery). Considering the recovery problem of retriev-ing a s-sparse x ∈ Cd from linear measurements y = Φx one distinguishes between two kindsof results:

• A uniform recovery result guarantees that with certain probability, depending on thematrix Φ, all s-sparse x ∈ Cd are restored.

• A nonuniform recovery result states that for an arbitrary but fixed s-sparse x ∈ Cd thissingle x are restored with a certain probability.

Uniform recovery obviously implies nonuniform recovery while the contrary does not hold.Nonuniform results often have less strict assumptions than uniform results and may dependon the structure of the signal x. This work will mostly be concerned with uniform results.

By now, the most well-known way to ensure proper uniform reconstruction of s-sparsevector via basis pursuit or basis pursuit denoising, namely solving Ω-BP or Ω-BPDN withΩ = I, centers on proving the restricted isometry property (RIP) of order s for sensing

19

Page 22: Weighted l1-Analysis minimization and stochastic gradient

matrices Φ. This entails showing an inequality of the type

(1− δs)‖x‖22 ≤ ‖Φx‖22 ≤ (1 + δs)‖x‖22 (RIP)

holds for all (weighted) s-sparse x ∈ RN with an preferably small restricted isometry constantδs ∈ (0, 1), see e.g., [34, chapter 6]. This δs can also be expressed as

δs = maxS⊂[d],ω(S)≤s

‖Φ∗SΦS − Id‖2→2.

The smaller δs is, the closer Φ acts compared to an isometry on the set of (ω)-s-sparse vectors.In that sense it is a generalization of the coherence of a matrix Φ = (φ1| . . . |φn) where theφi are the columns of Φ:

µ := maxi6=j∈[d]

|〈φi, φj〉|. (2.7)

The coherence µ and the Restricted Isometry Property are closely linked as [34, Theorem5.3] shows. One fundamental theorem relating the unweighted RIP to basis pursuit is thefollowing:

Theorem 5. [34, Theorem 6.9] Suppose that the 2sth restricted isometry constant of Φ

satisfies δ2s < 13 . Then every s-sparse vector x ∈ Cd is reconstructed via ω-BP with ω ≡ 1.

For the weighted counterpart of Basis Pursuit, ω-BP and its robust extension ω-BPDNthere is a corresponding result.

Theorem 6. [80, Theorem 5.2] Fix δ, γ ∈ (0, 1) and let (φj)1≤j≤N be an orthonormalsystem. Consider weights ωj ≥ ‖φj‖∞, let s ≥ 2‖ω‖2∞and fix

m ≥ Cδ−2smaxlog3(s) log(N), log(1/γ).

Suppose that x1, . . . , xm ∈ Rd are drawn independently from the orthogonalization measureassociated to (φj)j∈Λ. Then, with probability exceeding 1−γ, the normalized sampling matrixA ∈ Cm×N with entries Al,k = 1√

mφk(xl) satisfies the weighted restricted isometry property

of order s with δω,s ≤ δ.

A necessary and sufficient condition for the recovery of sparse vectors via ω-BP or ω-BPDN is the (weighted) null space property. This is the general approach we follow inChapters 3 and 4. Recovery via basis pursuit is equivalent to the sensing matrix having thisproperty, see [34, Chapter 4] which makes it weaker than the RIP, which only is a sufficientcondition for exact recovery of s-sparse vectors. First, we define the Null Space Propertiesproperly.

Definition 7. Let Φ ∈ Rm×d be a sensing matrix and Ω ∈ Rn×d a frame.

20

Page 23: Weighted l1-Analysis minimization and stochastic gradient

(i) Ω is said to possess the weighted null space property of order k with respect to Ω if forall S ⊂ [n] with ]S < n− k

‖ΩSv‖ω,1 < ‖ΩSv‖ω,1 for all v ∈ ker(Φ) \ 0 (NSP)

(ii) Φ is said to possess the weighted stable null space property of order k with respect to Ω

with constant θ < 1 if for all S ⊂ [n] with ]S < n− k

‖ΩSv‖ω,1 < θ‖ΩSv‖ω,1 for all v ∈ ker(Φ) \ 0 (SNSP)

(iii) Φ is said to possess the weighted `q−robust null space property of order k with respectto Ω with constants 0 < θ < 1 and τ > 0 if for all S ⊂ [n] with ]S ≤ n− k

‖ΩSv‖ω,q <θ

s1− 1q

‖ΩSv‖ω,1 + τ‖Φv‖2. (RNSP)

Assuming that Ω is possess the weighted null space property of order k with respect toΩ, every signal x[ within the kernel of Φ contributes to the `1-norm in Ω-BP since the framecoefficients, by virtue of contravening the NSP.

Note that the robust `1-RNSP implies the SNSP and this in turn the NSP. In the following,we are only concerned with the respective NSP over RN since it is equivalent to the complexNSP, see [34, Theorem 4.7]. Moreover, it is well-known that the `q-RNSP implies the `r-RNSP for q ≥ r, hence it suffices to prove the `2-RNSP.The case Ω = Id and ω ≡ 1 yields the commonly known respective Null space properties. Wesummarize the important results in the following theorem.

Theorem 8. Let Φ ∈ Rm×d and Ω ∈ Rn×d be a frame. Then the following hold:

(i) [49, Theorem 7]: If Φ possesses the NSP of order k with respect to a frame Ω we havereconstruction via Ω-BP for all n− k-cosparse x ∈ Rd.

(ii) [49, Theorem 8]: The stable weighted `2-NSP implies recovery in the case where Ωx isnot s-sparse but compressible via Ω-BP. The reconstruction x] obeys the error bound

‖x− x]‖2 ≤C√sσs(Ωx)ω,1

where constant C only depends on the parameter θ and the frame bounds.

(iii) [33, Theorem 5]: The robust weighted `2-NSP implies recovery in the perturbed casevia (Ω-BPDN). The reconstruction x] obeys the error bound

‖x− x]‖1 ≤ Cσs(Ωx)ω,1 +Dη

21

Page 24: Weighted l1-Analysis minimization and stochastic gradient

where η is a bound for the `2-norm of the error vector and the constants C,D onlydepend on the parameters θ, τ and the frame bounds, see [49, Theorem 10].

We provide a proof of the last point of this theorem since we can highlight the differencesbetween the weighted and unweighted case here. Beforehand, we provide a full statement ofour assumptions in the following lemma.

Lemma 9. Let Φ ∈ Rm×d and Ω ∈ Rn×d be a frame such that Φ possess the weighted`1−robust null space property of order k with respect to Ω with constants 0 < θ < 1 andτ > 0 and assume s > 2‖ω‖∞. Then x] = argmin‖Ωz‖ω,1 : ‖Φz − y‖2 ≤ η obeys

‖Ω(x− x])‖p,ω ≤C

s1−1/pσs(Ωx)ω,1 +Dη

for 1 ≤ p ≤ 2.

Proof. We set s = n− k for this proof. Let g = −x−x] and S ⊂ [n], ](S) ≤ n− k = s. Then

‖ΩSg‖ω,1 ≤ ρ‖ΩSg‖ω,1 + τ‖Φg‖

by virtue of the NSP. Then we write

‖Ωx‖ω,1 = ‖ΩSx‖ω,1 + ‖ΩSx‖ω,1 ≤ ‖ΩSg‖ω,1 + ‖ΩSx]‖ω,1 + ‖ΩSx‖ω,1‖ΩSg‖ω,1 ≤ ‖ΩSx‖ω,1 + ‖ΩSx

]‖ω,1.

Summing these two inequalities and rearranging yields

‖ΩSg‖ω,1 ≤ ‖ΩSg‖ω,1 + ‖Ωx]‖ω,1 − ‖Ωx‖ω,1 + 2‖ΩSx‖ω,1.

This combined with the inequality from the NSP gives

‖ΩSg‖ω,1 ≤1

1− ρ

‖Ωx]‖ω,1 − ‖Ωx‖ω,1︸ ︷︷ ︸≤0

+2‖ΩSx‖ω,1 + τ‖Φg‖

.

This then gives

‖Ωg‖ω,1 = ‖ΩSg‖ω,1 + ‖ΩSg‖ω,1 ≤ (1 + ρ)‖ΩSg‖ω,1 + τ‖Φv‖2

≤ 1 + ρ

1− ρ2‖ΩSx‖ω,1 +

1− ρ‖Φg‖2.

We have that ]S ≥ ω(S) since the weights obey ωi ≥ 1. If now T ⊂ [N ] is the set thatrealizes the error of the weighted best `1 approximation of x, that is σs(Ωx)ω,1 = ‖ΩTx‖ω,1,then ]T ≤ ω(T ) ≤ s = ]S and therefore the NSP holds also for the set T . Accordingly

‖Ωg‖ω,1 ≤1 + ρ

1− ρ2σs(Ωx)ω,1 +

1− ρ‖Φg‖2. (2.8)

22

Page 25: Weighted l1-Analysis minimization and stochastic gradient

Moreover, since g = x− x] we have that ‖Φv‖ ≤ ‖Φx− y‖+ ‖Φx] − y‖ ≤ 2η. Additionally,the Stetchkin estimate (2.5 yields σs(Ωx)ω,p ≤ (s− ‖ω‖∞)1/q−1/p‖Ωx‖ω,p for 0 < p < q ≤ 2

and therefore

‖Ωg‖ω,q = ‖ΩSg‖ω,q + ‖ΩSg‖ω,q ≤1

(s− ‖ω‖∞)1−1/q‖Ωx‖ω,1 +

ρ

s1−1/q‖ΩSg‖ω,1 + τ‖Φg‖2

≤ 1 + ρ

(s− ‖ω‖∞)1−1/q‖Ωg‖ω,1 + τ‖Φv‖2.

Plugging (2.8) into the last inequality yields the desired result since s > 2‖ω‖∞.

Unfortunately, there are no deterministic ways to construct a sensing matrix Φ whichsatisfies the RIP or one of the NSPs in the optimal parameter regime. Moreover, it has beproven to be NP-hard to compute δs and the constant θ from the SNSP inequality, see [92].Having said that, in [13] it was already noticed that one can verify that certain types ofrandom matrices possess the RIP or the NSP with high probability, which is why most of theanalysis of Compressed Sensing algorithms is focused on these matrices, see the appendix fora short summary on this or [34] for an elaborate introduction.

2.3 Low Rank Matrix Recovery and the Phase Retrieval

problem

The phase retrieval problem differs from the aforementioned reconstruction problems in thesense that precise reconstruction in this particular case is only possible up to a global phase.Consider a collection of measurement vectors ai ∈ Cd, i = 1, . . . ,m. We take measurementsof the form yi = |〈x, ai〉|2 or y = |Ax|2, where the ai are the rows of A, for some originaldata vector x ∈ Cd. The first problem one encounters is that for any φ ∈ [0, 2π) thevector eiφx also satisfies the measurements. Secondly, x appears quadratically within thesemeasurements and therefore linear methods of reconstruction fail here. This, however, canbe addressed by rewriting |〈x, ai〉|2 = tr(aia

∗i xx

∗) which is now a linear problem in a matrixspace of dimension d× d. Thus, the problem can be restated as

find X 0, rank(X) = 1, tr(aka∗kX) = yk for k = 1, . . . ,m (2.9)

where, as usual, X 0 signifies that X is a positive semi-definite matrix.Since this problem is intractable for basically the same reasons `0-minimization is, the rankconstraint is replaced by the trace:

min tr(X) subject to X 0, tr(aka∗kX) = yk for k = 1, . . . ,m. (2.10)

One prominent algorithm for tackling this problem is the Wirtinger Flow (WF) [15], astochastic gradient descent approach employing an initialization based on the spectral de-

23

Page 26: Weighted l1-Analysis minimization and stochastic gradient

composition of the matrix

1

m

m∑k=1

ykaka∗k.

2.4 Notation

Usually, f, g, h denote functions, n,m, k, integers or multi-indices and N, d ∈ N are dimen-sions.The isometric dilation by a regular matrix A ∈ Rd×d of a function f ∈ L2(Rd) is denoted byDAf(x) :=

√det(A)f(Ax), the isotropic translation by t ∈ Rd via

Ttf(x) := f(x − t) and the modulation by ξ ∈ Rd is Mξf(x) = e2πi〈ξ,x〉f(x). For anySchwartz function f ∈ S(Rd), i.e., a C∞ function that decays as fast as (1 + |x|)−N , x→∞for arbitrary N ∈ N, its Fourier transform is

f(ξ) =

∫Rdf(x)e−i〈ξ,x〉 dx (2.11)

which extends to L2(Rd) and L1(Rd) since S(Rd) ⊂ L2(Rd), L1(Rd) is dense and the Fouriertransform is a bounded, linear operator. The Fourier Transformation, either on functionsf ∈ L2(Rd) or on vectors x ∈ Cd is denoted by F and the Haar transformation (see Defini-tion 46 for the discrete version) by H. The standard inner product on L2(Rd) is denoted by〈 · , · 〉.For x ∈ Rd and S ⊂ [d] we write xS as the vector which equals x for all entries indexed byS and is 0 otherwise. Accordingly, for a matrix A ∈ Cn×n we write AS for the matrix thathas the same rows as A for the rows indexed by S and is 0 otherwise so that ASx = (Ax)S .

24

Page 27: Weighted l1-Analysis minimization and stochastic gradient

Chapter 3

Weighted `1-Minimization andWeighted Iterative HardThresholding

In this chapter we consider the weighted `1-minimization problem for sparse signal recovery,i.e., how to reconstruct a sparse x ∈ Rd from samples y = Φx for Φ ∈ Cm×d via

min ‖z‖ω,1 subject to Φz = y

for weights ωi ≥ 1, i = 1, . . . , d. We extend this theory to the recovery of functions f ∈ Bp,qlwhere the latter is a Besov space (see Chapter 6, especially Theorem 106 and Definition 6.17)from finitely many Fourier samples which are sampled according to a probability measurewhich induces a certain type of weights.

3.1 Introduction

The `pω-spaces for the weighted norm given by (2.1) can be utilized to define various weightedfunction spaces, see Section 6.2 for the examples important to this work. In order to do so,let ν be a probability measure on some domain D and (φj)j∈Λ of complex-valued functionson D that are orthonormal with respect to ν , i.e.,

〈φj , φk〉 :=

∫Dφj(t)φk(t) dν(t) = δj,k for all j, k ∈ Λ.

Then, the `pω-spaces give rise to the definition of the function spaces

Sω,p :=

f =∑j∈Λ

xjφj : ‖f‖ω,p := ‖x‖ω,p <∞

25

Page 28: Weighted l1-Analysis minimization and stochastic gradient

for 0 ≤ p ≤ 2. Setting the (quasi-)norm ‖f‖ω,p to be the `pω-norm of the coefficients isintrinsic since in case p = 2 and ω ≡ 1 we have Sω,p = L2(R) for some wavelet basis (φj)j∈Λ

of L2(Rd). Also in this case

‖f‖22 = 〈f, f〉 =∑j,k∈Λ

xjxk〈φj , φk〉 =∑j∈Λ

|xj |2

holds for all functions f ∈ L2(R). Identifying a function f ∈ Sω,p with its coefficients in thebasis (φj)j∈Λ, we can transfer the terminology of weighted sparse approximations from Rd

over to Sω,p. Since ‖f‖ω,p = ‖x‖ω,p for a function f =∑j∈Λ xjφj ∈ Sω,p, the best weighted

s-term approximation of such a f is the function

fS :=∑j∈S

xjφj

for S ⊂ Λ, where xS realizes the best weighted s-term approximation of the coefficients(xj)j∈Λ as defined by Equation (2.4), i.e., σs(x)ω,p = ‖xΛ\S‖ω,p. Then, the error of the bests-term approximation of f is defined as

σs(f)ω,p := σs(x)ω,p.

Our aim is to construct a s-sparse f ∈ Sω,p, or equivalently the corresponding coefficientvector (xj)j∈Λ, that realizes the best s-term approximation of such a function f . To this endwe propose the greedy thresholding algorithmWeighted Iterative Hard Thresholding (WIHT),based on the Iterative Hard Thresholding algorithm (see [34, p. 70]).

Weighted sparsity with regard to L2(R) can be understood as the idea that certain atomsare more prevalent in signals belonging to a certain class. This has been observed for theamplitude spectra of photographs of natural scenes in [85, 93] where the authors suggest apower-law decay of the form

amplitude ∝ frequency−α

where α ∈ [0.8, 1.2] for natural images in [93] and α ≈ 1.8 in [85]. Moreover, it is suggestedthat the preference of certain atoms within a given dictionary might prevent overfitting.

Assuming now that t1, . . . , tm are drawn from R according to the probability measure ν,we want to reconstruct f ∈ Sω,p from samples yi = f(ti) =

∑λ∈Λ xλφλ(ti). This, by setting

Φi,λ := φλ(ti) for i ∈ [m], λ ∈ Λ, is a linear equation y = Φx which we aim to solve viaweighted Basis Pursuit ω-BP

min ‖z‖ω,1 subject to Φz = y

which is the ’frameless’ variant of Ω-BP. In [80], the following result was shown for ω-BP.

Theorem 10. [80, Theorem 1.1] Suppose (ϕλ)λ∈Λ is an orthonormal system of finite size

26

Page 29: Weighted l1-Analysis minimization and stochastic gradient

]Λ = N , and consider weights (ωλ)λ∈Λ with ωλ ≥ ‖ϕλ‖∞. Then for s ≥ 2‖ω‖2∞ fix a numberof samples

m ≥ c0s log3(s) log(N)

and suppose that tl, l = 1, . . . ,m are sampling points drawn i.i.d. from the orthogonalizationmeasure associated to (ϕλ)λ∈Λ. Then with probability exceeding 1−N− log3(s), the followingholds for all functions f =

∑λ∈Λ cλϕλ: given samples yl = f(tl), l = 1, . . . ,m, let c] be the

solution of ω-BP and let f ] =∑λ∈Λ c

]λϕλ. Then the following error rates are satisfied:

‖f − f ]‖∞ ≤ ‖f − f ]‖ω,1 ≤ c1σs(f)ω,1 and ‖f − f ]‖2 ≤ c2σs(f)ω,1/√s

The constants c0, c1 and c2 are universal, independent of everything else.

The authors also prove a non-uniform version of the theorem above for infinite-dimensionalfunction spaces.

Theorem 11. [80, Theorem 1.2] Suppose (ϕλ)λ∈Λ is an orthonormal system and considerweights (ωλ)λ∈Λ. For a fixed parameter s, let Λs := λ ∈ Λ : ωλ ≤ s/2 and N := ]Λs.Then fix a number of samples

m ≥ c0s log3(s) log(N).

Consider a fixed function f =∑λ∈Λ cλϕλ of finite ‖ · ‖ω,1-norm and suppose that tl, l =

1, . . . ,m are sampling points drawn i.i.d. from the orthogonalization measure associated to(ϕλ)λ∈Λ. Let Φ ∈ Cm×N be the sampling matrix with entries Φl,λ = ϕλ(tl) for λ ∈ Λs. Letη, ε > 0 be such that η ≤ ‖f − fΛs‖ω,1 ≤ η(1 + ε). From samples yl = f(tl), l = 1, . . . ,m, letc] be the solution of

minz∈CΛs

‖z‖ω,1subject to ‖Φz − y‖2 ≤√m

and set f ] =∑λ∈Λ c

]λϕλ. Then with probability exceeding 1−N− log3(s), the following error

rates hold:

‖f − f ]‖∞ ≤ ‖f − f ]‖ω,1 ≤ c1σs(f)ω,1 and ‖f − f ]‖2 ≤ c2σs(f)ω,1/√s

The constants c0, c1 and c2 are universal, independent of everything else.

Another approach which incorporates previously known structure apparent in the signalclass is sparsity in levels, see [65].

Definition 12. For r ∈ N, let M = (M1, . . . ,Mr) where 1 ≤ M1 < Mr = N and s =

(s1, . . . , sr) where sk ≤ Mk − Mk−1 for k = 1, . . . , r and M0 = 0. A vector x ∈ CN is

27

Page 30: Weighted l1-Analysis minimization and stochastic gradient

(s,M)-sparse in levels if

]supp(x) ∩ Mk, . . . ,Mk+1 ≤ sk, k = 1, . . . , r.

To this notion the authors associate a custom Restricted Isometry Property, the RestrictedIsometry Property in Levels (RIPL) , basically the usual RIP which in this particular caseis assumed to hold for all (s,M)-sparse vectors instead of the usual s-sparse vectors. Sincethe sparsity is only considered on certain subsets of [d], they propose a sampling schemethat attributes this idea, which is called a (d,m)-multilevel sampling scheme : Let 1 =

d1, d2, . . . , dr−1, dr = d and mk ≤ dk − dk−1 the number of measurements taken at levelk = 1, . . . , r. They show the following nonuniform recovery result for the 1D Fourier/Haarwavelet system. Here, F is the discrete Fourier transform of a vector x ∈ Cd for an evendimension d

Fx(t) =1

d

d∑j=1

xje2πi(j−1)t/d, t ∈ R

and for abbreviation the authors write F ∈ Cd×d for the corresponding unitary matrix of thetransform, that is Fx = (Fx(t))

d2

t=− d2 +1. Moreover, the discrete Haar transform is H ∈ Cd×d,

where the columns are the orthonormal Haar basis vectors of Cd. For more details on theHaar basis and the related Haar frame, see Definition 46.

Theorem 13. [65, Corollary 2.11] Let N = 2r for some r ≥ 1, 0 < ε < exp(−1) andx ∈ CN . For each k = 1, . . . , r suppose that mk Fourier samples are drawn uniformly atrandom from the band Wk = 2k−1 + 1, . . . , 2k where

mk &

sk +∑l 6=k

2−|k−l|/2sl

log(sε

)log(d) (3.1)

for some sk ≤ 2k−1 − 1 and s =∑r sr. Let K = maxk=1,...,r

dk−dk−1

mk

and suppose

y = PΩFx + e with an error ‖e‖2 ≤√Kη for some η ≥ 0. Here PΩ is the subsampling

operator restricting y to the entries of Fx that are actually sampled by the (d,m)-subsamplingscheme. Then with probability exceeding 1− ε any minimizer x] of

min ‖H∗z‖1 subject to ‖PΩFx− y‖2 ≤√Kη

satisfies

‖x− x]‖2 . σs,M (H∗x) + (1 + L√s)η

where L = 1 +

√log(1/ε)

log(d) , and σs,M is the error of the best (s,M)-term approximation,

28

Page 31: Weighted l1-Analysis minimization and stochastic gradient

σs,M (x) = min‖x− x′‖ : x′ is (s,M)-sparse.

The occurrence of the sparsities sl for l 6= k in the estimate (3.1) is due to the use of thelocal coherence µk,l(d,M) which for an isometry U ∈ Cd×d is defined as

µk,l(d,M)(U) = max|Ui,j |2 : i = dk−1, . . . , dk, j = Ml−1, . . . ,Ml

which is a measure of the impact the entries xj , j = Ml−1, . . . ,Ml at level l have on themeasurements at level k 6= l. In the 1 D-Fourier/Haar-result we have Mk = 2k.The sparsity in levels approach is closely related to the weighted sparsity approach althoughthere is no precise result linking the two. One striking difference, however, is that in the for-mer it is possible to forbid the reconstruction coefficients at a certain level Ml−1+1, . . . ,Mlby setting the respective sparsity sl = 0. This could only be achieved via weights if we wereto set the respective weights ωλ =∞, λ ∈ Ml−1 + 1, . . . ,Ml.

One algorithmic approach to reconstruct the original signal vector via ω-BPDN was pro-posed by Jo in [48] which hinges on computing a vector realizing σω,s(x)1 which is infeasible.However, the idea to apply Iterative Hard Thresholding (IHT) to the reconstruction task in-spired our own approach. Henceforth, we propose Weighted Iterative Hard Thresholding, seeAlgorithm 1 in section 3.4 later in this chapter as an alternative approach to solve ω-BPDN.Our main result for the finite-dimensional setting reads as follows:

Lemma 14. Suppose the 14sth order weighted restricted isometry constant of Φ ∈ Cm×N

satisfies δ14s,ω <1√3. Then the sequence (xn)n∈N defined via WIHT for measurements y =

Φx+e, see Algorithm 1, is bounded and every accumulation point x] of this sequence satisfies

‖x− x#‖ω,p ≤ Cs1p−1σs(x)ω,1 +Ds

1p−

12 ‖e‖2.

Later we will extend this theory to infinite dimensional function spaces which serve asa model for image reconstruction where we will employ Besov spaces Bp,ql (R), see Equation(6.14), or Bp,ql (D) for some bounded domain D ⊂ R, see Definition 6.17. Those can be definedvia sequence spaces given by weighted norms as in Equation (2.1). Our main theorem withregard to function reconstruction is Theorem 36.

Theorem 15. Let q ∈ (0, 1) and ω, v be sequences of weights satisfying ωj & 2κj and

vj ≥ ωq

2−qj . For given s ∈ N set

Js = max

j ∈ N0 : ωjv

q−2q

j ≥ s12−

1q

.

Fix a number m of samples with

m ≥ cs log3(s) log(N).

29

Page 32: Weighted l1-Analysis minimization and stochastic gradient

Suppose the sampling points x1, . . . , xm are drawn independently according to the precondi-tioning measure P, see Section 3.3 for more details. Then with probability exceeding 1 −e− log3(s) log(N the following holds for all f in the Besov space Bq,q

κ 2−qq + 1

2

(R):

For measurements yi = f(xi) for i = 1, . . . ,m the function f ] recovered via WIHT satisfies

‖f − f ]‖ω,1 ≤ Cs1− 1q ‖f‖v,q and ‖f − f ]‖2 ≤ C ′s

12−

1q ‖f‖v,q.

3.2 Weighted RIP

In the case of unweighted sparse recovery, the Restricted Isometry Property, see DefinitionRIP is a common tool for the analysis of reconstruction algorithms, as we have elaboratedin Chapter 2. Consequently, there should be a similar concept for weighted sparse recovery,which would be the ω-RIP, see [80], where for additional details.

Definition 16. For A ∈ Cm×N , s ≥ 1 and a weight sequence ω, the ω-restricted isometryconstant δω,s associated to A is the smallest positive number for which

(1− δω,s)‖x‖22 ≤ ‖Ax‖22 ≤ (1 + δω,s)‖x‖22

holds for all x ∈ CN obeying ‖x‖ω,0 ≤ s. ♦

For weights ω ≡ 1 the ω-RIP equals the RIP. In the more general case, with weightsωj ≥ 1, the ω-RIP is a weaker assumption on a matrix since it requires the matrix A to actas a near-isometry on a smaller set.

Theorem 17. [80, Theorem 5.2] Fix δ, γ ∈ (0, 1) and let (φj)1≤j≤N be an orthonormalsystem. Consider weights ωj ≥ ‖φj‖∞, let s ≥ 2‖ω‖2∞and fix

m ≥ Cδ−2smaxlog3(s) log(N), log(1/γ).

Suppose that x1, . . . , xm ∈ Rd are drawn independently from the orthogonalization measureassociated to (φj)j∈Λ. Then, with probability exceeding 1−γ, the normalized sampling matrixA ∈ Cm×N with entries Al,k = 1√

mφk(xl) satisfies the weighted restricted isometry property

of order s with δω,s ≤ δ.

Note that with K = maxj∈Λ ‖φj‖∞ and ωj = K this result reduces to the previouslyknown result for the standard RIP for bounded orthonormal systems which demands thenumber of measurements to be m & K2sδ−2 ln4(N), see [78, Theorem 12.31] for furtherdetails.

3.3 Wavelet Bases and Preconditioning

A first approach for signal recovery using preconditioning methods was made by Krahmer andWard in [55, Theorems 3.1, 3.2]. They, however, encounter the obstacle that their probability

30

Page 33: Weighted l1-Analysis minimization and stochastic gradient

measure cannot be extended to the whole real line or Rd, an obstacle we circumvent with apreconditioning function ϑ taking weights ω into account.In [55] Krahmer and Ward discussed recovery conditions for signals x ∈ CN using the discreteHaar-wavelet basis. Here, a generalized result will be proven using arbitrary wavelets andweighted function spaces. From now on we will consider an orthonormal wavelet φ ∈ Ckc (R),the space of compactly supported, k-times differentiable functions, as proposed for exampleby Daubechies in [27].We take a compactly supported mother wavelet for L2(R), namely φM with supp(φM ) ⊂[−M,M ], and the corresponding supported father wavelet, or scaling function, φF withsupp(φF ) ⊂ [0, 2M − 1] and set

φj,k(x) =

φF (x− k) if j = 0, k ∈ Z

2j2φM (2jx− k) if j > 0, k ∈ Z.

(3.2)

Then the system (φj,k)j≥0,k∈Z is known to form an orthonormal basis of L2(R). This only isa special case of the dyadic wavelet system presented in Definition 52.

Our aim is to reconstruct f from a finite set of samples f(x1), . . . , f(xm) at samplingpoints x1, . . . , xm which are drawn independently at random in frequency domain. We optfor using m-times differentiable wavelets since they have rapid decay in the Fourier domain,which will be relevant later on.In fact since φ ∈ Ck, we have that

|φ(ξ)| = O((1 + |ξ|)−k).

The following technical lemma will prove useful for many purposes later on. For conve-nience, we set

D(φ)p := max‖φF ‖p, ‖φM‖p for 0 < p ≤ ∞. (3.3)

Also, we will mostly omit the subscripts M and F since for all intents and purposes bothfunctions have the same properties such as smoothness or compact support.

Lemma 18. Suppose φ ∈ C2c (R). Then we have

|φj,k(x)| ≤

2−

j2D(φ)∞ if |x| < 2j

√D(∂2φ)∞

D(φ)∞,

C(κ)2κj

|x|12

+κelse.

with C(κ) = D(∂2φ)∞

(D(φ)∞

D(∂2φ)∞

) 32−κ2

for every κ ∈ R.

Note that, in order for the bounding function on the right to remain integrable, we musthave κ > 0.

31

Page 34: Weighted l1-Analysis minimization and stochastic gradient

Proof. Since φ ∈ C2c (R), we have that φ ∈ L2(R)∩C2(R) and therefore, the function φ fulfills

|φ(x)| = 1

|x2||∂2φ(x)| (3.4)

for every x ∈ R \ 0, owing to our normalization of the Fourier transform. Next, using basicproperties of the Fourier transform we observe

|φj,k(x)| = |D2jTkφ(x)| = |D2−jM−kφ(x)| = |D2−j φ(x)| = 2−j2 |φ(2−jx)|.

Then (3.4) yields

|φ(x)| ≤ min

D(φ)∞,

D(∂2φ)∞|x|2

which, for α ∈ R leads to

|φj,k(x)| ≤

2−j2D(φ)∞ if |x| < 2j

√D(∂2φ)∞

D(φ)∞,

232 jD(∂2φ)∞|x|2 else.

=

2−j2D(φ)∞ if |x| < 2j

√D(∂2φ)∞

D(φ)∞,

232 j

D(∂2φ)∞|x|α|x|2−α else.

Now setting α > 0 we lower bound the term |x|α by(

2j√

D(∂2φ)∞

D(φ)∞

)αin the second case to

receive

|φj,k(x)| ≤

2−

j2D(φ)∞ if |x| < 2j

√D(∂2φ)∞

D(φ)∞

2( 32−α)j

(D(φ)∞

D(∂2φ)∞

)α2 D(∂2φ)∞|x|2−α else.

Setting κ = 32 − α we finally obtain

|φj,k(x)| ≤

2−

j2D(φ)∞ if |x| < 2j

√D(∂2φ)∞

D(φ)∞

C(κ)2κj

|x|12

+κelse

with C(κ) = D(∂2φ)∞

(D(φ)∞

D(∂2φ)∞

) 32−κ2

.

Here we encounter the first occurrence of weights ωj = 2κj which are similar to the2dimension·smoothness·scale−type weights for Besov spaces from the characterization (6.14).Next we establish a probability measure P on Rd in order to be able to draw the samplingpoints x1, . . . , xm at random. Moreover a properly modified version of our wavelet systemshould still consist an ONB of L2(R, dP). This will be done by multiplying the normalized

32

Page 35: Weighted l1-Analysis minimization and stochastic gradient

Lebesgue-measure dλ with the square of the inverse of a function ϑ > 0 such that∫R

1

ϑ2(x)dx = 1. (3.5)

The function ϑ is a so-called preconditioning-function. Therefore, our basis functions will beof the form ϑφj,k : j, k ∈ Z. This is an orthonormal system associated to the precondi-tioned measure

〈ϑφj,k, ϑφl,n〉 dP =

∫Rφ(x)φj,k(x)ϑ(x)φl,n(x)

dx

ϑ2(x)=

∫Rφj,k(x)ϑl,n(x)dx = 〈φj,k, φl,n〉,

i.e., dν = dxϑ2(x) . This setting yields measurements

yi = f(xi)φ(xi) =∑

(j,k)∈Γs

cj,kφj,k(xi)ϑ(xi) +∑

(j,k)/∈Γs

cj,kφj,k(xi)ϑ(xi), 1 ≤ i ≤ m (3.6)

which can be obtained from the samples f(x1), . . . , f(xm). Here, we already partitionedthe whole set of indices into Γs and Γcs, the first one being finite so that we can create alinear system of equations from Definition 3.6. This system takes the form y = Ax+ e withA ∈ Cm×]Γs where A(j,k),i = φj,k(xi)ϑ(xi) and ei =

∑(j,k)/∈Γs

cj,kφj,k(xi)ϑ(xi), where thelatter will be treated as noise or a measurement error. The design of the sampling set Γs

will be discussed later and depends one the function system and the form of ϑ. Our goalis to to obtain a theorem similar to [80, Theorem 1.2], which bounds the quality of thereconstruction from samples y using `2-restricted `1-minimization, i.e., ω-BPDN, in termsof the best weighted s-term approximation, with high probability. The function ϑ needs toobey several requirements:

(i) ϑ > 0, in particular ϑ,

(ii)∫R

1ϑ2(x) dx = 1,

(iii) maxx∈Rϑ(x)φj,k(x) should decay fast or at least not increase too fast for j →∞.

The last point leads to the idea that ϑ should attain the pointwise inverse values of φj,k’uniformly’ over (j, k) ∈ Γs, since we know from [34, p. 368], owing to orthonormality, thatthe maximum in (iii) can not be smaller than 1. Hence we start by defining ϑ to be

ϑ(x) := max

1

D(φ)∞,|x| 12 +κ

C(κ)

. (3.7)

This ϑ has to be normalized properly. Since Lemma 18 already alluded to choosing κ > 0

that, we set κ > 0 from now on.

33

Page 36: Weighted l1-Analysis minimization and stochastic gradient

Lemma 19. We have∫R

1

ϑ2(x)dx = 2

(√CφD(φ)2

∞ +D(∂2φ)2

∞2κ

C3−6κ

4

φ

)

with Cφ = D(∂2φ)∞

D(φ)∞.

Proof. A short calculation reveals that Cφ is chosen such that√Cφ =

(C(κ)

D(φ)∞

) 21+2κ

holds.

From the definition we can write ϑ as

ϑ(x) =

1

D(φ)∞if |x| ≤

√Cφ

|x| 1+2κ2 C(κ)−1 otherwise

with C(κ) as defined above. Therefore we have

∫R

1

ϑ2(x)dx = 2

∫ 2√Cφ

0

D(φ)2∞ dx+ 2

∫ ∞2√Cφ

C(κ)2|x|−(1+2κ) dx

=√Cφ2 ·D(φ)2

∞2 +2

C(κ)22κ(

2√Cφ

)−2κ

.

Now we are going to evaluate the maximum of x 7→ ϑ(x)φj,k(x). To this end, we employall the assumptions made throughout this section, such as κ > 0 and φ ∈ C2(R).

Lemma 20. With our definitions of the wavelet system φj,k from (3.2) and the properlynormalized version φ of ϑ, we have

maxx∈Rd

ϑ(x)|φj,k(x)| ≤ C max2−j2 , 2κj. (3.8)

Proof. For abbreviation we define

tmin := min

2j√Cφ, 2

√Cφ

and tmax := max

2j√Cφ, 2

√Cφ

.

Then, by Lemma 18

ϑ(x)|φj,k(x)| ≤

2−j2 if |x| ≤ tmin

C1+2κ

4

φ 2+κj |x|− 1+2κ2 if tmin < |x| ≤ tmax and if tmin = 2j

√Cφ

2−j2C− 1+2κ

4

φ |x| 1+2κ2 if tmin < |x| ≤ tmax and if tmax = 2j

√Cφ

2κj if tmax ≤ |x|

.

The functions occurring on the right-hand side are either constant, strictly increasing orstrictly decreasing, thus we can evaluate the respective maximum at the endpoints of the

34

Page 37: Weighted l1-Analysis minimization and stochastic gradient

respective intervals:

ϑ(x)|φj,k(x)| ≤ max

2−j2 if |x| ≤ tmin

2κj−j1+2κ

2 if tmin = 2j√Cφ

2−j2 +j 1+2κ

2 if tmax = 2j√Cφ

2κj if tmax ≤ |x|

= max

2−j2

2−j2 if tmin = 2j

√Cφ

2κj if tmaxx = 2j√Cφ

2κj

.

Altogether, we obtain

ϑ(x)|φj,k(x)| ≤ C

2−j2

2κj.

with C = 2

(√Cφ‖φ‖2∞ +

‖∂2φ‖2∞2κ C

3−6κ4

φ

).

Remark 21. This result again suggests to set ωj = C ′2κj . In accordance to Section 6.2, wenow see that the parameter κ, which naturally turns up here in the weights, measures the”smoothness” of the underlying function space. Since C(κ) and C

3−6κ4

φ scale mild in κ as longas κ is set reasonably small, say κ = 1

2 , one has just to pick such a κ and thus have

ωj = O(2κj

2−p )

for the weights.

3.4 WIHT

In [48], Jo proposed an iterative greedy algorithm for weighted sparse recovery which basicallyis a version of Iterative Hard Thresholding (IHT), see [34, p. 70]. The proposed algorithmhinges on finding the vector minimizing min‖z‖0,ω≤s ‖z − x‖2 where x is the given iteratewithin one step of IHT. This means that instead of projecting onto the set of s-sparse vectorsin the thresholding step at each iteration, the projection is onto the set of weighted-s-sparsevectors. To compute this projection all possible support sets must be checked and as it turnsout that his approach is NP-hard. In order to show this NP-hardness, we define the weightedthresholding operator

Hω,s(x) := argmin‖z‖0,ω≤s

‖z − x‖2

and consider the subset sum problem, which is known to be NP-complete:

Definition 22. Subset Sum ProblemLet s ∈ N and S ⊂ N be finite. The subset sum problem consists of the question ”Is there asubset S′ ⊂ S such that

∑i∈S′ i = s ? ” ♦

35

Page 38: Weighted l1-Analysis minimization and stochastic gradient

This problem is a reformulation of the well-known 3-SAT-problem, one of Karp’s 21 NP-complete problems [50].

Lemma 23. Applying Hs,ω(x) consists of solving a NP-hard problem.

Proof. Let (s, S) be an instance of the subset sum problem and set ω ∈ RS to be the vectorcontaining the square roots of elements of S, that is ωi =

√i for i ∈ S. Then, the subset

sum problem is solvable if and only if there is a vector x ∈ CS satisfying ‖x‖0,ω = s. Everyvector x satisfying ‖x‖0,ω =

∑xi 6=0 ω

2i ≤ s serves as a selector for candidate solutions for

the subset sum problem since a set S′ := i : xi 6= 0 is at least not too large to consist asolution. Therefore, if we apply Hω,s to the vector ω, we obtain a vector y := Hω,s(ω) whoseentries only consist either of 0’s or the entries of ω: obviously, if a nonzero yi is not equal toωi, we can set this yi to ωi since this does not change ‖y‖0,ω but would reduce the summand|yi − ωi| in ‖y − ω‖2. Now there are two possibilities with regard to ‖y‖0,ω which can becomputed in linear time:First, ‖y‖0,ω = s. In this case, the set S′ = supp(y) is indeed a solution to the subset sumproblem. If, on the other hand, ‖y‖0,ω < s then there is no set S′ that solves the subset sumproblem. To see this, assume that in this case there is an index set T ⊂ S which is a solutionto the subset sum problem. The vector y′ which is defined as

y′i :=

√ωi if i ∈ T

0 else

clearly obeys

‖y′‖0,ω, =∑i∈T

ω2i =

∑i∈T

i = s > ‖y‖0,ω =∑yi 6=0

i

and thus

‖y − ω‖22 =∑yi=0

ω2i =

∑yi=0

i =∑i∈S

i−∑yi 6=0

i >∑i∈S

i−∑i∈T

i =∑y′i 6=0

ω2i = ‖y′ − ω‖22

contradicting the minimality of ‖y − ω‖2. Thus, the subset sum problem can in polynomialtime be reduced to computing Hs,ω(ω).

Therefore, we are in need of a different thresholding operator for our algorithm.

Definition 24. For x ∈ CΛ let π be a permutation of the index set Λ realizing a non-increasing rearrangement of

(|xi|ω−1

i

)i∈Λ

and S := π(j) : 1 ≤ j ≤ k where k is themaximal integer such that

∑kj=1 ω

2π(j) ≤ s holds. Then we set

Hs(x)ω := xS .

36

Page 39: Weighted l1-Analysis minimization and stochastic gradient

The first important result, which will also yield reconstruction guarantees for our algo-rithm, is the following Lemma.

Lemma 25. [80, Lemma 3.1] Let s ≥ ‖ω‖2∞. Then σ3s(x)ω,p ≤ σs(x)ω,p.

Now we are able to state the following algorithm:

Data: Measurement matrix Φ ∈ Cm×N , measurement vector y ∈ Cm, weightsω ∈ RN , sparsity s, update parameter µi ∈ C.

Result: The weighted 3s-sparse vector xn

initialize x0 = 0, i = 0.;repeat

xi+1 := H3s(xi + µiΦ

∗(y − Φxi))ω andi := i+ 1

until Iterate until stopping criterion is met at i = n;Algorithm 1: Weighted Iterative Hard Thresholding (WIHT)

One possible choice for the parameter µi some constant value, usually µi ≡ 1, in whichcase the algorithm resembles the classic IHT and is simply referred to as weighted iterativehard thresholding (WIHT).In [7], Blumensath and Davis proposed a different normalization factor for IHT: Let gi :=

Φ∗(y−Φxi) which is the gradient of x 7→ − 12‖y−Φx‖22 at the current iteration xi and Si its

support set. Then

µi :=(giSi)

tgiSi(giSi)

tΦ∗SiΦSigiSi

(3.9)

as long as µi <‖xn+1−xn‖22‖Φ(xn+1−xn)‖22

, otherwise µi is divided by a power of 2 such that it obeys thebound. A comparison of the performance of this normalized version of WIHT, NWIHT toWIHT and IHT can be found in the section on numerical examples, Section 3.6 at the endof this chapter.We now prove several convergence results for WIHT.

Proposition 26. Suppose the 7sth weighted restricted isometry constant of the matrix Φ ∈Cm×N satisfies

δω,7s <1

2.

Then, for every vector x ∈ CN with ‖x‖ω,0 < s the sequence (xn)n∈N defined by Algorithm 1with any of the aforementioned choices of µi and y = Φx converges to x.

Before we turn to the proof of this result, we include a weighted version of the standardresult for the RIP constants [34, Lemma 6.16] which will be essential in proving of Proposition26.

37

Page 40: Weighted l1-Analysis minimization and stochastic gradient

Lemma 27. Given u, v ∈ CN and an index set S ⊂ [N ] the inequalities

|〈u, (Id− Φ∗Φ)v〉| ≤ δω,t‖u‖2‖v‖2 if ω(supp(u) ∪ supp(v)) ≤ t

‖((Id− Φ∗Φ)v)S‖2 ≤ δω,t‖v‖2 if ω(S ∪ supp(v)) ≤ t

hold.

The proof of this lemma is literally the same as in [34] only with δt replaced by δω,t since‖ · ‖ω,2 = ‖ · ‖2 for our norm as defined by the Stetchkin-type estimate (2.5).

Proof of Proposition 26. This proof is a weighted variant of the proof of [34, Theorem 6.15].We only need to address the fact that the vector xi = Hs(x

i−1 + Φ∗(y−Φxi−1)) is weighted-3s-sparse instead of s-sparse. For convenience, we provide a complete proof.It suffices to prove the existence of a constant 0 ≤ ρ < 1 such that

‖xn+1 − x‖2 ≤ ρ‖xn − x‖2 for all n ≥ 0.

Then the claim follows by induction. We define the vector un via

un := xn + Φ∗(y − Φxn) = xn + Φ∗Φ(x− xn),

and thus xn+1 = H3s(un)ω. Now, Lemma 25 yields the inequality

‖xn+1 − un‖ω,p = σp(un)ω,3s ≤ σp(un)s,ω ≤ ‖x− un‖ω,p,

and therefore un is at least as good approximated by xn+1 as by the s-sparse vector x.Choosing p = 2 and squaring then yields

‖un − xn+1‖22 ≤ ‖un − x‖22.

Expanding ‖un − xn+1‖22 = ‖(un − x)− (xn+1 − x)‖22 and rearranging gives

‖xn+1 − x‖22 ≤ 2Re〈un − x, xn+1 − x〉.

Then we conclude with Lemma 27 that

Re〈un − x, xn+1 − x〉 = Re〈(Id− Φ∗Φ)(xn − x), xn+1 − x〉

≤ δω,7s‖xn − x‖2‖xn+1 − x‖2

where we used that supp(xn − x) ∪ supp(xn+1 − x) = supp(xn − x) ∪ supp(xn+1) andω(supp(xn − x)) ≤ 4s and ω(supp(xn+1)) ≤ 3s. If ‖xn+1 − x‖2 > 0, we derive

‖xn+1 − x‖2 ≤ 2δω,7s‖xn − x‖2

38

Page 41: Weighted l1-Analysis minimization and stochastic gradient

which also holds if ‖xn+1−x‖2 = 0. Thus the desired inequality holds with ρ = 2δω,7s < 1.

More often than not measurements are neither exactly sparse vectors nor noiseless. Toaddress this issue, we now turn to a stable and robust version of Proposition 26.

Theorem 28. Suppose the 7sth weighted restricted isometry constant of the matrix Φ ∈Cm×N satisfies

δω,7s <1√3.

Then, for x ∈ CN and S ⊂ [N ] with ω(S) ≤ s the sequence (xn)n∈N defined in Algorithm 1with y = Φx+ e satisfies

‖xn − xS‖2 ≤ ρn‖x0 − xS‖2 + τ‖ΦxS + e‖2

where ρ =√

3δω,7s < 1 and τ ≤ 2.18(1− ρ).

The proof requires an auxiliary lemma whose proof is essentially the same as its un-weighted counterpart [34, Lemma 6.20].

Lemma 29. Given y ∈ Cm and S ⊂ [N ] with ω(S) ≤ s, we have

‖(Φ∗y)S‖2 ≤√

1 + δω,s‖y‖2.

Proof of Theorem 28. As before, this proof strongly resembles the proof of [34, Theorem6.18]. For x ∈ CN , e ∈ Cm and S ⊂ [N ] with ω(S) ≤ 3s we want to prove

‖xn+1 − xS‖2 ≤ ρ‖xn − xS‖2 + (1− ρ)τ‖ΦxS + e‖2

for n ≥ 0, then the claim follows by induction. We set Sn := supp(xn) ⊂ [N ] for n ≥ 0 andchoose S′ ⊂ [N ], which realizes the best weighted s-term approximation xS of x, and T ⊂ N ,which yields the weighted quasi-best 3s-term approximation for z.Lemma 25 shows that for every p ∈ (0, 2) and every R ⊂ [N ] with ω(R) ≤ s

‖zT ‖pω,p ≤ ‖zS′‖

pω,p ≤ ‖zR‖

pω,p

or equivalently

‖zR‖pω,p ≤ ‖zS′‖pω,p ≤ ‖zT ‖pω,p.

Since (xn+Φ∗(y−Φx))Sn+1 is the weighted quasi-best 3s approximation to un = xn+Φ∗(y−Φx) we have

‖(xn + Φ∗(y − Φxn))S‖22 ≤ ‖(xn + Φ∗(y − Φxn))Sn+1‖22.

39

Page 42: Weighted l1-Analysis minimization and stochastic gradient

Deleting the contribution on the intersection S ∩ Sn+1, this reads

‖(xn + Φ∗(y − Φxn))S\Sn+1‖22 ≤ ‖(xn + Φ∗(y − Φxn))Sn+1\S‖22 (3.10)

= ‖(xn − xS + Φ∗(y − Φxn))Sn+1\S‖22.

The left-hand side of (3.10) satisfies

‖(xn+Φ∗(y − Φxn))S\Sn+1‖2 = ‖(xS − xn+1 + xn − xS + Φ∗(y − Φxn))S\Sn+1‖2≥ ‖(xS − xn+1)S\Sn+1‖2 − ‖(xn − xS + Φ∗(y − Φxn))S\Sn+1‖2.

We denote by S4Sn+1 := (S \ Sn+1) ∪ (Sn+1 \ S) the symmetric difference of S and Sn+1

and conclude that

‖(xS − xn+1)S\Sn+1‖2 ≤ ‖xn − xS + (Φ∗(x− Φxn))S\Sn+1‖2 + ‖xn − xS + (Φ∗(x− Φxn))Sn+1\S‖2≤√

2‖xn − xS + (Φ∗(x− Φxn))S4Sn+1‖2.

Thus we have

‖xn+1 − xS‖22 = ‖(xn+1 − xS)Sn+1‖22 + ‖(xn+1 − xS)Sn+1‖22

= ‖(xn − xS + Φ∗(y − Φxn))Sn+1‖22 + ‖(xn+1 − xS)S\Sn+1‖22.

Putting these inequalities together we obtain

‖xn+1 − xS‖22 ≤ ‖(xn − xS + Φ∗(y − Φxn))Sn+1‖22+ 2‖(xn − xS + Φ∗(y − Φxn))S4Sn+1‖22≤ 3‖(xn − xS + Φ∗(y − Φxn))Sn+1∪S‖22.

Now we write y = Φx+ e = ΦxS + e′ with e′ = ΦxS + e and recall that

ω(S ∪ Sn+1 ∪ supp(xn − xS)) ≤ ω(S ∪ Sn+1 ∪ Sn) ≤ 7s,

because xn and xn+1 are weighted 3s-sparse and ‖x‖0,ω ≤ s. Now we use Lemma 29 toconclude

‖xn+1 − xS‖2 ≤√

3‖(xn + xS + Φ∗Φ(xS − xn) + Φ∗e′)S∪Sn+1‖2≤√

3 [‖((Id− Φ∗Φ)(xn − xS))S∪Sn+1‖2 + ‖(Φ∗e′)S∪Sn+1‖2]

≤√

3[δω,7s‖xn − xS‖2 +

√1 + δω,4s‖e′‖2

].

This is the inequality in question. By our assumptions ρ =√

3δω,7s is smaller than 1 andmoreover (1− ρ)τ ≤

√3√

1 + δω,4s ≤√

3 +√

3.

If x ∈ CN is ω-s-sparse, then Proposition 26 or Theorem 28 guarantee exact recovery of

40

Page 43: Weighted l1-Analysis minimization and stochastic gradient

x from the measurements y via WIHT under the weighted RIP. In practical applications,however, the vector x in question is often not sparse but rather compressible which is to saythat σs(x)ω,p is small compared to ‖x‖ω,p for p ≤ 1. In this case, Theorem 28 just yieldsboundedness of the sequence (xn)n∈N and thereby the existence of some cluster points oneof which we name x]. These points obey

‖x− x]‖2 ≤ τ‖ΦxS + e‖2.

Though Theorem 28 states convergence or at least the existence of some cluster points, itstill lacks an estimate for the convergence rate in terms of the best s-term error σs(x)ω,p.

Theorem 30. Let Φ ∈ Cm×N with ω-RIP δs,ω < 1 and further let x, x′ ∈ CN with ‖x‖ω,0 ≤θs for some sparsity s ∈ N and a constant θ > 0. Furthermore, assume there exists τ > 0

such that

‖xT − x′‖2 ≤ τ‖ΦxT + e‖2 + ξ (3.11)

for some constant ξ ≥ 0, where T ⊂ [N ] realizing the weighted quasi-best 2s-term approxima-tion, in other terms the subset of [N ] of maximal size such that∑

j∈Tω2j ≤ 2s

and |xi|ω−1i ≥ |xj |ω−1

j for all i ∈ T and j ∈ T . Then we have

‖x− x′‖ω,p ≤ s1p−1

[21− 1

p + 2τ(2 + θ)1p−

12

√1 + δs,ω

]σs(x)ω,1 + ((2 + θ)s)

1p−

12 (τ‖e‖2 + ξ)

Proof. As before, this proof is similar to the proof of [34, Lemma 6.23]. In order to show thedesired inequality , we employ the triangle inequality to obtain

‖x− x′‖ω,p ≤ ‖xT ‖ω,p + ‖xT − x′‖ω,p.

Additionally to the triangle inequality, the `pω-norms with exponent 2 − p as in (2.1) obeysthe Hölder-inequality: ∑

j∈[N ]

|xjyj | ≤ ‖x‖ω,p‖y‖ω,q with x, y ∈ CN (3.12)

for 1 < p, q ≤ 2 with 1p + 1

q = 1.

Firstly, we consider the quantity ‖xT ‖ω,p and bound it in terms of σs(x)ω,q . SinceT is the set that realizes the weighted quasi-best 2s-term approximation of x it foremostcontains S1 ⊂ [N ], the set realizing the weighted quasi-best s-term approximation. Thereforeω(T \ S1) ≥ s, with equality at least in the unweighed case. Then, the quasi-best weighteds-term approximation of xS is the restriction of x to some set S2 ⊂ T \ S1 with the s second

41

Page 44: Weighted l1-Analysis minimization and stochastic gradient

weighted-largest values according to the non-increasing rearrangement of (|xi|ω−1i )i∈[N ] which

moreover fulfills ω(S2) ≤ s. Now T \ (S1 ∪ S2) can be non-empty, hence

‖xT ‖ω,p ≤ σs(xS1)ω,p = ‖xS1\S2

‖ω,p ≤(s− ‖ω‖2∞

) 1p−1 ‖xS1

‖ω,1 =(s− ‖ω‖2∞

) 1p−1

σs(x)ω,1

(3.13)

where we used the following Theorem.

Theorem 31. [80, Theorem 3.2] For p ≤ q ≤ 2, s ≥ 2‖ω‖2∞ and x ∈ CN we have

σs(x)ω,q ≤ (s− ‖ω‖2∞)1q−

1p ‖x‖ω,p.

Secondly, we bound the remaining term ‖xT − x′‖ω,p by a multiple of σs(x)ω,q. To thisend, we estimate the `pω-norm in terms of the `2-norm.

Proposition 32. For 1 ≤ p < q ≤ 2 and x ∈ CN we have

‖x‖ω,p ≤ ‖x‖1p−

1q

0,ω ‖x‖ω,q.

Proof. We have

‖x‖pω,p =∑j∈[N ]

|xj |pω2−pj =

∑j∈[N ]

(|xj |ω1p−1

j )pωj

∑j∈[N ]

(|xj |ω1p−1

j )qω2− qpj

pq ∑j∈supp(x)

ωqq−p+2− q

q−pj

q−pq

= ‖x‖pω,qω(supp(x))1− pq ,

where we invoked the weighted version of the Hölder-inequality (3.12) with p′ = qp > 1 and

q′ = qq−p .

The assumption ‖x‖0,ω ≤ 2s together with Proposition 32 and the assumption (3.11)yields

‖x′ − xT ‖ω,p ≤ ((2 + θ)s)1p−

12 ‖x′ − xT ‖2 ≤ ((2 + θ)s)

1p−

12 (τ‖AxT + e‖2 + ξ) .

Next, we form a partition S1, S2, . . . ⊂ [N ] of T inductively: S1 and S2 form a partition ofT , where S1 contains the indices of the larger entries of (|xj |ω−1

j )j∈T and S2 is the maximalsubset of T fulfilling s−‖ω‖2∞ ≤ ω(S2) ≤ s and containing the indices for the smaller entriesof (|xj |ω−1

j )j∈T\S1. That is, we collect the indices of the smaller entries of (|xj |ω−1

j )j∈T ina set of maximal feasible size, namely S2 and collect the remaining indices in S1. Now thesets S3, . . . are, similar to S2, the collection of sets fulfilling the following properties:

(i) Si consists of the indices of the maximal values of (|xj |ω−1j )j∈[N ]\S1∪...∪Si−1

(ii) s− ‖ω‖2∞ ≤ ω(Si) ≤ s and there are no larger sets with this property.

42

Page 45: Weighted l1-Analysis minimization and stochastic gradient

We conclude that

‖AxT + e‖2 ≤∑k≥3

‖AxSk‖2 + ‖e‖2 ≤∑k≥3

√1 + δs,ω‖xSk‖2 + ‖e‖2. (3.14)

By definition of the Si we have |xj |ω−1j ≤ |xk|ω−1

k for every j ∈ Si and k ∈ Si−1, i ≥ 2.Setting

αk :=

∑j∈Si

ω2j

−1

ω2k for k ∈ Si

we have∑k∈Si αk = 1 for all i. It then follows for j ∈ Si that

|xj |ω−1j ≤

∑k∈Si−1

αk|xk|ω−1k ≤

(s− ‖ω‖2∞

)−1 ‖xSi−1‖ω,1

and by our general assumption s ≥ 2‖ω‖2∞

‖xSi‖2 =

√∑j∈Si

|x2j | ≤

√√√√∑j∈Si

(‖xSi−1

‖ω,1s− ‖ω‖2∞

ωj

)2

≤√s

s− ‖ω‖2∞‖xSi−1‖ω,1 ≤

2√s‖xSi−1‖ω,1

and by (3.14)

‖AxT + e‖2 ≤2√

1 + δs,ω√s

∑i≥2

‖xSi‖ω,1 + ‖e‖2 ≤2√

1 + δs,ω√s

‖xS1‖ω,1 + ‖e‖2

=2√

1 + δs,ω√s

σs(x)ω,1 + ‖e‖2.

Using the inequality (3.13) we arrive at

‖x− x′‖ω,p ≤ ‖xT ‖ω,p + ‖x′ − xT ‖ω,p ≤ σs(xS)ω,p + ((2 + θ)s)1p−

12 ‖x′ − xT ‖2

≤ σs(xS)ω,p + ((2 + θ)s)1p−

12 τ‖AxT + e‖2 + ((2 + θ)s)

1p−

12 ξ

≤ (s− ‖ω‖2∞)1p−1σs(x)ω,1 + τ((2 + θ)s)

1p−

12

2√

1 + δs,ω√s

σs(x)ω,1

+ ((2 + θ)s)1p−

12 (τ‖e‖2 + ξ)

≤(

2

s

) 1p−1

σs(x)ω,1 + τ((2 + θ)s)1p−

12

√1 + δs,ω

2√sσs(x)ω,1

+ ((2 + θ)s)1p−

12 (τ‖e‖2 + ξ)

= s1p−1

[21− 1

p + τ(2 + θ)1p−

12

√1 + δs,ω2

]σs(x)ω,1 + ((2 + θ)s)

1p−

12 (τ‖e‖2 + ξ) .

This is the desired result.

43

Page 46: Weighted l1-Analysis minimization and stochastic gradient

Now we gathered enough preliminary findings to combine them in one conclusive recon-struction result.

Lemma 33. Suppose the 14sth order weighted restricted isometry constant of Φ ∈ Cm×N

satisfies δ14s,ω < 1√3. Then, for all x ∈ CN and e ∈ Cm the sequence (xn)n∈N defined via

WIHT with y = Φx+ e and x0 = 0 and s replaced by 2s satisfies for every n ∈ N

‖x− xn‖ω,p ≤ Cs1p−1σs(x)ω,1 +Ds

1p−

12 ‖e‖2 + ρns

1p−

12 ‖x‖2

where the constants C,D > 0 and 0 < ρ < 1 only depend on δ14s,ω. Every accumulationpoint x] of the sequence (xn)n∈N defined via WIHT, see Algorithm 1, satisfies

‖x− x#‖ω,p ≤ Cs1p−1σs(x)ω,1 +Ds

1p−

12 ‖e‖2.

Proof. Under the given assumptions, Lemma 28 yields the existence of some 0 < ρ < 1 andτ > 0 depending only on δ14s,ω such that

‖xT − xn‖2 ≤ τ‖AxT + e‖2 + ρn‖xT ‖2,

where T realizes the weighted quasi-best 2s approximation to x for every n ∈ N. ThenTheorem 30 with x′ = xn and ξ = ρn‖xT ‖2 ≤ ρn‖x‖2 implies for any 1 ≤ ρ ≤ 2

‖x− xn‖ω,p ≤ s1p−1Cσs(x)ω,1 +Ds

1p−

12 (τ‖e‖2 + ρn‖x‖2).

3.5 Weighted Sparse Recovery in Besov Spaces

As mentioned repeatedly before, weighted reconstruction methods lend themselves to func-tion retrieval in Besov spaces. Here, we aim at the reconstruction of an unknown f ∈ Bp,ql (R)

with supp(f) ⊂ [−1, 1] from a finite number of samples f(ξ1), . . . , f(ξm). Besov spaces canbe characterized using wavelet expansions where the series of wavelet coefficients belongs toan appropriate sequence space. The notation as well as a short introduction into the theoryof Besov spaces can be found in Section 6.2. For a thorough and detailed survey of the theoryof Besov spaces we refer to [31].

As outlined before, we employ the preconditioned wavelet system(F−1

(ϑφj,k

))j∈N,k∈Z

which forms an ONB of L2 (R, dP) where dP is the preconditioned probability measuredP = dx

ϑ2(x) with dx being the Lebesgue measure. The basis functions additionally obey

‖φj,kϑ‖∞ . 2κj where κ > 0 is a parameter within the definition of the preconditioningfunction ϑ, see equation (3.7). If we set ωγ = ωj,k = 2j

p(l−1/2)2−p , then according to Theorem

108 and Theorem 107 this is precisely the criterion for a function to belong to the Besov

44

Page 47: Weighted l1-Analysis minimization and stochastic gradient

space Bp,pl (R).Therefore it is reasonable to employ weights ωj,k := ωj = 2κj with κ = p(l−1/2)

2−p . To establisha suitable Ansatz space for the reconstruction of functions f ∈ Bp,pl (R), we invoke Theorems108 and 107 which states the characterization of Besov space Bp,pl (R) by means of a Waveletdecomposition. We summarize our settings for the remainder of this chapter in the followingdefinition.

Definition 34. In this section, we will employ the following definitions repeatedly.

• LetM ∈ N , then there exists a father wavelet φF such that supp(φF ) ⊂ [0, 2M−1] anda corresponding mother Wavelet φM such that supp(φM ) ⊂ [−M,M ], see for examplethe construction in [27]. If we choose M , the assumptions of Theorem 107 are satisfiedso that we can characterize the Besov space Bp,ql (R) via our wavelet basis. We set

φj,k =

φF (x− k) if j = 0

2j2φM (2jx− k) elsewise.

(3.15)

Then, for Γ = N0 × Z, the system φγ : γ ∈ Γ consists an ONB of L2(R).

• We define

Γ :=⋃j∈N0

j × Pj and Γs :=

Js⋃j=0

j × Pj , (3.16)

where Js is the maximal scale to be used in our computations and will be determinedlater and where

P0 := −2M, . . . , 1

Pj := −2j −M, . . . , 2j +M.

• Let p, q ≥ 1 and l > 0. We use the fact that f ∈ Bp,ql (R) if and only if

‖f‖Bp,ql (R) :=

∑j∈N0

2j(q2 +lq−p)

(∑k∈Z|〈f, φj,k〉|p

)p/q1/q

<∞. (3.17)

For γ = (j, k) ∈ Γ we set ωγ = ωj = 2jpl−1/22−p . Since we only consider p = q in the

following, we have that f ∈ Bp,pl (R) if and only if

‖f‖Bp,ql (R) :=∥∥∥(〈f, φγ〉)γ∈Γ

∥∥∥ω,p

:=

∑γ∈Γ

ω2−pγ |〈f, φγ〉|p

1/p

<∞

see Theorems 106 and 107. Since we will assume supp(f) ⊂ [−1, 1] for the remainderof this chapter, we only need to take into account those translations with k ∈ Pj at

45

Page 48: Weighted l1-Analysis minimization and stochastic gradient

each scale j in (3.17), that is

‖f‖Bp,pl (R) =

∑j∈N0

∑k∈Pj

ω2−pj |〈f, φj,k〉|p

1/p

=

∑j∈N0

2jpl−1/22−p

∑k∈Pj

|〈f, φj,k〉|p1/p

.

Here, the Pj take the role of the A[−1,1]j from Theorem 108.

• We set

Sω,p =

f =∑γ∈Γ

cγφγ : ‖c‖ω,p =: ‖f‖ω,p <∞

= Bp,pl (R).

• From now on, let f ∈ Sω,p and f0 =∑

(j,k)∈Γscj,kφj,k be a finite approximation of f

with Γs to be determined later in accordance to the maximal scale Js. Since we assumesupp(f) ⊂ [−1, 1] for f ∈ Bp,pl (R) we have that cj,k = 〈f, φj,k〉 = 0 as soon as k /∈ Pj .

• Lastly, let xi : i = 1, . . . ,m be sampled from R according to the preconditioned prob-ability measure dx

ϑ2(x) and let f ] be the reconstruction from the samples ϑ(xi)f(xi) :

1 ≤ i ≤ m via WIHT.

In order to prove our first recovery theorem, we only need one more auxiliary lemma.

Lemma 35. [80, Lemma 6.3] For a weight ω and 0 < q < 1, set α = 2q − 1. Then

‖x‖ωα,1 ≤ ‖x‖ω,p.

Now everything is in place for the proof of this chapter’s main theorem. Keep in mindthat since the maximal scale Js is finite, so is the index set Γs we use for approximation.

Theorem 36. Let q ∈ (0, 1) and ω be a sequence of weights satisfying ωj ≥ ‖φj,kϑ‖ ≥ 2κj

and (vj)j∈N0 be a sequence of weights such that vj = ωβj ≥ ωq

2−qj for some β > 1. Let for

given s ∈ N

Js = max

j ∈ N0 : ωjv

q−2q

j ≥ s12−

1q

.

Fix a number m of samples with

m ≥ c0smax

log3(s) log(]Γs), log

(1

γ

). (3.18)

Suppose the sampling points x1, . . . , xm are drawn independently according to the randommeasure dP. Then with probability exceeding 1−γ the following holds for all f ∈ Bq,q

(βκ 2−qq + 1

2 )(R):

46

Page 49: Weighted l1-Analysis minimization and stochastic gradient

Let yi = f(xi)φ(xi) for i = 1, . . . ,m and let c] ∈ C]Γs be an accumulation point of WIHTinitialized with c0 = 0. Furthermore, set f ] =

∑(j,k)∈Γs

c]j,kϕj,k. Then

‖f − f ]‖ω,1 ≤ Cs1− 1q ‖f‖v,q

‖f − f ]‖2 ≤ C ′s12−

1q ‖f‖v,q.

Since vj = ωβj ≥ ωj we have that ‖f‖ω,q ≤ ‖f‖v,q where we have that Sv,q = Bq,qβ(κ 2−q

q + 12 )

(R)

and Sω,q = Bq,qκ 2−q

q + 12

(R). As mentioned in Section 3.5, the expression ‖f‖v,q is a quasi-norm

on Bq,qβ(κ 2−q

q + 12 )

where, as we will explain in the forthcoming proof, we set vj = ωβj .

Proof. For f =∑γ∈Γ cγφγ we set the finite approximation f0 :=

∑γ∈Γs

cγφγ and then wecan estimate the error via the triangle inequality

‖f − f ]‖ω,p ≤ ‖f − f0‖ω,p + ‖f0 − f ]‖ω,p

for any given p ∈ [1, 2]. Then, we examine the two terms on the right-hand-side separately.Upon suitable normalization, for the randomly chosen sampling points x1, . . . , xm, the mea-surements take the form

yi = f(xi)ϑ(xi) =∑

(j,k)∈Γs

cj,kφj,k(xi)ϑ(xi) +∑

(j,k)/∈Γs

cj,kφj,k(xi)ϑ(xi) (3.19)

which is the linear equation system y = Φc + e with Φ(j,k),i = φj,k(xi)ϑ(xi) for (j, k) ∈ Γs

and 1 ≤ i ≤ m where

ei =∑

(j,k)/∈Γs

cj,kφj,k(xi)ϑ(xi). (3.20)

Let f ] be the reconstruction of f from the samples y by WIHT, see Algorithm 1. ThenLemma 33, applied to the rescaled system 1√

my = 1√

m(Φx+ e) yields a reconstruction error

of

‖f0 − f ]‖ω,p ≤ C1s1p−1σs(f0)ω,1 +

C2s1p−

12

√m‖e‖2

for p ∈ [1, 2]. Moreover

‖e‖22 =∑i∈[m]

∣∣∣∣∣∣∑

(j,k)/∈Γs

cj,kφj,k(xi)ϑ(xi)

∣∣∣∣∣∣2

≤∑i∈[m]

∑(j,k)/∈Γs

|cj,k|‖φj,kϑ‖∞

2

≤∑i∈[m]

‖f − f0‖2ω,1 = m‖f − f0‖2ω,1

47

Page 50: Weighted l1-Analysis minimization and stochastic gradient

so that ‖e‖2 ≤√m‖f − f0‖ω,1. Then we estimate

‖f − f0‖ω,1 =∑

(j,k)/∈Γs

|cj,k|ωj ≤ maxωjv−αj : j /∈ Γs∑

(j,k)/∈Γs

|cj,k|vαj

≤ maxωjv−αj : j /∈ Γs‖f‖vα,1. (3.21)

We abbreviate ρ := maxωjv−αj : j > Js. Then Lemma 35 yields for 0 < q < 1 andα = 2

q − 1 > 1

‖f‖vα,1 ≤ ‖f‖v,q.

The Stechkin-type estimate from Theorem 31 states

σs(f0)ω,1 ≤ cs1− 1q ‖f‖ω,q

for q ∈ (0, 1). In summary we obtain

‖f0 − f ]‖ω,p ≤ c1s1p−

1q ‖f‖ω,q + c2s

1p−

12 ρ‖f‖v,q. (3.22)

In the following we distinguish two cases, namely p = 1 and p = 2. For p = 1 we have

‖f − f0‖ω,1 ≤ ρ‖f − f0‖vα,1 ≤ ρ‖f‖vα,1 ≤ ρ‖f‖v,q

where we used the same estimate as in (3.21). Therefore, the reconstruction error amountsto

‖f − f ]‖ω,1 ≤ ‖f − f0‖ω,1 + ‖f0 − f ]‖ω,1 ≤ (1 + c2√s)ρ‖f‖v,q + c1s

1− 1q ‖f‖ω,q.

Now we determine Γs and weights vj such that ρ = maxωjvq−2q

j : j > Js is bounded bys

12−

1q . Then

‖f − f ]‖ω,1 ≤ (1 + c2√s)ρ‖f‖v,q + c1s

1− 1q ‖f‖ω,q ≤ 2c2

√sρ‖f‖v,q + c1s

1− 1q ‖f‖v,q

≤ 2c2s1− 1

q ‖f‖v,q + c1s1− 1

q ‖f‖ω,q = Cs1− 1q ‖f‖v,q.

In the statement of our theorem we already have set

Js := max

j ∈ N0 : ωjv

q−2q

j ≥ s12−

1q

. (3.23)

Setting vj := ωβj ≥ ωj for some β > 1, so that Sv,q = Bq,q(βκ 2−q

q + 12 )

(R) = Bq,qβl−β/2+1/2(R)

again is a Besov space, leads to ωjvq−2

2j = 2jκ(1+β q−2

2 ) where (3.23) requires for j > Js that

ωjvq−2

2j < s

q−22q

48

Page 51: Weighted l1-Analysis minimization and stochastic gradient

holds, which it does if and only if

j ≤ log2(s)

2− qβ(2− q)− q

.

This can only be true if β is strictly larger than q2−q and therefore Js := b log2(s)

2κ2−q

β(2−q)−q c.Then Γs is finite, which is essential for actual computational purposes. Since vj = ωβj ≥ ωj

for all j ∈ N0, ‖f‖ω,q ≤ ‖f‖v,q follows immediately. In summary, under these conditions

‖f − f ]‖ω,1 ≤ Cs1− 1q ‖f‖v,q.

For p = 2 we again invoke inequality (3.22) so that we just need to consider the approximationerror ‖f − f0‖2.We calculate

‖f − f0‖2 =

√ ∑(j,k)/∈Γs

|cj,k|2 ≤∑

(j,k)/∈Γs

|cj,k| = ‖(cj,k)(j,k)/∈Γs‖1 = ‖f − f0‖1

≤ ‖f − f0‖ω,1

since ωj ≥ 1. Therefore

‖f − f0‖2 ≤ ‖f − f0‖ω,1 ≤ ρ‖f‖v,q.

With our previous definition of Js we arrive at

‖f − f ]‖2 ≤ ρ‖f‖v,q + c1s12−

1q ‖f‖ω,q + c2s

12−

1q ρ‖f‖v,q

≤ Cs12−

1q ‖f‖v,q.

This is the desired estimate

The reason for using the weighted q-quasi-norm instead of the weighted 1-norm is that inour setting we can expect ‖f‖ω,q to be small since we assume the coefficient vector of f tobe sparse or compressible with regard to the weighted `pω-norm.

For computational purposes, it is important to have an upper bound for the size of Γs

since this will give a lower bound on the number of required measurements from (3.18).Naturally, it also is the size of the original signal according to (3.19). Therefore this quantityenters the requirements on the size of the system in Theorem 36 logarithmically.

Corollary 37. With Js as in (3.23) and if β ≥ max

1, 22−q

we have

]Γs ≤ 2M

(⌊log2(s)

2− qβ(2− q)− q

⌋+ 1

)+ 2s

12κ

which roughly amounts to ]Γs ≤ Cs2−q

2(β(2−q)−q)κ . In this context, M was determined by the size

49

Page 52: Weighted l1-Analysis minimization and stochastic gradient

of the support of our wavelet system, i.e., supp(φM ) ⊂ [−M,M ] and supp(φF ) ⊂ [0, 2M −1].

Proof. By definition

Γs =

Js⋃j=0

j × Pj

and ]Pj = 2M + 2j+1 for j ≥ 0. Furthermore, with ωj = 2κj and vj = ωβj , the j ≤ Js arethose fulfilling j ≤ log2(s)

2κ2−q

β(2−q)−q . Then it remains to compute

]Γs =

Js∑j=0

]Pj = 2M (Js + 1) + 2

Js∑j=0

2j

= 22Js+1 − 1

2− 1+ 2M (Js + 1) ≤ 2

log2(s)2κ

2−qβ(2−q)−q+1 − 1 + 2M (Js + 1)

= 2M (Js + 1) + 2 · 2log2(s2−q

2(β(2−q)−q)κ ) − 1 = 2M (Js + 1) + 2s2−q

2(β(2−q)−q)κ − 1.

Now the estimate follows immediately since we have q < 1 which gives 2 > 22−q which in

hindsight yields q−2(q+β(q−2))κ <

12κ . In general, s

2−q2(β(2−q)−q)κ grows faster in s than Js even for

small values of κ, henceforth we obtain approximately

]Γs ≤ Cs2−q

2(β(2−q)−q)κ .

Since Theorem 36 aims at sparse recovery, we demand s N = ]Γs ≤ Cs2−q

2(β(2−q)−q)κ ,therefore only κ < 2−q

2(β(2−q)−q) is interesting in this context. The resulting lower bound forthe number of measurements (3.18) in Theorem 36 is

m ≥ c0s

κlog4

2(s).

There is yet another way to bound the approximation error ei =∑

(j,k)/∈Γscj,kφj,k(xi)ϑ(xi)

if the full index set Λ is not finite, via a probabilistic approach, see [80, Theorem 1.2] for com-parison.Since our sampling points x1, . . . , xn are drawn i.i.d. from the orthogonalization measure

dxϑ2(x) = dP where dx is the Lebesgue measure on R, the random variables e2

i , as defined in(3.20), are independent and distributed identically and we have that

E(|ei|2

)=

∑(j,k)/∈Γs

|cj,k|2.

The definition of the finite index set Γs ⊂ N0 × Z as in (3.16) where now

Js = maxj ∈ No : ω2

j ≥s

2

50

Page 53: Weighted l1-Analysis minimization and stochastic gradient

gives

∑(j,k)/∈Γs

|cj,k|2 ≤2

s

∑(j,k)/∈Γs

|cj,k|2ω2j ≤

2

s

∑(j,k)/∈Γs

|cj,k|ωj

2

=2

s‖f − f0‖2ω,1.

Moreover, we have

|ei| ≤∑

(j,k)/∈Γs

|cj,k||ψj,k(xi)φ(xi)| ≤∑

(j,k)/∈Γs

|cj,k|ωj = ‖f − f0‖ω,1.

The variance of the mean-zero-variable |ei|2 − E|ei|2 is bounded by

E(|ei|2 − E|ei|2

)2 ≤ E|ei|4 ≤ ‖f − f0‖2ω,1E|ei|2 ≤2

s‖f − f0‖4ω,1

Now we apply Bernstein’s inequality [34, see Chapter 7.5] to obtain

P

∣∣∣∣∣∣ 1

m

m∑i=1

|ei|2 −∑

(j,k)/∈Γs

|cj,k|2∣∣∣∣∣∣ ≥ θ

≤ 2 exp

(− mθ2/2

2‖f − f0‖4ω,1/s+ θ‖f − f0‖2ω,1/3

)(3.24)

which, by setting θ = 3s‖f − f0‖2ω,1, reads

P

∣∣∣∣∣∣ 1

m

m∑i=1

|ei|2 −∑

(j,k)/∈Γs

|cj,k|2∣∣∣∣∣∣ ≥ 3

s‖f − f0‖2ω,1

≤ 2 exp

(−3m

2s

)

For a number of measurements m fulfilling m ≥ c0θ s log4

2(s) = c′s log32(s) log(N) we obtain

P

(1

m

m∑i=1

|ei|2 ≥1

s‖f − f0‖2ω,1

)≤ N− log3

2(s).

Accordingly 1√m‖e‖2 ≤ ‖f−f0‖ω,1√

swith high probability. Therefore, if we reuse some of the

estimates from the proof of Theorem 36 and 1 ≤ p ≤ 2

‖f − f ]‖ω,p ≤ ‖f − f0‖ω,p + ‖f0 − f ]‖ω,pleq‖f − f0‖ω,1 + C1s1p−1σs(f0)ω,1 +

C2s1p−

12

√m‖e‖2

≤ ‖f − f0‖ω,1 + C1s1p−1σs(f0)ω,1 + C2s

1p−1‖f − f0‖ω,1

=(

1 + C2s1p−1)‖f − f0‖ω,1 + C1s

1p−1σs(f0)ω,1

≤(

1 + C2s1p−1)ρ‖f‖v,q + C1s

1p−

1q ‖f‖ω,q

≤(s

12−

1p +

C2√s

+ C1

)s

1p−

1q ‖f‖v,q

51

Page 54: Weighted l1-Analysis minimization and stochastic gradient

for 0 < q < 1 and p ∈ [1, 2], where we used that ρ = maxωjvq−2q

j : j > Js ≤ s12−

1q .

We summarize these estimates into a single non-uniform recovery theorem.

Theorem 38. Let q ∈ (0, 1), ωj ≥ 1 be a sequence of weights and Γs ⊂ N0 × Z an index setand fix a number m of samples with

m ≥ c′smax

log3(s) log(]Γs), log

(1

γ

)(3.25)

Then the following holds with probability exceeding 1 − γ: Consider a fixed function f =∑j≥0

∑k∈Z cj,kφj,k ∈ Sω,q = Bq,q

κ 2−qq + 1

2

(R) with its support contained in [−1, 1]. Suppose thesampling points x1, . . . , xm are drawn independently from Fourier domain according to theprobability measure dx

ϑ2(x) where dx denotes the Lebesgue measure on R.Let yi = 1√

mf(xi)ϑ(xi) and Φ the m× ]Γs sampling matrix with entries

Φ(j,k),i = 1√mφj,k(xi)ϑ(xi) for (j, k) ∈ Γs and let c] ∈ C]Γs be an accumulation point of

the sequence (cn)n∈N computed via WIHT initialized at c0 = 0. Furthermore, set f =∑(j,k)∈Γs

cj,kφj,k. Then

‖f − f ]‖ω,p ≤ Cs1p−

1q σs(f)ω,q

‖f − f ]‖2 ≤ C ′s12−

1q σs(f)ω,q.

This, however, is a non-uniform recovery result because Bernstein’s inequality in (3.24)only accounts for this single set of randomly chosen sampling points x1, . . . , xm. It should benoted that the number of measurements from (3.25) guarantees for the ω-RIP to hold withhigh probability.

The estimates from Theorem 38 imply

‖f − f ]‖ω,p ≤ Cs1p−

1q ‖f‖ω,q = Cs

1p−

1q ‖f‖Bq,ql (R)

‖f − f ]‖2 ≤ C ′s12−

1q ‖f‖ω,q = C ′s

12−

1q ‖f‖Bq,ql (R).

for l = κ 2−q2 + 1

2by means of the weighted Stechkin-inequality in Lemma 25.

Remark 39. For image or volume reconstruction, these ideas and principles can be extendedto the d-dimensional setting, i.e., f ∈ Bp,ql (Rd). Taking the norm as proposed in Theorem 106or the pseudo-norm from Theorem 107 as well as the respective wavelet-ONBs and choosingp = q as in the in the proofs of Theorems 36 and 38, suggests weights ω2−q

j = 2‖j‖1q(l−d/2) =

2‖j‖1κ(2−q) for j ∈ Nd0, with κ = q(l−d/2)2−q , now also dependent on the dimension d, for l ≥ 0.

The occurrence of the ambient dimension d in the exponent of the weights can be obtainedby redoing the computations in the proofs of Lemmas 18 and 20 in the d-dimensional case,resulting in the estimate ‖ψj,kφ‖∞ . max2−

jd2 , 2κdj, where every computation is done with

a tensorized wavelet ONB, i.e., separately for each dimension. Then Theorems 36 and 38hold basically unchanged in formulation, although the size of the involved index set Γs mightvary in size. From the estimate on the maximal scale Js in (3.23) we recall the definition of

52

Page 55: Weighted l1-Analysis minimization and stochastic gradient

the maximal scale Js appearing in the recovery results, which now leads to the inequality

‖j‖1 ≤log2(s)

2− qβ(2− q)− q

.

Plugging in our definition of κ > 0, implicitly assuming l > d2 , we obtain

‖j‖1 ≤log2(s)(2− q)2

q(2l − d)(β(2− q)− q)

which deteriorates when d approachs 2l. So now we can take Js = b log2(s)(2−q)2

q(2l−d)(β(2−q)−q)c.Moreover, under the assumption that for all functions f we consider, supp(f) ⊂ [−1, 1]d,the cardinality of the sets of translations Pj scales like (2M + 2‖j‖1+1)d where supp(ϕM ) ⊂[−M,M ]d and since ]j ∈ Nd0 : ‖j‖1 ≤ r ≤ (r+1)d

d!

]Γs =∑‖j‖1≤Js

]Pj =∑‖j‖1≤Js

(2M + 2‖j‖1+1)d =

Js∑j=0

(j + 1)d

d!(2M + 2j+1)d

=1

d!

Js∑j=0

[(j + 1)d(2M + 2j+1)d =1

d!

Js∑j=0

d∑k=0

(j + 1)d(d

k

)(2M)n−k2(j+1)k

=1

d!

d∑k=0

(d

k

)(2M)n−k

Js∑j=0

(j + 1)d2(j+1)k

≤ 1

d!

d∑k=0

(d

k

)(2M)n−k(Js + 1)d2k(Js+1)+1

=2(Js + 1)d

d!

d∑k=0

(d

k

)(2M)n−k2k(Js+1) =

2(Js + 1)d

d!

(2M + 2Js+1

)d.

Accordingly roughly about m & sd log3(s) log(JsM2Js

d!

)= sd log3(s)

[log(JsMd!

)+ Js

]mea-

surements are needed in order to obtain the ω-RIP with high probability in the multidimen-sional case according to Theorem 10.

3.6 Numerical illustrations

In this section we will demonstrate how Theorem 36 and Remark 39 give rise to recovery andapproximation results especially with regards to computerized tomography.Firstly we consider the 1-D setting where Figure 3.1 shows reconstructions of the function

t : R→ R, x 7→

−x if x < 0√x if x ≥ 0

53

Page 56: Weighted l1-Analysis minimization and stochastic gradient

for several choices of κ ∈ (0, 1) where κ occurs in the preconditioning function which yieldsour probability measure: we sample x1, . . . , xm according to the measure dP = dx

ϑ2(x) whereϑ(x) . (1 + |x|) 1

2 +κ, see equation (3.7) for more details. Our numerical experiments em-ployed the implementation of the one-dimensional db2 wavelet in Matlab and 4096 = 212

locations where the function was interpolated. Consequently, Js = 12 and we used m = 1000

measurements in Fourier domain. The error rate was computed as ‖x−x]‖2

‖x‖2 where x ∈ R4096

is the original function evaluated at its interpolation points, i.e., xi = t(−2048 + i) and x]

the reconstruction.

(a) κ = 0.1, error rate ≈ 20% (b) κ = 0.2, error rate ≈ 18% (c) κ = 0.3, error rate ≈ 19%

(d) κ = 0.4, error rate ≈ 12% (e) κ = 0.5, error rate ≈ 4% (f) κ = 0.6, error rate ≈ 23%

(g) κ = 0.7, error rate ≈ 4% (h) κ = 0.8, error rate ≈ 3% (i) κ = 0.9, error rate ≈ 100%

(j) Original function (k) Histogram of sampling loca-tions for κ = 0.1

(l) Histogram of sampling loca-tions for κ = 0.8

Figure 3.1: The function was sampled at the N = 4096 integers in [−2048, 2047] and thensubsampled in Fourier domain at m = 1000 locations chosen at random according to dP =

dxϑ2(x) .

The most notable phenomenon occurs as κ converges to 1, as can be observed from Figure3.1i where reconstruction quality drops significantly. The choice of the parameter κ within

54

Page 57: Weighted l1-Analysis minimization and stochastic gradient

the probability measure plays a decisive role, as can be observed form Figures 3.4a and 3.4b.Remark 39 offers the possibility to extend the results into higher dimensions. The nat-

ural connection of Fourier subsampling, as applied throughout this paper, to Computer To-mography (CT) via the Radon transform and the Fourier Slice Theorem leads to effective,sparsity-based reconstruction methods.

Definition 40 (Radon transform). Let f ∈ L2(R2) and r ∈ R,θ ∈ S1 := x ∈ R2 : ‖x‖2 = 1. Then the Radon transform R(f) of f is defined as

R(f)(r, θ) := Rθ(f)(r) :=

∫〈θ,x〉=r

f(x) dx. (3.26)

Note, that Rθ(f) is a function in a single one-dimensional variable. ♦

Theorem 41 (Fourier Slice Theorem). [12, Theorem 1] Let f ∈ L2(R2) and r ∈ R,θ =

(cos(ϕ)sin(ϕ)

)∈ S1 for ϕ ∈ [0, 2π). Then

Rθ(f)(r) = f(rθ). (3.27)

Thus, a function can be reconstructed via the Fourier transform from sampling it alongradial lines. This observation led Candes, Tao and Romberg to their famous discovery ofsparsity-based reconstruction techniques [18].

(a) Subsampling pattern for 20lines with randomly chosen an-gles.

(b) Subsampling pattern for20 deterministically chosenlines.

(c) Subsampling pattern for κ =0.75 and m = 10175 (≈ 16%subsampling rate).

Figure 3.2: Subsampling patterns in 2D Fourier domain.

Their computational method, however, differed from the theoretical approach: the an-gles, according to which the radial lines in numerical experiments were chosen equispaced in[0, 2π), see Figure 3.2b. However, this actually did not impair the quality of reconstruction.Our approach includes sampling along equally distributed radial lines but additionally optsfor choosing the angles at random, as can be seen in 3.2a and moreover employs the precondi-tioned probability measure dP = dx

ϑ2(x) , where the preconditioning function ϑ is proportionalto x 7→ (1 + ‖x‖2)1+2κ with κ > 0, as suggested by Theorem 36 and Remark 39, see Figure

55

Page 58: Weighted l1-Analysis minimization and stochastic gradient

3.2c. Again we employed the implementation of the db2 wavelet, now for the 2-dimensionalcase and m = 10175 measurements. Here, the relative error was computed as ‖x−x

]‖F‖x‖F where

x ∈ R256×256 is the original image, x] the reconstruction and ‖ · ‖F the Frobenius norm.Comparing Figures 3.3b and 3.3a shows that sampling along lines generally yields the

expectable angular artifacts. Here we encounter the known phenomenon that in practicalapplications the deterministic sampling approach yields the best results.

(a) Reconstruction from sam-pling along angles randomlychosen with equal probability.Error is ≈ 22.4%.

(b) Reconstruction from sam-pling along deterministicallychosen, equally distributedangles.Error is ≈ 15.9%.

(c) Reconstruction from sam-pling randomly according to thecustom probability distributionas depicted in Figure 3.2c. Er-ror is ≈ 17%.

Figure 3.3: Reconstruction for the Phantom test image with κ = 1 and sampling along 50lines. For the reconstruction seen in Figure 3.3c, the number of sampling locations was chosensuch that it equals the number of sampling points for 50 lines.

3.7 Conclusion

In this chapter we proposed a working algorithm, namely Weighted Iterative Hard Thresh-olding WIHT, see Algorithm 1, that can reconstruct ω-s-compressible data from noisy mea-surements as well as the framework for the analysis of this algorithm (see Section 3.3). Weobtained convergence results as well as approximation rates and extended the theory into theinfinite dimensional setting via Besov spaces. Lastly, we provided numerical evidence thatour methods are actually applicable to reconstruction task.

56

Page 59: Weighted l1-Analysis minimization and stochastic gradient

(a) Reconstruction with κ = 0.3.Error is ≈ 33.7%.

(b) Reconstruction with κ = 0.4.Error is ≈ 19 %.

(c) Reconstruction with κ = 0.3.Error is = 16.79%.

(d) Reconstruction with κ = 0.4.Error is = 16.97%.

Figure 3.4: Reconstruction for the Phantom test image with different values for κ < 1 andsampling according to the random subsampling pattern depicted in Figure 3.2c. The numberof sampling points was chosen such that at amounted for as much data as sampling along70 radial lines for the reconstruction in Figure 3.4a and 3.4b and 110 radial lines for Figures3.4c and 3.4d.

57

Page 60: Weighted l1-Analysis minimization and stochastic gradient

Chapter 4

`1-Minimization of AnalysisCoefficients

In this chapter we consider an extension of the usual Compressed Sensing approach where,instead of classical Basis Pursuit, we want to reconstruct x ∈ Cd from linear measurementsy = Φx ∈ Cm by solving the Weighted Analysis Basis Pursuit minimization problem Ω-BP

minz∈Cd

‖Ωz‖ω,1 subject to Φz = y

for an operator Ω. Hereafter, we consider the case that Ω ∈ Cn×d is a frame, i.e., a matrixsuch that

A‖x‖22 ≤ ‖Ωx‖22 ≤ B‖x‖22 holds for all x ∈ Cd

with constants A,B > 0. This approach needs a slightly different analysis since in Ω-BP and the robust version Ω-BPDN the sparsity of Ωx rather than of x itself play a role.In the following, we will develop this analysis and present reconstruction results includingapproximation rates. Moreover, we will extend the theory to infinite dimensional functionspaces via decomposition spaces.

4.1 Introduction

In this chapter we provide recovery results for the `1ω-analysis minimization problems Ω-BPand Ω-BPDN where from now on Ω ∈ Rn×d, n ≥ d denotes an arbitrary frame. Minimizingthe `1ω-norm of the analysis coefficients Ωx of a signal x ∈ Rd is usually referred to asanalysis sparsity since the key ingredient for recovery is the assumed sparsity of the sequenceof analysis coefficients Ωx of a signal x ∈ Rd for some frame Ω ∈ Rn×d. The synthesisapproach, which puts the emphasis on the synthesis coefficients z ∈ Rn itself, aims at the

58

Page 61: Weighted l1-Analysis minimization and stochastic gradient

minimization of

min ‖z‖ω,1 subject to ΦΩ†z = y.

In case that Ω ∈ Cd×d is an orthonormal matrix, this is equivalent to the case of `1ω-analysisminimization

min ‖Ωz‖ω,1 subject to Φz = y.

or the standard Basis Pursuit for a matrix ΦΩ† = ΦΩt. If Ω ∈ Cn×d is a overcompletesystem, i.e., a redundant frame, both approaches differ: Let ψ1, . . . ψd ∈ Rn be the columnsof Ω and ψ†1, . . . , ψ

†n the rows of a dual frame Ω†. Analysis sparsity is the assumption that a

signal x only has a sparse representation under some transformation, in our case it only hasfew significant frame coefficients Ωx, i.e., Ωx is s-sparse. Thus, it can be expressed efficientlyin the dictionary ψ†1, . . . , ψ

†n:

x =

n∑i=1

〈x, ψi〉ψ†i .

Synthesis sparsity, on the other hand, assumes that the signal x = Ω†z has a sparse coefficientvector z in a dictionary which is not necessarily linked to the original signal x by the frameΩ.The latter approach, however, will not be pursued further in this work. Though it can beimplemented easily, employing one of the plethora of `1ω-minimization algorithms once a dualframe Ω† is known, it has several shortcomings: Most notably, the redundancy of the frameitself allows for several dual frames, hence the minimizer is highly non-unique. Moreover,even in the seemingly simple case of a Parseval frames, computation of the dual frame mightcome at high computational cost. In case of Ω† being the canonical dual frame (Ω∗Ω)−1Ω∗,we would need to invert Ω∗Ω ∈ Rd×d which is computational to high a demand.Reconstruction results and error estimates in cases where Φ is a subgaussian random matrixor a random subsampling of an orthonormal matrix frequently employ the Restricted IsometryProperty (RIP), requiring that

(1− δs)‖x‖22 ≤ ‖Φx‖22 ≤ (1 + δs)‖x‖22 for all s-sparse x. (4.1)

The utilization of the ‖ · ‖2-norm here leads to the fact that results obtained in the standardbasis are quickly transferable to any type of orthonormal basis due to the invariance of the`2-norm under operations of the orthogonal group O(d), that is to say that the defininginequality (4.1) also holds for any operator Φ · Υ instead of just Φ for some orthonormalΥ ∈ Cd×d. Also, Gaussian random matrices which are often used for random sampling (seeRemark 101) remain Gaussian under orthogonal transformations. This opens up the ques-tion whether similar approaches also yield fruitful results for operators Ω instead of Υ which

59

Page 62: Weighted l1-Analysis minimization and stochastic gradient

behave, roughly speaking, mostly as orthonormal operators do.This question was answered positively for Parseval frames Ω, i.e., ‖x‖2 = ‖Ωx‖2, and ran-domly subsampled orthonormal matrices Φ with uniformly bounded entries by Krahmer,Needell and Ward in [53]. There, the authors considered the minimization problem

min ‖D∗z‖ω,1 subject to Φz = y.

for a Parseval frame D∗ ∈ Rn×d. They employ what they named the D-RIP, which wouldbe the Ω∗-RIP in our terms, which reads

(1− δs)‖Dx‖22 ≤ ‖Φx‖22 ≤ (1 + δs)‖Dx‖ (4.2)

for all signals x which are both sparse in the standard basis as well as in the dictionary D,i.e., ‖x‖0 and ‖Dx‖0 are both small. As can be seen from the difference in notation, theyfollow the path of synthesis sparsity. Moreover, they exploit localization factor of a frame,

ηD,s := sup‖Dz‖2=1,‖z‖0≤s

‖D∗Dz‖1√s

.

With this at hand, their main result reads as follows:

Theorem 42. [53, Theorem 1.2] Fix a sparsity level s < n . Let D ∈ Cn×d be a Parsevalframe, i.e., DD∗ is the identity, with columns δ1, . . . , δn, and let Φ = ϕ1, . . . , ϕd be anorthonormal basis of Cd which is incoherent to D in the sense that

supi∈[d]

supj∈[n]

|〈ϕi, δj〉| ≤ K

for some constant K ≥ 1.Consider the localization factor η = ηD,s and construct Φ ∈ Cm×d by sampling vectors fromΦ i.i.d. uniformly at random. If m ≥ CsK2η2 log3(sη2) log(d), then with probability at least

1 − d− log(2s) the rescaled matrix√

dm Φ exhibits uniform recovery guarantees for `1-analysis

minimization Ω-BP. That is, for every signal f , the solution f ] of the minimization problem

f ] = argmin‖D∗f‖1 : ‖Φf − y‖2

with y =

√dm Φf + e for some noise e ∈ CN with ‖e‖2 ≤ η satisfies

‖f − f ]‖2 ≤ C1η + C2‖D∗f − (D∗f)s‖1√

s,

where (D∗f)s denotes the vector obtained by extracting the largest s entries from D∗f andsetting all the others to 0. The constants C1, C2 are absolute and independent of the signalf .

60

Page 63: Weighted l1-Analysis minimization and stochastic gradient

This theorem raises several questions some of which will be addressed in the following.First of all, the use of a Parseval frame is very restrictive, especially in applications. Moreover,the utilization of the localization factor calls for the preservation of sparsity under the actionof the frame D∗ = Ω, hence both the underlying signal as well as its frame coefficients oughtto be sparse.To overcome these issues, instead of showing some kind of Restricted Isometry Property,we consider a different quality of a system which also yields reconstruction results, namelythe Null Space Property, see Definition 7. For the purpose of function reconstruction, weconsider Fourier sampling, i.e., the samples are taken at i.i.d. random locations in Fourierdomain. For a thorough introduction to Null space Properties, we again refer the readerto [34, Chapter 4, Theorems 4.4, 4.14, 4.20]. Our variant of the Null Space Properties willadditionally be refined to apply to analysis sparsity.

Results similar to Theorem 42 have been shown for sampling matrices that satisfy theconditions of the Johnson-Lindenstrauss lemma such as matrices with independent subgaus-sian entries (see Remark 101 for an exemplary list of random matrix classes). Also, [54] showsthat matrices that satisfy the classical RIP will also satisfy the Ω†-RIP with high probabilityafter randomizing the column signs. This is due to the fact that any matrix that satisfiesthe classical RIP also satisfies the conditions of the Johnson-Lindenstrauss lemma with highprobability after a randomization of the column signs.

Another aspect of `1-analysis minimization, namely the minimization of the analysiscoefficients of a signal as in Ω-BP, employs the concept of cosparsity, that is a signal x ∈ Cp

is s-p-cosparse if p − ‖Ωx‖0 = s for a frame Ω ∈ Cn×d. The idea behind this is inspired bythe notion that a s-sparse vector x lies in the union of subspaces

x ∈ Σs :=⋃

M⊂[N ],]M=s

ei : i ∈M,

where ei denotes the i-th canonical standard basis vector. Similarly, since an s-p-cosparsevector x must be orthogonal to at least s elements of the frame, hence belong to the set

x ∈⋃

Λ⊂[p],]Λ=s

ψi : i ∈ Λ⊥.

In this setting, Rauhut and Kabanava showed the following theorem for uniform recoveryfrom Gaussian measurements.

Theorem. [49, Theorem 11] Let Φ ∈ Rm×d be a Gaussian random matrix, 0 < ρ, ε < 1

and τ > 1. If (roughly)

m ≥ 2(1 + (1 + ρ−1)2)τ2B

(τ − 1)2as

(√ln(eps

)+

√a ln(ε−1)

bs(1 + (1 + ρ−1)2)

)2

,

then with probability at least 1− ε for every x ∈ Rd and perturbed measurements y = Φx+ e

61

Page 64: Weighted l1-Analysis minimization and stochastic gradient

with noise ‖e‖2 ≤ η a minimizer x] of Ω-BPDN approximates x with `2-error

‖x− x]‖2 ≤2(1 + ρ)2

√a(1− ρ)

σs(Ωx)1√s

+2τ√

2B(3 + ρ)√Am(1− ρ)

η.

While their approach is technically similar to ours, they employ methods which can onlybe applied to Gaussian measurement operators, most notably the Gaussian width of a setsimilar to our T ∗ ∩ ρSd−1, see (4.4). Using Gordon’s ’Escape through the Mesh’-theorem,they obtain a bound for the corresponding deviation and manage to prove their main result.Unfortunately, this method is not applicable to our line of work, where we take samples froma subsampled orthonormal system. We take finite index sets Γ,Λ and assume our signalc ∈ RΓ to be the coefficient sequence of a function f =

∑γ∈Γ cγϕλ in some finite dimensional

function space SΓ which is a subset of L2(Rd). Then we choose x1, . . . , xm ∈ Rd independentat random according to some probability measure P on Rd (e.g., the properly preconditionedmeasure from Chapter 3) for all f ∈ SΓ and take measurements yi = f(xi) =

∑γ∈Γ cγϕγ . A

frame for such a function space is a collection (ψλ)λ∈Λ such that

A‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2 ≤ B‖f‖22.

The λ-th frame coefficient can be written as

〈f, ψλ〉 =∑γ∈Γ

cγ〈ϕγ , ψλ〉.

Setting Φi,γ = ϕγ(xi) so that y = Φc the corresponding frame matrix reads Ωλ,γ = 〈ϕγ , ψλ〉:

A‖f‖2 ≤ ‖Ωc‖22 ≤ B‖f‖2 and ‖Ωc‖2 =

√∑λ∈Λ

|〈f, ψλ〉|2.

It is not clear a priori whether all the frame functions ψλ we consider belong to SΓ for allλ ∈ Λ. This, however, does not pose a problem as long as the frame inequality

A‖f‖2 ≤∑λ∈λ

|〈f, ψλ〉|2 ≤ B‖f‖2

holds for all f ∈ SΓ. Such a system ψλ : λ ∈ λ is called a Pseudoframe of SΓ and iswell-defined since we assume all of the calculations to take place within L2(Rd). For anintroduction into the theory of Pseudoframes see [66].

This enables us to prove the following theorem.

Theorem 43. Let (φγ)γ∈Γ be a finite collection of functions which are pairwise orthonormalwith respect to the probability measure P and (ψλ)λ∈Λ be a finite Pseudoframe for the linearspan of the system φγγ∈Γ. Let ϑ be a preconditioning function, ωγ := ‖φγϑ‖∞ consist asequence of weights and (xi)i=1,...,m be sampled from Rd according to dP := dx

ϑ(x)2 where dx

62

Page 65: Weighted l1-Analysis minimization and stochastic gradient

denotes the Lebesgue measure on Rd. Let the frame be Ω := (〈ψλ, φγ〉)λ∈Λγ∈Γ

and set N = ]Λ.

Then Φ :=(φγ(xi)ϑ(xi)

)i=1,...,mγ∈Γ

possesses the weighted Ω robust null space property of order

s with constants τ, θ > 0 provided

m ≥ C0C2Ω log(eN) log

(4τ2)

log (CΩ)

with probability exceeding 1 − C1 exp (log(N) log (CΩ)) for some universal constants C0, C1,

where ωmax := maxγ∈Γ ωγ and CΩ = 2(θ+1)τωmax

√sb‖Ω†‖1→1

θ .

As before, this result can be extended to function spaces of infinite dimension, this timeby means of shearlet smoothness spaces Sp,ql (R2). These are function spaces consisting offunctions which can be characterized by the sequences of frame coefficients, i.e., f ∈ Sp,qlif and only if the frame coefficients (〈f, ψλ〉)λ∈Λ belongs to a weighted sequence space, seeSection 6.2.1 for further details.

Theorem 44. For 0 < p < 1 let f ∈ Sp,pl (R2) be compactly supported and l > 1p −

12 .

Let ϑ a preconditioning function as described in Remark 49 and x1, . . . , xm with m as belowbe drawn from R2 according to 1

ϑ2 for a preconditioning function ϑ which yields weightsωγ = ‖ϕ

γϑ‖∞ = c2‖j‖1(l−1) and choose a l′ > l such that 2‖j‖∞(l−l′) ≤ s1/p−1 for γ =

(j, k) ∈ Γ = N20 × Z2. Let y =

(1√mf(xi)ϑ(xi)

)i∈[m]

be the measurements of f .

Let Γ ⊂ Γ be finite such that ωγ ≤ s1−1/p for all γ ∈ Γ. Choose Λ ⊂ Λ to be finite such thatψλλ∈Λ forms a Pseudoframe for the C-span of ϕγγ∈Γ.If

m ≥ C0‖ω‖2∞sB‖Ω†‖21→1 log(e]Λ) log(‖ω‖∞√sB‖Ω†‖1→1)

then ω-BPDN yields a reconstruction f ] of f from y with error

‖f − f ]‖2 . s1−1/p‖f‖Sp,pmaxl+1,l′

with probability exceeding 1−C1 exp(− log(e]Λ) log(‖ω‖∞

√sB‖Ω†‖1→1

)for some universal

constants C0, C1.

We finish this introductory section with a small survey of related work:In [1], Alberti and Santacesaria extend Ω-BP to the infinite-dimensional setting in the contextof PDEs by providing a dual certificate for the `1-minimization. Thus their result providesrecovery guarantees for a single Ω-s-sparse vector instead of the whole ensemble of these.Moreover, they make use of the aforementioned localization factor and introduce the balanc-ing property, which basically ensures that the requirements of [34, Theorem 4.33] are satisfied.In the present work, we consider a setting similar to [49], except that instead of a Gaussianmeasurement matrix we examine a random subsampling of an orthogonal operator.The analysis-sparsity based model and corresponding reconstruction methods were intro-duced systematically in recent work by Nam et al. in [72]. Nevertheless, we note that it

63

Page 66: Weighted l1-Analysis minimization and stochastic gradient

appeared also in earlier works, see e.g., [14]. In particular, the popular method of total vari-ation minimization [21, 73] in image processing is loosely related to analysis based sparsitywith respect to a difference operator which is a discretized version of the ∇-operator. It canbe derived from [23, Lemma 8.1], see also [55, Proposition 7.1] and [73], that the `2-norm ofthe Haar frame coefficients Ωx behave as the TV-norm of the signal x. In this context, theTV-operator of an image x ∈ CN×N is defined as the `1-norm of the directional derivativesof x:

Definition 45. For z ∈ Cd×d we define its directional derivatives as

zxi,j = zi+1,j − zi,j for i ∈ [d− 1], j ∈ [d] and

zyi,j = zi,j+1 − zi,j for i ∈ [d], j ∈ [d− 1].

Then the discrete gradient ∇ of z is defined as ∇z ∈ Cd×d×2 with (∇z)i,j,1 = zxi,j and(∇z)i,j,2 = zyi,j for i, j ∈ [d] where the missing entries are padded with 0s. The total variationsemi-norm then is

‖z‖TV = ‖∇z‖1.

The corresponding frame is the bivariate Haar frame of which there are two versions.

Definition 46. Given d ∈ N, the system of a Haar frames is obtained from the discrete,univariate Haar wavelet basis of C2d

h0 := 2−d2 (1, . . . , 1)t

h0,0 := h := 2−d2 (1, . . . , 1,−1, . . . ,−1)t

hl,k(j) := h(2lj − k)

2l−d

2 if k2d−l ≤ j < k2d−l + 2d−l−1

−2l−d

2 if k2d−l + 2d−l−1 ≤ j < k2d−l + 2d−l

0 else

where the indices 2lj − k of h(2lj − k) are taken modulo d. The tight Haar frame is thenobtained by the union of this basis and its circular shift by 1 index mod 2d, whereas for itsnon-tight counterpart the vectors are shifted by 2d−l−1 mod 2d and the copy of the constantvector is omitted. The multivariate Haar basis and the corresponding tight or non-tightframes are the constructed by appropriate tensorization.

We write H for the matrix with the transposed of the tight Haar frame vectors vectorsas rows so that w = Hx is the series of Haar frame coefficients of a signal x ∈ Cd. ♦

While they are not linked directly, the discrete gradient and the Haar system are relatedas the following proposition shows.

64

Page 67: Weighted l1-Analysis minimization and stochastic gradient

Proposition 47. [55, Proposition 7.1] Let x ∈ Cd×d have mean zero and suppose itsbivariate Haar transform w is ordered such that wk is the entry with the kth largest magnitude.Then there is a universal constant C > 0 such that for all k ≥ 1

|wk| ≤ C‖x‖TV

k.

An estimate of the number of Gaussian measurements for successful recovery via TotalVariation Minimization has recently been obtained in [68].

4.2 Null Space Properties for Bounded Orthonormal Sys-

tems

As mentioned before, instead of the Ω-RIP our approach employs the respective Null SpaceProperties from Definition 7. For each type of NSP we define the cone of vectors in Rd

violating the respective requirement, namely

CNSPs := v ∈ Rd : there is S ⊂ Λ, ]S < s such that ‖ΩSv‖ω,1 ≥ ‖ΩSv‖ω,1

and

CSNSPs := v ∈ Rd : there is S ⊂ Λ, ]S < s such that ‖ΩSv‖2 ≥

θ√s‖ΩSv‖ω,1.

For reasons which will become clear later we do not require a cone designated to the robustNSP. We will omit the superscript NSP or SNSP and write Cs since every result for CSNSP

s

implies the likewise result for CNSPs if we set θ = 1.

Then, the respective NSP is equivalent to ker Φ ∩ Cs = ∅ or equivalently

Sd−1 ∩ ker Φ ∩ Cs = ∅ (4.3)

where Sd−1 denotes the unit sphere in Rd. But since we are working with frames we aregoing to examine C′s := ΩCs ⊂ Rn. We then have for v = Ωy ∈ C′s ∩ ΩSd−1 with y ∈ Sd−1

‖v‖ω,1 = ‖Ωy‖ω,1 = ‖ΩSy‖ω,1 + ‖ΩSy‖ω,1 ≤√s‖ΩSy‖2 +

√s

θ‖ΩSy‖2

=√s

(1 +

1

θ

)‖ΩSy‖2 ≤

√sB

(1 +

1

θ

).

Now set 1ρ :=

(1 + 1

θ

)√sB. Henceforth we have

C′s ∩ ΩSd−1 ⊂ ΩSd−1 ∩ 1

ρBnω,1

65

Page 68: Weighted l1-Analysis minimization and stochastic gradient

where Bnω,1 is the unit ball in the ω-1-Norm in Rn. Then we have the respective NSP if

ΩSd−1 ∩ Ω ker(Φ) ∩ 1

ρBnω,1 = ∅.

or equivalently

ΩρSd−1 ∩ Ω ker(Φ) ∩Bnω,1 = ∅.

which holds true as soon as

infy∈T∗∩ρSd−1

‖Φy‖22 > 0

where T ∗ :=y ∈ Rd : Ωy ∈ Bmω,1

. The reason we did not define a cone accounting for the

violation of the RNSP is that it suffices to employ the same cones as before, i.e., a matrix Φ

possesses the Ω−RNSP with constant τ as soon as

infy∈T∗∩ρSd−1

‖Φy‖22 ≥ρ2

τ2. (4.4)

Here the factor ρ cancels out but we nevertheless keep it for computational purposes.The bound from (4.4) states that for vectors y ∈ Rd such that ‖Φy‖22 ≤

ρ2

τ2 ‖y‖2 we alreadyhave

‖ΩSy‖2 ≤θ√s‖ΩSy‖ω,1 for every S ⊂ [d], ]S ≤ s.

If now y ∈ Rd is such that ‖Φy‖22 ≥ρ2

τ2 ‖y‖2, then

‖ΩSy‖2 ≤ ‖Ωy‖2 ≤ B‖y‖2 ≤ Bτ2

ρ2‖Φy‖22

since Ω is a frame with upper bound B. In summary

‖ΩSy‖2 ≤θ√s‖ΩSy‖ω,1 +B

τ2

ρ2‖Φy‖22.

Instead of assessing the infimum of a stochastic process directly, we analyze the deviation

supy∈T∗∩ρSN−1

∣∣‖Φy‖22 − E‖Φy‖22∣∣ .

To this end we employ [9, Theorem 2.1].

Theorem 48. There exist two universal constants c0, c1 such that the following requirementshold: Let X = (Xn)n∈[N ] be a collection of random variables, where Xn : Ω → C on a

66

Page 69: Weighted l1-Analysis minimization and stochastic gradient

probability space (Ω,P,Σ), such that

‖X‖L∞(Ω) := maxn∈[N ]

‖Xn‖L∞(Ω) <∞.

Let ε, δ > 0 and let ($1, . . . , $m) ∈ Ωm be a sample of size m drawn uniformly at randomaccording to P⊗mΩ such that

m ≥ c01

εδlog(eN) log(1/δ2) log(1/ε2). (4.5)

Then the following holds with probability exceeding 1− c1 exp(log(N) log(δ)):

for all t ∈ CN :

∣∣∣∣∣ 1

m

m∑i=1

|〈X($i), t〉|2 − E|〈X, t〉|2∣∣∣∣∣ ≤ εE|〈X, t〉|2 + δ‖t‖21‖X‖2L∞(Ω).

Note that since ωλ ≥ 1 for all λ ∈ Λ this also entails∣∣∣∣∣ 1

m

m∑i=1

|〈X($i), t〉|2 − E|〈X, t〉|2∣∣∣∣∣ ≤ εE|〈X, t〉|2 + δ‖t‖2ω,1‖X‖2L∞(Ω)

for all t ∈ CN .

Remark 49. Our use of weights in the minimization and reconstruction problem Ω-BPDN forfunction spaces naturally leads to a certain sampling strategy where we sample according to acustom probability measure which is the preconditioned Lebesgue measure in Rd as describedin Section 3.3 in the 1-dimensional case. In order to extend this to the multivariate case, wefix some strictly positive functions ϑi : R→ R, 1 ≤ i ≤ m which fulfill∫

R

1

ϑ2i (x)

dx = 1, 1 ≤ i ≤ m

then 1ϑ2 :=

⊗mi=1

1ϑ2iis a density function. We draw x1, . . . xm ∈ Rd independently accord-

ing to the probability measure defined by the tensorized density 1ϑ2 and precondition our

measurements yi = ϑ(xi)f(xi) =∑γ∈Γ cγ φγ(xi)ϑ(xi). Note that if the family (φγ)γ∈γ is

orthonormal with respect to the Lebesgue measure dx, then

E〈y, y〉 = Em∑i=1

yiyi =

m∑i=1

E[f(xi)ϑ(xi)

f(xi)ϑ(xi)

]=

1

m

m∑i=1

∫Rdf(x)ϑ(x)

f(x)ϑ(x)

dx

ϑ2(x)

=1

m

m∑i=1

〈f, f〉L2(Rd) = 〈f, f〉L2(Rd)

and

‖y‖1 =

m∑i=1

|ϑ(xi)||f(xi)| ≤m∑i=1

∑γ∈Γ

|cγ ||φγ(xi)||ϑ(xi)| ≤ m∑γ∈Γ

|cγ | ‖φγϑ‖∞︸ ︷︷ ︸:=ωγ

= m∥∥∥(cγ)γ∈Γ

∥∥∥ω,1

.

67

Page 70: Weighted l1-Analysis minimization and stochastic gradient

Thus, we obtain a natural connection between preconditioning and `1ω-minimization. Hence-forth, our sampling matrix has entries Φi,γ = φγ(xi)ϑ(xi), 1 ≤ i ≤ m, γ ∈ Γ.

Theorem 50. Let (φγ)γ∈Γ be a finite collection of pairwise orthonormal functions, (ψλ)λ∈Λ

be a Pseudoframe for the span of the system φγγ∈Γ and set N := ]Λ. Let ϑ be a precon-ditioning function, ωγ := ‖φγϑ‖∞ consist a sequence of weights and (xi)i=1,...,m be sampledindependently from Rd according to dP := dx

ϑ(x)2 where dx denotes the Lebesgue measure on

Rd. Then Φ :=(φγ(xi)ϑ(xi)

)i=1,...,mγ∈Γ

possesses the weighted Ω robust null space property of

order s with constants τ, ϑ > 0 for the frame matrix

Ω := (〈ψλ, φγ〉)λ∈Λγ∈Γ

provided

m ≥ C064(θ + 1)2τ4ω2

maxsB‖Ω†‖21→1

θ2log(eN) log (2τ) log

(2(θ + 1)τωmax

√sB‖Ω†‖1→1

θ

)

with probability exceeding

1− C1 exp

(2 log(eN) log

(2(θ + 1)τωmax

√sB‖Ω†‖1→1

θ

))

for some universal constants C0, C1, where ωmax := maxγ∈Γ ωγ .

Proof. Let x1, . . . , xm ∈ RN be drawn i.i.d. from Rd according to the density function 1ϑ2

and Ω consist a Pseudoframe for Cd. As before we define T ∗ :=v ∈ Cd : Ωv ∈ BNω,1

and

we set Xi =(φγ(xi)ϑ(xi)

)γ∈Γ

. It suffices to show that

supy∈ρSN−1∩T∗

∣∣‖Φy‖22 − E‖Φy‖22∣∣ ≤ ρ2

2τ2.

We have

1

mE‖Φy‖22 = E|〈Xi, y〉|2 =

∫Rd

∑γ∈Γ

yγ φγ(x)ϑ(x)

2

1

ϑ2(x)dx =

∫Rd

F∑γ∈Γ

yγφγ

(x)

2

dx

=

∫Rd

∑γ∈Γ

yγφγ(x)

2

dx = ‖y‖2

then the claim follows as once we show

supy∈ρSN−1∩T∗

∣∣∣∣ 1

m‖Φy‖22 − ρ2

∣∣∣∣ ≤ ρ2

2τ2.

68

Page 71: Weighted l1-Analysis minimization and stochastic gradient

We employ Theorem 48, which yields the bound

supy∈ρSN−1∩T∗

∣∣∣∣ 1

m‖Φy‖22 − ρ2

∣∣∣∣ ≤ ερ2 + δ‖y‖21‖X‖2∞ ≤ ερ2 + δ‖Ω†‖21→1‖Ωy‖2ω,1‖X‖2∞

where ‖X‖∞ = maxγ∈Γ‖φγϑ‖∞ = max

γ∈Γωγ =: ωmax. Therefore, our claim holds if we choose

ε =1

4τ2, and δ =

1

4

ωmaxτ‖Ω†‖1→1

)2

.

According to the number of measurements demanded by (4.5), our claim holds as soon as

m ≥ C016τ4ω2

max

ρ2log(eN) log

(4τ2)

log

(4τ2ω2

max

ρ2

)

with probability exceeding 1− C1 exp(

log(eN) log(

4τ2ω2max

ρ2

)).

Recalling that ρ = θ(θ+1)

√sB

reformulates these inequalities into

m ≥ C064(θ + 1)2τ4ω2

maxsb‖Ω†‖21→1

θ2log(eN) log (2τ) log

(2(θ + 1)τωmax

√sb‖Ω†‖1→1

θ

)and

P ≥ 1− C1 exp

(2 log(eN) log

(2(θ + 1)τωmax

√sb‖Ω†‖1→1

θ

)).

The operator norm ‖Ω†‖1→1 is the maximum 1√N

maxγ∈Γ ‖Ω†φγ‖1 since φγ : γ ∈ Γ waschosen as the discrete Fourier basis. Accordingly, this serves as a proxy for the incoherencebetween the frame Ω which is employed in `1 instead of `2 and the Fourier basis, in the sameway [53] uses the local coherence, see [53, Theorem 1.2 and equation (1.2) there] to measurethe interaction between the frame Ω and the basis.Originally, there was another proof for a similar result for the weightless case, based onmethods presented in [19, Chapter 5] in which the incoherence between Ω† and the sensingmatrix Φ presented itself in a more direct way.

Theorem 51. Let Φ ∈ Rm×N with m ≤ N be a sensing matrix with its rows drawn at randomfrom an orthonormal basis φ1, . . . , φN of RN . Furthermore, let Ω ∈ Rp×N with p > N be aframe with frame bounds 0 < A ≤ B and

‖Ω†Φ‖∞ ≤K√N, (4.6)

where Φ† denotes a dual frame. If

m

ln3(m)&

sB

(θ(1− δ))2ln(p)

69

Page 72: Weighted l1-Analysis minimization and stochastic gradient

then with probability exceeding 1−C exp(−c mδθ

2

K2sB

), for some constants C, c > 0, the matrix

Φ possess the stable NSP of order s for the frame Ω and the respective robust NSP withτ =

√Nmδ .

In the case of Ω being a Parseval-frame, meaning that both frame bound equal 1, theassumption (4.6) can be replaced by the usual incoherence assumption

maxj∈[p]

maxi∈[m]

|〈φi, ψj〉| ≤K√N.

This is even closer to [53, Theorem 1.2] except for the fact that the latter result was onlyformulated for the case that Ω is Parseval. We omit the lengthy proof of Theorem 51 herefor the sake of reading convenience since it does not provide new insights.

4.3 Shearlet-Wavelet `1-Analysis Minimization

Now we extend Theorem 50 to the infinite dimensional realm, i.e., to suitable subspaces ofL2(R2). To this end we employ the smoothness space Sp,qβ (R2) defined in Section 6.2 inDefinition 118 and fix a wavelet ONB (ϕ

γ)γ∈Γ of L2(R2) and the system of shearlets (ψλ)λ∈Λ

as described in Definition 52. It is well-known that the smoothness of a function f can beassessed via its wavelet coefficients, a feature shearlets also possess. Moreover, both systemscan be employed to approximate functions in several function spaces, see [43, Chapters 8for general theory and Chapter 9 for approximation in Besov spaces] for the wavelet caseand [39, Section 5] for the shearlet case. One noticeable difference between wavelets andshearlets is that the former consist a basis of L2(R2), or L2(Rd) for d ∈ N given a suitableconstruction of the wavelets, while the latter form a frame. Moreover, shearlets featureanisotropic scaling and are designed especially with regard to the two-dimensional case offunction approximation in L2(R2) which is why we restrict our further discussion to thisspace.

The shearlet Smoothness Spaces Sp,qβ (R2) serve as the shearlet counterpart of the Besovspaces Bp,ql (R2) which can be characterized by wavelets in the sense that f ∈ Bp,ql (R2) if andonly if the sequence of wavelet coefficients of f belongs to a certain sequence space. The sameis true for the shearlet smoothness spaces, that is f ∈ Sp,qβ (R2) if and only if the shearletcoefficients of f belong to a accustomed weighted sequence space, see Section 6.2 for moredetails. To state the different families of functions for future reference, we give the followingsummarizing definition:

Definition 52. In this work, we employ an ONB and a frame of L2(R2) which are definedas follows:

•ϕj,k : j ∈ N0, k ∈ Z2

is a dyadic wavelet system of L2(R) where ϕ0,k(x) = ϕF (x−k)

are translations of the father wavelet ϕF and ϕj,k(x) = 2j/2ϕM (2jx−k) are translationsand dilations of the mother wavelet ϕM . It has been shown repeatedly that the system

70

Page 73: Weighted l1-Analysis minimization and stochastic gradient

ϕj,k : j ∈ N0, k ∈ Z consists an ONB of L2(R) and via tensorization we obtain awavelet-ONB of L2(R2) where we set

ϕj,k

(x) = ϕj1,k1(x1)ϕj2,k2

(x2) for x ∈ R2, k ∈ Z2 and j ∈ N20. (4.7)

We abbreviate Γ := N20 × Z2 and ϕ

γ= ϕ

j,kfor γ = (j, k) ∈ Γ.

• We use a shearlet system for L2(R2) where we set the basic dilation matrix to be

A = A1 :=

(2 0

0√

2

)and A2 :=

(√2 0

0 2

), compare [63, eqn. 3.13]. The shearing

matrices are B = B1 =

(1 1

0 1

)and B2 = RB1 where R =

(0 1

1 0

). This matrix is

used to switch between the coordinate axis for the cone-adapted shearlets. Then theshearlets functions are defined as

ψa,s,t(x) = 232aψ (BsAax− t) and ψha,s,t(x) = 2

32aψ (RBsAax− t)

Any proof related to the shearlet system will only discuss the case for the horizontalcone C :=

x ∈ R2 :

∣∣∣x2

x1

∣∣∣ ≤ 1

since the proof for the case of the vertical cone willnot reveal new insights as it is the same with regard to the techniques employed. Thewhole shearlet system consists of

– coarse-scale-shearlets ψ−1,t : t ∈ Z2 where ψ−1,t(x) = W (x − t) are the trans-lations of a window function with supp(W ) ⊂

[− 1

8 ,18

]×[− 1

16 ,116

],

– interior shearlets ψha,s,t : h ∈ 1, 2, a ∈ N0, s ∈ Z, |s| ≤ 2a, t ∈ Z2.

Some authors additionally employ a system of boundary shearlets ψa,s,t : a ≥ 0, l =

±2a, t ∈ Z2 which covers the region where the two interior shearlet systems meet. Thisadditional system allows for the construction of a Parseval frame of shearlets which weare not particularly concerned with.

The whole set of indices for the shearlet system is denoted by

Λ =

(a, s, t, h), (a, s, t), (−1, t) : a ∈ N0, s ∈ Z, |s| ≤ 2a, t ∈ Z2, h ∈ 1, 2.

The system ψλ : λ ∈ Λ consists a frame for L2(R2), see [63].

• With the notation as described above, we have for the Besov space Bp,pl , so for p = q,the case we are interested in, the identity

‖f‖Bp,pl

∑j∈N2

0

∑k∈Z2

2‖j‖1p(l−1)|〈f, ϕj,k〉|p1/p

.

To avoid confusion, we will refer to the Besov spaces as Bp,ql (R2) from now on, whetherthey are characterized by wavelets or other types of decomposition.

71

Page 74: Weighted l1-Analysis minimization and stochastic gradient

There is another way of wavelet basis for L2(Rd) for a general d ∈ N as described in [63]where the authors employ a different construction that only needs a single dilationparameter. Since we not going to use this setting, we will not divulge into furtherdetails on it, only shortly state the characterization results here: In this setting, thesmoothness space Sp,qβ (R2) with 1 ≤ p, q ≤ ∞ associated to the wavelet decompositioncan be expressed as the set of those functions f ∈ L2(R2) for which

‖f‖Sp,qβ :=

∑j∈Z

2jq(( βδ + 12−

1p )

(∑k∈Z2

|〈f, ϕj,k〉|p)q/p1/q

(4.8)

is finite, see [63, p. 7]. The quantity δ > 0 is to chose such that the weight ωj =

ωj,k ≈ |supp(ϕ)| 1δ . These spaces are the Besov spaces Bp,qβ/δ(R2). Then this quantity

is equivalent to

‖f‖Bp,ql :=

∑j∈N2

0

2‖j‖1(q+lq−2p)

(∑k∈Z2

|〈f, ϕj,k〉|p)q/p1/q

as can be seen in Theorem 106.

• By [63, Section 4.5], the shearlet smoothness space are (quasi-)normed spaces with0 ≤ p, q ≤ ∞ characterized by the expression

‖f‖Sp,qβ :=

∑h=1,2

∑a∈_0,|s|≤2a

2aq(β+3(1/2−1/p))

(∑t∈Z2

|〈f, ψ(h)a,s,t〉|p

)q/p1/p

or for p = q

‖f‖Sp,pβ :=

(∑λ∈Λ

2ap(β+3(1/2−1/p))|〈f, ψ(h)a,s,t〉|p

)1/p

where λ = (a, s, t, h).

The most important fact of this far too short introduction is from [63, Theorem 3.1]:

Theorem 53. The system of shearlets from Definition 52 is a Parseval frame for L2(R2). Inaddition, the elements of this system are C∞ and compactly supported in the Fourier domain.

We define an approximation space A := spanϕγ

: γ ∈ Γ ⊂ Sp,qβ (R2) for some finite

Γ ⊂ Γ which will be concretized later. Then we express f ∈ Sp,qβ (R2) as

f =∑γ∈Γ

cγϕγ +∑γ /∈Γ

cγϕγ

72

Page 75: Weighted l1-Analysis minimization and stochastic gradient

where the second summand is treated as an approximation error. Note that f ∈ Sp,qβ (R2)

is equivalent to (cγ)γ∈Γ belonging to a weighted sequence space as described in (4.8) or inSection 6.2.1.

Technically speaking, the shearlets may not belong to A which is why they do not consista frame for that space. Since the wavelets belong to A which itself is a subspace of theshearlet smoothness space Sp,qβ (R2), the system of shearlets indexed by Λ ⊂ Λ which obeysthe criterion from the following lemma form a so-called Pseudoframe for A, e.g., there existpositive constants A,B such that for all f ∈ A

A‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2 ≤ B‖f‖22

holds, see [66] for an introduction to Pseudoframes. As far as this work is concerned, onlythe inequality above is of importance.

Lemma 54. Suppose (ψλ)λ∈Λ is a frame for L2(Rd) with frame bounds A,B > 0. Then(ψλ)λ∈Λ is a Pseudoframe for A as soon as∑

γ∈Γ

∑λ/∈Λ

|〈ϕγ, ψλ〉|2 < A

with frame bounds A−∑λ/∈Λ

∑γ∈Γ

|〈ψλ, ϕγ〉|2

and B.

Proof. Let f ∈ A. Since (ψλ)λ∈Λ is a frame for all of L2, we have

A‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2 =∑λ∈Λ

|〈f, ψλ〉|2 +∑λ/∈Λ

|〈f, ψλ〉|2.

We expand the second summand and obtain

∑λ/∈Λ

|〈f, ψλ〉|2 =∑λ/∈Λ

∣∣∣∣∣∣∑γ∈Γ

〈f, ϕγ〉〈ϕ

γ, ψλ〉

∣∣∣∣∣∣2

≤∑λ/∈Λ

∑γ∈Γ

|〈f, ϕγ〉|2

︸ ︷︷ ︸=‖f‖22

∑γ∈Γ

|〈ϕγ, ψλ〉|2

via the Cauchy-Schwarz inequality. Rearranging yieldsA−∑λ/∈Λ

∑γ∈Γ

|〈ψλ, ϕγ〉|2

‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2.

73

Page 76: Weighted l1-Analysis minimization and stochastic gradient

Lastly, ∑λ∈Λ

|〈f, ψλ〉|2 ≤∑λ∈Λ

|〈f, ψλ〉|2 ≤ B‖f‖22

by virtue of the frame property of (ψλ)λ∈Λ.

Since Γ is finite, we may as well require∑λ/∈Λ |〈ψγ , ϕλ〉|

2 to be small enough uniformlyover Γ, i.e., if

∑λ/∈Λ |〈ψγ , ϕλ〉|

2 ≤ ε for all γ ∈ Γ we have∑λ/∈Λ

∑γ∈Γ |〈ψλ, ϕγ〉|

2 ≤ ]Γε.The requiring ε < A

]Γfor an ε small enough allows for the system (ψλ)λ∈Λ to consist a

Pseudoframe for A. Moreover, since∑λ/∈Λ

|〈ψλ, ϕγ〉|2 ≤

∑λ∈Λ

|〈ψλ, ϕγ〉|2 ≤ B‖ϕ

γ‖22 = B‖ϕ‖22

the quantity∑γ∈Γ

∑λ/∈Λ |〈ϕγ , ψλ〉|

2 is always finite.In light of Lemma 54 it seems well advised to examine decay properties of (|〈ϕ

λ, ψγ〉|)λ∈Λ,γ∈Γ

in further detail.

Lemma 55. For a family of wavelets (ϕj,k

)j∈N20,k∈Z2 ⊂ CN (R2) and shearlets ψλ : λ ∈ Λ ⊂

CN (R2) as given by Definition 52 we have that

〈ϕj,k, ψλ〉 = O

(2‖j‖1−

14a+mina,j1,j2−2 minj1,j2

((1 + 2mina,j1,j2

∣∣|D−jk| − |A−aS−st|∣∣))−N)(4.9)

where λ = (a, s, t, h) ∈ Λ.

Note that the same result for the window function W follows from setting a, s = 0 in theestimate above. For the proof, we need one technical lemma in the more general setting ofRn from [35].

Lemma 56. Let a, b > 0, M,N ≥ max(n,min(a, b)) and t,m ∈ Rn. Then we have∫Rn

an

(1 + a|x− t|)Nbn

(1 + b|x−m|)Mdx ≤ C0

min(a, b)

(1 + min(a, b)|t−m|)min(M,N)

In [35] a slightly different version is proven and the proof of this lemma is only contained inthe one-volume-version of the book (i.e., "Classical and modern Fourier Analysis") as opposedto the split version (i.e., "Classical Fourier Analysis" and “Modern Fourier Analysis").

Proof of Lemma 55. For this proof we take ψ to be the shearlet adapted for the horizontalcone since the proof is literally the same for the shearlet for the vertical cone and the band-limited window function W , since we assume that our wavelet ϕ and the shearlet ψ both

74

Page 77: Weighted l1-Analysis minimization and stochastic gradient

decay like 11+‖xα‖ , x→∞ for large values of α > 1. Setting Dj :=

(2j1 0

0 2j2

)we have

∣∣∣〈ϕj,k, ψa,s,t〉

∣∣∣ ≤ 2‖j‖1+ 32a

∫R2

|φ(Djx− k)ψ (BsAax− t)| dx

≤ C2‖j‖1+ 34a

∫R2

(1 + |Djx− k|)−N (1 + |BsAax− t|)−N dx

≤ C2‖j‖1+ 34a

∫R2

(1 + 2minj1,j2|x−D−jk|)−N (1 + 2a/2|x−A−aS−st|)−N dx

≤ C2‖j‖1+ 34a

2mina,j1,j22−2 minj1,j22−a(1 + 2mina,j1,j2|2−minj1,j2k −A−aS−st|

)Nwhere we employed Lemma 56 in the last line. Moreover, we used the inequality σmin(A)|x| ≤|Ax| in the third line, where σmin(A) denotes the smallest eigenvalue of a matrix A. Invokingthe triangle inequality we obtain∣∣∣〈ϕ

j,k, ψa,s,t〉

∣∣∣ ≤ C2‖j‖1−14a+mina,j1,j2−2 minj1,j2

((1 + 2mina,j1,j2|D−jk −A−aS−st|

))−N≤ C2‖j‖1−

14a+mina,j1,j2−2 minj1,j2

((1 + 2mina,j1,j2

∣∣|D−jk| − |A−aS−st|∣∣))−N .The constant C only depends on N and the volume of the 2-dimensional unit sphere.

Lemma 55 basically is a comparison of the two parameterizations, in which sense it re-sembles the α-scaled index distance from [36, Definition 2.16], see Definition 126 in Section6.2.

There is another approach yielding a result much more tailored to our need and moreoverdetermining the exact constants that show up in the estimates:

Lemma 57. Let Λ ⊂ Λ and Γ ⊂ Γ both be finite and set A := spanϕγ

: γ ∈ Γ. Weabbreviate (a, s, t, h) = λ ∈ Λ and (j, k) = γ ∈ Γ. Then (ψλ)λ∈Λ consists a frame for A if23+jmax(2el−4)−2(ec−el)(2amax−1) + 22+2(el−1)jmax−(4el−2)(2amax−1) ≤ A

4 where

• jmax is the maximal wavelet dilation involved, i.e., there is no((

jj′

), k)∈ Γ with j >

jmax or j′ > jmax,

• amax is the maximal shearlet dilation, i.e., there is no (a, s, t, h) ∈ Λ where a > amax,

• el is the smoothness parameter of the wavelet functions ϕM , ϕF , i.e., ϕM , ϕF ∈ Cec(R2)

and

• ec is the number of vanishing moments of ϕM , ϕF , i.e.,∫ϕMx

m dx = 0 =

∫ϕFx

m dx for m = 1, . . . , ec

75

Page 78: Weighted l1-Analysis minimization and stochastic gradient

which is equivalent to the fact that the first M derivatives of ϕM and ϕF are 0 at theorigin.

If some statement is true for both ϕM and ϕF we just to state it for ϕ

Proof. First of all we notice that the construction of the shearlets that we employ, as isstated in [63], leads to the union of the Fourier domain support of all shearlets involved tobe contained in a ‖ · ‖∞-ball which scales like the maximal shearlet dilation involved that isamax := maxa ≥ 0 : (a, s, t, h) ∈ Λ:⋃

λ∈Λ

supp(ψλ) ⊂ [−22amax−1, 22amax−1]2 := [−R,R]2.

Note that the frame functions have compact support in Fourier domain, as described by(6.26) due to the construction of the cone adapted shearlet system as laid out in Section6.2.1. For abbreviation we set PRf := χR2\[−R,R]2f the restriction of a function to valueslarger than R. Now let f ∈ A and we estimate

∑λ/∈Λ |〈f, ψλ〉|

2 according to Lemma 54:∑λ/∈Λ

|〈f, ψλ〉|2 =∑λ/∈Λ

|〈f , ψλ〉|2 =∑λ/∈Λ

|〈PRf , ψλ〉|2 ≤ ‖PRf‖22.

Our goal is to establish that ‖PRf‖22 ≤ ε‖f‖22 for all f ∈ A since this implies

A‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2 +∑λ/∈Λ

|〈f, ψλ〉|2 ≤∑λ∈Λ

|〈f, ψλ〉|2 + ε‖f‖22

which yields

(A− ε)‖f‖22 ≤∑λ∈Λ

|〈f, ψλ〉|2

which in turn is the desired lower frame bound.

We will employ two estimates for the Fourier transform of our wavelets which, as in ourcase, have fast polynomial decay towards ∞ and 0:Firstly |ϕ(x)| ≤ C|x|−el , x → ∞ for the smoothness parameter el ∈ N, el > 2 for largevalues of x, where C can be calculated by standard Fourier analysis taking into account thederivatives of ϕ.Secondly we have ec vanishing moments which gives |ϕ(x)| ≤ C|x|ec , x→ 0 for ec ∈ N, ec ≥ 2

where the constant C can be estimated by Taylor expansion.We express ‖PRf‖22 in wavelets and estimate

‖PRf‖22 =

∥∥∥∥∥∥∑γ∈Γ

cγPRϕγ

∥∥∥∥∥∥2

2

=∑γ,γ′∈Γ

|cγcγ′ ||〈PRϕγ , PRϕγ′〉| ≤∑γ,γ′∈Γ

|cγcγ′ |‖PRϕγ‖2‖PRϕγ′‖2.

76

Page 79: Weighted l1-Analysis minimization and stochastic gradient

Now, we have for γ = (j, k) ∈ Γ

‖PRϕγ‖22 ≤ 2−2(j1+j2)

∫‖x‖∞≥R

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx

= 2−2(j1+j2)+2

∫‖x‖∞≥R,xi≥0,i=1,2

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx

= 2−2(j1+j2)+2

[∫ ∞R

∫ R

0

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2

+

∫ R

0

∫ ∞R

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2

+

∫ ∞R

∫ ∞R

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2

]

The first summand can be estimated to∫ R

0

∫ ∞R

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2 ≤ C22(j2el−j1ec)

(∫ R

0

x2ec dx ·∫ ∞R

x−2el dx

)

= C22(j2el−j1ec) R2(ec−el)

(2ec + 1)(2e2 − 1)

and the second similarly∫ ∞R

∫ R

0

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2 ≤ C22(j1el−j2ec) R2(ec−el)

(2ec + 1)(2e2 − 1).

The third summand is bounded by∫ ∞R

∫ ∞R

|ϕ(2−j1x1)|2|ϕ(2−j2x2)|2 dx1 dx2 ≤ C22el(j1+j2)

(∫ ∞R

|x|−2el dx)2

= C22el(j1+j2)

(2el − 1)2R4el−2.

In summary

‖PRϕγ‖22 ≤ C2−2(j1+j2)+2C

(22(j2el−j1ec) + 22(j1el−j2ec)

) R2(ec−el)

(2ec + 1)(2e2 − 1)

+ 2−2(j1+j2)+2C22el(j1+j2)

(2el − 1)2R4el−2

≤ C R2(ec−el)

(2ec + 1)(2e2 − 1)

(2−2(j1+j2)+2+2(j2el−j1ec) + 2−2(j1+j2)+2+2(j1el−j2ec)

)+

22+2(el−1)(j1+j2)

(2el − 1)2R4el−2

In order to consist a Pseudoframe for A the ψλ : λ ∈ Λ, the Fourier support of the ψλ

77

Page 80: Weighted l1-Analysis minimization and stochastic gradient

concerned has to cover the essential part of the wavelets’ Fourier support. Since ϕ is smooth,its Fourier transform decays fast towards ∞ and since it possesses ec vanishing moments wehave polynomial decay towards 0. Thus, the contributing part of the ϕ

γ’s support resembles

a donut around the origin which is is spread out over Fourier plane with increasing scale,at least for those tensorized ϕ

γwhere the Father wavelet is not concerned. However, these

also have compact support in time domain.. Therefore, in order to cover the area in questionwith a cube [−R,R]2 in Fourier domain properly, it is necessary to invoke enough dilationsto ensure 2j ≥ R for a sufficient number of γ = (j, k) ∈ Γ.

Remember that R = 22amax−1 and therefore

‖PRϕγ‖22 .

(2−2(j1+j2)+2+2(j2el−j1ec)−2(ec−el)(2amax−1) + 2−2(j1+j2)+2+2(j1el−j2ec)−2(ec−el)(2amax−1)

)+ 22+2(el−1)(j1+j2)−(4el−2)(2amax−1)

We take the maximal dilation parameter in Γ, jmax as defined in the theorem, and estimate

‖PRϕγ‖22 . 23+jmax(2el−4)−2(ec−el)(2amax−1) + 22+2(el−1)jmax−(4el−2)(2amax−1).

If we now choose our wavelets in such a way that ec, el are such that ec−el > 0 and 4el−2 > 0,the right-hand side of the last estimate is monotonically decreasing in amax so that choosingϕ such that ec − el > 0 and 4el − 2 > 0 are comparably large and setting amax large enoughas well, we can obtain an inequality of the form

‖PRϕγ‖22 . 23+jmax(2el−4)−2(ec−el)(2amax−1) + 22+2(el−1)jmax−(4el−2)(2amax−1) ≤ ε.

Hence

‖PRf‖22 ≤ C∑

(j,k),(j′,k′)∈Γ

|cj,kcj′,k′ |[23+jmax(2el−4)−2(ec−el)(2amax−1)

+ 22+2(el−1)jmax−(4el−2)(2amax−1)]

≤ C[23+jmax(2el−4)−2(ec−el)(2amax−1) + 22+2(el−1)jmax−(4el−2)(2amax−1)

]‖f‖22

If now we choose amax large enough, taking into account the constant c and maximal waveletdilation jmax involved

C23+jmax(2el−4)−2(ec−el)(2amax−1) + 22+2(el−1)jmax−(4el−2)(2amax−1) ≤ A

4

which gives 3A4 ‖f‖

22 ≤

∑λ∈Λ |〈f, ψλ〉|

2 which is the desired inequality.

Note that this proof relies on the structure of the Fourier support of the shearlets ψλ.Also, result may improve but not suffer if all shearlets with dilations smaller or equal to amax

are taken to form Λ.

78

Page 81: Weighted l1-Analysis minimization and stochastic gradient

Remark 58. Until now we only have shown a condition on the respective scales of the waveletsand shearlets. Yet in order to actually construct a frame of shearlets for a finite-dimensionalapproximation space spanned by wavelets we also need to concern ourselves with the trans-lations. From the definition of the shearlet functions we recall that the shearing parameterobeys |s| ≤ 2a which is why it is bounded as well by our bound on the scales.Since we are going to prove our main result for a compactly supported f ∈ Sp,pl , i.e.,supp(f) ⊂[−K,K]2 we only need a finite amount of translations of a compactly supported wavelets ateach scale, i.e.,

f =∑j≥0

∑k∈Z2

‖k‖≤L

〈f, ϕj,k〉ϕj,k

for some appropriate L > 0. Now Theorem 57 gives Sp,pl → Bp,pl−1/p+1. Therefore we knowthat the wavelet coefficients (〈f, ϕ

j,k〉)(j,k)∈Γ belong to a weighted sequence space `pω with

weights ω′γ = 2‖j‖1(l−1) p2−p , γ = (j, k) ∈ Γ and thus decay fast and accordingly we can restrict

ourselves to an index set of the form

Γ :=γ = (j, k) ∈ Γ : ωγ ≤ s1−1/p, χ[−K,K]2ϕγ 6= 0

,

which is finite once l > 1. It remains to show that a finite number of shearlet translationssuffices at each scale so that the number of shearlet indices involved actually stays finite.As we already outlined, the shearlets have polynomial decay towards ∞, i.e., |ψa,s,t|(x) ≤2

3a4 (1 + ‖BsAax− t‖)−sD , x→∞ for some sD ∈ N, sD > 1. We estimate∑λ/∈Λ

|〈f, ψλ〉|2 =∑λ/∈Λ

|〈χ[−K,K]2f, ψλ〉|2 =∑λ/∈Λ

|〈f, χ[−K,K]2ψλ〉|2 ≤∑λ/∈Λ

‖f‖22‖χ[−K,K]2ψλ‖22

and now it remains to estimate

‖χ[−K,K]2ψa,s,t‖22 = 23a2

∫ K

−K

∫ K

−K|ψ (BsAax− t)| dx

≤ 23a4

∫ K

−K

∫ K

−K(1 + ‖(BsAax− t)‖1)

−sd dx

≤ 23a4

∫ K

−K

∫ K

−K(1 + |‖(BsAax)‖1 − ‖t‖1|)

−sd dx

Now we have

2a|s− 1|‖x‖1 = min22a, 2a|s− 1|‖x‖1 = min‖z‖1=1

‖BsAaz‖1‖x‖1 ≤ ‖AaBsx‖1 ≤ ‖AaBs‖1‖x‖1

= 22a‖x‖1

79

Page 82: Weighted l1-Analysis minimization and stochastic gradient

which is why

2a+1|s− 1|K ≤ ‖BsAax‖1 ≤ 22a+1K for all x ∈ [−K,K]2.

Accordingly,

‖χ[−K,K]2ψa,s,t‖22 ≤ 23a4 +2K2

(1 + ‖t‖1 − 22a+1K

)−sD.

Now we assume that we need to take into account far more translations of the shearlet thanthe of the wavelet hence it is reasonable to assume that these translations t ∈ Z2 have a normexceeding any other constant in this context. Therefore,

∑λ/∈Λ

|〈f, ψλ〉|2 ≤ ‖f‖22amax∑a=0

2a∑s=−2a

∑t∈Z2

‖t‖1≥tmin

23a4 +2K2

(1 + ‖t‖1 − 22a+1K

)−sD≤ ‖f‖22K2

amax∑a=0

25a4 +3

∑t∈Z2

‖t‖1≥tmin

(1 + ‖t‖1 − 22a+1K

)−sD= ‖f‖22K2

amax∑a=0

25a4 +3

∞∑p=tmin

∑‖t‖1=p

(1 + ‖t‖1 − 22a+1K

)−sD= ‖f‖22K2

amax∑a=0

25a4 +3

∞∑p=tmin

p(1 + p− 22a+1K

)−sD≤ ‖f‖22K2amax2

5amax4 +3

∞∑p=tmin

p(1 + p− 22amax+1K

)−sDwhere we used that

]t ∈ Z2 : ‖t‖1 = p = ](t1, t2) ∈ Z2 : t1 = −p, . . . , p, t2 = ±(p− |t1|) = 2(2p− 1) + 2 = 4p.

Since the series∑∞n=0

1nr converges for r > 1, this expression is not only finite but moreover

the remainder sum∑∞p=tmin

p(1 + p− 22amax+1K

)−sD gets arbitrarily small once sD ≥ 2 ifwe set tmin appropriately. Now we invoke Lemma 57 where we already have seen that underthe assumptions of that Lemma, we have that

‖PRf‖22 ≤A

4‖f‖22.

In summary for such a tmin and given that the conditions of Lemma 57 are satisfied,∑λ/∈Λ

|〈f, ψλ〉|2 ≤∑λ/∈Λ

‖PRf‖22‖χ[−K,K]2ψλ‖22

≤ A

4‖f‖2

∑λ/∈Λ

‖χ[−K,K]2ψλ‖22

80

Page 83: Weighted l1-Analysis minimization and stochastic gradient

≤ A

4‖f‖22K2amax2

5amax4 +3

∞∑p=tmin

p(1 + p− 22amax+1K

)−sDwe now set tmin accordingly so that

K2amax25amax

4 +3∞∑

p=tmin

p(1 + p− 22amax+1K

)−sD ≤ 2

whereby we obtain the inequality

∑λ/∈Λ

|〈f, ψλ〉|2 ≤A

2‖f‖22

and consequently

A

2‖f‖22 ≤

∑λ∈Λ

|〈f, ψλ〉|2

which is the desired inequality. As mentioned repeatedly, the (ψλ)λ∈Λ do not necessarily allbelong to A := spanϕ

γ: γ ∈ Γ which is why they technically do not consist a frame for A

but rather a Pseudoframe, see once again [66] for more details.

Now that we finished the proofs of the preparatory results, we can combine those intoa summarizing theorem. Let ϕ

γ, ψλ be as given by Definition 52, i.e., a dyadic wavelet ϕ

γ

and a shearlet ψλ with γ ∈ Γ and λ ∈ Λ. Let f ∈ Sp,pl (R2) be compactly supported so thatit can be approximated well by a finite linear combination of shearlets. Since we have theembedding Sp,pl (R2) → Bp,pl+1/p(R

2) from (6.27), we know that f ∈ Bp,pl+1/p(R2) as well and

therefore can be approximated well by wavelets ϕγ. Moreover, if p ≤ 1 the coefficient series(

〈f, ϕγ〉)γ∈Γ

are a compressible sequence, i.e., are dominated by a few significant coefficients.

Accordingly, we assume that f ∈ A := spanϕγ

: γ ∈ Γ for a finite Γ ⊂ Γ.

Now Lemma 57 and Remark 58 yield a subset Λ ⊂ Λ such that (ψλ)λ∈Λ is a Pseudoframeof A. If we express f =

∑γ∈Γ cγϕγ we want to reconstruct f from a finite amount of

Fourier measurements f(xi) =∑λ∈Λ cγϕγ , i = 1, . . . ,m where the xi are sampled according

to a preconditioned probability measure dP = dxϑ2(x) as described in Chapter 3. Here dx

denotes the Lebesgue measure on R2. The resulting linear equation system is y = Φc whereyi = f(xi)ϑ(xi) and Φi,γ = ϑ(xi)ϕγ(xi). Since (ψλ)λ∈Λ is a frame of A,

〈f, ψλ〉 =∑γ∈Γ

cγ〈ϕγ , ψλ〉

the the matrix

Ω =(〈ϕγ, ψλ〉

)λ∈Λ,γ∈Γ

(4.10)

81

Page 84: Weighted l1-Analysis minimization and stochastic gradient

consist of a frame C]Γ. Thus, in order to solve Φc = y, we employ ω-BPDN

min ‖Ωx‖ω,1 subject to ‖Φx− y‖2 ≤ η

with some estimate on the approximation error η.The next theorem employs two series of weights:

• (ωγ)γ∈Γ the weights ωγ = 2‖j‖1(l−1) p2−p for (j, k) ∈ Γ associated to the Besov spaces,

i.e., f =∑γ∈Γ cγϕγ ∈ B

p,pl (R2) if and only if ‖c‖ω,p < ∞. These are uses mainly for

the embeddings between shearlet smoothness spaces and Besov spaces.

• (ω′λ)λ∈Λ the weights ω′λ = 2a(l+3(1/2−1/p)) p2−p associated to the shearlet smoothness

spaces i.e., f ∈ Sp,pl (R2) if and only if (〈f, ψλ〉)λ∈Λ ∈ lpω′(Λ). These are the weightsthat are used for Ω-BPDN.

Theorem 59. For 0 < p < 1 let f ∈ Sp,pl (R2) be compactly supported and l > 1. Letϑ a preconditioning function as described in Remark 49 and x1, . . . , xm with m as belowbe drawn from R2 according to 1

ϑ2 for a preconditioning function ϑ which yields weightsωγ = ‖ϕ

γϑ‖∞ = c2‖j‖1(l−1) and choose a l′ > l such that 2‖j‖∞(l−l′) ≤ s1/p−1 for γ =

(j, k) ∈ Γ = N20 × Z2. Let y =

(1√mf(xi)ϑ(xi)

)i∈[m]

be the measurements of f .

Let Γ ⊂ Γ be finite such that ωγ ≤ s1−1/p for all γ ∈ Γ. Choose Λ ⊂ Λ to be finite such thatψλλ∈Λ forms a Pseudoframe for the C-span of ϕγγ∈Γ and assume s ≥ 2

∥∥(ω′λ)λ∈Λ

∥∥2

∞.If

m ≥ C0‖ω‖2∞sB‖Ω†‖21→1 log(e]Γ) log(‖ω‖∞√sB‖Ω†‖1→1)

then ω-BPDN yields a reconstruction f ] of f from y with error

‖f − f ]‖2 . s1−1/p‖f‖Sp,pmaxl+1,l′

with probability exceeding 1−C1 exp(− log(e]Γ) log(‖ω‖∞

√sB‖Ω†‖1→1

)for some universal

constants C0, C1.

Proof. Proposition 120 states that f ∈ Sp,pl (R2) also belongs to the Besov space Bp,pl+1/p−1(R2)

for l > 0, where the latter will serve as a tool for calculations and approximations.Since f is compactly supported, say supp(f) = K ⊂ R2, only a finite number of its waveletcoefficients are non-zero at every scale j ∈ N2

0 since the wavelets have compact support aswell. We set

Γ = γ = (j, k) ∈ Γ : ωγ = 2‖j‖1(l−1) ≤ s1−1/p, χKϕj,k 6= 0

which is finite since l > 1.We express f in wavelets, namely f =

∑γ∈Γ cγϕγ =

∑γ∈Γ cγϕγ+

∑γ /∈Γ cγϕγ . Then Remark

82

Page 85: Weighted l1-Analysis minimization and stochastic gradient

58 gives us a Pseudoframe (ψλ)λ∈Λ for A := spanϕγ

: γ ∈ Γ as soon as we choose theparameters appropriately.In turn, weighted analysis minimization Ω-BPDN takes the form

min ‖ (〈g, ψλ〉)λ∈Λ ‖ω′,1 subject to

∥∥∥∥∥y −(

1√mg(xi)

)i=1,...,m

∥∥∥∥∥ < η, i = 1, . . . ,m (4.11)

with samples yi = f(xi)θ(xi) and the error is η = ‖e‖2 where ei = 1√m

∑γ /∈Γ cγϕγ(xi)ϑ(xi)

contains the error of the finite approximation f :=∑γ∈Γ cγϕγ when compared to the original

f .Since (ψλ)λ∈Λ is a Pseudoframe for A with bounds A

2 and B we have that for any coefficientvector c ∈ R]Γ

A

2‖c‖22 =

A

2

∥∥∥∥∥∥∑γ∈Γ

cγϕγ

∥∥∥∥∥∥2

2

≤∑λ∈Λ

∣∣∣∣∣∣⟨∑γ∈Γ

cγϕγ , ψλ

⟩∣∣∣∣∣∣2

=∑λ∈Λ

∣∣∣∣∣∣∑γ∈Γ

⟨ϕγ, ψλ

⟩∣∣∣∣∣∣2

︸ ︷︷ ︸=‖Ωc‖22

≤ B

∥∥∥∥∥∥∑γ∈Γ

cγϕγ

∥∥∥∥∥∥2

2

= B‖c‖22.

Thus, the rows of the matrix Ω =(〈ϕγ, ψλ〉

)λ∈Λ,γ∈Γ

form a frame of R]Λ.

The condition on the number of measurements ensures via Theorem 50 that the samplingmatrix Φ =

(1√mϕγ(xi)ϑ(xi)

)i∈[m],γ∈Γ

possesses the weighted Ω robust Null space property

of order s with high probability which ensures robust reconstruction via Remark 8. Thus,with high probability, (4.11) yields a function f# =

∑γ∈Γ c

#ϕ#γ∈ Sp,pl (R2) which obeys

‖f − f#‖2 ≤C√sσs(f)ω′,1 +Dη ≤ C√

sσ3s(f)ω′,1 +Dη

where σs(g)1 is the quasi-best weighted s-term approximation of a function g ∈ L2(R2)

measured in frame coefficients as defined in Definition 2: For frame coefficients (〈g, ψλ〉)λ∈Λ

and weights (ω′λ)λ∈Λ let (υi)i=1,...]Λ be the non-increasing rearrangement of (|〈g, ψλ〉|ω′λ)λ∈Λ

and τ : Λ → 1, . . . ]Λ the reordering. Let k ≤ ]Λ be maximal such that∑i≤k ω

′2τ−1(i) ≤ s,

then S := τ−1(i) : i ≤ k is the set realizing the quasi-best weighted s-term approximationof g with the corresponding error of the quasi-best weighted s-term approximation measuredin frame coefficients

σs(g)ω′,p :=∥∥∥(〈g, ψλ〉)λ∈Λ\S

∥∥∥ω′,p

.

We have σs(g)ω′,p ≤ σ3s(g)ω′,p where the former is the error of the best weighted s-term

83

Page 86: Weighted l1-Analysis minimization and stochastic gradient

approximation

σs(g)ω′,p = minS⊂Λω(S)≤s

∥∥∥(〈g, ψλ〉)λ∈Λ\S

∥∥∥ω′,p

.

Since f as a finite linear combination of the(ϕγ

)γ∈Γ

, it belongs to every

Bp,pl′ (R2) → Sp,pl′−1/p(R2) no matter what l′ > 0, 0 < p ≤ 1. Moreover, we know that∥∥∥∥(〈f , ψλ〉)

λ∈Λ

∥∥∥∥1

≤∥∥∥∥(〈f , ψλ〉)

λ∈Λ

∥∥∥∥ω′,1

= ‖f‖S1,1l

(R2) is finite. This, by using the embed-

dings between shearlet smoothness spaces and Besov spaces (6.27) twice, entails

σ3s(f)ω′,1 ≤ Cs1−1/p

∥∥∥∥(〈f , ψλ〉)λ∈Λ

∥∥∥∥ω′,p

≤ C ′s1−1/p‖f‖Sp,pl ≤ C ′s1−1/p‖f‖Bp,pl+1/p

≤ C ′s1−1/p‖f‖Bp,pl+1/p

≤ C ′s1−1/p‖f‖Sp,pl+1,

where we also used s ≥ 2∥∥(ω′λ)λ∈Λ

∥∥2

∞ and the Stechkin-type estimate for frame coefficients(2.5).

The error of approximation is

‖f − f‖2 =

√∑γ /∈Γ

|cγ |2 ≤∑γ /∈Γ

|cγ | =∑γ /∈Γ

|cγ |ωγωγ≤ s1−1/p

∑γ /∈Γ

|cγ |ωγ = s1−1/p‖f − f‖B1,1l

≤ s1−1/p‖f‖B1,1l≤ s1−1/p‖f‖S1,1

l. (4.12)

We deal with the error η by noting that

η2 = ‖e‖22 =1

m

m∑i=1

∣∣∣∣∣∣∑γ /∈Γ

cγϕγ(xi)θ(xi)

∣∣∣∣∣∣2

≤ 1

m

m∑i=1

∑γ /∈Γ

|cγ |‖ϕγθ‖∞

2

≤ 1

m

m∑i=1

∑γ /∈Γ

|cγ |ωγ

2

= ‖f − f‖2B1,1l

.

Let jmax again be the largest scale that still occurs in γ = (j, k) ∈ Γ as given by thedefinitions in Lemma 57, i.e., there is no

((jj′

), k)∈ Γ with j > jmax or j′ > jmax. The

inequality defining Γ, that is 2‖j‖1(l−1) ≤ s1−1/p, yields jmax ≥ (1/p−1) log2(s)2(l−1) if l > 1. Then

we estimate

η ≤∑γ /∈Γ

|cγ |ωγ ≤

∑γ /∈Γ

|cγ |pωpγ

1/p

=

∑γ /∈Γ

|cγ |p2‖j‖1p(l−1)

1/p

≤ 22jmax(l−l′)

∑γ /∈Γ

|cγ |p2‖j‖1p(l′−1)

1/p

= 22jmax(l−l′)∥∥∥(cγ)γ /∈Γ

∥∥∥ω,p

84

Page 87: Weighted l1-Analysis minimization and stochastic gradient

= 22jmax(l−l′)‖f − f‖Bp,pl′≤ 22jmax(l−l′)‖f‖Bp,p

l′

≤ C22jmax(l−l′)‖f‖Sp,pl′−1/p+1

≤ C22jmax(l−l′)‖f‖Sp,pl′

for l < l′. If we choose l′ large enough so that 22jmax(l−l′) ≤ s1−1/p we obtain

η ≤ Cs1−1/p‖f‖Sp,pl′. (4.13)

Taking both errors (4.12) and (4.13) into account we arrive at

‖f − f ]‖2 ≤ ‖f − f‖2 + ‖f − f ]‖2 ≤s1−1/p‖f‖S1,1

l′+ C ′s1−1/p‖f‖Sp,pl+1

+ Cs1−1/p‖f‖Sp,pl

for p ≤ 1. Altogether, using the embedding result for shearlet smoothness spaces and Besovspaces from (6.27) and recalling that l′ > l, we obtain

‖f − f ]‖2 ≤ (1 + C + C ′)s1−1/p‖f‖Sp,pmaxl+1,l′

.

Remark 60. Firstly we want to point out that our choice of the preconditioning function inTheorem 59 leads to weights which are the ones naturally occurring in the context of Besovspaces. Assuming that ϕ is at least twice differentiable, Lemma 20 shows that there existssuch a preconditioning function ϑ which has at most polynomial growth. For the case ofshearlet-wavelet-minimization we may use a properly normalized version of the tensorizationof the preconditioning function from Chapter 3 namely

ϑ(x) = C max1, C ′|x1|12 +κmax1, C ′′|x2|

12 +κ.

The conditions on the index set Γ demand that 22jmax(l−l′) ≤ s1−1/p and 22jmax(l−1) <

s1−1/p. Since l′ > l and once l < 1, these give the bound jmax >(

1p − 1

)log2(s)

2 max

11−l ,

1l−l′

on the magnitude of the scales contained in Γ.

As outlined in the proof of Theorem 59, we choose Γ to be

Γ = γ = (j, k) ∈ Γ : ωγ = 2‖j‖1(l−1) ≤ s1−1/p, χKϕj,k 6= 0.

Now we choose amax ∈ N0 to be the maximal dilation of the shearlet system such that

C23+jmax(2el−4)−2(ec−el)(amax−1) + 22+2(el−1)jmax−(4el−2)(amax−1) ≤ A

4

where el, ec are as in Lemma 57 and A is the lower frame bound of the shearlet frame.

85

Page 88: Weighted l1-Analysis minimization and stochastic gradient

Moreover, we choose tmin ∈ Z such that

Kamax25amax

4 +3∞∑

p=tmin

p(1 + p− 22amax+1K

)−sd< 2

where sD is as chosen in Remark 58. Then, in order to constitute a frame for the linear spanof ϕ

γ: γ ∈ Γ, we choose the index set of the shearlets to be

Λ :=λ = (a, s, t, h) ∈ Λ : h ∈ 1, 2, a ≤ amax, |s| ≤ 2a, t ∈ Z2 such that ‖t‖2 ≤ tmin

.

4.4 Numerical illustrations

(a) 20 equiangular lines inFourier domain

(b) 20 angular lines inFourier domain sampled atrandom

(c) Sampling pattern ac-cording to the precondi-tioned measure dP = dx

ϑ2(x)

with κ = 0.1 and 10% sub-sampling.

(d) Sampling pattern ac-cording to the precondi-tioned measure dP = dx

ϑ2(x)

with κ = 2 and 10% sub-sampling.

Figure 4.1

Theorems 43 and 59 lend them-selves to an easy implementation ofthe measurement process by sim-ply sampling the 2D-Fourier trans-form of an image at prescribedlocations as already described inSection 3.6. Accordingly, everypoint (r, θ) in polar coordinates inFourier space corresponds to theintensity measured by a sensor atthe angle θ and the distance r tothe center of the sample as statedby the Fourier Slice Theorem 41.

We sampled the measurementsfor the examples from the fft ofthe original data uniformly at ran-dom according to the precondi-tioned probability measure givenby dP = dx

ϑ2(x) with ϑ as in Re-mark 60 where dx is the Lebesguemeasure on R2 (see figures 4.1c and4.1d). To stay within the realmof Computer Tomography, we alsosampled along randomly chosen angular lines (see 4.1b) or along equispaced angular lines(see figure 4.1a). Our sampling strategy in this numerical section does not employ a waveletdecomposition of the Fourier transform of the original data as we just measured at the loca-

86

Page 89: Weighted l1-Analysis minimization and stochastic gradient

tions in Fourier space in pixel basis.

Two types of frames are employed: Firstly, a shearlet frame [5], which, due to its im-plementation, happens to be tight and normalized. Secondly, we choose two closely relatedHaar frames, which are obtained from the discrete, univariate Haar wavelet basis

h0 := 2−d2 (1, . . . , 1)t

h0,0 := h := 2−d2 (1, . . . , 1,−1, . . . ,−1)t

hl,k(j) := h(2lj − k)

2l−d

2 if k2d−l ≤ j < k2d−l + 2d−l−1

−2l−d

2 if k2d−l + 2d−l−1 ≤ j < k2d−l + 2d−l

0 else

where the indices are taken modulo d and d itself is even. The tight Haar frame is thenobtained by the union of this basis and its circular shift by 1 index mod 2d, whereas for itsnon-tight counterpart the vectors are shifted by 2d−l−1 mod 2d and the copy of the constantvector is ignored.Minimization of the Haar coefficients of a function resembles the well-known TV-minimizationapproach since the magnitude of the frame coefficients measures the differences of neighboringpixels, see [73, Proposition 8] or the introductory part of this chapter for more details.The error on the captions is always the relative error ‖x−x

]‖‖x‖ where x] is the reconstruction

of the original x from measurements y = Φx with Φ being the Fourier subsampling operator.

(a) Result for the non-tightHaar frame. Error is ≈ 14.4%.

(b) Result for the tight Haarframe. Error is ≈ 17.2%.

(c) Result for the shearlet frame.Error is ≈ 26%.

Figure 4.2: Reconstruction for the Phantom test image by sampling along 40 deterministicallychosen lines.

While the error of the reconstruction via any of the two Haar frames is basically the same,we observe that the phantom test image, which is designed to be used in TV minimization-examples (compare Figures 4.2a and 4.2a to Figures 4.2c or 4.3b and 4.3a to 4.3c), is re-constructed with a larger error by the shearlet frame, whereas for the Barbara image thedifference is rather due to numerical errors.For comparison, we include examples obtained via TV-minimization in figure 4.6.

87

Page 90: Weighted l1-Analysis minimization and stochastic gradient

(a) Result for the non-tightHaar frame. Error is ≈ 5.2%.

(b) Result for the tight Haarframe. Error is ≈ 6.1%.

(c) Result for the shearlet frame.Error is ≈ 17.1%.

Figure 4.3: Reconstruction for the Phantom test image by sampling along 40 randomly chosenlines.

(a) Result for the non-tightHaar frame. Error is ≈ 24.1%.

(b) Result for the tight Haarframe. Error is ≈ 23%.

(c) Result for the shearlet frame.Error is ≈ 21.1%.

Figure 4.4: Reconstruction for the Barbara test image by sampling along 30 deterministicallychosen lines.

(a) Result for the non-tightHaar frame. Error is ≈ 19%.

(b) Result for the tight Haarframe. Error is ≈ 19%.

(c) Result for the shearlet frame.Error is ≈ 18.9%.

Figure 4.5: Reconstruction for the Barbara test image by sampling along 30 randomly chosenlines.

Until now, all the examples used data which was sampled along angular lines in Fourierdomain which is the sampling scheme given by the Fourier Slice Theorem 41 and the Radon

88

Page 91: Weighted l1-Analysis minimization and stochastic gradient

(a) Result for TV-minimization sam-pling along 30 randomly chosen lines.Error is ≈ 19.2%.

(b) Result for TV-minimization withsampling along 30 deterministically cho-sen lines.Error is ≈ 21.3%.

(c) Result for TV-minimization sam-pling along 40 randomly chosen lines.Error is ≈ 5.1%.

(d) Result for TV-minimization withsampling along 40 deterministically cho-sen lines. Error is ≈ 22%.

Figure 4.6: Numerical illustration for TV minimization.

Transform from Definition 40 since these show the applicability of our methods to the recov-ery of images from subsampled Computer Tomography data. However, the reconstructionTheorem 59 employs a different type of sampling where the locations in Fourier domain arechosen i.i.d. according to the preconditioned probability measure as described in Remark60 and depicted in figures 4.1c and 4.1d at the beginning of this section. Figures 4.7 and4.8 show reconstructions for this type of sampling. In figures 4.7a and 4.8a we see linearartifacts which are caused by the non-tight Haar frame used in Ω-BPDN whereas figures 4.7band 4.8b exhibit the angular artifacts caused by the shearlet frame. Reconstruction via TVMinimization and via the minimization of Haar frame coefficients result in similar error rates,as figures 4.7a, 4.8a and 4.7c, 4.8c show. This is already suggested by Theorem 47 whichgives an estimate for the magnitude of the Haar wavelet coefficients by the TV-operator.

89

Page 92: Weighted l1-Analysis minimization and stochastic gradient

(a) Result for the non-tightHaar frame. Error is ≈ 59%.

(b) Result for the shearletframe. Error is ≈ 51%.

(c) Result for Minimization. Er-ror is ≈ 57%.

Figure 4.7: Reconstruction for the Phantom test image by sampling according to the precon-ditioned probability measure. The number of sampling locations was chosen in such a waythat it equaled sampling along 10 randomly chosen lines.

(a) Result for the non-tightHaar frame. Error is ≈ 4%.

(b) Result for the shearletframe. Error is ≈ 15%.

(c) Result for Minimization. Er-ror is ≈ 4%.

Figure 4.8: Reconstruction for the Phantom test image by sampling according to the precon-ditioned probability measure. The number of sampling locations was chosen in such a waythat it equaled sampling along 50 randomly chosen lines.

4.4.1 TV-Minimization of Real-World Data

Finally the combined results of Chapters 3 and 4 are applied to real-world data. A largedatabase of X-Ray images from a CT machine is the SophieBeads-Dataset [89] which also in-cludes the operator that actually modeled the tomography process in our experiments. TheTFOCS optimization package [25] for Matlab provided the optimization algorithms for thisexamples.

The implementation of the shearlet decomposition which was used in previous examplesand was implemented during a session at a Winter School at CIRM, [5], used a dyadic de-composition which is why the depth of the decomposition is the maximal power of two inthe factorization of the format of the image. Unfortunately, the original data as in figure4.9b had a format of 1564× 1564 pixels after preprocessing which is divisible by 2 just once.While the image could have been padded with 0s or cut down to the next power of 2, the

90

Page 93: Weighted l1-Analysis minimization and stochastic gradient

former approach would enlarge the image to a size which demanded far more computationtime while the latter idea would delete roughly 5

9 of the data. The same issue occurred forthe Haar wavelet decomposition. For these two reasons, we restricted our numerical exper-iments to TV Minimization. Moreover, the implementation of the Total Variation operatorwas considerably faster than the shearlet and Haar wavelet decompositions.

Although we hereby deviate from the main topic of this chapter, we include these examplesfor two reasons: First and foremost, we are not aware that one of the aforementioned methodswas ever applied to real-world data so these examples provide an insight in the applicability ofthese methods for the reconstruction of images from Computer Tomography measurements.Secondly, the implementation of the algorithms that are applicable to real-world data wastremendously labor-intensive and this effort should be presented somewhere in this work. Asit turns out, while the theory of Computer Tomography as modeled by the Radon Transformis understood quite well, there is a wide gap between theory and reality. Sophisticated andcomputational expensive preprocessing is necessary to apply TV Minimization to data froma Tomography machine.

As the comparison between figures 4.9a, 4.9c and 4.9d, 4.9e, 4.9f shows, the results of thisimplementation of TV Minimization do not have the quality of those obtained from classicalreconstruction methods. However, for the images shown in figure 4.9a and 4.9c all of theavailable data was used. Moreover, for the images 4.9d, 4.9e, and 4.9f there was no post-processing. This is due to the fact that further efforts in this direction were not undertakensince these images were already a large deviation from the main focus of this work.

91

Page 94: Weighted l1-Analysis minimization and stochastic gradient

(a) Reconstruction via standard Filtered Backpro-jection

(b) Original X-Ray from the SophieBeads dataset

(c) Reconstruction via Conjugate Gradient LeastSquares method from the SophieBeads GitHubrepository

(d) Reconstruction via TV minimization with ap-prox. 20% subsampling

(e) Reconstruction via TV minimization with ap-prox. 80% subsampling

(f) Reconstruction via TV minimization with ap-prox. 90% subsampling

Figure 4.9: Numerical illustration for real world data from the SophieBeads GitHub reposi-tory.

92

Page 95: Weighted l1-Analysis minimization and stochastic gradient

4.5 Conclusion

In this chapter we have shown that `1ω-analysis minimization is a meaningful approach to theproblem of data that is sparse when expressed as frame coefficients. Moreover, we introducedthe framework for the analysis of Ω-BPDN which is, in this chapter, the Null Space Prop-erty extended to the setting of frames (see Definition 7). This was then extended to infinitedimensional function spaces by means of sparse shearlet expansions and shearlet smoothnessspaces and offers a possibility for future inquiry: As described in more detail in Section6.3 or in [36], it is possible to carry approximation results obtained for one representationsystem over to other systems under several assumptions. This is called sparsity equivalenceand it is an open question whether these allow for results like Theorem 59 to be extended toother frame systems like curvelets or contourlets. Moreover, using the techniques and resultsof [36] for α-molecules it might be able to show reconstruction theorems like Theorem 59for different but related function system and the question arises in which types of functionspaces we arrive at with which convergence results.We finished this chapter with a series of numerical examples some of which were from real-world data. Here we saw the success of TV-minimization which is loosely related to the HaarFrame for which we also presented reconstruction examples. It remains to be seen whichframes provide best results for which type of real-world data. Additionally, the implementa-tion of our methods to address real-world problems remains a proof-of-concept. A version ofthese methods implemented in a way which matches the current industrial standard in speedand precision has yet to be developed.

93

Page 96: Weighted l1-Analysis minimization and stochastic gradient

Chapter 5

Low Rank Matrix Recovery viaStochastic Gradient Descent

In this chapter, we consider the problem of reconstructing a matrix Z ∈ Rn×k from m

quadratic measurements of the form bi = tr(ZtAiZ), i = 1, . . . ,m for random matricesAi ∈ Rn×n. Instead of a sparsity prior, we assume that Z has a low rank and try to solve

minZ∈Rn×k

1

4m

m∑i=1

(bi − tr(ZtAiZ))2

via stochastic gradient descent. Since the measurements are oblivious to orthogonal trans-formations of the matrix Z, the quality of any reconstruction can only be measured up toa global orthogonal factor. This, and the fact that the measurements are quadratic insteadof linear, force us to pursue a strategy different to those commonly followed in CompressiveSensing Theory and apply ideas inspired by the related Phase Retrieval Problem.

5.1 Introduction

The Phase Retrieval Problem consists of reconstructing an unknown feature vector x ∈ Cn

from quadratic measurements bi = |〈ai, x〉|2, i = 1, . . . , r, where the ai ∈ Rn are a collectionof sampling vectors. We set A to be the matrix with rows ati and abbreviate b = |Ax|2 wherethe absolute value is applied entry-wise.

Phase Retrieval is one type of non-convex quadratic problem many of which occur inphysics, i.e., in X-ray crystallography. Generally it is referred to as the phase problem andaddresses the issue that in lots of applications only the intensity, or amplitude, of an elec-tromagnetic wave can be measured but not the phase, other quantities such as polarizationnotwithstanding. This is due to the fact that these types of measurement techniques rely onthe diffraction pattern of an electromagnetic wave which does not retain the phase.

This problem was, amongst others, considered in [15] and can be seen as the rank-one

94

Page 97: Weighted l1-Analysis minimization and stochastic gradient

counterpart of the task at hand. There are several algorithms to approach this problem:In case that A is a Fourier transform the Gerchberg-Saxton algorithm [76] and the moregeneral Fienup algorithm [32] have long been established. The former, Gerchberg-Saxton,works in two steps: firstly, the modulus of Azt at the current iterate zt is adjusted to matchthe modulus of the data y

zt+1 = b Azt|Azt|

where denotes entry-wise multiplication. Afterwards

zt+1 = argminz∈Cn‖zt+1 −Az‖2

is computed by least-squares methods. If some prior knowledge of x is available such as xbeing an element of some subspace or convex set, an additional projection step can be appliedadditionally after this. A refinement of this is AltMinPhase [74]: The signal y = |Ax|2 isbroken up into chunks of roughly equal size, i.e., y1 = |A1x|2 ∈ Rm1 , . . . , yl = |Alx|2 ∈ Rml

so that∑imi = m and mi ≈ mj for all i, j ≤ l and the Gerchberg-Saxton algorithm is

applied to these chunks consecutively:

zt+1 = |At+1x| At+1zt|At+1zt|

, zt+1 = argminz∈Cn‖zt+1 −At+1z‖2.

If now the first block, used for initialization, contains m1 ≥ cn log3(n) samples and theconsecutive block consist each ofmi ≥ c′n log(n), i = 2, . . . , l samples, then the error ‖zt−x‖2is halved in each step with high probability depending on the random matrix A. However,the size of the original data x ∈ Cn determines the maximum number of steps as the samplesare only used once in this algorithm in contrast to Gerchberg-Saxon.

The authors of [15] suggested a gradient descent scheme they called Wirtinger Flow whichaims at minimizing the function

f(z) =1

2m

m∑i=1

l(bi, |〈ai, z〉|) (5.1)

for the loss function l(x, y) = (x − y)2. For an initial z0 ∈ Rn their algorithm computesiterations

zt+1 = zt + µt∇f(zt) (5.2)

for a step-size parameter µt ≥ 0. Minimizing a non-convex function such as this is a chal-lenging problem since it is NP-hard in general. In [71], the authors give an example of areal-valued polynomial of degree 4 for which it is NP-hard to show that the limit of a se-quence of iterations produced by the gradient descent algorithm (or any other given point)actually is a local minimum. However, Candes et.al., manage to obtain the following theorem

95

Page 98: Weighted l1-Analysis minimization and stochastic gradient

in [15] for their algorithm:

Theorem 61. [15, Theorem 3.3] Let x ∈ Cn and b = |Ax|2 ∈ Rm be m quadratic sampleswith m ≥ c0n log(n) where c0 is a sufficiently large numerical constant. Take a constant step-size µ for the gradient descent algorithm 5.2 applied to the target function for the WirtingerFlow (5.1) and assume µ ≤ c1

n for some fixed constant c1. Then there is an event of probabilityat least 1 − 13e−γn −me1.5m − 8

n2 such that on this event, starting from an initial solutionz0 that obeys

dist(z0, x) := minφ∈[0,2π]

‖z − eiφx‖2 ≤1

8‖x‖ (5.3)

we have that the sequence (zi)i = 1, . . . ,m computed by the Wirtinger Flow Algorithm 5.2suffices

dist(zi, x) ≤ 1

8

(1− µ

4

)i/2‖x‖.

The authors of [15] construct a starting point z0 in a similar manner as we do with thespectral initialization, see Definition 66, that satisfies (5.3) with the probability stated above.

The log-factor in the bound for the measurements was later removed in [22,97].Note that we have |〈ai, x〉|2 = 〈ai, x〉〈x, ai〉 = a∗i xx

∗ai = tr(aia∗i xx

∗). One key aspect ofthe problem becomes obvious here, namely that any limit of the sequence (zi)i=1,...,m canonly be accurate up to a global sign.

We aim at extending the Phase Retrieval Problem to the matrix setting: For a collectionof random matrices A1, . . . , Am ∈ Rn×n we define the operator

A : Rn×n → Rm, Z 7→(tr(AiX)

)1≤i≤m (5.4)

and take measurements y = A(ZZt) with Z ∈ Rn×k of rank r. If we set k = 1 we are in thesetting of the Phase Retrieval problem as described at the beginning of this chapter.

For a matrix X ∈ Rn×n, we write X 0 if it is positive semi-definite, i.e., xtXx ≥ 0

holds for all x ∈ Rn and X 0 if it is positive definite, i.e., xtXx > 0 holds for all x ∈ Rn.Given the measurements b = A(ZZt) for some random matrices Ai, we aim at reconstructingthe original Z. Since the matrix ZZt is symmetric and positive definite, we may also solveb = A(X) for such a matrix X and decompose X = Z∗Zt∗ via singular value decomposition.

Since the trace is cyclic we have that for i = 1, . . . ,m and measurements bi = tr(AiZZt) =

tr(ZtAiZ) and ZZt ∈ Rn×n has rank r < n for the rank r matrix Z. In the light of the usualcompressed sensing approach, it seems natural to try and solve

minX0

symmetric

rank(X) subject to A(X) = b,

and obtain a solution Z via the Schur decomposition of the minimizer Xmin = UDU∗ whereD ∈ [0,∞)n×n is a diagonal matrix and U ∈ O(n) orthogonal. Then we set D ∈ Rn×k to be

96

Page 99: Weighted l1-Analysis minimization and stochastic gradient

the matrix with entries di,i =√di,i for i = 1, . . . , k and 0 everywhere else. Then Z = UD

solves the minimization problem. Such Z is of course only unique up to an additional orthog-onal matrix since for such an O ∈ O(n) bi = tr(ZtAiZ) = tr((ZO)tAiZO). Additionally,note that the operator A from (5.4) is quadratic and most research in this area so far hasbeen carried out for linear operators A : Rn×k → Rm which is why we discuss these resultsin the introductory part of this chapter.

Rank minimization, even for linear operators, is NP hard as [69, Theorem 3.1] shows (thisalso follows from the fact that `0-minimization for vectors is already NP-hard). We approachthis problem in the same way as sparse recovery: Considering rank(X) = ‖(σ1(X), . . . , σr(X))t‖0,where the σi(X) are the singular values of X, the `0-operator in this minimization problemis relaxed to the according `1-norm, a method that is called Nuclear Norm minimization:

minX0

symmetric

r∑i=1

σi(X) = ‖X‖∗ subject to A(X) = y. (NNM)

A theoretical analysis of this approach is made in [81] where the authors propose a RIPfor linear operators Φ : Rn×k → Rm which demands for all X ∈ Rn×k of at most rank r thatthe inequality

(1− δr(Φ))‖X‖F ≤ ‖Φ(X)‖2 ≤ (1 + δr(Φ))‖X‖F (5.5)

holds for a constant δr := δr(Φ) > 0.The authors of [81] show the following recovery theorem for NNM:

Theorem 62. [81, Theorem 3.3] Suppose r ≥ 1 and δ5r < 110 . Then any minimizer X∗ of

NNM equals the original data X0 with Φ(X0) = b.

The estimate on δr was improved to δr < 13 in [11] and a robust recovery theorem for

noisy measurements was shown in [70].The square-less matrix RIP (5.5) is not the matrix counterpart of the RIP for vectors as

we know it, but [17] and [46] propose the inequality

(1− δr(Φ))‖X‖2F ≤ ‖Φ(X)‖22 ≤ (1 + δr(Φ))‖X‖2F (5.6)

to hold for all matrices X of rank at most r. These two definitions are, however, equivalent.There are several sampling ensembles available for which it has been proven that either oneof the RIPs holds with high probability. The next theorem might serve as an example forsuch a result.

Theorem 63. [81, Theorem 4.2] Fix 0 < δ < 1. If Φ is a nearly isometric measurementoperator (see the definition of ’nearly isometric random variable’ in [81, Definition 4.1] fordetails), then for every 1 ≤ r ≤ m there exists constants c0, c1 depending only on δ such thatwith probability exceeding 1− ec1p we have δr(Φ) ≤ δ whenever p ≥ c0r(m+ n) log(mn)

In the following, we will not use either of the RIPs.

97

Page 100: Weighted l1-Analysis minimization and stochastic gradient

Other proposed algorithms which exploit this link between low rank matrix recoveryand `1-minimization of vectors include Singular Value Projection [46] and alternating leastsquares [47]. The former a gradient descent scheme to solve the robust version of the affinerank minimization problem

min Ψ(X) :=1

2‖Φ(X)− b‖22 subject to X ∈ C(r) = X ∈ Rn×k : rank(X) ≤ r

for an arbitrary affine transformation Φ which entails an iteration scheme

Xt+1 = Pr(Xt − ηt∇Ψ(Xt)) = Pr(Xt − ηtΦ(Φ∗(Xt)− b)) (SVDRARM)

where the projection Pr : Rn×n → C(r) is computed via singular value decomposition (SVD).This is a iterative hard thresholding scheme and thus ideationally related to Algorithm 1,WIHT. The authors of [47] show at the following theorem:

Theorem 64. [46, Theorem 1.2] Suppose the restricted isometry constant of the linearoperator Φn×kR → Rm satisfies δ2r < 1

3 for δ2r as in the squared RIP inequality (5.6) andb = Φ(X0) + e for a rank-r-matrix X0 ∈ Rn×k and an error vector e ∈ Rm. Then theprojected gradient descent algorithm SVDRARM with stepsize ηt = 1

1+δ2routputs a matrix

X∗ of rank at most r such that ‖Φ(X∗) − b‖22 ≤ C‖e‖22 + ε and ‖X0 −X∗‖2F ≤C‖e‖22+ε

1−δ2r for

some ε ≥ 0 in at most⌈log(

‖b‖222(C‖e‖22+ε)

)⌉iterations for some universal constant C.

A different approach is Alternate Minimization For Matrix Sensing or AltMinSense inshort [47] where the target matrix X0 is written as a product X0 = U0V

t0 of two matrices

U0 ∈ Rk×r and V0 ∈ Rn×r. If the rank r is smaller than m or n, this reduces the complexityof the problem. Then in order to solve

min ‖Φ(UV t)− b‖22, U ∈ Rk×r, V ∈ Rn×r

the authors propose the following alternating algorithm:

Data: A, binitialize U0 ∈ Rk×r to be the top-r left singular vectors of 1

m

∑iA

ibi;for j = 0, . . . ,M − 1 do

V j+1 = argminV ∈Rn×r‖Φ(U jV t)‖22U j+1 = argminU∈Rk×r‖Φ(U(V (j+1))t)‖22

endResult: Reconstruction X∗ = UM (VM )t

Algorithm 2: AltMinSenseThen UT (V T )t obeys the following recovery guarantee:

Theorem 65. [46, Theorem 2.2] LetM = U ′Σ′V′t be a rank-r matrix with non-zero singular

values σ′1 ≥ σ′s ≥ . . . ≥ σ′r. Also, let the linear measurement operator Φ( · ) : Rn×k →Rm satisfy the 2r-RIP from the squared RIP inequality (5.6) with constant δ2r < κ(M)

100r

98

Page 101: Weighted l1-Analysis minimization and stochastic gradient

where κ(M) =σ′1(M)σ′r(M) is the condition number. Then the AltMinSense Algorithm 2, produces

matrices UT , V T which for all T > 2 log(‖M‖Fε

)satisfy

‖M − UT (V T )t‖F ≤ ε.

Generally we denote the condition number of a matrix M ∈ Rn×k by κ(M) =σ′1(M)σ′r(M) .

The initialization of AltMinSense 2 is akin to the spectral initialization which we employin our analysis:

Definition 66 (Spectral Initialization). Let Ai ∈ Rn×n be the sampling matrices and bi =

tr(ZtAiZ) for i = 1, . . . ,m. Then, withW := 1m

∑mi=1 biA

i and (λi, wi)ni=1 are its eigenvalues

and eigenvectors ordered by decreasing magnitude of the eigenvalues, we set our initializationmatrix to be

Z0 :=[z0

1 | . . . |z0k

]where z0

i :=

√|λi|2wi.

In [98] the authors considered the function

f(X) =1

4m

m∑i=1

(tr(ZtAiZ)− bi

)2, (5.7)

where bi = tr(ZtAiZ) for i.i.d. random matrices Ai, i = 1, . . . ,m which is similar to thefunction f(Z,Ai) = 1

4

(tr(ZtAiZ)− bi

)2 we examine. Our results demand a special kind ofrandomized sampling scheme where the measurement matrices Ai are distributed in the formof a Gaussian Orthogonal Ensemble:

Definition 67 (Gaussian Orthogonal Ensemble). Let A = (ai,j)i,j∈[n] ∈ Rn×n be a randommatrix. A is distributed as an Gaussian Orthogonal Ensemble (GOE) if A is symmetric, theentries ai,j are independent for all (i, j) with i ≥ j and obey ai,j ∼ N (0, 1) if i 6= j andaii ∼ N (0, 2) . ♦

The authors of [98] show that O(r3n log(n)) measurements suffice for recovery of X∗ =

ZZt via gradient descent applied to the cumulative target function (5.7) with constant stepsize with high probability. Their gradient descent scheme is designed closely along the linesof Candes et. al. in [15] with A as in (5.4).

The major drawback of the approach as outlined in [98] approach is that in every step ofthe algorithm the gradient

∇f(Z) =1

m

m∑i=1

(tr(ZtAiZ)− bi

)AiZ

has to be computed. This comes at a high computational cost since at every iteration 2m

matrix products have to be computed. We try to prevent this by using stochastic gradient

99

Page 102: Weighted l1-Analysis minimization and stochastic gradient

descent. For abbreviation, we set our target function f(Z,A) := (tr(ZitAiZi) − bi)2, wherebi = tr(ZitAiZi). This amounts to the following algorithm:

Data: Random Matrices Ai, i = 1, . . . ,m, measurements yResult: Reconstruction x]

initialize Z0 ∈ Rn×k via Spectral initialization (see Definition 66);while i ≤ m do

Zi+1 ← Zi − µi∇Zf(Zi, Ai)

i = i+ 1end

Algorithm 3: Stochastic Gradient Descent

Note that an additional projection step using

Ps : Rn×k → C(r), X 7→ argminrank(C)≤r

‖X − C‖F (5.8)

may be included into the stochastic gradient descent algorithm 3 to increase performance.This was done in [46] for an arbitrary affine transformation Φ as mentioned during thediscussion of SVDRARM above. For the cumulative target function f this would amount tothe minimization problem

min1

4m

m∑i=1

(tr(AiX)− bi

)2 subject to X ∈ C(r) := Y ∈ Rn×n : rank(Y ) ≤ r (5.9)

where the minimizer X∗ is decomposed into the product X∗ = Z∗Z∗t. In every iterationof the gradient descent the orthogonal projection onto C(r), Ps : Rn×m → C(r) ⊂ Rn×m, isapplied to the iterate. Although C(r) is not convex, the projection can be computed quicklyvia a singular value decomposition (SVD) by pruning an iterate to the s largest singularvalues as done in SVDRARM.

The main theorem for the gradient descent algorithm proposed in [98] reads as follows:

Theorem 68. [98, Theorem 2] If, for some universal constant c0 > 0, m ≥ c0κ(ZZt)r3n log(n)

then with high probability the spectral initialization Z0 satisfies d(Z0, Z) ≤ σr√

34 where σr

denotes the r-th singular value of Z. Moreover, there exists an universal constant c1 suchthat when using a constant step size µ > 0 with µ ≤ c1

κ(ZZt)n‖Z‖2Fand starting at Z0, the

gradient descent algorithm produces a series (Zl)l≥1 that satisfies

d(Zl, Z) ≤ σr√

3

4

(1− µ

12κ(ZZt)r

)l/2

with high probability.

As we will see later in this chapter, we are not able to provide a full proof of the conver-gence of the stochastic gradient descent Algorithm 3 due to the lack of a deviation inequality

100

Page 103: Weighted l1-Analysis minimization and stochastic gradient

for the gradient of the target function. Although we will conceive such a deviation inequalityin Section 5.5, i.e., show that ‖∇f(Z,A) − ∇Ef(Z,A)‖F ≤ C‖Z − Z‖, the quantity C willdepend on Z, Z to an extent that does not allow for a proof of convergence. We will, however,provide numerical evidence that our algorithm actually converges in Section 5.6 where we willalso examine the problematic deviation numerically. Moreover, we state another algorithm,Mini-Batch Stochastic Gradient Descent, Algorithm 6, and prove its convergence in Section5.7.

5.2 Notation

We write O(n) for the group of orthogonal Rn×n matrices. Since the trace is invariantunder such orthogonal transformations, two matrices Z,ZU which differ by an orthogo-nal matrix U ∈ O(k) lead to the same result: A(ZU,A) =

(tr(U tZtAiZU)

)i=,...,m

=(tr(ZtAiZ)

)i=1,...,m

= A(Z,A). For X ∈ Rn×k, we write X = (x1| . . . |xk) = (xi,j)i∈[n],j∈[k]

where the xi ∈ Rn are the columns of X. Usually, we write σ1(X), . . . , σr(X) for the singularvalues of X in decreasing order, where r = rank(X). Since our computations mostly involvethe singular values of Z, we abbreviate σi := σi(Z).For matrix differentiation we employ the following notation: For a function g : Rn×k → Rwe write

dgdX

(X) := ∇Xg(X) :=

(∂g

∂xi,j(X)

)i∈[n]j∈[k]

.

Occasionally we abbreviate our target function as

f(Z,A) =1

4(tr(ZtAZ)− tr(ZtAZ))2

and its expectation as f(Z) = Ef(Z,A) and likewise for its derivatives, i.e., ∇f(Z) =

E∇f(Z,A).

Definition 69. Let Z ∈ Rn×k.

• Throughout this chapter, r denotes the rank of the original Z and σ1 ≥ . . . ≥ σr itssingular values.

• Since the Frobenius norm is invariant under orthogonal transformations we set

O := Z∗ ∈ Rn×k : Z∗ = ZU, U ∈ O(k) orthogonal

as the right action orbit of Z under the orthogonal group O(k).

• We measure the distance between Z, Z ∈ Rn×k via

d(Z, Z) := minU∈O(k)

‖Z − ZU‖F

101

Page 104: Weighted l1-Analysis minimization and stochastic gradient

the minimal distance of Z to the orbit of Z under the right multiplication of O(k).

• For Z ∈ Rn×k we write the Z∗ ∈ O that is closest to Z onto O as

Z := argminZ∗∈O‖Z − Z∗‖F .

• We generally abbreviate the residual H = Z−Z and for an iterate Zi of the stochasticgradient descent algorithm 3 we write Hi = Zi − Zi.

• Our target function is f(Z,A) = 14 (tr(ZtAZ)−tr(ZtAz))2 and we sometimes abbreviate

f(Z) = Ef(Z,A).

5.3 Initialization

The quality of the initialization is of crucial importance within our analysis of our mainalgorithm 3 as it is used within the proof of our main theorem 78. Inspired by the initializationin [15] we use the same starting point as [98] which was inspired by the former paper.

The idea behind spectral initialization is that for y = tr(ZtAZ)

(EyA)i,j = Etr(ZtAZ)Ai,j =

n∑s=1

∑p,q

Ezp,sap,q zq,sai,j = 2

n∑s=1

zi,szj,s = 2(ZZt)i,j

where zm is the m-th column of Z. Accordingly the eigenvalues of 1m

∑mi=1 biA

i approximatethe eigenvalues of ZZt. This is the idea behind the next result:

Theorem 70. Let 0 < δ < σr and suppose that for any u ∈ Rn we have∥∥∥∥∥ 1

m

m∑i=1

(utAiu)Ai − 2uut

∥∥∥∥∥ ≤ δ

r‖u‖22.

Then

d(Z0, Z) ≤

√rδ2

(√

2− 1)σ2r

where r ∈ N is the rank of the original Z ∈ Rn×k and Z0 is the starting point of the algorithmobtained via Spectral initialization, see Definition 66.

This is the version of [98, Theorem 5] that is actually proven in the appendix of [98,Appendix E, p. 23, 24]. Since we will use this Theorem over and over again, we assume fromnow on that generally 0 < δ < σr whenever we apply this Theorem. In each situation wherewe choose δ, we ensure that this specific δ will obey the bound δ < σr.

102

Page 105: Weighted l1-Analysis minimization and stochastic gradient

The assumptions of Theorem 70 are fulfilled with high probability, as the next theoremshows.

Theorem 71. [98, Theorem 6] If the number of samples m ≥ 42

min

δ2

r2σ41, δ

rσ21

n log(n), then,

with probability exceeding 1 − mCe−ρn − 2n2 the following holds for all u ∈ Rn that satisfy

‖u‖ ≤ σ1: ∥∥∥∥∥ 1

m

m∑i=1

(utAiu)Ai − 2uut

∥∥∥∥∥ ≤ δ

k, (5.10)

where C and ρ are universal constants.

Figure 5.1: Occasionally, the startingpoint computed via spectral initial-ization is so close to the actual min-imum that Algorithm 3 converges ina few steps. Here Z ∈ R25×15 of rank2 and Z0 computed from m = 16094measurements.

5.4 The averaged flow

In order to prove convergence of this section’s main algorithm (Algorithm 3), we first needto establish that the following local curvature condition is fulfilled for suitable α, β > 0:

〈∇ZEf(Z,A), Z − Z〉 ≥ α‖Z − Z‖2F + β‖(Z − Z)tZ‖2F

for all Z ∈ Rn×k and Z as in Definition 69. The intuition behind this property is that thegradient of f at some iterate Z is, more or less, aligned with the residual which of course isthe working principle of gradient descent schemes. Another important property is the localsmoothness condition, which for γ, δ > 0 reads as follows:

‖∇ZEf(Z,A)‖2F ≤ γ‖Z − Z‖2F + δ‖(Z − Z)tZ‖2F .

Here, the idea is that the gradient is not too large compared to the difference vectorH = Z−Zin order to avoid overshooting the minimizer.These two conditions ensure that a third regularity condition is fulfilled, which is essentialfor the proof of convergence of our main Algorithm 3:

〈∇ZEf(Z,A), Z − Z〉 ≥ τ‖Z − Z‖2F + ρ‖∇ZEf(Z,A)‖2F with parameters τ, ρ > 0.

103

Page 106: Weighted l1-Analysis minimization and stochastic gradient

The strategy of the proof is as follows: First, we show that Ef(Z,A) = 14E(tr(ZtAZ) −

tr(ZtAZ))2 for a GOE-distributed random matrix A (see Definition 67) fulfills the localcurvature (5.11) and local smoothness condition (5.12) and thus the regularity condition aswell, thereby ensuring linear convergence of the gradient descent algorithm. Since Algorithm 4can not employ the expectations∇Ef(Z,A) we need to account for the difference∇Zf(Z,A)−∇ZEf(Z,A). Therefore, in a next step we prove a deviation inequality for the gradient, i.e.,an inequality of the type

‖∇Zf(Z,A)− E∇Zf(Z,A)‖F < ε

so that the results for the averaged flow carry over to the computable counterpart.Before providing the proofs for the actual results, we compute some of the ingredients

which will be used repeatedly.

Lemma 72. Let A ∈ Rn×n be distributed as a GOE (see Definition 67), X,Y, V,W ∈ Rn×k

andf : Rn×k → R, Z 7→ 1

4(tr(ZtAZ)− tr(ZtAZ))2.

Then the following equations hold:

1. ∇Zf(Z,A) = (tr(ZtAZ)− b)AZ,

2. Etr(XtAY )A = XY t + Y Xt,

3. ∇ZEf(Z,A) = 2(ZZt − ZZt)Z,

4. and

E[tr(V tAW )tr(XtAY )

]= 〈V tW,XtY 〉F + 〈V tW,Y tX〉F= 〈V tW,XtY + Y tX〉F ≤ 2‖V tW‖F ‖XtY ‖F .

Proof. 1. Since ∇tr(ZtAZ) = (A+At)Z and A is symmetric, the chain rule yields

∇Zf(Z,A) =1

2(tr(ZtAZ)− b)(A+At)Z = (tr(ZtAZ)− b)AZ.

2. We have for p, q ∈ [n]

(Etr(XtAY )A)

)p,q

= Ek∑i=1

∑j,l

xj,iaj,lyl,iap,q

= 1p 6=qk∑i=1

xp,iyq,i +

k∑i=1

xq,iyp,i + 2 · 1p=qk∑i=1

xp,iyp,i

=

k∑i=1

xp,iyq,i +

k∑i=1

xq,iyp,i = (XY t)p,q + (XY t)q,p

104

Page 107: Weighted l1-Analysis minimization and stochastic gradient

3. First we have that ∇Zf(Z,A) = (tr(ZtAZ) − b)AZ = tr(ZtAZ)AZ − tr(ZtAZ)AZ.Then 2 yields

Etr(ZtAZ)AZ =[Etr(ZtAZ)A

]Z = 2ZZtZ and

Etr(ZtAZ)AZ =[Etr(ZtAZ)A

]Z = 2ZZtZ.

4. We calculate

E[tr(V tAW )tr(XtAY )

]= E

k∑l=1

n∑i,j=1

xi,lai,jyj,l

k∑r=1

n∑p,q=1

vp,rap,qwq,r

i=pj=q=

k∑r,l=1

n∑i,j=1

xi,lyj,lvi,rwj,r

i=qj=p

+

k∑r,l=1

n∑i,j=1

xi,lyj,lvj,rwi,r

=

n∑i,j=1

(WV t)i,j(XYt)i,j +

n∑i,j=1

(WV t)i,j(Y Xt)i,j

and the last inequality of the statement is obtained via the Cauchy-Schwarz inequality.

The key ingredient in the proof of convergence of the stochastic gradient descent Algorithm3 reads as follows:

Definition 73. A function g : Rn×k → R satisfies the regularity condition R(ε, α, β) if thereexist α, β > 0 such that for every Z ∈ Rn×k Z ∈ Rn×k with d(Z, Z) = ‖Z−Z‖F = ‖H‖F < ε

〈∇g(Z), Z − Z〉 ≥ 1

α‖Z − Z‖2F +

1

β‖∇g(Z)‖2F . (R(ε, α, β))

holds. ♦

This is the regularity condition R(ε, α, β) from [98]. In a similar fashion, we adopt thetwo conditions in Definition 74 below which are based on their namesakes from [98] sincethose ensure that a function g satisfies the regularity condition R(ε, α, β).

In [98], the authors prove that the local smoothness and local curvature conditions holdfor the cumulative function f(Z) = 1

4m

∑mi=1(tr(ZtAiZ) − bi)

2 rather than the expecta-tions Ef(Z,A),E∇f(Z,A) as we will do, see especially [98, Lemmas 5, 6]. Our approach issomewhat different: Instead of directly showing that the function f(Z,A) fulfills the localcurvature and smoothness conditions, we prove these for Ef(Z,A) and show an additionaldeviation inequality for ∇f(Z,A)−E∇f(Z,A) afterwards. So while we employ the same lineof thinking, which was also used by [15] to address the phase retrieval problem, our methodsdiffer.

Definition 74. A function g : Rn×k → R satisfies the local curvature condition if there exists

105

Page 108: Weighted l1-Analysis minimization and stochastic gradient

a constant C1 > 0 such that for all Z ∈ Rn×k with d(Z, Z) = ‖Z − Z‖F = ‖H‖F < ε

〈∇g(Z), Z − Z〉 ≥ C1‖Z − Z‖2F + ‖(Z − Z)>Z‖2F (5.11)

holds. The function g satisfies the local smoothness condition if there exist constants C2, C3 >

0 such that for all Z ∈ Rn×k with d(Z, Z) = ‖Z − Z‖F = ‖H‖F < ε

‖∇g(Z)‖2F ≤ C2‖Z − Z‖2F + C3‖(Z − Z)>Z‖2F (5.12)

holds. ♦

We will show that our target function f(Z,Ai) = (tr(ZtAiZ) − tr(Zt)AiZ)2 obeys twoinequalities which are slightly more general than the local curvature condition and the localsmoothness condition.

Lemma 75. The target function f(Z) = Ef(Z,A) = Etr(ZtAZ) − y, where y = tr(ZtAZ)

obeys the local curvature condition with ε = 2√5σr and generally obeys

〈∇Ef(Z,A), H〉 ≥(σ2r −

5

2‖H‖2F

)‖H‖2F + ‖HtZ‖2F (5.13)

where, as usual, H := Z − Z.

Proof. We express ∇f(Z) = Ef(Z,A) in a convenient way in terms of H

∇f(Z) = E(tr(ZtAZ)− y)AZ = E[tr(ZtAZ)− tr(ZtAZ)AZ] = E[tr(ZtAZ)− tr(U tZtAZU)AZ]

= E[tr(ZtAZ)− tr(ZtAZ)AZ] = E[(tr(ZtAZ + Z

tAZ − ZtAZ − ZtAZ)A(H + Z)]

= E[tr((Z + Z)tA(Z − Z))A(H + Z)] = E[[tr(HtAH) + 2tr(ZtAH)]A(H + Z)].

This entails

〈∇f(Z), H〉 = tr(Ht[E[tr(HtAH) + 2tr(Z

tAH)]A(H + Z)]

)= E[tr(HtAH)2]︸ ︷︷ ︸

=a2

+ 2E[tr(ZtAH)2]]︸ ︷︷ ︸

=b2

+3E[tr(ZtAH)tr(HtAH)]]

≥ a2 + b2 − 3√E[tr(HtAH)2]

√E[tr(Z

tAH)2]

= a2 + b2 − 3√2

√E[tr(HtAH)2]︸ ︷︷ ︸

=a

√2E[tr(Z

tAH)2]︸ ︷︷ ︸

=b

=

(b− 3

2√

2a

)2

− a2

8

≥(b2

2− 9

8a2

)− a2

8=b2

2− 5

4a2 = E[tr(Z

tAH)2]− 5

4E[tr(HtAH)2]

where we used the inequality

(c− d)2 ≥ c2

2− d2 ⇔ c2

2− 2cd+ 2d2 =

(√2d− c√

2

)2

≥ 0

106

Page 109: Weighted l1-Analysis minimization and stochastic gradient

in the last estimate. We denote the ith column of H by hi and obtain

E[tr(HtAH)2] = E

[k∑i=1

htiAhitr(HtAH)

]=

k∑i=1

htiE[tr(HtAH)A

]hi

and for the expectation we have that

E[tr(HtAH)A

]k,l

= 2

k∑i=1

Hk,iHl,i = 2HHtk,l

with calculations similar to those in the proof of Lemma 72. Therefore

E[tr(HtAH)2] = 2

k∑i=1

htiHHthi = 2tr(HtHHtH) = 2‖HHt‖2F .

Similarly

Etr(HtAZ)2 =

k∑i=1

htiE[tr(HtAZ)A]zi

and therefore

E[tr(HtAZ)A]p,q = (HZt)p,q + (ZHt)p,q

hence, since by [98, Lemma 7] we have that tr(HtZHtZ) = ‖HtZ‖2F ,

Etr(HtAZ)2 =

k∑i=1

htiHZtzi +

k∑i=1

htiZHtzi = tr(HtHZ

tZ) + tr(HtZHtZ)

= tr(ZHt(ZHt)t) + tr(HtZHtZ) = ‖ZHt‖2F + ‖HtZ‖2F≥ σ2

r‖Ht‖2F + ‖HtZ‖2F ≥ 2σ2r‖H‖2F .

Now we calculate in a similar fashion

E[tr(ZtAH)tr(HtAH)]] = E

k∑i=1

∑p,q

Zp,iap,qHq,i

k∑j=1

∑r,s

Hr,jar,sHs,j

=

k∑i,j=1

∑p,q,r,s

EZp,iap,qHq,iHr,jar,sHs,j

p=rq=s=p=sq=r

2

k∑i,j=1

∑p,q

Zp,iHq,iHp,jHq,j

= 2

k∑i,j=1

〈Zi, Hj〉〈Hi, Hj〉 = 2〈ZtH,HtH〉F .

In summary

〈∇f(Z), H〉 = Etr(HtAH)2 + 2Etr(ZtAH) + 3E[tr(Z

tAH)tr(HtAH)]]

107

Page 110: Weighted l1-Analysis minimization and stochastic gradient

= 2(‖ZHt‖2F + ‖HtZ‖2F + ‖HHt‖2F + 3〈ZtH,HtH〉F

)= 2‖HHt‖2F + 2‖ZHt‖2F + ‖HtZ‖2F + 6〈ZtH,HtH〉F

≥ Etr(ZtAH)2 − 5

4Etr(HtAH)2

≥(σ2r‖Ht‖2F + ‖HtZ‖2F −

5

2‖HHt‖2F

)≥(σ2r −

5

2‖H‖2F

)‖H‖2F + ‖HtZ‖2F .

In our analysis of the convergence of the stochastic gradient descent Algorithm 3 we willuse the estimate (5.13) to better keep track of the constants involved.

Lemma 76. The target function f obeys the local smoothness condition for every ε ≤ ‖Z‖Fwith constants C3 = 4C2 = 64‖Z‖F and generally obeys

‖∇Ef(Z,A)‖2F8(‖H‖2F + ‖Z‖2F )

≤ ‖H‖4F + 4‖HtZ‖2F . (5.14)

Proof. In order to bound ‖∇f(Z)‖2F = max‖U‖2F=1 |〈U,∇f(Z)〉|2 we expand

|〈U,∇f(Z)〉|2 =[E(tr(HtAH) + 2tr(HtAZ)(tr(U tAH) + tr(U tAZ))

]2≤ 4

[Etr(HtAH)tr(U tAH)

]2+ 4

[Etr(HtAH)tr(U tAZ)

]2+ 16

[Etr(HtAZ)tr(U tAH)

]2+ 16

[Etr(HtAZ)tr(U tAZ)

]2where we used that (a+ b+ c+ d)2 ≤ 4(a2 + b2 + c2 + d2) for any a, b, c, d ∈ R. Each of thesummands above will be bounded using the identity for the expectation of a product of twotraces from Lemma 72 and ‖U‖F ≤ 1:

[Etr(HtAH)tr(U tAH)

]2 ≤ 2‖HtH‖2F ‖H‖2F ,[Etr(HtAH)tr(U tAZ)

]2 ≤ 2‖HtH‖2F ‖Z‖2F ,[Etr(HtAZ)tr(U tAH)

]2 ≤ 2‖HtZ‖2F ‖H‖2F ,[Etr(HtAZ)tr(U tAZ)

]2 ≤ 2‖HtZ‖2F ‖Z‖2F ,

and thus

‖∇f(Z)‖2F ≤ 8(‖H‖2F + ‖Z‖2F )(‖H‖4F + 4‖HtZ‖2F ),

where we used ‖HtH‖F ≤ ‖H‖2F repeatedly. This entails

‖∇f(Z)‖2F8(‖H‖2F + ‖Z‖2F )

≤ ‖H‖4F + 4‖HtZ‖2F .

108

Page 111: Weighted l1-Analysis minimization and stochastic gradient

The local smoothness condition (5.12) and the local curvature condition (5.11) can becombined in the proof of the main regularity condition R(ε, α, β) for f .

Lemma 77. The target function f obeys the regularity condition R(ε, α, β) with ε = 1√11σr

and generally obeys

〈∇Ef(Z,A), H〉 ≥(σ2r −

11

4‖H‖2F

)‖H‖2F +

‖∇Ef(Z,A)‖2F32(‖H‖2F + ‖Z‖2F )

. (5.15)

Proof. By Lemmas 75 and 76

〈∇ZEf(Z,A), H〉 ≥(σ2r −

5

2‖H‖2F

)‖H‖2F + ‖HtZ‖2F

≥(σ2r −

11

4‖H‖2F

)‖H‖2F +

‖∇ZEf(Z,A)‖2F32(‖H‖2F + ‖Z‖2F )

.

Setting ε = 1√11σr < ‖Z‖F yields constants α = 4

3σ2r

and β = 32(‖Z‖2F + ε

)for the

regularity condition R(ε, α, β).

Unfortunately, as we will outline in Section 5.5, we are not able to provide a full proof ofconvergence of the stochastic gradient descent Algorithm 3. We will split the task in severalcomponents. For some of them we achieve results which we will outline hereafter. In Section5.6 we will establish and provide numerical evidence that the open parts seem to be true. Inthe final section of this chapter, we will present a closely related algorithm for which we willgive a full proof of convergence.

We set Hi = Zi − Zi to obtain

‖Hi+1‖2F = ‖Zi+1 − Zi+1‖2F ≤ ‖Zi+1 − Zi‖2F =∥∥∥Zi − µ∇f(Zi, Ai)− Zi

∥∥∥2

F

=∥∥∥Zi − µ∇Ef(Zi, Ai) + µ∇Ef(Zi, Ai)− µ∇f(Zi, Ai)− Zi

∥∥∥2

F

= ‖Hi − µ∇Ef(Zi, Ai)‖2F + µ2‖∇f(Zi, Ai)−∇Ef(Zi, Ai)‖2F+ 2µ

⟨Hi − µ∇Ef(Zi, Ai),∇f(Zi, Ai)−∇Ef(Zi, Ai)

⟩≤ ‖Hi − µ∇Ef(Zi, Ai)‖2F + µ2‖∇f(Zi, Ai)−∇Ef(Zi, Ai)‖2F (5.16)

+ 2µ‖Hi − µ∇Ef(Zi, Ai)‖F ‖∇f(Zi, Ai)−∇Ef(Zi, Ai)‖F .

The first term of (5.16) can be treated with the regularity condition R(ε, α, β) to obtain thefollowing result:

Theorem 78. With the notation used above, let the stepsize parameter µ be bounded as

µ ≤ 1

16(‖H0‖2F + ‖Z‖2F )

109

Page 112: Weighted l1-Analysis minimization and stochastic gradient

Let (Zi)i≥0 be the sequence generated by the Stochastic Gradient Descent Algorithm 3, thatis

Zi+1 = Zi − µ∇f(Zi, Ai), i = 0, . . . ,m− 1.

Then we have

‖Hi − µ∇Ef(Zi, Ai)‖2F ≤(

1− 2µ

(σ2r −

11

4‖Hi‖2F

))‖Hi‖2F . (5.17)

Proof. We expand

‖Hi − µ∇Ef(Zi, Ai)‖2F =∥∥∥Zi − µ∇Ef(Zi, Ai)− Zi

∥∥∥2

F

= ‖Hi‖2F + µ2‖∇Ef(Zi, Ai)‖2F − 2µ⟨∇Ef(Zi, Ai), Hi

⟩≤ ‖Hi‖2F + µ2‖∇Ef(Zi, Ai)‖2F

− 2µ

((σ2r −

11

4‖Hi‖2F

)‖Hi‖2F +

‖∇Ef(Zi, Ai)‖2F32(‖Hi‖2F + ‖Zi‖2F )

)

=

(1− 2µ

(σ2r −

11

4‖Hi‖2F

))‖Hi‖2F

+ µ‖∇Ef(Zi, Ai)‖2F

(µ− 1

16(‖Hi‖2F + ‖Zi‖2F )

).

Since ‖Zi‖F = ‖Z‖F and µ ≤ (16(‖H0‖2F + ‖Z‖2F ))−1 the second summand is non-positiveso that

‖Hi − µ∇Ef(Zi, Ai)‖2F ≤(

1− 2µ

(σ2r −

11

4‖Hi‖2F

))‖Hi‖2F .

Since Theorem 78 only covers the left-hand side summand of (5.16) we are still in need ofa deviation inequality for the gradient ∇f(Zi, Ai)−∇Ef(Zi, Ai). We will see several versionsof deviation inequalities like this in sections 5.5, 5.6 and 5.7 which is why we will outline theargument to be made there here beforehand.

Lemma 79. Assume the demands of Theorem 78 are fulfilled and moreover assume thatthere is an ε > 0 such that

‖∇f(Z,A)−∇Ef(Z,A)‖F ≤ ε‖Z − Z‖F , for all Z ∈ Rn×k

for GOE distributed random matrices A ∈ Rn×n. Moreover, assume that

ε < σ2r and µ <

2σ2r

ε(1 + ε).

110

Page 113: Weighted l1-Analysis minimization and stochastic gradient

Then

‖Hi+1‖2F ≤[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

− µε2

2− ε)]i‖H0‖2F .

Proof. Theorem 78 provides the estimate

‖Hi − µ∇Ef(Zi, Ai)‖2F ≤(

1− 2µ

(σ2r −

11

4‖Hi‖2F

))︸ ︷︷ ︸

=F (‖Hi‖F ,µ)

‖Hi‖2F .

Note that the expression F (a, b) := 1−2b(σ2r − 11

4 a)is monotonically increasing in a. Firstly,

we only consider i = 0 and show ‖H1‖2F < ‖H0‖2F , then convergence follows by induction.Here, we employ a result from Section 5.3, where Theorem 70 yields the estimate

‖H0‖2F ≤rδ2

(√

2− 1)σ2r

.

Theorem 78 and (5.16) imply

‖H1‖2F ≤ F (‖H0‖F , µ)‖H0‖2F + µ2‖∇f(Z,A)−∇Ef(Z,A)‖2F+ 2µ‖∇f(Z,A)−∇Ef(Z,A)‖F ‖H0‖F ≤ F (‖H0‖F , µ)‖H0‖2F + µ2ε2‖H0‖2F + 2µε‖H0‖2F

≤(F (‖H0‖F , µ) + µ2ε2 + 2µε

)‖H0‖2F ≤

(F

(rδ2

(√

2− 1)σ2r

, µ

)+ µ2ε2 + 2µε

)‖H0‖2F

=

[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

− µε2

2− ε)]‖H0‖2F

and if now

σ2r −

11rδ2

4(√

2− 1)σ2r

− µε2

2− ε > 0⇔ δ <

√4(√

2− 1)σ2r

11r

(σ2r −

µε2

2− ε)

(5.18)

we have that [1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

− µε2

2− ε)]

< 1

which yields ‖H1‖F < ‖H0‖F . Note that since ε < σ2r and µ <

2σ2r

ε(1+ε) we have that

σ2r −

µε2

2 − ε > 0 and thus δ > 0. Then F (‖H1‖F , µ) + µε2 + 2ε < F (‖H0‖F , µ) + µε2 + 2ε

and similarly via induction

‖Hi+1‖2F ≤(F (‖Hi‖F , µ) + µε2 + 2ε

)‖Hi‖2F ≤

(F (‖H0‖, µ) + µε2 + 2ε

)‖Hi‖2F

≤(F (‖H0‖, µ) + µε2 + 2ε

)i ‖H0‖2F

≤[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

− µε2

2− ε)]i‖H0‖2F

111

Page 114: Weighted l1-Analysis minimization and stochastic gradient

which is the claim.

The bound on the stepsize µ ≤[16(‖H0‖2F + ‖Z‖2F )

]−1

from Theorem 78 depends on Zand thus is impossible to verify so that a proper µ can not be determined. The algorithm isusually run with

µ = c

[kδ2

(√

2− 1)σr(Z0)2+ ‖Z0‖2F

]−1

for some small constant c > 0 and δ can be chosen according to (5.18) where we replaceσr = σr(Z) by σr(Z0)

Now it remains to determine the number of measurements necessary to ensure convergenceof our stochastic gradient descent Algorithm 3.

Remark 80. Theorem 71 provides a lower bound for the number of measurements

m ≥ 42

min

δ2

r2σ41, δrσ2

1

n log(n)

where according to our analysis δ needs to be bounded by

δ <

√4(√

2− 1)σ2r

11r(σ2r − µε2 − 2ε)

Since σr ≤ σ1 and r, k ≥ 1 we have that δrσ2

1< 1 and δ

rσ21. σ2

r

σ21r

3/2 then the number ofmeasurements is bounded below by

m ≥ Cr3/2κ(Z)2n log(n)

where κ(Z) = σ1

σris, as before, the condition number of the matrix Z.

5.5 Deviation via Hanson-Wright and Generic Chaining

The proof of convergence of the Stochastic Gradient Descent Algorithm 3 hinges on theconditions of Lemma 79 being satisfied. In particular we require a deviation inequality of theform

‖∇Zf(Z,A)−∇Zf(Z)‖F = ‖∇Zf(Z,A)− E∇Zf(Z,A)‖F < C‖Z − Z‖F

to hold with high probability on the GOE matrix A.As a disclaimer it should be mentioned that while we will show how to derive such an

inequality, our result seems to demand far too much measurements for an applicable recoveryresult. We include the proof nonetheless hoping that future research into this matter willbenefit from it.

112

Page 115: Weighted l1-Analysis minimization and stochastic gradient

Our main result reads as follows:

Theorem 81. Let Z ∈ Rn×k. The target function f(Y,A) : Rn×k → R, Z 7→ 14 (tr(Y tAY )−

tr(ZtAZ))2, where A ∈ Rn×n is a random matrix sampled as a GOE, obeys

P[‖∇Zf(Z,A)−∇ZEf(Z,A)‖F ≥ CZ,Knk(1 + t)

]≤ 2e−nkt (5.19)

for t ≥ 0 where K ≥ maxi,j ‖ai,j‖ψ2and CZ,K = σ1(Z)K

c ‖ZZt − ZZt‖F .

Remark 82. Since the initial Z0 is obtained by spectral initialization, see Definition 66, eachZi depends stochastically on all the measurement matrices Ai. Therefore, the deviationinequality (5.19) does not hold for the iteration matrices Zi. If, however, we were to use adifferent initial point, i.e., Z0 = 0 ∈ Rn×k or an initial guess by an oracle, we would havestochastic independence of Ai and Zi since the latter is only dependent to the Aj , j < i. Inthis case, using a union bound on the estimate from Theorem 81 we have that with probabilityexceeding 1− 2me−nkt

‖∇Zf(Zi, Ai)−∇ZEf(Zi, Ai)‖F ≤ CZi,Knk(1 + t), i = 1 . . . ,m (5.20)

holds where, according to Theorem, 81

CZ,K =σ1(Z)K

c‖MZZt −MZZt‖2.

The absolute constants c,K do not have an impact on our findings. Recalling that weabbreviate σi(Z) =: σi, i = 1, . . . r, we evaluate

‖MZZt −MZZt‖2 = ‖ZZt − ZZt‖F ≤ ‖HZt

+ ZtH +HHt‖F ≤

(2σ1‖H‖F + ‖H‖2F

)= (2σ1 + ‖H‖F ) ‖H‖F

which thus yields

P

(‖∇f(Z,A)−∇Ef(Z,A)‖F ≥ Cnk(1 + t)σ1(Z) (2σ1 + ‖H‖F ) ‖H‖F

)≤ 2e−nkt.

Applying a union bound we have that with probability exceeding 1− 2me−nkt for all Zi, i =

1 . . . ,m

‖∇f(Zi, Ai)−∇Ef(Zi, Ai)‖F ≤ Cnk(1 + t)σ1(Zi)(2σ1 + ‖Hi‖F

)‖Hi‖F . (5.21)

We see that the bound on the deviation is of order ‖Zi‖3F which is no surprise since theGradient ∇f(Zi, Ai) = (tr(ZitAiZi)− tr(ZtAiZ))AiZi is of the same order.

Now we turn back to the proof of Theorem 78 where by the approximation estimate (5.17)we obtained a convergence result for the stochastic gradient descent Algorithm 3 which lacked

113

Page 116: Weighted l1-Analysis minimization and stochastic gradient

a deviation inequality the type of which Theorem 81 provides, e.g., the inequalities (5.19)and (5.20) respectively.

Lemma 83. Let δ <√

2(√

2−1)3r σ2

r and the assumptions from Theorem 78 hold and assume

‖∇f(Z,A)−∇Ef(Z,A)‖F ≤ Cnk(1 + t)σ1(Z) (2σ1 + ‖H‖F ) ‖H‖F

holds for all Z ∈ Rn×k. Set

G(‖Hi‖F ) :=

(1− 2µ

(σ2r −

11

4‖Hi‖

))+ µ2C2(nk)2(‖Hi‖F + ‖Z‖F )2(1 + t)2

(2σ1 + ‖Hi‖F

)2+ 2µCnk(1 + t)(‖Hi‖F + ‖Z)‖F

(2σ1 + ‖Hi‖F

)and assume G(‖H0‖2F ) < 1. Then

d(Zi, Z)2 ≤ G(‖H0‖F )i‖H0‖2F .

Remark 84. The convergence result from Lemma 83 hinges on the fact that G(‖H0‖2F ) < 1.Originally, we have ‖H0‖2F ≤ rδ2

(√

2−1)σ2r

with probability at least 1−mCe−ρn − 2n2 for some

absolute constants C, ρ by Theorem 70. Using Z0 = H0 + Z interferes with the deviationinequality for the gradient (5.19) as we mentioned beforehand since then Zi stochasticallydepends on each of the Ai. However, it is safe to assume that the estimate ‖H0‖2F ≤ rδ2

(√

2−1)σ2r

is the best we can hope for for any kind of initialization which is why we will use it as anestimate for the quality of the initial matrix Z0.

According to Lemma 79, we have convergence as soon as

0 <µ

σ2r −

11rδ2

4(√

2− 1)σ2r

− µC2(nk)2(1 + t)2(‖H‖F + ‖Z‖F )2

(σ1 +

√rδ2

(√

2− 1)σ2r

)2

− 2Cnk(1 + t)(‖H‖F + ‖Z‖F )

(σ1 +

√rδ2

(√

2− 1)σ2r

)]‖H0‖2F

Even if µ is small enough so that, for the time being, we may neglect the summand µC2(nk)2(1+

t)2(‖H‖F + ‖Z‖F )2(σ1 +

√rδ2

(√

2−1)σ2r

)2

, the inequality above can only hold if

σ2r ≥

11rδ2

4(√

2− 1)σ2r

+ 2Cnk(1 + t)(‖H‖F + ‖Z‖F )

(σ1 +

√rδ2

(√

2− 1)σ2r

)

which even for small values of δ, ‖H‖F and the constant C implies

σ2r > 2Cnkσ1‖Z‖F ≥ 2Cnkσ2

1

114

Page 117: Weighted l1-Analysis minimization and stochastic gradient

which already does not hold for the rank 2 matrix

Z =

1 0 0 . . .

0√Cnk 0 . . .

0 0 0 . . ....

......

. . .

if Cnk < 1 and if Cnk ≥ 1 by Z =

1 0 0 . . .

0 1 0 . . .

0 0 0 . . ....

......

. . .

.

In summary, while we are able to show a deviation inequality for the gradient in Theorem 81,this inequality needs to be improved either by a refinement of our strategy or a completelynew approach to yield a convergence result for the stochastic gradient descent Algorithm3 like outlined in Lemma 79. Although we did not find such an inequality nor a suitablealternative approach, we will present numerical evidence in Section 5.6 that on the one handAlgorithm 3 does converge and that on the other hand a better version of the deviationinequality exists.

Since it may benefit future research into this topic, we provide a detailed proof of thedeviation inequality (5.19). Our strategy employs the Hanson-Wright Inequality:

Theorem 85. [Hanson-Wright-Inequality [84, Theorem 1.1]] Let (X1, . . . , Xn) ∈ Rn be arandom vector with independent, centered, subgaussian components components Xi, that iseach if the Xi satisfies EXi = 0 and

‖Xi‖ψ2:= inf

s > 0 : E exp

(Xi

s

2

≤ 2

)≤ K. (5.22)

Let M ∈ Rn×n, then for every t ≥ 0

P(∣∣XtMX − EXtMX

∣∣ > t)≤ 2 exp

(−c

t2

K4‖M‖2F,

t

K2‖M‖2

).

Note that ‖ · ‖ψ2as in Condition (5.22) is a norm and finiteness of ‖Xi‖ψ2

is equivalentto Xi being a subgaussian random variable.

Remark 86. From now on we denote by vec(A) the column vector formed with the entries ofthe upper triangle part of the matrix A , i.e.,

vec(A)1 = A1,1, vec(A)2 = A1,2, vec(A)3 = A2,2, . . . .

We require this vectorization due to the formulation of Theorem 85 where what is denotedby X in the Hanson-Wright-Theorem takes the role of the GOE matrices A from Definition67. These, by our requirements, are symmetric and therefore the entries in the upper andlower triangle are not independent but redundant. Firstly, we rewrite

P [‖∇Zf(Z,A)−∇Zf(Z)‖F ≥ t] = P

[sup‖U‖F≤1

|〈∇Zf(Z,A), U〉F − 〈∇Zf(Z), U〉F | ≥ t

]

115

Page 118: Weighted l1-Analysis minimization and stochastic gradient

= P

[sup‖U‖F≤1

∣∣vec(A)tPU,Zvec(A)− Evec(A)tPU,Zvec(A)∣∣ ≥ t]

where PU,Z is the matrix realizing the identity

vec(A)tPU,Zvec(A) = 〈∇Zf(Z,A), U〉F = tr(ZtAZ − ZtAZ)〈AZ,U〉F= tr(ZZtA− ZZtA)tr(ZtAU) = tr((ZZt − ZZt)A)tr(UZtA)

where we used the cyclicality of the trace. Now we expand

tr((ZZt − ZZt)A)tr(UZtA) =

n∑i=1

n∑j=1

(ZZt − ZZt)i,jAj,in∑p=1

n∑q=1

(UZt)p,qAq,p

= vec(A)t(MZZt −MZZt)MtZUtvec(A)

where for a matrix M ∈ Rn×k we setMM to be the vectorization of

M ′i,j :=

Mi,j +Mj,i if i 6= j

Mi,i.(5.23)

This way, vec(A) only has independent, centered Gaussian entries of variance 1 or 2. Thuswe can now apply the Hanson-Wright inequality from Theorem 85.The matrix PU,Z = (MZZt −MZZt)M

tZUt is the product of two vectors and has rank 1

accordingly. Therefore ‖PU,Z‖2 = ‖PU,Z‖F = ‖MZZt −MZZt‖2 · ‖MZUt‖2. In summary

P [|〈∇Zf(Z,A), U〉F − 〈∇Zf(Z), U〉F | ≥ t]

= P[∣∣vec(A)tPU,Zvec(A)− Evec(A)tPU,Zvec(A)

∣∣ ≥ t]≤ 2 exp

(−cmin

t

K‖PU,Z‖F,

(t

K‖PU,Z‖F

)2)

=

2 exp (−tcU,Z,K) if t ≥ 1cU,Z,K

2 exp(−t2c2U,Z,K

)if t ≤ 1

cU,Z,K

where cU,Z,K := cK‖PU,Z‖F . We further estimate

‖MZUt‖2 =

∥∥∥∥1

2(ZU t + UZt)

∥∥∥∥F

≤ ‖ZU t‖F ≤ σ1(Z)‖U‖F

and thereforecU,Z,K ≥

c

K‖MZZt −MZZt‖σ1(Z)‖U‖F

for an appropriate cZ,K which still may depend on Z. We name XU = 〈∇Zf(Z,A), U〉F −〈∇ZEf(Z,A), U〉F which is a centered Gaussian process with U ∈ U := U ∈ Rn×k : ‖U‖F ≤

116

Page 119: Weighted l1-Analysis minimization and stochastic gradient

1. Since XU −XV = XU−V = 2XT for some T ∈ U we have

P(|XU −XV | ≥ t) ≤ 2 exp

(−cmin

1

cZ,K‖U − V ‖Ft,

(1

cZ,K‖U − V ‖Ft

)2)

(5.24)

with cZ,K :=K‖MZZt−MZZt‖σ1(Z)

c . The behavior in (5.24) is called a mixed tail.

Now we have maneuvered ourselves in the position to employ [91, Theorems 2.2.28] whichstate the following:

Theorem 87. Let T be a set endowed with two distances d1, d2. Consider a centered process(Xt)t∈T which satisfies

P (|Xs −Xt| ≥ u) ≤ 2 exp

(−min

u2

d2(s, t)2,

u

d1(s, t)

), for all s, t ∈ T, u ≥ 0. (5.25)

Then we have for all u1, u2 ≥ 0

P(

sups,t∈T

|Xs −Xt| ≥ L (γ1(T, d1) + γ2(T, d2) + u1∆(U , d1) + u2∆(U , d2))

)(5.26)

≤ L exp(−minu1, u22).

As Dirksen points out in [28], the γα functional obeys the generalized Dudley inequality(6.3) which reads

γα(T, d) ≤ Cα∫ ∞

0

(log (N (T, d, u)))1/α du.

Now we assembled all the necessary tools for the proof of Theorem 81

Proof of Theorem 81. In Theorem 87, d1 and d2 are arbitrary distances on T and for ourpurposes we have d1 = d2 = cZ,K‖ · ‖F , the metric induced by the Frobenius norm. SinceU = U ∈ Rn×k : ‖U‖F ≤ 1 we can estimate

• γ1(U , cZ,K‖ · ‖F ) via [34, Prop. C.3] which states N (U , ‖ · ‖F , t) ≤(1 + 2

t

)nk andtherefore

γ1(U , cZ,K‖ · ‖F ) ≤ C∫ ∞

0

logN (U , cZ,K‖ · ‖F , t)dt = C

∫ ∞0

logN(U , ‖ · ‖F ,

t

cZ,K

)dt

≤ n2

∫ cZ,K

0

log

(1 +

2cZ,Kt

)dt = cZ,Knk

∫ 1

0

log

(1 +

2

t

)dt

= cZ,K log

(27

4

)nk

• and γ2(U , cZ,K‖ · ‖F ) to be

γ2 (U , cZ,K‖ · ‖F ) ≤ C∫ ∞

0

√logN (U , cZ,K‖ · ‖F , t) dt

117

Page 120: Weighted l1-Analysis minimization and stochastic gradient

=√nk

∫ cZ,K

0

√log

(1 +

2cZ,Kt

)dt = 1.352

√nkcz,k

• and lastly ∆cZ,K‖·‖F (U) = maxU,V ∈U cZ,K‖U − V ‖F = 2cZ,K .

In summary, we obtain for u2 =√u1 =

√u > 0

P[‖∇Zf(Z,A)−∇ZEf(Z,A)‖F ≥ C

(1.352

√nkcZ,K + cZ,K log

(27

4

)nk

)+c(2ucZ,K +

√u2cZ,K)

]≤ 2e−u (5.27)

via Theorem 87, which is to say

P

[‖∇Zf(Z,A)−∇ZEf(Z,A)‖F ≥ CZ,Knk(1 + t)

]≤ 2e−tnk.

Finally, we can provide a proof for Lemma 83 where we follow the strategy as outlined inLemma 79.

Proof of Lemma 83. Using the estimate from Theorem 78 and ‖H0‖2F ≤ rδ2

(√

2−1)σ2r

from The-orem 70, we expand

d(Z1, Z)2 =‖H1‖2F = ‖Z0 − µ∇f(Z0, A0)− Z0‖2F=‖Z0 − µE∇f(Z0, A0)− Z0‖2F + µ2‖∇f(Z0, A0)− E∇f(Z0, A0)‖2F

+ 2µ⟨Z0 − µE∇f(Z0, A0)− Z0,∇f(Z0, A0)− E∇f(Z0, A0)

⟩F

≤‖Z0 − µE∇f(Z0, A0)− Z0‖2F + µ2‖∇f(Z0, A0)− E∇f(Z0, A0)‖2F+ 2µ‖Z0 − µE∇f(Z0, A0)− Z0‖F ‖∇f(Z0, A0)− E∇f(Z0, A0)‖F

≤(

1− 2µ

(σ2r −

11rδ

4(√

2− 1)σ2r

))‖H0‖2F

+ µ2C2(nk)2σ1(Z0)2(1 + t)2(2σ1 + ‖H0‖F

)2 ‖H0‖2F

+ 2µ

√1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)Cnk(1 + t)σ1(Z0)

(2σ1 + ‖H0‖F

)‖H0‖2F

≤[(

1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

))+ µ2C2(nk)2(1 + t)2σ1(Z0)2

(2σ1 + ‖H0‖F

)2+ 2µCnk(1 + t)σ1(Z0)

(2σ1 + ‖H0‖F

)]‖H0‖2F

≤[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)

+ µ2C2(nk)2(1 + t)2σ1(Z0)2

(2σ1 +

√rδ2

(√

2− 1)σ2r

)2

118

Page 121: Weighted l1-Analysis minimization and stochastic gradient

+ 2µCnk(1 + t)σ1(Z0)

(2σ1 +

√rδ2

(√

2− 1)σ2r

)]‖H0‖2F

≤[(

1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)+ µ2C2(nk)2(1 + t)2σ1(Z0)2 (2σ1 + σr)

2

+ 2µCnk(1 + t)σ1(Z0) (2σ1 + σr))]‖H0‖2F

where we used δ <√

2(√

2−1)3r σ2

r which implies ‖H0‖2F ≤√

23σr in the last step as demanded

in the lemma. Moreover, by the definition of Zi+1 as stated in Algorithm 3, we have that

d(Zi+1, Z)2 =‖Hi+1‖2F = ‖Zi+1 − Zi+1‖2F ≤ ‖Zi+1 − Zi‖2F = ‖Zi − µ∇f(Zi, Ai)− Zi‖2F=|Zi − µE∇f(Zi, Ai)− Zi‖2F + µ2‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖2F

+ 2µ⟨Zi − µE∇f(Zi, Ai)− Zi,∇f(Zi, Ai)− E∇f(Zi, Ai)

⟩F

≤‖Zi − µE∇f(Zi, Ai)− Zi‖2F + µ2‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖2F+ 2µ‖Zi − µE∇f(Zi, Ai)− Zi‖F ‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖F

≤(

1− 2µ

(σ2r −

11

4‖Hi‖

))‖Hi‖2F

+ µ2C2(nk)2σ1(Zi)2(1 + t)2(2σ1 + ‖Hi‖F

)2 ‖Hi‖2F

+ 2µ

√1− 2µ

(σ2r −

11

4‖Hi‖

)Cnk(1 + t)σ1(Zi)

(2σ1 + ‖Hi‖F

)‖Hi‖2F

=

[(1− 2µ

(σ2r −

11

4‖Hi‖

))+ µ2C2(nk)2σ1(Zi)2(1 + t)2

(2σ1 + ‖Hi‖F

)2+2µCnk(1 + t)σ1(Zi)

(2σ1 + ‖Hi‖F

) ]‖Hi‖2F

One should note that σ1(Zi) ≤ ‖Zi‖F ≤ ‖Hi + Z‖F ≤ ‖Hi‖F + ‖Z‖F and if we set

G(‖Hi‖F ) :=

(1− 2µ

(σ2r −

11

4‖Hi‖

))+ µ2C2(nk)2(‖Hi‖F + ‖Z‖F )2(1 + t)2

(2σ1 + ‖Hi‖F

)2+ 2µCnk(1 + t)(‖Hi‖F + ‖Z)‖F

(2σ1 + ‖Hi‖F

)we receive d(Zi, Z)2 ≤ G(‖Hi‖F )‖Hi‖2F and F is strictly monotonically increasing. Then byinduction

d(Zi, Z)2 ≤ G(‖H0‖F )i‖H0‖2F

119

Page 122: Weighted l1-Analysis minimization and stochastic gradient

5.6 Numerical Addendum

In this section we demonstrate numerically that the idea to apply stochastic gradient descentto this low rank recovery problem is working in practice. To this end, we implemented severalversions of the Stochastic Gradient Descent Algorithm 3 in Matlab and computed examplesfor different values n, k, r and Z ∈ Rn×k of rank r. The number m of measurements is thenumber of iterations as well, since the Stochastic Gradient Descent Algorithm 3 uses eachmeasurement exactly once. The matrices are normalized so that ‖H0‖F = 1 for comparison.Later in this chapter, we discuss the performance of an algorithm that reuses measurementsbut we restrict ourselves to numerical examples, since the analysis of this algorithm needs atotally different stochastic model.

All the computations for low rank matrix recovery employ a step-size µi ∼ 1i , i ∈ [m]

which proved the most useful in numerical experiments as can be seen from Figure 5.2 wherewe depict the error at each iteration plotted against the number of iterations of our Algorithm3 for step-sizes µi ∼ 1

iα , α ∈

0, 34 , 1, 2, 3

and µi ∼ 2−i. We are not aware of any results

concerning a proper analysis of the appropriate step-size.

A noticeable phenomenon occurs when µi ∼ 1i3/4

,: Either the algorithm converges af-ter very few steps compared to the maximum possible number of iterations which is m =

dr3κ(ZZt)kn log(n)e, indicating that the initialization is already close to the minimizer, orthe algorithm diverged as depicted in Figure 5.3.

In the light of Theorem 62 the number of measurements required by recovery resultswhich employed the matrix RIP (5.6), we use m = d(n+ k)κ2(Z)r log(n)e measurements forthe first set of examples in Figures 5.4b and 5.4a.

From Figures 5.4a and 5.4b we see that even a comparably small increase in the rankr of Z leads to a massive drop in reconstruction quality. Due to the large computationalcost of the calculations involved, all further examples are conducted using an initial matrixZ ∈ R25×25.

If instead we employ m = drkκ2(Z)n log(n)e measurements, as suggested by Theorem 71,we obtain the results in Figure 5.5. Again, the reconstruction quality deteriorates massivelywith increasing rank.

While Figure 5.5 shows convergence of the algorithm, the number of measurements takeninto account surely is too small to yield reconstruction. To examine this further, we conductedanother reconstruction trial, this time with m = drkn2κ2(Z) log(n)e. Due to the implicatedlarge number of measurements m, we were only able to compute examples up to rank 7 whichalready needed m = 17251163 measurements for matrices of size 25× 25.

Since the success of reconstruction depends on the rank, we also computed trials form = dr3knκ2(Z) log(n)e measurements, see Figure 5.7. Here, not even r = 7 could becomputed for lack of RAM in this case.

The drastic difference in comparison to the recovery results for the cumulative targetfunction f(Z) = 1

m

∑mi=1(tr(ZtAiZ) − tr(ZtAiZ)) which was studied in [98] can be caused

120

Page 123: Weighted l1-Analysis minimization and stochastic gradient

(a) Error for µi ∼ 1i, µi ∼ 1

i2and µi ∼ 1

i3.

(b) Error for µi ∼ 1√iand µi ∼ 1

Figure 5.2: Comparison of error d(Zi, Z) for Algorithm 3 for Z ∈ R25×25 using differentstep-sizes. The error d(Zi, Z) is plotted against the number of iterations.

by the additional error due to the deviation term as described in Remark 84. However, whileour findings do not provide a full recovery result, see Remark 84, we have convergence in our

121

Page 124: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.3: Error d(Zi, Z) for several trials of Algorithm 3 for Z ∈ R25×25 using µi ∼ 1i3/4 .

The error d(Zi, Z) is plotted against the number of iterations.

numerical experiments. To close this gap, we conducted a numerical survey of the deviation

122

Page 125: Weighted l1-Analysis minimization and stochastic gradient

(a) Error d(Zi, Z) of Algorithm 3 for Z ∈ R100×50 of rank 2, 4, 6,and 8.

(b) Error d(Zi, Z) of Algorithm 3 for Z ∈ R100×100 of rank 2, 4, 6,and 8.

Figure 5.4: Error d(Zi, Z) for stochastic gradient descent. The error d(Zi, Z) is plottedagainst the number of iterations..

where the quotients

‖∇f(Z,A)− E∇f(Z,A)‖Fd(Z, Z)

were computed for matrices Z, Z ∈ Rn×k where n ∈ 100, 200 . . . , 2000 and

123

Page 126: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.5: Convergence rates of algorithm 3 for Z ∈ R25×25 of rank 2, 3, 4, 5, 6, 7 and 8for m = drknκ2(Z) log(n)e measurements with the x-axis scaled logarithmically. Numericalfindings from [98] suggest that this number of measurements suffices for gradient descentapplied to f(Z) = 1

4m

∑i(tr(Z

tAiZt) − bi)2 to yield reconstruction of X∗ = ZZt with highprobability.

k ∈ 100, 150, . . . , 1000 for matrices Z of full rank and Z of either rank 2, 4, 6 or 8. For eachvalue of (n, k) the respective quotient was computed for 100 GOE-distributed matrices andthe resulting values were averaged, resulting in the plot from Figure 5.8a.

We only include the graph for rank(Z) = 2. The plots for ranks 4, 6 and 8 look exactlythe same. The findings from Figure 5.8 suggest that

E‖∇f(Z,A)− E∇f(Z,A)‖F ≤3

8

√nk d(Z, Z) (5.28)

for a Z ∈ Rn×k which is a huge improvement on the deviation inequality (5.20) which suggestsa scaling with a factor nk (see 5.21) instead of

√nk as in (5.28). Let D ∈ R20×10 be the

matrix where the entries are averaged deviations, i.e.,

Di,j =1

100

100∑l=1

‖∇f(Zl, Al)− E∇f(Zl, Al)‖F for Zl ∈ Rni×kj ,

where ni = i · 100, kj = j · 100 for i ∈ [20], j ∈ [10], then D is the matrix plotted in Figure5.8. If D′ ∈ R20×10 is the corresponding matrix with entires 3

8

√nikj d(Zl, Z), then our

124

Page 127: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.6: Error d(Zi, Z) of algorithm 3 for Z ∈ R25×25 of rank 2, 3, 4, 5, 6 and 7 form = drkn2κ2(Z) log(n)e measurements with the x-axis scaled logarithmically.

experiments give

‖D −D′‖F‖D′‖F

= 0.0012

unanimously of the rank of Z.A probability estimate follows from 5.28 via Markov’s inequality

P(‖∇f(Z,A)− E∇f(Z,A)‖F ≥ t

)≤ 3√nk d(Z, Z)

8tfor Z ∈ Rn×k. (5.29)

As mentioned in the beginning of this chapter, the stochastic gradient descent Algorithm3 may be refined by an additional rank restriction step:

Data: Random Matrices Ai, i = 1, . . . ,m, measurements yResult: Reconstruction x]

initialize Z0 ∈ Rn×k(see Theorem 70);while i ≤ m do

Zi+1 ← Pr(Zi − ηi∇Zf(Zi, Ai)

)i = i+ 1

endAlgorithm 4: Rank Restricted Stochastic Gradient Descent

The according operation Pr(Z) = argminrank(Z′)≤r‖Z ′ − Z‖F can be computed fast via

125

Page 128: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.7: Convergence of algorithm 3 for Z ∈ R25×25 of rank 2, 3, 4, 5 and 6 for m =dr3knκ2(Z) log(n)e measurements with the x-axis scaled logarithmically. The authors of [98]showed that this number of measurements suffices for gradient descent applied to f(Z) =

14m

∑i(tr(Z

tAiZt)− bi)2 to yield reconstruction of X∗ = ZZt with high probability.

(a) Deviation ‖∇f(Z,A)−E∇f(Z,A)‖Fd(Z−Z)

(b) Graph of (n, k) 7→ 38

√nk

Figure 5.8: Deviations for n ∈ 100, 200 . . . , 2000 and k ∈ 100, 150, . . . , 1000 for matricesZ of full rank and rank(Z) = 2.

SVD and is therefore easily implemented. Again, we conducted a numerical trial like the oneleading to Figure 5.8 and obtained Figure 5.9.

Except for minor erratic fluctuations for comparably small values of n, k, it can be assumedthat the relative deviation

‖∇f(Z,A)−∇Ef(Z,A)‖Fd(Z, Z)

126

Page 129: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.9: Deviations for n ∈ 100, 200 . . . , 2000 and k ∈ 100, 150, . . . , 1000 for rank(Z) =2 for Algorithm 4.

is constant for matrices Z, Z of rank 2. Even taking all the values into account, this compu-tation suggests that

E‖∇f(Z,A)− E∇f(Z,A)‖F ≤1

4d(Z, Z) (5.30)

for a Z ∈ Rn×k which again is a improvement on the deviation inequality (5.20) obtained viaHanson-Weight Inequality and Generic Chaining. This indicates via Markov’s inequality

P(‖∇f(Z,A)− E∇f(Z,A)‖F ≥ t

)≤ d(Z, Z)

4tfor Z ∈ Rn×k. (5.31)

Remark 88. Since the numerical findings from this section suggest that some sort of deviationinequality for the gradient of our target function f actually holds, we may outline what thisalludes to for a general convergence result:We assume

P(‖∇f(Z,A)− E∇f(Z,A)‖F ≥ t‖Z − Z‖F

)≤ 3√nk

8t(5.32)

which is likely to hold as demonstrated within our numerical findings.As shown in Lemma 79 the approximation rate for stochastic gradient descent in (5.17) is

‖Hi − µ∇Ef(Zi, Ai)‖F ≤

√1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)i‖H0‖F .

127

Page 130: Weighted l1-Analysis minimization and stochastic gradient

This was used in Remark 84 to estimate

d(Z1, Z)2 ≤‖Z0 − µE∇f(Z0, A0)− Z0‖2F + µ2‖∇f(Z0, A0)− E∇f(Z0, A0)‖2F+ 2µ‖Z0 − µE∇f(Z0, A0)− Z0‖F ‖∇f(Z0, A0)− E∇f(Z0, A0)‖F

≤(

1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

))‖H0‖2F + µ2t2‖H0‖2F

+ 2µt

√1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)‖H0‖2F

≤[(

1− µ(

2σ2r −

11rδ2

2(√

2− 1)σ2r

))+ µ2t2 + 2µt

]‖H0‖2F

=

[1− µ

(2σ2

r −11rδ2

2(√

2− 1)σ2r

− 2µt2 − 2t

)]‖H0‖2F .

Now if

2σ2r −

11rδ2

2(√

2− 1)σ2r

− µt2 − t > 0. (5.33)

we have ‖H1‖F < ‖H0‖F . The estimate from Theorem 78 yields

d(Zi+1, Z)2 =‖Zi+1 − Zi+1‖2F ≤ ‖Zi+1 − Zi‖2F= ‖Zi − µE∇f(Zi, Ai)− Zi‖2F + µ2‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖2F+ 2µ‖Zi − µE∇f(Zi, Ai)− Zi‖F ‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖F

≤(

1− 2µ

(σ2r −

11

4‖Hi‖2F

))‖Hi‖2F + µ2t2‖Hi‖2F

+ 2µt

√1− 2µ

(σ2r −

11

4‖Hi‖2F

)‖Hi‖2F

≤[(

1− µ(

2σ2r −

11

2‖Hi‖2F

))+ µ2t2 + 2µt

]‖H0‖2F

=

[1− µ

(2σ2

r −11

2‖Hi‖2F − 2µt2 − 2t

)]‖H0‖2F

and since ‖H1‖F < ‖H0‖F we have[1− µ

(2σ2

r −11

2‖H1‖2F − 2µt2 − 2t

)]<

[1− µ

(2σ2

r −11

2‖H0‖2F − 2µt2 − 2t

)].

By induction ‖Hi+1‖F < ‖Hi‖F and[1− µ

(2σ2

r −11

2‖Hi+1‖2F − 2µt2 − 2t

)]<

[1− µ

(2σ2

r −11

2‖Hi‖2F − 2µt2 − 2t

)].

128

Page 131: Weighted l1-Analysis minimization and stochastic gradient

Accordingly

‖Hi+1‖2F ≤[1− µ

(2σ2

r −11

2‖Hi‖2F − 2µt2 − 2t

)]i‖H0‖2F

as soon as (5.33) holds.

We set δ =

√2(√

2−1)11r σ2

r which yields ‖H0‖2F ≤ 211σ

2r via Theorem 70. Then it remains

to choose t such that

σ2r − µt2 − t > 0.

to guarantee convergence of the main algorithm 3. Now we abbreviate tmax := maxt, t2,then this inequality holds as soon as

tmax <σ2r

µ+ 1.

The probability estimate (5.32) is only non-trivial if t ≥ 38

√nk which implies 3µ

8

√nk ≤ σ2

r

which is why this result is not applicable to a general matrix Z ∈ Rn×k.

In summary, our findings strongly suggest that our Algorithm 3 not only converges butmoreover that a suitable deviation inequality might be available.

Another approach, which we will not develop further here, would be to reuse sampleswithin the iterations. This allows for the number of iterations to exceed the number ofmeasurements and may improve the quality of the reconstruction.

Data: Random Matrices Ai, i = 1, . . . ,m, measurements b ∈ Rm, precision εResult: Reconstruction x]

initialize Z0 ∈ Rn×k (see Definition 66);while d(Zi, Z) > ε do

choose ς ∈ [m] at randomZi+1 ← Zi − µi∇Zf(Zi, Aς) = Zi − µi(tr((Zi)tAςZi)− bς)AςZi

i = i+ 1end

Algorithm 5: Randomized Stochastic Gradient Descent

Figure 5.10 shows convergence plots for algorithm 5. We restrict ourselves to m =

drkn log(n)e and 10 times as many iterations since in this case the RAM suffices to com-pute these examples up to rank 8. Since the full analysis of this algorithm would requireus to deal with dependencies, we will not inquire this approach further since this may bequite challenging since the Zi are no longer stochastically independent of the measurementmatrices Aj .

129

Page 132: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.10: Convergence of the Randomized Stochastic Gradient Descent Algorithm 5 forZ ∈ R25×25 of rank 2, 3, 4, 5, 6, 7 and 8 for m = drkn log(n)e measurements and 10 times asmany iterations with the x-axis scaled logarithmically. The error d(Zi, Z) is plotted againstthe number of iterations

5.7 Mini-Batch Stochastic Gradient Descent

The approach from [98], where gradient descent is applied to the cumulative target functionf(Z) = 1

4m

∑mi=1(tr(ZtAiZ)−bi)2 is one end of the spectrum of possible choices of target func-

tions, where our approach to apply gradient descent to f(Z,A) = 14 (tr(ZtAiZ)− tr(ZtAZ))2

is located on the other end. The target function f takes each of the m measurements ateach iteration while our algorithm 3 only takes one single measurement into account at anyiteration.

Suppose now that we have a numberM ∈ N of measurements, wheremM , we considerthe mini-batch function fΘ(Z) which for some Θ ⊂ [M ] with ]Θ = m is defined by

fΘ(Z) =1

4m

∑i∈Θ

(tr(ZtAiZ)− bi)2. (5.34)

We propose the following algorithm:

130

Page 133: Weighted l1-Analysis minimization and stochastic gradient

Data: Random Matrices Ai, i = 1, . . . ,M , measurements b ∈ RM , batch size m,precision ε

Result: Reconstruction x]

initialize Z0 ∈ Rn×k (see Definition 91);while d(Zi, Z) > ε do

choose Θ ⊂ [M ], ]Θ = m uniformly at randomZi+1 ← Zi − µi∇ZfΘ(Zi, Ai) = Zi − µi

m

∑j∈Θ(tr((Zj)tAjZj)− bj)AjZj

i = i+ 1end

Algorithm 6: Mini Batch Stochastic Gradient Descent

Firstly we show a deviation inequality for the mini-batch target function (5.34) in orderto prove convergence of Algorithm 6.

Lemma 89. Let Ai ∈ Rn×n, i ∈ Θ be random matrices be distributed as an GOE for Θ ⊂[M ], ]Θ = m and assume for all u ∈ Rn we have that∥∥∥∥∥ 1

m

∑i∈Θ

(utAiu)Ai − 2uut

∥∥∥∥∥ ≤ δ

r‖u‖22 (5.35)

for Θ ⊂ [M ] with ]Θ = m, then

‖∇ZfΘ(Z)− E∇ZfΘ(Z)‖2F ≤δ√n

4r‖H‖F

(‖H‖2F + 3‖H‖F ‖Z‖F + 2‖Z‖2F

). (5.36)

Proof. We use

‖∇ZfΘ(Z)− E∇ZfΘ(Z)‖F = sup‖U‖F=1

〈U,∇ZfΘ(Z)− E∇ZfΘ(Z)〉

and expand

〈U,∇ZfΘ(Z)− E∇ZfΘ(Z)〉

=

⟨U,

1

4m

∑i∈Θ

(tr(ZtAiZ)AiZ − tr(ZtAiZ)AiZ − 2(ZZt − ZZt)Z)

=

⟨U,

1

4m

∑i∈Θ

[tr(ZtAiZ)Ai − tr(Z

tAiZ)Ai − 2(ZZt − ZZt)

]Z

=

⟨U,

1

4m

∑i∈Θ

[(tr(HtAiH)Ai − 2HHt

)+(

tr(Ai(HZt

+ ZHt))Ai − 2(HZt

+ ZHt))]

(H + Z)⟩

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(HtAiH)Ai − 2HHt

∥∥∥∥∥F

‖H‖F +

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(HtAiH)Ai − 2HHt

∥∥∥∥∥F

‖Z‖F

131

Page 134: Weighted l1-Analysis minimization and stochastic gradient

+

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(Ai(HZt

+ ZHt))Ai − 2(HZt

+ ZHt)

∥∥∥∥∥F

‖H‖F

+

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(Ai(HZt

+ ZHt))Ai − 2(HZt

+ ZHt)

∥∥∥∥∥F

‖Z‖F

and since HZt+ZHt is symmetric we can write it in the formWDW t, where D is a diagonal

matrix where dj,j = ±1, j = 1, . . . , n via the singular value decomposition. Let w1, . . . wn bethe columns of the matrix W ∈ Cn×n. Then∥∥∥∥∥ 1

4m

∑i∈Θ

tr(Ai(HZt

+ ZHt))Ai − 2(HZt

+ ZHt)

∥∥∥∥∥F

=

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(AiWW t)Ai − 2WW t

∥∥∥∥∥F

=

∥∥∥∥∥∥ 1

4m

∑i∈Θ

n∑j=1

dj,jwtjA

iwj − dj,j2wjwtj

∥∥∥∥∥∥F

≤√n

4

n∑j=1

∥∥∥∥∥ 1

m

∑i∈Θ

wtjAiwj − 2wjw

tj

∥∥∥∥∥ (5.37)

≤ δ√n

4r

n∑j=1

‖wj‖2 =δ√n

4r‖W‖2F =

δ√n

4rtr(WW ∗) =

δ√n

4rtr(HZ

t+ ZHt)

≤ δ√n

4r‖H‖F ‖Z‖F

where in (5.37) we used the estimate ‖ · ‖F ≤√n‖ · ‖2. Note that ‖Z‖F = ‖Z‖F is constant.

If we replace W by H in the computations above we obtain∥∥∥∥∥ 1

4m

∑i∈Θ

tr(HtAiH)Ai −HHt

∥∥∥∥∥F

≤ δ√n

4r‖H‖2F .

In summary

‖∇ZfΘ(Z)− E∇ZfΘ(Z)‖F ≤δ√n

4r‖H‖F

(‖H‖2F + 3‖H‖F ‖Z‖F + 2‖Z‖2F

).

Inequality (5.36) differs from the numeric results in Section 5.6 inasmuch as it does notinclude a factor that scales as

√nk as the deviation inequality (5.28) does.

Corollary 90. Under the conditions of Lemma 89 and assuming that 2k ≤ n we have that

‖∇ZfΘ(Z)− E∇ZfΘ(Z)‖2F ≤δ√

2k

4r‖H‖F

(‖H‖2F + 3‖H‖F ‖Z‖F + 2‖Z‖2F

). (5.38)

Proof. In (5.37) we used the inequality ‖X‖F ≤√

rank(X)‖X‖2 for a general matrix X ∈Rn×k. If now 2k < n then HZ

t+ZHt is of rank 2k at most and accordingly can be decom-

posed into WDW t for the diagonal matrix D ∈ −1, 12k and W ∈ [0,∞)n×2k. Henceforth

132

Page 135: Weighted l1-Analysis minimization and stochastic gradient

we can estimate∥∥∥∥∥ 1

4m

∑i∈Θ

tr(Ai(HZt

+ ZHt))Ai − 2(HZt

+ ZHt)

∥∥∥∥∥F

=

∥∥∥∥∥ 1

4m

∑i∈Θ

tr(AiWW t)Ai − 2WW t

∥∥∥∥∥F

=

∥∥∥∥∥∥ 1

4m

∑i∈Θ

2k∑j=1

dj,jwtjA

iwj − dj,j2wjwtj

∥∥∥∥∥∥F

≤√

2k

4

2k∑j=1

∥∥∥∥∥ 1

m

∑i∈Θ

wtjAiwj − 2wjw

tj

∥∥∥∥∥≤ δ√n

4r

2k∑j=1

‖wj‖2 =δ√

2k

4r‖W‖2F =

δ√

2k

4rtr(WW ∗) =

δ√

2k

4rtr(HZ

t+ ZHt)

≤ δ√

2k

4r‖H‖F ‖Z‖F .

This also holds for ∥∥∥∥∥ 1

4m

∑i∈Θ

tr(HtAiH)Ai −HHt

∥∥∥∥∥F

≤ δ√

2k

4r‖H‖2F

in a similar manner, which yields the claim.

Since the condition that gives an estimate for the quality of the initial Z0 as given by(5.10) or the current mini-batch version from (5.35) also gives an estimate for ‖H0‖F as inSection 5.4, we adapt our definition of Z0 to make a proper summary in the end of thissection a bit easier.

Definition 91. Let Ai ∈ Rn×n, i = 1, . . . ,M be the GOE sampling matrices and bi =

tr(ZtAiZ) for i ≤ M . Then, if W := 1M

∑Mi=1 biA

i and (λi, wi)i∈[M ] are its eigenvalues andeigenvectors ordered by decreasing magnitude of the eigenvalues, we set our initializationmatrix to be

Z0 :=[z0

1 | . . . |z0k

]where z0

i :=

√|λi|2wi.

Then, as in Theorem 70, ‖H0‖F ≤√

rδ2

(√

2−1)σ2r

. The bound on ‖H0‖F enables us to showthe following theorem.

Theorem 92. With the notation from Section 5.4, let the stepsize parameter

µ ≤ 1

8(‖H0‖2F + ‖Z‖2F )and maxδ, δ2 < σ2

r

11r4(√

2−1)σ2r

+ µ9nσ4

1

16 +3√nσ2

1

2

Let (Zi)i≥0 be the sequence generated by the Mini-Batch Stochastic Gradient Descent Al-gorithm 6 and suppose that the initial condition (5.35) holds. Then Mini-Batch Stochastic

133

Page 136: Weighted l1-Analysis minimization and stochastic gradient

Gradient Descent 6 converges as

d(Zi+1, Z) ≤[1− µ

(2σ2

r −11rδ2

2(√

2− 1)σ2r

− µ9δ2nσ41

8− 3δ√nσ2

1

)]i/2‖H0‖2F (5.39)

Proof. Firstly, we show that fΘ obeys the regularity condition (R(ε, α, β)). To this end wenotice that

EfΘ(Z) =1

m

∑i∈Θ

Ef(Z,Ai) = f(Z) = Ef(Z,A).

and thus Lemmas 75, 76 and consequently Lemma 77 remain true for Ef(Z,A) replaced byEfΘ(Z,A). We expand

d(Z1, Z)2 =‖H1‖2F = ‖Z1 − Z1‖F ≤ ‖Z1 − Z0‖F ≤ ‖Z0 − µ∇fΘ(Z0)− Z0‖2F=‖Z0 − µE∇fΘ(Z0)− Z0‖2F + µ2‖∇Θf(Z0)− E∇Θf(Z0)‖2F

+ 2µ⟨Z0 − µE∇Θf(Z0)− Z0

,∇Θf(Z0)− E∇Θf(Z0)⟩F

≤‖Z0 − E∇Θf(Z0)− Z0‖2F + µ2‖∇Θf(Z0)− E∇Θf(Z0)‖2F+ 2µ‖Z0 − E∇Θf(Z0)− Z0‖F ‖∇Θf(Z0)− E∇Θf(Z0)‖F .

We consider ‖Z0 − E∇Θf(Z0)− Z0‖F first:∥∥∥Z0 − µE∇fΘ(Z0)− Z0∥∥∥2

F= ‖H0‖2F + µ2‖E∇fΘ(Z0)‖2F − 2µ

⟨E∇fΘ(Z0), H0

⟩≤ ‖H0‖2F + µ2‖E∇fΘ(Z0)‖2F − 2µ

((σ2r −

11

4‖H0‖2F

)‖H0‖2F +

‖E∇fΘ(Z0)‖2F16(‖H0‖2F + ‖Z‖2F )

)=

(1− 2µ

(σ2r −

11

4‖H0‖2F

))‖H0‖2F + µ‖E∇fΘ(Z0)‖2F

(µ− 1

8(‖H0‖2F + ‖Z‖2F )

).

Since µ ≤ (8(‖H0‖2F + ‖Z‖2F ))−1, the second summand is non-positive so that

∥∥∥Z0 − µE∇fΘ(Z0)− Z0∥∥∥2

F≤(

1− 2µ

(σ2r −

11

4‖H0‖2F

))︸ ︷︷ ︸

=:F (‖H0‖F )

‖H0‖2F .

Note that the expression F (a) =(1− 2µ

(σ2r − 11

4 a))

is strictly monotonically increasing ina. Theorem 70 now gives the estimate

‖H0‖2F ≤rδ2

(√

2− 1)σ2r

.

Therefore we have that

‖H1‖2F ≤ F (‖H0‖F )‖H0‖2F ≤ F(

rδ2

(√

2− 1)σ2r

)‖H0‖2F

134

Page 137: Weighted l1-Analysis minimization and stochastic gradient

=

[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)]‖H0‖2F .

This entails via Lemma 89

‖H1‖2F ≤ ‖Z0 − E∇Θf(Z0)− Z0‖2F + µ2‖∇Θf(Z0)− E∇Θf(Z0)‖2F+ 2µ‖Z0 − E∇Θf(Z0)− Z0‖F ‖∇Θf(Z0)− E∇Θf(Z0)‖F

≤[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)]‖H0‖2F

+ µ2 δ2n

16r2‖H0‖2F

(‖H0‖2F + 3‖H0‖F ‖Z

0‖F + 2‖Z0‖2F)2

+ 2µ

√1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)‖H0‖2F

δ√n

4r

(‖H0‖2F + 3‖H0‖F ‖Z

0‖F + 2‖Z0‖2F).

Now we have ‖Z0‖F = ‖Z‖F and since it is save to assume that ‖H0‖2F ≤ rδ2

(√

2−1)σ2r

≤ ‖Z‖2F ,

i.e., δ < σr‖Z‖√r(√

2−1)for an appropriately chosen δ, and thus

δ√n

4r

(‖H0‖2F + 3‖H0‖F ‖Z‖F + 2‖Z‖2F

)≤ 3δ

√n

4r‖Z‖2F ≤

3δ√nσ2

1

4

which gives

‖H1‖2F ≤[1− 2µ

(σ2r −

11rδ2

4(√

2− 1)σ2r

)]‖H0‖2F + µ2 9δ2nσ4

1

16‖H0‖2F + µ

3δ√nσ2

1

2‖H0‖2F

=

[1− µ

(2σ2

r −11rδ2

2(√

2− 1)σ2r

− µ9δ2nσ41

8− 3δ√nσ2

1

)]︸ ︷︷ ︸

:=Q(δ)

‖H0‖2F . (5.40)

Note that the function Q(δ) is strictly monotonically increasing in δ for δ > 0. Accordingly,we obtain ‖H1‖F < ‖H0‖F , and henceforth convergence by induction, once

σ2r >

11rδ2

4(√

2− 1)σ2r

+ µ9δ2nσ4

1

16+

3δ√nσ2

1

2. (5.41)

We set δmax := maxδ, δ2 and then inequality (5.41) holds as soon as

σ2r

11r4(√

2−1)σ2r

+ µ9nσ4

1

16 +3√nσ2

1

2

> δmax.

Given such a δmax we employ Theorem 78 which yields

‖Hi+1‖2F = ‖Zi+1 − Zi+1‖2F ≤ ‖Zi+1 − Zi‖2F = ‖Zi − µ∇f(Zi, Ai)− Zi‖2F= ‖Hi − µ∇Ef(Zi, Ai)‖2F + µ2‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖2F+ 2µ

⟨Hi − µ∇Ef(Zi, Ai),∇f(Zi, Ai)− E∇f(Zi, Ai)

⟩F

135

Page 138: Weighted l1-Analysis minimization and stochastic gradient

≤ ‖Hi − µ∇Ef(Zi, Ai)‖2F + µ2‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖2F+ 2µ‖Hi − µ∇Ef(Zi, Ai)‖F ‖∇f(Zi, Ai)− E∇f(Zi, Ai)‖F

≤(

1− 2µ(σr −11

4‖Hi‖2F )

)‖Hi‖2F

+ µ2 δ2

16σ4r

‖Hi‖2F(‖Hi‖2F + 3‖Hi‖F ‖Z‖F + 2‖Z‖F 2

)2

+ 2µδ

4σ2r

‖Hi‖2F(‖Hi‖2F + 3‖Hi‖F ‖Z‖F + 2‖Z‖F 2

)= J(‖Hi‖F )‖Hi‖2F

where

J(‖Hi‖F ) =

(1− 2µ(σr −

11

4‖Hi‖2F )

)+ µ2 δ2

16σ4r

(‖Hi‖2F + 3‖Hi‖F ‖Z‖F + 2‖Z‖F 2

)2

+ 2µδ

4σ2r

(‖Hi‖2F + 3‖Hi‖F ‖Z‖F + 2‖Z‖F 2

)which is strictly monotonically increasing in ‖Hi‖F . Since we have shown that ‖H1‖F <

‖H0‖F we have J(‖H1‖F ) < J(‖H0‖F ) and thus via induction

‖Hi+1‖2F ≤ J(‖Hi‖F )‖Hi‖2F < J(‖H0‖F )‖Hi‖2F < J(‖H0‖F )i‖H0‖2F .

We also have seen in (5.40) that J(‖H0‖F ) ≤ Q(δ) where we use the estimate on the qualityof the initial Z0 from Theorem 70 and accordingly

‖Hi‖2F ≤[1− µ

(2σ2

r −11rδ2

2(√

2− 1)σ2r

− µ9δ2nσ41

8− 3δ√nσ2

1

)]i‖H0‖2F .

Remark 93. In the proof of Theorem 92 it would have been possible to choose the parameter δfor the estimate on ‖H0‖F and the deviation inequality differently but we abstained from thisidea. For the algorithm to converge, the condition for the quality of the spectral initialization(5.10) has to hold separately for each of the sets Θ employed in the Mini Batch StochasticGradient Descent Algorithm 6. Per Theorem 71 we know that since ]Θ = m this holds withprobability exceeding 1−mCe−ρn− 2

n2 for a single, isolated Θ. Then we would need a unionbound over all Θ ⊂ [M ] : ]Θ = m where there are

(Mm

)many of those. For one single of

these Θ ⊂ [M ], ]Θ = m, we need as per Theorem 71 at least

m ≥ 42

min

δ2

r2σ41, δrσ2

1

n log(n)

measurements. Since we operate under the general assumption that δ < σr we immediately

136

Page 139: Weighted l1-Analysis minimization and stochastic gradient

have δmax < rσ21 and therefore

min

δ2max

r2σ41

,δmax

rσ21

=δ2max

r2σ41

σ2r

11r4(√

2−1)σ2r

+µ9nσ4

116 +

3√nσ2

12

r2σ41

which roughly demands at least

m > Cκ(Z)2r3n2 log(n)

measurements which is still more measurements that what we can expect as we have seennumerically in Section 5.6, especially in Figures 5.5, 5.7 and 5.6. Also, once we set m = M

we consider the Wirtinger Flow algorithm from [98] where the authors showed that m >

Cκ(Z)k3rn log(n) measurements suffice to guarantee convergence.In case that 2k < n, i.e., the additional assumptions of Corollary 90 are fulfilled, we only

need to demand

δmax <σ2r

11r4(√

2−1)σ2r

+ µ9kσ4

1

8 +3√

2kσ21

2

which yields a rough bound of

m > Cκ(Z)2r3nk log(n)

for the number of measurements.

Although we did not provide a full proof of convergence for this algorithm, we want tosummarize our findings in a single theorem.

Theorem 94. Let 0 < δ < σr where the latter is the smallest singular value of the rankr matrix Z ∈ Rn×k and let Ai ∈ Rn×n, i = 1, . . . ,M be sampling matrices distributed as aGOE, see Definition 67. Take measurements bi = tr(ZtAiZ), i = 1, . . . ,M . Moreover, let

maxδ, δ2 < σ2r

11r4(√

2−1)σ2r

+ µ9nσ4

1

16 +3√nσ2

1

2

.

Then, once (5.35) is fulfilled for all Θ ⊂ [M ] of size ]Θ = m, which demands at least

m > Cκ(Z)2r3n2 log(n) (5.42)

we have that the series of iterates Zi produced by the Mini Batch Stochastic Gradient DescentAlgorithm 6 starting with Z0 obtained via Spectral initialization, see Definition 91, convergesas

d(Zi+1, Z)2 ≤[1− µ

(2σ2

r −11rδ2

2(√

2− 1)σ2r

− µ9δ2nσ41

8− 3δ√nσ2

1

)]id(Z0, Z). (5.43)

137

Page 140: Weighted l1-Analysis minimization and stochastic gradient

Figure 5.11: Convergence of the mini-batch stochastic gradient descent algorithm 6 for Z ∈R25×25 of rank 2, 3, 4, 5, 6, 7 and 8 forM = 10drknκ(Z) log(n)e measurements and m = 100with the x-axis scaled logarithmically.

As before, we want to provide some numerical evidence of the applicability of our methods.Therefore, we computed examples for the Mini Batch Stochastic Gradient Descent Algorithm6 for for Z ∈ R25×25 as in Section 5.6 and for M = 10drknκ(Z) log(n)e and m = 100. Thesample sizeM was not chosen as in Theorem 94 since using the actualM & κ(Z)2r3n2 log(n)

would have demanded far too much RAM. Moreover, at least for a small rank we alreadyobtain convergence with M = 10drknκ(Z) log(n)e measurements

Comparing Figures 5.5, 5.6, 5.7 and 5.10 from the last Section 5.6 with Figure 5.11 whichdepicts the Convergence of the Mini Batch Stochastic Gradient Descent Algorithm 6, one cannot fail to notice that the curve depicting the reconstruction quality appears much smoother.This may be due to the averaging effect of fΘ(Z) compared to f(Z,A) on the random matricesAi, i = 1, . . . ,m.

5.8 Conclusion

In this chapter we have shown that stochastic gradient descent is a viable strategy for thetask of low rank matrix recovery for quadratic measurements that are oblivious to orthogonaltransformation. We also highlighted the problems with regard to a deviation inequality inRemark 84 but provided suitable numerical evidence that this issue can be resolved. Thedeviation inequality we obtained numerically (5.32) is not strong enough to guarantee recov-

138

Page 141: Weighted l1-Analysis minimization and stochastic gradient

ery for all matrices z ∈ Rn×k, see Remark 88. Additionally, there is no proper probabilisticmodel for the analysis of either the Randomized Stochastic Gradient Descent Algorithm 5where samples are employed several times during iterations or the Mini Batch StochasticGradient Descent Algorithm 6 which we have seen in the last section. The analysis of astochastic model for this approach is a topic for future inquiry.Although we have shown that mini-batch sampling is a way out of this problem, it remainsquestionable whether the deviation inequality obtained in Lemma 89 can be improved. More-over is the number of measurements needed to ensure proper quality of the spectral initial-ization (5.35) surely far from optimal.An open line of work yet uncovered is the question of robustness, i.e., how reconstruc-tion from noisy measurements of the form bi = tr(ZAiZ) + ei can be carried out via ei-ther stochastic gradient descent or gradient descent applied to a cumulative target functionf(Z) = 1

4m

∑i(tr(Z

tAiZ)− bi)2.

139

Page 142: Weighted l1-Analysis minimization and stochastic gradient

Chapter 6

Appendix

In this chapter we provide some general background information of the related but muchwider area of Compressed Sensing as well as introduction in fields closely related to thisdissertation such as Besov and smoothness spaces, high-dimensional probability theory andframe theory.

6.1 Tools from probability theory

Recovery results in Compressed Sensing heavily rely on probability theory and the complexityof sets such as

Σω,s := x ∈ Rd : ‖x‖ω,1 ≤ s, ‖x‖2 ≤ 1

or even its preimage under the frame Ω, that is x ∈ Rd : ‖Ωx‖ω,1 ≤ s, ‖Ωx‖2 ≤ 1.In this context, we are interested in deviation inequalities of the type

supx∈Σω,s

‖(Φ∗Φ− Id)x‖2 ≤ δ (6.1)

since this is the same as Φ possessing a restricted isometry constant δω,s ≤ δ. Φ usually is arandom matrix, like those described below in Remark 101, we are interested in the probabilitythat the event in (6.1) occurs. For most of the matrices from Remark 101 we have EΦx = x

if the matrix Φ is rescaled properly. Therefore, results often invoke deviation inequalitieswhich give bounds for deviations P(|X − EX| ≥ t), i.e.,

P(∣∣‖Φ‖22 − E‖Φx‖22

∣∣ ≥ δ) = P(∣∣‖Φ‖22 − ‖x‖22∣∣ ≥ δ)

which then should be generalized over the set of s-sparse vectors

P

(sup

x∈Σω,s

∣∣‖Φx‖22 − ‖x‖22∣∣ ≥ δ).

140

Page 143: Weighted l1-Analysis minimization and stochastic gradient

One such way to measure the complexity is the Gaussian Width

Definition 95 (Gaussian Width). For T ⊂ Rd the Gaussian width of T is

l(T ) := E supx∈T〈g, x〉

where g ∈ Rd is a standard Gaussian random vector. ♦

Since the Gaussian Width it restricted to the case of Gaussian measurements, we needanother measure of complexity which can be applied to the case of Fourier subsampling asused in Chapters 3 and 4. Therefore, we assess the complexity of a set by means of the γαfunctional:

Definition 96 (γα-functional). Let (M,d) be a metric space, T ⊂M and α > 0. Then theγα-functional is defined as

γα(T, d) = infT

supt∈T

∑n>0

2n/αd(t, Tn) (6.2)

where the infimum is taken over all admissible sequences of subsets of T , i.e., all sequencesT = (Tn)n∈N ⊂M that obey ]T0 = 1, ]Tn ≤ 22n and moreover d(t, Tn) := infs∈Tn d(t, s). ♦

The infimum over all admissible sequences is extremely hard to calculate in general whichis why Dudley’s inequality and the Dudley entropy integral are usually employed alongsidethe γα-functional.

Theorem 97 (Dudley’s inequality). In the setting described above

γα(T, d) ≤ Cα∫ ∞

0

(log (N (T, d, u)))1/α du. (6.3)

The case α = 2 is proven in [90] while the others, as Dirksen points out in [28], are similar.There are several approaches to this problem such as the Bowling Scheme [95], the Golf-

ing Scheme [40], Gordon’s Escape through the Mesh [34, Theorem 9.21] which only coversGaussian measurement matrices, Mendelson’s Small Ball Method [64] just to name a few.Since it would exceed the scope of this work by far, we will abstain from delving into furtherdetail, but for a thorough introduction we refer to [34, Sections 7, 8, 9] and the sources citedthere.

Accordingly, we restrict ourselves to the theorems which are used in the preceding chap-ters. As mentioned before, the complexity of subsets of metric spaces is an important toolfor the analysis of the RIP and NSP. From now on let (M,d) be a metric space and T ⊂M .Since Dudley’s theorem uses the covering number, we also want to introduce this quantityalongside the related packing number which are both essential to assess a set’s complexity.

Definition 98. The covering number Let (M,d) be a metric space, T ⊂M , then N (T, d, t)

is defined as the smallest integer N such that T is covered by the union of balls Bt(xi) :=

141

Page 144: Weighted l1-Analysis minimization and stochastic gradient

x ∈M : d(x, xi) ≤ t, xi ∈ T, i ∈ [N ]:

T ⊂N⋃i=1

Bt(xi). (6.4)

The set of centers xi : i ∈ [N ] is called a t-covering. Occasionally, authors only demandxi ∈ X instead of xi ∈ T .

The packing number P(T, d, t) is defined as the largest integer P such that there arepoints xi ∈ T, i ∈ [P] which are t-separated, that is d(xi, xk) > t for different i, k ∈ [P]. ♦

There are several important properties of these quantities which are proven in [34, LemmaC.2, Proposition C.3]:

Lemma 99. Let t, α > 0, then the following hold:

• N (T, αd, t) = N (T, d, t/α).

• Both quantities are linked via P(T, d, 2t) ≤ N (T, d, t) ≤ P(T, d, t).

• if d = ‖ · ‖ is a norm on Rd and T is a subset of the closed union ball B = x ∈ Rd :

‖x‖ ≤ 1 then

N (T, d, t) ≤ P(T, d, t) ≤(

1 +2

t

)d. (6.5)

As mentioned beforehand, reconstruction results for sparse recovery, i.e., solving Φx = y

by finding a minimizer of Ω-BPDN

min ‖x‖1 subject to Φx = y

use properties like the Null Space Properties from Definition 7 or the Restricted IsometryProperty from Definition RIP. Until now, there is no known way to construct matrices deter-ministically which fulfill either of these properties which is why the analysis of such recoveryor sparse approximation results usually employs random matrices.

Definition 100. A matrix Φ ∈ Cm×d is a random matrix if every entry Φi,j of Φ is a randomvariable. ♦

Remark 101. Several families of random matrices are examined in the context of CompressedSensing. Some of the more prominent families are the following:

• If all the entries of Φ take values ±1 with equal probability independently of each other,i.e., these are Rademacher variables, then Φ is called a Bernoulli random matrix.

• If all the entries of Φ are independent standard Gaussian random variables, then Φ iscalled a Gaussian random matrix.

142

Page 145: Weighted l1-Analysis minimization and stochastic gradient

• If all the entries of Φ satisfy

EΦi,j = 0 and P(|Φi,j | ≥ t) ≤ βe−κt for all t > 0, i ∈ [m], j ∈ [d] (6.6)

which is to say that each entry is a mean-zero subgaussian random variable, then Φ

is called a subgaussian random matrix. Equivalently one may define Φj,k to possess asubgaussian parameter c:

E[exp(θΦj,k)] ≤ exp(cθ2) for all θ ∈ R, j ∈ [m], k ∈ [d]. (6.7)

• Let D ⊂ RN be endowed with a probability measure ν and Φ := ϕ1, . . . , ϕd be acollection of complex-valued functions which are orthonormal with respect to ν, i.e.

∫Dϕi(t)ϕj(t) dν(t) =

0 if i 6= j

1 if i = j.

If

‖ϕj‖∞ := supt∈D|ϕj(t)| ≤ K for all j ∈ [d]

then Φ is called a bounded orthonormal system (BOS) with constant K. Thus, weconsider functions on D which take the form f =

∑di=1 ciϕi. Now take sampling points

t1, . . . , tm ∈ D and obtain measurements

yj =

m∑i=1

ciϕi(tj), j ∈ [m]

and therefore consider the sampling matrix Φi,j = ϕi(tj), i ∈ [d], j ∈ [m] which yieldsa linear equation y = Φc where y = (y1, . . . , ym)> and c = (c1, . . . , cd)

>. Examplesinclude

– Trigonometric polynomials, where D = [−1, 1]d and for k ∈ Zd we set

ϕk(t) = exp (iπ〈k, t〉)

and ν is the Lebesgue measure on D normalized by 12d. Here we have K = 1

and usually the index k is chosen from a set −l, . . . , l, resulting in multivariatetrigonometric polynomials of order ld. Furthermore, this can be generalized tocharacters of compact abelian groups where the normalized Haar measure is thecorresponding probability measure.

– Discrete orthonormal systems where we consider a unitary matrix U ∈ Cd×d withnormalized columns

√duk, k ∈ [d] which form an orthonormal system with re-

spect to the discrete uniform measure on [d] as given by ν(P ) = ]Pd for P ⊂ [d].

143

Page 146: Weighted l1-Analysis minimization and stochastic gradient

In that case ϕj(k) = uk,j and the sampling set T ⊂ [d] has independently anduniformly chosen entries t1, . . . , tm. This yields a random matrix Φ by selectingrows independently and uniformly at random from the rows of

√dU , which is to

say Φ = RT√dU , where RT is the restriction to the set T ⊂ [d], ]T = m, i.e.,

(RT z)l = ttl for l ∈ [m]. Note that it is possible that one row is selected morethan once, i.e., T contains an index several times. There is a different probabilitymodel where T is selected uniformly at random from all subsets of [d] of cardinal-ity m but these two models need a different analysis.One prominent instance of this example class is the partial discrete Fourier trans-form where U = F and

Fj,k :=1√d

exp(2πi(j − 1)(k − 1)), j, k ∈ [d]

and x = Fx is called the discrete Fourier transform of x. Since the Fast FourierTransform fft allows for fast computation of x, needing only O(d log(d)) com-putations, Φ = RTF also comes with a fast matrix-vector-multiplication. Theadjoint F∗ can also be computed equally fast.

For the sake of completeness, we include a short list of uniform and non-uniform recoveryguarantees for several of these matrix classes, most of which are from [34]:

Remark 102. This remark is intended to serve as a cursory survey of well-established recoveryresults which is why we restrict ourselves to the unweighted and frame-less regime. Similarresults also hold in the weighted or `1-analysis or `1-synthesis context prospectively and someof those have been shown in Chapters 3 and 4.

Let x ∈ Cd be s-sparse, ε > 0 and Φ ∈ Cm×d be a random matrix.

• [34, Theorem 9.16, Remark 9.17] If Φ is subgaussian with parameter c as in the momentbound (6.7), then, with probability at least 1− ε, x is the unique minimizer of ω-BP if

m ≥ 4cs ln

(2d

ε

). (6.8)

• [34, Theorem 9.2] If Φ is subgaussian with parameters β, κ as in the probability bound(6.6), then there is a constant C only depending on those two parameters such thatthe restricted isometry constant of 1√

mΦ satisfies δs ≤ δ with probability exceeding

1− 2 exp(−δ2m/(2C)) provided

m ≥ 2Cs

δ2ln

(ed

s

). (6.9)

• [34, Corollary 9.34]. Let Φ be a Gaussian matrix, then with probability 1 − ε every

144

Page 147: Weighted l1-Analysis minimization and stochastic gradient

z ∈ C is approximated by a minimizer of Ω-BP (with Ω = Id for now) if

m > 2s(1 + ρ−1)2 ln

(eN

s

)for a small ratio of N

s (the precise condition is far more complicated without offeringmuch more insight). The reconstruction quality can be estimated as

‖z − z]‖1 ≤2(1 + ρ)

1− ρσs(z)1. (6.10)

• [34, Theorem 12.31] Let Φ be a random sampling matrix associated to a boundedorthonormal system. Then 1√

mΦ possesses a RIP-constant of at most δ with probability

1− d− ln(d) as soon as

m ≥ CK2δ−1s ln4(d). (6.11)

• [34, Theorem 12.20] Let Φ be a random sampling matrix associated to a boundedorthonormal system. Then x is the unique minimizer of ω-BP with probability at least1− ε once

m ≥ CK2s ln(d) ln

(1

ε

)(6.12)

holds true.

• [34, Theorem 12.22] For arbitrary z ∈ Cd and Φ a random sampling matrix associatedto a bounded orthonormal system with bound K > 0, let y = Φz+e and z] the solutionof ω-BPDN with η ≥ ‖e‖2√

m. If

m ≥ CK2s ln(d) ln

(1

ε

)(6.13)

then with probability exceeding 1− ε the reconstruction satisfies

‖z − z]‖2 ≤ C1σs(x)1 + C2

√sη.

The constants C,C1 and C2 are universal in all the statements above but may differ dependingon the theorem.

145

Page 148: Weighted l1-Analysis minimization and stochastic gradient

6.2 Besov Spaces, Smoothness Spaces, wavelets and shear-

lets

Chapters 3 and 4 are also concerned to extend the theory of sparse recovery to subspaces ofL2(Rd), the space of square-integrable functions on Rd. One important class of approximationor interpolation spaces for L2(Rd) are the Besov spaces Bp,ql (Rd).

Definition 103 (Besov Space). [96, Definition 2.4] We denote by S ′(Rd) the space oftempered distributions, the dual of the Schwartz class S(Rd).

• Let θ : R→ R be compactly supported in[−1,− 1

2

]∪[

12 , 1]such that∑

j∈Zθ(2−jx) ≡ 1 almost everywhere in x ∈ R

and for x ∈ Rd ν(x) =∏di=1 θ(xi). Then such a ν is called a tensorized dyadic decom-

position of unity.

• The Besov spaces Bp,ql (Rd) with 0 < p, q <∞ and l > 0 are defined as

Bp,ql (Rd) :=

f ∈ S′(Rd) : ‖f‖l,p,q :=

∑v∈Nd0

2lq‖v‖1‖F−1[νjFf ]‖qp

1/q

<∞

(6.14)

where (νv)v∈Nd0 is a tensorized dyadic decomposition of unity.

This definition is equivalent to its classical counterpart via the modulus of continuity :

Definition 104. Let 1 ≤ p, q < ∞, ∆hf(x) = f(x − h) − f(x) and define the modulus ofcontinuity by ω2

p(f, t) = sup|h|≤t ‖∆2hf‖p. Let n ∈ N and s = n+ a for some 0 < a ≤ 1. The

the Besov space Bp,qs (R) is the set of all weakly differentiable functions f ∈ Wn,p(R) fromthe Sobolev space W p,n(R) such that

∫ ∞0

∣∣∣∣∣ω2p(f (n), t)

ta

∣∣∣∣∣qdtt<∞. (6.15)

This is a normed space equipped with the norm

‖f‖′Bp,qs (R) =

(‖f‖qWn,p(R) +

∫ ∞0

∣∣∣∣∣ω2p(f (n), t)

ta

∣∣∣∣∣qdtt

)1/q

.

Here, f (n) is the n-th weak derivative of f . ♦

146

Page 149: Weighted l1-Analysis minimization and stochastic gradient

Inequality (6.15) alludes to the notion of fractional smoothness that is linked to Besovspaces: While f in the Sobolev space Wn,p is n-times differentiable almost everywhere,finiteness of (6.15) implies thatfn additionally is ’a-times’ differentiable.

Coorbit space theory offers the possibility to characterize Besov spaces by means ofwavelets. We only will give the most basic definitions and for a thorough introduction werefer the reader to [31,77,79].

Definition 105. [79, Lemma A.2] Suppose that we have a multi-resolution analysis in L2(R)

with scaling function ϕ0 and associated wavelet ϕ1. Let E = 0, 1d, c = (c1, . . . , cd) ∈ E,and ϕc = ⊗di=1ϕ

ci . Then the system

ϕ0k

:= ϕ0(· − k) : k ∈ Zd∪

ϕcj,k

(x) := 2‖j‖1d/2d∏i=1

ϕci(2jixi − ki) : c ∈ E \ 0, j ∈ Nd0, k ∈ Zd

consists an orthonormal basis of L2(Rd). ♦

Theorem 106. Let 1 ≤ p, q <∞, l ≥ 1 and bp,ql be the space of sequences λ = (λj,k)j∈Ndo,k∈Zd

for which

‖λ‖p,q,l := ‖λ(0,·)‖q +

∑j∈Nd0

2‖j‖1(dq2 +lq−pd)‖λ(j,·)‖qp

1q

<∞. (6.16)

Then we have that f ∈ Bp,ql (Rd) if it can be represented as

f =∑c∈E

∑k∈Zd

λc0,kϕc0,k

+∑

c∈E\0

∑j∈Nd0

∑k∈Zd

λcj,kϕcj,k

and the sequences λc := (λcj,k)j∈N0,k∈Zd belong to bp,ql for every c ∈ E. Here, λ0j,k = 0 if

j > 0.Conversely, if f ∈ Bp,ql (Rd), then the sequences λc belong to bp,ql for all c ∈ E where

λcj,k := 〈f, ϕcj,k〉, j ∈ N0, k ∈ Zd.

Moreover, summing over all c ∈ E in (6.16) yields a norm on Bp,ql (Rd):

f ∈ Bp,ql (Rd)⇔∑c∈E‖λc(0,·)‖q +

∑c∈E\0

∑j∈Nd0

2‖j‖1(dq2 +lq−pd)‖λ(j,·)‖qp

1q

<∞

Proof. This result is a combination of [79, Theorem 5.8 and equation 5.13] invoking Definition105. Note that in the case of Besov spaces, the according sequence spaces are simply weightedsequence spaces of the type `pω ⊗ `

qω′ and the finite sequences are dense in those spaces.

This theorem was extended in [3] to also include the cases 0 < p, q ≤ ∞.

147

Page 150: Weighted l1-Analysis minimization and stochastic gradient

Theorem 107. Let 0 < p, q ≤ ∞, l > 0 and r ∈ N such that r > maxl, 2d

p + d2 − l

.

Assume that the wavelet functions ϕ ∈ Cr(Rd). Then Theorem 106 holds unchanged informulation. Moreover, if we set λcj,k := 〈f, ϕc

j,k〉 for c ∈ E, j ∈ Nd0, k ∈ Zd an f ∈ Bp,ql (Rd),

the expression

‖f‖Bp,ql (Rd) :=∑c∈E‖λc(0,·)‖q +

∑c∈E\0

∑j∈Nd0

2‖j‖1(dq2 +lq−pd)‖λ(j,·)‖qp

1q

is a norm for 1 ≤ p, q ≤ ∞ with the usual modification for p = ∞ or q = ∞, and apseudo-norm for 0 < p, q < 1.

Proof. This is a direct consequence of [3, Theorem 8].

An important application of Compressed Sensing based reconstruction techniques is imagerestoration in Magnetic Resonance Imaging (MRI) or Computerized Tomography (CT). Thelatter is discussed at length in Chapters 3 and 4. In this context, Besov spaces will takethe role of model classes for such images. Since images and pictures can easily be modeledby functions which are only supported on a bounded domain D ⊂ Rd, we restrict ourselvesmostly to these Besov spaces:

Bp,ql (D) :=f ∈ D′(D) : there is g ∈ Bp,ql (Rd) such that g|D ≡ f

(6.17)

where D′(D) is the set of distributions of the set of test functions on D. It is known thatBp,ql (D) is a (quasi)normed space with the with the according (quasi)norm ‖f‖Bp,ql (D) :=

infg|D≡f ‖g‖Bp,ql , where the infimum is taken over all g ∈ Bp,ql (Rd). Again, these Besovspaces can be characterized via wavelets.

Theorem 108. For j ∈ Nd0, k ∈ Zd, let 1 ≤ p, q <∞ and Qj,k :=∏di=1 2−j [ki − 1, ki + 1] ⊃

supp(ϕcj,k) and for j ∈ Nd0 set ADj =k ∈ Zd : Qj,k ∩ D 6= ∅

. Define

bp,ql (D) :=λ = (λj,k)j∈Nd0 ,k∈ADj

⊂ C : ‖λ‖bp,ql (D) <∞

(6.18)

where

‖λ‖bp,ql (D) :=

∑j∈Nd0

2‖j‖1(dq2 +lq−pd)

∑k∈ADj

|λj,k|pq/p

1/q

. (6.19)

Then we have that f ∈ Bp,ql (D) if it can be represented as

f =∑c∈E

∑k∈Zd

λc0,kϕc0,k

+∑

c∈E\0

∑j∈Nd0

∑k∈Zd

λcj,kϕcj,k.

and the sequences λc := (λcj,k)j∈Nd0 ,k∈Zd belong to bp,ql (D) for every c ∈ E. Here, λ0j,k = 0 if

148

Page 151: Weighted l1-Analysis minimization and stochastic gradient

j ∈ Nd.Conversely, if f ∈ Bp,ql (D), then the sequences λc belong to bp,ql (D) for all c ∈ E where

λcj,k := 〈f, ϕcj,k〉, j ∈ Nd0, k ∈ Zd.

Moreover, (6.19) is a norm on Bp,ql (D).

Again, the proof is a combination of [79, Theorem 5.8, and equation 5.13] invoking Defi-nition 105.

Accordingly, ‖f‖Bp,ql (D) :=∑c∈E ‖λc‖bp,ql (D) with λc = (λcj,k)j∈N0,k∈Zd as above is what

will be employed from now on as the norm of a Besov space for the analysis. Since we

restrict ourselves to the diagonal p = q, the weights are chosen as ωj := 2‖j‖1p( d

2+l−1)

2−p forsome parameters l and p.

On R there is a different, more concise characterization result

Theorem 109. [43, Corollary 9.1] Let the father wavelet φF fulfill

∑k∈Z|φF (ξ + 2πk)|2 = 1 and φF (ξ) = ϕF

2

)m0

2

)

almost everywhere, where m0 is a 2π-periodic L2(0, 2π) function. Moreover, the followinghold:

• There exists a bounded, non-increasing function Φ such that∫Φ(|x|) dx <∞ and |φF (u)| ≤ Φ(|u|)

almost everywhere

• and ∫Φ(|x|)|x|N+1 dx <∞

almost everywhere.

Then, for 1 ≤ p, q <∞, f ∈ Bp,ql (R) if and only if

‖f‖p,q,l := ‖λF ‖p +

∑j≥n

(2j(l+1/2−1/p)‖λM,j‖p

)q1/q

<∞ (6.20)

where

λF,k = 〈f, TkϕF 〉 and λM,j,k = 〈f, TkD2jϕM 〉

for the mother wavelet ϕM .

149

Page 152: Weighted l1-Analysis minimization and stochastic gradient

This result is used in Chapter 3 where we are interested in the diagonal p = q where thecharacterization reads

f ∈ Bp,pl (R)⇔ ‖λF ‖p +

∑j≥0

2j(lp+p/2−1)‖λM,j‖pp

1/p

<∞ (6.21)

in the notation of Theorem 109.

Moreover, since there are smooth embeddings Bp,ql (Rd) → L2(Rd), convergence resultsobtained in the weighted sequence spaces extend to the Hilbert space L2(Rd).Smoothness in this context can be understood as a measure of well-behavedness. Thesmoother a function is, the higher approximation rates can be achieved using methods ofconvex optimization such as Ω-BP since the weights increase faster with the index j.

6.2.1 Shearlet Frames

Wavelets are predominantly famous for their ability to detect irregularities or singularities infunctions while simultaneously consisting a generating system for L2(Rd). Since for detectionor approximation purposes the former features are more helpful, efforts to conceive newfunction systems which feature these properties predominantly have been going on for years.The latest addition to the family of “-lets”, including curvelets, contourlets and ridgeletsamongst others, are shearlets see [60,62,63] of which, by now, again plethora of variants areavailable.Shearlets are known to detect curvatures and irregularities of functions as shown in [56], aswell as to possess the nearly optimal best s-term approximation rate for cartoon-like functions,see [42,57] and the defining (6.36) later in this chapter. Also, both the aforementioned sourcescontain a thorough introduction into the theory of shearlets. Since they are understood bestin the two-dimensional setting and we only employ them for image reconstruction, we restrictourselves to this case.As [63, Prop. 4.3] suggests, Smoothness spaces offer a suitable, easy way to employ thesparsifying qualities of shearlets for the reconstruction of functions which are expressed bytheir coefficients in a wavelet basis. For a proper introduction to the theory and setting thereader is referred to [63] of which we want to summarize the most important tools here:

Remark 110. Every frame (ψλ)λ∈Λ ⊂ H for some countable index set Λ and Hilbert spaceH induces two operators,

• the analysis operator T : H → `2(Λ), f 7→ (〈f, ψλ〉)λ∈Λ and the

• synthesis operator T ∗ : `2(Λ) → H, (cλ)λ∈Λ 7→∑λ∈Λ cλψλ which is the adjoint of the

analysis operator.

Since A‖f‖2H ≤ ‖Tf‖`2(Λ), T is invertible so that there is a family of functions(ψλ

)λ∈Λ

such

150

Page 153: Weighted l1-Analysis minimization and stochastic gradient

that

f =∑λ∈Λ

〈f, ψλ〉ψλ.

The two operators are concatenated to the frame operator S = T ∗T , i.e.

Sf =∑λ∈Λ

〈f, ψλ〉ψλ (6.22)

and therefore 〈Sf, f〉 =∑λ∈λ |〈f, ψλ〉|2. As a consequence

A‖f‖2 ≤ 〈Sf, f〉 ≤ B‖f‖2.

Accordingly, S is a positive and self-adjoint operator. The elements of the canonical dualframe are the defined as ψ†λ := S−1ψλ. If we no take g =

∑λ∈λ〈f, ψλ〉ψ

†λ we obtain

g =∑λ∈λ

〈f, ψλ〉S−1ψλ = S−1∑λ∈λ

〈f, ψλ〉ψλ = S−1Sf = f = SS−1f

=∑λ∈Λ

〈S−1f, ψλ〉ψλ =∑λ∈Λ

〈f, S−1ψλ〉ψλ =∑λ∈Λ

〈f, ψ†λ〉ψλ.

In summary, f =∑λ∈Λ〈f, ψλ〉ψ

†λ. There are examples of frames with several dual frames.

A frame is called a Parseval frame if ‖f‖2 =∑λ∈Λ |〈f, ψλ〉|2 and in this case ψ†λ = ψλ.

Now we go forth to construct several frames.

Definition 111 (Admissible Covering). A family Qλλ∈Λ of measurable subsets Qλ ⊂ Rd

is called an admissible covering if Rd = ∪λ∈ΛQλ and supλ∈Λ ]µ ∈ Λ : Qλ ∩Qµ 6= ∅ <∞.♦

The quantity n0 := supλ∈Λ ]µ ∈ Λ : Qλ ∩ Qµ 6= ∅ is occasionally called the height ofthe covering.

Definition 112 (Bounded Partition of Unity). Given an admissible covering Qλλ∈Λ, acorresponding bounded partition of unity (BAPU) is a family of functions (ϕλ)λ∈Λ such that

• supp(ϕλ) ⊂ Qλ

•∑λ∈Λ ϕλ ≡ 1

• supλ∈Λ |Qλ|1/p−1‖F−1ϕλ‖p <∞ for all p ∈ (0, 1].

Remark 113. The last point of Definition 112 ensures that the functions φλ are boundedFourier multipliers on band-limited functions. Accordingly, we define the multiplier Tλ as-

151

Page 154: Weighted l1-Analysis minimization and stochastic gradient

sociated to ϕλ as

Tλ : f 7→ F−1(ϕλFf)

and compute

‖Tλf‖pp = ‖F−1(ϕλ) ∗ F−1(f)‖pp ≤ |Qλ − supp(f)|1−p∫Rd

∫Rd|F−1ϕλ(x)||f(y − x)| dx dy

= |Qλ − supp(f)|1−p∫Rd|F−1ϕλ(x)|

∫Rd

∫Rd|f(y − x)| dy dx

= |Qλ − supp(f)|1−p‖F−1φλ‖pp‖f‖pp

where for two sets A,B the difference is A−B := a− b : a ∈ A, b ∈ B.

Definition 114 (Decomposition spaces). Let Q := Qλλ∈Λ be an admissible covering and(ϕλ)λ∈Λ a corresponding BAPU. Let Y be a solid sequence (quasi-)Banach space on Λ,meaning that whenever |aλ| ≤ |bλ| holds for all λ ∈ Λ we have ‖a‖Y ≤ ‖b‖Y , such that thespace of finite sequences over Λ, `0(Λ) is dense in Y . Then we define the decomposition spaceD(Q, Lp, Y ) as the set of f ∈ S ′ satisfying

‖f‖D(Q,Lp,Y ) :=∥∥‖ϕλf‖pλ∈Λ

∥∥Y<∞. (6.23)

Let Q = Qi : i ∈ I be an admissible covering and ω : Rd → R a strictly positivefunction. ω is called Q-moderate if there exists a constant C independent of i ∈ I such thatfor all x, y ∈ Qi we have ω(x) ≤ Cω(y).For a (quasi-)Banach sequence space Y over an index set Λ and a sequence of weights v =

(vλ)λ∈Λ we define Yv := (yλ)λ∈Λ : (vλyλ)λ∈Λ ∈ Y . A sequence of weights (vi)i∈I is calledQ-moderate if it is derived from a Q-moderate weight function, i.e., vi = ω(xi), i ∈ I for somexi ∈ Qi. As is turns out, this definition is, to some extent, independent of the actual choiceof the BAPU [63, Theorem 2.1], which is not entirely obvious. Moreover, it is desirable thatthe decomposition space is stable under minor geometric modifications of the covering Q. Tothis end, we need to give some basic definitions and properties of decomposition spaces.

Definition 115. Let Qλλ∈Λ be an admissible covering of Rd, (ϕλ)λ∈Λ be a correspondingBAPU and M ⊂ Λ. We define

• M := λ ∈ Λ : there is some µ ∈ M such that Qµ ∩ Qλ 6= ∅. Inductively, we define

M0 := M and Mn :=˜Mn−1. For singletons λ ⊂ Λ we sometimes may omit the curly

brackets, i.e., a = a.

• Qλ := ∪µ∈ΛQµ and ψλ :=∑µ∈Λ ψµ

• Let P := Pθθ∈Θ another admissible covering of Rd. Q is called subordinate to P, iffor every λ ∈ Λ there is some θ ∈ Θ such that Qλ ⊂ Pθ. Q is called almost subordinate

152

Page 155: Weighted l1-Analysis minimization and stochastic gradient

to P, written Q ≤ P, if there exists an k ∈ N0 such that Q is subordinate to Pθkθ∈Θ.

If P ≤ Q and Q ≤ P then the two coverings are called equivalent, written Q ∼ P.

• A strictly positive function w : Rd → R+ is called Q-moderate, if there is some absoluteconstant C > 0 such that w(x) ≤ w(y) for all x, y ∈ Qλ and any given λ ∈ Λ.

• A strictly positive Q-moderate weight on Λ is a sequence vλ := w(xλ) for λ ∈ Λ, wherexλ ∈ Qλ for all λ ∈ Λ and w is a Q-moderate function.

• For a solid (quasi-)Banach space Y we define its weighted version Yv := (dλ)λ∈Λ :

(vλdλ)λ∈Λ ∈ Y .

• A solid (quasi-)Banach sequence space on Λ is called symmetric if it is invariant underpermutations ρ : Λ→ Λ.

Now we turn to a far more useful but yet specialized version of admissible coverings.

Definition 116. Let P ⊂ Q ⊂ Rd be bounded and open, with P being compactly containedin Q and let T = Aλ ·+cλ : λ ∈ Λ be a family of invertible affine transformations of Rd,i.e., Aλ ∈ GLd(R) and cλ ∈ Rd for λ ∈ Λ. We define Tx := Ax + c for T = A ·+c ∈ T andQT := TQ. Then QT T∈T is called a structured admissible covering and T a structuredfamily of affine transformations if

• PT T∈T and QT T∈T are admissible coverings,

• there is some constant K ≥ 0 such that whenever AkQ+ ck ∩Ak′Q+ ck′ 6= ∅ we havethat ‖A−1

k′ Ak‖`∞(Rd×d) ≤ K.

We abbreviate |T | = |det(A)|. ♦

[8] showed that from a structured admissible covering alone one can generate severalfamilies of functions with plethora of applications:

Theorem 117. [8, Prop. 3.1] Let P ⊂ Q ⊂ Rd be bounded and open, with P beingcompactly contained in Q and let T = Aλ · +cλ : λ ∈ Λ be a structured family of affinetransformations then there exist

• a BAPU (ψT )T∈T ⊂ S(Rd) corresponding to the covering QT T∈T

• a squared BAPU, namely a system (ϕT )T∈T ⊂ S(Rd) such that

– supp(ϕT ) ⊂ QT for all T ∈ T

–∑T∈T ϕ

2T ≡ 1

– supT∈T |QT |1/p−1‖F−1ϕT ‖p <∞ for all p ∈ (0, 1].

153

Page 156: Weighted l1-Analysis minimization and stochastic gradient

• a pair of systems γT T∈T , γT T∈T ⊂ S(Rd) where γT = Φ(T−1 ·) for a fixed Φ ∈S(Rd) such that

– supp(γT ), supp(γT ) ⊂ QT for all T ∈ T

–∑T∈T γT (ξ)γT (ξ) = 1 for all ξ ∈ Rd.

– supT∈T |QT |1/p−1‖F−1γT ‖p <∞ for all p ∈ (0, 1].

– supT∈T |QT |1/p−1‖F−1γT ‖p <∞ for all p ∈ (0, 1].

Smoothness spaces are a special class of decomposition spaces.

Definition 118. Let a structured admissible covering Q be generated by a family of affinetransformations T and ω be a Q-moderate weight function. Define the weights vω,β :=(ω(bT )β

)AT ·+bT∈T

. Then the Smoothness space Sp,qβ (T , ω) is given by

Sp,qβ (T , ω) := D(Q, Lp, (lq)vω,β ). (6.24)

These definitions enable us to construct tight frames of L2(Rd):Consider a structured admissible covering QT T∈T and let Ka be the cube in Rd containingQ aligned to the coordinate axis of side-length 2a. Then we set

en,T (ξ) := (2a)−d/2√|T |χKa(T−1ξ) exp

(iπ

a〈n, T−1ξ〉

)for n ∈ Zd, T ∈ T .

and define

ηn,T := ϕT en,t for all n ∈ Zd, T ∈ T . (6.25)

This system has plethora of useful properties which we summarize in the next theorem.

Theorem 119. [8] With the notation from above we have

• [8, Prop. 3.4] The system F(T ) := ηn,T : n ∈ Zd, T ∈ T is a Parseval frame forL2(Rd).

• [8, eqn. 3.7] For every multi-index β ∈ Nd we have∣∣∣∂βξ η(ξ)

∣∣∣ ≤ Cβ(2a)d2 |T | 12χQ(ξ) for

every β ∈ Nd.

• [8, Prop. 4.3] If we normalize ηn,T in Lp, namely setting ηpn,T := |T |1/2−1/pηn,T wecan characterize the decomposition spaces for 0 < p ≤ ∞ via its normalized framecoefficients

‖f‖D(Q,Lp,Yv)

∥∥∥∥∥∥∥∑n∈Zd

|〈f, ηpn,T 〉|p

1/pT∈T

∥∥∥∥∥∥∥Yv

154

Page 157: Weighted l1-Analysis minimization and stochastic gradient

In summary, one can obtain a Parseval frame of L2(Rd) from a suitably chosen family ofaffine transformations and sets P ⊂ Q ⊂ Rd as in Definition 116.

For our own purposes, we construct a Parseval Frame from shearlets similar to the versionof [63]; we choose a fundamental function ϕ ∈ C∞(R) such that ϕ(ξ) ∈ [0, 1] and bothsupp(ϕ) ⊂

[− 1

8 ,18

]and ϕ ≡ 1 on

[− 1

16 ,116

]. Then

Ψ(ξ) = ϕ(ξ1)ϕ(ξ2)

resulting in a shearlet function ψ : R2 → R which is compactly supported in Fourier domain.The concise definitions and properties are in [63, Equations 3.7 - 3.12]. The only difference

we make is that we choose the basic dilation matrix to be A = A1 :=

(2 0

0√

2

)and A2 :=(√

2 0

0 2

), compare [63, eqn. 3.13]. The shearing matrices remain B = B1 =

(1 1

0 1

)and

B2 = RB1 where R =

(0 1

1 0

). This matrix is used to switch between the coordinate axis

for the cone-adapted shearlets. With this in mind we will only treat the case of the horizontalcone as is usually done in the literature. After suitable renormalization of all the functionsinvolved, yields a Fourier support for the shearlet functions ψa,s,t = T−tDBsAjψ of the form

supp(ψa,s,t) := Σa,s :=

ξ ∈ R2 : ξ ∈ [−2a−1, 2a−1]2 \ [−2a−3, 2a−3]2,

∣∣∣∣ξ2ξ1 − 2−a/2s

∣∣∣∣ ≤ 2−a/2.

(6.26)

This approach is cone-adapted, meaning the the sets

Ch :=

ξ ∈ R2 : |x1| ≥ 1,

ξ2ξ1∈ [0, 1]

and Cv := C :=

ξ ∈ R2 : |x2| ≥ 1,

ξ2ξ2∈ [0, 1]

are treated separately, which is the most useful approach when it comes to actual computa-tions.

As in [63, Prop. 4.1] it can be calculated that these from a structured admissible coveringwhen using the sets and affine transformations as follows:

• Let A1, A2, B1, B2 be as above and define Ta,s,m = BsAa · −t for a ≥ 0, s ∈ Z, t ∈ Z2.

• We use the same trapezoids P and Q as in [63, p. 10] with only the modification thatour vertices are adjusted to our changes in the dilation matrices, thus doubling everyentry from every vertex, i.e., our P = V ∪V − ⊂ R2 and Q = U ∪U− ⊂ R2 each consistof two disjoint sets where the exponent − indicates mirroring the respective set withrespect to the origin: V − = −V and U, V are trapezoids with vertices

–(

14 ,

14

), (1, 1) , (1,−1) ,

(14 ,−

14

)for V and

–(

18 ,

38

),(

98 ,

118

),(

98 ,−

38

),(

98 ,−

118

)for U.

155

Page 158: Weighted l1-Analysis minimization and stochastic gradient

This setting yields the following: First and foremost, we obtain a structured admissiblecovering of R2 which yields a Parseval frame from a BAPU via the results mentioned inabove. Moreover, we can construct a Parseval frame of shearlets with the support Σj,k. Ourshearlet Parseval frame constitutes of the following subsystems:

• The coarse-scale shearletsW−1,t = Φ(· − t) : t ∈ Z2

where Φ is given via Φ(ξ) =

ψ(ξ1)φ(ξ2) and ϕ ∈ [0, 1] and its support is contained in[− 1

8 ,18

]and it is ≡ 1 on[

− 116 ,

116

].

• The interior shearletsψa,s,t,h : a ≥ 0, |s| < 2a, t ∈ Z2, h = 1, 2

which have for h =

1 their Fourier-support contained in Σj,k and for h = 2 in RΣj,k

For purely theoretical purposes it suffices to consider only one system of interior shearletswhere we let a range over the whole of the integers since this already yields a covering ofR \

(0y

): y ∈ R

.

These systems, however, are not sufficient to yield a Parseval frame for L2(R2); theboundary between the cones Ch and Cv, namely the lines where either ξ1 = ξ2 or ξ1 = −ξ2cause issues which is why we have to introduce a boundary shearlet which is an amalgam ofthe horizontal and vertical shearlets.

One way of interpolating between shearlet Smoothness spaces and Besov spaces is thefollowing Proposition from [63]:

Proposition 120. [63, Section 4.5] For 0 < p ≤ ∞, 0 < q <∞ and β ∈ R we have

Bp,qβ+1/q(R2) → Sp,qβ (R2)

and

Sp,qβ (R2) → Bp,qβ−l(R2)

where l = max(

1, 1p

)−min

(1, 1

q

).

In case 0 < p = q ≤ 1 we have l = 1p − 1.

Note that since `p → `q for 0 ≤ r ≤ p ≤ q we have for 0 ≤ p ≤ 1 and 0 ≤ l ≤ l′

‖f‖Bq,ql ≤ ‖f‖Bp,pl ≤ ‖f‖Bp,pl′≤ C‖f‖Sp,p

l′−1/p+1≤ C‖f‖Sp,p

l′≤ C‖f‖Sr,r

l′≤ C‖f‖Br,r

l′+1/r

(6.27)

where C > 0 is an absolute constant.

Remark 121. The construction of a shearlet frame for L2(R2) by means of decompositionspaces was outlined above: First, we decide on a set T of affine transformations which, fora suitably chosen Q ⊂ R2, generates a structured admissible covering Q = QT : T =

(A · +c) ∈ T . This yields a BAPU via Theorem 117 and even more families of functions ofwhich we will make no use here. From this BAPU we obtain a Parseval frame which charac-terizes via equation (6.25) the according decomposition space, see Theorem 119. Applying

156

Page 159: Weighted l1-Analysis minimization and stochastic gradient

this machinery to the cone-adapted ’wedges’ of the shearlet covering, the shearlet frame isgenerated. This frame, as was pointed out in Definition 52 and proven in [63, Section 4.5],characterizes the shearlet Smoothness space using a weighted sequence space.At this stage we should point out that the shearlets Dilation Matrices, as we defined them,actually consist a subgroups of Gl2(R):

S :=

Ma,s :=

(a√as

0√a

): a ∈ (0,∞), s ∈ R

and the set Sv containing the transposed of the elements of S. This offers another approachto shearlets through coorbit space theory. Multiplication within the full shearlet group SoR2

is carried out in such a way that it is consistent with regard to the dilation and translationof functions: Let ψ : R2 → R and Ma,s · +t,Ma′,s′ · +t′ ∈ S oR2 then

DM−1a,sTtDM−1

a′,s′Tt′ψ(x) =

1√det(Ma,sMa′,s′)

ψ(M−1a,s (M−1

a′,s′(x− t′)− t)

)=

1√det(Ma,sMa′,s′)

ψ(M−1a,sM

−1a′,s′(x− (t′ +Ma′,s′t))

)=

1√det(Ma,sMa′,s′)

ψ((Ma′,s′Ma,s)

−1(x− (t′ +Ma′,s′t)))

= DM−1

aa′,√as+s′

Tt′+Ma′,s′ tψ(x).

Actually, instead of the square root any other positive exponent can be used in the definitionof the Ma,s. The affine group S o R2 has the Haar measure da

a3 dsdt and the discrete set ofaffine transformations, i.e., the counterpart of T from Definition 116 is T = S o Z2. So ina sense the shearlet Smoothness space for the horizontal cone can be characterized by thevoice transform

Vψf : S oR2 → R, (a, s, t) 7→ 〈f, ψa,s,t〉

in the following sense which is called coorbit space theory: Consider a group G with left Haarmeasure µ and the spaces L∞(G), L1(G) or M(G), where the latter is the space of complexradon measures on G. Take Q to be a compact neighborhood of the identity element e ∈ G,then we define the control function by

K(F,Q,B) : G→ R, x 7→ ‖LxχQF‖B (6.28)

where Lxf(y) = f(x−1y) is the left translation. Then we say that F is locally contained inB, i.e., F ∈ Bloc, B being one of the spaces mentioned above, if FχK ∈ B for every compactK ⊂ G . Now take an solid quasi-Banach space Y of functions on G which contains thecharacteristic functions of all compact subset of G. The the Wiener amalgam space W (B, Y )

157

Page 160: Weighted l1-Analysis minimization and stochastic gradient

is defined as

W (B, Y ) := W (B, Y,Q) := F ∈ Bloc : K(F,Q,B) ∈ Y and (6.29)

‖F |W (B, Y,Q)| := ‖K(F,Q,B)‖Y .

Now we take an irreducible unitary representation π of G on some Hilbert space H andconsider the voice transform Vfg(x) := 〈f, π(x)g〉 for x ∈ G and f, g ∈ H. Occasionally, gis called the window function. In terms of shearlet smoothness spaces this would read asπ(x)g = π(a, s, t)ψ = ψa,s,t for x = Ma,s · +t ∈ S o Z2. Let v ≥ 1 be a submultiplicativeweight function and fix an element g ∈ H such that Vgg ∈ L1

v(G) assuming the latter spaceis non-trivial. Then we define

H1v := f ∈ H : Vgf ∈ L1

v with norm ‖f‖H1v

:= ‖Vgf‖L1v

and its anti-dual(H1v

)¬, i.e., the space of all bounded, conjugate linear functionals on H1v.

Then the voice transform extends to the anti dual via

Vgf(x) = f(π(x)g), f ∈(H1v

)¬, x ∈ G.

Finally, for a suitable weight v and a window g we can define the coorbit space CoW (L∞, Y )

as

CoW (L∞, Y ) := f ∈(H1v

)¬: Vgf ∈W (L∞, Y ). (6.30)

For more information on what ’suitable’ weights and windows are, we refer to [77, Section4]. One particular feature of coorbit spaces is that they can be characterized by a family offunctions in very much the same sense as decomposition spaces can be characterized by aBAPU. To elaborate on that, for a compact neighborhood U ⊂ G of e we we define that aset X = xi : i ∈ I is U -dense and well spread if it is

• U -dense, i.e., G =⋃i∈I xiU

• relatively separated if for all compactK ⊂ G the supreme supj∈I ]i ∈ I : xiK∩xjK 6=∅ ≤ CK is bounded by a constant only dependent on K.

For such a neighborhood U and sequence X we define the sequence space

Yd :=

(λi)i∈I :

∥∥∥∥∥∑i∈I|λi|χxiU

∥∥∥∥∥Y

.

Then [77, Theorem 5.5] states the following: For an appropriate window g there exist acompact neighborhood U of e and a U -dense, well-spread set X such that the family offunctions π(xi)gi∈I is an atomic decomposition of CoW (L∞, Y ) which is to say that thereis a sequence (λi)i∈I of bounded linear functionals on

(H1v

)¬ such that

158

Page 161: Weighted l1-Analysis minimization and stochastic gradient

• f =∑i∈I λi(f)π(xi)g for all f ∈ CoW (L∞, Y ) with convergence in the weak-∗ topology

of(H1v

)¬ provided the finite sequences are dense in Yd and

• f ∈(H1v

)¬ if and only if (λi(f))i∈I ∈ Yd and

‖(λi(f))i∈I‖Yd ‖f‖CoW (L∞,Y ).

Thus we see a striking resemblance between coorbit spaces and decomposition spaces: Whata structured, admissible covering is for the smoothness space is the U -dense, relatively sepa-rated neighborhood for the smoothness space. Both types of spaces hinge of a proper set U orQ and characterize a function space by the behavior of the functions localized on (xiU)i∈I orQT T∈T respectively which are only allowed to intersect non-trivially finitely many times.This way they conceive weighted sequence spaces that characterize the according functionspaces. The main difference is that coorbit spaces always come with the structure of theunderlying group whereas smoothness spaces can the defined for arbitrary set T of affinetransformations. This way, decomposition spaces are more suitable to provide characteri-zation results like those for the cone adapted shearlet systems where for each of the conesC and Cv one system of interior shearlets was constructed. It would have been possible toobtain a similar characterization result employing the whole shearlet group S and discretizeit later on only that here we would have to cover the whole of R2 in Fourier domain withone single family xiUi∈I . This would have let to setting U to one of the trapezoids P or Qfrom Definition 6.2.1 on page 155 with which we would need to cover the whole of R2. Thatis only possible if the scaling parameter a in the discrete subset of S is unrestricted: Thiswas done in [26, Proposition 4.3] where the authors defined a set[

1√2,√

2

)×[−1

2,

1

2

)×[−1

2,

1

2

)2

and an index set

Λ :=

(ε21, 2j/2s, Ss2j/2A2jt) : a, s ∈ Z, t ∈ Z2, ε ∈ −1, 1

.

With this at hand, they showed that f belongs to a shearlet coorbit space SCω,p if andonly if the function can be expressed as f =

∑λ∈Λ cλψλ for a proper shearlet function

ψ which is similar to the shearlet used in this work (see [26, Lemma 4.6, Theorem 4.7] forfurther details) if and only if the sequence (cλ)λ∈Λ ∈ `pω where the weight function ω(a, s, t) =

ω(a, s) =(

1|a| + |a|

)m (1|a| + |a|+ |s|

)nfor somem,n > 0. In summary, both types of spaces,

shearlet smoothness spaces and shearlet coorbit spaces, have much in common, however,smoothness spaces offer a bit more flexibility that is useful for approximating a given functionand embedding results like Proposition 120.

159

Page 162: Weighted l1-Analysis minimization and stochastic gradient

6.3 A brief note on Sparsity Equivalence

Lemmas 57 and especially Lemma 55 allude to the idea that a function with a sparse waveletrepresentation has a sparse representation in the shearlet system as well and vice versa.This property is known as sparsity equivalence and was described in [36] in more detail forfunctions over R2. Most of the results and definitions in this sections are from [36] except ifexplicitly stated otherwise

Definition 122 (Sparsity equivalence). Let 0 < p < 1 and (mλ)λ∈Λ, (pδ)δ∈∆ be frames fora Hilbert space H ⊂ L2(R2). Then (mλ)λ∈Λ and (pδ)δ∈∆ are sparsity equivalent in `p if∥∥∥(〈mλ, pδ〉)λ∈Λ,δ∈∆

∥∥∥`p→`p

<∞

As it turns out, not only are the wavelet and the shearlet system as laid out in Remark58 equivalent with regards to sparse representations of certain types of functions, but manyof the other renowned representation systems such as Ridgelets or (α)-Curvelets (which alsopossess a kind of (anisotropic) scaling much like shearlets do but do not have the shearingproperty) also possess this quality. The key concept behind the theory is the idea that all ofthese systems are related families of molecules on a shared parameter space.

Definition 123 (Parameterizations). We set the parameter space to

P := R+ ×(−π

2,π

2

]× R2 (6.31)

so that each (s, θ, t) ∈ P consists of a scale s, an orientation θ and a location t. Then aparameterization is a pair (Λ,ΦΛ) where Λ is an index set and

ΦΛ : Λ→ P, λ 7→ (sλ, θλ, tλ) (6.32)

which assigns to each parameter λ a parameter (sλ, θλ, tλ). ♦

Since the directional parameter is an angle, the proper rotation is given by the matrix

Rθ :=

(cos(θ) − sin(θ)

sin(θ) cos(θ))

)

and the scaling via the matrix

Aα,s :=

(s 0

0 sα

)

for some s > 0.

Definition 124. α-Molecules [36, Definition 2.6] Let α ∈ [0, 1] and L,M,N1, N2 ∈ N∪ ∞and (Λ,ΦΛ) be a parameterization. A family of functions (mλ)λ∈λ ⊂ L2(R2) is called a

160

Page 163: Weighted l1-Analysis minimization and stochastic gradient

system of α-molecules with respect to the parameterization (Λ,ΦΛ) of order (L,M,N1, N2),if it can be written as

mλ(x) = s1+α

2

λ g(λ)(Aα,sλRθλ(x− tλ)) (6.33)

such that for all |ρ| ≤ L∣∣∣∂ρg(λ)(ξ)∣∣∣ . min1, s−1

λ + |ξ1|+ s−(1−α)λ |ξ2|(1 + |ξ|2)−

N12 (1 + ξ2

2)−N22 (6.34)

where the implicit constant in the last inequality is uniform over λ ∈ Λ. If one or several ofthe parameters are equal to infinity, the corresponding quantity can be arbitrarily large. ♦

(a) Tiling of the Fourier domainof a bandlimited wavelet system.

(b) Tiling of the Fourier domainof the prototypical system of αmolecules.

(c) Tiling of the Fourier do-main of the system of bandlim-ited shearlets.

Figure 6.1: Tilings of 2D Fourier domain for several classes of atomic decompositions, [37,87]

All ’-lets’ from Section 6.2.1 hinge on a certain tiling of the Fourier domain; for theshearlet system we employ this tessellation has been described in Remark 58 and Figure 6.1.It basically remains true for highly smooth wavelets ϕj,k, since ϕj,k decays fast towards ∞implies that the essential part of their support is centered somewhere around the locationparameter k. This is imitated by the construction of the α-molecules by means of the demandon the functions g(λ) in (6.34). If we employ this definition and express ξ =

(r cos(φ)r sin(φ)

)in polar

coordinates, we obtain via (6.33)

|mλ(ξ)| . s−1+α

2

λ min1, s−1λ (1 + r)M

(1 + r2 mins−αλ , s−1

λ 2)−N1

2(1 + (s−αλ r sin(φ+ θλ))2

)−N22 .

This basically states that the essential contribution to the Fourier support of the mλ iscontained in two wedges on opposite sides of the origin in direction of the angle θλ.

Consequently, the cone-adapted shearlet system as defined in Definition 52 is an α-molecule:

Theorem 125. [36, Prop. 2.15] The cone-adapted shearlet system of Definition 52 isa system of 1

2 -molecules of order (L,M − L,∞,∞), where M is the number of vanishingmoments of the generator W of the coarse-scale shearlets and L ∈ 0, . . . ,M with respect to

161

Page 164: Weighted l1-Analysis minimization and stochastic gradient

Figure 6.2: Essential Fourier support for α-molecules [37,61].

the parameterization

Φs : Λ→ P, (a, s, t, h) 7→(

2a, (h− 1)π

2+ arctan(−s2−1(1−α)), B−s2

−a(1−α)

h A−ah t).

As mentioned at the beginning of this section, the underlying principle of α-moleculesis partly inspired by the idea that all the those types of molecules behave similarly whenit comes to sparse approximations. To compare two system of such molecules, one mustanalyze the cross-Gramian in the same way as Lemmas 55 and 57 do. It is, however, usefulto have a notion of similarity between two families of molecules and their accompanyingparameterizations.

Definition 126. Let α ∈ [0, 1] and let (Λ,ΦΛ), (∆,Φ∆) be two parameterizations. The α-scaled index distance ωα is defined as follows: For λ ∈ Λ and δ ∈ ∆ set ΦΛ(λ) ∈ (sλ, θλ, tλ) ∈and Φ∆(δ) ∈ (sδ, θδ, tδ) ∈ P and define

wα(λ, δ) := max

sλsδ,sδsλ

(1 + dα(λ, δ)) (6.35)

where for s0 = minsλ, sδ and eλ = R−θλe1 =(

cos(θλ)− sin(θλ)

)dα(λ, δ) := s

2(1−α)0 |θλ − θδ|2 + s2α

0 |tλ − tδ|2 +s2

0

1 + s2(1−α)0 |θλ − θδ|2

|〈eλ, tλ − tµ〉|2.

While this distance also depends on the different parameterizations we excluded them fromthe notation in order not to overload it. ♦

This notion of distance only compares indices for parameterizations for one and the sameα since the aim of [36] was to compare approximation results and approximation rates for

162

Page 165: Weighted l1-Analysis minimization and stochastic gradient

molecules of the same ’class’. The main result in this context states that, given a suitablesmoothness and number of vanishing moments of the atom g in the defining expressions (6.33)and (6.34), two systems of α-molecules are nearly orthogonal with respect to the α-scaledindex distance:

Theorem 127. [36, Theorem 2.17] Let α ∈ [0, 1] and let (mλ)λ∈Λ and (pδ)δ∈∆ be twosystems of α-molecules of order (L,M,N1, N2). Furthermore, we assume that there is someconstant c > 0 such that sλ ≥ c, sδ ≥ c for all λ ∈ Λ and δ ∈ ∆ where (sλ, θλ, tλ) := ΦΛ(λ)

and (sδ, θδ, tδ) := Φ∆(δ) and some N ∈ N such that

L ≥ 2N,M > 3N − 3− α2

, N1 ≥ N +1 + α

2and N2 ≥ 2N.

Then

|〈mλ, pδ〉| . ωα(λ, δ)−N for all λ ∈ Λ, δ ∈ ∆.

This theorem then gives rise to the following definition:

Definition 128. (α, k)-Consistency [36, Definition 3.3]

Let α ∈ [0, 1] and k > 0. Two parameterizations (mλ)λ∈Λ and (pδ)δ∈∆ are (α, k)-consistent, if

supλ∈Λ

∑δ∈∆

ωα(λ, δ)−k <∞ and supδ∈∆

∑λ∈Λ

ωα(λ, δ)−k <∞.

This definition is the last notion missing to be able to understand the following theoremwhich gives a first positive result on the existence of sparsity equivalent frames. It employsthe concept of (α, k)-consistency in an interpolation result to give an easily checkable criterionfor sparsity equivalence.

Theorem 129. [36, Theorem 3.4]

Let α ∈ [0, 1], k > 0 and 0 < p < 1. Let (mλ)λ∈Λ and (pδ)δ∈∆ be to frames of α-moleculesof order (L,M,N1, N2) with (α, k)-consistent parameterizations (Λ,ΦΛ) and (∆,Φ∆) satis-fying

L ≥ 2k

p,M > 3

k

p− 3− α

2, N1 ≥

k

p+

1 + α

2and N2 ≥ 2

k

p.

Then (mλ)λ∈Λ and (pδ)δ∈∆ are sparsity equivalent in `p.

This result is then used in [36] to transfer approximation results for α-Curvelets (see,e.g., [36, Section 2.1.1]) to any other frame of α-molecules which, in combination with theframe of α-Curvelets, satisfies the conditions of Theorem 129. One class of model functions

163

Page 166: Weighted l1-Analysis minimization and stochastic gradient

often considered are the cartoon-like functions with parameter β ∈ (1, 2]

Eβ(R2) := f ∈ L2(R2) : f = f0 + f1 · χB (6.36)

where f0, f1 ∈ C20 ([0, 1]2) and B ⊂ [0, 1]2 is a Jordan domain with a regular closed piecewise

smooth Cβ-curve as boundary. We consider best-s-term approximations: If fs is the bestapproximation of f only using s ∈ N atoms of any kind, i.e., wavelets, α-molecules and thelike, the best achievable approximation rate is

‖f − fs‖22 = O(s−β

)for s→∞

as [29] shows for a slightly more restricted system. In [59, Theorem 1.4] the authors show thatcompactly supported shearlets realize a best-s-term approximation for rate for f ∈ E2(R2) of

‖f − fs‖22 = O(s−2 log3(s)

)for s→∞

Moreover, setting α = 1β , [38, Theorems 5.6, 5.10, 5.12 or 5.13] show that α-Curvelets and

α-shearlets almost realize this approximation rate, i.e.

‖f − fs‖22 ≤ Cs−β+ε for arbitrary ε > 0, for s→∞.

α-shearlets are defined the same was as the shearlets in Definition 52 only with the parameter12 in the shearing matrix replaced by α. Then, via Theorem 129 which is [38, Theorem 5.6],Grohs et al. establish that the same approximation rate also holds for all α-molecules thatare sparsity equivalent to the α-Curvelets, e.g., α-shearlets.

Now these results do not exactly apply to the situation we laid out in Section 6.2 as in [36]the notion of the α-scaled distance was used where we directly analyzed the cross-Gramian.

It might be fruitful to inquire further as to what extent results like Theorem 129 allowfor recovery results along the line of Theorem 59. In the proof of Theorem 59 an Ansatzspace is constructed using a proper decay of the cross-Gramian for the shearlet frame andthe wavelet basis so that the shearlet system becomes a Pseudoframe for the Ansatz-space,see Lemma 55.

164

Page 167: Weighted l1-Analysis minimization and stochastic gradient

Symbols

In the following, let S ⊂ T be sets, (X,µ) a measurable space, G,H be groups and H be aHilbert space.

[n] the set 1, . . . , nS or T \ S The complement of S in TLp(X,µ) The space of functions f on (X,µ) such that |f |p is µ-integrableLpω(X,µ) The space of functions f on (X,µ) such that (|f |ω)p is µ-integrable`p(Λ) The set of sequences (xλ)λ∈XΛ on C such that

∑λ∈X |xλ|p is finite

`pω The set of sequences (xλ)λ∈X on C such that∑λ∈X |xλ|pω2−p or∑

λ∈X |xλ|pωp is finite, depending on the context

χT The indicator function of T : χT (x) :=

1 if x ∈ T

0 otherwise

〈·, ·〉 The scalar product on a Hilbert space Hδj,k The Kronecker symbol is δj,k = χj(k) = 1 if and only if j = k and 0

otherwise.Bβ0 (D) the β-times differentiable functions f on the domain D with f |∂D ≡ 0

S(Rd) The Schwartz class on Rd

S ′(Rd) The set of tempered distributions, the dual of Schwartz class on Rd

Bp,ql (D) The Besov space or oder l, p, q on a domain D, see pages 148, 149DAf(x) DAf(x) :=

√det(A)f(Ax), the dilation of a function f : Rd → Rk by a

matrix A ∈ Gln(R)

Ttf(x) Ttf(x) := f(x− t), the translation of a function f : Rd → Rk by a vectort ∈ Rn

Mξf(x) Mξf(x) = e2πi〈ξ,x〉f(x), the modulation of a function f : Rd → Rk by avector ξ ∈ Rn

O(n) The orthogonal group of order n, i.e., the matrices Q ∈ Rn×n that fulfillQtQ = QQt = Id

‖ · ‖F The Frobenius norm ‖X‖F =√∑

i,j |xi,j |2 =√

tr(XtX) =√∑r

i=1 σ2i

where r = rank(X) and σ1, . . . , σr are the singular values of X.

165

Page 168: Weighted l1-Analysis minimization and stochastic gradient

〈A,B, 〉F The Frobenius inner product 〈A,B, 〉F = tr(AtB) for matrices A,B ∈Rn×k

κ(Z) The condition number of a matrix, i.e., the fraction of the largest andsmallest singular value.

d(Z, Z) The distance up to orthonormal transformation of two matrices Z, Z ∈Rn×k : d(Z, Z) := minU∈On(k) ‖Z − ZU‖F

N (T, d, t) The covering number of a set T , see Definition 98P(T, d, t) The packing number of a set T , see Definition 98γα(T, d) The γα functional, see Definition 96∆d(T ) ∆d(T ) = maxt∈T d(t) the diameter of a set T according to the metric d∆hf(x) ∆hf(x) = f(x− h)− f(x)

ω2p(f, t) The modulus of continuity ω2

p(f, t) = sup|h|≤t ‖∆2hf‖p with p ∈ [1,∞)

S The shearlet group, see Remark 121GoH The outer semidirect product of groups G,HVψf The voice transform Vψf : G → H, g 7→ 〈f, π(g)〉 where π : F → H is a

unitary representation of G.V q The anti-dual of V , i.e., the space of all bounded, conjugate linear func-

tionals on V .CoW (L∞, Y ) A coorbit space, see (6.30). Pointwise multiplication of two vectors, i.e. (x y)i = xi · yi.1A(x) The indicator function of a set A, i.e. 1p 6=q(x) = 1 if and only if x ∈ A

and 0 else wise.S1 The unit circle S1 = x ∈ R2 : x2

1 + x22 = 1

166

Page 169: Weighted l1-Analysis minimization and stochastic gradient

Index

D-RIP, 60Sω,p, 25U -density, 158Ω-BP, 19Ω-BPDN, 19`pω-norms, 15`pω-spaces, 17γα-functional, 141Q-moderate

covering, 152Q-moderate weight, 152ω-BP, 18ω-BPDN, 18ψ2-norm, 115σs(x)ω,p, 17Hs(x)ω, 36

Admissible Covering, 151Structured, 153

AlgorithmΩ-BP, 19Ω-BPDN, 19ω-BP, 18Mini Batch SGD, 131Randomized SGD, 129RARM, 100AltMinPhase, 95AltMinSense, 98Basis Pursuit, 18Basis Pursuit Denoizing, 18Gerchberg-Saxton, 95Nuclear Norm Minimization, 97Phase Lift, 23Rank Projection SGD, 125Robust Affine Rank Minimization, 98

SVDRARM, 98WIHT, 37Wirtinger Flow, 23

AltMinSense, 98Analysis Operator, 150Anti Dual, 158

BAPU, 151Besov space, 146Bowling Scheme, 141

CanonicalDual Frame, 151

Coherence, 20Compressed sensing, 15Compressibility, 17, 41Control Function, 157Cosparsity, 17Covering Number, 142

Dilation, 24directional derivative, 64Dual Frame, 151Dudley Entropy Integral, 141Dudley’s Inequality, 141

Escape through the Mesh, 141

Fourier multipliers, 151Fourier Transform, 24Frame

bivariate Haar, 64univariate Haar, 64Parseval, 151shearlet, 155

Frame operator, 151Frames, 150

167

Page 170: Weighted l1-Analysis minimization and stochastic gradient

Gaussian Orthogonal Ensemble (GOE), 99Gaussian Width, 141General Weighted Function Spaces, 25Golfing Scheme, 141

Haar basis, 64Hanson-Wright Inequality, 115Height of covering, 151

Local Coherence, 29Localization Factor, 60

mini-batchfunction, 130sampling, 130

Mixed Tail, 117Modulation, 24Modulus of continuity, 146Multilevel Sampling Scheme, 28

Null Space Property, 20robust, 21stable, 21

NWIHT, 37

orthogonalization measure, 25

Packing Number, 142Parseval Frame, 151Partition of unity, 146Phase Lift, 23Phase Retrieval, 23Preconditioning function, 33Pseudoframe, 62

Radon Transform, 55Random matrix, 142

Bounded Orthonormal Systems, 143Discrete, 143

Gaussian, 142Subgaussian, 143

RARM, 98Regularity Condition, 105

R(τ, ρ), 103local curvature, 106local smoothness, 106

relatively separated, 158

Restricted isometry in Levels, 28RIP, 20

weighted, 30RIPL, 28

shearletFrame, 155Group, 157System, 70

Boundary, 71Coarse Scale, 71Interior, 71

Small Ball Method, 141Spaces

Wiener Amalgam, 158Besov, 146Besov, bounded domain, 148Coorbit, 157, 158Decomposition, 152

shearlet, 72Smoothness, 154wavelet, 71

shearlet Smoothness, 72sparsity, 15

set of sparse vectors, 15Sparsity in levels, 27Spectral Initialization, 99Stechkin estimate, 17

p ∈ (0, 2], 18p ∈ (0,∞), 18

Synthesis Operator, 150, 151

Translation, 24TV semi-norm, 64

Vanishing Moment, 75Voice Transform, 157

wavelet ONB, 147Weighted Function Spaces, 25WIHT, 37Window function, 158

168

Page 171: Weighted l1-Analysis minimization and stochastic gradient

Bibliography

[1] Alberti, Giovanni S and Santacesaria, Matteo. Infinite dimensional compressed sensing fromanisotropic measurements and applications to inverse problems in PDE. Applied and Compu-tational Harmonic Analysis, 2019.

[2] Alizadeh, Farid. Interior point methods in semidefinite programming with applications to com-binatorial optimization. SIAM journal on Optimization, 5(1):13–51, 1995.

[3] Almeida, Alexandre. Wavelet bases in generalized Besov spaces. Journal of mathematicalanalysis and applications, 304(1):198–211, 2005.

[4] Alonso, Mariví Tello and López-Dekker, Paco and Mallorquí, Jordi J. A novel strategy for radarimaging based on compressive sensing. IEEE Transactions on Geoscience and Remote Sensing,48(12):4285–4295, 2010.

[5] Axel Obermaier. Winter School "Computational Harmonic Analysis - with Applications toSignal and Image Processing", 2014.

[6] Bian, Junguo and Siewerdsen, Jeffrey H and Han, Xiao and Sidky, Emil Y and Prince, Jerry Land Pelizzari, Charles A and Pan, Xiaochuan. Evaluation of sparse-view reconstruction fromflat-panel-detector cone-beam CT. Physics in Medicine & Biology, 55(22):6575, 2010.

[7] Blumensath, Thomas and Davies, Michael E. Normalized iterative hard thresholding: Guaran-teed stability and performance. Selected Topics in Signal Processing, IEEE Journal of, 4(2):298–309, 2010.

[8] Borup, Lasse and Nielsen, Morten. Frame decomposition of decomposition spaces. Journal ofFourier Analysis and Applications, 13(1):39–70, 2007.

[9] Brugiapaglia, Simone and Dirksen, Sjoerd and Jung, Hans Christian and Rauhut, Holger. Sparserecovery in bounded Riesz systems with applications to numerical methods for PDEs. arXivpreprint arXiv:2005.06994, 2020.

[10] Bunk, Oliver and Diaz, Ana and Pfeiffer, Franz and David, Christian and Schmitt, Bernd andSatapathy, Dillip K and Van Der Veen, J Friso. Diffractive imaging for periodic samples: retriev-ing one-dimensional concentration profiles across microfluidic channels. Acta CrystallographicaSection A: Foundations of Crystallography, 63(4):306–314, 2007.

[11] Cai, T Tony and Zhang, Anru. Sharp RIP bound for sparse signal and low-rank matrix recovery.Applied and Computational Harmonic Analysis, 35(1):74–93, 2013.

[12] Candes, Emmanuel. Applied Fourier Analysis and Elements of Modern Signal Processing. http://statweb.stanford.edu/~candes/teaching/math262/Lectures/Lecture09.pdf, Course Ma-terial, University of Standford, February 2016. [Online; accessed December 21, 2020].

169

Page 172: Weighted l1-Analysis minimization and stochastic gradient

[13] Candes, Emmanuel and Romberg, Justin. l1-magic: Recovery of sparse signals via con-vex programming; Matlab Library Documentation. https://statweb.stanford.edu/~candes/software/l1magic/downloads/l1magic.pdf, 2005. [Online; accessed December 21, 2020].

[14] Candes, Emmanuel J and Eldar, Yonina C and Needell, Deanna and Randall, Paige. Compressedsensing with coherent and redundant dictionaries. Applied and Computational Harmonic Anal-ysis, 31(1):59–73, 2011.

[15] Candes, Emmanuel J and Li, Xiaodong and Soltanolkotabi, Mahdi. Phase retrieval via Wirtingerflow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007,2015.

[16] Candes, Emmanuel J and Plan, Yaniv. Matrix completion with noise. Proceedings of the IEEE,98(6):925–936, 2010.

[17] Candes, Emmanuel J and Plan, Yaniv. Tight oracle inequalities for low-rank matrix recoveryfrom a minimal number of noisy random measurements. IEEE Transactions on InformationTheory, 57(4):2342–2359, 2011.

[18] Candès, Emmanuel J and Romberg, Justin and Tao, Terence. Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency information. IEEE Transactionson information theory, 52(2):489–509, 2006.

[19] Chafaı, Djalil and Guédon, Olivier and Lecué, Guillaume and Pajor, Alain. Interactions be-tween compressed sensing, random matrices, and high dimensional geometry. https://djalil.chafai.net/docs/LAMABOOK/lamabook-draft.pdf, 2012. [Online; accessed December 21, 2020].

[20] Chai, Anwei and Moscoso, Miguel and Papanicolaou, George. Array imaging using intensity-only measurements. Inverse Problems, 27(1):015005, 2010.

[21] Chan, Tony F and Shen, Jianhong Jackie. Image processing and analysis: variational, PDE,wavelet, and stochastic methods. Siam, 2005.

[22] Chen, Jinghui and Wang, Lingxiao and Zhang, Xiao and Gu, Quanquan. Robust Wirtingerflow for phase retrieval with arbitrary corruption. arXiv preprint arXiv:1704.06256, 2017.

[23] Cohen, Albert and DeVore, Ronald and Petrushev, Pencho and Xu, Hong. Nonlinear approxi-mation and the space BV. American Journal of Mathematics, 121(3):587–628, 1999.

[24] Corbett, John V. The Pauli problem, state reconstruction and quantum-real numbers. Reportson Mathematical Physics, 1(57):53–68, 2006.

[25] CVX Research, Inc. TFOCS-1.3.1. http://cvxr.com/tfocs/, 2013. [Online; accessed January14, 2020].

[26] Dahlke, Stephan and Kutyniok, Gitta and Steidl, Gabriele and Teschke, Gerd. Shearlet coor-bit spaces and associated Banach frames. Applied and Computational Harmonic Analysis,27(2):195–214, 2009.

[27] Daubechies, Ingrid and others. Ten lectures on wavelets, volume 61. SIAM, 1992.

[28] Dirksen, Sjoerd and others. Tail bounds via generic chaining. Electronic Journal of Probability,20, 2015.

[29] Donoho, David Leigh. Sparse components of images and optimal atomic decompositions. Con-structive Approximation, 17(3):353–382, 2001.

170

Page 173: Weighted l1-Analysis minimization and stochastic gradient

[30] Ender, Joachim HG. On compressive sensing applied to radar. Signal Processing, 90(5):1402–1414, 2010.

[31] Felix Voigtlaender. Embedding Theorems for Decomposition Spaces with Applications to WaveletCoorbit Spaces. Doctoral Thesis, RWTH Aachen University, 2015.

[32] Fienup, James R. Reconstruction of an object from the modulus of its Fourier transform. Opticsletters, 3(1):27–29, 1978.

[33] Foucart, Simon. Stability and robustness of l1-minimizations with Weibull matrices and redun-dant dictionaries. Linear Algebra and its Applications, 441:4–21, 2014.

[34] Foucart, Simon and Rauhut, Holger. A mathematical introduction to compressive sensing.Springer, 2013.

[35] Grafakos, Loukas. Classical and Modern Fourier Analysis, volume 1. Pearson / Prentice-Hall,2004.

[36] Grohs, Philipp and Keiper, Sandra and Kutyniok, Gitta and Schaefer, Martin. Alpha molecules:curvelets, shearlets, ridgelets, and beyond. In Wavelets and Sparsity XV, volume 8858, page885804. International Society for Optics and Photonics, 2013.

[37] Grohs, Philipp and Keiper, Sandra and Kutyniok, Gitta and Schäfer, Martin. Alpha Molecules.arXiv preprint arXiv:1407.4424, 2014.

[38] Grohs, Philipp and Keiper, Sandra and Kutyniok, Gitta and Schäfer, Martin. α-Molecules.Applied and Computational Harmonic Analysis, 41(1):297–336, 2016.

[39] Grohs, Philipp and Kutyniok, Gitta and Ma, Jackie and Petersen, Philipp and Raslan, Mones.Anisotropic multiscale systems on bounded domains. Advances in Computational Mathematics,46:1–33, 2020.

[40] Gross, David. Recovering low-rank matrices from few coefficients in any basis. IEEE Transac-tions on Information Theory, 57(3):1548–1566, 2011.

[41] Guo, Kanghui and Kutyniok, Gitta and Labate, Demetrio. Sparse multidimensional representa-tions using anisotropic dilation and shear operators. Wavelets und Splines (Athens, GA, 2005),G. Chen und MJ Lai, eds., Nashboro Press, Nashville, TN, pages 189–201, 2006.

[42] Guo, Kanghui and Labate, Demetrio. Optimally sparse multidimensional representation usingshearlets. SIAM journal on mathematical analysis, 39(1):298–318, 2007.

[43] Härdle, Wolfgang and Kerkyacharian, Gerard and Picard, Dominique and Tsybakov, Alexander.Wavelets, approximation, and statistical applications, volume 129. Springer Science & BusinessMedia, 2012.

[44] Haroske, Dorothee D and Schneider, Cornelia. Besov spaces with positive smoothness on Rn,embeddings and growth envelopes. Journal of Approximation Theory, 161(2):723–747, 2009.

[45] Hauptman, Herbert A. The phase problem of x-ray crystallography. Reports on Progress inPhysics, 54(11):1427, 1991.

[46] Jain, Prateek and Meka, Raghu and Dhillon, Inderjit S. Guaranteed rank minimization viasingular value projection. In Advances in Neural Information Processing Systems, pages 937–945, 2010.

171

Page 174: Weighted l1-Analysis minimization and stochastic gradient

[47] Jain, Prateek and Netrapalli, Praneeth and Sanghavi, Sujay. Low-rank matrix completion usingalternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theoryof computing, pages 665–674, 2013.

[48] Jo, Jason. Iterative Hard Thresholding for Weighted Sparse Approximation. arXiv preprintarXiv:1312.3582, 2013.

[49] Kabanava, Maryia and Rauhut, Holger. Analysis 1-recovery with frames and Gaussian mea-surements. Acta Applicandae Mathematicae, 140(1):173–195, 2015.

[50] Karp, Richard M. Reducibility among combinatorial problems. In Complexity of computercomputations, pages 85–103. Springer, 1972.

[51] Kempka, Henning and Schäfer, Martin and Ullrich, Tino. General coorbit space theory for quasi-Banach spaces and inhomogeneous function spaces with variable smoothness and integrability.Journal of Fourier Analysis and Applications, 23(6):1348–1407, 2017.

[52] Kittipoom, Pisamai and Kutyniok, Gitta and Lim, Wang-Q. Construction of compactly sup-ported shearlet frames. Constructive Approximation, 35(1):21–72, 2012.

[53] Krahmer, Felix and Needell, Deanna and Ward, Rachel. Compressive sensing with redun-dant dictionaries and structured measurements. SIAM Journal on Mathematical Analysis,47(6):4606–4629, 2015.

[54] Krahmer, Felix and Ward, Rachel. New and improved Johnson–Lindenstrauss embeddings viathe restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281,2011.

[55] Krahmer, Felix and Ward, Rachel. Stable and robust sampling strategies for compressive imag-ing. IEEE transactions on image processing, 23(2):612–622, 2013.

[56] Kutyniok, Gitta and Labate, Demetrio. Resolution of the wavefront set using continuous shear-lets. Transactions of the American Mathematical Society, 361(5):2719–2754, 2009.

[57] Kutyniok, Gitta and Labate, Demetrio. Shearlets: Multiscale analysis for multivariate data.Springer Science & Business Media, 2012.

[58] Kutyniok, Gitta and Lemvig, Jakob and Lim, Wang-Q. Compactly supported shearlets. InApproximation Theory XIII: San Antonio 2010, pages 163–186. Springer, 2012.

[59] Kutyniok, Gitta and Lim, Wang-Q. Compactly supported shearlets are optimally sparse. Jour-nal of Approximation Theory, 163(11):1564–1589, 2011.

[60] Kutyniok, Gitta and Lim, Wang-Q. Dualizable shearlet frames and sparse approximation.Constructive Approximation, 44(1):53–86, 2016.

[61] Kutyniok, Gitta and Lim, Wang-Q and Reisenhofer, Rafael. ShearLab 3D. https://orms.mfo.de/project?id=346. [Online; accessed December 20, 2020].

[62] Labate, Demetrio and Lim, Wang-Q and Kutyniok, Gitta and Weiss, Guido. Sparse multidimen-sional representation using shearlets. In Wavelets XI, volume 5914, page 59140U. InternationalSociety for Optics and Photonics, 2005.

[63] Labate, Demetrio and Mantovani, Lucia and Negi, Pooran. Shearlet smoothness spaces. Journalof Fourier Analysis and Applications, 19(3):577–611, 2013.

[64] Lecué, Guillaume and Mendelson, Shahar and others. Regularization and the small-ball methodi: sparse recovery. The Annals of Statistics, 46(2):611–641, 2018.

172

Page 175: Weighted l1-Analysis minimization and stochastic gradient

[65] Li, Chen and Adcock, Ben. Compressed sensing with local structure: uniform recovery guaran-tees for the sparsity in levels class. Applied and Computational Harmonic Analysis, 46(3):453–477, 2019.

[66] Li, Shidong and Ogawa, Hidemitsu. Pseudoframes for subspaces with applications. Journal ofFourier Analysis and Applications, 10(4):409–431, 2004.

[67] Lustig, Michael and Donoho, David and Pauly, John M. Sparse MRI: The application of com-pressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journalof the International Society for Magnetic Resonance in Medicine, 58(6):1182–1195, 2007.

[68] Maryia Kabanava and Holger Rauhut and Hui Zhang. Robust analysis l1-recovery from Gaus-sian measurements and total variation minimization. CoRR, abs/1407.7402, 2014.

[69] Meka, Raghu and Jain, Prateek and Caramanis, Constantine and Dhillon, Inderjit S. Rank min-imization via online learning. In Proceedings of the 25th International Conference on Machinelearning, pages 656–663. ACM, 2008.

[70] Mohan, Karthik and Fazel, Maryam. New restricted isometry results for noisy low-rank recovery.In 2010 IEEE International Symposium on Information Theory, pages 1573–1577. IEEE, 2010.

[71] Murty, Katta G and Kabadi, Santosh N. Some NP-complete problems in quadratic and nonlinearprogramming. Mathematical Programming , 39:117–129, 1987.

[72] Nam, Sangnam and Davies, Mike E and Elad, Michael and Gribonval, Rémi. The cosparseanalysis model and algorithms. Applied and Computational Harmonic Analysis, 34(1):30–56,2013.

[73] Needell, Deanna and Ward, Rachel. Stable image reconstruction using total variation mini-mization. SIAM Journal on Imaging Sciences, 6(2):1035–1058, 2013.

[74] Netrapalli, Praneeth and Jain, Prateek and Sanghavi, Sujay. Phase retrieval using alternatingminimization. In Advances in Neural Information Processing Systems, pages 2796–2804, 2013.

[75] Niu, Shanzhou and Gao, Yang and Bian, Zhaoying and Huang, Jing and Chen, Wufan and Yu,Gaohang and Liang, Zhengrong and Ma, Jianhua. Sparse-view X-ray CT reconstruction viatotal generalized variation regularization. Physics in Medicine & Biology, 59(12):2997, 2014.

[76] R. W. Gerchberg and W. O. Saxton. A practical algorithm for the determination of phase fromimage and diffraction plane pictures. Optik 35, pp. 237, 1972.

[77] Rauhut, Holger. Coorbit space theory for quasi-Banach spaces. Studia Mathematica, 180(3):237–253, 2007.

[78] Rauhut, Holger. Compressive sensing and structured random matrices. Theoretical foundationsand numerical methods for sparse recovery, 9:1–92, 2010.

[79] Rauhut, Holger and Ullrich, Tino. Generalized coorbit space theory and inhomogeneous functionspaces of Besov–Lizorkin–Triebel type. Journal of Functional Analysis, 260(11):3299–3362,2011.

[80] Rauhut, Holger and Ward, Rachel. Interpolation via weighted 1 minimization. Applied andComputational Harmonic Analysis, 40(2):321–351, 2016.

[81] Recht, Benjamin and Fazel, Maryam and Parrilo, Pablo A. Guaranteed minimum-rank solutionsof linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

173

Page 176: Weighted l1-Analysis minimization and stochastic gradient

[82] Reichenbach, Hans. Philosophic foundations of quantum mechanics. Courier Corporation, 1998.

[83] Rossi, Marco and Haimovich, Alexander M and Eldar, Yonina C. Spatial compressive sensingfor MIMO radar. IEEE Transactions on Signal Processing, 62(2):419–430, 2013.

[84] Rudelson, Mark and Vershynin, Roman and others. Hanson-Wright inequality and sub-gaussianconcentration. Electronic Communications in Probability, 18, 2013.

[85] Ruderman, Daniel L and Bialek, William. Statistics of natural images: Scaling in the woods.In Advances in neural information processing systems, pages 551–558, 1994.

[86] Sayre, David. X-ray crystallography: The past and present of the phase problem. StructuralChemistry, 13(1):81–96, 2002.

[87] Schäfer, Martin. The framework of α-molecules : Theory and Applications. Doctoral Thesis,Technische Universität Berlin, Berlin, 2018.

[88] Shen, Xiaohui and Wu, Ying. A unified approach to salient object detection via low rank matrixrecovery. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 853–860. IEEE, 2012.

[89] Sophia Bethany Coban. SophiaBeads-datasets. https://github.com/Sophilyplum/

sophiabeads-datasets, 2015. [Online; accessed January 14, 2020].

[90] Talagrand, Michel. The generic chaining: upper and lower bounds of stochastic processes.Springer Science & Business Media, 2006.

[91] Talagrand, Michel. Upper and lower bounds for stochastic processes: modern methods andclassical problems, volume 60. Springer Science & Business Media, 2014.

[92] Tillmann, Andreas M and Pfetsch, Marc E. The computational complexity of the restrictedisometry property, the nullspace property, and related concepts in compressed sensing. IEEETransactions on Information Theory, 60(2):1248–1259, 2013.

[93] Tolhurst, DJ and Tadmor, Y and Chao, Tang. Amplitude spectra of natural images. Ophthalmicand Physiological Optics, 12(2):229–232, 1992.

[94] Triebel, Hans. Theory of function spaces. Birkhäuser Verlag, Basel, Boston, Berlin, 1992.

[95] Tropp, Joel A. Convex recovery of a structured signal from independent random linear mea-surements. In Sampling Theory, a Renaissance, pages 67–101. Springer, 2015.

[96] Vybíral, Jan. On sharp embeddings of Besov and Triebel-Lizorkin spaces in the subcritical case.Proceedings of the American Mathematical Society, 138(1):141–146, 2010.

[97] Zhang, Huishuai and Liang, Yingbin. Reshaped wirtinger flow for solving quadratic system ofequations. In Advances in Neural Information Processing Systems, pages 2622–2630, 2016.

[98] Zheng, Qinqing and Lafferty, John. A convergent gradient descent algorithm for rank minimiza-tion and semidefinite programming from random linear measurements. In Advances in NeuralInformation Processing Systems, pages 109–117, 2015.

174

Page 177: Weighted l1-Analysis minimization and stochastic gradient

Eidesstattliche Erklärung

Jonathan Fell erklärt hiermit, dass diese Dissertation und die darin dargelegten Inhalte die eigenensind und selbstständig, als Ergebnis der eigenen originären Forschung, generiert wurden.

Hiermit erkläre ich an Eides statt

1. Diese Arbeit wurde vollständig oder größtenteils in der Phase als Doktorand dieser Fakultätund Universität angefertigt;

2. Sofern irgendein Bestandteil dieser Dissertation zuvor für einen akademischen Abschluss odereine andere Qualifikation an dieser oder einer anderen Institution verwendet wurde, wurdedies klar angezeigt;

3. Wenn immer andere eigene oder Veröffentlichungen Dritter herangezogen wurden, wurdendiese klar benannt;

4. Wenn aus anderen eigenen- oder Veröffentlichungen Dritter zitiert wurde, wurde stets dieQuelle hierfür angegeben. Diese Dissertation ist vollständig meine eigene Arbeit, mit derAusnahme solcher Zitate;

5. Alle wesentlichen Quellen von Unterstützung wurden benannt;

6. Wenn immer ein Teil dieser Dissertation auf der Zusammenarbeit mit anderen basiert, wurdevon mir klar gekennzeichnet, was von anderen und was von mir selbst erarbeitet wurde;

7. Kein Teil dieser Arbeit wurde vor deren Einreichung veröffentlicht.

Jonathan Fell, Düsseldorf den 28. Dezember 2020

175